All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/5] migration: improve multithreads
@ 2018-11-22  7:20 ` guangrong.xiao
  0 siblings, 0 replies; 70+ messages in thread
From: guangrong.xiao @ 2018-11-22  7:20 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: kvm, quintela, Xiao Guangrong, qemu-devel, peterx, dgilbert,
	wei.w.wang, cota, jiang.biao2

From: Xiao Guangrong <xiaoguangrong@tencent.com>

Changelog in v3:
Thanks to Emilio's comments and his example code, the changes in
this version are:
1. move @requests from the shared data struct to each single thread
2. move completion ev from the shared data struct to each single thread
3. move bitmaps from the shared data struct to each single thread
4. limit the number of request that each thread need handle to 64, then
   use uint64_t instead of bitmap pointer.
   
The performance is measured by using the benchmark we introduced in
this pachset:
	./tests/threaded-workqueue-bench -c 20 -m 16 -t N
the data is as followings:

The baseline of v2:
Thread #, Throughput
1, 0.428024
4, 1.668876
8, 3.501940
12, 5.026403
16, 1.912374
20, 1.174771
24, 1.074085
28, 0.747920
32, 0.651409
36, 0.533240
40, 0.517421
44, 0.482153
48, 0.525176
52, 0.492677
56, 0.798679
60, 0.733868
64, 0.751396

After this patchset:
Thread #, Throughput
1, 0.449192
4, 1.849271
8, 3.644339
12, 4.809391
16, 4.709095
20, 4.942153
24, 5.116967
28, 4.921542
32, 5.008816
36, 5.408070
40, 5.166064
44, 4.994953
48, 4.853351
52, 4.797540
56, 4.815153
60, 4.793704
64, 4.913544

To see more detailed compression by each step, please refer to
   https://ibb.co/hq7u5V

Xiao Guangrong (5):
  bitops: introduce change_bit_atomic
  util: introduce threaded workqueue
  migration: use threaded workqueue for compression
  migration: use threaded workqueue for decompression
  tests: add threaded-workqueue-bench

 include/qemu/bitops.h             |  13 +
 include/qemu/threaded-workqueue.h | 106 ++++++++
 migration/ram.c                   | 530 ++++++++++++++------------------------
 tests/Makefile.include            |   5 +-
 tests/threaded-workqueue-bench.c  | 255 ++++++++++++++++++
 util/Makefile.objs                |   1 +
 util/threaded-workqueue.c         | 463 +++++++++++++++++++++++++++++++++
 7 files changed, 1029 insertions(+), 344 deletions(-)
 create mode 100644 include/qemu/threaded-workqueue.h
 create mode 100644 tests/threaded-workqueue-bench.c
 create mode 100644 util/threaded-workqueue.c

-- 
2.14.5

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 0/5] migration: improve multithreads
@ 2018-11-22  7:20 ` guangrong.xiao
  0 siblings, 0 replies; 70+ messages in thread
From: guangrong.xiao @ 2018-11-22  7:20 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: qemu-devel, kvm, dgilbert, peterx, wei.w.wang, jiang.biao2,
	eblake, quintela, cota, Xiao Guangrong

From: Xiao Guangrong <xiaoguangrong@tencent.com>

Changelog in v3:
Thanks to Emilio's comments and his example code, the changes in
this version are:
1. move @requests from the shared data struct to each single thread
2. move completion ev from the shared data struct to each single thread
3. move bitmaps from the shared data struct to each single thread
4. limit the number of request that each thread need handle to 64, then
   use uint64_t instead of bitmap pointer.
   
The performance is measured by using the benchmark we introduced in
this pachset:
	./tests/threaded-workqueue-bench -c 20 -m 16 -t N
the data is as followings:

The baseline of v2:
Thread #, Throughput
1, 0.428024
4, 1.668876
8, 3.501940
12, 5.026403
16, 1.912374
20, 1.174771
24, 1.074085
28, 0.747920
32, 0.651409
36, 0.533240
40, 0.517421
44, 0.482153
48, 0.525176
52, 0.492677
56, 0.798679
60, 0.733868
64, 0.751396

After this patchset:
Thread #, Throughput
1, 0.449192
4, 1.849271
8, 3.644339
12, 4.809391
16, 4.709095
20, 4.942153
24, 5.116967
28, 4.921542
32, 5.008816
36, 5.408070
40, 5.166064
44, 4.994953
48, 4.853351
52, 4.797540
56, 4.815153
60, 4.793704
64, 4.913544

To see more detailed compression by each step, please refer to
   https://ibb.co/hq7u5V

Xiao Guangrong (5):
  bitops: introduce change_bit_atomic
  util: introduce threaded workqueue
  migration: use threaded workqueue for compression
  migration: use threaded workqueue for decompression
  tests: add threaded-workqueue-bench

 include/qemu/bitops.h             |  13 +
 include/qemu/threaded-workqueue.h | 106 ++++++++
 migration/ram.c                   | 530 ++++++++++++++------------------------
 tests/Makefile.include            |   5 +-
 tests/threaded-workqueue-bench.c  | 255 ++++++++++++++++++
 util/Makefile.objs                |   1 +
 util/threaded-workqueue.c         | 463 +++++++++++++++++++++++++++++++++
 7 files changed, 1029 insertions(+), 344 deletions(-)
 create mode 100644 include/qemu/threaded-workqueue.h
 create mode 100644 tests/threaded-workqueue-bench.c
 create mode 100644 util/threaded-workqueue.c

-- 
2.14.5

^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v3 1/5] bitops: introduce change_bit_atomic
  2018-11-22  7:20 ` [Qemu-devel] " guangrong.xiao
@ 2018-11-22  7:20   ` guangrong.xiao
  -1 siblings, 0 replies; 70+ messages in thread
From: guangrong.xiao @ 2018-11-22  7:20 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: kvm, quintela, Xiao Guangrong, qemu-devel, peterx, dgilbert,
	wei.w.wang, cota, jiang.biao2

From: Xiao Guangrong <xiaoguangrong@tencent.com>

It will be used by threaded workqueue

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 include/qemu/bitops.h | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/include/qemu/bitops.h b/include/qemu/bitops.h
index 3f0926cf40..c522958852 100644
--- a/include/qemu/bitops.h
+++ b/include/qemu/bitops.h
@@ -79,6 +79,19 @@ static inline void change_bit(long nr, unsigned long *addr)
     *p ^= mask;
 }
 
+/**
+ * change_bit_atomic - Toggle a bit in memory atomically
+ * @nr: Bit to change
+ * @addr: Address to start counting from
+ */
+static inline void change_bit_atomic(long nr, unsigned long *addr)
+{
+    unsigned long mask = BIT_MASK(nr);
+    unsigned long *p = addr + BIT_WORD(nr);
+
+    atomic_xor(p, mask);
+}
+
 /**
  * test_and_set_bit - Set a bit and return its old value
  * @nr: Bit to set
-- 
2.14.5

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 1/5] bitops: introduce change_bit_atomic
@ 2018-11-22  7:20   ` guangrong.xiao
  0 siblings, 0 replies; 70+ messages in thread
From: guangrong.xiao @ 2018-11-22  7:20 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: qemu-devel, kvm, dgilbert, peterx, wei.w.wang, jiang.biao2,
	eblake, quintela, cota, Xiao Guangrong

From: Xiao Guangrong <xiaoguangrong@tencent.com>

It will be used by threaded workqueue

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 include/qemu/bitops.h | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/include/qemu/bitops.h b/include/qemu/bitops.h
index 3f0926cf40..c522958852 100644
--- a/include/qemu/bitops.h
+++ b/include/qemu/bitops.h
@@ -79,6 +79,19 @@ static inline void change_bit(long nr, unsigned long *addr)
     *p ^= mask;
 }
 
+/**
+ * change_bit_atomic - Toggle a bit in memory atomically
+ * @nr: Bit to change
+ * @addr: Address to start counting from
+ */
+static inline void change_bit_atomic(long nr, unsigned long *addr)
+{
+    unsigned long mask = BIT_MASK(nr);
+    unsigned long *p = addr + BIT_WORD(nr);
+
+    atomic_xor(p, mask);
+}
+
 /**
  * test_and_set_bit - Set a bit and return its old value
  * @nr: Bit to set
-- 
2.14.5

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 2/5] util: introduce threaded workqueue
  2018-11-22  7:20 ` [Qemu-devel] " guangrong.xiao
@ 2018-11-22  7:20   ` guangrong.xiao
  -1 siblings, 0 replies; 70+ messages in thread
From: guangrong.xiao @ 2018-11-22  7:20 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: kvm, quintela, Xiao Guangrong, qemu-devel, peterx, dgilbert,
	wei.w.wang, cota, jiang.biao2

From: Xiao Guangrong <xiaoguangrong@tencent.com>

This modules implements the lockless and efficient threaded workqueue.

Three abstracted objects are used in this module:
- Request.
     It not only contains the data that the workqueue fetches out
    to finish the request but also offers the space to save the result
    after the workqueue handles the request.

    It's flowed between user and workqueue. The user fills the request
    data into it when it is owned by user. After it is submitted to the
    workqueue, the workqueue fetched data out and save the result into
    it after the request is handled.

    All the requests are pre-allocated and carefully partitioned between
    threads so there is no contention on the request, that make threads
    be parallel as much as possible.

- User, i.e, the submitter
    It's the one fills the request and submits it to the workqueue,
    the result will be collected after it is handled by the work queue.

    The user can consecutively submit requests without waiting the previous
    requests been handled.
    It only supports one submitter, you should do serial submission by
    yourself if you want more, e.g, use lock on you side.

- Workqueue, i.e, thread
    Each workqueue is represented by a running thread that fetches
    the request submitted by the user, do the specified work and save
    the result to the request.

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 include/qemu/threaded-workqueue.h | 106 +++++++++
 util/Makefile.objs                |   1 +
 util/threaded-workqueue.c         | 463 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 570 insertions(+)
 create mode 100644 include/qemu/threaded-workqueue.h
 create mode 100644 util/threaded-workqueue.c

diff --git a/include/qemu/threaded-workqueue.h b/include/qemu/threaded-workqueue.h
new file mode 100644
index 0000000000..e0ede496d0
--- /dev/null
+++ b/include/qemu/threaded-workqueue.h
@@ -0,0 +1,106 @@
+/*
+ * Lockless and Efficient Threaded Workqueue Abstraction
+ *
+ * Author:
+ *   Xiao Guangrong <xiaoguangrong@tencent.com>
+ *
+ * Copyright(C) 2018 Tencent Corporation.
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ */
+
+#ifndef QEMU_THREADED_WORKQUEUE_H
+#define QEMU_THREADED_WORKQUEUE_H
+
+#include "qemu/queue.h"
+#include "qemu/thread.h"
+
+/*
+ * This modules implements the lockless and efficient threaded workqueue.
+ *
+ * Three abstracted objects are used in this module:
+ * - Request.
+ *   It not only contains the data that the workqueue fetches out
+ *   to finish the request but also offers the space to save the result
+ *   after the workqueue handles the request.
+ *
+ *   It's flowed between user and workqueue. The user fills the request
+ *   data into it when it is owned by user. After it is submitted to the
+ *   workqueue, the workqueue fetched data out and save the result into
+ *   it after the request is handled.
+ *
+ *   All the requests are pre-allocated and carefully partitioned between
+ *   threads so there is no contention on the request, that make threads
+ *   be parallel as much as possible.
+ *
+ * - User, i.e, the submitter
+ *   It's the one fills the request and submits it to the workqueue,
+ *   the result will be collected after it is handled by the work queue.
+ *
+ *   The user can consecutively submit requests without waiting the previous
+ *   requests been handled.
+ *   It only supports one submitter, you should do serial submission by
+ *   yourself if you want more, e.g, use lock on you side.
+ *
+ * - Workqueue, i.e, thread
+ *   Each workqueue is represented by a running thread that fetches
+ *   the request submitted by the user, do the specified work and save
+ *   the result to the request.
+ */
+
+typedef struct Threads Threads;
+
+struct ThreadedWorkqueueOps {
+    /* constructor of the request */
+    int (*thread_request_init)(void *request);
+    /*  destructor of the request */
+    void (*thread_request_uninit)(void *request);
+
+    /* the handler of the request that is called by the thread */
+    void (*thread_request_handler)(void *request);
+    /* called by the user after the request has been handled */
+    void (*thread_request_done)(void *request);
+
+    size_t request_size;
+};
+typedef struct ThreadedWorkqueueOps ThreadedWorkqueueOps;
+
+/* the default number of requests that thread need handle */
+#define DEFAULT_THREAD_REQUEST_NR 4
+/* the max number of requests that thread need handle */
+#define MAX_THREAD_REQUEST_NR     (sizeof(uint64_t) * BITS_PER_BYTE)
+
+/*
+ * create a threaded queue. Other APIs will work on the Threads it returned
+ *
+ * @name: the identity of the workqueue which is used to construct the name
+ *    of threads only
+ * @threads_nr: the number of threads that the workqueue will create
+ * @thread_requests_nr: the number of requests that each single thread will
+ *    handle
+ * @ops: the handlers of the request
+ *
+ * Return NULL if it failed
+ */
+Threads *threaded_workqueue_create(const char *name, unsigned int threads_nr,
+                                   unsigned int thread_requests_nr,
+                                   const ThreadedWorkqueueOps *ops);
+void threaded_workqueue_destroy(Threads *threads);
+
+/*
+ * find a free request where the user can store the data that is needed to
+ * finish the request
+ *
+ * If all requests are used up, return NULL
+ */
+void *threaded_workqueue_get_request(Threads *threads);
+/* submit the request and notify the thread */
+void threaded_workqueue_submit_request(Threads *threads, void *request);
+
+/*
+ * wait all threads to complete the request to make sure there is no
+ * previous request exists
+ */
+void threaded_workqueue_wait_for_requests(Threads *threads);
+#endif
diff --git a/util/Makefile.objs b/util/Makefile.objs
index 0820923c18..f26dfe5182 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -50,5 +50,6 @@ util-obj-y += range.o
 util-obj-y += stats64.o
 util-obj-y += systemd.o
 util-obj-y += iova-tree.o
+util-obj-y += threaded-workqueue.o
 util-obj-$(CONFIG_LINUX) += vfio-helpers.o
 util-obj-$(CONFIG_OPENGL) += drm.o
diff --git a/util/threaded-workqueue.c b/util/threaded-workqueue.c
new file mode 100644
index 0000000000..2ab37cee8d
--- /dev/null
+++ b/util/threaded-workqueue.c
@@ -0,0 +1,463 @@
+/*
+ * Lockless and Efficient Threaded Workqueue Abstraction
+ *
+ * Author:
+ *   Xiao Guangrong <xiaoguangrong@tencent.com>
+ *
+ * Copyright(C) 2018 Tencent Corporation.
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/bitmap.h"
+#include "qemu/threaded-workqueue.h"
+
+#define SMP_CACHE_BYTES 64
+
+/*
+ * the request representation which contains the internally used mete data,
+ * it is the header of user-defined data.
+ *
+ * It should be aligned to the nature size of CPU.
+ */
+struct ThreadRequest {
+    /*
+     * the request has been handled by the thread and need the user
+     * to fetch result out.
+     */
+    uint8_t done;
+
+    /*
+     * the index to Thread::requests.
+     * Save it to the padding space although it can be calculated at runtime.
+     */
+    uint8_t request_index;
+
+    /* the index to Threads::per_thread_data */
+    unsigned int thread_index;
+} QEMU_ALIGNED(sizeof(unsigned long));
+typedef struct ThreadRequest ThreadRequest;
+
+struct ThreadLocal {
+    struct Threads *threads;
+
+    /* the index of the thread */
+    int self;
+
+    /* thread is useless and needs to exit */
+    bool quit;
+
+    QemuThread thread;
+
+    void *requests;
+
+   /*
+     * the bit in these two bitmaps indicates the index of the @requests
+     * respectively. If it's the same, the corresponding request is free
+     * and owned by the user, i.e, where the user fills a request. Otherwise,
+     * it is valid and owned by the thread, i.e, where the thread fetches
+     * the request and write the result.
+     */
+
+    /* after the user fills the request, the bit is flipped. */
+    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
+    /* after handles the request, the thread flips the bit. */
+    uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
+
+    /*
+     * the event used to wake up the thread whenever a valid request has
+     * been submitted
+     */
+    QemuEvent request_valid_ev QEMU_ALIGNED(SMP_CACHE_BYTES);
+
+    /*
+     * the event is notified whenever a request has been completed
+     * (i.e, become free), which is used to wake up the user
+     */
+    QemuEvent request_free_ev QEMU_ALIGNED(SMP_CACHE_BYTES);
+};
+typedef struct ThreadLocal ThreadLocal;
+
+/*
+ * the main data struct represents multithreads which is shared by
+ * all threads
+ */
+struct Threads {
+    /* the request header, ThreadRequest, is contained */
+    unsigned int request_size;
+    unsigned int thread_requests_nr;
+    unsigned int threads_nr;
+
+    /* the request is pushed to the thread with round-robin manner */
+    unsigned int current_thread_index;
+
+    const ThreadedWorkqueueOps *ops;
+
+    ThreadLocal per_thread_data[0];
+};
+typedef struct Threads Threads;
+
+static ThreadRequest *index_to_request(ThreadLocal *thread, int request_index)
+{
+    ThreadRequest *request;
+
+    request = thread->requests + request_index * thread->threads->request_size;
+    assert(request->request_index == request_index);
+    assert(request->thread_index == thread->self);
+    return request;
+}
+
+static int request_to_index(ThreadRequest *request)
+{
+    return request->request_index;
+}
+
+static int request_to_thread_index(ThreadRequest *request)
+{
+    return request->thread_index;
+}
+
+/*
+ * free request: the request is not used by any thread, however, it might
+ *   contain the result need the user to call thread_request_done()
+ *
+ * valid request: the request contains the request data and it's committed
+ *   to the thread, i,e. it's owned by thread.
+ */
+static uint64_t get_free_request_bitmap(Threads *threads, ThreadLocal *thread)
+{
+    uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap;
+
+    request_fill_bitmap = atomic_rcu_read(&thread->request_fill_bitmap);
+    request_done_bitmap = atomic_rcu_read(&thread->request_done_bitmap);
+    bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
+               threads->thread_requests_nr);
+
+    /*
+     * paired with smp_wmb() in mark_request_free() to make sure that we
+     * read request_done_bitmap before fetching the result out.
+     */
+    smp_rmb();
+
+    return result_bitmap;
+}
+
+static ThreadRequest
+*find_thread_free_request(Threads *threads, ThreadLocal *thread)
+{
+    uint64_t result_bitmap = get_free_request_bitmap(threads, thread);
+    int index;
+
+    index  = find_first_zero_bit(&result_bitmap, threads->thread_requests_nr);
+    if (index >= threads->thread_requests_nr) {
+        return NULL;
+    }
+
+    return index_to_request(thread, index);
+}
+
+static ThreadRequest *threads_find_free_request(Threads *threads)
+{
+    ThreadLocal *thread;
+    ThreadRequest *request;
+    int cur_thread, thread_index;
+
+    cur_thread = threads->current_thread_index % threads->threads_nr;
+    thread_index = cur_thread;
+    do {
+        thread = threads->per_thread_data + thread_index++;
+        request = find_thread_free_request(threads, thread);
+        if (request) {
+            break;
+        }
+        thread_index %= threads->threads_nr;
+    } while (thread_index != cur_thread);
+
+    return request;
+}
+
+/*
+ * the change bit operation combined with READ_ONCE and WRITE_ONCE which
+ * only works on single uint64_t width
+ */
+static void change_bit_once(long nr, uint64_t *addr)
+{
+    uint64_t value = atomic_rcu_read(addr) ^ BIT_MASK(nr);
+
+    atomic_rcu_set(addr, value);
+}
+
+static void mark_request_valid(Threads *threads, ThreadRequest *request)
+{
+    int thread_index = request_to_thread_index(request);
+    int request_index = request_to_index(request);
+    ThreadLocal *thread = threads->per_thread_data + thread_index;
+
+    /*
+     * paired with smp_rmb() in find_first_valid_request_index() to make
+     * sure the request has been filled before the bit is flipped that
+     * will make the request be visible to the thread
+     */
+    smp_wmb();
+
+    change_bit_once(request_index, &thread->request_fill_bitmap);
+    qemu_event_set(&thread->request_valid_ev);
+}
+
+static int thread_find_first_valid_request_index(ThreadLocal *thread)
+{
+    Threads *threads = thread->threads;
+    uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap;
+    int index;
+
+    request_fill_bitmap = atomic_rcu_read(&thread->request_fill_bitmap);
+    request_done_bitmap = atomic_rcu_read(&thread->request_done_bitmap);
+    bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
+               threads->thread_requests_nr);
+    /*
+     * paired with smp_wmb() in mark_request_valid() to make sure that
+     * we read request_fill_bitmap before fetch the request out.
+     */
+    smp_rmb();
+
+    index = find_first_bit(&result_bitmap, threads->thread_requests_nr);
+    return index >= threads->thread_requests_nr ? -1 : index;
+}
+
+static void mark_request_free(ThreadLocal *thread, ThreadRequest *request)
+{
+    int index = request_to_index(request);
+
+    /*
+     * smp_wmb() is implied in change_bit_atomic() that is paired with
+     * smp_rmb() in get_free_request_bitmap() to make sure the result
+     * has been saved before the bit is flipped.
+     */
+    change_bit_atomic(index, &thread->request_done_bitmap);
+    qemu_event_set(&thread->request_free_ev);
+}
+
+/* retry to see if there is available request before actually go to wait. */
+#define BUSY_WAIT_COUNT 1000
+
+static ThreadRequest *
+thread_busy_wait_for_request(ThreadLocal *thread)
+{
+    int index, count = 0;
+
+    for (count = 0; count < BUSY_WAIT_COUNT; count++) {
+        index = thread_find_first_valid_request_index(thread);
+        if (index >= 0) {
+            return index_to_request(thread, index);
+        }
+
+        cpu_relax();
+    }
+
+    return NULL;
+}
+
+static void *thread_run(void *opaque)
+{
+    ThreadLocal *self_data = (ThreadLocal *)opaque;
+    Threads *threads = self_data->threads;
+    void (*handler)(void *request) = threads->ops->thread_request_handler;
+    ThreadRequest *request;
+
+    for ( ; !atomic_read(&self_data->quit); ) {
+        qemu_event_reset(&self_data->request_valid_ev);
+
+        request = thread_busy_wait_for_request(self_data);
+        if (!request) {
+            qemu_event_wait(&self_data->request_valid_ev);
+            continue;
+        }
+
+        assert(!request->done);
+
+        handler(request + 1);
+        request->done = true;
+        mark_request_free(self_data, request);
+    }
+
+    return NULL;
+}
+
+static void uninit_thread_requests(ThreadLocal *thread, int free_nr)
+{
+    Threads *threads = thread->threads;
+    ThreadRequest *request = thread->requests;
+    int i;
+
+    for (i = 0; i < free_nr; i++) {
+        threads->ops->thread_request_uninit(request + 1);
+        request = (void *)request + threads->request_size;
+    }
+    g_free(thread->requests);
+}
+
+static int init_thread_requests(ThreadLocal *thread)
+{
+    Threads *threads = thread->threads;
+    ThreadRequest *request;
+    int ret, i, thread_reqs_size;
+
+    thread_reqs_size = threads->thread_requests_nr * threads->request_size;
+    thread_reqs_size = QEMU_ALIGN_UP(thread_reqs_size, SMP_CACHE_BYTES);
+    thread->requests = g_malloc0(thread_reqs_size);
+
+    request = thread->requests;
+    for (i = 0; i < threads->thread_requests_nr; i++) {
+        ret = threads->ops->thread_request_init(request + 1);
+        if (ret < 0) {
+            goto exit;
+        }
+
+        request->request_index = i;
+        request->thread_index = thread->self;
+        request = (void *)request + threads->request_size;
+    }
+    return 0;
+
+exit:
+    uninit_thread_requests(thread, i);
+    return -1;
+}
+
+static void uninit_thread_data(Threads *threads, int free_nr)
+{
+    ThreadLocal *thread_local = threads->per_thread_data;
+    int i;
+
+    for (i = 0; i < free_nr; i++) {
+        thread_local[i].quit = true;
+        qemu_event_set(&thread_local[i].request_valid_ev);
+        qemu_thread_join(&thread_local[i].thread);
+        qemu_event_destroy(&thread_local[i].request_valid_ev);
+        qemu_event_destroy(&thread_local[i].request_free_ev);
+        uninit_thread_requests(&thread_local[i], threads->thread_requests_nr);
+    }
+}
+
+static int
+init_thread_data(Threads *threads, const char *thread_name, int thread_nr)
+{
+    ThreadLocal *thread_local = threads->per_thread_data;
+    char *name;
+    int i;
+
+    for (i = 0; i < thread_nr; i++) {
+        thread_local[i].threads = threads;
+        thread_local[i].self = i;
+
+        if (init_thread_requests(&thread_local[i]) < 0) {
+            goto exit;
+        }
+
+        qemu_event_init(&thread_local[i].request_free_ev, false);
+        qemu_event_init(&thread_local[i].request_valid_ev, false);
+
+        name = g_strdup_printf("%s/%d", thread_name, thread_local[i].self);
+        qemu_thread_create(&thread_local[i].thread, name,
+                           thread_run, &thread_local[i], QEMU_THREAD_JOINABLE);
+        g_free(name);
+    }
+    return 0;
+
+exit:
+    uninit_thread_data(threads, i);
+    return -1;
+}
+
+Threads *threaded_workqueue_create(const char *name, unsigned int threads_nr,
+                                   unsigned int thread_requests_nr,
+                                   const ThreadedWorkqueueOps *ops)
+{
+    Threads *threads;
+
+    if (threads_nr > MAX_THREAD_REQUEST_NR) {
+        return NULL;
+    }
+
+    threads = g_malloc0(sizeof(*threads) + threads_nr * sizeof(ThreadLocal));
+    threads->ops = ops;
+    threads->threads_nr = threads_nr;
+    threads->thread_requests_nr = thread_requests_nr;
+
+    QEMU_BUILD_BUG_ON(!QEMU_IS_ALIGNED(sizeof(ThreadRequest), sizeof(long)));
+    threads->request_size = threads->ops->request_size;
+    threads->request_size = QEMU_ALIGN_UP(threads->request_size, sizeof(long));
+    threads->request_size += sizeof(ThreadRequest);
+
+    if (init_thread_data(threads, name, threads_nr) < 0) {
+        g_free(threads);
+        return NULL;
+    }
+
+    return threads;
+}
+
+void threaded_workqueue_destroy(Threads *threads)
+{
+    uninit_thread_data(threads, threads->threads_nr);
+    g_free(threads);
+}
+
+static void request_done(Threads *threads, ThreadRequest *request)
+{
+    if (!request->done) {
+        return;
+    }
+
+    threads->ops->thread_request_done(request + 1);
+    request->done = false;
+}
+
+void *threaded_workqueue_get_request(Threads *threads)
+{
+    ThreadRequest *request;
+
+    request = threads_find_free_request(threads);
+    if (!request) {
+        return NULL;
+    }
+
+    request_done(threads, request);
+    return request + 1;
+}
+
+void threaded_workqueue_submit_request(Threads *threads, void *request)
+{
+    ThreadRequest *req = request - sizeof(ThreadRequest);
+    int thread_index = request_to_thread_index(request);
+
+    assert(!req->done);
+    mark_request_valid(threads, req);
+    threads->current_thread_index = thread_index  + 1;
+}
+
+void threaded_workqueue_wait_for_requests(Threads *threads)
+{
+    ThreadLocal *thread;
+    uint64_t result_bitmap;
+    int thread_index, index = 0;
+
+    for (thread_index = 0; thread_index < threads->threads_nr; thread_index++) {
+        thread = threads->per_thread_data + thread_index;
+        index = 0;
+retry:
+        qemu_event_reset(&thread->request_free_ev);
+        result_bitmap = get_free_request_bitmap(threads, thread);
+
+        for (; index < threads->thread_requests_nr; index++) {
+            if (test_bit(index, &result_bitmap)) {
+                qemu_event_wait(&thread->request_free_ev);
+                goto retry;
+            }
+
+            request_done(threads, index_to_request(thread, index));
+        }
+    }
+}
-- 
2.14.5

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
@ 2018-11-22  7:20   ` guangrong.xiao
  0 siblings, 0 replies; 70+ messages in thread
From: guangrong.xiao @ 2018-11-22  7:20 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: qemu-devel, kvm, dgilbert, peterx, wei.w.wang, jiang.biao2,
	eblake, quintela, cota, Xiao Guangrong

From: Xiao Guangrong <xiaoguangrong@tencent.com>

This modules implements the lockless and efficient threaded workqueue.

Three abstracted objects are used in this module:
- Request.
     It not only contains the data that the workqueue fetches out
    to finish the request but also offers the space to save the result
    after the workqueue handles the request.

    It's flowed between user and workqueue. The user fills the request
    data into it when it is owned by user. After it is submitted to the
    workqueue, the workqueue fetched data out and save the result into
    it after the request is handled.

    All the requests are pre-allocated and carefully partitioned between
    threads so there is no contention on the request, that make threads
    be parallel as much as possible.

- User, i.e, the submitter
    It's the one fills the request and submits it to the workqueue,
    the result will be collected after it is handled by the work queue.

    The user can consecutively submit requests without waiting the previous
    requests been handled.
    It only supports one submitter, you should do serial submission by
    yourself if you want more, e.g, use lock on you side.

- Workqueue, i.e, thread
    Each workqueue is represented by a running thread that fetches
    the request submitted by the user, do the specified work and save
    the result to the request.

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 include/qemu/threaded-workqueue.h | 106 +++++++++
 util/Makefile.objs                |   1 +
 util/threaded-workqueue.c         | 463 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 570 insertions(+)
 create mode 100644 include/qemu/threaded-workqueue.h
 create mode 100644 util/threaded-workqueue.c

diff --git a/include/qemu/threaded-workqueue.h b/include/qemu/threaded-workqueue.h
new file mode 100644
index 0000000000..e0ede496d0
--- /dev/null
+++ b/include/qemu/threaded-workqueue.h
@@ -0,0 +1,106 @@
+/*
+ * Lockless and Efficient Threaded Workqueue Abstraction
+ *
+ * Author:
+ *   Xiao Guangrong <xiaoguangrong@tencent.com>
+ *
+ * Copyright(C) 2018 Tencent Corporation.
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ */
+
+#ifndef QEMU_THREADED_WORKQUEUE_H
+#define QEMU_THREADED_WORKQUEUE_H
+
+#include "qemu/queue.h"
+#include "qemu/thread.h"
+
+/*
+ * This modules implements the lockless and efficient threaded workqueue.
+ *
+ * Three abstracted objects are used in this module:
+ * - Request.
+ *   It not only contains the data that the workqueue fetches out
+ *   to finish the request but also offers the space to save the result
+ *   after the workqueue handles the request.
+ *
+ *   It's flowed between user and workqueue. The user fills the request
+ *   data into it when it is owned by user. After it is submitted to the
+ *   workqueue, the workqueue fetched data out and save the result into
+ *   it after the request is handled.
+ *
+ *   All the requests are pre-allocated and carefully partitioned between
+ *   threads so there is no contention on the request, that make threads
+ *   be parallel as much as possible.
+ *
+ * - User, i.e, the submitter
+ *   It's the one fills the request and submits it to the workqueue,
+ *   the result will be collected after it is handled by the work queue.
+ *
+ *   The user can consecutively submit requests without waiting the previous
+ *   requests been handled.
+ *   It only supports one submitter, you should do serial submission by
+ *   yourself if you want more, e.g, use lock on you side.
+ *
+ * - Workqueue, i.e, thread
+ *   Each workqueue is represented by a running thread that fetches
+ *   the request submitted by the user, do the specified work and save
+ *   the result to the request.
+ */
+
+typedef struct Threads Threads;
+
+struct ThreadedWorkqueueOps {
+    /* constructor of the request */
+    int (*thread_request_init)(void *request);
+    /*  destructor of the request */
+    void (*thread_request_uninit)(void *request);
+
+    /* the handler of the request that is called by the thread */
+    void (*thread_request_handler)(void *request);
+    /* called by the user after the request has been handled */
+    void (*thread_request_done)(void *request);
+
+    size_t request_size;
+};
+typedef struct ThreadedWorkqueueOps ThreadedWorkqueueOps;
+
+/* the default number of requests that thread need handle */
+#define DEFAULT_THREAD_REQUEST_NR 4
+/* the max number of requests that thread need handle */
+#define MAX_THREAD_REQUEST_NR     (sizeof(uint64_t) * BITS_PER_BYTE)
+
+/*
+ * create a threaded queue. Other APIs will work on the Threads it returned
+ *
+ * @name: the identity of the workqueue which is used to construct the name
+ *    of threads only
+ * @threads_nr: the number of threads that the workqueue will create
+ * @thread_requests_nr: the number of requests that each single thread will
+ *    handle
+ * @ops: the handlers of the request
+ *
+ * Return NULL if it failed
+ */
+Threads *threaded_workqueue_create(const char *name, unsigned int threads_nr,
+                                   unsigned int thread_requests_nr,
+                                   const ThreadedWorkqueueOps *ops);
+void threaded_workqueue_destroy(Threads *threads);
+
+/*
+ * find a free request where the user can store the data that is needed to
+ * finish the request
+ *
+ * If all requests are used up, return NULL
+ */
+void *threaded_workqueue_get_request(Threads *threads);
+/* submit the request and notify the thread */
+void threaded_workqueue_submit_request(Threads *threads, void *request);
+
+/*
+ * wait all threads to complete the request to make sure there is no
+ * previous request exists
+ */
+void threaded_workqueue_wait_for_requests(Threads *threads);
+#endif
diff --git a/util/Makefile.objs b/util/Makefile.objs
index 0820923c18..f26dfe5182 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -50,5 +50,6 @@ util-obj-y += range.o
 util-obj-y += stats64.o
 util-obj-y += systemd.o
 util-obj-y += iova-tree.o
+util-obj-y += threaded-workqueue.o
 util-obj-$(CONFIG_LINUX) += vfio-helpers.o
 util-obj-$(CONFIG_OPENGL) += drm.o
diff --git a/util/threaded-workqueue.c b/util/threaded-workqueue.c
new file mode 100644
index 0000000000..2ab37cee8d
--- /dev/null
+++ b/util/threaded-workqueue.c
@@ -0,0 +1,463 @@
+/*
+ * Lockless and Efficient Threaded Workqueue Abstraction
+ *
+ * Author:
+ *   Xiao Guangrong <xiaoguangrong@tencent.com>
+ *
+ * Copyright(C) 2018 Tencent Corporation.
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/bitmap.h"
+#include "qemu/threaded-workqueue.h"
+
+#define SMP_CACHE_BYTES 64
+
+/*
+ * the request representation which contains the internally used mete data,
+ * it is the header of user-defined data.
+ *
+ * It should be aligned to the nature size of CPU.
+ */
+struct ThreadRequest {
+    /*
+     * the request has been handled by the thread and need the user
+     * to fetch result out.
+     */
+    uint8_t done;
+
+    /*
+     * the index to Thread::requests.
+     * Save it to the padding space although it can be calculated at runtime.
+     */
+    uint8_t request_index;
+
+    /* the index to Threads::per_thread_data */
+    unsigned int thread_index;
+} QEMU_ALIGNED(sizeof(unsigned long));
+typedef struct ThreadRequest ThreadRequest;
+
+struct ThreadLocal {
+    struct Threads *threads;
+
+    /* the index of the thread */
+    int self;
+
+    /* thread is useless and needs to exit */
+    bool quit;
+
+    QemuThread thread;
+
+    void *requests;
+
+   /*
+     * the bit in these two bitmaps indicates the index of the @requests
+     * respectively. If it's the same, the corresponding request is free
+     * and owned by the user, i.e, where the user fills a request. Otherwise,
+     * it is valid and owned by the thread, i.e, where the thread fetches
+     * the request and write the result.
+     */
+
+    /* after the user fills the request, the bit is flipped. */
+    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
+    /* after handles the request, the thread flips the bit. */
+    uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
+
+    /*
+     * the event used to wake up the thread whenever a valid request has
+     * been submitted
+     */
+    QemuEvent request_valid_ev QEMU_ALIGNED(SMP_CACHE_BYTES);
+
+    /*
+     * the event is notified whenever a request has been completed
+     * (i.e, become free), which is used to wake up the user
+     */
+    QemuEvent request_free_ev QEMU_ALIGNED(SMP_CACHE_BYTES);
+};
+typedef struct ThreadLocal ThreadLocal;
+
+/*
+ * the main data struct represents multithreads which is shared by
+ * all threads
+ */
+struct Threads {
+    /* the request header, ThreadRequest, is contained */
+    unsigned int request_size;
+    unsigned int thread_requests_nr;
+    unsigned int threads_nr;
+
+    /* the request is pushed to the thread with round-robin manner */
+    unsigned int current_thread_index;
+
+    const ThreadedWorkqueueOps *ops;
+
+    ThreadLocal per_thread_data[0];
+};
+typedef struct Threads Threads;
+
+static ThreadRequest *index_to_request(ThreadLocal *thread, int request_index)
+{
+    ThreadRequest *request;
+
+    request = thread->requests + request_index * thread->threads->request_size;
+    assert(request->request_index == request_index);
+    assert(request->thread_index == thread->self);
+    return request;
+}
+
+static int request_to_index(ThreadRequest *request)
+{
+    return request->request_index;
+}
+
+static int request_to_thread_index(ThreadRequest *request)
+{
+    return request->thread_index;
+}
+
+/*
+ * free request: the request is not used by any thread, however, it might
+ *   contain the result need the user to call thread_request_done()
+ *
+ * valid request: the request contains the request data and it's committed
+ *   to the thread, i,e. it's owned by thread.
+ */
+static uint64_t get_free_request_bitmap(Threads *threads, ThreadLocal *thread)
+{
+    uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap;
+
+    request_fill_bitmap = atomic_rcu_read(&thread->request_fill_bitmap);
+    request_done_bitmap = atomic_rcu_read(&thread->request_done_bitmap);
+    bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
+               threads->thread_requests_nr);
+
+    /*
+     * paired with smp_wmb() in mark_request_free() to make sure that we
+     * read request_done_bitmap before fetching the result out.
+     */
+    smp_rmb();
+
+    return result_bitmap;
+}
+
+static ThreadRequest
+*find_thread_free_request(Threads *threads, ThreadLocal *thread)
+{
+    uint64_t result_bitmap = get_free_request_bitmap(threads, thread);
+    int index;
+
+    index  = find_first_zero_bit(&result_bitmap, threads->thread_requests_nr);
+    if (index >= threads->thread_requests_nr) {
+        return NULL;
+    }
+
+    return index_to_request(thread, index);
+}
+
+static ThreadRequest *threads_find_free_request(Threads *threads)
+{
+    ThreadLocal *thread;
+    ThreadRequest *request;
+    int cur_thread, thread_index;
+
+    cur_thread = threads->current_thread_index % threads->threads_nr;
+    thread_index = cur_thread;
+    do {
+        thread = threads->per_thread_data + thread_index++;
+        request = find_thread_free_request(threads, thread);
+        if (request) {
+            break;
+        }
+        thread_index %= threads->threads_nr;
+    } while (thread_index != cur_thread);
+
+    return request;
+}
+
+/*
+ * the change bit operation combined with READ_ONCE and WRITE_ONCE which
+ * only works on single uint64_t width
+ */
+static void change_bit_once(long nr, uint64_t *addr)
+{
+    uint64_t value = atomic_rcu_read(addr) ^ BIT_MASK(nr);
+
+    atomic_rcu_set(addr, value);
+}
+
+static void mark_request_valid(Threads *threads, ThreadRequest *request)
+{
+    int thread_index = request_to_thread_index(request);
+    int request_index = request_to_index(request);
+    ThreadLocal *thread = threads->per_thread_data + thread_index;
+
+    /*
+     * paired with smp_rmb() in find_first_valid_request_index() to make
+     * sure the request has been filled before the bit is flipped that
+     * will make the request be visible to the thread
+     */
+    smp_wmb();
+
+    change_bit_once(request_index, &thread->request_fill_bitmap);
+    qemu_event_set(&thread->request_valid_ev);
+}
+
+static int thread_find_first_valid_request_index(ThreadLocal *thread)
+{
+    Threads *threads = thread->threads;
+    uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap;
+    int index;
+
+    request_fill_bitmap = atomic_rcu_read(&thread->request_fill_bitmap);
+    request_done_bitmap = atomic_rcu_read(&thread->request_done_bitmap);
+    bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
+               threads->thread_requests_nr);
+    /*
+     * paired with smp_wmb() in mark_request_valid() to make sure that
+     * we read request_fill_bitmap before fetch the request out.
+     */
+    smp_rmb();
+
+    index = find_first_bit(&result_bitmap, threads->thread_requests_nr);
+    return index >= threads->thread_requests_nr ? -1 : index;
+}
+
+static void mark_request_free(ThreadLocal *thread, ThreadRequest *request)
+{
+    int index = request_to_index(request);
+
+    /*
+     * smp_wmb() is implied in change_bit_atomic() that is paired with
+     * smp_rmb() in get_free_request_bitmap() to make sure the result
+     * has been saved before the bit is flipped.
+     */
+    change_bit_atomic(index, &thread->request_done_bitmap);
+    qemu_event_set(&thread->request_free_ev);
+}
+
+/* retry to see if there is available request before actually go to wait. */
+#define BUSY_WAIT_COUNT 1000
+
+static ThreadRequest *
+thread_busy_wait_for_request(ThreadLocal *thread)
+{
+    int index, count = 0;
+
+    for (count = 0; count < BUSY_WAIT_COUNT; count++) {
+        index = thread_find_first_valid_request_index(thread);
+        if (index >= 0) {
+            return index_to_request(thread, index);
+        }
+
+        cpu_relax();
+    }
+
+    return NULL;
+}
+
+static void *thread_run(void *opaque)
+{
+    ThreadLocal *self_data = (ThreadLocal *)opaque;
+    Threads *threads = self_data->threads;
+    void (*handler)(void *request) = threads->ops->thread_request_handler;
+    ThreadRequest *request;
+
+    for ( ; !atomic_read(&self_data->quit); ) {
+        qemu_event_reset(&self_data->request_valid_ev);
+
+        request = thread_busy_wait_for_request(self_data);
+        if (!request) {
+            qemu_event_wait(&self_data->request_valid_ev);
+            continue;
+        }
+
+        assert(!request->done);
+
+        handler(request + 1);
+        request->done = true;
+        mark_request_free(self_data, request);
+    }
+
+    return NULL;
+}
+
+static void uninit_thread_requests(ThreadLocal *thread, int free_nr)
+{
+    Threads *threads = thread->threads;
+    ThreadRequest *request = thread->requests;
+    int i;
+
+    for (i = 0; i < free_nr; i++) {
+        threads->ops->thread_request_uninit(request + 1);
+        request = (void *)request + threads->request_size;
+    }
+    g_free(thread->requests);
+}
+
+static int init_thread_requests(ThreadLocal *thread)
+{
+    Threads *threads = thread->threads;
+    ThreadRequest *request;
+    int ret, i, thread_reqs_size;
+
+    thread_reqs_size = threads->thread_requests_nr * threads->request_size;
+    thread_reqs_size = QEMU_ALIGN_UP(thread_reqs_size, SMP_CACHE_BYTES);
+    thread->requests = g_malloc0(thread_reqs_size);
+
+    request = thread->requests;
+    for (i = 0; i < threads->thread_requests_nr; i++) {
+        ret = threads->ops->thread_request_init(request + 1);
+        if (ret < 0) {
+            goto exit;
+        }
+
+        request->request_index = i;
+        request->thread_index = thread->self;
+        request = (void *)request + threads->request_size;
+    }
+    return 0;
+
+exit:
+    uninit_thread_requests(thread, i);
+    return -1;
+}
+
+static void uninit_thread_data(Threads *threads, int free_nr)
+{
+    ThreadLocal *thread_local = threads->per_thread_data;
+    int i;
+
+    for (i = 0; i < free_nr; i++) {
+        thread_local[i].quit = true;
+        qemu_event_set(&thread_local[i].request_valid_ev);
+        qemu_thread_join(&thread_local[i].thread);
+        qemu_event_destroy(&thread_local[i].request_valid_ev);
+        qemu_event_destroy(&thread_local[i].request_free_ev);
+        uninit_thread_requests(&thread_local[i], threads->thread_requests_nr);
+    }
+}
+
+static int
+init_thread_data(Threads *threads, const char *thread_name, int thread_nr)
+{
+    ThreadLocal *thread_local = threads->per_thread_data;
+    char *name;
+    int i;
+
+    for (i = 0; i < thread_nr; i++) {
+        thread_local[i].threads = threads;
+        thread_local[i].self = i;
+
+        if (init_thread_requests(&thread_local[i]) < 0) {
+            goto exit;
+        }
+
+        qemu_event_init(&thread_local[i].request_free_ev, false);
+        qemu_event_init(&thread_local[i].request_valid_ev, false);
+
+        name = g_strdup_printf("%s/%d", thread_name, thread_local[i].self);
+        qemu_thread_create(&thread_local[i].thread, name,
+                           thread_run, &thread_local[i], QEMU_THREAD_JOINABLE);
+        g_free(name);
+    }
+    return 0;
+
+exit:
+    uninit_thread_data(threads, i);
+    return -1;
+}
+
+Threads *threaded_workqueue_create(const char *name, unsigned int threads_nr,
+                                   unsigned int thread_requests_nr,
+                                   const ThreadedWorkqueueOps *ops)
+{
+    Threads *threads;
+
+    if (threads_nr > MAX_THREAD_REQUEST_NR) {
+        return NULL;
+    }
+
+    threads = g_malloc0(sizeof(*threads) + threads_nr * sizeof(ThreadLocal));
+    threads->ops = ops;
+    threads->threads_nr = threads_nr;
+    threads->thread_requests_nr = thread_requests_nr;
+
+    QEMU_BUILD_BUG_ON(!QEMU_IS_ALIGNED(sizeof(ThreadRequest), sizeof(long)));
+    threads->request_size = threads->ops->request_size;
+    threads->request_size = QEMU_ALIGN_UP(threads->request_size, sizeof(long));
+    threads->request_size += sizeof(ThreadRequest);
+
+    if (init_thread_data(threads, name, threads_nr) < 0) {
+        g_free(threads);
+        return NULL;
+    }
+
+    return threads;
+}
+
+void threaded_workqueue_destroy(Threads *threads)
+{
+    uninit_thread_data(threads, threads->threads_nr);
+    g_free(threads);
+}
+
+static void request_done(Threads *threads, ThreadRequest *request)
+{
+    if (!request->done) {
+        return;
+    }
+
+    threads->ops->thread_request_done(request + 1);
+    request->done = false;
+}
+
+void *threaded_workqueue_get_request(Threads *threads)
+{
+    ThreadRequest *request;
+
+    request = threads_find_free_request(threads);
+    if (!request) {
+        return NULL;
+    }
+
+    request_done(threads, request);
+    return request + 1;
+}
+
+void threaded_workqueue_submit_request(Threads *threads, void *request)
+{
+    ThreadRequest *req = request - sizeof(ThreadRequest);
+    int thread_index = request_to_thread_index(request);
+
+    assert(!req->done);
+    mark_request_valid(threads, req);
+    threads->current_thread_index = thread_index  + 1;
+}
+
+void threaded_workqueue_wait_for_requests(Threads *threads)
+{
+    ThreadLocal *thread;
+    uint64_t result_bitmap;
+    int thread_index, index = 0;
+
+    for (thread_index = 0; thread_index < threads->threads_nr; thread_index++) {
+        thread = threads->per_thread_data + thread_index;
+        index = 0;
+retry:
+        qemu_event_reset(&thread->request_free_ev);
+        result_bitmap = get_free_request_bitmap(threads, thread);
+
+        for (; index < threads->thread_requests_nr; index++) {
+            if (test_bit(index, &result_bitmap)) {
+                qemu_event_wait(&thread->request_free_ev);
+                goto retry;
+            }
+
+            request_done(threads, index_to_request(thread, index));
+        }
+    }
+}
-- 
2.14.5

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 3/5] migration: use threaded workqueue for compression
  2018-11-22  7:20 ` [Qemu-devel] " guangrong.xiao
@ 2018-11-22  7:20   ` guangrong.xiao
  -1 siblings, 0 replies; 70+ messages in thread
From: guangrong.xiao @ 2018-11-22  7:20 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: kvm, quintela, Xiao Guangrong, qemu-devel, peterx, dgilbert,
	wei.w.wang, cota, jiang.biao2

From: Xiao Guangrong <xiaoguangrong@tencent.com>

Adapt the compression code to the threaded workqueue

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 migration/ram.c | 308 ++++++++++++++++++++------------------------------------
 1 file changed, 110 insertions(+), 198 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 7e7deec4d8..254c08f27b 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -57,6 +57,7 @@
 #include "qemu/uuid.h"
 #include "savevm.h"
 #include "qemu/iov.h"
+#include "qemu/threaded-workqueue.h"
 
 /***********************************************************/
 /* ram save/restore */
@@ -349,22 +350,6 @@ typedef struct PageSearchStatus PageSearchStatus;
 
 CompressionStats compression_counters;
 
-struct CompressParam {
-    bool done;
-    bool quit;
-    bool zero_page;
-    QEMUFile *file;
-    QemuMutex mutex;
-    QemuCond cond;
-    RAMBlock *block;
-    ram_addr_t offset;
-
-    /* internally used fields */
-    z_stream stream;
-    uint8_t *originbuf;
-};
-typedef struct CompressParam CompressParam;
-
 struct DecompressParam {
     bool done;
     bool quit;
@@ -377,15 +362,6 @@ struct DecompressParam {
 };
 typedef struct DecompressParam DecompressParam;
 
-static CompressParam *comp_param;
-static QemuThread *compress_threads;
-/* comp_done_cond is used to wake up the migration thread when
- * one of the compression threads has finished the compression.
- * comp_done_lock is used to co-work with comp_done_cond.
- */
-static QemuMutex comp_done_lock;
-static QemuCond comp_done_cond;
-/* The empty QEMUFileOps will be used by file in CompressParam */
 static const QEMUFileOps empty_ops = { };
 
 static QEMUFile *decomp_file;
@@ -394,125 +370,6 @@ static QemuThread *decompress_threads;
 static QemuMutex decomp_done_lock;
 static QemuCond decomp_done_cond;
 
-static bool do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock *block,
-                                 ram_addr_t offset, uint8_t *source_buf);
-
-static void *do_data_compress(void *opaque)
-{
-    CompressParam *param = opaque;
-    RAMBlock *block;
-    ram_addr_t offset;
-    bool zero_page;
-
-    qemu_mutex_lock(&param->mutex);
-    while (!param->quit) {
-        if (param->block) {
-            block = param->block;
-            offset = param->offset;
-            param->block = NULL;
-            qemu_mutex_unlock(&param->mutex);
-
-            zero_page = do_compress_ram_page(param->file, &param->stream,
-                                             block, offset, param->originbuf);
-
-            qemu_mutex_lock(&comp_done_lock);
-            param->done = true;
-            param->zero_page = zero_page;
-            qemu_cond_signal(&comp_done_cond);
-            qemu_mutex_unlock(&comp_done_lock);
-
-            qemu_mutex_lock(&param->mutex);
-        } else {
-            qemu_cond_wait(&param->cond, &param->mutex);
-        }
-    }
-    qemu_mutex_unlock(&param->mutex);
-
-    return NULL;
-}
-
-static void compress_threads_save_cleanup(void)
-{
-    int i, thread_count;
-
-    if (!migrate_use_compression() || !comp_param) {
-        return;
-    }
-
-    thread_count = migrate_compress_threads();
-    for (i = 0; i < thread_count; i++) {
-        /*
-         * we use it as a indicator which shows if the thread is
-         * properly init'd or not
-         */
-        if (!comp_param[i].file) {
-            break;
-        }
-
-        qemu_mutex_lock(&comp_param[i].mutex);
-        comp_param[i].quit = true;
-        qemu_cond_signal(&comp_param[i].cond);
-        qemu_mutex_unlock(&comp_param[i].mutex);
-
-        qemu_thread_join(compress_threads + i);
-        qemu_mutex_destroy(&comp_param[i].mutex);
-        qemu_cond_destroy(&comp_param[i].cond);
-        deflateEnd(&comp_param[i].stream);
-        g_free(comp_param[i].originbuf);
-        qemu_fclose(comp_param[i].file);
-        comp_param[i].file = NULL;
-    }
-    qemu_mutex_destroy(&comp_done_lock);
-    qemu_cond_destroy(&comp_done_cond);
-    g_free(compress_threads);
-    g_free(comp_param);
-    compress_threads = NULL;
-    comp_param = NULL;
-}
-
-static int compress_threads_save_setup(void)
-{
-    int i, thread_count;
-
-    if (!migrate_use_compression()) {
-        return 0;
-    }
-    thread_count = migrate_compress_threads();
-    compress_threads = g_new0(QemuThread, thread_count);
-    comp_param = g_new0(CompressParam, thread_count);
-    qemu_cond_init(&comp_done_cond);
-    qemu_mutex_init(&comp_done_lock);
-    for (i = 0; i < thread_count; i++) {
-        comp_param[i].originbuf = g_try_malloc(TARGET_PAGE_SIZE);
-        if (!comp_param[i].originbuf) {
-            goto exit;
-        }
-
-        if (deflateInit(&comp_param[i].stream,
-                        migrate_compress_level()) != Z_OK) {
-            g_free(comp_param[i].originbuf);
-            goto exit;
-        }
-
-        /* comp_param[i].file is just used as a dummy buffer to save data,
-         * set its ops to empty.
-         */
-        comp_param[i].file = qemu_fopen_ops(NULL, &empty_ops);
-        comp_param[i].done = true;
-        comp_param[i].quit = false;
-        qemu_mutex_init(&comp_param[i].mutex);
-        qemu_cond_init(&comp_param[i].cond);
-        qemu_thread_create(compress_threads + i, "compress",
-                           do_data_compress, comp_param + i,
-                           QEMU_THREAD_JOINABLE);
-    }
-    return 0;
-
-exit:
-    compress_threads_save_cleanup();
-    return -1;
-}
-
 /* Multiple fd's */
 
 #define MULTIFD_MAGIC 0x11223344U
@@ -1909,12 +1766,25 @@ exit:
     return zero_page;
 }
 
+struct CompressData {
+    /* filled by migration thread.*/
+    RAMBlock *block;
+    ram_addr_t offset;
+
+    /* filled by compress thread. */
+    QEMUFile *file;
+    z_stream stream;
+    uint8_t *originbuf;
+    bool zero_page;
+};
+typedef struct CompressData CompressData;
+
 static void
-update_compress_thread_counts(const CompressParam *param, int bytes_xmit)
+update_compress_thread_counts(CompressData *cd, int bytes_xmit)
 {
     ram_counters.transferred += bytes_xmit;
 
-    if (param->zero_page) {
+    if (cd->zero_page) {
         ram_counters.duplicate++;
         return;
     }
@@ -1924,81 +1794,123 @@ update_compress_thread_counts(const CompressParam *param, int bytes_xmit)
     compression_counters.pages++;
 }
 
+static int compress_thread_data_init(void *request)
+{
+    CompressData *cd = request;
+
+    cd->originbuf = g_try_malloc(TARGET_PAGE_SIZE);
+    if (!cd->originbuf) {
+        return -1;
+    }
+
+    if (deflateInit(&cd->stream, migrate_compress_level()) != Z_OK) {
+        g_free(cd->originbuf);
+        return -1;
+    }
+
+    cd->file = qemu_fopen_ops(NULL, &empty_ops);
+    return 0;
+}
+
+static void compress_thread_data_fini(void *request)
+{
+    CompressData *cd = request;
+
+    qemu_fclose(cd->file);
+    deflateEnd(&cd->stream);
+    g_free(cd->originbuf);
+}
+
+static void compress_thread_data_handler(void *request)
+{
+    CompressData *cd = request;
+
+    /*
+     * if compression fails, it will be indicated by
+     * migrate_get_current()->to_dst_file.
+     */
+    cd->zero_page = do_compress_ram_page(cd->file, &cd->stream, cd->block,
+                                         cd->offset, cd->originbuf);
+}
+
+static void compress_thread_data_done(void *request)
+{
+    CompressData *cd = request;
+    RAMState *rs = ram_state;
+    int bytes_xmit;
+
+    bytes_xmit = qemu_put_qemu_file(rs->f, cd->file);
+    update_compress_thread_counts(cd, bytes_xmit);
+}
+
+static const ThreadedWorkqueueOps compress_ops = {
+    .thread_request_init = compress_thread_data_init,
+    .thread_request_uninit = compress_thread_data_fini,
+    .thread_request_handler = compress_thread_data_handler,
+    .thread_request_done = compress_thread_data_done,
+    .request_size = sizeof(CompressData),
+};
+
+static Threads *compress_threads;
+
 static bool save_page_use_compression(RAMState *rs);
 
 static void flush_compressed_data(RAMState *rs)
 {
-    int idx, len, thread_count;
-
     if (!save_page_use_compression(rs)) {
         return;
     }
-    thread_count = migrate_compress_threads();
 
-    qemu_mutex_lock(&comp_done_lock);
-    for (idx = 0; idx < thread_count; idx++) {
-        while (!comp_param[idx].done) {
-            qemu_cond_wait(&comp_done_cond, &comp_done_lock);
-        }
-    }
-    qemu_mutex_unlock(&comp_done_lock);
+    threaded_workqueue_wait_for_requests(compress_threads);
+}
 
-    for (idx = 0; idx < thread_count; idx++) {
-        qemu_mutex_lock(&comp_param[idx].mutex);
-        if (!comp_param[idx].quit) {
-            len = qemu_put_qemu_file(rs->f, comp_param[idx].file);
-            /*
-             * it's safe to fetch zero_page without holding comp_done_lock
-             * as there is no further request submitted to the thread,
-             * i.e, the thread should be waiting for a request at this point.
-             */
-            update_compress_thread_counts(&comp_param[idx], len);
-        }
-        qemu_mutex_unlock(&comp_param[idx].mutex);
+static void compress_threads_save_cleanup(void)
+{
+    if (!compress_threads) {
+        return;
     }
+
+    threaded_workqueue_destroy(compress_threads);
+    compress_threads = NULL;
 }
 
-static inline void set_compress_params(CompressParam *param, RAMBlock *block,
-                                       ram_addr_t offset)
+static int compress_threads_save_setup(void)
 {
-    param->block = block;
-    param->offset = offset;
+    if (!migrate_use_compression()) {
+        return 0;
+    }
+
+    compress_threads = threaded_workqueue_create("compress",
+                                migrate_compress_threads(),
+                                DEFAULT_THREAD_REQUEST_NR, &compress_ops);
+    return compress_threads ? 0 : -1;
 }
 
 static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
                                            ram_addr_t offset)
 {
-    int idx, thread_count, bytes_xmit = -1, pages = -1;
+    CompressData *cd;
     bool wait = migrate_compress_wait_thread();
 
-    thread_count = migrate_compress_threads();
-    qemu_mutex_lock(&comp_done_lock);
 retry:
-    for (idx = 0; idx < thread_count; idx++) {
-        if (comp_param[idx].done) {
-            comp_param[idx].done = false;
-            bytes_xmit = qemu_put_qemu_file(rs->f, comp_param[idx].file);
-            qemu_mutex_lock(&comp_param[idx].mutex);
-            set_compress_params(&comp_param[idx], block, offset);
-            qemu_cond_signal(&comp_param[idx].cond);
-            qemu_mutex_unlock(&comp_param[idx].mutex);
-            pages = 1;
-            update_compress_thread_counts(&comp_param[idx], bytes_xmit);
-            break;
+    cd = threaded_workqueue_get_request(compress_threads);
+    if (!cd) {
+        /*
+         * wait for the free thread if the user specifies
+         * 'compress-wait-thread', otherwise we will post
+         *  the page out in the main thread as normal page.
+         */
+        if (wait) {
+            cpu_relax();
+            goto retry;
         }
-    }
 
-    /*
-     * wait for the free thread if the user specifies 'compress-wait-thread',
-     * otherwise we will post the page out in the main thread as normal page.
-     */
-    if (pages < 0 && wait) {
-        qemu_cond_wait(&comp_done_cond, &comp_done_lock);
-        goto retry;
-    }
-    qemu_mutex_unlock(&comp_done_lock);
-
-    return pages;
+        return -1;
+     }
+    cd->block = block;
+    cd->offset = offset;
+    threaded_workqueue_submit_request(compress_threads, cd);
+    return 1;
 }
 
 /**
-- 
2.14.5

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 3/5] migration: use threaded workqueue for compression
@ 2018-11-22  7:20   ` guangrong.xiao
  0 siblings, 0 replies; 70+ messages in thread
From: guangrong.xiao @ 2018-11-22  7:20 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: qemu-devel, kvm, dgilbert, peterx, wei.w.wang, jiang.biao2,
	eblake, quintela, cota, Xiao Guangrong

From: Xiao Guangrong <xiaoguangrong@tencent.com>

Adapt the compression code to the threaded workqueue

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 migration/ram.c | 308 ++++++++++++++++++++------------------------------------
 1 file changed, 110 insertions(+), 198 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 7e7deec4d8..254c08f27b 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -57,6 +57,7 @@
 #include "qemu/uuid.h"
 #include "savevm.h"
 #include "qemu/iov.h"
+#include "qemu/threaded-workqueue.h"
 
 /***********************************************************/
 /* ram save/restore */
@@ -349,22 +350,6 @@ typedef struct PageSearchStatus PageSearchStatus;
 
 CompressionStats compression_counters;
 
-struct CompressParam {
-    bool done;
-    bool quit;
-    bool zero_page;
-    QEMUFile *file;
-    QemuMutex mutex;
-    QemuCond cond;
-    RAMBlock *block;
-    ram_addr_t offset;
-
-    /* internally used fields */
-    z_stream stream;
-    uint8_t *originbuf;
-};
-typedef struct CompressParam CompressParam;
-
 struct DecompressParam {
     bool done;
     bool quit;
@@ -377,15 +362,6 @@ struct DecompressParam {
 };
 typedef struct DecompressParam DecompressParam;
 
-static CompressParam *comp_param;
-static QemuThread *compress_threads;
-/* comp_done_cond is used to wake up the migration thread when
- * one of the compression threads has finished the compression.
- * comp_done_lock is used to co-work with comp_done_cond.
- */
-static QemuMutex comp_done_lock;
-static QemuCond comp_done_cond;
-/* The empty QEMUFileOps will be used by file in CompressParam */
 static const QEMUFileOps empty_ops = { };
 
 static QEMUFile *decomp_file;
@@ -394,125 +370,6 @@ static QemuThread *decompress_threads;
 static QemuMutex decomp_done_lock;
 static QemuCond decomp_done_cond;
 
-static bool do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock *block,
-                                 ram_addr_t offset, uint8_t *source_buf);
-
-static void *do_data_compress(void *opaque)
-{
-    CompressParam *param = opaque;
-    RAMBlock *block;
-    ram_addr_t offset;
-    bool zero_page;
-
-    qemu_mutex_lock(&param->mutex);
-    while (!param->quit) {
-        if (param->block) {
-            block = param->block;
-            offset = param->offset;
-            param->block = NULL;
-            qemu_mutex_unlock(&param->mutex);
-
-            zero_page = do_compress_ram_page(param->file, &param->stream,
-                                             block, offset, param->originbuf);
-
-            qemu_mutex_lock(&comp_done_lock);
-            param->done = true;
-            param->zero_page = zero_page;
-            qemu_cond_signal(&comp_done_cond);
-            qemu_mutex_unlock(&comp_done_lock);
-
-            qemu_mutex_lock(&param->mutex);
-        } else {
-            qemu_cond_wait(&param->cond, &param->mutex);
-        }
-    }
-    qemu_mutex_unlock(&param->mutex);
-
-    return NULL;
-}
-
-static void compress_threads_save_cleanup(void)
-{
-    int i, thread_count;
-
-    if (!migrate_use_compression() || !comp_param) {
-        return;
-    }
-
-    thread_count = migrate_compress_threads();
-    for (i = 0; i < thread_count; i++) {
-        /*
-         * we use it as a indicator which shows if the thread is
-         * properly init'd or not
-         */
-        if (!comp_param[i].file) {
-            break;
-        }
-
-        qemu_mutex_lock(&comp_param[i].mutex);
-        comp_param[i].quit = true;
-        qemu_cond_signal(&comp_param[i].cond);
-        qemu_mutex_unlock(&comp_param[i].mutex);
-
-        qemu_thread_join(compress_threads + i);
-        qemu_mutex_destroy(&comp_param[i].mutex);
-        qemu_cond_destroy(&comp_param[i].cond);
-        deflateEnd(&comp_param[i].stream);
-        g_free(comp_param[i].originbuf);
-        qemu_fclose(comp_param[i].file);
-        comp_param[i].file = NULL;
-    }
-    qemu_mutex_destroy(&comp_done_lock);
-    qemu_cond_destroy(&comp_done_cond);
-    g_free(compress_threads);
-    g_free(comp_param);
-    compress_threads = NULL;
-    comp_param = NULL;
-}
-
-static int compress_threads_save_setup(void)
-{
-    int i, thread_count;
-
-    if (!migrate_use_compression()) {
-        return 0;
-    }
-    thread_count = migrate_compress_threads();
-    compress_threads = g_new0(QemuThread, thread_count);
-    comp_param = g_new0(CompressParam, thread_count);
-    qemu_cond_init(&comp_done_cond);
-    qemu_mutex_init(&comp_done_lock);
-    for (i = 0; i < thread_count; i++) {
-        comp_param[i].originbuf = g_try_malloc(TARGET_PAGE_SIZE);
-        if (!comp_param[i].originbuf) {
-            goto exit;
-        }
-
-        if (deflateInit(&comp_param[i].stream,
-                        migrate_compress_level()) != Z_OK) {
-            g_free(comp_param[i].originbuf);
-            goto exit;
-        }
-
-        /* comp_param[i].file is just used as a dummy buffer to save data,
-         * set its ops to empty.
-         */
-        comp_param[i].file = qemu_fopen_ops(NULL, &empty_ops);
-        comp_param[i].done = true;
-        comp_param[i].quit = false;
-        qemu_mutex_init(&comp_param[i].mutex);
-        qemu_cond_init(&comp_param[i].cond);
-        qemu_thread_create(compress_threads + i, "compress",
-                           do_data_compress, comp_param + i,
-                           QEMU_THREAD_JOINABLE);
-    }
-    return 0;
-
-exit:
-    compress_threads_save_cleanup();
-    return -1;
-}
-
 /* Multiple fd's */
 
 #define MULTIFD_MAGIC 0x11223344U
@@ -1909,12 +1766,25 @@ exit:
     return zero_page;
 }
 
+struct CompressData {
+    /* filled by migration thread.*/
+    RAMBlock *block;
+    ram_addr_t offset;
+
+    /* filled by compress thread. */
+    QEMUFile *file;
+    z_stream stream;
+    uint8_t *originbuf;
+    bool zero_page;
+};
+typedef struct CompressData CompressData;
+
 static void
-update_compress_thread_counts(const CompressParam *param, int bytes_xmit)
+update_compress_thread_counts(CompressData *cd, int bytes_xmit)
 {
     ram_counters.transferred += bytes_xmit;
 
-    if (param->zero_page) {
+    if (cd->zero_page) {
         ram_counters.duplicate++;
         return;
     }
@@ -1924,81 +1794,123 @@ update_compress_thread_counts(const CompressParam *param, int bytes_xmit)
     compression_counters.pages++;
 }
 
+static int compress_thread_data_init(void *request)
+{
+    CompressData *cd = request;
+
+    cd->originbuf = g_try_malloc(TARGET_PAGE_SIZE);
+    if (!cd->originbuf) {
+        return -1;
+    }
+
+    if (deflateInit(&cd->stream, migrate_compress_level()) != Z_OK) {
+        g_free(cd->originbuf);
+        return -1;
+    }
+
+    cd->file = qemu_fopen_ops(NULL, &empty_ops);
+    return 0;
+}
+
+static void compress_thread_data_fini(void *request)
+{
+    CompressData *cd = request;
+
+    qemu_fclose(cd->file);
+    deflateEnd(&cd->stream);
+    g_free(cd->originbuf);
+}
+
+static void compress_thread_data_handler(void *request)
+{
+    CompressData *cd = request;
+
+    /*
+     * if compression fails, it will be indicated by
+     * migrate_get_current()->to_dst_file.
+     */
+    cd->zero_page = do_compress_ram_page(cd->file, &cd->stream, cd->block,
+                                         cd->offset, cd->originbuf);
+}
+
+static void compress_thread_data_done(void *request)
+{
+    CompressData *cd = request;
+    RAMState *rs = ram_state;
+    int bytes_xmit;
+
+    bytes_xmit = qemu_put_qemu_file(rs->f, cd->file);
+    update_compress_thread_counts(cd, bytes_xmit);
+}
+
+static const ThreadedWorkqueueOps compress_ops = {
+    .thread_request_init = compress_thread_data_init,
+    .thread_request_uninit = compress_thread_data_fini,
+    .thread_request_handler = compress_thread_data_handler,
+    .thread_request_done = compress_thread_data_done,
+    .request_size = sizeof(CompressData),
+};
+
+static Threads *compress_threads;
+
 static bool save_page_use_compression(RAMState *rs);
 
 static void flush_compressed_data(RAMState *rs)
 {
-    int idx, len, thread_count;
-
     if (!save_page_use_compression(rs)) {
         return;
     }
-    thread_count = migrate_compress_threads();
 
-    qemu_mutex_lock(&comp_done_lock);
-    for (idx = 0; idx < thread_count; idx++) {
-        while (!comp_param[idx].done) {
-            qemu_cond_wait(&comp_done_cond, &comp_done_lock);
-        }
-    }
-    qemu_mutex_unlock(&comp_done_lock);
+    threaded_workqueue_wait_for_requests(compress_threads);
+}
 
-    for (idx = 0; idx < thread_count; idx++) {
-        qemu_mutex_lock(&comp_param[idx].mutex);
-        if (!comp_param[idx].quit) {
-            len = qemu_put_qemu_file(rs->f, comp_param[idx].file);
-            /*
-             * it's safe to fetch zero_page without holding comp_done_lock
-             * as there is no further request submitted to the thread,
-             * i.e, the thread should be waiting for a request at this point.
-             */
-            update_compress_thread_counts(&comp_param[idx], len);
-        }
-        qemu_mutex_unlock(&comp_param[idx].mutex);
+static void compress_threads_save_cleanup(void)
+{
+    if (!compress_threads) {
+        return;
     }
+
+    threaded_workqueue_destroy(compress_threads);
+    compress_threads = NULL;
 }
 
-static inline void set_compress_params(CompressParam *param, RAMBlock *block,
-                                       ram_addr_t offset)
+static int compress_threads_save_setup(void)
 {
-    param->block = block;
-    param->offset = offset;
+    if (!migrate_use_compression()) {
+        return 0;
+    }
+
+    compress_threads = threaded_workqueue_create("compress",
+                                migrate_compress_threads(),
+                                DEFAULT_THREAD_REQUEST_NR, &compress_ops);
+    return compress_threads ? 0 : -1;
 }
 
 static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
                                            ram_addr_t offset)
 {
-    int idx, thread_count, bytes_xmit = -1, pages = -1;
+    CompressData *cd;
     bool wait = migrate_compress_wait_thread();
 
-    thread_count = migrate_compress_threads();
-    qemu_mutex_lock(&comp_done_lock);
 retry:
-    for (idx = 0; idx < thread_count; idx++) {
-        if (comp_param[idx].done) {
-            comp_param[idx].done = false;
-            bytes_xmit = qemu_put_qemu_file(rs->f, comp_param[idx].file);
-            qemu_mutex_lock(&comp_param[idx].mutex);
-            set_compress_params(&comp_param[idx], block, offset);
-            qemu_cond_signal(&comp_param[idx].cond);
-            qemu_mutex_unlock(&comp_param[idx].mutex);
-            pages = 1;
-            update_compress_thread_counts(&comp_param[idx], bytes_xmit);
-            break;
+    cd = threaded_workqueue_get_request(compress_threads);
+    if (!cd) {
+        /*
+         * wait for the free thread if the user specifies
+         * 'compress-wait-thread', otherwise we will post
+         *  the page out in the main thread as normal page.
+         */
+        if (wait) {
+            cpu_relax();
+            goto retry;
         }
-    }
 
-    /*
-     * wait for the free thread if the user specifies 'compress-wait-thread',
-     * otherwise we will post the page out in the main thread as normal page.
-     */
-    if (pages < 0 && wait) {
-        qemu_cond_wait(&comp_done_cond, &comp_done_lock);
-        goto retry;
-    }
-    qemu_mutex_unlock(&comp_done_lock);
-
-    return pages;
+        return -1;
+     }
+    cd->block = block;
+    cd->offset = offset;
+    threaded_workqueue_submit_request(compress_threads, cd);
+    return 1;
 }
 
 /**
-- 
2.14.5

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 4/5] migration: use threaded workqueue for decompression
  2018-11-22  7:20 ` [Qemu-devel] " guangrong.xiao
@ 2018-11-22  7:20   ` guangrong.xiao
  -1 siblings, 0 replies; 70+ messages in thread
From: guangrong.xiao @ 2018-11-22  7:20 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: kvm, quintela, Xiao Guangrong, qemu-devel, peterx, dgilbert,
	wei.w.wang, cota, jiang.biao2

From: Xiao Guangrong <xiaoguangrong@tencent.com>

Adapt the compression code to the threaded workqueue

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 migration/ram.c | 222 ++++++++++++++++++++------------------------------------
 1 file changed, 77 insertions(+), 145 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 254c08f27b..ccec59c35e 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -350,25 +350,9 @@ typedef struct PageSearchStatus PageSearchStatus;
 
 CompressionStats compression_counters;
 
-struct DecompressParam {
-    bool done;
-    bool quit;
-    QemuMutex mutex;
-    QemuCond cond;
-    void *des;
-    uint8_t *compbuf;
-    int len;
-    z_stream stream;
-};
-typedef struct DecompressParam DecompressParam;
-
 static const QEMUFileOps empty_ops = { };
 
 static QEMUFile *decomp_file;
-static DecompressParam *decomp_param;
-static QemuThread *decompress_threads;
-static QemuMutex decomp_done_lock;
-static QemuCond decomp_done_cond;
 
 /* Multiple fd's */
 
@@ -3399,6 +3383,7 @@ void ram_handle_compressed(void *host, uint8_t ch, uint64_t size)
     }
 }
 
+
 /* return the size after decompression, or negative value on error */
 static int
 qemu_uncompress_data(z_stream *stream, uint8_t *dest, size_t dest_len,
@@ -3424,166 +3409,113 @@ qemu_uncompress_data(z_stream *stream, uint8_t *dest, size_t dest_len,
     return stream->total_out;
 }
 
-static void *do_data_decompress(void *opaque)
-{
-    DecompressParam *param = opaque;
-    unsigned long pagesize;
-    uint8_t *des;
-    int len, ret;
-
-    qemu_mutex_lock(&param->mutex);
-    while (!param->quit) {
-        if (param->des) {
-            des = param->des;
-            len = param->len;
-            param->des = 0;
-            qemu_mutex_unlock(&param->mutex);
-
-            pagesize = TARGET_PAGE_SIZE;
-
-            ret = qemu_uncompress_data(&param->stream, des, pagesize,
-                                       param->compbuf, len);
-            if (ret < 0 && migrate_get_current()->decompress_error_check) {
-                error_report("decompress data failed");
-                qemu_file_set_error(decomp_file, ret);
-            }
+struct DecompressData {
+    /* filled by migration thread.*/
+    void *des;
+    uint8_t *compbuf;
+    size_t len;
 
-            qemu_mutex_lock(&decomp_done_lock);
-            param->done = true;
-            qemu_cond_signal(&decomp_done_cond);
-            qemu_mutex_unlock(&decomp_done_lock);
+    z_stream stream;
+};
+typedef struct DecompressData DecompressData;
 
-            qemu_mutex_lock(&param->mutex);
-        } else {
-            qemu_cond_wait(&param->cond, &param->mutex);
-        }
+static Threads *decompress_threads;
+
+static int decompress_thread_data_init(void *request)
+{
+    DecompressData *dd = request;
+
+    if (inflateInit(&dd->stream) != Z_OK) {
+        return -1;
     }
-    qemu_mutex_unlock(&param->mutex);
 
-    return NULL;
+    dd->compbuf = g_malloc0(compressBound(TARGET_PAGE_SIZE));
+    return 0;
 }
 
-static int wait_for_decompress_done(void)
+static void decompress_thread_data_fini(void *request)
 {
-    int idx, thread_count;
+    DecompressData *dd = request;
 
-    if (!migrate_use_compression()) {
-        return 0;
-    }
+    inflateEnd(&dd->stream);
+    g_free(dd->compbuf);
+}
 
-    thread_count = migrate_decompress_threads();
-    qemu_mutex_lock(&decomp_done_lock);
-    for (idx = 0; idx < thread_count; idx++) {
-        while (!decomp_param[idx].done) {
-            qemu_cond_wait(&decomp_done_cond, &decomp_done_lock);
-        }
+static void decompress_thread_data_handler(void *request)
+{
+    DecompressData *dd = request;
+    unsigned long pagesize = TARGET_PAGE_SIZE;
+    int ret;
+
+    ret = qemu_uncompress_data(&dd->stream, dd->des, pagesize,
+                               dd->compbuf, dd->len);
+    if (ret < 0 && migrate_get_current()->decompress_error_check) {
+        error_report("decompress data failed");
+        qemu_file_set_error(decomp_file, ret);
     }
-    qemu_mutex_unlock(&decomp_done_lock);
-    return qemu_file_get_error(decomp_file);
 }
 
-static void compress_threads_load_cleanup(void)
+static void decompress_thread_data_done(void *request)
 {
-    int i, thread_count;
+}
+
+static const ThreadedWorkqueueOps decompress_ops = {
+    .thread_request_init = decompress_thread_data_init,
+    .thread_request_uninit = decompress_thread_data_fini,
+    .thread_request_handler = decompress_thread_data_handler,
+    .thread_request_done = decompress_thread_data_done,
+    .request_size = sizeof(DecompressData),
+};
 
+static int decompress_init(QEMUFile *f)
+{
     if (!migrate_use_compression()) {
-        return;
+        return 0;
     }
-    thread_count = migrate_decompress_threads();
-    for (i = 0; i < thread_count; i++) {
-        /*
-         * we use it as a indicator which shows if the thread is
-         * properly init'd or not
-         */
-        if (!decomp_param[i].compbuf) {
-            break;
-        }
 
-        qemu_mutex_lock(&decomp_param[i].mutex);
-        decomp_param[i].quit = true;
-        qemu_cond_signal(&decomp_param[i].cond);
-        qemu_mutex_unlock(&decomp_param[i].mutex);
-    }
-    for (i = 0; i < thread_count; i++) {
-        if (!decomp_param[i].compbuf) {
-            break;
-        }
+    decomp_file = f;
+    decompress_threads = threaded_workqueue_create("decompress",
+                                migrate_decompress_threads(),
+                                DEFAULT_THREAD_REQUEST_NR, &decompress_ops);
+    return decompress_threads ? 0 : -1;
+}
 
-        qemu_thread_join(decompress_threads + i);
-        qemu_mutex_destroy(&decomp_param[i].mutex);
-        qemu_cond_destroy(&decomp_param[i].cond);
-        inflateEnd(&decomp_param[i].stream);
-        g_free(decomp_param[i].compbuf);
-        decomp_param[i].compbuf = NULL;
+static void decompress_fini(void)
+{
+    if (!decompress_threads) {
+        return;
     }
-    g_free(decompress_threads);
-    g_free(decomp_param);
+
+    threaded_workqueue_destroy(decompress_threads);
     decompress_threads = NULL;
-    decomp_param = NULL;
     decomp_file = NULL;
 }
 
-static int compress_threads_load_setup(QEMUFile *f)
+static int flush_decompressed_data(void)
 {
-    int i, thread_count;
-
     if (!migrate_use_compression()) {
         return 0;
     }
 
-    thread_count = migrate_decompress_threads();
-    decompress_threads = g_new0(QemuThread, thread_count);
-    decomp_param = g_new0(DecompressParam, thread_count);
-    qemu_mutex_init(&decomp_done_lock);
-    qemu_cond_init(&decomp_done_cond);
-    decomp_file = f;
-    for (i = 0; i < thread_count; i++) {
-        if (inflateInit(&decomp_param[i].stream) != Z_OK) {
-            goto exit;
-        }
-
-        decomp_param[i].compbuf = g_malloc0(compressBound(TARGET_PAGE_SIZE));
-        qemu_mutex_init(&decomp_param[i].mutex);
-        qemu_cond_init(&decomp_param[i].cond);
-        decomp_param[i].done = true;
-        decomp_param[i].quit = false;
-        qemu_thread_create(decompress_threads + i, "decompress",
-                           do_data_decompress, decomp_param + i,
-                           QEMU_THREAD_JOINABLE);
-    }
-    return 0;
-exit:
-    compress_threads_load_cleanup();
-    return -1;
+    threaded_workqueue_wait_for_requests(decompress_threads);
+    return qemu_file_get_error(decomp_file);
 }
 
 static void decompress_data_with_multi_threads(QEMUFile *f,
-                                               void *host, int len)
+                                               void *host, size_t len)
 {
-    int idx, thread_count;
+    DecompressData *dd;
 
-    thread_count = migrate_decompress_threads();
-    qemu_mutex_lock(&decomp_done_lock);
-    while (true) {
-        for (idx = 0; idx < thread_count; idx++) {
-            if (decomp_param[idx].done) {
-                decomp_param[idx].done = false;
-                qemu_mutex_lock(&decomp_param[idx].mutex);
-                qemu_get_buffer(f, decomp_param[idx].compbuf, len);
-                decomp_param[idx].des = host;
-                decomp_param[idx].len = len;
-                qemu_cond_signal(&decomp_param[idx].cond);
-                qemu_mutex_unlock(&decomp_param[idx].mutex);
-                break;
-            }
-        }
-        if (idx < thread_count) {
-            break;
-        } else {
-            qemu_cond_wait(&decomp_done_cond, &decomp_done_lock);
-        }
+retry:
+    dd = threaded_workqueue_get_request(decompress_threads);
+    if (!dd) {
+        goto retry;
     }
-    qemu_mutex_unlock(&decomp_done_lock);
+
+    dd->des = host;
+    dd->len = len;
+    qemu_get_buffer(f, dd->compbuf, len);
+    threaded_workqueue_submit_request(decompress_threads, dd);
 }
 
 /*
@@ -3678,7 +3610,7 @@ void colo_release_ram_cache(void)
  */
 static int ram_load_setup(QEMUFile *f, void *opaque)
 {
-    if (compress_threads_load_setup(f)) {
+    if (decompress_init(f)) {
         return -1;
     }
 
@@ -3699,7 +3631,7 @@ static int ram_load_cleanup(void *opaque)
     }
 
     xbzrle_load_cleanup();
-    compress_threads_load_cleanup();
+    decompress_fini();
 
     RAMBLOCK_FOREACH_MIGRATABLE(rb) {
         g_free(rb->receivedmap);
@@ -4101,7 +4033,7 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
         }
     }
 
-    ret |= wait_for_decompress_done();
+    ret |= flush_decompressed_data();
     rcu_read_unlock();
     trace_ram_load_complete(ret, seq_iter);
 
-- 
2.14.5

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 4/5] migration: use threaded workqueue for decompression
@ 2018-11-22  7:20   ` guangrong.xiao
  0 siblings, 0 replies; 70+ messages in thread
From: guangrong.xiao @ 2018-11-22  7:20 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: qemu-devel, kvm, dgilbert, peterx, wei.w.wang, jiang.biao2,
	eblake, quintela, cota, Xiao Guangrong

From: Xiao Guangrong <xiaoguangrong@tencent.com>

Adapt the compression code to the threaded workqueue

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 migration/ram.c | 222 ++++++++++++++++++++------------------------------------
 1 file changed, 77 insertions(+), 145 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 254c08f27b..ccec59c35e 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -350,25 +350,9 @@ typedef struct PageSearchStatus PageSearchStatus;
 
 CompressionStats compression_counters;
 
-struct DecompressParam {
-    bool done;
-    bool quit;
-    QemuMutex mutex;
-    QemuCond cond;
-    void *des;
-    uint8_t *compbuf;
-    int len;
-    z_stream stream;
-};
-typedef struct DecompressParam DecompressParam;
-
 static const QEMUFileOps empty_ops = { };
 
 static QEMUFile *decomp_file;
-static DecompressParam *decomp_param;
-static QemuThread *decompress_threads;
-static QemuMutex decomp_done_lock;
-static QemuCond decomp_done_cond;
 
 /* Multiple fd's */
 
@@ -3399,6 +3383,7 @@ void ram_handle_compressed(void *host, uint8_t ch, uint64_t size)
     }
 }
 
+
 /* return the size after decompression, or negative value on error */
 static int
 qemu_uncompress_data(z_stream *stream, uint8_t *dest, size_t dest_len,
@@ -3424,166 +3409,113 @@ qemu_uncompress_data(z_stream *stream, uint8_t *dest, size_t dest_len,
     return stream->total_out;
 }
 
-static void *do_data_decompress(void *opaque)
-{
-    DecompressParam *param = opaque;
-    unsigned long pagesize;
-    uint8_t *des;
-    int len, ret;
-
-    qemu_mutex_lock(&param->mutex);
-    while (!param->quit) {
-        if (param->des) {
-            des = param->des;
-            len = param->len;
-            param->des = 0;
-            qemu_mutex_unlock(&param->mutex);
-
-            pagesize = TARGET_PAGE_SIZE;
-
-            ret = qemu_uncompress_data(&param->stream, des, pagesize,
-                                       param->compbuf, len);
-            if (ret < 0 && migrate_get_current()->decompress_error_check) {
-                error_report("decompress data failed");
-                qemu_file_set_error(decomp_file, ret);
-            }
+struct DecompressData {
+    /* filled by migration thread.*/
+    void *des;
+    uint8_t *compbuf;
+    size_t len;
 
-            qemu_mutex_lock(&decomp_done_lock);
-            param->done = true;
-            qemu_cond_signal(&decomp_done_cond);
-            qemu_mutex_unlock(&decomp_done_lock);
+    z_stream stream;
+};
+typedef struct DecompressData DecompressData;
 
-            qemu_mutex_lock(&param->mutex);
-        } else {
-            qemu_cond_wait(&param->cond, &param->mutex);
-        }
+static Threads *decompress_threads;
+
+static int decompress_thread_data_init(void *request)
+{
+    DecompressData *dd = request;
+
+    if (inflateInit(&dd->stream) != Z_OK) {
+        return -1;
     }
-    qemu_mutex_unlock(&param->mutex);
 
-    return NULL;
+    dd->compbuf = g_malloc0(compressBound(TARGET_PAGE_SIZE));
+    return 0;
 }
 
-static int wait_for_decompress_done(void)
+static void decompress_thread_data_fini(void *request)
 {
-    int idx, thread_count;
+    DecompressData *dd = request;
 
-    if (!migrate_use_compression()) {
-        return 0;
-    }
+    inflateEnd(&dd->stream);
+    g_free(dd->compbuf);
+}
 
-    thread_count = migrate_decompress_threads();
-    qemu_mutex_lock(&decomp_done_lock);
-    for (idx = 0; idx < thread_count; idx++) {
-        while (!decomp_param[idx].done) {
-            qemu_cond_wait(&decomp_done_cond, &decomp_done_lock);
-        }
+static void decompress_thread_data_handler(void *request)
+{
+    DecompressData *dd = request;
+    unsigned long pagesize = TARGET_PAGE_SIZE;
+    int ret;
+
+    ret = qemu_uncompress_data(&dd->stream, dd->des, pagesize,
+                               dd->compbuf, dd->len);
+    if (ret < 0 && migrate_get_current()->decompress_error_check) {
+        error_report("decompress data failed");
+        qemu_file_set_error(decomp_file, ret);
     }
-    qemu_mutex_unlock(&decomp_done_lock);
-    return qemu_file_get_error(decomp_file);
 }
 
-static void compress_threads_load_cleanup(void)
+static void decompress_thread_data_done(void *request)
 {
-    int i, thread_count;
+}
+
+static const ThreadedWorkqueueOps decompress_ops = {
+    .thread_request_init = decompress_thread_data_init,
+    .thread_request_uninit = decompress_thread_data_fini,
+    .thread_request_handler = decompress_thread_data_handler,
+    .thread_request_done = decompress_thread_data_done,
+    .request_size = sizeof(DecompressData),
+};
 
+static int decompress_init(QEMUFile *f)
+{
     if (!migrate_use_compression()) {
-        return;
+        return 0;
     }
-    thread_count = migrate_decompress_threads();
-    for (i = 0; i < thread_count; i++) {
-        /*
-         * we use it as a indicator which shows if the thread is
-         * properly init'd or not
-         */
-        if (!decomp_param[i].compbuf) {
-            break;
-        }
 
-        qemu_mutex_lock(&decomp_param[i].mutex);
-        decomp_param[i].quit = true;
-        qemu_cond_signal(&decomp_param[i].cond);
-        qemu_mutex_unlock(&decomp_param[i].mutex);
-    }
-    for (i = 0; i < thread_count; i++) {
-        if (!decomp_param[i].compbuf) {
-            break;
-        }
+    decomp_file = f;
+    decompress_threads = threaded_workqueue_create("decompress",
+                                migrate_decompress_threads(),
+                                DEFAULT_THREAD_REQUEST_NR, &decompress_ops);
+    return decompress_threads ? 0 : -1;
+}
 
-        qemu_thread_join(decompress_threads + i);
-        qemu_mutex_destroy(&decomp_param[i].mutex);
-        qemu_cond_destroy(&decomp_param[i].cond);
-        inflateEnd(&decomp_param[i].stream);
-        g_free(decomp_param[i].compbuf);
-        decomp_param[i].compbuf = NULL;
+static void decompress_fini(void)
+{
+    if (!decompress_threads) {
+        return;
     }
-    g_free(decompress_threads);
-    g_free(decomp_param);
+
+    threaded_workqueue_destroy(decompress_threads);
     decompress_threads = NULL;
-    decomp_param = NULL;
     decomp_file = NULL;
 }
 
-static int compress_threads_load_setup(QEMUFile *f)
+static int flush_decompressed_data(void)
 {
-    int i, thread_count;
-
     if (!migrate_use_compression()) {
         return 0;
     }
 
-    thread_count = migrate_decompress_threads();
-    decompress_threads = g_new0(QemuThread, thread_count);
-    decomp_param = g_new0(DecompressParam, thread_count);
-    qemu_mutex_init(&decomp_done_lock);
-    qemu_cond_init(&decomp_done_cond);
-    decomp_file = f;
-    for (i = 0; i < thread_count; i++) {
-        if (inflateInit(&decomp_param[i].stream) != Z_OK) {
-            goto exit;
-        }
-
-        decomp_param[i].compbuf = g_malloc0(compressBound(TARGET_PAGE_SIZE));
-        qemu_mutex_init(&decomp_param[i].mutex);
-        qemu_cond_init(&decomp_param[i].cond);
-        decomp_param[i].done = true;
-        decomp_param[i].quit = false;
-        qemu_thread_create(decompress_threads + i, "decompress",
-                           do_data_decompress, decomp_param + i,
-                           QEMU_THREAD_JOINABLE);
-    }
-    return 0;
-exit:
-    compress_threads_load_cleanup();
-    return -1;
+    threaded_workqueue_wait_for_requests(decompress_threads);
+    return qemu_file_get_error(decomp_file);
 }
 
 static void decompress_data_with_multi_threads(QEMUFile *f,
-                                               void *host, int len)
+                                               void *host, size_t len)
 {
-    int idx, thread_count;
+    DecompressData *dd;
 
-    thread_count = migrate_decompress_threads();
-    qemu_mutex_lock(&decomp_done_lock);
-    while (true) {
-        for (idx = 0; idx < thread_count; idx++) {
-            if (decomp_param[idx].done) {
-                decomp_param[idx].done = false;
-                qemu_mutex_lock(&decomp_param[idx].mutex);
-                qemu_get_buffer(f, decomp_param[idx].compbuf, len);
-                decomp_param[idx].des = host;
-                decomp_param[idx].len = len;
-                qemu_cond_signal(&decomp_param[idx].cond);
-                qemu_mutex_unlock(&decomp_param[idx].mutex);
-                break;
-            }
-        }
-        if (idx < thread_count) {
-            break;
-        } else {
-            qemu_cond_wait(&decomp_done_cond, &decomp_done_lock);
-        }
+retry:
+    dd = threaded_workqueue_get_request(decompress_threads);
+    if (!dd) {
+        goto retry;
     }
-    qemu_mutex_unlock(&decomp_done_lock);
+
+    dd->des = host;
+    dd->len = len;
+    qemu_get_buffer(f, dd->compbuf, len);
+    threaded_workqueue_submit_request(decompress_threads, dd);
 }
 
 /*
@@ -3678,7 +3610,7 @@ void colo_release_ram_cache(void)
  */
 static int ram_load_setup(QEMUFile *f, void *opaque)
 {
-    if (compress_threads_load_setup(f)) {
+    if (decompress_init(f)) {
         return -1;
     }
 
@@ -3699,7 +3631,7 @@ static int ram_load_cleanup(void *opaque)
     }
 
     xbzrle_load_cleanup();
-    compress_threads_load_cleanup();
+    decompress_fini();
 
     RAMBLOCK_FOREACH_MIGRATABLE(rb) {
         g_free(rb->receivedmap);
@@ -4101,7 +4033,7 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
         }
     }
 
-    ret |= wait_for_decompress_done();
+    ret |= flush_decompressed_data();
     rcu_read_unlock();
     trace_ram_load_complete(ret, seq_iter);
 
-- 
2.14.5

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v3 5/5] tests: add threaded-workqueue-bench
  2018-11-22  7:20 ` [Qemu-devel] " guangrong.xiao
@ 2018-11-22  7:20   ` guangrong.xiao
  -1 siblings, 0 replies; 70+ messages in thread
From: guangrong.xiao @ 2018-11-22  7:20 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: kvm, quintela, Xiao Guangrong, qemu-devel, peterx, dgilbert,
	wei.w.wang, cota, jiang.biao2

From: Xiao Guangrong <xiaoguangrong@tencent.com>

It's the benhcmark of threaded-workqueue, also it's a good
example to show how threaded-workqueue is used

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 tests/Makefile.include           |   5 +-
 tests/threaded-workqueue-bench.c | 255 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 259 insertions(+), 1 deletion(-)
 create mode 100644 tests/threaded-workqueue-bench.c

diff --git a/tests/Makefile.include b/tests/Makefile.include
index 613242bc6e..05ad27e75d 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -500,7 +500,8 @@ test-obj-y = tests/check-qnum.o tests/check-qstring.o tests/check-qdict.o \
 	tests/test-rcu-tailq.o \
 	tests/test-qdist.o tests/test-shift128.o \
 	tests/test-qht.o tests/qht-bench.o tests/test-qht-par.o \
-	tests/atomic_add-bench.o tests/atomic64-bench.o
+	tests/atomic_add-bench.o tests/atomic64-bench.o \
+	tests/threaded-workqueue-bench.o
 
 $(test-obj-y): QEMU_INCLUDES += -Itests
 QEMU_CFLAGS += -I$(SRC_PATH)/tests
@@ -557,6 +558,8 @@ tests/qht-bench$(EXESUF): tests/qht-bench.o $(test-util-obj-y)
 tests/test-bufferiszero$(EXESUF): tests/test-bufferiszero.o $(test-util-obj-y)
 tests/atomic_add-bench$(EXESUF): tests/atomic_add-bench.o $(test-util-obj-y)
 tests/atomic64-bench$(EXESUF): tests/atomic64-bench.o $(test-util-obj-y)
+tests/threaded-workqueue-bench$(EXESUF): tests/threaded-workqueue-bench.o migration/qemu-file.o \
+	$(test-util-obj-y)
 
 tests/fp/%:
 	$(MAKE) -C $(dir $@) $(notdir $@)
diff --git a/tests/threaded-workqueue-bench.c b/tests/threaded-workqueue-bench.c
new file mode 100644
index 0000000000..0d04948ed3
--- /dev/null
+++ b/tests/threaded-workqueue-bench.c
@@ -0,0 +1,255 @@
+/*
+ * Threaded Workqueue Benchmark
+ *
+ * Author:
+ *   Xiao Guangrong <xiaoguangrong@tencent.com>
+ *
+ * Copyright(C) 2018 Tencent Corporation.
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ */
+#include <zlib.h>
+
+#include "qemu/osdep.h"
+#include "exec/cpu-common.h"
+#include "qemu/error-report.h"
+#include "migration/qemu-file.h"
+#include "qemu/threaded-workqueue.h"
+
+#define PAGE_SHIFT              12
+#define PAGE_SIZE               (1 << PAGE_SHIFT)
+#define DEFAULT_THREAD_NR       2
+#define DEFAULT_MEM_SIZE        1
+#define DEFAULT_REPEATED_COUNT  3
+
+static ssize_t test_writev_buffer(void *opaque, struct iovec *iov, int iovcnt,
+                                   int64_t pos)
+{
+    int i, size = 0;
+
+    for (i = 0; i < iovcnt; i++) {
+        size += iov[i].iov_len;
+    }
+    return size;
+}
+
+static int test_fclose(void *opaque)
+{
+    return 0;
+}
+
+static const QEMUFileOps test_write_ops = {
+    .writev_buffer  = test_writev_buffer,
+    .close          = test_fclose
+};
+
+static QEMUFile *dest_file;
+
+static const QEMUFileOps empty_ops = { };
+
+struct CompressData {
+    uint8_t *ram_addr;
+    QEMUFile *file;
+    z_stream stream;
+};
+typedef struct CompressData CompressData;
+
+static int compress_request_init(void *request)
+{
+    CompressData *cd = request;
+
+    if (deflateInit(&cd->stream, 1) != Z_OK) {
+        return -1;
+    }
+    cd->file = qemu_fopen_ops(NULL, &empty_ops);
+    return 0;
+}
+
+static void compress_request_uninit(void *request)
+{
+    CompressData *cd = request;
+
+    qemu_fclose(cd->file);
+    deflateEnd(&cd->stream);
+}
+
+static void compress_thread_data_handler(void *request)
+{
+    CompressData *cd = request;
+    int blen;
+
+    blen = qemu_put_compression_data(cd->file, &cd->stream, cd->ram_addr,
+                                     PAGE_SIZE);
+    if (blen < 0) {
+        error_report("compressed data failed!");
+        qemu_file_set_error(dest_file, blen);
+    }
+}
+
+struct CompressStats {
+    unsigned long pages;
+    unsigned long compressed_size;
+};
+typedef struct CompressStats CompressStats;
+
+static CompressStats comp_stats;
+
+static void compress_thread_data_done(void *request)
+{
+    CompressData *cd = request;
+    int bytes_xmit;
+
+    bytes_xmit = qemu_put_qemu_file(dest_file, cd->file);
+
+    comp_stats.pages++;
+    comp_stats.compressed_size += bytes_xmit;
+}
+
+static const ThreadedWorkqueueOps ops = {
+    .thread_request_init = compress_request_init,
+    .thread_request_uninit = compress_request_uninit,
+    .thread_request_handler = compress_thread_data_handler,
+    .thread_request_done = compress_thread_data_done,
+    .request_size = sizeof(CompressData),
+};
+
+static void compress_threads_save_cleanup(Threads *threads)
+{
+    threaded_workqueue_destroy(threads);
+    qemu_fclose(dest_file);
+}
+
+static Threads *compress_threads_save_setup(int threads_nr, int requests_nr)
+{
+    Threads *compress_threads;
+
+    dest_file = qemu_fopen_ops(NULL, &test_write_ops);
+    compress_threads = threaded_workqueue_create("compress", threads_nr,
+                                                 requests_nr, &ops);
+    assert(compress_threads);
+    return compress_threads;
+}
+
+static void compress_page_with_multi_thread(Threads *threads, uint8_t *addr)
+{
+    CompressData *cd;
+
+retry:
+    cd = threaded_workqueue_get_request(threads);
+    if (!cd) {
+        goto retry;
+    }
+
+    cd->ram_addr = addr;
+    threaded_workqueue_submit_request(threads, cd);
+}
+
+static void run(Threads *threads, uint8_t *mem, unsigned long mem_size,
+                int repeated_count)
+{
+    uint8_t *ptr = mem, *end = mem + mem_size;
+    uint64_t start_ts, spend, total_ts = 0, pages = mem_size >> PAGE_SHIFT;
+    double rate;
+    int i;
+
+    for (i = 0; i < repeated_count; i++) {
+        ptr = mem;
+        memset(&comp_stats, 0, sizeof(comp_stats));
+
+        start_ts = g_get_monotonic_time();
+        for (ptr = mem; ptr < end; ptr += PAGE_SIZE) {
+            *ptr = 0x10;
+            compress_page_with_multi_thread(threads, ptr);
+        }
+        threaded_workqueue_wait_for_requests(threads);
+        spend = g_get_monotonic_time() - start_ts;
+        total_ts += spend;
+
+        if (comp_stats.pages != pages) {
+            printf("ERROR: pages are compressed %ld, expect %ld.\n",
+                   comp_stats.pages, pages);
+            exit(-1);
+        }
+
+        rate = (double)(comp_stats.pages * PAGE_SIZE) /
+                        comp_stats.compressed_size;
+        printf("RUN %d: Request # %ld Cost %ld, Compression Rate %f.\n", i,
+               comp_stats.pages, spend, rate);
+    }
+
+    printf("AVG: Time Cost %ld\n", total_ts / repeated_count);
+    printf("AVG Throughput: %f GB/s\n",
+           (double)(mem_size >> 30) * repeated_count * 1e6 / total_ts);
+}
+
+static void usage(const char *arg0)
+{
+    printf("\nThreaded Workqueue Benchmark.\n");
+    printf("Usage:\n");
+    printf("  %s [OPTIONS]\n", arg0);
+    printf("Options:\n");
+    printf("   -t        the number of threads (default %d).\n",
+            DEFAULT_THREAD_NR);
+    printf("   -r:       the number of requests handled by each thread (default %d).\n",
+            DEFAULT_THREAD_REQUEST_NR);
+    printf("   -m:       the size of the memory (G) used to test (default %dG).\n",
+            DEFAULT_MEM_SIZE);
+    printf("   -c:       the repeated count (default %d).\n",
+            DEFAULT_REPEATED_COUNT);
+    printf("   -h        show this help info.\n");
+}
+
+int main(int argc, char *argv[])
+{
+    int c, threads_nr, requests_nr, repeated_count;
+    unsigned long mem_size;
+    uint8_t *mem;
+    Threads *threads;
+
+    threads_nr = DEFAULT_THREAD_NR;
+    requests_nr = DEFAULT_THREAD_REQUEST_NR;
+    mem_size = DEFAULT_MEM_SIZE;
+    repeated_count = DEFAULT_REPEATED_COUNT;
+
+    for (;;) {
+        c = getopt(argc, argv, "t:r:m:c:h");
+        if (c < 0) {
+            break;
+        }
+
+        switch (c) {
+        case 't':
+            threads_nr = atoi(optarg);
+            break;
+        case 'r':
+            requests_nr = atoi(optarg);
+            break;
+        case 'm':
+            mem_size = atol(optarg);
+            break;
+        case 'c':
+            repeated_count = atoi(optarg);
+            break;
+        default:
+            printf("Unkown option: %c.\n", c);
+        case 'h':
+            usage(argv[0]);
+            return -1;
+        }
+    }
+
+    printf("Run the benchmark: threads %d requests-per-thread: %d memory %ldG repeat %d.\n",
+            threads_nr, requests_nr, mem_size, repeated_count);
+
+    mem_size = mem_size << 30;
+    mem = qemu_memalign(PAGE_SIZE, mem_size);
+    memset(mem, 0, mem_size);
+
+    threads = compress_threads_save_setup(threads_nr, requests_nr);
+    run(threads, mem, mem_size, repeated_count);
+    compress_threads_save_cleanup(threads);
+
+    qemu_vfree(mem);
+    return 0;
+}
-- 
2.14.5

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [Qemu-devel] [PATCH v3 5/5] tests: add threaded-workqueue-bench
@ 2018-11-22  7:20   ` guangrong.xiao
  0 siblings, 0 replies; 70+ messages in thread
From: guangrong.xiao @ 2018-11-22  7:20 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: qemu-devel, kvm, dgilbert, peterx, wei.w.wang, jiang.biao2,
	eblake, quintela, cota, Xiao Guangrong

From: Xiao Guangrong <xiaoguangrong@tencent.com>

It's the benhcmark of threaded-workqueue, also it's a good
example to show how threaded-workqueue is used

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 tests/Makefile.include           |   5 +-
 tests/threaded-workqueue-bench.c | 255 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 259 insertions(+), 1 deletion(-)
 create mode 100644 tests/threaded-workqueue-bench.c

diff --git a/tests/Makefile.include b/tests/Makefile.include
index 613242bc6e..05ad27e75d 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -500,7 +500,8 @@ test-obj-y = tests/check-qnum.o tests/check-qstring.o tests/check-qdict.o \
 	tests/test-rcu-tailq.o \
 	tests/test-qdist.o tests/test-shift128.o \
 	tests/test-qht.o tests/qht-bench.o tests/test-qht-par.o \
-	tests/atomic_add-bench.o tests/atomic64-bench.o
+	tests/atomic_add-bench.o tests/atomic64-bench.o \
+	tests/threaded-workqueue-bench.o
 
 $(test-obj-y): QEMU_INCLUDES += -Itests
 QEMU_CFLAGS += -I$(SRC_PATH)/tests
@@ -557,6 +558,8 @@ tests/qht-bench$(EXESUF): tests/qht-bench.o $(test-util-obj-y)
 tests/test-bufferiszero$(EXESUF): tests/test-bufferiszero.o $(test-util-obj-y)
 tests/atomic_add-bench$(EXESUF): tests/atomic_add-bench.o $(test-util-obj-y)
 tests/atomic64-bench$(EXESUF): tests/atomic64-bench.o $(test-util-obj-y)
+tests/threaded-workqueue-bench$(EXESUF): tests/threaded-workqueue-bench.o migration/qemu-file.o \
+	$(test-util-obj-y)
 
 tests/fp/%:
 	$(MAKE) -C $(dir $@) $(notdir $@)
diff --git a/tests/threaded-workqueue-bench.c b/tests/threaded-workqueue-bench.c
new file mode 100644
index 0000000000..0d04948ed3
--- /dev/null
+++ b/tests/threaded-workqueue-bench.c
@@ -0,0 +1,255 @@
+/*
+ * Threaded Workqueue Benchmark
+ *
+ * Author:
+ *   Xiao Guangrong <xiaoguangrong@tencent.com>
+ *
+ * Copyright(C) 2018 Tencent Corporation.
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ */
+#include <zlib.h>
+
+#include "qemu/osdep.h"
+#include "exec/cpu-common.h"
+#include "qemu/error-report.h"
+#include "migration/qemu-file.h"
+#include "qemu/threaded-workqueue.h"
+
+#define PAGE_SHIFT              12
+#define PAGE_SIZE               (1 << PAGE_SHIFT)
+#define DEFAULT_THREAD_NR       2
+#define DEFAULT_MEM_SIZE        1
+#define DEFAULT_REPEATED_COUNT  3
+
+static ssize_t test_writev_buffer(void *opaque, struct iovec *iov, int iovcnt,
+                                   int64_t pos)
+{
+    int i, size = 0;
+
+    for (i = 0; i < iovcnt; i++) {
+        size += iov[i].iov_len;
+    }
+    return size;
+}
+
+static int test_fclose(void *opaque)
+{
+    return 0;
+}
+
+static const QEMUFileOps test_write_ops = {
+    .writev_buffer  = test_writev_buffer,
+    .close          = test_fclose
+};
+
+static QEMUFile *dest_file;
+
+static const QEMUFileOps empty_ops = { };
+
+struct CompressData {
+    uint8_t *ram_addr;
+    QEMUFile *file;
+    z_stream stream;
+};
+typedef struct CompressData CompressData;
+
+static int compress_request_init(void *request)
+{
+    CompressData *cd = request;
+
+    if (deflateInit(&cd->stream, 1) != Z_OK) {
+        return -1;
+    }
+    cd->file = qemu_fopen_ops(NULL, &empty_ops);
+    return 0;
+}
+
+static void compress_request_uninit(void *request)
+{
+    CompressData *cd = request;
+
+    qemu_fclose(cd->file);
+    deflateEnd(&cd->stream);
+}
+
+static void compress_thread_data_handler(void *request)
+{
+    CompressData *cd = request;
+    int blen;
+
+    blen = qemu_put_compression_data(cd->file, &cd->stream, cd->ram_addr,
+                                     PAGE_SIZE);
+    if (blen < 0) {
+        error_report("compressed data failed!");
+        qemu_file_set_error(dest_file, blen);
+    }
+}
+
+struct CompressStats {
+    unsigned long pages;
+    unsigned long compressed_size;
+};
+typedef struct CompressStats CompressStats;
+
+static CompressStats comp_stats;
+
+static void compress_thread_data_done(void *request)
+{
+    CompressData *cd = request;
+    int bytes_xmit;
+
+    bytes_xmit = qemu_put_qemu_file(dest_file, cd->file);
+
+    comp_stats.pages++;
+    comp_stats.compressed_size += bytes_xmit;
+}
+
+static const ThreadedWorkqueueOps ops = {
+    .thread_request_init = compress_request_init,
+    .thread_request_uninit = compress_request_uninit,
+    .thread_request_handler = compress_thread_data_handler,
+    .thread_request_done = compress_thread_data_done,
+    .request_size = sizeof(CompressData),
+};
+
+static void compress_threads_save_cleanup(Threads *threads)
+{
+    threaded_workqueue_destroy(threads);
+    qemu_fclose(dest_file);
+}
+
+static Threads *compress_threads_save_setup(int threads_nr, int requests_nr)
+{
+    Threads *compress_threads;
+
+    dest_file = qemu_fopen_ops(NULL, &test_write_ops);
+    compress_threads = threaded_workqueue_create("compress", threads_nr,
+                                                 requests_nr, &ops);
+    assert(compress_threads);
+    return compress_threads;
+}
+
+static void compress_page_with_multi_thread(Threads *threads, uint8_t *addr)
+{
+    CompressData *cd;
+
+retry:
+    cd = threaded_workqueue_get_request(threads);
+    if (!cd) {
+        goto retry;
+    }
+
+    cd->ram_addr = addr;
+    threaded_workqueue_submit_request(threads, cd);
+}
+
+static void run(Threads *threads, uint8_t *mem, unsigned long mem_size,
+                int repeated_count)
+{
+    uint8_t *ptr = mem, *end = mem + mem_size;
+    uint64_t start_ts, spend, total_ts = 0, pages = mem_size >> PAGE_SHIFT;
+    double rate;
+    int i;
+
+    for (i = 0; i < repeated_count; i++) {
+        ptr = mem;
+        memset(&comp_stats, 0, sizeof(comp_stats));
+
+        start_ts = g_get_monotonic_time();
+        for (ptr = mem; ptr < end; ptr += PAGE_SIZE) {
+            *ptr = 0x10;
+            compress_page_with_multi_thread(threads, ptr);
+        }
+        threaded_workqueue_wait_for_requests(threads);
+        spend = g_get_monotonic_time() - start_ts;
+        total_ts += spend;
+
+        if (comp_stats.pages != pages) {
+            printf("ERROR: pages are compressed %ld, expect %ld.\n",
+                   comp_stats.pages, pages);
+            exit(-1);
+        }
+
+        rate = (double)(comp_stats.pages * PAGE_SIZE) /
+                        comp_stats.compressed_size;
+        printf("RUN %d: Request # %ld Cost %ld, Compression Rate %f.\n", i,
+               comp_stats.pages, spend, rate);
+    }
+
+    printf("AVG: Time Cost %ld\n", total_ts / repeated_count);
+    printf("AVG Throughput: %f GB/s\n",
+           (double)(mem_size >> 30) * repeated_count * 1e6 / total_ts);
+}
+
+static void usage(const char *arg0)
+{
+    printf("\nThreaded Workqueue Benchmark.\n");
+    printf("Usage:\n");
+    printf("  %s [OPTIONS]\n", arg0);
+    printf("Options:\n");
+    printf("   -t        the number of threads (default %d).\n",
+            DEFAULT_THREAD_NR);
+    printf("   -r:       the number of requests handled by each thread (default %d).\n",
+            DEFAULT_THREAD_REQUEST_NR);
+    printf("   -m:       the size of the memory (G) used to test (default %dG).\n",
+            DEFAULT_MEM_SIZE);
+    printf("   -c:       the repeated count (default %d).\n",
+            DEFAULT_REPEATED_COUNT);
+    printf("   -h        show this help info.\n");
+}
+
+int main(int argc, char *argv[])
+{
+    int c, threads_nr, requests_nr, repeated_count;
+    unsigned long mem_size;
+    uint8_t *mem;
+    Threads *threads;
+
+    threads_nr = DEFAULT_THREAD_NR;
+    requests_nr = DEFAULT_THREAD_REQUEST_NR;
+    mem_size = DEFAULT_MEM_SIZE;
+    repeated_count = DEFAULT_REPEATED_COUNT;
+
+    for (;;) {
+        c = getopt(argc, argv, "t:r:m:c:h");
+        if (c < 0) {
+            break;
+        }
+
+        switch (c) {
+        case 't':
+            threads_nr = atoi(optarg);
+            break;
+        case 'r':
+            requests_nr = atoi(optarg);
+            break;
+        case 'm':
+            mem_size = atol(optarg);
+            break;
+        case 'c':
+            repeated_count = atoi(optarg);
+            break;
+        default:
+            printf("Unkown option: %c.\n", c);
+        case 'h':
+            usage(argv[0]);
+            return -1;
+        }
+    }
+
+    printf("Run the benchmark: threads %d requests-per-thread: %d memory %ldG repeat %d.\n",
+            threads_nr, requests_nr, mem_size, repeated_count);
+
+    mem_size = mem_size << 30;
+    mem = qemu_memalign(PAGE_SIZE, mem_size);
+    memset(mem, 0, mem_size);
+
+    threads = compress_threads_save_setup(threads_nr, requests_nr);
+    run(threads, mem, mem_size, repeated_count);
+    compress_threads_save_cleanup(threads);
+
+    qemu_vfree(mem);
+    return 0;
+}
-- 
2.14.5

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 0/5] migration: improve multithreads
  2018-11-22  7:20 ` [Qemu-devel] " guangrong.xiao
@ 2018-11-22 21:25   ` no-reply
  -1 siblings, 0 replies; 70+ messages in thread
From: no-reply @ 2018-11-22 21:25 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: famz, kvm, quintela, mtosatti, xiaoguangrong, qemu-devel, peterx,
	dgilbert, wei.w.wang, cota, mst, jiang.biao2, pbonzini

Hi,

This series failed docker-mingw@fedora build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

Message-id: 20181122072028.22819-1-xiaoguangrong@tencent.com
Type: series
Subject: [Qemu-devel] [PATCH v3 0/5] migration: improve multithreads

=== TEST SCRIPT BEGIN ===
#!/bin/bash
time make docker-test-mingw@fedora SHOW_ENV=1 J=8
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
1f40f88 tests: add threaded-workqueue-bench
0c560ac migration: use threaded workqueue for decompression
effdcb4 migration: use threaded workqueue for compression
eb91c63 util: introduce threaded workqueue
3bf8b44 bitops: introduce change_bit_atomic

=== OUTPUT BEGIN ===
  BUILD   fedora
make[1]: Entering directory `/var/tmp/patchew-tester-tmp-1vusqh9v/src'
  GEN     /var/tmp/patchew-tester-tmp-1vusqh9v/src/docker-src.2018-11-22-16.24.23.23218/qemu.tar
Cloning into '/var/tmp/patchew-tester-tmp-1vusqh9v/src/docker-src.2018-11-22-16.24.23.23218/qemu.tar.vroot'...
done.
Submodule 'dtc' (https://git.qemu.org/git/dtc.git) registered for path 'dtc'
Cloning into 'dtc'...
Submodule path 'dtc': checked out '88f18909db731a627456f26d779445f84e449536'
Submodule 'ui/keycodemapdb' (https://git.qemu.org/git/keycodemapdb.git) registered for path 'ui/keycodemapdb'
Cloning into 'ui/keycodemapdb'...
Submodule path 'ui/keycodemapdb': checked out '6b3d716e2b6472eb7189d3220552280ef3d832ce'
  COPY    RUNNER
    RUN test-mingw in qemu:fedora 
Packages installed:
SDL2-devel-2.0.9-1.fc28.x86_64
bc-1.07.1-5.fc28.x86_64
bison-3.0.4-9.fc28.x86_64
bluez-libs-devel-5.50-1.fc28.x86_64
brlapi-devel-0.6.7-19.fc28.x86_64
bzip2-1.0.6-26.fc28.x86_64
bzip2-devel-1.0.6-26.fc28.x86_64
ccache-3.4.2-2.fc28.x86_64
clang-6.0.1-2.fc28.x86_64
device-mapper-multipath-devel-0.7.4-3.git07e7bd5.fc28.x86_64
findutils-4.6.0-19.fc28.x86_64
flex-2.6.1-7.fc28.x86_64
gcc-8.2.1-5.fc28.x86_64
gcc-c++-8.2.1-5.fc28.x86_64
gettext-0.19.8.1-14.fc28.x86_64
git-2.17.2-1.fc28.x86_64
glib2-devel-2.56.3-2.fc28.x86_64
glusterfs-api-devel-4.1.5-1.fc28.x86_64
gnutls-devel-3.6.4-1.fc28.x86_64
gtk3-devel-3.22.30-1.fc28.x86_64
hostname-3.20-3.fc28.x86_64
libaio-devel-0.3.110-11.fc28.x86_64
libasan-8.2.1-5.fc28.x86_64
libattr-devel-2.4.48-3.fc28.x86_64
libcap-devel-2.25-9.fc28.x86_64
libcap-ng-devel-0.7.9-4.fc28.x86_64
libcurl-devel-7.59.0-8.fc28.x86_64
libfdt-devel-1.4.7-1.fc28.x86_64
libpng-devel-1.6.34-6.fc28.x86_64
librbd-devel-12.2.8-1.fc28.x86_64
libssh2-devel-1.8.0-7.fc28.x86_64
libubsan-8.2.1-5.fc28.x86_64
libusbx-devel-1.0.22-1.fc28.x86_64
libxml2-devel-2.9.8-4.fc28.x86_64
llvm-6.0.1-8.fc28.x86_64
lzo-devel-2.08-12.fc28.x86_64
make-4.2.1-6.fc28.x86_64
mingw32-SDL2-2.0.9-1.fc28.noarch
mingw32-bzip2-1.0.6-9.fc27.noarch
mingw32-curl-7.57.0-1.fc28.noarch
mingw32-glib2-2.56.1-1.fc28.noarch
mingw32-gmp-6.1.2-2.fc27.noarch
mingw32-gnutls-3.6.3-1.fc28.noarch
mingw32-gtk3-3.22.30-1.fc28.noarch
mingw32-libjpeg-turbo-1.5.1-3.fc27.noarch
mingw32-libpng-1.6.29-2.fc27.noarch
mingw32-libssh2-1.8.0-3.fc27.noarch
mingw32-libtasn1-4.13-1.fc28.noarch
mingw32-nettle-3.4-1.fc28.noarch
mingw32-pixman-0.34.0-3.fc27.noarch
mingw32-pkg-config-0.28-9.fc27.x86_64
mingw64-SDL2-2.0.9-1.fc28.noarch
mingw64-bzip2-1.0.6-9.fc27.noarch
mingw64-curl-7.57.0-1.fc28.noarch
mingw64-glib2-2.56.1-1.fc28.noarch
mingw64-gmp-6.1.2-2.fc27.noarch
mingw64-gnutls-3.6.3-1.fc28.noarch
mingw64-gtk3-3.22.30-1.fc28.noarch
mingw64-libjpeg-turbo-1.5.1-3.fc27.noarch
mingw64-libpng-1.6.29-2.fc27.noarch
mingw64-libssh2-1.8.0-3.fc27.noarch
mingw64-libtasn1-4.13-1.fc28.noarch
mingw64-nettle-3.4-1.fc28.noarch
mingw64-pixman-0.34.0-3.fc27.noarch
mingw64-pkg-config-0.28-9.fc27.x86_64
ncurses-devel-6.1-5.20180224.fc28.x86_64
nettle-devel-3.4-2.fc28.x86_64
nss-devel-3.39.0-1.0.fc28.x86_64
numactl-devel-2.0.11-8.fc28.x86_64
package PyYAML is not installed
package libjpeg-devel is not installed
perl-5.26.2-414.fc28.x86_64
pixman-devel-0.34.0-8.fc28.x86_64
python3-3.6.6-1.fc28.x86_64
snappy-devel-1.1.7-5.fc28.x86_64
sparse-0.5.2-1.fc28.x86_64
spice-server-devel-0.14.0-4.fc28.x86_64
systemtap-sdt-devel-4.0-1.fc28.x86_64
tar-1.30-3.fc28.x86_64
usbredir-devel-0.8.0-1.fc28.x86_64
virglrenderer-devel-0.6.0-4.20170210git76b3da97b.fc28.x86_64
vte3-devel-0.36.5-6.fc28.x86_64
which-2.21-8.fc28.x86_64
xen-devel-4.10.2-2.fc28.x86_64
zlib-devel-1.2.11-8.fc28.x86_64

Environment variables:
TARGET_LIST=
PACKAGES=bc     bison     bluez-libs-devel     brlapi-devel     bzip2     bzip2-devel     ccache     clang     device-mapper-multipath-devel     findutils     flex     gcc     gcc-c++     gettext     git     glib2-devel     glusterfs-api-devel     gnutls-devel     gtk3-devel     hostname     libaio-devel     libasan     libattr-devel     libcap-devel     libcap-ng-devel     libcurl-devel     libfdt-devel     libjpeg-devel     libpng-devel     librbd-devel     libssh2-devel     libubsan     libusbx-devel     libxml2-devel     llvm     lzo-devel     make     mingw32-bzip2     mingw32-curl     mingw32-glib2     mingw32-gmp     mingw32-gnutls     mingw32-gtk3     mingw32-libjpeg-turbo     mingw32-libpng     mingw32-libssh2     mingw32-libtasn1     mingw32-nettle     mingw32-pixman     mingw32-pkg-config     mingw32-SDL2     mingw64-bzip2     mingw64-curl     mingw64-glib2     mingw64-gmp     mingw64-gnutls     mingw64-gtk3     mingw64-libjpeg-turbo     mingw64-libpng     mingw64-libssh2     mingw64-libtasn1     mingw64-nettle     mingw64-pixman     mingw64-pkg-config     mingw64-SDL2     ncurses-devel     nettle-devel     nss-devel     numactl-devel     perl     pixman-devel     python3     PyYAML     SDL2-devel     snappy-devel     sparse     spice-server-devel     systemtap-sdt-devel     tar     usbredir-devel     virglrenderer-devel     vte3-devel     which     xen-devel     zlib-devel
J=8
V=
HOSTNAME=1690db6f9c50
DEBUG=
SHOW_ENV=1
PWD=/
HOME=/
CCACHE_DIR=/var/tmp/ccache
FBR=f28
DISTTAG=f28container
QEMU_CONFIGURE_OPTS=--python=/usr/bin/python3
FGC=f28
TEST_DIR=/tmp/qemu-test
SHLVL=1
FEATURES=mingw clang pyyaml asan dtc
PATH=/usr/lib/ccache:/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
MAKEFLAGS= -j8
EXTRA_CONFIGURE_OPTS=
_=/usr/bin/env

Configure options:
--enable-werror --target-list=x86_64-softmmu,aarch64-softmmu --prefix=/tmp/qemu-test/install --python=/usr/bin/python3 --cross-prefix=x86_64-w64-mingw32- --enable-trace-backends=simple --enable-gnutls --enable-nettle --enable-curl --enable-vnc --enable-bzip2 --enable-guest-agent --with-sdlabi=2.0
Install prefix    /tmp/qemu-test/install
BIOS directory    /tmp/qemu-test/install
firmware path     /tmp/qemu-test/install/share/qemu-firmware
binary directory  /tmp/qemu-test/install
library directory /tmp/qemu-test/install/lib
module directory  /tmp/qemu-test/install/lib
libexec directory /tmp/qemu-test/install/libexec
include directory /tmp/qemu-test/install/include
config directory  /tmp/qemu-test/install
local state directory   queried at runtime
Windows SDK       no
Source path       /tmp/qemu-test/src
GIT binary        git
GIT submodules    
C compiler        x86_64-w64-mingw32-gcc
Host C compiler   cc
C++ compiler      x86_64-w64-mingw32-g++
Objective-C compiler clang
ARFLAGS           rv
CFLAGS            -O2 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -g 
QEMU_CFLAGS       -I/usr/x86_64-w64-mingw32/sys-root/mingw/include/pixman-1  -I$(SRC_PATH)/dtc/libfdt -Werror -DHAS_LIBSSH2_SFTP_FSYNC -I/usr/x86_64-w64-mingw32/sys-root/mingw/include  -mms-bitfields -I/usr/x86_64-w64-mingw32/sys-root/mingw/include/glib-2.0 -I/usr/x86_64-w64-mingw32/sys-root/mingw/lib/glib-2.0/include -I/usr/x86_64-w64-mingw32/sys-root/mingw/include  -m64 -mcx16 -mthreads -D__USE_MINGW_ANSI_STDIO=1 -DWIN32_LEAN_AND_MEAN -DWINVER=0x501 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -Wstrict-prototypes -Wredundant-decls -Wall -Wundef -Wwrite-strings -Wmissing-prototypes -fno-strict-aliasing -fno-common -fwrapv  -Wexpansion-to-defined -Wendif-labels -Wno-shift-negative-value -Wno-missing-include-dirs -Wempty-body -Wnested-externs -Wformat-security -Wformat-y2k -Winit-self -Wignored-qualifiers -Wold-style-declaration -Wold-style-definition -Wtype-limits -fstack-protector-strong -I/usr/x86_64-w64-mingw32/sys-root/mingw/include -I/usr/x86_64-w64-mingw32/sys-root/mingw/include/p11-kit-1  -I/usr/x86_64-w64-mingw32/sys-root/mingw/include   -I/usr/x86_64-w64-mingw32/sys-root/mingw/include/libpng16 
LDFLAGS           -Wl,--nxcompat -Wl,--no-seh -Wl,--dynamicbase -Wl,--warn-common -m64 -g 
QEMU_LDFLAGS      -L$(BUILD_DIR)/dtc/libfdt 
make              make
install           install
python            /usr/bin/python3 -B
smbd              /usr/sbin/smbd
module support    no
host CPU          x86_64
host big endian   no
target list       x86_64-softmmu aarch64-softmmu
gprof enabled     no
sparse enabled    no
strip binaries    yes
profiler          no
static build      no
SDL support       yes (2.0.9)
GTK support       yes (3.22.30)
GTK GL support    no
VTE support       no 
TLS priority      NORMAL
GNUTLS support    yes
libgcrypt         no
nettle            yes (3.4)
libtasn1          yes
curses support    no
virgl support     no 
curl support      yes
mingw32 support   yes
Audio drivers     dsound
Block whitelist (rw) 
Block whitelist (ro) 
VirtFS support    no
Multipath support no
VNC support       yes
VNC SASL support  no
VNC JPEG support  yes
VNC PNG support   yes
xen support       no
brlapi support    no
bluez  support    no
Documentation     no
PIE               no
vde support       no
netmap support    no
Linux AIO support no
ATTR/XATTR support no
Install blobs     yes
KVM support       no
HAX support       yes
HVF support       no
WHPX support      no
TCG support       yes
TCG debug enabled no
TCG interpreter   no
malloc trim support no
RDMA support      no
PVRDMA support    no
fdt support       git
membarrier        no
preadv support    no
fdatasync         no
madvise           no
posix_madvise     no
posix_memalign    no
libcap-ng support no
vhost-net support no
vhost-crypto support no
vhost-scsi support no
vhost-vsock support no
vhost-user support no
Trace backends    simple
Trace output file trace-<pid>
spice support     no 
rbd support       no
xfsctl support    no
smartcard support no
libusb            no
usb net redir     no
OpenGL support    no
OpenGL dmabufs    no
libiscsi support  no
libnfs support    no
build guest agent yes
QGA VSS support   no
QGA w32 disk info yes
QGA MSI support   no
seccomp support   no
coroutine backend win32
coroutine pool    yes
debug stack usage no
mutex debugging   no
crypto afalg      no
GlusterFS support no
gcov              gcov
gcov enabled      no
TPM support       yes
libssh2 support   yes
TPM passthrough   no
TPM emulator      no
QOM debugging     yes
Live block migration yes
lzo support       no
snappy support    no
bzip2 support     yes
NUMA host support no
libxml2           no
tcmalloc support  no
jemalloc support  no
avx2 optimization yes
replication support yes
VxHS block device no
bochs support     yes
cloop support     yes
dmg support       yes
qcow v1 support   yes
vdi support       yes
vvfat support     yes
qed support       yes
parallels support yes
sheepdog support  yes
capstone          no
docker            no
libpmem support   no
libudev           no

NOTE: cross-compilers enabled:  'x86_64-w64-mingw32-gcc'
  GEN     x86_64-softmmu/config-devices.mak.tmp
  GEN     aarch64-softmmu/config-devices.mak.tmp
  GEN     config-host.h
  GEN     qemu-options.def
  GEN     qapi-gen
  GEN     trace/generated-tcg-tracers.h
  GEN     trace/generated-helpers-wrappers.h
  GEN     trace/generated-helpers.h
  GEN     trace/generated-helpers.c
  GEN     aarch64-softmmu/config-devices.mak
  GEN     x86_64-softmmu/config-devices.mak
  GEN     module_block.h
  GEN     ui/input-keymap-atset1-to-qcode.c
  GEN     ui/input-keymap-linux-to-qcode.c
  GEN     ui/input-keymap-qcode-to-atset1.c
  GEN     ui/input-keymap-qcode-to-atset2.c
  GEN     ui/input-keymap-qcode-to-atset3.c
  GEN     ui/input-keymap-qcode-to-linux.c
  GEN     ui/input-keymap-qcode-to-qnum.c
  GEN     ui/input-keymap-qcode-to-sun.c
  GEN     ui/input-keymap-qnum-to-qcode.c
  GEN     ui/input-keymap-usb-to-qcode.c
  GEN     ui/input-keymap-win32-to-qcode.c
  GEN     ui/input-keymap-x11-to-qcode.c
  GEN     ui/input-keymap-xorgevdev-to-qcode.c
  GEN     ui/input-keymap-xorgkbd-to-qcode.c
  GEN     ui/input-keymap-xorgxquartz-to-qcode.c
  GEN     ui/input-keymap-xorgxwin-to-qcode.c
  GEN     ui/input-keymap-osx-to-qcode.c
  GEN     tests/test-qapi-gen
  GEN     trace-root.h
  GEN     accel/kvm/trace.h
  GEN     accel/tcg/trace.h
  GEN     audio/trace.h
  GEN     block/trace.h
  GEN     chardev/trace.h
  GEN     crypto/trace.h
  GEN     hw/9pfs/trace.h
  GEN     hw/acpi/trace.h
  GEN     hw/alpha/trace.h
  GEN     hw/arm/trace.h
  GEN     hw/audio/trace.h
  GEN     hw/block/trace.h
  GEN     hw/block/dataplane/trace.h
  GEN     hw/char/trace.h
  GEN     hw/display/trace.h
  GEN     hw/dma/trace.h
  GEN     hw/hppa/trace.h
  GEN     hw/i2c/trace.h
  GEN     hw/i386/trace.h
  GEN     hw/i386/xen/trace.h
  GEN     hw/ide/trace.h
  GEN     hw/input/trace.h
  GEN     hw/intc/trace.h
  GEN     hw/isa/trace.h
  GEN     hw/mem/trace.h
  GEN     hw/misc/trace.h
  GEN     hw/misc/macio/trace.h
  GEN     hw/net/trace.h
  GEN     hw/nvram/trace.h
  GEN     hw/pci/trace.h
  GEN     hw/pci-host/trace.h
  GEN     hw/ppc/trace.h
  GEN     hw/rdma/trace.h
  GEN     hw/rdma/vmw/trace.h
  GEN     hw/s390x/trace.h
  GEN     hw/scsi/trace.h
  GEN     hw/sd/trace.h
  GEN     hw/sparc/trace.h
  GEN     hw/sparc64/trace.h
  GEN     hw/timer/trace.h
  GEN     hw/tpm/trace.h
  GEN     hw/usb/trace.h
  GEN     hw/vfio/trace.h
  GEN     hw/virtio/trace.h
  GEN     hw/watchdog/trace.h
  GEN     hw/xen/trace.h
  GEN     io/trace.h
  GEN     linux-user/trace.h
  GEN     migration/trace.h
  GEN     nbd/trace.h
  GEN     net/trace.h
  GEN     qapi/trace.h
  GEN     qom/trace.h
  GEN     scsi/trace.h
  GEN     target/arm/trace.h
  GEN     target/i386/trace.h
  GEN     target/mips/trace.h
  GEN     target/ppc/trace.h
  GEN     target/s390x/trace.h
  GEN     target/sparc/trace.h
  GEN     ui/trace.h
  GEN     util/trace.h
  GEN     trace-root.c
  GEN     accel/kvm/trace.c
  GEN     accel/tcg/trace.c
  GEN     audio/trace.c
  GEN     block/trace.c
  GEN     chardev/trace.c
  GEN     crypto/trace.c
  GEN     hw/9pfs/trace.c
  GEN     hw/acpi/trace.c
  GEN     hw/alpha/trace.c
  GEN     hw/arm/trace.c
  GEN     hw/audio/trace.c
  GEN     hw/block/trace.c
  GEN     hw/block/dataplane/trace.c
  GEN     hw/char/trace.c
  GEN     hw/display/trace.c
  GEN     hw/dma/trace.c
  GEN     hw/hppa/trace.c
  GEN     hw/i2c/trace.c
  GEN     hw/i386/trace.c
  GEN     hw/i386/xen/trace.c
  GEN     hw/ide/trace.c
  GEN     hw/input/trace.c
  GEN     hw/intc/trace.c
  GEN     hw/isa/trace.c
  GEN     hw/mem/trace.c
  GEN     hw/misc/trace.c
  GEN     hw/misc/macio/trace.c
  GEN     hw/net/trace.c
  GEN     hw/nvram/trace.c
  GEN     hw/pci/trace.c
  GEN     hw/pci-host/trace.c
  GEN     hw/ppc/trace.c
  GEN     hw/rdma/trace.c
  GEN     hw/rdma/vmw/trace.c
  GEN     hw/s390x/trace.c
  GEN     hw/scsi/trace.c
  GEN     hw/sd/trace.c
  GEN     hw/sparc/trace.c
  GEN     hw/sparc64/trace.c
  GEN     hw/timer/trace.c
  GEN     hw/tpm/trace.c
  GEN     hw/usb/trace.c
  GEN     hw/vfio/trace.c
  GEN     hw/virtio/trace.c
  GEN     hw/watchdog/trace.c
  GEN     hw/xen/trace.c
  GEN     io/trace.c
  GEN     linux-user/trace.c
  GEN     migration/trace.c
  GEN     nbd/trace.c
  GEN     net/trace.c
  GEN     qapi/trace.c
  GEN     qom/trace.c
  GEN     scsi/trace.c
  GEN     target/arm/trace.c
  GEN     target/i386/trace.c
  GEN     target/mips/trace.c
  GEN     target/ppc/trace.c
  GEN     target/s390x/trace.c
  GEN     target/sparc/trace.c
  GEN     ui/trace.c
  GEN     util/trace.c
  GEN     config-all-devices.mak
	 DEP /tmp/qemu-test/src/dtc/tests/dumptrees.c
	 DEP /tmp/qemu-test/src/dtc/tests/trees.S
	 DEP /tmp/qemu-test/src/dtc/tests/value-labels.c
	 DEP /tmp/qemu-test/src/dtc/tests/testutils.c
	 DEP /tmp/qemu-test/src/dtc/tests/asm_tree_dump.c
	 DEP /tmp/qemu-test/src/dtc/tests/truncated_memrsv.c
	 DEP /tmp/qemu-test/src/dtc/tests/truncated_string.c
	 DEP /tmp/qemu-test/src/dtc/tests/truncated_property.c
	 DEP /tmp/qemu-test/src/dtc/tests/check_full.c
	 DEP /tmp/qemu-test/src/dtc/tests/check_path.c
	 DEP /tmp/qemu-test/src/dtc/tests/check_header.c
	 DEP /tmp/qemu-test/src/dtc/tests/overlay_bad_fixup.c
	 DEP /tmp/qemu-test/src/dtc/tests/overlay.c
	 DEP /tmp/qemu-test/src/dtc/tests/subnode_iterate.c
	 DEP /tmp/qemu-test/src/dtc/tests/integer-expressions.c
	 DEP /tmp/qemu-test/src/dtc/tests/property_iterate.c
	 DEP /tmp/qemu-test/src/dtc/tests/utilfdt_test.c
	 DEP /tmp/qemu-test/src/dtc/tests/path_offset_aliases.c
	 DEP /tmp/qemu-test/src/dtc/tests/add_subnode_with_nops.c
	 DEP /tmp/qemu-test/src/dtc/tests/dtbs_equal_unordered.c
	 DEP /tmp/qemu-test/src/dtc/tests/dtb_reverse.c
	 DEP /tmp/qemu-test/src/dtc/tests/dtbs_equal_ordered.c
	 DEP /tmp/qemu-test/src/dtc/tests/extra-terminating-null.c
	 DEP /tmp/qemu-test/src/dtc/tests/incbin.c
	 DEP /tmp/qemu-test/src/dtc/tests/boot-cpuid.c
	 DEP /tmp/qemu-test/src/dtc/tests/phandle_format.c
	 DEP /tmp/qemu-test/src/dtc/tests/path-references.c
	 DEP /tmp/qemu-test/src/dtc/tests/references.c
	 DEP /tmp/qemu-test/src/dtc/tests/string_escapes.c
	 DEP /tmp/qemu-test/src/dtc/tests/propname_escapes.c
	 DEP /tmp/qemu-test/src/dtc/tests/appendprop2.c
	 DEP /tmp/qemu-test/src/dtc/tests/appendprop1.c
	 DEP /tmp/qemu-test/src/dtc/tests/del_node.c
	 DEP /tmp/qemu-test/src/dtc/tests/del_property.c
	 DEP /tmp/qemu-test/src/dtc/tests/setprop.c
	 DEP /tmp/qemu-test/src/dtc/tests/set_name.c
	 DEP /tmp/qemu-test/src/dtc/tests/rw_tree1.c
	 DEP /tmp/qemu-test/src/dtc/tests/open_pack.c
	 DEP /tmp/qemu-test/src/dtc/tests/nopulate.c
	 DEP /tmp/qemu-test/src/dtc/tests/mangle-layout.c
	 DEP /tmp/qemu-test/src/dtc/tests/move_and_save.c
	 DEP /tmp/qemu-test/src/dtc/tests/sw_states.c
	 DEP /tmp/qemu-test/src/dtc/tests/sw_tree1.c
	 DEP /tmp/qemu-test/src/dtc/tests/nop_node.c
	 DEP /tmp/qemu-test/src/dtc/tests/nop_property.c
	 DEP /tmp/qemu-test/src/dtc/tests/setprop_inplace.c
	 DEP /tmp/qemu-test/src/dtc/tests/stringlist.c
	 DEP /tmp/qemu-test/src/dtc/tests/addr_size_cells2.c
	 DEP /tmp/qemu-test/src/dtc/tests/addr_size_cells.c
	 DEP /tmp/qemu-test/src/dtc/tests/notfound.c
	 DEP /tmp/qemu-test/src/dtc/tests/sized_cells.c
	 DEP /tmp/qemu-test/src/dtc/tests/char_literal.c
	 DEP /tmp/qemu-test/src/dtc/tests/get_alias.c
	 DEP /tmp/qemu-test/src/dtc/tests/node_offset_by_compatible.c
	 DEP /tmp/qemu-test/src/dtc/tests/node_check_compatible.c
	 DEP /tmp/qemu-test/src/dtc/tests/node_offset_by_phandle.c
	 DEP /tmp/qemu-test/src/dtc/tests/node_offset_by_prop_value.c
	 DEP /tmp/qemu-test/src/dtc/tests/parent_offset.c
	 DEP /tmp/qemu-test/src/dtc/tests/supernode_atdepth_offset.c
	 DEP /tmp/qemu-test/src/dtc/tests/get_path.c
	 DEP /tmp/qemu-test/src/dtc/tests/get_phandle.c
	 DEP /tmp/qemu-test/src/dtc/tests/getprop.c
	 DEP /tmp/qemu-test/src/dtc/tests/get_name.c
	 DEP /tmp/qemu-test/src/dtc/tests/path_offset.c
	 DEP /tmp/qemu-test/src/dtc/tests/subnode_offset.c
	 DEP /tmp/qemu-test/src/dtc/tests/find_property.c
	 DEP /tmp/qemu-test/src/dtc/tests/root_node.c
	 DEP /tmp/qemu-test/src/dtc/tests/get_mem_rsv.c
	 DEP /tmp/qemu-test/src/dtc/libfdt/fdt_overlay.c
	 DEP /tmp/qemu-test/src/dtc/libfdt/fdt_addresses.c
	 DEP /tmp/qemu-test/src/dtc/libfdt/fdt_empty_tree.c
	 DEP /tmp/qemu-test/src/dtc/libfdt/fdt_strerror.c
	 DEP /tmp/qemu-test/src/dtc/libfdt/fdt_rw.c
	 DEP /tmp/qemu-test/src/dtc/libfdt/fdt_sw.c
	 DEP /tmp/qemu-test/src/dtc/libfdt/fdt_wip.c
	 DEP /tmp/qemu-test/src/dtc/libfdt/fdt_ro.c
	 DEP /tmp/qemu-test/src/dtc/libfdt/fdt.c
	 DEP /tmp/qemu-test/src/dtc/util.c
	 DEP /tmp/qemu-test/src/dtc/fdtoverlay.c
	 DEP /tmp/qemu-test/src/dtc/fdtput.c
	 DEP /tmp/qemu-test/src/dtc/fdtget.c
	 DEP /tmp/qemu-test/src/dtc/fdtdump.c
	 LEX convert-dtsv0-lexer.lex.c
	 DEP /tmp/qemu-test/src/dtc/srcpos.c
	 BISON dtc-parser.tab.c
	 LEX dtc-lexer.lex.c
	 DEP /tmp/qemu-test/src/dtc/treesource.c
	 DEP /tmp/qemu-test/src/dtc/livetree.c
	 DEP /tmp/qemu-test/src/dtc/fstree.c
	 DEP /tmp/qemu-test/src/dtc/flattree.c
	 DEP /tmp/qemu-test/src/dtc/dtc.c
	 DEP /tmp/qemu-test/src/dtc/data.c
	 DEP /tmp/qemu-test/src/dtc/checks.c
	 DEP convert-dtsv0-lexer.lex.c
	 DEP dtc-parser.tab.c
	 DEP dtc-lexer.lex.c
	CHK version_gen.h
	UPD version_gen.h
	 DEP /tmp/qemu-test/src/dtc/util.c
	 CC libfdt/fdt.o
	 CC libfdt/fdt_ro.o
	 CC libfdt/fdt_wip.o
	 CC libfdt/fdt_sw.o
	 CC libfdt/fdt_rw.o
	 CC libfdt/fdt_strerror.o
	 CC libfdt/fdt_empty_tree.o
	 CC libfdt/fdt_addresses.o
	 CC libfdt/fdt_overlay.o
	 AR libfdt/libfdt.a
x86_64-w64-mingw32-ar: creating libfdt/libfdt.a
a - libfdt/fdt.o
a - libfdt/fdt_ro.o
a - libfdt/fdt_wip.o
a - libfdt/fdt_sw.o
a - libfdt/fdt_rw.o
a - libfdt/fdt_strerror.o
a - libfdt/fdt_empty_tree.o
a - libfdt/fdt_addresses.o
a - libfdt/fdt_overlay.o
  RC      version.o
  GEN     qga/qapi-generated/qapi-gen
  CC      qapi/qapi-builtin-types.o
  CC      qapi/qapi-types.o
  CC      qapi/qapi-types-block-core.o
  CC      qapi/qapi-types-block.o
  CC      qapi/qapi-types-char.o
  CC      qapi/qapi-types-common.o
  CC      qapi/qapi-types-crypto.o
  CC      qapi/qapi-types-introspect.o
  CC      qapi/qapi-types-job.o
  CC      qapi/qapi-types-migration.o
  CC      qapi/qapi-types-misc.o
  CC      qapi/qapi-types-net.o
  CC      qapi/qapi-types-rocker.o
  CC      qapi/qapi-types-run-state.o
  CC      qapi/qapi-types-sockets.o
  CC      qapi/qapi-types-tpm.o
  CC      qapi/qapi-types-trace.o
  CC      qapi/qapi-types-transaction.o
  CC      qapi/qapi-types-ui.o
  CC      qapi/qapi-builtin-visit.o
  CC      qapi/qapi-visit.o
  CC      qapi/qapi-visit-block-core.o
  CC      qapi/qapi-visit-block.o
  CC      qapi/qapi-visit-common.o
  CC      qapi/qapi-visit-char.o
  CC      qapi/qapi-visit-crypto.o
  CC      qapi/qapi-visit-introspect.o
  CC      qapi/qapi-visit-migration.o
  CC      qapi/qapi-visit-job.o
  CC      qapi/qapi-visit-misc.o
  CC      qapi/qapi-visit-net.o
  CC      qapi/qapi-visit-rocker.o
  CC      qapi/qapi-visit-run-state.o
  CC      qapi/qapi-visit-sockets.o
  CC      qapi/qapi-visit-tpm.o
  CC      qapi/qapi-visit-trace.o
  CC      qapi/qapi-visit-ui.o
  CC      qapi/qapi-visit-transaction.o
  CC      qapi/qapi-events.o
  CC      qapi/qapi-events-block-core.o
  CC      qapi/qapi-events-block.o
  CC      qapi/qapi-events-char.o
  CC      qapi/qapi-events-common.o
  CC      qapi/qapi-events-crypto.o
  CC      qapi/qapi-events-introspect.o
  CC      qapi/qapi-events-job.o
  CC      qapi/qapi-events-migration.o
  CC      qapi/qapi-events-misc.o
  CC      qapi/qapi-events-net.o
  CC      qapi/qapi-events-rocker.o
  CC      qapi/qapi-events-run-state.o
  CC      qapi/qapi-events-sockets.o
  CC      qapi/qapi-events-tpm.o
  CC      qapi/qapi-events-trace.o
  CC      qapi/qapi-events-transaction.o
  CC      qapi/qapi-events-ui.o
  CC      qapi/qapi-introspect.o
  CC      qapi/qapi-visit-core.o
  CC      qapi/qapi-dealloc-visitor.o
  CC      qapi/qobject-input-visitor.o
  CC      qapi/qobject-output-visitor.o
  CC      qapi/qmp-registry.o
  CC      qapi/qmp-dispatch.o
  CC      qapi/string-input-visitor.o
  CC      qapi/string-output-visitor.o
  CC      qapi/opts-visitor.o
  CC      qapi/qapi-clone-visitor.o
  CC      qapi/qmp-event.o
  CC      qapi/qapi-util.o
  CC      qobject/qnull.o
  CC      qobject/qnum.o
  CC      qobject/qstring.o
  CC      qobject/qdict.o
  CC      qobject/qlist.o
  CC      qobject/qbool.o
  CC      qobject/qlit.o
  CC      qobject/qjson.o
  CC      qobject/qobject.o
  CC      qobject/json-lexer.o
  CC      qobject/json-streamer.o
  CC      qobject/json-parser.o
  CC      qobject/block-qdict.o
  CC      trace/simple.o
  CC      trace/control.o
  CC      trace/qmp.o
  CC      util/osdep.o
  CC      util/cutils.o
  CC      util/unicode.o
  CC      util/qemu-timer-common.o
  CC      util/bufferiszero.o
  CC      util/lockcnt.o
  CC      util/aiocb.o
  CC      util/async.o
  CC      util/aio-wait.o
  CC      util/thread-pool.o
  CC      util/qemu-timer.o
  CC      util/main-loop.o
  CC      util/iohandler.o
  CC      util/aio-win32.o
  CC      util/event_notifier-win32.o
  CC      util/oslib-win32.o
  CC      util/qemu-thread-win32.o
  CC      util/envlist.o
  CC      util/path.o
  CC      util/module.o
  CC      util/host-utils.o
  CC      util/bitmap.o
  CC      util/bitops.o
  CC      util/hbitmap.o
  CC      util/fifo8.o
  CC      util/acl.o
  CC      util/cacheinfo.o
  CC      util/error.o
  CC      util/qemu-error.o
  CC      util/id.o
  CC      util/iov.o
  CC      util/qemu-config.o
  CC      util/qemu-sockets.o
  CC      util/uri.o
  CC      util/notify.o
  CC      util/qemu-option.o
  CC      util/qemu-progress.o
  CC      util/keyval.o
  CC      util/hexdump.o
  CC      util/crc32c.o
  CC      util/uuid.o
  CC      util/throttle.o
  CC      util/getauxval.o
  CC      util/readline.o
  CC      util/rcu.o
  CC      util/qemu-coroutine.o
  CC      util/qemu-coroutine-lock.o
  CC      util/qemu-coroutine-io.o
  CC      util/qemu-coroutine-sleep.o
  CC      util/coroutine-win32.o
  CC      util/buffer.o
  CC      util/timed-average.o
  CC      util/base64.o
  CC      util/log.o
  CC      util/pagesize.o
  CC      util/qdist.o
  CC      util/qht.o
  CC      util/qsp.o
  CC      util/range.o
  CC      util/stats64.o
  CC      util/systemd.o
  CC      util/iova-tree.o
  CC      util/threaded-workqueue.o
  CC      trace-root.o
  CC      accel/kvm/trace.o
  CC      accel/tcg/trace.o
  CC      audio/trace.o
  CC      block/trace.o
  CC      chardev/trace.o
/tmp/qemu-test/src/util/threaded-workqueue.c: In function 'get_free_request_bitmap':
/tmp/qemu-test/src/util/threaded-workqueue.c:135:16: error: passing argument 1 of 'bitmap_xor' from incompatible pointer type [-Werror=incompatible-pointer-types]
     bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
                ^
In file included from /tmp/qemu-test/src/util/threaded-workqueue.c:14:0:
/tmp/qemu-test/src/include/qemu/bitmap.h:154:20: note: expected 'long unsigned int *' but argument is of type 'uint64_t * {aka long long unsigned int *}'
 static inline void bitmap_xor(unsigned long *dst, const unsigned long *src1,
                    ^~~~~~~~~~
/tmp/qemu-test/src/util/threaded-workqueue.c:135:32: error: passing argument 2 of 'bitmap_xor' from incompatible pointer type [-Werror=incompatible-pointer-types]
     bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
                                ^
In file included from /tmp/qemu-test/src/util/threaded-workqueue.c:14:0:
/tmp/qemu-test/src/include/qemu/bitmap.h:154:20: note: expected 'const long unsigned int *' but argument is of type 'uint64_t * {aka long long unsigned int *}'
 static inline void bitmap_xor(unsigned long *dst, const unsigned long *src1,
                    ^~~~~~~~~~
/tmp/qemu-test/src/util/threaded-workqueue.c:135:54: error: passing argument 3 of 'bitmap_xor' from incompatible pointer type [-Werror=incompatible-pointer-types]
     bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
                                                      ^
In file included from /tmp/qemu-test/src/util/threaded-workqueue.c:14:0:
/tmp/qemu-test/src/include/qemu/bitmap.h:154:20: note: expected 'const long unsigned int *' but argument is of type 'uint64_t * {aka long long unsigned int *}'
 static inline void bitmap_xor(unsigned long *dst, const unsigned long *src1,
                    ^~~~~~~~~~
/tmp/qemu-test/src/util/threaded-workqueue.c: In function 'find_thread_free_request':
/tmp/qemu-test/src/util/threaded-workqueue.c:153:34: error: passing argument 1 of 'find_first_zero_bit' from incompatible pointer type [-Werror=incompatible-pointer-types]
     index  = find_first_zero_bit(&result_bitmap, threads->thread_requests_nr);
                                  ^
In file included from /tmp/qemu-test/src/include/qemu/bitmap.h:16:0,
                 from /tmp/qemu-test/src/util/threaded-workqueue.c:14:
/tmp/qemu-test/src/include/qemu/bitops.h:211:29: note: expected 'const long unsigned int *' but argument is of type 'uint64_t * {aka long long unsigned int *}'
 static inline unsigned long find_first_zero_bit(const unsigned long *addr,
                             ^~~~~~~~~~~~~~~~~~~
/tmp/qemu-test/src/util/threaded-workqueue.c: In function 'thread_find_first_valid_request_index':
/tmp/qemu-test/src/util/threaded-workqueue.c:217:16: error: passing argument 1 of 'bitmap_xor' from incompatible pointer type [-Werror=incompatible-pointer-types]
     bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
                ^
In file included from /tmp/qemu-test/src/util/threaded-workqueue.c:14:0:
/tmp/qemu-test/src/include/qemu/bitmap.h:154:20: note: expected 'long unsigned int *' but argument is of type 'uint64_t * {aka long long unsigned int *}'
 static inline void bitmap_xor(unsigned long *dst, const unsigned long *src1,
                    ^~~~~~~~~~
/tmp/qemu-test/src/util/threaded-workqueue.c:217:32: error: passing argument 2 of 'bitmap_xor' from incompatible pointer type [-Werror=incompatible-pointer-types]
     bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
                                ^
In file included from /tmp/qemu-test/src/util/threaded-workqueue.c:14:0:
/tmp/qemu-test/src/include/qemu/bitmap.h:154:20: note: expected 'const long unsigned int *' but argument is of type 'uint64_t * {aka long long unsigned int *}'
 static inline void bitmap_xor(unsigned long *dst, const unsigned long *src1,
                    ^~~~~~~~~~
/tmp/qemu-test/src/util/threaded-workqueue.c:217:54: error: passing argument 3 of 'bitmap_xor' from incompatible pointer type [-Werror=incompatible-pointer-types]
     bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
                                                      ^
In file included from /tmp/qemu-test/src/util/threaded-workqueue.c:14:0:
/tmp/qemu-test/src/include/qemu/bitmap.h:154:20: note: expected 'const long unsigned int *' but argument is of type 'uint64_t * {aka long long unsigned int *}'
 static inline void bitmap_xor(unsigned long *dst, const unsigned long *src1,
                    ^~~~~~~~~~
/tmp/qemu-test/src/util/threaded-workqueue.c:225:28: error: passing argument 1 of 'find_first_bit' from incompatible pointer type [-Werror=incompatible-pointer-types]
     index = find_first_bit(&result_bitmap, threads->thread_requests_nr);
                            ^
In file included from /tmp/qemu-test/src/include/qemu/bitmap.h:16:0,
                 from /tmp/qemu-test/src/util/threaded-workqueue.c:14:
/tmp/qemu-test/src/include/qemu/bitops.h:188:29: note: expected 'const long unsigned int *' but argument is of type 'uint64_t * {aka long long unsigned int *}'
 static inline unsigned long find_first_bit(const unsigned long *addr,
                             ^~~~~~~~~~~~~~
/tmp/qemu-test/src/util/threaded-workqueue.c: In function 'mark_request_free':
/tmp/qemu-test/src/util/threaded-workqueue.c:238:30: error: passing argument 2 of 'change_bit_atomic' from incompatible pointer type [-Werror=incompatible-pointer-types]
     change_bit_atomic(index, &thread->request_done_bitmap);
                              ^
In file included from /tmp/qemu-test/src/include/qemu/bitmap.h:16:0,
                 from /tmp/qemu-test/src/util/threaded-workqueue.c:14:
/tmp/qemu-test/src/include/qemu/bitops.h:87:20: note: expected 'long unsigned int *' but argument is of type 'uint64_t * {aka long long unsigned int *}'
 static inline void change_bit_atomic(long nr, unsigned long *addr)
                    ^~~~~~~~~~~~~~~~~
/tmp/qemu-test/src/util/threaded-workqueue.c: In function 'threaded_workqueue_wait_for_requests':
/tmp/qemu-test/src/util/threaded-workqueue.c:455:33: error: passing argument 2 of 'test_bit' from incompatible pointer type [-Werror=incompatible-pointer-types]
             if (test_bit(index, &result_bitmap)) {
                                 ^
In file included from /tmp/qemu-test/src/include/qemu/bitmap.h:16:0,
                 from /tmp/qemu-test/src/util/threaded-workqueue.c:14:
/tmp/qemu-test/src/include/qemu/bitops.h:145:19: note: expected 'const long unsigned int *' but argument is of type 'uint64_t * {aka long long unsigned int *}'
 static inline int test_bit(long nr, const unsigned long *addr)
                   ^~~~~~~~
cc1: all warnings being treated as errors
make: *** [/tmp/qemu-test/src/rules.mak:69: util/threaded-workqueue.o] Error 1
make: *** Waiting for unfinished jobs....
  CC      crypto/trace.o
Traceback (most recent call last):
  File "./tests/docker/docker.py", line 563, in <module>
    sys.exit(main())
  File "./tests/docker/docker.py", line 560, in main
    return args.cmdobj.run(args, argv)
  File "./tests/docker/docker.py", line 306, in run
    return Docker().run(argv, args.keep, quiet=args.quiet)
  File "./tests/docker/docker.py", line 274, in run
    quiet=quiet)
  File "./tests/docker/docker.py", line 181, in _do_check
    return subprocess.check_call(self._command + cmd, **kwargs)
  File "/usr/lib64/python2.7/subprocess.py", line 542, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', '--label', 'com.qemu.instance.uuid=07fae0c4ee9d11e8b22468b59973b7d0', '-u', '1001', '--security-opt', 'seccomp=unconfined', '--rm', '--net=none', '-e', 'TARGET_LIST=', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=8', '-e', 'DEBUG=', '-e', 'SHOW_ENV=1', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', '/home/patchew/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', '/var/tmp/patchew-tester-tmp-1vusqh9v/src/docker-src.2018-11-22-16.24.23.23218:/var/tmp/qemu:z,ro', 'qemu:fedora', '/var/tmp/qemu/run', 'test-mingw']' returned non-zero exit status 2
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-1vusqh9v/src'
make: *** [docker-run-test-mingw@fedora] Error 2

real	1m30.699s
user	0m16.848s
sys	0m3.570s
=== OUTPUT END ===

Test command exited with code: 2


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/5] migration: improve multithreads
@ 2018-11-22 21:25   ` no-reply
  0 siblings, 0 replies; 70+ messages in thread
From: no-reply @ 2018-11-22 21:25 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: famz, pbonzini, mst, mtosatti, kvm, quintela, xiaoguangrong,
	qemu-devel, peterx, dgilbert, wei.w.wang, cota, jiang.biao2

Hi,

This series failed docker-mingw@fedora build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

Message-id: 20181122072028.22819-1-xiaoguangrong@tencent.com
Type: series
Subject: [Qemu-devel] [PATCH v3 0/5] migration: improve multithreads

=== TEST SCRIPT BEGIN ===
#!/bin/bash
time make docker-test-mingw@fedora SHOW_ENV=1 J=8
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
1f40f88 tests: add threaded-workqueue-bench
0c560ac migration: use threaded workqueue for decompression
effdcb4 migration: use threaded workqueue for compression
eb91c63 util: introduce threaded workqueue
3bf8b44 bitops: introduce change_bit_atomic

=== OUTPUT BEGIN ===
  BUILD   fedora
make[1]: Entering directory `/var/tmp/patchew-tester-tmp-1vusqh9v/src'
  GEN     /var/tmp/patchew-tester-tmp-1vusqh9v/src/docker-src.2018-11-22-16.24.23.23218/qemu.tar
Cloning into '/var/tmp/patchew-tester-tmp-1vusqh9v/src/docker-src.2018-11-22-16.24.23.23218/qemu.tar.vroot'...
done.
Submodule 'dtc' (https://git.qemu.org/git/dtc.git) registered for path 'dtc'
Cloning into 'dtc'...
Submodule path 'dtc': checked out '88f18909db731a627456f26d779445f84e449536'
Submodule 'ui/keycodemapdb' (https://git.qemu.org/git/keycodemapdb.git) registered for path 'ui/keycodemapdb'
Cloning into 'ui/keycodemapdb'...
Submodule path 'ui/keycodemapdb': checked out '6b3d716e2b6472eb7189d3220552280ef3d832ce'
  COPY    RUNNER
    RUN test-mingw in qemu:fedora 
Packages installed:
SDL2-devel-2.0.9-1.fc28.x86_64
bc-1.07.1-5.fc28.x86_64
bison-3.0.4-9.fc28.x86_64
bluez-libs-devel-5.50-1.fc28.x86_64
brlapi-devel-0.6.7-19.fc28.x86_64
bzip2-1.0.6-26.fc28.x86_64
bzip2-devel-1.0.6-26.fc28.x86_64
ccache-3.4.2-2.fc28.x86_64
clang-6.0.1-2.fc28.x86_64
device-mapper-multipath-devel-0.7.4-3.git07e7bd5.fc28.x86_64
findutils-4.6.0-19.fc28.x86_64
flex-2.6.1-7.fc28.x86_64
gcc-8.2.1-5.fc28.x86_64
gcc-c++-8.2.1-5.fc28.x86_64
gettext-0.19.8.1-14.fc28.x86_64
git-2.17.2-1.fc28.x86_64
glib2-devel-2.56.3-2.fc28.x86_64
glusterfs-api-devel-4.1.5-1.fc28.x86_64
gnutls-devel-3.6.4-1.fc28.x86_64
gtk3-devel-3.22.30-1.fc28.x86_64
hostname-3.20-3.fc28.x86_64
libaio-devel-0.3.110-11.fc28.x86_64
libasan-8.2.1-5.fc28.x86_64
libattr-devel-2.4.48-3.fc28.x86_64
libcap-devel-2.25-9.fc28.x86_64
libcap-ng-devel-0.7.9-4.fc28.x86_64
libcurl-devel-7.59.0-8.fc28.x86_64
libfdt-devel-1.4.7-1.fc28.x86_64
libpng-devel-1.6.34-6.fc28.x86_64
librbd-devel-12.2.8-1.fc28.x86_64
libssh2-devel-1.8.0-7.fc28.x86_64
libubsan-8.2.1-5.fc28.x86_64
libusbx-devel-1.0.22-1.fc28.x86_64
libxml2-devel-2.9.8-4.fc28.x86_64
llvm-6.0.1-8.fc28.x86_64
lzo-devel-2.08-12.fc28.x86_64
make-4.2.1-6.fc28.x86_64
mingw32-SDL2-2.0.9-1.fc28.noarch
mingw32-bzip2-1.0.6-9.fc27.noarch
mingw32-curl-7.57.0-1.fc28.noarch
mingw32-glib2-2.56.1-1.fc28.noarch
mingw32-gmp-6.1.2-2.fc27.noarch
mingw32-gnutls-3.6.3-1.fc28.noarch
mingw32-gtk3-3.22.30-1.fc28.noarch
mingw32-libjpeg-turbo-1.5.1-3.fc27.noarch
mingw32-libpng-1.6.29-2.fc27.noarch
mingw32-libssh2-1.8.0-3.fc27.noarch
mingw32-libtasn1-4.13-1.fc28.noarch
mingw32-nettle-3.4-1.fc28.noarch
mingw32-pixman-0.34.0-3.fc27.noarch
mingw32-pkg-config-0.28-9.fc27.x86_64
mingw64-SDL2-2.0.9-1.fc28.noarch
mingw64-bzip2-1.0.6-9.fc27.noarch
mingw64-curl-7.57.0-1.fc28.noarch
mingw64-glib2-2.56.1-1.fc28.noarch
mingw64-gmp-6.1.2-2.fc27.noarch
mingw64-gnutls-3.6.3-1.fc28.noarch
mingw64-gtk3-3.22.30-1.fc28.noarch
mingw64-libjpeg-turbo-1.5.1-3.fc27.noarch
mingw64-libpng-1.6.29-2.fc27.noarch
mingw64-libssh2-1.8.0-3.fc27.noarch
mingw64-libtasn1-4.13-1.fc28.noarch
mingw64-nettle-3.4-1.fc28.noarch
mingw64-pixman-0.34.0-3.fc27.noarch
mingw64-pkg-config-0.28-9.fc27.x86_64
ncurses-devel-6.1-5.20180224.fc28.x86_64
nettle-devel-3.4-2.fc28.x86_64
nss-devel-3.39.0-1.0.fc28.x86_64
numactl-devel-2.0.11-8.fc28.x86_64
package PyYAML is not installed
package libjpeg-devel is not installed
perl-5.26.2-414.fc28.x86_64
pixman-devel-0.34.0-8.fc28.x86_64
python3-3.6.6-1.fc28.x86_64
snappy-devel-1.1.7-5.fc28.x86_64
sparse-0.5.2-1.fc28.x86_64
spice-server-devel-0.14.0-4.fc28.x86_64
systemtap-sdt-devel-4.0-1.fc28.x86_64
tar-1.30-3.fc28.x86_64
usbredir-devel-0.8.0-1.fc28.x86_64
virglrenderer-devel-0.6.0-4.20170210git76b3da97b.fc28.x86_64
vte3-devel-0.36.5-6.fc28.x86_64
which-2.21-8.fc28.x86_64
xen-devel-4.10.2-2.fc28.x86_64
zlib-devel-1.2.11-8.fc28.x86_64

Environment variables:
TARGET_LIST=
PACKAGES=bc     bison     bluez-libs-devel     brlapi-devel     bzip2     bzip2-devel     ccache     clang     device-mapper-multipath-devel     findutils     flex     gcc     gcc-c++     gettext     git     glib2-devel     glusterfs-api-devel     gnutls-devel     gtk3-devel     hostname     libaio-devel     libasan     libattr-devel     libcap-devel     libcap-ng-devel     libcurl-devel     libfdt-devel     libjpeg-devel     libpng-devel     librbd-devel     libssh2-devel     libubsan     libusbx-devel     libxml2-devel     llvm     lzo-devel     make     mingw32-bzip2     mingw32-curl     mingw32-glib2     mingw32-gmp     mingw32-gnutls     mingw32-gtk3     mingw32-libjpeg-turbo     mingw32-libpng     mingw32-libssh2     mingw32-libtasn1     mingw32-nettle     mingw32-pixman     mingw32-pkg-config     mingw32-SDL2     mingw64-bzip2     mingw64-curl     mingw64-glib2     mingw64-gmp     mingw64-gnutls     mingw64-gtk3     mingw64-libjpeg-turbo     mingw64-libpng     mingw64-libssh2     mingw64-libtasn1     mingw64-nettle     mingw64-pixman     mingw64-pkg-config     mingw64-SDL2     ncurses-devel     nettle-devel     nss-devel     numactl-devel     perl     pixman-devel     python3     PyYAML     SDL2-devel     snappy-devel     sparse     spice-server-devel     systemtap-sdt-devel     tar     usbredir-devel     virglrenderer-devel     vte3-devel     which     xen-devel     zlib-devel
J=8
V=
HOSTNAME=1690db6f9c50
DEBUG=
SHOW_ENV=1
PWD=/
HOME=/
CCACHE_DIR=/var/tmp/ccache
FBR=f28
DISTTAG=f28container
QEMU_CONFIGURE_OPTS=--python=/usr/bin/python3
FGC=f28
TEST_DIR=/tmp/qemu-test
SHLVL=1
FEATURES=mingw clang pyyaml asan dtc
PATH=/usr/lib/ccache:/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
MAKEFLAGS= -j8
EXTRA_CONFIGURE_OPTS=
_=/usr/bin/env

Configure options:
--enable-werror --target-list=x86_64-softmmu,aarch64-softmmu --prefix=/tmp/qemu-test/install --python=/usr/bin/python3 --cross-prefix=x86_64-w64-mingw32- --enable-trace-backends=simple --enable-gnutls --enable-nettle --enable-curl --enable-vnc --enable-bzip2 --enable-guest-agent --with-sdlabi=2.0
Install prefix    /tmp/qemu-test/install
BIOS directory    /tmp/qemu-test/install
firmware path     /tmp/qemu-test/install/share/qemu-firmware
binary directory  /tmp/qemu-test/install
library directory /tmp/qemu-test/install/lib
module directory  /tmp/qemu-test/install/lib
libexec directory /tmp/qemu-test/install/libexec
include directory /tmp/qemu-test/install/include
config directory  /tmp/qemu-test/install
local state directory   queried at runtime
Windows SDK       no
Source path       /tmp/qemu-test/src
GIT binary        git
GIT submodules    
C compiler        x86_64-w64-mingw32-gcc
Host C compiler   cc
C++ compiler      x86_64-w64-mingw32-g++
Objective-C compiler clang
ARFLAGS           rv
CFLAGS            -O2 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -g 
QEMU_CFLAGS       -I/usr/x86_64-w64-mingw32/sys-root/mingw/include/pixman-1  -I$(SRC_PATH)/dtc/libfdt -Werror -DHAS_LIBSSH2_SFTP_FSYNC -I/usr/x86_64-w64-mingw32/sys-root/mingw/include  -mms-bitfields -I/usr/x86_64-w64-mingw32/sys-root/mingw/include/glib-2.0 -I/usr/x86_64-w64-mingw32/sys-root/mingw/lib/glib-2.0/include -I/usr/x86_64-w64-mingw32/sys-root/mingw/include  -m64 -mcx16 -mthreads -D__USE_MINGW_ANSI_STDIO=1 -DWIN32_LEAN_AND_MEAN -DWINVER=0x501 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -Wstrict-prototypes -Wredundant-decls -Wall -Wundef -Wwrite-strings -Wmissing-prototypes -fno-strict-aliasing -fno-common -fwrapv  -Wexpansion-to-defined -Wendif-labels -Wno-shift-negative-value -Wno-missing-include-dirs -Wempty-body -Wnested-externs -Wformat-security -Wformat-y2k -Winit-self -Wignored-qualifiers -Wold-style-declaration -Wold-style-definition -Wtype-limits -fstack-protector-strong -I/usr/x86_64-w64-mingw32/sys-root/mingw/include -I/usr/x86_64-w64-mingw32/sys-root/mingw/include/p11-kit-1  -I/usr/x86_64-w64-mingw32/sys-root/mingw/include   -I/usr/x86_64-w64-mingw32/sys-root/mingw/include/libpng16 
LDFLAGS           -Wl,--nxcompat -Wl,--no-seh -Wl,--dynamicbase -Wl,--warn-common -m64 -g 
QEMU_LDFLAGS      -L$(BUILD_DIR)/dtc/libfdt 
make              make
install           install
python            /usr/bin/python3 -B
smbd              /usr/sbin/smbd
module support    no
host CPU          x86_64
host big endian   no
target list       x86_64-softmmu aarch64-softmmu
gprof enabled     no
sparse enabled    no
strip binaries    yes
profiler          no
static build      no
SDL support       yes (2.0.9)
GTK support       yes (3.22.30)
GTK GL support    no
VTE support       no 
TLS priority      NORMAL
GNUTLS support    yes
libgcrypt         no
nettle            yes (3.4)
libtasn1          yes
curses support    no
virgl support     no 
curl support      yes
mingw32 support   yes
Audio drivers     dsound
Block whitelist (rw) 
Block whitelist (ro) 
VirtFS support    no
Multipath support no
VNC support       yes
VNC SASL support  no
VNC JPEG support  yes
VNC PNG support   yes
xen support       no
brlapi support    no
bluez  support    no
Documentation     no
PIE               no
vde support       no
netmap support    no
Linux AIO support no
ATTR/XATTR support no
Install blobs     yes
KVM support       no
HAX support       yes
HVF support       no
WHPX support      no
TCG support       yes
TCG debug enabled no
TCG interpreter   no
malloc trim support no
RDMA support      no
PVRDMA support    no
fdt support       git
membarrier        no
preadv support    no
fdatasync         no
madvise           no
posix_madvise     no
posix_memalign    no
libcap-ng support no
vhost-net support no
vhost-crypto support no
vhost-scsi support no
vhost-vsock support no
vhost-user support no
Trace backends    simple
Trace output file trace-<pid>
spice support     no 
rbd support       no
xfsctl support    no
smartcard support no
libusb            no
usb net redir     no
OpenGL support    no
OpenGL dmabufs    no
libiscsi support  no
libnfs support    no
build guest agent yes
QGA VSS support   no
QGA w32 disk info yes
QGA MSI support   no
seccomp support   no
coroutine backend win32
coroutine pool    yes
debug stack usage no
mutex debugging   no
crypto afalg      no
GlusterFS support no
gcov              gcov
gcov enabled      no
TPM support       yes
libssh2 support   yes
TPM passthrough   no
TPM emulator      no
QOM debugging     yes
Live block migration yes
lzo support       no
snappy support    no
bzip2 support     yes
NUMA host support no
libxml2           no
tcmalloc support  no
jemalloc support  no
avx2 optimization yes
replication support yes
VxHS block device no
bochs support     yes
cloop support     yes
dmg support       yes
qcow v1 support   yes
vdi support       yes
vvfat support     yes
qed support       yes
parallels support yes
sheepdog support  yes
capstone          no
docker            no
libpmem support   no
libudev           no

NOTE: cross-compilers enabled:  'x86_64-w64-mingw32-gcc'
  GEN     x86_64-softmmu/config-devices.mak.tmp
  GEN     aarch64-softmmu/config-devices.mak.tmp
  GEN     config-host.h
  GEN     qemu-options.def
  GEN     qapi-gen
  GEN     trace/generated-tcg-tracers.h
  GEN     trace/generated-helpers-wrappers.h
  GEN     trace/generated-helpers.h
  GEN     trace/generated-helpers.c
  GEN     aarch64-softmmu/config-devices.mak
  GEN     x86_64-softmmu/config-devices.mak
  GEN     module_block.h
  GEN     ui/input-keymap-atset1-to-qcode.c
  GEN     ui/input-keymap-linux-to-qcode.c
  GEN     ui/input-keymap-qcode-to-atset1.c
  GEN     ui/input-keymap-qcode-to-atset2.c
  GEN     ui/input-keymap-qcode-to-atset3.c
  GEN     ui/input-keymap-qcode-to-linux.c
  GEN     ui/input-keymap-qcode-to-qnum.c
  GEN     ui/input-keymap-qcode-to-sun.c
  GEN     ui/input-keymap-qnum-to-qcode.c
  GEN     ui/input-keymap-usb-to-qcode.c
  GEN     ui/input-keymap-win32-to-qcode.c
  GEN     ui/input-keymap-x11-to-qcode.c
  GEN     ui/input-keymap-xorgevdev-to-qcode.c
  GEN     ui/input-keymap-xorgkbd-to-qcode.c
  GEN     ui/input-keymap-xorgxquartz-to-qcode.c
  GEN     ui/input-keymap-xorgxwin-to-qcode.c
  GEN     ui/input-keymap-osx-to-qcode.c
  GEN     tests/test-qapi-gen
  GEN     trace-root.h
  GEN     accel/kvm/trace.h
  GEN     accel/tcg/trace.h
  GEN     audio/trace.h
  GEN     block/trace.h
  GEN     chardev/trace.h
  GEN     crypto/trace.h
  GEN     hw/9pfs/trace.h
  GEN     hw/acpi/trace.h
  GEN     hw/alpha/trace.h
  GEN     hw/arm/trace.h
  GEN     hw/audio/trace.h
  GEN     hw/block/trace.h
  GEN     hw/block/dataplane/trace.h
  GEN     hw/char/trace.h
  GEN     hw/display/trace.h
  GEN     hw/dma/trace.h
  GEN     hw/hppa/trace.h
  GEN     hw/i2c/trace.h
  GEN     hw/i386/trace.h
  GEN     hw/i386/xen/trace.h
  GEN     hw/ide/trace.h
  GEN     hw/input/trace.h
  GEN     hw/intc/trace.h
  GEN     hw/isa/trace.h
  GEN     hw/mem/trace.h
  GEN     hw/misc/trace.h
  GEN     hw/misc/macio/trace.h
  GEN     hw/net/trace.h
  GEN     hw/nvram/trace.h
  GEN     hw/pci/trace.h
  GEN     hw/pci-host/trace.h
  GEN     hw/ppc/trace.h
  GEN     hw/rdma/trace.h
  GEN     hw/rdma/vmw/trace.h
  GEN     hw/s390x/trace.h
  GEN     hw/scsi/trace.h
  GEN     hw/sd/trace.h
  GEN     hw/sparc/trace.h
  GEN     hw/sparc64/trace.h
  GEN     hw/timer/trace.h
  GEN     hw/tpm/trace.h
  GEN     hw/usb/trace.h
  GEN     hw/vfio/trace.h
  GEN     hw/virtio/trace.h
  GEN     hw/watchdog/trace.h
  GEN     hw/xen/trace.h
  GEN     io/trace.h
  GEN     linux-user/trace.h
  GEN     migration/trace.h
  GEN     nbd/trace.h
  GEN     net/trace.h
  GEN     qapi/trace.h
  GEN     qom/trace.h
  GEN     scsi/trace.h
  GEN     target/arm/trace.h
  GEN     target/i386/trace.h
  GEN     target/mips/trace.h
  GEN     target/ppc/trace.h
  GEN     target/s390x/trace.h
  GEN     target/sparc/trace.h
  GEN     ui/trace.h
  GEN     util/trace.h
  GEN     trace-root.c
  GEN     accel/kvm/trace.c
  GEN     accel/tcg/trace.c
  GEN     audio/trace.c
  GEN     block/trace.c
  GEN     chardev/trace.c
  GEN     crypto/trace.c
  GEN     hw/9pfs/trace.c
  GEN     hw/acpi/trace.c
  GEN     hw/alpha/trace.c
  GEN     hw/arm/trace.c
  GEN     hw/audio/trace.c
  GEN     hw/block/trace.c
  GEN     hw/block/dataplane/trace.c
  GEN     hw/char/trace.c
  GEN     hw/display/trace.c
  GEN     hw/dma/trace.c
  GEN     hw/hppa/trace.c
  GEN     hw/i2c/trace.c
  GEN     hw/i386/trace.c
  GEN     hw/i386/xen/trace.c
  GEN     hw/ide/trace.c
  GEN     hw/input/trace.c
  GEN     hw/intc/trace.c
  GEN     hw/isa/trace.c
  GEN     hw/mem/trace.c
  GEN     hw/misc/trace.c
  GEN     hw/misc/macio/trace.c
  GEN     hw/net/trace.c
  GEN     hw/nvram/trace.c
  GEN     hw/pci/trace.c
  GEN     hw/pci-host/trace.c
  GEN     hw/ppc/trace.c
  GEN     hw/rdma/trace.c
  GEN     hw/rdma/vmw/trace.c
  GEN     hw/s390x/trace.c
  GEN     hw/scsi/trace.c
  GEN     hw/sd/trace.c
  GEN     hw/sparc/trace.c
  GEN     hw/sparc64/trace.c
  GEN     hw/timer/trace.c
  GEN     hw/tpm/trace.c
  GEN     hw/usb/trace.c
  GEN     hw/vfio/trace.c
  GEN     hw/virtio/trace.c
  GEN     hw/watchdog/trace.c
  GEN     hw/xen/trace.c
  GEN     io/trace.c
  GEN     linux-user/trace.c
  GEN     migration/trace.c
  GEN     nbd/trace.c
  GEN     net/trace.c
  GEN     qapi/trace.c
  GEN     qom/trace.c
  GEN     scsi/trace.c
  GEN     target/arm/trace.c
  GEN     target/i386/trace.c
  GEN     target/mips/trace.c
  GEN     target/ppc/trace.c
  GEN     target/s390x/trace.c
  GEN     target/sparc/trace.c
  GEN     ui/trace.c
  GEN     util/trace.c
  GEN     config-all-devices.mak
	 DEP /tmp/qemu-test/src/dtc/tests/dumptrees.c
	 DEP /tmp/qemu-test/src/dtc/tests/trees.S
	 DEP /tmp/qemu-test/src/dtc/tests/value-labels.c
	 DEP /tmp/qemu-test/src/dtc/tests/testutils.c
	 DEP /tmp/qemu-test/src/dtc/tests/asm_tree_dump.c
	 DEP /tmp/qemu-test/src/dtc/tests/truncated_memrsv.c
	 DEP /tmp/qemu-test/src/dtc/tests/truncated_string.c
	 DEP /tmp/qemu-test/src/dtc/tests/truncated_property.c
	 DEP /tmp/qemu-test/src/dtc/tests/check_full.c
	 DEP /tmp/qemu-test/src/dtc/tests/check_path.c
	 DEP /tmp/qemu-test/src/dtc/tests/check_header.c
	 DEP /tmp/qemu-test/src/dtc/tests/overlay_bad_fixup.c
	 DEP /tmp/qemu-test/src/dtc/tests/overlay.c
	 DEP /tmp/qemu-test/src/dtc/tests/subnode_iterate.c
	 DEP /tmp/qemu-test/src/dtc/tests/integer-expressions.c
	 DEP /tmp/qemu-test/src/dtc/tests/property_iterate.c
	 DEP /tmp/qemu-test/src/dtc/tests/utilfdt_test.c
	 DEP /tmp/qemu-test/src/dtc/tests/path_offset_aliases.c
	 DEP /tmp/qemu-test/src/dtc/tests/add_subnode_with_nops.c
	 DEP /tmp/qemu-test/src/dtc/tests/dtbs_equal_unordered.c
	 DEP /tmp/qemu-test/src/dtc/tests/dtb_reverse.c
	 DEP /tmp/qemu-test/src/dtc/tests/dtbs_equal_ordered.c
	 DEP /tmp/qemu-test/src/dtc/tests/extra-terminating-null.c
	 DEP /tmp/qemu-test/src/dtc/tests/incbin.c
	 DEP /tmp/qemu-test/src/dtc/tests/boot-cpuid.c
	 DEP /tmp/qemu-test/src/dtc/tests/phandle_format.c
	 DEP /tmp/qemu-test/src/dtc/tests/path-references.c
	 DEP /tmp/qemu-test/src/dtc/tests/references.c
	 DEP /tmp/qemu-test/src/dtc/tests/string_escapes.c
	 DEP /tmp/qemu-test/src/dtc/tests/propname_escapes.c
	 DEP /tmp/qemu-test/src/dtc/tests/appendprop2.c
	 DEP /tmp/qemu-test/src/dtc/tests/appendprop1.c
	 DEP /tmp/qemu-test/src/dtc/tests/del_node.c
	 DEP /tmp/qemu-test/src/dtc/tests/del_property.c
	 DEP /tmp/qemu-test/src/dtc/tests/setprop.c
	 DEP /tmp/qemu-test/src/dtc/tests/set_name.c
	 DEP /tmp/qemu-test/src/dtc/tests/rw_tree1.c
	 DEP /tmp/qemu-test/src/dtc/tests/open_pack.c
	 DEP /tmp/qemu-test/src/dtc/tests/nopulate.c
	 DEP /tmp/qemu-test/src/dtc/tests/mangle-layout.c
	 DEP /tmp/qemu-test/src/dtc/tests/move_and_save.c
	 DEP /tmp/qemu-test/src/dtc/tests/sw_states.c
	 DEP /tmp/qemu-test/src/dtc/tests/sw_tree1.c
	 DEP /tmp/qemu-test/src/dtc/tests/nop_node.c
	 DEP /tmp/qemu-test/src/dtc/tests/nop_property.c
	 DEP /tmp/qemu-test/src/dtc/tests/setprop_inplace.c
	 DEP /tmp/qemu-test/src/dtc/tests/stringlist.c
	 DEP /tmp/qemu-test/src/dtc/tests/addr_size_cells2.c
	 DEP /tmp/qemu-test/src/dtc/tests/addr_size_cells.c
	 DEP /tmp/qemu-test/src/dtc/tests/notfound.c
	 DEP /tmp/qemu-test/src/dtc/tests/sized_cells.c
	 DEP /tmp/qemu-test/src/dtc/tests/char_literal.c
	 DEP /tmp/qemu-test/src/dtc/tests/get_alias.c
	 DEP /tmp/qemu-test/src/dtc/tests/node_offset_by_compatible.c
	 DEP /tmp/qemu-test/src/dtc/tests/node_check_compatible.c
	 DEP /tmp/qemu-test/src/dtc/tests/node_offset_by_phandle.c
	 DEP /tmp/qemu-test/src/dtc/tests/node_offset_by_prop_value.c
	 DEP /tmp/qemu-test/src/dtc/tests/parent_offset.c
	 DEP /tmp/qemu-test/src/dtc/tests/supernode_atdepth_offset.c
	 DEP /tmp/qemu-test/src/dtc/tests/get_path.c
	 DEP /tmp/qemu-test/src/dtc/tests/get_phandle.c
	 DEP /tmp/qemu-test/src/dtc/tests/getprop.c
	 DEP /tmp/qemu-test/src/dtc/tests/get_name.c
	 DEP /tmp/qemu-test/src/dtc/tests/path_offset.c
	 DEP /tmp/qemu-test/src/dtc/tests/subnode_offset.c
	 DEP /tmp/qemu-test/src/dtc/tests/find_property.c
	 DEP /tmp/qemu-test/src/dtc/tests/root_node.c
	 DEP /tmp/qemu-test/src/dtc/tests/get_mem_rsv.c
	 DEP /tmp/qemu-test/src/dtc/libfdt/fdt_overlay.c
	 DEP /tmp/qemu-test/src/dtc/libfdt/fdt_addresses.c
	 DEP /tmp/qemu-test/src/dtc/libfdt/fdt_empty_tree.c
	 DEP /tmp/qemu-test/src/dtc/libfdt/fdt_strerror.c
	 DEP /tmp/qemu-test/src/dtc/libfdt/fdt_rw.c
	 DEP /tmp/qemu-test/src/dtc/libfdt/fdt_sw.c
	 DEP /tmp/qemu-test/src/dtc/libfdt/fdt_wip.c
	 DEP /tmp/qemu-test/src/dtc/libfdt/fdt_ro.c
	 DEP /tmp/qemu-test/src/dtc/libfdt/fdt.c
	 DEP /tmp/qemu-test/src/dtc/util.c
	 DEP /tmp/qemu-test/src/dtc/fdtoverlay.c
	 DEP /tmp/qemu-test/src/dtc/fdtput.c
	 DEP /tmp/qemu-test/src/dtc/fdtget.c
	 DEP /tmp/qemu-test/src/dtc/fdtdump.c
	 LEX convert-dtsv0-lexer.lex.c
	 DEP /tmp/qemu-test/src/dtc/srcpos.c
	 BISON dtc-parser.tab.c
	 LEX dtc-lexer.lex.c
	 DEP /tmp/qemu-test/src/dtc/treesource.c
	 DEP /tmp/qemu-test/src/dtc/livetree.c
	 DEP /tmp/qemu-test/src/dtc/fstree.c
	 DEP /tmp/qemu-test/src/dtc/flattree.c
	 DEP /tmp/qemu-test/src/dtc/dtc.c
	 DEP /tmp/qemu-test/src/dtc/data.c
	 DEP /tmp/qemu-test/src/dtc/checks.c
	 DEP convert-dtsv0-lexer.lex.c
	 DEP dtc-parser.tab.c
	 DEP dtc-lexer.lex.c
	CHK version_gen.h
	UPD version_gen.h
	 DEP /tmp/qemu-test/src/dtc/util.c
	 CC libfdt/fdt.o
	 CC libfdt/fdt_ro.o
	 CC libfdt/fdt_wip.o
	 CC libfdt/fdt_sw.o
	 CC libfdt/fdt_rw.o
	 CC libfdt/fdt_strerror.o
	 CC libfdt/fdt_empty_tree.o
	 CC libfdt/fdt_addresses.o
	 CC libfdt/fdt_overlay.o
	 AR libfdt/libfdt.a
x86_64-w64-mingw32-ar: creating libfdt/libfdt.a
a - libfdt/fdt.o
a - libfdt/fdt_ro.o
a - libfdt/fdt_wip.o
a - libfdt/fdt_sw.o
a - libfdt/fdt_rw.o
a - libfdt/fdt_strerror.o
a - libfdt/fdt_empty_tree.o
a - libfdt/fdt_addresses.o
a - libfdt/fdt_overlay.o
  RC      version.o
  GEN     qga/qapi-generated/qapi-gen
  CC      qapi/qapi-builtin-types.o
  CC      qapi/qapi-types.o
  CC      qapi/qapi-types-block-core.o
  CC      qapi/qapi-types-block.o
  CC      qapi/qapi-types-char.o
  CC      qapi/qapi-types-common.o
  CC      qapi/qapi-types-crypto.o
  CC      qapi/qapi-types-introspect.o
  CC      qapi/qapi-types-job.o
  CC      qapi/qapi-types-migration.o
  CC      qapi/qapi-types-misc.o
  CC      qapi/qapi-types-net.o
  CC      qapi/qapi-types-rocker.o
  CC      qapi/qapi-types-run-state.o
  CC      qapi/qapi-types-sockets.o
  CC      qapi/qapi-types-tpm.o
  CC      qapi/qapi-types-trace.o
  CC      qapi/qapi-types-transaction.o
  CC      qapi/qapi-types-ui.o
  CC      qapi/qapi-builtin-visit.o
  CC      qapi/qapi-visit.o
  CC      qapi/qapi-visit-block-core.o
  CC      qapi/qapi-visit-block.o
  CC      qapi/qapi-visit-common.o
  CC      qapi/qapi-visit-char.o
  CC      qapi/qapi-visit-crypto.o
  CC      qapi/qapi-visit-introspect.o
  CC      qapi/qapi-visit-migration.o
  CC      qapi/qapi-visit-job.o
  CC      qapi/qapi-visit-misc.o
  CC      qapi/qapi-visit-net.o
  CC      qapi/qapi-visit-rocker.o
  CC      qapi/qapi-visit-run-state.o
  CC      qapi/qapi-visit-sockets.o
  CC      qapi/qapi-visit-tpm.o
  CC      qapi/qapi-visit-trace.o
  CC      qapi/qapi-visit-ui.o
  CC      qapi/qapi-visit-transaction.o
  CC      qapi/qapi-events.o
  CC      qapi/qapi-events-block-core.o
  CC      qapi/qapi-events-block.o
  CC      qapi/qapi-events-char.o
  CC      qapi/qapi-events-common.o
  CC      qapi/qapi-events-crypto.o
  CC      qapi/qapi-events-introspect.o
  CC      qapi/qapi-events-job.o
  CC      qapi/qapi-events-migration.o
  CC      qapi/qapi-events-misc.o
  CC      qapi/qapi-events-net.o
  CC      qapi/qapi-events-rocker.o
  CC      qapi/qapi-events-run-state.o
  CC      qapi/qapi-events-sockets.o
  CC      qapi/qapi-events-tpm.o
  CC      qapi/qapi-events-trace.o
  CC      qapi/qapi-events-transaction.o
  CC      qapi/qapi-events-ui.o
  CC      qapi/qapi-introspect.o
  CC      qapi/qapi-visit-core.o
  CC      qapi/qapi-dealloc-visitor.o
  CC      qapi/qobject-input-visitor.o
  CC      qapi/qobject-output-visitor.o
  CC      qapi/qmp-registry.o
  CC      qapi/qmp-dispatch.o
  CC      qapi/string-input-visitor.o
  CC      qapi/string-output-visitor.o
  CC      qapi/opts-visitor.o
  CC      qapi/qapi-clone-visitor.o
  CC      qapi/qmp-event.o
  CC      qapi/qapi-util.o
  CC      qobject/qnull.o
  CC      qobject/qnum.o
  CC      qobject/qstring.o
  CC      qobject/qdict.o
  CC      qobject/qlist.o
  CC      qobject/qbool.o
  CC      qobject/qlit.o
  CC      qobject/qjson.o
  CC      qobject/qobject.o
  CC      qobject/json-lexer.o
  CC      qobject/json-streamer.o
  CC      qobject/json-parser.o
  CC      qobject/block-qdict.o
  CC      trace/simple.o
  CC      trace/control.o
  CC      trace/qmp.o
  CC      util/osdep.o
  CC      util/cutils.o
  CC      util/unicode.o
  CC      util/qemu-timer-common.o
  CC      util/bufferiszero.o
  CC      util/lockcnt.o
  CC      util/aiocb.o
  CC      util/async.o
  CC      util/aio-wait.o
  CC      util/thread-pool.o
  CC      util/qemu-timer.o
  CC      util/main-loop.o
  CC      util/iohandler.o
  CC      util/aio-win32.o
  CC      util/event_notifier-win32.o
  CC      util/oslib-win32.o
  CC      util/qemu-thread-win32.o
  CC      util/envlist.o
  CC      util/path.o
  CC      util/module.o
  CC      util/host-utils.o
  CC      util/bitmap.o
  CC      util/bitops.o
  CC      util/hbitmap.o
  CC      util/fifo8.o
  CC      util/acl.o
  CC      util/cacheinfo.o
  CC      util/error.o
  CC      util/qemu-error.o
  CC      util/id.o
  CC      util/iov.o
  CC      util/qemu-config.o
  CC      util/qemu-sockets.o
  CC      util/uri.o
  CC      util/notify.o
  CC      util/qemu-option.o
  CC      util/qemu-progress.o
  CC      util/keyval.o
  CC      util/hexdump.o
  CC      util/crc32c.o
  CC      util/uuid.o
  CC      util/throttle.o
  CC      util/getauxval.o
  CC      util/readline.o
  CC      util/rcu.o
  CC      util/qemu-coroutine.o
  CC      util/qemu-coroutine-lock.o
  CC      util/qemu-coroutine-io.o
  CC      util/qemu-coroutine-sleep.o
  CC      util/coroutine-win32.o
  CC      util/buffer.o
  CC      util/timed-average.o
  CC      util/base64.o
  CC      util/log.o
  CC      util/pagesize.o
  CC      util/qdist.o
  CC      util/qht.o
  CC      util/qsp.o
  CC      util/range.o
  CC      util/stats64.o
  CC      util/systemd.o
  CC      util/iova-tree.o
  CC      util/threaded-workqueue.o
  CC      trace-root.o
  CC      accel/kvm/trace.o
  CC      accel/tcg/trace.o
  CC      audio/trace.o
  CC      block/trace.o
  CC      chardev/trace.o
/tmp/qemu-test/src/util/threaded-workqueue.c: In function 'get_free_request_bitmap':
/tmp/qemu-test/src/util/threaded-workqueue.c:135:16: error: passing argument 1 of 'bitmap_xor' from incompatible pointer type [-Werror=incompatible-pointer-types]
     bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
                ^
In file included from /tmp/qemu-test/src/util/threaded-workqueue.c:14:0:
/tmp/qemu-test/src/include/qemu/bitmap.h:154:20: note: expected 'long unsigned int *' but argument is of type 'uint64_t * {aka long long unsigned int *}'
 static inline void bitmap_xor(unsigned long *dst, const unsigned long *src1,
                    ^~~~~~~~~~
/tmp/qemu-test/src/util/threaded-workqueue.c:135:32: error: passing argument 2 of 'bitmap_xor' from incompatible pointer type [-Werror=incompatible-pointer-types]
     bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
                                ^
In file included from /tmp/qemu-test/src/util/threaded-workqueue.c:14:0:
/tmp/qemu-test/src/include/qemu/bitmap.h:154:20: note: expected 'const long unsigned int *' but argument is of type 'uint64_t * {aka long long unsigned int *}'
 static inline void bitmap_xor(unsigned long *dst, const unsigned long *src1,
                    ^~~~~~~~~~
/tmp/qemu-test/src/util/threaded-workqueue.c:135:54: error: passing argument 3 of 'bitmap_xor' from incompatible pointer type [-Werror=incompatible-pointer-types]
     bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
                                                      ^
In file included from /tmp/qemu-test/src/util/threaded-workqueue.c:14:0:
/tmp/qemu-test/src/include/qemu/bitmap.h:154:20: note: expected 'const long unsigned int *' but argument is of type 'uint64_t * {aka long long unsigned int *}'
 static inline void bitmap_xor(unsigned long *dst, const unsigned long *src1,
                    ^~~~~~~~~~
/tmp/qemu-test/src/util/threaded-workqueue.c: In function 'find_thread_free_request':
/tmp/qemu-test/src/util/threaded-workqueue.c:153:34: error: passing argument 1 of 'find_first_zero_bit' from incompatible pointer type [-Werror=incompatible-pointer-types]
     index  = find_first_zero_bit(&result_bitmap, threads->thread_requests_nr);
                                  ^
In file included from /tmp/qemu-test/src/include/qemu/bitmap.h:16:0,
                 from /tmp/qemu-test/src/util/threaded-workqueue.c:14:
/tmp/qemu-test/src/include/qemu/bitops.h:211:29: note: expected 'const long unsigned int *' but argument is of type 'uint64_t * {aka long long unsigned int *}'
 static inline unsigned long find_first_zero_bit(const unsigned long *addr,
                             ^~~~~~~~~~~~~~~~~~~
/tmp/qemu-test/src/util/threaded-workqueue.c: In function 'thread_find_first_valid_request_index':
/tmp/qemu-test/src/util/threaded-workqueue.c:217:16: error: passing argument 1 of 'bitmap_xor' from incompatible pointer type [-Werror=incompatible-pointer-types]
     bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
                ^
In file included from /tmp/qemu-test/src/util/threaded-workqueue.c:14:0:
/tmp/qemu-test/src/include/qemu/bitmap.h:154:20: note: expected 'long unsigned int *' but argument is of type 'uint64_t * {aka long long unsigned int *}'
 static inline void bitmap_xor(unsigned long *dst, const unsigned long *src1,
                    ^~~~~~~~~~
/tmp/qemu-test/src/util/threaded-workqueue.c:217:32: error: passing argument 2 of 'bitmap_xor' from incompatible pointer type [-Werror=incompatible-pointer-types]
     bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
                                ^
In file included from /tmp/qemu-test/src/util/threaded-workqueue.c:14:0:
/tmp/qemu-test/src/include/qemu/bitmap.h:154:20: note: expected 'const long unsigned int *' but argument is of type 'uint64_t * {aka long long unsigned int *}'
 static inline void bitmap_xor(unsigned long *dst, const unsigned long *src1,
                    ^~~~~~~~~~
/tmp/qemu-test/src/util/threaded-workqueue.c:217:54: error: passing argument 3 of 'bitmap_xor' from incompatible pointer type [-Werror=incompatible-pointer-types]
     bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
                                                      ^
In file included from /tmp/qemu-test/src/util/threaded-workqueue.c:14:0:
/tmp/qemu-test/src/include/qemu/bitmap.h:154:20: note: expected 'const long unsigned int *' but argument is of type 'uint64_t * {aka long long unsigned int *}'
 static inline void bitmap_xor(unsigned long *dst, const unsigned long *src1,
                    ^~~~~~~~~~
/tmp/qemu-test/src/util/threaded-workqueue.c:225:28: error: passing argument 1 of 'find_first_bit' from incompatible pointer type [-Werror=incompatible-pointer-types]
     index = find_first_bit(&result_bitmap, threads->thread_requests_nr);
                            ^
In file included from /tmp/qemu-test/src/include/qemu/bitmap.h:16:0,
                 from /tmp/qemu-test/src/util/threaded-workqueue.c:14:
/tmp/qemu-test/src/include/qemu/bitops.h:188:29: note: expected 'const long unsigned int *' but argument is of type 'uint64_t * {aka long long unsigned int *}'
 static inline unsigned long find_first_bit(const unsigned long *addr,
                             ^~~~~~~~~~~~~~
/tmp/qemu-test/src/util/threaded-workqueue.c: In function 'mark_request_free':
/tmp/qemu-test/src/util/threaded-workqueue.c:238:30: error: passing argument 2 of 'change_bit_atomic' from incompatible pointer type [-Werror=incompatible-pointer-types]
     change_bit_atomic(index, &thread->request_done_bitmap);
                              ^
In file included from /tmp/qemu-test/src/include/qemu/bitmap.h:16:0,
                 from /tmp/qemu-test/src/util/threaded-workqueue.c:14:
/tmp/qemu-test/src/include/qemu/bitops.h:87:20: note: expected 'long unsigned int *' but argument is of type 'uint64_t * {aka long long unsigned int *}'
 static inline void change_bit_atomic(long nr, unsigned long *addr)
                    ^~~~~~~~~~~~~~~~~
/tmp/qemu-test/src/util/threaded-workqueue.c: In function 'threaded_workqueue_wait_for_requests':
/tmp/qemu-test/src/util/threaded-workqueue.c:455:33: error: passing argument 2 of 'test_bit' from incompatible pointer type [-Werror=incompatible-pointer-types]
             if (test_bit(index, &result_bitmap)) {
                                 ^
In file included from /tmp/qemu-test/src/include/qemu/bitmap.h:16:0,
                 from /tmp/qemu-test/src/util/threaded-workqueue.c:14:
/tmp/qemu-test/src/include/qemu/bitops.h:145:19: note: expected 'const long unsigned int *' but argument is of type 'uint64_t * {aka long long unsigned int *}'
 static inline int test_bit(long nr, const unsigned long *addr)
                   ^~~~~~~~
cc1: all warnings being treated as errors
make: *** [/tmp/qemu-test/src/rules.mak:69: util/threaded-workqueue.o] Error 1
make: *** Waiting for unfinished jobs....
  CC      crypto/trace.o
Traceback (most recent call last):
  File "./tests/docker/docker.py", line 563, in <module>
    sys.exit(main())
  File "./tests/docker/docker.py", line 560, in main
    return args.cmdobj.run(args, argv)
  File "./tests/docker/docker.py", line 306, in run
    return Docker().run(argv, args.keep, quiet=args.quiet)
  File "./tests/docker/docker.py", line 274, in run
    quiet=quiet)
  File "./tests/docker/docker.py", line 181, in _do_check
    return subprocess.check_call(self._command + cmd, **kwargs)
  File "/usr/lib64/python2.7/subprocess.py", line 542, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', '--label', 'com.qemu.instance.uuid=07fae0c4ee9d11e8b22468b59973b7d0', '-u', '1001', '--security-opt', 'seccomp=unconfined', '--rm', '--net=none', '-e', 'TARGET_LIST=', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=8', '-e', 'DEBUG=', '-e', 'SHOW_ENV=1', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', '/home/patchew/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', '/var/tmp/patchew-tester-tmp-1vusqh9v/src/docker-src.2018-11-22-16.24.23.23218:/var/tmp/qemu:z,ro', 'qemu:fedora', '/var/tmp/qemu/run', 'test-mingw']' returned non-zero exit status 2
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-1vusqh9v/src'
make: *** [docker-run-test-mingw@fedora] Error 2

real	1m30.699s
user	0m16.848s
sys	0m3.570s
=== OUTPUT END ===

Test command exited with code: 2


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 0/5] migration: improve multithreads
  2018-11-22  7:20 ` [Qemu-devel] " guangrong.xiao
@ 2018-11-22 21:35   ` no-reply
  -1 siblings, 0 replies; 70+ messages in thread
From: no-reply @ 2018-11-22 21:35 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: famz, kvm, quintela, mtosatti, xiaoguangrong, qemu-devel, peterx,
	dgilbert, wei.w.wang, cota, mst, jiang.biao2, pbonzini

Hi,

This series seems to have some coding style problems. See output below for
more information:

Message-id: 20181122072028.22819-1-xiaoguangrong@tencent.com
Type: series
Subject: [Qemu-devel] [PATCH v3 0/5] migration: improve multithreads

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
    echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
    if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
        failed=1
        echo
    fi
    n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
1f40f88 tests: add threaded-workqueue-bench
0c560ac migration: use threaded workqueue for decompression
effdcb4 migration: use threaded workqueue for compression
eb91c63 util: introduce threaded workqueue
3bf8b44 bitops: introduce change_bit_atomic

=== OUTPUT BEGIN ===
Checking PATCH 1/5: bitops: introduce change_bit_atomic...
Checking PATCH 2/5: util: introduce threaded workqueue...
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#41: 
new file mode 100644

ERROR: externs should be avoided in .c files
#233: FILE: util/threaded-workqueue.c:65:
+    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);

ERROR: externs should be avoided in .c files
#235: FILE: util/threaded-workqueue.c:67:
+    uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);

ERROR: externs should be avoided in .c files
#241: FILE: util/threaded-workqueue.c:73:
+    QemuEvent request_valid_ev QEMU_ALIGNED(SMP_CACHE_BYTES);

ERROR: externs should be avoided in .c files
#247: FILE: util/threaded-workqueue.c:79:
+    QemuEvent request_free_ev QEMU_ALIGNED(SMP_CACHE_BYTES);

total: 4 errors, 1 warnings, 575 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Checking PATCH 3/5: migration: use threaded workqueue for compression...
Checking PATCH 4/5: migration: use threaded workqueue for decompression...
Checking PATCH 5/5: tests: add threaded-workqueue-bench...
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#36: 
new file mode 100644

WARNING: line over 80 characters
#234: FILE: tests/threaded-workqueue-bench.c:194:
+    printf("   -r:       the number of requests handled by each thread (default %d).\n",

WARNING: line over 80 characters
#236: FILE: tests/threaded-workqueue-bench.c:196:
+    printf("   -m:       the size of the memory (G) used to test (default %dG).\n",

ERROR: line over 90 characters
#282: FILE: tests/threaded-workqueue-bench.c:242:
+    printf("Run the benchmark: threads %d requests-per-thread: %d memory %ldG repeat %d.\n",

total: 1 errors, 3 warnings, 272 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

=== OUTPUT END ===

Test command exited with code: 1


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/5] migration: improve multithreads
@ 2018-11-22 21:35   ` no-reply
  0 siblings, 0 replies; 70+ messages in thread
From: no-reply @ 2018-11-22 21:35 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: famz, pbonzini, mst, mtosatti, kvm, quintela, xiaoguangrong,
	qemu-devel, peterx, dgilbert, wei.w.wang, cota, jiang.biao2

Hi,

This series seems to have some coding style problems. See output below for
more information:

Message-id: 20181122072028.22819-1-xiaoguangrong@tencent.com
Type: series
Subject: [Qemu-devel] [PATCH v3 0/5] migration: improve multithreads

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
    echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
    if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
        failed=1
        echo
    fi
    n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
1f40f88 tests: add threaded-workqueue-bench
0c560ac migration: use threaded workqueue for decompression
effdcb4 migration: use threaded workqueue for compression
eb91c63 util: introduce threaded workqueue
3bf8b44 bitops: introduce change_bit_atomic

=== OUTPUT BEGIN ===
Checking PATCH 1/5: bitops: introduce change_bit_atomic...
Checking PATCH 2/5: util: introduce threaded workqueue...
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#41: 
new file mode 100644

ERROR: externs should be avoided in .c files
#233: FILE: util/threaded-workqueue.c:65:
+    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);

ERROR: externs should be avoided in .c files
#235: FILE: util/threaded-workqueue.c:67:
+    uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);

ERROR: externs should be avoided in .c files
#241: FILE: util/threaded-workqueue.c:73:
+    QemuEvent request_valid_ev QEMU_ALIGNED(SMP_CACHE_BYTES);

ERROR: externs should be avoided in .c files
#247: FILE: util/threaded-workqueue.c:79:
+    QemuEvent request_free_ev QEMU_ALIGNED(SMP_CACHE_BYTES);

total: 4 errors, 1 warnings, 575 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Checking PATCH 3/5: migration: use threaded workqueue for compression...
Checking PATCH 4/5: migration: use threaded workqueue for decompression...
Checking PATCH 5/5: tests: add threaded-workqueue-bench...
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#36: 
new file mode 100644

WARNING: line over 80 characters
#234: FILE: tests/threaded-workqueue-bench.c:194:
+    printf("   -r:       the number of requests handled by each thread (default %d).\n",

WARNING: line over 80 characters
#236: FILE: tests/threaded-workqueue-bench.c:196:
+    printf("   -m:       the size of the memory (G) used to test (default %dG).\n",

ERROR: line over 90 characters
#282: FILE: tests/threaded-workqueue-bench.c:242:
+    printf("Run the benchmark: threads %d requests-per-thread: %d memory %ldG repeat %d.\n",

total: 1 errors, 3 warnings, 272 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

=== OUTPUT END ===

Test command exited with code: 1


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 1/5] bitops: introduce change_bit_atomic
  2018-11-22  7:20   ` [Qemu-devel] " guangrong.xiao
@ 2018-11-23 10:23     ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 70+ messages in thread
From: Dr. David Alan Gilbert @ 2018-11-23 10:23 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, peterx, quintela,
	wei.w.wang, cota, jiang.biao2, pbonzini

* guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> It will be used by threaded workqueue
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  include/qemu/bitops.h | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/include/qemu/bitops.h b/include/qemu/bitops.h
> index 3f0926cf40..c522958852 100644
> --- a/include/qemu/bitops.h
> +++ b/include/qemu/bitops.h
> @@ -79,6 +79,19 @@ static inline void change_bit(long nr, unsigned long *addr)
>      *p ^= mask;
>  }
>  
> +/**
> + * change_bit_atomic - Toggle a bit in memory atomically
> + * @nr: Bit to change
> + * @addr: Address to start counting from
> + */
> +static inline void change_bit_atomic(long nr, unsigned long *addr)
> +{
> +    unsigned long mask = BIT_MASK(nr);
> +    unsigned long *p = addr + BIT_WORD(nr);
> +
> +    atomic_xor(p, mask);
> +}
> +
>  /**
>   * test_and_set_bit - Set a bit and return its old value
>   * @nr: Bit to set
> -- 
> 2.14.5
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/5] bitops: introduce change_bit_atomic
@ 2018-11-23 10:23     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 70+ messages in thread
From: Dr. David Alan Gilbert @ 2018-11-23 10:23 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, peterx, wei.w.wang,
	jiang.biao2, eblake, quintela, cota, Xiao Guangrong

* guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> It will be used by threaded workqueue
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  include/qemu/bitops.h | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/include/qemu/bitops.h b/include/qemu/bitops.h
> index 3f0926cf40..c522958852 100644
> --- a/include/qemu/bitops.h
> +++ b/include/qemu/bitops.h
> @@ -79,6 +79,19 @@ static inline void change_bit(long nr, unsigned long *addr)
>      *p ^= mask;
>  }
>  
> +/**
> + * change_bit_atomic - Toggle a bit in memory atomically
> + * @nr: Bit to change
> + * @addr: Address to start counting from
> + */
> +static inline void change_bit_atomic(long nr, unsigned long *addr)
> +{
> +    unsigned long mask = BIT_MASK(nr);
> +    unsigned long *p = addr + BIT_WORD(nr);
> +
> +    atomic_xor(p, mask);
> +}
> +
>  /**
>   * test_and_set_bit - Set a bit and return its old value
>   * @nr: Bit to set
> -- 
> 2.14.5
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 2/5] util: introduce threaded workqueue
  2018-11-22  7:20   ` [Qemu-devel] " guangrong.xiao
@ 2018-11-23 11:02     ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 70+ messages in thread
From: Dr. David Alan Gilbert @ 2018-11-23 11:02 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, peterx, quintela,
	wei.w.wang, cota, jiang.biao2, pbonzini

* guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> This modules implements the lockless and efficient threaded workqueue.
> 
> Three abstracted objects are used in this module:
> - Request.
>      It not only contains the data that the workqueue fetches out
>     to finish the request but also offers the space to save the result
>     after the workqueue handles the request.
> 
>     It's flowed between user and workqueue. The user fills the request
>     data into it when it is owned by user. After it is submitted to the
>     workqueue, the workqueue fetched data out and save the result into
>     it after the request is handled.
> 
>     All the requests are pre-allocated and carefully partitioned between
>     threads so there is no contention on the request, that make threads
>     be parallel as much as possible.
> 
> - User, i.e, the submitter
>     It's the one fills the request and submits it to the workqueue,
>     the result will be collected after it is handled by the work queue.
> 
>     The user can consecutively submit requests without waiting the previous
>     requests been handled.
>     It only supports one submitter, you should do serial submission by
>     yourself if you want more, e.g, use lock on you side.
> 
> - Workqueue, i.e, thread
>     Each workqueue is represented by a running thread that fetches
>     the request submitted by the user, do the specified work and save
>     the result to the request.
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---
>  include/qemu/threaded-workqueue.h | 106 +++++++++
>  util/Makefile.objs                |   1 +
>  util/threaded-workqueue.c         | 463 ++++++++++++++++++++++++++++++++++++++
>  3 files changed, 570 insertions(+)
>  create mode 100644 include/qemu/threaded-workqueue.h
>  create mode 100644 util/threaded-workqueue.c
> 
> diff --git a/include/qemu/threaded-workqueue.h b/include/qemu/threaded-workqueue.h
> new file mode 100644
> index 0000000000..e0ede496d0
> --- /dev/null
> +++ b/include/qemu/threaded-workqueue.h
> @@ -0,0 +1,106 @@
> +/*
> + * Lockless and Efficient Threaded Workqueue Abstraction
> + *
> + * Author:
> + *   Xiao Guangrong <xiaoguangrong@tencent.com>
> + *
> + * Copyright(C) 2018 Tencent Corporation.
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
> + * See the COPYING.LIB file in the top-level directory.
> + */
> +
> +#ifndef QEMU_THREADED_WORKQUEUE_H
> +#define QEMU_THREADED_WORKQUEUE_H
> +
> +#include "qemu/queue.h"
> +#include "qemu/thread.h"
> +
> +/*
> + * This modules implements the lockless and efficient threaded workqueue.
> + *
> + * Three abstracted objects are used in this module:
> + * - Request.
> + *   It not only contains the data that the workqueue fetches out
> + *   to finish the request but also offers the space to save the result
> + *   after the workqueue handles the request.
> + *
> + *   It's flowed between user and workqueue. The user fills the request
> + *   data into it when it is owned by user. After it is submitted to the
> + *   workqueue, the workqueue fetched data out and save the result into
> + *   it after the request is handled.
> + *
> + *   All the requests are pre-allocated and carefully partitioned between
> + *   threads so there is no contention on the request, that make threads
> + *   be parallel as much as possible.
> + *
> + * - User, i.e, the submitter
> + *   It's the one fills the request and submits it to the workqueue,
> + *   the result will be collected after it is handled by the work queue.
> + *
> + *   The user can consecutively submit requests without waiting the previous
> + *   requests been handled.
> + *   It only supports one submitter, you should do serial submission by
> + *   yourself if you want more, e.g, use lock on you side.
> + *
> + * - Workqueue, i.e, thread
> + *   Each workqueue is represented by a running thread that fetches
> + *   the request submitted by the user, do the specified work and save
> + *   the result to the request.
> + */
> +
> +typedef struct Threads Threads;
> +
> +struct ThreadedWorkqueueOps {
> +    /* constructor of the request */
> +    int (*thread_request_init)(void *request);
> +    /*  destructor of the request */
> +    void (*thread_request_uninit)(void *request);
> +
> +    /* the handler of the request that is called by the thread */
> +    void (*thread_request_handler)(void *request);
> +    /* called by the user after the request has been handled */
> +    void (*thread_request_done)(void *request);
> +
> +    size_t request_size;
> +};
> +typedef struct ThreadedWorkqueueOps ThreadedWorkqueueOps;
> +
> +/* the default number of requests that thread need handle */
> +#define DEFAULT_THREAD_REQUEST_NR 4
> +/* the max number of requests that thread need handle */
> +#define MAX_THREAD_REQUEST_NR     (sizeof(uint64_t) * BITS_PER_BYTE)
> +
> +/*
> + * create a threaded queue. Other APIs will work on the Threads it returned
> + *
> + * @name: the identity of the workqueue which is used to construct the name
> + *    of threads only
> + * @threads_nr: the number of threads that the workqueue will create
> + * @thread_requests_nr: the number of requests that each single thread will
> + *    handle
> + * @ops: the handlers of the request
> + *
> + * Return NULL if it failed
> + */
> +Threads *threaded_workqueue_create(const char *name, unsigned int threads_nr,
> +                                   unsigned int thread_requests_nr,
> +                                   const ThreadedWorkqueueOps *ops);
> +void threaded_workqueue_destroy(Threads *threads);
> +
> +/*
> + * find a free request where the user can store the data that is needed to
> + * finish the request
> + *
> + * If all requests are used up, return NULL
> + */
> +void *threaded_workqueue_get_request(Threads *threads);
> +/* submit the request and notify the thread */
> +void threaded_workqueue_submit_request(Threads *threads, void *request);
> +
> +/*
> + * wait all threads to complete the request to make sure there is no
> + * previous request exists
> + */
> +void threaded_workqueue_wait_for_requests(Threads *threads);
> +#endif
> diff --git a/util/Makefile.objs b/util/Makefile.objs
> index 0820923c18..f26dfe5182 100644
> --- a/util/Makefile.objs
> +++ b/util/Makefile.objs
> @@ -50,5 +50,6 @@ util-obj-y += range.o
>  util-obj-y += stats64.o
>  util-obj-y += systemd.o
>  util-obj-y += iova-tree.o
> +util-obj-y += threaded-workqueue.o
>  util-obj-$(CONFIG_LINUX) += vfio-helpers.o
>  util-obj-$(CONFIG_OPENGL) += drm.o
> diff --git a/util/threaded-workqueue.c b/util/threaded-workqueue.c
> new file mode 100644
> index 0000000000..2ab37cee8d
> --- /dev/null
> +++ b/util/threaded-workqueue.c
> @@ -0,0 +1,463 @@
> +/*
> + * Lockless and Efficient Threaded Workqueue Abstraction
> + *
> + * Author:
> + *   Xiao Guangrong <xiaoguangrong@tencent.com>
> + *
> + * Copyright(C) 2018 Tencent Corporation.
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
> + * See the COPYING.LIB file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/bitmap.h"
> +#include "qemu/threaded-workqueue.h"
> +
> +#define SMP_CACHE_BYTES 64

That's architecture dependent isn't it?

> +
> +/*
> + * the request representation which contains the internally used mete data,
> + * it is the header of user-defined data.
> + *
> + * It should be aligned to the nature size of CPU.
> + */
> +struct ThreadRequest {
> +    /*
> +     * the request has been handled by the thread and need the user
> +     * to fetch result out.
> +     */
> +    uint8_t done;
> +
> +    /*
> +     * the index to Thread::requests.
> +     * Save it to the padding space although it can be calculated at runtime.
> +     */
> +    uint8_t request_index;
> +
> +    /* the index to Threads::per_thread_data */
> +    unsigned int thread_index;
> +} QEMU_ALIGNED(sizeof(unsigned long));
> +typedef struct ThreadRequest ThreadRequest;
> +
> +struct ThreadLocal {
> +    struct Threads *threads;
> +
> +    /* the index of the thread */
> +    int self;
> +
> +    /* thread is useless and needs to exit */
> +    bool quit;
> +
> +    QemuThread thread;
> +
> +    void *requests;
> +
> +   /*
> +     * the bit in these two bitmaps indicates the index of the @requests
> +     * respectively. If it's the same, the corresponding request is free
> +     * and owned by the user, i.e, where the user fills a request. Otherwise,
> +     * it is valid and owned by the thread, i.e, where the thread fetches
> +     * the request and write the result.
> +     */
> +
> +    /* after the user fills the request, the bit is flipped. */
> +    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
> +    /* after handles the request, the thread flips the bit. */
> +    uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);

Patchew complained about some type mismatches; I think those are because
you're using the bitmap_* functions on these; those functions always
operate on 'long' not on uint64_t - and on some platforms they're
unfortunately not the same.


Dave

> +    /*
> +     * the event used to wake up the thread whenever a valid request has
> +     * been submitted
> +     */
> +    QemuEvent request_valid_ev QEMU_ALIGNED(SMP_CACHE_BYTES);
> +
> +    /*
> +     * the event is notified whenever a request has been completed
> +     * (i.e, become free), which is used to wake up the user
> +     */
> +    QemuEvent request_free_ev QEMU_ALIGNED(SMP_CACHE_BYTES);
> +};
> +typedef struct ThreadLocal ThreadLocal;
> +
> +/*
> + * the main data struct represents multithreads which is shared by
> + * all threads
> + */
> +struct Threads {
> +    /* the request header, ThreadRequest, is contained */
> +    unsigned int request_size;
> +    unsigned int thread_requests_nr;
> +    unsigned int threads_nr;
> +
> +    /* the request is pushed to the thread with round-robin manner */
> +    unsigned int current_thread_index;
> +
> +    const ThreadedWorkqueueOps *ops;
> +
> +    ThreadLocal per_thread_data[0];
> +};
> +typedef struct Threads Threads;
> +
> +static ThreadRequest *index_to_request(ThreadLocal *thread, int request_index)
> +{
> +    ThreadRequest *request;
> +
> +    request = thread->requests + request_index * thread->threads->request_size;
> +    assert(request->request_index == request_index);
> +    assert(request->thread_index == thread->self);
> +    return request;
> +}
> +
> +static int request_to_index(ThreadRequest *request)
> +{
> +    return request->request_index;
> +}
> +
> +static int request_to_thread_index(ThreadRequest *request)
> +{
> +    return request->thread_index;
> +}
> +
> +/*
> + * free request: the request is not used by any thread, however, it might
> + *   contain the result need the user to call thread_request_done()
> + *
> + * valid request: the request contains the request data and it's committed
> + *   to the thread, i,e. it's owned by thread.
> + */
> +static uint64_t get_free_request_bitmap(Threads *threads, ThreadLocal *thread)
> +{
> +    uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap;
> +
> +    request_fill_bitmap = atomic_rcu_read(&thread->request_fill_bitmap);
> +    request_done_bitmap = atomic_rcu_read(&thread->request_done_bitmap);
> +    bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
> +               threads->thread_requests_nr);
> +
> +    /*
> +     * paired with smp_wmb() in mark_request_free() to make sure that we
> +     * read request_done_bitmap before fetching the result out.
> +     */
> +    smp_rmb();
> +
> +    return result_bitmap;
> +}
> +
> +static ThreadRequest
> +*find_thread_free_request(Threads *threads, ThreadLocal *thread)
> +{
> +    uint64_t result_bitmap = get_free_request_bitmap(threads, thread);
> +    int index;
> +
> +    index  = find_first_zero_bit(&result_bitmap, threads->thread_requests_nr);
> +    if (index >= threads->thread_requests_nr) {
> +        return NULL;
> +    }
> +
> +    return index_to_request(thread, index);
> +}
> +
> +static ThreadRequest *threads_find_free_request(Threads *threads)
> +{
> +    ThreadLocal *thread;
> +    ThreadRequest *request;
> +    int cur_thread, thread_index;
> +
> +    cur_thread = threads->current_thread_index % threads->threads_nr;
> +    thread_index = cur_thread;
> +    do {
> +        thread = threads->per_thread_data + thread_index++;
> +        request = find_thread_free_request(threads, thread);
> +        if (request) {
> +            break;
> +        }
> +        thread_index %= threads->threads_nr;
> +    } while (thread_index != cur_thread);
> +
> +    return request;
> +}
> +
> +/*
> + * the change bit operation combined with READ_ONCE and WRITE_ONCE which
> + * only works on single uint64_t width
> + */
> +static void change_bit_once(long nr, uint64_t *addr)
> +{
> +    uint64_t value = atomic_rcu_read(addr) ^ BIT_MASK(nr);
> +
> +    atomic_rcu_set(addr, value);
> +}
> +
> +static void mark_request_valid(Threads *threads, ThreadRequest *request)
> +{
> +    int thread_index = request_to_thread_index(request);
> +    int request_index = request_to_index(request);
> +    ThreadLocal *thread = threads->per_thread_data + thread_index;
> +
> +    /*
> +     * paired with smp_rmb() in find_first_valid_request_index() to make
> +     * sure the request has been filled before the bit is flipped that
> +     * will make the request be visible to the thread
> +     */
> +    smp_wmb();
> +
> +    change_bit_once(request_index, &thread->request_fill_bitmap);
> +    qemu_event_set(&thread->request_valid_ev);
> +}
> +
> +static int thread_find_first_valid_request_index(ThreadLocal *thread)
> +{
> +    Threads *threads = thread->threads;
> +    uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap;
> +    int index;
> +
> +    request_fill_bitmap = atomic_rcu_read(&thread->request_fill_bitmap);
> +    request_done_bitmap = atomic_rcu_read(&thread->request_done_bitmap);
> +    bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
> +               threads->thread_requests_nr);
> +    /*
> +     * paired with smp_wmb() in mark_request_valid() to make sure that
> +     * we read request_fill_bitmap before fetch the request out.
> +     */
> +    smp_rmb();
> +
> +    index = find_first_bit(&result_bitmap, threads->thread_requests_nr);
> +    return index >= threads->thread_requests_nr ? -1 : index;
> +}
> +
> +static void mark_request_free(ThreadLocal *thread, ThreadRequest *request)
> +{
> +    int index = request_to_index(request);
> +
> +    /*
> +     * smp_wmb() is implied in change_bit_atomic() that is paired with
> +     * smp_rmb() in get_free_request_bitmap() to make sure the result
> +     * has been saved before the bit is flipped.
> +     */
> +    change_bit_atomic(index, &thread->request_done_bitmap);
> +    qemu_event_set(&thread->request_free_ev);
> +}
> +
> +/* retry to see if there is available request before actually go to wait. */
> +#define BUSY_WAIT_COUNT 1000
> +
> +static ThreadRequest *
> +thread_busy_wait_for_request(ThreadLocal *thread)
> +{
> +    int index, count = 0;
> +
> +    for (count = 0; count < BUSY_WAIT_COUNT; count++) {
> +        index = thread_find_first_valid_request_index(thread);
> +        if (index >= 0) {
> +            return index_to_request(thread, index);
> +        }
> +
> +        cpu_relax();
> +    }
> +
> +    return NULL;
> +}
> +
> +static void *thread_run(void *opaque)
> +{
> +    ThreadLocal *self_data = (ThreadLocal *)opaque;
> +    Threads *threads = self_data->threads;
> +    void (*handler)(void *request) = threads->ops->thread_request_handler;
> +    ThreadRequest *request;
> +
> +    for ( ; !atomic_read(&self_data->quit); ) {
> +        qemu_event_reset(&self_data->request_valid_ev);
> +
> +        request = thread_busy_wait_for_request(self_data);
> +        if (!request) {
> +            qemu_event_wait(&self_data->request_valid_ev);
> +            continue;
> +        }
> +
> +        assert(!request->done);
> +
> +        handler(request + 1);
> +        request->done = true;
> +        mark_request_free(self_data, request);
> +    }
> +
> +    return NULL;
> +}
> +
> +static void uninit_thread_requests(ThreadLocal *thread, int free_nr)
> +{
> +    Threads *threads = thread->threads;
> +    ThreadRequest *request = thread->requests;
> +    int i;
> +
> +    for (i = 0; i < free_nr; i++) {
> +        threads->ops->thread_request_uninit(request + 1);
> +        request = (void *)request + threads->request_size;
> +    }
> +    g_free(thread->requests);
> +}
> +
> +static int init_thread_requests(ThreadLocal *thread)
> +{
> +    Threads *threads = thread->threads;
> +    ThreadRequest *request;
> +    int ret, i, thread_reqs_size;
> +
> +    thread_reqs_size = threads->thread_requests_nr * threads->request_size;
> +    thread_reqs_size = QEMU_ALIGN_UP(thread_reqs_size, SMP_CACHE_BYTES);
> +    thread->requests = g_malloc0(thread_reqs_size);
> +
> +    request = thread->requests;
> +    for (i = 0; i < threads->thread_requests_nr; i++) {
> +        ret = threads->ops->thread_request_init(request + 1);
> +        if (ret < 0) {
> +            goto exit;
> +        }
> +
> +        request->request_index = i;
> +        request->thread_index = thread->self;
> +        request = (void *)request + threads->request_size;
> +    }
> +    return 0;
> +
> +exit:
> +    uninit_thread_requests(thread, i);
> +    return -1;
> +}
> +
> +static void uninit_thread_data(Threads *threads, int free_nr)
> +{
> +    ThreadLocal *thread_local = threads->per_thread_data;
> +    int i;
> +
> +    for (i = 0; i < free_nr; i++) {
> +        thread_local[i].quit = true;
> +        qemu_event_set(&thread_local[i].request_valid_ev);
> +        qemu_thread_join(&thread_local[i].thread);
> +        qemu_event_destroy(&thread_local[i].request_valid_ev);
> +        qemu_event_destroy(&thread_local[i].request_free_ev);
> +        uninit_thread_requests(&thread_local[i], threads->thread_requests_nr);
> +    }
> +}
> +
> +static int
> +init_thread_data(Threads *threads, const char *thread_name, int thread_nr)
> +{
> +    ThreadLocal *thread_local = threads->per_thread_data;
> +    char *name;
> +    int i;
> +
> +    for (i = 0; i < thread_nr; i++) {
> +        thread_local[i].threads = threads;
> +        thread_local[i].self = i;
> +
> +        if (init_thread_requests(&thread_local[i]) < 0) {
> +            goto exit;
> +        }
> +
> +        qemu_event_init(&thread_local[i].request_free_ev, false);
> +        qemu_event_init(&thread_local[i].request_valid_ev, false);
> +
> +        name = g_strdup_printf("%s/%d", thread_name, thread_local[i].self);
> +        qemu_thread_create(&thread_local[i].thread, name,
> +                           thread_run, &thread_local[i], QEMU_THREAD_JOINABLE);
> +        g_free(name);
> +    }
> +    return 0;
> +
> +exit:
> +    uninit_thread_data(threads, i);
> +    return -1;
> +}
> +
> +Threads *threaded_workqueue_create(const char *name, unsigned int threads_nr,
> +                                   unsigned int thread_requests_nr,
> +                                   const ThreadedWorkqueueOps *ops)
> +{
> +    Threads *threads;
> +
> +    if (threads_nr > MAX_THREAD_REQUEST_NR) {
> +        return NULL;
> +    }
> +
> +    threads = g_malloc0(sizeof(*threads) + threads_nr * sizeof(ThreadLocal));
> +    threads->ops = ops;
> +    threads->threads_nr = threads_nr;
> +    threads->thread_requests_nr = thread_requests_nr;
> +
> +    QEMU_BUILD_BUG_ON(!QEMU_IS_ALIGNED(sizeof(ThreadRequest), sizeof(long)));
> +    threads->request_size = threads->ops->request_size;
> +    threads->request_size = QEMU_ALIGN_UP(threads->request_size, sizeof(long));
> +    threads->request_size += sizeof(ThreadRequest);
> +
> +    if (init_thread_data(threads, name, threads_nr) < 0) {
> +        g_free(threads);
> +        return NULL;
> +    }
> +
> +    return threads;
> +}
> +
> +void threaded_workqueue_destroy(Threads *threads)
> +{
> +    uninit_thread_data(threads, threads->threads_nr);
> +    g_free(threads);
> +}
> +
> +static void request_done(Threads *threads, ThreadRequest *request)
> +{
> +    if (!request->done) {
> +        return;
> +    }
> +
> +    threads->ops->thread_request_done(request + 1);
> +    request->done = false;
> +}
> +
> +void *threaded_workqueue_get_request(Threads *threads)
> +{
> +    ThreadRequest *request;
> +
> +    request = threads_find_free_request(threads);
> +    if (!request) {
> +        return NULL;
> +    }
> +
> +    request_done(threads, request);
> +    return request + 1;
> +}
> +
> +void threaded_workqueue_submit_request(Threads *threads, void *request)
> +{
> +    ThreadRequest *req = request - sizeof(ThreadRequest);
> +    int thread_index = request_to_thread_index(request);
> +
> +    assert(!req->done);
> +    mark_request_valid(threads, req);
> +    threads->current_thread_index = thread_index  + 1;
> +}
> +
> +void threaded_workqueue_wait_for_requests(Threads *threads)
> +{
> +    ThreadLocal *thread;
> +    uint64_t result_bitmap;
> +    int thread_index, index = 0;
> +
> +    for (thread_index = 0; thread_index < threads->threads_nr; thread_index++) {
> +        thread = threads->per_thread_data + thread_index;
> +        index = 0;
> +retry:
> +        qemu_event_reset(&thread->request_free_ev);
> +        result_bitmap = get_free_request_bitmap(threads, thread);
> +
> +        for (; index < threads->thread_requests_nr; index++) {
> +            if (test_bit(index, &result_bitmap)) {
> +                qemu_event_wait(&thread->request_free_ev);
> +                goto retry;
> +            }
> +
> +            request_done(threads, index_to_request(thread, index));
> +        }
> +    }
> +}
> -- 
> 2.14.5
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
@ 2018-11-23 11:02     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 70+ messages in thread
From: Dr. David Alan Gilbert @ 2018-11-23 11:02 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, peterx, wei.w.wang,
	jiang.biao2, eblake, quintela, cota, Xiao Guangrong

* guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> This modules implements the lockless and efficient threaded workqueue.
> 
> Three abstracted objects are used in this module:
> - Request.
>      It not only contains the data that the workqueue fetches out
>     to finish the request but also offers the space to save the result
>     after the workqueue handles the request.
> 
>     It's flowed between user and workqueue. The user fills the request
>     data into it when it is owned by user. After it is submitted to the
>     workqueue, the workqueue fetched data out and save the result into
>     it after the request is handled.
> 
>     All the requests are pre-allocated and carefully partitioned between
>     threads so there is no contention on the request, that make threads
>     be parallel as much as possible.
> 
> - User, i.e, the submitter
>     It's the one fills the request and submits it to the workqueue,
>     the result will be collected after it is handled by the work queue.
> 
>     The user can consecutively submit requests without waiting the previous
>     requests been handled.
>     It only supports one submitter, you should do serial submission by
>     yourself if you want more, e.g, use lock on you side.
> 
> - Workqueue, i.e, thread
>     Each workqueue is represented by a running thread that fetches
>     the request submitted by the user, do the specified work and save
>     the result to the request.
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---
>  include/qemu/threaded-workqueue.h | 106 +++++++++
>  util/Makefile.objs                |   1 +
>  util/threaded-workqueue.c         | 463 ++++++++++++++++++++++++++++++++++++++
>  3 files changed, 570 insertions(+)
>  create mode 100644 include/qemu/threaded-workqueue.h
>  create mode 100644 util/threaded-workqueue.c
> 
> diff --git a/include/qemu/threaded-workqueue.h b/include/qemu/threaded-workqueue.h
> new file mode 100644
> index 0000000000..e0ede496d0
> --- /dev/null
> +++ b/include/qemu/threaded-workqueue.h
> @@ -0,0 +1,106 @@
> +/*
> + * Lockless and Efficient Threaded Workqueue Abstraction
> + *
> + * Author:
> + *   Xiao Guangrong <xiaoguangrong@tencent.com>
> + *
> + * Copyright(C) 2018 Tencent Corporation.
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
> + * See the COPYING.LIB file in the top-level directory.
> + */
> +
> +#ifndef QEMU_THREADED_WORKQUEUE_H
> +#define QEMU_THREADED_WORKQUEUE_H
> +
> +#include "qemu/queue.h"
> +#include "qemu/thread.h"
> +
> +/*
> + * This modules implements the lockless and efficient threaded workqueue.
> + *
> + * Three abstracted objects are used in this module:
> + * - Request.
> + *   It not only contains the data that the workqueue fetches out
> + *   to finish the request but also offers the space to save the result
> + *   after the workqueue handles the request.
> + *
> + *   It's flowed between user and workqueue. The user fills the request
> + *   data into it when it is owned by user. After it is submitted to the
> + *   workqueue, the workqueue fetched data out and save the result into
> + *   it after the request is handled.
> + *
> + *   All the requests are pre-allocated and carefully partitioned between
> + *   threads so there is no contention on the request, that make threads
> + *   be parallel as much as possible.
> + *
> + * - User, i.e, the submitter
> + *   It's the one fills the request and submits it to the workqueue,
> + *   the result will be collected after it is handled by the work queue.
> + *
> + *   The user can consecutively submit requests without waiting the previous
> + *   requests been handled.
> + *   It only supports one submitter, you should do serial submission by
> + *   yourself if you want more, e.g, use lock on you side.
> + *
> + * - Workqueue, i.e, thread
> + *   Each workqueue is represented by a running thread that fetches
> + *   the request submitted by the user, do the specified work and save
> + *   the result to the request.
> + */
> +
> +typedef struct Threads Threads;
> +
> +struct ThreadedWorkqueueOps {
> +    /* constructor of the request */
> +    int (*thread_request_init)(void *request);
> +    /*  destructor of the request */
> +    void (*thread_request_uninit)(void *request);
> +
> +    /* the handler of the request that is called by the thread */
> +    void (*thread_request_handler)(void *request);
> +    /* called by the user after the request has been handled */
> +    void (*thread_request_done)(void *request);
> +
> +    size_t request_size;
> +};
> +typedef struct ThreadedWorkqueueOps ThreadedWorkqueueOps;
> +
> +/* the default number of requests that thread need handle */
> +#define DEFAULT_THREAD_REQUEST_NR 4
> +/* the max number of requests that thread need handle */
> +#define MAX_THREAD_REQUEST_NR     (sizeof(uint64_t) * BITS_PER_BYTE)
> +
> +/*
> + * create a threaded queue. Other APIs will work on the Threads it returned
> + *
> + * @name: the identity of the workqueue which is used to construct the name
> + *    of threads only
> + * @threads_nr: the number of threads that the workqueue will create
> + * @thread_requests_nr: the number of requests that each single thread will
> + *    handle
> + * @ops: the handlers of the request
> + *
> + * Return NULL if it failed
> + */
> +Threads *threaded_workqueue_create(const char *name, unsigned int threads_nr,
> +                                   unsigned int thread_requests_nr,
> +                                   const ThreadedWorkqueueOps *ops);
> +void threaded_workqueue_destroy(Threads *threads);
> +
> +/*
> + * find a free request where the user can store the data that is needed to
> + * finish the request
> + *
> + * If all requests are used up, return NULL
> + */
> +void *threaded_workqueue_get_request(Threads *threads);
> +/* submit the request and notify the thread */
> +void threaded_workqueue_submit_request(Threads *threads, void *request);
> +
> +/*
> + * wait all threads to complete the request to make sure there is no
> + * previous request exists
> + */
> +void threaded_workqueue_wait_for_requests(Threads *threads);
> +#endif
> diff --git a/util/Makefile.objs b/util/Makefile.objs
> index 0820923c18..f26dfe5182 100644
> --- a/util/Makefile.objs
> +++ b/util/Makefile.objs
> @@ -50,5 +50,6 @@ util-obj-y += range.o
>  util-obj-y += stats64.o
>  util-obj-y += systemd.o
>  util-obj-y += iova-tree.o
> +util-obj-y += threaded-workqueue.o
>  util-obj-$(CONFIG_LINUX) += vfio-helpers.o
>  util-obj-$(CONFIG_OPENGL) += drm.o
> diff --git a/util/threaded-workqueue.c b/util/threaded-workqueue.c
> new file mode 100644
> index 0000000000..2ab37cee8d
> --- /dev/null
> +++ b/util/threaded-workqueue.c
> @@ -0,0 +1,463 @@
> +/*
> + * Lockless and Efficient Threaded Workqueue Abstraction
> + *
> + * Author:
> + *   Xiao Guangrong <xiaoguangrong@tencent.com>
> + *
> + * Copyright(C) 2018 Tencent Corporation.
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
> + * See the COPYING.LIB file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/bitmap.h"
> +#include "qemu/threaded-workqueue.h"
> +
> +#define SMP_CACHE_BYTES 64

That's architecture dependent isn't it?

> +
> +/*
> + * the request representation which contains the internally used mete data,
> + * it is the header of user-defined data.
> + *
> + * It should be aligned to the nature size of CPU.
> + */
> +struct ThreadRequest {
> +    /*
> +     * the request has been handled by the thread and need the user
> +     * to fetch result out.
> +     */
> +    uint8_t done;
> +
> +    /*
> +     * the index to Thread::requests.
> +     * Save it to the padding space although it can be calculated at runtime.
> +     */
> +    uint8_t request_index;
> +
> +    /* the index to Threads::per_thread_data */
> +    unsigned int thread_index;
> +} QEMU_ALIGNED(sizeof(unsigned long));
> +typedef struct ThreadRequest ThreadRequest;
> +
> +struct ThreadLocal {
> +    struct Threads *threads;
> +
> +    /* the index of the thread */
> +    int self;
> +
> +    /* thread is useless and needs to exit */
> +    bool quit;
> +
> +    QemuThread thread;
> +
> +    void *requests;
> +
> +   /*
> +     * the bit in these two bitmaps indicates the index of the @requests
> +     * respectively. If it's the same, the corresponding request is free
> +     * and owned by the user, i.e, where the user fills a request. Otherwise,
> +     * it is valid and owned by the thread, i.e, where the thread fetches
> +     * the request and write the result.
> +     */
> +
> +    /* after the user fills the request, the bit is flipped. */
> +    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
> +    /* after handles the request, the thread flips the bit. */
> +    uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);

Patchew complained about some type mismatches; I think those are because
you're using the bitmap_* functions on these; those functions always
operate on 'long' not on uint64_t - and on some platforms they're
unfortunately not the same.


Dave

> +    /*
> +     * the event used to wake up the thread whenever a valid request has
> +     * been submitted
> +     */
> +    QemuEvent request_valid_ev QEMU_ALIGNED(SMP_CACHE_BYTES);
> +
> +    /*
> +     * the event is notified whenever a request has been completed
> +     * (i.e, become free), which is used to wake up the user
> +     */
> +    QemuEvent request_free_ev QEMU_ALIGNED(SMP_CACHE_BYTES);
> +};
> +typedef struct ThreadLocal ThreadLocal;
> +
> +/*
> + * the main data struct represents multithreads which is shared by
> + * all threads
> + */
> +struct Threads {
> +    /* the request header, ThreadRequest, is contained */
> +    unsigned int request_size;
> +    unsigned int thread_requests_nr;
> +    unsigned int threads_nr;
> +
> +    /* the request is pushed to the thread with round-robin manner */
> +    unsigned int current_thread_index;
> +
> +    const ThreadedWorkqueueOps *ops;
> +
> +    ThreadLocal per_thread_data[0];
> +};
> +typedef struct Threads Threads;
> +
> +static ThreadRequest *index_to_request(ThreadLocal *thread, int request_index)
> +{
> +    ThreadRequest *request;
> +
> +    request = thread->requests + request_index * thread->threads->request_size;
> +    assert(request->request_index == request_index);
> +    assert(request->thread_index == thread->self);
> +    return request;
> +}
> +
> +static int request_to_index(ThreadRequest *request)
> +{
> +    return request->request_index;
> +}
> +
> +static int request_to_thread_index(ThreadRequest *request)
> +{
> +    return request->thread_index;
> +}
> +
> +/*
> + * free request: the request is not used by any thread, however, it might
> + *   contain the result need the user to call thread_request_done()
> + *
> + * valid request: the request contains the request data and it's committed
> + *   to the thread, i,e. it's owned by thread.
> + */
> +static uint64_t get_free_request_bitmap(Threads *threads, ThreadLocal *thread)
> +{
> +    uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap;
> +
> +    request_fill_bitmap = atomic_rcu_read(&thread->request_fill_bitmap);
> +    request_done_bitmap = atomic_rcu_read(&thread->request_done_bitmap);
> +    bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
> +               threads->thread_requests_nr);
> +
> +    /*
> +     * paired with smp_wmb() in mark_request_free() to make sure that we
> +     * read request_done_bitmap before fetching the result out.
> +     */
> +    smp_rmb();
> +
> +    return result_bitmap;
> +}
> +
> +static ThreadRequest
> +*find_thread_free_request(Threads *threads, ThreadLocal *thread)
> +{
> +    uint64_t result_bitmap = get_free_request_bitmap(threads, thread);
> +    int index;
> +
> +    index  = find_first_zero_bit(&result_bitmap, threads->thread_requests_nr);
> +    if (index >= threads->thread_requests_nr) {
> +        return NULL;
> +    }
> +
> +    return index_to_request(thread, index);
> +}
> +
> +static ThreadRequest *threads_find_free_request(Threads *threads)
> +{
> +    ThreadLocal *thread;
> +    ThreadRequest *request;
> +    int cur_thread, thread_index;
> +
> +    cur_thread = threads->current_thread_index % threads->threads_nr;
> +    thread_index = cur_thread;
> +    do {
> +        thread = threads->per_thread_data + thread_index++;
> +        request = find_thread_free_request(threads, thread);
> +        if (request) {
> +            break;
> +        }
> +        thread_index %= threads->threads_nr;
> +    } while (thread_index != cur_thread);
> +
> +    return request;
> +}
> +
> +/*
> + * the change bit operation combined with READ_ONCE and WRITE_ONCE which
> + * only works on single uint64_t width
> + */
> +static void change_bit_once(long nr, uint64_t *addr)
> +{
> +    uint64_t value = atomic_rcu_read(addr) ^ BIT_MASK(nr);
> +
> +    atomic_rcu_set(addr, value);
> +}
> +
> +static void mark_request_valid(Threads *threads, ThreadRequest *request)
> +{
> +    int thread_index = request_to_thread_index(request);
> +    int request_index = request_to_index(request);
> +    ThreadLocal *thread = threads->per_thread_data + thread_index;
> +
> +    /*
> +     * paired with smp_rmb() in find_first_valid_request_index() to make
> +     * sure the request has been filled before the bit is flipped that
> +     * will make the request be visible to the thread
> +     */
> +    smp_wmb();
> +
> +    change_bit_once(request_index, &thread->request_fill_bitmap);
> +    qemu_event_set(&thread->request_valid_ev);
> +}
> +
> +static int thread_find_first_valid_request_index(ThreadLocal *thread)
> +{
> +    Threads *threads = thread->threads;
> +    uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap;
> +    int index;
> +
> +    request_fill_bitmap = atomic_rcu_read(&thread->request_fill_bitmap);
> +    request_done_bitmap = atomic_rcu_read(&thread->request_done_bitmap);
> +    bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
> +               threads->thread_requests_nr);
> +    /*
> +     * paired with smp_wmb() in mark_request_valid() to make sure that
> +     * we read request_fill_bitmap before fetch the request out.
> +     */
> +    smp_rmb();
> +
> +    index = find_first_bit(&result_bitmap, threads->thread_requests_nr);
> +    return index >= threads->thread_requests_nr ? -1 : index;
> +}
> +
> +static void mark_request_free(ThreadLocal *thread, ThreadRequest *request)
> +{
> +    int index = request_to_index(request);
> +
> +    /*
> +     * smp_wmb() is implied in change_bit_atomic() that is paired with
> +     * smp_rmb() in get_free_request_bitmap() to make sure the result
> +     * has been saved before the bit is flipped.
> +     */
> +    change_bit_atomic(index, &thread->request_done_bitmap);
> +    qemu_event_set(&thread->request_free_ev);
> +}
> +
> +/* retry to see if there is available request before actually go to wait. */
> +#define BUSY_WAIT_COUNT 1000
> +
> +static ThreadRequest *
> +thread_busy_wait_for_request(ThreadLocal *thread)
> +{
> +    int index, count = 0;
> +
> +    for (count = 0; count < BUSY_WAIT_COUNT; count++) {
> +        index = thread_find_first_valid_request_index(thread);
> +        if (index >= 0) {
> +            return index_to_request(thread, index);
> +        }
> +
> +        cpu_relax();
> +    }
> +
> +    return NULL;
> +}
> +
> +static void *thread_run(void *opaque)
> +{
> +    ThreadLocal *self_data = (ThreadLocal *)opaque;
> +    Threads *threads = self_data->threads;
> +    void (*handler)(void *request) = threads->ops->thread_request_handler;
> +    ThreadRequest *request;
> +
> +    for ( ; !atomic_read(&self_data->quit); ) {
> +        qemu_event_reset(&self_data->request_valid_ev);
> +
> +        request = thread_busy_wait_for_request(self_data);
> +        if (!request) {
> +            qemu_event_wait(&self_data->request_valid_ev);
> +            continue;
> +        }
> +
> +        assert(!request->done);
> +
> +        handler(request + 1);
> +        request->done = true;
> +        mark_request_free(self_data, request);
> +    }
> +
> +    return NULL;
> +}
> +
> +static void uninit_thread_requests(ThreadLocal *thread, int free_nr)
> +{
> +    Threads *threads = thread->threads;
> +    ThreadRequest *request = thread->requests;
> +    int i;
> +
> +    for (i = 0; i < free_nr; i++) {
> +        threads->ops->thread_request_uninit(request + 1);
> +        request = (void *)request + threads->request_size;
> +    }
> +    g_free(thread->requests);
> +}
> +
> +static int init_thread_requests(ThreadLocal *thread)
> +{
> +    Threads *threads = thread->threads;
> +    ThreadRequest *request;
> +    int ret, i, thread_reqs_size;
> +
> +    thread_reqs_size = threads->thread_requests_nr * threads->request_size;
> +    thread_reqs_size = QEMU_ALIGN_UP(thread_reqs_size, SMP_CACHE_BYTES);
> +    thread->requests = g_malloc0(thread_reqs_size);
> +
> +    request = thread->requests;
> +    for (i = 0; i < threads->thread_requests_nr; i++) {
> +        ret = threads->ops->thread_request_init(request + 1);
> +        if (ret < 0) {
> +            goto exit;
> +        }
> +
> +        request->request_index = i;
> +        request->thread_index = thread->self;
> +        request = (void *)request + threads->request_size;
> +    }
> +    return 0;
> +
> +exit:
> +    uninit_thread_requests(thread, i);
> +    return -1;
> +}
> +
> +static void uninit_thread_data(Threads *threads, int free_nr)
> +{
> +    ThreadLocal *thread_local = threads->per_thread_data;
> +    int i;
> +
> +    for (i = 0; i < free_nr; i++) {
> +        thread_local[i].quit = true;
> +        qemu_event_set(&thread_local[i].request_valid_ev);
> +        qemu_thread_join(&thread_local[i].thread);
> +        qemu_event_destroy(&thread_local[i].request_valid_ev);
> +        qemu_event_destroy(&thread_local[i].request_free_ev);
> +        uninit_thread_requests(&thread_local[i], threads->thread_requests_nr);
> +    }
> +}
> +
> +static int
> +init_thread_data(Threads *threads, const char *thread_name, int thread_nr)
> +{
> +    ThreadLocal *thread_local = threads->per_thread_data;
> +    char *name;
> +    int i;
> +
> +    for (i = 0; i < thread_nr; i++) {
> +        thread_local[i].threads = threads;
> +        thread_local[i].self = i;
> +
> +        if (init_thread_requests(&thread_local[i]) < 0) {
> +            goto exit;
> +        }
> +
> +        qemu_event_init(&thread_local[i].request_free_ev, false);
> +        qemu_event_init(&thread_local[i].request_valid_ev, false);
> +
> +        name = g_strdup_printf("%s/%d", thread_name, thread_local[i].self);
> +        qemu_thread_create(&thread_local[i].thread, name,
> +                           thread_run, &thread_local[i], QEMU_THREAD_JOINABLE);
> +        g_free(name);
> +    }
> +    return 0;
> +
> +exit:
> +    uninit_thread_data(threads, i);
> +    return -1;
> +}
> +
> +Threads *threaded_workqueue_create(const char *name, unsigned int threads_nr,
> +                                   unsigned int thread_requests_nr,
> +                                   const ThreadedWorkqueueOps *ops)
> +{
> +    Threads *threads;
> +
> +    if (threads_nr > MAX_THREAD_REQUEST_NR) {
> +        return NULL;
> +    }
> +
> +    threads = g_malloc0(sizeof(*threads) + threads_nr * sizeof(ThreadLocal));
> +    threads->ops = ops;
> +    threads->threads_nr = threads_nr;
> +    threads->thread_requests_nr = thread_requests_nr;
> +
> +    QEMU_BUILD_BUG_ON(!QEMU_IS_ALIGNED(sizeof(ThreadRequest), sizeof(long)));
> +    threads->request_size = threads->ops->request_size;
> +    threads->request_size = QEMU_ALIGN_UP(threads->request_size, sizeof(long));
> +    threads->request_size += sizeof(ThreadRequest);
> +
> +    if (init_thread_data(threads, name, threads_nr) < 0) {
> +        g_free(threads);
> +        return NULL;
> +    }
> +
> +    return threads;
> +}
> +
> +void threaded_workqueue_destroy(Threads *threads)
> +{
> +    uninit_thread_data(threads, threads->threads_nr);
> +    g_free(threads);
> +}
> +
> +static void request_done(Threads *threads, ThreadRequest *request)
> +{
> +    if (!request->done) {
> +        return;
> +    }
> +
> +    threads->ops->thread_request_done(request + 1);
> +    request->done = false;
> +}
> +
> +void *threaded_workqueue_get_request(Threads *threads)
> +{
> +    ThreadRequest *request;
> +
> +    request = threads_find_free_request(threads);
> +    if (!request) {
> +        return NULL;
> +    }
> +
> +    request_done(threads, request);
> +    return request + 1;
> +}
> +
> +void threaded_workqueue_submit_request(Threads *threads, void *request)
> +{
> +    ThreadRequest *req = request - sizeof(ThreadRequest);
> +    int thread_index = request_to_thread_index(request);
> +
> +    assert(!req->done);
> +    mark_request_valid(threads, req);
> +    threads->current_thread_index = thread_index  + 1;
> +}
> +
> +void threaded_workqueue_wait_for_requests(Threads *threads)
> +{
> +    ThreadLocal *thread;
> +    uint64_t result_bitmap;
> +    int thread_index, index = 0;
> +
> +    for (thread_index = 0; thread_index < threads->threads_nr; thread_index++) {
> +        thread = threads->per_thread_data + thread_index;
> +        index = 0;
> +retry:
> +        qemu_event_reset(&thread->request_free_ev);
> +        result_bitmap = get_free_request_bitmap(threads, thread);
> +
> +        for (; index < threads->thread_requests_nr; index++) {
> +            if (test_bit(index, &result_bitmap)) {
> +                qemu_event_wait(&thread->request_free_ev);
> +                goto retry;
> +            }
> +
> +            request_done(threads, index_to_request(thread, index));
> +        }
> +    }
> +}
> -- 
> 2.14.5
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 3/5] migration: use threaded workqueue for compression
  2018-11-22  7:20   ` [Qemu-devel] " guangrong.xiao
@ 2018-11-23 18:17     ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 70+ messages in thread
From: Dr. David Alan Gilbert @ 2018-11-23 18:17 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, peterx, quintela,
	wei.w.wang, cota, jiang.biao2, pbonzini

* guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> Adapt the compression code to the threaded workqueue
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---
>  migration/ram.c | 308 ++++++++++++++++++++------------------------------------
>  1 file changed, 110 insertions(+), 198 deletions(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 7e7deec4d8..254c08f27b 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -57,6 +57,7 @@
>  #include "qemu/uuid.h"
>  #include "savevm.h"
>  #include "qemu/iov.h"
> +#include "qemu/threaded-workqueue.h"
>  
>  /***********************************************************/
>  /* ram save/restore */
> @@ -349,22 +350,6 @@ typedef struct PageSearchStatus PageSearchStatus;
>  
>  CompressionStats compression_counters;
>  
> -struct CompressParam {
> -    bool done;
> -    bool quit;
> -    bool zero_page;
> -    QEMUFile *file;
> -    QemuMutex mutex;
> -    QemuCond cond;
> -    RAMBlock *block;
> -    ram_addr_t offset;
> -
> -    /* internally used fields */
> -    z_stream stream;
> -    uint8_t *originbuf;
> -};
> -typedef struct CompressParam CompressParam;
> -
>  struct DecompressParam {
>      bool done;
>      bool quit;
> @@ -377,15 +362,6 @@ struct DecompressParam {
>  };
>  typedef struct DecompressParam DecompressParam;
>  
> -static CompressParam *comp_param;
> -static QemuThread *compress_threads;
> -/* comp_done_cond is used to wake up the migration thread when
> - * one of the compression threads has finished the compression.
> - * comp_done_lock is used to co-work with comp_done_cond.
> - */
> -static QemuMutex comp_done_lock;
> -static QemuCond comp_done_cond;
> -/* The empty QEMUFileOps will be used by file in CompressParam */
>  static const QEMUFileOps empty_ops = { };
>  
>  static QEMUFile *decomp_file;
> @@ -394,125 +370,6 @@ static QemuThread *decompress_threads;
>  static QemuMutex decomp_done_lock;
>  static QemuCond decomp_done_cond;
>  
> -static bool do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock *block,
> -                                 ram_addr_t offset, uint8_t *source_buf);
> -
> -static void *do_data_compress(void *opaque)
> -{
> -    CompressParam *param = opaque;
> -    RAMBlock *block;
> -    ram_addr_t offset;
> -    bool zero_page;
> -
> -    qemu_mutex_lock(&param->mutex);
> -    while (!param->quit) {
> -        if (param->block) {
> -            block = param->block;
> -            offset = param->offset;
> -            param->block = NULL;
> -            qemu_mutex_unlock(&param->mutex);
> -
> -            zero_page = do_compress_ram_page(param->file, &param->stream,
> -                                             block, offset, param->originbuf);
> -
> -            qemu_mutex_lock(&comp_done_lock);
> -            param->done = true;
> -            param->zero_page = zero_page;
> -            qemu_cond_signal(&comp_done_cond);
> -            qemu_mutex_unlock(&comp_done_lock);
> -
> -            qemu_mutex_lock(&param->mutex);
> -        } else {
> -            qemu_cond_wait(&param->cond, &param->mutex);
> -        }
> -    }
> -    qemu_mutex_unlock(&param->mutex);
> -
> -    return NULL;
> -}
> -
> -static void compress_threads_save_cleanup(void)
> -{
> -    int i, thread_count;
> -
> -    if (!migrate_use_compression() || !comp_param) {
> -        return;
> -    }
> -
> -    thread_count = migrate_compress_threads();
> -    for (i = 0; i < thread_count; i++) {
> -        /*
> -         * we use it as a indicator which shows if the thread is
> -         * properly init'd or not
> -         */
> -        if (!comp_param[i].file) {
> -            break;
> -        }
> -
> -        qemu_mutex_lock(&comp_param[i].mutex);
> -        comp_param[i].quit = true;
> -        qemu_cond_signal(&comp_param[i].cond);
> -        qemu_mutex_unlock(&comp_param[i].mutex);
> -
> -        qemu_thread_join(compress_threads + i);
> -        qemu_mutex_destroy(&comp_param[i].mutex);
> -        qemu_cond_destroy(&comp_param[i].cond);
> -        deflateEnd(&comp_param[i].stream);
> -        g_free(comp_param[i].originbuf);
> -        qemu_fclose(comp_param[i].file);
> -        comp_param[i].file = NULL;
> -    }
> -    qemu_mutex_destroy(&comp_done_lock);
> -    qemu_cond_destroy(&comp_done_cond);
> -    g_free(compress_threads);
> -    g_free(comp_param);
> -    compress_threads = NULL;
> -    comp_param = NULL;
> -}
> -
> -static int compress_threads_save_setup(void)
> -{
> -    int i, thread_count;
> -
> -    if (!migrate_use_compression()) {
> -        return 0;
> -    }
> -    thread_count = migrate_compress_threads();
> -    compress_threads = g_new0(QemuThread, thread_count);
> -    comp_param = g_new0(CompressParam, thread_count);
> -    qemu_cond_init(&comp_done_cond);
> -    qemu_mutex_init(&comp_done_lock);
> -    for (i = 0; i < thread_count; i++) {
> -        comp_param[i].originbuf = g_try_malloc(TARGET_PAGE_SIZE);
> -        if (!comp_param[i].originbuf) {
> -            goto exit;
> -        }
> -
> -        if (deflateInit(&comp_param[i].stream,
> -                        migrate_compress_level()) != Z_OK) {
> -            g_free(comp_param[i].originbuf);
> -            goto exit;
> -        }
> -
> -        /* comp_param[i].file is just used as a dummy buffer to save data,
> -         * set its ops to empty.
> -         */
> -        comp_param[i].file = qemu_fopen_ops(NULL, &empty_ops);
> -        comp_param[i].done = true;
> -        comp_param[i].quit = false;
> -        qemu_mutex_init(&comp_param[i].mutex);
> -        qemu_cond_init(&comp_param[i].cond);
> -        qemu_thread_create(compress_threads + i, "compress",
> -                           do_data_compress, comp_param + i,
> -                           QEMU_THREAD_JOINABLE);
> -    }
> -    return 0;
> -
> -exit:
> -    compress_threads_save_cleanup();
> -    return -1;
> -}
> -
>  /* Multiple fd's */
>  
>  #define MULTIFD_MAGIC 0x11223344U
> @@ -1909,12 +1766,25 @@ exit:
>      return zero_page;
>  }
>  
> +struct CompressData {
> +    /* filled by migration thread.*/
> +    RAMBlock *block;
> +    ram_addr_t offset;
> +
> +    /* filled by compress thread. */
> +    QEMUFile *file;
> +    z_stream stream;
> +    uint8_t *originbuf;
> +    bool zero_page;
> +};
> +typedef struct CompressData CompressData;
> +
>  static void
> -update_compress_thread_counts(const CompressParam *param, int bytes_xmit)
> +update_compress_thread_counts(CompressData *cd, int bytes_xmit)

Keep the const?
>  {
>      ram_counters.transferred += bytes_xmit;
>  
> -    if (param->zero_page) {
> +    if (cd->zero_page) {
>          ram_counters.duplicate++;
>          return;
>      }
> @@ -1924,81 +1794,123 @@ update_compress_thread_counts(const CompressParam *param, int bytes_xmit)
>      compression_counters.pages++;
>  }
>  
> +static int compress_thread_data_init(void *request)
> +{
> +    CompressData *cd = request;
> +
> +    cd->originbuf = g_try_malloc(TARGET_PAGE_SIZE);
> +    if (!cd->originbuf) {
> +        return -1;
> +    }
> +
> +    if (deflateInit(&cd->stream, migrate_compress_level()) != Z_OK) {
> +        g_free(cd->originbuf);
> +        return -1;
> +    }

Please print errors if you fail in any case so we can easily tell what
happened.

> +    cd->file = qemu_fopen_ops(NULL, &empty_ops);
> +    return 0;
> +}
> +
> +static void compress_thread_data_fini(void *request)
> +{
> +    CompressData *cd = request;
> +
> +    qemu_fclose(cd->file);
> +    deflateEnd(&cd->stream);
> +    g_free(cd->originbuf);
> +}
> +
> +static void compress_thread_data_handler(void *request)
> +{
> +    CompressData *cd = request;
> +
> +    /*
> +     * if compression fails, it will be indicated by
> +     * migrate_get_current()->to_dst_file.
> +     */
> +    cd->zero_page = do_compress_ram_page(cd->file, &cd->stream, cd->block,
> +                                         cd->offset, cd->originbuf);
> +}
> +
> +static void compress_thread_data_done(void *request)
> +{
> +    CompressData *cd = request;
> +    RAMState *rs = ram_state;
> +    int bytes_xmit;
> +
> +    bytes_xmit = qemu_put_qemu_file(rs->f, cd->file);
> +    update_compress_thread_counts(cd, bytes_xmit);
> +}
> +
> +static const ThreadedWorkqueueOps compress_ops = {
> +    .thread_request_init = compress_thread_data_init,
> +    .thread_request_uninit = compress_thread_data_fini,
> +    .thread_request_handler = compress_thread_data_handler,
> +    .thread_request_done = compress_thread_data_done,
> +    .request_size = sizeof(CompressData),
> +};
> +
> +static Threads *compress_threads;
> +
>  static bool save_page_use_compression(RAMState *rs);
>  
>  static void flush_compressed_data(RAMState *rs)
>  {
> -    int idx, len, thread_count;
> -
>      if (!save_page_use_compression(rs)) {
>          return;
>      }
> -    thread_count = migrate_compress_threads();
>  
> -    qemu_mutex_lock(&comp_done_lock);
> -    for (idx = 0; idx < thread_count; idx++) {
> -        while (!comp_param[idx].done) {
> -            qemu_cond_wait(&comp_done_cond, &comp_done_lock);
> -        }
> -    }
> -    qemu_mutex_unlock(&comp_done_lock);
> +    threaded_workqueue_wait_for_requests(compress_threads);
> +}
>  
> -    for (idx = 0; idx < thread_count; idx++) {
> -        qemu_mutex_lock(&comp_param[idx].mutex);
> -        if (!comp_param[idx].quit) {
> -            len = qemu_put_qemu_file(rs->f, comp_param[idx].file);
> -            /*
> -             * it's safe to fetch zero_page without holding comp_done_lock
> -             * as there is no further request submitted to the thread,
> -             * i.e, the thread should be waiting for a request at this point.
> -             */
> -            update_compress_thread_counts(&comp_param[idx], len);
> -        }
> -        qemu_mutex_unlock(&comp_param[idx].mutex);
> +static void compress_threads_save_cleanup(void)
> +{
> +    if (!compress_threads) {
> +        return;
>      }
> +
> +    threaded_workqueue_destroy(compress_threads);
> +    compress_threads = NULL;
>  }
>  
> -static inline void set_compress_params(CompressParam *param, RAMBlock *block,
> -                                       ram_addr_t offset)
> +static int compress_threads_save_setup(void)
>  {
> -    param->block = block;
> -    param->offset = offset;
> +    if (!migrate_use_compression()) {
> +        return 0;
> +    }
> +
> +    compress_threads = threaded_workqueue_create("compress",
> +                                migrate_compress_threads(),
> +                                DEFAULT_THREAD_REQUEST_NR, &compress_ops);
> +    return compress_threads ? 0 : -1;
>  }
>  
>  static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
>                                             ram_addr_t offset)
>  {
> -    int idx, thread_count, bytes_xmit = -1, pages = -1;
> +    CompressData *cd;
>      bool wait = migrate_compress_wait_thread();
>  
> -    thread_count = migrate_compress_threads();
> -    qemu_mutex_lock(&comp_done_lock);
>  retry:
> -    for (idx = 0; idx < thread_count; idx++) {
> -        if (comp_param[idx].done) {
> -            comp_param[idx].done = false;
> -            bytes_xmit = qemu_put_qemu_file(rs->f, comp_param[idx].file);
> -            qemu_mutex_lock(&comp_param[idx].mutex);
> -            set_compress_params(&comp_param[idx], block, offset);
> -            qemu_cond_signal(&comp_param[idx].cond);
> -            qemu_mutex_unlock(&comp_param[idx].mutex);
> -            pages = 1;
> -            update_compress_thread_counts(&comp_param[idx], bytes_xmit);
> -            break;
> +    cd = threaded_workqueue_get_request(compress_threads);
> +    if (!cd) {
> +        /*
> +         * wait for the free thread if the user specifies
> +         * 'compress-wait-thread', otherwise we will post
> +         *  the page out in the main thread as normal page.
> +         */
> +        if (wait) {
> +            cpu_relax();
> +            goto retry;

Is there nothing better we can use to wait without eating CPU time?

Dave

>          }
> -    }
>  
> -    /*
> -     * wait for the free thread if the user specifies 'compress-wait-thread',
> -     * otherwise we will post the page out in the main thread as normal page.
> -     */
> -    if (pages < 0 && wait) {
> -        qemu_cond_wait(&comp_done_cond, &comp_done_lock);
> -        goto retry;
> -    }
> -    qemu_mutex_unlock(&comp_done_lock);
> -
> -    return pages;
> +        return -1;
> +     }
> +    cd->block = block;
> +    cd->offset = offset;
> +    threaded_workqueue_submit_request(compress_threads, cd);
> +    return 1;
>  }
>  
>  /**
> -- 
> 2.14.5
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/5] migration: use threaded workqueue for compression
@ 2018-11-23 18:17     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 70+ messages in thread
From: Dr. David Alan Gilbert @ 2018-11-23 18:17 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, peterx, wei.w.wang,
	jiang.biao2, eblake, quintela, cota, Xiao Guangrong

* guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> Adapt the compression code to the threaded workqueue
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---
>  migration/ram.c | 308 ++++++++++++++++++++------------------------------------
>  1 file changed, 110 insertions(+), 198 deletions(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 7e7deec4d8..254c08f27b 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -57,6 +57,7 @@
>  #include "qemu/uuid.h"
>  #include "savevm.h"
>  #include "qemu/iov.h"
> +#include "qemu/threaded-workqueue.h"
>  
>  /***********************************************************/
>  /* ram save/restore */
> @@ -349,22 +350,6 @@ typedef struct PageSearchStatus PageSearchStatus;
>  
>  CompressionStats compression_counters;
>  
> -struct CompressParam {
> -    bool done;
> -    bool quit;
> -    bool zero_page;
> -    QEMUFile *file;
> -    QemuMutex mutex;
> -    QemuCond cond;
> -    RAMBlock *block;
> -    ram_addr_t offset;
> -
> -    /* internally used fields */
> -    z_stream stream;
> -    uint8_t *originbuf;
> -};
> -typedef struct CompressParam CompressParam;
> -
>  struct DecompressParam {
>      bool done;
>      bool quit;
> @@ -377,15 +362,6 @@ struct DecompressParam {
>  };
>  typedef struct DecompressParam DecompressParam;
>  
> -static CompressParam *comp_param;
> -static QemuThread *compress_threads;
> -/* comp_done_cond is used to wake up the migration thread when
> - * one of the compression threads has finished the compression.
> - * comp_done_lock is used to co-work with comp_done_cond.
> - */
> -static QemuMutex comp_done_lock;
> -static QemuCond comp_done_cond;
> -/* The empty QEMUFileOps will be used by file in CompressParam */
>  static const QEMUFileOps empty_ops = { };
>  
>  static QEMUFile *decomp_file;
> @@ -394,125 +370,6 @@ static QemuThread *decompress_threads;
>  static QemuMutex decomp_done_lock;
>  static QemuCond decomp_done_cond;
>  
> -static bool do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock *block,
> -                                 ram_addr_t offset, uint8_t *source_buf);
> -
> -static void *do_data_compress(void *opaque)
> -{
> -    CompressParam *param = opaque;
> -    RAMBlock *block;
> -    ram_addr_t offset;
> -    bool zero_page;
> -
> -    qemu_mutex_lock(&param->mutex);
> -    while (!param->quit) {
> -        if (param->block) {
> -            block = param->block;
> -            offset = param->offset;
> -            param->block = NULL;
> -            qemu_mutex_unlock(&param->mutex);
> -
> -            zero_page = do_compress_ram_page(param->file, &param->stream,
> -                                             block, offset, param->originbuf);
> -
> -            qemu_mutex_lock(&comp_done_lock);
> -            param->done = true;
> -            param->zero_page = zero_page;
> -            qemu_cond_signal(&comp_done_cond);
> -            qemu_mutex_unlock(&comp_done_lock);
> -
> -            qemu_mutex_lock(&param->mutex);
> -        } else {
> -            qemu_cond_wait(&param->cond, &param->mutex);
> -        }
> -    }
> -    qemu_mutex_unlock(&param->mutex);
> -
> -    return NULL;
> -}
> -
> -static void compress_threads_save_cleanup(void)
> -{
> -    int i, thread_count;
> -
> -    if (!migrate_use_compression() || !comp_param) {
> -        return;
> -    }
> -
> -    thread_count = migrate_compress_threads();
> -    for (i = 0; i < thread_count; i++) {
> -        /*
> -         * we use it as a indicator which shows if the thread is
> -         * properly init'd or not
> -         */
> -        if (!comp_param[i].file) {
> -            break;
> -        }
> -
> -        qemu_mutex_lock(&comp_param[i].mutex);
> -        comp_param[i].quit = true;
> -        qemu_cond_signal(&comp_param[i].cond);
> -        qemu_mutex_unlock(&comp_param[i].mutex);
> -
> -        qemu_thread_join(compress_threads + i);
> -        qemu_mutex_destroy(&comp_param[i].mutex);
> -        qemu_cond_destroy(&comp_param[i].cond);
> -        deflateEnd(&comp_param[i].stream);
> -        g_free(comp_param[i].originbuf);
> -        qemu_fclose(comp_param[i].file);
> -        comp_param[i].file = NULL;
> -    }
> -    qemu_mutex_destroy(&comp_done_lock);
> -    qemu_cond_destroy(&comp_done_cond);
> -    g_free(compress_threads);
> -    g_free(comp_param);
> -    compress_threads = NULL;
> -    comp_param = NULL;
> -}
> -
> -static int compress_threads_save_setup(void)
> -{
> -    int i, thread_count;
> -
> -    if (!migrate_use_compression()) {
> -        return 0;
> -    }
> -    thread_count = migrate_compress_threads();
> -    compress_threads = g_new0(QemuThread, thread_count);
> -    comp_param = g_new0(CompressParam, thread_count);
> -    qemu_cond_init(&comp_done_cond);
> -    qemu_mutex_init(&comp_done_lock);
> -    for (i = 0; i < thread_count; i++) {
> -        comp_param[i].originbuf = g_try_malloc(TARGET_PAGE_SIZE);
> -        if (!comp_param[i].originbuf) {
> -            goto exit;
> -        }
> -
> -        if (deflateInit(&comp_param[i].stream,
> -                        migrate_compress_level()) != Z_OK) {
> -            g_free(comp_param[i].originbuf);
> -            goto exit;
> -        }
> -
> -        /* comp_param[i].file is just used as a dummy buffer to save data,
> -         * set its ops to empty.
> -         */
> -        comp_param[i].file = qemu_fopen_ops(NULL, &empty_ops);
> -        comp_param[i].done = true;
> -        comp_param[i].quit = false;
> -        qemu_mutex_init(&comp_param[i].mutex);
> -        qemu_cond_init(&comp_param[i].cond);
> -        qemu_thread_create(compress_threads + i, "compress",
> -                           do_data_compress, comp_param + i,
> -                           QEMU_THREAD_JOINABLE);
> -    }
> -    return 0;
> -
> -exit:
> -    compress_threads_save_cleanup();
> -    return -1;
> -}
> -
>  /* Multiple fd's */
>  
>  #define MULTIFD_MAGIC 0x11223344U
> @@ -1909,12 +1766,25 @@ exit:
>      return zero_page;
>  }
>  
> +struct CompressData {
> +    /* filled by migration thread.*/
> +    RAMBlock *block;
> +    ram_addr_t offset;
> +
> +    /* filled by compress thread. */
> +    QEMUFile *file;
> +    z_stream stream;
> +    uint8_t *originbuf;
> +    bool zero_page;
> +};
> +typedef struct CompressData CompressData;
> +
>  static void
> -update_compress_thread_counts(const CompressParam *param, int bytes_xmit)
> +update_compress_thread_counts(CompressData *cd, int bytes_xmit)

Keep the const?
>  {
>      ram_counters.transferred += bytes_xmit;
>  
> -    if (param->zero_page) {
> +    if (cd->zero_page) {
>          ram_counters.duplicate++;
>          return;
>      }
> @@ -1924,81 +1794,123 @@ update_compress_thread_counts(const CompressParam *param, int bytes_xmit)
>      compression_counters.pages++;
>  }
>  
> +static int compress_thread_data_init(void *request)
> +{
> +    CompressData *cd = request;
> +
> +    cd->originbuf = g_try_malloc(TARGET_PAGE_SIZE);
> +    if (!cd->originbuf) {
> +        return -1;
> +    }
> +
> +    if (deflateInit(&cd->stream, migrate_compress_level()) != Z_OK) {
> +        g_free(cd->originbuf);
> +        return -1;
> +    }

Please print errors if you fail in any case so we can easily tell what
happened.

> +    cd->file = qemu_fopen_ops(NULL, &empty_ops);
> +    return 0;
> +}
> +
> +static void compress_thread_data_fini(void *request)
> +{
> +    CompressData *cd = request;
> +
> +    qemu_fclose(cd->file);
> +    deflateEnd(&cd->stream);
> +    g_free(cd->originbuf);
> +}
> +
> +static void compress_thread_data_handler(void *request)
> +{
> +    CompressData *cd = request;
> +
> +    /*
> +     * if compression fails, it will be indicated by
> +     * migrate_get_current()->to_dst_file.
> +     */
> +    cd->zero_page = do_compress_ram_page(cd->file, &cd->stream, cd->block,
> +                                         cd->offset, cd->originbuf);
> +}
> +
> +static void compress_thread_data_done(void *request)
> +{
> +    CompressData *cd = request;
> +    RAMState *rs = ram_state;
> +    int bytes_xmit;
> +
> +    bytes_xmit = qemu_put_qemu_file(rs->f, cd->file);
> +    update_compress_thread_counts(cd, bytes_xmit);
> +}
> +
> +static const ThreadedWorkqueueOps compress_ops = {
> +    .thread_request_init = compress_thread_data_init,
> +    .thread_request_uninit = compress_thread_data_fini,
> +    .thread_request_handler = compress_thread_data_handler,
> +    .thread_request_done = compress_thread_data_done,
> +    .request_size = sizeof(CompressData),
> +};
> +
> +static Threads *compress_threads;
> +
>  static bool save_page_use_compression(RAMState *rs);
>  
>  static void flush_compressed_data(RAMState *rs)
>  {
> -    int idx, len, thread_count;
> -
>      if (!save_page_use_compression(rs)) {
>          return;
>      }
> -    thread_count = migrate_compress_threads();
>  
> -    qemu_mutex_lock(&comp_done_lock);
> -    for (idx = 0; idx < thread_count; idx++) {
> -        while (!comp_param[idx].done) {
> -            qemu_cond_wait(&comp_done_cond, &comp_done_lock);
> -        }
> -    }
> -    qemu_mutex_unlock(&comp_done_lock);
> +    threaded_workqueue_wait_for_requests(compress_threads);
> +}
>  
> -    for (idx = 0; idx < thread_count; idx++) {
> -        qemu_mutex_lock(&comp_param[idx].mutex);
> -        if (!comp_param[idx].quit) {
> -            len = qemu_put_qemu_file(rs->f, comp_param[idx].file);
> -            /*
> -             * it's safe to fetch zero_page without holding comp_done_lock
> -             * as there is no further request submitted to the thread,
> -             * i.e, the thread should be waiting for a request at this point.
> -             */
> -            update_compress_thread_counts(&comp_param[idx], len);
> -        }
> -        qemu_mutex_unlock(&comp_param[idx].mutex);
> +static void compress_threads_save_cleanup(void)
> +{
> +    if (!compress_threads) {
> +        return;
>      }
> +
> +    threaded_workqueue_destroy(compress_threads);
> +    compress_threads = NULL;
>  }
>  
> -static inline void set_compress_params(CompressParam *param, RAMBlock *block,
> -                                       ram_addr_t offset)
> +static int compress_threads_save_setup(void)
>  {
> -    param->block = block;
> -    param->offset = offset;
> +    if (!migrate_use_compression()) {
> +        return 0;
> +    }
> +
> +    compress_threads = threaded_workqueue_create("compress",
> +                                migrate_compress_threads(),
> +                                DEFAULT_THREAD_REQUEST_NR, &compress_ops);
> +    return compress_threads ? 0 : -1;
>  }
>  
>  static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
>                                             ram_addr_t offset)
>  {
> -    int idx, thread_count, bytes_xmit = -1, pages = -1;
> +    CompressData *cd;
>      bool wait = migrate_compress_wait_thread();
>  
> -    thread_count = migrate_compress_threads();
> -    qemu_mutex_lock(&comp_done_lock);
>  retry:
> -    for (idx = 0; idx < thread_count; idx++) {
> -        if (comp_param[idx].done) {
> -            comp_param[idx].done = false;
> -            bytes_xmit = qemu_put_qemu_file(rs->f, comp_param[idx].file);
> -            qemu_mutex_lock(&comp_param[idx].mutex);
> -            set_compress_params(&comp_param[idx], block, offset);
> -            qemu_cond_signal(&comp_param[idx].cond);
> -            qemu_mutex_unlock(&comp_param[idx].mutex);
> -            pages = 1;
> -            update_compress_thread_counts(&comp_param[idx], bytes_xmit);
> -            break;
> +    cd = threaded_workqueue_get_request(compress_threads);
> +    if (!cd) {
> +        /*
> +         * wait for the free thread if the user specifies
> +         * 'compress-wait-thread', otherwise we will post
> +         *  the page out in the main thread as normal page.
> +         */
> +        if (wait) {
> +            cpu_relax();
> +            goto retry;

Is there nothing better we can use to wait without eating CPU time?

Dave

>          }
> -    }
>  
> -    /*
> -     * wait for the free thread if the user specifies 'compress-wait-thread',
> -     * otherwise we will post the page out in the main thread as normal page.
> -     */
> -    if (pages < 0 && wait) {
> -        qemu_cond_wait(&comp_done_cond, &comp_done_lock);
> -        goto retry;
> -    }
> -    qemu_mutex_unlock(&comp_done_lock);
> -
> -    return pages;
> +        return -1;
> +     }
> +    cd->block = block;
> +    cd->offset = offset;
> +    threaded_workqueue_submit_request(compress_threads, cd);
> +    return 1;
>  }
>  
>  /**
> -- 
> 2.14.5
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 3/5] migration: use threaded workqueue for compression
  2018-11-23 18:17     ` [Qemu-devel] " Dr. David Alan Gilbert
@ 2018-11-23 18:22       ` Paolo Bonzini
  -1 siblings, 0 replies; 70+ messages in thread
From: Paolo Bonzini @ 2018-11-23 18:22 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, guangrong.xiao
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, peterx, quintela,
	wei.w.wang, cota, jiang.biao2

On 23/11/18 19:17, Dr. David Alan Gilbert wrote:
> * guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>> Adapt the compression code to the threaded workqueue
>>
>> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
>> ---
>>  migration/ram.c | 308 ++++++++++++++++++++------------------------------------
>>  1 file changed, 110 insertions(+), 198 deletions(-)
>>
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 7e7deec4d8..254c08f27b 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -57,6 +57,7 @@
>>  #include "qemu/uuid.h"
>>  #include "savevm.h"
>>  #include "qemu/iov.h"
>> +#include "qemu/threaded-workqueue.h"
>>  
>>  /***********************************************************/
>>  /* ram save/restore */
>> @@ -349,22 +350,6 @@ typedef struct PageSearchStatus PageSearchStatus;
>>  
>>  CompressionStats compression_counters;
>>  
>> -struct CompressParam {
>> -    bool done;
>> -    bool quit;
>> -    bool zero_page;
>> -    QEMUFile *file;
>> -    QemuMutex mutex;
>> -    QemuCond cond;
>> -    RAMBlock *block;
>> -    ram_addr_t offset;
>> -
>> -    /* internally used fields */
>> -    z_stream stream;
>> -    uint8_t *originbuf;
>> -};
>> -typedef struct CompressParam CompressParam;
>> -
>>  struct DecompressParam {
>>      bool done;
>>      bool quit;
>> @@ -377,15 +362,6 @@ struct DecompressParam {
>>  };
>>  typedef struct DecompressParam DecompressParam;
>>  
>> -static CompressParam *comp_param;
>> -static QemuThread *compress_threads;
>> -/* comp_done_cond is used to wake up the migration thread when
>> - * one of the compression threads has finished the compression.
>> - * comp_done_lock is used to co-work with comp_done_cond.
>> - */
>> -static QemuMutex comp_done_lock;
>> -static QemuCond comp_done_cond;
>> -/* The empty QEMUFileOps will be used by file in CompressParam */
>>  static const QEMUFileOps empty_ops = { };
>>  
>>  static QEMUFile *decomp_file;
>> @@ -394,125 +370,6 @@ static QemuThread *decompress_threads;
>>  static QemuMutex decomp_done_lock;
>>  static QemuCond decomp_done_cond;
>>  
>> -static bool do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock *block,
>> -                                 ram_addr_t offset, uint8_t *source_buf);
>> -
>> -static void *do_data_compress(void *opaque)
>> -{
>> -    CompressParam *param = opaque;
>> -    RAMBlock *block;
>> -    ram_addr_t offset;
>> -    bool zero_page;
>> -
>> -    qemu_mutex_lock(&param->mutex);
>> -    while (!param->quit) {
>> -        if (param->block) {
>> -            block = param->block;
>> -            offset = param->offset;
>> -            param->block = NULL;
>> -            qemu_mutex_unlock(&param->mutex);
>> -
>> -            zero_page = do_compress_ram_page(param->file, &param->stream,
>> -                                             block, offset, param->originbuf);
>> -
>> -            qemu_mutex_lock(&comp_done_lock);
>> -            param->done = true;
>> -            param->zero_page = zero_page;
>> -            qemu_cond_signal(&comp_done_cond);
>> -            qemu_mutex_unlock(&comp_done_lock);
>> -
>> -            qemu_mutex_lock(&param->mutex);
>> -        } else {
>> -            qemu_cond_wait(&param->cond, &param->mutex);
>> -        }
>> -    }
>> -    qemu_mutex_unlock(&param->mutex);
>> -
>> -    return NULL;
>> -}
>> -
>> -static void compress_threads_save_cleanup(void)
>> -{
>> -    int i, thread_count;
>> -
>> -    if (!migrate_use_compression() || !comp_param) {
>> -        return;
>> -    }
>> -
>> -    thread_count = migrate_compress_threads();
>> -    for (i = 0; i < thread_count; i++) {
>> -        /*
>> -         * we use it as a indicator which shows if the thread is
>> -         * properly init'd or not
>> -         */
>> -        if (!comp_param[i].file) {
>> -            break;
>> -        }
>> -
>> -        qemu_mutex_lock(&comp_param[i].mutex);
>> -        comp_param[i].quit = true;
>> -        qemu_cond_signal(&comp_param[i].cond);
>> -        qemu_mutex_unlock(&comp_param[i].mutex);
>> -
>> -        qemu_thread_join(compress_threads + i);
>> -        qemu_mutex_destroy(&comp_param[i].mutex);
>> -        qemu_cond_destroy(&comp_param[i].cond);
>> -        deflateEnd(&comp_param[i].stream);
>> -        g_free(comp_param[i].originbuf);
>> -        qemu_fclose(comp_param[i].file);
>> -        comp_param[i].file = NULL;
>> -    }
>> -    qemu_mutex_destroy(&comp_done_lock);
>> -    qemu_cond_destroy(&comp_done_cond);
>> -    g_free(compress_threads);
>> -    g_free(comp_param);
>> -    compress_threads = NULL;
>> -    comp_param = NULL;
>> -}
>> -
>> -static int compress_threads_save_setup(void)
>> -{
>> -    int i, thread_count;
>> -
>> -    if (!migrate_use_compression()) {
>> -        return 0;
>> -    }
>> -    thread_count = migrate_compress_threads();
>> -    compress_threads = g_new0(QemuThread, thread_count);
>> -    comp_param = g_new0(CompressParam, thread_count);
>> -    qemu_cond_init(&comp_done_cond);
>> -    qemu_mutex_init(&comp_done_lock);
>> -    for (i = 0; i < thread_count; i++) {
>> -        comp_param[i].originbuf = g_try_malloc(TARGET_PAGE_SIZE);
>> -        if (!comp_param[i].originbuf) {
>> -            goto exit;
>> -        }
>> -
>> -        if (deflateInit(&comp_param[i].stream,
>> -                        migrate_compress_level()) != Z_OK) {
>> -            g_free(comp_param[i].originbuf);
>> -            goto exit;
>> -        }
>> -
>> -        /* comp_param[i].file is just used as a dummy buffer to save data,
>> -         * set its ops to empty.
>> -         */
>> -        comp_param[i].file = qemu_fopen_ops(NULL, &empty_ops);
>> -        comp_param[i].done = true;
>> -        comp_param[i].quit = false;
>> -        qemu_mutex_init(&comp_param[i].mutex);
>> -        qemu_cond_init(&comp_param[i].cond);
>> -        qemu_thread_create(compress_threads + i, "compress",
>> -                           do_data_compress, comp_param + i,
>> -                           QEMU_THREAD_JOINABLE);
>> -    }
>> -    return 0;
>> -
>> -exit:
>> -    compress_threads_save_cleanup();
>> -    return -1;
>> -}
>> -
>>  /* Multiple fd's */
>>  
>>  #define MULTIFD_MAGIC 0x11223344U
>> @@ -1909,12 +1766,25 @@ exit:
>>      return zero_page;
>>  }
>>  
>> +struct CompressData {
>> +    /* filled by migration thread.*/
>> +    RAMBlock *block;
>> +    ram_addr_t offset;
>> +
>> +    /* filled by compress thread. */
>> +    QEMUFile *file;
>> +    z_stream stream;
>> +    uint8_t *originbuf;
>> +    bool zero_page;
>> +};
>> +typedef struct CompressData CompressData;
>> +
>>  static void
>> -update_compress_thread_counts(const CompressParam *param, int bytes_xmit)
>> +update_compress_thread_counts(CompressData *cd, int bytes_xmit)
> 
> Keep the const?
>>  {
>>      ram_counters.transferred += bytes_xmit;
>>  
>> -    if (param->zero_page) {
>> +    if (cd->zero_page) {
>>          ram_counters.duplicate++;
>>          return;
>>      }
>> @@ -1924,81 +1794,123 @@ update_compress_thread_counts(const CompressParam *param, int bytes_xmit)
>>      compression_counters.pages++;
>>  }
>>  
>> +static int compress_thread_data_init(void *request)
>> +{
>> +    CompressData *cd = request;
>> +
>> +    cd->originbuf = g_try_malloc(TARGET_PAGE_SIZE);
>> +    if (!cd->originbuf) {
>> +        return -1;
>> +    }
>> +
>> +    if (deflateInit(&cd->stream, migrate_compress_level()) != Z_OK) {
>> +        g_free(cd->originbuf);
>> +        return -1;
>> +    }
> 
> Please print errors if you fail in any case so we can easily tell what
> happened.
> 
>> +    cd->file = qemu_fopen_ops(NULL, &empty_ops);
>> +    return 0;
>> +}
>> +
>> +static void compress_thread_data_fini(void *request)
>> +{
>> +    CompressData *cd = request;
>> +
>> +    qemu_fclose(cd->file);
>> +    deflateEnd(&cd->stream);
>> +    g_free(cd->originbuf);
>> +}
>> +
>> +static void compress_thread_data_handler(void *request)
>> +{
>> +    CompressData *cd = request;
>> +
>> +    /*
>> +     * if compression fails, it will be indicated by
>> +     * migrate_get_current()->to_dst_file.
>> +     */
>> +    cd->zero_page = do_compress_ram_page(cd->file, &cd->stream, cd->block,
>> +                                         cd->offset, cd->originbuf);
>> +}
>> +
>> +static void compress_thread_data_done(void *request)
>> +{
>> +    CompressData *cd = request;
>> +    RAMState *rs = ram_state;
>> +    int bytes_xmit;
>> +
>> +    bytes_xmit = qemu_put_qemu_file(rs->f, cd->file);
>> +    update_compress_thread_counts(cd, bytes_xmit);
>> +}
>> +
>> +static const ThreadedWorkqueueOps compress_ops = {
>> +    .thread_request_init = compress_thread_data_init,
>> +    .thread_request_uninit = compress_thread_data_fini,
>> +    .thread_request_handler = compress_thread_data_handler,
>> +    .thread_request_done = compress_thread_data_done,
>> +    .request_size = sizeof(CompressData),
>> +};
>> +
>> +static Threads *compress_threads;
>> +
>>  static bool save_page_use_compression(RAMState *rs);
>>  
>>  static void flush_compressed_data(RAMState *rs)
>>  {
>> -    int idx, len, thread_count;
>> -
>>      if (!save_page_use_compression(rs)) {
>>          return;
>>      }
>> -    thread_count = migrate_compress_threads();
>>  
>> -    qemu_mutex_lock(&comp_done_lock);
>> -    for (idx = 0; idx < thread_count; idx++) {
>> -        while (!comp_param[idx].done) {
>> -            qemu_cond_wait(&comp_done_cond, &comp_done_lock);
>> -        }
>> -    }
>> -    qemu_mutex_unlock(&comp_done_lock);
>> +    threaded_workqueue_wait_for_requests(compress_threads);
>> +}
>>  
>> -    for (idx = 0; idx < thread_count; idx++) {
>> -        qemu_mutex_lock(&comp_param[idx].mutex);
>> -        if (!comp_param[idx].quit) {
>> -            len = qemu_put_qemu_file(rs->f, comp_param[idx].file);
>> -            /*
>> -             * it's safe to fetch zero_page without holding comp_done_lock
>> -             * as there is no further request submitted to the thread,
>> -             * i.e, the thread should be waiting for a request at this point.
>> -             */
>> -            update_compress_thread_counts(&comp_param[idx], len);
>> -        }
>> -        qemu_mutex_unlock(&comp_param[idx].mutex);
>> +static void compress_threads_save_cleanup(void)
>> +{
>> +    if (!compress_threads) {
>> +        return;
>>      }
>> +
>> +    threaded_workqueue_destroy(compress_threads);
>> +    compress_threads = NULL;
>>  }
>>  
>> -static inline void set_compress_params(CompressParam *param, RAMBlock *block,
>> -                                       ram_addr_t offset)
>> +static int compress_threads_save_setup(void)
>>  {
>> -    param->block = block;
>> -    param->offset = offset;
>> +    if (!migrate_use_compression()) {
>> +        return 0;
>> +    }
>> +
>> +    compress_threads = threaded_workqueue_create("compress",
>> +                                migrate_compress_threads(),
>> +                                DEFAULT_THREAD_REQUEST_NR, &compress_ops);
>> +    return compress_threads ? 0 : -1;
>>  }
>>  
>>  static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
>>                                             ram_addr_t offset)
>>  {
>> -    int idx, thread_count, bytes_xmit = -1, pages = -1;
>> +    CompressData *cd;
>>      bool wait = migrate_compress_wait_thread();
>>  
>> -    thread_count = migrate_compress_threads();
>> -    qemu_mutex_lock(&comp_done_lock);
>>  retry:
>> -    for (idx = 0; idx < thread_count; idx++) {
>> -        if (comp_param[idx].done) {
>> -            comp_param[idx].done = false;
>> -            bytes_xmit = qemu_put_qemu_file(rs->f, comp_param[idx].file);
>> -            qemu_mutex_lock(&comp_param[idx].mutex);
>> -            set_compress_params(&comp_param[idx], block, offset);
>> -            qemu_cond_signal(&comp_param[idx].cond);
>> -            qemu_mutex_unlock(&comp_param[idx].mutex);
>> -            pages = 1;
>> -            update_compress_thread_counts(&comp_param[idx], bytes_xmit);
>> -            break;
>> +    cd = threaded_workqueue_get_request(compress_threads);
>> +    if (!cd) {
>> +        /*
>> +         * wait for the free thread if the user specifies
>> +         * 'compress-wait-thread', otherwise we will post
>> +         *  the page out in the main thread as normal page.
>> +         */
>> +        if (wait) {
>> +            cpu_relax();
>> +            goto retry;
> 
> Is there nothing better we can use to wait without eating CPU time?

There is a mechanism to wait without eating CPU time in the data
structure, but it makes sense to busy wait.  There are 4 threads in the
workqueue, so you have to compare 1/4th of the time spent compressing a
page, with the trip into the kernel to wake you up.  You're adding 20%
CPU usage, but I'm not surprised it's worthwhile.

Paolo

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/5] migration: use threaded workqueue for compression
@ 2018-11-23 18:22       ` Paolo Bonzini
  0 siblings, 0 replies; 70+ messages in thread
From: Paolo Bonzini @ 2018-11-23 18:22 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, guangrong.xiao
  Cc: mst, mtosatti, qemu-devel, kvm, peterx, wei.w.wang, jiang.biao2,
	eblake, quintela, cota, Xiao Guangrong

On 23/11/18 19:17, Dr. David Alan Gilbert wrote:
> * guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>> Adapt the compression code to the threaded workqueue
>>
>> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
>> ---
>>  migration/ram.c | 308 ++++++++++++++++++++------------------------------------
>>  1 file changed, 110 insertions(+), 198 deletions(-)
>>
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 7e7deec4d8..254c08f27b 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -57,6 +57,7 @@
>>  #include "qemu/uuid.h"
>>  #include "savevm.h"
>>  #include "qemu/iov.h"
>> +#include "qemu/threaded-workqueue.h"
>>  
>>  /***********************************************************/
>>  /* ram save/restore */
>> @@ -349,22 +350,6 @@ typedef struct PageSearchStatus PageSearchStatus;
>>  
>>  CompressionStats compression_counters;
>>  
>> -struct CompressParam {
>> -    bool done;
>> -    bool quit;
>> -    bool zero_page;
>> -    QEMUFile *file;
>> -    QemuMutex mutex;
>> -    QemuCond cond;
>> -    RAMBlock *block;
>> -    ram_addr_t offset;
>> -
>> -    /* internally used fields */
>> -    z_stream stream;
>> -    uint8_t *originbuf;
>> -};
>> -typedef struct CompressParam CompressParam;
>> -
>>  struct DecompressParam {
>>      bool done;
>>      bool quit;
>> @@ -377,15 +362,6 @@ struct DecompressParam {
>>  };
>>  typedef struct DecompressParam DecompressParam;
>>  
>> -static CompressParam *comp_param;
>> -static QemuThread *compress_threads;
>> -/* comp_done_cond is used to wake up the migration thread when
>> - * one of the compression threads has finished the compression.
>> - * comp_done_lock is used to co-work with comp_done_cond.
>> - */
>> -static QemuMutex comp_done_lock;
>> -static QemuCond comp_done_cond;
>> -/* The empty QEMUFileOps will be used by file in CompressParam */
>>  static const QEMUFileOps empty_ops = { };
>>  
>>  static QEMUFile *decomp_file;
>> @@ -394,125 +370,6 @@ static QemuThread *decompress_threads;
>>  static QemuMutex decomp_done_lock;
>>  static QemuCond decomp_done_cond;
>>  
>> -static bool do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock *block,
>> -                                 ram_addr_t offset, uint8_t *source_buf);
>> -
>> -static void *do_data_compress(void *opaque)
>> -{
>> -    CompressParam *param = opaque;
>> -    RAMBlock *block;
>> -    ram_addr_t offset;
>> -    bool zero_page;
>> -
>> -    qemu_mutex_lock(&param->mutex);
>> -    while (!param->quit) {
>> -        if (param->block) {
>> -            block = param->block;
>> -            offset = param->offset;
>> -            param->block = NULL;
>> -            qemu_mutex_unlock(&param->mutex);
>> -
>> -            zero_page = do_compress_ram_page(param->file, &param->stream,
>> -                                             block, offset, param->originbuf);
>> -
>> -            qemu_mutex_lock(&comp_done_lock);
>> -            param->done = true;
>> -            param->zero_page = zero_page;
>> -            qemu_cond_signal(&comp_done_cond);
>> -            qemu_mutex_unlock(&comp_done_lock);
>> -
>> -            qemu_mutex_lock(&param->mutex);
>> -        } else {
>> -            qemu_cond_wait(&param->cond, &param->mutex);
>> -        }
>> -    }
>> -    qemu_mutex_unlock(&param->mutex);
>> -
>> -    return NULL;
>> -}
>> -
>> -static void compress_threads_save_cleanup(void)
>> -{
>> -    int i, thread_count;
>> -
>> -    if (!migrate_use_compression() || !comp_param) {
>> -        return;
>> -    }
>> -
>> -    thread_count = migrate_compress_threads();
>> -    for (i = 0; i < thread_count; i++) {
>> -        /*
>> -         * we use it as a indicator which shows if the thread is
>> -         * properly init'd or not
>> -         */
>> -        if (!comp_param[i].file) {
>> -            break;
>> -        }
>> -
>> -        qemu_mutex_lock(&comp_param[i].mutex);
>> -        comp_param[i].quit = true;
>> -        qemu_cond_signal(&comp_param[i].cond);
>> -        qemu_mutex_unlock(&comp_param[i].mutex);
>> -
>> -        qemu_thread_join(compress_threads + i);
>> -        qemu_mutex_destroy(&comp_param[i].mutex);
>> -        qemu_cond_destroy(&comp_param[i].cond);
>> -        deflateEnd(&comp_param[i].stream);
>> -        g_free(comp_param[i].originbuf);
>> -        qemu_fclose(comp_param[i].file);
>> -        comp_param[i].file = NULL;
>> -    }
>> -    qemu_mutex_destroy(&comp_done_lock);
>> -    qemu_cond_destroy(&comp_done_cond);
>> -    g_free(compress_threads);
>> -    g_free(comp_param);
>> -    compress_threads = NULL;
>> -    comp_param = NULL;
>> -}
>> -
>> -static int compress_threads_save_setup(void)
>> -{
>> -    int i, thread_count;
>> -
>> -    if (!migrate_use_compression()) {
>> -        return 0;
>> -    }
>> -    thread_count = migrate_compress_threads();
>> -    compress_threads = g_new0(QemuThread, thread_count);
>> -    comp_param = g_new0(CompressParam, thread_count);
>> -    qemu_cond_init(&comp_done_cond);
>> -    qemu_mutex_init(&comp_done_lock);
>> -    for (i = 0; i < thread_count; i++) {
>> -        comp_param[i].originbuf = g_try_malloc(TARGET_PAGE_SIZE);
>> -        if (!comp_param[i].originbuf) {
>> -            goto exit;
>> -        }
>> -
>> -        if (deflateInit(&comp_param[i].stream,
>> -                        migrate_compress_level()) != Z_OK) {
>> -            g_free(comp_param[i].originbuf);
>> -            goto exit;
>> -        }
>> -
>> -        /* comp_param[i].file is just used as a dummy buffer to save data,
>> -         * set its ops to empty.
>> -         */
>> -        comp_param[i].file = qemu_fopen_ops(NULL, &empty_ops);
>> -        comp_param[i].done = true;
>> -        comp_param[i].quit = false;
>> -        qemu_mutex_init(&comp_param[i].mutex);
>> -        qemu_cond_init(&comp_param[i].cond);
>> -        qemu_thread_create(compress_threads + i, "compress",
>> -                           do_data_compress, comp_param + i,
>> -                           QEMU_THREAD_JOINABLE);
>> -    }
>> -    return 0;
>> -
>> -exit:
>> -    compress_threads_save_cleanup();
>> -    return -1;
>> -}
>> -
>>  /* Multiple fd's */
>>  
>>  #define MULTIFD_MAGIC 0x11223344U
>> @@ -1909,12 +1766,25 @@ exit:
>>      return zero_page;
>>  }
>>  
>> +struct CompressData {
>> +    /* filled by migration thread.*/
>> +    RAMBlock *block;
>> +    ram_addr_t offset;
>> +
>> +    /* filled by compress thread. */
>> +    QEMUFile *file;
>> +    z_stream stream;
>> +    uint8_t *originbuf;
>> +    bool zero_page;
>> +};
>> +typedef struct CompressData CompressData;
>> +
>>  static void
>> -update_compress_thread_counts(const CompressParam *param, int bytes_xmit)
>> +update_compress_thread_counts(CompressData *cd, int bytes_xmit)
> 
> Keep the const?
>>  {
>>      ram_counters.transferred += bytes_xmit;
>>  
>> -    if (param->zero_page) {
>> +    if (cd->zero_page) {
>>          ram_counters.duplicate++;
>>          return;
>>      }
>> @@ -1924,81 +1794,123 @@ update_compress_thread_counts(const CompressParam *param, int bytes_xmit)
>>      compression_counters.pages++;
>>  }
>>  
>> +static int compress_thread_data_init(void *request)
>> +{
>> +    CompressData *cd = request;
>> +
>> +    cd->originbuf = g_try_malloc(TARGET_PAGE_SIZE);
>> +    if (!cd->originbuf) {
>> +        return -1;
>> +    }
>> +
>> +    if (deflateInit(&cd->stream, migrate_compress_level()) != Z_OK) {
>> +        g_free(cd->originbuf);
>> +        return -1;
>> +    }
> 
> Please print errors if you fail in any case so we can easily tell what
> happened.
> 
>> +    cd->file = qemu_fopen_ops(NULL, &empty_ops);
>> +    return 0;
>> +}
>> +
>> +static void compress_thread_data_fini(void *request)
>> +{
>> +    CompressData *cd = request;
>> +
>> +    qemu_fclose(cd->file);
>> +    deflateEnd(&cd->stream);
>> +    g_free(cd->originbuf);
>> +}
>> +
>> +static void compress_thread_data_handler(void *request)
>> +{
>> +    CompressData *cd = request;
>> +
>> +    /*
>> +     * if compression fails, it will be indicated by
>> +     * migrate_get_current()->to_dst_file.
>> +     */
>> +    cd->zero_page = do_compress_ram_page(cd->file, &cd->stream, cd->block,
>> +                                         cd->offset, cd->originbuf);
>> +}
>> +
>> +static void compress_thread_data_done(void *request)
>> +{
>> +    CompressData *cd = request;
>> +    RAMState *rs = ram_state;
>> +    int bytes_xmit;
>> +
>> +    bytes_xmit = qemu_put_qemu_file(rs->f, cd->file);
>> +    update_compress_thread_counts(cd, bytes_xmit);
>> +}
>> +
>> +static const ThreadedWorkqueueOps compress_ops = {
>> +    .thread_request_init = compress_thread_data_init,
>> +    .thread_request_uninit = compress_thread_data_fini,
>> +    .thread_request_handler = compress_thread_data_handler,
>> +    .thread_request_done = compress_thread_data_done,
>> +    .request_size = sizeof(CompressData),
>> +};
>> +
>> +static Threads *compress_threads;
>> +
>>  static bool save_page_use_compression(RAMState *rs);
>>  
>>  static void flush_compressed_data(RAMState *rs)
>>  {
>> -    int idx, len, thread_count;
>> -
>>      if (!save_page_use_compression(rs)) {
>>          return;
>>      }
>> -    thread_count = migrate_compress_threads();
>>  
>> -    qemu_mutex_lock(&comp_done_lock);
>> -    for (idx = 0; idx < thread_count; idx++) {
>> -        while (!comp_param[idx].done) {
>> -            qemu_cond_wait(&comp_done_cond, &comp_done_lock);
>> -        }
>> -    }
>> -    qemu_mutex_unlock(&comp_done_lock);
>> +    threaded_workqueue_wait_for_requests(compress_threads);
>> +}
>>  
>> -    for (idx = 0; idx < thread_count; idx++) {
>> -        qemu_mutex_lock(&comp_param[idx].mutex);
>> -        if (!comp_param[idx].quit) {
>> -            len = qemu_put_qemu_file(rs->f, comp_param[idx].file);
>> -            /*
>> -             * it's safe to fetch zero_page without holding comp_done_lock
>> -             * as there is no further request submitted to the thread,
>> -             * i.e, the thread should be waiting for a request at this point.
>> -             */
>> -            update_compress_thread_counts(&comp_param[idx], len);
>> -        }
>> -        qemu_mutex_unlock(&comp_param[idx].mutex);
>> +static void compress_threads_save_cleanup(void)
>> +{
>> +    if (!compress_threads) {
>> +        return;
>>      }
>> +
>> +    threaded_workqueue_destroy(compress_threads);
>> +    compress_threads = NULL;
>>  }
>>  
>> -static inline void set_compress_params(CompressParam *param, RAMBlock *block,
>> -                                       ram_addr_t offset)
>> +static int compress_threads_save_setup(void)
>>  {
>> -    param->block = block;
>> -    param->offset = offset;
>> +    if (!migrate_use_compression()) {
>> +        return 0;
>> +    }
>> +
>> +    compress_threads = threaded_workqueue_create("compress",
>> +                                migrate_compress_threads(),
>> +                                DEFAULT_THREAD_REQUEST_NR, &compress_ops);
>> +    return compress_threads ? 0 : -1;
>>  }
>>  
>>  static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
>>                                             ram_addr_t offset)
>>  {
>> -    int idx, thread_count, bytes_xmit = -1, pages = -1;
>> +    CompressData *cd;
>>      bool wait = migrate_compress_wait_thread();
>>  
>> -    thread_count = migrate_compress_threads();
>> -    qemu_mutex_lock(&comp_done_lock);
>>  retry:
>> -    for (idx = 0; idx < thread_count; idx++) {
>> -        if (comp_param[idx].done) {
>> -            comp_param[idx].done = false;
>> -            bytes_xmit = qemu_put_qemu_file(rs->f, comp_param[idx].file);
>> -            qemu_mutex_lock(&comp_param[idx].mutex);
>> -            set_compress_params(&comp_param[idx], block, offset);
>> -            qemu_cond_signal(&comp_param[idx].cond);
>> -            qemu_mutex_unlock(&comp_param[idx].mutex);
>> -            pages = 1;
>> -            update_compress_thread_counts(&comp_param[idx], bytes_xmit);
>> -            break;
>> +    cd = threaded_workqueue_get_request(compress_threads);
>> +    if (!cd) {
>> +        /*
>> +         * wait for the free thread if the user specifies
>> +         * 'compress-wait-thread', otherwise we will post
>> +         *  the page out in the main thread as normal page.
>> +         */
>> +        if (wait) {
>> +            cpu_relax();
>> +            goto retry;
> 
> Is there nothing better we can use to wait without eating CPU time?

There is a mechanism to wait without eating CPU time in the data
structure, but it makes sense to busy wait.  There are 4 threads in the
workqueue, so you have to compare 1/4th of the time spent compressing a
page, with the trip into the kernel to wake you up.  You're adding 20%
CPU usage, but I'm not surprised it's worthwhile.

Paolo

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 3/5] migration: use threaded workqueue for compression
  2018-11-23 18:22       ` [Qemu-devel] " Paolo Bonzini
@ 2018-11-23 18:29         ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 70+ messages in thread
From: Dr. David Alan Gilbert @ 2018-11-23 18:29 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, peterx, quintela,
	wei.w.wang, cota, guangrong.xiao, jiang.biao2

* Paolo Bonzini (pbonzini@redhat.com) wrote:
> On 23/11/18 19:17, Dr. David Alan Gilbert wrote:
> > * guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
> >> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> >>
> >> Adapt the compression code to the threaded workqueue
> >>
> >> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> >> ---
> >>  migration/ram.c | 308 ++++++++++++++++++++------------------------------------
> >>  1 file changed, 110 insertions(+), 198 deletions(-)
> >>
> >> diff --git a/migration/ram.c b/migration/ram.c
> >> index 7e7deec4d8..254c08f27b 100644
> >> --- a/migration/ram.c
> >> +++ b/migration/ram.c
> >> @@ -57,6 +57,7 @@
> >>  #include "qemu/uuid.h"
> >>  #include "savevm.h"
> >>  #include "qemu/iov.h"
> >> +#include "qemu/threaded-workqueue.h"
> >>  
> >>  /***********************************************************/
> >>  /* ram save/restore */
> >> @@ -349,22 +350,6 @@ typedef struct PageSearchStatus PageSearchStatus;
> >>  
> >>  CompressionStats compression_counters;
> >>  
> >> -struct CompressParam {
> >> -    bool done;
> >> -    bool quit;
> >> -    bool zero_page;
> >> -    QEMUFile *file;
> >> -    QemuMutex mutex;
> >> -    QemuCond cond;
> >> -    RAMBlock *block;
> >> -    ram_addr_t offset;
> >> -
> >> -    /* internally used fields */
> >> -    z_stream stream;
> >> -    uint8_t *originbuf;
> >> -};
> >> -typedef struct CompressParam CompressParam;
> >> -
> >>  struct DecompressParam {
> >>      bool done;
> >>      bool quit;
> >> @@ -377,15 +362,6 @@ struct DecompressParam {
> >>  };
> >>  typedef struct DecompressParam DecompressParam;
> >>  
> >> -static CompressParam *comp_param;
> >> -static QemuThread *compress_threads;
> >> -/* comp_done_cond is used to wake up the migration thread when
> >> - * one of the compression threads has finished the compression.
> >> - * comp_done_lock is used to co-work with comp_done_cond.
> >> - */
> >> -static QemuMutex comp_done_lock;
> >> -static QemuCond comp_done_cond;
> >> -/* The empty QEMUFileOps will be used by file in CompressParam */
> >>  static const QEMUFileOps empty_ops = { };
> >>  
> >>  static QEMUFile *decomp_file;
> >> @@ -394,125 +370,6 @@ static QemuThread *decompress_threads;
> >>  static QemuMutex decomp_done_lock;
> >>  static QemuCond decomp_done_cond;
> >>  
> >> -static bool do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock *block,
> >> -                                 ram_addr_t offset, uint8_t *source_buf);
> >> -
> >> -static void *do_data_compress(void *opaque)
> >> -{
> >> -    CompressParam *param = opaque;
> >> -    RAMBlock *block;
> >> -    ram_addr_t offset;
> >> -    bool zero_page;
> >> -
> >> -    qemu_mutex_lock(&param->mutex);
> >> -    while (!param->quit) {
> >> -        if (param->block) {
> >> -            block = param->block;
> >> -            offset = param->offset;
> >> -            param->block = NULL;
> >> -            qemu_mutex_unlock(&param->mutex);
> >> -
> >> -            zero_page = do_compress_ram_page(param->file, &param->stream,
> >> -                                             block, offset, param->originbuf);
> >> -
> >> -            qemu_mutex_lock(&comp_done_lock);
> >> -            param->done = true;
> >> -            param->zero_page = zero_page;
> >> -            qemu_cond_signal(&comp_done_cond);
> >> -            qemu_mutex_unlock(&comp_done_lock);
> >> -
> >> -            qemu_mutex_lock(&param->mutex);
> >> -        } else {
> >> -            qemu_cond_wait(&param->cond, &param->mutex);
> >> -        }
> >> -    }
> >> -    qemu_mutex_unlock(&param->mutex);
> >> -
> >> -    return NULL;
> >> -}
> >> -
> >> -static void compress_threads_save_cleanup(void)
> >> -{
> >> -    int i, thread_count;
> >> -
> >> -    if (!migrate_use_compression() || !comp_param) {
> >> -        return;
> >> -    }
> >> -
> >> -    thread_count = migrate_compress_threads();
> >> -    for (i = 0; i < thread_count; i++) {
> >> -        /*
> >> -         * we use it as a indicator which shows if the thread is
> >> -         * properly init'd or not
> >> -         */
> >> -        if (!comp_param[i].file) {
> >> -            break;
> >> -        }
> >> -
> >> -        qemu_mutex_lock(&comp_param[i].mutex);
> >> -        comp_param[i].quit = true;
> >> -        qemu_cond_signal(&comp_param[i].cond);
> >> -        qemu_mutex_unlock(&comp_param[i].mutex);
> >> -
> >> -        qemu_thread_join(compress_threads + i);
> >> -        qemu_mutex_destroy(&comp_param[i].mutex);
> >> -        qemu_cond_destroy(&comp_param[i].cond);
> >> -        deflateEnd(&comp_param[i].stream);
> >> -        g_free(comp_param[i].originbuf);
> >> -        qemu_fclose(comp_param[i].file);
> >> -        comp_param[i].file = NULL;
> >> -    }
> >> -    qemu_mutex_destroy(&comp_done_lock);
> >> -    qemu_cond_destroy(&comp_done_cond);
> >> -    g_free(compress_threads);
> >> -    g_free(comp_param);
> >> -    compress_threads = NULL;
> >> -    comp_param = NULL;
> >> -}
> >> -
> >> -static int compress_threads_save_setup(void)
> >> -{
> >> -    int i, thread_count;
> >> -
> >> -    if (!migrate_use_compression()) {
> >> -        return 0;
> >> -    }
> >> -    thread_count = migrate_compress_threads();
> >> -    compress_threads = g_new0(QemuThread, thread_count);
> >> -    comp_param = g_new0(CompressParam, thread_count);
> >> -    qemu_cond_init(&comp_done_cond);
> >> -    qemu_mutex_init(&comp_done_lock);
> >> -    for (i = 0; i < thread_count; i++) {
> >> -        comp_param[i].originbuf = g_try_malloc(TARGET_PAGE_SIZE);
> >> -        if (!comp_param[i].originbuf) {
> >> -            goto exit;
> >> -        }
> >> -
> >> -        if (deflateInit(&comp_param[i].stream,
> >> -                        migrate_compress_level()) != Z_OK) {
> >> -            g_free(comp_param[i].originbuf);
> >> -            goto exit;
> >> -        }
> >> -
> >> -        /* comp_param[i].file is just used as a dummy buffer to save data,
> >> -         * set its ops to empty.
> >> -         */
> >> -        comp_param[i].file = qemu_fopen_ops(NULL, &empty_ops);
> >> -        comp_param[i].done = true;
> >> -        comp_param[i].quit = false;
> >> -        qemu_mutex_init(&comp_param[i].mutex);
> >> -        qemu_cond_init(&comp_param[i].cond);
> >> -        qemu_thread_create(compress_threads + i, "compress",
> >> -                           do_data_compress, comp_param + i,
> >> -                           QEMU_THREAD_JOINABLE);
> >> -    }
> >> -    return 0;
> >> -
> >> -exit:
> >> -    compress_threads_save_cleanup();
> >> -    return -1;
> >> -}
> >> -
> >>  /* Multiple fd's */
> >>  
> >>  #define MULTIFD_MAGIC 0x11223344U
> >> @@ -1909,12 +1766,25 @@ exit:
> >>      return zero_page;
> >>  }
> >>  
> >> +struct CompressData {
> >> +    /* filled by migration thread.*/
> >> +    RAMBlock *block;
> >> +    ram_addr_t offset;
> >> +
> >> +    /* filled by compress thread. */
> >> +    QEMUFile *file;
> >> +    z_stream stream;
> >> +    uint8_t *originbuf;
> >> +    bool zero_page;
> >> +};
> >> +typedef struct CompressData CompressData;
> >> +
> >>  static void
> >> -update_compress_thread_counts(const CompressParam *param, int bytes_xmit)
> >> +update_compress_thread_counts(CompressData *cd, int bytes_xmit)
> > 
> > Keep the const?
> >>  {
> >>      ram_counters.transferred += bytes_xmit;
> >>  
> >> -    if (param->zero_page) {
> >> +    if (cd->zero_page) {
> >>          ram_counters.duplicate++;
> >>          return;
> >>      }
> >> @@ -1924,81 +1794,123 @@ update_compress_thread_counts(const CompressParam *param, int bytes_xmit)
> >>      compression_counters.pages++;
> >>  }
> >>  
> >> +static int compress_thread_data_init(void *request)
> >> +{
> >> +    CompressData *cd = request;
> >> +
> >> +    cd->originbuf = g_try_malloc(TARGET_PAGE_SIZE);
> >> +    if (!cd->originbuf) {
> >> +        return -1;
> >> +    }
> >> +
> >> +    if (deflateInit(&cd->stream, migrate_compress_level()) != Z_OK) {
> >> +        g_free(cd->originbuf);
> >> +        return -1;
> >> +    }
> > 
> > Please print errors if you fail in any case so we can easily tell what
> > happened.
> > 
> >> +    cd->file = qemu_fopen_ops(NULL, &empty_ops);
> >> +    return 0;
> >> +}
> >> +
> >> +static void compress_thread_data_fini(void *request)
> >> +{
> >> +    CompressData *cd = request;
> >> +
> >> +    qemu_fclose(cd->file);
> >> +    deflateEnd(&cd->stream);
> >> +    g_free(cd->originbuf);
> >> +}
> >> +
> >> +static void compress_thread_data_handler(void *request)
> >> +{
> >> +    CompressData *cd = request;
> >> +
> >> +    /*
> >> +     * if compression fails, it will be indicated by
> >> +     * migrate_get_current()->to_dst_file.
> >> +     */
> >> +    cd->zero_page = do_compress_ram_page(cd->file, &cd->stream, cd->block,
> >> +                                         cd->offset, cd->originbuf);
> >> +}
> >> +
> >> +static void compress_thread_data_done(void *request)
> >> +{
> >> +    CompressData *cd = request;
> >> +    RAMState *rs = ram_state;
> >> +    int bytes_xmit;
> >> +
> >> +    bytes_xmit = qemu_put_qemu_file(rs->f, cd->file);
> >> +    update_compress_thread_counts(cd, bytes_xmit);
> >> +}
> >> +
> >> +static const ThreadedWorkqueueOps compress_ops = {
> >> +    .thread_request_init = compress_thread_data_init,
> >> +    .thread_request_uninit = compress_thread_data_fini,
> >> +    .thread_request_handler = compress_thread_data_handler,
> >> +    .thread_request_done = compress_thread_data_done,
> >> +    .request_size = sizeof(CompressData),
> >> +};
> >> +
> >> +static Threads *compress_threads;
> >> +
> >>  static bool save_page_use_compression(RAMState *rs);
> >>  
> >>  static void flush_compressed_data(RAMState *rs)
> >>  {
> >> -    int idx, len, thread_count;
> >> -
> >>      if (!save_page_use_compression(rs)) {
> >>          return;
> >>      }
> >> -    thread_count = migrate_compress_threads();
> >>  
> >> -    qemu_mutex_lock(&comp_done_lock);
> >> -    for (idx = 0; idx < thread_count; idx++) {
> >> -        while (!comp_param[idx].done) {
> >> -            qemu_cond_wait(&comp_done_cond, &comp_done_lock);
> >> -        }
> >> -    }
> >> -    qemu_mutex_unlock(&comp_done_lock);
> >> +    threaded_workqueue_wait_for_requests(compress_threads);
> >> +}
> >>  
> >> -    for (idx = 0; idx < thread_count; idx++) {
> >> -        qemu_mutex_lock(&comp_param[idx].mutex);
> >> -        if (!comp_param[idx].quit) {
> >> -            len = qemu_put_qemu_file(rs->f, comp_param[idx].file);
> >> -            /*
> >> -             * it's safe to fetch zero_page without holding comp_done_lock
> >> -             * as there is no further request submitted to the thread,
> >> -             * i.e, the thread should be waiting for a request at this point.
> >> -             */
> >> -            update_compress_thread_counts(&comp_param[idx], len);
> >> -        }
> >> -        qemu_mutex_unlock(&comp_param[idx].mutex);
> >> +static void compress_threads_save_cleanup(void)
> >> +{
> >> +    if (!compress_threads) {
> >> +        return;
> >>      }
> >> +
> >> +    threaded_workqueue_destroy(compress_threads);
> >> +    compress_threads = NULL;
> >>  }
> >>  
> >> -static inline void set_compress_params(CompressParam *param, RAMBlock *block,
> >> -                                       ram_addr_t offset)
> >> +static int compress_threads_save_setup(void)
> >>  {
> >> -    param->block = block;
> >> -    param->offset = offset;
> >> +    if (!migrate_use_compression()) {
> >> +        return 0;
> >> +    }
> >> +
> >> +    compress_threads = threaded_workqueue_create("compress",
> >> +                                migrate_compress_threads(),
> >> +                                DEFAULT_THREAD_REQUEST_NR, &compress_ops);
> >> +    return compress_threads ? 0 : -1;
> >>  }
> >>  
> >>  static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
> >>                                             ram_addr_t offset)
> >>  {
> >> -    int idx, thread_count, bytes_xmit = -1, pages = -1;
> >> +    CompressData *cd;
> >>      bool wait = migrate_compress_wait_thread();
> >>  
> >> -    thread_count = migrate_compress_threads();
> >> -    qemu_mutex_lock(&comp_done_lock);
> >>  retry:
> >> -    for (idx = 0; idx < thread_count; idx++) {
> >> -        if (comp_param[idx].done) {
> >> -            comp_param[idx].done = false;
> >> -            bytes_xmit = qemu_put_qemu_file(rs->f, comp_param[idx].file);
> >> -            qemu_mutex_lock(&comp_param[idx].mutex);
> >> -            set_compress_params(&comp_param[idx], block, offset);
> >> -            qemu_cond_signal(&comp_param[idx].cond);
> >> -            qemu_mutex_unlock(&comp_param[idx].mutex);
> >> -            pages = 1;
> >> -            update_compress_thread_counts(&comp_param[idx], bytes_xmit);
> >> -            break;
> >> +    cd = threaded_workqueue_get_request(compress_threads);
> >> +    if (!cd) {
> >> +        /*
> >> +         * wait for the free thread if the user specifies
> >> +         * 'compress-wait-thread', otherwise we will post
> >> +         *  the page out in the main thread as normal page.
> >> +         */
> >> +        if (wait) {
> >> +            cpu_relax();
> >> +            goto retry;
> > 
> > Is there nothing better we can use to wait without eating CPU time?
> 
> There is a mechanism to wait without eating CPU time in the data
> structure, but it makes sense to busy wait.  There are 4 threads in the
> workqueue, so you have to compare 1/4th of the time spent compressing a
> page, with the trip into the kernel to wake you up.  You're adding 20%
> CPU usage, but I'm not surprised it's worthwhile.

Hmm OK; in that case it does at least need a comment because it's a bit
odd, and we should watch out how that scales - I guess it's less of
an overhead the more threads you use.

Dave

> Paolo
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/5] migration: use threaded workqueue for compression
@ 2018-11-23 18:29         ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 70+ messages in thread
From: Dr. David Alan Gilbert @ 2018-11-23 18:29 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: guangrong.xiao, mst, mtosatti, qemu-devel, kvm, peterx,
	wei.w.wang, jiang.biao2, eblake, quintela, cota, Xiao Guangrong

* Paolo Bonzini (pbonzini@redhat.com) wrote:
> On 23/11/18 19:17, Dr. David Alan Gilbert wrote:
> > * guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
> >> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> >>
> >> Adapt the compression code to the threaded workqueue
> >>
> >> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> >> ---
> >>  migration/ram.c | 308 ++++++++++++++++++++------------------------------------
> >>  1 file changed, 110 insertions(+), 198 deletions(-)
> >>
> >> diff --git a/migration/ram.c b/migration/ram.c
> >> index 7e7deec4d8..254c08f27b 100644
> >> --- a/migration/ram.c
> >> +++ b/migration/ram.c
> >> @@ -57,6 +57,7 @@
> >>  #include "qemu/uuid.h"
> >>  #include "savevm.h"
> >>  #include "qemu/iov.h"
> >> +#include "qemu/threaded-workqueue.h"
> >>  
> >>  /***********************************************************/
> >>  /* ram save/restore */
> >> @@ -349,22 +350,6 @@ typedef struct PageSearchStatus PageSearchStatus;
> >>  
> >>  CompressionStats compression_counters;
> >>  
> >> -struct CompressParam {
> >> -    bool done;
> >> -    bool quit;
> >> -    bool zero_page;
> >> -    QEMUFile *file;
> >> -    QemuMutex mutex;
> >> -    QemuCond cond;
> >> -    RAMBlock *block;
> >> -    ram_addr_t offset;
> >> -
> >> -    /* internally used fields */
> >> -    z_stream stream;
> >> -    uint8_t *originbuf;
> >> -};
> >> -typedef struct CompressParam CompressParam;
> >> -
> >>  struct DecompressParam {
> >>      bool done;
> >>      bool quit;
> >> @@ -377,15 +362,6 @@ struct DecompressParam {
> >>  };
> >>  typedef struct DecompressParam DecompressParam;
> >>  
> >> -static CompressParam *comp_param;
> >> -static QemuThread *compress_threads;
> >> -/* comp_done_cond is used to wake up the migration thread when
> >> - * one of the compression threads has finished the compression.
> >> - * comp_done_lock is used to co-work with comp_done_cond.
> >> - */
> >> -static QemuMutex comp_done_lock;
> >> -static QemuCond comp_done_cond;
> >> -/* The empty QEMUFileOps will be used by file in CompressParam */
> >>  static const QEMUFileOps empty_ops = { };
> >>  
> >>  static QEMUFile *decomp_file;
> >> @@ -394,125 +370,6 @@ static QemuThread *decompress_threads;
> >>  static QemuMutex decomp_done_lock;
> >>  static QemuCond decomp_done_cond;
> >>  
> >> -static bool do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock *block,
> >> -                                 ram_addr_t offset, uint8_t *source_buf);
> >> -
> >> -static void *do_data_compress(void *opaque)
> >> -{
> >> -    CompressParam *param = opaque;
> >> -    RAMBlock *block;
> >> -    ram_addr_t offset;
> >> -    bool zero_page;
> >> -
> >> -    qemu_mutex_lock(&param->mutex);
> >> -    while (!param->quit) {
> >> -        if (param->block) {
> >> -            block = param->block;
> >> -            offset = param->offset;
> >> -            param->block = NULL;
> >> -            qemu_mutex_unlock(&param->mutex);
> >> -
> >> -            zero_page = do_compress_ram_page(param->file, &param->stream,
> >> -                                             block, offset, param->originbuf);
> >> -
> >> -            qemu_mutex_lock(&comp_done_lock);
> >> -            param->done = true;
> >> -            param->zero_page = zero_page;
> >> -            qemu_cond_signal(&comp_done_cond);
> >> -            qemu_mutex_unlock(&comp_done_lock);
> >> -
> >> -            qemu_mutex_lock(&param->mutex);
> >> -        } else {
> >> -            qemu_cond_wait(&param->cond, &param->mutex);
> >> -        }
> >> -    }
> >> -    qemu_mutex_unlock(&param->mutex);
> >> -
> >> -    return NULL;
> >> -}
> >> -
> >> -static void compress_threads_save_cleanup(void)
> >> -{
> >> -    int i, thread_count;
> >> -
> >> -    if (!migrate_use_compression() || !comp_param) {
> >> -        return;
> >> -    }
> >> -
> >> -    thread_count = migrate_compress_threads();
> >> -    for (i = 0; i < thread_count; i++) {
> >> -        /*
> >> -         * we use it as a indicator which shows if the thread is
> >> -         * properly init'd or not
> >> -         */
> >> -        if (!comp_param[i].file) {
> >> -            break;
> >> -        }
> >> -
> >> -        qemu_mutex_lock(&comp_param[i].mutex);
> >> -        comp_param[i].quit = true;
> >> -        qemu_cond_signal(&comp_param[i].cond);
> >> -        qemu_mutex_unlock(&comp_param[i].mutex);
> >> -
> >> -        qemu_thread_join(compress_threads + i);
> >> -        qemu_mutex_destroy(&comp_param[i].mutex);
> >> -        qemu_cond_destroy(&comp_param[i].cond);
> >> -        deflateEnd(&comp_param[i].stream);
> >> -        g_free(comp_param[i].originbuf);
> >> -        qemu_fclose(comp_param[i].file);
> >> -        comp_param[i].file = NULL;
> >> -    }
> >> -    qemu_mutex_destroy(&comp_done_lock);
> >> -    qemu_cond_destroy(&comp_done_cond);
> >> -    g_free(compress_threads);
> >> -    g_free(comp_param);
> >> -    compress_threads = NULL;
> >> -    comp_param = NULL;
> >> -}
> >> -
> >> -static int compress_threads_save_setup(void)
> >> -{
> >> -    int i, thread_count;
> >> -
> >> -    if (!migrate_use_compression()) {
> >> -        return 0;
> >> -    }
> >> -    thread_count = migrate_compress_threads();
> >> -    compress_threads = g_new0(QemuThread, thread_count);
> >> -    comp_param = g_new0(CompressParam, thread_count);
> >> -    qemu_cond_init(&comp_done_cond);
> >> -    qemu_mutex_init(&comp_done_lock);
> >> -    for (i = 0; i < thread_count; i++) {
> >> -        comp_param[i].originbuf = g_try_malloc(TARGET_PAGE_SIZE);
> >> -        if (!comp_param[i].originbuf) {
> >> -            goto exit;
> >> -        }
> >> -
> >> -        if (deflateInit(&comp_param[i].stream,
> >> -                        migrate_compress_level()) != Z_OK) {
> >> -            g_free(comp_param[i].originbuf);
> >> -            goto exit;
> >> -        }
> >> -
> >> -        /* comp_param[i].file is just used as a dummy buffer to save data,
> >> -         * set its ops to empty.
> >> -         */
> >> -        comp_param[i].file = qemu_fopen_ops(NULL, &empty_ops);
> >> -        comp_param[i].done = true;
> >> -        comp_param[i].quit = false;
> >> -        qemu_mutex_init(&comp_param[i].mutex);
> >> -        qemu_cond_init(&comp_param[i].cond);
> >> -        qemu_thread_create(compress_threads + i, "compress",
> >> -                           do_data_compress, comp_param + i,
> >> -                           QEMU_THREAD_JOINABLE);
> >> -    }
> >> -    return 0;
> >> -
> >> -exit:
> >> -    compress_threads_save_cleanup();
> >> -    return -1;
> >> -}
> >> -
> >>  /* Multiple fd's */
> >>  
> >>  #define MULTIFD_MAGIC 0x11223344U
> >> @@ -1909,12 +1766,25 @@ exit:
> >>      return zero_page;
> >>  }
> >>  
> >> +struct CompressData {
> >> +    /* filled by migration thread.*/
> >> +    RAMBlock *block;
> >> +    ram_addr_t offset;
> >> +
> >> +    /* filled by compress thread. */
> >> +    QEMUFile *file;
> >> +    z_stream stream;
> >> +    uint8_t *originbuf;
> >> +    bool zero_page;
> >> +};
> >> +typedef struct CompressData CompressData;
> >> +
> >>  static void
> >> -update_compress_thread_counts(const CompressParam *param, int bytes_xmit)
> >> +update_compress_thread_counts(CompressData *cd, int bytes_xmit)
> > 
> > Keep the const?
> >>  {
> >>      ram_counters.transferred += bytes_xmit;
> >>  
> >> -    if (param->zero_page) {
> >> +    if (cd->zero_page) {
> >>          ram_counters.duplicate++;
> >>          return;
> >>      }
> >> @@ -1924,81 +1794,123 @@ update_compress_thread_counts(const CompressParam *param, int bytes_xmit)
> >>      compression_counters.pages++;
> >>  }
> >>  
> >> +static int compress_thread_data_init(void *request)
> >> +{
> >> +    CompressData *cd = request;
> >> +
> >> +    cd->originbuf = g_try_malloc(TARGET_PAGE_SIZE);
> >> +    if (!cd->originbuf) {
> >> +        return -1;
> >> +    }
> >> +
> >> +    if (deflateInit(&cd->stream, migrate_compress_level()) != Z_OK) {
> >> +        g_free(cd->originbuf);
> >> +        return -1;
> >> +    }
> > 
> > Please print errors if you fail in any case so we can easily tell what
> > happened.
> > 
> >> +    cd->file = qemu_fopen_ops(NULL, &empty_ops);
> >> +    return 0;
> >> +}
> >> +
> >> +static void compress_thread_data_fini(void *request)
> >> +{
> >> +    CompressData *cd = request;
> >> +
> >> +    qemu_fclose(cd->file);
> >> +    deflateEnd(&cd->stream);
> >> +    g_free(cd->originbuf);
> >> +}
> >> +
> >> +static void compress_thread_data_handler(void *request)
> >> +{
> >> +    CompressData *cd = request;
> >> +
> >> +    /*
> >> +     * if compression fails, it will be indicated by
> >> +     * migrate_get_current()->to_dst_file.
> >> +     */
> >> +    cd->zero_page = do_compress_ram_page(cd->file, &cd->stream, cd->block,
> >> +                                         cd->offset, cd->originbuf);
> >> +}
> >> +
> >> +static void compress_thread_data_done(void *request)
> >> +{
> >> +    CompressData *cd = request;
> >> +    RAMState *rs = ram_state;
> >> +    int bytes_xmit;
> >> +
> >> +    bytes_xmit = qemu_put_qemu_file(rs->f, cd->file);
> >> +    update_compress_thread_counts(cd, bytes_xmit);
> >> +}
> >> +
> >> +static const ThreadedWorkqueueOps compress_ops = {
> >> +    .thread_request_init = compress_thread_data_init,
> >> +    .thread_request_uninit = compress_thread_data_fini,
> >> +    .thread_request_handler = compress_thread_data_handler,
> >> +    .thread_request_done = compress_thread_data_done,
> >> +    .request_size = sizeof(CompressData),
> >> +};
> >> +
> >> +static Threads *compress_threads;
> >> +
> >>  static bool save_page_use_compression(RAMState *rs);
> >>  
> >>  static void flush_compressed_data(RAMState *rs)
> >>  {
> >> -    int idx, len, thread_count;
> >> -
> >>      if (!save_page_use_compression(rs)) {
> >>          return;
> >>      }
> >> -    thread_count = migrate_compress_threads();
> >>  
> >> -    qemu_mutex_lock(&comp_done_lock);
> >> -    for (idx = 0; idx < thread_count; idx++) {
> >> -        while (!comp_param[idx].done) {
> >> -            qemu_cond_wait(&comp_done_cond, &comp_done_lock);
> >> -        }
> >> -    }
> >> -    qemu_mutex_unlock(&comp_done_lock);
> >> +    threaded_workqueue_wait_for_requests(compress_threads);
> >> +}
> >>  
> >> -    for (idx = 0; idx < thread_count; idx++) {
> >> -        qemu_mutex_lock(&comp_param[idx].mutex);
> >> -        if (!comp_param[idx].quit) {
> >> -            len = qemu_put_qemu_file(rs->f, comp_param[idx].file);
> >> -            /*
> >> -             * it's safe to fetch zero_page without holding comp_done_lock
> >> -             * as there is no further request submitted to the thread,
> >> -             * i.e, the thread should be waiting for a request at this point.
> >> -             */
> >> -            update_compress_thread_counts(&comp_param[idx], len);
> >> -        }
> >> -        qemu_mutex_unlock(&comp_param[idx].mutex);
> >> +static void compress_threads_save_cleanup(void)
> >> +{
> >> +    if (!compress_threads) {
> >> +        return;
> >>      }
> >> +
> >> +    threaded_workqueue_destroy(compress_threads);
> >> +    compress_threads = NULL;
> >>  }
> >>  
> >> -static inline void set_compress_params(CompressParam *param, RAMBlock *block,
> >> -                                       ram_addr_t offset)
> >> +static int compress_threads_save_setup(void)
> >>  {
> >> -    param->block = block;
> >> -    param->offset = offset;
> >> +    if (!migrate_use_compression()) {
> >> +        return 0;
> >> +    }
> >> +
> >> +    compress_threads = threaded_workqueue_create("compress",
> >> +                                migrate_compress_threads(),
> >> +                                DEFAULT_THREAD_REQUEST_NR, &compress_ops);
> >> +    return compress_threads ? 0 : -1;
> >>  }
> >>  
> >>  static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
> >>                                             ram_addr_t offset)
> >>  {
> >> -    int idx, thread_count, bytes_xmit = -1, pages = -1;
> >> +    CompressData *cd;
> >>      bool wait = migrate_compress_wait_thread();
> >>  
> >> -    thread_count = migrate_compress_threads();
> >> -    qemu_mutex_lock(&comp_done_lock);
> >>  retry:
> >> -    for (idx = 0; idx < thread_count; idx++) {
> >> -        if (comp_param[idx].done) {
> >> -            comp_param[idx].done = false;
> >> -            bytes_xmit = qemu_put_qemu_file(rs->f, comp_param[idx].file);
> >> -            qemu_mutex_lock(&comp_param[idx].mutex);
> >> -            set_compress_params(&comp_param[idx], block, offset);
> >> -            qemu_cond_signal(&comp_param[idx].cond);
> >> -            qemu_mutex_unlock(&comp_param[idx].mutex);
> >> -            pages = 1;
> >> -            update_compress_thread_counts(&comp_param[idx], bytes_xmit);
> >> -            break;
> >> +    cd = threaded_workqueue_get_request(compress_threads);
> >> +    if (!cd) {
> >> +        /*
> >> +         * wait for the free thread if the user specifies
> >> +         * 'compress-wait-thread', otherwise we will post
> >> +         *  the page out in the main thread as normal page.
> >> +         */
> >> +        if (wait) {
> >> +            cpu_relax();
> >> +            goto retry;
> > 
> > Is there nothing better we can use to wait without eating CPU time?
> 
> There is a mechanism to wait without eating CPU time in the data
> structure, but it makes sense to busy wait.  There are 4 threads in the
> workqueue, so you have to compare 1/4th of the time spent compressing a
> page, with the trip into the kernel to wake you up.  You're adding 20%
> CPU usage, but I'm not surprised it's worthwhile.

Hmm OK; in that case it does at least need a comment because it's a bit
odd, and we should watch out how that scales - I guess it's less of
an overhead the more threads you use.

Dave

> Paolo
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 2/5] util: introduce threaded workqueue
  2018-11-22  7:20   ` [Qemu-devel] " guangrong.xiao
@ 2018-11-24  0:12     ` Emilio G. Cota
  -1 siblings, 0 replies; 70+ messages in thread
From: Emilio G. Cota @ 2018-11-24  0:12 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, peterx, qemu-devel,
	quintela, wei.w.wang, jiang.biao2, pbonzini

On Thu, Nov 22, 2018 at 15:20:25 +0800, guangrong.xiao@gmail.com wrote:
> +   /*
> +     * the bit in these two bitmaps indicates the index of the @requests

This @ is not ASCII, is it?

> +     * respectively. If it's the same, the corresponding request is free
> +     * and owned by the user, i.e, where the user fills a request. Otherwise,
> +     * it is valid and owned by the thread, i.e, where the thread fetches
> +     * the request and write the result.
> +     */
> +
> +    /* after the user fills the request, the bit is flipped. */
> +    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
> +    /* after handles the request, the thread flips the bit. */
> +    uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);

Use DECLARE_BITMAP, otherwise you'll get type errors as David
pointed out.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
@ 2018-11-24  0:12     ` Emilio G. Cota
  0 siblings, 0 replies; 70+ messages in thread
From: Emilio G. Cota @ 2018-11-24  0:12 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, peterx,
	wei.w.wang, jiang.biao2, eblake, quintela, Xiao Guangrong

On Thu, Nov 22, 2018 at 15:20:25 +0800, guangrong.xiao@gmail.com wrote:
> +   /*
> +     * the bit in these two bitmaps indicates the index of the @requests

This @ is not ASCII, is it?

> +     * respectively. If it's the same, the corresponding request is free
> +     * and owned by the user, i.e, where the user fills a request. Otherwise,
> +     * it is valid and owned by the thread, i.e, where the thread fetches
> +     * the request and write the result.
> +     */
> +
> +    /* after the user fills the request, the bit is flipped. */
> +    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
> +    /* after handles the request, the thread flips the bit. */
> +    uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);

Use DECLARE_BITMAP, otherwise you'll get type errors as David
pointed out.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 2/5] util: introduce threaded workqueue
  2018-11-22  7:20   ` [Qemu-devel] " guangrong.xiao
@ 2018-11-24  0:17     ` Emilio G. Cota
  -1 siblings, 0 replies; 70+ messages in thread
From: Emilio G. Cota @ 2018-11-24  0:17 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, peterx, qemu-devel,
	quintela, wei.w.wang, jiang.biao2, pbonzini

On Thu, Nov 22, 2018 at 15:20:25 +0800, guangrong.xiao@gmail.com wrote:
> +static uint64_t get_free_request_bitmap(Threads *threads, ThreadLocal *thread)
> +{
> +    uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap;
> +
> +    request_fill_bitmap = atomic_rcu_read(&thread->request_fill_bitmap);
> +    request_done_bitmap = atomic_rcu_read(&thread->request_done_bitmap);
> +    bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
> +               threads->thread_requests_nr);

This is not wrong, but it's a big ugly. Instead, I would:

- Introduce bitmap_xor_atomic in a previous patch
- Use bitmap_xor_atomic here, getting rid of the rcu reads

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
@ 2018-11-24  0:17     ` Emilio G. Cota
  0 siblings, 0 replies; 70+ messages in thread
From: Emilio G. Cota @ 2018-11-24  0:17 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, peterx,
	wei.w.wang, jiang.biao2, eblake, quintela, Xiao Guangrong

On Thu, Nov 22, 2018 at 15:20:25 +0800, guangrong.xiao@gmail.com wrote:
> +static uint64_t get_free_request_bitmap(Threads *threads, ThreadLocal *thread)
> +{
> +    uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap;
> +
> +    request_fill_bitmap = atomic_rcu_read(&thread->request_fill_bitmap);
> +    request_done_bitmap = atomic_rcu_read(&thread->request_done_bitmap);
> +    bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
> +               threads->thread_requests_nr);

This is not wrong, but it's a big ugly. Instead, I would:

- Introduce bitmap_xor_atomic in a previous patch
- Use bitmap_xor_atomic here, getting rid of the rcu reads

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 2/5] util: introduce threaded workqueue
  2018-11-23 11:02     ` [Qemu-devel] " Dr. David Alan Gilbert
@ 2018-11-26  7:57       ` Xiao Guangrong
  -1 siblings, 0 replies; 70+ messages in thread
From: Xiao Guangrong @ 2018-11-26  7:57 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, peterx, quintela,
	wei.w.wang, cota, jiang.biao2, pbonzini



On 11/23/18 7:02 PM, Dr. David Alan Gilbert wrote:

>> +#include "qemu/osdep.h"
>> +#include "qemu/bitmap.h"
>> +#include "qemu/threaded-workqueue.h"
>> +
>> +#define SMP_CACHE_BYTES 64
> 
> That's architecture dependent isn't it?
> 

Yes, it's arch dependent indeed.

I just used 64 for simplification and i think it is <= 64 on most CPU arch-es
so that can work.

Should i introduce statically defined CACHE LINE SIZE for all arch-es? :(

>> +   /*
>> +     * the bit in these two bitmaps indicates the index of the @requests
>> +     * respectively. If it's the same, the corresponding request is free
>> +     * and owned by the user, i.e, where the user fills a request. Otherwise,
>> +     * it is valid and owned by the thread, i.e, where the thread fetches
>> +     * the request and write the result.
>> +     */
>> +
>> +    /* after the user fills the request, the bit is flipped. */
>> +    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
>> +    /* after handles the request, the thread flips the bit. */
>> +    uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
> 
> Patchew complained about some type mismatches; I think those are because
> you're using the bitmap_* functions on these; those functions always
> operate on 'long' not on uint64_t - and on some platforms they're
> unfortunately not the same.

I guess you were taking about this error:
ERROR: externs should be avoided in .c files
#233: FILE: util/threaded-workqueue.c:65:
+    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);

The complained thing is "QEMU_ALIGNED(SMP_CACHE_BYTES)" as it gone
when the aligned thing is removed...

The issue you pointed out can be avoid by using type-casting, like:
bitmap_xor(..., (void *)&thread->request_fill_bitmap)
cannot we?

Thanks!

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
@ 2018-11-26  7:57       ` Xiao Guangrong
  0 siblings, 0 replies; 70+ messages in thread
From: Xiao Guangrong @ 2018-11-26  7:57 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, peterx, wei.w.wang,
	jiang.biao2, eblake, quintela, cota, Xiao Guangrong



On 11/23/18 7:02 PM, Dr. David Alan Gilbert wrote:

>> +#include "qemu/osdep.h"
>> +#include "qemu/bitmap.h"
>> +#include "qemu/threaded-workqueue.h"
>> +
>> +#define SMP_CACHE_BYTES 64
> 
> That's architecture dependent isn't it?
> 

Yes, it's arch dependent indeed.

I just used 64 for simplification and i think it is <= 64 on most CPU arch-es
so that can work.

Should i introduce statically defined CACHE LINE SIZE for all arch-es? :(

>> +   /*
>> +     * the bit in these two bitmaps indicates the index of the @requests
>> +     * respectively. If it's the same, the corresponding request is free
>> +     * and owned by the user, i.e, where the user fills a request. Otherwise,
>> +     * it is valid and owned by the thread, i.e, where the thread fetches
>> +     * the request and write the result.
>> +     */
>> +
>> +    /* after the user fills the request, the bit is flipped. */
>> +    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
>> +    /* after handles the request, the thread flips the bit. */
>> +    uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
> 
> Patchew complained about some type mismatches; I think those are because
> you're using the bitmap_* functions on these; those functions always
> operate on 'long' not on uint64_t - and on some platforms they're
> unfortunately not the same.

I guess you were taking about this error:
ERROR: externs should be avoided in .c files
#233: FILE: util/threaded-workqueue.c:65:
+    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);

The complained thing is "QEMU_ALIGNED(SMP_CACHE_BYTES)" as it gone
when the aligned thing is removed...

The issue you pointed out can be avoid by using type-casting, like:
bitmap_xor(..., (void *)&thread->request_fill_bitmap)
cannot we?

Thanks!

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 3/5] migration: use threaded workqueue for compression
  2018-11-23 18:29         ` [Qemu-devel] " Dr. David Alan Gilbert
@ 2018-11-26  8:00           ` Xiao Guangrong
  -1 siblings, 0 replies; 70+ messages in thread
From: Xiao Guangrong @ 2018-11-26  8:00 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Paolo Bonzini
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, peterx, quintela,
	wei.w.wang, cota, jiang.biao2



On 11/24/18 2:29 AM, Dr. David Alan Gilbert wrote:

>>>>   static void
>>>> -update_compress_thread_counts(const CompressParam *param, int bytes_xmit)
>>>> +update_compress_thread_counts(CompressData *cd, int bytes_xmit)
>>>
>>> Keep the const?

Yes, indeed. Will correct it in the next version.

>>>> +    if (deflateInit(&cd->stream, migrate_compress_level()) != Z_OK) {
>>>> +        g_free(cd->originbuf);
>>>> +        return -1;
>>>> +    }
>>>
>>> Please print errors if you fail in any case so we can easily tell what
>>> happened.

Sure, will do.

>>>> +        if (wait) {
>>>> +            cpu_relax();
>>>> +            goto retry;
>>>
>>> Is there nothing better we can use to wait without eating CPU time?
>>
>> There is a mechanism to wait without eating CPU time in the data
>> structure, but it makes sense to busy wait.  There are 4 threads in the
>> workqueue, so you have to compare 1/4th of the time spent compressing a
>> page, with the trip into the kernel to wake you up.  You're adding 20%
>> CPU usage, but I'm not surprised it's worthwhile.
> 
> Hmm OK; in that case it does at least need a comment because it's a bit
> odd, and we should watch out how that scales - I guess it's less of
> an overhead the more threads you use.
> 

Sure, will add some comments to explain the purpose.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/5] migration: use threaded workqueue for compression
@ 2018-11-26  8:00           ` Xiao Guangrong
  0 siblings, 0 replies; 70+ messages in thread
From: Xiao Guangrong @ 2018-11-26  8:00 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Paolo Bonzini
  Cc: mst, mtosatti, qemu-devel, kvm, peterx, wei.w.wang, jiang.biao2,
	eblake, quintela, cota, Xiao Guangrong



On 11/24/18 2:29 AM, Dr. David Alan Gilbert wrote:

>>>>   static void
>>>> -update_compress_thread_counts(const CompressParam *param, int bytes_xmit)
>>>> +update_compress_thread_counts(CompressData *cd, int bytes_xmit)
>>>
>>> Keep the const?

Yes, indeed. Will correct it in the next version.

>>>> +    if (deflateInit(&cd->stream, migrate_compress_level()) != Z_OK) {
>>>> +        g_free(cd->originbuf);
>>>> +        return -1;
>>>> +    }
>>>
>>> Please print errors if you fail in any case so we can easily tell what
>>> happened.

Sure, will do.

>>>> +        if (wait) {
>>>> +            cpu_relax();
>>>> +            goto retry;
>>>
>>> Is there nothing better we can use to wait without eating CPU time?
>>
>> There is a mechanism to wait without eating CPU time in the data
>> structure, but it makes sense to busy wait.  There are 4 threads in the
>> workqueue, so you have to compare 1/4th of the time spent compressing a
>> page, with the trip into the kernel to wake you up.  You're adding 20%
>> CPU usage, but I'm not surprised it's worthwhile.
> 
> Hmm OK; in that case it does at least need a comment because it's a bit
> odd, and we should watch out how that scales - I guess it's less of
> an overhead the more threads you use.
> 

Sure, will add some comments to explain the purpose.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 2/5] util: introduce threaded workqueue
  2018-11-24  0:12     ` [Qemu-devel] " Emilio G. Cota
@ 2018-11-26  8:06       ` Xiao Guangrong
  -1 siblings, 0 replies; 70+ messages in thread
From: Xiao Guangrong @ 2018-11-26  8:06 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, peterx, qemu-devel,
	quintela, wei.w.wang, jiang.biao2, pbonzini



On 11/24/18 8:12 AM, Emilio G. Cota wrote:
> On Thu, Nov 22, 2018 at 15:20:25 +0800, guangrong.xiao@gmail.com wrote:
>> +   /*
>> +     * the bit in these two bitmaps indicates the index of the @requests
> 
> This @ is not ASCII, is it?
> 

Good eyes. :)

Will fix it.

>> +     * respectively. If it's the same, the corresponding request is free
>> +     * and owned by the user, i.e, where the user fills a request. Otherwise,
>> +     * it is valid and owned by the thread, i.e, where the thread fetches
>> +     * the request and write the result.
>> +     */
>> +
>> +    /* after the user fills the request, the bit is flipped. */
>> +    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
>> +    /* after handles the request, the thread flips the bit. */
>> +    uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
> 
> Use DECLARE_BITMAP, otherwise you'll get type errors as David
> pointed out.

If we do it, the field becomes a pointer... that complicates the
thing.

Hmm, i am using the same trick applied by kvm module when it handles
vcpu->requests:
static inline bool kvm_test_request(int req, struct kvm_vcpu *vcpu)
{
	return test_bit(req & KVM_REQUEST_MASK, (void *)&vcpu->requests);
}

Is it good?

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
@ 2018-11-26  8:06       ` Xiao Guangrong
  0 siblings, 0 replies; 70+ messages in thread
From: Xiao Guangrong @ 2018-11-26  8:06 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, peterx,
	wei.w.wang, jiang.biao2, eblake, quintela, Xiao Guangrong



On 11/24/18 8:12 AM, Emilio G. Cota wrote:
> On Thu, Nov 22, 2018 at 15:20:25 +0800, guangrong.xiao@gmail.com wrote:
>> +   /*
>> +     * the bit in these two bitmaps indicates the index of the @requests
> 
> This @ is not ASCII, is it?
> 

Good eyes. :)

Will fix it.

>> +     * respectively. If it's the same, the corresponding request is free
>> +     * and owned by the user, i.e, where the user fills a request. Otherwise,
>> +     * it is valid and owned by the thread, i.e, where the thread fetches
>> +     * the request and write the result.
>> +     */
>> +
>> +    /* after the user fills the request, the bit is flipped. */
>> +    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
>> +    /* after handles the request, the thread flips the bit. */
>> +    uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
> 
> Use DECLARE_BITMAP, otherwise you'll get type errors as David
> pointed out.

If we do it, the field becomes a pointer... that complicates the
thing.

Hmm, i am using the same trick applied by kvm module when it handles
vcpu->requests:
static inline bool kvm_test_request(int req, struct kvm_vcpu *vcpu)
{
	return test_bit(req & KVM_REQUEST_MASK, (void *)&vcpu->requests);
}

Is it good?

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 2/5] util: introduce threaded workqueue
  2018-11-24  0:17     ` [Qemu-devel] " Emilio G. Cota
@ 2018-11-26  8:18       ` Xiao Guangrong
  -1 siblings, 0 replies; 70+ messages in thread
From: Xiao Guangrong @ 2018-11-26  8:18 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, peterx, qemu-devel,
	quintela, wei.w.wang, jiang.biao2, pbonzini



On 11/24/18 8:17 AM, Emilio G. Cota wrote:
> On Thu, Nov 22, 2018 at 15:20:25 +0800, guangrong.xiao@gmail.com wrote:
>> +static uint64_t get_free_request_bitmap(Threads *threads, ThreadLocal *thread)
>> +{
>> +    uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap;
>> +
>> +    request_fill_bitmap = atomic_rcu_read(&thread->request_fill_bitmap);
>> +    request_done_bitmap = atomic_rcu_read(&thread->request_done_bitmap);
>> +    bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
>> +               threads->thread_requests_nr);
> 
> This is not wrong, but it's a big ugly. Instead, I would:
> 
> - Introduce bitmap_xor_atomic in a previous patch
> - Use bitmap_xor_atomic here, getting rid of the rcu reads

Hmm, however, we do not need atomic xor operation here... that should be slower than
just two READ_ONCE calls.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
@ 2018-11-26  8:18       ` Xiao Guangrong
  0 siblings, 0 replies; 70+ messages in thread
From: Xiao Guangrong @ 2018-11-26  8:18 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, peterx,
	wei.w.wang, jiang.biao2, eblake, quintela, Xiao Guangrong



On 11/24/18 8:17 AM, Emilio G. Cota wrote:
> On Thu, Nov 22, 2018 at 15:20:25 +0800, guangrong.xiao@gmail.com wrote:
>> +static uint64_t get_free_request_bitmap(Threads *threads, ThreadLocal *thread)
>> +{
>> +    uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap;
>> +
>> +    request_fill_bitmap = atomic_rcu_read(&thread->request_fill_bitmap);
>> +    request_done_bitmap = atomic_rcu_read(&thread->request_done_bitmap);
>> +    bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
>> +               threads->thread_requests_nr);
> 
> This is not wrong, but it's a big ugly. Instead, I would:
> 
> - Introduce bitmap_xor_atomic in a previous patch
> - Use bitmap_xor_atomic here, getting rid of the rcu reads

Hmm, however, we do not need atomic xor operation here... that should be slower than
just two READ_ONCE calls.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 2/5] util: introduce threaded workqueue
  2018-11-26  8:18       ` [Qemu-devel] " Xiao Guangrong
@ 2018-11-26 10:28         ` Paolo Bonzini
  -1 siblings, 0 replies; 70+ messages in thread
From: Paolo Bonzini @ 2018-11-26 10:28 UTC (permalink / raw)
  To: Xiao Guangrong, Emilio G. Cota
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, peterx, qemu-devel,
	quintela, wei.w.wang, jiang.biao2

On 26/11/18 09:18, Xiao Guangrong wrote:
>>
>>> +static uint64_t get_free_request_bitmap(Threads *threads,
>>> ThreadLocal *thread)
>>> +{
>>> +    uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap;
>>> +
>>> +    request_fill_bitmap =
>>> atomic_rcu_read(&thread->request_fill_bitmap);
>>> +    request_done_bitmap =
>>> atomic_rcu_read(&thread->request_done_bitmap);
>>> +    bitmap_xor(&result_bitmap, &request_fill_bitmap,
>>> &request_done_bitmap,
>>> +               threads->thread_requests_nr);
>>
>> This is not wrong, but it's a big ugly. Instead, I would:
>>
>> - Introduce bitmap_xor_atomic in a previous patch
>> - Use bitmap_xor_atomic here, getting rid of the rcu reads
> 
> Hmm, however, we do not need atomic xor operation here... that should be
> slower than
> just two READ_ONCE calls.

Yeah, I'd just go with Guangrong's version.  Alternatively, add
find_{first,next}_{same,different}_bit functions (whatever subset of the
4 you need).

Paolo

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
@ 2018-11-26 10:28         ` Paolo Bonzini
  0 siblings, 0 replies; 70+ messages in thread
From: Paolo Bonzini @ 2018-11-26 10:28 UTC (permalink / raw)
  To: Xiao Guangrong, Emilio G. Cota
  Cc: mst, mtosatti, qemu-devel, kvm, dgilbert, peterx, wei.w.wang,
	jiang.biao2, eblake, quintela, Xiao Guangrong

On 26/11/18 09:18, Xiao Guangrong wrote:
>>
>>> +static uint64_t get_free_request_bitmap(Threads *threads,
>>> ThreadLocal *thread)
>>> +{
>>> +    uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap;
>>> +
>>> +    request_fill_bitmap =
>>> atomic_rcu_read(&thread->request_fill_bitmap);
>>> +    request_done_bitmap =
>>> atomic_rcu_read(&thread->request_done_bitmap);
>>> +    bitmap_xor(&result_bitmap, &request_fill_bitmap,
>>> &request_done_bitmap,
>>> +               threads->thread_requests_nr);
>>
>> This is not wrong, but it's a big ugly. Instead, I would:
>>
>> - Introduce bitmap_xor_atomic in a previous patch
>> - Use bitmap_xor_atomic here, getting rid of the rcu reads
> 
> Hmm, however, we do not need atomic xor operation here... that should be
> slower than
> just two READ_ONCE calls.

Yeah, I'd just go with Guangrong's version.  Alternatively, add
find_{first,next}_{same,different}_bit functions (whatever subset of the
4 you need).

Paolo

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 2/5] util: introduce threaded workqueue
  2018-11-26  7:57       ` [Qemu-devel] " Xiao Guangrong
@ 2018-11-26 10:56         ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 70+ messages in thread
From: Dr. David Alan Gilbert @ 2018-11-26 10:56 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, peterx, quintela,
	wei.w.wang, cota, jiang.biao2, pbonzini

* Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
> 
> 
> On 11/23/18 7:02 PM, Dr. David Alan Gilbert wrote:
> 
> > > +#include "qemu/osdep.h"
> > > +#include "qemu/bitmap.h"
> > > +#include "qemu/threaded-workqueue.h"
> > > +
> > > +#define SMP_CACHE_BYTES 64
> > 
> > That's architecture dependent isn't it?
> > 
> 
> Yes, it's arch dependent indeed.
> 
> I just used 64 for simplification and i think it is <= 64 on most CPU arch-es
> so that can work.
> 
> Should i introduce statically defined CACHE LINE SIZE for all arch-es? :(

I think it depends why you need it; but we shouldn't have a constant
that is wrong, and we shouldn't define something architecture dependent
in here.

> > > +   /*
> > > +     * the bit in these two bitmaps indicates the index of the @requests
> > > +     * respectively. If it's the same, the corresponding request is free
> > > +     * and owned by the user, i.e, where the user fills a request. Otherwise,
> > > +     * it is valid and owned by the thread, i.e, where the thread fetches
> > > +     * the request and write the result.
> > > +     */
> > > +
> > > +    /* after the user fills the request, the bit is flipped. */
> > > +    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
> > > +    /* after handles the request, the thread flips the bit. */
> > > +    uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
> > 
> > Patchew complained about some type mismatches; I think those are because
> > you're using the bitmap_* functions on these; those functions always
> > operate on 'long' not on uint64_t - and on some platforms they're
> > unfortunately not the same.
> 
> I guess you were taking about this error:
> ERROR: externs should be avoided in .c files
> #233: FILE: util/threaded-workqueue.c:65:
> +    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
> 
> The complained thing is "QEMU_ALIGNED(SMP_CACHE_BYTES)" as it gone
> when the aligned thing is removed...
> 
> The issue you pointed out can be avoid by using type-casting, like:
> bitmap_xor(..., (void *)&thread->request_fill_bitmap)
> cannot we?

I thought the error was just due to long vs uint64_t ratehr than the
qemu_aligned.  I don't think it's just a casting problem, since I don't
think the long's are always 64bit.

Dave

> Thanks!
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
@ 2018-11-26 10:56         ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 70+ messages in thread
From: Dr. David Alan Gilbert @ 2018-11-26 10:56 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, peterx, wei.w.wang,
	jiang.biao2, eblake, quintela, cota, Xiao Guangrong

* Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
> 
> 
> On 11/23/18 7:02 PM, Dr. David Alan Gilbert wrote:
> 
> > > +#include "qemu/osdep.h"
> > > +#include "qemu/bitmap.h"
> > > +#include "qemu/threaded-workqueue.h"
> > > +
> > > +#define SMP_CACHE_BYTES 64
> > 
> > That's architecture dependent isn't it?
> > 
> 
> Yes, it's arch dependent indeed.
> 
> I just used 64 for simplification and i think it is <= 64 on most CPU arch-es
> so that can work.
> 
> Should i introduce statically defined CACHE LINE SIZE for all arch-es? :(

I think it depends why you need it; but we shouldn't have a constant
that is wrong, and we shouldn't define something architecture dependent
in here.

> > > +   /*
> > > +     * the bit in these two bitmaps indicates the index of the @requests
> > > +     * respectively. If it's the same, the corresponding request is free
> > > +     * and owned by the user, i.e, where the user fills a request. Otherwise,
> > > +     * it is valid and owned by the thread, i.e, where the thread fetches
> > > +     * the request and write the result.
> > > +     */
> > > +
> > > +    /* after the user fills the request, the bit is flipped. */
> > > +    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
> > > +    /* after handles the request, the thread flips the bit. */
> > > +    uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
> > 
> > Patchew complained about some type mismatches; I think those are because
> > you're using the bitmap_* functions on these; those functions always
> > operate on 'long' not on uint64_t - and on some platforms they're
> > unfortunately not the same.
> 
> I guess you were taking about this error:
> ERROR: externs should be avoided in .c files
> #233: FILE: util/threaded-workqueue.c:65:
> +    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
> 
> The complained thing is "QEMU_ALIGNED(SMP_CACHE_BYTES)" as it gone
> when the aligned thing is removed...
> 
> The issue you pointed out can be avoid by using type-casting, like:
> bitmap_xor(..., (void *)&thread->request_fill_bitmap)
> cannot we?

I thought the error was just due to long vs uint64_t ratehr than the
qemu_aligned.  I don't think it's just a casting problem, since I don't
think the long's are always 64bit.

Dave

> Thanks!
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 2/5] util: introduce threaded workqueue
  2018-11-26  8:06       ` [Qemu-devel] " Xiao Guangrong
@ 2018-11-26 18:49         ` Emilio G. Cota
  -1 siblings, 0 replies; 70+ messages in thread
From: Emilio G. Cota @ 2018-11-26 18:49 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, peterx, qemu-devel,
	quintela, wei.w.wang, jiang.biao2, pbonzini

On Mon, Nov 26, 2018 at 16:06:37 +0800, Xiao Guangrong wrote:
> > > +    /* after the user fills the request, the bit is flipped. */
> > > +    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
> > > +    /* after handles the request, the thread flips the bit. */
> > > +    uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
> > 
> > Use DECLARE_BITMAP, otherwise you'll get type errors as David
> > pointed out.
> 
> If we do it, the field becomes a pointer... that complicates the
> thing.

Not necessarily, see below.

On Mon, Nov 26, 2018 at 16:18:24 +0800, Xiao Guangrong wrote:
> On 11/24/18 8:17 AM, Emilio G. Cota wrote:
> > On Thu, Nov 22, 2018 at 15:20:25 +0800, guangrong.xiao@gmail.com wrote:
> > > +static uint64_t get_free_request_bitmap(Threads *threads, ThreadLocal *thread)
> > > +{
> > > +    uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap;
> > > +
> > > +    request_fill_bitmap = atomic_rcu_read(&thread->request_fill_bitmap);
> > > +    request_done_bitmap = atomic_rcu_read(&thread->request_done_bitmap);
> > > +    bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
> > > +               threads->thread_requests_nr);
> > 
> > This is not wrong, but it's a big ugly. Instead, I would:
> > 
> > - Introduce bitmap_xor_atomic in a previous patch
> > - Use bitmap_xor_atomic here, getting rid of the rcu reads
> 
> Hmm, however, we do not need atomic xor operation here... that should be slower than
> just two READ_ONCE calls.

If you use DECLARE_BITMAP, you get an in-place array. On a 64-bit
host, that'd be
	unsigned long foo[1]; /* [2] on 32-bit */

Then again on 64-bit hosts, bitmap_xor_atomic would reduce
to 2 atomic reads:

static inline void bitmap_xor_atomic(unsigned long *dst,
const unsigned long *src1, const unsigned long *src2, long nbits)
{
    if (small_nbits(nbits)) {
        *dst = atomic_read(src1) ^ atomic_read(&src2);
    } else {
        slow_bitmap_xor_atomic(dst, src1, src2, nbits);
    }
}

So you can either do the above, or just define an unsigned long
instead of a u64 and keep doing what you're doing in this series,
but bearing in mind that the max on 32-bit hosts will be 32. But
that's no big deal since those machines won't have many cores
anyway.

		Emilio

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
@ 2018-11-26 18:49         ` Emilio G. Cota
  0 siblings, 0 replies; 70+ messages in thread
From: Emilio G. Cota @ 2018-11-26 18:49 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, peterx,
	wei.w.wang, jiang.biao2, eblake, quintela, Xiao Guangrong

On Mon, Nov 26, 2018 at 16:06:37 +0800, Xiao Guangrong wrote:
> > > +    /* after the user fills the request, the bit is flipped. */
> > > +    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
> > > +    /* after handles the request, the thread flips the bit. */
> > > +    uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
> > 
> > Use DECLARE_BITMAP, otherwise you'll get type errors as David
> > pointed out.
> 
> If we do it, the field becomes a pointer... that complicates the
> thing.

Not necessarily, see below.

On Mon, Nov 26, 2018 at 16:18:24 +0800, Xiao Guangrong wrote:
> On 11/24/18 8:17 AM, Emilio G. Cota wrote:
> > On Thu, Nov 22, 2018 at 15:20:25 +0800, guangrong.xiao@gmail.com wrote:
> > > +static uint64_t get_free_request_bitmap(Threads *threads, ThreadLocal *thread)
> > > +{
> > > +    uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap;
> > > +
> > > +    request_fill_bitmap = atomic_rcu_read(&thread->request_fill_bitmap);
> > > +    request_done_bitmap = atomic_rcu_read(&thread->request_done_bitmap);
> > > +    bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
> > > +               threads->thread_requests_nr);
> > 
> > This is not wrong, but it's a big ugly. Instead, I would:
> > 
> > - Introduce bitmap_xor_atomic in a previous patch
> > - Use bitmap_xor_atomic here, getting rid of the rcu reads
> 
> Hmm, however, we do not need atomic xor operation here... that should be slower than
> just two READ_ONCE calls.

If you use DECLARE_BITMAP, you get an in-place array. On a 64-bit
host, that'd be
	unsigned long foo[1]; /* [2] on 32-bit */

Then again on 64-bit hosts, bitmap_xor_atomic would reduce
to 2 atomic reads:

static inline void bitmap_xor_atomic(unsigned long *dst,
const unsigned long *src1, const unsigned long *src2, long nbits)
{
    if (small_nbits(nbits)) {
        *dst = atomic_read(src1) ^ atomic_read(&src2);
    } else {
        slow_bitmap_xor_atomic(dst, src1, src2, nbits);
    }
}

So you can either do the above, or just define an unsigned long
instead of a u64 and keep doing what you're doing in this series,
but bearing in mind that the max on 32-bit hosts will be 32. But
that's no big deal since those machines won't have many cores
anyway.

		Emilio

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 2/5] util: introduce threaded workqueue
  2018-11-26  7:57       ` [Qemu-devel] " Xiao Guangrong
@ 2018-11-26 18:55         ` Emilio G. Cota
  -1 siblings, 0 replies; 70+ messages in thread
From: Emilio G. Cota @ 2018-11-26 18:55 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, peterx,
	Dr. David Alan Gilbert, quintela, wei.w.wang, jiang.biao2,
	pbonzini

On Mon, Nov 26, 2018 at 15:57:25 +0800, Xiao Guangrong wrote:
> 
> 
> On 11/23/18 7:02 PM, Dr. David Alan Gilbert wrote:
> 
> > > +#include "qemu/osdep.h"
> > > +#include "qemu/bitmap.h"
> > > +#include "qemu/threaded-workqueue.h"
> > > +
> > > +#define SMP_CACHE_BYTES 64
> > 
> > That's architecture dependent isn't it?
> > 
> 
> Yes, it's arch dependent indeed.
> 
> I just used 64 for simplification and i think it is <= 64 on most CPU arch-es
> so that can work.
> 
> Should i introduce statically defined CACHE LINE SIZE for all arch-es? :(

No, at compile-time this is impossible to know.

We do query this info at run-time though (see util/cacheinfo.c),
but using that info here would complicate things too much.

You can just give it a different name, and perhaps add a comment.
See for instance what we do in qht.c with QHT_BUCKET_ALIGN.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
@ 2018-11-26 18:55         ` Emilio G. Cota
  0 siblings, 0 replies; 70+ messages in thread
From: Emilio G. Cota @ 2018-11-26 18:55 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Dr. David Alan Gilbert, pbonzini, mst, mtosatti, qemu-devel, kvm,
	peterx, wei.w.wang, jiang.biao2, eblake, quintela,
	Xiao Guangrong

On Mon, Nov 26, 2018 at 15:57:25 +0800, Xiao Guangrong wrote:
> 
> 
> On 11/23/18 7:02 PM, Dr. David Alan Gilbert wrote:
> 
> > > +#include "qemu/osdep.h"
> > > +#include "qemu/bitmap.h"
> > > +#include "qemu/threaded-workqueue.h"
> > > +
> > > +#define SMP_CACHE_BYTES 64
> > 
> > That's architecture dependent isn't it?
> > 
> 
> Yes, it's arch dependent indeed.
> 
> I just used 64 for simplification and i think it is <= 64 on most CPU arch-es
> so that can work.
> 
> Should i introduce statically defined CACHE LINE SIZE for all arch-es? :(

No, at compile-time this is impossible to know.

We do query this info at run-time though (see util/cacheinfo.c),
but using that info here would complicate things too much.

You can just give it a different name, and perhaps add a comment.
See for instance what we do in qht.c with QHT_BUCKET_ALIGN.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 2/5] util: introduce threaded workqueue
  2018-11-26 10:56         ` [Qemu-devel] " Dr. David Alan Gilbert
@ 2018-11-27  7:17           ` Xiao Guangrong
  -1 siblings, 0 replies; 70+ messages in thread
From: Xiao Guangrong @ 2018-11-27  7:17 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, peterx, quintela,
	wei.w.wang, cota, jiang.biao2, pbonzini



On 11/26/18 6:56 PM, Dr. David Alan Gilbert wrote:
> * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
>>
>>
>> On 11/23/18 7:02 PM, Dr. David Alan Gilbert wrote:
>>
>>>> +#include "qemu/osdep.h"
>>>> +#include "qemu/bitmap.h"
>>>> +#include "qemu/threaded-workqueue.h"
>>>> +
>>>> +#define SMP_CACHE_BYTES 64
>>>
>>> That's architecture dependent isn't it?
>>>
>>
>> Yes, it's arch dependent indeed.
>>
>> I just used 64 for simplification and i think it is <= 64 on most CPU arch-es
>> so that can work.
>>
>> Should i introduce statically defined CACHE LINE SIZE for all arch-es? :(
> 
> I think it depends why you need it; but we shouldn't have a constant
> that is wrong, and we shouldn't define something architecture dependent
> in here.
> 

I see. I will address Emilio's suggestion that rename SMP_CACHE_BYTES to
THREAD_QUEUE_ALIGN and additional comments.

>>>> +   /*
>>>> +     * the bit in these two bitmaps indicates the index of the @requests
>>>> +     * respectively. If it's the same, the corresponding request is free
>>>> +     * and owned by the user, i.e, where the user fills a request. Otherwise,
>>>> +     * it is valid and owned by the thread, i.e, where the thread fetches
>>>> +     * the request and write the result.
>>>> +     */
>>>> +
>>>> +    /* after the user fills the request, the bit is flipped. */
>>>> +    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
>>>> +    /* after handles the request, the thread flips the bit. */
>>>> +    uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
>>>
>>> Patchew complained about some type mismatches; I think those are because
>>> you're using the bitmap_* functions on these; those functions always
>>> operate on 'long' not on uint64_t - and on some platforms they're
>>> unfortunately not the same.
>>
>> I guess you were taking about this error:
>> ERROR: externs should be avoided in .c files
>> #233: FILE: util/threaded-workqueue.c:65:
>> +    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
>>
>> The complained thing is "QEMU_ALIGNED(SMP_CACHE_BYTES)" as it gone
>> when the aligned thing is removed...
>>
>> The issue you pointed out can be avoid by using type-casting, like:
>> bitmap_xor(..., (void *)&thread->request_fill_bitmap)
>> cannot we?
> 
> I thought the error was just due to long vs uint64_t ratehr than the
> qemu_aligned.  I don't think it's just a casting problem, since I don't
> think the long's are always 64bit.

Well, i made some adjustments that makes check_patch.sh really happy :),
as followings:
$ git diff util/
diff --git a/util/threaded-workqueue.c b/util/threaded-workqueue.c
index 2ab37cee8d..e34c65a8eb 100644
--- a/util/threaded-workqueue.c
+++ b/util/threaded-workqueue.c
@@ -62,21 +62,30 @@ struct ThreadLocal {
       */

      /* after the user fills the request, the bit is flipped. */
-    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
+    struct {
+        uint64_t request_fill_bitmap;
+    } QEMU_ALIGNED(SMP_CACHE_BYTES);
+
      /* after handles the request, the thread flips the bit. */
-    uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
+    struct {
+        uint64_t request_done_bitmap;
+    } QEMU_ALIGNED(SMP_CACHE_BYTES);

      /*
       * the event used to wake up the thread whenever a valid request has
       * been submitted
       */
-    QemuEvent request_valid_ev QEMU_ALIGNED(SMP_CACHE_BYTES);
+    struct {
+        QemuEvent request_valid_ev;
+    } QEMU_ALIGNED(SMP_CACHE_BYTES);

      /*
       * the event is notified whenever a request has been completed
       * (i.e, become free), which is used to wake up the user
       */
-    QemuEvent request_free_ev QEMU_ALIGNED(SMP_CACHE_BYTES);
+    struct {
+        QemuEvent request_free_ev;
+    } QEMU_ALIGNED(SMP_CACHE_BYTES);
  };
  typedef struct ThreadLocal ThreadLocal;

$ ./scripts/checkpatch.pl -f util/threaded-workqueue.c
total: 0 errors, 0 warnings, 472 lines checked

util/threaded-workqueue.c has no obvious style problems and is ready for submission.

check_patch.sh somehow treats QEMU_ALIGNED as a function before the modification.

And yes, u64 is not a long type on 32 bit arch, it's long[2] instead. that's fine
when we pass the &(u64) to the function whose parameter is (long *). I thing this
trick is widely used. e.g, the example in kvm that i replied to Emilio:

static inline bool kvm_test_request(int req, struct kvm_vcpu *vcpu)
{
     return test_bit(req & KVM_REQUEST_MASK, (void *)&vcpu->requests);
}

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
@ 2018-11-27  7:17           ` Xiao Guangrong
  0 siblings, 0 replies; 70+ messages in thread
From: Xiao Guangrong @ 2018-11-27  7:17 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, peterx, wei.w.wang,
	jiang.biao2, eblake, quintela, cota, Xiao Guangrong



On 11/26/18 6:56 PM, Dr. David Alan Gilbert wrote:
> * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
>>
>>
>> On 11/23/18 7:02 PM, Dr. David Alan Gilbert wrote:
>>
>>>> +#include "qemu/osdep.h"
>>>> +#include "qemu/bitmap.h"
>>>> +#include "qemu/threaded-workqueue.h"
>>>> +
>>>> +#define SMP_CACHE_BYTES 64
>>>
>>> That's architecture dependent isn't it?
>>>
>>
>> Yes, it's arch dependent indeed.
>>
>> I just used 64 for simplification and i think it is <= 64 on most CPU arch-es
>> so that can work.
>>
>> Should i introduce statically defined CACHE LINE SIZE for all arch-es? :(
> 
> I think it depends why you need it; but we shouldn't have a constant
> that is wrong, and we shouldn't define something architecture dependent
> in here.
> 

I see. I will address Emilio's suggestion that rename SMP_CACHE_BYTES to
THREAD_QUEUE_ALIGN and additional comments.

>>>> +   /*
>>>> +     * the bit in these two bitmaps indicates the index of the @requests
>>>> +     * respectively. If it's the same, the corresponding request is free
>>>> +     * and owned by the user, i.e, where the user fills a request. Otherwise,
>>>> +     * it is valid and owned by the thread, i.e, where the thread fetches
>>>> +     * the request and write the result.
>>>> +     */
>>>> +
>>>> +    /* after the user fills the request, the bit is flipped. */
>>>> +    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
>>>> +    /* after handles the request, the thread flips the bit. */
>>>> +    uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
>>>
>>> Patchew complained about some type mismatches; I think those are because
>>> you're using the bitmap_* functions on these; those functions always
>>> operate on 'long' not on uint64_t - and on some platforms they're
>>> unfortunately not the same.
>>
>> I guess you were taking about this error:
>> ERROR: externs should be avoided in .c files
>> #233: FILE: util/threaded-workqueue.c:65:
>> +    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
>>
>> The complained thing is "QEMU_ALIGNED(SMP_CACHE_BYTES)" as it gone
>> when the aligned thing is removed...
>>
>> The issue you pointed out can be avoid by using type-casting, like:
>> bitmap_xor(..., (void *)&thread->request_fill_bitmap)
>> cannot we?
> 
> I thought the error was just due to long vs uint64_t ratehr than the
> qemu_aligned.  I don't think it's just a casting problem, since I don't
> think the long's are always 64bit.

Well, i made some adjustments that makes check_patch.sh really happy :),
as followings:
$ git diff util/
diff --git a/util/threaded-workqueue.c b/util/threaded-workqueue.c
index 2ab37cee8d..e34c65a8eb 100644
--- a/util/threaded-workqueue.c
+++ b/util/threaded-workqueue.c
@@ -62,21 +62,30 @@ struct ThreadLocal {
       */

      /* after the user fills the request, the bit is flipped. */
-    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
+    struct {
+        uint64_t request_fill_bitmap;
+    } QEMU_ALIGNED(SMP_CACHE_BYTES);
+
      /* after handles the request, the thread flips the bit. */
-    uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
+    struct {
+        uint64_t request_done_bitmap;
+    } QEMU_ALIGNED(SMP_CACHE_BYTES);

      /*
       * the event used to wake up the thread whenever a valid request has
       * been submitted
       */
-    QemuEvent request_valid_ev QEMU_ALIGNED(SMP_CACHE_BYTES);
+    struct {
+        QemuEvent request_valid_ev;
+    } QEMU_ALIGNED(SMP_CACHE_BYTES);

      /*
       * the event is notified whenever a request has been completed
       * (i.e, become free), which is used to wake up the user
       */
-    QemuEvent request_free_ev QEMU_ALIGNED(SMP_CACHE_BYTES);
+    struct {
+        QemuEvent request_free_ev;
+    } QEMU_ALIGNED(SMP_CACHE_BYTES);
  };
  typedef struct ThreadLocal ThreadLocal;

$ ./scripts/checkpatch.pl -f util/threaded-workqueue.c
total: 0 errors, 0 warnings, 472 lines checked

util/threaded-workqueue.c has no obvious style problems and is ready for submission.

check_patch.sh somehow treats QEMU_ALIGNED as a function before the modification.

And yes, u64 is not a long type on 32 bit arch, it's long[2] instead. that's fine
when we pass the &(u64) to the function whose parameter is (long *). I thing this
trick is widely used. e.g, the example in kvm that i replied to Emilio:

static inline bool kvm_test_request(int req, struct kvm_vcpu *vcpu)
{
     return test_bit(req & KVM_REQUEST_MASK, (void *)&vcpu->requests);
}

^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 2/5] util: introduce threaded workqueue
  2018-11-26 18:49         ` [Qemu-devel] " Emilio G. Cota
@ 2018-11-27  8:29           ` Xiao Guangrong
  -1 siblings, 0 replies; 70+ messages in thread
From: Xiao Guangrong @ 2018-11-27  8:29 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, peterx, qemu-devel,
	quintela, wei.w.wang, jiang.biao2, pbonzini



On 11/27/18 2:49 AM, Emilio G. Cota wrote:
> On Mon, Nov 26, 2018 at 16:06:37 +0800, Xiao Guangrong wrote:
>>>> +    /* after the user fills the request, the bit is flipped. */
>>>> +    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
>>>> +    /* after handles the request, the thread flips the bit. */
>>>> +    uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
>>>
>>> Use DECLARE_BITMAP, otherwise you'll get type errors as David
>>> pointed out.
>>
>> If we do it, the field becomes a pointer... that complicates the
>> thing.
> 
> Not necessarily, see below.
> 
> On Mon, Nov 26, 2018 at 16:18:24 +0800, Xiao Guangrong wrote:
>> On 11/24/18 8:17 AM, Emilio G. Cota wrote:
>>> On Thu, Nov 22, 2018 at 15:20:25 +0800, guangrong.xiao@gmail.com wrote:
>>>> +static uint64_t get_free_request_bitmap(Threads *threads, ThreadLocal *thread)
>>>> +{
>>>> +    uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap;
>>>> +
>>>> +    request_fill_bitmap = atomic_rcu_read(&thread->request_fill_bitmap);
>>>> +    request_done_bitmap = atomic_rcu_read(&thread->request_done_bitmap);
>>>> +    bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
>>>> +               threads->thread_requests_nr);
>>>
>>> This is not wrong, but it's a big ugly. Instead, I would:
>>>
>>> - Introduce bitmap_xor_atomic in a previous patch
>>> - Use bitmap_xor_atomic here, getting rid of the rcu reads
>>
>> Hmm, however, we do not need atomic xor operation here... that should be slower than
>> just two READ_ONCE calls.
> 
> If you use DECLARE_BITMAP, you get an in-place array. On a 64-bit
> host, that'd be
> 	unsigned long foo[1]; /* [2] on 32-bit */
> 
> Then again on 64-bit hosts, bitmap_xor_atomic would reduce
> to 2 atomic reads:
> 
> static inline void bitmap_xor_atomic(unsigned long *dst,
> const unsigned long *src1, const unsigned long *src2, long nbits)
> {
>      if (small_nbits(nbits)) {
>          *dst = atomic_read(src1) ^ atomic_read(&src2);
>      } else {
>          slow_bitmap_xor_atomic(dst, src1, src2, nbits);

We needn't do inplace xor operation. i.e, we just fetch the bitmaps to
the local variables do xor locally.

So we need additional complicity to handle the case that is !small_nbits(nbits)
... but it is really not a big deal as you said, it just couple of codes.

However, use u64 for the purpose that only  64 indexes are allowed is more
straightforward and can be naturally understood. :)

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
@ 2018-11-27  8:29           ` Xiao Guangrong
  0 siblings, 0 replies; 70+ messages in thread
From: Xiao Guangrong @ 2018-11-27  8:29 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, peterx,
	wei.w.wang, jiang.biao2, eblake, quintela, Xiao Guangrong



On 11/27/18 2:49 AM, Emilio G. Cota wrote:
> On Mon, Nov 26, 2018 at 16:06:37 +0800, Xiao Guangrong wrote:
>>>> +    /* after the user fills the request, the bit is flipped. */
>>>> +    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
>>>> +    /* after handles the request, the thread flips the bit. */
>>>> +    uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
>>>
>>> Use DECLARE_BITMAP, otherwise you'll get type errors as David
>>> pointed out.
>>
>> If we do it, the field becomes a pointer... that complicates the
>> thing.
> 
> Not necessarily, see below.
> 
> On Mon, Nov 26, 2018 at 16:18:24 +0800, Xiao Guangrong wrote:
>> On 11/24/18 8:17 AM, Emilio G. Cota wrote:
>>> On Thu, Nov 22, 2018 at 15:20:25 +0800, guangrong.xiao@gmail.com wrote:
>>>> +static uint64_t get_free_request_bitmap(Threads *threads, ThreadLocal *thread)
>>>> +{
>>>> +    uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap;
>>>> +
>>>> +    request_fill_bitmap = atomic_rcu_read(&thread->request_fill_bitmap);
>>>> +    request_done_bitmap = atomic_rcu_read(&thread->request_done_bitmap);
>>>> +    bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
>>>> +               threads->thread_requests_nr);
>>>
>>> This is not wrong, but it's a big ugly. Instead, I would:
>>>
>>> - Introduce bitmap_xor_atomic in a previous patch
>>> - Use bitmap_xor_atomic here, getting rid of the rcu reads
>>
>> Hmm, however, we do not need atomic xor operation here... that should be slower than
>> just two READ_ONCE calls.
> 
> If you use DECLARE_BITMAP, you get an in-place array. On a 64-bit
> host, that'd be
> 	unsigned long foo[1]; /* [2] on 32-bit */
> 
> Then again on 64-bit hosts, bitmap_xor_atomic would reduce
> to 2 atomic reads:
> 
> static inline void bitmap_xor_atomic(unsigned long *dst,
> const unsigned long *src1, const unsigned long *src2, long nbits)
> {
>      if (small_nbits(nbits)) {
>          *dst = atomic_read(src1) ^ atomic_read(&src2);
>      } else {
>          slow_bitmap_xor_atomic(dst, src1, src2, nbits);

We needn't do inplace xor operation. i.e, we just fetch the bitmaps to
the local variables do xor locally.

So we need additional complicity to handle the case that is !small_nbits(nbits)
... but it is really not a big deal as you said, it just couple of codes.

However, use u64 for the purpose that only  64 indexes are allowed is more
straightforward and can be naturally understood. :)

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 2/5] util: introduce threaded workqueue
  2018-11-26 18:55         ` [Qemu-devel] " Emilio G. Cota
@ 2018-11-27  8:30           ` Xiao Guangrong
  -1 siblings, 0 replies; 70+ messages in thread
From: Xiao Guangrong @ 2018-11-27  8:30 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, peterx,
	Dr. David Alan Gilbert, quintela, wei.w.wang, jiang.biao2,
	pbonzini



On 11/27/18 2:55 AM, Emilio G. Cota wrote:
> On Mon, Nov 26, 2018 at 15:57:25 +0800, Xiao Guangrong wrote:
>>
>>
>> On 11/23/18 7:02 PM, Dr. David Alan Gilbert wrote:
>>
>>>> +#include "qemu/osdep.h"
>>>> +#include "qemu/bitmap.h"
>>>> +#include "qemu/threaded-workqueue.h"
>>>> +
>>>> +#define SMP_CACHE_BYTES 64
>>>
>>> That's architecture dependent isn't it?
>>>
>>
>> Yes, it's arch dependent indeed.
>>
>> I just used 64 for simplification and i think it is <= 64 on most CPU arch-es
>> so that can work.
>>
>> Should i introduce statically defined CACHE LINE SIZE for all arch-es? :(
> 
> No, at compile-time this is impossible to know.
> 
> We do query this info at run-time though (see util/cacheinfo.c),
> but using that info here would complicate things too much.

I see.

> 
> You can just give it a different name, and perhaps add a comment.
> See for instance what we do in qht.c with QHT_BUCKET_ALIGN.

That's really a good lesson to me, will follow it. :)

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
@ 2018-11-27  8:30           ` Xiao Guangrong
  0 siblings, 0 replies; 70+ messages in thread
From: Xiao Guangrong @ 2018-11-27  8:30 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: Dr. David Alan Gilbert, pbonzini, mst, mtosatti, qemu-devel, kvm,
	peterx, wei.w.wang, jiang.biao2, eblake, quintela,
	Xiao Guangrong



On 11/27/18 2:55 AM, Emilio G. Cota wrote:
> On Mon, Nov 26, 2018 at 15:57:25 +0800, Xiao Guangrong wrote:
>>
>>
>> On 11/23/18 7:02 PM, Dr. David Alan Gilbert wrote:
>>
>>>> +#include "qemu/osdep.h"
>>>> +#include "qemu/bitmap.h"
>>>> +#include "qemu/threaded-workqueue.h"
>>>> +
>>>> +#define SMP_CACHE_BYTES 64
>>>
>>> That's architecture dependent isn't it?
>>>
>>
>> Yes, it's arch dependent indeed.
>>
>> I just used 64 for simplification and i think it is <= 64 on most CPU arch-es
>> so that can work.
>>
>> Should i introduce statically defined CACHE LINE SIZE for all arch-es? :(
> 
> No, at compile-time this is impossible to know.
> 
> We do query this info at run-time though (see util/cacheinfo.c),
> but using that info here would complicate things too much.

I see.

> 
> You can just give it a different name, and perhaps add a comment.
> See for instance what we do in qht.c with QHT_BUCKET_ALIGN.

That's really a good lesson to me, will follow it. :)

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 2/5] util: introduce threaded workqueue
  2018-11-26 10:28         ` [Qemu-devel] " Paolo Bonzini
@ 2018-11-27  8:31           ` Xiao Guangrong
  -1 siblings, 0 replies; 70+ messages in thread
From: Xiao Guangrong @ 2018-11-27  8:31 UTC (permalink / raw)
  To: Paolo Bonzini, Emilio G. Cota
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, peterx, qemu-devel,
	quintela, wei.w.wang, jiang.biao2



On 11/26/18 6:28 PM, Paolo Bonzini wrote:
> On 26/11/18 09:18, Xiao Guangrong wrote:
>>>
>>>> +static uint64_t get_free_request_bitmap(Threads *threads,
>>>> ThreadLocal *thread)
>>>> +{
>>>> +    uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap;
>>>> +
>>>> +    request_fill_bitmap =
>>>> atomic_rcu_read(&thread->request_fill_bitmap);
>>>> +    request_done_bitmap =
>>>> atomic_rcu_read(&thread->request_done_bitmap);
>>>> +    bitmap_xor(&result_bitmap, &request_fill_bitmap,
>>>> &request_done_bitmap,
>>>> +               threads->thread_requests_nr);
>>>
>>> This is not wrong, but it's a big ugly. Instead, I would:
>>>
>>> - Introduce bitmap_xor_atomic in a previous patch
>>> - Use bitmap_xor_atomic here, getting rid of the rcu reads
>>
>> Hmm, however, we do not need atomic xor operation here... that should be
>> slower than
>> just two READ_ONCE calls.
> 
> Yeah, I'd just go with Guangrong's version.  Alternatively, add
> find_{first,next}_{same,different}_bit functions (whatever subset of the
> 4 you need).

That's good to me. will try it. ;)

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
@ 2018-11-27  8:31           ` Xiao Guangrong
  0 siblings, 0 replies; 70+ messages in thread
From: Xiao Guangrong @ 2018-11-27  8:31 UTC (permalink / raw)
  To: Paolo Bonzini, Emilio G. Cota
  Cc: mst, mtosatti, qemu-devel, kvm, dgilbert, peterx, wei.w.wang,
	jiang.biao2, eblake, quintela, Xiao Guangrong



On 11/26/18 6:28 PM, Paolo Bonzini wrote:
> On 26/11/18 09:18, Xiao Guangrong wrote:
>>>
>>>> +static uint64_t get_free_request_bitmap(Threads *threads,
>>>> ThreadLocal *thread)
>>>> +{
>>>> +    uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap;
>>>> +
>>>> +    request_fill_bitmap =
>>>> atomic_rcu_read(&thread->request_fill_bitmap);
>>>> +    request_done_bitmap =
>>>> atomic_rcu_read(&thread->request_done_bitmap);
>>>> +    bitmap_xor(&result_bitmap, &request_fill_bitmap,
>>>> &request_done_bitmap,
>>>> +               threads->thread_requests_nr);
>>>
>>> This is not wrong, but it's a big ugly. Instead, I would:
>>>
>>> - Introduce bitmap_xor_atomic in a previous patch
>>> - Use bitmap_xor_atomic here, getting rid of the rcu reads
>>
>> Hmm, however, we do not need atomic xor operation here... that should be
>> slower than
>> just two READ_ONCE calls.
> 
> Yeah, I'd just go with Guangrong's version.  Alternatively, add
> find_{first,next}_{same,different}_bit functions (whatever subset of the
> 4 you need).

That's good to me. will try it. ;)

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 2/5] util: introduce threaded workqueue
  2018-11-22  7:20   ` [Qemu-devel] " guangrong.xiao
@ 2018-11-27 12:49     ` Christophe de Dinechin
  -1 siblings, 0 replies; 70+ messages in thread
From: Christophe de Dinechin @ 2018-11-27 12:49 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, peterx, qemu-devel,
	quintela, wei.w.wang, cota, jiang.biao2, Paolo Bonzini

(I did not finish the review, but decided to send what I already had).


> On 22 Nov 2018, at 08:20, guangrong.xiao@gmail.com wrote:
> 
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> This modules implements the lockless and efficient threaded workqueue.

I’m not entirely convinced that it’s either “lockless” or “efficient”
in the current iteration. I believe that it’s relatively easy to fix, though.

> 
> Three abstracted objects are used in this module:
> - Request.
>     It not only contains the data that the workqueue fetches out
>    to finish the request but also offers the space to save the result
>    after the workqueue handles the request.
> 
>    It's flowed between user and workqueue. The user fills the request
>    data into it when it is owned by user. After it is submitted to the
>    workqueue, the workqueue fetched data out and save the result into
>    it after the request is handled.

fetched -> fetches
save -> saves

> 
>    All the requests are pre-allocated and carefully partitioned between
>    threads so there is no contention on the request, that make threads
>    be parallel as much as possible.

That sentence confused me (it’s also in a comment in the text).
I think I’m mostly confused by “there is no contention”. Perhaps you
meant “so as to avoid contention if possible”? If there is a reason
why there would never be any contention even if requests arrive faster than
completions, I did not figure it out.

I personally see serious contention on the fields in the Threads structure,
for example, but also possibly on the targets of the “modulo” operation in
thread_find_free_request. Specifically, if three CPUs are entering
thread_find_free_request at the same time, they will all run the same
loop and all, presumably, “attack” the same memory locations.

Sorry if I mis-read the code, but at the moment, it does not seem to
avoid contention as intended. I don’t see how it could without having
some way to discriminate between CPUs to start with, which I did not find.


> 
> - User, i.e, the submitter
>    It's the one fills the request and submits it to the workqueue,
the one -> the one who
>    the result will be collected after it is handled by the work queue.
> 
>    The user can consecutively submit requests without waiting the previous
waiting -> waiting for
>    requests been handled.
>    It only supports one submitter, you should do serial submission by
>    yourself if you want more, e.g, use lock on you side.

I’m also confused by this last statement. The proposal purports
to be “lockless”, which I read as working correctly without a lock…
Reading the code, I indeed see issues if different threads
try to place requests at the same time. So I believe the word
“lockless” is a bit misleading.

> 
> - Workqueue, i.e, thread
>    Each workqueue is represented by a running thread that fetches
>    the request submitted by the user, do the specified work and save
>    the result to the request.
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---
> include/qemu/threaded-workqueue.h | 106 +++++++++
> util/Makefile.objs                |   1 +
> util/threaded-workqueue.c         | 463 ++++++++++++++++++++++++++++++++++++++
> 3 files changed, 570 insertions(+)
> create mode 100644 include/qemu/threaded-workqueue.h
> create mode 100644 util/threaded-workqueue.c
> 
> diff --git a/include/qemu/threaded-workqueue.h b/include/qemu/threaded-workqueue.h
> new file mode 100644
> index 0000000000..e0ede496d0
> --- /dev/null
> +++ b/include/qemu/threaded-workqueue.h
> @@ -0,0 +1,106 @@
> +/*
> + * Lockless and Efficient Threaded Workqueue Abstraction
> + *
> + * Author:
> + *   Xiao Guangrong <xiaoguangrong@tencent.com>
> + *
> + * Copyright(C) 2018 Tencent Corporation.
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
> + * See the COPYING.LIB file in the top-level directory.
> + */
> +
> +#ifndef QEMU_THREADED_WORKQUEUE_H
> +#define QEMU_THREADED_WORKQUEUE_H
> +
> +#include "qemu/queue.h"
> +#include "qemu/thread.h"
> +
> +/*
> + * This modules implements the lockless and efficient threaded workqueue.
> + *
> + * Three abstracted objects are used in this module:
> + * - Request.
> + *   It not only contains the data that the workqueue fetches out
> + *   to finish the request but also offers the space to save the result
> + *   after the workqueue handles the request.
> + *
> + *   It's flowed between user and workqueue. The user fills the request
> + *   data into it when it is owned by user. After it is submitted to the
> + *   workqueue, the workqueue fetched data out and save the result into
> + *   it after the request is handled.
> + *
> + *   All the requests are pre-allocated and carefully partitioned between
> + *   threads so there is no contention on the request, that make threads
> + *   be parallel as much as possible.
> + *
> + * - User, i.e, the submitter
> + *   It's the one fills the request and submits it to the workqueue,
> + *   the result will be collected after it is handled by the work queue.
> + *
> + *   The user can consecutively submit requests without waiting the previous
> + *   requests been handled.
> + *   It only supports one submitter, you should do serial submission by
> + *   yourself if you want more, e.g, use lock on you side.
> + *
> + * - Workqueue, i.e, thread
> + *   Each workqueue is represented by a running thread that fetches
> + *   the request submitted by the user, do the specified work and save
> + *   the result to the request.
> + */
> +
> +typedef struct Threads Threads;
> +
> +struct ThreadedWorkqueueOps {
> +    /* constructor of the request */
> +    int (*thread_request_init)(void *request);
> +    /*  destructor of the request */
> +    void (*thread_request_uninit)(void *request);
> +
> +    /* the handler of the request that is called by the thread */
> +    void (*thread_request_handler)(void *request);
> +    /* called by the user after the request has been handled */
> +    void (*thread_request_done)(void *request);
> +
> +    size_t request_size;
> +};
> +typedef struct ThreadedWorkqueueOps ThreadedWorkqueueOps;
> +
> +/* the default number of requests that thread need handle */
> +#define DEFAULT_THREAD_REQUEST_NR 4
> +/* the max number of requests that thread need handle */
> +#define MAX_THREAD_REQUEST_NR     (sizeof(uint64_t) * BITS_PER_BYTE)
> +
> +/*
> + * create a threaded queue. Other APIs will work on the Threads it returned
> + *
> + * @name: the identity of the workqueue which is used to construct the name
> + *    of threads only
> + * @threads_nr: the number of threads that the workqueue will create
> + * @thread_requests_nr: the number of requests that each single thread will
> + *    handle
> + * @ops: the handlers of the request
> + *
> + * Return NULL if it failed
> + */
> +Threads *threaded_workqueue_create(const char *name, unsigned int threads_nr,
> +                                   unsigned int thread_requests_nr,
> +                                   const ThreadedWorkqueueOps *ops);
> +void threaded_workqueue_destroy(Threads *threads);
> +
> +/*
> + * find a free request where the user can store the data that is needed to
> + * finish the request
> + *
> + * If all requests are used up, return NULL
> + */
> +void *threaded_workqueue_get_request(Threads *threads);

Using void * to represent the payload makes it easy to get
the wrong pointer in there without the compiler noticing.
Consider adding a type for the payload?


> +/* submit the request and notify the thread */
> +void threaded_workqueue_submit_request(Threads *threads, void *request);
> +
> +/*
> + * wait all threads to complete the request to make sure there is no
> + * previous request exists
> + */
> +void threaded_workqueue_wait_for_requests(Threads *threads);
> +#endif
> diff --git a/util/Makefile.objs b/util/Makefile.objs
> index 0820923c18..f26dfe5182 100644
> --- a/util/Makefile.objs
> +++ b/util/Makefile.objs
> @@ -50,5 +50,6 @@ util-obj-y += range.o
> util-obj-y += stats64.o
> util-obj-y += systemd.o
> util-obj-y += iova-tree.o
> +util-obj-y += threaded-workqueue.o
> util-obj-$(CONFIG_LINUX) += vfio-helpers.o
> util-obj-$(CONFIG_OPENGL) += drm.o
> diff --git a/util/threaded-workqueue.c b/util/threaded-workqueue.c
> new file mode 100644
> index 0000000000..2ab37cee8d
> --- /dev/null
> +++ b/util/threaded-workqueue.c
> @@ -0,0 +1,463 @@
> +/*
> + * Lockless and Efficient Threaded Workqueue Abstraction


> + *
> + * Author:
> + *   Xiao Guangrong <xiaoguangrong@tencent.com>
> + *
> + * Copyright(C) 2018 Tencent Corporation.
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
> + * See the COPYING.LIB file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/bitmap.h"
> +#include "qemu/threaded-workqueue.h"
> +
> +#define SMP_CACHE_BYTES 64

+1 on comments already made by others

> +
> +/*
> + * the request representation which contains the internally used mete data,

mete -> meta

> + * it is the header of user-defined data.
> + *
> + * It should be aligned to the nature size of CPU.
> + */
> +struct ThreadRequest {
> +    /*
> +     * the request has been handled by the thread and need the user
> +     * to fetch result out.
> +     */
> +    uint8_t done;
> +
> +    /*
> +     * the index to Thread::requests.
> +     * Save it to the padding space although it can be calculated at runtime.
> +     */
> +    uint8_t request_index;

So no more than 256?

This is blocked by MAX_THREAD_REQUEST_NR test at the beginning
of threaded_workqueue_create, but I would make it more explicit either
with a compile-time assert that MAX_THREAD_REQUEST_NR is
below UINT8_MAX, or by adding a second test for UINT8_MAX in
threaded_workqueue_create.

Also, an obvious extension would be to make bitmaps into arrays.

Do you think someone would want to use the package to assign
requests per CPU or per VCPU? If so, that could quickly go above 64.


> +
> +    /* the index to Threads::per_thread_data */
> +    unsigned int thread_index;

Don’t you want to use a size_t for that?

> +} QEMU_ALIGNED(sizeof(unsigned long));

Nit: the alignment type is inconsistent with that given
to QEMU_BUILD_BUG_ON in threaded_workqueue_create.
(long vs. unsigned long).

Also, why is the alignment required? Aren’t you more interested
in cache-line alignment?


> +typedef struct ThreadRequest ThreadRequest;


> +
> +struct ThreadLocal {
> +    struct Threads *threads;
> +
> +    /* the index of the thread */
> +    int self;

Why signed?

> +
> +    /* thread is useless and needs to exit */
> +    bool quit;
> +
> +    QemuThread thread;
> +
> +    void *requests;
> +
> +   /*
> +     * the bit in these two bitmaps indicates the index of the @requests
> +     * respectively. If it's the same, the corresponding request is free
> +     * and owned by the user, i.e, where the user fills a request. Otherwise,
> +     * it is valid and owned by the thread, i.e, where the thread fetches
> +     * the request and write the result.
> +     */
> +
> +    /* after the user fills the request, the bit is flipped. */
> +    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);

I believe you are trying to ensure that data accessed from multiple CPUs
is on different cache lines. As others have pointed out, the real value for
SMP_CACHE_BYTES can only be known at run-time. So this is not really
helping. Also, the ThreadLocal structure itself is not necessarily aligned
within struct Threads. Therefore, it’s possible that “requests” for example
could be on the same cache line as request_fill_bitmap if planets align
the wrong way.

In order to mitigate these effects, I would group the data that the user
writes and the data that the thread writes, i.e. reorder declarations,
put request_fill_bitmap and request_valid_ev together, and try
to put them in the same cache line so that only one cache line is invalidated
from within mark_request_valid instead of two.

Then you end up with a single alignment directive instead of 4, to
separate requests from completions.

That being said, I’m not sure why you use a bitmap here. What is the
expected benefit relative to atomic lists (which would also make it really
lock-free)?

> +    /* after handles the request, the thread flips the bit. */
> +    uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
> +
> +    /*
> +     * the event used to wake up the thread whenever a valid request has
> +     * been submitted
> +     */
> +    QemuEvent request_valid_ev QEMU_ALIGNED(SMP_CACHE_BYTES);
> +
> +    /*
> +     * the event is notified whenever a request has been completed
> +     * (i.e, become free), which is used to wake up the user
> +     */
> +    QemuEvent request_free_ev QEMU_ALIGNED(SMP_CACHE_BYTES);
> +};
> +typedef struct ThreadLocal ThreadLocal;


> +
> +/*
> + * the main data struct represents multithreads which is shared by
> + * all threads
> + */
> +struct Threads {
> +    /* the request header, ThreadRequest, is contained */
> +    unsigned int request_size;

size_t?

> +    unsigned int thread_requests_nr;
> +    unsigned int threads_nr;
> +
> +    /* the request is pushed to the thread with round-robin manner */
> +    unsigned int current_thread_index;
> +
> +    const ThreadedWorkqueueOps *ops;


> +
> +    ThreadLocal per_thread_data[0];
> +};
> +typedef struct Threads Threads;
> +
> +static ThreadRequest *index_to_request(ThreadLocal *thread, int request_index)
> +{
> +    ThreadRequest *request;
> +
> +    request = thread->requests + request_index * thread->threads->request_size;
> +    assert(request->request_index == request_index);
> +    assert(request->thread_index == thread->self);
> +    return request;
> +}
> +
> +static int request_to_index(ThreadRequest *request)
> +{
> +    return request->request_index;
> +}
> +
> +static int request_to_thread_index(ThreadRequest *request)
> +{
> +    return request->thread_index;
> +}
> +
> +/*
> + * free request: the request is not used by any thread, however, it might
> + *   contain the result need the user to call thread_request_done()

might contain the result -> might still contain the result
result need the user to call -> result. The user needs to call

> + *
> + * valid request: the request contains the request data and it's committed
> + *   to the thread, i,e. it's owned by thread.
> + */
> +static uint64_t get_free_request_bitmap(Threads *threads, ThreadLocal *thread)
> +{
> +    uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap;
> +
> +    request_fill_bitmap = atomic_rcu_read(&thread->request_fill_bitmap);
> +    request_done_bitmap = atomic_rcu_read(&thread->request_done_bitmap);
> +    bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
> +               threads->thread_requests_nr);
> +
> +    /*
> +     * paired with smp_wmb() in mark_request_free() to make sure that we
> +     * read request_done_bitmap before fetching the result out.
> +     */
> +    smp_rmb();
> +
> +    return result_bitmap;
> +}

It seems that this part would be much simpler to understand using atomic lists.

> +
> +static ThreadRequest
> +*find_thread_free_request(Threads *threads, ThreadLocal *thread)
> +{
> +    uint64_t result_bitmap = get_free_request_bitmap(threads, thread);
> +    int index;
> +
> +    index  = find_first_zero_bit(&result_bitmap, threads->thread_requests_nr);
> +    if (index >= threads->thread_requests_nr) {
> +        return NULL;
> +    }
> +
> +    return index_to_request(thread, index);
> +}
> +
> +static ThreadRequest *threads_find_free_request(Threads *threads)
> +{
> +    ThreadLocal *thread;
> +    ThreadRequest *request;
> +    int cur_thread, thread_index;
> +
> +    cur_thread = threads->current_thread_index % threads->threads_nr;
> +    thread_index = cur_thread;
> +    do {
> +        thread = threads->per_thread_data + thread_index++;
> +        request = find_thread_free_request(threads, thread);
> +        if (request) {
> +            break;
> +        }
> +        thread_index %= threads->threads_nr;
> +    } while (thread_index != cur_thread);
> +
> +    return request;
> +}
> +
> +/*
> + * the change bit operation combined with READ_ONCE and WRITE_ONCE which
> + * only works on single uint64_t width
> + */
> +static void change_bit_once(long nr, uint64_t *addr)
> +{
> +    uint64_t value = atomic_rcu_read(addr) ^ BIT_MASK(nr);
> +
> +    atomic_rcu_set(addr, value);
> +}
> +
> +static void mark_request_valid(Threads *threads, ThreadRequest *request)
> +{
> +    int thread_index = request_to_thread_index(request);
> +    int request_index = request_to_index(request);
> +    ThreadLocal *thread = threads->per_thread_data + thread_index;
> +
> +    /*
> +     * paired with smp_rmb() in find_first_valid_request_index() to make
> +     * sure the request has been filled before the bit is flipped that
> +     * will make the request be visible to the thread
> +     */
> +    smp_wmb();
> +
> +    change_bit_once(request_index, &thread->request_fill_bitmap);
> +    qemu_event_set(&thread->request_valid_ev);
> +}
> +
> +static int thread_find_first_valid_request_index(ThreadLocal *thread)
> +{
> +    Threads *threads = thread->threads;
> +    uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap;
> +    int index;
> +
> +    request_fill_bitmap = atomic_rcu_read(&thread->request_fill_bitmap);
> +    request_done_bitmap = atomic_rcu_read(&thread->request_done_bitmap);
> +    bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
> +               threads->thread_requests_nr);
> +    /*
> +     * paired with smp_wmb() in mark_request_valid() to make sure that
> +     * we read request_fill_bitmap before fetch the request out.
> +     */
> +    smp_rmb();
> +
> +    index = find_first_bit(&result_bitmap, threads->thread_requests_nr);
> +    return index >= threads->thread_requests_nr ? -1 : index;
> +}
> +
> +static void mark_request_free(ThreadLocal *thread, ThreadRequest *request)
> +{
> +    int index = request_to_index(request);
> +
> +    /*
> +     * smp_wmb() is implied in change_bit_atomic() that is paired with
> +     * smp_rmb() in get_free_request_bitmap() to make sure the result
> +     * has been saved before the bit is flipped.
> +     */
> +    change_bit_atomic(index, &thread->request_done_bitmap);
> +    qemu_event_set(&thread->request_free_ev);
> +}
> +
> +/* retry to see if there is available request before actually go to wait. */
> +#define BUSY_WAIT_COUNT 1000
> +
> +static ThreadRequest *
> +thread_busy_wait_for_request(ThreadLocal *thread)
> +{
> +    int index, count = 0;
> +
> +    for (count = 0; count < BUSY_WAIT_COUNT; count++) {
> +        index = thread_find_first_valid_request_index(thread);
> +        if (index >= 0) {
> +            return index_to_request(thread, index);
> +        }
> +
> +        cpu_relax();
> +    }
> +
> +    return NULL;
> +}
> +
> +static void *thread_run(void *opaque)
> +{
> +    ThreadLocal *self_data = (ThreadLocal *)opaque;
> +    Threads *threads = self_data->threads;
> +    void (*handler)(void *request) = threads->ops->thread_request_handler;
> +    ThreadRequest *request;
> +
> +    for ( ; !atomic_read(&self_data->quit); ) {
> +        qemu_event_reset(&self_data->request_valid_ev);
> +
> +        request = thread_busy_wait_for_request(self_data);
> +        if (!request) {
> +            qemu_event_wait(&self_data->request_valid_ev);
> +            continue;
> +        }
> +
> +        assert(!request->done);
> +
> +        handler(request + 1);
> +        request->done = true;
> +        mark_request_free(self_data, request);
> +    }
> +
> +    return NULL;
> +}
> +
> +static void uninit_thread_requests(ThreadLocal *thread, int free_nr)
> +{
> +    Threads *threads = thread->threads;
> +    ThreadRequest *request = thread->requests;
> +    int i;
> +
> +    for (i = 0; i < free_nr; i++) {
> +        threads->ops->thread_request_uninit(request + 1);
> +        request = (void *)request + threads->request_size;

Despite GCC’s tolerance for it and rather lengthy debates,
pointer arithmetic on void * is illegal in C [1].

Consider using char * arithmetic, and using macros such as:

#define request_to_payload(req) (((ThreadRequest *) req) + 1)
#define payload_to_request(req) (((ThreadRequest *) req) - 1)
#define request_to_next(req,threads) ((ThreadRequest *) ((char *) req) + threads->request_size))

where appropriate, that would clarify the intent.

[1] https://stackoverflow.com/questions/3523145/pointer-arithmetic-for-void-pointer-in-c

> +    }
> +    g_free(thread->requests);
> +}
> +
> +static int init_thread_requests(ThreadLocal *thread)
> +{
> +    Threads *threads = thread->threads;
> +    ThreadRequest *request;
> +    int ret, i, thread_reqs_size;
> +
> +    thread_reqs_size = threads->thread_requests_nr * threads->request_size;
> +    thread_reqs_size = QEMU_ALIGN_UP(thread_reqs_size, SMP_CACHE_BYTES);
> +    thread->requests = g_malloc0(thread_reqs_size);
> +
> +    request = thread->requests;
> +    for (i = 0; i < threads->thread_requests_nr; i++) {
> +        ret = threads->ops->thread_request_init(request + 1);
> +        if (ret < 0) {
> +            goto exit;
> +        }
> +
> +        request->request_index = i;
> +        request->thread_index = thread->self;
> +        request = (void *)request + threads->request_size;

Pointer arithmetic on void * is illegal in C, see above.

> +    }
> +    return 0;
> +
> +exit:
> +    uninit_thread_requests(thread, i);
> +    return -1;
> +}
> +
> +static void uninit_thread_data(Threads *threads, int free_nr)
> +{
> +    ThreadLocal *thread_local = threads->per_thread_data;

thread_local is a keyword in C++11. I would avoid it as a name,
consider replacing with “per_thread_data” as in struct Threads?


> +    int i;
> +
> +    for (i = 0; i < free_nr; i++) {
> +        thread_local[i].quit = true;
> +        qemu_event_set(&thread_local[i].request_valid_ev);
> +        qemu_thread_join(&thread_local[i].thread);
> +        qemu_event_destroy(&thread_local[i].request_valid_ev);
> +        qemu_event_destroy(&thread_local[i].request_free_ev);
> +        uninit_thread_requests(&thread_local[i], threads->thread_requests_nr);
> +    }
> +}
> +
> +static int
> +init_thread_data(Threads *threads, const char *thread_name, int thread_nr)
> +{
> +    ThreadLocal *thread_local = threads->per_thread_data;
> +    char *name;
> +    int i;
> +
> +    for (i = 0; i < thread_nr; i++) {
> +        thread_local[i].threads = threads;
> +        thread_local[i].self = i;
> +
> +        if (init_thread_requests(&thread_local[i]) < 0) {
> +            goto exit;
> +        }
> +
> +        qemu_event_init(&thread_local[i].request_free_ev, false);
> +        qemu_event_init(&thread_local[i].request_valid_ev, false);
> +
> +        name = g_strdup_printf("%s/%d", thread_name, thread_local[i].self);
> +        qemu_thread_create(&thread_local[i].thread, name,
> +                           thread_run, &thread_local[i], QEMU_THREAD_JOINABLE);
> +        g_free(name);
> +    }
> +    return 0;
> +
> +exit:
> +    uninit_thread_data(threads, i);
> +    return -1;
> +}
> +
> +Threads *threaded_workqueue_create(const char *name, unsigned int threads_nr,
> +                                   unsigned int thread_requests_nr,
> +                                   const ThreadedWorkqueueOps *ops)
> +{
> +    Threads *threads;
> +
> +    if (threads_nr > MAX_THREAD_REQUEST_NR) {
> +        return NULL;
> +    }
> +
> +    threads = g_malloc0(sizeof(*threads) + threads_nr * sizeof(ThreadLocal));
> +    threads->ops = ops;
> +    threads->threads_nr = threads_nr;
> +    threads->thread_requests_nr = thread_requests_nr;
> +
> +    QEMU_BUILD_BUG_ON(!QEMU_IS_ALIGNED(sizeof(ThreadRequest), sizeof(long)));
> +    threads->request_size = threads->ops->request_size;
> +    threads->request_size = QEMU_ALIGN_UP(threads->request_size, sizeof(long));
> +    threads->request_size += sizeof(ThreadRequest);
> +
> +    if (init_thread_data(threads, name, threads_nr) < 0) {
> +        g_free(threads);
> +        return NULL;
> +    }
> +
> +    return threads;
> +}
> +
> +void threaded_workqueue_destroy(Threads *threads)
> +{
> +    uninit_thread_data(threads, threads->threads_nr);
> +    g_free(threads);
> +}
> +
> +static void request_done(Threads *threads, ThreadRequest *request)
> +{
> +    if (!request->done) {
> +        return;
> +    }
> +
> +    threads->ops->thread_request_done(request + 1);
> +    request->done = false;
> +}
> +
> +void *threaded_workqueue_get_request(Threads *threads)
> +{
> +    ThreadRequest *request;
> +
> +    request = threads_find_free_request(threads);
> +    if (!request) {
> +        return NULL;
> +    }
> +
> +    request_done(threads, request);
> +    return request + 1;
> +}
> +
> +void threaded_workqueue_submit_request(Threads *threads, void *request)
> +{
> +    ThreadRequest *req = request - sizeof(ThreadRequest);

Pointer arithmetic on void *…

Please consider rewriting as:

	ThreadRequest *req = (ThreadRequest *) request - 1;

which achieves the same objective, is legal C, and is the symmetric
counterpart of “return request + 1” above.


> +    int thread_index = request_to_thread_index(request);
> +
> +    assert(!req->done);
> +    mark_request_valid(threads, req);
> +    threads->current_thread_index = thread_index  + 1;
> +}
> +
> +void threaded_workqueue_wait_for_requests(Threads *threads)
> +{
> +    ThreadLocal *thread;
> +    uint64_t result_bitmap;
> +    int thread_index, index = 0;
> +
> +    for (thread_index = 0; thread_index < threads->threads_nr; thread_index++) {
> +        thread = threads->per_thread_data + thread_index;
> +        index = 0;
> +retry:
> +        qemu_event_reset(&thread->request_free_ev);
> +        result_bitmap = get_free_request_bitmap(threads, thread);
> +
> +        for (; index < threads->thread_requests_nr; index++) {
> +            if (test_bit(index, &result_bitmap)) {
> +                qemu_event_wait(&thread->request_free_ev);
> +                goto retry;
> +            }
> +
> +            request_done(threads, index_to_request(thread, index));
> +        }
> +    }
> +}
> -- 
> 2.14.5

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
@ 2018-11-27 12:49     ` Christophe de Dinechin
  0 siblings, 0 replies; 70+ messages in thread
From: Christophe de Dinechin @ 2018-11-27 12:49 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Paolo Bonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, peterx,
	wei.w.wang, jiang.biao2, eblake, quintela, cota, Xiao Guangrong

(I did not finish the review, but decided to send what I already had).


> On 22 Nov 2018, at 08:20, guangrong.xiao@gmail.com wrote:
> 
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> This modules implements the lockless and efficient threaded workqueue.

I’m not entirely convinced that it’s either “lockless” or “efficient”
in the current iteration. I believe that it’s relatively easy to fix, though.

> 
> Three abstracted objects are used in this module:
> - Request.
>     It not only contains the data that the workqueue fetches out
>    to finish the request but also offers the space to save the result
>    after the workqueue handles the request.
> 
>    It's flowed between user and workqueue. The user fills the request
>    data into it when it is owned by user. After it is submitted to the
>    workqueue, the workqueue fetched data out and save the result into
>    it after the request is handled.

fetched -> fetches
save -> saves

> 
>    All the requests are pre-allocated and carefully partitioned between
>    threads so there is no contention on the request, that make threads
>    be parallel as much as possible.

That sentence confused me (it’s also in a comment in the text).
I think I’m mostly confused by “there is no contention”. Perhaps you
meant “so as to avoid contention if possible”? If there is a reason
why there would never be any contention even if requests arrive faster than
completions, I did not figure it out.

I personally see serious contention on the fields in the Threads structure,
for example, but also possibly on the targets of the “modulo” operation in
thread_find_free_request. Specifically, if three CPUs are entering
thread_find_free_request at the same time, they will all run the same
loop and all, presumably, “attack” the same memory locations.

Sorry if I mis-read the code, but at the moment, it does not seem to
avoid contention as intended. I don’t see how it could without having
some way to discriminate between CPUs to start with, which I did not find.


> 
> - User, i.e, the submitter
>    It's the one fills the request and submits it to the workqueue,
the one -> the one who
>    the result will be collected after it is handled by the work queue.
> 
>    The user can consecutively submit requests without waiting the previous
waiting -> waiting for
>    requests been handled.
>    It only supports one submitter, you should do serial submission by
>    yourself if you want more, e.g, use lock on you side.

I’m also confused by this last statement. The proposal purports
to be “lockless”, which I read as working correctly without a lock…
Reading the code, I indeed see issues if different threads
try to place requests at the same time. So I believe the word
“lockless” is a bit misleading.

> 
> - Workqueue, i.e, thread
>    Each workqueue is represented by a running thread that fetches
>    the request submitted by the user, do the specified work and save
>    the result to the request.
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---
> include/qemu/threaded-workqueue.h | 106 +++++++++
> util/Makefile.objs                |   1 +
> util/threaded-workqueue.c         | 463 ++++++++++++++++++++++++++++++++++++++
> 3 files changed, 570 insertions(+)
> create mode 100644 include/qemu/threaded-workqueue.h
> create mode 100644 util/threaded-workqueue.c
> 
> diff --git a/include/qemu/threaded-workqueue.h b/include/qemu/threaded-workqueue.h
> new file mode 100644
> index 0000000000..e0ede496d0
> --- /dev/null
> +++ b/include/qemu/threaded-workqueue.h
> @@ -0,0 +1,106 @@
> +/*
> + * Lockless and Efficient Threaded Workqueue Abstraction
> + *
> + * Author:
> + *   Xiao Guangrong <xiaoguangrong@tencent.com>
> + *
> + * Copyright(C) 2018 Tencent Corporation.
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
> + * See the COPYING.LIB file in the top-level directory.
> + */
> +
> +#ifndef QEMU_THREADED_WORKQUEUE_H
> +#define QEMU_THREADED_WORKQUEUE_H
> +
> +#include "qemu/queue.h"
> +#include "qemu/thread.h"
> +
> +/*
> + * This modules implements the lockless and efficient threaded workqueue.
> + *
> + * Three abstracted objects are used in this module:
> + * - Request.
> + *   It not only contains the data that the workqueue fetches out
> + *   to finish the request but also offers the space to save the result
> + *   after the workqueue handles the request.
> + *
> + *   It's flowed between user and workqueue. The user fills the request
> + *   data into it when it is owned by user. After it is submitted to the
> + *   workqueue, the workqueue fetched data out and save the result into
> + *   it after the request is handled.
> + *
> + *   All the requests are pre-allocated and carefully partitioned between
> + *   threads so there is no contention on the request, that make threads
> + *   be parallel as much as possible.
> + *
> + * - User, i.e, the submitter
> + *   It's the one fills the request and submits it to the workqueue,
> + *   the result will be collected after it is handled by the work queue.
> + *
> + *   The user can consecutively submit requests without waiting the previous
> + *   requests been handled.
> + *   It only supports one submitter, you should do serial submission by
> + *   yourself if you want more, e.g, use lock on you side.
> + *
> + * - Workqueue, i.e, thread
> + *   Each workqueue is represented by a running thread that fetches
> + *   the request submitted by the user, do the specified work and save
> + *   the result to the request.
> + */
> +
> +typedef struct Threads Threads;
> +
> +struct ThreadedWorkqueueOps {
> +    /* constructor of the request */
> +    int (*thread_request_init)(void *request);
> +    /*  destructor of the request */
> +    void (*thread_request_uninit)(void *request);
> +
> +    /* the handler of the request that is called by the thread */
> +    void (*thread_request_handler)(void *request);
> +    /* called by the user after the request has been handled */
> +    void (*thread_request_done)(void *request);
> +
> +    size_t request_size;
> +};
> +typedef struct ThreadedWorkqueueOps ThreadedWorkqueueOps;
> +
> +/* the default number of requests that thread need handle */
> +#define DEFAULT_THREAD_REQUEST_NR 4
> +/* the max number of requests that thread need handle */
> +#define MAX_THREAD_REQUEST_NR     (sizeof(uint64_t) * BITS_PER_BYTE)
> +
> +/*
> + * create a threaded queue. Other APIs will work on the Threads it returned
> + *
> + * @name: the identity of the workqueue which is used to construct the name
> + *    of threads only
> + * @threads_nr: the number of threads that the workqueue will create
> + * @thread_requests_nr: the number of requests that each single thread will
> + *    handle
> + * @ops: the handlers of the request
> + *
> + * Return NULL if it failed
> + */
> +Threads *threaded_workqueue_create(const char *name, unsigned int threads_nr,
> +                                   unsigned int thread_requests_nr,
> +                                   const ThreadedWorkqueueOps *ops);
> +void threaded_workqueue_destroy(Threads *threads);
> +
> +/*
> + * find a free request where the user can store the data that is needed to
> + * finish the request
> + *
> + * If all requests are used up, return NULL
> + */
> +void *threaded_workqueue_get_request(Threads *threads);

Using void * to represent the payload makes it easy to get
the wrong pointer in there without the compiler noticing.
Consider adding a type for the payload?


> +/* submit the request and notify the thread */
> +void threaded_workqueue_submit_request(Threads *threads, void *request);
> +
> +/*
> + * wait all threads to complete the request to make sure there is no
> + * previous request exists
> + */
> +void threaded_workqueue_wait_for_requests(Threads *threads);
> +#endif
> diff --git a/util/Makefile.objs b/util/Makefile.objs
> index 0820923c18..f26dfe5182 100644
> --- a/util/Makefile.objs
> +++ b/util/Makefile.objs
> @@ -50,5 +50,6 @@ util-obj-y += range.o
> util-obj-y += stats64.o
> util-obj-y += systemd.o
> util-obj-y += iova-tree.o
> +util-obj-y += threaded-workqueue.o
> util-obj-$(CONFIG_LINUX) += vfio-helpers.o
> util-obj-$(CONFIG_OPENGL) += drm.o
> diff --git a/util/threaded-workqueue.c b/util/threaded-workqueue.c
> new file mode 100644
> index 0000000000..2ab37cee8d
> --- /dev/null
> +++ b/util/threaded-workqueue.c
> @@ -0,0 +1,463 @@
> +/*
> + * Lockless and Efficient Threaded Workqueue Abstraction


> + *
> + * Author:
> + *   Xiao Guangrong <xiaoguangrong@tencent.com>
> + *
> + * Copyright(C) 2018 Tencent Corporation.
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
> + * See the COPYING.LIB file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/bitmap.h"
> +#include "qemu/threaded-workqueue.h"
> +
> +#define SMP_CACHE_BYTES 64

+1 on comments already made by others

> +
> +/*
> + * the request representation which contains the internally used mete data,

mete -> meta

> + * it is the header of user-defined data.
> + *
> + * It should be aligned to the nature size of CPU.
> + */
> +struct ThreadRequest {
> +    /*
> +     * the request has been handled by the thread and need the user
> +     * to fetch result out.
> +     */
> +    uint8_t done;
> +
> +    /*
> +     * the index to Thread::requests.
> +     * Save it to the padding space although it can be calculated at runtime.
> +     */
> +    uint8_t request_index;

So no more than 256?

This is blocked by MAX_THREAD_REQUEST_NR test at the beginning
of threaded_workqueue_create, but I would make it more explicit either
with a compile-time assert that MAX_THREAD_REQUEST_NR is
below UINT8_MAX, or by adding a second test for UINT8_MAX in
threaded_workqueue_create.

Also, an obvious extension would be to make bitmaps into arrays.

Do you think someone would want to use the package to assign
requests per CPU or per VCPU? If so, that could quickly go above 64.


> +
> +    /* the index to Threads::per_thread_data */
> +    unsigned int thread_index;

Don’t you want to use a size_t for that?

> +} QEMU_ALIGNED(sizeof(unsigned long));

Nit: the alignment type is inconsistent with that given
to QEMU_BUILD_BUG_ON in threaded_workqueue_create.
(long vs. unsigned long).

Also, why is the alignment required? Aren’t you more interested
in cache-line alignment?


> +typedef struct ThreadRequest ThreadRequest;


> +
> +struct ThreadLocal {
> +    struct Threads *threads;
> +
> +    /* the index of the thread */
> +    int self;

Why signed?

> +
> +    /* thread is useless and needs to exit */
> +    bool quit;
> +
> +    QemuThread thread;
> +
> +    void *requests;
> +
> +   /*
> +     * the bit in these two bitmaps indicates the index of the @requests
> +     * respectively. If it's the same, the corresponding request is free
> +     * and owned by the user, i.e, where the user fills a request. Otherwise,
> +     * it is valid and owned by the thread, i.e, where the thread fetches
> +     * the request and write the result.
> +     */
> +
> +    /* after the user fills the request, the bit is flipped. */
> +    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);

I believe you are trying to ensure that data accessed from multiple CPUs
is on different cache lines. As others have pointed out, the real value for
SMP_CACHE_BYTES can only be known at run-time. So this is not really
helping. Also, the ThreadLocal structure itself is not necessarily aligned
within struct Threads. Therefore, it’s possible that “requests” for example
could be on the same cache line as request_fill_bitmap if planets align
the wrong way.

In order to mitigate these effects, I would group the data that the user
writes and the data that the thread writes, i.e. reorder declarations,
put request_fill_bitmap and request_valid_ev together, and try
to put them in the same cache line so that only one cache line is invalidated
from within mark_request_valid instead of two.

Then you end up with a single alignment directive instead of 4, to
separate requests from completions.

That being said, I’m not sure why you use a bitmap here. What is the
expected benefit relative to atomic lists (which would also make it really
lock-free)?

> +    /* after handles the request, the thread flips the bit. */
> +    uint64_t request_done_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
> +
> +    /*
> +     * the event used to wake up the thread whenever a valid request has
> +     * been submitted
> +     */
> +    QemuEvent request_valid_ev QEMU_ALIGNED(SMP_CACHE_BYTES);
> +
> +    /*
> +     * the event is notified whenever a request has been completed
> +     * (i.e, become free), which is used to wake up the user
> +     */
> +    QemuEvent request_free_ev QEMU_ALIGNED(SMP_CACHE_BYTES);
> +};
> +typedef struct ThreadLocal ThreadLocal;


> +
> +/*
> + * the main data struct represents multithreads which is shared by
> + * all threads
> + */
> +struct Threads {
> +    /* the request header, ThreadRequest, is contained */
> +    unsigned int request_size;

size_t?

> +    unsigned int thread_requests_nr;
> +    unsigned int threads_nr;
> +
> +    /* the request is pushed to the thread with round-robin manner */
> +    unsigned int current_thread_index;
> +
> +    const ThreadedWorkqueueOps *ops;


> +
> +    ThreadLocal per_thread_data[0];
> +};
> +typedef struct Threads Threads;
> +
> +static ThreadRequest *index_to_request(ThreadLocal *thread, int request_index)
> +{
> +    ThreadRequest *request;
> +
> +    request = thread->requests + request_index * thread->threads->request_size;
> +    assert(request->request_index == request_index);
> +    assert(request->thread_index == thread->self);
> +    return request;
> +}
> +
> +static int request_to_index(ThreadRequest *request)
> +{
> +    return request->request_index;
> +}
> +
> +static int request_to_thread_index(ThreadRequest *request)
> +{
> +    return request->thread_index;
> +}
> +
> +/*
> + * free request: the request is not used by any thread, however, it might
> + *   contain the result need the user to call thread_request_done()

might contain the result -> might still contain the result
result need the user to call -> result. The user needs to call

> + *
> + * valid request: the request contains the request data and it's committed
> + *   to the thread, i,e. it's owned by thread.
> + */
> +static uint64_t get_free_request_bitmap(Threads *threads, ThreadLocal *thread)
> +{
> +    uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap;
> +
> +    request_fill_bitmap = atomic_rcu_read(&thread->request_fill_bitmap);
> +    request_done_bitmap = atomic_rcu_read(&thread->request_done_bitmap);
> +    bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
> +               threads->thread_requests_nr);
> +
> +    /*
> +     * paired with smp_wmb() in mark_request_free() to make sure that we
> +     * read request_done_bitmap before fetching the result out.
> +     */
> +    smp_rmb();
> +
> +    return result_bitmap;
> +}

It seems that this part would be much simpler to understand using atomic lists.

> +
> +static ThreadRequest
> +*find_thread_free_request(Threads *threads, ThreadLocal *thread)
> +{
> +    uint64_t result_bitmap = get_free_request_bitmap(threads, thread);
> +    int index;
> +
> +    index  = find_first_zero_bit(&result_bitmap, threads->thread_requests_nr);
> +    if (index >= threads->thread_requests_nr) {
> +        return NULL;
> +    }
> +
> +    return index_to_request(thread, index);
> +}
> +
> +static ThreadRequest *threads_find_free_request(Threads *threads)
> +{
> +    ThreadLocal *thread;
> +    ThreadRequest *request;
> +    int cur_thread, thread_index;
> +
> +    cur_thread = threads->current_thread_index % threads->threads_nr;
> +    thread_index = cur_thread;
> +    do {
> +        thread = threads->per_thread_data + thread_index++;
> +        request = find_thread_free_request(threads, thread);
> +        if (request) {
> +            break;
> +        }
> +        thread_index %= threads->threads_nr;
> +    } while (thread_index != cur_thread);
> +
> +    return request;
> +}
> +
> +/*
> + * the change bit operation combined with READ_ONCE and WRITE_ONCE which
> + * only works on single uint64_t width
> + */
> +static void change_bit_once(long nr, uint64_t *addr)
> +{
> +    uint64_t value = atomic_rcu_read(addr) ^ BIT_MASK(nr);
> +
> +    atomic_rcu_set(addr, value);
> +}
> +
> +static void mark_request_valid(Threads *threads, ThreadRequest *request)
> +{
> +    int thread_index = request_to_thread_index(request);
> +    int request_index = request_to_index(request);
> +    ThreadLocal *thread = threads->per_thread_data + thread_index;
> +
> +    /*
> +     * paired with smp_rmb() in find_first_valid_request_index() to make
> +     * sure the request has been filled before the bit is flipped that
> +     * will make the request be visible to the thread
> +     */
> +    smp_wmb();
> +
> +    change_bit_once(request_index, &thread->request_fill_bitmap);
> +    qemu_event_set(&thread->request_valid_ev);
> +}
> +
> +static int thread_find_first_valid_request_index(ThreadLocal *thread)
> +{
> +    Threads *threads = thread->threads;
> +    uint64_t request_fill_bitmap, request_done_bitmap, result_bitmap;
> +    int index;
> +
> +    request_fill_bitmap = atomic_rcu_read(&thread->request_fill_bitmap);
> +    request_done_bitmap = atomic_rcu_read(&thread->request_done_bitmap);
> +    bitmap_xor(&result_bitmap, &request_fill_bitmap, &request_done_bitmap,
> +               threads->thread_requests_nr);
> +    /*
> +     * paired with smp_wmb() in mark_request_valid() to make sure that
> +     * we read request_fill_bitmap before fetch the request out.
> +     */
> +    smp_rmb();
> +
> +    index = find_first_bit(&result_bitmap, threads->thread_requests_nr);
> +    return index >= threads->thread_requests_nr ? -1 : index;
> +}
> +
> +static void mark_request_free(ThreadLocal *thread, ThreadRequest *request)
> +{
> +    int index = request_to_index(request);
> +
> +    /*
> +     * smp_wmb() is implied in change_bit_atomic() that is paired with
> +     * smp_rmb() in get_free_request_bitmap() to make sure the result
> +     * has been saved before the bit is flipped.
> +     */
> +    change_bit_atomic(index, &thread->request_done_bitmap);
> +    qemu_event_set(&thread->request_free_ev);
> +}
> +
> +/* retry to see if there is available request before actually go to wait. */
> +#define BUSY_WAIT_COUNT 1000
> +
> +static ThreadRequest *
> +thread_busy_wait_for_request(ThreadLocal *thread)
> +{
> +    int index, count = 0;
> +
> +    for (count = 0; count < BUSY_WAIT_COUNT; count++) {
> +        index = thread_find_first_valid_request_index(thread);
> +        if (index >= 0) {
> +            return index_to_request(thread, index);
> +        }
> +
> +        cpu_relax();
> +    }
> +
> +    return NULL;
> +}
> +
> +static void *thread_run(void *opaque)
> +{
> +    ThreadLocal *self_data = (ThreadLocal *)opaque;
> +    Threads *threads = self_data->threads;
> +    void (*handler)(void *request) = threads->ops->thread_request_handler;
> +    ThreadRequest *request;
> +
> +    for ( ; !atomic_read(&self_data->quit); ) {
> +        qemu_event_reset(&self_data->request_valid_ev);
> +
> +        request = thread_busy_wait_for_request(self_data);
> +        if (!request) {
> +            qemu_event_wait(&self_data->request_valid_ev);
> +            continue;
> +        }
> +
> +        assert(!request->done);
> +
> +        handler(request + 1);
> +        request->done = true;
> +        mark_request_free(self_data, request);
> +    }
> +
> +    return NULL;
> +}
> +
> +static void uninit_thread_requests(ThreadLocal *thread, int free_nr)
> +{
> +    Threads *threads = thread->threads;
> +    ThreadRequest *request = thread->requests;
> +    int i;
> +
> +    for (i = 0; i < free_nr; i++) {
> +        threads->ops->thread_request_uninit(request + 1);
> +        request = (void *)request + threads->request_size;

Despite GCC’s tolerance for it and rather lengthy debates,
pointer arithmetic on void * is illegal in C [1].

Consider using char * arithmetic, and using macros such as:

#define request_to_payload(req) (((ThreadRequest *) req) + 1)
#define payload_to_request(req) (((ThreadRequest *) req) - 1)
#define request_to_next(req,threads) ((ThreadRequest *) ((char *) req) + threads->request_size))

where appropriate, that would clarify the intent.

[1] https://stackoverflow.com/questions/3523145/pointer-arithmetic-for-void-pointer-in-c

> +    }
> +    g_free(thread->requests);
> +}
> +
> +static int init_thread_requests(ThreadLocal *thread)
> +{
> +    Threads *threads = thread->threads;
> +    ThreadRequest *request;
> +    int ret, i, thread_reqs_size;
> +
> +    thread_reqs_size = threads->thread_requests_nr * threads->request_size;
> +    thread_reqs_size = QEMU_ALIGN_UP(thread_reqs_size, SMP_CACHE_BYTES);
> +    thread->requests = g_malloc0(thread_reqs_size);
> +
> +    request = thread->requests;
> +    for (i = 0; i < threads->thread_requests_nr; i++) {
> +        ret = threads->ops->thread_request_init(request + 1);
> +        if (ret < 0) {
> +            goto exit;
> +        }
> +
> +        request->request_index = i;
> +        request->thread_index = thread->self;
> +        request = (void *)request + threads->request_size;

Pointer arithmetic on void * is illegal in C, see above.

> +    }
> +    return 0;
> +
> +exit:
> +    uninit_thread_requests(thread, i);
> +    return -1;
> +}
> +
> +static void uninit_thread_data(Threads *threads, int free_nr)
> +{
> +    ThreadLocal *thread_local = threads->per_thread_data;

thread_local is a keyword in C++11. I would avoid it as a name,
consider replacing with “per_thread_data” as in struct Threads?


> +    int i;
> +
> +    for (i = 0; i < free_nr; i++) {
> +        thread_local[i].quit = true;
> +        qemu_event_set(&thread_local[i].request_valid_ev);
> +        qemu_thread_join(&thread_local[i].thread);
> +        qemu_event_destroy(&thread_local[i].request_valid_ev);
> +        qemu_event_destroy(&thread_local[i].request_free_ev);
> +        uninit_thread_requests(&thread_local[i], threads->thread_requests_nr);
> +    }
> +}
> +
> +static int
> +init_thread_data(Threads *threads, const char *thread_name, int thread_nr)
> +{
> +    ThreadLocal *thread_local = threads->per_thread_data;
> +    char *name;
> +    int i;
> +
> +    for (i = 0; i < thread_nr; i++) {
> +        thread_local[i].threads = threads;
> +        thread_local[i].self = i;
> +
> +        if (init_thread_requests(&thread_local[i]) < 0) {
> +            goto exit;
> +        }
> +
> +        qemu_event_init(&thread_local[i].request_free_ev, false);
> +        qemu_event_init(&thread_local[i].request_valid_ev, false);
> +
> +        name = g_strdup_printf("%s/%d", thread_name, thread_local[i].self);
> +        qemu_thread_create(&thread_local[i].thread, name,
> +                           thread_run, &thread_local[i], QEMU_THREAD_JOINABLE);
> +        g_free(name);
> +    }
> +    return 0;
> +
> +exit:
> +    uninit_thread_data(threads, i);
> +    return -1;
> +}
> +
> +Threads *threaded_workqueue_create(const char *name, unsigned int threads_nr,
> +                                   unsigned int thread_requests_nr,
> +                                   const ThreadedWorkqueueOps *ops)
> +{
> +    Threads *threads;
> +
> +    if (threads_nr > MAX_THREAD_REQUEST_NR) {
> +        return NULL;
> +    }
> +
> +    threads = g_malloc0(sizeof(*threads) + threads_nr * sizeof(ThreadLocal));
> +    threads->ops = ops;
> +    threads->threads_nr = threads_nr;
> +    threads->thread_requests_nr = thread_requests_nr;
> +
> +    QEMU_BUILD_BUG_ON(!QEMU_IS_ALIGNED(sizeof(ThreadRequest), sizeof(long)));
> +    threads->request_size = threads->ops->request_size;
> +    threads->request_size = QEMU_ALIGN_UP(threads->request_size, sizeof(long));
> +    threads->request_size += sizeof(ThreadRequest);
> +
> +    if (init_thread_data(threads, name, threads_nr) < 0) {
> +        g_free(threads);
> +        return NULL;
> +    }
> +
> +    return threads;
> +}
> +
> +void threaded_workqueue_destroy(Threads *threads)
> +{
> +    uninit_thread_data(threads, threads->threads_nr);
> +    g_free(threads);
> +}
> +
> +static void request_done(Threads *threads, ThreadRequest *request)
> +{
> +    if (!request->done) {
> +        return;
> +    }
> +
> +    threads->ops->thread_request_done(request + 1);
> +    request->done = false;
> +}
> +
> +void *threaded_workqueue_get_request(Threads *threads)
> +{
> +    ThreadRequest *request;
> +
> +    request = threads_find_free_request(threads);
> +    if (!request) {
> +        return NULL;
> +    }
> +
> +    request_done(threads, request);
> +    return request + 1;
> +}
> +
> +void threaded_workqueue_submit_request(Threads *threads, void *request)
> +{
> +    ThreadRequest *req = request - sizeof(ThreadRequest);

Pointer arithmetic on void *…

Please consider rewriting as:

	ThreadRequest *req = (ThreadRequest *) request - 1;

which achieves the same objective, is legal C, and is the symmetric
counterpart of “return request + 1” above.


> +    int thread_index = request_to_thread_index(request);
> +
> +    assert(!req->done);
> +    mark_request_valid(threads, req);
> +    threads->current_thread_index = thread_index  + 1;
> +}
> +
> +void threaded_workqueue_wait_for_requests(Threads *threads)
> +{
> +    ThreadLocal *thread;
> +    uint64_t result_bitmap;
> +    int thread_index, index = 0;
> +
> +    for (thread_index = 0; thread_index < threads->threads_nr; thread_index++) {
> +        thread = threads->per_thread_data + thread_index;
> +        index = 0;
> +retry:
> +        qemu_event_reset(&thread->request_free_ev);
> +        result_bitmap = get_free_request_bitmap(threads, thread);
> +
> +        for (; index < threads->thread_requests_nr; index++) {
> +            if (test_bit(index, &result_bitmap)) {
> +                qemu_event_wait(&thread->request_free_ev);
> +                goto retry;
> +            }
> +
> +            request_done(threads, index_to_request(thread, index));
> +        }
> +    }
> +}
> -- 
> 2.14.5

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 2/5] util: introduce threaded workqueue
  2018-11-27 12:49     ` [Qemu-devel] " Christophe de Dinechin
@ 2018-11-27 13:51       ` Paolo Bonzini
  -1 siblings, 0 replies; 70+ messages in thread
From: Paolo Bonzini @ 2018-11-27 13:51 UTC (permalink / raw)
  To: Christophe de Dinechin, Xiao Guangrong
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, peterx, qemu-devel,
	quintela, wei.w.wang, cota, jiang.biao2

On 27/11/18 13:49, Christophe de Dinechin wrote:
> So this is not really
> helping. Also, the ThreadLocal structure itself is not necessarily aligned
> within struct Threads. Therefore, it’s possible that “requests” for example
> could be on the same cache line as request_fill_bitmap if planets align
> the wrong way.

I think this is a bit exaggerated.  Linux and QEMU's own qht work just
fine with compile-time directives.

> In order to mitigate these effects, I would group the data that the user
> writes and the data that the thread writes, i.e. reorder declarations,
> put request_fill_bitmap and request_valid_ev together, and try
> to put them in the same cache line so that only one cache line is invalidated
> from within mark_request_valid instead of two.
> 
> Then you end up with a single alignment directive instead of 4, to
> separate requests from completions.

Yeah, I agree with this.

> That being said, I’m not sure why you use a bitmap here. What is the
> expected benefit relative to atomic lists (which would also make it really
> lock-free)?
> 

I don't think lock-free lists are easier.  Bitmaps smaller than 64
elements are both faster and easier to manage.

Paolo

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
@ 2018-11-27 13:51       ` Paolo Bonzini
  0 siblings, 0 replies; 70+ messages in thread
From: Paolo Bonzini @ 2018-11-27 13:51 UTC (permalink / raw)
  To: Christophe de Dinechin, Xiao Guangrong
  Cc: mst, mtosatti, qemu-devel, kvm, dgilbert, peterx, wei.w.wang,
	jiang.biao2, eblake, quintela, cota, Xiao Guangrong

On 27/11/18 13:49, Christophe de Dinechin wrote:
> So this is not really
> helping. Also, the ThreadLocal structure itself is not necessarily aligned
> within struct Threads. Therefore, it’s possible that “requests” for example
> could be on the same cache line as request_fill_bitmap if planets align
> the wrong way.

I think this is a bit exaggerated.  Linux and QEMU's own qht work just
fine with compile-time directives.

> In order to mitigate these effects, I would group the data that the user
> writes and the data that the thread writes, i.e. reorder declarations,
> put request_fill_bitmap and request_valid_ev together, and try
> to put them in the same cache line so that only one cache line is invalidated
> from within mark_request_valid instead of two.
> 
> Then you end up with a single alignment directive instead of 4, to
> separate requests from completions.

Yeah, I agree with this.

> That being said, I’m not sure why you use a bitmap here. What is the
> expected benefit relative to atomic lists (which would also make it really
> lock-free)?
> 

I don't think lock-free lists are easier.  Bitmaps smaller than 64
elements are both faster and easier to manage.

Paolo

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 2/5] util: introduce threaded workqueue
  2018-11-27 12:49     ` [Qemu-devel] " Christophe de Dinechin
@ 2018-11-27 17:39       ` Emilio G. Cota
  -1 siblings, 0 replies; 70+ messages in thread
From: Emilio G. Cota @ 2018-11-27 17:39 UTC (permalink / raw)
  To: Christophe de Dinechin
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, peterx, dgilbert,
	quintela, wei.w.wang, Xiao Guangrong, jiang.biao2, Paolo Bonzini

On Tue, Nov 27, 2018 at 13:49:13 +0100, Christophe de Dinechin wrote:
> (I did not finish the review, but decided to send what I already had).
> 
> > On 22 Nov 2018, at 08:20, guangrong.xiao@gmail.com wrote:
> > 
> > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > 
> > This modules implements the lockless and efficient threaded workqueue.
> 
> I’m not entirely convinced that it’s either “lockless” or “efficient”
> in the current iteration. I believe that it’s relatively easy to fix, though.
(snip)
> > 
> >    All the requests are pre-allocated and carefully partitioned between
> >    threads so there is no contention on the request, that make threads
> >    be parallel as much as possible.
> 
> That sentence confused me (it’s also in a comment in the text).
> I think I’m mostly confused by “there is no contention”. Perhaps you
> meant “so as to avoid contention if possible”? If there is a reason
> why there would never be any contention even if requests arrive faster than
> completions, I did not figure it out.
> 
> I personally see serious contention on the fields in the Threads structure,
> for example, but also possibly on the targets of the “modulo” operation in
> thread_find_free_request. Specifically, if three CPUs are entering
> thread_find_free_request at the same time, they will all run the same
> loop and all, presumably, “attack” the same memory locations.
> 
> Sorry if I mis-read the code, but at the moment, it does not seem to
> avoid contention as intended. I don’t see how it could without having
> some way to discriminate between CPUs to start with, which I did not find.

You might have missed that only one thread can request jobs. So contention
should only happen between that thread and the worker threads, but
not among worker threads (they should only share cache lines with the
requester thread).

> > - User, i.e, the submitter
> >    It's the one fills the request and submits it to the workqueue,
> the one -> the one who
> >    the result will be collected after it is handled by the work queue.
> > 
> >    The user can consecutively submit requests without waiting the previous
> waiting -> waiting for
> >    requests been handled.
> >    It only supports one submitter, you should do serial submission by
> >    yourself if you want more, e.g, use lock on you side.
> 
> I’m also confused by this last statement. The proposal purports
> to be “lockless”, which I read as working correctly without a lock…
> Reading the code, I indeed see issues if different threads
> try to place requests at the same time. So I believe the word
> “lockless” is a bit misleading.

ditto, it is lockless as presented here, i.e. one requester thread.

> > +static void uninit_thread_requests(ThreadLocal *thread, int free_nr)
> > +{
> > +    Threads *threads = thread->threads;
> > +    ThreadRequest *request = thread->requests;
> > +    int i;
> > +
> > +    for (i = 0; i < free_nr; i++) {
> > +        threads->ops->thread_request_uninit(request + 1);
> > +        request = (void *)request + threads->request_size;
> 
> Despite GCC’s tolerance for it and rather lengthy debates,
> pointer arithmetic on void * is illegal in C [1].
> 
> Consider using char * arithmetic, and using macros such as:
> 
> #define request_to_payload(req) (((ThreadRequest *) req) + 1)
> #define payload_to_request(req) (((ThreadRequest *) req) - 1)
> #define request_to_next(req,threads) ((ThreadRequest *) ((char *) req) + threads->request_size))
> 
> where appropriate, that would clarify the intent.
> 
> [1] https://stackoverflow.com/questions/3523145/pointer-arithmetic-for-void-pointer-in-c

FWIW, we use void pointer arithmetic in other places in QEMU, so
I wouldn't worry about it being illegal.

I like those little macros though; even better as inlines.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
@ 2018-11-27 17:39       ` Emilio G. Cota
  0 siblings, 0 replies; 70+ messages in thread
From: Emilio G. Cota @ 2018-11-27 17:39 UTC (permalink / raw)
  To: Christophe de Dinechin
  Cc: Xiao Guangrong, Paolo Bonzini, mst, mtosatti, qemu-devel, kvm,
	dgilbert, peterx, wei.w.wang, jiang.biao2, eblake, quintela,
	Xiao Guangrong

On Tue, Nov 27, 2018 at 13:49:13 +0100, Christophe de Dinechin wrote:
> (I did not finish the review, but decided to send what I already had).
> 
> > On 22 Nov 2018, at 08:20, guangrong.xiao@gmail.com wrote:
> > 
> > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > 
> > This modules implements the lockless and efficient threaded workqueue.
> 
> I’m not entirely convinced that it’s either “lockless” or “efficient”
> in the current iteration. I believe that it’s relatively easy to fix, though.
(snip)
> > 
> >    All the requests are pre-allocated and carefully partitioned between
> >    threads so there is no contention on the request, that make threads
> >    be parallel as much as possible.
> 
> That sentence confused me (it’s also in a comment in the text).
> I think I’m mostly confused by “there is no contention”. Perhaps you
> meant “so as to avoid contention if possible”? If there is a reason
> why there would never be any contention even if requests arrive faster than
> completions, I did not figure it out.
> 
> I personally see serious contention on the fields in the Threads structure,
> for example, but also possibly on the targets of the “modulo” operation in
> thread_find_free_request. Specifically, if three CPUs are entering
> thread_find_free_request at the same time, they will all run the same
> loop and all, presumably, “attack” the same memory locations.
> 
> Sorry if I mis-read the code, but at the moment, it does not seem to
> avoid contention as intended. I don’t see how it could without having
> some way to discriminate between CPUs to start with, which I did not find.

You might have missed that only one thread can request jobs. So contention
should only happen between that thread and the worker threads, but
not among worker threads (they should only share cache lines with the
requester thread).

> > - User, i.e, the submitter
> >    It's the one fills the request and submits it to the workqueue,
> the one -> the one who
> >    the result will be collected after it is handled by the work queue.
> > 
> >    The user can consecutively submit requests without waiting the previous
> waiting -> waiting for
> >    requests been handled.
> >    It only supports one submitter, you should do serial submission by
> >    yourself if you want more, e.g, use lock on you side.
> 
> I’m also confused by this last statement. The proposal purports
> to be “lockless”, which I read as working correctly without a lock…
> Reading the code, I indeed see issues if different threads
> try to place requests at the same time. So I believe the word
> “lockless” is a bit misleading.

ditto, it is lockless as presented here, i.e. one requester thread.

> > +static void uninit_thread_requests(ThreadLocal *thread, int free_nr)
> > +{
> > +    Threads *threads = thread->threads;
> > +    ThreadRequest *request = thread->requests;
> > +    int i;
> > +
> > +    for (i = 0; i < free_nr; i++) {
> > +        threads->ops->thread_request_uninit(request + 1);
> > +        request = (void *)request + threads->request_size;
> 
> Despite GCC’s tolerance for it and rather lengthy debates,
> pointer arithmetic on void * is illegal in C [1].
> 
> Consider using char * arithmetic, and using macros such as:
> 
> #define request_to_payload(req) (((ThreadRequest *) req) + 1)
> #define payload_to_request(req) (((ThreadRequest *) req) - 1)
> #define request_to_next(req,threads) ((ThreadRequest *) ((char *) req) + threads->request_size))
> 
> where appropriate, that would clarify the intent.
> 
> [1] https://stackoverflow.com/questions/3523145/pointer-arithmetic-for-void-pointer-in-c

FWIW, we use void pointer arithmetic in other places in QEMU, so
I wouldn't worry about it being illegal.

I like those little macros though; even better as inlines.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 2/5] util: introduce threaded workqueue
  2018-11-27 12:49     ` [Qemu-devel] " Christophe de Dinechin
@ 2018-11-28  8:55       ` Xiao Guangrong
  -1 siblings, 0 replies; 70+ messages in thread
From: Xiao Guangrong @ 2018-11-28  8:55 UTC (permalink / raw)
  To: Christophe de Dinechin
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, peterx, qemu-devel,
	quintela, wei.w.wang, cota, jiang.biao2, Paolo Bonzini



On 11/27/18 8:49 PM, Christophe de Dinechin wrote:
> (I did not finish the review, but decided to send what I already had).
> 
> 
>> On 22 Nov 2018, at 08:20, guangrong.xiao@gmail.com wrote:
>>
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>> This modules implements the lockless and efficient threaded workqueue.
> 
> I’m not entirely convinced that it’s either “lockless” or “efficient”
> in the current iteration. I believe that it’s relatively easy to fix, though.
> 

I think Emilio has already replied to your concern why it is "lockless". :)

>>
>> Three abstracted objects are used in this module:
>> - Request.
>>      It not only contains the data that the workqueue fetches out
>>     to finish the request but also offers the space to save the result
>>     after the workqueue handles the request.
>>
>>     It's flowed between user and workqueue. The user fills the request
>>     data into it when it is owned by user. After it is submitted to the
>>     workqueue, the workqueue fetched data out and save the result into
>>     it after the request is handled.
> 
> fetched -> fetches
> save -> saves

Will fix... My English is even worse than C. :(

>> +
>> +/*
>> + * find a free request where the user can store the data that is needed to
>> + * finish the request
>> + *
>> + * If all requests are used up, return NULL
>> + */
>> +void *threaded_workqueue_get_request(Threads *threads);
> 
> Using void * to represent the payload makes it easy to get
> the wrong pointer in there without the compiler noticing.
> Consider adding a type for the payload?
> 

Another option could be taken is exporting the ThreadRequest to the user
and it's put at the very beginning in the user-defined data struct.

However, it will export the internal designed things to the user, i am
not sure it is a good idea...

>> + *
>> + * Author:
>> + *   Xiao Guangrong <xiaoguangrong@tencent.com>
>> + *
>> + * Copyright(C) 2018 Tencent Corporation.
>> + *
>> + * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
>> + * See the COPYING.LIB file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/bitmap.h"
>> +#include "qemu/threaded-workqueue.h"
>> +
>> +#define SMP_CACHE_BYTES 64
> 
> +1 on comments already made by others

Will improve it.

> 
>> +
>> +/*
>> + * the request representation which contains the internally used mete data,
> 
> mete -> meta

Will fix.

> 
>> + * it is the header of user-defined data.
>> + *
>> + * It should be aligned to the nature size of CPU.
>> + */
>> +struct ThreadRequest {
>> +    /*
>> +     * the request has been handled by the thread and need the user
>> +     * to fetch result out.
>> +     */
>> +    uint8_t done;
>> +
>> +    /*
>> +     * the index to Thread::requests.
>> +     * Save it to the padding space although it can be calculated at runtime.
>> +     */
>> +    uint8_t request_index;
> 
> So no more than 256?
> 
> This is blocked by MAX_THREAD_REQUEST_NR test at the beginning
> of threaded_workqueue_create, but I would make it more explicit either
> with a compile-time assert that MAX_THREAD_REQUEST_NR is
> below UINT8_MAX, or by adding a second test for UINT8_MAX in
> threaded_workqueue_create.

It's good to me.

I prefer the former one that "compile-time assert that MAX_THREAD_REQUEST_NR
is below UINT8_MAX"

> 
> Also, an obvious extension would be to make bitmaps into arrays.
> 
> Do you think someone would want to use the package to assign
> requests per CPU or per VCPU? If so, that could quickly go above 64.
> 

Well... it specifies the depth of each single thread, it has negative
affection if larger depth is used, as it causes
threaded_workqueue_wait_for_requests() too slow, at that point, the
user needs to wait all the threads to exhaust all its requests.

Another impact is that u64 is more efficient than bitmaps, we can see
it from the performance data:
    https://ibb.co/hq7u5V

Based on those, i think 64 should be enough, at least for the present
user, migration thread.

> 
>> +
>> +    /* the index to Threads::per_thread_data */
>> +    unsigned int thread_index;
> 
> Don’t you want to use a size_t for that?

size_t is 8 bytes... i'd like to make the request header more tiny...

> 
>> +} QEMU_ALIGNED(sizeof(unsigned long));
> 
> Nit: the alignment type is inconsistent with that given
> to QEMU_BUILD_BUG_ON in threaded_workqueue_create.
> (long vs. unsigned long).
> 

Yup, will make them consistent.

> Also, why is the alignment required? Aren’t you more interested
> in cache-line alignment?
> 

ThreadRequest actually is the header put at the very beginning of
the request. If is not aligned to "long", the user-defined data
struct could be accessed without properly aligned.

> 
>> +typedef struct ThreadRequest ThreadRequest;
> 
> 
>> +
>> +struct ThreadLocal {
>> +    struct Threads *threads;
>> +
>> +    /* the index of the thread */
>> +    int self;
> 
> Why signed?

Mistake, will fix.

> 
>> +
>> +    /* thread is useless and needs to exit */
>> +    bool quit;
>> +
>> +    QemuThread thread;
>> +
>> +    void *requests;
>> +
>> +   /*
>> +     * the bit in these two bitmaps indicates the index of the @requests
>> +     * respectively. If it's the same, the corresponding request is free
>> +     * and owned by the user, i.e, where the user fills a request. Otherwise,
>> +     * it is valid and owned by the thread, i.e, where the thread fetches
>> +     * the request and write the result.
>> +     */
>> +
>> +    /* after the user fills the request, the bit is flipped. */
>> +    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
> 
> I believe you are trying to ensure that data accessed from multiple CPUs
> is on different cache lines. As others have pointed out, the real value for
> SMP_CACHE_BYTES can only be known at run-time. So this is not really
> helping. Also, the ThreadLocal structure itself is not necessarily aligned
> within struct Threads. Therefore, it’s possible that “requests” for example
> could be on the same cache line as request_fill_bitmap if planets align
> the wrong way.
> 
> In order to mitigate these effects, I would group the data that the user
> writes and the data that the thread writes, i.e. reorder declarations,
> put request_fill_bitmap and request_valid_ev together, and try
> to put them in the same cache line so that only one cache line is invalidated
> from within mark_request_valid instead of two.
> 

However, QemuEvent is atomically updated at both sides, it is not good to mix
it with other fields, isn't?


> Then you end up with a single alignment directive instead of 4, to
> separate requests from completions.
> 
> That being said, I’m not sure why you use a bitmap here. What is the
> expected benefit relative to atomic lists (which would also make it really
> lock-free)?
> 

I agree Paolo's comments in another mail. :)

> 
>> +
>> +/*
>> + * the main data struct represents multithreads which is shared by
>> + * all threads
>> + */
>> +struct Threads {
>> +    /* the request header, ThreadRequest, is contained */
>> +    unsigned int request_size;
> 
> size_t?

Please see the comments above about "unsigned int thread_index;" in
ThreadRequest.

>> +/*
>> + * free request: the request is not used by any thread, however, it might
>> + *   contain the result need the user to call thread_request_done()
> 
> might contain the result -> might still contain the result
> result need the user to call -> result. The user needs to call
> 

Will fix.

>> +static void uninit_thread_requests(ThreadLocal *thread, int free_nr)
>> +{
>> +    Threads *threads = thread->threads;
>> +    ThreadRequest *request = thread->requests;
>> +    int i;
>> +
>> +    for (i = 0; i < free_nr; i++) {
>> +        threads->ops->thread_request_uninit(request + 1);
>> +        request = (void *)request + threads->request_size;
> 
> Despite GCC’s tolerance for it and rather lengthy debates,
> pointer arithmetic on void * is illegal in C [1].
> 
> Consider using char * arithmetic, and using macros such as:
> 
> #define request_to_payload(req) (((ThreadRequest *) req) + 1)
> #define payload_to_request(req) (((ThreadRequest *) req) - 1)
> #define request_to_next(req,threads) ((ThreadRequest *) ((char *) req) + threads->request_size))
> 

These definitions are really nice, will use them instead.

>> +static void uninit_thread_data(Threads *threads, int free_nr)
>> +{
>> +    ThreadLocal *thread_local = threads->per_thread_data;
> 
> thread_local is a keyword in C++11. I would avoid it as a name,
> consider replacing with “per_thread_data” as in struct Threads?
> 

Sure, it's good to me.

>> +void threaded_workqueue_submit_request(Threads *threads, void *request)
>> +{
>> +    ThreadRequest *req = request - sizeof(ThreadRequest);
> 
> Pointer arithmetic on void *…
> 
> Please consider rewriting as:
> 
> 	ThreadRequest *req = (ThreadRequest *) request - 1;
> 
> which achieves the same objective, is legal C, and is the symmetric
> counterpart of “return request + 1” above.
> 

It's nice, indeed.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
@ 2018-11-28  8:55       ` Xiao Guangrong
  0 siblings, 0 replies; 70+ messages in thread
From: Xiao Guangrong @ 2018-11-28  8:55 UTC (permalink / raw)
  To: Christophe de Dinechin
  Cc: Paolo Bonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, peterx,
	wei.w.wang, jiang.biao2, eblake, quintela, cota, Xiao Guangrong



On 11/27/18 8:49 PM, Christophe de Dinechin wrote:
> (I did not finish the review, but decided to send what I already had).
> 
> 
>> On 22 Nov 2018, at 08:20, guangrong.xiao@gmail.com wrote:
>>
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>> This modules implements the lockless and efficient threaded workqueue.
> 
> I’m not entirely convinced that it’s either “lockless” or “efficient”
> in the current iteration. I believe that it’s relatively easy to fix, though.
> 

I think Emilio has already replied to your concern why it is "lockless". :)

>>
>> Three abstracted objects are used in this module:
>> - Request.
>>      It not only contains the data that the workqueue fetches out
>>     to finish the request but also offers the space to save the result
>>     after the workqueue handles the request.
>>
>>     It's flowed between user and workqueue. The user fills the request
>>     data into it when it is owned by user. After it is submitted to the
>>     workqueue, the workqueue fetched data out and save the result into
>>     it after the request is handled.
> 
> fetched -> fetches
> save -> saves

Will fix... My English is even worse than C. :(

>> +
>> +/*
>> + * find a free request where the user can store the data that is needed to
>> + * finish the request
>> + *
>> + * If all requests are used up, return NULL
>> + */
>> +void *threaded_workqueue_get_request(Threads *threads);
> 
> Using void * to represent the payload makes it easy to get
> the wrong pointer in there without the compiler noticing.
> Consider adding a type for the payload?
> 

Another option could be taken is exporting the ThreadRequest to the user
and it's put at the very beginning in the user-defined data struct.

However, it will export the internal designed things to the user, i am
not sure it is a good idea...

>> + *
>> + * Author:
>> + *   Xiao Guangrong <xiaoguangrong@tencent.com>
>> + *
>> + * Copyright(C) 2018 Tencent Corporation.
>> + *
>> + * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
>> + * See the COPYING.LIB file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/bitmap.h"
>> +#include "qemu/threaded-workqueue.h"
>> +
>> +#define SMP_CACHE_BYTES 64
> 
> +1 on comments already made by others

Will improve it.

> 
>> +
>> +/*
>> + * the request representation which contains the internally used mete data,
> 
> mete -> meta

Will fix.

> 
>> + * it is the header of user-defined data.
>> + *
>> + * It should be aligned to the nature size of CPU.
>> + */
>> +struct ThreadRequest {
>> +    /*
>> +     * the request has been handled by the thread and need the user
>> +     * to fetch result out.
>> +     */
>> +    uint8_t done;
>> +
>> +    /*
>> +     * the index to Thread::requests.
>> +     * Save it to the padding space although it can be calculated at runtime.
>> +     */
>> +    uint8_t request_index;
> 
> So no more than 256?
> 
> This is blocked by MAX_THREAD_REQUEST_NR test at the beginning
> of threaded_workqueue_create, but I would make it more explicit either
> with a compile-time assert that MAX_THREAD_REQUEST_NR is
> below UINT8_MAX, or by adding a second test for UINT8_MAX in
> threaded_workqueue_create.

It's good to me.

I prefer the former one that "compile-time assert that MAX_THREAD_REQUEST_NR
is below UINT8_MAX"

> 
> Also, an obvious extension would be to make bitmaps into arrays.
> 
> Do you think someone would want to use the package to assign
> requests per CPU or per VCPU? If so, that could quickly go above 64.
> 

Well... it specifies the depth of each single thread, it has negative
affection if larger depth is used, as it causes
threaded_workqueue_wait_for_requests() too slow, at that point, the
user needs to wait all the threads to exhaust all its requests.

Another impact is that u64 is more efficient than bitmaps, we can see
it from the performance data:
    https://ibb.co/hq7u5V

Based on those, i think 64 should be enough, at least for the present
user, migration thread.

> 
>> +
>> +    /* the index to Threads::per_thread_data */
>> +    unsigned int thread_index;
> 
> Don’t you want to use a size_t for that?

size_t is 8 bytes... i'd like to make the request header more tiny...

> 
>> +} QEMU_ALIGNED(sizeof(unsigned long));
> 
> Nit: the alignment type is inconsistent with that given
> to QEMU_BUILD_BUG_ON in threaded_workqueue_create.
> (long vs. unsigned long).
> 

Yup, will make them consistent.

> Also, why is the alignment required? Aren’t you more interested
> in cache-line alignment?
> 

ThreadRequest actually is the header put at the very beginning of
the request. If is not aligned to "long", the user-defined data
struct could be accessed without properly aligned.

> 
>> +typedef struct ThreadRequest ThreadRequest;
> 
> 
>> +
>> +struct ThreadLocal {
>> +    struct Threads *threads;
>> +
>> +    /* the index of the thread */
>> +    int self;
> 
> Why signed?

Mistake, will fix.

> 
>> +
>> +    /* thread is useless and needs to exit */
>> +    bool quit;
>> +
>> +    QemuThread thread;
>> +
>> +    void *requests;
>> +
>> +   /*
>> +     * the bit in these two bitmaps indicates the index of the @requests
>> +     * respectively. If it's the same, the corresponding request is free
>> +     * and owned by the user, i.e, where the user fills a request. Otherwise,
>> +     * it is valid and owned by the thread, i.e, where the thread fetches
>> +     * the request and write the result.
>> +     */
>> +
>> +    /* after the user fills the request, the bit is flipped. */
>> +    uint64_t request_fill_bitmap QEMU_ALIGNED(SMP_CACHE_BYTES);
> 
> I believe you are trying to ensure that data accessed from multiple CPUs
> is on different cache lines. As others have pointed out, the real value for
> SMP_CACHE_BYTES can only be known at run-time. So this is not really
> helping. Also, the ThreadLocal structure itself is not necessarily aligned
> within struct Threads. Therefore, it’s possible that “requests” for example
> could be on the same cache line as request_fill_bitmap if planets align
> the wrong way.
> 
> In order to mitigate these effects, I would group the data that the user
> writes and the data that the thread writes, i.e. reorder declarations,
> put request_fill_bitmap and request_valid_ev together, and try
> to put them in the same cache line so that only one cache line is invalidated
> from within mark_request_valid instead of two.
> 

However, QemuEvent is atomically updated at both sides, it is not good to mix
it with other fields, isn't?


> Then you end up with a single alignment directive instead of 4, to
> separate requests from completions.
> 
> That being said, I’m not sure why you use a bitmap here. What is the
> expected benefit relative to atomic lists (which would also make it really
> lock-free)?
> 

I agree Paolo's comments in another mail. :)

> 
>> +
>> +/*
>> + * the main data struct represents multithreads which is shared by
>> + * all threads
>> + */
>> +struct Threads {
>> +    /* the request header, ThreadRequest, is contained */
>> +    unsigned int request_size;
> 
> size_t?

Please see the comments above about "unsigned int thread_index;" in
ThreadRequest.

>> +/*
>> + * free request: the request is not used by any thread, however, it might
>> + *   contain the result need the user to call thread_request_done()
> 
> might contain the result -> might still contain the result
> result need the user to call -> result. The user needs to call
> 

Will fix.

>> +static void uninit_thread_requests(ThreadLocal *thread, int free_nr)
>> +{
>> +    Threads *threads = thread->threads;
>> +    ThreadRequest *request = thread->requests;
>> +    int i;
>> +
>> +    for (i = 0; i < free_nr; i++) {
>> +        threads->ops->thread_request_uninit(request + 1);
>> +        request = (void *)request + threads->request_size;
> 
> Despite GCC’s tolerance for it and rather lengthy debates,
> pointer arithmetic on void * is illegal in C [1].
> 
> Consider using char * arithmetic, and using macros such as:
> 
> #define request_to_payload(req) (((ThreadRequest *) req) + 1)
> #define payload_to_request(req) (((ThreadRequest *) req) - 1)
> #define request_to_next(req,threads) ((ThreadRequest *) ((char *) req) + threads->request_size))
> 

These definitions are really nice, will use them instead.

>> +static void uninit_thread_data(Threads *threads, int free_nr)
>> +{
>> +    ThreadLocal *thread_local = threads->per_thread_data;
> 
> thread_local is a keyword in C++11. I would avoid it as a name,
> consider replacing with “per_thread_data” as in struct Threads?
> 

Sure, it's good to me.

>> +void threaded_workqueue_submit_request(Threads *threads, void *request)
>> +{
>> +    ThreadRequest *req = request - sizeof(ThreadRequest);
> 
> Pointer arithmetic on void *…
> 
> Please consider rewriting as:
> 
> 	ThreadRequest *req = (ThreadRequest *) request - 1;
> 
> which achieves the same objective, is legal C, and is the symmetric
> counterpart of “return request + 1” above.
> 

It's nice, indeed.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 1/5] bitops: introduce change_bit_atomic
  2018-11-22  7:20   ` [Qemu-devel] " guangrong.xiao
@ 2018-11-28  9:35     ` Juan Quintela
  -1 siblings, 0 replies; 70+ messages in thread
From: Juan Quintela @ 2018-11-28  9:35 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, peterx, qemu-devel,
	wei.w.wang, cota, jiang.biao2, pbonzini

guangrong.xiao@gmail.com wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>
> It will be used by threaded workqueue
>
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/5] bitops: introduce change_bit_atomic
@ 2018-11-28  9:35     ` Juan Quintela
  0 siblings, 0 replies; 70+ messages in thread
From: Juan Quintela @ 2018-11-28  9:35 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, peterx,
	wei.w.wang, jiang.biao2, eblake, cota, Xiao Guangrong

guangrong.xiao@gmail.com wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>
> It will be used by threaded workqueue
>
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 2/5] util: introduce threaded workqueue
  2018-11-27 13:51       ` [Qemu-devel] " Paolo Bonzini
@ 2018-12-04 15:49         ` Christophe de Dinechin
  -1 siblings, 0 replies; 70+ messages in thread
From: Christophe de Dinechin @ 2018-12-04 15:49 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: KVM list, Michael S. Tsirkin, Marcelo Tosatti, Xiao Guangrong,
	Dr. David Alan Gilbert, Peter Xu, qemu-devel, Juan Quintela,
	Wei Wang, Emilio G. Cota, Xiao Guangrong, jiang.biao2



> On 27 Nov 2018, at 14:51, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 27/11/18 13:49, Christophe de Dinechin wrote:
>> So this is not really
>> helping. Also, the ThreadLocal structure itself is not necessarily aligned
>> within struct Threads. Therefore, it’s possible that “requests” for example
>> could be on the same cache line as request_fill_bitmap if planets align
>> the wrong way.
> 
> I think this is a bit exaggerated.

Hence my “if planets align the wrong way” :-)

But I understand that my wording came out too strong. My apologies.

I think the fix is to align ThreadLocal as well.


>  Linux and QEMU's own qht work just fine with compile-time directives.

Wouldn’t it work fine without any compile-time directive at all?
Alignment is just a performance optimization.

> 
>> In order to mitigate these effects, I would group the data that the user
>> writes and the data that the thread writes, i.e. reorder declarations,
>> put request_fill_bitmap and request_valid_ev together, and try
>> to put them in the same cache line so that only one cache line is invalidated
>> from within mark_request_valid instead of two.
>> 
>> Then you end up with a single alignment directive instead of 4, to
>> separate requests from completions.
> 
> Yeah, I agree with this.
> 
>> That being said, I’m not sure why you use a bitmap here. What is the
>> expected benefit relative to atomic lists (which would also make it really
>> lock-free)?
>> 
> 
> I don't think lock-free lists are easier.  Bitmaps smaller than 64
> elements are both faster and easier to manage.

I believe that this is only true if you use a linked list for both freelist
management and for thread notification (i.e. to replace the bitmaps).
However, if you use an atomic list only for the free list, and keep
bitmaps for signaling, then performance is at least equal, often better.
Plus you get the added benefit of having a thread-safe API, i.e.
something that is truly lock-free.

I did a small experiment to test / prove this. Last commit on branch:
https://github.com/c3d/recorder/commits/181122-xiao_guangdong_introduce-threaded-workqueue
Take with a grain of salt, microbenchmarks are always suspect ;-)

The code in “thread_test.c” includes Xiao’s code with two variations,
plus some testing code lifted from the flight recorder library.
1. The FREE_LIST variation (sl_test) is what I would like to propose.
2. The BITMAP variation (bm_test) is the baseline
3. The DOUBLE_LIST variation (ll_test) is the slow double-list approach

To run it, you need to do “make opt-test”, then run “test_script”
which outputs a CSV file. The summary of my findings testing on
a ThinkPad, a Xeon machine and a MacBook is here:
https://imgur.com/a/4HmbB9K

Overall, the proposed approach:

- makes the API thread safe and lock free, addressing the one
drawback that Xiao was mentioning.

- delivers up to 30% more requests on the Macbook, while being
“within noise” (sometimes marginally better) for the other two.
I suspect an optimization opportunity found by clang, because
the Macbook delivers really high numbers.

- spends less time blocking when all threads are busy, which
accounts for the higher number of client loops.

If you think that makes sense, then either Xiao can adapt the code
from the branch above, or I can send a follow-up patch.


Thanks
Christophe

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
@ 2018-12-04 15:49         ` Christophe de Dinechin
  0 siblings, 0 replies; 70+ messages in thread
From: Christophe de Dinechin @ 2018-12-04 15:49 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Xiao Guangrong, Michael S. Tsirkin, Marcelo Tosatti, qemu-devel,
	KVM list, Dr. David Alan Gilbert, Peter Xu, Wei Wang,
	jiang.biao2, Eric Blake, Juan Quintela, Emilio G. Cota,
	Xiao Guangrong



> On 27 Nov 2018, at 14:51, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 27/11/18 13:49, Christophe de Dinechin wrote:
>> So this is not really
>> helping. Also, the ThreadLocal structure itself is not necessarily aligned
>> within struct Threads. Therefore, it’s possible that “requests” for example
>> could be on the same cache line as request_fill_bitmap if planets align
>> the wrong way.
> 
> I think this is a bit exaggerated.

Hence my “if planets align the wrong way” :-)

But I understand that my wording came out too strong. My apologies.

I think the fix is to align ThreadLocal as well.


>  Linux and QEMU's own qht work just fine with compile-time directives.

Wouldn’t it work fine without any compile-time directive at all?
Alignment is just a performance optimization.

> 
>> In order to mitigate these effects, I would group the data that the user
>> writes and the data that the thread writes, i.e. reorder declarations,
>> put request_fill_bitmap and request_valid_ev together, and try
>> to put them in the same cache line so that only one cache line is invalidated
>> from within mark_request_valid instead of two.
>> 
>> Then you end up with a single alignment directive instead of 4, to
>> separate requests from completions.
> 
> Yeah, I agree with this.
> 
>> That being said, I’m not sure why you use a bitmap here. What is the
>> expected benefit relative to atomic lists (which would also make it really
>> lock-free)?
>> 
> 
> I don't think lock-free lists are easier.  Bitmaps smaller than 64
> elements are both faster and easier to manage.

I believe that this is only true if you use a linked list for both freelist
management and for thread notification (i.e. to replace the bitmaps).
However, if you use an atomic list only for the free list, and keep
bitmaps for signaling, then performance is at least equal, often better.
Plus you get the added benefit of having a thread-safe API, i.e.
something that is truly lock-free.

I did a small experiment to test / prove this. Last commit on branch:
https://github.com/c3d/recorder/commits/181122-xiao_guangdong_introduce-threaded-workqueue
Take with a grain of salt, microbenchmarks are always suspect ;-)

The code in “thread_test.c” includes Xiao’s code with two variations,
plus some testing code lifted from the flight recorder library.
1. The FREE_LIST variation (sl_test) is what I would like to propose.
2. The BITMAP variation (bm_test) is the baseline
3. The DOUBLE_LIST variation (ll_test) is the slow double-list approach

To run it, you need to do “make opt-test”, then run “test_script”
which outputs a CSV file. The summary of my findings testing on
a ThinkPad, a Xeon machine and a MacBook is here:
https://imgur.com/a/4HmbB9K

Overall, the proposed approach:

- makes the API thread safe and lock free, addressing the one
drawback that Xiao was mentioning.

- delivers up to 30% more requests on the Macbook, while being
“within noise” (sometimes marginally better) for the other two.
I suspect an optimization opportunity found by clang, because
the Macbook delivers really high numbers.

- spends less time blocking when all threads are busy, which
accounts for the higher number of client loops.

If you think that makes sense, then either Xiao can adapt the code
from the branch above, or I can send a follow-up patch.


Thanks
Christophe

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 2/5] util: introduce threaded workqueue
  2018-12-04 15:49         ` [Qemu-devel] " Christophe de Dinechin
@ 2018-12-04 17:16           ` Paolo Bonzini
  -1 siblings, 0 replies; 70+ messages in thread
From: Paolo Bonzini @ 2018-12-04 17:16 UTC (permalink / raw)
  To: Christophe de Dinechin
  Cc: KVM list, Michael S. Tsirkin, Marcelo Tosatti, Xiao Guangrong,
	Dr. David Alan Gilbert, Peter Xu, qemu-devel, Juan Quintela,
	Wei Wang, Emilio G. Cota, Xiao Guangrong, jiang.biao2

On 04/12/18 16:49, Christophe de Dinechin wrote:
>>  Linux and QEMU's own qht work just fine with compile-time directives.
> 
> Wouldn’t it work fine without any compile-time directive at all?

Yes, that's what I meant.  Though there are certainly cases in which the
difference without proper cacheline alignment is an order of magnitude
less throughput or something like that; it would certainly be noticeable.

>> I don't think lock-free lists are easier.  Bitmaps smaller than 64
>> elements are both faster and easier to manage.
> 
> I believe that this is only true if you use a linked list for both freelist
> management and for thread notification (i.e. to replace the bitmaps).
> However, if you use an atomic list only for the free list, and keep
> bitmaps for signaling, then performance is at least equal, often better.
> Plus you get the added benefit of having a thread-safe API, i.e.
> something that is truly lock-free.
> 
> I did a small experiment to test / prove this. Last commit on branch:
> https://github.com/c3d/recorder/commits/181122-xiao_guangdong_introduce-threaded-workqueue
> Take with a grain of salt, microbenchmarks are always suspect ;-)
> 
> The code in “thread_test.c” includes Xiao’s code with two variations,
> plus some testing code lifted from the flight recorder library.
> 1. The FREE_LIST variation (sl_test) is what I would like to propose.
> 2. The BITMAP variation (bm_test) is the baseline
> 3. The DOUBLE_LIST variation (ll_test) is the slow double-list approach
> 
> To run it, you need to do “make opt-test”, then run “test_script”
> which outputs a CSV file. The summary of my findings testing on
> a ThinkPad, a Xeon machine and a MacBook is here:
> https://imgur.com/a/4HmbB9K
> 
> Overall, the proposed approach:
> 
> - makes the API thread safe and lock free, addressing the one
> drawback that Xiao was mentioning.
> 
> - delivers up to 30% more requests on the Macbook, while being
> “within noise” (sometimes marginally better) for the other two.
> I suspect an optimization opportunity found by clang, because
> the Macbook delivers really high numbers.
> 
> - spends less time blocking when all threads are busy, which
> accounts for the higher number of client loops.
> 
> If you think that makes sense, then either Xiao can adapt the code
> from the branch above, or I can send a follow-up patch.

Having a follow-up patch would be best I think.  Thanks for
experimenting with this, it's always fun stuff. :)

Paolo

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
@ 2018-12-04 17:16           ` Paolo Bonzini
  0 siblings, 0 replies; 70+ messages in thread
From: Paolo Bonzini @ 2018-12-04 17:16 UTC (permalink / raw)
  To: Christophe de Dinechin
  Cc: Xiao Guangrong, Michael S. Tsirkin, Marcelo Tosatti, qemu-devel,
	KVM list, Dr. David Alan Gilbert, Peter Xu, Wei Wang,
	jiang.biao2, Eric Blake, Juan Quintela, Emilio G. Cota,
	Xiao Guangrong

On 04/12/18 16:49, Christophe de Dinechin wrote:
>>  Linux and QEMU's own qht work just fine with compile-time directives.
> 
> Wouldn’t it work fine without any compile-time directive at all?

Yes, that's what I meant.  Though there are certainly cases in which the
difference without proper cacheline alignment is an order of magnitude
less throughput or something like that; it would certainly be noticeable.

>> I don't think lock-free lists are easier.  Bitmaps smaller than 64
>> elements are both faster and easier to manage.
> 
> I believe that this is only true if you use a linked list for both freelist
> management and for thread notification (i.e. to replace the bitmaps).
> However, if you use an atomic list only for the free list, and keep
> bitmaps for signaling, then performance is at least equal, often better.
> Plus you get the added benefit of having a thread-safe API, i.e.
> something that is truly lock-free.
> 
> I did a small experiment to test / prove this. Last commit on branch:
> https://github.com/c3d/recorder/commits/181122-xiao_guangdong_introduce-threaded-workqueue
> Take with a grain of salt, microbenchmarks are always suspect ;-)
> 
> The code in “thread_test.c” includes Xiao’s code with two variations,
> plus some testing code lifted from the flight recorder library.
> 1. The FREE_LIST variation (sl_test) is what I would like to propose.
> 2. The BITMAP variation (bm_test) is the baseline
> 3. The DOUBLE_LIST variation (ll_test) is the slow double-list approach
> 
> To run it, you need to do “make opt-test”, then run “test_script”
> which outputs a CSV file. The summary of my findings testing on
> a ThinkPad, a Xeon machine and a MacBook is here:
> https://imgur.com/a/4HmbB9K
> 
> Overall, the proposed approach:
> 
> - makes the API thread safe and lock free, addressing the one
> drawback that Xiao was mentioning.
> 
> - delivers up to 30% more requests on the Macbook, while being
> “within noise” (sometimes marginally better) for the other two.
> I suspect an optimization opportunity found by clang, because
> the Macbook delivers really high numbers.
> 
> - spends less time blocking when all threads are busy, which
> accounts for the higher number of client loops.
> 
> If you think that makes sense, then either Xiao can adapt the code
> from the branch above, or I can send a follow-up patch.

Having a follow-up patch would be best I think.  Thanks for
experimenting with this, it's always fun stuff. :)

Paolo

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v3 2/5] util: introduce threaded workqueue
  2018-12-04 17:16           ` [Qemu-devel] " Paolo Bonzini
@ 2018-12-10  3:23             ` Xiao Guangrong
  -1 siblings, 0 replies; 70+ messages in thread
From: Xiao Guangrong @ 2018-12-10  3:23 UTC (permalink / raw)
  To: Paolo Bonzini, Christophe de Dinechin
  Cc: KVM list, Michael S. Tsirkin, Marcelo Tosatti, Xiao Guangrong,
	Dr. David Alan Gilbert, Peter Xu, qemu-devel, Juan Quintela,
	Wei Wang, Emilio G. Cota, jiang.biao2



On 12/5/18 1:16 AM, Paolo Bonzini wrote:
> On 04/12/18 16:49, Christophe de Dinechin wrote:
>>>   Linux and QEMU's own qht work just fine with compile-time directives.
>>
>> Wouldn’t it work fine without any compile-time directive at all?
> 
> Yes, that's what I meant.  Though there are certainly cases in which the
> difference without proper cacheline alignment is an order of magnitude
> less throughput or something like that; it would certainly be noticeable.
> 
>>> I don't think lock-free lists are easier.  Bitmaps smaller than 64
>>> elements are both faster and easier to manage.
>>
>> I believe that this is only true if you use a linked list for both freelist
>> management and for thread notification (i.e. to replace the bitmaps).
>> However, if you use an atomic list only for the free list, and keep
>> bitmaps for signaling, then performance is at least equal, often better.
>> Plus you get the added benefit of having a thread-safe API, i.e.
>> something that is truly lock-free.
>>
>> I did a small experiment to test / prove this. Last commit on branch:
>> https://github.com/c3d/recorder/commits/181122-xiao_guangdong_introduce-threaded-workqueue
>> Take with a grain of salt, microbenchmarks are always suspect ;-)
>>
>> The code in “thread_test.c” includes Xiao’s code with two variations,
>> plus some testing code lifted from the flight recorder library.
>> 1. The FREE_LIST variation (sl_test) is what I would like to propose.
>> 2. The BITMAP variation (bm_test) is the baseline
>> 3. The DOUBLE_LIST variation (ll_test) is the slow double-list approach
>>
>> To run it, you need to do “make opt-test”, then run “test_script”
>> which outputs a CSV file. The summary of my findings testing on
>> a ThinkPad, a Xeon machine and a MacBook is here:
>> https://imgur.com/a/4HmbB9K
>>
>> Overall, the proposed approach:
>>
>> - makes the API thread safe and lock free, addressing the one
>> drawback that Xiao was mentioning.
>>
>> - delivers up to 30% more requests on the Macbook, while being
>> “within noise” (sometimes marginally better) for the other two.
>> I suspect an optimization opportunity found by clang, because
>> the Macbook delivers really high numbers.
>>
>> - spends less time blocking when all threads are busy, which
>> accounts for the higher number of client loops.
>>
>> If you think that makes sense, then either Xiao can adapt the code
>> from the branch above, or I can send a follow-up patch.
> 
> Having a follow-up patch would be best I think.  Thanks for
> experimenting with this, it's always fun stuff. :)
> 

Yup, Christophe, please post the follow-up patches and add yourself
to the author list if you like. I am looking forward to it. :)

Thanks!

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/5] util: introduce threaded workqueue
@ 2018-12-10  3:23             ` Xiao Guangrong
  0 siblings, 0 replies; 70+ messages in thread
From: Xiao Guangrong @ 2018-12-10  3:23 UTC (permalink / raw)
  To: Paolo Bonzini, Christophe de Dinechin
  Cc: Michael S. Tsirkin, Marcelo Tosatti, qemu-devel, KVM list,
	Dr. David Alan Gilbert, Peter Xu, Wei Wang, jiang.biao2,
	Eric Blake, Juan Quintela, Emilio G. Cota, Xiao Guangrong



On 12/5/18 1:16 AM, Paolo Bonzini wrote:
> On 04/12/18 16:49, Christophe de Dinechin wrote:
>>>   Linux and QEMU's own qht work just fine with compile-time directives.
>>
>> Wouldn’t it work fine without any compile-time directive at all?
> 
> Yes, that's what I meant.  Though there are certainly cases in which the
> difference without proper cacheline alignment is an order of magnitude
> less throughput or something like that; it would certainly be noticeable.
> 
>>> I don't think lock-free lists are easier.  Bitmaps smaller than 64
>>> elements are both faster and easier to manage.
>>
>> I believe that this is only true if you use a linked list for both freelist
>> management and for thread notification (i.e. to replace the bitmaps).
>> However, if you use an atomic list only for the free list, and keep
>> bitmaps for signaling, then performance is at least equal, often better.
>> Plus you get the added benefit of having a thread-safe API, i.e.
>> something that is truly lock-free.
>>
>> I did a small experiment to test / prove this. Last commit on branch:
>> https://github.com/c3d/recorder/commits/181122-xiao_guangdong_introduce-threaded-workqueue
>> Take with a grain of salt, microbenchmarks are always suspect ;-)
>>
>> The code in “thread_test.c” includes Xiao’s code with two variations,
>> plus some testing code lifted from the flight recorder library.
>> 1. The FREE_LIST variation (sl_test) is what I would like to propose.
>> 2. The BITMAP variation (bm_test) is the baseline
>> 3. The DOUBLE_LIST variation (ll_test) is the slow double-list approach
>>
>> To run it, you need to do “make opt-test”, then run “test_script”
>> which outputs a CSV file. The summary of my findings testing on
>> a ThinkPad, a Xeon machine and a MacBook is here:
>> https://imgur.com/a/4HmbB9K
>>
>> Overall, the proposed approach:
>>
>> - makes the API thread safe and lock free, addressing the one
>> drawback that Xiao was mentioning.
>>
>> - delivers up to 30% more requests on the Macbook, while being
>> “within noise” (sometimes marginally better) for the other two.
>> I suspect an optimization opportunity found by clang, because
>> the Macbook delivers really high numbers.
>>
>> - spends less time blocking when all threads are busy, which
>> accounts for the higher number of client loops.
>>
>> If you think that makes sense, then either Xiao can adapt the code
>> from the branch above, or I can send a follow-up patch.
> 
> Having a follow-up patch would be best I think.  Thanks for
> experimenting with this, it's always fun stuff. :)
> 

Yup, Christophe, please post the follow-up patches and add yourself
to the author list if you like. I am looking forward to it. :)

Thanks!

^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2018-12-10  3:23 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-22  7:20 [PATCH v3 0/5] migration: improve multithreads guangrong.xiao
2018-11-22  7:20 ` [Qemu-devel] " guangrong.xiao
2018-11-22  7:20 ` [PATCH v3 1/5] bitops: introduce change_bit_atomic guangrong.xiao
2018-11-22  7:20   ` [Qemu-devel] " guangrong.xiao
2018-11-23 10:23   ` Dr. David Alan Gilbert
2018-11-23 10:23     ` [Qemu-devel] " Dr. David Alan Gilbert
2018-11-28  9:35   ` Juan Quintela
2018-11-28  9:35     ` [Qemu-devel] " Juan Quintela
2018-11-22  7:20 ` [PATCH v3 2/5] util: introduce threaded workqueue guangrong.xiao
2018-11-22  7:20   ` [Qemu-devel] " guangrong.xiao
2018-11-23 11:02   ` Dr. David Alan Gilbert
2018-11-23 11:02     ` [Qemu-devel] " Dr. David Alan Gilbert
2018-11-26  7:57     ` Xiao Guangrong
2018-11-26  7:57       ` [Qemu-devel] " Xiao Guangrong
2018-11-26 10:56       ` Dr. David Alan Gilbert
2018-11-26 10:56         ` [Qemu-devel] " Dr. David Alan Gilbert
2018-11-27  7:17         ` Xiao Guangrong
2018-11-27  7:17           ` [Qemu-devel] " Xiao Guangrong
2018-11-26 18:55       ` Emilio G. Cota
2018-11-26 18:55         ` [Qemu-devel] " Emilio G. Cota
2018-11-27  8:30         ` Xiao Guangrong
2018-11-27  8:30           ` [Qemu-devel] " Xiao Guangrong
2018-11-24  0:12   ` Emilio G. Cota
2018-11-24  0:12     ` [Qemu-devel] " Emilio G. Cota
2018-11-26  8:06     ` Xiao Guangrong
2018-11-26  8:06       ` [Qemu-devel] " Xiao Guangrong
2018-11-26 18:49       ` Emilio G. Cota
2018-11-26 18:49         ` [Qemu-devel] " Emilio G. Cota
2018-11-27  8:29         ` Xiao Guangrong
2018-11-27  8:29           ` [Qemu-devel] " Xiao Guangrong
2018-11-24  0:17   ` Emilio G. Cota
2018-11-24  0:17     ` [Qemu-devel] " Emilio G. Cota
2018-11-26  8:18     ` Xiao Guangrong
2018-11-26  8:18       ` [Qemu-devel] " Xiao Guangrong
2018-11-26 10:28       ` Paolo Bonzini
2018-11-26 10:28         ` [Qemu-devel] " Paolo Bonzini
2018-11-27  8:31         ` Xiao Guangrong
2018-11-27  8:31           ` [Qemu-devel] " Xiao Guangrong
2018-11-27 12:49   ` Christophe de Dinechin
2018-11-27 12:49     ` [Qemu-devel] " Christophe de Dinechin
2018-11-27 13:51     ` Paolo Bonzini
2018-11-27 13:51       ` [Qemu-devel] " Paolo Bonzini
2018-12-04 15:49       ` Christophe de Dinechin
2018-12-04 15:49         ` [Qemu-devel] " Christophe de Dinechin
2018-12-04 17:16         ` Paolo Bonzini
2018-12-04 17:16           ` [Qemu-devel] " Paolo Bonzini
2018-12-10  3:23           ` Xiao Guangrong
2018-12-10  3:23             ` [Qemu-devel] " Xiao Guangrong
2018-11-27 17:39     ` Emilio G. Cota
2018-11-27 17:39       ` [Qemu-devel] " Emilio G. Cota
2018-11-28  8:55     ` Xiao Guangrong
2018-11-28  8:55       ` [Qemu-devel] " Xiao Guangrong
2018-11-22  7:20 ` [PATCH v3 3/5] migration: use threaded workqueue for compression guangrong.xiao
2018-11-22  7:20   ` [Qemu-devel] " guangrong.xiao
2018-11-23 18:17   ` Dr. David Alan Gilbert
2018-11-23 18:17     ` [Qemu-devel] " Dr. David Alan Gilbert
2018-11-23 18:22     ` Paolo Bonzini
2018-11-23 18:22       ` [Qemu-devel] " Paolo Bonzini
2018-11-23 18:29       ` Dr. David Alan Gilbert
2018-11-23 18:29         ` [Qemu-devel] " Dr. David Alan Gilbert
2018-11-26  8:00         ` Xiao Guangrong
2018-11-26  8:00           ` [Qemu-devel] " Xiao Guangrong
2018-11-22  7:20 ` [PATCH v3 4/5] migration: use threaded workqueue for decompression guangrong.xiao
2018-11-22  7:20   ` [Qemu-devel] " guangrong.xiao
2018-11-22  7:20 ` [PATCH v3 5/5] tests: add threaded-workqueue-bench guangrong.xiao
2018-11-22  7:20   ` [Qemu-devel] " guangrong.xiao
2018-11-22 21:25 ` [PATCH v3 0/5] migration: improve multithreads no-reply
2018-11-22 21:25   ` [Qemu-devel] " no-reply
2018-11-22 21:35 ` no-reply
2018-11-22 21:35   ` [Qemu-devel] " no-reply

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.