All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/12] migration: improve multithreads for compression and decompression
@ 2018-06-04  9:55 ` guangrong.xiao
  0 siblings, 0 replies; 156+ messages in thread
From: guangrong.xiao @ 2018-06-04  9:55 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: kvm, Xiao Guangrong, qemu-devel, peterx, dgilbert, wei.w.wang,
	jiang.biao2

From: Xiao Guangrong <xiaoguangrong@tencent.com>

Background
----------
Current implementation of compression and decompression are very
hard to be enabled on productions. We noticed that too many wait-wakes
go to kernel space and CPU usages are very low even if the system
is really free

The reasons are:
1) there are two many locks used to do synchronous,there
  is a global lock and each single thread has its own lock,
  migration thread and work threads need to go to sleep if
  these locks are busy

2) migration thread separately submits request to the thread
   however, only one request can be pended, that means, the
   thread has to go to sleep after finishing the request

Our Ideas
---------
To make it work better, we introduce a new multithread model,
the user, currently it is the migration thread, submits request
to each thread with round-robin manner, the thread has its own
ring whose capacity is 4 and puts the result to a global ring
which is lockless for multiple producers, the user fetches result
out from the global ring and do remaining operations for the
request, e.g, posting the compressed data out for migration on
the source QEMU

Other works in this patchset is offering some statistics to see
if compression works as we expected and making the migration thread
work fast so it can feed more requests to the threads

Implementation of the Ring
--------------------------
The key component is the ring which supports both single producer
vs. single consumer and multiple producers vs. single consumer

Many lessons were learned from Linux Kernel's kfifo (1) and DPDK's
rte_ring (2) before i wrote this implementation. It corrects some
bugs of memory barriers in kfifo and it is the simpler lockless
version of rte_ring as currently multiple access is only allowed
for producer.

If has single producer vs. single consumer, it is the traditional
fifo. If has multiple producers, it uses the algorithm as followings:

For the producer, it uses two steps to update the ring:
   - first step, occupy the entry in the ring:

retry:
      in = ring->in
      if (cmpxhg(&ring->in, in, in +1) != in)
            goto retry;

     after that the entry pointed by ring->data[in] has been owned by
     the producer.

     assert(ring->data[in] == NULL);

     Note, no other producer can touch this entry so that this entry
     should always be the initialized state.

   - second step, write the data to the entry:

     ring->data[in] = data;

For the consumer, it first checks if there is available entry in the
ring and fetches the entry from the ring:

     if (!ring_is_empty(ring))
          entry = &ring[ring->out];

     Note: the ring->out has not been updated so that the entry pointed
     by ring->out is completely owned by the consumer.

Then it checks if the data is ready:

retry:
     if (*entry == NULL)
            goto retry;
That means, the producer has updated the index but haven't written any
data to it.

Finally, it fetches the valid data out, set the entry to the initialized
state and update ring->out to make the entry be usable to the producer:

      data = *entry;
      *entry = NULL;
      ring->out++;

Memory barrier is omitted here, please refer to the comment in the code

Performance Result
-----------------
The test was based on top of the patch:
   ring: introduce lockless ring buffer
that means, previous optimizations are used for both of original case
and applying the new multithread model

We tested live migration on two hosts:
   Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz * 64 + 256G memory
to migration a VM between each other, which has 16 vCPUs and 60G
memory, during the migration, multiple threads are repeatedly writing
the memory in the VM

We used 16 threads on the destination to decompress the data and on the
source, we tried 8 threads and 16 threads to compress the data

--- Before our work ---
migration can not be finished for both 8 threads and 16 threads. The data
is as followings:

Use 8 threads to compress:
- on the source:
	    migration thread   compress-threads
CPU usage       70%          some use 36%, others are very low ~20%
- on the destination:
            main thread        decompress-threads
CPU usage       100%         some use ~40%, other are very low ~2%

Migration status (CAN NOT FINISH):
info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
Migration status: active
total time: 1019540 milliseconds
expected downtime: 2263 milliseconds
setup: 218 milliseconds
transferred ram: 252419995 kbytes
throughput: 2469.45 mbps
remaining ram: 15611332 kbytes
total ram: 62931784 kbytes
duplicate: 915323 pages
skipped: 0 pages
normal: 59673047 pages
normal bytes: 238692188 kbytes
dirty sync count: 28
page size: 4 kbytes
dirty pages rate: 170551 pages
compression pages: 121309323 pages
compression busy: 60588337
compression busy rate: 0.36
compression reduced size: 484281967178
compression rate: 0.97

Use 16 threads to compress:
- on the source:
	    migration thread   compress-threads
CPU usage       96%          some use 45%, others are very low ~6%
- on the destination:
            main thread        decompress-threads
CPU usage       96%         some use 58%, other are very low ~10%

Migration status (CAN NOT FINISH):
info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
Migration status: active
total time: 1189221 milliseconds
expected downtime: 6824 milliseconds
setup: 220 milliseconds
transferred ram: 90620052 kbytes
throughput: 840.41 mbps
remaining ram: 3678760 kbytes
total ram: 62931784 kbytes
duplicate: 195893 pages
skipped: 0 pages
normal: 17290715 pages
normal bytes: 69162860 kbytes
dirty sync count: 33
page size: 4 kbytes
dirty pages rate: 175039 pages
compression pages: 186739419 pages
compression busy: 17486568
compression busy rate: 0.09
compression reduced size: 744546683892
compression rate: 0.97

--- After our work ---
Migration can be finished quickly for both 8 threads and 16 threads. The
data is as followings:

Use 8 threads to compress:
- on the source:
	    migration thread   compress-threads
CPU usage       30%               30% (all threads have same CPU usage)
- on the destination:
            main thread        decompress-threads
CPU usage       100%              50% (all threads have same CPU usage)

Migration status (finished in 219467 ms):
info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
Migration status: completed
total time: 219467 milliseconds
downtime: 115 milliseconds
setup: 222 milliseconds
transferred ram: 88510173 kbytes
throughput: 3303.81 mbps
remaining ram: 0 kbytes
total ram: 62931784 kbytes
duplicate: 2211775 pages
skipped: 0 pages
normal: 21166222 pages
normal bytes: 84664888 kbytes
dirty sync count: 15
page size: 4 kbytes
compression pages: 32045857 pages
compression busy: 23377968
compression busy rate: 0.34
compression reduced size: 127767894329
compression rate: 0.97

Use 16 threads to compress:
- on the source:
	    migration thread   compress-threads
CPU usage       60%               60% (all threads have same CPU usage)
- on the destination:
            main thread        decompress-threads
CPU usage       100%              75% (all threads have same CPU usage)

Migration status (finished in 64118 ms):
info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
Migration status: completed
total time: 64118 milliseconds
downtime: 29 milliseconds
setup: 223 milliseconds
transferred ram: 13345135 kbytes
throughput: 1705.10 mbps
remaining ram: 0 kbytes
total ram: 62931784 kbytes
duplicate: 574921 pages
skipped: 0 pages
normal: 2570281 pages
normal bytes: 10281124 kbytes
dirty sync count: 9
page size: 4 kbytes
compression pages: 28007024 pages
compression busy: 3145182
compression busy rate: 0.08
compression reduced size: 111829024985
compression rate: 0.97


(1) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
(2) http://dpdk.org/doc/api/rte__ring_8h.html 

Xiao Guangrong (12):
  migration: do not wait if no free thread
  migration: fix counting normal page for compression
  migration: fix counting xbzrle cache_miss_rate
  migration: introduce migration_update_rates
  migration: show the statistics of compression
  migration: do not detect zero page for compression
  migration: hold the lock only if it is really needed
  migration: do not flush_compressed_data at the end of each iteration
  ring: introduce lockless ring buffer
  migration: introduce lockless multithreads model
  migration: use lockless Multithread model for compression
  migration: use lockless Multithread model for decompression

 hmp.c                   |  13 +
 include/qemu/queue.h    |   1 +
 migration/Makefile.objs |   1 +
 migration/migration.c   |  11 +
 migration/ram.c         | 898 ++++++++++++++++++++++--------------------------
 migration/ram.h         |   1 +
 migration/ring.h        | 265 ++++++++++++++
 migration/threads.c     | 265 ++++++++++++++
 migration/threads.h     | 116 +++++++
 qapi/migration.json     |  25 +-
 10 files changed, 1109 insertions(+), 487 deletions(-)
 create mode 100644 migration/ring.h
 create mode 100644 migration/threads.c
 create mode 100644 migration/threads.h

-- 
2.14.4

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [Qemu-devel] [PATCH 00/12] migration: improve multithreads for compression and decompression
@ 2018-06-04  9:55 ` guangrong.xiao
  0 siblings, 0 replies; 156+ messages in thread
From: guangrong.xiao @ 2018-06-04  9:55 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: qemu-devel, kvm, dgilbert, peterx, jiang.biao2, wei.w.wang,
	Xiao Guangrong

From: Xiao Guangrong <xiaoguangrong@tencent.com>

Background
----------
Current implementation of compression and decompression are very
hard to be enabled on productions. We noticed that too many wait-wakes
go to kernel space and CPU usages are very low even if the system
is really free

The reasons are:
1) there are two many locks used to do synchronous,there
  is a global lock and each single thread has its own lock,
  migration thread and work threads need to go to sleep if
  these locks are busy

2) migration thread separately submits request to the thread
   however, only one request can be pended, that means, the
   thread has to go to sleep after finishing the request

Our Ideas
---------
To make it work better, we introduce a new multithread model,
the user, currently it is the migration thread, submits request
to each thread with round-robin manner, the thread has its own
ring whose capacity is 4 and puts the result to a global ring
which is lockless for multiple producers, the user fetches result
out from the global ring and do remaining operations for the
request, e.g, posting the compressed data out for migration on
the source QEMU

Other works in this patchset is offering some statistics to see
if compression works as we expected and making the migration thread
work fast so it can feed more requests to the threads

Implementation of the Ring
--------------------------
The key component is the ring which supports both single producer
vs. single consumer and multiple producers vs. single consumer

Many lessons were learned from Linux Kernel's kfifo (1) and DPDK's
rte_ring (2) before i wrote this implementation. It corrects some
bugs of memory barriers in kfifo and it is the simpler lockless
version of rte_ring as currently multiple access is only allowed
for producer.

If has single producer vs. single consumer, it is the traditional
fifo. If has multiple producers, it uses the algorithm as followings:

For the producer, it uses two steps to update the ring:
   - first step, occupy the entry in the ring:

retry:
      in = ring->in
      if (cmpxhg(&ring->in, in, in +1) != in)
            goto retry;

     after that the entry pointed by ring->data[in] has been owned by
     the producer.

     assert(ring->data[in] == NULL);

     Note, no other producer can touch this entry so that this entry
     should always be the initialized state.

   - second step, write the data to the entry:

     ring->data[in] = data;

For the consumer, it first checks if there is available entry in the
ring and fetches the entry from the ring:

     if (!ring_is_empty(ring))
          entry = &ring[ring->out];

     Note: the ring->out has not been updated so that the entry pointed
     by ring->out is completely owned by the consumer.

Then it checks if the data is ready:

retry:
     if (*entry == NULL)
            goto retry;
That means, the producer has updated the index but haven't written any
data to it.

Finally, it fetches the valid data out, set the entry to the initialized
state and update ring->out to make the entry be usable to the producer:

      data = *entry;
      *entry = NULL;
      ring->out++;

Memory barrier is omitted here, please refer to the comment in the code

Performance Result
-----------------
The test was based on top of the patch:
   ring: introduce lockless ring buffer
that means, previous optimizations are used for both of original case
and applying the new multithread model

We tested live migration on two hosts:
   Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz * 64 + 256G memory
to migration a VM between each other, which has 16 vCPUs and 60G
memory, during the migration, multiple threads are repeatedly writing
the memory in the VM

We used 16 threads on the destination to decompress the data and on the
source, we tried 8 threads and 16 threads to compress the data

--- Before our work ---
migration can not be finished for both 8 threads and 16 threads. The data
is as followings:

Use 8 threads to compress:
- on the source:
	    migration thread   compress-threads
CPU usage       70%          some use 36%, others are very low ~20%
- on the destination:
            main thread        decompress-threads
CPU usage       100%         some use ~40%, other are very low ~2%

Migration status (CAN NOT FINISH):
info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
Migration status: active
total time: 1019540 milliseconds
expected downtime: 2263 milliseconds
setup: 218 milliseconds
transferred ram: 252419995 kbytes
throughput: 2469.45 mbps
remaining ram: 15611332 kbytes
total ram: 62931784 kbytes
duplicate: 915323 pages
skipped: 0 pages
normal: 59673047 pages
normal bytes: 238692188 kbytes
dirty sync count: 28
page size: 4 kbytes
dirty pages rate: 170551 pages
compression pages: 121309323 pages
compression busy: 60588337
compression busy rate: 0.36
compression reduced size: 484281967178
compression rate: 0.97

Use 16 threads to compress:
- on the source:
	    migration thread   compress-threads
CPU usage       96%          some use 45%, others are very low ~6%
- on the destination:
            main thread        decompress-threads
CPU usage       96%         some use 58%, other are very low ~10%

Migration status (CAN NOT FINISH):
info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
Migration status: active
total time: 1189221 milliseconds
expected downtime: 6824 milliseconds
setup: 220 milliseconds
transferred ram: 90620052 kbytes
throughput: 840.41 mbps
remaining ram: 3678760 kbytes
total ram: 62931784 kbytes
duplicate: 195893 pages
skipped: 0 pages
normal: 17290715 pages
normal bytes: 69162860 kbytes
dirty sync count: 33
page size: 4 kbytes
dirty pages rate: 175039 pages
compression pages: 186739419 pages
compression busy: 17486568
compression busy rate: 0.09
compression reduced size: 744546683892
compression rate: 0.97

--- After our work ---
Migration can be finished quickly for both 8 threads and 16 threads. The
data is as followings:

Use 8 threads to compress:
- on the source:
	    migration thread   compress-threads
CPU usage       30%               30% (all threads have same CPU usage)
- on the destination:
            main thread        decompress-threads
CPU usage       100%              50% (all threads have same CPU usage)

Migration status (finished in 219467 ms):
info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
Migration status: completed
total time: 219467 milliseconds
downtime: 115 milliseconds
setup: 222 milliseconds
transferred ram: 88510173 kbytes
throughput: 3303.81 mbps
remaining ram: 0 kbytes
total ram: 62931784 kbytes
duplicate: 2211775 pages
skipped: 0 pages
normal: 21166222 pages
normal bytes: 84664888 kbytes
dirty sync count: 15
page size: 4 kbytes
compression pages: 32045857 pages
compression busy: 23377968
compression busy rate: 0.34
compression reduced size: 127767894329
compression rate: 0.97

Use 16 threads to compress:
- on the source:
	    migration thread   compress-threads
CPU usage       60%               60% (all threads have same CPU usage)
- on the destination:
            main thread        decompress-threads
CPU usage       100%              75% (all threads have same CPU usage)

Migration status (finished in 64118 ms):
info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
Migration status: completed
total time: 64118 milliseconds
downtime: 29 milliseconds
setup: 223 milliseconds
transferred ram: 13345135 kbytes
throughput: 1705.10 mbps
remaining ram: 0 kbytes
total ram: 62931784 kbytes
duplicate: 574921 pages
skipped: 0 pages
normal: 2570281 pages
normal bytes: 10281124 kbytes
dirty sync count: 9
page size: 4 kbytes
compression pages: 28007024 pages
compression busy: 3145182
compression busy rate: 0.08
compression reduced size: 111829024985
compression rate: 0.97


(1) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
(2) http://dpdk.org/doc/api/rte__ring_8h.html 

Xiao Guangrong (12):
  migration: do not wait if no free thread
  migration: fix counting normal page for compression
  migration: fix counting xbzrle cache_miss_rate
  migration: introduce migration_update_rates
  migration: show the statistics of compression
  migration: do not detect zero page for compression
  migration: hold the lock only if it is really needed
  migration: do not flush_compressed_data at the end of each iteration
  ring: introduce lockless ring buffer
  migration: introduce lockless multithreads model
  migration: use lockless Multithread model for compression
  migration: use lockless Multithread model for decompression

 hmp.c                   |  13 +
 include/qemu/queue.h    |   1 +
 migration/Makefile.objs |   1 +
 migration/migration.c   |  11 +
 migration/ram.c         | 898 ++++++++++++++++++++++--------------------------
 migration/ram.h         |   1 +
 migration/ring.h        | 265 ++++++++++++++
 migration/threads.c     | 265 ++++++++++++++
 migration/threads.h     | 116 +++++++
 qapi/migration.json     |  25 +-
 10 files changed, 1109 insertions(+), 487 deletions(-)
 create mode 100644 migration/ring.h
 create mode 100644 migration/threads.c
 create mode 100644 migration/threads.h

-- 
2.14.4

^ permalink raw reply	[flat|nested] 156+ messages in thread

* [PATCH 01/12] migration: do not wait if no free thread
  2018-06-04  9:55 ` [Qemu-devel] " guangrong.xiao
@ 2018-06-04  9:55   ` guangrong.xiao
  -1 siblings, 0 replies; 156+ messages in thread
From: guangrong.xiao @ 2018-06-04  9:55 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: kvm, Xiao Guangrong, qemu-devel, peterx, dgilbert, wei.w.wang,
	jiang.biao2

From: Xiao Guangrong <xiaoguangrong@tencent.com>

Instead of putting the main thread to sleep state to wait for
free compression thread, we can directly post it out as normal
page that reduces the latency and uses CPUs more efficiently

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 migration/ram.c | 34 +++++++++++++++-------------------
 1 file changed, 15 insertions(+), 19 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 5bcbf7a9f9..0caf32ab0a 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1423,25 +1423,18 @@ static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
 
     thread_count = migrate_compress_threads();
     qemu_mutex_lock(&comp_done_lock);
-    while (true) {
-        for (idx = 0; idx < thread_count; idx++) {
-            if (comp_param[idx].done) {
-                comp_param[idx].done = false;
-                bytes_xmit = qemu_put_qemu_file(rs->f, comp_param[idx].file);
-                qemu_mutex_lock(&comp_param[idx].mutex);
-                set_compress_params(&comp_param[idx], block, offset);
-                qemu_cond_signal(&comp_param[idx].cond);
-                qemu_mutex_unlock(&comp_param[idx].mutex);
-                pages = 1;
-                ram_counters.normal++;
-                ram_counters.transferred += bytes_xmit;
-                break;
-            }
-        }
-        if (pages > 0) {
+    for (idx = 0; idx < thread_count; idx++) {
+        if (comp_param[idx].done) {
+            comp_param[idx].done = false;
+            bytes_xmit = qemu_put_qemu_file(rs->f, comp_param[idx].file);
+            qemu_mutex_lock(&comp_param[idx].mutex);
+            set_compress_params(&comp_param[idx], block, offset);
+            qemu_cond_signal(&comp_param[idx].cond);
+            qemu_mutex_unlock(&comp_param[idx].mutex);
+            pages = 1;
+            ram_counters.normal++;
+            ram_counters.transferred += bytes_xmit;
             break;
-        } else {
-            qemu_cond_wait(&comp_done_cond, &comp_done_lock);
         }
     }
     qemu_mutex_unlock(&comp_done_lock);
@@ -1755,7 +1748,10 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss,
      * CPU resource.
      */
     if (block == rs->last_sent_block && save_page_use_compression(rs)) {
-        return compress_page_with_multi_thread(rs, block, offset);
+        res = compress_page_with_multi_thread(rs, block, offset);
+        if (res > 0) {
+            return res;
+        }
     }
 
     return ram_save_page(rs, pss, last_stage);
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Qemu-devel] [PATCH 01/12] migration: do not wait if no free thread
@ 2018-06-04  9:55   ` guangrong.xiao
  0 siblings, 0 replies; 156+ messages in thread
From: guangrong.xiao @ 2018-06-04  9:55 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: qemu-devel, kvm, dgilbert, peterx, jiang.biao2, wei.w.wang,
	Xiao Guangrong

From: Xiao Guangrong <xiaoguangrong@tencent.com>

Instead of putting the main thread to sleep state to wait for
free compression thread, we can directly post it out as normal
page that reduces the latency and uses CPUs more efficiently

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 migration/ram.c | 34 +++++++++++++++-------------------
 1 file changed, 15 insertions(+), 19 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 5bcbf7a9f9..0caf32ab0a 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1423,25 +1423,18 @@ static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
 
     thread_count = migrate_compress_threads();
     qemu_mutex_lock(&comp_done_lock);
-    while (true) {
-        for (idx = 0; idx < thread_count; idx++) {
-            if (comp_param[idx].done) {
-                comp_param[idx].done = false;
-                bytes_xmit = qemu_put_qemu_file(rs->f, comp_param[idx].file);
-                qemu_mutex_lock(&comp_param[idx].mutex);
-                set_compress_params(&comp_param[idx], block, offset);
-                qemu_cond_signal(&comp_param[idx].cond);
-                qemu_mutex_unlock(&comp_param[idx].mutex);
-                pages = 1;
-                ram_counters.normal++;
-                ram_counters.transferred += bytes_xmit;
-                break;
-            }
-        }
-        if (pages > 0) {
+    for (idx = 0; idx < thread_count; idx++) {
+        if (comp_param[idx].done) {
+            comp_param[idx].done = false;
+            bytes_xmit = qemu_put_qemu_file(rs->f, comp_param[idx].file);
+            qemu_mutex_lock(&comp_param[idx].mutex);
+            set_compress_params(&comp_param[idx], block, offset);
+            qemu_cond_signal(&comp_param[idx].cond);
+            qemu_mutex_unlock(&comp_param[idx].mutex);
+            pages = 1;
+            ram_counters.normal++;
+            ram_counters.transferred += bytes_xmit;
             break;
-        } else {
-            qemu_cond_wait(&comp_done_cond, &comp_done_lock);
         }
     }
     qemu_mutex_unlock(&comp_done_lock);
@@ -1755,7 +1748,10 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss,
      * CPU resource.
      */
     if (block == rs->last_sent_block && save_page_use_compression(rs)) {
-        return compress_page_with_multi_thread(rs, block, offset);
+        res = compress_page_with_multi_thread(rs, block, offset);
+        if (res > 0) {
+            return res;
+        }
     }
 
     return ram_save_page(rs, pss, last_stage);
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH 02/12] migration: fix counting normal page for compression
  2018-06-04  9:55 ` [Qemu-devel] " guangrong.xiao
@ 2018-06-04  9:55   ` guangrong.xiao
  -1 siblings, 0 replies; 156+ messages in thread
From: guangrong.xiao @ 2018-06-04  9:55 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: kvm, Xiao Guangrong, qemu-devel, peterx, dgilbert, wei.w.wang,
	jiang.biao2

From: Xiao Guangrong <xiaoguangrong@tencent.com>

The compressed page is not normal page

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 migration/ram.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/migration/ram.c b/migration/ram.c
index 0caf32ab0a..dbf24d8c87 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1432,7 +1432,6 @@ static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
             qemu_cond_signal(&comp_param[idx].cond);
             qemu_mutex_unlock(&comp_param[idx].mutex);
             pages = 1;
-            ram_counters.normal++;
             ram_counters.transferred += bytes_xmit;
             break;
         }
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Qemu-devel] [PATCH 02/12] migration: fix counting normal page for compression
@ 2018-06-04  9:55   ` guangrong.xiao
  0 siblings, 0 replies; 156+ messages in thread
From: guangrong.xiao @ 2018-06-04  9:55 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: qemu-devel, kvm, dgilbert, peterx, jiang.biao2, wei.w.wang,
	Xiao Guangrong

From: Xiao Guangrong <xiaoguangrong@tencent.com>

The compressed page is not normal page

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 migration/ram.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/migration/ram.c b/migration/ram.c
index 0caf32ab0a..dbf24d8c87 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1432,7 +1432,6 @@ static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
             qemu_cond_signal(&comp_param[idx].cond);
             qemu_mutex_unlock(&comp_param[idx].mutex);
             pages = 1;
-            ram_counters.normal++;
             ram_counters.transferred += bytes_xmit;
             break;
         }
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH 03/12] migration: fix counting xbzrle cache_miss_rate
  2018-06-04  9:55 ` [Qemu-devel] " guangrong.xiao
@ 2018-06-04  9:55   ` guangrong.xiao
  -1 siblings, 0 replies; 156+ messages in thread
From: guangrong.xiao @ 2018-06-04  9:55 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: kvm, Xiao Guangrong, qemu-devel, peterx, dgilbert, wei.w.wang,
	jiang.biao2

From: Xiao Guangrong <xiaoguangrong@tencent.com>

Sync up xbzrle_cache_miss_prev only after migration iteration goes
forward

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 migration/ram.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/ram.c b/migration/ram.c
index dbf24d8c87..dd1283dd45 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1189,9 +1189,9 @@ static void migration_bitmap_sync(RAMState *rs)
                    (double)(xbzrle_counters.cache_miss -
                             rs->xbzrle_cache_miss_prev) /
                    (rs->iterations - rs->iterations_prev);
+                rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
             }
             rs->iterations_prev = rs->iterations;
-            rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
         }
 
         /* reset period counters */
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Qemu-devel] [PATCH 03/12] migration: fix counting xbzrle cache_miss_rate
@ 2018-06-04  9:55   ` guangrong.xiao
  0 siblings, 0 replies; 156+ messages in thread
From: guangrong.xiao @ 2018-06-04  9:55 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: qemu-devel, kvm, dgilbert, peterx, jiang.biao2, wei.w.wang,
	Xiao Guangrong

From: Xiao Guangrong <xiaoguangrong@tencent.com>

Sync up xbzrle_cache_miss_prev only after migration iteration goes
forward

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 migration/ram.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/ram.c b/migration/ram.c
index dbf24d8c87..dd1283dd45 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1189,9 +1189,9 @@ static void migration_bitmap_sync(RAMState *rs)
                    (double)(xbzrle_counters.cache_miss -
                             rs->xbzrle_cache_miss_prev) /
                    (rs->iterations - rs->iterations_prev);
+                rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
             }
             rs->iterations_prev = rs->iterations;
-            rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
         }
 
         /* reset period counters */
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH 04/12] migration: introduce migration_update_rates
  2018-06-04  9:55 ` [Qemu-devel] " guangrong.xiao
@ 2018-06-04  9:55   ` guangrong.xiao
  -1 siblings, 0 replies; 156+ messages in thread
From: guangrong.xiao @ 2018-06-04  9:55 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: kvm, Xiao Guangrong, qemu-devel, peterx, dgilbert, wei.w.wang,
	jiang.biao2

From: Xiao Guangrong <xiaoguangrong@tencent.com>

It is used to slightly clean the code up, no logic is changed

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 migration/ram.c | 35 ++++++++++++++++++++++-------------
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index dd1283dd45..ee03b28435 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1130,6 +1130,25 @@ uint64_t ram_pagesize_summary(void)
     return summary;
 }
 
+static void migration_update_rates(RAMState *rs, int64_t end_time)
+{
+    uint64_t iter_count = rs->iterations - rs->iterations_prev;
+
+    /* calculate period counters */
+    ram_counters.dirty_pages_rate = rs->num_dirty_pages_period * 1000
+                / (end_time - rs->time_last_bitmap_sync);
+
+    if (!iter_count) {
+        return;
+    }
+
+    if (migrate_use_xbzrle()) {
+        xbzrle_counters.cache_miss_rate = (double)(xbzrle_counters.cache_miss -
+            rs->xbzrle_cache_miss_prev) / iter_count;
+        rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
+    }
+}
+
 static void migration_bitmap_sync(RAMState *rs)
 {
     RAMBlock *block;
@@ -1159,9 +1178,6 @@ static void migration_bitmap_sync(RAMState *rs)
 
     /* more than 1 second = 1000 millisecons */
     if (end_time > rs->time_last_bitmap_sync + 1000) {
-        /* calculate period counters */
-        ram_counters.dirty_pages_rate = rs->num_dirty_pages_period * 1000
-            / (end_time - rs->time_last_bitmap_sync);
         bytes_xfer_now = ram_counters.transferred;
 
         /* During block migration the auto-converge logic incorrectly detects
@@ -1183,16 +1199,9 @@ static void migration_bitmap_sync(RAMState *rs)
             }
         }
 
-        if (migrate_use_xbzrle()) {
-            if (rs->iterations_prev != rs->iterations) {
-                xbzrle_counters.cache_miss_rate =
-                   (double)(xbzrle_counters.cache_miss -
-                            rs->xbzrle_cache_miss_prev) /
-                   (rs->iterations - rs->iterations_prev);
-                rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
-            }
-            rs->iterations_prev = rs->iterations;
-        }
+        migration_update_rates(rs, end_time);
+
+        rs->iterations_prev = rs->iterations;
 
         /* reset period counters */
         rs->time_last_bitmap_sync = end_time;
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Qemu-devel] [PATCH 04/12] migration: introduce migration_update_rates
@ 2018-06-04  9:55   ` guangrong.xiao
  0 siblings, 0 replies; 156+ messages in thread
From: guangrong.xiao @ 2018-06-04  9:55 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: qemu-devel, kvm, dgilbert, peterx, jiang.biao2, wei.w.wang,
	Xiao Guangrong

From: Xiao Guangrong <xiaoguangrong@tencent.com>

It is used to slightly clean the code up, no logic is changed

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 migration/ram.c | 35 ++++++++++++++++++++++-------------
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index dd1283dd45..ee03b28435 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1130,6 +1130,25 @@ uint64_t ram_pagesize_summary(void)
     return summary;
 }
 
+static void migration_update_rates(RAMState *rs, int64_t end_time)
+{
+    uint64_t iter_count = rs->iterations - rs->iterations_prev;
+
+    /* calculate period counters */
+    ram_counters.dirty_pages_rate = rs->num_dirty_pages_period * 1000
+                / (end_time - rs->time_last_bitmap_sync);
+
+    if (!iter_count) {
+        return;
+    }
+
+    if (migrate_use_xbzrle()) {
+        xbzrle_counters.cache_miss_rate = (double)(xbzrle_counters.cache_miss -
+            rs->xbzrle_cache_miss_prev) / iter_count;
+        rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
+    }
+}
+
 static void migration_bitmap_sync(RAMState *rs)
 {
     RAMBlock *block;
@@ -1159,9 +1178,6 @@ static void migration_bitmap_sync(RAMState *rs)
 
     /* more than 1 second = 1000 millisecons */
     if (end_time > rs->time_last_bitmap_sync + 1000) {
-        /* calculate period counters */
-        ram_counters.dirty_pages_rate = rs->num_dirty_pages_period * 1000
-            / (end_time - rs->time_last_bitmap_sync);
         bytes_xfer_now = ram_counters.transferred;
 
         /* During block migration the auto-converge logic incorrectly detects
@@ -1183,16 +1199,9 @@ static void migration_bitmap_sync(RAMState *rs)
             }
         }
 
-        if (migrate_use_xbzrle()) {
-            if (rs->iterations_prev != rs->iterations) {
-                xbzrle_counters.cache_miss_rate =
-                   (double)(xbzrle_counters.cache_miss -
-                            rs->xbzrle_cache_miss_prev) /
-                   (rs->iterations - rs->iterations_prev);
-                rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
-            }
-            rs->iterations_prev = rs->iterations;
-        }
+        migration_update_rates(rs, end_time);
+
+        rs->iterations_prev = rs->iterations;
 
         /* reset period counters */
         rs->time_last_bitmap_sync = end_time;
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH 05/12] migration: show the statistics of compression
  2018-06-04  9:55 ` [Qemu-devel] " guangrong.xiao
@ 2018-06-04  9:55   ` guangrong.xiao
  -1 siblings, 0 replies; 156+ messages in thread
From: guangrong.xiao @ 2018-06-04  9:55 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: kvm, Xiao Guangrong, qemu-devel, peterx, dgilbert, wei.w.wang,
	jiang.biao2

From: Xiao Guangrong <xiaoguangrong@tencent.com>

Then the uses can adjust the parameters based on this info

Currently, it includes:
pages: amount of pages compressed and transferred to the target VM
busy: amount of count that no free thread to compress data
busy-rate: rate of thread busy
reduced-size: amount of bytes reduced by compression
compression-rate: rate of compressed size

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 hmp.c                 | 13 +++++++++++++
 migration/migration.c | 11 +++++++++++
 migration/ram.c       | 37 +++++++++++++++++++++++++++++++++++++
 migration/ram.h       |  1 +
 qapi/migration.json   | 25 ++++++++++++++++++++++++-
 5 files changed, 86 insertions(+), 1 deletion(-)

diff --git a/hmp.c b/hmp.c
index ef93f4878b..5c2d3bd318 100644
--- a/hmp.c
+++ b/hmp.c
@@ -269,6 +269,19 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
                        info->xbzrle_cache->overflow);
     }
 
+    if (info->has_compression) {
+        monitor_printf(mon, "compression pages: %" PRIu64 " pages\n",
+                       info->compression->pages);
+        monitor_printf(mon, "compression busy: %" PRIu64 "\n",
+                       info->compression->busy);
+        monitor_printf(mon, "compression busy rate: %0.2f\n",
+                       info->compression->busy_rate);
+        monitor_printf(mon, "compression reduced size: %" PRIu64 "\n",
+                       info->compression->reduced_size);
+        monitor_printf(mon, "compression rate: %0.2f\n",
+                       info->compression->compression_rate);
+    }
+
     if (info->has_cpu_throttle_percentage) {
         monitor_printf(mon, "cpu throttle percentage: %" PRIu64 "\n",
                        info->cpu_throttle_percentage);
diff --git a/migration/migration.c b/migration/migration.c
index 05aec2c905..bf7c63a5a2 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -693,6 +693,17 @@ static void populate_ram_info(MigrationInfo *info, MigrationState *s)
         info->xbzrle_cache->overflow = xbzrle_counters.overflow;
     }
 
+    if (migrate_use_compression()) {
+        info->has_compression = true;
+        info->compression = g_malloc0(sizeof(*info->compression));
+        info->compression->pages = compression_counters.pages;
+        info->compression->busy = compression_counters.busy;
+        info->compression->busy_rate = compression_counters.busy_rate;
+        info->compression->reduced_size = compression_counters.reduced_size;
+        info->compression->compression_rate =
+                                    compression_counters.compression_rate;
+    }
+
     if (cpu_throttle_active()) {
         info->has_cpu_throttle_percentage = true;
         info->cpu_throttle_percentage = cpu_throttle_get_percentage();
diff --git a/migration/ram.c b/migration/ram.c
index ee03b28435..80914b747e 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -292,6 +292,15 @@ struct RAMState {
     uint64_t num_dirty_pages_period;
     /* xbzrle misses since the beginning of the period */
     uint64_t xbzrle_cache_miss_prev;
+
+    /* compression statistics since the beginning of the period */
+    /* amount of count that no free thread to compress data */
+    uint64_t compress_thread_busy_prev;
+    /* amount bytes reduced by compression */
+    uint64_t compress_reduced_size_prev;
+    /* amount of compressed pages */
+    uint64_t compress_pages_prev;
+
     /* number of iterations at the beginning of period */
     uint64_t iterations_prev;
     /* Iterations since start */
@@ -329,6 +338,8 @@ struct PageSearchStatus {
 };
 typedef struct PageSearchStatus PageSearchStatus;
 
+CompressionStats compression_counters;
+
 struct CompressParam {
     bool done;
     bool quit;
@@ -1147,6 +1158,24 @@ static void migration_update_rates(RAMState *rs, int64_t end_time)
             rs->xbzrle_cache_miss_prev) / iter_count;
         rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
     }
+
+    if (migrate_use_compression()) {
+        uint64_t comp_pages;
+
+        compression_counters.busy_rate = (double)(compression_counters.busy -
+            rs->compress_thread_busy_prev) / iter_count;
+        rs->compress_thread_busy_prev = compression_counters.busy;
+
+        comp_pages = compression_counters.pages - rs->compress_pages_prev;
+        if (comp_pages) {
+            compression_counters.compression_rate =
+                (double)(compression_counters.reduced_size -
+                rs->compress_reduced_size_prev) /
+                (comp_pages * TARGET_PAGE_SIZE);
+            rs->compress_pages_prev = compression_counters.pages;
+            rs->compress_reduced_size_prev = compression_counters.reduced_size;
+        }
+    }
 }
 
 static void migration_bitmap_sync(RAMState *rs)
@@ -1412,6 +1441,9 @@ static void flush_compressed_data(RAMState *rs)
         qemu_mutex_lock(&comp_param[idx].mutex);
         if (!comp_param[idx].quit) {
             len = qemu_put_qemu_file(rs->f, comp_param[idx].file);
+            /* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */
+            compression_counters.reduced_size += TARGET_PAGE_SIZE - len + 8;
+            compression_counters.pages++;
             ram_counters.transferred += len;
         }
         qemu_mutex_unlock(&comp_param[idx].mutex);
@@ -1441,6 +1473,10 @@ static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
             qemu_cond_signal(&comp_param[idx].cond);
             qemu_mutex_unlock(&comp_param[idx].mutex);
             pages = 1;
+            /* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */
+            compression_counters.reduced_size += TARGET_PAGE_SIZE -
+                                                 bytes_xmit + 8;
+            compression_counters.pages++;
             ram_counters.transferred += bytes_xmit;
             break;
         }
@@ -1760,6 +1796,7 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss,
         if (res > 0) {
             return res;
         }
+        compression_counters.busy++;
     }
 
     return ram_save_page(rs, pss, last_stage);
diff --git a/migration/ram.h b/migration/ram.h
index d386f4d641..7b009b23e5 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -36,6 +36,7 @@
 
 extern MigrationStats ram_counters;
 extern XBZRLECacheStats xbzrle_counters;
+extern CompressionStats compression_counters;
 
 int xbzrle_cache_resize(int64_t new_size, Error **errp);
 uint64_t ram_bytes_remaining(void);
diff --git a/qapi/migration.json b/qapi/migration.json
index 3ec418dabf..a11987cdc4 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -72,6 +72,26 @@
            'cache-miss': 'int', 'cache-miss-rate': 'number',
            'overflow': 'int' } }
 
+##
+# @CompressionStats:
+#
+# Detailed compression migration statistics
+#
+# @pages: amount of pages compressed and transferred to the target VM
+#
+# @busy: amount of count that no free thread to compress data
+#
+# @busy-rate: rate of thread busy
+#
+# @reduced-size: amount of bytes reduced by compression
+#
+# @compression-rate: rate of compressed size
+#
+##
+{ 'struct': 'CompressionStats',
+  'data': {'pages': 'int', 'busy': 'int', 'busy-rate': 'number',
+	   'reduced-size': 'int', 'compression-rate': 'number' } }
+
 ##
 # @MigrationStatus:
 #
@@ -169,6 +189,8 @@
 #           only present when the postcopy-blocktime migration capability
 #           is enabled. (Since 2.13)
 #
+# @compression: compression migration statistics, only returned if compression
+#           feature is on and status is 'active' or 'completed' (Since 2.14)
 #
 # Since: 0.14.0
 ##
@@ -183,7 +205,8 @@
            '*cpu-throttle-percentage': 'int',
            '*error-desc': 'str',
            '*postcopy-blocktime' : 'uint32',
-           '*postcopy-vcpu-blocktime': ['uint32']} }
+           '*postcopy-vcpu-blocktime': ['uint32'],
+           '*compression': 'CompressionStats'} }
 
 ##
 # @query-migrate:
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Qemu-devel] [PATCH 05/12] migration: show the statistics of compression
@ 2018-06-04  9:55   ` guangrong.xiao
  0 siblings, 0 replies; 156+ messages in thread
From: guangrong.xiao @ 2018-06-04  9:55 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: qemu-devel, kvm, dgilbert, peterx, jiang.biao2, wei.w.wang,
	Xiao Guangrong

From: Xiao Guangrong <xiaoguangrong@tencent.com>

Then the uses can adjust the parameters based on this info

Currently, it includes:
pages: amount of pages compressed and transferred to the target VM
busy: amount of count that no free thread to compress data
busy-rate: rate of thread busy
reduced-size: amount of bytes reduced by compression
compression-rate: rate of compressed size

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 hmp.c                 | 13 +++++++++++++
 migration/migration.c | 11 +++++++++++
 migration/ram.c       | 37 +++++++++++++++++++++++++++++++++++++
 migration/ram.h       |  1 +
 qapi/migration.json   | 25 ++++++++++++++++++++++++-
 5 files changed, 86 insertions(+), 1 deletion(-)

diff --git a/hmp.c b/hmp.c
index ef93f4878b..5c2d3bd318 100644
--- a/hmp.c
+++ b/hmp.c
@@ -269,6 +269,19 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
                        info->xbzrle_cache->overflow);
     }
 
+    if (info->has_compression) {
+        monitor_printf(mon, "compression pages: %" PRIu64 " pages\n",
+                       info->compression->pages);
+        monitor_printf(mon, "compression busy: %" PRIu64 "\n",
+                       info->compression->busy);
+        monitor_printf(mon, "compression busy rate: %0.2f\n",
+                       info->compression->busy_rate);
+        monitor_printf(mon, "compression reduced size: %" PRIu64 "\n",
+                       info->compression->reduced_size);
+        monitor_printf(mon, "compression rate: %0.2f\n",
+                       info->compression->compression_rate);
+    }
+
     if (info->has_cpu_throttle_percentage) {
         monitor_printf(mon, "cpu throttle percentage: %" PRIu64 "\n",
                        info->cpu_throttle_percentage);
diff --git a/migration/migration.c b/migration/migration.c
index 05aec2c905..bf7c63a5a2 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -693,6 +693,17 @@ static void populate_ram_info(MigrationInfo *info, MigrationState *s)
         info->xbzrle_cache->overflow = xbzrle_counters.overflow;
     }
 
+    if (migrate_use_compression()) {
+        info->has_compression = true;
+        info->compression = g_malloc0(sizeof(*info->compression));
+        info->compression->pages = compression_counters.pages;
+        info->compression->busy = compression_counters.busy;
+        info->compression->busy_rate = compression_counters.busy_rate;
+        info->compression->reduced_size = compression_counters.reduced_size;
+        info->compression->compression_rate =
+                                    compression_counters.compression_rate;
+    }
+
     if (cpu_throttle_active()) {
         info->has_cpu_throttle_percentage = true;
         info->cpu_throttle_percentage = cpu_throttle_get_percentage();
diff --git a/migration/ram.c b/migration/ram.c
index ee03b28435..80914b747e 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -292,6 +292,15 @@ struct RAMState {
     uint64_t num_dirty_pages_period;
     /* xbzrle misses since the beginning of the period */
     uint64_t xbzrle_cache_miss_prev;
+
+    /* compression statistics since the beginning of the period */
+    /* amount of count that no free thread to compress data */
+    uint64_t compress_thread_busy_prev;
+    /* amount bytes reduced by compression */
+    uint64_t compress_reduced_size_prev;
+    /* amount of compressed pages */
+    uint64_t compress_pages_prev;
+
     /* number of iterations at the beginning of period */
     uint64_t iterations_prev;
     /* Iterations since start */
@@ -329,6 +338,8 @@ struct PageSearchStatus {
 };
 typedef struct PageSearchStatus PageSearchStatus;
 
+CompressionStats compression_counters;
+
 struct CompressParam {
     bool done;
     bool quit;
@@ -1147,6 +1158,24 @@ static void migration_update_rates(RAMState *rs, int64_t end_time)
             rs->xbzrle_cache_miss_prev) / iter_count;
         rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
     }
+
+    if (migrate_use_compression()) {
+        uint64_t comp_pages;
+
+        compression_counters.busy_rate = (double)(compression_counters.busy -
+            rs->compress_thread_busy_prev) / iter_count;
+        rs->compress_thread_busy_prev = compression_counters.busy;
+
+        comp_pages = compression_counters.pages - rs->compress_pages_prev;
+        if (comp_pages) {
+            compression_counters.compression_rate =
+                (double)(compression_counters.reduced_size -
+                rs->compress_reduced_size_prev) /
+                (comp_pages * TARGET_PAGE_SIZE);
+            rs->compress_pages_prev = compression_counters.pages;
+            rs->compress_reduced_size_prev = compression_counters.reduced_size;
+        }
+    }
 }
 
 static void migration_bitmap_sync(RAMState *rs)
@@ -1412,6 +1441,9 @@ static void flush_compressed_data(RAMState *rs)
         qemu_mutex_lock(&comp_param[idx].mutex);
         if (!comp_param[idx].quit) {
             len = qemu_put_qemu_file(rs->f, comp_param[idx].file);
+            /* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */
+            compression_counters.reduced_size += TARGET_PAGE_SIZE - len + 8;
+            compression_counters.pages++;
             ram_counters.transferred += len;
         }
         qemu_mutex_unlock(&comp_param[idx].mutex);
@@ -1441,6 +1473,10 @@ static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
             qemu_cond_signal(&comp_param[idx].cond);
             qemu_mutex_unlock(&comp_param[idx].mutex);
             pages = 1;
+            /* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */
+            compression_counters.reduced_size += TARGET_PAGE_SIZE -
+                                                 bytes_xmit + 8;
+            compression_counters.pages++;
             ram_counters.transferred += bytes_xmit;
             break;
         }
@@ -1760,6 +1796,7 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss,
         if (res > 0) {
             return res;
         }
+        compression_counters.busy++;
     }
 
     return ram_save_page(rs, pss, last_stage);
diff --git a/migration/ram.h b/migration/ram.h
index d386f4d641..7b009b23e5 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -36,6 +36,7 @@
 
 extern MigrationStats ram_counters;
 extern XBZRLECacheStats xbzrle_counters;
+extern CompressionStats compression_counters;
 
 int xbzrle_cache_resize(int64_t new_size, Error **errp);
 uint64_t ram_bytes_remaining(void);
diff --git a/qapi/migration.json b/qapi/migration.json
index 3ec418dabf..a11987cdc4 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -72,6 +72,26 @@
            'cache-miss': 'int', 'cache-miss-rate': 'number',
            'overflow': 'int' } }
 
+##
+# @CompressionStats:
+#
+# Detailed compression migration statistics
+#
+# @pages: amount of pages compressed and transferred to the target VM
+#
+# @busy: amount of count that no free thread to compress data
+#
+# @busy-rate: rate of thread busy
+#
+# @reduced-size: amount of bytes reduced by compression
+#
+# @compression-rate: rate of compressed size
+#
+##
+{ 'struct': 'CompressionStats',
+  'data': {'pages': 'int', 'busy': 'int', 'busy-rate': 'number',
+	   'reduced-size': 'int', 'compression-rate': 'number' } }
+
 ##
 # @MigrationStatus:
 #
@@ -169,6 +189,8 @@
 #           only present when the postcopy-blocktime migration capability
 #           is enabled. (Since 2.13)
 #
+# @compression: compression migration statistics, only returned if compression
+#           feature is on and status is 'active' or 'completed' (Since 2.14)
 #
 # Since: 0.14.0
 ##
@@ -183,7 +205,8 @@
            '*cpu-throttle-percentage': 'int',
            '*error-desc': 'str',
            '*postcopy-blocktime' : 'uint32',
-           '*postcopy-vcpu-blocktime': ['uint32']} }
+           '*postcopy-vcpu-blocktime': ['uint32'],
+           '*compression': 'CompressionStats'} }
 
 ##
 # @query-migrate:
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH 06/12] migration: do not detect zero page for compression
  2018-06-04  9:55 ` [Qemu-devel] " guangrong.xiao
@ 2018-06-04  9:55   ` guangrong.xiao
  -1 siblings, 0 replies; 156+ messages in thread
From: guangrong.xiao @ 2018-06-04  9:55 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: kvm, Xiao Guangrong, qemu-devel, peterx, dgilbert, wei.w.wang,
	jiang.biao2

From: Xiao Guangrong <xiaoguangrong@tencent.com>

Detecting zero page is not a light work, we can disable it
for compression that can handle all zero data very well

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 migration/ram.c | 44 +++++++++++++++++++++++---------------------
 1 file changed, 23 insertions(+), 21 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 80914b747e..15b20d3f70 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1760,15 +1760,30 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss,
         return res;
     }
 
-    /*
-     * When starting the process of a new block, the first page of
-     * the block should be sent out before other pages in the same
-     * block, and all the pages in last block should have been sent
-     * out, keeping this order is important, because the 'cont' flag
-     * is used to avoid resending the block name.
-     */
-    if (block != rs->last_sent_block && save_page_use_compression(rs)) {
+    if (save_page_use_compression(rs)) {
+        /*
+         * When starting the process of a new block, the first page of
+         * the block should be sent out before other pages in the same
+         * block, and all the pages in last block should have been sent
+         * out, keeping this order is important, because the 'cont' flag
+         * is used to avoid resending the block name.
+         *
+         * We post the fist page as normal page as compression will take
+         * much CPU resource.
+         */
+        if (block != rs->last_sent_block) {
             flush_compressed_data(rs);
+        } else {
+            /*
+             * do not detect zero page as it can be handled very well
+             * for compression
+             */
+            res = compress_page_with_multi_thread(rs, block, offset);
+            if (res > 0) {
+                return res;
+            }
+            compression_counters.busy++;
+        }
     }
 
     res = save_zero_page(rs, block, offset);
@@ -1785,19 +1800,6 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss,
         return res;
     }
 
-    /*
-     * Make sure the first page is sent out before other pages.
-     *
-     * we post it as normal page as compression will take much
-     * CPU resource.
-     */
-    if (block == rs->last_sent_block && save_page_use_compression(rs)) {
-        res = compress_page_with_multi_thread(rs, block, offset);
-        if (res > 0) {
-            return res;
-        }
-        compression_counters.busy++;
-    }
 
     return ram_save_page(rs, pss, last_stage);
 }
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Qemu-devel] [PATCH 06/12] migration: do not detect zero page for compression
@ 2018-06-04  9:55   ` guangrong.xiao
  0 siblings, 0 replies; 156+ messages in thread
From: guangrong.xiao @ 2018-06-04  9:55 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: qemu-devel, kvm, dgilbert, peterx, jiang.biao2, wei.w.wang,
	Xiao Guangrong

From: Xiao Guangrong <xiaoguangrong@tencent.com>

Detecting zero page is not a light work, we can disable it
for compression that can handle all zero data very well

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 migration/ram.c | 44 +++++++++++++++++++++++---------------------
 1 file changed, 23 insertions(+), 21 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 80914b747e..15b20d3f70 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1760,15 +1760,30 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss,
         return res;
     }
 
-    /*
-     * When starting the process of a new block, the first page of
-     * the block should be sent out before other pages in the same
-     * block, and all the pages in last block should have been sent
-     * out, keeping this order is important, because the 'cont' flag
-     * is used to avoid resending the block name.
-     */
-    if (block != rs->last_sent_block && save_page_use_compression(rs)) {
+    if (save_page_use_compression(rs)) {
+        /*
+         * When starting the process of a new block, the first page of
+         * the block should be sent out before other pages in the same
+         * block, and all the pages in last block should have been sent
+         * out, keeping this order is important, because the 'cont' flag
+         * is used to avoid resending the block name.
+         *
+         * We post the fist page as normal page as compression will take
+         * much CPU resource.
+         */
+        if (block != rs->last_sent_block) {
             flush_compressed_data(rs);
+        } else {
+            /*
+             * do not detect zero page as it can be handled very well
+             * for compression
+             */
+            res = compress_page_with_multi_thread(rs, block, offset);
+            if (res > 0) {
+                return res;
+            }
+            compression_counters.busy++;
+        }
     }
 
     res = save_zero_page(rs, block, offset);
@@ -1785,19 +1800,6 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss,
         return res;
     }
 
-    /*
-     * Make sure the first page is sent out before other pages.
-     *
-     * we post it as normal page as compression will take much
-     * CPU resource.
-     */
-    if (block == rs->last_sent_block && save_page_use_compression(rs)) {
-        res = compress_page_with_multi_thread(rs, block, offset);
-        if (res > 0) {
-            return res;
-        }
-        compression_counters.busy++;
-    }
 
     return ram_save_page(rs, pss, last_stage);
 }
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH 07/12] migration: hold the lock only if it is really needed
  2018-06-04  9:55 ` [Qemu-devel] " guangrong.xiao
@ 2018-06-04  9:55   ` guangrong.xiao
  -1 siblings, 0 replies; 156+ messages in thread
From: guangrong.xiao @ 2018-06-04  9:55 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: kvm, Xiao Guangrong, qemu-devel, peterx, dgilbert, wei.w.wang,
	jiang.biao2

From: Xiao Guangrong <xiaoguangrong@tencent.com>

Try to hold src_page_req_mutex only if the queue is not
empty

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 include/qemu/queue.h | 1 +
 migration/ram.c      | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/include/qemu/queue.h b/include/qemu/queue.h
index 59fd1203a1..ac418efc43 100644
--- a/include/qemu/queue.h
+++ b/include/qemu/queue.h
@@ -341,6 +341,7 @@ struct {                                                                \
 /*
  * Simple queue access methods.
  */
+#define QSIMPLEQ_EMPTY_ATOMIC(head) (atomic_read(&((head)->sqh_first)) == NULL)
 #define QSIMPLEQ_EMPTY(head)        ((head)->sqh_first == NULL)
 #define QSIMPLEQ_FIRST(head)        ((head)->sqh_first)
 #define QSIMPLEQ_NEXT(elm, field)   ((elm)->field.sqe_next)
diff --git a/migration/ram.c b/migration/ram.c
index 15b20d3f70..f9a8646520 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1550,6 +1550,10 @@ static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset)
 {
     RAMBlock *block = NULL;
 
+    if (QSIMPLEQ_EMPTY_ATOMIC(&rs->src_page_requests)) {
+        return NULL;
+    }
+
     qemu_mutex_lock(&rs->src_page_req_mutex);
     if (!QSIMPLEQ_EMPTY(&rs->src_page_requests)) {
         struct RAMSrcPageRequest *entry =
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Qemu-devel] [PATCH 07/12] migration: hold the lock only if it is really needed
@ 2018-06-04  9:55   ` guangrong.xiao
  0 siblings, 0 replies; 156+ messages in thread
From: guangrong.xiao @ 2018-06-04  9:55 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: qemu-devel, kvm, dgilbert, peterx, jiang.biao2, wei.w.wang,
	Xiao Guangrong

From: Xiao Guangrong <xiaoguangrong@tencent.com>

Try to hold src_page_req_mutex only if the queue is not
empty

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 include/qemu/queue.h | 1 +
 migration/ram.c      | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/include/qemu/queue.h b/include/qemu/queue.h
index 59fd1203a1..ac418efc43 100644
--- a/include/qemu/queue.h
+++ b/include/qemu/queue.h
@@ -341,6 +341,7 @@ struct {                                                                \
 /*
  * Simple queue access methods.
  */
+#define QSIMPLEQ_EMPTY_ATOMIC(head) (atomic_read(&((head)->sqh_first)) == NULL)
 #define QSIMPLEQ_EMPTY(head)        ((head)->sqh_first == NULL)
 #define QSIMPLEQ_FIRST(head)        ((head)->sqh_first)
 #define QSIMPLEQ_NEXT(elm, field)   ((elm)->field.sqe_next)
diff --git a/migration/ram.c b/migration/ram.c
index 15b20d3f70..f9a8646520 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1550,6 +1550,10 @@ static RAMBlock *unqueue_page(RAMState *rs, ram_addr_t *offset)
 {
     RAMBlock *block = NULL;
 
+    if (QSIMPLEQ_EMPTY_ATOMIC(&rs->src_page_requests)) {
+        return NULL;
+    }
+
     qemu_mutex_lock(&rs->src_page_req_mutex);
     if (!QSIMPLEQ_EMPTY(&rs->src_page_requests)) {
         struct RAMSrcPageRequest *entry =
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH 08/12] migration: do not flush_compressed_data at the end of each iteration
  2018-06-04  9:55 ` [Qemu-devel] " guangrong.xiao
@ 2018-06-04  9:55   ` guangrong.xiao
  -1 siblings, 0 replies; 156+ messages in thread
From: guangrong.xiao @ 2018-06-04  9:55 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: kvm, Xiao Guangrong, qemu-devel, peterx, dgilbert, wei.w.wang,
	jiang.biao2

From: Xiao Guangrong <xiaoguangrong@tencent.com>

flush_compressed_data() needs to wait all compression threads to
finish their work, after that all threads are free until the
migration feed new request to them, reducing its call can improve
the throughput and use CPU resource more effectively

We do not need to flush all threads at the end of iteration, the
data can be kept locally until the memory block is changed

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 migration/ram.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/ram.c b/migration/ram.c
index f9a8646520..0a38c1c61e 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1994,6 +1994,7 @@ static void ram_save_cleanup(void *opaque)
     }
 
     xbzrle_cleanup();
+    flush_compressed_data(*rsp);
     compress_threads_save_cleanup();
     ram_state_cleanup(rsp);
 }
@@ -2690,7 +2691,6 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
         }
         i++;
     }
-    flush_compressed_data(rs);
     rcu_read_unlock();
 
     /*
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Qemu-devel] [PATCH 08/12] migration: do not flush_compressed_data at the end of each iteration
@ 2018-06-04  9:55   ` guangrong.xiao
  0 siblings, 0 replies; 156+ messages in thread
From: guangrong.xiao @ 2018-06-04  9:55 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: qemu-devel, kvm, dgilbert, peterx, jiang.biao2, wei.w.wang,
	Xiao Guangrong

From: Xiao Guangrong <xiaoguangrong@tencent.com>

flush_compressed_data() needs to wait all compression threads to
finish their work, after that all threads are free until the
migration feed new request to them, reducing its call can improve
the throughput and use CPU resource more effectively

We do not need to flush all threads at the end of iteration, the
data can be kept locally until the memory block is changed

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 migration/ram.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/ram.c b/migration/ram.c
index f9a8646520..0a38c1c61e 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1994,6 +1994,7 @@ static void ram_save_cleanup(void *opaque)
     }
 
     xbzrle_cleanup();
+    flush_compressed_data(*rsp);
     compress_threads_save_cleanup();
     ram_state_cleanup(rsp);
 }
@@ -2690,7 +2691,6 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
         }
         i++;
     }
-    flush_compressed_data(rs);
     rcu_read_unlock();
 
     /*
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH 09/12] ring: introduce lockless ring buffer
  2018-06-04  9:55 ` [Qemu-devel] " guangrong.xiao
@ 2018-06-04  9:55   ` guangrong.xiao
  -1 siblings, 0 replies; 156+ messages in thread
From: guangrong.xiao @ 2018-06-04  9:55 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: kvm, Xiao Guangrong, qemu-devel, peterx, dgilbert, wei.w.wang,
	jiang.biao2

From: Xiao Guangrong <xiaoguangrong@tencent.com>

It's the simple lockless ring buffer implement which supports both
single producer vs. single consumer and multiple producers vs.
single consumer.

Many lessons were learned from Linux Kernel's kfifo (1) and DPDK's
rte_ring (2) before i wrote this implement. It corrects some bugs of
memory barriers in kfifo and it is the simpler lockless version of
rte_ring as currently multiple access is only allowed for producer.

If has single producer vs. single consumer, it is the traditional fifo,
If has multiple producers, it uses the algorithm as followings:

For the producer, it uses two steps to update the ring:
   - first step, occupy the entry in the ring:

retry:
      in = ring->in
      if (cmpxhg(&ring->in, in, in +1) != in)
            goto retry;

     after that the entry pointed by ring->data[in] has been owned by
     the producer.

     assert(ring->data[in] == NULL);

     Note, no other producer can touch this entry so that this entry
     should always be the initialized state.

   - second step, write the data to the entry:

     ring->data[in] = data;

For the consumer, it first checks if there is available entry in the
ring and fetches the entry from the ring:

     if (!ring_is_empty(ring))
          entry = &ring[ring->out];

     Note: the ring->out has not been updated so that the entry pointed
     by ring->out is completely owned by the consumer.

Then it checks if the data is ready:

retry:
     if (*entry == NULL)
            goto retry;
That means, the producer has updated the index but haven't written any
data to it.

Finally, it fetches the valid data out, set the entry to the initialized
state and update ring->out to make the entry be usable to the producer:

      data = *entry;
      *entry = NULL;
      ring->out++;

Memory barrier is omitted here, please refer to the comment in the code.

(1) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
(2) http://dpdk.org/doc/api/rte__ring_8h.html

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 migration/ring.h | 265 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 265 insertions(+)
 create mode 100644 migration/ring.h

diff --git a/migration/ring.h b/migration/ring.h
new file mode 100644
index 0000000000..da9b8bdcbb
--- /dev/null
+++ b/migration/ring.h
@@ -0,0 +1,265 @@
+/*
+ * Ring Buffer
+ *
+ * Multiple producers and single consumer are supported with lock free.
+ *
+ * Copyright (c) 2018 Tencent Inc
+ *
+ * Authors:
+ *  Xiao Guangrong <xiaoguangrong@tencent.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef _RING__
+#define _RING__
+
+#define CACHE_LINE  64
+#define cache_aligned __attribute__((__aligned__(CACHE_LINE)))
+
+#define RING_MULTI_PRODUCER 0x1
+
+struct Ring {
+    unsigned int flags;
+    unsigned int size;
+    unsigned int mask;
+
+    unsigned int in cache_aligned;
+
+    unsigned int out cache_aligned;
+
+    void *data[0] cache_aligned;
+};
+typedef struct Ring Ring;
+
+/*
+ * allocate and initialize the ring
+ *
+ * @size: the number of element, it should be power of 2
+ * @flags: set to RING_MULTI_PRODUCER if the ring has multiple producer,
+ *         otherwise set it to 0, i,e. single producer and single consumer.
+ *
+ * return the ring.
+ */
+static inline Ring *ring_alloc(unsigned int size, unsigned int flags)
+{
+    Ring *ring;
+
+    assert(is_power_of_2(size));
+
+    ring = g_malloc0(sizeof(*ring) + size * sizeof(void *));
+    ring->size = size;
+    ring->mask = ring->size - 1;
+    ring->flags = flags;
+    return ring;
+}
+
+static inline void ring_free(Ring *ring)
+{
+    g_free(ring);
+}
+
+static inline bool __ring_is_empty(unsigned int in, unsigned int out)
+{
+    return in == out;
+}
+
+static inline bool ring_is_empty(Ring *ring)
+{
+    return ring->in == ring->out;
+}
+
+static inline unsigned int ring_len(unsigned int in, unsigned int out)
+{
+    return in - out;
+}
+
+static inline bool
+__ring_is_full(Ring *ring, unsigned int in, unsigned int out)
+{
+    return ring_len(in, out) > ring->mask;
+}
+
+static inline bool ring_is_full(Ring *ring)
+{
+    return __ring_is_full(ring, ring->in, ring->out);
+}
+
+static inline unsigned int ring_index(Ring *ring, unsigned int pos)
+{
+    return pos & ring->mask;
+}
+
+static inline int __ring_put(Ring *ring, void *data)
+{
+    unsigned int index, out;
+
+    out = atomic_load_acquire(&ring->out);
+    /*
+     * smp_mb()
+     *
+     * should read ring->out before updating the entry, see the comments in
+     * __ring_get().
+     */
+
+    if (__ring_is_full(ring, ring->in, out)) {
+        return -ENOBUFS;
+    }
+
+    index = ring_index(ring, ring->in);
+
+    atomic_set(&ring->data[index], data);
+
+    /*
+     * should make sure the entry is updated before increasing ring->in
+     * otherwise the consumer will get a entry but its content is useless.
+     */
+    smp_wmb();
+    atomic_set(&ring->in, ring->in + 1);
+    return 0;
+}
+
+static inline void *__ring_get(Ring *ring)
+{
+    unsigned int index, in;
+    void *data;
+
+    in = atomic_read(&ring->in);
+
+    /*
+     * should read ring->in first to make sure the entry pointed by this
+     * index is available, see the comments in __ring_put().
+     */
+    smp_rmb();
+    if (__ring_is_empty(in, ring->out)) {
+        return NULL;
+    }
+
+    index = ring_index(ring, ring->out);
+
+    data = atomic_read(&ring->data[index]);
+
+    /*
+     * smp_mb()
+     *
+     * once the ring->out is updated the entry originally indicated by the
+     * the index is visible and usable to the producer so that we should
+     * make sure reading the entry out before updating ring->out to avoid
+     * the entry being overwritten by the producer.
+     */
+    atomic_store_release(&ring->out, ring->out + 1);
+
+    return data;
+}
+
+static inline int ring_mp_put(Ring *ring, void *data)
+{
+    unsigned int index, in, in_next, out;
+
+    do {
+        in = atomic_read(&ring->in);
+        out = atomic_read(&ring->out);
+
+        if (__ring_is_full(ring, in, out)) {
+            if (atomic_read(&ring->in) == in &&
+                atomic_read(&ring->out) == out) {
+                return -ENOBUFS;
+            }
+
+            /* a entry has been fetched out, retry. */
+            continue;
+        }
+
+        in_next = in + 1;
+    } while (atomic_cmpxchg(&ring->in, in, in_next) != in);
+
+    index = ring_index(ring, in);
+
+    /*
+     * smp_rmb() paired with the memory barrier of (A) in ring_mp_get()
+     * is implied in atomic_cmpxchg() as we should read ring->out first
+     * before fetching the entry, otherwise this assert will fail.
+     */
+    assert(!atomic_read(&ring->data[index]));
+
+    /*
+     * smp_mb() paired with the memory barrier of (B) in ring_mp_get() is
+     * implied in atomic_cmpxchg(), that is needed here as  we should read
+     * ring->out before updating the entry, it is the same as we did in
+     * __ring_put().
+     *
+     * smp_wmb() paired with the memory barrier of (C) in ring_mp_get()
+     * is implied in atomic_cmpxchg(), that is needed as we should increase
+     * ring->in before updating the entry.
+     */
+    atomic_set(&ring->data[index], data);
+
+    return 0;
+}
+
+static inline void *ring_mp_get(Ring *ring)
+{
+    unsigned int index, in;
+    void *data;
+
+    do {
+        in = atomic_read(&ring->in);
+
+        /*
+         * (C) should read ring->in first to make sure the entry pointed by this
+         * index is available
+         */
+        smp_rmb();
+
+        if (!__ring_is_empty(in, ring->out)) {
+            break;
+        }
+
+        if (atomic_read(&ring->in) == in) {
+            return NULL;
+        }
+        /* new entry has been added in, retry. */
+    } while (1);
+
+    index = ring_index(ring, ring->out);
+
+    do {
+        data = atomic_read(&ring->data[index]);
+        if (data) {
+            break;
+        }
+        /* the producer is updating the entry, retry */
+        cpu_relax();
+    } while (1);
+
+    atomic_set(&ring->data[index], NULL);
+
+    /*
+     * (B) smp_mb() is needed as we should read the entry out before
+     * updating ring->out as we did in __ring_get().
+     *
+     * (A) smp_wmb() is needed as we should make the entry be NULL before
+     * updating ring->out (which will make the entry be visible and usable).
+     */
+    atomic_store_release(&ring->out, ring->out + 1);
+
+    return data;
+}
+
+static inline int ring_put(Ring *ring, void *data)
+{
+    if (ring->flags & RING_MULTI_PRODUCER) {
+        return ring_mp_put(ring, data);
+    }
+    return __ring_put(ring, data);
+}
+
+static inline void *ring_get(Ring *ring)
+{
+    if (ring->flags & RING_MULTI_PRODUCER) {
+        return ring_mp_get(ring);
+    }
+    return __ring_get(ring);
+}
+#endif
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer
@ 2018-06-04  9:55   ` guangrong.xiao
  0 siblings, 0 replies; 156+ messages in thread
From: guangrong.xiao @ 2018-06-04  9:55 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: qemu-devel, kvm, dgilbert, peterx, jiang.biao2, wei.w.wang,
	Xiao Guangrong

From: Xiao Guangrong <xiaoguangrong@tencent.com>

It's the simple lockless ring buffer implement which supports both
single producer vs. single consumer and multiple producers vs.
single consumer.

Many lessons were learned from Linux Kernel's kfifo (1) and DPDK's
rte_ring (2) before i wrote this implement. It corrects some bugs of
memory barriers in kfifo and it is the simpler lockless version of
rte_ring as currently multiple access is only allowed for producer.

If has single producer vs. single consumer, it is the traditional fifo,
If has multiple producers, it uses the algorithm as followings:

For the producer, it uses two steps to update the ring:
   - first step, occupy the entry in the ring:

retry:
      in = ring->in
      if (cmpxhg(&ring->in, in, in +1) != in)
            goto retry;

     after that the entry pointed by ring->data[in] has been owned by
     the producer.

     assert(ring->data[in] == NULL);

     Note, no other producer can touch this entry so that this entry
     should always be the initialized state.

   - second step, write the data to the entry:

     ring->data[in] = data;

For the consumer, it first checks if there is available entry in the
ring and fetches the entry from the ring:

     if (!ring_is_empty(ring))
          entry = &ring[ring->out];

     Note: the ring->out has not been updated so that the entry pointed
     by ring->out is completely owned by the consumer.

Then it checks if the data is ready:

retry:
     if (*entry == NULL)
            goto retry;
That means, the producer has updated the index but haven't written any
data to it.

Finally, it fetches the valid data out, set the entry to the initialized
state and update ring->out to make the entry be usable to the producer:

      data = *entry;
      *entry = NULL;
      ring->out++;

Memory barrier is omitted here, please refer to the comment in the code.

(1) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
(2) http://dpdk.org/doc/api/rte__ring_8h.html

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 migration/ring.h | 265 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 265 insertions(+)
 create mode 100644 migration/ring.h

diff --git a/migration/ring.h b/migration/ring.h
new file mode 100644
index 0000000000..da9b8bdcbb
--- /dev/null
+++ b/migration/ring.h
@@ -0,0 +1,265 @@
+/*
+ * Ring Buffer
+ *
+ * Multiple producers and single consumer are supported with lock free.
+ *
+ * Copyright (c) 2018 Tencent Inc
+ *
+ * Authors:
+ *  Xiao Guangrong <xiaoguangrong@tencent.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef _RING__
+#define _RING__
+
+#define CACHE_LINE  64
+#define cache_aligned __attribute__((__aligned__(CACHE_LINE)))
+
+#define RING_MULTI_PRODUCER 0x1
+
+struct Ring {
+    unsigned int flags;
+    unsigned int size;
+    unsigned int mask;
+
+    unsigned int in cache_aligned;
+
+    unsigned int out cache_aligned;
+
+    void *data[0] cache_aligned;
+};
+typedef struct Ring Ring;
+
+/*
+ * allocate and initialize the ring
+ *
+ * @size: the number of element, it should be power of 2
+ * @flags: set to RING_MULTI_PRODUCER if the ring has multiple producer,
+ *         otherwise set it to 0, i,e. single producer and single consumer.
+ *
+ * return the ring.
+ */
+static inline Ring *ring_alloc(unsigned int size, unsigned int flags)
+{
+    Ring *ring;
+
+    assert(is_power_of_2(size));
+
+    ring = g_malloc0(sizeof(*ring) + size * sizeof(void *));
+    ring->size = size;
+    ring->mask = ring->size - 1;
+    ring->flags = flags;
+    return ring;
+}
+
+static inline void ring_free(Ring *ring)
+{
+    g_free(ring);
+}
+
+static inline bool __ring_is_empty(unsigned int in, unsigned int out)
+{
+    return in == out;
+}
+
+static inline bool ring_is_empty(Ring *ring)
+{
+    return ring->in == ring->out;
+}
+
+static inline unsigned int ring_len(unsigned int in, unsigned int out)
+{
+    return in - out;
+}
+
+static inline bool
+__ring_is_full(Ring *ring, unsigned int in, unsigned int out)
+{
+    return ring_len(in, out) > ring->mask;
+}
+
+static inline bool ring_is_full(Ring *ring)
+{
+    return __ring_is_full(ring, ring->in, ring->out);
+}
+
+static inline unsigned int ring_index(Ring *ring, unsigned int pos)
+{
+    return pos & ring->mask;
+}
+
+static inline int __ring_put(Ring *ring, void *data)
+{
+    unsigned int index, out;
+
+    out = atomic_load_acquire(&ring->out);
+    /*
+     * smp_mb()
+     *
+     * should read ring->out before updating the entry, see the comments in
+     * __ring_get().
+     */
+
+    if (__ring_is_full(ring, ring->in, out)) {
+        return -ENOBUFS;
+    }
+
+    index = ring_index(ring, ring->in);
+
+    atomic_set(&ring->data[index], data);
+
+    /*
+     * should make sure the entry is updated before increasing ring->in
+     * otherwise the consumer will get a entry but its content is useless.
+     */
+    smp_wmb();
+    atomic_set(&ring->in, ring->in + 1);
+    return 0;
+}
+
+static inline void *__ring_get(Ring *ring)
+{
+    unsigned int index, in;
+    void *data;
+
+    in = atomic_read(&ring->in);
+
+    /*
+     * should read ring->in first to make sure the entry pointed by this
+     * index is available, see the comments in __ring_put().
+     */
+    smp_rmb();
+    if (__ring_is_empty(in, ring->out)) {
+        return NULL;
+    }
+
+    index = ring_index(ring, ring->out);
+
+    data = atomic_read(&ring->data[index]);
+
+    /*
+     * smp_mb()
+     *
+     * once the ring->out is updated the entry originally indicated by the
+     * the index is visible and usable to the producer so that we should
+     * make sure reading the entry out before updating ring->out to avoid
+     * the entry being overwritten by the producer.
+     */
+    atomic_store_release(&ring->out, ring->out + 1);
+
+    return data;
+}
+
+static inline int ring_mp_put(Ring *ring, void *data)
+{
+    unsigned int index, in, in_next, out;
+
+    do {
+        in = atomic_read(&ring->in);
+        out = atomic_read(&ring->out);
+
+        if (__ring_is_full(ring, in, out)) {
+            if (atomic_read(&ring->in) == in &&
+                atomic_read(&ring->out) == out) {
+                return -ENOBUFS;
+            }
+
+            /* a entry has been fetched out, retry. */
+            continue;
+        }
+
+        in_next = in + 1;
+    } while (atomic_cmpxchg(&ring->in, in, in_next) != in);
+
+    index = ring_index(ring, in);
+
+    /*
+     * smp_rmb() paired with the memory barrier of (A) in ring_mp_get()
+     * is implied in atomic_cmpxchg() as we should read ring->out first
+     * before fetching the entry, otherwise this assert will fail.
+     */
+    assert(!atomic_read(&ring->data[index]));
+
+    /*
+     * smp_mb() paired with the memory barrier of (B) in ring_mp_get() is
+     * implied in atomic_cmpxchg(), that is needed here as  we should read
+     * ring->out before updating the entry, it is the same as we did in
+     * __ring_put().
+     *
+     * smp_wmb() paired with the memory barrier of (C) in ring_mp_get()
+     * is implied in atomic_cmpxchg(), that is needed as we should increase
+     * ring->in before updating the entry.
+     */
+    atomic_set(&ring->data[index], data);
+
+    return 0;
+}
+
+static inline void *ring_mp_get(Ring *ring)
+{
+    unsigned int index, in;
+    void *data;
+
+    do {
+        in = atomic_read(&ring->in);
+
+        /*
+         * (C) should read ring->in first to make sure the entry pointed by this
+         * index is available
+         */
+        smp_rmb();
+
+        if (!__ring_is_empty(in, ring->out)) {
+            break;
+        }
+
+        if (atomic_read(&ring->in) == in) {
+            return NULL;
+        }
+        /* new entry has been added in, retry. */
+    } while (1);
+
+    index = ring_index(ring, ring->out);
+
+    do {
+        data = atomic_read(&ring->data[index]);
+        if (data) {
+            break;
+        }
+        /* the producer is updating the entry, retry */
+        cpu_relax();
+    } while (1);
+
+    atomic_set(&ring->data[index], NULL);
+
+    /*
+     * (B) smp_mb() is needed as we should read the entry out before
+     * updating ring->out as we did in __ring_get().
+     *
+     * (A) smp_wmb() is needed as we should make the entry be NULL before
+     * updating ring->out (which will make the entry be visible and usable).
+     */
+    atomic_store_release(&ring->out, ring->out + 1);
+
+    return data;
+}
+
+static inline int ring_put(Ring *ring, void *data)
+{
+    if (ring->flags & RING_MULTI_PRODUCER) {
+        return ring_mp_put(ring, data);
+    }
+    return __ring_put(ring, data);
+}
+
+static inline void *ring_get(Ring *ring)
+{
+    if (ring->flags & RING_MULTI_PRODUCER) {
+        return ring_mp_get(ring);
+    }
+    return __ring_get(ring);
+}
+#endif
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH 10/12] migration: introduce lockless multithreads model
  2018-06-04  9:55 ` [Qemu-devel] " guangrong.xiao
@ 2018-06-04  9:55   ` guangrong.xiao
  -1 siblings, 0 replies; 156+ messages in thread
From: guangrong.xiao @ 2018-06-04  9:55 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: kvm, Xiao Guangrong, qemu-devel, peterx, dgilbert, wei.w.wang,
	jiang.biao2

From: Xiao Guangrong <xiaoguangrong@tencent.com>

Current implementation of compression and decompression are very
hard to be enabled on productions. We noticed that too many wait-wakes
go to kernel space and CPU usages are very low even if the system
is really free

The reasons are:
1) there are two many locks used to do synchronous,there
  is a global lock and each single thread has its own lock,
  migration thread and work threads need to go to sleep if
  these locks are busy

2) migration thread separately submits request to the thread
   however, only one request can be pended, that means, the
   thread has to go to sleep after finishing the request

To make it work better, we introduce a new multithread model,
the user, currently it is the migration thread, submits request
to each thread with round-robin manner, the thread has its own
ring whose capacity is 4 and puts the result to a global ring
which is lockless for multiple producers, the user fetches result
out from the global ring and do remaining operations for the
request, e.g, posting the compressed data out for migration on
the source QEMU

Performance Result:
The test was based on top of the patch:
   ring: introduce lockless ring buffer
that means, previous optimizations are used for both of original case
and applying the new multithread model

We tested live migration on two hosts:
   Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz * 64 + 256G memory
to migration a VM between each other, which has 16 vCPUs and 60G
memory, during the migration, multiple threads are repeatedly writing
the memory in the VM

We used 16 threads on the destination to decompress the data and on the
source, we tried 8 threads and 16 threads to compress the data

--- Before our work ---
migration can not be finished for both 8 threads and 16 threads. The data
is as followings:

Use 8 threads to compress:
- on the source:
	    migration thread   compress-threads
CPU usage       70%          some use 36%, others are very low ~20%
- on the destination:
            main thread        decompress-threads
CPU usage       100%         some use ~40%, other are very low ~2%

Migration status (CAN NOT FINISH):
info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
Migration status: active
total time: 1019540 milliseconds
expected downtime: 2263 milliseconds
setup: 218 milliseconds
transferred ram: 252419995 kbytes
throughput: 2469.45 mbps
remaining ram: 15611332 kbytes
total ram: 62931784 kbytes
duplicate: 915323 pages
skipped: 0 pages
normal: 59673047 pages
normal bytes: 238692188 kbytes
dirty sync count: 28
page size: 4 kbytes
dirty pages rate: 170551 pages
compression pages: 121309323 pages
compression busy: 60588337
compression busy rate: 0.36
compression reduced size: 484281967178
compression rate: 0.97

Use 16 threads to compress:
- on the source:
	    migration thread   compress-threads
CPU usage       96%          some use 45%, others are very low ~6%
- on the destination:
            main thread        decompress-threads
CPU usage       96%         some use 58%, other are very low ~10%

Migration status (CAN NOT FINISH):
info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
Migration status: active
total time: 1189221 milliseconds
expected downtime: 6824 milliseconds
setup: 220 milliseconds
transferred ram: 90620052 kbytes
throughput: 840.41 mbps
remaining ram: 3678760 kbytes
total ram: 62931784 kbytes
duplicate: 195893 pages
skipped: 0 pages
normal: 17290715 pages
normal bytes: 69162860 kbytes
dirty sync count: 33
page size: 4 kbytes
dirty pages rate: 175039 pages
compression pages: 186739419 pages
compression busy: 17486568
compression busy rate: 0.09
compression reduced size: 744546683892
compression rate: 0.97

--- After our work ---
Migration can be finished quickly for both 8 threads and 16 threads. The
data is as followings:

Use 8 threads to compress:
- on the source:
	    migration thread   compress-threads
CPU usage       30%               30% (all threads have same CPU usage)
- on the destination:
            main thread        decompress-threads
CPU usage       100%              50% (all threads have same CPU usage)

Migration status (finished in 219467 ms):
info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
Migration status: completed
total time: 219467 milliseconds
downtime: 115 milliseconds
setup: 222 milliseconds
transferred ram: 88510173 kbytes
throughput: 3303.81 mbps
remaining ram: 0 kbytes
total ram: 62931784 kbytes
duplicate: 2211775 pages
skipped: 0 pages
normal: 21166222 pages
normal bytes: 84664888 kbytes
dirty sync count: 15
page size: 4 kbytes
compression pages: 32045857 pages
compression busy: 23377968
compression busy rate: 0.34
compression reduced size: 127767894329
compression rate: 0.97

Use 16 threads to compress:
- on the source:
	    migration thread   compress-threads
CPU usage       60%               60% (all threads have same CPU usage)
- on the destination:
            main thread        decompress-threads
CPU usage       100%              75% (all threads have same CPU usage)

Migration status (finished in 64118 ms):
info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
Migration status: completed
total time: 64118 milliseconds
downtime: 29 milliseconds
setup: 223 milliseconds
transferred ram: 13345135 kbytes
throughput: 1705.10 mbps
remaining ram: 0 kbytes
total ram: 62931784 kbytes
duplicate: 574921 pages
skipped: 0 pages
normal: 2570281 pages
normal bytes: 10281124 kbytes
dirty sync count: 9
page size: 4 kbytes
compression pages: 28007024 pages
compression busy: 3145182
compression busy rate: 0.08
compression reduced size: 111829024985
compression rate: 0.97

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 migration/Makefile.objs |   1 +
 migration/threads.c     | 265 ++++++++++++++++++++++++++++++++++++++++++++++++
 migration/threads.h     | 116 +++++++++++++++++++++
 3 files changed, 382 insertions(+)
 create mode 100644 migration/threads.c
 create mode 100644 migration/threads.h

diff --git a/migration/Makefile.objs b/migration/Makefile.objs
index c83ec47ba8..bdb61a7983 100644
--- a/migration/Makefile.objs
+++ b/migration/Makefile.objs
@@ -7,6 +7,7 @@ common-obj-y += qemu-file-channel.o
 common-obj-y += xbzrle.o postcopy-ram.o
 common-obj-y += qjson.o
 common-obj-y += block-dirty-bitmap.o
+common-obj-y += threads.o
 
 common-obj-$(CONFIG_RDMA) += rdma.o
 
diff --git a/migration/threads.c b/migration/threads.c
new file mode 100644
index 0000000000..eecd3229b7
--- /dev/null
+++ b/migration/threads.c
@@ -0,0 +1,265 @@
+#include "threads.h"
+
+/* retry to see if there is avilable request before actually go to wait. */
+#define BUSY_WAIT_COUNT 1000
+
+static void *thread_run(void *opaque)
+{
+    ThreadLocal *self_data = (ThreadLocal *)opaque;
+    Threads *threads = self_data->threads;
+    void (*handler)(ThreadRequest *data) = threads->thread_request_handler;
+    ThreadRequest *request;
+    int count, ret;
+
+    for ( ; !atomic_read(&self_data->quit); ) {
+        qemu_event_reset(&self_data->ev);
+
+        count = 0;
+        while ((request = ring_get(self_data->request_ring)) ||
+            count < BUSY_WAIT_COUNT) {
+             /*
+             * wait some while before go to sleep so that the user
+             * needn't go to kernel space to wake up the consumer
+             * threads.
+             *
+             * That will waste some CPU resource indeed however it
+             * can significantly improve the case that the request
+             * will be available soon.
+             */
+             if (!request) {
+                cpu_relax();
+                count++;
+                continue;
+            }
+            count = 0;
+
+            handler(request);
+
+            do {
+                ret = ring_put(threads->request_done_ring, request);
+                /*
+                 * request_done_ring has enough room to contain all
+                 * requests, however, theoretically, it still can be
+                 * fail if the ring's indexes are overflow that would
+                 * happen if there is more than 2^32 requests are
+                 * handled between two calls of threads_wait_done().
+                 * So we do retry to make the code more robust.
+                 *
+                 * It is unlikely the case for migration as the block's
+                 * memory is unlikely more than 16T (2^32 pages) memory.
+                 */
+                if (ret) {
+                    fprintf(stderr,
+                            "Potential BUG if it is triggered by migration.\n");
+                }
+            } while (ret);
+        }
+
+        qemu_event_wait(&self_data->ev);
+    }
+
+    return NULL;
+}
+
+static void add_free_request(Threads *threads, ThreadRequest *request)
+{
+    QSLIST_INSERT_HEAD(&threads->free_requests, request, node);
+    threads->free_requests_nr++;
+}
+
+static ThreadRequest *get_and_remove_first_free_request(Threads *threads)
+{
+    ThreadRequest *request;
+
+    if (QSLIST_EMPTY(&threads->free_requests)) {
+        return NULL;
+    }
+
+    request = QSLIST_FIRST(&threads->free_requests);
+    QSLIST_REMOVE_HEAD(&threads->free_requests, node);
+    threads->free_requests_nr--;
+    return request;
+}
+
+static void uninit_requests(Threads *threads, int free_nr)
+{
+    ThreadRequest *request;
+
+    /*
+     * all requests should be released to the list if threads are being
+     * destroyed, i,e. should call threads_wait_done() first.
+     */
+    assert(threads->free_requests_nr == free_nr);
+
+    while ((request = get_and_remove_first_free_request(threads))) {
+        threads->thread_request_uninit(request);
+    }
+
+    assert(ring_is_empty(threads->request_done_ring));
+    ring_free(threads->request_done_ring);
+}
+
+static int init_requests(Threads *threads)
+{
+    ThreadRequest *request;
+    unsigned int done_ring_size = pow2roundup32(threads->total_requests);
+    int i, free_nr = 0;
+
+    threads->request_done_ring = ring_alloc(done_ring_size,
+                                            RING_MULTI_PRODUCER);
+
+    QSLIST_INIT(&threads->free_requests);
+    for (i = 0; i < threads->total_requests; i++) {
+        request = threads->thread_request_init();
+        if (!request) {
+            goto cleanup;
+        }
+
+        free_nr++;
+        add_free_request(threads, request);
+    }
+    return 0;
+
+cleanup:
+    uninit_requests(threads, free_nr);
+    return -1;
+}
+
+static void uninit_thread_data(Threads *threads)
+{
+    ThreadLocal *thread_local = threads->per_thread_data;
+    int i;
+
+    for (i = 0; i < threads->threads_nr; i++) {
+        thread_local[i].quit = true;
+        qemu_event_set(&thread_local[i].ev);
+        qemu_thread_join(&thread_local[i].thread);
+        qemu_event_destroy(&thread_local[i].ev);
+        assert(ring_is_empty(thread_local[i].request_ring));
+        ring_free(thread_local[i].request_ring);
+    }
+}
+
+static void init_thread_data(Threads *threads)
+{
+    ThreadLocal *thread_local = threads->per_thread_data;
+    char *name;
+    int i;
+
+    for (i = 0; i < threads->threads_nr; i++) {
+        qemu_event_init(&thread_local[i].ev, false);
+
+        thread_local[i].threads = threads;
+        thread_local[i].self = i;
+        thread_local[i].request_ring = ring_alloc(threads->thread_ring_size, 0);
+        name = g_strdup_printf("%s/%d", threads->name, thread_local[i].self);
+        qemu_thread_create(&thread_local[i].thread, name,
+                           thread_run, &thread_local[i], QEMU_THREAD_JOINABLE);
+        g_free(name);
+    }
+}
+
+/* the size of thread local request ring */
+#define THREAD_REQ_RING_SIZE 4
+
+Threads *threads_create(unsigned int threads_nr, const char *name,
+                        ThreadRequest *(*thread_request_init)(void),
+                        void (*thread_request_uninit)(ThreadRequest *request),
+                        void (*thread_request_handler)(ThreadRequest *request),
+                        void (*thread_request_done)(ThreadRequest *request))
+{
+    Threads *threads;
+    int ret;
+
+    threads = g_malloc0(sizeof(*threads) + threads_nr * sizeof(ThreadLocal));
+    threads->threads_nr = threads_nr;
+    threads->thread_ring_size = THREAD_REQ_RING_SIZE;
+    threads->total_requests = threads->thread_ring_size * threads_nr;
+
+    threads->name = name;
+    threads->thread_request_init = thread_request_init;
+    threads->thread_request_uninit = thread_request_uninit;
+    threads->thread_request_handler = thread_request_handler;
+    threads->thread_request_done = thread_request_done;
+
+    ret = init_requests(threads);
+    if (ret) {
+        g_free(threads);
+        return NULL;
+    }
+
+    init_thread_data(threads);
+    return threads;
+}
+
+void threads_destroy(Threads *threads)
+{
+    uninit_thread_data(threads);
+    uninit_requests(threads, threads->total_requests);
+    g_free(threads);
+}
+
+ThreadRequest *threads_submit_request_prepare(Threads *threads)
+{
+    ThreadRequest *request;
+    unsigned int index;
+
+    index = threads->current_thread_index % threads->threads_nr;
+
+    /* the thread is busy */
+    if (ring_is_full(threads->per_thread_data[index].request_ring)) {
+        return NULL;
+    }
+
+    /* try to get the request from the list */
+    request = get_and_remove_first_free_request(threads);
+    if (request) {
+        goto got_request;
+    }
+
+    /* get the request already been handled by the threads */
+    request = ring_get(threads->request_done_ring);
+    if (request) {
+        threads->thread_request_done(request);
+        goto got_request;
+    }
+    return NULL;
+
+got_request:
+    threads->current_thread_index++;
+    request->thread_index = index;
+    return request;
+}
+
+void threads_submit_request_commit(Threads *threads, ThreadRequest *request)
+{
+    int ret, index = request->thread_index;
+    ThreadLocal *thread_local = &threads->per_thread_data[index];
+
+    ret = ring_put(thread_local->request_ring, request);
+
+    /*
+     * we have detected that the thread's ring is not full in
+     * threads_submit_request_prepare(), there should be free
+     * room in the ring
+     */
+    assert(!ret);
+    /* new request arrived, notify the thread */
+    qemu_event_set(&thread_local->ev);
+}
+
+void threads_wait_done(Threads *threads)
+{
+    ThreadRequest *request;
+
+retry:
+    while ((request = ring_get(threads->request_done_ring))) {
+        threads->thread_request_done(request);
+        add_free_request(threads, request);
+    }
+
+    if (threads->free_requests_nr != threads->total_requests) {
+        cpu_relax();
+        goto retry;
+    }
+}
diff --git a/migration/threads.h b/migration/threads.h
new file mode 100644
index 0000000000..eced913065
--- /dev/null
+++ b/migration/threads.h
@@ -0,0 +1,116 @@
+#ifndef QEMU_MIGRATION_THREAD_H
+#define QEMU_MIGRATION_THREAD_H
+
+/*
+ * Multithreads abstraction
+ *
+ * This is the abstraction layer for multithreads management which is
+ * used to speed up migration.
+ *
+ * Note: currently only one producer is allowed.
+ *
+ * Copyright(C) 2018 Tencent Corporation.
+ *
+ * Author:
+ *   Xiao Guangrong <xiaoguangrong@tencent.com>
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "hw/boards.h"
+
+#include "ring.h"
+
+/*
+ * the request representation which contains the internally used mete data,
+ * it can be embedded to user's self-defined data struct and the user can
+ * use container_of() to get the self-defined data
+ */
+struct ThreadRequest {
+    QSLIST_ENTRY(ThreadRequest) node;
+    unsigned int thread_index;
+};
+typedef struct ThreadRequest ThreadRequest;
+
+struct Threads;
+
+struct ThreadLocal {
+    QemuThread thread;
+
+    /* the event used to wake up the thread */
+    QemuEvent ev;
+
+    struct Threads *threads;
+
+    /* local request ring which is filled by the user */
+    Ring *request_ring;
+
+    /* the index of the thread */
+    int self;
+
+    /* thread is useless and needs to exit */
+    bool quit;
+};
+typedef struct ThreadLocal ThreadLocal;
+
+/*
+ * the main data struct represents multithreads which is shared by
+ * all threads
+ */
+struct Threads {
+    const char *name;
+    unsigned int threads_nr;
+    /* the request is pushed to the thread with round-robin manner */
+    unsigned int current_thread_index;
+
+    int thread_ring_size;
+    int total_requests;
+
+    /* the request is pre-allocated and linked in the list */
+    int free_requests_nr;
+    QSLIST_HEAD(, ThreadRequest) free_requests;
+
+    /* the constructor of request */
+    ThreadRequest *(*thread_request_init)(void);
+    /* the destructor of request */
+    void (*thread_request_uninit)(ThreadRequest *request);
+    /* the handler of the request which is called in the thread */
+    void (*thread_request_handler)(ThreadRequest *request);
+    /*
+     * the handler to process the result which is called in the
+     * user's context
+     */
+    void (*thread_request_done)(ThreadRequest *request);
+
+    /* the thread push the result to this ring so it has multiple producers */
+    Ring *request_done_ring;
+
+    ThreadLocal per_thread_data[0];
+};
+typedef struct Threads Threads;
+
+Threads *threads_create(unsigned int threads_nr, const char *name,
+                        ThreadRequest *(*thread_request_init)(void),
+                        void (*thread_request_uninit)(ThreadRequest *request),
+                        void (*thread_request_handler)(ThreadRequest *request),
+                        void (*thread_request_done)(ThreadRequest *request));
+void threads_destroy(Threads *threads);
+
+/*
+ * find a free request and associate it with a free thread.
+ * If no request or no thread is free, return NULL
+ */
+ThreadRequest *threads_submit_request_prepare(Threads *threads);
+/*
+ * push the request to its thread's local ring and notify the thread
+ */
+void threads_submit_request_commit(Threads *threads, ThreadRequest *request);
+
+/*
+ * wait all threads to complete the request filled in their local rings
+ * to make sure there is no previous request exists.
+ */
+void threads_wait_done(Threads *threads);
+#endif
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Qemu-devel] [PATCH 10/12] migration: introduce lockless multithreads model
@ 2018-06-04  9:55   ` guangrong.xiao
  0 siblings, 0 replies; 156+ messages in thread
From: guangrong.xiao @ 2018-06-04  9:55 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: qemu-devel, kvm, dgilbert, peterx, jiang.biao2, wei.w.wang,
	Xiao Guangrong

From: Xiao Guangrong <xiaoguangrong@tencent.com>

Current implementation of compression and decompression are very
hard to be enabled on productions. We noticed that too many wait-wakes
go to kernel space and CPU usages are very low even if the system
is really free

The reasons are:
1) there are two many locks used to do synchronous,there
  is a global lock and each single thread has its own lock,
  migration thread and work threads need to go to sleep if
  these locks are busy

2) migration thread separately submits request to the thread
   however, only one request can be pended, that means, the
   thread has to go to sleep after finishing the request

To make it work better, we introduce a new multithread model,
the user, currently it is the migration thread, submits request
to each thread with round-robin manner, the thread has its own
ring whose capacity is 4 and puts the result to a global ring
which is lockless for multiple producers, the user fetches result
out from the global ring and do remaining operations for the
request, e.g, posting the compressed data out for migration on
the source QEMU

Performance Result:
The test was based on top of the patch:
   ring: introduce lockless ring buffer
that means, previous optimizations are used for both of original case
and applying the new multithread model

We tested live migration on two hosts:
   Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz * 64 + 256G memory
to migration a VM between each other, which has 16 vCPUs and 60G
memory, during the migration, multiple threads are repeatedly writing
the memory in the VM

We used 16 threads on the destination to decompress the data and on the
source, we tried 8 threads and 16 threads to compress the data

--- Before our work ---
migration can not be finished for both 8 threads and 16 threads. The data
is as followings:

Use 8 threads to compress:
- on the source:
	    migration thread   compress-threads
CPU usage       70%          some use 36%, others are very low ~20%
- on the destination:
            main thread        decompress-threads
CPU usage       100%         some use ~40%, other are very low ~2%

Migration status (CAN NOT FINISH):
info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
Migration status: active
total time: 1019540 milliseconds
expected downtime: 2263 milliseconds
setup: 218 milliseconds
transferred ram: 252419995 kbytes
throughput: 2469.45 mbps
remaining ram: 15611332 kbytes
total ram: 62931784 kbytes
duplicate: 915323 pages
skipped: 0 pages
normal: 59673047 pages
normal bytes: 238692188 kbytes
dirty sync count: 28
page size: 4 kbytes
dirty pages rate: 170551 pages
compression pages: 121309323 pages
compression busy: 60588337
compression busy rate: 0.36
compression reduced size: 484281967178
compression rate: 0.97

Use 16 threads to compress:
- on the source:
	    migration thread   compress-threads
CPU usage       96%          some use 45%, others are very low ~6%
- on the destination:
            main thread        decompress-threads
CPU usage       96%         some use 58%, other are very low ~10%

Migration status (CAN NOT FINISH):
info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
Migration status: active
total time: 1189221 milliseconds
expected downtime: 6824 milliseconds
setup: 220 milliseconds
transferred ram: 90620052 kbytes
throughput: 840.41 mbps
remaining ram: 3678760 kbytes
total ram: 62931784 kbytes
duplicate: 195893 pages
skipped: 0 pages
normal: 17290715 pages
normal bytes: 69162860 kbytes
dirty sync count: 33
page size: 4 kbytes
dirty pages rate: 175039 pages
compression pages: 186739419 pages
compression busy: 17486568
compression busy rate: 0.09
compression reduced size: 744546683892
compression rate: 0.97

--- After our work ---
Migration can be finished quickly for both 8 threads and 16 threads. The
data is as followings:

Use 8 threads to compress:
- on the source:
	    migration thread   compress-threads
CPU usage       30%               30% (all threads have same CPU usage)
- on the destination:
            main thread        decompress-threads
CPU usage       100%              50% (all threads have same CPU usage)

Migration status (finished in 219467 ms):
info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
Migration status: completed
total time: 219467 milliseconds
downtime: 115 milliseconds
setup: 222 milliseconds
transferred ram: 88510173 kbytes
throughput: 3303.81 mbps
remaining ram: 0 kbytes
total ram: 62931784 kbytes
duplicate: 2211775 pages
skipped: 0 pages
normal: 21166222 pages
normal bytes: 84664888 kbytes
dirty sync count: 15
page size: 4 kbytes
compression pages: 32045857 pages
compression busy: 23377968
compression busy rate: 0.34
compression reduced size: 127767894329
compression rate: 0.97

Use 16 threads to compress:
- on the source:
	    migration thread   compress-threads
CPU usage       60%               60% (all threads have same CPU usage)
- on the destination:
            main thread        decompress-threads
CPU usage       100%              75% (all threads have same CPU usage)

Migration status (finished in 64118 ms):
info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
Migration status: completed
total time: 64118 milliseconds
downtime: 29 milliseconds
setup: 223 milliseconds
transferred ram: 13345135 kbytes
throughput: 1705.10 mbps
remaining ram: 0 kbytes
total ram: 62931784 kbytes
duplicate: 574921 pages
skipped: 0 pages
normal: 2570281 pages
normal bytes: 10281124 kbytes
dirty sync count: 9
page size: 4 kbytes
compression pages: 28007024 pages
compression busy: 3145182
compression busy rate: 0.08
compression reduced size: 111829024985
compression rate: 0.97

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 migration/Makefile.objs |   1 +
 migration/threads.c     | 265 ++++++++++++++++++++++++++++++++++++++++++++++++
 migration/threads.h     | 116 +++++++++++++++++++++
 3 files changed, 382 insertions(+)
 create mode 100644 migration/threads.c
 create mode 100644 migration/threads.h

diff --git a/migration/Makefile.objs b/migration/Makefile.objs
index c83ec47ba8..bdb61a7983 100644
--- a/migration/Makefile.objs
+++ b/migration/Makefile.objs
@@ -7,6 +7,7 @@ common-obj-y += qemu-file-channel.o
 common-obj-y += xbzrle.o postcopy-ram.o
 common-obj-y += qjson.o
 common-obj-y += block-dirty-bitmap.o
+common-obj-y += threads.o
 
 common-obj-$(CONFIG_RDMA) += rdma.o
 
diff --git a/migration/threads.c b/migration/threads.c
new file mode 100644
index 0000000000..eecd3229b7
--- /dev/null
+++ b/migration/threads.c
@@ -0,0 +1,265 @@
+#include "threads.h"
+
+/* retry to see if there is avilable request before actually go to wait. */
+#define BUSY_WAIT_COUNT 1000
+
+static void *thread_run(void *opaque)
+{
+    ThreadLocal *self_data = (ThreadLocal *)opaque;
+    Threads *threads = self_data->threads;
+    void (*handler)(ThreadRequest *data) = threads->thread_request_handler;
+    ThreadRequest *request;
+    int count, ret;
+
+    for ( ; !atomic_read(&self_data->quit); ) {
+        qemu_event_reset(&self_data->ev);
+
+        count = 0;
+        while ((request = ring_get(self_data->request_ring)) ||
+            count < BUSY_WAIT_COUNT) {
+             /*
+             * wait some while before go to sleep so that the user
+             * needn't go to kernel space to wake up the consumer
+             * threads.
+             *
+             * That will waste some CPU resource indeed however it
+             * can significantly improve the case that the request
+             * will be available soon.
+             */
+             if (!request) {
+                cpu_relax();
+                count++;
+                continue;
+            }
+            count = 0;
+
+            handler(request);
+
+            do {
+                ret = ring_put(threads->request_done_ring, request);
+                /*
+                 * request_done_ring has enough room to contain all
+                 * requests, however, theoretically, it still can be
+                 * fail if the ring's indexes are overflow that would
+                 * happen if there is more than 2^32 requests are
+                 * handled between two calls of threads_wait_done().
+                 * So we do retry to make the code more robust.
+                 *
+                 * It is unlikely the case for migration as the block's
+                 * memory is unlikely more than 16T (2^32 pages) memory.
+                 */
+                if (ret) {
+                    fprintf(stderr,
+                            "Potential BUG if it is triggered by migration.\n");
+                }
+            } while (ret);
+        }
+
+        qemu_event_wait(&self_data->ev);
+    }
+
+    return NULL;
+}
+
+static void add_free_request(Threads *threads, ThreadRequest *request)
+{
+    QSLIST_INSERT_HEAD(&threads->free_requests, request, node);
+    threads->free_requests_nr++;
+}
+
+static ThreadRequest *get_and_remove_first_free_request(Threads *threads)
+{
+    ThreadRequest *request;
+
+    if (QSLIST_EMPTY(&threads->free_requests)) {
+        return NULL;
+    }
+
+    request = QSLIST_FIRST(&threads->free_requests);
+    QSLIST_REMOVE_HEAD(&threads->free_requests, node);
+    threads->free_requests_nr--;
+    return request;
+}
+
+static void uninit_requests(Threads *threads, int free_nr)
+{
+    ThreadRequest *request;
+
+    /*
+     * all requests should be released to the list if threads are being
+     * destroyed, i,e. should call threads_wait_done() first.
+     */
+    assert(threads->free_requests_nr == free_nr);
+
+    while ((request = get_and_remove_first_free_request(threads))) {
+        threads->thread_request_uninit(request);
+    }
+
+    assert(ring_is_empty(threads->request_done_ring));
+    ring_free(threads->request_done_ring);
+}
+
+static int init_requests(Threads *threads)
+{
+    ThreadRequest *request;
+    unsigned int done_ring_size = pow2roundup32(threads->total_requests);
+    int i, free_nr = 0;
+
+    threads->request_done_ring = ring_alloc(done_ring_size,
+                                            RING_MULTI_PRODUCER);
+
+    QSLIST_INIT(&threads->free_requests);
+    for (i = 0; i < threads->total_requests; i++) {
+        request = threads->thread_request_init();
+        if (!request) {
+            goto cleanup;
+        }
+
+        free_nr++;
+        add_free_request(threads, request);
+    }
+    return 0;
+
+cleanup:
+    uninit_requests(threads, free_nr);
+    return -1;
+}
+
+static void uninit_thread_data(Threads *threads)
+{
+    ThreadLocal *thread_local = threads->per_thread_data;
+    int i;
+
+    for (i = 0; i < threads->threads_nr; i++) {
+        thread_local[i].quit = true;
+        qemu_event_set(&thread_local[i].ev);
+        qemu_thread_join(&thread_local[i].thread);
+        qemu_event_destroy(&thread_local[i].ev);
+        assert(ring_is_empty(thread_local[i].request_ring));
+        ring_free(thread_local[i].request_ring);
+    }
+}
+
+static void init_thread_data(Threads *threads)
+{
+    ThreadLocal *thread_local = threads->per_thread_data;
+    char *name;
+    int i;
+
+    for (i = 0; i < threads->threads_nr; i++) {
+        qemu_event_init(&thread_local[i].ev, false);
+
+        thread_local[i].threads = threads;
+        thread_local[i].self = i;
+        thread_local[i].request_ring = ring_alloc(threads->thread_ring_size, 0);
+        name = g_strdup_printf("%s/%d", threads->name, thread_local[i].self);
+        qemu_thread_create(&thread_local[i].thread, name,
+                           thread_run, &thread_local[i], QEMU_THREAD_JOINABLE);
+        g_free(name);
+    }
+}
+
+/* the size of thread local request ring */
+#define THREAD_REQ_RING_SIZE 4
+
+Threads *threads_create(unsigned int threads_nr, const char *name,
+                        ThreadRequest *(*thread_request_init)(void),
+                        void (*thread_request_uninit)(ThreadRequest *request),
+                        void (*thread_request_handler)(ThreadRequest *request),
+                        void (*thread_request_done)(ThreadRequest *request))
+{
+    Threads *threads;
+    int ret;
+
+    threads = g_malloc0(sizeof(*threads) + threads_nr * sizeof(ThreadLocal));
+    threads->threads_nr = threads_nr;
+    threads->thread_ring_size = THREAD_REQ_RING_SIZE;
+    threads->total_requests = threads->thread_ring_size * threads_nr;
+
+    threads->name = name;
+    threads->thread_request_init = thread_request_init;
+    threads->thread_request_uninit = thread_request_uninit;
+    threads->thread_request_handler = thread_request_handler;
+    threads->thread_request_done = thread_request_done;
+
+    ret = init_requests(threads);
+    if (ret) {
+        g_free(threads);
+        return NULL;
+    }
+
+    init_thread_data(threads);
+    return threads;
+}
+
+void threads_destroy(Threads *threads)
+{
+    uninit_thread_data(threads);
+    uninit_requests(threads, threads->total_requests);
+    g_free(threads);
+}
+
+ThreadRequest *threads_submit_request_prepare(Threads *threads)
+{
+    ThreadRequest *request;
+    unsigned int index;
+
+    index = threads->current_thread_index % threads->threads_nr;
+
+    /* the thread is busy */
+    if (ring_is_full(threads->per_thread_data[index].request_ring)) {
+        return NULL;
+    }
+
+    /* try to get the request from the list */
+    request = get_and_remove_first_free_request(threads);
+    if (request) {
+        goto got_request;
+    }
+
+    /* get the request already been handled by the threads */
+    request = ring_get(threads->request_done_ring);
+    if (request) {
+        threads->thread_request_done(request);
+        goto got_request;
+    }
+    return NULL;
+
+got_request:
+    threads->current_thread_index++;
+    request->thread_index = index;
+    return request;
+}
+
+void threads_submit_request_commit(Threads *threads, ThreadRequest *request)
+{
+    int ret, index = request->thread_index;
+    ThreadLocal *thread_local = &threads->per_thread_data[index];
+
+    ret = ring_put(thread_local->request_ring, request);
+
+    /*
+     * we have detected that the thread's ring is not full in
+     * threads_submit_request_prepare(), there should be free
+     * room in the ring
+     */
+    assert(!ret);
+    /* new request arrived, notify the thread */
+    qemu_event_set(&thread_local->ev);
+}
+
+void threads_wait_done(Threads *threads)
+{
+    ThreadRequest *request;
+
+retry:
+    while ((request = ring_get(threads->request_done_ring))) {
+        threads->thread_request_done(request);
+        add_free_request(threads, request);
+    }
+
+    if (threads->free_requests_nr != threads->total_requests) {
+        cpu_relax();
+        goto retry;
+    }
+}
diff --git a/migration/threads.h b/migration/threads.h
new file mode 100644
index 0000000000..eced913065
--- /dev/null
+++ b/migration/threads.h
@@ -0,0 +1,116 @@
+#ifndef QEMU_MIGRATION_THREAD_H
+#define QEMU_MIGRATION_THREAD_H
+
+/*
+ * Multithreads abstraction
+ *
+ * This is the abstraction layer for multithreads management which is
+ * used to speed up migration.
+ *
+ * Note: currently only one producer is allowed.
+ *
+ * Copyright(C) 2018 Tencent Corporation.
+ *
+ * Author:
+ *   Xiao Guangrong <xiaoguangrong@tencent.com>
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "hw/boards.h"
+
+#include "ring.h"
+
+/*
+ * the request representation which contains the internally used mete data,
+ * it can be embedded to user's self-defined data struct and the user can
+ * use container_of() to get the self-defined data
+ */
+struct ThreadRequest {
+    QSLIST_ENTRY(ThreadRequest) node;
+    unsigned int thread_index;
+};
+typedef struct ThreadRequest ThreadRequest;
+
+struct Threads;
+
+struct ThreadLocal {
+    QemuThread thread;
+
+    /* the event used to wake up the thread */
+    QemuEvent ev;
+
+    struct Threads *threads;
+
+    /* local request ring which is filled by the user */
+    Ring *request_ring;
+
+    /* the index of the thread */
+    int self;
+
+    /* thread is useless and needs to exit */
+    bool quit;
+};
+typedef struct ThreadLocal ThreadLocal;
+
+/*
+ * the main data struct represents multithreads which is shared by
+ * all threads
+ */
+struct Threads {
+    const char *name;
+    unsigned int threads_nr;
+    /* the request is pushed to the thread with round-robin manner */
+    unsigned int current_thread_index;
+
+    int thread_ring_size;
+    int total_requests;
+
+    /* the request is pre-allocated and linked in the list */
+    int free_requests_nr;
+    QSLIST_HEAD(, ThreadRequest) free_requests;
+
+    /* the constructor of request */
+    ThreadRequest *(*thread_request_init)(void);
+    /* the destructor of request */
+    void (*thread_request_uninit)(ThreadRequest *request);
+    /* the handler of the request which is called in the thread */
+    void (*thread_request_handler)(ThreadRequest *request);
+    /*
+     * the handler to process the result which is called in the
+     * user's context
+     */
+    void (*thread_request_done)(ThreadRequest *request);
+
+    /* the thread push the result to this ring so it has multiple producers */
+    Ring *request_done_ring;
+
+    ThreadLocal per_thread_data[0];
+};
+typedef struct Threads Threads;
+
+Threads *threads_create(unsigned int threads_nr, const char *name,
+                        ThreadRequest *(*thread_request_init)(void),
+                        void (*thread_request_uninit)(ThreadRequest *request),
+                        void (*thread_request_handler)(ThreadRequest *request),
+                        void (*thread_request_done)(ThreadRequest *request));
+void threads_destroy(Threads *threads);
+
+/*
+ * find a free request and associate it with a free thread.
+ * If no request or no thread is free, return NULL
+ */
+ThreadRequest *threads_submit_request_prepare(Threads *threads);
+/*
+ * push the request to its thread's local ring and notify the thread
+ */
+void threads_submit_request_commit(Threads *threads, ThreadRequest *request);
+
+/*
+ * wait all threads to complete the request filled in their local rings
+ * to make sure there is no previous request exists.
+ */
+void threads_wait_done(Threads *threads);
+#endif
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH 11/12] migration: use lockless Multithread model for compression
  2018-06-04  9:55 ` [Qemu-devel] " guangrong.xiao
@ 2018-06-04  9:55   ` guangrong.xiao
  -1 siblings, 0 replies; 156+ messages in thread
From: guangrong.xiao @ 2018-06-04  9:55 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: kvm, Xiao Guangrong, qemu-devel, peterx, dgilbert, wei.w.wang,
	jiang.biao2

From: Xiao Guangrong <xiaoguangrong@tencent.com>

Adapt the compression code to the lockless multithread model

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 migration/ram.c | 412 ++++++++++++++++++++++----------------------------------
 1 file changed, 161 insertions(+), 251 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 0a38c1c61e..58ecf5caa0 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -55,6 +55,7 @@
 #include "sysemu/sysemu.h"
 #include "qemu/uuid.h"
 #include "savevm.h"
+#include "migration/threads.h"
 
 /***********************************************************/
 /* ram save/restore */
@@ -340,21 +341,6 @@ typedef struct PageSearchStatus PageSearchStatus;
 
 CompressionStats compression_counters;
 
-struct CompressParam {
-    bool done;
-    bool quit;
-    QEMUFile *file;
-    QemuMutex mutex;
-    QemuCond cond;
-    RAMBlock *block;
-    ram_addr_t offset;
-
-    /* internally used fields */
-    z_stream stream;
-    uint8_t *originbuf;
-};
-typedef struct CompressParam CompressParam;
-
 struct DecompressParam {
     bool done;
     bool quit;
@@ -367,15 +353,6 @@ struct DecompressParam {
 };
 typedef struct DecompressParam DecompressParam;
 
-static CompressParam *comp_param;
-static QemuThread *compress_threads;
-/* comp_done_cond is used to wake up the migration thread when
- * one of the compression threads has finished the compression.
- * comp_done_lock is used to co-work with comp_done_cond.
- */
-static QemuMutex comp_done_lock;
-static QemuCond comp_done_cond;
-/* The empty QEMUFileOps will be used by file in CompressParam */
 static const QEMUFileOps empty_ops = { };
 
 static QEMUFile *decomp_file;
@@ -384,131 +361,6 @@ static QemuThread *decompress_threads;
 static QemuMutex decomp_done_lock;
 static QemuCond decomp_done_cond;
 
-static int do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock *block,
-                                ram_addr_t offset, uint8_t *source_buf);
-
-static void *do_data_compress(void *opaque)
-{
-    CompressParam *param = opaque;
-    RAMBlock *block;
-    ram_addr_t offset;
-
-    qemu_mutex_lock(&param->mutex);
-    while (!param->quit) {
-        if (param->block) {
-            block = param->block;
-            offset = param->offset;
-            param->block = NULL;
-            qemu_mutex_unlock(&param->mutex);
-
-            do_compress_ram_page(param->file, &param->stream, block, offset,
-                                 param->originbuf);
-
-            qemu_mutex_lock(&comp_done_lock);
-            param->done = true;
-            qemu_cond_signal(&comp_done_cond);
-            qemu_mutex_unlock(&comp_done_lock);
-
-            qemu_mutex_lock(&param->mutex);
-        } else {
-            qemu_cond_wait(&param->cond, &param->mutex);
-        }
-    }
-    qemu_mutex_unlock(&param->mutex);
-
-    return NULL;
-}
-
-static inline void terminate_compression_threads(void)
-{
-    int idx, thread_count;
-
-    thread_count = migrate_compress_threads();
-
-    for (idx = 0; idx < thread_count; idx++) {
-        qemu_mutex_lock(&comp_param[idx].mutex);
-        comp_param[idx].quit = true;
-        qemu_cond_signal(&comp_param[idx].cond);
-        qemu_mutex_unlock(&comp_param[idx].mutex);
-    }
-}
-
-static void compress_threads_save_cleanup(void)
-{
-    int i, thread_count;
-
-    if (!migrate_use_compression()) {
-        return;
-    }
-    terminate_compression_threads();
-    thread_count = migrate_compress_threads();
-    for (i = 0; i < thread_count; i++) {
-        /*
-         * we use it as a indicator which shows if the thread is
-         * properly init'd or not
-         */
-        if (!comp_param[i].file) {
-            break;
-        }
-        qemu_thread_join(compress_threads + i);
-        qemu_mutex_destroy(&comp_param[i].mutex);
-        qemu_cond_destroy(&comp_param[i].cond);
-        deflateEnd(&comp_param[i].stream);
-        g_free(comp_param[i].originbuf);
-        qemu_fclose(comp_param[i].file);
-        comp_param[i].file = NULL;
-    }
-    qemu_mutex_destroy(&comp_done_lock);
-    qemu_cond_destroy(&comp_done_cond);
-    g_free(compress_threads);
-    g_free(comp_param);
-    compress_threads = NULL;
-    comp_param = NULL;
-}
-
-static int compress_threads_save_setup(void)
-{
-    int i, thread_count;
-
-    if (!migrate_use_compression()) {
-        return 0;
-    }
-    thread_count = migrate_compress_threads();
-    compress_threads = g_new0(QemuThread, thread_count);
-    comp_param = g_new0(CompressParam, thread_count);
-    qemu_cond_init(&comp_done_cond);
-    qemu_mutex_init(&comp_done_lock);
-    for (i = 0; i < thread_count; i++) {
-        comp_param[i].originbuf = g_try_malloc(TARGET_PAGE_SIZE);
-        if (!comp_param[i].originbuf) {
-            goto exit;
-        }
-
-        if (deflateInit(&comp_param[i].stream,
-                        migrate_compress_level()) != Z_OK) {
-            g_free(comp_param[i].originbuf);
-            goto exit;
-        }
-
-        /* comp_param[i].file is just used as a dummy buffer to save data,
-         * set its ops to empty.
-         */
-        comp_param[i].file = qemu_fopen_ops(NULL, &empty_ops);
-        comp_param[i].done = true;
-        comp_param[i].quit = false;
-        qemu_mutex_init(&comp_param[i].mutex);
-        qemu_cond_init(&comp_param[i].cond);
-        qemu_thread_create(compress_threads + i, "compress",
-                           do_data_compress, comp_param + i,
-                           QEMU_THREAD_JOINABLE);
-    }
-    return 0;
-
-exit:
-    compress_threads_save_cleanup();
-    return -1;
-}
-
 /* Multiple fd's */
 
 #define MULTIFD_MAGIC 0x11223344U
@@ -965,6 +817,151 @@ static void mig_throttle_guest_down(void)
     }
 }
 
+static void ram_release_pages(const char *rbname, uint64_t offset, int pages)
+{
+    if (!migrate_release_ram() || !migration_in_postcopy()) {
+        return;
+    }
+
+    ram_discard_range(rbname, offset, pages << TARGET_PAGE_BITS);
+}
+
+static int do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock *block,
+                                ram_addr_t offset, uint8_t *source_buf)
+{
+    RAMState *rs = ram_state;
+    int bytes_sent, blen;
+    uint8_t *p = block->host + (offset & TARGET_PAGE_MASK);
+
+    bytes_sent = save_page_header(rs, f, block, offset |
+                                  RAM_SAVE_FLAG_COMPRESS_PAGE);
+
+    /*
+     * copy it to a internal buffer to avoid it being modified by VM
+     * so that we can catch up the error during compression and
+     * decompression
+     */
+    memcpy(source_buf, p, TARGET_PAGE_SIZE);
+    blen = qemu_put_compression_data(f, stream, source_buf, TARGET_PAGE_SIZE);
+    if (blen < 0) {
+        bytes_sent = 0;
+        qemu_file_set_error(migrate_get_current()->to_dst_file, blen);
+        error_report("compressed data failed!");
+    } else {
+        bytes_sent += blen;
+        ram_release_pages(block->idstr, offset & TARGET_PAGE_MASK, 1);
+    }
+
+    return bytes_sent;
+}
+
+struct CompressData {
+    /* filled by migration thread.*/
+    RAMBlock *block;
+    ram_addr_t offset;
+
+    /* filled by compress thread. */
+    QEMUFile *file;
+    z_stream stream;
+    uint8_t *originbuf;
+
+    ThreadRequest request;
+};
+typedef struct CompressData CompressData;
+
+static ThreadRequest *compress_thread_data_init(void)
+{
+    CompressData *cd = g_new0(CompressData, 1);
+
+    cd->originbuf = g_try_malloc(TARGET_PAGE_SIZE);
+    if (!cd->originbuf) {
+        goto exit;
+    }
+
+    if (deflateInit(&cd->stream, migrate_compress_level()) != Z_OK) {
+        g_free(cd->originbuf);
+        goto exit;
+    }
+
+    cd->file = qemu_fopen_ops(NULL, &empty_ops);
+    return &cd->request;
+
+exit:
+    g_free(cd);
+    return NULL;
+}
+
+static void compress_thread_data_fini(ThreadRequest *request)
+{
+    CompressData *cd = container_of(request, CompressData, request);
+
+    qemu_fclose(cd->file);
+    deflateEnd(&cd->stream);
+    g_free(cd->originbuf);
+    g_free(cd);
+}
+
+static void compress_thread_data_handler(ThreadRequest *request)
+{
+    CompressData *cd = container_of(request, CompressData, request);
+
+    /*
+     * if compression fails, it will be indicated by
+     * migrate_get_current()->to_dst_file.
+     */
+    do_compress_ram_page(cd->file, &cd->stream, cd->block, cd->offset,
+                         cd->originbuf);
+}
+
+static void compress_thread_data_done(ThreadRequest *request)
+{
+    CompressData *cd = container_of(request, CompressData, request);
+    RAMState *rs = ram_state;
+    int bytes_xmit;
+
+    bytes_xmit = qemu_put_qemu_file(rs->f, cd->file);
+    /* 8 means a header with RAM_SAVE_FLAG_CONTINUE */
+    compression_counters.reduced_size += TARGET_PAGE_SIZE - bytes_xmit + 8;
+    compression_counters.pages++;
+    ram_counters.transferred += bytes_xmit;
+}
+
+static Threads *compress_threads;
+
+static void flush_compressed_data(void)
+{
+    if (!migrate_use_compression()) {
+        return;
+    }
+
+    threads_wait_done(compress_threads);
+}
+
+static void compress_threads_save_cleanup(void)
+{
+    if (!compress_threads) {
+        return;
+    }
+
+    threads_destroy(compress_threads);
+    compress_threads = NULL;
+}
+
+static int compress_threads_save_setup(void)
+{
+    if (!migrate_use_compression()) {
+        return 0;
+    }
+
+    compress_threads = threads_create(migrate_compress_threads(),
+                                      "compress",
+                                      compress_thread_data_init,
+                                      compress_thread_data_fini,
+                                      compress_thread_data_handler,
+                                      compress_thread_data_done);
+    return compress_threads ? 0 : -1;
+}
+
 /**
  * xbzrle_cache_zero_page: insert a zero page in the XBZRLE cache
  *
@@ -1268,15 +1265,6 @@ static int save_zero_page(RAMState *rs, RAMBlock *block, ram_addr_t offset)
     return pages;
 }
 
-static void ram_release_pages(const char *rbname, uint64_t offset, int pages)
-{
-    if (!migrate_release_ram() || !migration_in_postcopy()) {
-        return;
-    }
-
-    ram_discard_range(rbname, offset, pages << TARGET_PAGE_BITS);
-}
-
 /*
  * @pages: the number of pages written by the control path,
  *        < 0 - error
@@ -1391,99 +1379,22 @@ static int ram_save_page(RAMState *rs, PageSearchStatus *pss, bool last_stage)
     return pages;
 }
 
-static int do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock *block,
-                                ram_addr_t offset, uint8_t *source_buf)
-{
-    RAMState *rs = ram_state;
-    int bytes_sent, blen;
-    uint8_t *p = block->host + (offset & TARGET_PAGE_MASK);
-
-    bytes_sent = save_page_header(rs, f, block, offset |
-                                  RAM_SAVE_FLAG_COMPRESS_PAGE);
-
-    /*
-     * copy it to a internal buffer to avoid it being modified by VM
-     * so that we can catch up the error during compression and
-     * decompression
-     */
-    memcpy(source_buf, p, TARGET_PAGE_SIZE);
-    blen = qemu_put_compression_data(f, stream, source_buf, TARGET_PAGE_SIZE);
-    if (blen < 0) {
-        bytes_sent = 0;
-        qemu_file_set_error(migrate_get_current()->to_dst_file, blen);
-        error_report("compressed data failed!");
-    } else {
-        bytes_sent += blen;
-        ram_release_pages(block->idstr, offset & TARGET_PAGE_MASK, 1);
-    }
-
-    return bytes_sent;
-}
-
-static void flush_compressed_data(RAMState *rs)
-{
-    int idx, len, thread_count;
-
-    if (!migrate_use_compression()) {
-        return;
-    }
-    thread_count = migrate_compress_threads();
-
-    qemu_mutex_lock(&comp_done_lock);
-    for (idx = 0; idx < thread_count; idx++) {
-        while (!comp_param[idx].done) {
-            qemu_cond_wait(&comp_done_cond, &comp_done_lock);
-        }
-    }
-    qemu_mutex_unlock(&comp_done_lock);
-
-    for (idx = 0; idx < thread_count; idx++) {
-        qemu_mutex_lock(&comp_param[idx].mutex);
-        if (!comp_param[idx].quit) {
-            len = qemu_put_qemu_file(rs->f, comp_param[idx].file);
-            /* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */
-            compression_counters.reduced_size += TARGET_PAGE_SIZE - len + 8;
-            compression_counters.pages++;
-            ram_counters.transferred += len;
-        }
-        qemu_mutex_unlock(&comp_param[idx].mutex);
-    }
-}
-
-static inline void set_compress_params(CompressParam *param, RAMBlock *block,
-                                       ram_addr_t offset)
-{
-    param->block = block;
-    param->offset = offset;
-}
-
 static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
                                            ram_addr_t offset)
 {
-    int idx, thread_count, bytes_xmit = -1, pages = -1;
+    CompressData *cd;
+    ThreadRequest *request = threads_submit_request_prepare(compress_threads);
 
-    thread_count = migrate_compress_threads();
-    qemu_mutex_lock(&comp_done_lock);
-    for (idx = 0; idx < thread_count; idx++) {
-        if (comp_param[idx].done) {
-            comp_param[idx].done = false;
-            bytes_xmit = qemu_put_qemu_file(rs->f, comp_param[idx].file);
-            qemu_mutex_lock(&comp_param[idx].mutex);
-            set_compress_params(&comp_param[idx], block, offset);
-            qemu_cond_signal(&comp_param[idx].cond);
-            qemu_mutex_unlock(&comp_param[idx].mutex);
-            pages = 1;
-            /* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */
-            compression_counters.reduced_size += TARGET_PAGE_SIZE -
-                                                 bytes_xmit + 8;
-            compression_counters.pages++;
-            ram_counters.transferred += bytes_xmit;
-            break;
-        }
-    }
-    qemu_mutex_unlock(&comp_done_lock);
+    if (!request) {
+        compression_counters.busy++;
+        return -1;
+     }
 
-    return pages;
+    cd = container_of(request, CompressData, request);
+    cd->block = block;
+    cd->offset = offset;
+    threads_submit_request_commit(compress_threads, request);
+    return 1;
 }
 
 /**
@@ -1522,7 +1433,7 @@ static bool find_dirty_block(RAMState *rs, PageSearchStatus *pss, bool *again)
                 /* If xbzrle is on, stop using the data compression at this
                  * point. In theory, xbzrle can do better than compression.
                  */
-                flush_compressed_data(rs);
+                flush_compressed_data();
             }
         }
         /* Didn't find anything this time, but try again on the new block */
@@ -1776,7 +1687,7 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss,
          * much CPU resource.
          */
         if (block != rs->last_sent_block) {
-            flush_compressed_data(rs);
+            flush_compressed_data();
         } else {
             /*
              * do not detect zero page as it can be handled very well
@@ -1786,7 +1697,6 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss,
             if (res > 0) {
                 return res;
             }
-            compression_counters.busy++;
         }
     }
 
@@ -1994,7 +1904,7 @@ static void ram_save_cleanup(void *opaque)
     }
 
     xbzrle_cleanup();
-    flush_compressed_data(*rsp);
+    flush_compressed_data();
     compress_threads_save_cleanup();
     ram_state_cleanup(rsp);
 }
@@ -2747,7 +2657,7 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
         }
     }
 
-    flush_compressed_data(rs);
+    flush_compressed_data();
     ram_control_after_iterate(f, RAM_CONTROL_FINISH);
 
     rcu_read_unlock();
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Qemu-devel] [PATCH 11/12] migration: use lockless Multithread model for compression
@ 2018-06-04  9:55   ` guangrong.xiao
  0 siblings, 0 replies; 156+ messages in thread
From: guangrong.xiao @ 2018-06-04  9:55 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: qemu-devel, kvm, dgilbert, peterx, jiang.biao2, wei.w.wang,
	Xiao Guangrong

From: Xiao Guangrong <xiaoguangrong@tencent.com>

Adapt the compression code to the lockless multithread model

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 migration/ram.c | 412 ++++++++++++++++++++++----------------------------------
 1 file changed, 161 insertions(+), 251 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 0a38c1c61e..58ecf5caa0 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -55,6 +55,7 @@
 #include "sysemu/sysemu.h"
 #include "qemu/uuid.h"
 #include "savevm.h"
+#include "migration/threads.h"
 
 /***********************************************************/
 /* ram save/restore */
@@ -340,21 +341,6 @@ typedef struct PageSearchStatus PageSearchStatus;
 
 CompressionStats compression_counters;
 
-struct CompressParam {
-    bool done;
-    bool quit;
-    QEMUFile *file;
-    QemuMutex mutex;
-    QemuCond cond;
-    RAMBlock *block;
-    ram_addr_t offset;
-
-    /* internally used fields */
-    z_stream stream;
-    uint8_t *originbuf;
-};
-typedef struct CompressParam CompressParam;
-
 struct DecompressParam {
     bool done;
     bool quit;
@@ -367,15 +353,6 @@ struct DecompressParam {
 };
 typedef struct DecompressParam DecompressParam;
 
-static CompressParam *comp_param;
-static QemuThread *compress_threads;
-/* comp_done_cond is used to wake up the migration thread when
- * one of the compression threads has finished the compression.
- * comp_done_lock is used to co-work with comp_done_cond.
- */
-static QemuMutex comp_done_lock;
-static QemuCond comp_done_cond;
-/* The empty QEMUFileOps will be used by file in CompressParam */
 static const QEMUFileOps empty_ops = { };
 
 static QEMUFile *decomp_file;
@@ -384,131 +361,6 @@ static QemuThread *decompress_threads;
 static QemuMutex decomp_done_lock;
 static QemuCond decomp_done_cond;
 
-static int do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock *block,
-                                ram_addr_t offset, uint8_t *source_buf);
-
-static void *do_data_compress(void *opaque)
-{
-    CompressParam *param = opaque;
-    RAMBlock *block;
-    ram_addr_t offset;
-
-    qemu_mutex_lock(&param->mutex);
-    while (!param->quit) {
-        if (param->block) {
-            block = param->block;
-            offset = param->offset;
-            param->block = NULL;
-            qemu_mutex_unlock(&param->mutex);
-
-            do_compress_ram_page(param->file, &param->stream, block, offset,
-                                 param->originbuf);
-
-            qemu_mutex_lock(&comp_done_lock);
-            param->done = true;
-            qemu_cond_signal(&comp_done_cond);
-            qemu_mutex_unlock(&comp_done_lock);
-
-            qemu_mutex_lock(&param->mutex);
-        } else {
-            qemu_cond_wait(&param->cond, &param->mutex);
-        }
-    }
-    qemu_mutex_unlock(&param->mutex);
-
-    return NULL;
-}
-
-static inline void terminate_compression_threads(void)
-{
-    int idx, thread_count;
-
-    thread_count = migrate_compress_threads();
-
-    for (idx = 0; idx < thread_count; idx++) {
-        qemu_mutex_lock(&comp_param[idx].mutex);
-        comp_param[idx].quit = true;
-        qemu_cond_signal(&comp_param[idx].cond);
-        qemu_mutex_unlock(&comp_param[idx].mutex);
-    }
-}
-
-static void compress_threads_save_cleanup(void)
-{
-    int i, thread_count;
-
-    if (!migrate_use_compression()) {
-        return;
-    }
-    terminate_compression_threads();
-    thread_count = migrate_compress_threads();
-    for (i = 0; i < thread_count; i++) {
-        /*
-         * we use it as a indicator which shows if the thread is
-         * properly init'd or not
-         */
-        if (!comp_param[i].file) {
-            break;
-        }
-        qemu_thread_join(compress_threads + i);
-        qemu_mutex_destroy(&comp_param[i].mutex);
-        qemu_cond_destroy(&comp_param[i].cond);
-        deflateEnd(&comp_param[i].stream);
-        g_free(comp_param[i].originbuf);
-        qemu_fclose(comp_param[i].file);
-        comp_param[i].file = NULL;
-    }
-    qemu_mutex_destroy(&comp_done_lock);
-    qemu_cond_destroy(&comp_done_cond);
-    g_free(compress_threads);
-    g_free(comp_param);
-    compress_threads = NULL;
-    comp_param = NULL;
-}
-
-static int compress_threads_save_setup(void)
-{
-    int i, thread_count;
-
-    if (!migrate_use_compression()) {
-        return 0;
-    }
-    thread_count = migrate_compress_threads();
-    compress_threads = g_new0(QemuThread, thread_count);
-    comp_param = g_new0(CompressParam, thread_count);
-    qemu_cond_init(&comp_done_cond);
-    qemu_mutex_init(&comp_done_lock);
-    for (i = 0; i < thread_count; i++) {
-        comp_param[i].originbuf = g_try_malloc(TARGET_PAGE_SIZE);
-        if (!comp_param[i].originbuf) {
-            goto exit;
-        }
-
-        if (deflateInit(&comp_param[i].stream,
-                        migrate_compress_level()) != Z_OK) {
-            g_free(comp_param[i].originbuf);
-            goto exit;
-        }
-
-        /* comp_param[i].file is just used as a dummy buffer to save data,
-         * set its ops to empty.
-         */
-        comp_param[i].file = qemu_fopen_ops(NULL, &empty_ops);
-        comp_param[i].done = true;
-        comp_param[i].quit = false;
-        qemu_mutex_init(&comp_param[i].mutex);
-        qemu_cond_init(&comp_param[i].cond);
-        qemu_thread_create(compress_threads + i, "compress",
-                           do_data_compress, comp_param + i,
-                           QEMU_THREAD_JOINABLE);
-    }
-    return 0;
-
-exit:
-    compress_threads_save_cleanup();
-    return -1;
-}
-
 /* Multiple fd's */
 
 #define MULTIFD_MAGIC 0x11223344U
@@ -965,6 +817,151 @@ static void mig_throttle_guest_down(void)
     }
 }
 
+static void ram_release_pages(const char *rbname, uint64_t offset, int pages)
+{
+    if (!migrate_release_ram() || !migration_in_postcopy()) {
+        return;
+    }
+
+    ram_discard_range(rbname, offset, pages << TARGET_PAGE_BITS);
+}
+
+static int do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock *block,
+                                ram_addr_t offset, uint8_t *source_buf)
+{
+    RAMState *rs = ram_state;
+    int bytes_sent, blen;
+    uint8_t *p = block->host + (offset & TARGET_PAGE_MASK);
+
+    bytes_sent = save_page_header(rs, f, block, offset |
+                                  RAM_SAVE_FLAG_COMPRESS_PAGE);
+
+    /*
+     * copy it to a internal buffer to avoid it being modified by VM
+     * so that we can catch up the error during compression and
+     * decompression
+     */
+    memcpy(source_buf, p, TARGET_PAGE_SIZE);
+    blen = qemu_put_compression_data(f, stream, source_buf, TARGET_PAGE_SIZE);
+    if (blen < 0) {
+        bytes_sent = 0;
+        qemu_file_set_error(migrate_get_current()->to_dst_file, blen);
+        error_report("compressed data failed!");
+    } else {
+        bytes_sent += blen;
+        ram_release_pages(block->idstr, offset & TARGET_PAGE_MASK, 1);
+    }
+
+    return bytes_sent;
+}
+
+struct CompressData {
+    /* filled by migration thread.*/
+    RAMBlock *block;
+    ram_addr_t offset;
+
+    /* filled by compress thread. */
+    QEMUFile *file;
+    z_stream stream;
+    uint8_t *originbuf;
+
+    ThreadRequest request;
+};
+typedef struct CompressData CompressData;
+
+static ThreadRequest *compress_thread_data_init(void)
+{
+    CompressData *cd = g_new0(CompressData, 1);
+
+    cd->originbuf = g_try_malloc(TARGET_PAGE_SIZE);
+    if (!cd->originbuf) {
+        goto exit;
+    }
+
+    if (deflateInit(&cd->stream, migrate_compress_level()) != Z_OK) {
+        g_free(cd->originbuf);
+        goto exit;
+    }
+
+    cd->file = qemu_fopen_ops(NULL, &empty_ops);
+    return &cd->request;
+
+exit:
+    g_free(cd);
+    return NULL;
+}
+
+static void compress_thread_data_fini(ThreadRequest *request)
+{
+    CompressData *cd = container_of(request, CompressData, request);
+
+    qemu_fclose(cd->file);
+    deflateEnd(&cd->stream);
+    g_free(cd->originbuf);
+    g_free(cd);
+}
+
+static void compress_thread_data_handler(ThreadRequest *request)
+{
+    CompressData *cd = container_of(request, CompressData, request);
+
+    /*
+     * if compression fails, it will be indicated by
+     * migrate_get_current()->to_dst_file.
+     */
+    do_compress_ram_page(cd->file, &cd->stream, cd->block, cd->offset,
+                         cd->originbuf);
+}
+
+static void compress_thread_data_done(ThreadRequest *request)
+{
+    CompressData *cd = container_of(request, CompressData, request);
+    RAMState *rs = ram_state;
+    int bytes_xmit;
+
+    bytes_xmit = qemu_put_qemu_file(rs->f, cd->file);
+    /* 8 means a header with RAM_SAVE_FLAG_CONTINUE */
+    compression_counters.reduced_size += TARGET_PAGE_SIZE - bytes_xmit + 8;
+    compression_counters.pages++;
+    ram_counters.transferred += bytes_xmit;
+}
+
+static Threads *compress_threads;
+
+static void flush_compressed_data(void)
+{
+    if (!migrate_use_compression()) {
+        return;
+    }
+
+    threads_wait_done(compress_threads);
+}
+
+static void compress_threads_save_cleanup(void)
+{
+    if (!compress_threads) {
+        return;
+    }
+
+    threads_destroy(compress_threads);
+    compress_threads = NULL;
+}
+
+static int compress_threads_save_setup(void)
+{
+    if (!migrate_use_compression()) {
+        return 0;
+    }
+
+    compress_threads = threads_create(migrate_compress_threads(),
+                                      "compress",
+                                      compress_thread_data_init,
+                                      compress_thread_data_fini,
+                                      compress_thread_data_handler,
+                                      compress_thread_data_done);
+    return compress_threads ? 0 : -1;
+}
+
 /**
  * xbzrle_cache_zero_page: insert a zero page in the XBZRLE cache
  *
@@ -1268,15 +1265,6 @@ static int save_zero_page(RAMState *rs, RAMBlock *block, ram_addr_t offset)
     return pages;
 }
 
-static void ram_release_pages(const char *rbname, uint64_t offset, int pages)
-{
-    if (!migrate_release_ram() || !migration_in_postcopy()) {
-        return;
-    }
-
-    ram_discard_range(rbname, offset, pages << TARGET_PAGE_BITS);
-}
-
 /*
  * @pages: the number of pages written by the control path,
  *        < 0 - error
@@ -1391,99 +1379,22 @@ static int ram_save_page(RAMState *rs, PageSearchStatus *pss, bool last_stage)
     return pages;
 }
 
-static int do_compress_ram_page(QEMUFile *f, z_stream *stream, RAMBlock *block,
-                                ram_addr_t offset, uint8_t *source_buf)
-{
-    RAMState *rs = ram_state;
-    int bytes_sent, blen;
-    uint8_t *p = block->host + (offset & TARGET_PAGE_MASK);
-
-    bytes_sent = save_page_header(rs, f, block, offset |
-                                  RAM_SAVE_FLAG_COMPRESS_PAGE);
-
-    /*
-     * copy it to a internal buffer to avoid it being modified by VM
-     * so that we can catch up the error during compression and
-     * decompression
-     */
-    memcpy(source_buf, p, TARGET_PAGE_SIZE);
-    blen = qemu_put_compression_data(f, stream, source_buf, TARGET_PAGE_SIZE);
-    if (blen < 0) {
-        bytes_sent = 0;
-        qemu_file_set_error(migrate_get_current()->to_dst_file, blen);
-        error_report("compressed data failed!");
-    } else {
-        bytes_sent += blen;
-        ram_release_pages(block->idstr, offset & TARGET_PAGE_MASK, 1);
-    }
-
-    return bytes_sent;
-}
-
-static void flush_compressed_data(RAMState *rs)
-{
-    int idx, len, thread_count;
-
-    if (!migrate_use_compression()) {
-        return;
-    }
-    thread_count = migrate_compress_threads();
-
-    qemu_mutex_lock(&comp_done_lock);
-    for (idx = 0; idx < thread_count; idx++) {
-        while (!comp_param[idx].done) {
-            qemu_cond_wait(&comp_done_cond, &comp_done_lock);
-        }
-    }
-    qemu_mutex_unlock(&comp_done_lock);
-
-    for (idx = 0; idx < thread_count; idx++) {
-        qemu_mutex_lock(&comp_param[idx].mutex);
-        if (!comp_param[idx].quit) {
-            len = qemu_put_qemu_file(rs->f, comp_param[idx].file);
-            /* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */
-            compression_counters.reduced_size += TARGET_PAGE_SIZE - len + 8;
-            compression_counters.pages++;
-            ram_counters.transferred += len;
-        }
-        qemu_mutex_unlock(&comp_param[idx].mutex);
-    }
-}
-
-static inline void set_compress_params(CompressParam *param, RAMBlock *block,
-                                       ram_addr_t offset)
-{
-    param->block = block;
-    param->offset = offset;
-}
-
 static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
                                            ram_addr_t offset)
 {
-    int idx, thread_count, bytes_xmit = -1, pages = -1;
+    CompressData *cd;
+    ThreadRequest *request = threads_submit_request_prepare(compress_threads);
 
-    thread_count = migrate_compress_threads();
-    qemu_mutex_lock(&comp_done_lock);
-    for (idx = 0; idx < thread_count; idx++) {
-        if (comp_param[idx].done) {
-            comp_param[idx].done = false;
-            bytes_xmit = qemu_put_qemu_file(rs->f, comp_param[idx].file);
-            qemu_mutex_lock(&comp_param[idx].mutex);
-            set_compress_params(&comp_param[idx], block, offset);
-            qemu_cond_signal(&comp_param[idx].cond);
-            qemu_mutex_unlock(&comp_param[idx].mutex);
-            pages = 1;
-            /* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */
-            compression_counters.reduced_size += TARGET_PAGE_SIZE -
-                                                 bytes_xmit + 8;
-            compression_counters.pages++;
-            ram_counters.transferred += bytes_xmit;
-            break;
-        }
-    }
-    qemu_mutex_unlock(&comp_done_lock);
+    if (!request) {
+        compression_counters.busy++;
+        return -1;
+     }
 
-    return pages;
+    cd = container_of(request, CompressData, request);
+    cd->block = block;
+    cd->offset = offset;
+    threads_submit_request_commit(compress_threads, request);
+    return 1;
 }
 
 /**
@@ -1522,7 +1433,7 @@ static bool find_dirty_block(RAMState *rs, PageSearchStatus *pss, bool *again)
                 /* If xbzrle is on, stop using the data compression at this
                  * point. In theory, xbzrle can do better than compression.
                  */
-                flush_compressed_data(rs);
+                flush_compressed_data();
             }
         }
         /* Didn't find anything this time, but try again on the new block */
@@ -1776,7 +1687,7 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss,
          * much CPU resource.
          */
         if (block != rs->last_sent_block) {
-            flush_compressed_data(rs);
+            flush_compressed_data();
         } else {
             /*
              * do not detect zero page as it can be handled very well
@@ -1786,7 +1697,6 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss,
             if (res > 0) {
                 return res;
             }
-            compression_counters.busy++;
         }
     }
 
@@ -1994,7 +1904,7 @@ static void ram_save_cleanup(void *opaque)
     }
 
     xbzrle_cleanup();
-    flush_compressed_data(*rsp);
+    flush_compressed_data();
     compress_threads_save_cleanup();
     ram_state_cleanup(rsp);
 }
@@ -2747,7 +2657,7 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
         }
     }
 
-    flush_compressed_data(rs);
+    flush_compressed_data();
     ram_control_after_iterate(f, RAM_CONTROL_FINISH);
 
     rcu_read_unlock();
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [PATCH 12/12] migration: use lockless Multithread model for decompression
  2018-06-04  9:55 ` [Qemu-devel] " guangrong.xiao
@ 2018-06-04  9:55   ` guangrong.xiao
  -1 siblings, 0 replies; 156+ messages in thread
From: guangrong.xiao @ 2018-06-04  9:55 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: kvm, Xiao Guangrong, qemu-devel, peterx, dgilbert, wei.w.wang,
	jiang.biao2

From: Xiao Guangrong <xiaoguangrong@tencent.com>

Adapt the compression code to the lockless multithread model

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 migration/ram.c | 381 ++++++++++++++++++++++++++------------------------------
 1 file changed, 175 insertions(+), 206 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 58ecf5caa0..0a0ef0ee57 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -341,25 +341,9 @@ typedef struct PageSearchStatus PageSearchStatus;
 
 CompressionStats compression_counters;
 
-struct DecompressParam {
-    bool done;
-    bool quit;
-    QemuMutex mutex;
-    QemuCond cond;
-    void *des;
-    uint8_t *compbuf;
-    int len;
-    z_stream stream;
-};
-typedef struct DecompressParam DecompressParam;
-
 static const QEMUFileOps empty_ops = { };
 
 static QEMUFile *decomp_file;
-static DecompressParam *decomp_param;
-static QemuThread *decompress_threads;
-static QemuMutex decomp_done_lock;
-static QemuCond decomp_done_cond;
 
 /* Multiple fd's */
 
@@ -962,6 +946,178 @@ static int compress_threads_save_setup(void)
     return compress_threads ? 0 : -1;
 }
 
+/* return the size after decompression, or negative value on error */
+static int
+qemu_uncompress_data(z_stream *stream, uint8_t *dest, size_t dest_len,
+                     const uint8_t *source, size_t source_len)
+{
+    int err;
+
+    err = inflateReset(stream);
+    if (err != Z_OK) {
+        return -1;
+    }
+
+    stream->avail_in = source_len;
+    stream->next_in = (uint8_t *)source;
+    stream->avail_out = dest_len;
+    stream->next_out = dest;
+
+    err = inflate(stream, Z_NO_FLUSH);
+    if (err != Z_STREAM_END) {
+        return -1;
+    }
+
+    return stream->total_out;
+}
+
+struct DecompressData {
+    /* filled by migration thread.*/
+    void *des;
+    uint8_t *compbuf;
+    size_t len;
+
+    z_stream stream;
+    ThreadRequest request;
+};
+typedef struct DecompressData DecompressData;
+
+static ThreadRequest *decompress_thread_data_init(void)
+{
+    DecompressData *dd = g_new0(DecompressData, 1);
+
+    if (inflateInit(&dd->stream) != Z_OK) {
+        g_free(dd);
+        return NULL;
+    }
+
+    dd->compbuf = g_malloc0(compressBound(TARGET_PAGE_SIZE));
+    return &dd->request;
+}
+
+static void decompress_thread_data_fini(ThreadRequest *request)
+{
+    DecompressData *dd = container_of(request, DecompressData, request);
+
+    inflateEnd(&dd->stream);
+    g_free(dd->compbuf);
+    g_free(dd);
+}
+
+static void decompress_thread_data_handler(ThreadRequest *request)
+{
+    DecompressData *dd = container_of(request, DecompressData, request);
+    unsigned long pagesize = TARGET_PAGE_SIZE;
+    int ret;
+
+    ret = qemu_uncompress_data(&dd->stream, dd->des, pagesize,
+                               dd->compbuf, dd->len);
+    if (ret < 0) {
+        error_report("decompress data failed");
+        qemu_file_set_error(decomp_file, ret);
+    }
+}
+
+static void decompress_thread_data_done(ThreadRequest *data)
+{
+}
+
+struct CompressLoad {
+    Threads *decompress_threads;
+
+    /*
+     * used to decompress data in migration thread if
+     * decompress threads are busy.
+     */
+    z_stream stream;
+    uint8_t *compbuf;
+};
+typedef struct CompressLoad CompressLoad;
+
+static CompressLoad compress_load;
+
+static int decompress_init(QEMUFile *f)
+{
+    Threads *threads;
+
+    threads = threads_create(migrate_decompress_threads(), "decompress",
+                             decompress_thread_data_init,
+                             decompress_thread_data_fini,
+                             decompress_thread_data_handler,
+                             decompress_thread_data_done);
+    if (!threads) {
+        return -1;
+    }
+
+    if (inflateInit(&compress_load.stream) != Z_OK) {
+        threads_destroy(threads);
+        return -1;
+    }
+
+    compress_load.decompress_threads = threads;
+    compress_load.compbuf = g_malloc0(compressBound(TARGET_PAGE_SIZE));
+    decomp_file = f;
+    return 0;
+}
+
+static void decompress_fini(void)
+{
+    if (!compress_load.compbuf) {
+        return;
+    }
+
+    threads_destroy(compress_load.decompress_threads);
+    compress_load.decompress_threads = NULL;
+    g_free(compress_load.compbuf);
+    compress_load.compbuf = NULL;
+    inflateEnd(&compress_load.stream);
+    decomp_file = NULL;
+}
+
+static int flush_decompressed_data(void)
+{
+    if (!migrate_use_compression()) {
+        return 0;
+    }
+
+    threads_wait_done(compress_load.decompress_threads);
+    return qemu_file_get_error(decomp_file);
+}
+
+static void decompress_data_with_multi_threads(QEMUFile *f,
+                                               void *host, size_t len)
+{
+    ThreadRequest *request;
+    Threads *threads = compress_load.decompress_threads;
+    unsigned long pagesize = TARGET_PAGE_SIZE;
+    uint8_t *compbuf = compress_load.compbuf;
+    int ret;
+
+    request = threads_submit_request_prepare(threads);
+    if (request) {
+        DecompressData *dd;
+
+        dd = container_of(request, DecompressData, request);
+        dd->des = host;
+        dd->len = len;
+        qemu_get_buffer(f, dd->compbuf, len);
+        threads_submit_request_commit(threads, request);
+        return;
+    }
+
+    /* load data and decompress in the main thread */
+
+    /* it can change compbuf to point to an internal buffer */
+    qemu_get_buffer_in_place(f, &compbuf, len);
+
+    ret = qemu_uncompress_data(&compress_load.stream, host, pagesize,
+                               compbuf, len);
+    if (ret < 0) {
+        error_report("decompress data failed");
+        qemu_file_set_error(decomp_file, ret);
+    }
+}
+
 /**
  * xbzrle_cache_zero_page: insert a zero page in the XBZRLE cache
  *
@@ -2794,193 +2950,6 @@ void ram_handle_compressed(void *host, uint8_t ch, uint64_t size)
     }
 }
 
-/* return the size after decompression, or negative value on error */
-static int
-qemu_uncompress_data(z_stream *stream, uint8_t *dest, size_t dest_len,
-                     const uint8_t *source, size_t source_len)
-{
-    int err;
-
-    err = inflateReset(stream);
-    if (err != Z_OK) {
-        return -1;
-    }
-
-    stream->avail_in = source_len;
-    stream->next_in = (uint8_t *)source;
-    stream->avail_out = dest_len;
-    stream->next_out = dest;
-
-    err = inflate(stream, Z_NO_FLUSH);
-    if (err != Z_STREAM_END) {
-        return -1;
-    }
-
-    return stream->total_out;
-}
-
-static void *do_data_decompress(void *opaque)
-{
-    DecompressParam *param = opaque;
-    unsigned long pagesize;
-    uint8_t *des;
-    int len, ret;
-
-    qemu_mutex_lock(&param->mutex);
-    while (!param->quit) {
-        if (param->des) {
-            des = param->des;
-            len = param->len;
-            param->des = 0;
-            qemu_mutex_unlock(&param->mutex);
-
-            pagesize = TARGET_PAGE_SIZE;
-
-            ret = qemu_uncompress_data(&param->stream, des, pagesize,
-                                       param->compbuf, len);
-            if (ret < 0) {
-                error_report("decompress data failed");
-                qemu_file_set_error(decomp_file, ret);
-            }
-
-            qemu_mutex_lock(&decomp_done_lock);
-            param->done = true;
-            qemu_cond_signal(&decomp_done_cond);
-            qemu_mutex_unlock(&decomp_done_lock);
-
-            qemu_mutex_lock(&param->mutex);
-        } else {
-            qemu_cond_wait(&param->cond, &param->mutex);
-        }
-    }
-    qemu_mutex_unlock(&param->mutex);
-
-    return NULL;
-}
-
-static int wait_for_decompress_done(void)
-{
-    int idx, thread_count;
-
-    if (!migrate_use_compression()) {
-        return 0;
-    }
-
-    thread_count = migrate_decompress_threads();
-    qemu_mutex_lock(&decomp_done_lock);
-    for (idx = 0; idx < thread_count; idx++) {
-        while (!decomp_param[idx].done) {
-            qemu_cond_wait(&decomp_done_cond, &decomp_done_lock);
-        }
-    }
-    qemu_mutex_unlock(&decomp_done_lock);
-    return qemu_file_get_error(decomp_file);
-}
-
-static void compress_threads_load_cleanup(void)
-{
-    int i, thread_count;
-
-    if (!migrate_use_compression()) {
-        return;
-    }
-    thread_count = migrate_decompress_threads();
-    for (i = 0; i < thread_count; i++) {
-        /*
-         * we use it as a indicator which shows if the thread is
-         * properly init'd or not
-         */
-        if (!decomp_param[i].compbuf) {
-            break;
-        }
-
-        qemu_mutex_lock(&decomp_param[i].mutex);
-        decomp_param[i].quit = true;
-        qemu_cond_signal(&decomp_param[i].cond);
-        qemu_mutex_unlock(&decomp_param[i].mutex);
-    }
-    for (i = 0; i < thread_count; i++) {
-        if (!decomp_param[i].compbuf) {
-            break;
-        }
-
-        qemu_thread_join(decompress_threads + i);
-        qemu_mutex_destroy(&decomp_param[i].mutex);
-        qemu_cond_destroy(&decomp_param[i].cond);
-        inflateEnd(&decomp_param[i].stream);
-        g_free(decomp_param[i].compbuf);
-        decomp_param[i].compbuf = NULL;
-    }
-    g_free(decompress_threads);
-    g_free(decomp_param);
-    decompress_threads = NULL;
-    decomp_param = NULL;
-    decomp_file = NULL;
-}
-
-static int compress_threads_load_setup(QEMUFile *f)
-{
-    int i, thread_count;
-
-    if (!migrate_use_compression()) {
-        return 0;
-    }
-
-    thread_count = migrate_decompress_threads();
-    decompress_threads = g_new0(QemuThread, thread_count);
-    decomp_param = g_new0(DecompressParam, thread_count);
-    qemu_mutex_init(&decomp_done_lock);
-    qemu_cond_init(&decomp_done_cond);
-    decomp_file = f;
-    for (i = 0; i < thread_count; i++) {
-        if (inflateInit(&decomp_param[i].stream) != Z_OK) {
-            goto exit;
-        }
-
-        decomp_param[i].compbuf = g_malloc0(compressBound(TARGET_PAGE_SIZE));
-        qemu_mutex_init(&decomp_param[i].mutex);
-        qemu_cond_init(&decomp_param[i].cond);
-        decomp_param[i].done = true;
-        decomp_param[i].quit = false;
-        qemu_thread_create(decompress_threads + i, "decompress",
-                           do_data_decompress, decomp_param + i,
-                           QEMU_THREAD_JOINABLE);
-    }
-    return 0;
-exit:
-    compress_threads_load_cleanup();
-    return -1;
-}
-
-static void decompress_data_with_multi_threads(QEMUFile *f,
-                                               void *host, int len)
-{
-    int idx, thread_count;
-
-    thread_count = migrate_decompress_threads();
-    qemu_mutex_lock(&decomp_done_lock);
-    while (true) {
-        for (idx = 0; idx < thread_count; idx++) {
-            if (decomp_param[idx].done) {
-                decomp_param[idx].done = false;
-                qemu_mutex_lock(&decomp_param[idx].mutex);
-                qemu_get_buffer(f, decomp_param[idx].compbuf, len);
-                decomp_param[idx].des = host;
-                decomp_param[idx].len = len;
-                qemu_cond_signal(&decomp_param[idx].cond);
-                qemu_mutex_unlock(&decomp_param[idx].mutex);
-                break;
-            }
-        }
-        if (idx < thread_count) {
-            break;
-        } else {
-            qemu_cond_wait(&decomp_done_cond, &decomp_done_lock);
-        }
-    }
-    qemu_mutex_unlock(&decomp_done_lock);
-}
-
 /**
  * ram_load_setup: Setup RAM for migration incoming side
  *
@@ -2991,7 +2960,7 @@ static void decompress_data_with_multi_threads(QEMUFile *f,
  */
 static int ram_load_setup(QEMUFile *f, void *opaque)
 {
-    if (compress_threads_load_setup(f)) {
+    if (decompress_init(f)) {
         return -1;
     }
 
@@ -3004,7 +2973,7 @@ static int ram_load_cleanup(void *opaque)
 {
     RAMBlock *rb;
     xbzrle_load_cleanup();
-    compress_threads_load_cleanup();
+    decompress_fini();
 
     RAMBLOCK_FOREACH(rb) {
         g_free(rb->receivedmap);
@@ -3346,7 +3315,7 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
         }
     }
 
-    ret |= wait_for_decompress_done();
+    ret |= flush_decompressed_data();
     rcu_read_unlock();
     trace_ram_load_complete(ret, seq_iter);
     return ret;
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* [Qemu-devel] [PATCH 12/12] migration: use lockless Multithread model for decompression
@ 2018-06-04  9:55   ` guangrong.xiao
  0 siblings, 0 replies; 156+ messages in thread
From: guangrong.xiao @ 2018-06-04  9:55 UTC (permalink / raw)
  To: pbonzini, mst, mtosatti
  Cc: qemu-devel, kvm, dgilbert, peterx, jiang.biao2, wei.w.wang,
	Xiao Guangrong

From: Xiao Guangrong <xiaoguangrong@tencent.com>

Adapt the compression code to the lockless multithread model

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 migration/ram.c | 381 ++++++++++++++++++++++++++------------------------------
 1 file changed, 175 insertions(+), 206 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 58ecf5caa0..0a0ef0ee57 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -341,25 +341,9 @@ typedef struct PageSearchStatus PageSearchStatus;
 
 CompressionStats compression_counters;
 
-struct DecompressParam {
-    bool done;
-    bool quit;
-    QemuMutex mutex;
-    QemuCond cond;
-    void *des;
-    uint8_t *compbuf;
-    int len;
-    z_stream stream;
-};
-typedef struct DecompressParam DecompressParam;
-
 static const QEMUFileOps empty_ops = { };
 
 static QEMUFile *decomp_file;
-static DecompressParam *decomp_param;
-static QemuThread *decompress_threads;
-static QemuMutex decomp_done_lock;
-static QemuCond decomp_done_cond;
 
 /* Multiple fd's */
 
@@ -962,6 +946,178 @@ static int compress_threads_save_setup(void)
     return compress_threads ? 0 : -1;
 }
 
+/* return the size after decompression, or negative value on error */
+static int
+qemu_uncompress_data(z_stream *stream, uint8_t *dest, size_t dest_len,
+                     const uint8_t *source, size_t source_len)
+{
+    int err;
+
+    err = inflateReset(stream);
+    if (err != Z_OK) {
+        return -1;
+    }
+
+    stream->avail_in = source_len;
+    stream->next_in = (uint8_t *)source;
+    stream->avail_out = dest_len;
+    stream->next_out = dest;
+
+    err = inflate(stream, Z_NO_FLUSH);
+    if (err != Z_STREAM_END) {
+        return -1;
+    }
+
+    return stream->total_out;
+}
+
+struct DecompressData {
+    /* filled by migration thread.*/
+    void *des;
+    uint8_t *compbuf;
+    size_t len;
+
+    z_stream stream;
+    ThreadRequest request;
+};
+typedef struct DecompressData DecompressData;
+
+static ThreadRequest *decompress_thread_data_init(void)
+{
+    DecompressData *dd = g_new0(DecompressData, 1);
+
+    if (inflateInit(&dd->stream) != Z_OK) {
+        g_free(dd);
+        return NULL;
+    }
+
+    dd->compbuf = g_malloc0(compressBound(TARGET_PAGE_SIZE));
+    return &dd->request;
+}
+
+static void decompress_thread_data_fini(ThreadRequest *request)
+{
+    DecompressData *dd = container_of(request, DecompressData, request);
+
+    inflateEnd(&dd->stream);
+    g_free(dd->compbuf);
+    g_free(dd);
+}
+
+static void decompress_thread_data_handler(ThreadRequest *request)
+{
+    DecompressData *dd = container_of(request, DecompressData, request);
+    unsigned long pagesize = TARGET_PAGE_SIZE;
+    int ret;
+
+    ret = qemu_uncompress_data(&dd->stream, dd->des, pagesize,
+                               dd->compbuf, dd->len);
+    if (ret < 0) {
+        error_report("decompress data failed");
+        qemu_file_set_error(decomp_file, ret);
+    }
+}
+
+static void decompress_thread_data_done(ThreadRequest *data)
+{
+}
+
+struct CompressLoad {
+    Threads *decompress_threads;
+
+    /*
+     * used to decompress data in migration thread if
+     * decompress threads are busy.
+     */
+    z_stream stream;
+    uint8_t *compbuf;
+};
+typedef struct CompressLoad CompressLoad;
+
+static CompressLoad compress_load;
+
+static int decompress_init(QEMUFile *f)
+{
+    Threads *threads;
+
+    threads = threads_create(migrate_decompress_threads(), "decompress",
+                             decompress_thread_data_init,
+                             decompress_thread_data_fini,
+                             decompress_thread_data_handler,
+                             decompress_thread_data_done);
+    if (!threads) {
+        return -1;
+    }
+
+    if (inflateInit(&compress_load.stream) != Z_OK) {
+        threads_destroy(threads);
+        return -1;
+    }
+
+    compress_load.decompress_threads = threads;
+    compress_load.compbuf = g_malloc0(compressBound(TARGET_PAGE_SIZE));
+    decomp_file = f;
+    return 0;
+}
+
+static void decompress_fini(void)
+{
+    if (!compress_load.compbuf) {
+        return;
+    }
+
+    threads_destroy(compress_load.decompress_threads);
+    compress_load.decompress_threads = NULL;
+    g_free(compress_load.compbuf);
+    compress_load.compbuf = NULL;
+    inflateEnd(&compress_load.stream);
+    decomp_file = NULL;
+}
+
+static int flush_decompressed_data(void)
+{
+    if (!migrate_use_compression()) {
+        return 0;
+    }
+
+    threads_wait_done(compress_load.decompress_threads);
+    return qemu_file_get_error(decomp_file);
+}
+
+static void decompress_data_with_multi_threads(QEMUFile *f,
+                                               void *host, size_t len)
+{
+    ThreadRequest *request;
+    Threads *threads = compress_load.decompress_threads;
+    unsigned long pagesize = TARGET_PAGE_SIZE;
+    uint8_t *compbuf = compress_load.compbuf;
+    int ret;
+
+    request = threads_submit_request_prepare(threads);
+    if (request) {
+        DecompressData *dd;
+
+        dd = container_of(request, DecompressData, request);
+        dd->des = host;
+        dd->len = len;
+        qemu_get_buffer(f, dd->compbuf, len);
+        threads_submit_request_commit(threads, request);
+        return;
+    }
+
+    /* load data and decompress in the main thread */
+
+    /* it can change compbuf to point to an internal buffer */
+    qemu_get_buffer_in_place(f, &compbuf, len);
+
+    ret = qemu_uncompress_data(&compress_load.stream, host, pagesize,
+                               compbuf, len);
+    if (ret < 0) {
+        error_report("decompress data failed");
+        qemu_file_set_error(decomp_file, ret);
+    }
+}
+
 /**
  * xbzrle_cache_zero_page: insert a zero page in the XBZRLE cache
  *
@@ -2794,193 +2950,6 @@ void ram_handle_compressed(void *host, uint8_t ch, uint64_t size)
     }
 }
 
-/* return the size after decompression, or negative value on error */
-static int
-qemu_uncompress_data(z_stream *stream, uint8_t *dest, size_t dest_len,
-                     const uint8_t *source, size_t source_len)
-{
-    int err;
-
-    err = inflateReset(stream);
-    if (err != Z_OK) {
-        return -1;
-    }
-
-    stream->avail_in = source_len;
-    stream->next_in = (uint8_t *)source;
-    stream->avail_out = dest_len;
-    stream->next_out = dest;
-
-    err = inflate(stream, Z_NO_FLUSH);
-    if (err != Z_STREAM_END) {
-        return -1;
-    }
-
-    return stream->total_out;
-}
-
-static void *do_data_decompress(void *opaque)
-{
-    DecompressParam *param = opaque;
-    unsigned long pagesize;
-    uint8_t *des;
-    int len, ret;
-
-    qemu_mutex_lock(&param->mutex);
-    while (!param->quit) {
-        if (param->des) {
-            des = param->des;
-            len = param->len;
-            param->des = 0;
-            qemu_mutex_unlock(&param->mutex);
-
-            pagesize = TARGET_PAGE_SIZE;
-
-            ret = qemu_uncompress_data(&param->stream, des, pagesize,
-                                       param->compbuf, len);
-            if (ret < 0) {
-                error_report("decompress data failed");
-                qemu_file_set_error(decomp_file, ret);
-            }
-
-            qemu_mutex_lock(&decomp_done_lock);
-            param->done = true;
-            qemu_cond_signal(&decomp_done_cond);
-            qemu_mutex_unlock(&decomp_done_lock);
-
-            qemu_mutex_lock(&param->mutex);
-        } else {
-            qemu_cond_wait(&param->cond, &param->mutex);
-        }
-    }
-    qemu_mutex_unlock(&param->mutex);
-
-    return NULL;
-}
-
-static int wait_for_decompress_done(void)
-{
-    int idx, thread_count;
-
-    if (!migrate_use_compression()) {
-        return 0;
-    }
-
-    thread_count = migrate_decompress_threads();
-    qemu_mutex_lock(&decomp_done_lock);
-    for (idx = 0; idx < thread_count; idx++) {
-        while (!decomp_param[idx].done) {
-            qemu_cond_wait(&decomp_done_cond, &decomp_done_lock);
-        }
-    }
-    qemu_mutex_unlock(&decomp_done_lock);
-    return qemu_file_get_error(decomp_file);
-}
-
-static void compress_threads_load_cleanup(void)
-{
-    int i, thread_count;
-
-    if (!migrate_use_compression()) {
-        return;
-    }
-    thread_count = migrate_decompress_threads();
-    for (i = 0; i < thread_count; i++) {
-        /*
-         * we use it as a indicator which shows if the thread is
-         * properly init'd or not
-         */
-        if (!decomp_param[i].compbuf) {
-            break;
-        }
-
-        qemu_mutex_lock(&decomp_param[i].mutex);
-        decomp_param[i].quit = true;
-        qemu_cond_signal(&decomp_param[i].cond);
-        qemu_mutex_unlock(&decomp_param[i].mutex);
-    }
-    for (i = 0; i < thread_count; i++) {
-        if (!decomp_param[i].compbuf) {
-            break;
-        }
-
-        qemu_thread_join(decompress_threads + i);
-        qemu_mutex_destroy(&decomp_param[i].mutex);
-        qemu_cond_destroy(&decomp_param[i].cond);
-        inflateEnd(&decomp_param[i].stream);
-        g_free(decomp_param[i].compbuf);
-        decomp_param[i].compbuf = NULL;
-    }
-    g_free(decompress_threads);
-    g_free(decomp_param);
-    decompress_threads = NULL;
-    decomp_param = NULL;
-    decomp_file = NULL;
-}
-
-static int compress_threads_load_setup(QEMUFile *f)
-{
-    int i, thread_count;
-
-    if (!migrate_use_compression()) {
-        return 0;
-    }
-
-    thread_count = migrate_decompress_threads();
-    decompress_threads = g_new0(QemuThread, thread_count);
-    decomp_param = g_new0(DecompressParam, thread_count);
-    qemu_mutex_init(&decomp_done_lock);
-    qemu_cond_init(&decomp_done_cond);
-    decomp_file = f;
-    for (i = 0; i < thread_count; i++) {
-        if (inflateInit(&decomp_param[i].stream) != Z_OK) {
-            goto exit;
-        }
-
-        decomp_param[i].compbuf = g_malloc0(compressBound(TARGET_PAGE_SIZE));
-        qemu_mutex_init(&decomp_param[i].mutex);
-        qemu_cond_init(&decomp_param[i].cond);
-        decomp_param[i].done = true;
-        decomp_param[i].quit = false;
-        qemu_thread_create(decompress_threads + i, "decompress",
-                           do_data_decompress, decomp_param + i,
-                           QEMU_THREAD_JOINABLE);
-    }
-    return 0;
-exit:
-    compress_threads_load_cleanup();
-    return -1;
-}
-
-static void decompress_data_with_multi_threads(QEMUFile *f,
-                                               void *host, int len)
-{
-    int idx, thread_count;
-
-    thread_count = migrate_decompress_threads();
-    qemu_mutex_lock(&decomp_done_lock);
-    while (true) {
-        for (idx = 0; idx < thread_count; idx++) {
-            if (decomp_param[idx].done) {
-                decomp_param[idx].done = false;
-                qemu_mutex_lock(&decomp_param[idx].mutex);
-                qemu_get_buffer(f, decomp_param[idx].compbuf, len);
-                decomp_param[idx].des = host;
-                decomp_param[idx].len = len;
-                qemu_cond_signal(&decomp_param[idx].cond);
-                qemu_mutex_unlock(&decomp_param[idx].mutex);
-                break;
-            }
-        }
-        if (idx < thread_count) {
-            break;
-        } else {
-            qemu_cond_wait(&decomp_done_cond, &decomp_done_lock);
-        }
-    }
-    qemu_mutex_unlock(&decomp_done_lock);
-}
-
 /**
  * ram_load_setup: Setup RAM for migration incoming side
  *
@@ -2991,7 +2960,7 @@ static void decompress_data_with_multi_threads(QEMUFile *f,
  */
 static int ram_load_setup(QEMUFile *f, void *opaque)
 {
-    if (compress_threads_load_setup(f)) {
+    if (decompress_init(f)) {
         return -1;
     }
 
@@ -3004,7 +2973,7 @@ static int ram_load_cleanup(void *opaque)
 {
     RAMBlock *rb;
     xbzrle_load_cleanup();
-    compress_threads_load_cleanup();
+    decompress_fini();
 
     RAMBLOCK_FOREACH(rb) {
         g_free(rb->receivedmap);
@@ -3346,7 +3315,7 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
         }
     }
 
-    ret |= wait_for_decompress_done();
+    ret |= flush_decompressed_data();
     rcu_read_unlock();
     trace_ram_load_complete(ret, seq_iter);
     return ret;
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* Re: [PATCH 05/12] migration: show the statistics of compression
  2018-06-04  9:55   ` [Qemu-devel] " guangrong.xiao
@ 2018-06-04 22:31     ` Eric Blake
  -1 siblings, 0 replies; 156+ messages in thread
From: Eric Blake @ 2018-06-04 22:31 UTC (permalink / raw)
  To: guangrong.xiao, pbonzini, mst, mtosatti
  Cc: kvm, Xiao Guangrong, qemu-devel, peterx, dgilbert, wei.w.wang,
	jiang.biao2

On 06/04/2018 04:55 AM, guangrong.xiao@gmail.com wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> Then the uses can adjust the parameters based on this info
> 
> Currently, it includes:
> pages: amount of pages compressed and transferred to the target VM
> busy: amount of count that no free thread to compress data
> busy-rate: rate of thread busy
> reduced-size: amount of bytes reduced by compression
> compression-rate: rate of compressed size
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---

> +++ b/qapi/migration.json
> @@ -72,6 +72,26 @@
>              'cache-miss': 'int', 'cache-miss-rate': 'number',
>              'overflow': 'int' } }
>   
> +##
> +# @CompressionStats:
> +#
> +# Detailed compression migration statistics

Sounds better as s/compression migration/migration compression/

> +#
> +# @pages: amount of pages compressed and transferred to the target VM
> +#
> +# @busy: amount of count that no free thread to compress data

Not sure what was meant, maybe:

@busy: count of times that no free thread was available to compress data

> +#
> +# @busy-rate: rate of thread busy

In what unit? pages per second?

> +#
> +# @reduced-size: amount of bytes reduced by compression
> +#
> +# @compression-rate: rate of compressed size

In what unit?

> +#
> +##

Missing a 'Since: 3.0' tag

> +{ 'struct': 'CompressionStats',
> +  'data': {'pages': 'int', 'busy': 'int', 'busy-rate': 'number',
> +	   'reduced-size': 'int', 'compression-rate': 'number' } }
> +
>   ##
>   # @MigrationStatus:
>   #
> @@ -169,6 +189,8 @@
>   #           only present when the postcopy-blocktime migration capability
>   #           is enabled. (Since 2.13)

Pre-existing - we need to fix this 2.13 to be 3.0 (if it isn't already 
fixed)

>   #
> +# @compression: compression migration statistics, only returned if compression
> +#           feature is on and status is 'active' or 'completed' (Since 2.14)

There will not be a 2.14 (for that matter, not even a 2.13).  The next 
release is 3.0.

>   #
>   # Since: 0.14.0
>   ##
> @@ -183,7 +205,8 @@
>              '*cpu-throttle-percentage': 'int',
>              '*error-desc': 'str',
>              '*postcopy-blocktime' : 'uint32',
> -           '*postcopy-vcpu-blocktime': ['uint32']} }
> +           '*postcopy-vcpu-blocktime': ['uint32'],
> +           '*compression': 'CompressionStats'} }
>   
>   ##
>   # @query-migrate:
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 05/12] migration: show the statistics of compression
@ 2018-06-04 22:31     ` Eric Blake
  0 siblings, 0 replies; 156+ messages in thread
From: Eric Blake @ 2018-06-04 22:31 UTC (permalink / raw)
  To: guangrong.xiao, pbonzini, mst, mtosatti
  Cc: kvm, Xiao Guangrong, qemu-devel, peterx, dgilbert, wei.w.wang,
	jiang.biao2

On 06/04/2018 04:55 AM, guangrong.xiao@gmail.com wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> Then the uses can adjust the parameters based on this info
> 
> Currently, it includes:
> pages: amount of pages compressed and transferred to the target VM
> busy: amount of count that no free thread to compress data
> busy-rate: rate of thread busy
> reduced-size: amount of bytes reduced by compression
> compression-rate: rate of compressed size
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---

> +++ b/qapi/migration.json
> @@ -72,6 +72,26 @@
>              'cache-miss': 'int', 'cache-miss-rate': 'number',
>              'overflow': 'int' } }
>   
> +##
> +# @CompressionStats:
> +#
> +# Detailed compression migration statistics

Sounds better as s/compression migration/migration compression/

> +#
> +# @pages: amount of pages compressed and transferred to the target VM
> +#
> +# @busy: amount of count that no free thread to compress data

Not sure what was meant, maybe:

@busy: count of times that no free thread was available to compress data

> +#
> +# @busy-rate: rate of thread busy

In what unit? pages per second?

> +#
> +# @reduced-size: amount of bytes reduced by compression
> +#
> +# @compression-rate: rate of compressed size

In what unit?

> +#
> +##

Missing a 'Since: 3.0' tag

> +{ 'struct': 'CompressionStats',
> +  'data': {'pages': 'int', 'busy': 'int', 'busy-rate': 'number',
> +	   'reduced-size': 'int', 'compression-rate': 'number' } }
> +
>   ##
>   # @MigrationStatus:
>   #
> @@ -169,6 +189,8 @@
>   #           only present when the postcopy-blocktime migration capability
>   #           is enabled. (Since 2.13)

Pre-existing - we need to fix this 2.13 to be 3.0 (if it isn't already 
fixed)

>   #
> +# @compression: compression migration statistics, only returned if compression
> +#           feature is on and status is 'active' or 'completed' (Since 2.14)

There will not be a 2.14 (for that matter, not even a 2.13).  The next 
release is 3.0.

>   #
>   # Since: 0.14.0
>   ##
> @@ -183,7 +205,8 @@
>              '*cpu-throttle-percentage': 'int',
>              '*error-desc': 'str',
>              '*postcopy-blocktime' : 'uint32',
> -           '*postcopy-vcpu-blocktime': ['uint32']} }
> +           '*postcopy-vcpu-blocktime': ['uint32'],
> +           '*compression': 'CompressionStats'} }
>   
>   ##
>   # @query-migrate:
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 05/12] migration: show the statistics of compression
  2018-06-04 22:31     ` [Qemu-devel] " Eric Blake
@ 2018-06-06 12:44       ` Xiao Guangrong
  -1 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-06 12:44 UTC (permalink / raw)
  To: Eric Blake, pbonzini, mst, mtosatti
  Cc: kvm, Xiao Guangrong, qemu-devel, peterx, dgilbert, wei.w.wang,
	jiang.biao2



On 06/05/2018 06:31 AM, Eric Blake wrote:
> On 06/04/2018 04:55 AM, guangrong.xiao@gmail.com wrote:
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>> Then the uses can adjust the parameters based on this info
>>
>> Currently, it includes:
>> pages: amount of pages compressed and transferred to the target VM
>> busy: amount of count that no free thread to compress data
>> busy-rate: rate of thread busy
>> reduced-size: amount of bytes reduced by compression
>> compression-rate: rate of compressed size
>>
>> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
>> ---
> 
>> +++ b/qapi/migration.json
>> @@ -72,6 +72,26 @@
>>              'cache-miss': 'int', 'cache-miss-rate': 'number',
>>              'overflow': 'int' } }
>> +##
>> +# @CompressionStats:
>> +#
>> +# Detailed compression migration statistics
> 
> Sounds better as s/compression migration/migration compression/

Indeed.

> 
>> +#
>> +# @pages: amount of pages compressed and transferred to the target VM
>> +#
>> +# @busy: amount of count that no free thread to compress data
> 
> Not sure what was meant, maybe:
> 
> @busy: count of times that no free thread was available to compress data
> 

Yup, that's better.

>> +#
>> +# @busy-rate: rate of thread busy
> 
> In what unit? pages per second?

It's calculated by:
    pages-directly-posted-out-without-compression / total-page-posted-out

> 
>> +#
>> +# @reduced-size: amount of bytes reduced by compression
>> +#
>> +# @compression-rate: rate of compressed size
> 
> In what unit?
> 

It's calculated by:
    size-posted-out-after-compression / (compressed-page * page_size, i.e, that is
the raw data without compression)

>> +#
>> +##
> 
> Missing a 'Since: 3.0' tag
> 

Wow, directly upgrade to 3.0, big step. :-)
Will add this tag in the next version.

>> +{ 'struct': 'CompressionStats',
>> +  'data': {'pages': 'int', 'busy': 'int', 'busy-rate': 'number',
>> +       'reduced-size': 'int', 'compression-rate': 'number' } }
>> +
>>   ##
>>   # @MigrationStatus:
>>   #
>> @@ -169,6 +189,8 @@
>>   #           only present when the postcopy-blocktime migration capability
>>   #           is enabled. (Since 2.13)
> 
> Pre-existing - we need to fix this 2.13 to be 3.0 (if it isn't already fixed)

I should re-sync the repo before making patches next time.

> 
>>   #
>> +# @compression: compression migration statistics, only returned if compression
>> +#           feature is on and status is 'active' or 'completed' (Since 2.14)
> 
> There will not be a 2.14 (for that matter, not even a 2.13).  The next release is 3.0.

Okay, will fix.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 05/12] migration: show the statistics of compression
@ 2018-06-06 12:44       ` Xiao Guangrong
  0 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-06 12:44 UTC (permalink / raw)
  To: Eric Blake, pbonzini, mst, mtosatti
  Cc: kvm, Xiao Guangrong, qemu-devel, peterx, dgilbert, wei.w.wang,
	jiang.biao2



On 06/05/2018 06:31 AM, Eric Blake wrote:
> On 06/04/2018 04:55 AM, guangrong.xiao@gmail.com wrote:
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>> Then the uses can adjust the parameters based on this info
>>
>> Currently, it includes:
>> pages: amount of pages compressed and transferred to the target VM
>> busy: amount of count that no free thread to compress data
>> busy-rate: rate of thread busy
>> reduced-size: amount of bytes reduced by compression
>> compression-rate: rate of compressed size
>>
>> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
>> ---
> 
>> +++ b/qapi/migration.json
>> @@ -72,6 +72,26 @@
>>              'cache-miss': 'int', 'cache-miss-rate': 'number',
>>              'overflow': 'int' } }
>> +##
>> +# @CompressionStats:
>> +#
>> +# Detailed compression migration statistics
> 
> Sounds better as s/compression migration/migration compression/

Indeed.

> 
>> +#
>> +# @pages: amount of pages compressed and transferred to the target VM
>> +#
>> +# @busy: amount of count that no free thread to compress data
> 
> Not sure what was meant, maybe:
> 
> @busy: count of times that no free thread was available to compress data
> 

Yup, that's better.

>> +#
>> +# @busy-rate: rate of thread busy
> 
> In what unit? pages per second?

It's calculated by:
    pages-directly-posted-out-without-compression / total-page-posted-out

> 
>> +#
>> +# @reduced-size: amount of bytes reduced by compression
>> +#
>> +# @compression-rate: rate of compressed size
> 
> In what unit?
> 

It's calculated by:
    size-posted-out-after-compression / (compressed-page * page_size, i.e, that is
the raw data without compression)

>> +#
>> +##
> 
> Missing a 'Since: 3.0' tag
> 

Wow, directly upgrade to 3.0, big step. :-)
Will add this tag in the next version.

>> +{ 'struct': 'CompressionStats',
>> +  'data': {'pages': 'int', 'busy': 'int', 'busy-rate': 'number',
>> +       'reduced-size': 'int', 'compression-rate': 'number' } }
>> +
>>   ##
>>   # @MigrationStatus:
>>   #
>> @@ -169,6 +189,8 @@
>>   #           only present when the postcopy-blocktime migration capability
>>   #           is enabled. (Since 2.13)
> 
> Pre-existing - we need to fix this 2.13 to be 3.0 (if it isn't already fixed)

I should re-sync the repo before making patches next time.

> 
>>   #
>> +# @compression: compression migration statistics, only returned if compression
>> +#           feature is on and status is 'active' or 'completed' (Since 2.14)
> 
> There will not be a 2.14 (for that matter, not even a 2.13).  The next release is 3.0.

Okay, will fix.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 01/12] migration: do not wait if no free thread
  2018-06-04  9:55   ` [Qemu-devel] " guangrong.xiao
@ 2018-06-11  7:39     ` Peter Xu
  -1 siblings, 0 replies; 156+ messages in thread
From: Peter Xu @ 2018-06-11  7:39 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, qemu-devel,
	wei.w.wang, jiang.biao2, pbonzini

On Mon, Jun 04, 2018 at 05:55:09PM +0800, guangrong.xiao@gmail.com wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> Instead of putting the main thread to sleep state to wait for
> free compression thread, we can directly post it out as normal
> page that reduces the latency and uses CPUs more efficiently

The feature looks good, though I'm not sure whether we should make a
capability flag for this feature since otherwise it'll be hard to
switch back to the old full-compression way no matter for what
reason.  Would that be a problem?

> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---
>  migration/ram.c | 34 +++++++++++++++-------------------
>  1 file changed, 15 insertions(+), 19 deletions(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 5bcbf7a9f9..0caf32ab0a 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1423,25 +1423,18 @@ static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
>  
>      thread_count = migrate_compress_threads();
>      qemu_mutex_lock(&comp_done_lock);

Can we drop this lock in this case?

> -    while (true) {
> -        for (idx = 0; idx < thread_count; idx++) {
> -            if (comp_param[idx].done) {
> -                comp_param[idx].done = false;
> -                bytes_xmit = qemu_put_qemu_file(rs->f, comp_param[idx].file);
> -                qemu_mutex_lock(&comp_param[idx].mutex);
> -                set_compress_params(&comp_param[idx], block, offset);
> -                qemu_cond_signal(&comp_param[idx].cond);
> -                qemu_mutex_unlock(&comp_param[idx].mutex);
> -                pages = 1;
> -                ram_counters.normal++;
> -                ram_counters.transferred += bytes_xmit;
> -                break;
> -            }
> -        }
> -        if (pages > 0) {
> +    for (idx = 0; idx < thread_count; idx++) {
> +        if (comp_param[idx].done) {
> +            comp_param[idx].done = false;
> +            bytes_xmit = qemu_put_qemu_file(rs->f, comp_param[idx].file);
> +            qemu_mutex_lock(&comp_param[idx].mutex);
> +            set_compress_params(&comp_param[idx], block, offset);
> +            qemu_cond_signal(&comp_param[idx].cond);
> +            qemu_mutex_unlock(&comp_param[idx].mutex);
> +            pages = 1;
> +            ram_counters.normal++;
> +            ram_counters.transferred += bytes_xmit;
>              break;
> -        } else {
> -            qemu_cond_wait(&comp_done_cond, &comp_done_lock);
>          }
>      }
>      qemu_mutex_unlock(&comp_done_lock);

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 01/12] migration: do not wait if no free thread
@ 2018-06-11  7:39     ` Peter Xu
  0 siblings, 0 replies; 156+ messages in thread
From: Peter Xu @ 2018-06-11  7:39 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, jiang.biao2,
	wei.w.wang, Xiao Guangrong

On Mon, Jun 04, 2018 at 05:55:09PM +0800, guangrong.xiao@gmail.com wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> Instead of putting the main thread to sleep state to wait for
> free compression thread, we can directly post it out as normal
> page that reduces the latency and uses CPUs more efficiently

The feature looks good, though I'm not sure whether we should make a
capability flag for this feature since otherwise it'll be hard to
switch back to the old full-compression way no matter for what
reason.  Would that be a problem?

> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---
>  migration/ram.c | 34 +++++++++++++++-------------------
>  1 file changed, 15 insertions(+), 19 deletions(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 5bcbf7a9f9..0caf32ab0a 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1423,25 +1423,18 @@ static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
>  
>      thread_count = migrate_compress_threads();
>      qemu_mutex_lock(&comp_done_lock);

Can we drop this lock in this case?

> -    while (true) {
> -        for (idx = 0; idx < thread_count; idx++) {
> -            if (comp_param[idx].done) {
> -                comp_param[idx].done = false;
> -                bytes_xmit = qemu_put_qemu_file(rs->f, comp_param[idx].file);
> -                qemu_mutex_lock(&comp_param[idx].mutex);
> -                set_compress_params(&comp_param[idx], block, offset);
> -                qemu_cond_signal(&comp_param[idx].cond);
> -                qemu_mutex_unlock(&comp_param[idx].mutex);
> -                pages = 1;
> -                ram_counters.normal++;
> -                ram_counters.transferred += bytes_xmit;
> -                break;
> -            }
> -        }
> -        if (pages > 0) {
> +    for (idx = 0; idx < thread_count; idx++) {
> +        if (comp_param[idx].done) {
> +            comp_param[idx].done = false;
> +            bytes_xmit = qemu_put_qemu_file(rs->f, comp_param[idx].file);
> +            qemu_mutex_lock(&comp_param[idx].mutex);
> +            set_compress_params(&comp_param[idx], block, offset);
> +            qemu_cond_signal(&comp_param[idx].cond);
> +            qemu_mutex_unlock(&comp_param[idx].mutex);
> +            pages = 1;
> +            ram_counters.normal++;
> +            ram_counters.transferred += bytes_xmit;
>              break;
> -        } else {
> -            qemu_cond_wait(&comp_done_cond, &comp_done_lock);
>          }
>      }
>      qemu_mutex_unlock(&comp_done_lock);

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 00/12] migration: improve multithreads for compression and decompression
  2018-06-04  9:55 ` [Qemu-devel] " guangrong.xiao
@ 2018-06-11  8:00   ` Peter Xu
  -1 siblings, 0 replies; 156+ messages in thread
From: Peter Xu @ 2018-06-11  8:00 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, qemu-devel,
	wei.w.wang, jiang.biao2, pbonzini

On Mon, Jun 04, 2018 at 05:55:08PM +0800, guangrong.xiao@gmail.com wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> Background
> ----------
> Current implementation of compression and decompression are very
> hard to be enabled on productions. We noticed that too many wait-wakes
> go to kernel space and CPU usages are very low even if the system
> is really free
> 
> The reasons are:
> 1) there are two many locks used to do synchronous,there
>   is a global lock and each single thread has its own lock,
>   migration thread and work threads need to go to sleep if
>   these locks are busy
> 
> 2) migration thread separately submits request to the thread
>    however, only one request can be pended, that means, the
>    thread has to go to sleep after finishing the request
> 
> Our Ideas
> ---------
> To make it work better, we introduce a new multithread model,
> the user, currently it is the migration thread, submits request
> to each thread with round-robin manner, the thread has its own
> ring whose capacity is 4 and puts the result to a global ring
> which is lockless for multiple producers, the user fetches result
> out from the global ring and do remaining operations for the
> request, e.g, posting the compressed data out for migration on
> the source QEMU
> 
> Other works in this patchset is offering some statistics to see
> if compression works as we expected and making the migration thread
> work fast so it can feed more requests to the threads

Hi, Guangrong,

I'm not sure whether my understanding is correct, but AFAIU the old
code has a major defect that it depends too much on the big lock.  The
critial section of the small lock seems to be very short always, and
also that's per-thread.  However we use the big lock in lots of
places: flush compress data, queue every page, or send the notifies in
the compression thread.

I haven't yet read the whole work, this work seems to be quite nice
according to your test results.  However have you thought about
firstly remove the big lock without touching much of other part of the
code, then continue to improve it?  Or have you ever tried to do so?
I don't think you need to do extra work for this, but I would
appreciate if you have existing test results to share.

In other words, would it be nicer to separate the work into two
pieces?

- one to refactor the existing locks, to see what we can gain by
  simplify the locks to minimum.  AFAIU now the locking used is still
  not ideal, my thinking is that _maybe_ we can start by removing the
  big lock, and use a semaphore or something to replace the "done"
  notification while still keep the small lock?  Even some busy
  looping?

- one to introduce the lockless ring buffer, to demostrate how the
  lockless data structure helps comparing to the locking ways

Then we can know which item contributed how much to the performance
numbers.  After all the new ring and thread model seems to be a big
chunk of work (sorry I haven't read them yet, but I will).

What do you think?

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 00/12] migration: improve multithreads for compression and decompression
@ 2018-06-11  8:00   ` Peter Xu
  0 siblings, 0 replies; 156+ messages in thread
From: Peter Xu @ 2018-06-11  8:00 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, jiang.biao2,
	wei.w.wang, Xiao Guangrong

On Mon, Jun 04, 2018 at 05:55:08PM +0800, guangrong.xiao@gmail.com wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> Background
> ----------
> Current implementation of compression and decompression are very
> hard to be enabled on productions. We noticed that too many wait-wakes
> go to kernel space and CPU usages are very low even if the system
> is really free
> 
> The reasons are:
> 1) there are two many locks used to do synchronous,there
>   is a global lock and each single thread has its own lock,
>   migration thread and work threads need to go to sleep if
>   these locks are busy
> 
> 2) migration thread separately submits request to the thread
>    however, only one request can be pended, that means, the
>    thread has to go to sleep after finishing the request
> 
> Our Ideas
> ---------
> To make it work better, we introduce a new multithread model,
> the user, currently it is the migration thread, submits request
> to each thread with round-robin manner, the thread has its own
> ring whose capacity is 4 and puts the result to a global ring
> which is lockless for multiple producers, the user fetches result
> out from the global ring and do remaining operations for the
> request, e.g, posting the compressed data out for migration on
> the source QEMU
> 
> Other works in this patchset is offering some statistics to see
> if compression works as we expected and making the migration thread
> work fast so it can feed more requests to the threads

Hi, Guangrong,

I'm not sure whether my understanding is correct, but AFAIU the old
code has a major defect that it depends too much on the big lock.  The
critial section of the small lock seems to be very short always, and
also that's per-thread.  However we use the big lock in lots of
places: flush compress data, queue every page, or send the notifies in
the compression thread.

I haven't yet read the whole work, this work seems to be quite nice
according to your test results.  However have you thought about
firstly remove the big lock without touching much of other part of the
code, then continue to improve it?  Or have you ever tried to do so?
I don't think you need to do extra work for this, but I would
appreciate if you have existing test results to share.

In other words, would it be nicer to separate the work into two
pieces?

- one to refactor the existing locks, to see what we can gain by
  simplify the locks to minimum.  AFAIU now the locking used is still
  not ideal, my thinking is that _maybe_ we can start by removing the
  big lock, and use a semaphore or something to replace the "done"
  notification while still keep the small lock?  Even some busy
  looping?

- one to introduce the lockless ring buffer, to demostrate how the
  lockless data structure helps comparing to the locking ways

Then we can know which item contributed how much to the performance
numbers.  After all the new ring and thread model seems to be a big
chunk of work (sorry I haven't read them yet, but I will).

What do you think?

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 01/12] migration: do not wait if no free thread
  2018-06-11  7:39     ` [Qemu-devel] " Peter Xu
@ 2018-06-12  2:42       ` Xiao Guangrong
  -1 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-12  2:42 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, qemu-devel,
	wei.w.wang, jiang.biao2, pbonzini



On 06/11/2018 03:39 PM, Peter Xu wrote:
> On Mon, Jun 04, 2018 at 05:55:09PM +0800, guangrong.xiao@gmail.com wrote:
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>> Instead of putting the main thread to sleep state to wait for
>> free compression thread, we can directly post it out as normal
>> page that reduces the latency and uses CPUs more efficiently
> 
> The feature looks good, though I'm not sure whether we should make a
> capability flag for this feature since otherwise it'll be hard to
> switch back to the old full-compression way no matter for what
> reason.  Would that be a problem?
> 

We assume this optimization should always be optimistic for all cases,
particularly, we introduced the statistics of compression, then the user
should adjust its parameters based on those statistics if anything works
worse.

Furthermore, we really need to improve this optimization if it hurts
any case rather than leaving a option to the user. :)

>>
>> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
>> ---
>>   migration/ram.c | 34 +++++++++++++++-------------------
>>   1 file changed, 15 insertions(+), 19 deletions(-)
>>
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 5bcbf7a9f9..0caf32ab0a 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -1423,25 +1423,18 @@ static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
>>   
>>       thread_count = migrate_compress_threads();
>>       qemu_mutex_lock(&comp_done_lock);
> 
> Can we drop this lock in this case?

The lock is used to protect comp_param[].done...

Well, we are able to possibly remove it if we redesign the implementation, e.g, use atomic
access for comp_param.done, however, it still can not work efficiently i believe. Please see
more in the later reply to your comments in the cover-letter.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 01/12] migration: do not wait if no free thread
@ 2018-06-12  2:42       ` Xiao Guangrong
  0 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-12  2:42 UTC (permalink / raw)
  To: Peter Xu
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, jiang.biao2,
	wei.w.wang, Xiao Guangrong



On 06/11/2018 03:39 PM, Peter Xu wrote:
> On Mon, Jun 04, 2018 at 05:55:09PM +0800, guangrong.xiao@gmail.com wrote:
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>> Instead of putting the main thread to sleep state to wait for
>> free compression thread, we can directly post it out as normal
>> page that reduces the latency and uses CPUs more efficiently
> 
> The feature looks good, though I'm not sure whether we should make a
> capability flag for this feature since otherwise it'll be hard to
> switch back to the old full-compression way no matter for what
> reason.  Would that be a problem?
> 

We assume this optimization should always be optimistic for all cases,
particularly, we introduced the statistics of compression, then the user
should adjust its parameters based on those statistics if anything works
worse.

Furthermore, we really need to improve this optimization if it hurts
any case rather than leaving a option to the user. :)

>>
>> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
>> ---
>>   migration/ram.c | 34 +++++++++++++++-------------------
>>   1 file changed, 15 insertions(+), 19 deletions(-)
>>
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 5bcbf7a9f9..0caf32ab0a 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -1423,25 +1423,18 @@ static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
>>   
>>       thread_count = migrate_compress_threads();
>>       qemu_mutex_lock(&comp_done_lock);
> 
> Can we drop this lock in this case?

The lock is used to protect comp_param[].done...

Well, we are able to possibly remove it if we redesign the implementation, e.g, use atomic
access for comp_param.done, however, it still can not work efficiently i believe. Please see
more in the later reply to your comments in the cover-letter.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 01/12] migration: do not wait if no free thread
  2018-06-12  2:42       ` [Qemu-devel] " Xiao Guangrong
@ 2018-06-12  3:15         ` Peter Xu
  -1 siblings, 0 replies; 156+ messages in thread
From: Peter Xu @ 2018-06-12  3:15 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, qemu-devel,
	wei.w.wang, jiang.biao2, pbonzini

On Tue, Jun 12, 2018 at 10:42:25AM +0800, Xiao Guangrong wrote:
> 
> 
> On 06/11/2018 03:39 PM, Peter Xu wrote:
> > On Mon, Jun 04, 2018 at 05:55:09PM +0800, guangrong.xiao@gmail.com wrote:
> > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > 
> > > Instead of putting the main thread to sleep state to wait for
> > > free compression thread, we can directly post it out as normal
> > > page that reduces the latency and uses CPUs more efficiently
> > 
> > The feature looks good, though I'm not sure whether we should make a
> > capability flag for this feature since otherwise it'll be hard to
> > switch back to the old full-compression way no matter for what
> > reason.  Would that be a problem?
> > 
> 
> We assume this optimization should always be optimistic for all cases,
> particularly, we introduced the statistics of compression, then the user
> should adjust its parameters based on those statistics if anything works
> worse.

Ah, that'll be good.

> 
> Furthermore, we really need to improve this optimization if it hurts
> any case rather than leaving a option to the user. :)

Yeah, even if we make it a parameter/capability we can still turn that
on by default in new versions but keep the old behavior in old
versions. :) The major difference is that, then we can still _have_ a
way to compress every page. I'm just thinking if we don't have a
switch for that then if someone wants to measure e.g.  how a new
compression algo could help VM migration, then he/she won't be
possible to do that again since the numbers will be meaningless if
that bit is out of control on which page will be compressed.

Though I don't know how much use it'll bring...  But if that won't be
too hard, it still seems good.  Not a strong opinion.

> 
> > > 
> > > Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > ---
> > >   migration/ram.c | 34 +++++++++++++++-------------------
> > >   1 file changed, 15 insertions(+), 19 deletions(-)
> > > 
> > > diff --git a/migration/ram.c b/migration/ram.c
> > > index 5bcbf7a9f9..0caf32ab0a 100644
> > > --- a/migration/ram.c
> > > +++ b/migration/ram.c
> > > @@ -1423,25 +1423,18 @@ static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
> > >       thread_count = migrate_compress_threads();
> > >       qemu_mutex_lock(&comp_done_lock);
> > 
> > Can we drop this lock in this case?
> 
> The lock is used to protect comp_param[].done...

IMHO it's okay?

It's used in this way:

  if (done) {
    done = false;
  }

So it only switches done from true->false.

And the compression thread is the only one that did the other switch
(false->true).  IMHO this special case will allow no-lock since as
long as "done" is true here then current thread will be the only one
to modify it, then no race at all.

> 
> Well, we are able to possibly remove it if we redesign the implementation, e.g, use atomic
> access for comp_param.done, however, it still can not work efficiently i believe. Please see
> more in the later reply to your comments in the cover-letter.

Will read that after it arrives; though I didn't receive a reply.
Have you missed clicking the "send" button? ;)

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 01/12] migration: do not wait if no free thread
@ 2018-06-12  3:15         ` Peter Xu
  0 siblings, 0 replies; 156+ messages in thread
From: Peter Xu @ 2018-06-12  3:15 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, jiang.biao2,
	wei.w.wang, Xiao Guangrong

On Tue, Jun 12, 2018 at 10:42:25AM +0800, Xiao Guangrong wrote:
> 
> 
> On 06/11/2018 03:39 PM, Peter Xu wrote:
> > On Mon, Jun 04, 2018 at 05:55:09PM +0800, guangrong.xiao@gmail.com wrote:
> > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > 
> > > Instead of putting the main thread to sleep state to wait for
> > > free compression thread, we can directly post it out as normal
> > > page that reduces the latency and uses CPUs more efficiently
> > 
> > The feature looks good, though I'm not sure whether we should make a
> > capability flag for this feature since otherwise it'll be hard to
> > switch back to the old full-compression way no matter for what
> > reason.  Would that be a problem?
> > 
> 
> We assume this optimization should always be optimistic for all cases,
> particularly, we introduced the statistics of compression, then the user
> should adjust its parameters based on those statistics if anything works
> worse.

Ah, that'll be good.

> 
> Furthermore, we really need to improve this optimization if it hurts
> any case rather than leaving a option to the user. :)

Yeah, even if we make it a parameter/capability we can still turn that
on by default in new versions but keep the old behavior in old
versions. :) The major difference is that, then we can still _have_ a
way to compress every page. I'm just thinking if we don't have a
switch for that then if someone wants to measure e.g.  how a new
compression algo could help VM migration, then he/she won't be
possible to do that again since the numbers will be meaningless if
that bit is out of control on which page will be compressed.

Though I don't know how much use it'll bring...  But if that won't be
too hard, it still seems good.  Not a strong opinion.

> 
> > > 
> > > Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > ---
> > >   migration/ram.c | 34 +++++++++++++++-------------------
> > >   1 file changed, 15 insertions(+), 19 deletions(-)
> > > 
> > > diff --git a/migration/ram.c b/migration/ram.c
> > > index 5bcbf7a9f9..0caf32ab0a 100644
> > > --- a/migration/ram.c
> > > +++ b/migration/ram.c
> > > @@ -1423,25 +1423,18 @@ static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
> > >       thread_count = migrate_compress_threads();
> > >       qemu_mutex_lock(&comp_done_lock);
> > 
> > Can we drop this lock in this case?
> 
> The lock is used to protect comp_param[].done...

IMHO it's okay?

It's used in this way:

  if (done) {
    done = false;
  }

So it only switches done from true->false.

And the compression thread is the only one that did the other switch
(false->true).  IMHO this special case will allow no-lock since as
long as "done" is true here then current thread will be the only one
to modify it, then no race at all.

> 
> Well, we are able to possibly remove it if we redesign the implementation, e.g, use atomic
> access for comp_param.done, however, it still can not work efficiently i believe. Please see
> more in the later reply to your comments in the cover-letter.

Will read that after it arrives; though I didn't receive a reply.
Have you missed clicking the "send" button? ;)

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 00/12] migration: improve multithreads for compression and decompression
  2018-06-11  8:00   ` [Qemu-devel] " Peter Xu
@ 2018-06-12  3:19     ` Xiao Guangrong
  -1 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-12  3:19 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, qemu-devel,
	wei.w.wang, jiang.biao2, pbonzini



On 06/11/2018 04:00 PM, Peter Xu wrote:
> On Mon, Jun 04, 2018 at 05:55:08PM +0800, guangrong.xiao@gmail.com wrote:
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>> Background
>> ----------
>> Current implementation of compression and decompression are very
>> hard to be enabled on productions. We noticed that too many wait-wakes
>> go to kernel space and CPU usages are very low even if the system
>> is really free
>>
>> The reasons are:
>> 1) there are two many locks used to do synchronous,there
>>   is a global lock and each single thread has its own lock,
>>   migration thread and work threads need to go to sleep if
>>   these locks are busy
>>
>> 2) migration thread separately submits request to the thread
>>     however, only one request can be pended, that means, the
>>     thread has to go to sleep after finishing the request
>>
>> Our Ideas
>> ---------
>> To make it work better, we introduce a new multithread model,
>> the user, currently it is the migration thread, submits request
>> to each thread with round-robin manner, the thread has its own
>> ring whose capacity is 4 and puts the result to a global ring
>> which is lockless for multiple producers, the user fetches result
>> out from the global ring and do remaining operations for the
>> request, e.g, posting the compressed data out for migration on
>> the source QEMU
>>
>> Other works in this patchset is offering some statistics to see
>> if compression works as we expected and making the migration thread
>> work fast so it can feed more requests to the threads
> 
> Hi, Guangrong,
> 
> I'm not sure whether my understanding is correct, but AFAIU the old
> code has a major defect that it depends too much on the big lock.  The
> critial section of the small lock seems to be very short always, and
> also that's per-thread.  However we use the big lock in lots of
> places: flush compress data, queue every page, or send the notifies in
> the compression thread.
> 

The lock is one issue, however, another issue is that, the thread has
to go to sleep after finishing one request and the main thread (live
migration thread) needs to go to kernel space and wake the thread up
for every single request.

And we also observed that linearly scan the threads one by one to
see which is free is not cache-friendly...

> I haven't yet read the whole work, this work seems to be quite nice
> according to your test results.  However have you thought about
> firstly remove the big lock without touching much of other part of the
> code, then continue to improve it?  Or have you ever tried to do so?
> I don't think you need to do extra work for this, but I would
> appreciate if you have existing test results to share.
> 

If you really want the performance result, i will try it...

Actually, the first version we used on our production is that we
use a lockless multi-thread model (only one atomic operation is needed
for both producer and consumer) but only one request can be fed to the
thread. It's comparable to your suggestion (and should far more faster
than your suggestion).

We observed the shortcoming of this solutions is that too many waits and
wakeups trapped to kernel, so CPU is idle and bandwidth is low.

> In other words, would it be nicer to separate the work into two
> pieces?
> 
> - one to refactor the existing locks, to see what we can gain by
>    simplify the locks to minimum.  AFAIU now the locking used is still
>    not ideal, my thinking is that _maybe_ we can start by removing the
>    big lock, and use a semaphore or something to replace the "done"
>    notification while still keep the small lock?  Even some busy
>    looping?
> 

Note: no lock is used after this patchset...

> - one to introduce the lockless ring buffer, to demostrate how the
>    lockless data structure helps comparing to the locking ways
> 
> Then we can know which item contributed how much to the performance
> numbers.  After all the new ring and thread model seems to be a big
> chunk of work (sorry I haven't read them yet, but I will).

It is really a huge burden that refactor old code and later completely
remove old code.

We redesigned the data struct and algorithm completely and abstract the
model to clean up the code used for compression and decompression, it's
not easy to modify the old code part by part... :(

But... if you really it is really needed, i will try to figure out a way
to address your suggestion. :)

Thanks!

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 00/12] migration: improve multithreads for compression and decompression
@ 2018-06-12  3:19     ` Xiao Guangrong
  0 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-12  3:19 UTC (permalink / raw)
  To: Peter Xu
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, jiang.biao2,
	wei.w.wang, Xiao Guangrong



On 06/11/2018 04:00 PM, Peter Xu wrote:
> On Mon, Jun 04, 2018 at 05:55:08PM +0800, guangrong.xiao@gmail.com wrote:
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>> Background
>> ----------
>> Current implementation of compression and decompression are very
>> hard to be enabled on productions. We noticed that too many wait-wakes
>> go to kernel space and CPU usages are very low even if the system
>> is really free
>>
>> The reasons are:
>> 1) there are two many locks used to do synchronous,there
>>   is a global lock and each single thread has its own lock,
>>   migration thread and work threads need to go to sleep if
>>   these locks are busy
>>
>> 2) migration thread separately submits request to the thread
>>     however, only one request can be pended, that means, the
>>     thread has to go to sleep after finishing the request
>>
>> Our Ideas
>> ---------
>> To make it work better, we introduce a new multithread model,
>> the user, currently it is the migration thread, submits request
>> to each thread with round-robin manner, the thread has its own
>> ring whose capacity is 4 and puts the result to a global ring
>> which is lockless for multiple producers, the user fetches result
>> out from the global ring and do remaining operations for the
>> request, e.g, posting the compressed data out for migration on
>> the source QEMU
>>
>> Other works in this patchset is offering some statistics to see
>> if compression works as we expected and making the migration thread
>> work fast so it can feed more requests to the threads
> 
> Hi, Guangrong,
> 
> I'm not sure whether my understanding is correct, but AFAIU the old
> code has a major defect that it depends too much on the big lock.  The
> critial section of the small lock seems to be very short always, and
> also that's per-thread.  However we use the big lock in lots of
> places: flush compress data, queue every page, or send the notifies in
> the compression thread.
> 

The lock is one issue, however, another issue is that, the thread has
to go to sleep after finishing one request and the main thread (live
migration thread) needs to go to kernel space and wake the thread up
for every single request.

And we also observed that linearly scan the threads one by one to
see which is free is not cache-friendly...

> I haven't yet read the whole work, this work seems to be quite nice
> according to your test results.  However have you thought about
> firstly remove the big lock without touching much of other part of the
> code, then continue to improve it?  Or have you ever tried to do so?
> I don't think you need to do extra work for this, but I would
> appreciate if you have existing test results to share.
> 

If you really want the performance result, i will try it...

Actually, the first version we used on our production is that we
use a lockless multi-thread model (only one atomic operation is needed
for both producer and consumer) but only one request can be fed to the
thread. It's comparable to your suggestion (and should far more faster
than your suggestion).

We observed the shortcoming of this solutions is that too many waits and
wakeups trapped to kernel, so CPU is idle and bandwidth is low.

> In other words, would it be nicer to separate the work into two
> pieces?
> 
> - one to refactor the existing locks, to see what we can gain by
>    simplify the locks to minimum.  AFAIU now the locking used is still
>    not ideal, my thinking is that _maybe_ we can start by removing the
>    big lock, and use a semaphore or something to replace the "done"
>    notification while still keep the small lock?  Even some busy
>    looping?
> 

Note: no lock is used after this patchset...

> - one to introduce the lockless ring buffer, to demostrate how the
>    lockless data structure helps comparing to the locking ways
> 
> Then we can know which item contributed how much to the performance
> numbers.  After all the new ring and thread model seems to be a big
> chunk of work (sorry I haven't read them yet, but I will).

It is really a huge burden that refactor old code and later completely
remove old code.

We redesigned the data struct and algorithm completely and abstract the
model to clean up the code used for compression and decompression, it's
not easy to modify the old code part by part... :(

But... if you really it is really needed, i will try to figure out a way
to address your suggestion. :)

Thanks!

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 00/12] migration: improve multithreads for compression and decompression
  2018-06-12  3:19     ` [Qemu-devel] " Xiao Guangrong
@ 2018-06-12  5:36       ` Peter Xu
  -1 siblings, 0 replies; 156+ messages in thread
From: Peter Xu @ 2018-06-12  5:36 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, qemu-devel,
	wei.w.wang, jiang.biao2, pbonzini

On Tue, Jun 12, 2018 at 11:19:14AM +0800, Xiao Guangrong wrote:
> 
> 
> On 06/11/2018 04:00 PM, Peter Xu wrote:
> > On Mon, Jun 04, 2018 at 05:55:08PM +0800, guangrong.xiao@gmail.com wrote:
> > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > 
> > > Background
> > > ----------
> > > Current implementation of compression and decompression are very
> > > hard to be enabled on productions. We noticed that too many wait-wakes
> > > go to kernel space and CPU usages are very low even if the system
> > > is really free
> > > 
> > > The reasons are:
> > > 1) there are two many locks used to do synchronous,there
> > >   is a global lock and each single thread has its own lock,
> > >   migration thread and work threads need to go to sleep if
> > >   these locks are busy
> > > 
> > > 2) migration thread separately submits request to the thread
> > >     however, only one request can be pended, that means, the
> > >     thread has to go to sleep after finishing the request
> > > 
> > > Our Ideas
> > > ---------
> > > To make it work better, we introduce a new multithread model,
> > > the user, currently it is the migration thread, submits request
> > > to each thread with round-robin manner, the thread has its own
> > > ring whose capacity is 4 and puts the result to a global ring
> > > which is lockless for multiple producers, the user fetches result
> > > out from the global ring and do remaining operations for the
> > > request, e.g, posting the compressed data out for migration on
> > > the source QEMU
> > > 
> > > Other works in this patchset is offering some statistics to see
> > > if compression works as we expected and making the migration thread
> > > work fast so it can feed more requests to the threads
> > 
> > Hi, Guangrong,
> > 
> > I'm not sure whether my understanding is correct, but AFAIU the old
> > code has a major defect that it depends too much on the big lock.  The
> > critial section of the small lock seems to be very short always, and
> > also that's per-thread.  However we use the big lock in lots of
> > places: flush compress data, queue every page, or send the notifies in
> > the compression thread.
> > 
> 
> The lock is one issue, however, another issue is that, the thread has
> to go to sleep after finishing one request and the main thread (live
> migration thread) needs to go to kernel space and wake the thread up
> for every single request.
> 
> And we also observed that linearly scan the threads one by one to
> see which is free is not cache-friendly...

I don't quite understand how this can be fixed on cache POV, but I'll
read the series first before further asking.

> 
> > I haven't yet read the whole work, this work seems to be quite nice
> > according to your test results.  However have you thought about
> > firstly remove the big lock without touching much of other part of the
> > code, then continue to improve it?  Or have you ever tried to do so?
> > I don't think you need to do extra work for this, but I would
> > appreciate if you have existing test results to share.
> > 
> 
> If you really want the performance result, i will try it...

Then that'll be enough for me.  Please only provide the performance
numbers if there are more people asking for that.  Otherwise please
feel free to put that aside.

> 
> Actually, the first version we used on our production is that we
> use a lockless multi-thread model (only one atomic operation is needed
> for both producer and consumer) but only one request can be fed to the
> thread. It's comparable to your suggestion (and should far more faster
> than your suggestion).
> 
> We observed the shortcoming of this solutions is that too many waits and
> wakeups trapped to kernel, so CPU is idle and bandwidth is low.

Okay.

> 
> > In other words, would it be nicer to separate the work into two
> > pieces?
> > 
> > - one to refactor the existing locks, to see what we can gain by
> >    simplify the locks to minimum.  AFAIU now the locking used is still
> >    not ideal, my thinking is that _maybe_ we can start by removing the
> >    big lock, and use a semaphore or something to replace the "done"
> >    notification while still keep the small lock?  Even some busy
> >    looping?
> > 
> 
> Note: no lock is used after this patchset...
> 
> > - one to introduce the lockless ring buffer, to demostrate how the
> >    lockless data structure helps comparing to the locking ways
> > 
> > Then we can know which item contributed how much to the performance
> > numbers.  After all the new ring and thread model seems to be a big
> > chunk of work (sorry I haven't read them yet, but I will).
> 
> It is really a huge burden that refactor old code and later completely
> remove old code.
> 
> We redesigned the data struct and algorithm completely and abstract the
> model to clean up the code used for compression and decompression, it's
> not easy to modify the old code part by part... :(

Yeah; my suggestion above is based on the possibility that removing
the big lock won't be that hard (I _feel_ like it can be <100 LOC, but
I can't tell).  If it's proven to be very hard itself already, then
I'm totally fine without it.

> 
> But... if you really it is really needed, i will try to figure out a way
> to address your suggestion. :)

Not necessary.  I won't spend your time for my solo question.  :)

My point is not strong, it's just how I think about it - generally I
would prefer incremental changes along the way especially the first
step seems obvious (for this, again I would try operating on the big
lock first).  But this is of course not a reason to refuse your
work.

I'll study your whole series soon; meanwhile let's see how other
people think.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 00/12] migration: improve multithreads for compression and decompression
@ 2018-06-12  5:36       ` Peter Xu
  0 siblings, 0 replies; 156+ messages in thread
From: Peter Xu @ 2018-06-12  5:36 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, jiang.biao2,
	wei.w.wang, Xiao Guangrong

On Tue, Jun 12, 2018 at 11:19:14AM +0800, Xiao Guangrong wrote:
> 
> 
> On 06/11/2018 04:00 PM, Peter Xu wrote:
> > On Mon, Jun 04, 2018 at 05:55:08PM +0800, guangrong.xiao@gmail.com wrote:
> > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > 
> > > Background
> > > ----------
> > > Current implementation of compression and decompression are very
> > > hard to be enabled on productions. We noticed that too many wait-wakes
> > > go to kernel space and CPU usages are very low even if the system
> > > is really free
> > > 
> > > The reasons are:
> > > 1) there are two many locks used to do synchronous,there
> > >   is a global lock and each single thread has its own lock,
> > >   migration thread and work threads need to go to sleep if
> > >   these locks are busy
> > > 
> > > 2) migration thread separately submits request to the thread
> > >     however, only one request can be pended, that means, the
> > >     thread has to go to sleep after finishing the request
> > > 
> > > Our Ideas
> > > ---------
> > > To make it work better, we introduce a new multithread model,
> > > the user, currently it is the migration thread, submits request
> > > to each thread with round-robin manner, the thread has its own
> > > ring whose capacity is 4 and puts the result to a global ring
> > > which is lockless for multiple producers, the user fetches result
> > > out from the global ring and do remaining operations for the
> > > request, e.g, posting the compressed data out for migration on
> > > the source QEMU
> > > 
> > > Other works in this patchset is offering some statistics to see
> > > if compression works as we expected and making the migration thread
> > > work fast so it can feed more requests to the threads
> > 
> > Hi, Guangrong,
> > 
> > I'm not sure whether my understanding is correct, but AFAIU the old
> > code has a major defect that it depends too much on the big lock.  The
> > critial section of the small lock seems to be very short always, and
> > also that's per-thread.  However we use the big lock in lots of
> > places: flush compress data, queue every page, or send the notifies in
> > the compression thread.
> > 
> 
> The lock is one issue, however, another issue is that, the thread has
> to go to sleep after finishing one request and the main thread (live
> migration thread) needs to go to kernel space and wake the thread up
> for every single request.
> 
> And we also observed that linearly scan the threads one by one to
> see which is free is not cache-friendly...

I don't quite understand how this can be fixed on cache POV, but I'll
read the series first before further asking.

> 
> > I haven't yet read the whole work, this work seems to be quite nice
> > according to your test results.  However have you thought about
> > firstly remove the big lock without touching much of other part of the
> > code, then continue to improve it?  Or have you ever tried to do so?
> > I don't think you need to do extra work for this, but I would
> > appreciate if you have existing test results to share.
> > 
> 
> If you really want the performance result, i will try it...

Then that'll be enough for me.  Please only provide the performance
numbers if there are more people asking for that.  Otherwise please
feel free to put that aside.

> 
> Actually, the first version we used on our production is that we
> use a lockless multi-thread model (only one atomic operation is needed
> for both producer and consumer) but only one request can be fed to the
> thread. It's comparable to your suggestion (and should far more faster
> than your suggestion).
> 
> We observed the shortcoming of this solutions is that too many waits and
> wakeups trapped to kernel, so CPU is idle and bandwidth is low.

Okay.

> 
> > In other words, would it be nicer to separate the work into two
> > pieces?
> > 
> > - one to refactor the existing locks, to see what we can gain by
> >    simplify the locks to minimum.  AFAIU now the locking used is still
> >    not ideal, my thinking is that _maybe_ we can start by removing the
> >    big lock, and use a semaphore or something to replace the "done"
> >    notification while still keep the small lock?  Even some busy
> >    looping?
> > 
> 
> Note: no lock is used after this patchset...
> 
> > - one to introduce the lockless ring buffer, to demostrate how the
> >    lockless data structure helps comparing to the locking ways
> > 
> > Then we can know which item contributed how much to the performance
> > numbers.  After all the new ring and thread model seems to be a big
> > chunk of work (sorry I haven't read them yet, but I will).
> 
> It is really a huge burden that refactor old code and later completely
> remove old code.
> 
> We redesigned the data struct and algorithm completely and abstract the
> model to clean up the code used for compression and decompression, it's
> not easy to modify the old code part by part... :(

Yeah; my suggestion above is based on the possibility that removing
the big lock won't be that hard (I _feel_ like it can be <100 LOC, but
I can't tell).  If it's proven to be very hard itself already, then
I'm totally fine without it.

> 
> But... if you really it is really needed, i will try to figure out a way
> to address your suggestion. :)

Not necessary.  I won't spend your time for my solo question.  :)

My point is not strong, it's just how I think about it - generally I
would prefer incremental changes along the way especially the first
step seems obvious (for this, again I would try operating on the big
lock first).  But this is of course not a reason to refuse your
work.

I'll study your whole series soon; meanwhile let's see how other
people think.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 01/12] migration: do not wait if no free thread
  2018-06-12  3:15         ` [Qemu-devel] " Peter Xu
@ 2018-06-13 15:43           ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-06-13 15:43 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, wei.w.wang,
	Xiao Guangrong, jiang.biao2, pbonzini

* Peter Xu (peterx@redhat.com) wrote:
> On Tue, Jun 12, 2018 at 10:42:25AM +0800, Xiao Guangrong wrote:
> > 
> > 
> > On 06/11/2018 03:39 PM, Peter Xu wrote:
> > > On Mon, Jun 04, 2018 at 05:55:09PM +0800, guangrong.xiao@gmail.com wrote:
> > > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > > 
> > > > Instead of putting the main thread to sleep state to wait for
> > > > free compression thread, we can directly post it out as normal
> > > > page that reduces the latency and uses CPUs more efficiently
> > > 
> > > The feature looks good, though I'm not sure whether we should make a
> > > capability flag for this feature since otherwise it'll be hard to
> > > switch back to the old full-compression way no matter for what
> > > reason.  Would that be a problem?
> > > 
> > 
> > We assume this optimization should always be optimistic for all cases,
> > particularly, we introduced the statistics of compression, then the user
> > should adjust its parameters based on those statistics if anything works
> > worse.
> 
> Ah, that'll be good.
> 
> > 
> > Furthermore, we really need to improve this optimization if it hurts
> > any case rather than leaving a option to the user. :)
> 
> Yeah, even if we make it a parameter/capability we can still turn that
> on by default in new versions but keep the old behavior in old
> versions. :) The major difference is that, then we can still _have_ a
> way to compress every page. I'm just thinking if we don't have a
> switch for that then if someone wants to measure e.g.  how a new
> compression algo could help VM migration, then he/she won't be
> possible to do that again since the numbers will be meaningless if
> that bit is out of control on which page will be compressed.
> 
> Though I don't know how much use it'll bring...  But if that won't be
> too hard, it still seems good.  Not a strong opinion.

I think that is needed; it might be that some users have really awful
networking and need the compression; I'd expect that for people who turn
on compression they really expect the slowdown because they need it for
their network, so changing that is a bit odd.

Dave
> > 
> > > > 
> > > > Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > > ---
> > > >   migration/ram.c | 34 +++++++++++++++-------------------
> > > >   1 file changed, 15 insertions(+), 19 deletions(-)
> > > > 
> > > > diff --git a/migration/ram.c b/migration/ram.c
> > > > index 5bcbf7a9f9..0caf32ab0a 100644
> > > > --- a/migration/ram.c
> > > > +++ b/migration/ram.c
> > > > @@ -1423,25 +1423,18 @@ static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
> > > >       thread_count = migrate_compress_threads();
> > > >       qemu_mutex_lock(&comp_done_lock);
> > > 
> > > Can we drop this lock in this case?
> > 
> > The lock is used to protect comp_param[].done...
> 
> IMHO it's okay?
> 
> It's used in this way:
> 
>   if (done) {
>     done = false;
>   }
> 
> So it only switches done from true->false.
> 
> And the compression thread is the only one that did the other switch
> (false->true).  IMHO this special case will allow no-lock since as
> long as "done" is true here then current thread will be the only one
> to modify it, then no race at all.
> 
> > 
> > Well, we are able to possibly remove it if we redesign the implementation, e.g, use atomic
> > access for comp_param.done, however, it still can not work efficiently i believe. Please see
> > more in the later reply to your comments in the cover-letter.
> 
> Will read that after it arrives; though I didn't receive a reply.
> Have you missed clicking the "send" button? ;)
> 
> Regards,
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 01/12] migration: do not wait if no free thread
@ 2018-06-13 15:43           ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-06-13 15:43 UTC (permalink / raw)
  To: Peter Xu
  Cc: Xiao Guangrong, pbonzini, mst, mtosatti, qemu-devel, kvm,
	jiang.biao2, wei.w.wang, Xiao Guangrong

* Peter Xu (peterx@redhat.com) wrote:
> On Tue, Jun 12, 2018 at 10:42:25AM +0800, Xiao Guangrong wrote:
> > 
> > 
> > On 06/11/2018 03:39 PM, Peter Xu wrote:
> > > On Mon, Jun 04, 2018 at 05:55:09PM +0800, guangrong.xiao@gmail.com wrote:
> > > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > > 
> > > > Instead of putting the main thread to sleep state to wait for
> > > > free compression thread, we can directly post it out as normal
> > > > page that reduces the latency and uses CPUs more efficiently
> > > 
> > > The feature looks good, though I'm not sure whether we should make a
> > > capability flag for this feature since otherwise it'll be hard to
> > > switch back to the old full-compression way no matter for what
> > > reason.  Would that be a problem?
> > > 
> > 
> > We assume this optimization should always be optimistic for all cases,
> > particularly, we introduced the statistics of compression, then the user
> > should adjust its parameters based on those statistics if anything works
> > worse.
> 
> Ah, that'll be good.
> 
> > 
> > Furthermore, we really need to improve this optimization if it hurts
> > any case rather than leaving a option to the user. :)
> 
> Yeah, even if we make it a parameter/capability we can still turn that
> on by default in new versions but keep the old behavior in old
> versions. :) The major difference is that, then we can still _have_ a
> way to compress every page. I'm just thinking if we don't have a
> switch for that then if someone wants to measure e.g.  how a new
> compression algo could help VM migration, then he/she won't be
> possible to do that again since the numbers will be meaningless if
> that bit is out of control on which page will be compressed.
> 
> Though I don't know how much use it'll bring...  But if that won't be
> too hard, it still seems good.  Not a strong opinion.

I think that is needed; it might be that some users have really awful
networking and need the compression; I'd expect that for people who turn
on compression they really expect the slowdown because they need it for
their network, so changing that is a bit odd.

Dave
> > 
> > > > 
> > > > Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > > ---
> > > >   migration/ram.c | 34 +++++++++++++++-------------------
> > > >   1 file changed, 15 insertions(+), 19 deletions(-)
> > > > 
> > > > diff --git a/migration/ram.c b/migration/ram.c
> > > > index 5bcbf7a9f9..0caf32ab0a 100644
> > > > --- a/migration/ram.c
> > > > +++ b/migration/ram.c
> > > > @@ -1423,25 +1423,18 @@ static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
> > > >       thread_count = migrate_compress_threads();
> > > >       qemu_mutex_lock(&comp_done_lock);
> > > 
> > > Can we drop this lock in this case?
> > 
> > The lock is used to protect comp_param[].done...
> 
> IMHO it's okay?
> 
> It's used in this way:
> 
>   if (done) {
>     done = false;
>   }
> 
> So it only switches done from true->false.
> 
> And the compression thread is the only one that did the other switch
> (false->true).  IMHO this special case will allow no-lock since as
> long as "done" is true here then current thread will be the only one
> to modify it, then no race at all.
> 
> > 
> > Well, we are able to possibly remove it if we redesign the implementation, e.g, use atomic
> > access for comp_param.done, however, it still can not work efficiently i believe. Please see
> > more in the later reply to your comments in the cover-letter.
> 
> Will read that after it arrives; though I didn't receive a reply.
> Have you missed clicking the "send" button? ;)
> 
> Regards,
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 02/12] migration: fix counting normal page for compression
  2018-06-04  9:55   ` [Qemu-devel] " guangrong.xiao
@ 2018-06-13 15:51     ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-06-13 15:51 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, peterx,
	wei.w.wang, jiang.biao2, pbonzini

* guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> The compressed page is not normal page

Is this the right reason?
I think we always increment some counter for a page - so
what gets incremented for a compressed page?
Is the real answer that we do:

  ram_save_target_page
     control_save_page
     compress_page_with_multi_thread

and control_save_page already increments the counter?

Dave

> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---
>  migration/ram.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 0caf32ab0a..dbf24d8c87 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1432,7 +1432,6 @@ static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
>              qemu_cond_signal(&comp_param[idx].cond);
>              qemu_mutex_unlock(&comp_param[idx].mutex);
>              pages = 1;
> -            ram_counters.normal++;

>              ram_counters.transferred += bytes_xmit;
>              break;
>          }
> -- 
> 2.14.4
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 02/12] migration: fix counting normal page for compression
@ 2018-06-13 15:51     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-06-13 15:51 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, peterx, jiang.biao2,
	wei.w.wang, Xiao Guangrong

* guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> The compressed page is not normal page

Is this the right reason?
I think we always increment some counter for a page - so
what gets incremented for a compressed page?
Is the real answer that we do:

  ram_save_target_page
     control_save_page
     compress_page_with_multi_thread

and control_save_page already increments the counter?

Dave

> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---
>  migration/ram.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 0caf32ab0a..dbf24d8c87 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1432,7 +1432,6 @@ static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
>              qemu_cond_signal(&comp_param[idx].cond);
>              qemu_mutex_unlock(&comp_param[idx].mutex);
>              pages = 1;
> -            ram_counters.normal++;

>              ram_counters.transferred += bytes_xmit;
>              break;
>          }
> -- 
> 2.14.4
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 03/12] migration: fix counting xbzrle cache_miss_rate
  2018-06-04  9:55   ` [Qemu-devel] " guangrong.xiao
@ 2018-06-13 16:09     ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-06-13 16:09 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, peterx,
	wei.w.wang, jiang.biao2, pbonzini

* guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> Sync up xbzrle_cache_miss_prev only after migration iteration goes
> forward
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>

Yeh, I think you're right.


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  migration/ram.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index dbf24d8c87..dd1283dd45 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1189,9 +1189,9 @@ static void migration_bitmap_sync(RAMState *rs)
>                     (double)(xbzrle_counters.cache_miss -
>                              rs->xbzrle_cache_miss_prev) /
>                     (rs->iterations - rs->iterations_prev);
> +                rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
>              }
>              rs->iterations_prev = rs->iterations;
> -            rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
>          }
>  
>          /* reset period counters */
> -- 
> 2.14.4
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 03/12] migration: fix counting xbzrle cache_miss_rate
@ 2018-06-13 16:09     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-06-13 16:09 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, peterx, jiang.biao2,
	wei.w.wang, Xiao Guangrong

* guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> Sync up xbzrle_cache_miss_prev only after migration iteration goes
> forward
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>

Yeh, I think you're right.


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  migration/ram.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index dbf24d8c87..dd1283dd45 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1189,9 +1189,9 @@ static void migration_bitmap_sync(RAMState *rs)
>                     (double)(xbzrle_counters.cache_miss -
>                              rs->xbzrle_cache_miss_prev) /
>                     (rs->iterations - rs->iterations_prev);
> +                rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
>              }
>              rs->iterations_prev = rs->iterations;
> -            rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
>          }
>  
>          /* reset period counters */
> -- 
> 2.14.4
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 04/12] migration: introduce migration_update_rates
  2018-06-04  9:55   ` [Qemu-devel] " guangrong.xiao
@ 2018-06-13 16:17     ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-06-13 16:17 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, peterx,
	wei.w.wang, jiang.biao2, pbonzini

* guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> It is used to slightly clean the code up, no logic is changed

Actually, there is a slight change; iterations_prev is always updated
when previously it was only updated with xbzrle on; still the change
makes more sense.


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---
>  migration/ram.c | 35 ++++++++++++++++++++++-------------
>  1 file changed, 22 insertions(+), 13 deletions(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index dd1283dd45..ee03b28435 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1130,6 +1130,25 @@ uint64_t ram_pagesize_summary(void)
>      return summary;
>  }
>  
> +static void migration_update_rates(RAMState *rs, int64_t end_time)
> +{
> +    uint64_t iter_count = rs->iterations - rs->iterations_prev;
> +
> +    /* calculate period counters */
> +    ram_counters.dirty_pages_rate = rs->num_dirty_pages_period * 1000
> +                / (end_time - rs->time_last_bitmap_sync);
> +
> +    if (!iter_count) {
> +        return;
> +    }
> +
> +    if (migrate_use_xbzrle()) {
> +        xbzrle_counters.cache_miss_rate = (double)(xbzrle_counters.cache_miss -
> +            rs->xbzrle_cache_miss_prev) / iter_count;
> +        rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
> +    }
> +}
> +
>  static void migration_bitmap_sync(RAMState *rs)
>  {
>      RAMBlock *block;
> @@ -1159,9 +1178,6 @@ static void migration_bitmap_sync(RAMState *rs)
>  
>      /* more than 1 second = 1000 millisecons */
>      if (end_time > rs->time_last_bitmap_sync + 1000) {
> -        /* calculate period counters */
> -        ram_counters.dirty_pages_rate = rs->num_dirty_pages_period * 1000
> -            / (end_time - rs->time_last_bitmap_sync);
>          bytes_xfer_now = ram_counters.transferred;
>  
>          /* During block migration the auto-converge logic incorrectly detects
> @@ -1183,16 +1199,9 @@ static void migration_bitmap_sync(RAMState *rs)
>              }
>          }
>  
> -        if (migrate_use_xbzrle()) {
> -            if (rs->iterations_prev != rs->iterations) {
> -                xbzrle_counters.cache_miss_rate =
> -                   (double)(xbzrle_counters.cache_miss -
> -                            rs->xbzrle_cache_miss_prev) /
> -                   (rs->iterations - rs->iterations_prev);
> -                rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
> -            }
> -            rs->iterations_prev = rs->iterations;
> -        }
> +        migration_update_rates(rs, end_time);
> +
> +        rs->iterations_prev = rs->iterations;
>  
>          /* reset period counters */
>          rs->time_last_bitmap_sync = end_time;
> -- 
> 2.14.4
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 04/12] migration: introduce migration_update_rates
@ 2018-06-13 16:17     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-06-13 16:17 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, peterx, jiang.biao2,
	wei.w.wang, Xiao Guangrong

* guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> It is used to slightly clean the code up, no logic is changed

Actually, there is a slight change; iterations_prev is always updated
when previously it was only updated with xbzrle on; still the change
makes more sense.


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---
>  migration/ram.c | 35 ++++++++++++++++++++++-------------
>  1 file changed, 22 insertions(+), 13 deletions(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index dd1283dd45..ee03b28435 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1130,6 +1130,25 @@ uint64_t ram_pagesize_summary(void)
>      return summary;
>  }
>  
> +static void migration_update_rates(RAMState *rs, int64_t end_time)
> +{
> +    uint64_t iter_count = rs->iterations - rs->iterations_prev;
> +
> +    /* calculate period counters */
> +    ram_counters.dirty_pages_rate = rs->num_dirty_pages_period * 1000
> +                / (end_time - rs->time_last_bitmap_sync);
> +
> +    if (!iter_count) {
> +        return;
> +    }
> +
> +    if (migrate_use_xbzrle()) {
> +        xbzrle_counters.cache_miss_rate = (double)(xbzrle_counters.cache_miss -
> +            rs->xbzrle_cache_miss_prev) / iter_count;
> +        rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
> +    }
> +}
> +
>  static void migration_bitmap_sync(RAMState *rs)
>  {
>      RAMBlock *block;
> @@ -1159,9 +1178,6 @@ static void migration_bitmap_sync(RAMState *rs)
>  
>      /* more than 1 second = 1000 millisecons */
>      if (end_time > rs->time_last_bitmap_sync + 1000) {
> -        /* calculate period counters */
> -        ram_counters.dirty_pages_rate = rs->num_dirty_pages_period * 1000
> -            / (end_time - rs->time_last_bitmap_sync);
>          bytes_xfer_now = ram_counters.transferred;
>  
>          /* During block migration the auto-converge logic incorrectly detects
> @@ -1183,16 +1199,9 @@ static void migration_bitmap_sync(RAMState *rs)
>              }
>          }
>  
> -        if (migrate_use_xbzrle()) {
> -            if (rs->iterations_prev != rs->iterations) {
> -                xbzrle_counters.cache_miss_rate =
> -                   (double)(xbzrle_counters.cache_miss -
> -                            rs->xbzrle_cache_miss_prev) /
> -                   (rs->iterations - rs->iterations_prev);
> -                rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
> -            }
> -            rs->iterations_prev = rs->iterations;
> -        }
> +        migration_update_rates(rs, end_time);
> +
> +        rs->iterations_prev = rs->iterations;
>  
>          /* reset period counters */
>          rs->time_last_bitmap_sync = end_time;
> -- 
> 2.14.4
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 05/12] migration: show the statistics of compression
  2018-06-04  9:55   ` [Qemu-devel] " guangrong.xiao
@ 2018-06-13 16:25     ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-06-13 16:25 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, peterx,
	wei.w.wang, jiang.biao2, pbonzini

* guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> Then the uses can adjust the parameters based on this info
> 
> Currently, it includes:
> pages: amount of pages compressed and transferred to the target VM
> busy: amount of count that no free thread to compress data
> busy-rate: rate of thread busy
> reduced-size: amount of bytes reduced by compression
> compression-rate: rate of compressed size
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---
>  hmp.c                 | 13 +++++++++++++
>  migration/migration.c | 11 +++++++++++
>  migration/ram.c       | 37 +++++++++++++++++++++++++++++++++++++
>  migration/ram.h       |  1 +
>  qapi/migration.json   | 25 ++++++++++++++++++++++++-
>  5 files changed, 86 insertions(+), 1 deletion(-)
> 
> diff --git a/hmp.c b/hmp.c
> index ef93f4878b..5c2d3bd318 100644
> --- a/hmp.c
> +++ b/hmp.c
> @@ -269,6 +269,19 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
>                         info->xbzrle_cache->overflow);
>      }
>  
> +    if (info->has_compression) {
> +        monitor_printf(mon, "compression pages: %" PRIu64 " pages\n",
> +                       info->compression->pages);
> +        monitor_printf(mon, "compression busy: %" PRIu64 "\n",
> +                       info->compression->busy);
> +        monitor_printf(mon, "compression busy rate: %0.2f\n",
> +                       info->compression->busy_rate);
> +        monitor_printf(mon, "compression reduced size: %" PRIu64 "\n",
> +                       info->compression->reduced_size);
> +        monitor_printf(mon, "compression rate: %0.2f\n",
> +                       info->compression->compression_rate);
> +    }
> +
>      if (info->has_cpu_throttle_percentage) {
>          monitor_printf(mon, "cpu throttle percentage: %" PRIu64 "\n",
>                         info->cpu_throttle_percentage);
> diff --git a/migration/migration.c b/migration/migration.c
> index 05aec2c905..bf7c63a5a2 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -693,6 +693,17 @@ static void populate_ram_info(MigrationInfo *info, MigrationState *s)
>          info->xbzrle_cache->overflow = xbzrle_counters.overflow;
>      }
>  
> +    if (migrate_use_compression()) {
> +        info->has_compression = true;
> +        info->compression = g_malloc0(sizeof(*info->compression));
> +        info->compression->pages = compression_counters.pages;
> +        info->compression->busy = compression_counters.busy;
> +        info->compression->busy_rate = compression_counters.busy_rate;
> +        info->compression->reduced_size = compression_counters.reduced_size;
> +        info->compression->compression_rate =
> +                                    compression_counters.compression_rate;
> +    }
> +
>      if (cpu_throttle_active()) {
>          info->has_cpu_throttle_percentage = true;
>          info->cpu_throttle_percentage = cpu_throttle_get_percentage();
> diff --git a/migration/ram.c b/migration/ram.c
> index ee03b28435..80914b747e 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -292,6 +292,15 @@ struct RAMState {
>      uint64_t num_dirty_pages_period;
>      /* xbzrle misses since the beginning of the period */
>      uint64_t xbzrle_cache_miss_prev;
> +
> +    /* compression statistics since the beginning of the period */
> +    /* amount of count that no free thread to compress data */
> +    uint64_t compress_thread_busy_prev;
> +    /* amount bytes reduced by compression */
> +    uint64_t compress_reduced_size_prev;
> +    /* amount of compressed pages */
> +    uint64_t compress_pages_prev;
> +
>      /* number of iterations at the beginning of period */
>      uint64_t iterations_prev;
>      /* Iterations since start */
> @@ -329,6 +338,8 @@ struct PageSearchStatus {
>  };
>  typedef struct PageSearchStatus PageSearchStatus;
>  
> +CompressionStats compression_counters;
> +
>  struct CompressParam {
>      bool done;
>      bool quit;
> @@ -1147,6 +1158,24 @@ static void migration_update_rates(RAMState *rs, int64_t end_time)
>              rs->xbzrle_cache_miss_prev) / iter_count;
>          rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
>      }
> +
> +    if (migrate_use_compression()) {
> +        uint64_t comp_pages;
> +
> +        compression_counters.busy_rate = (double)(compression_counters.busy -
> +            rs->compress_thread_busy_prev) / iter_count;
> +        rs->compress_thread_busy_prev = compression_counters.busy;
> +
> +        comp_pages = compression_counters.pages - rs->compress_pages_prev;
> +        if (comp_pages) {
> +            compression_counters.compression_rate =
> +                (double)(compression_counters.reduced_size -
> +                rs->compress_reduced_size_prev) /
> +                (comp_pages * TARGET_PAGE_SIZE);
> +            rs->compress_pages_prev = compression_counters.pages;
> +            rs->compress_reduced_size_prev = compression_counters.reduced_size;
> +        }
> +    }
>  }
>  
>  static void migration_bitmap_sync(RAMState *rs)
> @@ -1412,6 +1441,9 @@ static void flush_compressed_data(RAMState *rs)
>          qemu_mutex_lock(&comp_param[idx].mutex);
>          if (!comp_param[idx].quit) {
>              len = qemu_put_qemu_file(rs->f, comp_param[idx].file);
> +            /* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */
> +            compression_counters.reduced_size += TARGET_PAGE_SIZE - len + 8;

I think I'd rather save just len+8 rather than than the subtraction.

I think other than that, and Eric's comments, it's OK.

Dave

> +            compression_counters.pages++;
>              ram_counters.transferred += len;
>          }
>          qemu_mutex_unlock(&comp_param[idx].mutex);
> @@ -1441,6 +1473,10 @@ static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
>              qemu_cond_signal(&comp_param[idx].cond);
>              qemu_mutex_unlock(&comp_param[idx].mutex);
>              pages = 1;
> +            /* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */
> +            compression_counters.reduced_size += TARGET_PAGE_SIZE -
> +                                                 bytes_xmit + 8;
> +            compression_counters.pages++;
>              ram_counters.transferred += bytes_xmit;
>              break;
>          }
> @@ -1760,6 +1796,7 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss,
>          if (res > 0) {
>              return res;
>          }
> +        compression_counters.busy++;
>      }
>  
>      return ram_save_page(rs, pss, last_stage);
> diff --git a/migration/ram.h b/migration/ram.h
> index d386f4d641..7b009b23e5 100644
> --- a/migration/ram.h
> +++ b/migration/ram.h
> @@ -36,6 +36,7 @@
>  
>  extern MigrationStats ram_counters;
>  extern XBZRLECacheStats xbzrle_counters;
> +extern CompressionStats compression_counters;
>  
>  int xbzrle_cache_resize(int64_t new_size, Error **errp);
>  uint64_t ram_bytes_remaining(void);
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 3ec418dabf..a11987cdc4 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -72,6 +72,26 @@
>             'cache-miss': 'int', 'cache-miss-rate': 'number',
>             'overflow': 'int' } }
>  
> +##
> +# @CompressionStats:
> +#
> +# Detailed compression migration statistics
> +#
> +# @pages: amount of pages compressed and transferred to the target VM
> +#
> +# @busy: amount of count that no free thread to compress data
> +#
> +# @busy-rate: rate of thread busy
> +#
> +# @reduced-size: amount of bytes reduced by compression
> +#
> +# @compression-rate: rate of compressed size
> +#
> +##
> +{ 'struct': 'CompressionStats',
> +  'data': {'pages': 'int', 'busy': 'int', 'busy-rate': 'number',
> +	   'reduced-size': 'int', 'compression-rate': 'number' } }
> +
>  ##
>  # @MigrationStatus:
>  #
> @@ -169,6 +189,8 @@
>  #           only present when the postcopy-blocktime migration capability
>  #           is enabled. (Since 2.13)
>  #
> +# @compression: compression migration statistics, only returned if compression
> +#           feature is on and status is 'active' or 'completed' (Since 2.14)
>  #
>  # Since: 0.14.0
>  ##
> @@ -183,7 +205,8 @@
>             '*cpu-throttle-percentage': 'int',
>             '*error-desc': 'str',
>             '*postcopy-blocktime' : 'uint32',
> -           '*postcopy-vcpu-blocktime': ['uint32']} }
> +           '*postcopy-vcpu-blocktime': ['uint32'],
> +           '*compression': 'CompressionStats'} }
>  
>  ##
>  # @query-migrate:
> -- 
> 2.14.4
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 05/12] migration: show the statistics of compression
@ 2018-06-13 16:25     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-06-13 16:25 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, peterx, jiang.biao2,
	wei.w.wang, Xiao Guangrong

* guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> Then the uses can adjust the parameters based on this info
> 
> Currently, it includes:
> pages: amount of pages compressed and transferred to the target VM
> busy: amount of count that no free thread to compress data
> busy-rate: rate of thread busy
> reduced-size: amount of bytes reduced by compression
> compression-rate: rate of compressed size
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---
>  hmp.c                 | 13 +++++++++++++
>  migration/migration.c | 11 +++++++++++
>  migration/ram.c       | 37 +++++++++++++++++++++++++++++++++++++
>  migration/ram.h       |  1 +
>  qapi/migration.json   | 25 ++++++++++++++++++++++++-
>  5 files changed, 86 insertions(+), 1 deletion(-)
> 
> diff --git a/hmp.c b/hmp.c
> index ef93f4878b..5c2d3bd318 100644
> --- a/hmp.c
> +++ b/hmp.c
> @@ -269,6 +269,19 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
>                         info->xbzrle_cache->overflow);
>      }
>  
> +    if (info->has_compression) {
> +        monitor_printf(mon, "compression pages: %" PRIu64 " pages\n",
> +                       info->compression->pages);
> +        monitor_printf(mon, "compression busy: %" PRIu64 "\n",
> +                       info->compression->busy);
> +        monitor_printf(mon, "compression busy rate: %0.2f\n",
> +                       info->compression->busy_rate);
> +        monitor_printf(mon, "compression reduced size: %" PRIu64 "\n",
> +                       info->compression->reduced_size);
> +        monitor_printf(mon, "compression rate: %0.2f\n",
> +                       info->compression->compression_rate);
> +    }
> +
>      if (info->has_cpu_throttle_percentage) {
>          monitor_printf(mon, "cpu throttle percentage: %" PRIu64 "\n",
>                         info->cpu_throttle_percentage);
> diff --git a/migration/migration.c b/migration/migration.c
> index 05aec2c905..bf7c63a5a2 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -693,6 +693,17 @@ static void populate_ram_info(MigrationInfo *info, MigrationState *s)
>          info->xbzrle_cache->overflow = xbzrle_counters.overflow;
>      }
>  
> +    if (migrate_use_compression()) {
> +        info->has_compression = true;
> +        info->compression = g_malloc0(sizeof(*info->compression));
> +        info->compression->pages = compression_counters.pages;
> +        info->compression->busy = compression_counters.busy;
> +        info->compression->busy_rate = compression_counters.busy_rate;
> +        info->compression->reduced_size = compression_counters.reduced_size;
> +        info->compression->compression_rate =
> +                                    compression_counters.compression_rate;
> +    }
> +
>      if (cpu_throttle_active()) {
>          info->has_cpu_throttle_percentage = true;
>          info->cpu_throttle_percentage = cpu_throttle_get_percentage();
> diff --git a/migration/ram.c b/migration/ram.c
> index ee03b28435..80914b747e 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -292,6 +292,15 @@ struct RAMState {
>      uint64_t num_dirty_pages_period;
>      /* xbzrle misses since the beginning of the period */
>      uint64_t xbzrle_cache_miss_prev;
> +
> +    /* compression statistics since the beginning of the period */
> +    /* amount of count that no free thread to compress data */
> +    uint64_t compress_thread_busy_prev;
> +    /* amount bytes reduced by compression */
> +    uint64_t compress_reduced_size_prev;
> +    /* amount of compressed pages */
> +    uint64_t compress_pages_prev;
> +
>      /* number of iterations at the beginning of period */
>      uint64_t iterations_prev;
>      /* Iterations since start */
> @@ -329,6 +338,8 @@ struct PageSearchStatus {
>  };
>  typedef struct PageSearchStatus PageSearchStatus;
>  
> +CompressionStats compression_counters;
> +
>  struct CompressParam {
>      bool done;
>      bool quit;
> @@ -1147,6 +1158,24 @@ static void migration_update_rates(RAMState *rs, int64_t end_time)
>              rs->xbzrle_cache_miss_prev) / iter_count;
>          rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
>      }
> +
> +    if (migrate_use_compression()) {
> +        uint64_t comp_pages;
> +
> +        compression_counters.busy_rate = (double)(compression_counters.busy -
> +            rs->compress_thread_busy_prev) / iter_count;
> +        rs->compress_thread_busy_prev = compression_counters.busy;
> +
> +        comp_pages = compression_counters.pages - rs->compress_pages_prev;
> +        if (comp_pages) {
> +            compression_counters.compression_rate =
> +                (double)(compression_counters.reduced_size -
> +                rs->compress_reduced_size_prev) /
> +                (comp_pages * TARGET_PAGE_SIZE);
> +            rs->compress_pages_prev = compression_counters.pages;
> +            rs->compress_reduced_size_prev = compression_counters.reduced_size;
> +        }
> +    }
>  }
>  
>  static void migration_bitmap_sync(RAMState *rs)
> @@ -1412,6 +1441,9 @@ static void flush_compressed_data(RAMState *rs)
>          qemu_mutex_lock(&comp_param[idx].mutex);
>          if (!comp_param[idx].quit) {
>              len = qemu_put_qemu_file(rs->f, comp_param[idx].file);
> +            /* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */
> +            compression_counters.reduced_size += TARGET_PAGE_SIZE - len + 8;

I think I'd rather save just len+8 rather than than the subtraction.

I think other than that, and Eric's comments, it's OK.

Dave

> +            compression_counters.pages++;
>              ram_counters.transferred += len;
>          }
>          qemu_mutex_unlock(&comp_param[idx].mutex);
> @@ -1441,6 +1473,10 @@ static int compress_page_with_multi_thread(RAMState *rs, RAMBlock *block,
>              qemu_cond_signal(&comp_param[idx].cond);
>              qemu_mutex_unlock(&comp_param[idx].mutex);
>              pages = 1;
> +            /* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */
> +            compression_counters.reduced_size += TARGET_PAGE_SIZE -
> +                                                 bytes_xmit + 8;
> +            compression_counters.pages++;
>              ram_counters.transferred += bytes_xmit;
>              break;
>          }
> @@ -1760,6 +1796,7 @@ static int ram_save_target_page(RAMState *rs, PageSearchStatus *pss,
>          if (res > 0) {
>              return res;
>          }
> +        compression_counters.busy++;
>      }
>  
>      return ram_save_page(rs, pss, last_stage);
> diff --git a/migration/ram.h b/migration/ram.h
> index d386f4d641..7b009b23e5 100644
> --- a/migration/ram.h
> +++ b/migration/ram.h
> @@ -36,6 +36,7 @@
>  
>  extern MigrationStats ram_counters;
>  extern XBZRLECacheStats xbzrle_counters;
> +extern CompressionStats compression_counters;
>  
>  int xbzrle_cache_resize(int64_t new_size, Error **errp);
>  uint64_t ram_bytes_remaining(void);
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 3ec418dabf..a11987cdc4 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -72,6 +72,26 @@
>             'cache-miss': 'int', 'cache-miss-rate': 'number',
>             'overflow': 'int' } }
>  
> +##
> +# @CompressionStats:
> +#
> +# Detailed compression migration statistics
> +#
> +# @pages: amount of pages compressed and transferred to the target VM
> +#
> +# @busy: amount of count that no free thread to compress data
> +#
> +# @busy-rate: rate of thread busy
> +#
> +# @reduced-size: amount of bytes reduced by compression
> +#
> +# @compression-rate: rate of compressed size
> +#
> +##
> +{ 'struct': 'CompressionStats',
> +  'data': {'pages': 'int', 'busy': 'int', 'busy-rate': 'number',
> +	   'reduced-size': 'int', 'compression-rate': 'number' } }
> +
>  ##
>  # @MigrationStatus:
>  #
> @@ -169,6 +189,8 @@
>  #           only present when the postcopy-blocktime migration capability
>  #           is enabled. (Since 2.13)
>  #
> +# @compression: compression migration statistics, only returned if compression
> +#           feature is on and status is 'active' or 'completed' (Since 2.14)
>  #
>  # Since: 0.14.0
>  ##
> @@ -183,7 +205,8 @@
>             '*cpu-throttle-percentage': 'int',
>             '*error-desc': 'str',
>             '*postcopy-blocktime' : 'uint32',
> -           '*postcopy-vcpu-blocktime': ['uint32']} }
> +           '*postcopy-vcpu-blocktime': ['uint32'],
> +           '*compression': 'CompressionStats'} }
>  
>  ##
>  # @query-migrate:
> -- 
> 2.14.4
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 01/12] migration: do not wait if no free thread
  2018-06-13 15:43           ` [Qemu-devel] " Dr. David Alan Gilbert
@ 2018-06-14  3:19             ` Xiao Guangrong
  -1 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-14  3:19 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Peter Xu
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, wei.w.wang,
	pbonzini, jiang.biao2



On 06/13/2018 11:43 PM, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
>> On Tue, Jun 12, 2018 at 10:42:25AM +0800, Xiao Guangrong wrote:
>>>
>>>
>>> On 06/11/2018 03:39 PM, Peter Xu wrote:
>>>> On Mon, Jun 04, 2018 at 05:55:09PM +0800, guangrong.xiao@gmail.com wrote:
>>>>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>>>
>>>>> Instead of putting the main thread to sleep state to wait for
>>>>> free compression thread, we can directly post it out as normal
>>>>> page that reduces the latency and uses CPUs more efficiently
>>>>
>>>> The feature looks good, though I'm not sure whether we should make a
>>>> capability flag for this feature since otherwise it'll be hard to
>>>> switch back to the old full-compression way no matter for what
>>>> reason.  Would that be a problem?
>>>>
>>>
>>> We assume this optimization should always be optimistic for all cases,
>>> particularly, we introduced the statistics of compression, then the user
>>> should adjust its parameters based on those statistics if anything works
>>> worse.
>>
>> Ah, that'll be good.
>>
>>>
>>> Furthermore, we really need to improve this optimization if it hurts
>>> any case rather than leaving a option to the user. :)
>>
>> Yeah, even if we make it a parameter/capability we can still turn that
>> on by default in new versions but keep the old behavior in old
>> versions. :) The major difference is that, then we can still _have_ a
>> way to compress every page. I'm just thinking if we don't have a
>> switch for that then if someone wants to measure e.g.  how a new
>> compression algo could help VM migration, then he/she won't be
>> possible to do that again since the numbers will be meaningless if
>> that bit is out of control on which page will be compressed.
>>
>> Though I don't know how much use it'll bring...  But if that won't be
>> too hard, it still seems good.  Not a strong opinion.
> 
> I think that is needed; it might be that some users have really awful
> networking and need the compression; I'd expect that for people who turn
> on compression they really expect the slowdown because they need it for
> their network, so changing that is a bit odd.

People should make sure the system has enough CPU resource to do
compression as well, so the perfect usage is that the 'busy-rate'
is low enough i think.

However, it's not a big deal, i will introduce a parameter,
maybe, compress-wait-free-thread.

Thank you all, Dave and Peter! :)

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 01/12] migration: do not wait if no free thread
@ 2018-06-14  3:19             ` Xiao Guangrong
  0 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-14  3:19 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Peter Xu
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, jiang.biao2,
	wei.w.wang, Xiao Guangrong



On 06/13/2018 11:43 PM, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
>> On Tue, Jun 12, 2018 at 10:42:25AM +0800, Xiao Guangrong wrote:
>>>
>>>
>>> On 06/11/2018 03:39 PM, Peter Xu wrote:
>>>> On Mon, Jun 04, 2018 at 05:55:09PM +0800, guangrong.xiao@gmail.com wrote:
>>>>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>>>
>>>>> Instead of putting the main thread to sleep state to wait for
>>>>> free compression thread, we can directly post it out as normal
>>>>> page that reduces the latency and uses CPUs more efficiently
>>>>
>>>> The feature looks good, though I'm not sure whether we should make a
>>>> capability flag for this feature since otherwise it'll be hard to
>>>> switch back to the old full-compression way no matter for what
>>>> reason.  Would that be a problem?
>>>>
>>>
>>> We assume this optimization should always be optimistic for all cases,
>>> particularly, we introduced the statistics of compression, then the user
>>> should adjust its parameters based on those statistics if anything works
>>> worse.
>>
>> Ah, that'll be good.
>>
>>>
>>> Furthermore, we really need to improve this optimization if it hurts
>>> any case rather than leaving a option to the user. :)
>>
>> Yeah, even if we make it a parameter/capability we can still turn that
>> on by default in new versions but keep the old behavior in old
>> versions. :) The major difference is that, then we can still _have_ a
>> way to compress every page. I'm just thinking if we don't have a
>> switch for that then if someone wants to measure e.g.  how a new
>> compression algo could help VM migration, then he/she won't be
>> possible to do that again since the numbers will be meaningless if
>> that bit is out of control on which page will be compressed.
>>
>> Though I don't know how much use it'll bring...  But if that won't be
>> too hard, it still seems good.  Not a strong opinion.
> 
> I think that is needed; it might be that some users have really awful
> networking and need the compression; I'd expect that for people who turn
> on compression they really expect the slowdown because they need it for
> their network, so changing that is a bit odd.

People should make sure the system has enough CPU resource to do
compression as well, so the perfect usage is that the 'busy-rate'
is low enough i think.

However, it's not a big deal, i will introduce a parameter,
maybe, compress-wait-free-thread.

Thank you all, Dave and Peter! :)

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 02/12] migration: fix counting normal page for compression
  2018-06-13 15:51     ` [Qemu-devel] " Dr. David Alan Gilbert
@ 2018-06-14  3:32       ` Xiao Guangrong
  -1 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-14  3:32 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, peterx,
	wei.w.wang, jiang.biao2, pbonzini



On 06/13/2018 11:51 PM, Dr. David Alan Gilbert wrote:
> * guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>> The compressed page is not normal page
> 
> Is this the right reason?

I think the 'normal' page shouldn't include the compressed
page and XBZRLE-ed page (the current code does not treat
xbzrle pages are normal as well).

> I think we always increment some counter for a page - so
> what gets incremented for a compressed page?

In the later patch, we will introduce the statistics of
compression which contains "pages":
    @pages: amount of pages compressed and transferred to the target VM

> Is the real answer that we do:
> 
>    ram_save_target_page
>       control_save_page
>       compress_page_with_multi_thread
> 
> and control_save_page already increments the counter?

No :), control_save_page increments the counter only if it posted
data out, under that case, the compression path is not invoked.

Thanks!

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 02/12] migration: fix counting normal page for compression
@ 2018-06-14  3:32       ` Xiao Guangrong
  0 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-14  3:32 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, peterx, jiang.biao2,
	wei.w.wang, Xiao Guangrong



On 06/13/2018 11:51 PM, Dr. David Alan Gilbert wrote:
> * guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>> The compressed page is not normal page
> 
> Is this the right reason?

I think the 'normal' page shouldn't include the compressed
page and XBZRLE-ed page (the current code does not treat
xbzrle pages are normal as well).

> I think we always increment some counter for a page - so
> what gets incremented for a compressed page?

In the later patch, we will introduce the statistics of
compression which contains "pages":
    @pages: amount of pages compressed and transferred to the target VM

> Is the real answer that we do:
> 
>    ram_save_target_page
>       control_save_page
>       compress_page_with_multi_thread
> 
> and control_save_page already increments the counter?

No :), control_save_page increments the counter only if it posted
data out, under that case, the compression path is not invoked.

Thanks!

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 04/12] migration: introduce migration_update_rates
  2018-06-13 16:17     ` [Qemu-devel] " Dr. David Alan Gilbert
@ 2018-06-14  3:35       ` Xiao Guangrong
  -1 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-14  3:35 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, peterx,
	wei.w.wang, jiang.biao2, pbonzini



On 06/14/2018 12:17 AM, Dr. David Alan Gilbert wrote:
> * guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>> It is used to slightly clean the code up, no logic is changed
> 
> Actually, there is a slight change; iterations_prev is always updated
> when previously it was only updated with xbzrle on; still the change
> makes more sense.

Yes, indeed. I will update the changelog.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 04/12] migration: introduce migration_update_rates
@ 2018-06-14  3:35       ` Xiao Guangrong
  0 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-14  3:35 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, peterx, jiang.biao2,
	wei.w.wang, Xiao Guangrong



On 06/14/2018 12:17 AM, Dr. David Alan Gilbert wrote:
> * guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>> It is used to slightly clean the code up, no logic is changed
> 
> Actually, there is a slight change; iterations_prev is always updated
> when previously it was only updated with xbzrle on; still the change
> makes more sense.

Yes, indeed. I will update the changelog.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 05/12] migration: show the statistics of compression
  2018-06-13 16:25     ` [Qemu-devel] " Dr. David Alan Gilbert
@ 2018-06-14  6:48       ` Xiao Guangrong
  -1 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-14  6:48 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, peterx,
	wei.w.wang, jiang.biao2, pbonzini



On 06/14/2018 12:25 AM, Dr. David Alan Gilbert wrote:
  }
>>   
>>   static void migration_bitmap_sync(RAMState *rs)
>> @@ -1412,6 +1441,9 @@ static void flush_compressed_data(RAMState *rs)
>>           qemu_mutex_lock(&comp_param[idx].mutex);
>>           if (!comp_param[idx].quit) {
>>               len = qemu_put_qemu_file(rs->f, comp_param[idx].file);
>> +            /* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */
>> +            compression_counters.reduced_size += TARGET_PAGE_SIZE - len + 8;
> 
> I think I'd rather save just len+8 rather than than the subtraction.
>
Hmmmmmm, is this what you want?
       compression_counters.reduced_size += len - 8;

Then calculate the real reduced size in populate_ram_info() where we return this
info to the user:
       info->compression->reduced_size = compression_counters.pages * PAGE_SIZE - compression_counters.reduced_size;

Right?

> I think other than that, and Eric's comments, it's OK.
> 

Thanks.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 05/12] migration: show the statistics of compression
@ 2018-06-14  6:48       ` Xiao Guangrong
  0 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-14  6:48 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, peterx, jiang.biao2,
	wei.w.wang, Xiao Guangrong



On 06/14/2018 12:25 AM, Dr. David Alan Gilbert wrote:
  }
>>   
>>   static void migration_bitmap_sync(RAMState *rs)
>> @@ -1412,6 +1441,9 @@ static void flush_compressed_data(RAMState *rs)
>>           qemu_mutex_lock(&comp_param[idx].mutex);
>>           if (!comp_param[idx].quit) {
>>               len = qemu_put_qemu_file(rs->f, comp_param[idx].file);
>> +            /* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */
>> +            compression_counters.reduced_size += TARGET_PAGE_SIZE - len + 8;
> 
> I think I'd rather save just len+8 rather than than the subtraction.
>
Hmmmmmm, is this what you want?
       compression_counters.reduced_size += len - 8;

Then calculate the real reduced size in populate_ram_info() where we return this
info to the user:
       info->compression->reduced_size = compression_counters.pages * PAGE_SIZE - compression_counters.reduced_size;

Right?

> I think other than that, and Eric's comments, it's OK.
> 

Thanks.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 03/12] migration: fix counting xbzrle cache_miss_rate
  2018-06-04  9:55   ` [Qemu-devel] " guangrong.xiao
@ 2018-06-15 11:30     ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-06-15 11:30 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, peterx,
	wei.w.wang, jiang.biao2, pbonzini

* guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> Sync up xbzrle_cache_miss_prev only after migration iteration goes
> forward
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>

This patch (not the whole set) queued

> ---
>  migration/ram.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index dbf24d8c87..dd1283dd45 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1189,9 +1189,9 @@ static void migration_bitmap_sync(RAMState *rs)
>                     (double)(xbzrle_counters.cache_miss -
>                              rs->xbzrle_cache_miss_prev) /
>                     (rs->iterations - rs->iterations_prev);
> +                rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
>              }
>              rs->iterations_prev = rs->iterations;
> -            rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
>          }
>  
>          /* reset period counters */
> -- 
> 2.14.4
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 03/12] migration: fix counting xbzrle cache_miss_rate
@ 2018-06-15 11:30     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-06-15 11:30 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, peterx, jiang.biao2,
	wei.w.wang, Xiao Guangrong

* guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> Sync up xbzrle_cache_miss_prev only after migration iteration goes
> forward
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>

This patch (not the whole set) queued

> ---
>  migration/ram.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index dbf24d8c87..dd1283dd45 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1189,9 +1189,9 @@ static void migration_bitmap_sync(RAMState *rs)
>                     (double)(xbzrle_counters.cache_miss -
>                              rs->xbzrle_cache_miss_prev) /
>                     (rs->iterations - rs->iterations_prev);
> +                rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
>              }
>              rs->iterations_prev = rs->iterations;
> -            rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
>          }
>  
>          /* reset period counters */
> -- 
> 2.14.4
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 04/12] migration: introduce migration_update_rates
  2018-06-13 16:17     ` [Qemu-devel] " Dr. David Alan Gilbert
@ 2018-06-15 11:32       ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-06-15 11:32 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, peterx,
	wei.w.wang, jiang.biao2, pbonzini

* Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> * guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
> > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > 
> > It is used to slightly clean the code up, no logic is changed
> 
> Actually, there is a slight change; iterations_prev is always updated
> when previously it was only updated with xbzrle on; still the change
> makes more sense.
> 
> 
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

This patch queued.

> > Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> > ---
> >  migration/ram.c | 35 ++++++++++++++++++++++-------------
> >  1 file changed, 22 insertions(+), 13 deletions(-)
> > 
> > diff --git a/migration/ram.c b/migration/ram.c
> > index dd1283dd45..ee03b28435 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -1130,6 +1130,25 @@ uint64_t ram_pagesize_summary(void)
> >      return summary;
> >  }
> >  
> > +static void migration_update_rates(RAMState *rs, int64_t end_time)
> > +{
> > +    uint64_t iter_count = rs->iterations - rs->iterations_prev;
> > +
> > +    /* calculate period counters */
> > +    ram_counters.dirty_pages_rate = rs->num_dirty_pages_period * 1000
> > +                / (end_time - rs->time_last_bitmap_sync);
> > +
> > +    if (!iter_count) {
> > +        return;
> > +    }
> > +
> > +    if (migrate_use_xbzrle()) {
> > +        xbzrle_counters.cache_miss_rate = (double)(xbzrle_counters.cache_miss -
> > +            rs->xbzrle_cache_miss_prev) / iter_count;
> > +        rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
> > +    }
> > +}
> > +
> >  static void migration_bitmap_sync(RAMState *rs)
> >  {
> >      RAMBlock *block;
> > @@ -1159,9 +1178,6 @@ static void migration_bitmap_sync(RAMState *rs)
> >  
> >      /* more than 1 second = 1000 millisecons */
> >      if (end_time > rs->time_last_bitmap_sync + 1000) {
> > -        /* calculate period counters */
> > -        ram_counters.dirty_pages_rate = rs->num_dirty_pages_period * 1000
> > -            / (end_time - rs->time_last_bitmap_sync);
> >          bytes_xfer_now = ram_counters.transferred;
> >  
> >          /* During block migration the auto-converge logic incorrectly detects
> > @@ -1183,16 +1199,9 @@ static void migration_bitmap_sync(RAMState *rs)
> >              }
> >          }
> >  
> > -        if (migrate_use_xbzrle()) {
> > -            if (rs->iterations_prev != rs->iterations) {
> > -                xbzrle_counters.cache_miss_rate =
> > -                   (double)(xbzrle_counters.cache_miss -
> > -                            rs->xbzrle_cache_miss_prev) /
> > -                   (rs->iterations - rs->iterations_prev);
> > -                rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
> > -            }
> > -            rs->iterations_prev = rs->iterations;
> > -        }
> > +        migration_update_rates(rs, end_time);
> > +
> > +        rs->iterations_prev = rs->iterations;
> >  
> >          /* reset period counters */
> >          rs->time_last_bitmap_sync = end_time;
> > -- 
> > 2.14.4
> > 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 04/12] migration: introduce migration_update_rates
@ 2018-06-15 11:32       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-06-15 11:32 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, peterx, jiang.biao2,
	wei.w.wang, Xiao Guangrong

* Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> * guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
> > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > 
> > It is used to slightly clean the code up, no logic is changed
> 
> Actually, there is a slight change; iterations_prev is always updated
> when previously it was only updated with xbzrle on; still the change
> makes more sense.
> 
> 
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

This patch queued.

> > Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> > ---
> >  migration/ram.c | 35 ++++++++++++++++++++++-------------
> >  1 file changed, 22 insertions(+), 13 deletions(-)
> > 
> > diff --git a/migration/ram.c b/migration/ram.c
> > index dd1283dd45..ee03b28435 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -1130,6 +1130,25 @@ uint64_t ram_pagesize_summary(void)
> >      return summary;
> >  }
> >  
> > +static void migration_update_rates(RAMState *rs, int64_t end_time)
> > +{
> > +    uint64_t iter_count = rs->iterations - rs->iterations_prev;
> > +
> > +    /* calculate period counters */
> > +    ram_counters.dirty_pages_rate = rs->num_dirty_pages_period * 1000
> > +                / (end_time - rs->time_last_bitmap_sync);
> > +
> > +    if (!iter_count) {
> > +        return;
> > +    }
> > +
> > +    if (migrate_use_xbzrle()) {
> > +        xbzrle_counters.cache_miss_rate = (double)(xbzrle_counters.cache_miss -
> > +            rs->xbzrle_cache_miss_prev) / iter_count;
> > +        rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
> > +    }
> > +}
> > +
> >  static void migration_bitmap_sync(RAMState *rs)
> >  {
> >      RAMBlock *block;
> > @@ -1159,9 +1178,6 @@ static void migration_bitmap_sync(RAMState *rs)
> >  
> >      /* more than 1 second = 1000 millisecons */
> >      if (end_time > rs->time_last_bitmap_sync + 1000) {
> > -        /* calculate period counters */
> > -        ram_counters.dirty_pages_rate = rs->num_dirty_pages_period * 1000
> > -            / (end_time - rs->time_last_bitmap_sync);
> >          bytes_xfer_now = ram_counters.transferred;
> >  
> >          /* During block migration the auto-converge logic incorrectly detects
> > @@ -1183,16 +1199,9 @@ static void migration_bitmap_sync(RAMState *rs)
> >              }
> >          }
> >  
> > -        if (migrate_use_xbzrle()) {
> > -            if (rs->iterations_prev != rs->iterations) {
> > -                xbzrle_counters.cache_miss_rate =
> > -                   (double)(xbzrle_counters.cache_miss -
> > -                            rs->xbzrle_cache_miss_prev) /
> > -                   (rs->iterations - rs->iterations_prev);
> > -                rs->xbzrle_cache_miss_prev = xbzrle_counters.cache_miss;
> > -            }
> > -            rs->iterations_prev = rs->iterations;
> > -        }
> > +        migration_update_rates(rs, end_time);
> > +
> > +        rs->iterations_prev = rs->iterations;
> >  
> >          /* reset period counters */
> >          rs->time_last_bitmap_sync = end_time;
> > -- 
> > 2.14.4
> > 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 06/12] migration: do not detect zero page for compression
  2018-06-04  9:55   ` [Qemu-devel] " guangrong.xiao
@ 2018-06-19  7:30     ` Peter Xu
  -1 siblings, 0 replies; 156+ messages in thread
From: Peter Xu @ 2018-06-19  7:30 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, qemu-devel,
	wei.w.wang, jiang.biao2, pbonzini

On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> Detecting zero page is not a light work, we can disable it
> for compression that can handle all zero data very well

Is there any number shows how the compression algo performs better
than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
be fast, depending on how init_accel() is done in util/bufferiszero.c.

>From compression rate POV of course zero page algo wins since it
contains no data (but only a flag).

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 06/12] migration: do not detect zero page for compression
@ 2018-06-19  7:30     ` Peter Xu
  0 siblings, 0 replies; 156+ messages in thread
From: Peter Xu @ 2018-06-19  7:30 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, jiang.biao2,
	wei.w.wang, Xiao Guangrong

On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> Detecting zero page is not a light work, we can disable it
> for compression that can handle all zero data very well

Is there any number shows how the compression algo performs better
than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
be fast, depending on how init_accel() is done in util/bufferiszero.c.

>From compression rate POV of course zero page algo wins since it
contains no data (but only a flag).

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 07/12] migration: hold the lock only if it is really needed
  2018-06-04  9:55   ` [Qemu-devel] " guangrong.xiao
@ 2018-06-19  7:36     ` Peter Xu
  -1 siblings, 0 replies; 156+ messages in thread
From: Peter Xu @ 2018-06-19  7:36 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, qemu-devel,
	wei.w.wang, jiang.biao2, pbonzini

On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> Try to hold src_page_req_mutex only if the queue is not
> empty

Pure question: how much this patch would help?  Basically if you are
running compression tests then I think it means you are with precopy
(since postcopy cannot work with compression yet), then here the lock
has no contention at all.

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 07/12] migration: hold the lock only if it is really needed
@ 2018-06-19  7:36     ` Peter Xu
  0 siblings, 0 replies; 156+ messages in thread
From: Peter Xu @ 2018-06-19  7:36 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, jiang.biao2,
	wei.w.wang, Xiao Guangrong

On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> Try to hold src_page_req_mutex only if the queue is not
> empty

Pure question: how much this patch would help?  Basically if you are
running compression tests then I think it means you are with precopy
(since postcopy cannot work with compression yet), then here the lock
has no contention at all.

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 09/12] ring: introduce lockless ring buffer
  2018-06-04  9:55   ` [Qemu-devel] " guangrong.xiao
@ 2018-06-20  4:52     ` Peter Xu
  -1 siblings, 0 replies; 156+ messages in thread
From: Peter Xu @ 2018-06-20  4:52 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, qemu-devel,
	wei.w.wang, jiang.biao2, pbonzini

On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.xiao@gmail.com wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> It's the simple lockless ring buffer implement which supports both
> single producer vs. single consumer and multiple producers vs.
> single consumer.
> 
> Many lessons were learned from Linux Kernel's kfifo (1) and DPDK's
> rte_ring (2) before i wrote this implement. It corrects some bugs of
> memory barriers in kfifo and it is the simpler lockless version of
> rte_ring as currently multiple access is only allowed for producer.

Could you provide some more information about the kfifo bug?  Any
pointer would be appreciated.

> 
> If has single producer vs. single consumer, it is the traditional fifo,
> If has multiple producers, it uses the algorithm as followings:
> 
> For the producer, it uses two steps to update the ring:
>    - first step, occupy the entry in the ring:
> 
> retry:
>       in = ring->in
>       if (cmpxhg(&ring->in, in, in +1) != in)
>             goto retry;
> 
>      after that the entry pointed by ring->data[in] has been owned by
>      the producer.
> 
>      assert(ring->data[in] == NULL);
> 
>      Note, no other producer can touch this entry so that this entry
>      should always be the initialized state.
> 
>    - second step, write the data to the entry:
> 
>      ring->data[in] = data;
> 
> For the consumer, it first checks if there is available entry in the
> ring and fetches the entry from the ring:
> 
>      if (!ring_is_empty(ring))
>           entry = &ring[ring->out];
> 
>      Note: the ring->out has not been updated so that the entry pointed
>      by ring->out is completely owned by the consumer.
> 
> Then it checks if the data is ready:
> 
> retry:
>      if (*entry == NULL)
>             goto retry;
> That means, the producer has updated the index but haven't written any
> data to it.
> 
> Finally, it fetches the valid data out, set the entry to the initialized
> state and update ring->out to make the entry be usable to the producer:
> 
>       data = *entry;
>       *entry = NULL;
>       ring->out++;
> 
> Memory barrier is omitted here, please refer to the comment in the code.
> 
> (1) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
> (2) http://dpdk.org/doc/api/rte__ring_8h.html
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---
>  migration/ring.h | 265 +++++++++++++++++++++++++++++++++++++++++++++++++++++++

If this is a very general implementation, not sure whether we can move
this to util/ directory so that it can be used even outside migration
codes.

>  1 file changed, 265 insertions(+)
>  create mode 100644 migration/ring.h
> 
> diff --git a/migration/ring.h b/migration/ring.h
> new file mode 100644
> index 0000000000..da9b8bdcbb
> --- /dev/null
> +++ b/migration/ring.h
> @@ -0,0 +1,265 @@
> +/*
> + * Ring Buffer
> + *
> + * Multiple producers and single consumer are supported with lock free.
> + *
> + * Copyright (c) 2018 Tencent Inc
> + *
> + * Authors:
> + *  Xiao Guangrong <xiaoguangrong@tencent.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#ifndef _RING__
> +#define _RING__
> +
> +#define CACHE_LINE  64

Is this for x86_64?  Is the cache line size the same for all arch?

> +#define cache_aligned __attribute__((__aligned__(CACHE_LINE)))
> +
> +#define RING_MULTI_PRODUCER 0x1
> +
> +struct Ring {
> +    unsigned int flags;
> +    unsigned int size;
> +    unsigned int mask;
> +
> +    unsigned int in cache_aligned;
> +
> +    unsigned int out cache_aligned;
> +
> +    void *data[0] cache_aligned;
> +};
> +typedef struct Ring Ring;
> +
> +/*
> + * allocate and initialize the ring
> + *
> + * @size: the number of element, it should be power of 2
> + * @flags: set to RING_MULTI_PRODUCER if the ring has multiple producer,
> + *         otherwise set it to 0, i,e. single producer and single consumer.
> + *
> + * return the ring.
> + */
> +static inline Ring *ring_alloc(unsigned int size, unsigned int flags)
> +{
> +    Ring *ring;
> +
> +    assert(is_power_of_2(size));
> +
> +    ring = g_malloc0(sizeof(*ring) + size * sizeof(void *));
> +    ring->size = size;
> +    ring->mask = ring->size - 1;
> +    ring->flags = flags;
> +    return ring;
> +}
> +
> +static inline void ring_free(Ring *ring)
> +{
> +    g_free(ring);
> +}
> +
> +static inline bool __ring_is_empty(unsigned int in, unsigned int out)
> +{
> +    return in == out;
> +}

(some of the helpers are a bit confusing to me like this one; I would
 prefer some of the helpers be directly squashed into code, but it's a
 personal preference only)

> +
> +static inline bool ring_is_empty(Ring *ring)
> +{
> +    return ring->in == ring->out;
> +}
> +
> +static inline unsigned int ring_len(unsigned int in, unsigned int out)
> +{
> +    return in - out;
> +}

(this too)

> +
> +static inline bool
> +__ring_is_full(Ring *ring, unsigned int in, unsigned int out)
> +{
> +    return ring_len(in, out) > ring->mask;
> +}
> +
> +static inline bool ring_is_full(Ring *ring)
> +{
> +    return __ring_is_full(ring, ring->in, ring->out);
> +}
> +
> +static inline unsigned int ring_index(Ring *ring, unsigned int pos)
> +{
> +    return pos & ring->mask;
> +}
> +
> +static inline int __ring_put(Ring *ring, void *data)
> +{
> +    unsigned int index, out;
> +
> +    out = atomic_load_acquire(&ring->out);
> +    /*
> +     * smp_mb()
> +     *
> +     * should read ring->out before updating the entry, see the comments in
> +     * __ring_get().

Nit: here I think it means the comment in [1] below.  Maybe:

  "see the comments in __ring_get() when calling
   atomic_store_release()"

?

> +     */
> +
> +    if (__ring_is_full(ring, ring->in, out)) {
> +        return -ENOBUFS;
> +    }
> +
> +    index = ring_index(ring, ring->in);
> +
> +    atomic_set(&ring->data[index], data);
> +
> +    /*
> +     * should make sure the entry is updated before increasing ring->in
> +     * otherwise the consumer will get a entry but its content is useless.
> +     */
> +    smp_wmb();
> +    atomic_set(&ring->in, ring->in + 1);

Pure question: could we use store_release() instead of a mixture of
store/release and raw memory barriers in the function?  Or is there
any performance consideration behind?

It'll be nice to mention the performance considerations if there is.

> +    return 0;
> +}
> +
> +static inline void *__ring_get(Ring *ring)
> +{
> +    unsigned int index, in;
> +    void *data;
> +
> +    in = atomic_read(&ring->in);
> +
> +    /*
> +     * should read ring->in first to make sure the entry pointed by this
> +     * index is available, see the comments in __ring_put().
> +     */

Nit: similar to above, maybe mention about which comment would be a
bit nicer.

> +    smp_rmb();
> +    if (__ring_is_empty(in, ring->out)) {
> +        return NULL;
> +    }
> +
> +    index = ring_index(ring, ring->out);
> +
> +    data = atomic_read(&ring->data[index]);
> +
> +    /*
> +     * smp_mb()
> +     *
> +     * once the ring->out is updated the entry originally indicated by the
> +     * the index is visible and usable to the producer so that we should
> +     * make sure reading the entry out before updating ring->out to avoid
> +     * the entry being overwritten by the producer.
> +     */
> +    atomic_store_release(&ring->out, ring->out + 1);

[1]

> +
> +    return data;
> +}

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer
@ 2018-06-20  4:52     ` Peter Xu
  0 siblings, 0 replies; 156+ messages in thread
From: Peter Xu @ 2018-06-20  4:52 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, jiang.biao2,
	wei.w.wang, Xiao Guangrong

On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.xiao@gmail.com wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> It's the simple lockless ring buffer implement which supports both
> single producer vs. single consumer and multiple producers vs.
> single consumer.
> 
> Many lessons were learned from Linux Kernel's kfifo (1) and DPDK's
> rte_ring (2) before i wrote this implement. It corrects some bugs of
> memory barriers in kfifo and it is the simpler lockless version of
> rte_ring as currently multiple access is only allowed for producer.

Could you provide some more information about the kfifo bug?  Any
pointer would be appreciated.

> 
> If has single producer vs. single consumer, it is the traditional fifo,
> If has multiple producers, it uses the algorithm as followings:
> 
> For the producer, it uses two steps to update the ring:
>    - first step, occupy the entry in the ring:
> 
> retry:
>       in = ring->in
>       if (cmpxhg(&ring->in, in, in +1) != in)
>             goto retry;
> 
>      after that the entry pointed by ring->data[in] has been owned by
>      the producer.
> 
>      assert(ring->data[in] == NULL);
> 
>      Note, no other producer can touch this entry so that this entry
>      should always be the initialized state.
> 
>    - second step, write the data to the entry:
> 
>      ring->data[in] = data;
> 
> For the consumer, it first checks if there is available entry in the
> ring and fetches the entry from the ring:
> 
>      if (!ring_is_empty(ring))
>           entry = &ring[ring->out];
> 
>      Note: the ring->out has not been updated so that the entry pointed
>      by ring->out is completely owned by the consumer.
> 
> Then it checks if the data is ready:
> 
> retry:
>      if (*entry == NULL)
>             goto retry;
> That means, the producer has updated the index but haven't written any
> data to it.
> 
> Finally, it fetches the valid data out, set the entry to the initialized
> state and update ring->out to make the entry be usable to the producer:
> 
>       data = *entry;
>       *entry = NULL;
>       ring->out++;
> 
> Memory barrier is omitted here, please refer to the comment in the code.
> 
> (1) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
> (2) http://dpdk.org/doc/api/rte__ring_8h.html
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---
>  migration/ring.h | 265 +++++++++++++++++++++++++++++++++++++++++++++++++++++++

If this is a very general implementation, not sure whether we can move
this to util/ directory so that it can be used even outside migration
codes.

>  1 file changed, 265 insertions(+)
>  create mode 100644 migration/ring.h
> 
> diff --git a/migration/ring.h b/migration/ring.h
> new file mode 100644
> index 0000000000..da9b8bdcbb
> --- /dev/null
> +++ b/migration/ring.h
> @@ -0,0 +1,265 @@
> +/*
> + * Ring Buffer
> + *
> + * Multiple producers and single consumer are supported with lock free.
> + *
> + * Copyright (c) 2018 Tencent Inc
> + *
> + * Authors:
> + *  Xiao Guangrong <xiaoguangrong@tencent.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#ifndef _RING__
> +#define _RING__
> +
> +#define CACHE_LINE  64

Is this for x86_64?  Is the cache line size the same for all arch?

> +#define cache_aligned __attribute__((__aligned__(CACHE_LINE)))
> +
> +#define RING_MULTI_PRODUCER 0x1
> +
> +struct Ring {
> +    unsigned int flags;
> +    unsigned int size;
> +    unsigned int mask;
> +
> +    unsigned int in cache_aligned;
> +
> +    unsigned int out cache_aligned;
> +
> +    void *data[0] cache_aligned;
> +};
> +typedef struct Ring Ring;
> +
> +/*
> + * allocate and initialize the ring
> + *
> + * @size: the number of element, it should be power of 2
> + * @flags: set to RING_MULTI_PRODUCER if the ring has multiple producer,
> + *         otherwise set it to 0, i,e. single producer and single consumer.
> + *
> + * return the ring.
> + */
> +static inline Ring *ring_alloc(unsigned int size, unsigned int flags)
> +{
> +    Ring *ring;
> +
> +    assert(is_power_of_2(size));
> +
> +    ring = g_malloc0(sizeof(*ring) + size * sizeof(void *));
> +    ring->size = size;
> +    ring->mask = ring->size - 1;
> +    ring->flags = flags;
> +    return ring;
> +}
> +
> +static inline void ring_free(Ring *ring)
> +{
> +    g_free(ring);
> +}
> +
> +static inline bool __ring_is_empty(unsigned int in, unsigned int out)
> +{
> +    return in == out;
> +}

(some of the helpers are a bit confusing to me like this one; I would
 prefer some of the helpers be directly squashed into code, but it's a
 personal preference only)

> +
> +static inline bool ring_is_empty(Ring *ring)
> +{
> +    return ring->in == ring->out;
> +}
> +
> +static inline unsigned int ring_len(unsigned int in, unsigned int out)
> +{
> +    return in - out;
> +}

(this too)

> +
> +static inline bool
> +__ring_is_full(Ring *ring, unsigned int in, unsigned int out)
> +{
> +    return ring_len(in, out) > ring->mask;
> +}
> +
> +static inline bool ring_is_full(Ring *ring)
> +{
> +    return __ring_is_full(ring, ring->in, ring->out);
> +}
> +
> +static inline unsigned int ring_index(Ring *ring, unsigned int pos)
> +{
> +    return pos & ring->mask;
> +}
> +
> +static inline int __ring_put(Ring *ring, void *data)
> +{
> +    unsigned int index, out;
> +
> +    out = atomic_load_acquire(&ring->out);
> +    /*
> +     * smp_mb()
> +     *
> +     * should read ring->out before updating the entry, see the comments in
> +     * __ring_get().

Nit: here I think it means the comment in [1] below.  Maybe:

  "see the comments in __ring_get() when calling
   atomic_store_release()"

?

> +     */
> +
> +    if (__ring_is_full(ring, ring->in, out)) {
> +        return -ENOBUFS;
> +    }
> +
> +    index = ring_index(ring, ring->in);
> +
> +    atomic_set(&ring->data[index], data);
> +
> +    /*
> +     * should make sure the entry is updated before increasing ring->in
> +     * otherwise the consumer will get a entry but its content is useless.
> +     */
> +    smp_wmb();
> +    atomic_set(&ring->in, ring->in + 1);

Pure question: could we use store_release() instead of a mixture of
store/release and raw memory barriers in the function?  Or is there
any performance consideration behind?

It'll be nice to mention the performance considerations if there is.

> +    return 0;
> +}
> +
> +static inline void *__ring_get(Ring *ring)
> +{
> +    unsigned int index, in;
> +    void *data;
> +
> +    in = atomic_read(&ring->in);
> +
> +    /*
> +     * should read ring->in first to make sure the entry pointed by this
> +     * index is available, see the comments in __ring_put().
> +     */

Nit: similar to above, maybe mention about which comment would be a
bit nicer.

> +    smp_rmb();
> +    if (__ring_is_empty(in, ring->out)) {
> +        return NULL;
> +    }
> +
> +    index = ring_index(ring, ring->out);
> +
> +    data = atomic_read(&ring->data[index]);
> +
> +    /*
> +     * smp_mb()
> +     *
> +     * once the ring->out is updated the entry originally indicated by the
> +     * the index is visible and usable to the producer so that we should
> +     * make sure reading the entry out before updating ring->out to avoid
> +     * the entry being overwritten by the producer.
> +     */
> +    atomic_store_release(&ring->out, ring->out + 1);

[1]

> +
> +    return data;
> +}

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 09/12] ring: introduce lockless ring buffer
  2018-06-04  9:55   ` [Qemu-devel] " guangrong.xiao
@ 2018-06-20  5:55     ` Peter Xu
  -1 siblings, 0 replies; 156+ messages in thread
From: Peter Xu @ 2018-06-20  5:55 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, qemu-devel,
	wei.w.wang, jiang.biao2, pbonzini

On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.xiao@gmail.com wrote:

[...]

(Some more comments/questions for the MP implementation...)

> +static inline int ring_mp_put(Ring *ring, void *data)
> +{
> +    unsigned int index, in, in_next, out;
> +
> +    do {
> +        in = atomic_read(&ring->in);
> +        out = atomic_read(&ring->out);

[0]

Do we need to fetch "out" with load_acquire()?  Otherwise what's the
pairing of below store_release() at [1]?

This barrier exists in SP-SC case which makes sense to me, I assume
that's also needed for MP-SC case, am I right?

> +
> +        if (__ring_is_full(ring, in, out)) {
> +            if (atomic_read(&ring->in) == in &&
> +                atomic_read(&ring->out) == out) {

Why read again?  After all the ring API seems to be designed as
non-blocking.  E.g., I see the poll at [2] below makes more sense
since when reaches [2] it means that there must be a producer that is
_doing_ the queuing, so polling is very possible to complete fast.
However here it seems to be a pure busy poll without any hint.  Then
not sure whether we should just let the caller decide whether it wants
to call ring_put() again.

> +                return -ENOBUFS;
> +            }
> +
> +            /* a entry has been fetched out, retry. */
> +            continue;
> +        }
> +
> +        in_next = in + 1;
> +    } while (atomic_cmpxchg(&ring->in, in, in_next) != in);
> +
> +    index = ring_index(ring, in);
> +
> +    /*
> +     * smp_rmb() paired with the memory barrier of (A) in ring_mp_get()
> +     * is implied in atomic_cmpxchg() as we should read ring->out first
> +     * before fetching the entry, otherwise this assert will fail.

Thanks for all these comments!  These are really helpful for
reviewers.

However I'm not sure whether I understand it correctly here on MB of
(A) for ring_mp_get() - AFAIU that should corresponds to a smp_rmb()
at [0] above when reading the "out" variable rather than this
assertion, and that's why I thought at [0] we should have something
like a load_acquire() there (which contains a rmb()).

>From content-wise, I think the code here is correct, since
atomic_cmpxchg() should have one implicit smp_mb() after all so we
don't need anything further barriers here.

> +     */
> +    assert(!atomic_read(&ring->data[index]));
> +
> +    /*
> +     * smp_mb() paired with the memory barrier of (B) in ring_mp_get() is
> +     * implied in atomic_cmpxchg(), that is needed here as  we should read
> +     * ring->out before updating the entry, it is the same as we did in
> +     * __ring_put().
> +     *
> +     * smp_wmb() paired with the memory barrier of (C) in ring_mp_get()
> +     * is implied in atomic_cmpxchg(), that is needed as we should increase
> +     * ring->in before updating the entry.
> +     */
> +    atomic_set(&ring->data[index], data);
> +
> +    return 0;
> +}
> +
> +static inline void *ring_mp_get(Ring *ring)
> +{
> +    unsigned int index, in;
> +    void *data;
> +
> +    do {
> +        in = atomic_read(&ring->in);
> +
> +        /*
> +         * (C) should read ring->in first to make sure the entry pointed by this
> +         * index is available
> +         */
> +        smp_rmb();
> +
> +        if (!__ring_is_empty(in, ring->out)) {
> +            break;
> +        }
> +
> +        if (atomic_read(&ring->in) == in) {
> +            return NULL;
> +        }
> +        /* new entry has been added in, retry. */
> +    } while (1);
> +
> +    index = ring_index(ring, ring->out);
> +
> +    do {
> +        data = atomic_read(&ring->data[index]);
> +        if (data) {
> +            break;
> +        }
> +        /* the producer is updating the entry, retry */
> +        cpu_relax();

[2]

> +    } while (1);
> +
> +    atomic_set(&ring->data[index], NULL);
> +
> +    /*
> +     * (B) smp_mb() is needed as we should read the entry out before
> +     * updating ring->out as we did in __ring_get().
> +     *
> +     * (A) smp_wmb() is needed as we should make the entry be NULL before
> +     * updating ring->out (which will make the entry be visible and usable).
> +     */
> +    atomic_store_release(&ring->out, ring->out + 1);

[1]

> +
> +    return data;
> +}
> +
> +static inline int ring_put(Ring *ring, void *data)
> +{
> +    if (ring->flags & RING_MULTI_PRODUCER) {
> +        return ring_mp_put(ring, data);
> +    }
> +    return __ring_put(ring, data);
> +}
> +
> +static inline void *ring_get(Ring *ring)
> +{
> +    if (ring->flags & RING_MULTI_PRODUCER) {
> +        return ring_mp_get(ring);
> +    }
> +    return __ring_get(ring);
> +}
> +#endif
> -- 
> 2.14.4
> 

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer
@ 2018-06-20  5:55     ` Peter Xu
  0 siblings, 0 replies; 156+ messages in thread
From: Peter Xu @ 2018-06-20  5:55 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, jiang.biao2,
	wei.w.wang, Xiao Guangrong

On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.xiao@gmail.com wrote:

[...]

(Some more comments/questions for the MP implementation...)

> +static inline int ring_mp_put(Ring *ring, void *data)
> +{
> +    unsigned int index, in, in_next, out;
> +
> +    do {
> +        in = atomic_read(&ring->in);
> +        out = atomic_read(&ring->out);

[0]

Do we need to fetch "out" with load_acquire()?  Otherwise what's the
pairing of below store_release() at [1]?

This barrier exists in SP-SC case which makes sense to me, I assume
that's also needed for MP-SC case, am I right?

> +
> +        if (__ring_is_full(ring, in, out)) {
> +            if (atomic_read(&ring->in) == in &&
> +                atomic_read(&ring->out) == out) {

Why read again?  After all the ring API seems to be designed as
non-blocking.  E.g., I see the poll at [2] below makes more sense
since when reaches [2] it means that there must be a producer that is
_doing_ the queuing, so polling is very possible to complete fast.
However here it seems to be a pure busy poll without any hint.  Then
not sure whether we should just let the caller decide whether it wants
to call ring_put() again.

> +                return -ENOBUFS;
> +            }
> +
> +            /* a entry has been fetched out, retry. */
> +            continue;
> +        }
> +
> +        in_next = in + 1;
> +    } while (atomic_cmpxchg(&ring->in, in, in_next) != in);
> +
> +    index = ring_index(ring, in);
> +
> +    /*
> +     * smp_rmb() paired with the memory barrier of (A) in ring_mp_get()
> +     * is implied in atomic_cmpxchg() as we should read ring->out first
> +     * before fetching the entry, otherwise this assert will fail.

Thanks for all these comments!  These are really helpful for
reviewers.

However I'm not sure whether I understand it correctly here on MB of
(A) for ring_mp_get() - AFAIU that should corresponds to a smp_rmb()
at [0] above when reading the "out" variable rather than this
assertion, and that's why I thought at [0] we should have something
like a load_acquire() there (which contains a rmb()).

>From content-wise, I think the code here is correct, since
atomic_cmpxchg() should have one implicit smp_mb() after all so we
don't need anything further barriers here.

> +     */
> +    assert(!atomic_read(&ring->data[index]));
> +
> +    /*
> +     * smp_mb() paired with the memory barrier of (B) in ring_mp_get() is
> +     * implied in atomic_cmpxchg(), that is needed here as  we should read
> +     * ring->out before updating the entry, it is the same as we did in
> +     * __ring_put().
> +     *
> +     * smp_wmb() paired with the memory barrier of (C) in ring_mp_get()
> +     * is implied in atomic_cmpxchg(), that is needed as we should increase
> +     * ring->in before updating the entry.
> +     */
> +    atomic_set(&ring->data[index], data);
> +
> +    return 0;
> +}
> +
> +static inline void *ring_mp_get(Ring *ring)
> +{
> +    unsigned int index, in;
> +    void *data;
> +
> +    do {
> +        in = atomic_read(&ring->in);
> +
> +        /*
> +         * (C) should read ring->in first to make sure the entry pointed by this
> +         * index is available
> +         */
> +        smp_rmb();
> +
> +        if (!__ring_is_empty(in, ring->out)) {
> +            break;
> +        }
> +
> +        if (atomic_read(&ring->in) == in) {
> +            return NULL;
> +        }
> +        /* new entry has been added in, retry. */
> +    } while (1);
> +
> +    index = ring_index(ring, ring->out);
> +
> +    do {
> +        data = atomic_read(&ring->data[index]);
> +        if (data) {
> +            break;
> +        }
> +        /* the producer is updating the entry, retry */
> +        cpu_relax();

[2]

> +    } while (1);
> +
> +    atomic_set(&ring->data[index], NULL);
> +
> +    /*
> +     * (B) smp_mb() is needed as we should read the entry out before
> +     * updating ring->out as we did in __ring_get().
> +     *
> +     * (A) smp_wmb() is needed as we should make the entry be NULL before
> +     * updating ring->out (which will make the entry be visible and usable).
> +     */
> +    atomic_store_release(&ring->out, ring->out + 1);

[1]

> +
> +    return data;
> +}
> +
> +static inline int ring_put(Ring *ring, void *data)
> +{
> +    if (ring->flags & RING_MULTI_PRODUCER) {
> +        return ring_mp_put(ring, data);
> +    }
> +    return __ring_put(ring, data);
> +}
> +
> +static inline void *ring_get(Ring *ring)
> +{
> +    if (ring->flags & RING_MULTI_PRODUCER) {
> +        return ring_mp_get(ring);
> +    }
> +    return __ring_get(ring);
> +}
> +#endif
> -- 
> 2.14.4
> 

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 10/12] migration: introduce lockless multithreads model
  2018-06-04  9:55   ` [Qemu-devel] " guangrong.xiao
@ 2018-06-20  6:52     ` Peter Xu
  -1 siblings, 0 replies; 156+ messages in thread
From: Peter Xu @ 2018-06-20  6:52 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, qemu-devel,
	wei.w.wang, jiang.biao2, pbonzini

On Mon, Jun 04, 2018 at 05:55:18PM +0800, guangrong.xiao@gmail.com wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> Current implementation of compression and decompression are very
> hard to be enabled on productions. We noticed that too many wait-wakes
> go to kernel space and CPU usages are very low even if the system
> is really free
> 
> The reasons are:
> 1) there are two many locks used to do synchronous,there
>   is a global lock and each single thread has its own lock,
>   migration thread and work threads need to go to sleep if
>   these locks are busy
> 
> 2) migration thread separately submits request to the thread
>    however, only one request can be pended, that means, the
>    thread has to go to sleep after finishing the request
> 
> To make it work better, we introduce a new multithread model,
> the user, currently it is the migration thread, submits request
> to each thread with round-robin manner, the thread has its own
> ring whose capacity is 4 and puts the result to a global ring
> which is lockless for multiple producers, the user fetches result
> out from the global ring and do remaining operations for the
> request, e.g, posting the compressed data out for migration on
> the source QEMU
> 
> Performance Result:
> The test was based on top of the patch:
>    ring: introduce lockless ring buffer
> that means, previous optimizations are used for both of original case
> and applying the new multithread model
> 
> We tested live migration on two hosts:
>    Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz * 64 + 256G memory
> to migration a VM between each other, which has 16 vCPUs and 60G
> memory, during the migration, multiple threads are repeatedly writing
> the memory in the VM
> 
> We used 16 threads on the destination to decompress the data and on the
> source, we tried 8 threads and 16 threads to compress the data
> 
> --- Before our work ---
> migration can not be finished for both 8 threads and 16 threads. The data
> is as followings:
> 
> Use 8 threads to compress:
> - on the source:
> 	    migration thread   compress-threads
> CPU usage       70%          some use 36%, others are very low ~20%
> - on the destination:
>             main thread        decompress-threads
> CPU usage       100%         some use ~40%, other are very low ~2%
> 
> Migration status (CAN NOT FINISH):
> info migrate
> globals:
> store-global-state: on
> only-migratable: off
> send-configuration: on
> send-section-footer: on
> capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
> Migration status: active
> total time: 1019540 milliseconds
> expected downtime: 2263 milliseconds
> setup: 218 milliseconds
> transferred ram: 252419995 kbytes
> throughput: 2469.45 mbps
> remaining ram: 15611332 kbytes
> total ram: 62931784 kbytes
> duplicate: 915323 pages
> skipped: 0 pages
> normal: 59673047 pages
> normal bytes: 238692188 kbytes
> dirty sync count: 28
> page size: 4 kbytes
> dirty pages rate: 170551 pages
> compression pages: 121309323 pages
> compression busy: 60588337
> compression busy rate: 0.36
> compression reduced size: 484281967178
> compression rate: 0.97
> 
> Use 16 threads to compress:
> - on the source:
> 	    migration thread   compress-threads
> CPU usage       96%          some use 45%, others are very low ~6%
> - on the destination:
>             main thread        decompress-threads
> CPU usage       96%         some use 58%, other are very low ~10%
> 
> Migration status (CAN NOT FINISH):
> info migrate
> globals:
> store-global-state: on
> only-migratable: off
> send-configuration: on
> send-section-footer: on
> capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
> Migration status: active
> total time: 1189221 milliseconds
> expected downtime: 6824 milliseconds
> setup: 220 milliseconds
> transferred ram: 90620052 kbytes
> throughput: 840.41 mbps
> remaining ram: 3678760 kbytes
> total ram: 62931784 kbytes
> duplicate: 195893 pages
> skipped: 0 pages
> normal: 17290715 pages
> normal bytes: 69162860 kbytes
> dirty sync count: 33
> page size: 4 kbytes
> dirty pages rate: 175039 pages
> compression pages: 186739419 pages
> compression busy: 17486568
> compression busy rate: 0.09
> compression reduced size: 744546683892
> compression rate: 0.97
> 
> --- After our work ---
> Migration can be finished quickly for both 8 threads and 16 threads. The
> data is as followings:
> 
> Use 8 threads to compress:
> - on the source:
> 	    migration thread   compress-threads
> CPU usage       30%               30% (all threads have same CPU usage)
> - on the destination:
>             main thread        decompress-threads
> CPU usage       100%              50% (all threads have same CPU usage)
> 
> Migration status (finished in 219467 ms):
> info migrate
> globals:
> store-global-state: on
> only-migratable: off
> send-configuration: on
> send-section-footer: on
> capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
> Migration status: completed
> total time: 219467 milliseconds
> downtime: 115 milliseconds
> setup: 222 milliseconds
> transferred ram: 88510173 kbytes
> throughput: 3303.81 mbps
> remaining ram: 0 kbytes
> total ram: 62931784 kbytes
> duplicate: 2211775 pages
> skipped: 0 pages
> normal: 21166222 pages
> normal bytes: 84664888 kbytes
> dirty sync count: 15
> page size: 4 kbytes
> compression pages: 32045857 pages
> compression busy: 23377968
> compression busy rate: 0.34
> compression reduced size: 127767894329
> compression rate: 0.97
> 
> Use 16 threads to compress:
> - on the source:
> 	    migration thread   compress-threads
> CPU usage       60%               60% (all threads have same CPU usage)
> - on the destination:
>             main thread        decompress-threads
> CPU usage       100%              75% (all threads have same CPU usage)
> 
> Migration status (finished in 64118 ms):
> info migrate
> globals:
> store-global-state: on
> only-migratable: off
> send-configuration: on
> send-section-footer: on
> capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
> Migration status: completed
> total time: 64118 milliseconds
> downtime: 29 milliseconds
> setup: 223 milliseconds
> transferred ram: 13345135 kbytes
> throughput: 1705.10 mbps
> remaining ram: 0 kbytes
> total ram: 62931784 kbytes
> duplicate: 574921 pages
> skipped: 0 pages
> normal: 2570281 pages
> normal bytes: 10281124 kbytes
> dirty sync count: 9
> page size: 4 kbytes
> compression pages: 28007024 pages
> compression busy: 3145182
> compression busy rate: 0.08
> compression reduced size: 111829024985
> compression rate: 0.97

Not sure how other people think, for me these information suites
better as cover letter.  For commit message, I would prefer to know
about something like: what this thread model can do; how the APIs are
designed and used; what's the limitations, etc.  After all until this
patch nowhere is using the new model yet, so these numbers are a bit
misleading.

> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---
>  migration/Makefile.objs |   1 +
>  migration/threads.c     | 265 ++++++++++++++++++++++++++++++++++++++++++++++++
>  migration/threads.h     | 116 +++++++++++++++++++++

Again, this model seems to be suitable for scenarios even outside
migration.  So I'm not sure whether you'd like to generalize it (I
still see e.g. constants and comments related to migration, but there
aren't much) and put it into util/.

>  3 files changed, 382 insertions(+)
>  create mode 100644 migration/threads.c
>  create mode 100644 migration/threads.h
> 
> diff --git a/migration/Makefile.objs b/migration/Makefile.objs
> index c83ec47ba8..bdb61a7983 100644
> --- a/migration/Makefile.objs
> +++ b/migration/Makefile.objs
> @@ -7,6 +7,7 @@ common-obj-y += qemu-file-channel.o
>  common-obj-y += xbzrle.o postcopy-ram.o
>  common-obj-y += qjson.o
>  common-obj-y += block-dirty-bitmap.o
> +common-obj-y += threads.o
>  
>  common-obj-$(CONFIG_RDMA) += rdma.o
>  
> diff --git a/migration/threads.c b/migration/threads.c
> new file mode 100644
> index 0000000000..eecd3229b7
> --- /dev/null
> +++ b/migration/threads.c
> @@ -0,0 +1,265 @@
> +#include "threads.h"
> +
> +/* retry to see if there is avilable request before actually go to wait. */
> +#define BUSY_WAIT_COUNT 1000
> +
> +static void *thread_run(void *opaque)
> +{
> +    ThreadLocal *self_data = (ThreadLocal *)opaque;
> +    Threads *threads = self_data->threads;
> +    void (*handler)(ThreadRequest *data) = threads->thread_request_handler;
> +    ThreadRequest *request;
> +    int count, ret;
> +
> +    for ( ; !atomic_read(&self_data->quit); ) {
> +        qemu_event_reset(&self_data->ev);
> +
> +        count = 0;
> +        while ((request = ring_get(self_data->request_ring)) ||
> +            count < BUSY_WAIT_COUNT) {
> +             /*
> +             * wait some while before go to sleep so that the user
> +             * needn't go to kernel space to wake up the consumer
> +             * threads.
> +             *
> +             * That will waste some CPU resource indeed however it
> +             * can significantly improve the case that the request
> +             * will be available soon.
> +             */
> +             if (!request) {
> +                cpu_relax();
> +                count++;
> +                continue;
> +            }
> +            count = 0;
> +
> +            handler(request);
> +
> +            do {
> +                ret = ring_put(threads->request_done_ring, request);
> +                /*
> +                 * request_done_ring has enough room to contain all
> +                 * requests, however, theoretically, it still can be
> +                 * fail if the ring's indexes are overflow that would
> +                 * happen if there is more than 2^32 requests are

Could you elaborate why this ring_put() could fail, and why failure is
somehow related to 2^32 overflow?

Firstly, I don't understand why it will fail.

Meanwhile, AFAIU your ring can even live well with that 2^32 overflow.
Or did I misunderstood?

> +                 * handled between two calls of threads_wait_done().
> +                 * So we do retry to make the code more robust.
> +                 *
> +                 * It is unlikely the case for migration as the block's
> +                 * memory is unlikely more than 16T (2^32 pages) memory.

(some migration-related comments; maybe we can remove that)

> +                 */
> +                if (ret) {
> +                    fprintf(stderr,
> +                            "Potential BUG if it is triggered by migration.\n");
> +                }
> +            } while (ret);
> +        }
> +
> +        qemu_event_wait(&self_data->ev);
> +    }
> +
> +    return NULL;
> +}
> +
> +static void add_free_request(Threads *threads, ThreadRequest *request)
> +{
> +    QSLIST_INSERT_HEAD(&threads->free_requests, request, node);
> +    threads->free_requests_nr++;
> +}
> +
> +static ThreadRequest *get_and_remove_first_free_request(Threads *threads)
> +{
> +    ThreadRequest *request;
> +
> +    if (QSLIST_EMPTY(&threads->free_requests)) {
> +        return NULL;
> +    }
> +
> +    request = QSLIST_FIRST(&threads->free_requests);
> +    QSLIST_REMOVE_HEAD(&threads->free_requests, node);
> +    threads->free_requests_nr--;
> +    return request;
> +}
> +
> +static void uninit_requests(Threads *threads, int free_nr)
> +{
> +    ThreadRequest *request;
> +
> +    /*
> +     * all requests should be released to the list if threads are being
> +     * destroyed, i,e. should call threads_wait_done() first.
> +     */
> +    assert(threads->free_requests_nr == free_nr);
> +
> +    while ((request = get_and_remove_first_free_request(threads))) {
> +        threads->thread_request_uninit(request);
> +    }
> +
> +    assert(ring_is_empty(threads->request_done_ring));
> +    ring_free(threads->request_done_ring);
> +}
> +
> +static int init_requests(Threads *threads)
> +{
> +    ThreadRequest *request;
> +    unsigned int done_ring_size = pow2roundup32(threads->total_requests);
> +    int i, free_nr = 0;
> +
> +    threads->request_done_ring = ring_alloc(done_ring_size,
> +                                            RING_MULTI_PRODUCER);
> +
> +    QSLIST_INIT(&threads->free_requests);
> +    for (i = 0; i < threads->total_requests; i++) {
> +        request = threads->thread_request_init();
> +        if (!request) {
> +            goto cleanup;
> +        }
> +
> +        free_nr++;
> +        add_free_request(threads, request);
> +    }
> +    return 0;
> +
> +cleanup:
> +    uninit_requests(threads, free_nr);
> +    return -1;
> +}
> +
> +static void uninit_thread_data(Threads *threads)
> +{
> +    ThreadLocal *thread_local = threads->per_thread_data;
> +    int i;
> +
> +    for (i = 0; i < threads->threads_nr; i++) {
> +        thread_local[i].quit = true;
> +        qemu_event_set(&thread_local[i].ev);
> +        qemu_thread_join(&thread_local[i].thread);
> +        qemu_event_destroy(&thread_local[i].ev);
> +        assert(ring_is_empty(thread_local[i].request_ring));
> +        ring_free(thread_local[i].request_ring);
> +    }
> +}
> +
> +static void init_thread_data(Threads *threads)
> +{
> +    ThreadLocal *thread_local = threads->per_thread_data;
> +    char *name;
> +    int i;
> +
> +    for (i = 0; i < threads->threads_nr; i++) {
> +        qemu_event_init(&thread_local[i].ev, false);
> +
> +        thread_local[i].threads = threads;
> +        thread_local[i].self = i;
> +        thread_local[i].request_ring = ring_alloc(threads->thread_ring_size, 0);
> +        name = g_strdup_printf("%s/%d", threads->name, thread_local[i].self);
> +        qemu_thread_create(&thread_local[i].thread, name,
> +                           thread_run, &thread_local[i], QEMU_THREAD_JOINABLE);
> +        g_free(name);
> +    }
> +}
> +
> +/* the size of thread local request ring */
> +#define THREAD_REQ_RING_SIZE 4
> +
> +Threads *threads_create(unsigned int threads_nr, const char *name,
> +                        ThreadRequest *(*thread_request_init)(void),
> +                        void (*thread_request_uninit)(ThreadRequest *request),
> +                        void (*thread_request_handler)(ThreadRequest *request),
> +                        void (*thread_request_done)(ThreadRequest *request))
> +{
> +    Threads *threads;
> +    int ret;
> +
> +    threads = g_malloc0(sizeof(*threads) + threads_nr * sizeof(ThreadLocal));
> +    threads->threads_nr = threads_nr;
> +    threads->thread_ring_size = THREAD_REQ_RING_SIZE;

(If we're going to generalize this thread model, maybe you'd consider
 to allow specify this ring size as well?)

> +    threads->total_requests = threads->thread_ring_size * threads_nr;
> +
> +    threads->name = name;
> +    threads->thread_request_init = thread_request_init;
> +    threads->thread_request_uninit = thread_request_uninit;
> +    threads->thread_request_handler = thread_request_handler;
> +    threads->thread_request_done = thread_request_done;
> +
> +    ret = init_requests(threads);
> +    if (ret) {
> +        g_free(threads);
> +        return NULL;
> +    }
> +
> +    init_thread_data(threads);
> +    return threads;
> +}
> +
> +void threads_destroy(Threads *threads)
> +{
> +    uninit_thread_data(threads);
> +    uninit_requests(threads, threads->total_requests);
> +    g_free(threads);
> +}
> +
> +ThreadRequest *threads_submit_request_prepare(Threads *threads)
> +{
> +    ThreadRequest *request;
> +    unsigned int index;
> +
> +    index = threads->current_thread_index % threads->threads_nr;

Why round-robin rather than simply find a idle thread (still with
valid free requests) and put the request onto that?

Asked since I don't see much difficulty to achieve that, meanwhile for
round-robin I'm not sure whether it can happen that one thread stuck
due to some reason (e.g., scheduling reason?), while the rest of the
threads are idle, then would threads_submit_request_prepare() be stuck
for that hanging thread?

> +
> +    /* the thread is busy */
> +    if (ring_is_full(threads->per_thread_data[index].request_ring)) {
> +        return NULL;
> +    }
> +
> +    /* try to get the request from the list */
> +    request = get_and_remove_first_free_request(threads);
> +    if (request) {
> +        goto got_request;
> +    }
> +
> +    /* get the request already been handled by the threads */
> +    request = ring_get(threads->request_done_ring);
> +    if (request) {
> +        threads->thread_request_done(request);
> +        goto got_request;
> +    }
> +    return NULL;
> +
> +got_request:
> +    threads->current_thread_index++;
> +    request->thread_index = index;
> +    return request;
> +}
> +
> +void threads_submit_request_commit(Threads *threads, ThreadRequest *request)
> +{
> +    int ret, index = request->thread_index;
> +    ThreadLocal *thread_local = &threads->per_thread_data[index];
> +
> +    ret = ring_put(thread_local->request_ring, request);
> +
> +    /*
> +     * we have detected that the thread's ring is not full in
> +     * threads_submit_request_prepare(), there should be free
> +     * room in the ring
> +     */
> +    assert(!ret);
> +    /* new request arrived, notify the thread */
> +    qemu_event_set(&thread_local->ev);
> +}
> +
> +void threads_wait_done(Threads *threads)
> +{
> +    ThreadRequest *request;
> +
> +retry:
> +    while ((request = ring_get(threads->request_done_ring))) {
> +        threads->thread_request_done(request);
> +        add_free_request(threads, request);
> +    }
> +
> +    if (threads->free_requests_nr != threads->total_requests) {
> +        cpu_relax();
> +        goto retry;
> +    }
> +}
> diff --git a/migration/threads.h b/migration/threads.h
> new file mode 100644
> index 0000000000..eced913065
> --- /dev/null
> +++ b/migration/threads.h
> @@ -0,0 +1,116 @@
> +#ifndef QEMU_MIGRATION_THREAD_H
> +#define QEMU_MIGRATION_THREAD_H
> +
> +/*
> + * Multithreads abstraction
> + *
> + * This is the abstraction layer for multithreads management which is
> + * used to speed up migration.
> + *
> + * Note: currently only one producer is allowed.
> + *
> + * Copyright(C) 2018 Tencent Corporation.
> + *
> + * Author:
> + *   Xiao Guangrong <xiaoguangrong@tencent.com>
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
> + * See the COPYING.LIB file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"

I was told (more than once) that we should not include "osdep.h" in
headers. :) I'll suggest you include that in the source file.

> +#include "hw/boards.h"

Why do we need this header?

> +
> +#include "ring.h"
> +
> +/*
> + * the request representation which contains the internally used mete data,
> + * it can be embedded to user's self-defined data struct and the user can
> + * use container_of() to get the self-defined data
> + */
> +struct ThreadRequest {
> +    QSLIST_ENTRY(ThreadRequest) node;
> +    unsigned int thread_index;
> +};
> +typedef struct ThreadRequest ThreadRequest;
> +
> +struct Threads;
> +
> +struct ThreadLocal {
> +    QemuThread thread;
> +
> +    /* the event used to wake up the thread */
> +    QemuEvent ev;
> +
> +    struct Threads *threads;
> +
> +    /* local request ring which is filled by the user */
> +    Ring *request_ring;
> +
> +    /* the index of the thread */
> +    int self;
> +
> +    /* thread is useless and needs to exit */
> +    bool quit;
> +};
> +typedef struct ThreadLocal ThreadLocal;
> +
> +/*
> + * the main data struct represents multithreads which is shared by
> + * all threads
> + */
> +struct Threads {
> +    const char *name;
> +    unsigned int threads_nr;
> +    /* the request is pushed to the thread with round-robin manner */
> +    unsigned int current_thread_index;
> +
> +    int thread_ring_size;
> +    int total_requests;
> +
> +    /* the request is pre-allocated and linked in the list */
> +    int free_requests_nr;
> +    QSLIST_HEAD(, ThreadRequest) free_requests;
> +
> +    /* the constructor of request */
> +    ThreadRequest *(*thread_request_init)(void);
> +    /* the destructor of request */
> +    void (*thread_request_uninit)(ThreadRequest *request);
> +    /* the handler of the request which is called in the thread */
> +    void (*thread_request_handler)(ThreadRequest *request);
> +    /*
> +     * the handler to process the result which is called in the
> +     * user's context
> +     */
> +    void (*thread_request_done)(ThreadRequest *request);
> +
> +    /* the thread push the result to this ring so it has multiple producers */
> +    Ring *request_done_ring;
> +
> +    ThreadLocal per_thread_data[0];
> +};
> +typedef struct Threads Threads;

Not sure whether we can move Threads/ThreadLocal definition into the
source file, then we only expose the struct definition, along with the
APIs.

Regards,

> +
> +Threads *threads_create(unsigned int threads_nr, const char *name,
> +                        ThreadRequest *(*thread_request_init)(void),
> +                        void (*thread_request_uninit)(ThreadRequest *request),
> +                        void (*thread_request_handler)(ThreadRequest *request),
> +                        void (*thread_request_done)(ThreadRequest *request));
> +void threads_destroy(Threads *threads);
> +
> +/*
> + * find a free request and associate it with a free thread.
> + * If no request or no thread is free, return NULL
> + */
> +ThreadRequest *threads_submit_request_prepare(Threads *threads);
> +/*
> + * push the request to its thread's local ring and notify the thread
> + */
> +void threads_submit_request_commit(Threads *threads, ThreadRequest *request);
> +
> +/*
> + * wait all threads to complete the request filled in their local rings
> + * to make sure there is no previous request exists.
> + */
> +void threads_wait_done(Threads *threads);
> +#endif
> -- 
> 2.14.4
> 

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 10/12] migration: introduce lockless multithreads model
@ 2018-06-20  6:52     ` Peter Xu
  0 siblings, 0 replies; 156+ messages in thread
From: Peter Xu @ 2018-06-20  6:52 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, jiang.biao2,
	wei.w.wang, Xiao Guangrong

On Mon, Jun 04, 2018 at 05:55:18PM +0800, guangrong.xiao@gmail.com wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> Current implementation of compression and decompression are very
> hard to be enabled on productions. We noticed that too many wait-wakes
> go to kernel space and CPU usages are very low even if the system
> is really free
> 
> The reasons are:
> 1) there are two many locks used to do synchronous,there
>   is a global lock and each single thread has its own lock,
>   migration thread and work threads need to go to sleep if
>   these locks are busy
> 
> 2) migration thread separately submits request to the thread
>    however, only one request can be pended, that means, the
>    thread has to go to sleep after finishing the request
> 
> To make it work better, we introduce a new multithread model,
> the user, currently it is the migration thread, submits request
> to each thread with round-robin manner, the thread has its own
> ring whose capacity is 4 and puts the result to a global ring
> which is lockless for multiple producers, the user fetches result
> out from the global ring and do remaining operations for the
> request, e.g, posting the compressed data out for migration on
> the source QEMU
> 
> Performance Result:
> The test was based on top of the patch:
>    ring: introduce lockless ring buffer
> that means, previous optimizations are used for both of original case
> and applying the new multithread model
> 
> We tested live migration on two hosts:
>    Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz * 64 + 256G memory
> to migration a VM between each other, which has 16 vCPUs and 60G
> memory, during the migration, multiple threads are repeatedly writing
> the memory in the VM
> 
> We used 16 threads on the destination to decompress the data and on the
> source, we tried 8 threads and 16 threads to compress the data
> 
> --- Before our work ---
> migration can not be finished for both 8 threads and 16 threads. The data
> is as followings:
> 
> Use 8 threads to compress:
> - on the source:
> 	    migration thread   compress-threads
> CPU usage       70%          some use 36%, others are very low ~20%
> - on the destination:
>             main thread        decompress-threads
> CPU usage       100%         some use ~40%, other are very low ~2%
> 
> Migration status (CAN NOT FINISH):
> info migrate
> globals:
> store-global-state: on
> only-migratable: off
> send-configuration: on
> send-section-footer: on
> capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
> Migration status: active
> total time: 1019540 milliseconds
> expected downtime: 2263 milliseconds
> setup: 218 milliseconds
> transferred ram: 252419995 kbytes
> throughput: 2469.45 mbps
> remaining ram: 15611332 kbytes
> total ram: 62931784 kbytes
> duplicate: 915323 pages
> skipped: 0 pages
> normal: 59673047 pages
> normal bytes: 238692188 kbytes
> dirty sync count: 28
> page size: 4 kbytes
> dirty pages rate: 170551 pages
> compression pages: 121309323 pages
> compression busy: 60588337
> compression busy rate: 0.36
> compression reduced size: 484281967178
> compression rate: 0.97
> 
> Use 16 threads to compress:
> - on the source:
> 	    migration thread   compress-threads
> CPU usage       96%          some use 45%, others are very low ~6%
> - on the destination:
>             main thread        decompress-threads
> CPU usage       96%         some use 58%, other are very low ~10%
> 
> Migration status (CAN NOT FINISH):
> info migrate
> globals:
> store-global-state: on
> only-migratable: off
> send-configuration: on
> send-section-footer: on
> capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
> Migration status: active
> total time: 1189221 milliseconds
> expected downtime: 6824 milliseconds
> setup: 220 milliseconds
> transferred ram: 90620052 kbytes
> throughput: 840.41 mbps
> remaining ram: 3678760 kbytes
> total ram: 62931784 kbytes
> duplicate: 195893 pages
> skipped: 0 pages
> normal: 17290715 pages
> normal bytes: 69162860 kbytes
> dirty sync count: 33
> page size: 4 kbytes
> dirty pages rate: 175039 pages
> compression pages: 186739419 pages
> compression busy: 17486568
> compression busy rate: 0.09
> compression reduced size: 744546683892
> compression rate: 0.97
> 
> --- After our work ---
> Migration can be finished quickly for both 8 threads and 16 threads. The
> data is as followings:
> 
> Use 8 threads to compress:
> - on the source:
> 	    migration thread   compress-threads
> CPU usage       30%               30% (all threads have same CPU usage)
> - on the destination:
>             main thread        decompress-threads
> CPU usage       100%              50% (all threads have same CPU usage)
> 
> Migration status (finished in 219467 ms):
> info migrate
> globals:
> store-global-state: on
> only-migratable: off
> send-configuration: on
> send-section-footer: on
> capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
> Migration status: completed
> total time: 219467 milliseconds
> downtime: 115 milliseconds
> setup: 222 milliseconds
> transferred ram: 88510173 kbytes
> throughput: 3303.81 mbps
> remaining ram: 0 kbytes
> total ram: 62931784 kbytes
> duplicate: 2211775 pages
> skipped: 0 pages
> normal: 21166222 pages
> normal bytes: 84664888 kbytes
> dirty sync count: 15
> page size: 4 kbytes
> compression pages: 32045857 pages
> compression busy: 23377968
> compression busy rate: 0.34
> compression reduced size: 127767894329
> compression rate: 0.97
> 
> Use 16 threads to compress:
> - on the source:
> 	    migration thread   compress-threads
> CPU usage       60%               60% (all threads have same CPU usage)
> - on the destination:
>             main thread        decompress-threads
> CPU usage       100%              75% (all threads have same CPU usage)
> 
> Migration status (finished in 64118 ms):
> info migrate
> globals:
> store-global-state: on
> only-migratable: off
> send-configuration: on
> send-section-footer: on
> capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
> Migration status: completed
> total time: 64118 milliseconds
> downtime: 29 milliseconds
> setup: 223 milliseconds
> transferred ram: 13345135 kbytes
> throughput: 1705.10 mbps
> remaining ram: 0 kbytes
> total ram: 62931784 kbytes
> duplicate: 574921 pages
> skipped: 0 pages
> normal: 2570281 pages
> normal bytes: 10281124 kbytes
> dirty sync count: 9
> page size: 4 kbytes
> compression pages: 28007024 pages
> compression busy: 3145182
> compression busy rate: 0.08
> compression reduced size: 111829024985
> compression rate: 0.97

Not sure how other people think, for me these information suites
better as cover letter.  For commit message, I would prefer to know
about something like: what this thread model can do; how the APIs are
designed and used; what's the limitations, etc.  After all until this
patch nowhere is using the new model yet, so these numbers are a bit
misleading.

> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---
>  migration/Makefile.objs |   1 +
>  migration/threads.c     | 265 ++++++++++++++++++++++++++++++++++++++++++++++++
>  migration/threads.h     | 116 +++++++++++++++++++++

Again, this model seems to be suitable for scenarios even outside
migration.  So I'm not sure whether you'd like to generalize it (I
still see e.g. constants and comments related to migration, but there
aren't much) and put it into util/.

>  3 files changed, 382 insertions(+)
>  create mode 100644 migration/threads.c
>  create mode 100644 migration/threads.h
> 
> diff --git a/migration/Makefile.objs b/migration/Makefile.objs
> index c83ec47ba8..bdb61a7983 100644
> --- a/migration/Makefile.objs
> +++ b/migration/Makefile.objs
> @@ -7,6 +7,7 @@ common-obj-y += qemu-file-channel.o
>  common-obj-y += xbzrle.o postcopy-ram.o
>  common-obj-y += qjson.o
>  common-obj-y += block-dirty-bitmap.o
> +common-obj-y += threads.o
>  
>  common-obj-$(CONFIG_RDMA) += rdma.o
>  
> diff --git a/migration/threads.c b/migration/threads.c
> new file mode 100644
> index 0000000000..eecd3229b7
> --- /dev/null
> +++ b/migration/threads.c
> @@ -0,0 +1,265 @@
> +#include "threads.h"
> +
> +/* retry to see if there is avilable request before actually go to wait. */
> +#define BUSY_WAIT_COUNT 1000
> +
> +static void *thread_run(void *opaque)
> +{
> +    ThreadLocal *self_data = (ThreadLocal *)opaque;
> +    Threads *threads = self_data->threads;
> +    void (*handler)(ThreadRequest *data) = threads->thread_request_handler;
> +    ThreadRequest *request;
> +    int count, ret;
> +
> +    for ( ; !atomic_read(&self_data->quit); ) {
> +        qemu_event_reset(&self_data->ev);
> +
> +        count = 0;
> +        while ((request = ring_get(self_data->request_ring)) ||
> +            count < BUSY_WAIT_COUNT) {
> +             /*
> +             * wait some while before go to sleep so that the user
> +             * needn't go to kernel space to wake up the consumer
> +             * threads.
> +             *
> +             * That will waste some CPU resource indeed however it
> +             * can significantly improve the case that the request
> +             * will be available soon.
> +             */
> +             if (!request) {
> +                cpu_relax();
> +                count++;
> +                continue;
> +            }
> +            count = 0;
> +
> +            handler(request);
> +
> +            do {
> +                ret = ring_put(threads->request_done_ring, request);
> +                /*
> +                 * request_done_ring has enough room to contain all
> +                 * requests, however, theoretically, it still can be
> +                 * fail if the ring's indexes are overflow that would
> +                 * happen if there is more than 2^32 requests are

Could you elaborate why this ring_put() could fail, and why failure is
somehow related to 2^32 overflow?

Firstly, I don't understand why it will fail.

Meanwhile, AFAIU your ring can even live well with that 2^32 overflow.
Or did I misunderstood?

> +                 * handled between two calls of threads_wait_done().
> +                 * So we do retry to make the code more robust.
> +                 *
> +                 * It is unlikely the case for migration as the block's
> +                 * memory is unlikely more than 16T (2^32 pages) memory.

(some migration-related comments; maybe we can remove that)

> +                 */
> +                if (ret) {
> +                    fprintf(stderr,
> +                            "Potential BUG if it is triggered by migration.\n");
> +                }
> +            } while (ret);
> +        }
> +
> +        qemu_event_wait(&self_data->ev);
> +    }
> +
> +    return NULL;
> +}
> +
> +static void add_free_request(Threads *threads, ThreadRequest *request)
> +{
> +    QSLIST_INSERT_HEAD(&threads->free_requests, request, node);
> +    threads->free_requests_nr++;
> +}
> +
> +static ThreadRequest *get_and_remove_first_free_request(Threads *threads)
> +{
> +    ThreadRequest *request;
> +
> +    if (QSLIST_EMPTY(&threads->free_requests)) {
> +        return NULL;
> +    }
> +
> +    request = QSLIST_FIRST(&threads->free_requests);
> +    QSLIST_REMOVE_HEAD(&threads->free_requests, node);
> +    threads->free_requests_nr--;
> +    return request;
> +}
> +
> +static void uninit_requests(Threads *threads, int free_nr)
> +{
> +    ThreadRequest *request;
> +
> +    /*
> +     * all requests should be released to the list if threads are being
> +     * destroyed, i,e. should call threads_wait_done() first.
> +     */
> +    assert(threads->free_requests_nr == free_nr);
> +
> +    while ((request = get_and_remove_first_free_request(threads))) {
> +        threads->thread_request_uninit(request);
> +    }
> +
> +    assert(ring_is_empty(threads->request_done_ring));
> +    ring_free(threads->request_done_ring);
> +}
> +
> +static int init_requests(Threads *threads)
> +{
> +    ThreadRequest *request;
> +    unsigned int done_ring_size = pow2roundup32(threads->total_requests);
> +    int i, free_nr = 0;
> +
> +    threads->request_done_ring = ring_alloc(done_ring_size,
> +                                            RING_MULTI_PRODUCER);
> +
> +    QSLIST_INIT(&threads->free_requests);
> +    for (i = 0; i < threads->total_requests; i++) {
> +        request = threads->thread_request_init();
> +        if (!request) {
> +            goto cleanup;
> +        }
> +
> +        free_nr++;
> +        add_free_request(threads, request);
> +    }
> +    return 0;
> +
> +cleanup:
> +    uninit_requests(threads, free_nr);
> +    return -1;
> +}
> +
> +static void uninit_thread_data(Threads *threads)
> +{
> +    ThreadLocal *thread_local = threads->per_thread_data;
> +    int i;
> +
> +    for (i = 0; i < threads->threads_nr; i++) {
> +        thread_local[i].quit = true;
> +        qemu_event_set(&thread_local[i].ev);
> +        qemu_thread_join(&thread_local[i].thread);
> +        qemu_event_destroy(&thread_local[i].ev);
> +        assert(ring_is_empty(thread_local[i].request_ring));
> +        ring_free(thread_local[i].request_ring);
> +    }
> +}
> +
> +static void init_thread_data(Threads *threads)
> +{
> +    ThreadLocal *thread_local = threads->per_thread_data;
> +    char *name;
> +    int i;
> +
> +    for (i = 0; i < threads->threads_nr; i++) {
> +        qemu_event_init(&thread_local[i].ev, false);
> +
> +        thread_local[i].threads = threads;
> +        thread_local[i].self = i;
> +        thread_local[i].request_ring = ring_alloc(threads->thread_ring_size, 0);
> +        name = g_strdup_printf("%s/%d", threads->name, thread_local[i].self);
> +        qemu_thread_create(&thread_local[i].thread, name,
> +                           thread_run, &thread_local[i], QEMU_THREAD_JOINABLE);
> +        g_free(name);
> +    }
> +}
> +
> +/* the size of thread local request ring */
> +#define THREAD_REQ_RING_SIZE 4
> +
> +Threads *threads_create(unsigned int threads_nr, const char *name,
> +                        ThreadRequest *(*thread_request_init)(void),
> +                        void (*thread_request_uninit)(ThreadRequest *request),
> +                        void (*thread_request_handler)(ThreadRequest *request),
> +                        void (*thread_request_done)(ThreadRequest *request))
> +{
> +    Threads *threads;
> +    int ret;
> +
> +    threads = g_malloc0(sizeof(*threads) + threads_nr * sizeof(ThreadLocal));
> +    threads->threads_nr = threads_nr;
> +    threads->thread_ring_size = THREAD_REQ_RING_SIZE;

(If we're going to generalize this thread model, maybe you'd consider
 to allow specify this ring size as well?)

> +    threads->total_requests = threads->thread_ring_size * threads_nr;
> +
> +    threads->name = name;
> +    threads->thread_request_init = thread_request_init;
> +    threads->thread_request_uninit = thread_request_uninit;
> +    threads->thread_request_handler = thread_request_handler;
> +    threads->thread_request_done = thread_request_done;
> +
> +    ret = init_requests(threads);
> +    if (ret) {
> +        g_free(threads);
> +        return NULL;
> +    }
> +
> +    init_thread_data(threads);
> +    return threads;
> +}
> +
> +void threads_destroy(Threads *threads)
> +{
> +    uninit_thread_data(threads);
> +    uninit_requests(threads, threads->total_requests);
> +    g_free(threads);
> +}
> +
> +ThreadRequest *threads_submit_request_prepare(Threads *threads)
> +{
> +    ThreadRequest *request;
> +    unsigned int index;
> +
> +    index = threads->current_thread_index % threads->threads_nr;

Why round-robin rather than simply find a idle thread (still with
valid free requests) and put the request onto that?

Asked since I don't see much difficulty to achieve that, meanwhile for
round-robin I'm not sure whether it can happen that one thread stuck
due to some reason (e.g., scheduling reason?), while the rest of the
threads are idle, then would threads_submit_request_prepare() be stuck
for that hanging thread?

> +
> +    /* the thread is busy */
> +    if (ring_is_full(threads->per_thread_data[index].request_ring)) {
> +        return NULL;
> +    }
> +
> +    /* try to get the request from the list */
> +    request = get_and_remove_first_free_request(threads);
> +    if (request) {
> +        goto got_request;
> +    }
> +
> +    /* get the request already been handled by the threads */
> +    request = ring_get(threads->request_done_ring);
> +    if (request) {
> +        threads->thread_request_done(request);
> +        goto got_request;
> +    }
> +    return NULL;
> +
> +got_request:
> +    threads->current_thread_index++;
> +    request->thread_index = index;
> +    return request;
> +}
> +
> +void threads_submit_request_commit(Threads *threads, ThreadRequest *request)
> +{
> +    int ret, index = request->thread_index;
> +    ThreadLocal *thread_local = &threads->per_thread_data[index];
> +
> +    ret = ring_put(thread_local->request_ring, request);
> +
> +    /*
> +     * we have detected that the thread's ring is not full in
> +     * threads_submit_request_prepare(), there should be free
> +     * room in the ring
> +     */
> +    assert(!ret);
> +    /* new request arrived, notify the thread */
> +    qemu_event_set(&thread_local->ev);
> +}
> +
> +void threads_wait_done(Threads *threads)
> +{
> +    ThreadRequest *request;
> +
> +retry:
> +    while ((request = ring_get(threads->request_done_ring))) {
> +        threads->thread_request_done(request);
> +        add_free_request(threads, request);
> +    }
> +
> +    if (threads->free_requests_nr != threads->total_requests) {
> +        cpu_relax();
> +        goto retry;
> +    }
> +}
> diff --git a/migration/threads.h b/migration/threads.h
> new file mode 100644
> index 0000000000..eced913065
> --- /dev/null
> +++ b/migration/threads.h
> @@ -0,0 +1,116 @@
> +#ifndef QEMU_MIGRATION_THREAD_H
> +#define QEMU_MIGRATION_THREAD_H
> +
> +/*
> + * Multithreads abstraction
> + *
> + * This is the abstraction layer for multithreads management which is
> + * used to speed up migration.
> + *
> + * Note: currently only one producer is allowed.
> + *
> + * Copyright(C) 2018 Tencent Corporation.
> + *
> + * Author:
> + *   Xiao Guangrong <xiaoguangrong@tencent.com>
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
> + * See the COPYING.LIB file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"

I was told (more than once) that we should not include "osdep.h" in
headers. :) I'll suggest you include that in the source file.

> +#include "hw/boards.h"

Why do we need this header?

> +
> +#include "ring.h"
> +
> +/*
> + * the request representation which contains the internally used mete data,
> + * it can be embedded to user's self-defined data struct and the user can
> + * use container_of() to get the self-defined data
> + */
> +struct ThreadRequest {
> +    QSLIST_ENTRY(ThreadRequest) node;
> +    unsigned int thread_index;
> +};
> +typedef struct ThreadRequest ThreadRequest;
> +
> +struct Threads;
> +
> +struct ThreadLocal {
> +    QemuThread thread;
> +
> +    /* the event used to wake up the thread */
> +    QemuEvent ev;
> +
> +    struct Threads *threads;
> +
> +    /* local request ring which is filled by the user */
> +    Ring *request_ring;
> +
> +    /* the index of the thread */
> +    int self;
> +
> +    /* thread is useless and needs to exit */
> +    bool quit;
> +};
> +typedef struct ThreadLocal ThreadLocal;
> +
> +/*
> + * the main data struct represents multithreads which is shared by
> + * all threads
> + */
> +struct Threads {
> +    const char *name;
> +    unsigned int threads_nr;
> +    /* the request is pushed to the thread with round-robin manner */
> +    unsigned int current_thread_index;
> +
> +    int thread_ring_size;
> +    int total_requests;
> +
> +    /* the request is pre-allocated and linked in the list */
> +    int free_requests_nr;
> +    QSLIST_HEAD(, ThreadRequest) free_requests;
> +
> +    /* the constructor of request */
> +    ThreadRequest *(*thread_request_init)(void);
> +    /* the destructor of request */
> +    void (*thread_request_uninit)(ThreadRequest *request);
> +    /* the handler of the request which is called in the thread */
> +    void (*thread_request_handler)(ThreadRequest *request);
> +    /*
> +     * the handler to process the result which is called in the
> +     * user's context
> +     */
> +    void (*thread_request_done)(ThreadRequest *request);
> +
> +    /* the thread push the result to this ring so it has multiple producers */
> +    Ring *request_done_ring;
> +
> +    ThreadLocal per_thread_data[0];
> +};
> +typedef struct Threads Threads;

Not sure whether we can move Threads/ThreadLocal definition into the
source file, then we only expose the struct definition, along with the
APIs.

Regards,

> +
> +Threads *threads_create(unsigned int threads_nr, const char *name,
> +                        ThreadRequest *(*thread_request_init)(void),
> +                        void (*thread_request_uninit)(ThreadRequest *request),
> +                        void (*thread_request_handler)(ThreadRequest *request),
> +                        void (*thread_request_done)(ThreadRequest *request));
> +void threads_destroy(Threads *threads);
> +
> +/*
> + * find a free request and associate it with a free thread.
> + * If no request or no thread is free, return NULL
> + */
> +ThreadRequest *threads_submit_request_prepare(Threads *threads);
> +/*
> + * push the request to its thread's local ring and notify the thread
> + */
> +void threads_submit_request_commit(Threads *threads, ThreadRequest *request);
> +
> +/*
> + * wait all threads to complete the request filled in their local rings
> + * to make sure there is no previous request exists.
> + */
> +void threads_wait_done(Threads *threads);
> +#endif
> -- 
> 2.14.4
> 

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 09/12] ring: introduce lockless ring buffer
  2018-06-04  9:55   ` [Qemu-devel] " guangrong.xiao
@ 2018-06-20 12:38     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 156+ messages in thread
From: Michael S. Tsirkin @ 2018-06-20 12:38 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: kvm, mtosatti, Xiao Guangrong, qemu-devel, peterx, dgilbert,
	wei.w.wang, jiang.biao2, pbonzini

On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.xiao@gmail.com wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> It's the simple lockless ring buffer implement which supports both
> single producer vs. single consumer and multiple producers vs.
> single consumer.
> 
> Many lessons were learned from Linux Kernel's kfifo (1) and DPDK's
> rte_ring (2) before i wrote this implement. It corrects some bugs of
> memory barriers in kfifo and it is the simpler lockless version of
> rte_ring as currently multiple access is only allowed for producer.
> 
> If has single producer vs. single consumer, it is the traditional fifo,
> If has multiple producers, it uses the algorithm as followings:
> 
> For the producer, it uses two steps to update the ring:
>    - first step, occupy the entry in the ring:
> 
> retry:
>       in = ring->in
>       if (cmpxhg(&ring->in, in, in +1) != in)
>             goto retry;
> 
>      after that the entry pointed by ring->data[in] has been owned by
>      the producer.
> 
>      assert(ring->data[in] == NULL);
> 
>      Note, no other producer can touch this entry so that this entry
>      should always be the initialized state.
> 
>    - second step, write the data to the entry:
> 
>      ring->data[in] = data;
> 
> For the consumer, it first checks if there is available entry in the
> ring and fetches the entry from the ring:
> 
>      if (!ring_is_empty(ring))
>           entry = &ring[ring->out];
> 
>      Note: the ring->out has not been updated so that the entry pointed
>      by ring->out is completely owned by the consumer.
> 
> Then it checks if the data is ready:
> 
> retry:
>      if (*entry == NULL)
>             goto retry;
> That means, the producer has updated the index but haven't written any
> data to it.
> 
> Finally, it fetches the valid data out, set the entry to the initialized
> state and update ring->out to make the entry be usable to the producer:
> 
>       data = *entry;
>       *entry = NULL;
>       ring->out++;
> 
> Memory barrier is omitted here, please refer to the comment in the code.
>
>
>
> (1) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
> (2) http://dpdk.org/doc/api/rte__ring_8h.html
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>

So instead of all this super-optimized trickiness, how about
a simple port of ptr_ring from linux?

That one isn't lockless but it's known to outperform
most others for a single producer/single consumer case.
And with a ton of networking going on,
who said it's such a hot spot? OTOH this implementation
has more barriers which slows down each individual thread.
It's also a source of bugs.

Further, atomic tricks this one uses are not fair so some threads can get
completely starved while others make progress. There's also no
chance to mix aggressive polling and sleeping with this
kind of scheme, so the starved thread will consume lots of
CPU.

So I'd like to see a simple ring used, and then a patch on top
switching to this tricky one with performance comparison
along with that.

> ---
>  migration/ring.h | 265 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 265 insertions(+)
>  create mode 100644 migration/ring.h
> 
> diff --git a/migration/ring.h b/migration/ring.h
> new file mode 100644
> index 0000000000..da9b8bdcbb
> --- /dev/null
> +++ b/migration/ring.h
> @@ -0,0 +1,265 @@
> +/*
> + * Ring Buffer
> + *
> + * Multiple producers and single consumer are supported with lock free.
> + *
> + * Copyright (c) 2018 Tencent Inc
> + *
> + * Authors:
> + *  Xiao Guangrong <xiaoguangrong@tencent.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#ifndef _RING__
> +#define _RING__

Prefix Ring is too short.


> +
> +#define CACHE_LINE  64
> +#define cache_aligned __attribute__((__aligned__(CACHE_LINE)))
> +
> +#define RING_MULTI_PRODUCER 0x1
> +
> +struct Ring {
> +    unsigned int flags;
> +    unsigned int size;
> +    unsigned int mask;
> +
> +    unsigned int in cache_aligned;
> +
> +    unsigned int out cache_aligned;
> +
> +    void *data[0] cache_aligned;
> +};
> +typedef struct Ring Ring;
> +
> +/*
> + * allocate and initialize the ring
> + *
> + * @size: the number of element, it should be power of 2
> + * @flags: set to RING_MULTI_PRODUCER if the ring has multiple producer,
> + *         otherwise set it to 0, i,e. single producer and single consumer.
> + *
> + * return the ring.
> + */
> +static inline Ring *ring_alloc(unsigned int size, unsigned int flags)
> +{
> +    Ring *ring;
> +
> +    assert(is_power_of_2(size));
> +
> +    ring = g_malloc0(sizeof(*ring) + size * sizeof(void *));
> +    ring->size = size;
> +    ring->mask = ring->size - 1;
> +    ring->flags = flags;
> +    return ring;
> +}
> +
> +static inline void ring_free(Ring *ring)
> +{
> +    g_free(ring);
> +}
> +
> +static inline bool __ring_is_empty(unsigned int in, unsigned int out)
> +{
> +    return in == out;
> +}
> +
> +static inline bool ring_is_empty(Ring *ring)
> +{
> +    return ring->in == ring->out;
> +}
> +
> +static inline unsigned int ring_len(unsigned int in, unsigned int out)
> +{
> +    return in - out;
> +}
> +
> +static inline bool
> +__ring_is_full(Ring *ring, unsigned int in, unsigned int out)
> +{
> +    return ring_len(in, out) > ring->mask;
> +}
> +
> +static inline bool ring_is_full(Ring *ring)
> +{
> +    return __ring_is_full(ring, ring->in, ring->out);
> +}
> +
> +static inline unsigned int ring_index(Ring *ring, unsigned int pos)
> +{
> +    return pos & ring->mask;
> +}
> +
> +static inline int __ring_put(Ring *ring, void *data)
> +{
> +    unsigned int index, out;
> +
> +    out = atomic_load_acquire(&ring->out);
> +    /*
> +     * smp_mb()
> +     *
> +     * should read ring->out before updating the entry, see the comments in
> +     * __ring_get().
> +     */
> +
> +    if (__ring_is_full(ring, ring->in, out)) {
> +        return -ENOBUFS;
> +    }
> +
> +    index = ring_index(ring, ring->in);
> +
> +    atomic_set(&ring->data[index], data);
> +
> +    /*
> +     * should make sure the entry is updated before increasing ring->in
> +     * otherwise the consumer will get a entry but its content is useless.
> +     */
> +    smp_wmb();
> +    atomic_set(&ring->in, ring->in + 1);
> +    return 0;
> +}
> +
> +static inline void *__ring_get(Ring *ring)
> +{
> +    unsigned int index, in;
> +    void *data;
> +
> +    in = atomic_read(&ring->in);
> +
> +    /*
> +     * should read ring->in first to make sure the entry pointed by this
> +     * index is available, see the comments in __ring_put().
> +     */
> +    smp_rmb();
> +    if (__ring_is_empty(in, ring->out)) {
> +        return NULL;
> +    }
> +
> +    index = ring_index(ring, ring->out);
> +
> +    data = atomic_read(&ring->data[index]);
> +
> +    /*
> +     * smp_mb()
> +     *
> +     * once the ring->out is updated the entry originally indicated by the
> +     * the index is visible and usable to the producer so that we should
> +     * make sure reading the entry out before updating ring->out to avoid
> +     * the entry being overwritten by the producer.
> +     */
> +    atomic_store_release(&ring->out, ring->out + 1);
> +
> +    return data;
> +}
> +
> +static inline int ring_mp_put(Ring *ring, void *data)
> +{
> +    unsigned int index, in, in_next, out;
> +
> +    do {
> +        in = atomic_read(&ring->in);
> +        out = atomic_read(&ring->out);
> +
> +        if (__ring_is_full(ring, in, out)) {
> +            if (atomic_read(&ring->in) == in &&
> +                atomic_read(&ring->out) == out) {
> +                return -ENOBUFS;
> +            }
> +
> +            /* a entry has been fetched out, retry. */
> +            continue;
> +        }
> +
> +        in_next = in + 1;
> +    } while (atomic_cmpxchg(&ring->in, in, in_next) != in);
> +
> +    index = ring_index(ring, in);
> +
> +    /*
> +     * smp_rmb() paired with the memory barrier of (A) in ring_mp_get()
> +     * is implied in atomic_cmpxchg() as we should read ring->out first
> +     * before fetching the entry, otherwise this assert will fail.
> +     */
> +    assert(!atomic_read(&ring->data[index]));
> +
> +    /*
> +     * smp_mb() paired with the memory barrier of (B) in ring_mp_get() is
> +     * implied in atomic_cmpxchg(), that is needed here as  we should read
> +     * ring->out before updating the entry, it is the same as we did in
> +     * __ring_put().
> +     *
> +     * smp_wmb() paired with the memory barrier of (C) in ring_mp_get()
> +     * is implied in atomic_cmpxchg(), that is needed as we should increase
> +     * ring->in before updating the entry.
> +     */
> +    atomic_set(&ring->data[index], data);
> +
> +    return 0;
> +}
> +
> +static inline void *ring_mp_get(Ring *ring)
> +{
> +    unsigned int index, in;
> +    void *data;
> +
> +    do {
> +        in = atomic_read(&ring->in);
> +
> +        /*
> +         * (C) should read ring->in first to make sure the entry pointed by this
> +         * index is available
> +         */
> +        smp_rmb();
> +
> +        if (!__ring_is_empty(in, ring->out)) {
> +            break;
> +        }
> +
> +        if (atomic_read(&ring->in) == in) {
> +            return NULL;
> +        }
> +        /* new entry has been added in, retry. */
> +    } while (1);
> +
> +    index = ring_index(ring, ring->out);
> +
> +    do {
> +        data = atomic_read(&ring->data[index]);
> +        if (data) {
> +            break;
> +        }
> +        /* the producer is updating the entry, retry */
> +        cpu_relax();
> +    } while (1);
> +
> +    atomic_set(&ring->data[index], NULL);
> +
> +    /*
> +     * (B) smp_mb() is needed as we should read the entry out before
> +     * updating ring->out as we did in __ring_get().
> +     *
> +     * (A) smp_wmb() is needed as we should make the entry be NULL before
> +     * updating ring->out (which will make the entry be visible and usable).
> +     */

I can't say I understand this all.
And the interaction of acquire/release semantics with smp_*
barriers is even scarier.

> +    atomic_store_release(&ring->out, ring->out + 1);
> +
> +    return data;
> +}
> +
> +static inline int ring_put(Ring *ring, void *data)
> +{
> +    if (ring->flags & RING_MULTI_PRODUCER) {
> +        return ring_mp_put(ring, data);
> +    }
> +    return __ring_put(ring, data);
> +}
> +
> +static inline void *ring_get(Ring *ring)
> +{
> +    if (ring->flags & RING_MULTI_PRODUCER) {
> +        return ring_mp_get(ring);
> +    }
> +    return __ring_get(ring);
> +}
> +#endif


A bunch of tricky barriers retries etc all over the place.  This sorely
needs *a lot of* unit tests. Where are they?



> -- 
> 2.14.4

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer
@ 2018-06-20 12:38     ` Michael S. Tsirkin
  0 siblings, 0 replies; 156+ messages in thread
From: Michael S. Tsirkin @ 2018-06-20 12:38 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: pbonzini, mtosatti, qemu-devel, kvm, dgilbert, peterx,
	jiang.biao2, wei.w.wang, Xiao Guangrong

On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.xiao@gmail.com wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> It's the simple lockless ring buffer implement which supports both
> single producer vs. single consumer and multiple producers vs.
> single consumer.
> 
> Many lessons were learned from Linux Kernel's kfifo (1) and DPDK's
> rte_ring (2) before i wrote this implement. It corrects some bugs of
> memory barriers in kfifo and it is the simpler lockless version of
> rte_ring as currently multiple access is only allowed for producer.
> 
> If has single producer vs. single consumer, it is the traditional fifo,
> If has multiple producers, it uses the algorithm as followings:
> 
> For the producer, it uses two steps to update the ring:
>    - first step, occupy the entry in the ring:
> 
> retry:
>       in = ring->in
>       if (cmpxhg(&ring->in, in, in +1) != in)
>             goto retry;
> 
>      after that the entry pointed by ring->data[in] has been owned by
>      the producer.
> 
>      assert(ring->data[in] == NULL);
> 
>      Note, no other producer can touch this entry so that this entry
>      should always be the initialized state.
> 
>    - second step, write the data to the entry:
> 
>      ring->data[in] = data;
> 
> For the consumer, it first checks if there is available entry in the
> ring and fetches the entry from the ring:
> 
>      if (!ring_is_empty(ring))
>           entry = &ring[ring->out];
> 
>      Note: the ring->out has not been updated so that the entry pointed
>      by ring->out is completely owned by the consumer.
> 
> Then it checks if the data is ready:
> 
> retry:
>      if (*entry == NULL)
>             goto retry;
> That means, the producer has updated the index but haven't written any
> data to it.
> 
> Finally, it fetches the valid data out, set the entry to the initialized
> state and update ring->out to make the entry be usable to the producer:
> 
>       data = *entry;
>       *entry = NULL;
>       ring->out++;
> 
> Memory barrier is omitted here, please refer to the comment in the code.
>
>
>
> (1) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
> (2) http://dpdk.org/doc/api/rte__ring_8h.html
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>

So instead of all this super-optimized trickiness, how about
a simple port of ptr_ring from linux?

That one isn't lockless but it's known to outperform
most others for a single producer/single consumer case.
And with a ton of networking going on,
who said it's such a hot spot? OTOH this implementation
has more barriers which slows down each individual thread.
It's also a source of bugs.

Further, atomic tricks this one uses are not fair so some threads can get
completely starved while others make progress. There's also no
chance to mix aggressive polling and sleeping with this
kind of scheme, so the starved thread will consume lots of
CPU.

So I'd like to see a simple ring used, and then a patch on top
switching to this tricky one with performance comparison
along with that.

> ---
>  migration/ring.h | 265 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 265 insertions(+)
>  create mode 100644 migration/ring.h
> 
> diff --git a/migration/ring.h b/migration/ring.h
> new file mode 100644
> index 0000000000..da9b8bdcbb
> --- /dev/null
> +++ b/migration/ring.h
> @@ -0,0 +1,265 @@
> +/*
> + * Ring Buffer
> + *
> + * Multiple producers and single consumer are supported with lock free.
> + *
> + * Copyright (c) 2018 Tencent Inc
> + *
> + * Authors:
> + *  Xiao Guangrong <xiaoguangrong@tencent.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#ifndef _RING__
> +#define _RING__

Prefix Ring is too short.


> +
> +#define CACHE_LINE  64
> +#define cache_aligned __attribute__((__aligned__(CACHE_LINE)))
> +
> +#define RING_MULTI_PRODUCER 0x1
> +
> +struct Ring {
> +    unsigned int flags;
> +    unsigned int size;
> +    unsigned int mask;
> +
> +    unsigned int in cache_aligned;
> +
> +    unsigned int out cache_aligned;
> +
> +    void *data[0] cache_aligned;
> +};
> +typedef struct Ring Ring;
> +
> +/*
> + * allocate and initialize the ring
> + *
> + * @size: the number of element, it should be power of 2
> + * @flags: set to RING_MULTI_PRODUCER if the ring has multiple producer,
> + *         otherwise set it to 0, i,e. single producer and single consumer.
> + *
> + * return the ring.
> + */
> +static inline Ring *ring_alloc(unsigned int size, unsigned int flags)
> +{
> +    Ring *ring;
> +
> +    assert(is_power_of_2(size));
> +
> +    ring = g_malloc0(sizeof(*ring) + size * sizeof(void *));
> +    ring->size = size;
> +    ring->mask = ring->size - 1;
> +    ring->flags = flags;
> +    return ring;
> +}
> +
> +static inline void ring_free(Ring *ring)
> +{
> +    g_free(ring);
> +}
> +
> +static inline bool __ring_is_empty(unsigned int in, unsigned int out)
> +{
> +    return in == out;
> +}
> +
> +static inline bool ring_is_empty(Ring *ring)
> +{
> +    return ring->in == ring->out;
> +}
> +
> +static inline unsigned int ring_len(unsigned int in, unsigned int out)
> +{
> +    return in - out;
> +}
> +
> +static inline bool
> +__ring_is_full(Ring *ring, unsigned int in, unsigned int out)
> +{
> +    return ring_len(in, out) > ring->mask;
> +}
> +
> +static inline bool ring_is_full(Ring *ring)
> +{
> +    return __ring_is_full(ring, ring->in, ring->out);
> +}
> +
> +static inline unsigned int ring_index(Ring *ring, unsigned int pos)
> +{
> +    return pos & ring->mask;
> +}
> +
> +static inline int __ring_put(Ring *ring, void *data)
> +{
> +    unsigned int index, out;
> +
> +    out = atomic_load_acquire(&ring->out);
> +    /*
> +     * smp_mb()
> +     *
> +     * should read ring->out before updating the entry, see the comments in
> +     * __ring_get().
> +     */
> +
> +    if (__ring_is_full(ring, ring->in, out)) {
> +        return -ENOBUFS;
> +    }
> +
> +    index = ring_index(ring, ring->in);
> +
> +    atomic_set(&ring->data[index], data);
> +
> +    /*
> +     * should make sure the entry is updated before increasing ring->in
> +     * otherwise the consumer will get a entry but its content is useless.
> +     */
> +    smp_wmb();
> +    atomic_set(&ring->in, ring->in + 1);
> +    return 0;
> +}
> +
> +static inline void *__ring_get(Ring *ring)
> +{
> +    unsigned int index, in;
> +    void *data;
> +
> +    in = atomic_read(&ring->in);
> +
> +    /*
> +     * should read ring->in first to make sure the entry pointed by this
> +     * index is available, see the comments in __ring_put().
> +     */
> +    smp_rmb();
> +    if (__ring_is_empty(in, ring->out)) {
> +        return NULL;
> +    }
> +
> +    index = ring_index(ring, ring->out);
> +
> +    data = atomic_read(&ring->data[index]);
> +
> +    /*
> +     * smp_mb()
> +     *
> +     * once the ring->out is updated the entry originally indicated by the
> +     * the index is visible and usable to the producer so that we should
> +     * make sure reading the entry out before updating ring->out to avoid
> +     * the entry being overwritten by the producer.
> +     */
> +    atomic_store_release(&ring->out, ring->out + 1);
> +
> +    return data;
> +}
> +
> +static inline int ring_mp_put(Ring *ring, void *data)
> +{
> +    unsigned int index, in, in_next, out;
> +
> +    do {
> +        in = atomic_read(&ring->in);
> +        out = atomic_read(&ring->out);
> +
> +        if (__ring_is_full(ring, in, out)) {
> +            if (atomic_read(&ring->in) == in &&
> +                atomic_read(&ring->out) == out) {
> +                return -ENOBUFS;
> +            }
> +
> +            /* a entry has been fetched out, retry. */
> +            continue;
> +        }
> +
> +        in_next = in + 1;
> +    } while (atomic_cmpxchg(&ring->in, in, in_next) != in);
> +
> +    index = ring_index(ring, in);
> +
> +    /*
> +     * smp_rmb() paired with the memory barrier of (A) in ring_mp_get()
> +     * is implied in atomic_cmpxchg() as we should read ring->out first
> +     * before fetching the entry, otherwise this assert will fail.
> +     */
> +    assert(!atomic_read(&ring->data[index]));
> +
> +    /*
> +     * smp_mb() paired with the memory barrier of (B) in ring_mp_get() is
> +     * implied in atomic_cmpxchg(), that is needed here as  we should read
> +     * ring->out before updating the entry, it is the same as we did in
> +     * __ring_put().
> +     *
> +     * smp_wmb() paired with the memory barrier of (C) in ring_mp_get()
> +     * is implied in atomic_cmpxchg(), that is needed as we should increase
> +     * ring->in before updating the entry.
> +     */
> +    atomic_set(&ring->data[index], data);
> +
> +    return 0;
> +}
> +
> +static inline void *ring_mp_get(Ring *ring)
> +{
> +    unsigned int index, in;
> +    void *data;
> +
> +    do {
> +        in = atomic_read(&ring->in);
> +
> +        /*
> +         * (C) should read ring->in first to make sure the entry pointed by this
> +         * index is available
> +         */
> +        smp_rmb();
> +
> +        if (!__ring_is_empty(in, ring->out)) {
> +            break;
> +        }
> +
> +        if (atomic_read(&ring->in) == in) {
> +            return NULL;
> +        }
> +        /* new entry has been added in, retry. */
> +    } while (1);
> +
> +    index = ring_index(ring, ring->out);
> +
> +    do {
> +        data = atomic_read(&ring->data[index]);
> +        if (data) {
> +            break;
> +        }
> +        /* the producer is updating the entry, retry */
> +        cpu_relax();
> +    } while (1);
> +
> +    atomic_set(&ring->data[index], NULL);
> +
> +    /*
> +     * (B) smp_mb() is needed as we should read the entry out before
> +     * updating ring->out as we did in __ring_get().
> +     *
> +     * (A) smp_wmb() is needed as we should make the entry be NULL before
> +     * updating ring->out (which will make the entry be visible and usable).
> +     */

I can't say I understand this all.
And the interaction of acquire/release semantics with smp_*
barriers is even scarier.

> +    atomic_store_release(&ring->out, ring->out + 1);
> +
> +    return data;
> +}
> +
> +static inline int ring_put(Ring *ring, void *data)
> +{
> +    if (ring->flags & RING_MULTI_PRODUCER) {
> +        return ring_mp_put(ring, data);
> +    }
> +    return __ring_put(ring, data);
> +}
> +
> +static inline void *ring_get(Ring *ring)
> +{
> +    if (ring->flags & RING_MULTI_PRODUCER) {
> +        return ring_mp_get(ring);
> +    }
> +    return __ring_get(ring);
> +}
> +#endif


A bunch of tricky barriers retries etc all over the place.  This sorely
needs *a lot of* unit tests. Where are they?



> -- 
> 2.14.4

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 06/12] migration: do not detect zero page for compression
  2018-06-19  7:30     ` [Qemu-devel] " Peter Xu
@ 2018-06-28  9:12       ` Xiao Guangrong
  -1 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-28  9:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, qemu-devel,
	wei.w.wang, jiang.biao2, pbonzini


Hi Peter,

Sorry for the delay as i was busy on other things.

On 06/19/2018 03:30 PM, Peter Xu wrote:
> On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>> Detecting zero page is not a light work, we can disable it
>> for compression that can handle all zero data very well
> 
> Is there any number shows how the compression algo performs better
> than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
> be fast, depending on how init_accel() is done in util/bufferiszero.c.

This is the comparison between zero-detection and compression (the target
buffer is all zero bit):

Zero 810 ns Compression: 26905 ns.
Zero 417 ns Compression: 8022 ns.
Zero 408 ns Compression: 7189 ns.
Zero 400 ns Compression: 7255 ns.
Zero 412 ns Compression: 7016 ns.
Zero 411 ns Compression: 7035 ns.
Zero 413 ns Compression: 6994 ns.
Zero 399 ns Compression: 7024 ns.
Zero 416 ns Compression: 7053 ns.
Zero 405 ns Compression: 7041 ns.

Indeed, zero-detection is faster than compression.

However during our profiling for the live_migration thread (after reverted this patch),
we noticed zero-detection cost lots of CPU:

  12.01%  kqemu  qemu-system-x86_64            [.] buffer_zero_sse2                                                                                                                                                                           ◆
   7.60%  kqemu  qemu-system-x86_64            [.] ram_bytes_total                                                                                                                                                                            ▒
   6.56%  kqemu  qemu-system-x86_64            [.] qemu_event_set                                                                                                                                                                             ▒
   5.61%  kqemu  qemu-system-x86_64            [.] qemu_put_qemu_file                                                                                                                                                                         ▒
   5.00%  kqemu  qemu-system-x86_64            [.] __ring_put                                                                                                                                                                                 ▒
   4.89%  kqemu  [kernel.kallsyms]             [k] copy_user_enhanced_fast_string                                                                                                                                                             ▒
   4.71%  kqemu  qemu-system-x86_64            [.] compress_thread_data_done                                                                                                                                                                  ▒
   3.63%  kqemu  qemu-system-x86_64            [.] ring_is_full                                                                                                                                                                               ▒
   2.89%  kqemu  qemu-system-x86_64            [.] __ring_is_full                                                                                                                                                                             ▒
   2.68%  kqemu  qemu-system-x86_64            [.] threads_submit_request_prepare                                                                                                                                                             ▒
   2.60%  kqemu  qemu-system-x86_64            [.] ring_mp_get                                                                                                                                                                                ▒
   2.25%  kqemu  qemu-system-x86_64            [.] ring_get                                                                                                                                                                                   ▒
   1.96%  kqemu  libc-2.12.so                  [.] memcpy

After this patch, the workload is moved to the worker thread, is it
acceptable?

> 
>  From compression rate POV of course zero page algo wins since it
> contains no data (but only a flag).
> 

Yes it is. The compressed zero page is 45 bytes that is small enough i think.

Hmm, if you do not like, how about move detecting zero page to the work thread?

Thanks!

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 06/12] migration: do not detect zero page for compression
@ 2018-06-28  9:12       ` Xiao Guangrong
  0 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-28  9:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, jiang.biao2,
	wei.w.wang, Xiao Guangrong


Hi Peter,

Sorry for the delay as i was busy on other things.

On 06/19/2018 03:30 PM, Peter Xu wrote:
> On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>> Detecting zero page is not a light work, we can disable it
>> for compression that can handle all zero data very well
> 
> Is there any number shows how the compression algo performs better
> than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
> be fast, depending on how init_accel() is done in util/bufferiszero.c.

This is the comparison between zero-detection and compression (the target
buffer is all zero bit):

Zero 810 ns Compression: 26905 ns.
Zero 417 ns Compression: 8022 ns.
Zero 408 ns Compression: 7189 ns.
Zero 400 ns Compression: 7255 ns.
Zero 412 ns Compression: 7016 ns.
Zero 411 ns Compression: 7035 ns.
Zero 413 ns Compression: 6994 ns.
Zero 399 ns Compression: 7024 ns.
Zero 416 ns Compression: 7053 ns.
Zero 405 ns Compression: 7041 ns.

Indeed, zero-detection is faster than compression.

However during our profiling for the live_migration thread (after reverted this patch),
we noticed zero-detection cost lots of CPU:

  12.01%  kqemu  qemu-system-x86_64            [.] buffer_zero_sse2                                                                                                                                                                           ◆
   7.60%  kqemu  qemu-system-x86_64            [.] ram_bytes_total                                                                                                                                                                            ▒
   6.56%  kqemu  qemu-system-x86_64            [.] qemu_event_set                                                                                                                                                                             ▒
   5.61%  kqemu  qemu-system-x86_64            [.] qemu_put_qemu_file                                                                                                                                                                         ▒
   5.00%  kqemu  qemu-system-x86_64            [.] __ring_put                                                                                                                                                                                 ▒
   4.89%  kqemu  [kernel.kallsyms]             [k] copy_user_enhanced_fast_string                                                                                                                                                             ▒
   4.71%  kqemu  qemu-system-x86_64            [.] compress_thread_data_done                                                                                                                                                                  ▒
   3.63%  kqemu  qemu-system-x86_64            [.] ring_is_full                                                                                                                                                                               ▒
   2.89%  kqemu  qemu-system-x86_64            [.] __ring_is_full                                                                                                                                                                             ▒
   2.68%  kqemu  qemu-system-x86_64            [.] threads_submit_request_prepare                                                                                                                                                             ▒
   2.60%  kqemu  qemu-system-x86_64            [.] ring_mp_get                                                                                                                                                                                ▒
   2.25%  kqemu  qemu-system-x86_64            [.] ring_get                                                                                                                                                                                   ▒
   1.96%  kqemu  libc-2.12.so                  [.] memcpy

After this patch, the workload is moved to the worker thread, is it
acceptable?

> 
>  From compression rate POV of course zero page algo wins since it
> contains no data (but only a flag).
> 

Yes it is. The compressed zero page is 45 bytes that is small enough i think.

Hmm, if you do not like, how about move detecting zero page to the work thread?

Thanks!

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 07/12] migration: hold the lock only if it is really needed
  2018-06-19  7:36     ` [Qemu-devel] " Peter Xu
@ 2018-06-28  9:33       ` Xiao Guangrong
  -1 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-28  9:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, qemu-devel,
	wei.w.wang, jiang.biao2, pbonzini



On 06/19/2018 03:36 PM, Peter Xu wrote:
> On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>> Try to hold src_page_req_mutex only if the queue is not
>> empty
> 
> Pure question: how much this patch would help?  Basically if you are
> running compression tests then I think it means you are with precopy
> (since postcopy cannot work with compression yet), then here the lock
> has no contention at all.

Yes, you are right, however we can observe it is in the top functions
(after revert this patch):

Samples: 29K of event 'cycles', Event count (approx.): 22263412260
+   7.99%  kqemu  qemu-system-x86_64       [.] ram_bytes_total
+   6.95%  kqemu  [kernel.kallsyms]        [k] copy_user_enhanced_fast_string
+   6.23%  kqemu  qemu-system-x86_64       [.] qemu_put_qemu_file
+   6.20%  kqemu  qemu-system-x86_64       [.] qemu_event_set
+   5.80%  kqemu  qemu-system-x86_64       [.] __ring_put
+   4.82%  kqemu  qemu-system-x86_64       [.] compress_thread_data_done
+   4.11%  kqemu  qemu-system-x86_64       [.] ring_is_full
+   3.07%  kqemu  qemu-system-x86_64       [.] threads_submit_request_prepare
+   2.83%  kqemu  qemu-system-x86_64       [.] ring_mp_get
+   2.71%  kqemu  qemu-system-x86_64       [.] __ring_is_full
+   2.46%  kqemu  qemu-system-x86_64       [.] buffer_zero_sse2
+   2.40%  kqemu  qemu-system-x86_64       [.] add_to_iovec
+   2.21%  kqemu  qemu-system-x86_64       [.] ring_get
+   1.96%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
+   1.90%  kqemu  libc-2.12.so             [.] memcpy
+   1.55%  kqemu  qemu-system-x86_64       [.] ring_len
+   1.12%  kqemu  libpthread-2.12.so       [.] pthread_mutex_unlock
+   1.11%  kqemu  qemu-system-x86_64       [.] ram_find_and_save_block
+   1.07%  kqemu  qemu-system-x86_64       [.] ram_save_host_page
+   1.04%  kqemu  qemu-system-x86_64       [.] qemu_put_buffer
+   0.97%  kqemu  qemu-system-x86_64       [.] compress_page_with_multi_thread
+   0.96%  kqemu  qemu-system-x86_64       [.] ram_save_target_page
+   0.93%  kqemu  libpthread-2.12.so       [.] pthread_mutex_lock

I guess its atomic operations cost CPU resource and check-before-lock is
a common tech, i think it shouldn't have side effect, right? :)

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 07/12] migration: hold the lock only if it is really needed
@ 2018-06-28  9:33       ` Xiao Guangrong
  0 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-28  9:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, jiang.biao2,
	wei.w.wang, Xiao Guangrong



On 06/19/2018 03:36 PM, Peter Xu wrote:
> On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>> Try to hold src_page_req_mutex only if the queue is not
>> empty
> 
> Pure question: how much this patch would help?  Basically if you are
> running compression tests then I think it means you are with precopy
> (since postcopy cannot work with compression yet), then here the lock
> has no contention at all.

Yes, you are right, however we can observe it is in the top functions
(after revert this patch):

Samples: 29K of event 'cycles', Event count (approx.): 22263412260
+   7.99%  kqemu  qemu-system-x86_64       [.] ram_bytes_total
+   6.95%  kqemu  [kernel.kallsyms]        [k] copy_user_enhanced_fast_string
+   6.23%  kqemu  qemu-system-x86_64       [.] qemu_put_qemu_file
+   6.20%  kqemu  qemu-system-x86_64       [.] qemu_event_set
+   5.80%  kqemu  qemu-system-x86_64       [.] __ring_put
+   4.82%  kqemu  qemu-system-x86_64       [.] compress_thread_data_done
+   4.11%  kqemu  qemu-system-x86_64       [.] ring_is_full
+   3.07%  kqemu  qemu-system-x86_64       [.] threads_submit_request_prepare
+   2.83%  kqemu  qemu-system-x86_64       [.] ring_mp_get
+   2.71%  kqemu  qemu-system-x86_64       [.] __ring_is_full
+   2.46%  kqemu  qemu-system-x86_64       [.] buffer_zero_sse2
+   2.40%  kqemu  qemu-system-x86_64       [.] add_to_iovec
+   2.21%  kqemu  qemu-system-x86_64       [.] ring_get
+   1.96%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
+   1.90%  kqemu  libc-2.12.so             [.] memcpy
+   1.55%  kqemu  qemu-system-x86_64       [.] ring_len
+   1.12%  kqemu  libpthread-2.12.so       [.] pthread_mutex_unlock
+   1.11%  kqemu  qemu-system-x86_64       [.] ram_find_and_save_block
+   1.07%  kqemu  qemu-system-x86_64       [.] ram_save_host_page
+   1.04%  kqemu  qemu-system-x86_64       [.] qemu_put_buffer
+   0.97%  kqemu  qemu-system-x86_64       [.] compress_page_with_multi_thread
+   0.96%  kqemu  qemu-system-x86_64       [.] ram_save_target_page
+   0.93%  kqemu  libpthread-2.12.so       [.] pthread_mutex_lock

I guess its atomic operations cost CPU resource and check-before-lock is
a common tech, i think it shouldn't have side effect, right? :)

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 06/12] migration: do not detect zero page for compression
  2018-06-28  9:12       ` [Qemu-devel] " Xiao Guangrong
@ 2018-06-28  9:36         ` Daniel P. Berrangé
  -1 siblings, 0 replies; 156+ messages in thread
From: Daniel P. Berrangé @ 2018-06-28  9:36 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, Peter Xu,
	qemu-devel, wei.w.wang, pbonzini, jiang.biao2

On Thu, Jun 28, 2018 at 05:12:39PM +0800, Xiao Guangrong wrote:
> 
> Hi Peter,
> 
> Sorry for the delay as i was busy on other things.
> 
> On 06/19/2018 03:30 PM, Peter Xu wrote:
> > On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
> > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > 
> > > Detecting zero page is not a light work, we can disable it
> > > for compression that can handle all zero data very well
> > 
> > Is there any number shows how the compression algo performs better
> > than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
> > be fast, depending on how init_accel() is done in util/bufferiszero.c.
> 
> This is the comparison between zero-detection and compression (the target
> buffer is all zero bit):
> 
> Zero 810 ns Compression: 26905 ns.
> Zero 417 ns Compression: 8022 ns.
> Zero 408 ns Compression: 7189 ns.
> Zero 400 ns Compression: 7255 ns.
> Zero 412 ns Compression: 7016 ns.
> Zero 411 ns Compression: 7035 ns.
> Zero 413 ns Compression: 6994 ns.
> Zero 399 ns Compression: 7024 ns.
> Zero 416 ns Compression: 7053 ns.
> Zero 405 ns Compression: 7041 ns.
> 
> Indeed, zero-detection is faster than compression.
> 
> However during our profiling for the live_migration thread (after reverted this patch),
> we noticed zero-detection cost lots of CPU:
> 
>  12.01%  kqemu  qemu-system-x86_64            [.] buffer_zero_sse2                                                                                                                                                                           ◆
>   7.60%  kqemu  qemu-system-x86_64            [.] ram_bytes_total                                                                                                                                                                            ▒
>   6.56%  kqemu  qemu-system-x86_64            [.] qemu_event_set                                                                                                                                                                             ▒
>   5.61%  kqemu  qemu-system-x86_64            [.] qemu_put_qemu_file                                                                                                                                                                         ▒
>   5.00%  kqemu  qemu-system-x86_64            [.] __ring_put                                                                                                                                                                                 ▒
>   4.89%  kqemu  [kernel.kallsyms]             [k] copy_user_enhanced_fast_string                                                                                                                                                             ▒
>   4.71%  kqemu  qemu-system-x86_64            [.] compress_thread_data_done                                                                                                                                                                  ▒
>   3.63%  kqemu  qemu-system-x86_64            [.] ring_is_full                                                                                                                                                                               ▒
>   2.89%  kqemu  qemu-system-x86_64            [.] __ring_is_full                                                                                                                                                                             ▒
>   2.68%  kqemu  qemu-system-x86_64            [.] threads_submit_request_prepare                                                                                                                                                             ▒
>   2.60%  kqemu  qemu-system-x86_64            [.] ring_mp_get                                                                                                                                                                                ▒
>   2.25%  kqemu  qemu-system-x86_64            [.] ring_get                                                                                                                                                                                   ▒
>   1.96%  kqemu  libc-2.12.so                  [.] memcpy
> 
> After this patch, the workload is moved to the worker thread, is it
> acceptable?

It depends on your point of view. If you have spare / idle CPUs on the host,
then moving workload to a thread is ok, despite the CPU cost of compression
in that thread being much higher what what was replaced, since you won't be
taking CPU resources away from other contending workloads.

I'd venture to suggest though that we should probably *not* be optimizing for
the case of idle CPUs on the host. More realistic is to expect that the host
CPUs are near fully committed to work, and thus the (default) goal should be
to minimize CPU overhead for the host as a whole. From this POV, zero-page
detection is better than compression due to > x10 better speed.

Given the CPU overheads of compression, I think it has fairly narrow use
in migration in general when considering hosts are often highly committed
on CPU.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 06/12] migration: do not detect zero page for compression
@ 2018-06-28  9:36         ` Daniel P. Berrangé
  0 siblings, 0 replies; 156+ messages in thread
From: Daniel P. Berrangé @ 2018-06-28  9:36 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Peter Xu, kvm, mst, mtosatti, Xiao Guangrong, dgilbert,
	qemu-devel, wei.w.wang, jiang.biao2, pbonzini

On Thu, Jun 28, 2018 at 05:12:39PM +0800, Xiao Guangrong wrote:
> 
> Hi Peter,
> 
> Sorry for the delay as i was busy on other things.
> 
> On 06/19/2018 03:30 PM, Peter Xu wrote:
> > On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
> > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > 
> > > Detecting zero page is not a light work, we can disable it
> > > for compression that can handle all zero data very well
> > 
> > Is there any number shows how the compression algo performs better
> > than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
> > be fast, depending on how init_accel() is done in util/bufferiszero.c.
> 
> This is the comparison between zero-detection and compression (the target
> buffer is all zero bit):
> 
> Zero 810 ns Compression: 26905 ns.
> Zero 417 ns Compression: 8022 ns.
> Zero 408 ns Compression: 7189 ns.
> Zero 400 ns Compression: 7255 ns.
> Zero 412 ns Compression: 7016 ns.
> Zero 411 ns Compression: 7035 ns.
> Zero 413 ns Compression: 6994 ns.
> Zero 399 ns Compression: 7024 ns.
> Zero 416 ns Compression: 7053 ns.
> Zero 405 ns Compression: 7041 ns.
> 
> Indeed, zero-detection is faster than compression.
> 
> However during our profiling for the live_migration thread (after reverted this patch),
> we noticed zero-detection cost lots of CPU:
> 
>  12.01%  kqemu  qemu-system-x86_64            [.] buffer_zero_sse2                                                                                                                                                                           ◆
>   7.60%  kqemu  qemu-system-x86_64            [.] ram_bytes_total                                                                                                                                                                            ▒
>   6.56%  kqemu  qemu-system-x86_64            [.] qemu_event_set                                                                                                                                                                             ▒
>   5.61%  kqemu  qemu-system-x86_64            [.] qemu_put_qemu_file                                                                                                                                                                         ▒
>   5.00%  kqemu  qemu-system-x86_64            [.] __ring_put                                                                                                                                                                                 ▒
>   4.89%  kqemu  [kernel.kallsyms]             [k] copy_user_enhanced_fast_string                                                                                                                                                             ▒
>   4.71%  kqemu  qemu-system-x86_64            [.] compress_thread_data_done                                                                                                                                                                  ▒
>   3.63%  kqemu  qemu-system-x86_64            [.] ring_is_full                                                                                                                                                                               ▒
>   2.89%  kqemu  qemu-system-x86_64            [.] __ring_is_full                                                                                                                                                                             ▒
>   2.68%  kqemu  qemu-system-x86_64            [.] threads_submit_request_prepare                                                                                                                                                             ▒
>   2.60%  kqemu  qemu-system-x86_64            [.] ring_mp_get                                                                                                                                                                                ▒
>   2.25%  kqemu  qemu-system-x86_64            [.] ring_get                                                                                                                                                                                   ▒
>   1.96%  kqemu  libc-2.12.so                  [.] memcpy
> 
> After this patch, the workload is moved to the worker thread, is it
> acceptable?

It depends on your point of view. If you have spare / idle CPUs on the host,
then moving workload to a thread is ok, despite the CPU cost of compression
in that thread being much higher what what was replaced, since you won't be
taking CPU resources away from other contending workloads.

I'd venture to suggest though that we should probably *not* be optimizing for
the case of idle CPUs on the host. More realistic is to expect that the host
CPUs are near fully committed to work, and thus the (default) goal should be
to minimize CPU overhead for the host as a whole. From this POV, zero-page
detection is better than compression due to > x10 better speed.

Given the CPU overheads of compression, I think it has fairly narrow use
in migration in general when considering hosts are often highly committed
on CPU.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 09/12] ring: introduce lockless ring buffer
  2018-06-20  4:52     ` [Qemu-devel] " Peter Xu
@ 2018-06-28 10:02       ` Xiao Guangrong
  -1 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-28 10:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, mst, peterz, Lai Jiangshan, stefani, mtosatti,
	Xiao Guangrong, dgilbert, qemu-devel, wei.w.wang, jiang.biao2,
	pbonzini, paulmck


CC: Paul, Peter Zijlstra, Stefani, Lai who are all good at memory barrier.


On 06/20/2018 12:52 PM, Peter Xu wrote:
> On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.xiao@gmail.com wrote:
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>> It's the simple lockless ring buffer implement which supports both
>> single producer vs. single consumer and multiple producers vs.
>> single consumer.
>>
>> Many lessons were learned from Linux Kernel's kfifo (1) and DPDK's
>> rte_ring (2) before i wrote this implement. It corrects some bugs of
>> memory barriers in kfifo and it is the simpler lockless version of
>> rte_ring as currently multiple access is only allowed for producer.
> 
> Could you provide some more information about the kfifo bug?  Any
> pointer would be appreciated.
> 

Sure, i reported one of the memory barrier issue to linux kernel:
    https://lkml.org/lkml/2018/5/11/58

Actually, beside that, there is another memory barrier issue in kfifo,
please consider this case:

    at the beginning
    ring->size = 4
    ring->out = 0
    ring->in = 4

      Consumer                            Producer
  ---------------                     --------------
    index = ring->out; /* index == 0 */
    ring->out++; /* ring->out == 1 */
    < Re-Order >
                                     out = ring->out;
                                     if (ring->in - out >= ring->mask)
                                         return -EFULL;
                                     /* see the ring is not full */
                                     index = ring->in & ring->mask; /* index == 0 */
                                     ring->data[index] = new_data;
                     ring->in++;

    data = ring->data[index];
    !!!!!! the old data is lost !!!!!!

So we need to make sure:
1) for the consumer, we should read the ring->data[] out before updating ring->out
2) for the producer, we should read ring->out before updating ring->data[]

as followings:
       Producer                                       Consumer
   ------------------------------------         ------------------------
       Reading ring->out                            Reading ring->data[index]
       smp_mb()                                     smp_mb()
       Setting ring->data[index] = data             ring->out++

[ i used atomic_store_release() and atomic_load_acquire() instead of smp_mb() in the
   patch. ]

But i am not sure if we can use smp_acquire__after_ctrl_dep() in the producer?

>>
>> If has single producer vs. single consumer, it is the traditional fifo,
>> If has multiple producers, it uses the algorithm as followings:
>>
>> For the producer, it uses two steps to update the ring:
>>     - first step, occupy the entry in the ring:
>>
>> retry:
>>        in = ring->in
>>        if (cmpxhg(&ring->in, in, in +1) != in)
>>              goto retry;
>>
>>       after that the entry pointed by ring->data[in] has been owned by
>>       the producer.
>>
>>       assert(ring->data[in] == NULL);
>>
>>       Note, no other producer can touch this entry so that this entry
>>       should always be the initialized state.
>>
>>     - second step, write the data to the entry:
>>
>>       ring->data[in] = data;
>>
>> For the consumer, it first checks if there is available entry in the
>> ring and fetches the entry from the ring:
>>
>>       if (!ring_is_empty(ring))
>>            entry = &ring[ring->out];
>>
>>       Note: the ring->out has not been updated so that the entry pointed
>>       by ring->out is completely owned by the consumer.
>>
>> Then it checks if the data is ready:
>>
>> retry:
>>       if (*entry == NULL)
>>              goto retry;
>> That means, the producer has updated the index but haven't written any
>> data to it.
>>
>> Finally, it fetches the valid data out, set the entry to the initialized
>> state and update ring->out to make the entry be usable to the producer:
>>
>>        data = *entry;
>>        *entry = NULL;
>>        ring->out++;
>>
>> Memory barrier is omitted here, please refer to the comment in the code.
>>
>> (1) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
>> (2) http://dpdk.org/doc/api/rte__ring_8h.html
>>
>> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
>> ---
>>   migration/ring.h | 265 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> If this is a very general implementation, not sure whether we can move
> this to util/ directory so that it can be used even outside migration
> codes.

I thought it too. Currently migration is the only user for it, so i put
it near the code of migration. It's good to me to move it to util/ if you
prefer.

> 
>>   1 file changed, 265 insertions(+)
>>   create mode 100644 migration/ring.h
>>
>> diff --git a/migration/ring.h b/migration/ring.h
>> new file mode 100644
>> index 0000000000..da9b8bdcbb
>> --- /dev/null
>> +++ b/migration/ring.h
>> @@ -0,0 +1,265 @@
>> +/*
>> + * Ring Buffer
>> + *
>> + * Multiple producers and single consumer are supported with lock free.
>> + *
>> + * Copyright (c) 2018 Tencent Inc
>> + *
>> + * Authors:
>> + *  Xiao Guangrong <xiaoguangrong@tencent.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#ifndef _RING__
>> +#define _RING__
>> +
>> +#define CACHE_LINE  64
> 
> Is this for x86_64?  Is the cache line size the same for all arch?

64 bytes is just a common size. :)

Does QEMU support pre-configured CACHE_SIZE?

> 
>> +#define cache_aligned __attribute__((__aligned__(CACHE_LINE)))
>> +
>> +#define RING_MULTI_PRODUCER 0x1
>> +
>> +struct Ring {
>> +    unsigned int flags;
>> +    unsigned int size;
>> +    unsigned int mask;
>> +
>> +    unsigned int in cache_aligned;
>> +
>> +    unsigned int out cache_aligned;
>> +
>> +    void *data[0] cache_aligned;
>> +};
>> +typedef struct Ring Ring;
>> +
>> +/*
>> + * allocate and initialize the ring
>> + *
>> + * @size: the number of element, it should be power of 2
>> + * @flags: set to RING_MULTI_PRODUCER if the ring has multiple producer,
>> + *         otherwise set it to 0, i,e. single producer and single consumer.
>> + *
>> + * return the ring.
>> + */
>> +static inline Ring *ring_alloc(unsigned int size, unsigned int flags)
>> +{
>> +    Ring *ring;
>> +
>> +    assert(is_power_of_2(size));
>> +
>> +    ring = g_malloc0(sizeof(*ring) + size * sizeof(void *));
>> +    ring->size = size;
>> +    ring->mask = ring->size - 1;
>> +    ring->flags = flags;
>> +    return ring;
>> +}
>> +
>> +static inline void ring_free(Ring *ring)
>> +{
>> +    g_free(ring);
>> +}
>> +
>> +static inline bool __ring_is_empty(unsigned int in, unsigned int out)
>> +{
>> +    return in == out;
>> +}
> 
> (some of the helpers are a bit confusing to me like this one; I would
>   prefer some of the helpers be directly squashed into code, but it's a
>   personal preference only)
> 

I will carefully consider it in the next version...

>> +
>> +static inline bool ring_is_empty(Ring *ring)
>> +{
>> +    return ring->in == ring->out;
>> +}
>> +
>> +static inline unsigned int ring_len(unsigned int in, unsigned int out)
>> +{
>> +    return in - out;
>> +}
> 
> (this too)
> 
>> +
>> +static inline bool
>> +__ring_is_full(Ring *ring, unsigned int in, unsigned int out)
>> +{
>> +    return ring_len(in, out) > ring->mask;
>> +}
>> +
>> +static inline bool ring_is_full(Ring *ring)
>> +{
>> +    return __ring_is_full(ring, ring->in, ring->out);
>> +}
>> +
>> +static inline unsigned int ring_index(Ring *ring, unsigned int pos)
>> +{
>> +    return pos & ring->mask;
>> +}
>> +
>> +static inline int __ring_put(Ring *ring, void *data)
>> +{
>> +    unsigned int index, out;
>> +
>> +    out = atomic_load_acquire(&ring->out);
>> +    /*
>> +     * smp_mb()
>> +     *
>> +     * should read ring->out before updating the entry, see the comments in
>> +     * __ring_get().
> 
> Nit: here I think it means the comment in [1] below.  Maybe:
> 
>    "see the comments in __ring_get() when calling
>     atomic_store_release()"
> 
> ?

Yes, you are right, i will address your suggestion.

> 
>> +     */
>> +
>> +    if (__ring_is_full(ring, ring->in, out)) {
>> +        return -ENOBUFS;
>> +    }
>> +
>> +    index = ring_index(ring, ring->in);
>> +
>> +    atomic_set(&ring->data[index], data);
>> +
>> +    /*
>> +     * should make sure the entry is updated before increasing ring->in
>> +     * otherwise the consumer will get a entry but its content is useless.
>> +     */
>> +    smp_wmb();
>> +    atomic_set(&ring->in, ring->in + 1);
> 
> Pure question: could we use store_release() instead of a mixture of
> store/release and raw memory barriers in the function?  Or is there
> any performance consideration behind?
> 
> It'll be nice to mention the performance considerations if there is.

I think atomic_mb_read() and atomic_mb_set() is what you are
talking about. These operations speed up read accesses but
slow done write accesses that is not suitable for our case.

> 
>> +    return 0;
>> +}
>> +
>> +static inline void *__ring_get(Ring *ring)
>> +{
>> +    unsigned int index, in;
>> +    void *data;
>> +
>> +    in = atomic_read(&ring->in);
>> +
>> +    /*
>> +     * should read ring->in first to make sure the entry pointed by this
>> +     * index is available, see the comments in __ring_put().
>> +     */
> 
> Nit: similar to above, maybe mention about which comment would be a
> bit nicer.

Yes, will improve it.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer
@ 2018-06-28 10:02       ` Xiao Guangrong
  0 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-28 10:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, jiang.biao2,
	wei.w.wang, Xiao Guangrong, peterz, stefani, paulmck,
	Lai Jiangshan


CC: Paul, Peter Zijlstra, Stefani, Lai who are all good at memory barrier.


On 06/20/2018 12:52 PM, Peter Xu wrote:
> On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.xiao@gmail.com wrote:
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>> It's the simple lockless ring buffer implement which supports both
>> single producer vs. single consumer and multiple producers vs.
>> single consumer.
>>
>> Many lessons were learned from Linux Kernel's kfifo (1) and DPDK's
>> rte_ring (2) before i wrote this implement. It corrects some bugs of
>> memory barriers in kfifo and it is the simpler lockless version of
>> rte_ring as currently multiple access is only allowed for producer.
> 
> Could you provide some more information about the kfifo bug?  Any
> pointer would be appreciated.
> 

Sure, i reported one of the memory barrier issue to linux kernel:
    https://lkml.org/lkml/2018/5/11/58

Actually, beside that, there is another memory barrier issue in kfifo,
please consider this case:

    at the beginning
    ring->size = 4
    ring->out = 0
    ring->in = 4

      Consumer                            Producer
  ---------------                     --------------
    index = ring->out; /* index == 0 */
    ring->out++; /* ring->out == 1 */
    < Re-Order >
                                     out = ring->out;
                                     if (ring->in - out >= ring->mask)
                                         return -EFULL;
                                     /* see the ring is not full */
                                     index = ring->in & ring->mask; /* index == 0 */
                                     ring->data[index] = new_data;
                     ring->in++;

    data = ring->data[index];
    !!!!!! the old data is lost !!!!!!

So we need to make sure:
1) for the consumer, we should read the ring->data[] out before updating ring->out
2) for the producer, we should read ring->out before updating ring->data[]

as followings:
       Producer                                       Consumer
   ------------------------------------         ------------------------
       Reading ring->out                            Reading ring->data[index]
       smp_mb()                                     smp_mb()
       Setting ring->data[index] = data             ring->out++

[ i used atomic_store_release() and atomic_load_acquire() instead of smp_mb() in the
   patch. ]

But i am not sure if we can use smp_acquire__after_ctrl_dep() in the producer?

>>
>> If has single producer vs. single consumer, it is the traditional fifo,
>> If has multiple producers, it uses the algorithm as followings:
>>
>> For the producer, it uses two steps to update the ring:
>>     - first step, occupy the entry in the ring:
>>
>> retry:
>>        in = ring->in
>>        if (cmpxhg(&ring->in, in, in +1) != in)
>>              goto retry;
>>
>>       after that the entry pointed by ring->data[in] has been owned by
>>       the producer.
>>
>>       assert(ring->data[in] == NULL);
>>
>>       Note, no other producer can touch this entry so that this entry
>>       should always be the initialized state.
>>
>>     - second step, write the data to the entry:
>>
>>       ring->data[in] = data;
>>
>> For the consumer, it first checks if there is available entry in the
>> ring and fetches the entry from the ring:
>>
>>       if (!ring_is_empty(ring))
>>            entry = &ring[ring->out];
>>
>>       Note: the ring->out has not been updated so that the entry pointed
>>       by ring->out is completely owned by the consumer.
>>
>> Then it checks if the data is ready:
>>
>> retry:
>>       if (*entry == NULL)
>>              goto retry;
>> That means, the producer has updated the index but haven't written any
>> data to it.
>>
>> Finally, it fetches the valid data out, set the entry to the initialized
>> state and update ring->out to make the entry be usable to the producer:
>>
>>        data = *entry;
>>        *entry = NULL;
>>        ring->out++;
>>
>> Memory barrier is omitted here, please refer to the comment in the code.
>>
>> (1) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
>> (2) http://dpdk.org/doc/api/rte__ring_8h.html
>>
>> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
>> ---
>>   migration/ring.h | 265 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> If this is a very general implementation, not sure whether we can move
> this to util/ directory so that it can be used even outside migration
> codes.

I thought it too. Currently migration is the only user for it, so i put
it near the code of migration. It's good to me to move it to util/ if you
prefer.

> 
>>   1 file changed, 265 insertions(+)
>>   create mode 100644 migration/ring.h
>>
>> diff --git a/migration/ring.h b/migration/ring.h
>> new file mode 100644
>> index 0000000000..da9b8bdcbb
>> --- /dev/null
>> +++ b/migration/ring.h
>> @@ -0,0 +1,265 @@
>> +/*
>> + * Ring Buffer
>> + *
>> + * Multiple producers and single consumer are supported with lock free.
>> + *
>> + * Copyright (c) 2018 Tencent Inc
>> + *
>> + * Authors:
>> + *  Xiao Guangrong <xiaoguangrong@tencent.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#ifndef _RING__
>> +#define _RING__
>> +
>> +#define CACHE_LINE  64
> 
> Is this for x86_64?  Is the cache line size the same for all arch?

64 bytes is just a common size. :)

Does QEMU support pre-configured CACHE_SIZE?

> 
>> +#define cache_aligned __attribute__((__aligned__(CACHE_LINE)))
>> +
>> +#define RING_MULTI_PRODUCER 0x1
>> +
>> +struct Ring {
>> +    unsigned int flags;
>> +    unsigned int size;
>> +    unsigned int mask;
>> +
>> +    unsigned int in cache_aligned;
>> +
>> +    unsigned int out cache_aligned;
>> +
>> +    void *data[0] cache_aligned;
>> +};
>> +typedef struct Ring Ring;
>> +
>> +/*
>> + * allocate and initialize the ring
>> + *
>> + * @size: the number of element, it should be power of 2
>> + * @flags: set to RING_MULTI_PRODUCER if the ring has multiple producer,
>> + *         otherwise set it to 0, i,e. single producer and single consumer.
>> + *
>> + * return the ring.
>> + */
>> +static inline Ring *ring_alloc(unsigned int size, unsigned int flags)
>> +{
>> +    Ring *ring;
>> +
>> +    assert(is_power_of_2(size));
>> +
>> +    ring = g_malloc0(sizeof(*ring) + size * sizeof(void *));
>> +    ring->size = size;
>> +    ring->mask = ring->size - 1;
>> +    ring->flags = flags;
>> +    return ring;
>> +}
>> +
>> +static inline void ring_free(Ring *ring)
>> +{
>> +    g_free(ring);
>> +}
>> +
>> +static inline bool __ring_is_empty(unsigned int in, unsigned int out)
>> +{
>> +    return in == out;
>> +}
> 
> (some of the helpers are a bit confusing to me like this one; I would
>   prefer some of the helpers be directly squashed into code, but it's a
>   personal preference only)
> 

I will carefully consider it in the next version...

>> +
>> +static inline bool ring_is_empty(Ring *ring)
>> +{
>> +    return ring->in == ring->out;
>> +}
>> +
>> +static inline unsigned int ring_len(unsigned int in, unsigned int out)
>> +{
>> +    return in - out;
>> +}
> 
> (this too)
> 
>> +
>> +static inline bool
>> +__ring_is_full(Ring *ring, unsigned int in, unsigned int out)
>> +{
>> +    return ring_len(in, out) > ring->mask;
>> +}
>> +
>> +static inline bool ring_is_full(Ring *ring)
>> +{
>> +    return __ring_is_full(ring, ring->in, ring->out);
>> +}
>> +
>> +static inline unsigned int ring_index(Ring *ring, unsigned int pos)
>> +{
>> +    return pos & ring->mask;
>> +}
>> +
>> +static inline int __ring_put(Ring *ring, void *data)
>> +{
>> +    unsigned int index, out;
>> +
>> +    out = atomic_load_acquire(&ring->out);
>> +    /*
>> +     * smp_mb()
>> +     *
>> +     * should read ring->out before updating the entry, see the comments in
>> +     * __ring_get().
> 
> Nit: here I think it means the comment in [1] below.  Maybe:
> 
>    "see the comments in __ring_get() when calling
>     atomic_store_release()"
> 
> ?

Yes, you are right, i will address your suggestion.

> 
>> +     */
>> +
>> +    if (__ring_is_full(ring, ring->in, out)) {
>> +        return -ENOBUFS;
>> +    }
>> +
>> +    index = ring_index(ring, ring->in);
>> +
>> +    atomic_set(&ring->data[index], data);
>> +
>> +    /*
>> +     * should make sure the entry is updated before increasing ring->in
>> +     * otherwise the consumer will get a entry but its content is useless.
>> +     */
>> +    smp_wmb();
>> +    atomic_set(&ring->in, ring->in + 1);
> 
> Pure question: could we use store_release() instead of a mixture of
> store/release and raw memory barriers in the function?  Or is there
> any performance consideration behind?
> 
> It'll be nice to mention the performance considerations if there is.

I think atomic_mb_read() and atomic_mb_set() is what you are
talking about. These operations speed up read accesses but
slow done write accesses that is not suitable for our case.

> 
>> +    return 0;
>> +}
>> +
>> +static inline void *__ring_get(Ring *ring)
>> +{
>> +    unsigned int index, in;
>> +    void *data;
>> +
>> +    in = atomic_read(&ring->in);
>> +
>> +    /*
>> +     * should read ring->in first to make sure the entry pointed by this
>> +     * index is available, see the comments in __ring_put().
>> +     */
> 
> Nit: similar to above, maybe mention about which comment would be a
> bit nicer.

Yes, will improve it.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 09/12] ring: introduce lockless ring buffer
  2018-06-28 10:02       ` [Qemu-devel] " Xiao Guangrong
@ 2018-06-28 11:55         ` Wei Wang
  -1 siblings, 0 replies; 156+ messages in thread
From: Wei Wang @ 2018-06-28 11:55 UTC (permalink / raw)
  To: Xiao Guangrong, Peter Xu
  Cc: kvm, mst, peterz, Lai Jiangshan, stefani, mtosatti,
	Xiao Guangrong, dgilbert, qemu-devel, jiang.biao2, pbonzini,
	paulmck

On 06/28/2018 06:02 PM, Xiao Guangrong wrote:
>
> CC: Paul, Peter Zijlstra, Stefani, Lai who are all good at memory 
> barrier.
>
>
> On 06/20/2018 12:52 PM, Peter Xu wrote:
>> On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.xiao@gmail.com 
>> wrote:
>>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>
>>> It's the simple lockless ring buffer implement which supports both
>>> single producer vs. single consumer and multiple producers vs.
>>> single consumer.
>>>
>>> Many lessons were learned from Linux Kernel's kfifo (1) and DPDK's
>>> rte_ring (2) before i wrote this implement. It corrects some bugs of
>>> memory barriers in kfifo and it is the simpler lockless version of
>>> rte_ring as currently multiple access is only allowed for producer.
>>
>> Could you provide some more information about the kfifo bug? Any
>> pointer would be appreciated.
>>
>
> Sure, i reported one of the memory barrier issue to linux kernel:
>    https://lkml.org/lkml/2018/5/11/58
>
> Actually, beside that, there is another memory barrier issue in kfifo,
> please consider this case:
>
>    at the beginning
>    ring->size = 4
>    ring->out = 0
>    ring->in = 4
>
>      Consumer                            Producer
>  ---------------                     --------------
>    index = ring->out; /* index == 0 */
>    ring->out++; /* ring->out == 1 */
>    < Re-Order >
>                                     out = ring->out;
>                                     if (ring->in - out >= ring->mask)
>                                         return -EFULL;
>                                     /* see the ring is not full */
>                                     index = ring->in & ring->mask; /* 
> index == 0 */
>                                     ring->data[index] = new_data;
>                      ring->in++;
>
>    data = ring->data[index];
>    !!!!!! the old data is lost !!!!!!
>
> So we need to make sure:
> 1) for the consumer, we should read the ring->data[] out before 
> updating ring->out
> 2) for the producer, we should read ring->out before updating 
> ring->data[]
>
> as followings:
>       Producer                                       Consumer
>   ------------------------------------ ------------------------
>       Reading ring->out                            Reading 
> ring->data[index]
>       smp_mb()                                     smp_mb()
>       Setting ring->data[index] = data ring->out++
>
> [ i used atomic_store_release() and atomic_load_acquire() instead of 
> smp_mb() in the
>   patch. ]
>
> But i am not sure if we can use smp_acquire__after_ctrl_dep() in the 
> producer?


I wonder if this could be solved by simply tweaking the above consumer 
implementation:

[1] index = ring->out;
[2] data = ring->data[index];
[3] index++;
[4] ring->out = index;

Now [2] and [3] forms a WAR dependency, which avoids the reordering.


Best,
Wei

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer
@ 2018-06-28 11:55         ` Wei Wang
  0 siblings, 0 replies; 156+ messages in thread
From: Wei Wang @ 2018-06-28 11:55 UTC (permalink / raw)
  To: Xiao Guangrong, Peter Xu
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, jiang.biao2,
	Xiao Guangrong, peterz, stefani, paulmck, Lai Jiangshan

On 06/28/2018 06:02 PM, Xiao Guangrong wrote:
>
> CC: Paul, Peter Zijlstra, Stefani, Lai who are all good at memory 
> barrier.
>
>
> On 06/20/2018 12:52 PM, Peter Xu wrote:
>> On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.xiao@gmail.com 
>> wrote:
>>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>
>>> It's the simple lockless ring buffer implement which supports both
>>> single producer vs. single consumer and multiple producers vs.
>>> single consumer.
>>>
>>> Many lessons were learned from Linux Kernel's kfifo (1) and DPDK's
>>> rte_ring (2) before i wrote this implement. It corrects some bugs of
>>> memory barriers in kfifo and it is the simpler lockless version of
>>> rte_ring as currently multiple access is only allowed for producer.
>>
>> Could you provide some more information about the kfifo bug? Any
>> pointer would be appreciated.
>>
>
> Sure, i reported one of the memory barrier issue to linux kernel:
>    https://lkml.org/lkml/2018/5/11/58
>
> Actually, beside that, there is another memory barrier issue in kfifo,
> please consider this case:
>
>    at the beginning
>    ring->size = 4
>    ring->out = 0
>    ring->in = 4
>
>      Consumer                            Producer
>  ---------------                     --------------
>    index = ring->out; /* index == 0 */
>    ring->out++; /* ring->out == 1 */
>    < Re-Order >
>                                     out = ring->out;
>                                     if (ring->in - out >= ring->mask)
>                                         return -EFULL;
>                                     /* see the ring is not full */
>                                     index = ring->in & ring->mask; /* 
> index == 0 */
>                                     ring->data[index] = new_data;
>                      ring->in++;
>
>    data = ring->data[index];
>    !!!!!! the old data is lost !!!!!!
>
> So we need to make sure:
> 1) for the consumer, we should read the ring->data[] out before 
> updating ring->out
> 2) for the producer, we should read ring->out before updating 
> ring->data[]
>
> as followings:
>       Producer                                       Consumer
>   ------------------------------------ ------------------------
>       Reading ring->out                            Reading 
> ring->data[index]
>       smp_mb()                                     smp_mb()
>       Setting ring->data[index] = data ring->out++
>
> [ i used atomic_store_release() and atomic_load_acquire() instead of 
> smp_mb() in the
>   patch. ]
>
> But i am not sure if we can use smp_acquire__after_ctrl_dep() in the 
> producer?


I wonder if this could be solved by simply tweaking the above consumer 
implementation:

[1] index = ring->out;
[2] data = ring->data[index];
[3] index++;
[4] ring->out = index;

Now [2] and [3] forms a WAR dependency, which avoids the reordering.


Best,
Wei

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 09/12] ring: introduce lockless ring buffer
  2018-06-04  9:55   ` [Qemu-devel] " guangrong.xiao
@ 2018-06-28 13:36     ` Jason Wang
  -1 siblings, 0 replies; 156+ messages in thread
From: Jason Wang @ 2018-06-28 13:36 UTC (permalink / raw)
  To: guangrong.xiao, pbonzini, mst, mtosatti
  Cc: kvm, Xiao Guangrong, qemu-devel, peterx, dgilbert, wei.w.wang,
	jiang.biao2



On 2018年06月04日 17:55, guangrong.xiao@gmail.com wrote:
> From: Xiao Guangrong<xiaoguangrong@tencent.com>
>
> It's the simple lockless ring buffer implement which supports both
> single producer vs. single consumer and multiple producers vs.
> single consumer.
>
> Many lessons were learned from Linux Kernel's kfifo (1) and DPDK's
> rte_ring (2) before i wrote this implement. It corrects some bugs of
> memory barriers in kfifo and it is the simpler lockless version of
> rte_ring as currently multiple access is only allowed for producer.
>
> If has single producer vs. single consumer, it is the traditional fifo,
> If has multiple producers, it uses the algorithm as followings:
>
> For the producer, it uses two steps to update the ring:
>     - first step, occupy the entry in the ring:
>
> retry:
>        in = ring->in
>        if (cmpxhg(&ring->in, in, in +1) != in)
>              goto retry;
>
>       after that the entry pointed by ring->data[in] has been owned by
>       the producer.
>
>       assert(ring->data[in] == NULL);
>
>       Note, no other producer can touch this entry so that this entry
>       should always be the initialized state.
>
>     - second step, write the data to the entry:
>
>       ring->data[in] = data;
>
> For the consumer, it first checks if there is available entry in the
> ring and fetches the entry from the ring:
>
>       if (!ring_is_empty(ring))
>            entry = &ring[ring->out];
>
>       Note: the ring->out has not been updated so that the entry pointed
>       by ring->out is completely owned by the consumer.
>
> Then it checks if the data is ready:
>
> retry:
>       if (*entry == NULL)
>              goto retry;
> That means, the producer has updated the index but haven't written any
> data to it.
>
> Finally, it fetches the valid data out, set the entry to the initialized
> state and update ring->out to make the entry be usable to the producer:
>
>        data = *entry;
>        *entry = NULL;
>        ring->out++;
>
> Memory barrier is omitted here, please refer to the comment in the code.
>
> (1)https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
> (2)http://dpdk.org/doc/api/rte__ring_8h.html
>
> Signed-off-by: Xiao Guangrong<xiaoguangrong@tencent.com>
> ---

May I ask why you need a MPSC ring here? Can we just use N SPSC ring for 
submitting pages and another N SPSC ring for passing back results?

Thanks

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer
@ 2018-06-28 13:36     ` Jason Wang
  0 siblings, 0 replies; 156+ messages in thread
From: Jason Wang @ 2018-06-28 13:36 UTC (permalink / raw)
  To: guangrong.xiao, pbonzini, mst, mtosatti
  Cc: qemu-devel, kvm, dgilbert, peterx, jiang.biao2, wei.w.wang,
	Xiao Guangrong



On 2018年06月04日 17:55, guangrong.xiao@gmail.com wrote:
> From: Xiao Guangrong<xiaoguangrong@tencent.com>
>
> It's the simple lockless ring buffer implement which supports both
> single producer vs. single consumer and multiple producers vs.
> single consumer.
>
> Many lessons were learned from Linux Kernel's kfifo (1) and DPDK's
> rte_ring (2) before i wrote this implement. It corrects some bugs of
> memory barriers in kfifo and it is the simpler lockless version of
> rte_ring as currently multiple access is only allowed for producer.
>
> If has single producer vs. single consumer, it is the traditional fifo,
> If has multiple producers, it uses the algorithm as followings:
>
> For the producer, it uses two steps to update the ring:
>     - first step, occupy the entry in the ring:
>
> retry:
>        in = ring->in
>        if (cmpxhg(&ring->in, in, in +1) != in)
>              goto retry;
>
>       after that the entry pointed by ring->data[in] has been owned by
>       the producer.
>
>       assert(ring->data[in] == NULL);
>
>       Note, no other producer can touch this entry so that this entry
>       should always be the initialized state.
>
>     - second step, write the data to the entry:
>
>       ring->data[in] = data;
>
> For the consumer, it first checks if there is available entry in the
> ring and fetches the entry from the ring:
>
>       if (!ring_is_empty(ring))
>            entry = &ring[ring->out];
>
>       Note: the ring->out has not been updated so that the entry pointed
>       by ring->out is completely owned by the consumer.
>
> Then it checks if the data is ready:
>
> retry:
>       if (*entry == NULL)
>              goto retry;
> That means, the producer has updated the index but haven't written any
> data to it.
>
> Finally, it fetches the valid data out, set the entry to the initialized
> state and update ring->out to make the entry be usable to the producer:
>
>        data = *entry;
>        *entry = NULL;
>        ring->out++;
>
> Memory barrier is omitted here, please refer to the comment in the code.
>
> (1)https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
> (2)http://dpdk.org/doc/api/rte__ring_8h.html
>
> Signed-off-by: Xiao Guangrong<xiaoguangrong@tencent.com>
> ---

May I ask why you need a MPSC ring here? Can we just use N SPSC ring for 
submitting pages and another N SPSC ring for passing back results?

Thanks

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 09/12] ring: introduce lockless ring buffer
  2018-06-20  5:55     ` [Qemu-devel] " Peter Xu
@ 2018-06-28 14:00       ` Xiao Guangrong
  -1 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-28 14:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, qemu-devel,
	wei.w.wang, jiang.biao2, pbonzini



On 06/20/2018 01:55 PM, Peter Xu wrote:
> On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.xiao@gmail.com wrote:
> 
> [...]
> 
> (Some more comments/questions for the MP implementation...)
> 
>> +static inline int ring_mp_put(Ring *ring, void *data)
>> +{
>> +    unsigned int index, in, in_next, out;
>> +
>> +    do {
>> +        in = atomic_read(&ring->in);
>> +        out = atomic_read(&ring->out);
> 
> [0]
> 
> Do we need to fetch "out" with load_acquire()?  Otherwise what's the
> pairing of below store_release() at [1]?
> 

The barrier paired with [1] is the full barrier implied in atomic_cmpxchg(),

> This barrier exists in SP-SC case which makes sense to me, I assume
> that's also needed for MP-SC case, am I right?

We needn't put a memory here as we do not need to care the order between
these two indexes (in and out), instead, the memory barrier (and for
SP-SC as well) is used to make the order between ring->out and updating
ring->data[] as we explained in previous mail.

> 
>> +
>> +        if (__ring_is_full(ring, in, out)) {
>> +            if (atomic_read(&ring->in) == in &&
>> +                atomic_read(&ring->out) == out) {
> 
> Why read again?  After all the ring API seems to be designed as
> non-blocking.  E.g., I see the poll at [2] below makes more sense
> since when reaches [2] it means that there must be a producer that is
> _doing_ the queuing, so polling is very possible to complete fast.
> However here it seems to be a pure busy poll without any hint.  Then
> not sure whether we should just let the caller decide whether it wants
> to call ring_put() again.
> 

Without it we can easily observe a "strange" behavior that the thread will
put the result to the global ring failed even if we allocated enough room
for the global ring (its capability >= total requests), that's because
these two indexes can be updated at anytime, consider the case that multiple
get and put operations can be finished between reading ring->in and ring->out
so that very possibly ring->in can pass the value readed from ring->out.

Having this code, the negative case only happens if these two indexes (32 bits)
overflows to the same value, that can help us to catch potential bug in the
code.

>> +                return -ENOBUFS;
>> +            }
>> +
>> +            /* a entry has been fetched out, retry. */
>> +            continue;
>> +        }
>> +
>> +        in_next = in + 1;
>> +    } while (atomic_cmpxchg(&ring->in, in, in_next) != in);
>> +
>> +    index = ring_index(ring, in);
>> +
>> +    /*
>> +     * smp_rmb() paired with the memory barrier of (A) in ring_mp_get()
>> +     * is implied in atomic_cmpxchg() as we should read ring->out first
>> +     * before fetching the entry, otherwise this assert will fail.
> 
> Thanks for all these comments!  These are really helpful for
> reviewers.
> 
> However I'm not sure whether I understand it correctly here on MB of
> (A) for ring_mp_get() - AFAIU that should corresponds to a smp_rmb()
> at [0] above when reading the "out" variable rather than this
> assertion, and that's why I thought at [0] we should have something
> like a load_acquire() there (which contains a rmb()).

Memory barrier (A) in ring_mp_get() makes sure the order between:
    ring->data[index] = NULL;
    smp_wmb();
    ring->out = out + 1;

And the memory barrier at [0] makes sure the order between:
    out = ring->out;
    /* smp_rmb() */
    compxchg();
    value = ring->data[index];
    assert(value);

[ note: the assertion and reading ring->out are across cmpxchg(). ]

Did i understand your question clearly?

> 
>  From content-wise, I think the code here is correct, since
> atomic_cmpxchg() should have one implicit smp_mb() after all so we
> don't need anything further barriers here.

Yes, it is.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer
@ 2018-06-28 14:00       ` Xiao Guangrong
  0 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-28 14:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, jiang.biao2,
	wei.w.wang, Xiao Guangrong



On 06/20/2018 01:55 PM, Peter Xu wrote:
> On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.xiao@gmail.com wrote:
> 
> [...]
> 
> (Some more comments/questions for the MP implementation...)
> 
>> +static inline int ring_mp_put(Ring *ring, void *data)
>> +{
>> +    unsigned int index, in, in_next, out;
>> +
>> +    do {
>> +        in = atomic_read(&ring->in);
>> +        out = atomic_read(&ring->out);
> 
> [0]
> 
> Do we need to fetch "out" with load_acquire()?  Otherwise what's the
> pairing of below store_release() at [1]?
> 

The barrier paired with [1] is the full barrier implied in atomic_cmpxchg(),

> This barrier exists in SP-SC case which makes sense to me, I assume
> that's also needed for MP-SC case, am I right?

We needn't put a memory here as we do not need to care the order between
these two indexes (in and out), instead, the memory barrier (and for
SP-SC as well) is used to make the order between ring->out and updating
ring->data[] as we explained in previous mail.

> 
>> +
>> +        if (__ring_is_full(ring, in, out)) {
>> +            if (atomic_read(&ring->in) == in &&
>> +                atomic_read(&ring->out) == out) {
> 
> Why read again?  After all the ring API seems to be designed as
> non-blocking.  E.g., I see the poll at [2] below makes more sense
> since when reaches [2] it means that there must be a producer that is
> _doing_ the queuing, so polling is very possible to complete fast.
> However here it seems to be a pure busy poll without any hint.  Then
> not sure whether we should just let the caller decide whether it wants
> to call ring_put() again.
> 

Without it we can easily observe a "strange" behavior that the thread will
put the result to the global ring failed even if we allocated enough room
for the global ring (its capability >= total requests), that's because
these two indexes can be updated at anytime, consider the case that multiple
get and put operations can be finished between reading ring->in and ring->out
so that very possibly ring->in can pass the value readed from ring->out.

Having this code, the negative case only happens if these two indexes (32 bits)
overflows to the same value, that can help us to catch potential bug in the
code.

>> +                return -ENOBUFS;
>> +            }
>> +
>> +            /* a entry has been fetched out, retry. */
>> +            continue;
>> +        }
>> +
>> +        in_next = in + 1;
>> +    } while (atomic_cmpxchg(&ring->in, in, in_next) != in);
>> +
>> +    index = ring_index(ring, in);
>> +
>> +    /*
>> +     * smp_rmb() paired with the memory barrier of (A) in ring_mp_get()
>> +     * is implied in atomic_cmpxchg() as we should read ring->out first
>> +     * before fetching the entry, otherwise this assert will fail.
> 
> Thanks for all these comments!  These are really helpful for
> reviewers.
> 
> However I'm not sure whether I understand it correctly here on MB of
> (A) for ring_mp_get() - AFAIU that should corresponds to a smp_rmb()
> at [0] above when reading the "out" variable rather than this
> assertion, and that's why I thought at [0] we should have something
> like a load_acquire() there (which contains a rmb()).

Memory barrier (A) in ring_mp_get() makes sure the order between:
    ring->data[index] = NULL;
    smp_wmb();
    ring->out = out + 1;

And the memory barrier at [0] makes sure the order between:
    out = ring->out;
    /* smp_rmb() */
    compxchg();
    value = ring->data[index];
    assert(value);

[ note: the assertion and reading ring->out are across cmpxchg(). ]

Did i understand your question clearly?

> 
>  From content-wise, I think the code here is correct, since
> atomic_cmpxchg() should have one implicit smp_mb() after all so we
> don't need anything further barriers here.

Yes, it is.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 10/12] migration: introduce lockless multithreads model
  2018-06-20  6:52     ` [Qemu-devel] " Peter Xu
@ 2018-06-28 14:25       ` Xiao Guangrong
  -1 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-28 14:25 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, qemu-devel,
	wei.w.wang, jiang.biao2, pbonzini



On 06/20/2018 02:52 PM, Peter Xu wrote:
> On Mon, Jun 04, 2018 at 05:55:18PM +0800, guangrong.xiao@gmail.com wrote:
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>> Current implementation of compression and decompression are very
>> hard to be enabled on productions. We noticed that too many wait-wakes
>> go to kernel space and CPU usages are very low even if the system
>> is really free
>>

> Not sure how other people think, for me these information suites
> better as cover letter.  For commit message, I would prefer to know
> about something like: what this thread model can do; how the APIs are
> designed and used; what's the limitations, etc.  After all until this
> patch nowhere is using the new model yet, so these numbers are a bit
> misleading.
> 

Yes, i completely agree with you, i will remove it for its changelog.

>>
>> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
>> ---
>>   migration/Makefile.objs |   1 +
>>   migration/threads.c     | 265 ++++++++++++++++++++++++++++++++++++++++++++++++
>>   migration/threads.h     | 116 +++++++++++++++++++++
> 
> Again, this model seems to be suitable for scenarios even outside
> migration.  So I'm not sure whether you'd like to generalize it (I
> still see e.g. constants and comments related to migration, but there
> aren't much) and put it into util/.

Sure, that's good to me. :)

> 
>>   3 files changed, 382 insertions(+)
>>   create mode 100644 migration/threads.c
>>   create mode 100644 migration/threads.h
>>
>> diff --git a/migration/Makefile.objs b/migration/Makefile.objs
>> index c83ec47ba8..bdb61a7983 100644
>> --- a/migration/Makefile.objs
>> +++ b/migration/Makefile.objs
>> @@ -7,6 +7,7 @@ common-obj-y += qemu-file-channel.o
>>   common-obj-y += xbzrle.o postcopy-ram.o
>>   common-obj-y += qjson.o
>>   common-obj-y += block-dirty-bitmap.o
>> +common-obj-y += threads.o
>>   
>>   common-obj-$(CONFIG_RDMA) += rdma.o
>>   
>> diff --git a/migration/threads.c b/migration/threads.c
>> new file mode 100644
>> index 0000000000..eecd3229b7
>> --- /dev/null
>> +++ b/migration/threads.c
>> @@ -0,0 +1,265 @@
>> +#include "threads.h"
>> +
>> +/* retry to see if there is avilable request before actually go to wait. */
>> +#define BUSY_WAIT_COUNT 1000
>> +
>> +static void *thread_run(void *opaque)
>> +{
>> +    ThreadLocal *self_data = (ThreadLocal *)opaque;
>> +    Threads *threads = self_data->threads;
>> +    void (*handler)(ThreadRequest *data) = threads->thread_request_handler;
>> +    ThreadRequest *request;
>> +    int count, ret;
>> +
>> +    for ( ; !atomic_read(&self_data->quit); ) {
>> +        qemu_event_reset(&self_data->ev);
>> +
>> +        count = 0;
>> +        while ((request = ring_get(self_data->request_ring)) ||
>> +            count < BUSY_WAIT_COUNT) {
>> +             /*
>> +             * wait some while before go to sleep so that the user
>> +             * needn't go to kernel space to wake up the consumer
>> +             * threads.
>> +             *
>> +             * That will waste some CPU resource indeed however it
>> +             * can significantly improve the case that the request
>> +             * will be available soon.
>> +             */
>> +             if (!request) {
>> +                cpu_relax();
>> +                count++;
>> +                continue;
>> +            }
>> +            count = 0;
>> +
>> +            handler(request);
>> +
>> +            do {
>> +                ret = ring_put(threads->request_done_ring, request);
>> +                /*
>> +                 * request_done_ring has enough room to contain all
>> +                 * requests, however, theoretically, it still can be
>> +                 * fail if the ring's indexes are overflow that would
>> +                 * happen if there is more than 2^32 requests are
> 
> Could you elaborate why this ring_put() could fail, and why failure is
> somehow related to 2^32 overflow?
> 
> Firstly, I don't understand why it will fail.

As we explained in the previous mail:

| Without it we can easily observe a "strange" behavior that the thread will
| put the result to the global ring failed even if we allocated enough room
| for the global ring (its capability >= total requests), that's because
| these two indexes can be updated at anytime, consider the case that multiple
| get and put operations can be finished between reading ring->in and ring->out
| so that very possibly ring->in can pass the value readed from ring->out.
|
| Having this code, the negative case only happens if these two indexes (32 bits)
| overflows to the same value, that can help us to catch potential bug in the
| code.
> 
> Meanwhile, AFAIU your ring can even live well with that 2^32 overflow.
> Or did I misunderstood?

Please refer to the code:
+        if (__ring_is_full(ring, in, out)) {
+            if (atomic_read(&ring->in) == in &&
+                atomic_read(&ring->out) == out) {
+                return -ENOBUFS;
+            }

As we allocated enough room for this global ring so there is the only case
that put data will fail that the indexes are overflowed to the same value.

This possibly 2^32 get/put operations happened on other threads and main
thread when this thread is reading these two indexes.


>> +                 * handled between two calls of threads_wait_done().
>> +                 * So we do retry to make the code more robust.
>> +                 *
>> +                 * It is unlikely the case for migration as the block's
>> +                 * memory is unlikely more than 16T (2^32 pages) memory.
> 
> (some migration-related comments; maybe we can remove that)

Okay, i will consider it to make it more general.

>> +Threads *threads_create(unsigned int threads_nr, const char *name,
>> +                        ThreadRequest *(*thread_request_init)(void),
>> +                        void (*thread_request_uninit)(ThreadRequest *request),
>> +                        void (*thread_request_handler)(ThreadRequest *request),
>> +                        void (*thread_request_done)(ThreadRequest *request))
>> +{
>> +    Threads *threads;
>> +    int ret;
>> +
>> +    threads = g_malloc0(sizeof(*threads) + threads_nr * sizeof(ThreadLocal));
>> +    threads->threads_nr = threads_nr;
>> +    threads->thread_ring_size = THREAD_REQ_RING_SIZE;
> 
> (If we're going to generalize this thread model, maybe you'd consider
>   to allow specify this ring size as well?)

Good point, will do it.

> 
>> +    threads->total_requests = threads->thread_ring_size * threads_nr;
>> +
>> +    threads->name = name;
>> +    threads->thread_request_init = thread_request_init;
>> +    threads->thread_request_uninit = thread_request_uninit;
>> +    threads->thread_request_handler = thread_request_handler;
>> +    threads->thread_request_done = thread_request_done;
>> +
>> +    ret = init_requests(threads);
>> +    if (ret) {
>> +        g_free(threads);
>> +        return NULL;
>> +    }
>> +
>> +    init_thread_data(threads);
>> +    return threads;
>> +}
>> +
>> +void threads_destroy(Threads *threads)
>> +{
>> +    uninit_thread_data(threads);
>> +    uninit_requests(threads, threads->total_requests);
>> +    g_free(threads);
>> +}
>> +
>> +ThreadRequest *threads_submit_request_prepare(Threads *threads)
>> +{
>> +    ThreadRequest *request;
>> +    unsigned int index;
>> +
>> +    index = threads->current_thread_index % threads->threads_nr;
> 
> Why round-robin rather than simply find a idle thread (still with
> valid free requests) and put the request onto that?
> 
> Asked since I don't see much difficulty to achieve that, meanwhile for
> round-robin I'm not sure whether it can happen that one thread stuck
> due to some reason (e.g., scheduling reason?), while the rest of the
> threads are idle, then would threads_submit_request_prepare() be stuck
> for that hanging thread?
> 

You concern is reasonable indeed, however, the RR is the simplest
algorithm to push one request to threads without figuring the
lightest thread out one by one which makes the main thread fast
enough.

And i think it generally works not bad for a load-balanced system,
further more, the good configuration we think is that if the user
uses N threads to compression, he should make sure the system should
have enough CPU resource to run these N threads.

We can improve it after this basic framework gets merged by using
more advanced distribution approach if we see it's needed in
the future.

>> diff --git a/migration/threads.h b/migration/threads.h
>> new file mode 100644
>> index 0000000000..eced913065
>> --- /dev/null
>> +++ b/migration/threads.h
>> @@ -0,0 +1,116 @@
>> +#ifndef QEMU_MIGRATION_THREAD_H
>> +#define QEMU_MIGRATION_THREAD_H
>> +
>> +/*
>> + * Multithreads abstraction
>> + *
>> + * This is the abstraction layer for multithreads management which is
>> + * used to speed up migration.
>> + *
>> + * Note: currently only one producer is allowed.
>> + *
>> + * Copyright(C) 2018 Tencent Corporation.
>> + *
>> + * Author:
>> + *   Xiao Guangrong <xiaoguangrong@tencent.com>
>> + *
>> + * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
>> + * See the COPYING.LIB file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
> 
> I was told (more than once) that we should not include "osdep.h" in
> headers. :) I'll suggest you include that in the source file.

Okay, good to know it. :)

> 
>> +#include "hw/boards.h"
> 
> Why do we need this header?

Well, i need to figure out the right head files to include the declarations
we used. :)

> 
>> +
>> +#include "ring.h"
>> +
>> +/*
>> + * the request representation which contains the internally used mete data,
>> + * it can be embedded to user's self-defined data struct and the user can
>> + * use container_of() to get the self-defined data
>> + */
>> +struct ThreadRequest {
>> +    QSLIST_ENTRY(ThreadRequest) node;
>> +    unsigned int thread_index;
>> +};
>> +typedef struct ThreadRequest ThreadRequest;
>> +
>> +struct Threads;
>> +
>> +struct ThreadLocal {
>> +    QemuThread thread;
>> +
>> +    /* the event used to wake up the thread */
>> +    QemuEvent ev;
>> +
>> +    struct Threads *threads;
>> +
>> +    /* local request ring which is filled by the user */
>> +    Ring *request_ring;
>> +
>> +    /* the index of the thread */
>> +    int self;
>> +
>> +    /* thread is useless and needs to exit */
>> +    bool quit;
>> +};
>> +typedef struct ThreadLocal ThreadLocal;
>> +
>> +/*
>> + * the main data struct represents multithreads which is shared by
>> + * all threads
>> + */
>> +struct Threads {
>> +    const char *name;
>> +    unsigned int threads_nr;
>> +    /* the request is pushed to the thread with round-robin manner */
>> +    unsigned int current_thread_index;
>> +
>> +    int thread_ring_size;
>> +    int total_requests;
>> +
>> +    /* the request is pre-allocated and linked in the list */
>> +    int free_requests_nr;
>> +    QSLIST_HEAD(, ThreadRequest) free_requests;
>> +
>> +    /* the constructor of request */
>> +    ThreadRequest *(*thread_request_init)(void);
>> +    /* the destructor of request */
>> +    void (*thread_request_uninit)(ThreadRequest *request);
>> +    /* the handler of the request which is called in the thread */
>> +    void (*thread_request_handler)(ThreadRequest *request);
>> +    /*
>> +     * the handler to process the result which is called in the
>> +     * user's context
>> +     */
>> +    void (*thread_request_done)(ThreadRequest *request);
>> +
>> +    /* the thread push the result to this ring so it has multiple producers */
>> +    Ring *request_done_ring;
>> +
>> +    ThreadLocal per_thread_data[0];
>> +};
>> +typedef struct Threads Threads;
> 
> Not sure whether we can move Threads/ThreadLocal definition into the
> source file, then we only expose the struct definition, along with the
> APIs.

Yup, that's better indeed, thank you, Peter!

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 10/12] migration: introduce lockless multithreads model
@ 2018-06-28 14:25       ` Xiao Guangrong
  0 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-28 14:25 UTC (permalink / raw)
  To: Peter Xu
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, jiang.biao2,
	wei.w.wang, Xiao Guangrong



On 06/20/2018 02:52 PM, Peter Xu wrote:
> On Mon, Jun 04, 2018 at 05:55:18PM +0800, guangrong.xiao@gmail.com wrote:
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>> Current implementation of compression and decompression are very
>> hard to be enabled on productions. We noticed that too many wait-wakes
>> go to kernel space and CPU usages are very low even if the system
>> is really free
>>

> Not sure how other people think, for me these information suites
> better as cover letter.  For commit message, I would prefer to know
> about something like: what this thread model can do; how the APIs are
> designed and used; what's the limitations, etc.  After all until this
> patch nowhere is using the new model yet, so these numbers are a bit
> misleading.
> 

Yes, i completely agree with you, i will remove it for its changelog.

>>
>> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
>> ---
>>   migration/Makefile.objs |   1 +
>>   migration/threads.c     | 265 ++++++++++++++++++++++++++++++++++++++++++++++++
>>   migration/threads.h     | 116 +++++++++++++++++++++
> 
> Again, this model seems to be suitable for scenarios even outside
> migration.  So I'm not sure whether you'd like to generalize it (I
> still see e.g. constants and comments related to migration, but there
> aren't much) and put it into util/.

Sure, that's good to me. :)

> 
>>   3 files changed, 382 insertions(+)
>>   create mode 100644 migration/threads.c
>>   create mode 100644 migration/threads.h
>>
>> diff --git a/migration/Makefile.objs b/migration/Makefile.objs
>> index c83ec47ba8..bdb61a7983 100644
>> --- a/migration/Makefile.objs
>> +++ b/migration/Makefile.objs
>> @@ -7,6 +7,7 @@ common-obj-y += qemu-file-channel.o
>>   common-obj-y += xbzrle.o postcopy-ram.o
>>   common-obj-y += qjson.o
>>   common-obj-y += block-dirty-bitmap.o
>> +common-obj-y += threads.o
>>   
>>   common-obj-$(CONFIG_RDMA) += rdma.o
>>   
>> diff --git a/migration/threads.c b/migration/threads.c
>> new file mode 100644
>> index 0000000000..eecd3229b7
>> --- /dev/null
>> +++ b/migration/threads.c
>> @@ -0,0 +1,265 @@
>> +#include "threads.h"
>> +
>> +/* retry to see if there is avilable request before actually go to wait. */
>> +#define BUSY_WAIT_COUNT 1000
>> +
>> +static void *thread_run(void *opaque)
>> +{
>> +    ThreadLocal *self_data = (ThreadLocal *)opaque;
>> +    Threads *threads = self_data->threads;
>> +    void (*handler)(ThreadRequest *data) = threads->thread_request_handler;
>> +    ThreadRequest *request;
>> +    int count, ret;
>> +
>> +    for ( ; !atomic_read(&self_data->quit); ) {
>> +        qemu_event_reset(&self_data->ev);
>> +
>> +        count = 0;
>> +        while ((request = ring_get(self_data->request_ring)) ||
>> +            count < BUSY_WAIT_COUNT) {
>> +             /*
>> +             * wait some while before go to sleep so that the user
>> +             * needn't go to kernel space to wake up the consumer
>> +             * threads.
>> +             *
>> +             * That will waste some CPU resource indeed however it
>> +             * can significantly improve the case that the request
>> +             * will be available soon.
>> +             */
>> +             if (!request) {
>> +                cpu_relax();
>> +                count++;
>> +                continue;
>> +            }
>> +            count = 0;
>> +
>> +            handler(request);
>> +
>> +            do {
>> +                ret = ring_put(threads->request_done_ring, request);
>> +                /*
>> +                 * request_done_ring has enough room to contain all
>> +                 * requests, however, theoretically, it still can be
>> +                 * fail if the ring's indexes are overflow that would
>> +                 * happen if there is more than 2^32 requests are
> 
> Could you elaborate why this ring_put() could fail, and why failure is
> somehow related to 2^32 overflow?
> 
> Firstly, I don't understand why it will fail.

As we explained in the previous mail:

| Without it we can easily observe a "strange" behavior that the thread will
| put the result to the global ring failed even if we allocated enough room
| for the global ring (its capability >= total requests), that's because
| these two indexes can be updated at anytime, consider the case that multiple
| get and put operations can be finished between reading ring->in and ring->out
| so that very possibly ring->in can pass the value readed from ring->out.
|
| Having this code, the negative case only happens if these two indexes (32 bits)
| overflows to the same value, that can help us to catch potential bug in the
| code.
> 
> Meanwhile, AFAIU your ring can even live well with that 2^32 overflow.
> Or did I misunderstood?

Please refer to the code:
+        if (__ring_is_full(ring, in, out)) {
+            if (atomic_read(&ring->in) == in &&
+                atomic_read(&ring->out) == out) {
+                return -ENOBUFS;
+            }

As we allocated enough room for this global ring so there is the only case
that put data will fail that the indexes are overflowed to the same value.

This possibly 2^32 get/put operations happened on other threads and main
thread when this thread is reading these two indexes.


>> +                 * handled between two calls of threads_wait_done().
>> +                 * So we do retry to make the code more robust.
>> +                 *
>> +                 * It is unlikely the case for migration as the block's
>> +                 * memory is unlikely more than 16T (2^32 pages) memory.
> 
> (some migration-related comments; maybe we can remove that)

Okay, i will consider it to make it more general.

>> +Threads *threads_create(unsigned int threads_nr, const char *name,
>> +                        ThreadRequest *(*thread_request_init)(void),
>> +                        void (*thread_request_uninit)(ThreadRequest *request),
>> +                        void (*thread_request_handler)(ThreadRequest *request),
>> +                        void (*thread_request_done)(ThreadRequest *request))
>> +{
>> +    Threads *threads;
>> +    int ret;
>> +
>> +    threads = g_malloc0(sizeof(*threads) + threads_nr * sizeof(ThreadLocal));
>> +    threads->threads_nr = threads_nr;
>> +    threads->thread_ring_size = THREAD_REQ_RING_SIZE;
> 
> (If we're going to generalize this thread model, maybe you'd consider
>   to allow specify this ring size as well?)

Good point, will do it.

> 
>> +    threads->total_requests = threads->thread_ring_size * threads_nr;
>> +
>> +    threads->name = name;
>> +    threads->thread_request_init = thread_request_init;
>> +    threads->thread_request_uninit = thread_request_uninit;
>> +    threads->thread_request_handler = thread_request_handler;
>> +    threads->thread_request_done = thread_request_done;
>> +
>> +    ret = init_requests(threads);
>> +    if (ret) {
>> +        g_free(threads);
>> +        return NULL;
>> +    }
>> +
>> +    init_thread_data(threads);
>> +    return threads;
>> +}
>> +
>> +void threads_destroy(Threads *threads)
>> +{
>> +    uninit_thread_data(threads);
>> +    uninit_requests(threads, threads->total_requests);
>> +    g_free(threads);
>> +}
>> +
>> +ThreadRequest *threads_submit_request_prepare(Threads *threads)
>> +{
>> +    ThreadRequest *request;
>> +    unsigned int index;
>> +
>> +    index = threads->current_thread_index % threads->threads_nr;
> 
> Why round-robin rather than simply find a idle thread (still with
> valid free requests) and put the request onto that?
> 
> Asked since I don't see much difficulty to achieve that, meanwhile for
> round-robin I'm not sure whether it can happen that one thread stuck
> due to some reason (e.g., scheduling reason?), while the rest of the
> threads are idle, then would threads_submit_request_prepare() be stuck
> for that hanging thread?
> 

You concern is reasonable indeed, however, the RR is the simplest
algorithm to push one request to threads without figuring the
lightest thread out one by one which makes the main thread fast
enough.

And i think it generally works not bad for a load-balanced system,
further more, the good configuration we think is that if the user
uses N threads to compression, he should make sure the system should
have enough CPU resource to run these N threads.

We can improve it after this basic framework gets merged by using
more advanced distribution approach if we see it's needed in
the future.

>> diff --git a/migration/threads.h b/migration/threads.h
>> new file mode 100644
>> index 0000000000..eced913065
>> --- /dev/null
>> +++ b/migration/threads.h
>> @@ -0,0 +1,116 @@
>> +#ifndef QEMU_MIGRATION_THREAD_H
>> +#define QEMU_MIGRATION_THREAD_H
>> +
>> +/*
>> + * Multithreads abstraction
>> + *
>> + * This is the abstraction layer for multithreads management which is
>> + * used to speed up migration.
>> + *
>> + * Note: currently only one producer is allowed.
>> + *
>> + * Copyright(C) 2018 Tencent Corporation.
>> + *
>> + * Author:
>> + *   Xiao Guangrong <xiaoguangrong@tencent.com>
>> + *
>> + * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
>> + * See the COPYING.LIB file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
> 
> I was told (more than once) that we should not include "osdep.h" in
> headers. :) I'll suggest you include that in the source file.

Okay, good to know it. :)

> 
>> +#include "hw/boards.h"
> 
> Why do we need this header?

Well, i need to figure out the right head files to include the declarations
we used. :)

> 
>> +
>> +#include "ring.h"
>> +
>> +/*
>> + * the request representation which contains the internally used mete data,
>> + * it can be embedded to user's self-defined data struct and the user can
>> + * use container_of() to get the self-defined data
>> + */
>> +struct ThreadRequest {
>> +    QSLIST_ENTRY(ThreadRequest) node;
>> +    unsigned int thread_index;
>> +};
>> +typedef struct ThreadRequest ThreadRequest;
>> +
>> +struct Threads;
>> +
>> +struct ThreadLocal {
>> +    QemuThread thread;
>> +
>> +    /* the event used to wake up the thread */
>> +    QemuEvent ev;
>> +
>> +    struct Threads *threads;
>> +
>> +    /* local request ring which is filled by the user */
>> +    Ring *request_ring;
>> +
>> +    /* the index of the thread */
>> +    int self;
>> +
>> +    /* thread is useless and needs to exit */
>> +    bool quit;
>> +};
>> +typedef struct ThreadLocal ThreadLocal;
>> +
>> +/*
>> + * the main data struct represents multithreads which is shared by
>> + * all threads
>> + */
>> +struct Threads {
>> +    const char *name;
>> +    unsigned int threads_nr;
>> +    /* the request is pushed to the thread with round-robin manner */
>> +    unsigned int current_thread_index;
>> +
>> +    int thread_ring_size;
>> +    int total_requests;
>> +
>> +    /* the request is pre-allocated and linked in the list */
>> +    int free_requests_nr;
>> +    QSLIST_HEAD(, ThreadRequest) free_requests;
>> +
>> +    /* the constructor of request */
>> +    ThreadRequest *(*thread_request_init)(void);
>> +    /* the destructor of request */
>> +    void (*thread_request_uninit)(ThreadRequest *request);
>> +    /* the handler of the request which is called in the thread */
>> +    void (*thread_request_handler)(ThreadRequest *request);
>> +    /*
>> +     * the handler to process the result which is called in the
>> +     * user's context
>> +     */
>> +    void (*thread_request_done)(ThreadRequest *request);
>> +
>> +    /* the thread push the result to this ring so it has multiple producers */
>> +    Ring *request_done_ring;
>> +
>> +    ThreadLocal per_thread_data[0];
>> +};
>> +typedef struct Threads Threads;
> 
> Not sure whether we can move Threads/ThreadLocal definition into the
> source file, then we only expose the struct definition, along with the
> APIs.

Yup, that's better indeed, thank you, Peter!

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 06/12] migration: do not detect zero page for compression
  2018-06-28  9:36         ` [Qemu-devel] " Daniel P. Berrangé
@ 2018-06-29  3:50           ` Xiao Guangrong
  -1 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-29  3:50 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, Peter Xu,
	qemu-devel, wei.w.wang, pbonzini, jiang.biao2


Hi Daniel,

On 06/28/2018 05:36 PM, Daniel P. Berrangé wrote:
> On Thu, Jun 28, 2018 at 05:12:39PM +0800, Xiao Guangrong wrote:
>>

>> After this patch, the workload is moved to the worker thread, is it
>> acceptable?
> 
> It depends on your point of view. If you have spare / idle CPUs on the host,
> then moving workload to a thread is ok, despite the CPU cost of compression
> in that thread being much higher what what was replaced, since you won't be
> taking CPU resources away from other contending workloads.
> 
> I'd venture to suggest though that we should probably *not* be optimizing for
> the case of idle CPUs on the host. More realistic is to expect that the host
> CPUs are near fully committed to work, and thus the (default) goal should be
> to minimize CPU overhead for the host as a whole. From this POV, zero-page
> detection is better than compression due to > x10 better speed.
> 
> Given the CPU overheads of compression, I think it has fairly narrow use
> in migration in general when considering hosts are often highly committed
> on CPU.

I understand your concern, however, it is not bad.

First, we tolerate the case that the thread runs little slowly - we do
not wait the thread becoming free, instead, we directly posted the page
out as normal (zero detecting still works for the normal page), so it
at least makes the performance not worse then the case compression is
not used.

Second we think the reasonable configuration is that the system should
have enough CPU resource to run the number of threads the user
configured. BTW, the work we will post out will make these parameters
be runtimely configurable, then a control daemon (e.g, libvirt) will
adjust them based on the current load of the system and the statistic
from QEMU. This is the topic we submitted to KVM Forum this year,
hope it can be accepted.

At last, we have a watermark to trigger live migration on a
load-balanced production, the watermark makes sure there is some free
CPU resource left for other works.

Thanks!

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 06/12] migration: do not detect zero page for compression
@ 2018-06-29  3:50           ` Xiao Guangrong
  0 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-29  3:50 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Peter Xu, kvm, mst, mtosatti, Xiao Guangrong, dgilbert,
	qemu-devel, wei.w.wang, jiang.biao2, pbonzini


Hi Daniel,

On 06/28/2018 05:36 PM, Daniel P. Berrangé wrote:
> On Thu, Jun 28, 2018 at 05:12:39PM +0800, Xiao Guangrong wrote:
>>

>> After this patch, the workload is moved to the worker thread, is it
>> acceptable?
> 
> It depends on your point of view. If you have spare / idle CPUs on the host,
> then moving workload to a thread is ok, despite the CPU cost of compression
> in that thread being much higher what what was replaced, since you won't be
> taking CPU resources away from other contending workloads.
> 
> I'd venture to suggest though that we should probably *not* be optimizing for
> the case of idle CPUs on the host. More realistic is to expect that the host
> CPUs are near fully committed to work, and thus the (default) goal should be
> to minimize CPU overhead for the host as a whole. From this POV, zero-page
> detection is better than compression due to > x10 better speed.
> 
> Given the CPU overheads of compression, I think it has fairly narrow use
> in migration in general when considering hosts are often highly committed
> on CPU.

I understand your concern, however, it is not bad.

First, we tolerate the case that the thread runs little slowly - we do
not wait the thread becoming free, instead, we directly posted the page
out as normal (zero detecting still works for the normal page), so it
at least makes the performance not worse then the case compression is
not used.

Second we think the reasonable configuration is that the system should
have enough CPU resource to run the number of threads the user
configured. BTW, the work we will post out will make these parameters
be runtimely configurable, then a control daemon (e.g, libvirt) will
adjust them based on the current load of the system and the statistic
from QEMU. This is the topic we submitted to KVM Forum this year,
hope it can be accepted.

At last, we have a watermark to trigger live migration on a
load-balanced production, the watermark makes sure there is some free
CPU resource left for other works.

Thanks!

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 09/12] ring: introduce lockless ring buffer
  2018-06-28 11:55         ` [Qemu-devel] " Wei Wang
@ 2018-06-29  3:55           ` Xiao Guangrong
  -1 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-29  3:55 UTC (permalink / raw)
  To: Wei Wang, Peter Xu
  Cc: kvm, mst, peterz, Lai Jiangshan, stefani, mtosatti,
	Xiao Guangrong, dgilbert, qemu-devel, jiang.biao2, pbonzini,
	paulmck



On 06/28/2018 07:55 PM, Wei Wang wrote:
> On 06/28/2018 06:02 PM, Xiao Guangrong wrote:
>>
>> CC: Paul, Peter Zijlstra, Stefani, Lai who are all good at memory barrier.
>>
>>
>> On 06/20/2018 12:52 PM, Peter Xu wrote:
>>> On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.xiao@gmail.com wrote:
>>>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>>
>>>> It's the simple lockless ring buffer implement which supports both
>>>> single producer vs. single consumer and multiple producers vs.
>>>> single consumer.
>>>>
>>>> Many lessons were learned from Linux Kernel's kfifo (1) and DPDK's
>>>> rte_ring (2) before i wrote this implement. It corrects some bugs of
>>>> memory barriers in kfifo and it is the simpler lockless version of
>>>> rte_ring as currently multiple access is only allowed for producer.
>>>
>>> Could you provide some more information about the kfifo bug? Any
>>> pointer would be appreciated.
>>>
>>
>> Sure, i reported one of the memory barrier issue to linux kernel:
>>    https://lkml.org/lkml/2018/5/11/58
>>
>> Actually, beside that, there is another memory barrier issue in kfifo,
>> please consider this case:
>>
>>    at the beginning
>>    ring->size = 4
>>    ring->out = 0
>>    ring->in = 4
>>
>>      Consumer                            Producer
>>  ---------------                     --------------
>>    index = ring->out; /* index == 0 */
>>    ring->out++; /* ring->out == 1 */
>>    < Re-Order >
>>                                     out = ring->out;
>>                                     if (ring->in - out >= ring->mask)
>>                                         return -EFULL;
>>                                     /* see the ring is not full */
>>                                     index = ring->in & ring->mask; /* index == 0 */
>>                                     ring->data[index] = new_data;
>>                      ring->in++;
>>
>>    data = ring->data[index];
>>    !!!!!! the old data is lost !!!!!!
>>
>> So we need to make sure:
>> 1) for the consumer, we should read the ring->data[] out before updating ring->out
>> 2) for the producer, we should read ring->out before updating ring->data[]
>>
>> as followings:
>>       Producer                                       Consumer
>>   ------------------------------------ ------------------------
>>       Reading ring->out                            Reading ring->data[index]
>>       smp_mb()                                     smp_mb()
>>       Setting ring->data[index] = data ring->out++
>>
>> [ i used atomic_store_release() and atomic_load_acquire() instead of smp_mb() in the
>>   patch. ]
>>
>> But i am not sure if we can use smp_acquire__after_ctrl_dep() in the producer?
> 
> 
> I wonder if this could be solved by simply tweaking the above consumer implementation:
> 
> [1] index = ring->out;
> [2] data = ring->data[index];
> [3] index++;
> [4] ring->out = index;
> 
> Now [2] and [3] forms a WAR dependency, which avoids the reordering.

It can not. [2] and [4] still do not any dependency, CPU and complainer can omit
the 'index'.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer
@ 2018-06-29  3:55           ` Xiao Guangrong
  0 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-29  3:55 UTC (permalink / raw)
  To: Wei Wang, Peter Xu
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, jiang.biao2,
	Xiao Guangrong, peterz, stefani, paulmck, Lai Jiangshan



On 06/28/2018 07:55 PM, Wei Wang wrote:
> On 06/28/2018 06:02 PM, Xiao Guangrong wrote:
>>
>> CC: Paul, Peter Zijlstra, Stefani, Lai who are all good at memory barrier.
>>
>>
>> On 06/20/2018 12:52 PM, Peter Xu wrote:
>>> On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.xiao@gmail.com wrote:
>>>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>>
>>>> It's the simple lockless ring buffer implement which supports both
>>>> single producer vs. single consumer and multiple producers vs.
>>>> single consumer.
>>>>
>>>> Many lessons were learned from Linux Kernel's kfifo (1) and DPDK's
>>>> rte_ring (2) before i wrote this implement. It corrects some bugs of
>>>> memory barriers in kfifo and it is the simpler lockless version of
>>>> rte_ring as currently multiple access is only allowed for producer.
>>>
>>> Could you provide some more information about the kfifo bug? Any
>>> pointer would be appreciated.
>>>
>>
>> Sure, i reported one of the memory barrier issue to linux kernel:
>>    https://lkml.org/lkml/2018/5/11/58
>>
>> Actually, beside that, there is another memory barrier issue in kfifo,
>> please consider this case:
>>
>>    at the beginning
>>    ring->size = 4
>>    ring->out = 0
>>    ring->in = 4
>>
>>      Consumer                            Producer
>>  ---------------                     --------------
>>    index = ring->out; /* index == 0 */
>>    ring->out++; /* ring->out == 1 */
>>    < Re-Order >
>>                                     out = ring->out;
>>                                     if (ring->in - out >= ring->mask)
>>                                         return -EFULL;
>>                                     /* see the ring is not full */
>>                                     index = ring->in & ring->mask; /* index == 0 */
>>                                     ring->data[index] = new_data;
>>                      ring->in++;
>>
>>    data = ring->data[index];
>>    !!!!!! the old data is lost !!!!!!
>>
>> So we need to make sure:
>> 1) for the consumer, we should read the ring->data[] out before updating ring->out
>> 2) for the producer, we should read ring->out before updating ring->data[]
>>
>> as followings:
>>       Producer                                       Consumer
>>   ------------------------------------ ------------------------
>>       Reading ring->out                            Reading ring->data[index]
>>       smp_mb()                                     smp_mb()
>>       Setting ring->data[index] = data ring->out++
>>
>> [ i used atomic_store_release() and atomic_load_acquire() instead of smp_mb() in the
>>   patch. ]
>>
>> But i am not sure if we can use smp_acquire__after_ctrl_dep() in the producer?
> 
> 
> I wonder if this could be solved by simply tweaking the above consumer implementation:
> 
> [1] index = ring->out;
> [2] data = ring->data[index];
> [3] index++;
> [4] ring->out = index;
> 
> Now [2] and [3] forms a WAR dependency, which avoids the reordering.

It can not. [2] and [4] still do not any dependency, CPU and complainer can omit
the 'index'.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 09/12] ring: introduce lockless ring buffer
  2018-06-28 13:36     ` [Qemu-devel] " Jason Wang
@ 2018-06-29  3:59       ` Xiao Guangrong
  -1 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-29  3:59 UTC (permalink / raw)
  To: Jason Wang, pbonzini, mst, mtosatti
  Cc: kvm, Xiao Guangrong, qemu-devel, peterx, dgilbert, wei.w.wang,
	jiang.biao2



On 06/28/2018 09:36 PM, Jason Wang wrote:
> 
> 
> On 2018年06月04日 17:55, guangrong.xiao@gmail.com wrote:
>> From: Xiao Guangrong<xiaoguangrong@tencent.com>
>>
>> It's the simple lockless ring buffer implement which supports both
>> single producer vs. single consumer and multiple producers vs.
>> single consumer.
>>

>> Finally, it fetches the valid data out, set the entry to the initialized
>> state and update ring->out to make the entry be usable to the producer:
>>
>>        data = *entry;
>>        *entry = NULL;
>>        ring->out++;
>>
>> Memory barrier is omitted here, please refer to the comment in the code.
>>
>> (1)https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
>> (2)http://dpdk.org/doc/api/rte__ring_8h.html
>>
>> Signed-off-by: Xiao Guangrong<xiaoguangrong@tencent.com>
>> ---
> 
> May I ask why you need a MPSC ring here? Can we just use N SPSC ring for submitting pages and another N SPSC ring for passing back results?

Sure.

We had this option in our mind, however, it is not scalable which will slow
the main thread down, instead, we'd rather to speed up main thread and move
reasonable workload to the threads.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer
@ 2018-06-29  3:59       ` Xiao Guangrong
  0 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-29  3:59 UTC (permalink / raw)
  To: Jason Wang, pbonzini, mst, mtosatti
  Cc: qemu-devel, kvm, dgilbert, peterx, jiang.biao2, wei.w.wang,
	Xiao Guangrong



On 06/28/2018 09:36 PM, Jason Wang wrote:
> 
> 
> On 2018年06月04日 17:55, guangrong.xiao@gmail.com wrote:
>> From: Xiao Guangrong<xiaoguangrong@tencent.com>
>>
>> It's the simple lockless ring buffer implement which supports both
>> single producer vs. single consumer and multiple producers vs.
>> single consumer.
>>

>> Finally, it fetches the valid data out, set the entry to the initialized
>> state and update ring->out to make the entry be usable to the producer:
>>
>>        data = *entry;
>>        *entry = NULL;
>>        ring->out++;
>>
>> Memory barrier is omitted here, please refer to the comment in the code.
>>
>> (1)https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
>> (2)http://dpdk.org/doc/api/rte__ring_8h.html
>>
>> Signed-off-by: Xiao Guangrong<xiaoguangrong@tencent.com>
>> ---
> 
> May I ask why you need a MPSC ring here? Can we just use N SPSC ring for submitting pages and another N SPSC ring for passing back results?

Sure.

We had this option in our mind, however, it is not scalable which will slow
the main thread down, instead, we'd rather to speed up main thread and move
reasonable workload to the threads.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 09/12] ring: introduce lockless ring buffer
  2018-06-28 13:36     ` [Qemu-devel] " Jason Wang
@ 2018-06-29  4:23       ` Michael S. Tsirkin
  -1 siblings, 0 replies; 156+ messages in thread
From: Michael S. Tsirkin @ 2018-06-29  4:23 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm, mtosatti, Xiao Guangrong, qemu-devel, peterx, dgilbert,
	wei.w.wang, guangrong.xiao, jiang.biao2, pbonzini

On Thu, Jun 28, 2018 at 09:36:00PM +0800, Jason Wang wrote:
> 
> 
> On 2018年06月04日 17:55, guangrong.xiao@gmail.com wrote:
> > From: Xiao Guangrong<xiaoguangrong@tencent.com>
> > 
> > It's the simple lockless ring buffer implement which supports both
> > single producer vs. single consumer and multiple producers vs.
> > single consumer.
> > 
> > Many lessons were learned from Linux Kernel's kfifo (1) and DPDK's
> > rte_ring (2) before i wrote this implement. It corrects some bugs of
> > memory barriers in kfifo and it is the simpler lockless version of
> > rte_ring as currently multiple access is only allowed for producer.
> > 
> > If has single producer vs. single consumer, it is the traditional fifo,
> > If has multiple producers, it uses the algorithm as followings:
> > 
> > For the producer, it uses two steps to update the ring:
> >     - first step, occupy the entry in the ring:
> > 
> > retry:
> >        in = ring->in
> >        if (cmpxhg(&ring->in, in, in +1) != in)
> >              goto retry;
> > 
> >       after that the entry pointed by ring->data[in] has been owned by
> >       the producer.
> > 
> >       assert(ring->data[in] == NULL);
> > 
> >       Note, no other producer can touch this entry so that this entry
> >       should always be the initialized state.
> > 
> >     - second step, write the data to the entry:
> > 
> >       ring->data[in] = data;
> > 
> > For the consumer, it first checks if there is available entry in the
> > ring and fetches the entry from the ring:
> > 
> >       if (!ring_is_empty(ring))
> >            entry = &ring[ring->out];
> > 
> >       Note: the ring->out has not been updated so that the entry pointed
> >       by ring->out is completely owned by the consumer.
> > 
> > Then it checks if the data is ready:
> > 
> > retry:
> >       if (*entry == NULL)
> >              goto retry;
> > That means, the producer has updated the index but haven't written any
> > data to it.
> > 
> > Finally, it fetches the valid data out, set the entry to the initialized
> > state and update ring->out to make the entry be usable to the producer:
> > 
> >        data = *entry;
> >        *entry = NULL;
> >        ring->out++;
> > 
> > Memory barrier is omitted here, please refer to the comment in the code.
> > 
> > (1)https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
> > (2)http://dpdk.org/doc/api/rte__ring_8h.html
> > 
> > Signed-off-by: Xiao Guangrong<xiaoguangrong@tencent.com>
> > ---
> 
> May I ask why you need a MPSC ring here? Can we just use N SPSC ring for
> submitting pages and another N SPSC ring for passing back results?
> 
> Thanks

Or just an SPSC ring + a lock.
How big of a gain is lockless access to a trivial structure
like the ring?

-- 
MST

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer
@ 2018-06-29  4:23       ` Michael S. Tsirkin
  0 siblings, 0 replies; 156+ messages in thread
From: Michael S. Tsirkin @ 2018-06-29  4:23 UTC (permalink / raw)
  To: Jason Wang
  Cc: guangrong.xiao, pbonzini, mtosatti, qemu-devel, kvm, dgilbert,
	peterx, jiang.biao2, wei.w.wang, Xiao Guangrong

On Thu, Jun 28, 2018 at 09:36:00PM +0800, Jason Wang wrote:
> 
> 
> On 2018年06月04日 17:55, guangrong.xiao@gmail.com wrote:
> > From: Xiao Guangrong<xiaoguangrong@tencent.com>
> > 
> > It's the simple lockless ring buffer implement which supports both
> > single producer vs. single consumer and multiple producers vs.
> > single consumer.
> > 
> > Many lessons were learned from Linux Kernel's kfifo (1) and DPDK's
> > rte_ring (2) before i wrote this implement. It corrects some bugs of
> > memory barriers in kfifo and it is the simpler lockless version of
> > rte_ring as currently multiple access is only allowed for producer.
> > 
> > If has single producer vs. single consumer, it is the traditional fifo,
> > If has multiple producers, it uses the algorithm as followings:
> > 
> > For the producer, it uses two steps to update the ring:
> >     - first step, occupy the entry in the ring:
> > 
> > retry:
> >        in = ring->in
> >        if (cmpxhg(&ring->in, in, in +1) != in)
> >              goto retry;
> > 
> >       after that the entry pointed by ring->data[in] has been owned by
> >       the producer.
> > 
> >       assert(ring->data[in] == NULL);
> > 
> >       Note, no other producer can touch this entry so that this entry
> >       should always be the initialized state.
> > 
> >     - second step, write the data to the entry:
> > 
> >       ring->data[in] = data;
> > 
> > For the consumer, it first checks if there is available entry in the
> > ring and fetches the entry from the ring:
> > 
> >       if (!ring_is_empty(ring))
> >            entry = &ring[ring->out];
> > 
> >       Note: the ring->out has not been updated so that the entry pointed
> >       by ring->out is completely owned by the consumer.
> > 
> > Then it checks if the data is ready:
> > 
> > retry:
> >       if (*entry == NULL)
> >              goto retry;
> > That means, the producer has updated the index but haven't written any
> > data to it.
> > 
> > Finally, it fetches the valid data out, set the entry to the initialized
> > state and update ring->out to make the entry be usable to the producer:
> > 
> >        data = *entry;
> >        *entry = NULL;
> >        ring->out++;
> > 
> > Memory barrier is omitted here, please refer to the comment in the code.
> > 
> > (1)https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
> > (2)http://dpdk.org/doc/api/rte__ring_8h.html
> > 
> > Signed-off-by: Xiao Guangrong<xiaoguangrong@tencent.com>
> > ---
> 
> May I ask why you need a MPSC ring here? Can we just use N SPSC ring for
> submitting pages and another N SPSC ring for passing back results?
> 
> Thanks

Or just an SPSC ring + a lock.
How big of a gain is lockless access to a trivial structure
like the ring?

-- 
MST

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 09/12] ring: introduce lockless ring buffer
  2018-06-29  3:59       ` [Qemu-devel] " Xiao Guangrong
@ 2018-06-29  6:15         ` Jason Wang
  -1 siblings, 0 replies; 156+ messages in thread
From: Jason Wang @ 2018-06-29  6:15 UTC (permalink / raw)
  To: Xiao Guangrong, pbonzini, mst, mtosatti
  Cc: kvm, Xiao Guangrong, qemu-devel, peterx, dgilbert, wei.w.wang,
	jiang.biao2



On 2018年06月29日 11:59, Xiao Guangrong wrote:
>
>
> On 06/28/2018 09:36 PM, Jason Wang wrote:
>>
>>
>> On 2018年06月04日 17:55, guangrong.xiao@gmail.com wrote:
>>> From: Xiao Guangrong<xiaoguangrong@tencent.com>
>>>
>>> It's the simple lockless ring buffer implement which supports both
>>> single producer vs. single consumer and multiple producers vs.
>>> single consumer.
>>>
>
>>> Finally, it fetches the valid data out, set the entry to the 
>>> initialized
>>> state and update ring->out to make the entry be usable to the producer:
>>>
>>>        data = *entry;
>>>        *entry = NULL;
>>>        ring->out++;
>>>
>>> Memory barrier is omitted here, please refer to the comment in the 
>>> code.
>>>
>>> (1)https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h 
>>>
>>> (2)http://dpdk.org/doc/api/rte__ring_8h.html
>>>
>>> Signed-off-by: Xiao Guangrong<xiaoguangrong@tencent.com>
>>> ---
>>
>> May I ask why you need a MPSC ring here? Can we just use N SPSC ring 
>> for submitting pages and another N SPSC ring for passing back results?
>
> Sure.
>
> We had this option in our mind, however, it is not scalable which will 
> slow
> the main thread down, instead, we'd rather to speed up main thread and 
> move
> reasonable workload to the threads.

I'm not quite understand the scalability issue here. Is it because of 
main thread need go through all N rings (which I think not)?

Thanks

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer
@ 2018-06-29  6:15         ` Jason Wang
  0 siblings, 0 replies; 156+ messages in thread
From: Jason Wang @ 2018-06-29  6:15 UTC (permalink / raw)
  To: Xiao Guangrong, pbonzini, mst, mtosatti
  Cc: qemu-devel, kvm, dgilbert, peterx, jiang.biao2, wei.w.wang,
	Xiao Guangrong



On 2018年06月29日 11:59, Xiao Guangrong wrote:
>
>
> On 06/28/2018 09:36 PM, Jason Wang wrote:
>>
>>
>> On 2018年06月04日 17:55, guangrong.xiao@gmail.com wrote:
>>> From: Xiao Guangrong<xiaoguangrong@tencent.com>
>>>
>>> It's the simple lockless ring buffer implement which supports both
>>> single producer vs. single consumer and multiple producers vs.
>>> single consumer.
>>>
>
>>> Finally, it fetches the valid data out, set the entry to the 
>>> initialized
>>> state and update ring->out to make the entry be usable to the producer:
>>>
>>>        data = *entry;
>>>        *entry = NULL;
>>>        ring->out++;
>>>
>>> Memory barrier is omitted here, please refer to the comment in the 
>>> code.
>>>
>>> (1)https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h 
>>>
>>> (2)http://dpdk.org/doc/api/rte__ring_8h.html
>>>
>>> Signed-off-by: Xiao Guangrong<xiaoguangrong@tencent.com>
>>> ---
>>
>> May I ask why you need a MPSC ring here? Can we just use N SPSC ring 
>> for submitting pages and another N SPSC ring for passing back results?
>
> Sure.
>
> We had this option in our mind, however, it is not scalable which will 
> slow
> the main thread down, instead, we'd rather to speed up main thread and 
> move
> reasonable workload to the threads.

I'm not quite understand the scalability issue here. Is it because of 
main thread need go through all N rings (which I think not)?

Thanks

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 09/12] ring: introduce lockless ring buffer
  2018-06-20 12:38     ` [Qemu-devel] " Michael S. Tsirkin
@ 2018-06-29  7:30       ` Xiao Guangrong
  -1 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-29  7:30 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, mtosatti, Xiao Guangrong, qemu-devel, peterx, dgilbert,
	wei.w.wang, jiang.biao2, pbonzini

[-- Attachment #1: Type: text/plain, Size: 3859 bytes --]


Hi Michael,

On 06/20/2018 08:38 PM, Michael S. Tsirkin wrote:
> On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.xiao@gmail.com wrote:
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>

>>
>>
>> (1) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
>> (2) http://dpdk.org/doc/api/rte__ring_8h.html
>>
>> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> So instead of all this super-optimized trickiness, how about
> a simple port of ptr_ring from linux?
> 
> That one isn't lockless but it's known to outperform
> most others for a single producer/single consumer case.
> And with a ton of networking going on,
> who said it's such a hot spot? OTOH this implementation
> has more barriers which slows down each individual thread.
> It's also a source of bugs.
> 

Thank you for pointing it out.

I just quickly went through the code of ptr_ring that is very nice and
really impressive. I will consider to port it to QEMU.

> Further, atomic tricks this one uses are not fair so some threads can get
> completely starved while others make progress. There's also no
> chance to mix aggressive polling and sleeping with this
> kind of scheme, so the starved thread will consume lots of
> CPU.
> 
> So I'd like to see a simple ring used, and then a patch on top
> switching to this tricky one with performance comparison
> along with that.
> 

I agree with you, i will make a version that uses a lock for multiple
producers and doing incremental optimizations based on it.

>> ---
>>   migration/ring.h | 265 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 265 insertions(+)
>>   create mode 100644 migration/ring.h
>>
>> diff --git a/migration/ring.h b/migration/ring.h
>> new file mode 100644
>> index 0000000000..da9b8bdcbb
>> --- /dev/null
>> +++ b/migration/ring.h
>> @@ -0,0 +1,265 @@
>> +/*
>> + * Ring Buffer
>> + *
>> + * Multiple producers and single consumer are supported with lock free.
>> + *
>> + * Copyright (c) 2018 Tencent Inc
>> + *
>> + * Authors:
>> + *  Xiao Guangrong <xiaoguangrong@tencent.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#ifndef _RING__
>> +#define _RING__
> 
> Prefix Ring is too short.
> 

Okay, will improve it.

>> +    atomic_set(&ring->data[index], NULL);
>> +
>> +    /*
>> +     * (B) smp_mb() is needed as we should read the entry out before
>> +     * updating ring->out as we did in __ring_get().
>> +     *
>> +     * (A) smp_wmb() is needed as we should make the entry be NULL before
>> +     * updating ring->out (which will make the entry be visible and usable).
>> +     */
> 
> I can't say I understand this all.
> And the interaction of acquire/release semantics with smp_*
> barriers is even scarier.
> 

Hmm... the parallel accesses for these two indexes and the data stored
in the ring are subtle indeed. :(

>> +    atomic_store_release(&ring->out, ring->out + 1);
>> +
>> +    return data;
>> +}
>> +
>> +static inline int ring_put(Ring *ring, void *data)
>> +{
>> +    if (ring->flags & RING_MULTI_PRODUCER) {
>> +        return ring_mp_put(ring, data);
>> +    }
>> +    return __ring_put(ring, data);
>> +}
>> +
>> +static inline void *ring_get(Ring *ring)
>> +{
>> +    if (ring->flags & RING_MULTI_PRODUCER) {
>> +        return ring_mp_get(ring);
>> +    }
>> +    return __ring_get(ring);
>> +}
>> +#endif
> 
> 
> A bunch of tricky barriers retries etc all over the place.  This sorely
> needs *a lot of* unit tests. Where are they?

I used the code attached in this mail to test & benchmark the patches during
my development which does not dedicate for Ring, instead it is based
on the framework of compression.

Yes, test cases are useful and really needed, i will do it... :)


[-- Attachment #2: migration-threads-test.c --]
[-- Type: text/x-csrc, Size: 7400 bytes --]

#include "qemu/osdep.h"

#include "libqtest.h"
#include <zlib.h>

#include "qemu/osdep.h"
#include <zlib.h>
#include "qemu/cutils.h"
#include "qemu/bitops.h"
#include "qemu/bitmap.h"
#include "qemu/main-loop.h"
#include "migration/ram.h"
#include "migration/migration.h"
#include "migration/register.h"
#include "migration/misc.h"
#include "migration/page_cache.h"
#include "qemu/error-report.h"
#include "qapi/error.h"
#include "qapi/qapi-events-migration.h"
#include "qapi/qmp/qerror.h"
#include "trace.h"
//#include "exec/ram_addr.h"
#include "exec/target_page.h"
#include "qemu/rcu_queue.h"
#include "migration/colo.h"
#include "migration/block.h"
#include "migration/threads.h"

#include "migration/qemu-file.h"
#include "migration/threads.h"

CompressionStats compression_counters;

#define PAGE_SIZE 4096
#define PAGE_MASK ~(PAGE_SIZE - 1)

static ssize_t test_writev_buffer(void *opaque, struct iovec *iov, int iovcnt,
                                   int64_t pos)
{
    int i, size = 0;

    for (i = 0; i < iovcnt; i++) {
        size += iov[i].iov_len;
    }
    return size;
}

static int test_fclose(void *opaque)
{
    return 0;
}

static const QEMUFileOps test_write_ops = {
    .writev_buffer  = test_writev_buffer,
    .close          = test_fclose
};

QEMUFile *dest_file;

static const QEMUFileOps empty_ops = { };

static int do_compress_ram_page(QEMUFile *f, z_stream *stream, uint8_t *ram_addr,
                                ram_addr_t offset, uint8_t *source_buf)
{
    int bytes_sent = 0, blen;
    uint8_t *p = ram_addr;

    /*
     * copy it to a internal buffer to avoid it being modified by VM
     * so that we can catch up the error during compression and
     * decompression
     */
    memcpy(source_buf, p, PAGE_SIZE);
    blen = qemu_put_compression_data(f, stream, source_buf, PAGE_SIZE);
    if (blen < 0) {
        bytes_sent = 0;
        qemu_file_set_error(dest_file, blen);
        error_report("compressed data failed!");
    } else {
        printf("Compressed size %d.\n", blen);
        bytes_sent += blen;
    }

    return bytes_sent;
}

struct CompressData {
    /* filled by migration thread.*/
    uint8_t *ram_addr;
    ram_addr_t offset;

    /* filled by compress thread. */
    QEMUFile *file;
    z_stream stream;
    uint8_t *originbuf;

    ThreadRequest data;
};
typedef struct CompressData CompressData;

static ThreadRequest *compress_thread_data_init(void)
{
    CompressData *cd = g_new0(CompressData, 1);

    cd->originbuf = g_try_malloc(PAGE_SIZE);
    if (!cd->originbuf) {
        goto exit;
    }

    if (deflateInit(&cd->stream, 1) != Z_OK) {
        g_free(cd->originbuf);
        goto exit;
    }

    cd->file = qemu_fopen_ops(NULL, &empty_ops);
    return &cd->data;

exit:
    g_free(cd);
    return NULL;
}

static void compress_thread_data_fini(ThreadRequest *data)
{
    CompressData *cd = container_of(data, CompressData, data);

    qemu_fclose(cd->file);
    deflateEnd(&cd->stream);
    g_free(cd->originbuf);
    g_free(cd);
}

static void compress_thread_data_handler(ThreadRequest *data)
{
    CompressData *cd = container_of(data, CompressData, data);

    /*
     * if compression fails, it will indicate by
     * migrate_get_current()->to_dst_file.
     */
    do_compress_ram_page(cd->file, &cd->stream, cd->ram_addr, cd->offset,
                         cd->originbuf);
}

static void compress_thread_data_done(ThreadRequest *data)
{
    CompressData *cd = container_of(data, CompressData, data);
    int bytes_xmit;

    bytes_xmit = qemu_put_qemu_file(dest_file, cd->file);
    /* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */
    compression_counters.reduced_size += 4096 - bytes_xmit + 8;
    compression_counters.pages++;
}

static Threads *compress_threads;

static void flush_compressed_data(void)
{
    threads_wait_done(compress_threads);
}

static void compress_threads_save_cleanup(void)
{
    if (!compress_threads) {
        return;
    }

    threads_destroy(compress_threads);
    compress_threads = NULL;
    qemu_fclose(dest_file);
    dest_file = NULL;
}

static int compress_threads_save_setup(void)
{
    dest_file = qemu_fopen_ops(NULL, &test_write_ops);
    compress_threads = threads_create(16,
                                      "compress",
                                      compress_thread_data_init,
                                      compress_thread_data_fini,
                                      compress_thread_data_handler,
                                      compress_thread_data_done);
    assert(compress_threads);
    return 0;
}

static int compress_page_with_multi_thread(uint8_t *addr)
{
    CompressData *cd;
    ThreadRequest *thread_data;
    thread_data = threads_submit_request_prepare(compress_threads);
    if (!thread_data) {
        compression_counters.busy++;
        return -1;
    }

    cd = container_of(thread_data, CompressData, data);
    cd->ram_addr = addr;
    threads_submit_request_commit(compress_threads, thread_data);
    return 1;
}

#define MEM_SIZE (30ULL << 30)
#define COUNT    5 

static void run(void)
{
    void *mem = qemu_memalign(PAGE_SIZE, MEM_SIZE);
    uint8_t *ptr = mem, *end = mem + MEM_SIZE;
    uint64_t start_time, total_time = 0, spend, total_busy = 0;
    int i;

    memset(mem, 0, MEM_SIZE);

    start_time = g_get_monotonic_time();
    for (i = 0; i < COUNT; i++) {
        ptr = mem;
	start_time = g_get_monotonic_time();
        while (ptr < end) {
            *ptr = 0x10;
            compress_page_with_multi_thread(ptr);
            ptr += PAGE_SIZE;
        }
        flush_compressed_data();
	spend = g_get_monotonic_time() - start_time;
	total_time += spend;
	printf("RUN %d: BUSY %ld Time Cost %ld.\n", i, compression_counters.busy, spend);
	total_busy += compression_counters.busy;
	compression_counters.busy = 0;
    }

    printf("AVG: BUSY %ld Time Cost %ld.\n", total_busy / COUNT, total_time / COUNT);
}

static void compare_zero_and_compression(void)
{
    ThreadRequest *data = compress_thread_data_init();
    CompressData *cd;
    uint64_t start_time, zero_time, compress_time;
    char page[PAGE_SIZE];

    if (!data) {
        printf("Init compression failed.\n");
        return;
    }

    cd = container_of(data, CompressData, data);
    cd->ram_addr = (uint8_t *)page;

    memset(page, 0, sizeof(page));
    dest_file = qemu_fopen_ops(NULL, &test_write_ops);

    start_time = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
    buffer_is_zero(page, PAGE_SIZE);
    zero_time = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - start_time;

    start_time = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
    compress_thread_data_handler(data);
    compress_time = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - start_time;

    printf("Zero %ld ns Compression: %ld ns.\n", zero_time, compress_time);
    compress_thread_data_fini(data);

}

static void migration_threads(void)
{
    int i;

    printf("Zero Test vs. compression.\n");
    for (i = 0; i < 10; i++) {
        compare_zero_and_compression();
    }

    printf("test migration threads.\n");
    compress_threads_save_setup();
    run();
    compress_threads_save_cleanup();
}

int main(int argc, char **argv)
{
    QTestState *s = NULL;
    int ret;

    g_test_init(&argc, &argv, NULL);

    qtest_add_func("/migration/threads", migration_threads);
    ret = g_test_run();

    if (s) {
        qtest_quit(s);
    }

    return ret;
}


^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer
@ 2018-06-29  7:30       ` Xiao Guangrong
  0 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-29  7:30 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: pbonzini, mtosatti, qemu-devel, kvm, dgilbert, peterx,
	jiang.biao2, wei.w.wang, Xiao Guangrong

[-- Attachment #1: Type: text/plain, Size: 3859 bytes --]


Hi Michael,

On 06/20/2018 08:38 PM, Michael S. Tsirkin wrote:
> On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.xiao@gmail.com wrote:
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>

>>
>>
>> (1) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
>> (2) http://dpdk.org/doc/api/rte__ring_8h.html
>>
>> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> So instead of all this super-optimized trickiness, how about
> a simple port of ptr_ring from linux?
> 
> That one isn't lockless but it's known to outperform
> most others for a single producer/single consumer case.
> And with a ton of networking going on,
> who said it's such a hot spot? OTOH this implementation
> has more barriers which slows down each individual thread.
> It's also a source of bugs.
> 

Thank you for pointing it out.

I just quickly went through the code of ptr_ring that is very nice and
really impressive. I will consider to port it to QEMU.

> Further, atomic tricks this one uses are not fair so some threads can get
> completely starved while others make progress. There's also no
> chance to mix aggressive polling and sleeping with this
> kind of scheme, so the starved thread will consume lots of
> CPU.
> 
> So I'd like to see a simple ring used, and then a patch on top
> switching to this tricky one with performance comparison
> along with that.
> 

I agree with you, i will make a version that uses a lock for multiple
producers and doing incremental optimizations based on it.

>> ---
>>   migration/ring.h | 265 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 265 insertions(+)
>>   create mode 100644 migration/ring.h
>>
>> diff --git a/migration/ring.h b/migration/ring.h
>> new file mode 100644
>> index 0000000000..da9b8bdcbb
>> --- /dev/null
>> +++ b/migration/ring.h
>> @@ -0,0 +1,265 @@
>> +/*
>> + * Ring Buffer
>> + *
>> + * Multiple producers and single consumer are supported with lock free.
>> + *
>> + * Copyright (c) 2018 Tencent Inc
>> + *
>> + * Authors:
>> + *  Xiao Guangrong <xiaoguangrong@tencent.com>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#ifndef _RING__
>> +#define _RING__
> 
> Prefix Ring is too short.
> 

Okay, will improve it.

>> +    atomic_set(&ring->data[index], NULL);
>> +
>> +    /*
>> +     * (B) smp_mb() is needed as we should read the entry out before
>> +     * updating ring->out as we did in __ring_get().
>> +     *
>> +     * (A) smp_wmb() is needed as we should make the entry be NULL before
>> +     * updating ring->out (which will make the entry be visible and usable).
>> +     */
> 
> I can't say I understand this all.
> And the interaction of acquire/release semantics with smp_*
> barriers is even scarier.
> 

Hmm... the parallel accesses for these two indexes and the data stored
in the ring are subtle indeed. :(

>> +    atomic_store_release(&ring->out, ring->out + 1);
>> +
>> +    return data;
>> +}
>> +
>> +static inline int ring_put(Ring *ring, void *data)
>> +{
>> +    if (ring->flags & RING_MULTI_PRODUCER) {
>> +        return ring_mp_put(ring, data);
>> +    }
>> +    return __ring_put(ring, data);
>> +}
>> +
>> +static inline void *ring_get(Ring *ring)
>> +{
>> +    if (ring->flags & RING_MULTI_PRODUCER) {
>> +        return ring_mp_get(ring);
>> +    }
>> +    return __ring_get(ring);
>> +}
>> +#endif
> 
> 
> A bunch of tricky barriers retries etc all over the place.  This sorely
> needs *a lot of* unit tests. Where are they?

I used the code attached in this mail to test & benchmark the patches during
my development which does not dedicate for Ring, instead it is based
on the framework of compression.

Yes, test cases are useful and really needed, i will do it... :)


[-- Attachment #2: migration-threads-test.c --]
[-- Type: text/x-csrc, Size: 7400 bytes --]

#include "qemu/osdep.h"

#include "libqtest.h"
#include <zlib.h>

#include "qemu/osdep.h"
#include <zlib.h>
#include "qemu/cutils.h"
#include "qemu/bitops.h"
#include "qemu/bitmap.h"
#include "qemu/main-loop.h"
#include "migration/ram.h"
#include "migration/migration.h"
#include "migration/register.h"
#include "migration/misc.h"
#include "migration/page_cache.h"
#include "qemu/error-report.h"
#include "qapi/error.h"
#include "qapi/qapi-events-migration.h"
#include "qapi/qmp/qerror.h"
#include "trace.h"
//#include "exec/ram_addr.h"
#include "exec/target_page.h"
#include "qemu/rcu_queue.h"
#include "migration/colo.h"
#include "migration/block.h"
#include "migration/threads.h"

#include "migration/qemu-file.h"
#include "migration/threads.h"

CompressionStats compression_counters;

#define PAGE_SIZE 4096
#define PAGE_MASK ~(PAGE_SIZE - 1)

static ssize_t test_writev_buffer(void *opaque, struct iovec *iov, int iovcnt,
                                   int64_t pos)
{
    int i, size = 0;

    for (i = 0; i < iovcnt; i++) {
        size += iov[i].iov_len;
    }
    return size;
}

static int test_fclose(void *opaque)
{
    return 0;
}

static const QEMUFileOps test_write_ops = {
    .writev_buffer  = test_writev_buffer,
    .close          = test_fclose
};

QEMUFile *dest_file;

static const QEMUFileOps empty_ops = { };

static int do_compress_ram_page(QEMUFile *f, z_stream *stream, uint8_t *ram_addr,
                                ram_addr_t offset, uint8_t *source_buf)
{
    int bytes_sent = 0, blen;
    uint8_t *p = ram_addr;

    /*
     * copy it to a internal buffer to avoid it being modified by VM
     * so that we can catch up the error during compression and
     * decompression
     */
    memcpy(source_buf, p, PAGE_SIZE);
    blen = qemu_put_compression_data(f, stream, source_buf, PAGE_SIZE);
    if (blen < 0) {
        bytes_sent = 0;
        qemu_file_set_error(dest_file, blen);
        error_report("compressed data failed!");
    } else {
        printf("Compressed size %d.\n", blen);
        bytes_sent += blen;
    }

    return bytes_sent;
}

struct CompressData {
    /* filled by migration thread.*/
    uint8_t *ram_addr;
    ram_addr_t offset;

    /* filled by compress thread. */
    QEMUFile *file;
    z_stream stream;
    uint8_t *originbuf;

    ThreadRequest data;
};
typedef struct CompressData CompressData;

static ThreadRequest *compress_thread_data_init(void)
{
    CompressData *cd = g_new0(CompressData, 1);

    cd->originbuf = g_try_malloc(PAGE_SIZE);
    if (!cd->originbuf) {
        goto exit;
    }

    if (deflateInit(&cd->stream, 1) != Z_OK) {
        g_free(cd->originbuf);
        goto exit;
    }

    cd->file = qemu_fopen_ops(NULL, &empty_ops);
    return &cd->data;

exit:
    g_free(cd);
    return NULL;
}

static void compress_thread_data_fini(ThreadRequest *data)
{
    CompressData *cd = container_of(data, CompressData, data);

    qemu_fclose(cd->file);
    deflateEnd(&cd->stream);
    g_free(cd->originbuf);
    g_free(cd);
}

static void compress_thread_data_handler(ThreadRequest *data)
{
    CompressData *cd = container_of(data, CompressData, data);

    /*
     * if compression fails, it will indicate by
     * migrate_get_current()->to_dst_file.
     */
    do_compress_ram_page(cd->file, &cd->stream, cd->ram_addr, cd->offset,
                         cd->originbuf);
}

static void compress_thread_data_done(ThreadRequest *data)
{
    CompressData *cd = container_of(data, CompressData, data);
    int bytes_xmit;

    bytes_xmit = qemu_put_qemu_file(dest_file, cd->file);
    /* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */
    compression_counters.reduced_size += 4096 - bytes_xmit + 8;
    compression_counters.pages++;
}

static Threads *compress_threads;

static void flush_compressed_data(void)
{
    threads_wait_done(compress_threads);
}

static void compress_threads_save_cleanup(void)
{
    if (!compress_threads) {
        return;
    }

    threads_destroy(compress_threads);
    compress_threads = NULL;
    qemu_fclose(dest_file);
    dest_file = NULL;
}

static int compress_threads_save_setup(void)
{
    dest_file = qemu_fopen_ops(NULL, &test_write_ops);
    compress_threads = threads_create(16,
                                      "compress",
                                      compress_thread_data_init,
                                      compress_thread_data_fini,
                                      compress_thread_data_handler,
                                      compress_thread_data_done);
    assert(compress_threads);
    return 0;
}

static int compress_page_with_multi_thread(uint8_t *addr)
{
    CompressData *cd;
    ThreadRequest *thread_data;
    thread_data = threads_submit_request_prepare(compress_threads);
    if (!thread_data) {
        compression_counters.busy++;
        return -1;
    }

    cd = container_of(thread_data, CompressData, data);
    cd->ram_addr = addr;
    threads_submit_request_commit(compress_threads, thread_data);
    return 1;
}

#define MEM_SIZE (30ULL << 30)
#define COUNT    5 

static void run(void)
{
    void *mem = qemu_memalign(PAGE_SIZE, MEM_SIZE);
    uint8_t *ptr = mem, *end = mem + MEM_SIZE;
    uint64_t start_time, total_time = 0, spend, total_busy = 0;
    int i;

    memset(mem, 0, MEM_SIZE);

    start_time = g_get_monotonic_time();
    for (i = 0; i < COUNT; i++) {
        ptr = mem;
	start_time = g_get_monotonic_time();
        while (ptr < end) {
            *ptr = 0x10;
            compress_page_with_multi_thread(ptr);
            ptr += PAGE_SIZE;
        }
        flush_compressed_data();
	spend = g_get_monotonic_time() - start_time;
	total_time += spend;
	printf("RUN %d: BUSY %ld Time Cost %ld.\n", i, compression_counters.busy, spend);
	total_busy += compression_counters.busy;
	compression_counters.busy = 0;
    }

    printf("AVG: BUSY %ld Time Cost %ld.\n", total_busy / COUNT, total_time / COUNT);
}

static void compare_zero_and_compression(void)
{
    ThreadRequest *data = compress_thread_data_init();
    CompressData *cd;
    uint64_t start_time, zero_time, compress_time;
    char page[PAGE_SIZE];

    if (!data) {
        printf("Init compression failed.\n");
        return;
    }

    cd = container_of(data, CompressData, data);
    cd->ram_addr = (uint8_t *)page;

    memset(page, 0, sizeof(page));
    dest_file = qemu_fopen_ops(NULL, &test_write_ops);

    start_time = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
    buffer_is_zero(page, PAGE_SIZE);
    zero_time = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - start_time;

    start_time = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
    compress_thread_data_handler(data);
    compress_time = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - start_time;

    printf("Zero %ld ns Compression: %ld ns.\n", zero_time, compress_time);
    compress_thread_data_fini(data);

}

static void migration_threads(void)
{
    int i;

    printf("Zero Test vs. compression.\n");
    for (i = 0; i < 10; i++) {
        compare_zero_and_compression();
    }

    printf("test migration threads.\n");
    compress_threads_save_setup();
    run();
    compress_threads_save_cleanup();
}

int main(int argc, char **argv)
{
    QTestState *s = NULL;
    int ret;

    g_test_init(&argc, &argv, NULL);

    qtest_add_func("/migration/threads", migration_threads);
    ret = g_test_run();

    if (s) {
        qtest_quit(s);
    }

    return ret;
}


^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 09/12] ring: introduce lockless ring buffer
  2018-06-29  4:23       ` [Qemu-devel] " Michael S. Tsirkin
@ 2018-06-29  7:44         ` Xiao Guangrong
  -1 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-29  7:44 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang
  Cc: kvm, mtosatti, Xiao Guangrong, qemu-devel, peterx, dgilbert,
	wei.w.wang, jiang.biao2, pbonzini



On 06/29/2018 12:23 PM, Michael S. Tsirkin wrote:
> On Thu, Jun 28, 2018 at 09:36:00PM +0800, Jason Wang wrote:
>>
>>
>> On 2018年06月04日 17:55, guangrong.xiao@gmail.com wrote:
>>> From: Xiao Guangrong<xiaoguangrong@tencent.com>
>>>

>>> Memory barrier is omitted here, please refer to the comment in the code.
>>>
>>> (1)https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
>>> (2)http://dpdk.org/doc/api/rte__ring_8h.html
>>>
>>> Signed-off-by: Xiao Guangrong<xiaoguangrong@tencent.com>
>>> ---
>>
>> May I ask why you need a MPSC ring here? Can we just use N SPSC ring for
>> submitting pages and another N SPSC ring for passing back results?
>>
>> Thanks
> 
> Or just an SPSC ring + a lock.
> How big of a gain is lockless access to a trivial structure
> like the ring?
> 

Okay, i will give a try.

BTW, we tried to use a global ring + lock for input and lockless ring for input,
the former did not show better performance. But we haven't tried to use global
ring + lock for out yet.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer
@ 2018-06-29  7:44         ` Xiao Guangrong
  0 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-29  7:44 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang
  Cc: pbonzini, mtosatti, qemu-devel, kvm, dgilbert, peterx,
	jiang.biao2, wei.w.wang, Xiao Guangrong



On 06/29/2018 12:23 PM, Michael S. Tsirkin wrote:
> On Thu, Jun 28, 2018 at 09:36:00PM +0800, Jason Wang wrote:
>>
>>
>> On 2018年06月04日 17:55, guangrong.xiao@gmail.com wrote:
>>> From: Xiao Guangrong<xiaoguangrong@tencent.com>
>>>

>>> Memory barrier is omitted here, please refer to the comment in the code.
>>>
>>> (1)https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
>>> (2)http://dpdk.org/doc/api/rte__ring_8h.html
>>>
>>> Signed-off-by: Xiao Guangrong<xiaoguangrong@tencent.com>
>>> ---
>>
>> May I ask why you need a MPSC ring here? Can we just use N SPSC ring for
>> submitting pages and another N SPSC ring for passing back results?
>>
>> Thanks
> 
> Or just an SPSC ring + a lock.
> How big of a gain is lockless access to a trivial structure
> like the ring?
> 

Okay, i will give a try.

BTW, we tried to use a global ring + lock for input and lockless ring for input,
the former did not show better performance. But we haven't tried to use global
ring + lock for out yet.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 09/12] ring: introduce lockless ring buffer
  2018-06-29  6:15         ` [Qemu-devel] " Jason Wang
@ 2018-06-29  7:47           ` Xiao Guangrong
  -1 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-29  7:47 UTC (permalink / raw)
  To: Jason Wang, pbonzini, mst, mtosatti
  Cc: kvm, Xiao Guangrong, qemu-devel, peterx, dgilbert, wei.w.wang,
	jiang.biao2



On 06/29/2018 02:15 PM, Jason Wang wrote:
> 
> 
> On 2018年06月29日 11:59, Xiao Guangrong wrote:
>>
>>
>> On 06/28/2018 09:36 PM, Jason Wang wrote:
>>>
>>>
>>> On 2018年06月04日 17:55, guangrong.xiao@gmail.com wrote:
>>>> From: Xiao Guangrong<xiaoguangrong@tencent.com>
>>>>
>>>> It's the simple lockless ring buffer implement which supports both
>>>> single producer vs. single consumer and multiple producers vs.
>>>> single consumer.
>>>>
>>
>>>> Finally, it fetches the valid data out, set the entry to the initialized
>>>> state and update ring->out to make the entry be usable to the producer:
>>>>
>>>>        data = *entry;
>>>>        *entry = NULL;
>>>>        ring->out++;
>>>>
>>>> Memory barrier is omitted here, please refer to the comment in the code.
>>>>
>>>> (1)https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
>>>> (2)http://dpdk.org/doc/api/rte__ring_8h.html
>>>>
>>>> Signed-off-by: Xiao Guangrong<xiaoguangrong@tencent.com>
>>>> ---
>>>
>>> May I ask why you need a MPSC ring here? Can we just use N SPSC ring for submitting pages and another N SPSC ring for passing back results?
>>
>> Sure.
>>
>> We had this option in our mind, however, it is not scalable which will slow
>> the main thread down, instead, we'd rather to speed up main thread and move
>> reasonable workload to the threads.
> 
> I'm not quite understand the scalability issue here. Is it because of main thread need go through all N rings (which I think not)?

Yes, it is.

The main thread need to check each single thread and wait
it done one by one...

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer
@ 2018-06-29  7:47           ` Xiao Guangrong
  0 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-06-29  7:47 UTC (permalink / raw)
  To: Jason Wang, pbonzini, mst, mtosatti
  Cc: qemu-devel, kvm, dgilbert, peterx, jiang.biao2, wei.w.wang,
	Xiao Guangrong



On 06/29/2018 02:15 PM, Jason Wang wrote:
> 
> 
> On 2018年06月29日 11:59, Xiao Guangrong wrote:
>>
>>
>> On 06/28/2018 09:36 PM, Jason Wang wrote:
>>>
>>>
>>> On 2018年06月04日 17:55, guangrong.xiao@gmail.com wrote:
>>>> From: Xiao Guangrong<xiaoguangrong@tencent.com>
>>>>
>>>> It's the simple lockless ring buffer implement which supports both
>>>> single producer vs. single consumer and multiple producers vs.
>>>> single consumer.
>>>>
>>
>>>> Finally, it fetches the valid data out, set the entry to the initialized
>>>> state and update ring->out to make the entry be usable to the producer:
>>>>
>>>>        data = *entry;
>>>>        *entry = NULL;
>>>>        ring->out++;
>>>>
>>>> Memory barrier is omitted here, please refer to the comment in the code.
>>>>
>>>> (1)https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
>>>> (2)http://dpdk.org/doc/api/rte__ring_8h.html
>>>>
>>>> Signed-off-by: Xiao Guangrong<xiaoguangrong@tencent.com>
>>>> ---
>>>
>>> May I ask why you need a MPSC ring here? Can we just use N SPSC ring for submitting pages and another N SPSC ring for passing back results?
>>
>> Sure.
>>
>> We had this option in our mind, however, it is not scalable which will slow
>> the main thread down, instead, we'd rather to speed up main thread and move
>> reasonable workload to the threads.
> 
> I'm not quite understand the scalability issue here. Is it because of main thread need go through all N rings (which I think not)?

Yes, it is.

The main thread need to check each single thread and wait
it done one by one...

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 06/12] migration: do not detect zero page for compression
  2018-06-28  9:12       ` [Qemu-devel] " Xiao Guangrong
@ 2018-06-29  9:42         ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-06-29  9:42 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, Peter Xu,
	wei.w.wang, jiang.biao2, pbonzini

* Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
> 
> Hi Peter,
> 
> Sorry for the delay as i was busy on other things.
> 
> On 06/19/2018 03:30 PM, Peter Xu wrote:
> > On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
> > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > 
> > > Detecting zero page is not a light work, we can disable it
> > > for compression that can handle all zero data very well
> > 
> > Is there any number shows how the compression algo performs better
> > than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
> > be fast, depending on how init_accel() is done in util/bufferiszero.c.
> 
> This is the comparison between zero-detection and compression (the target
> buffer is all zero bit):
> 
> Zero 810 ns Compression: 26905 ns.
> Zero 417 ns Compression: 8022 ns.
> Zero 408 ns Compression: 7189 ns.
> Zero 400 ns Compression: 7255 ns.
> Zero 412 ns Compression: 7016 ns.
> Zero 411 ns Compression: 7035 ns.
> Zero 413 ns Compression: 6994 ns.
> Zero 399 ns Compression: 7024 ns.
> Zero 416 ns Compression: 7053 ns.
> Zero 405 ns Compression: 7041 ns.
> 
> Indeed, zero-detection is faster than compression.
> 
> However during our profiling for the live_migration thread (after reverted this patch),
> we noticed zero-detection cost lots of CPU:
> 
>  12.01%  kqemu  qemu-system-x86_64            [.] buffer_zero_sse2                                                                                                                                                                           ◆

Interesting; what host are you running on?
Some hosts have support for the faster buffer_zero_ss4/avx2

>   7.60%  kqemu  qemu-system-x86_64            [.] ram_bytes_total                                                                                                                                                                            ▒
>   6.56%  kqemu  qemu-system-x86_64            [.] qemu_event_set                                                                                                                                                                             ▒
>   5.61%  kqemu  qemu-system-x86_64            [.] qemu_put_qemu_file                                                                                                                                                                         ▒
>   5.00%  kqemu  qemu-system-x86_64            [.] __ring_put                                                                                                                                                                                 ▒
>   4.89%  kqemu  [kernel.kallsyms]             [k] copy_user_enhanced_fast_string                                                                                                                                                             ▒
>   4.71%  kqemu  qemu-system-x86_64            [.] compress_thread_data_done                                                                                                                                                                  ▒
>   3.63%  kqemu  qemu-system-x86_64            [.] ring_is_full                                                                                                                                                                               ▒
>   2.89%  kqemu  qemu-system-x86_64            [.] __ring_is_full                                                                                                                                                                             ▒
>   2.68%  kqemu  qemu-system-x86_64            [.] threads_submit_request_prepare                                                                                                                                                             ▒
>   2.60%  kqemu  qemu-system-x86_64            [.] ring_mp_get                                                                                                                                                                                ▒
>   2.25%  kqemu  qemu-system-x86_64            [.] ring_get                                                                                                                                                                                   ▒
>   1.96%  kqemu  libc-2.12.so                  [.] memcpy
> 
> After this patch, the workload is moved to the worker thread, is it
> acceptable?
> 
> > 
> >  From compression rate POV of course zero page algo wins since it
> > contains no data (but only a flag).
> > 
> 
> Yes it is. The compressed zero page is 45 bytes that is small enough i think.

So the compression is ~20x slow and 10x the size;  not a great
improvement!

However, the tricky thing is that in the case of a guest which is mostly
non-zero, this patch would save that time used by zero detection, so it
would be faster.

> Hmm, if you do not like, how about move detecting zero page to the work thread?

That would be interesting to try.

Dave

> Thanks!
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 06/12] migration: do not detect zero page for compression
@ 2018-06-29  9:42         ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-06-29  9:42 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Peter Xu, pbonzini, mst, mtosatti, qemu-devel, kvm, jiang.biao2,
	wei.w.wang, Xiao Guangrong

* Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
> 
> Hi Peter,
> 
> Sorry for the delay as i was busy on other things.
> 
> On 06/19/2018 03:30 PM, Peter Xu wrote:
> > On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
> > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > 
> > > Detecting zero page is not a light work, we can disable it
> > > for compression that can handle all zero data very well
> > 
> > Is there any number shows how the compression algo performs better
> > than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
> > be fast, depending on how init_accel() is done in util/bufferiszero.c.
> 
> This is the comparison between zero-detection and compression (the target
> buffer is all zero bit):
> 
> Zero 810 ns Compression: 26905 ns.
> Zero 417 ns Compression: 8022 ns.
> Zero 408 ns Compression: 7189 ns.
> Zero 400 ns Compression: 7255 ns.
> Zero 412 ns Compression: 7016 ns.
> Zero 411 ns Compression: 7035 ns.
> Zero 413 ns Compression: 6994 ns.
> Zero 399 ns Compression: 7024 ns.
> Zero 416 ns Compression: 7053 ns.
> Zero 405 ns Compression: 7041 ns.
> 
> Indeed, zero-detection is faster than compression.
> 
> However during our profiling for the live_migration thread (after reverted this patch),
> we noticed zero-detection cost lots of CPU:
> 
>  12.01%  kqemu  qemu-system-x86_64            [.] buffer_zero_sse2                                                                                                                                                                           ◆

Interesting; what host are you running on?
Some hosts have support for the faster buffer_zero_ss4/avx2

>   7.60%  kqemu  qemu-system-x86_64            [.] ram_bytes_total                                                                                                                                                                            ▒
>   6.56%  kqemu  qemu-system-x86_64            [.] qemu_event_set                                                                                                                                                                             ▒
>   5.61%  kqemu  qemu-system-x86_64            [.] qemu_put_qemu_file                                                                                                                                                                         ▒
>   5.00%  kqemu  qemu-system-x86_64            [.] __ring_put                                                                                                                                                                                 ▒
>   4.89%  kqemu  [kernel.kallsyms]             [k] copy_user_enhanced_fast_string                                                                                                                                                             ▒
>   4.71%  kqemu  qemu-system-x86_64            [.] compress_thread_data_done                                                                                                                                                                  ▒
>   3.63%  kqemu  qemu-system-x86_64            [.] ring_is_full                                                                                                                                                                               ▒
>   2.89%  kqemu  qemu-system-x86_64            [.] __ring_is_full                                                                                                                                                                             ▒
>   2.68%  kqemu  qemu-system-x86_64            [.] threads_submit_request_prepare                                                                                                                                                             ▒
>   2.60%  kqemu  qemu-system-x86_64            [.] ring_mp_get                                                                                                                                                                                ▒
>   2.25%  kqemu  qemu-system-x86_64            [.] ring_get                                                                                                                                                                                   ▒
>   1.96%  kqemu  libc-2.12.so                  [.] memcpy
> 
> After this patch, the workload is moved to the worker thread, is it
> acceptable?
> 
> > 
> >  From compression rate POV of course zero page algo wins since it
> > contains no data (but only a flag).
> > 
> 
> Yes it is. The compressed zero page is 45 bytes that is small enough i think.

So the compression is ~20x slow and 10x the size;  not a great
improvement!

However, the tricky thing is that in the case of a guest which is mostly
non-zero, this patch would save that time used by zero detection, so it
would be faster.

> Hmm, if you do not like, how about move detecting zero page to the work thread?

That would be interesting to try.

Dave

> Thanks!
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 06/12] migration: do not detect zero page for compression
  2018-06-28  9:36         ` [Qemu-devel] " Daniel P. Berrangé
@ 2018-06-29  9:54           ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-06-29  9:54 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, Peter Xu,
	wei.w.wang, Xiao Guangrong, pbonzini, jiang.biao2

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Thu, Jun 28, 2018 at 05:12:39PM +0800, Xiao Guangrong wrote:
> > 
> > Hi Peter,
> > 
> > Sorry for the delay as i was busy on other things.
> > 
> > On 06/19/2018 03:30 PM, Peter Xu wrote:
> > > On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
> > > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > > 
> > > > Detecting zero page is not a light work, we can disable it
> > > > for compression that can handle all zero data very well
> > > 
> > > Is there any number shows how the compression algo performs better
> > > than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
> > > be fast, depending on how init_accel() is done in util/bufferiszero.c.
> > 
> > This is the comparison between zero-detection and compression (the target
> > buffer is all zero bit):
> > 
> > Zero 810 ns Compression: 26905 ns.
> > Zero 417 ns Compression: 8022 ns.
> > Zero 408 ns Compression: 7189 ns.
> > Zero 400 ns Compression: 7255 ns.
> > Zero 412 ns Compression: 7016 ns.
> > Zero 411 ns Compression: 7035 ns.
> > Zero 413 ns Compression: 6994 ns.
> > Zero 399 ns Compression: 7024 ns.
> > Zero 416 ns Compression: 7053 ns.
> > Zero 405 ns Compression: 7041 ns.
> > 
> > Indeed, zero-detection is faster than compression.
> > 
> > However during our profiling for the live_migration thread (after reverted this patch),
> > we noticed zero-detection cost lots of CPU:
> > 
> >  12.01%  kqemu  qemu-system-x86_64            [.] buffer_zero_sse2                                                                                                                                                                           ◆
> >   7.60%  kqemu  qemu-system-x86_64            [.] ram_bytes_total                                                                                                                                                                            ▒
> >   6.56%  kqemu  qemu-system-x86_64            [.] qemu_event_set                                                                                                                                                                             ▒
> >   5.61%  kqemu  qemu-system-x86_64            [.] qemu_put_qemu_file                                                                                                                                                                         ▒
> >   5.00%  kqemu  qemu-system-x86_64            [.] __ring_put                                                                                                                                                                                 ▒
> >   4.89%  kqemu  [kernel.kallsyms]             [k] copy_user_enhanced_fast_string                                                                                                                                                             ▒
> >   4.71%  kqemu  qemu-system-x86_64            [.] compress_thread_data_done                                                                                                                                                                  ▒
> >   3.63%  kqemu  qemu-system-x86_64            [.] ring_is_full                                                                                                                                                                               ▒
> >   2.89%  kqemu  qemu-system-x86_64            [.] __ring_is_full                                                                                                                                                                             ▒
> >   2.68%  kqemu  qemu-system-x86_64            [.] threads_submit_request_prepare                                                                                                                                                             ▒
> >   2.60%  kqemu  qemu-system-x86_64            [.] ring_mp_get                                                                                                                                                                                ▒
> >   2.25%  kqemu  qemu-system-x86_64            [.] ring_get                                                                                                                                                                                   ▒
> >   1.96%  kqemu  libc-2.12.so                  [.] memcpy
> > 
> > After this patch, the workload is moved to the worker thread, is it
> > acceptable?
> 
> It depends on your point of view. If you have spare / idle CPUs on the host,
> then moving workload to a thread is ok, despite the CPU cost of compression
> in that thread being much higher what what was replaced, since you won't be
> taking CPU resources away from other contending workloads.

It depends on teh VM as well; if the VM is mostly non-zero, the zero
checks happen and are over head (although if the pages are non-zero then
the zero check will mostly happen much faster unless you're unlucky and
the non-zero byte is the last one on the page).

> I'd venture to suggest though that we should probably *not* be optimizing for
> the case of idle CPUs on the host. More realistic is to expect that the host
> CPUs are near fully committed to work, and thus the (default) goal should be
> to minimize CPU overhead for the host as a whole. From this POV, zero-page
> detection is better than compression due to > x10 better speed.

Note that this is only happening if compression is enabled.

> Given the CPU overheads of compression, I think it has fairly narrow use
> in migration in general when considering hosts are often highly committed
> on CPU.

Also, this compression series was originally written by Intel for the
case where there's a compression accelerator hardware (that I've never
found to try); in that case I guess it saves that CPU overhead.

Dave

> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 06/12] migration: do not detect zero page for compression
@ 2018-06-29  9:54           ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-06-29  9:54 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Xiao Guangrong, Peter Xu, kvm, mst, mtosatti, Xiao Guangrong,
	qemu-devel, wei.w.wang, jiang.biao2, pbonzini

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Thu, Jun 28, 2018 at 05:12:39PM +0800, Xiao Guangrong wrote:
> > 
> > Hi Peter,
> > 
> > Sorry for the delay as i was busy on other things.
> > 
> > On 06/19/2018 03:30 PM, Peter Xu wrote:
> > > On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
> > > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > > 
> > > > Detecting zero page is not a light work, we can disable it
> > > > for compression that can handle all zero data very well
> > > 
> > > Is there any number shows how the compression algo performs better
> > > than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
> > > be fast, depending on how init_accel() is done in util/bufferiszero.c.
> > 
> > This is the comparison between zero-detection and compression (the target
> > buffer is all zero bit):
> > 
> > Zero 810 ns Compression: 26905 ns.
> > Zero 417 ns Compression: 8022 ns.
> > Zero 408 ns Compression: 7189 ns.
> > Zero 400 ns Compression: 7255 ns.
> > Zero 412 ns Compression: 7016 ns.
> > Zero 411 ns Compression: 7035 ns.
> > Zero 413 ns Compression: 6994 ns.
> > Zero 399 ns Compression: 7024 ns.
> > Zero 416 ns Compression: 7053 ns.
> > Zero 405 ns Compression: 7041 ns.
> > 
> > Indeed, zero-detection is faster than compression.
> > 
> > However during our profiling for the live_migration thread (after reverted this patch),
> > we noticed zero-detection cost lots of CPU:
> > 
> >  12.01%  kqemu  qemu-system-x86_64            [.] buffer_zero_sse2                                                                                                                                                                           ◆
> >   7.60%  kqemu  qemu-system-x86_64            [.] ram_bytes_total                                                                                                                                                                            ▒
> >   6.56%  kqemu  qemu-system-x86_64            [.] qemu_event_set                                                                                                                                                                             ▒
> >   5.61%  kqemu  qemu-system-x86_64            [.] qemu_put_qemu_file                                                                                                                                                                         ▒
> >   5.00%  kqemu  qemu-system-x86_64            [.] __ring_put                                                                                                                                                                                 ▒
> >   4.89%  kqemu  [kernel.kallsyms]             [k] copy_user_enhanced_fast_string                                                                                                                                                             ▒
> >   4.71%  kqemu  qemu-system-x86_64            [.] compress_thread_data_done                                                                                                                                                                  ▒
> >   3.63%  kqemu  qemu-system-x86_64            [.] ring_is_full                                                                                                                                                                               ▒
> >   2.89%  kqemu  qemu-system-x86_64            [.] __ring_is_full                                                                                                                                                                             ▒
> >   2.68%  kqemu  qemu-system-x86_64            [.] threads_submit_request_prepare                                                                                                                                                             ▒
> >   2.60%  kqemu  qemu-system-x86_64            [.] ring_mp_get                                                                                                                                                                                ▒
> >   2.25%  kqemu  qemu-system-x86_64            [.] ring_get                                                                                                                                                                                   ▒
> >   1.96%  kqemu  libc-2.12.so                  [.] memcpy
> > 
> > After this patch, the workload is moved to the worker thread, is it
> > acceptable?
> 
> It depends on your point of view. If you have spare / idle CPUs on the host,
> then moving workload to a thread is ok, despite the CPU cost of compression
> in that thread being much higher what what was replaced, since you won't be
> taking CPU resources away from other contending workloads.

It depends on teh VM as well; if the VM is mostly non-zero, the zero
checks happen and are over head (although if the pages are non-zero then
the zero check will mostly happen much faster unless you're unlucky and
the non-zero byte is the last one on the page).

> I'd venture to suggest though that we should probably *not* be optimizing for
> the case of idle CPUs on the host. More realistic is to expect that the host
> CPUs are near fully committed to work, and thus the (default) goal should be
> to minimize CPU overhead for the host as a whole. From this POV, zero-page
> detection is better than compression due to > x10 better speed.

Note that this is only happening if compression is enabled.

> Given the CPU overheads of compression, I think it has fairly narrow use
> in migration in general when considering hosts are often highly committed
> on CPU.

Also, this compression series was originally written by Intel for the
case where there's a compression accelerator hardware (that I've never
found to try); in that case I guess it saves that CPU overhead.

Dave

> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 07/12] migration: hold the lock only if it is really needed
  2018-06-28  9:33       ` [Qemu-devel] " Xiao Guangrong
@ 2018-06-29 11:22         ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-06-29 11:22 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, Peter Xu,
	wei.w.wang, jiang.biao2, pbonzini

* Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
> 
> 
> On 06/19/2018 03:36 PM, Peter Xu wrote:
> > On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
> > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > 
> > > Try to hold src_page_req_mutex only if the queue is not
> > > empty
> > 
> > Pure question: how much this patch would help?  Basically if you are
> > running compression tests then I think it means you are with precopy
> > (since postcopy cannot work with compression yet), then here the lock
> > has no contention at all.
> 
> Yes, you are right, however we can observe it is in the top functions
> (after revert this patch):

Can you show the matching trace with the patch in?

Dave

> Samples: 29K of event 'cycles', Event count (approx.): 22263412260
> +   7.99%  kqemu  qemu-system-x86_64       [.] ram_bytes_total
> +   6.95%  kqemu  [kernel.kallsyms]        [k] copy_user_enhanced_fast_string
> +   6.23%  kqemu  qemu-system-x86_64       [.] qemu_put_qemu_file
> +   6.20%  kqemu  qemu-system-x86_64       [.] qemu_event_set
> +   5.80%  kqemu  qemu-system-x86_64       [.] __ring_put
> +   4.82%  kqemu  qemu-system-x86_64       [.] compress_thread_data_done
> +   4.11%  kqemu  qemu-system-x86_64       [.] ring_is_full
> +   3.07%  kqemu  qemu-system-x86_64       [.] threads_submit_request_prepare
> +   2.83%  kqemu  qemu-system-x86_64       [.] ring_mp_get
> +   2.71%  kqemu  qemu-system-x86_64       [.] __ring_is_full
> +   2.46%  kqemu  qemu-system-x86_64       [.] buffer_zero_sse2
> +   2.40%  kqemu  qemu-system-x86_64       [.] add_to_iovec
> +   2.21%  kqemu  qemu-system-x86_64       [.] ring_get
> +   1.96%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
> +   1.90%  kqemu  libc-2.12.so             [.] memcpy
> +   1.55%  kqemu  qemu-system-x86_64       [.] ring_len
> +   1.12%  kqemu  libpthread-2.12.so       [.] pthread_mutex_unlock
> +   1.11%  kqemu  qemu-system-x86_64       [.] ram_find_and_save_block
> +   1.07%  kqemu  qemu-system-x86_64       [.] ram_save_host_page
> +   1.04%  kqemu  qemu-system-x86_64       [.] qemu_put_buffer
> +   0.97%  kqemu  qemu-system-x86_64       [.] compress_page_with_multi_thread
> +   0.96%  kqemu  qemu-system-x86_64       [.] ram_save_target_page
> +   0.93%  kqemu  libpthread-2.12.so       [.] pthread_mutex_lock
> 
> I guess its atomic operations cost CPU resource and check-before-lock is
> a common tech, i think it shouldn't have side effect, right? :)
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 07/12] migration: hold the lock only if it is really needed
@ 2018-06-29 11:22         ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-06-29 11:22 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Peter Xu, pbonzini, mst, mtosatti, qemu-devel, kvm, jiang.biao2,
	wei.w.wang, Xiao Guangrong

* Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
> 
> 
> On 06/19/2018 03:36 PM, Peter Xu wrote:
> > On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
> > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > 
> > > Try to hold src_page_req_mutex only if the queue is not
> > > empty
> > 
> > Pure question: how much this patch would help?  Basically if you are
> > running compression tests then I think it means you are with precopy
> > (since postcopy cannot work with compression yet), then here the lock
> > has no contention at all.
> 
> Yes, you are right, however we can observe it is in the top functions
> (after revert this patch):

Can you show the matching trace with the patch in?

Dave

> Samples: 29K of event 'cycles', Event count (approx.): 22263412260
> +   7.99%  kqemu  qemu-system-x86_64       [.] ram_bytes_total
> +   6.95%  kqemu  [kernel.kallsyms]        [k] copy_user_enhanced_fast_string
> +   6.23%  kqemu  qemu-system-x86_64       [.] qemu_put_qemu_file
> +   6.20%  kqemu  qemu-system-x86_64       [.] qemu_event_set
> +   5.80%  kqemu  qemu-system-x86_64       [.] __ring_put
> +   4.82%  kqemu  qemu-system-x86_64       [.] compress_thread_data_done
> +   4.11%  kqemu  qemu-system-x86_64       [.] ring_is_full
> +   3.07%  kqemu  qemu-system-x86_64       [.] threads_submit_request_prepare
> +   2.83%  kqemu  qemu-system-x86_64       [.] ring_mp_get
> +   2.71%  kqemu  qemu-system-x86_64       [.] __ring_is_full
> +   2.46%  kqemu  qemu-system-x86_64       [.] buffer_zero_sse2
> +   2.40%  kqemu  qemu-system-x86_64       [.] add_to_iovec
> +   2.21%  kqemu  qemu-system-x86_64       [.] ring_get
> +   1.96%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
> +   1.90%  kqemu  libc-2.12.so             [.] memcpy
> +   1.55%  kqemu  qemu-system-x86_64       [.] ring_len
> +   1.12%  kqemu  libpthread-2.12.so       [.] pthread_mutex_unlock
> +   1.11%  kqemu  qemu-system-x86_64       [.] ram_find_and_save_block
> +   1.07%  kqemu  qemu-system-x86_64       [.] ram_save_host_page
> +   1.04%  kqemu  qemu-system-x86_64       [.] qemu_put_buffer
> +   0.97%  kqemu  qemu-system-x86_64       [.] compress_page_with_multi_thread
> +   0.96%  kqemu  qemu-system-x86_64       [.] ram_save_target_page
> +   0.93%  kqemu  libpthread-2.12.so       [.] pthread_mutex_lock
> 
> I guess its atomic operations cost CPU resource and check-before-lock is
> a common tech, i think it shouldn't have side effect, right? :)
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 09/12] ring: introduce lockless ring buffer
  2018-06-29  7:30       ` [Qemu-devel] " Xiao Guangrong
@ 2018-06-29 13:08         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 156+ messages in thread
From: Michael S. Tsirkin @ 2018-06-29 13:08 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: kvm, mtosatti, Xiao Guangrong, qemu-devel, peterx, dgilbert,
	wei.w.wang, jiang.biao2, pbonzini

On Fri, Jun 29, 2018 at 03:30:44PM +0800, Xiao Guangrong wrote:
> 
> Hi Michael,
> 
> On 06/20/2018 08:38 PM, Michael S. Tsirkin wrote:
> > On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.xiao@gmail.com wrote:
> > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> > > 
> > > 
> > > (1) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
> > > (2) http://dpdk.org/doc/api/rte__ring_8h.html
> > > 
> > > Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> > 
> > So instead of all this super-optimized trickiness, how about
> > a simple port of ptr_ring from linux?
> > 
> > That one isn't lockless but it's known to outperform
> > most others for a single producer/single consumer case.
> > And with a ton of networking going on,
> > who said it's such a hot spot? OTOH this implementation
> > has more barriers which slows down each individual thread.
> > It's also a source of bugs.
> > 
> 
> Thank you for pointing it out.
> 
> I just quickly went through the code of ptr_ring that is very nice and
> really impressive. I will consider to port it to QEMU.

The port is pretty trivial. See below. It's a SPSC structure though.  So
you need to use it with lock.  Given the critical section is small, I
put in QmueSpin, not a mutex.  To reduce cost of locks, it helps if you
can use the batches API to consume. I assume producers can't batch
but if they can, we should add an API for that, will help too.


---

qemu/ptr_ring.h: straight port from Linux 4.17

Port done by author.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

diff --git a/include/qemu/ptr_ring.h b/include/qemu/ptr_ring.h
new file mode 100644
index 0000000000..f7446678de
--- /dev/null
+++ b/include/qemu/ptr_ring.h
@@ -0,0 +1,464 @@
+/*
+ *	Definitions for the 'struct ptr_ring' datastructure.
+ *
+ *	Author:
+ *		Michael S. Tsirkin <mst@redhat.com>
+ *
+ *	Copyright (C) 2016 Red Hat, Inc.
+ *
+ *	This program is free software; you can redistribute it and/or modify it
+ *	under the terms of the GNU General Public License as published by the
+ *	Free Software Foundation; either version 2 of the License, or (at your
+ *	option) any later version.
+ *
+ *	This is a limited-size FIFO maintaining pointers in FIFO order, with
+ *	one CPU producing entries and another consuming entries from a FIFO.
+ *
+ *	This implementation tries to minimize cache-contention when there is a
+ *	single producer and a single consumer CPU.
+ */
+
+#ifndef QEMU_PTR_RING_H
+#define QEMU_PTR_RING_H 1
+
+#include "qemu/thread.h"
+
+#define PTR_RING_CACHE_BYTES 64
+#define PTR_RING_CACHE_ALIGNED __attribute__((__aligned__(PTR_RING_CACHE_BYTES)))
+#define PTR_RING_WRITE_ONCE(p, v) (*(volatile typeof(&(p)))(&(p)) = (v))
+#define PTR_RING_READ_ONCE(p) (*(volatile typeof(&(p)))(&(p)))
+
+struct ptr_ring {
+	int producer PTR_RING_CACHE_ALIGNED;
+	QemuSpin producer_lock;
+	int consumer_head PTR_RING_CACHE_ALIGNED; /* next valid entry */
+	int consumer_tail; /* next entry to invalidate */
+	QemuSpin consumer_lock;
+	/* Shared consumer/producer data */
+	/* Read-only by both the producer and the consumer */
+	int size PTR_RING_CACHE_ALIGNED; /* max entries in queue */
+	int batch; /* number of entries to consume in a batch */
+	void **queue;
+};
+
+/* Note: callers invoking this in a loop must use a compiler barrier,
+ * for example cpu_relax().
+ *
+ * NB: this is unlike __ptr_ring_empty in that callers must hold producer_lock:
+ * see e.g. ptr_ring_full.
+ */
+static inline bool __ptr_ring_full(struct ptr_ring *r)
+{
+	return r->queue[r->producer];
+}
+
+static inline bool ptr_ring_full(struct ptr_ring *r)
+{
+	bool ret;
+
+	qemu_spin_lock(&r->producer_lock);
+	ret = __ptr_ring_full(r);
+	qemu_spin_unlock(&r->producer_lock);
+
+	return ret;
+}
+
+/* Note: callers invoking this in a loop must use a compiler barrier,
+ * for example cpu_relax(). Callers must hold producer_lock.
+ * Callers are responsible for making sure pointer that is being queued
+ * points to a valid data.
+ */
+static inline int __ptr_ring_produce(struct ptr_ring *r, void *ptr)
+{
+	if (unlikely(!r->size) || r->queue[r->producer])
+		return -ENOSPC;
+
+	/* Make sure the pointer we are storing points to a valid data. */
+	/* Pairs with smp_read_barrier_depends in __ptr_ring_consume. */
+	smp_wmb();
+
+	PTR_RING_WRITE_ONCE(r->queue[r->producer++], ptr);
+	if (unlikely(r->producer >= r->size))
+		r->producer = 0;
+	return 0;
+}
+
+/*
+ * Note: resize (below) nests producer lock within consumer lock, so if you
+ * consume in interrupt or BH context, you must disable interrupts/BH when
+ * calling this.
+ */
+static inline int ptr_ring_produce(struct ptr_ring *r, void *ptr)
+{
+	int ret;
+
+	qemu_spin_lock(&r->producer_lock);
+	ret = __ptr_ring_produce(r, ptr);
+	qemu_spin_unlock(&r->producer_lock);
+
+	return ret;
+}
+
+static inline void *__ptr_ring_peek(struct ptr_ring *r)
+{
+	if (likely(r->size))
+		return PTR_RING_READ_ONCE(r->queue[r->consumer_head]);
+	return NULL;
+}
+
+/*
+ * Test ring empty status without taking any locks.
+ *
+ * NB: This is only safe to call if ring is never resized.
+ *
+ * However, if some other CPU consumes ring entries at the same time, the value
+ * returned is not guaranteed to be correct.
+ *
+ * In this case - to avoid incorrectly detecting the ring
+ * as empty - the CPU consuming the ring entries is responsible
+ * for either consuming all ring entries until the ring is empty,
+ * or synchronizing with some other CPU and causing it to
+ * re-test __ptr_ring_empty and/or consume the ring enteries
+ * after the synchronization point.
+ *
+ * Note: callers invoking this in a loop must use a compiler barrier,
+ * for example cpu_relax().
+ */
+static inline bool __ptr_ring_empty(struct ptr_ring *r)
+{
+	if (likely(r->size))
+		return !r->queue[PTR_RING_READ_ONCE(r->consumer_head)];
+	return true;
+}
+
+static inline bool ptr_ring_empty(struct ptr_ring *r)
+{
+	bool ret;
+
+	qemu_spin_lock(&r->consumer_lock);
+	ret = __ptr_ring_empty(r);
+	qemu_spin_unlock(&r->consumer_lock);
+
+	return ret;
+}
+
+/* Must only be called after __ptr_ring_peek returned !NULL */
+static inline void __ptr_ring_discard_one(struct ptr_ring *r)
+{
+	/* Fundamentally, what we want to do is update consumer
+	 * index and zero out the entry so producer can reuse it.
+	 * Doing it naively at each consume would be as simple as:
+	 *       consumer = r->consumer;
+	 *       r->queue[consumer++] = NULL;
+	 *       if (unlikely(consumer >= r->size))
+	 *               consumer = 0;
+	 *       r->consumer = consumer;
+	 * but that is suboptimal when the ring is full as producer is writing
+	 * out new entries in the same cache line.  Defer these updates until a
+	 * batch of entries has been consumed.
+	 */
+	/* Note: we must keep consumer_head valid at all times for __ptr_ring_empty
+	 * to work correctly.
+	 */
+	int consumer_head = r->consumer_head;
+	int head = consumer_head++;
+
+	/* Once we have processed enough entries invalidate them in
+	 * the ring all at once so producer can reuse their space in the ring.
+	 * We also do this when we reach end of the ring - not mandatory
+	 * but helps keep the implementation simple.
+	 */
+	if (unlikely(consumer_head - r->consumer_tail >= r->batch ||
+		     consumer_head >= r->size)) {
+		/* Zero out entries in the reverse order: this way we touch the
+		 * cache line that producer might currently be reading the last;
+		 * producer won't make progress and touch other cache lines
+		 * besides the first one until we write out all entries.
+		 */
+		while (likely(head >= r->consumer_tail))
+			r->queue[head--] = NULL;
+		r->consumer_tail = consumer_head;
+	}
+	if (unlikely(consumer_head >= r->size)) {
+		consumer_head = 0;
+		r->consumer_tail = 0;
+	}
+	/* matching READ_ONCE in __ptr_ring_empty for lockless tests */
+	PTR_RING_WRITE_ONCE(r->consumer_head, consumer_head);
+}
+
+static inline void *__ptr_ring_consume(struct ptr_ring *r)
+{
+	void *ptr;
+
+	/* The READ_ONCE in __ptr_ring_peek guarantees that anyone
+	 * accessing data through the pointer is up to date. Pairs
+	 * with smp_wmb in __ptr_ring_produce.
+	 */
+	ptr = __ptr_ring_peek(r);
+	if (ptr)
+		__ptr_ring_discard_one(r);
+
+	return ptr;
+}
+
+static inline int __ptr_ring_consume_batched(struct ptr_ring *r,
+					     void **array, int n)
+{
+	void *ptr;
+	int i;
+
+	for (i = 0; i < n; i++) {
+		ptr = __ptr_ring_consume(r);
+		if (!ptr)
+			break;
+		array[i] = ptr;
+	}
+
+	return i;
+}
+
+/*
+ * Note: resize (below) nests producer lock within consumer lock, so if you
+ * call this in interrupt or BH context, you must disable interrupts/BH when
+ * producing.
+ */
+static inline void *ptr_ring_consume(struct ptr_ring *r)
+{
+	void *ptr;
+
+	qemu_spin_lock(&r->consumer_lock);
+	ptr = __ptr_ring_consume(r);
+	qemu_spin_unlock(&r->consumer_lock);
+
+	return ptr;
+}
+
+static inline int ptr_ring_consume_batched(struct ptr_ring *r,
+					   void **array, int n)
+{
+	int ret;
+
+	qemu_spin_lock(&r->consumer_lock);
+	ret = __ptr_ring_consume_batched(r, array, n);
+	qemu_spin_unlock(&r->consumer_lock);
+
+	return ret;
+}
+
+/* Cast to structure type and call a function without discarding from FIFO.
+ * Function must return a value.
+ * Callers must take consumer_lock.
+ */
+#define __PTR_RING_PEEK_CALL(r, f) ((f)(__ptr_ring_peek(r)))
+
+#define PTR_RING_PEEK_CALL(r, f) ({ \
+	typeof((f)(NULL)) __PTR_RING_PEEK_CALL_v; \
+	\
+	qemu_spin_lock(&(r)->consumer_lock); \
+	__PTR_RING_PEEK_CALL_v = __PTR_RING_PEEK_CALL(r, f); \
+	qemu_spin_unlock(&(r)->consumer_lock); \
+	__PTR_RING_PEEK_CALL_v; \
+})
+
+static inline void **__ptr_ring_init_queue_alloc(unsigned int size)
+{
+	return g_try_new(void *, size);
+}
+
+static inline void __ptr_ring_set_size(struct ptr_ring *r, int size)
+{
+	r->size = size;
+	r->batch = PTR_RING_CACHE_BYTES * 2 / sizeof(*(r->queue));
+	/* We need to set batch at least to 1 to make logic
+	 * in __ptr_ring_discard_one work correctly.
+	 * Batching too much (because ring is small) would cause a lot of
+	 * burstiness. Needs tuning, for now disable batching.
+	 */
+	if (r->batch > r->size / 2 || !r->batch)
+		r->batch = 1;
+}
+
+static inline int ptr_ring_init(struct ptr_ring *r, int size)
+{
+	r->queue = __ptr_ring_init_queue_alloc(size);
+	if (!r->queue)
+		return -ENOMEM;
+
+	__ptr_ring_set_size(r, size);
+	r->producer = r->consumer_head = r->consumer_tail = 0;
+	qemu_spin_init(&r->producer_lock);
+	qemu_spin_init(&r->consumer_lock);
+
+	return 0;
+}
+
+/*
+ * Return entries into ring. Destroy entries that don't fit.
+ *
+ * Note: this is expected to be a rare slow path operation.
+ *
+ * Note: producer lock is nested within consumer lock, so if you
+ * resize you must make sure all uses nest correctly.
+ * In particular if you consume ring in interrupt or BH context, you must
+ * disable interrupts/BH when doing so.
+ */
+static inline void ptr_ring_unconsume(struct ptr_ring *r, void **batch, int n,
+				      void (*destroy)(void *))
+{
+	int head;
+
+	qemu_spin_lock(&r->consumer_lock);
+	qemu_spin_lock(&r->producer_lock);
+
+	if (!r->size)
+		goto done;
+
+	/*
+	 * Clean out buffered entries (for simplicity). This way following code
+	 * can test entries for NULL and if not assume they are valid.
+	 */
+	head = r->consumer_head - 1;
+	while (likely(head >= r->consumer_tail))
+		r->queue[head--] = NULL;
+	r->consumer_tail = r->consumer_head;
+
+	/*
+	 * Go over entries in batch, start moving head back and copy entries.
+	 * Stop when we run into previously unconsumed entries.
+	 */
+	while (n) {
+		head = r->consumer_head - 1;
+		if (head < 0)
+			head = r->size - 1;
+		if (r->queue[head]) {
+			/* This batch entry will have to be destroyed. */
+			goto done;
+		}
+		r->queue[head] = batch[--n];
+		r->consumer_tail = head;
+		/* matching READ_ONCE in __ptr_ring_empty for lockless tests */
+		PTR_RING_WRITE_ONCE(r->consumer_head, head);
+	}
+
+done:
+	/* Destroy all entries left in the batch. */
+	while (n)
+		destroy(batch[--n]);
+	qemu_spin_unlock(&r->producer_lock);
+	qemu_spin_unlock(&r->consumer_lock);
+}
+
+static inline void **__ptr_ring_swap_queue(struct ptr_ring *r, void **queue,
+						    int size,
+						    void (*destroy)(void *))
+{
+	int producer = 0;
+	void **old;
+	void *ptr;
+
+	while ((ptr = __ptr_ring_consume(r)))
+		if (producer < size)
+			queue[producer++] = ptr;
+		else if (destroy)
+			destroy(ptr);
+
+	__ptr_ring_set_size(r, size);
+	r->producer = producer;
+	r->consumer_head = 0;
+	r->consumer_tail = 0;
+	old = r->queue;
+	r->queue = queue;
+
+	return old;
+}
+
+/*
+ * Note: producer lock is nested within consumer lock, so if you
+ * resize you must make sure all uses nest correctly.
+ * In particular if you consume ring in interrupt or BH context, you must
+ * disable interrupts/BH when doing so.
+ */
+static inline int ptr_ring_resize(struct ptr_ring *r, int size,
+				  void (*destroy)(void *))
+{
+	void **queue = __ptr_ring_init_queue_alloc(size);
+	void **old;
+
+	if (!queue)
+		return -ENOMEM;
+
+	qemu_spin_lock(&(r)->consumer_lock);
+	qemu_spin_lock(&(r)->producer_lock);
+
+	old = __ptr_ring_swap_queue(r, queue, size, destroy);
+
+	qemu_spin_unlock(&(r)->producer_lock);
+	qemu_spin_unlock(&(r)->consumer_lock);
+
+	g_free(old);
+
+	return 0;
+}
+
+/*
+ * Note: producer lock is nested within consumer lock, so if you
+ * resize you must make sure all uses nest correctly.
+ * In particular if you consume ring in interrupt or BH context, you must
+ * disable interrupts/BH when doing so.
+ */
+static inline int ptr_ring_resize_multiple(struct ptr_ring **rings,
+					   unsigned int nrings,
+					   int size,
+					   void (*destroy)(void *))
+{
+	void ***queues;
+	int i;
+
+	queues = g_try_new(void **, nrings);
+	if (!queues)
+		goto noqueues;
+
+	for (i = 0; i < nrings; ++i) {
+		queues[i] = __ptr_ring_init_queue_alloc(size);
+		if (!queues[i])
+			goto nomem;
+	}
+
+	for (i = 0; i < nrings; ++i) {
+		qemu_spin_lock(&(rings[i])->consumer_lock);
+		qemu_spin_lock(&(rings[i])->producer_lock);
+		queues[i] = __ptr_ring_swap_queue(rings[i], queues[i],
+						  size, destroy);
+		qemu_spin_unlock(&(rings[i])->producer_lock);
+		qemu_spin_unlock(&(rings[i])->consumer_lock);
+	}
+
+	for (i = 0; i < nrings; ++i)
+		g_free(queues[i]);
+
+	g_free(queues);
+
+	return 0;
+
+nomem:
+	while (--i >= 0)
+		g_free(queues[i]);
+
+	g_free(queues);
+
+noqueues:
+	return -ENOMEM;
+}
+
+static inline void ptr_ring_cleanup(struct ptr_ring *r, void (*destroy)(void *))
+{
+	void *ptr;
+
+	if (destroy)
+		while ((ptr = ptr_ring_consume(r)))
+			destroy(ptr);
+	g_free(r->queue);
+}
+
+#endif /* _LINUX_PTR_RING_H  */

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer
@ 2018-06-29 13:08         ` Michael S. Tsirkin
  0 siblings, 0 replies; 156+ messages in thread
From: Michael S. Tsirkin @ 2018-06-29 13:08 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: pbonzini, mtosatti, qemu-devel, kvm, dgilbert, peterx,
	jiang.biao2, wei.w.wang, Xiao Guangrong

On Fri, Jun 29, 2018 at 03:30:44PM +0800, Xiao Guangrong wrote:
> 
> Hi Michael,
> 
> On 06/20/2018 08:38 PM, Michael S. Tsirkin wrote:
> > On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.xiao@gmail.com wrote:
> > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> > > 
> > > 
> > > (1) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
> > > (2) http://dpdk.org/doc/api/rte__ring_8h.html
> > > 
> > > Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> > 
> > So instead of all this super-optimized trickiness, how about
> > a simple port of ptr_ring from linux?
> > 
> > That one isn't lockless but it's known to outperform
> > most others for a single producer/single consumer case.
> > And with a ton of networking going on,
> > who said it's such a hot spot? OTOH this implementation
> > has more barriers which slows down each individual thread.
> > It's also a source of bugs.
> > 
> 
> Thank you for pointing it out.
> 
> I just quickly went through the code of ptr_ring that is very nice and
> really impressive. I will consider to port it to QEMU.

The port is pretty trivial. See below. It's a SPSC structure though.  So
you need to use it with lock.  Given the critical section is small, I
put in QmueSpin, not a mutex.  To reduce cost of locks, it helps if you
can use the batches API to consume. I assume producers can't batch
but if they can, we should add an API for that, will help too.


---

qemu/ptr_ring.h: straight port from Linux 4.17

Port done by author.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

diff --git a/include/qemu/ptr_ring.h b/include/qemu/ptr_ring.h
new file mode 100644
index 0000000000..f7446678de
--- /dev/null
+++ b/include/qemu/ptr_ring.h
@@ -0,0 +1,464 @@
+/*
+ *	Definitions for the 'struct ptr_ring' datastructure.
+ *
+ *	Author:
+ *		Michael S. Tsirkin <mst@redhat.com>
+ *
+ *	Copyright (C) 2016 Red Hat, Inc.
+ *
+ *	This program is free software; you can redistribute it and/or modify it
+ *	under the terms of the GNU General Public License as published by the
+ *	Free Software Foundation; either version 2 of the License, or (at your
+ *	option) any later version.
+ *
+ *	This is a limited-size FIFO maintaining pointers in FIFO order, with
+ *	one CPU producing entries and another consuming entries from a FIFO.
+ *
+ *	This implementation tries to minimize cache-contention when there is a
+ *	single producer and a single consumer CPU.
+ */
+
+#ifndef QEMU_PTR_RING_H
+#define QEMU_PTR_RING_H 1
+
+#include "qemu/thread.h"
+
+#define PTR_RING_CACHE_BYTES 64
+#define PTR_RING_CACHE_ALIGNED __attribute__((__aligned__(PTR_RING_CACHE_BYTES)))
+#define PTR_RING_WRITE_ONCE(p, v) (*(volatile typeof(&(p)))(&(p)) = (v))
+#define PTR_RING_READ_ONCE(p) (*(volatile typeof(&(p)))(&(p)))
+
+struct ptr_ring {
+	int producer PTR_RING_CACHE_ALIGNED;
+	QemuSpin producer_lock;
+	int consumer_head PTR_RING_CACHE_ALIGNED; /* next valid entry */
+	int consumer_tail; /* next entry to invalidate */
+	QemuSpin consumer_lock;
+	/* Shared consumer/producer data */
+	/* Read-only by both the producer and the consumer */
+	int size PTR_RING_CACHE_ALIGNED; /* max entries in queue */
+	int batch; /* number of entries to consume in a batch */
+	void **queue;
+};
+
+/* Note: callers invoking this in a loop must use a compiler barrier,
+ * for example cpu_relax().
+ *
+ * NB: this is unlike __ptr_ring_empty in that callers must hold producer_lock:
+ * see e.g. ptr_ring_full.
+ */
+static inline bool __ptr_ring_full(struct ptr_ring *r)
+{
+	return r->queue[r->producer];
+}
+
+static inline bool ptr_ring_full(struct ptr_ring *r)
+{
+	bool ret;
+
+	qemu_spin_lock(&r->producer_lock);
+	ret = __ptr_ring_full(r);
+	qemu_spin_unlock(&r->producer_lock);
+
+	return ret;
+}
+
+/* Note: callers invoking this in a loop must use a compiler barrier,
+ * for example cpu_relax(). Callers must hold producer_lock.
+ * Callers are responsible for making sure pointer that is being queued
+ * points to a valid data.
+ */
+static inline int __ptr_ring_produce(struct ptr_ring *r, void *ptr)
+{
+	if (unlikely(!r->size) || r->queue[r->producer])
+		return -ENOSPC;
+
+	/* Make sure the pointer we are storing points to a valid data. */
+	/* Pairs with smp_read_barrier_depends in __ptr_ring_consume. */
+	smp_wmb();
+
+	PTR_RING_WRITE_ONCE(r->queue[r->producer++], ptr);
+	if (unlikely(r->producer >= r->size))
+		r->producer = 0;
+	return 0;
+}
+
+/*
+ * Note: resize (below) nests producer lock within consumer lock, so if you
+ * consume in interrupt or BH context, you must disable interrupts/BH when
+ * calling this.
+ */
+static inline int ptr_ring_produce(struct ptr_ring *r, void *ptr)
+{
+	int ret;
+
+	qemu_spin_lock(&r->producer_lock);
+	ret = __ptr_ring_produce(r, ptr);
+	qemu_spin_unlock(&r->producer_lock);
+
+	return ret;
+}
+
+static inline void *__ptr_ring_peek(struct ptr_ring *r)
+{
+	if (likely(r->size))
+		return PTR_RING_READ_ONCE(r->queue[r->consumer_head]);
+	return NULL;
+}
+
+/*
+ * Test ring empty status without taking any locks.
+ *
+ * NB: This is only safe to call if ring is never resized.
+ *
+ * However, if some other CPU consumes ring entries at the same time, the value
+ * returned is not guaranteed to be correct.
+ *
+ * In this case - to avoid incorrectly detecting the ring
+ * as empty - the CPU consuming the ring entries is responsible
+ * for either consuming all ring entries until the ring is empty,
+ * or synchronizing with some other CPU and causing it to
+ * re-test __ptr_ring_empty and/or consume the ring enteries
+ * after the synchronization point.
+ *
+ * Note: callers invoking this in a loop must use a compiler barrier,
+ * for example cpu_relax().
+ */
+static inline bool __ptr_ring_empty(struct ptr_ring *r)
+{
+	if (likely(r->size))
+		return !r->queue[PTR_RING_READ_ONCE(r->consumer_head)];
+	return true;
+}
+
+static inline bool ptr_ring_empty(struct ptr_ring *r)
+{
+	bool ret;
+
+	qemu_spin_lock(&r->consumer_lock);
+	ret = __ptr_ring_empty(r);
+	qemu_spin_unlock(&r->consumer_lock);
+
+	return ret;
+}
+
+/* Must only be called after __ptr_ring_peek returned !NULL */
+static inline void __ptr_ring_discard_one(struct ptr_ring *r)
+{
+	/* Fundamentally, what we want to do is update consumer
+	 * index and zero out the entry so producer can reuse it.
+	 * Doing it naively at each consume would be as simple as:
+	 *       consumer = r->consumer;
+	 *       r->queue[consumer++] = NULL;
+	 *       if (unlikely(consumer >= r->size))
+	 *               consumer = 0;
+	 *       r->consumer = consumer;
+	 * but that is suboptimal when the ring is full as producer is writing
+	 * out new entries in the same cache line.  Defer these updates until a
+	 * batch of entries has been consumed.
+	 */
+	/* Note: we must keep consumer_head valid at all times for __ptr_ring_empty
+	 * to work correctly.
+	 */
+	int consumer_head = r->consumer_head;
+	int head = consumer_head++;
+
+	/* Once we have processed enough entries invalidate them in
+	 * the ring all at once so producer can reuse their space in the ring.
+	 * We also do this when we reach end of the ring - not mandatory
+	 * but helps keep the implementation simple.
+	 */
+	if (unlikely(consumer_head - r->consumer_tail >= r->batch ||
+		     consumer_head >= r->size)) {
+		/* Zero out entries in the reverse order: this way we touch the
+		 * cache line that producer might currently be reading the last;
+		 * producer won't make progress and touch other cache lines
+		 * besides the first one until we write out all entries.
+		 */
+		while (likely(head >= r->consumer_tail))
+			r->queue[head--] = NULL;
+		r->consumer_tail = consumer_head;
+	}
+	if (unlikely(consumer_head >= r->size)) {
+		consumer_head = 0;
+		r->consumer_tail = 0;
+	}
+	/* matching READ_ONCE in __ptr_ring_empty for lockless tests */
+	PTR_RING_WRITE_ONCE(r->consumer_head, consumer_head);
+}
+
+static inline void *__ptr_ring_consume(struct ptr_ring *r)
+{
+	void *ptr;
+
+	/* The READ_ONCE in __ptr_ring_peek guarantees that anyone
+	 * accessing data through the pointer is up to date. Pairs
+	 * with smp_wmb in __ptr_ring_produce.
+	 */
+	ptr = __ptr_ring_peek(r);
+	if (ptr)
+		__ptr_ring_discard_one(r);
+
+	return ptr;
+}
+
+static inline int __ptr_ring_consume_batched(struct ptr_ring *r,
+					     void **array, int n)
+{
+	void *ptr;
+	int i;
+
+	for (i = 0; i < n; i++) {
+		ptr = __ptr_ring_consume(r);
+		if (!ptr)
+			break;
+		array[i] = ptr;
+	}
+
+	return i;
+}
+
+/*
+ * Note: resize (below) nests producer lock within consumer lock, so if you
+ * call this in interrupt or BH context, you must disable interrupts/BH when
+ * producing.
+ */
+static inline void *ptr_ring_consume(struct ptr_ring *r)
+{
+	void *ptr;
+
+	qemu_spin_lock(&r->consumer_lock);
+	ptr = __ptr_ring_consume(r);
+	qemu_spin_unlock(&r->consumer_lock);
+
+	return ptr;
+}
+
+static inline int ptr_ring_consume_batched(struct ptr_ring *r,
+					   void **array, int n)
+{
+	int ret;
+
+	qemu_spin_lock(&r->consumer_lock);
+	ret = __ptr_ring_consume_batched(r, array, n);
+	qemu_spin_unlock(&r->consumer_lock);
+
+	return ret;
+}
+
+/* Cast to structure type and call a function without discarding from FIFO.
+ * Function must return a value.
+ * Callers must take consumer_lock.
+ */
+#define __PTR_RING_PEEK_CALL(r, f) ((f)(__ptr_ring_peek(r)))
+
+#define PTR_RING_PEEK_CALL(r, f) ({ \
+	typeof((f)(NULL)) __PTR_RING_PEEK_CALL_v; \
+	\
+	qemu_spin_lock(&(r)->consumer_lock); \
+	__PTR_RING_PEEK_CALL_v = __PTR_RING_PEEK_CALL(r, f); \
+	qemu_spin_unlock(&(r)->consumer_lock); \
+	__PTR_RING_PEEK_CALL_v; \
+})
+
+static inline void **__ptr_ring_init_queue_alloc(unsigned int size)
+{
+	return g_try_new(void *, size);
+}
+
+static inline void __ptr_ring_set_size(struct ptr_ring *r, int size)
+{
+	r->size = size;
+	r->batch = PTR_RING_CACHE_BYTES * 2 / sizeof(*(r->queue));
+	/* We need to set batch at least to 1 to make logic
+	 * in __ptr_ring_discard_one work correctly.
+	 * Batching too much (because ring is small) would cause a lot of
+	 * burstiness. Needs tuning, for now disable batching.
+	 */
+	if (r->batch > r->size / 2 || !r->batch)
+		r->batch = 1;
+}
+
+static inline int ptr_ring_init(struct ptr_ring *r, int size)
+{
+	r->queue = __ptr_ring_init_queue_alloc(size);
+	if (!r->queue)
+		return -ENOMEM;
+
+	__ptr_ring_set_size(r, size);
+	r->producer = r->consumer_head = r->consumer_tail = 0;
+	qemu_spin_init(&r->producer_lock);
+	qemu_spin_init(&r->consumer_lock);
+
+	return 0;
+}
+
+/*
+ * Return entries into ring. Destroy entries that don't fit.
+ *
+ * Note: this is expected to be a rare slow path operation.
+ *
+ * Note: producer lock is nested within consumer lock, so if you
+ * resize you must make sure all uses nest correctly.
+ * In particular if you consume ring in interrupt or BH context, you must
+ * disable interrupts/BH when doing so.
+ */
+static inline void ptr_ring_unconsume(struct ptr_ring *r, void **batch, int n,
+				      void (*destroy)(void *))
+{
+	int head;
+
+	qemu_spin_lock(&r->consumer_lock);
+	qemu_spin_lock(&r->producer_lock);
+
+	if (!r->size)
+		goto done;
+
+	/*
+	 * Clean out buffered entries (for simplicity). This way following code
+	 * can test entries for NULL and if not assume they are valid.
+	 */
+	head = r->consumer_head - 1;
+	while (likely(head >= r->consumer_tail))
+		r->queue[head--] = NULL;
+	r->consumer_tail = r->consumer_head;
+
+	/*
+	 * Go over entries in batch, start moving head back and copy entries.
+	 * Stop when we run into previously unconsumed entries.
+	 */
+	while (n) {
+		head = r->consumer_head - 1;
+		if (head < 0)
+			head = r->size - 1;
+		if (r->queue[head]) {
+			/* This batch entry will have to be destroyed. */
+			goto done;
+		}
+		r->queue[head] = batch[--n];
+		r->consumer_tail = head;
+		/* matching READ_ONCE in __ptr_ring_empty for lockless tests */
+		PTR_RING_WRITE_ONCE(r->consumer_head, head);
+	}
+
+done:
+	/* Destroy all entries left in the batch. */
+	while (n)
+		destroy(batch[--n]);
+	qemu_spin_unlock(&r->producer_lock);
+	qemu_spin_unlock(&r->consumer_lock);
+}
+
+static inline void **__ptr_ring_swap_queue(struct ptr_ring *r, void **queue,
+						    int size,
+						    void (*destroy)(void *))
+{
+	int producer = 0;
+	void **old;
+	void *ptr;
+
+	while ((ptr = __ptr_ring_consume(r)))
+		if (producer < size)
+			queue[producer++] = ptr;
+		else if (destroy)
+			destroy(ptr);
+
+	__ptr_ring_set_size(r, size);
+	r->producer = producer;
+	r->consumer_head = 0;
+	r->consumer_tail = 0;
+	old = r->queue;
+	r->queue = queue;
+
+	return old;
+}
+
+/*
+ * Note: producer lock is nested within consumer lock, so if you
+ * resize you must make sure all uses nest correctly.
+ * In particular if you consume ring in interrupt or BH context, you must
+ * disable interrupts/BH when doing so.
+ */
+static inline int ptr_ring_resize(struct ptr_ring *r, int size,
+				  void (*destroy)(void *))
+{
+	void **queue = __ptr_ring_init_queue_alloc(size);
+	void **old;
+
+	if (!queue)
+		return -ENOMEM;
+
+	qemu_spin_lock(&(r)->consumer_lock);
+	qemu_spin_lock(&(r)->producer_lock);
+
+	old = __ptr_ring_swap_queue(r, queue, size, destroy);
+
+	qemu_spin_unlock(&(r)->producer_lock);
+	qemu_spin_unlock(&(r)->consumer_lock);
+
+	g_free(old);
+
+	return 0;
+}
+
+/*
+ * Note: producer lock is nested within consumer lock, so if you
+ * resize you must make sure all uses nest correctly.
+ * In particular if you consume ring in interrupt or BH context, you must
+ * disable interrupts/BH when doing so.
+ */
+static inline int ptr_ring_resize_multiple(struct ptr_ring **rings,
+					   unsigned int nrings,
+					   int size,
+					   void (*destroy)(void *))
+{
+	void ***queues;
+	int i;
+
+	queues = g_try_new(void **, nrings);
+	if (!queues)
+		goto noqueues;
+
+	for (i = 0; i < nrings; ++i) {
+		queues[i] = __ptr_ring_init_queue_alloc(size);
+		if (!queues[i])
+			goto nomem;
+	}
+
+	for (i = 0; i < nrings; ++i) {
+		qemu_spin_lock(&(rings[i])->consumer_lock);
+		qemu_spin_lock(&(rings[i])->producer_lock);
+		queues[i] = __ptr_ring_swap_queue(rings[i], queues[i],
+						  size, destroy);
+		qemu_spin_unlock(&(rings[i])->producer_lock);
+		qemu_spin_unlock(&(rings[i])->consumer_lock);
+	}
+
+	for (i = 0; i < nrings; ++i)
+		g_free(queues[i]);
+
+	g_free(queues);
+
+	return 0;
+
+nomem:
+	while (--i >= 0)
+		g_free(queues[i]);
+
+	g_free(queues);
+
+noqueues:
+	return -ENOMEM;
+}
+
+static inline void ptr_ring_cleanup(struct ptr_ring *r, void (*destroy)(void *))
+{
+	void *ptr;
+
+	if (destroy)
+		while ((ptr = ptr_ring_consume(r)))
+			destroy(ptr);
+	g_free(r->queue);
+}
+
+#endif /* _LINUX_PTR_RING_H  */

^ permalink raw reply related	[flat|nested] 156+ messages in thread

* Re: [PATCH 06/12] migration: do not detect zero page for compression
  2018-06-29  9:42         ` [Qemu-devel] " Dr. David Alan Gilbert
@ 2018-07-03  3:53           ` Xiao Guangrong
  -1 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-07-03  3:53 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, Peter Xu,
	wei.w.wang, jiang.biao2, pbonzini



On 06/29/2018 05:42 PM, Dr. David Alan Gilbert wrote:
> * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
>>
>> Hi Peter,
>>
>> Sorry for the delay as i was busy on other things.
>>
>> On 06/19/2018 03:30 PM, Peter Xu wrote:
>>> On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
>>>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>>
>>>> Detecting zero page is not a light work, we can disable it
>>>> for compression that can handle all zero data very well
>>>
>>> Is there any number shows how the compression algo performs better
>>> than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
>>> be fast, depending on how init_accel() is done in util/bufferiszero.c.
>>
>> This is the comparison between zero-detection and compression (the target
>> buffer is all zero bit):
>>
>> Zero 810 ns Compression: 26905 ns.
>> Zero 417 ns Compression: 8022 ns.
>> Zero 408 ns Compression: 7189 ns.
>> Zero 400 ns Compression: 7255 ns.
>> Zero 412 ns Compression: 7016 ns.
>> Zero 411 ns Compression: 7035 ns.
>> Zero 413 ns Compression: 6994 ns.
>> Zero 399 ns Compression: 7024 ns.
>> Zero 416 ns Compression: 7053 ns.
>> Zero 405 ns Compression: 7041 ns.
>>
>> Indeed, zero-detection is faster than compression.
>>
>> However during our profiling for the live_migration thread (after reverted this patch),
>> we noticed zero-detection cost lots of CPU:
>>
>>   12.01%  kqemu  qemu-system-x86_64            [.] buffer_zero_sse2                                                                                                                                                                           ◆
> 
> Interesting; what host are you running on?
> Some hosts have support for the faster buffer_zero_ss4/avx2

The host is:

model name	: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
...
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi
  mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts
  rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor
  ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt
  tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3
  cdp_l3 intel_ppin intel_pt mba tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1
  hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt
  clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total
  cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke

I checked and noticed "CONFIG_AVX2_OPT" has not been enabled, maybe is due to too old glib/gcc
version:
    gcc version 4.4.6 20110731 (Red Hat 4.4.6-4) (GCC)
    glibc.x86_64                     2.12


> 
>>    7.60%  kqemu  qemu-system-x86_64            [.] ram_bytes_total                                                                                                                                                                            ▒
>>    6.56%  kqemu  qemu-system-x86_64            [.] qemu_event_set                                                                                                                                                                             ▒
>>    5.61%  kqemu  qemu-system-x86_64            [.] qemu_put_qemu_file                                                                                                                                                                         ▒
>>    5.00%  kqemu  qemu-system-x86_64            [.] __ring_put                                                                                                                                                                                 ▒
>>    4.89%  kqemu  [kernel.kallsyms]             [k] copy_user_enhanced_fast_string                                                                                                                                                             ▒
>>    4.71%  kqemu  qemu-system-x86_64            [.] compress_thread_data_done                                                                                                                                                                  ▒
>>    3.63%  kqemu  qemu-system-x86_64            [.] ring_is_full                                                                                                                                                                               ▒
>>    2.89%  kqemu  qemu-system-x86_64            [.] __ring_is_full                                                                                                                                                                             ▒
>>    2.68%  kqemu  qemu-system-x86_64            [.] threads_submit_request_prepare                                                                                                                                                             ▒
>>    2.60%  kqemu  qemu-system-x86_64            [.] ring_mp_get                                                                                                                                                                                ▒
>>    2.25%  kqemu  qemu-system-x86_64            [.] ring_get                                                                                                                                                                                   ▒
>>    1.96%  kqemu  libc-2.12.so                  [.] memcpy
>>
>> After this patch, the workload is moved to the worker thread, is it
>> acceptable?
>>
>>>
>>>   From compression rate POV of course zero page algo wins since it
>>> contains no data (but only a flag).
>>>
>>
>> Yes it is. The compressed zero page is 45 bytes that is small enough i think.
> 
> So the compression is ~20x slow and 10x the size;  not a great
> improvement!
> 
> However, the tricky thing is that in the case of a guest which is mostly
> non-zero, this patch would save that time used by zero detection, so it
> would be faster.

Yes, indeed.

> 
>> Hmm, if you do not like, how about move detecting zero page to the work thread?
> 
> That would be interesting to try.
> 

Okay, i will try it then. :)

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 06/12] migration: do not detect zero page for compression
@ 2018-07-03  3:53           ` Xiao Guangrong
  0 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-07-03  3:53 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Xu, pbonzini, mst, mtosatti, qemu-devel, kvm, jiang.biao2,
	wei.w.wang, Xiao Guangrong



On 06/29/2018 05:42 PM, Dr. David Alan Gilbert wrote:
> * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
>>
>> Hi Peter,
>>
>> Sorry for the delay as i was busy on other things.
>>
>> On 06/19/2018 03:30 PM, Peter Xu wrote:
>>> On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
>>>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>>
>>>> Detecting zero page is not a light work, we can disable it
>>>> for compression that can handle all zero data very well
>>>
>>> Is there any number shows how the compression algo performs better
>>> than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
>>> be fast, depending on how init_accel() is done in util/bufferiszero.c.
>>
>> This is the comparison between zero-detection and compression (the target
>> buffer is all zero bit):
>>
>> Zero 810 ns Compression: 26905 ns.
>> Zero 417 ns Compression: 8022 ns.
>> Zero 408 ns Compression: 7189 ns.
>> Zero 400 ns Compression: 7255 ns.
>> Zero 412 ns Compression: 7016 ns.
>> Zero 411 ns Compression: 7035 ns.
>> Zero 413 ns Compression: 6994 ns.
>> Zero 399 ns Compression: 7024 ns.
>> Zero 416 ns Compression: 7053 ns.
>> Zero 405 ns Compression: 7041 ns.
>>
>> Indeed, zero-detection is faster than compression.
>>
>> However during our profiling for the live_migration thread (after reverted this patch),
>> we noticed zero-detection cost lots of CPU:
>>
>>   12.01%  kqemu  qemu-system-x86_64            [.] buffer_zero_sse2                                                                                                                                                                           ◆
> 
> Interesting; what host are you running on?
> Some hosts have support for the faster buffer_zero_ss4/avx2

The host is:

model name	: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
...
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi
  mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts
  rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor
  ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt
  tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3
  cdp_l3 intel_ppin intel_pt mba tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1
  hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt
  clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total
  cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke

I checked and noticed "CONFIG_AVX2_OPT" has not been enabled, maybe is due to too old glib/gcc
version:
    gcc version 4.4.6 20110731 (Red Hat 4.4.6-4) (GCC)
    glibc.x86_64                     2.12


> 
>>    7.60%  kqemu  qemu-system-x86_64            [.] ram_bytes_total                                                                                                                                                                            ▒
>>    6.56%  kqemu  qemu-system-x86_64            [.] qemu_event_set                                                                                                                                                                             ▒
>>    5.61%  kqemu  qemu-system-x86_64            [.] qemu_put_qemu_file                                                                                                                                                                         ▒
>>    5.00%  kqemu  qemu-system-x86_64            [.] __ring_put                                                                                                                                                                                 ▒
>>    4.89%  kqemu  [kernel.kallsyms]             [k] copy_user_enhanced_fast_string                                                                                                                                                             ▒
>>    4.71%  kqemu  qemu-system-x86_64            [.] compress_thread_data_done                                                                                                                                                                  ▒
>>    3.63%  kqemu  qemu-system-x86_64            [.] ring_is_full                                                                                                                                                                               ▒
>>    2.89%  kqemu  qemu-system-x86_64            [.] __ring_is_full                                                                                                                                                                             ▒
>>    2.68%  kqemu  qemu-system-x86_64            [.] threads_submit_request_prepare                                                                                                                                                             ▒
>>    2.60%  kqemu  qemu-system-x86_64            [.] ring_mp_get                                                                                                                                                                                ▒
>>    2.25%  kqemu  qemu-system-x86_64            [.] ring_get                                                                                                                                                                                   ▒
>>    1.96%  kqemu  libc-2.12.so                  [.] memcpy
>>
>> After this patch, the workload is moved to the worker thread, is it
>> acceptable?
>>
>>>
>>>   From compression rate POV of course zero page algo wins since it
>>> contains no data (but only a flag).
>>>
>>
>> Yes it is. The compressed zero page is 45 bytes that is small enough i think.
> 
> So the compression is ~20x slow and 10x the size;  not a great
> improvement!
> 
> However, the tricky thing is that in the case of a guest which is mostly
> non-zero, this patch would save that time used by zero detection, so it
> would be faster.

Yes, indeed.

> 
>> Hmm, if you do not like, how about move detecting zero page to the work thread?
> 
> That would be interesting to try.
> 

Okay, i will try it then. :)

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 07/12] migration: hold the lock only if it is really needed
  2018-06-29 11:22         ` [Qemu-devel] " Dr. David Alan Gilbert
@ 2018-07-03  6:27           ` Xiao Guangrong
  -1 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-07-03  6:27 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, Peter Xu,
	wei.w.wang, jiang.biao2, pbonzini



On 06/29/2018 07:22 PM, Dr. David Alan Gilbert wrote:
> * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
>>
>>
>> On 06/19/2018 03:36 PM, Peter Xu wrote:
>>> On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
>>>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>>
>>>> Try to hold src_page_req_mutex only if the queue is not
>>>> empty
>>>
>>> Pure question: how much this patch would help?  Basically if you are
>>> running compression tests then I think it means you are with precopy
>>> (since postcopy cannot work with compression yet), then here the lock
>>> has no contention at all.
>>
>> Yes, you are right, however we can observe it is in the top functions
>> (after revert this patch):
> 
> Can you show the matching trace with the patch in?

Sure, there is:

+   8.38%  kqemu  [kernel.kallsyms]        [k] copy_user_enhanced_fast_string
+   8.03%  kqemu  qemu-system-x86_64       [.] ram_bytes_total
+   6.62%  kqemu  qemu-system-x86_64       [.] qemu_event_set
+   6.02%  kqemu  qemu-system-x86_64       [.] qemu_put_qemu_file
+   5.81%  kqemu  qemu-system-x86_64       [.] __ring_put
+   5.04%  kqemu  qemu-system-x86_64       [.] compress_thread_data_done
+   4.48%  kqemu  qemu-system-x86_64       [.] ring_is_full
+   4.44%  kqemu  qemu-system-x86_64       [.] ring_mp_get
+   3.39%  kqemu  qemu-system-x86_64       [.] __ring_is_full
+   2.61%  kqemu  qemu-system-x86_64       [.] add_to_iovec
+   2.48%  kqemu  qemu-system-x86_64       [.] threads_submit_request_prepare
+   2.08%  kqemu  libc-2.12.so             [.] memcpy
+   2.07%  kqemu  qemu-system-x86_64       [.] ring_len
+   1.91%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
+   1.60%  kqemu  qemu-system-x86_64       [.] buffer_zero_sse2
+   1.16%  kqemu  qemu-system-x86_64       [.] ram_find_and_save_block
+   1.14%  kqemu  qemu-system-x86_64       [.] ram_save_target_page
+   1.12%  kqemu  qemu-system-x86_64       [.] compress_page_with_multi_thread
+   1.09%  kqemu  qemu-system-x86_64       [.] ram_save_host_page
+   1.07%  kqemu  qemu-system-x86_64       [.] test_and_clear_bit
+   1.07%  kqemu  qemu-system-x86_64       [.] qemu_put_buffer
+   1.03%  kqemu  qemu-system-x86_64       [.] qemu_put_byte
+   0.80%  kqemu  qemu-system-x86_64       [.] threads_submit_request_commit
+   0.74%  kqemu  qemu-system-x86_64       [.] migration_bitmap_clear_dirty
+   0.70%  kqemu  qemu-system-x86_64       [.] control_save_page
+   0.69%  kqemu  qemu-system-x86_64       [.] test_bit
+   0.69%  kqemu  qemu-system-x86_64       [.] ram_save_iterate
+   0.63%  kqemu  qemu-system-x86_64       [.] migration_bitmap_find_dirty
+   0.63%  kqemu  qemu-system-x86_64       [.] ram_control_save_page
+   0.62%  kqemu  qemu-system-x86_64       [.] rcu_read_lock
+   0.56%  kqemu  qemu-system-x86_64       [.] qemu_file_get_error
+   0.55%  kqemu  [kernel.kallsyms]        [k] lock_acquire
+   0.55%  kqemu  qemu-system-x86_64       [.] find_dirty_block
+   0.54%  kqemu  qemu-system-x86_64       [.] ring_index
+   0.53%  kqemu  qemu-system-x86_64       [.] ring_put
+   0.51%  kqemu  qemu-system-x86_64       [.] unqueue_page
+   0.50%  kqemu  qemu-system-x86_64       [.] migrate_use_compression
+   0.48%  kqemu  qemu-system-x86_64       [.] get_queued_page
+   0.46%  kqemu  qemu-system-x86_64       [.] ring_get
+   0.46%  kqemu  [i40e]                   [k] i40e_clean_tx_irq
+   0.45%  kqemu  [kernel.kallsyms]        [k] lock_release
+   0.44%  kqemu  [kernel.kallsyms]        [k] native_sched_clock
+   0.38%  kqemu  qemu-system-x86_64       [.] migrate_get_current
+   0.38%  kqemu  [kernel.kallsyms]        [k] find_held_lock
+   0.34%  kqemu  [kernel.kallsyms]        [k] __lock_release
+   0.34%  kqemu  qemu-system-x86_64       [.] qemu_ram_pagesize
+   0.29%  kqemu  [kernel.kallsyms]        [k] lock_is_held_type
+   0.27%  kqemu  [kernel.kallsyms]        [k] update_load_avg
+   0.27%  kqemu  qemu-system-x86_64       [.] save_page_use_compression
+   0.24%  kqemu  qemu-system-x86_64       [.] qemu_file_rate_limit
+   0.23%  kqemu  [kernel.kallsyms]        [k] tcp_sendmsg
+   0.23%  kqemu  [kernel.kallsyms]        [k] match_held_lock
+   0.22%  kqemu  [kernel.kallsyms]        [k] do_raw_spin_trylock
+   0.22%  kqemu  [kernel.kallsyms]        [k] cyc2ns_read_begin

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 07/12] migration: hold the lock only if it is really needed
@ 2018-07-03  6:27           ` Xiao Guangrong
  0 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-07-03  6:27 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Xu, pbonzini, mst, mtosatti, qemu-devel, kvm, jiang.biao2,
	wei.w.wang, Xiao Guangrong



On 06/29/2018 07:22 PM, Dr. David Alan Gilbert wrote:
> * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
>>
>>
>> On 06/19/2018 03:36 PM, Peter Xu wrote:
>>> On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
>>>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>>
>>>> Try to hold src_page_req_mutex only if the queue is not
>>>> empty
>>>
>>> Pure question: how much this patch would help?  Basically if you are
>>> running compression tests then I think it means you are with precopy
>>> (since postcopy cannot work with compression yet), then here the lock
>>> has no contention at all.
>>
>> Yes, you are right, however we can observe it is in the top functions
>> (after revert this patch):
> 
> Can you show the matching trace with the patch in?

Sure, there is:

+   8.38%  kqemu  [kernel.kallsyms]        [k] copy_user_enhanced_fast_string
+   8.03%  kqemu  qemu-system-x86_64       [.] ram_bytes_total
+   6.62%  kqemu  qemu-system-x86_64       [.] qemu_event_set
+   6.02%  kqemu  qemu-system-x86_64       [.] qemu_put_qemu_file
+   5.81%  kqemu  qemu-system-x86_64       [.] __ring_put
+   5.04%  kqemu  qemu-system-x86_64       [.] compress_thread_data_done
+   4.48%  kqemu  qemu-system-x86_64       [.] ring_is_full
+   4.44%  kqemu  qemu-system-x86_64       [.] ring_mp_get
+   3.39%  kqemu  qemu-system-x86_64       [.] __ring_is_full
+   2.61%  kqemu  qemu-system-x86_64       [.] add_to_iovec
+   2.48%  kqemu  qemu-system-x86_64       [.] threads_submit_request_prepare
+   2.08%  kqemu  libc-2.12.so             [.] memcpy
+   2.07%  kqemu  qemu-system-x86_64       [.] ring_len
+   1.91%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
+   1.60%  kqemu  qemu-system-x86_64       [.] buffer_zero_sse2
+   1.16%  kqemu  qemu-system-x86_64       [.] ram_find_and_save_block
+   1.14%  kqemu  qemu-system-x86_64       [.] ram_save_target_page
+   1.12%  kqemu  qemu-system-x86_64       [.] compress_page_with_multi_thread
+   1.09%  kqemu  qemu-system-x86_64       [.] ram_save_host_page
+   1.07%  kqemu  qemu-system-x86_64       [.] test_and_clear_bit
+   1.07%  kqemu  qemu-system-x86_64       [.] qemu_put_buffer
+   1.03%  kqemu  qemu-system-x86_64       [.] qemu_put_byte
+   0.80%  kqemu  qemu-system-x86_64       [.] threads_submit_request_commit
+   0.74%  kqemu  qemu-system-x86_64       [.] migration_bitmap_clear_dirty
+   0.70%  kqemu  qemu-system-x86_64       [.] control_save_page
+   0.69%  kqemu  qemu-system-x86_64       [.] test_bit
+   0.69%  kqemu  qemu-system-x86_64       [.] ram_save_iterate
+   0.63%  kqemu  qemu-system-x86_64       [.] migration_bitmap_find_dirty
+   0.63%  kqemu  qemu-system-x86_64       [.] ram_control_save_page
+   0.62%  kqemu  qemu-system-x86_64       [.] rcu_read_lock
+   0.56%  kqemu  qemu-system-x86_64       [.] qemu_file_get_error
+   0.55%  kqemu  [kernel.kallsyms]        [k] lock_acquire
+   0.55%  kqemu  qemu-system-x86_64       [.] find_dirty_block
+   0.54%  kqemu  qemu-system-x86_64       [.] ring_index
+   0.53%  kqemu  qemu-system-x86_64       [.] ring_put
+   0.51%  kqemu  qemu-system-x86_64       [.] unqueue_page
+   0.50%  kqemu  qemu-system-x86_64       [.] migrate_use_compression
+   0.48%  kqemu  qemu-system-x86_64       [.] get_queued_page
+   0.46%  kqemu  qemu-system-x86_64       [.] ring_get
+   0.46%  kqemu  [i40e]                   [k] i40e_clean_tx_irq
+   0.45%  kqemu  [kernel.kallsyms]        [k] lock_release
+   0.44%  kqemu  [kernel.kallsyms]        [k] native_sched_clock
+   0.38%  kqemu  qemu-system-x86_64       [.] migrate_get_current
+   0.38%  kqemu  [kernel.kallsyms]        [k] find_held_lock
+   0.34%  kqemu  [kernel.kallsyms]        [k] __lock_release
+   0.34%  kqemu  qemu-system-x86_64       [.] qemu_ram_pagesize
+   0.29%  kqemu  [kernel.kallsyms]        [k] lock_is_held_type
+   0.27%  kqemu  [kernel.kallsyms]        [k] update_load_avg
+   0.27%  kqemu  qemu-system-x86_64       [.] save_page_use_compression
+   0.24%  kqemu  qemu-system-x86_64       [.] qemu_file_rate_limit
+   0.23%  kqemu  [kernel.kallsyms]        [k] tcp_sendmsg
+   0.23%  kqemu  [kernel.kallsyms]        [k] match_held_lock
+   0.22%  kqemu  [kernel.kallsyms]        [k] do_raw_spin_trylock
+   0.22%  kqemu  [kernel.kallsyms]        [k] cyc2ns_read_begin

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 09/12] ring: introduce lockless ring buffer
  2018-06-29 13:08         ` [Qemu-devel] " Michael S. Tsirkin
@ 2018-07-03  7:31           ` Xiao Guangrong
  -1 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-07-03  7:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, mtosatti, Xiao Guangrong, qemu-devel, peterx, dgilbert,
	wei.w.wang, jiang.biao2, pbonzini



On 06/29/2018 09:08 PM, Michael S. Tsirkin wrote:
> On Fri, Jun 29, 2018 at 03:30:44PM +0800, Xiao Guangrong wrote:
>>
>> Hi Michael,
>>
>> On 06/20/2018 08:38 PM, Michael S. Tsirkin wrote:
>>> On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.xiao@gmail.com wrote:
>>>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>>>>
>>>>
>>>> (1) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
>>>> (2) http://dpdk.org/doc/api/rte__ring_8h.html
>>>>
>>>> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>
>>> So instead of all this super-optimized trickiness, how about
>>> a simple port of ptr_ring from linux?
>>>
>>> That one isn't lockless but it's known to outperform
>>> most others for a single producer/single consumer case.
>>> And with a ton of networking going on,
>>> who said it's such a hot spot? OTOH this implementation
>>> has more barriers which slows down each individual thread.
>>> It's also a source of bugs.
>>>
>>
>> Thank you for pointing it out.
>>
>> I just quickly went through the code of ptr_ring that is very nice and
>> really impressive. I will consider to port it to QEMU.
> 
> The port is pretty trivial. See below. It's a SPSC structure though.  So
> you need to use it with lock.  Given the critical section is small, I

Why put these locks into this common struct? For our case, each thread
has its own ring which is SCSP, no lock is needed at all. Atomic operations
still slow things down, see [PATCH 07/12] migration: hold the lock only if
it is really needed. I'd move the inner locks to the user instead.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer
@ 2018-07-03  7:31           ` Xiao Guangrong
  0 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-07-03  7:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: pbonzini, mtosatti, qemu-devel, kvm, dgilbert, peterx,
	jiang.biao2, wei.w.wang, Xiao Guangrong



On 06/29/2018 09:08 PM, Michael S. Tsirkin wrote:
> On Fri, Jun 29, 2018 at 03:30:44PM +0800, Xiao Guangrong wrote:
>>
>> Hi Michael,
>>
>> On 06/20/2018 08:38 PM, Michael S. Tsirkin wrote:
>>> On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.xiao@gmail.com wrote:
>>>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>>>>
>>>>
>>>> (1) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/kfifo.h
>>>> (2) http://dpdk.org/doc/api/rte__ring_8h.html
>>>>
>>>> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>
>>> So instead of all this super-optimized trickiness, how about
>>> a simple port of ptr_ring from linux?
>>>
>>> That one isn't lockless but it's known to outperform
>>> most others for a single producer/single consumer case.
>>> And with a ton of networking going on,
>>> who said it's such a hot spot? OTOH this implementation
>>> has more barriers which slows down each individual thread.
>>> It's also a source of bugs.
>>>
>>
>> Thank you for pointing it out.
>>
>> I just quickly went through the code of ptr_ring that is very nice and
>> really impressive. I will consider to port it to QEMU.
> 
> The port is pretty trivial. See below. It's a SPSC structure though.  So
> you need to use it with lock.  Given the critical section is small, I

Why put these locks into this common struct? For our case, each thread
has its own ring which is SCSP, no lock is needed at all. Atomic operations
still slow things down, see [PATCH 07/12] migration: hold the lock only if
it is really needed. I'd move the inner locks to the user instead.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 09/12] ring: introduce lockless ring buffer
  2018-06-29  3:55           ` [Qemu-devel] " Xiao Guangrong
@ 2018-07-03 15:55             ` Paul E. McKenney
  -1 siblings, 0 replies; 156+ messages in thread
From: Paul E. McKenney @ 2018-07-03 15:55 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: kvm, mst, peterz, Lai Jiangshan, stefani, mtosatti,
	Xiao Guangrong, qemu-devel, Peter Xu, dgilbert, Wei Wang,
	jiang.biao2, pbonzini

On Fri, Jun 29, 2018 at 11:55:08AM +0800, Xiao Guangrong wrote:
> 
> 
> On 06/28/2018 07:55 PM, Wei Wang wrote:
> >On 06/28/2018 06:02 PM, Xiao Guangrong wrote:
> >>
> >>CC: Paul, Peter Zijlstra, Stefani, Lai who are all good at memory barrier.
> >>
> >>
> >>On 06/20/2018 12:52 PM, Peter Xu wrote:
> >>>On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.xiao@gmail.com wrote:
> >>>>From: Xiao Guangrong <xiaoguangrong@tencent.com>
> >>>>
> >>>>It's the simple lockless ring buffer implement which supports both
> >>>>single producer vs. single consumer and multiple producers vs.
> >>>>single consumer.
> >>>>
> >>>>Many lessons were learned from Linux Kernel's kfifo (1) and DPDK's
> >>>>rte_ring (2) before i wrote this implement. It corrects some bugs of
> >>>>memory barriers in kfifo and it is the simpler lockless version of
> >>>>rte_ring as currently multiple access is only allowed for producer.
> >>>
> >>>Could you provide some more information about the kfifo bug? Any
> >>>pointer would be appreciated.
> >>>
> >>
> >>Sure, i reported one of the memory barrier issue to linux kernel:
> >>   https://lkml.org/lkml/2018/5/11/58
> >>
> >>Actually, beside that, there is another memory barrier issue in kfifo,
> >>please consider this case:
> >>
> >>   at the beginning
> >>   ring->size = 4
> >>   ring->out = 0
> >>   ring->in = 4
> >>
> >>     Consumer                            Producer
> >> ---------------                     --------------
> >>   index = ring->out; /* index == 0 */
> >>   ring->out++; /* ring->out == 1 */
> >>   < Re-Order >
> >>                                    out = ring->out;
> >>                                    if (ring->in - out >= ring->mask)
> >>                                        return -EFULL;
> >>                                    /* see the ring is not full */
> >>                                    index = ring->in & ring->mask; /* index == 0 */
> >>                                    ring->data[index] = new_data;
> >>                     ring->in++;
> >>
> >>   data = ring->data[index];
> >>   !!!!!! the old data is lost !!!!!!
> >>
> >>So we need to make sure:
> >>1) for the consumer, we should read the ring->data[] out before updating ring->out
> >>2) for the producer, we should read ring->out before updating ring->data[]
> >>
> >>as followings:
> >>      Producer                                       Consumer
> >>  ------------------------------------ ------------------------
> >>      Reading ring->out                            Reading ring->data[index]
> >>      smp_mb()                                     smp_mb()
> >>      Setting ring->data[index] = data ring->out++
> >>
> >>[ i used atomic_store_release() and atomic_load_acquire() instead of smp_mb() in the
> >>  patch. ]
> >>
> >>But i am not sure if we can use smp_acquire__after_ctrl_dep() in the producer?
> >
> >
> >I wonder if this could be solved by simply tweaking the above consumer implementation:
> >
> >[1] index = ring->out;
> >[2] data = ring->data[index];
> >[3] index++;
> >[4] ring->out = index;
> >
> >Now [2] and [3] forms a WAR dependency, which avoids the reordering.
> 
> It can not. [2] and [4] still do not any dependency, CPU and complainer can omit
> the 'index'.

One thing to try would be the Linux-kernel memory model tools in
tools/memory-model in current mainline.  There is a README file describing
how to install and set it up, with a number of files in Documentation
and litmus-tests that can help guide you.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 09/12] ring: introduce lockless ring buffer
@ 2018-07-03 15:55             ` Paul E. McKenney
  0 siblings, 0 replies; 156+ messages in thread
From: Paul E. McKenney @ 2018-07-03 15:55 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Wei Wang, Peter Xu, pbonzini, mst, mtosatti, qemu-devel, kvm,
	dgilbert, jiang.biao2, Xiao Guangrong, peterz, stefani,
	Lai Jiangshan

On Fri, Jun 29, 2018 at 11:55:08AM +0800, Xiao Guangrong wrote:
> 
> 
> On 06/28/2018 07:55 PM, Wei Wang wrote:
> >On 06/28/2018 06:02 PM, Xiao Guangrong wrote:
> >>
> >>CC: Paul, Peter Zijlstra, Stefani, Lai who are all good at memory barrier.
> >>
> >>
> >>On 06/20/2018 12:52 PM, Peter Xu wrote:
> >>>On Mon, Jun 04, 2018 at 05:55:17PM +0800, guangrong.xiao@gmail.com wrote:
> >>>>From: Xiao Guangrong <xiaoguangrong@tencent.com>
> >>>>
> >>>>It's the simple lockless ring buffer implement which supports both
> >>>>single producer vs. single consumer and multiple producers vs.
> >>>>single consumer.
> >>>>
> >>>>Many lessons were learned from Linux Kernel's kfifo (1) and DPDK's
> >>>>rte_ring (2) before i wrote this implement. It corrects some bugs of
> >>>>memory barriers in kfifo and it is the simpler lockless version of
> >>>>rte_ring as currently multiple access is only allowed for producer.
> >>>
> >>>Could you provide some more information about the kfifo bug? Any
> >>>pointer would be appreciated.
> >>>
> >>
> >>Sure, i reported one of the memory barrier issue to linux kernel:
> >>   https://lkml.org/lkml/2018/5/11/58
> >>
> >>Actually, beside that, there is another memory barrier issue in kfifo,
> >>please consider this case:
> >>
> >>   at the beginning
> >>   ring->size = 4
> >>   ring->out = 0
> >>   ring->in = 4
> >>
> >>     Consumer                            Producer
> >> ---------------                     --------------
> >>   index = ring->out; /* index == 0 */
> >>   ring->out++; /* ring->out == 1 */
> >>   < Re-Order >
> >>                                    out = ring->out;
> >>                                    if (ring->in - out >= ring->mask)
> >>                                        return -EFULL;
> >>                                    /* see the ring is not full */
> >>                                    index = ring->in & ring->mask; /* index == 0 */
> >>                                    ring->data[index] = new_data;
> >>                     ring->in++;
> >>
> >>   data = ring->data[index];
> >>   !!!!!! the old data is lost !!!!!!
> >>
> >>So we need to make sure:
> >>1) for the consumer, we should read the ring->data[] out before updating ring->out
> >>2) for the producer, we should read ring->out before updating ring->data[]
> >>
> >>as followings:
> >>      Producer                                       Consumer
> >>  ------------------------------------ ------------------------
> >>      Reading ring->out                            Reading ring->data[index]
> >>      smp_mb()                                     smp_mb()
> >>      Setting ring->data[index] = data ring->out++
> >>
> >>[ i used atomic_store_release() and atomic_load_acquire() instead of smp_mb() in the
> >>  patch. ]
> >>
> >>But i am not sure if we can use smp_acquire__after_ctrl_dep() in the producer?
> >
> >
> >I wonder if this could be solved by simply tweaking the above consumer implementation:
> >
> >[1] index = ring->out;
> >[2] data = ring->data[index];
> >[3] index++;
> >[4] ring->out = index;
> >
> >Now [2] and [3] forms a WAR dependency, which avoids the reordering.
> 
> It can not. [2] and [4] still do not any dependency, CPU and complainer can omit
> the 'index'.

One thing to try would be the Linux-kernel memory model tools in
tools/memory-model in current mainline.  There is a README file describing
how to install and set it up, with a number of files in Documentation
and litmus-tests that can help guide you.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 07/12] migration: hold the lock only if it is really needed
  2018-06-28  9:33       ` [Qemu-devel] " Xiao Guangrong
@ 2018-07-11  8:21         ` Peter Xu
  -1 siblings, 0 replies; 156+ messages in thread
From: Peter Xu @ 2018-07-11  8:21 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, qemu-devel,
	wei.w.wang, jiang.biao2, pbonzini

On Thu, Jun 28, 2018 at 05:33:58PM +0800, Xiao Guangrong wrote:
> 
> 
> On 06/19/2018 03:36 PM, Peter Xu wrote:
> > On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
> > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > 
> > > Try to hold src_page_req_mutex only if the queue is not
> > > empty
> > 
> > Pure question: how much this patch would help?  Basically if you are
> > running compression tests then I think it means you are with precopy
> > (since postcopy cannot work with compression yet), then here the lock
> > has no contention at all.
> 
> Yes, you are right, however we can observe it is in the top functions
> (after revert this patch):
> 
> Samples: 29K of event 'cycles', Event count (approx.): 22263412260
> +   7.99%  kqemu  qemu-system-x86_64       [.] ram_bytes_total
> +   6.95%  kqemu  [kernel.kallsyms]        [k] copy_user_enhanced_fast_string
> +   6.23%  kqemu  qemu-system-x86_64       [.] qemu_put_qemu_file
> +   6.20%  kqemu  qemu-system-x86_64       [.] qemu_event_set
> +   5.80%  kqemu  qemu-system-x86_64       [.] __ring_put
> +   4.82%  kqemu  qemu-system-x86_64       [.] compress_thread_data_done
> +   4.11%  kqemu  qemu-system-x86_64       [.] ring_is_full
> +   3.07%  kqemu  qemu-system-x86_64       [.] threads_submit_request_prepare
> +   2.83%  kqemu  qemu-system-x86_64       [.] ring_mp_get
> +   2.71%  kqemu  qemu-system-x86_64       [.] __ring_is_full
> +   2.46%  kqemu  qemu-system-x86_64       [.] buffer_zero_sse2
> +   2.40%  kqemu  qemu-system-x86_64       [.] add_to_iovec
> +   2.21%  kqemu  qemu-system-x86_64       [.] ring_get
> +   1.96%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
> +   1.90%  kqemu  libc-2.12.so             [.] memcpy
> +   1.55%  kqemu  qemu-system-x86_64       [.] ring_len
> +   1.12%  kqemu  libpthread-2.12.so       [.] pthread_mutex_unlock
> +   1.11%  kqemu  qemu-system-x86_64       [.] ram_find_and_save_block
> +   1.07%  kqemu  qemu-system-x86_64       [.] ram_save_host_page
> +   1.04%  kqemu  qemu-system-x86_64       [.] qemu_put_buffer
> +   0.97%  kqemu  qemu-system-x86_64       [.] compress_page_with_multi_thread
> +   0.96%  kqemu  qemu-system-x86_64       [.] ram_save_target_page
> +   0.93%  kqemu  libpthread-2.12.so       [.] pthread_mutex_lock

(sorry to respond late; I was busy with other stuff for the
 release...)

I am trying to find out anything related to unqueue_page() but I
failed.  Did I miss anything obvious there?

> 
> I guess its atomic operations cost CPU resource and check-before-lock is
> a common tech, i think it shouldn't have side effect, right? :)

Yeah it makes sense to me. :)

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 07/12] migration: hold the lock only if it is really needed
@ 2018-07-11  8:21         ` Peter Xu
  0 siblings, 0 replies; 156+ messages in thread
From: Peter Xu @ 2018-07-11  8:21 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, jiang.biao2,
	wei.w.wang, Xiao Guangrong

On Thu, Jun 28, 2018 at 05:33:58PM +0800, Xiao Guangrong wrote:
> 
> 
> On 06/19/2018 03:36 PM, Peter Xu wrote:
> > On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
> > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > 
> > > Try to hold src_page_req_mutex only if the queue is not
> > > empty
> > 
> > Pure question: how much this patch would help?  Basically if you are
> > running compression tests then I think it means you are with precopy
> > (since postcopy cannot work with compression yet), then here the lock
> > has no contention at all.
> 
> Yes, you are right, however we can observe it is in the top functions
> (after revert this patch):
> 
> Samples: 29K of event 'cycles', Event count (approx.): 22263412260
> +   7.99%  kqemu  qemu-system-x86_64       [.] ram_bytes_total
> +   6.95%  kqemu  [kernel.kallsyms]        [k] copy_user_enhanced_fast_string
> +   6.23%  kqemu  qemu-system-x86_64       [.] qemu_put_qemu_file
> +   6.20%  kqemu  qemu-system-x86_64       [.] qemu_event_set
> +   5.80%  kqemu  qemu-system-x86_64       [.] __ring_put
> +   4.82%  kqemu  qemu-system-x86_64       [.] compress_thread_data_done
> +   4.11%  kqemu  qemu-system-x86_64       [.] ring_is_full
> +   3.07%  kqemu  qemu-system-x86_64       [.] threads_submit_request_prepare
> +   2.83%  kqemu  qemu-system-x86_64       [.] ring_mp_get
> +   2.71%  kqemu  qemu-system-x86_64       [.] __ring_is_full
> +   2.46%  kqemu  qemu-system-x86_64       [.] buffer_zero_sse2
> +   2.40%  kqemu  qemu-system-x86_64       [.] add_to_iovec
> +   2.21%  kqemu  qemu-system-x86_64       [.] ring_get
> +   1.96%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
> +   1.90%  kqemu  libc-2.12.so             [.] memcpy
> +   1.55%  kqemu  qemu-system-x86_64       [.] ring_len
> +   1.12%  kqemu  libpthread-2.12.so       [.] pthread_mutex_unlock
> +   1.11%  kqemu  qemu-system-x86_64       [.] ram_find_and_save_block
> +   1.07%  kqemu  qemu-system-x86_64       [.] ram_save_host_page
> +   1.04%  kqemu  qemu-system-x86_64       [.] qemu_put_buffer
> +   0.97%  kqemu  qemu-system-x86_64       [.] compress_page_with_multi_thread
> +   0.96%  kqemu  qemu-system-x86_64       [.] ram_save_target_page
> +   0.93%  kqemu  libpthread-2.12.so       [.] pthread_mutex_lock

(sorry to respond late; I was busy with other stuff for the
 release...)

I am trying to find out anything related to unqueue_page() but I
failed.  Did I miss anything obvious there?

> 
> I guess its atomic operations cost CPU resource and check-before-lock is
> a common tech, i think it shouldn't have side effect, right? :)

Yeah it makes sense to me. :)

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 07/12] migration: hold the lock only if it is really needed
  2018-07-11  8:21         ` [Qemu-devel] " Peter Xu
@ 2018-07-12  7:47           ` Xiao Guangrong
  -1 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-07-12  7:47 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, qemu-devel,
	wei.w.wang, jiang.biao2, pbonzini



On 07/11/2018 04:21 PM, Peter Xu wrote:
> On Thu, Jun 28, 2018 at 05:33:58PM +0800, Xiao Guangrong wrote:
>>
>>
>> On 06/19/2018 03:36 PM, Peter Xu wrote:
>>> On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
>>>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>>
>>>> Try to hold src_page_req_mutex only if the queue is not
>>>> empty
>>>
>>> Pure question: how much this patch would help?  Basically if you are
>>> running compression tests then I think it means you are with precopy
>>> (since postcopy cannot work with compression yet), then here the lock
>>> has no contention at all.
>>
>> Yes, you are right, however we can observe it is in the top functions
>> (after revert this patch):
>>
>> Samples: 29K of event 'cycles', Event count (approx.): 22263412260
>> +   7.99%  kqemu  qemu-system-x86_64       [.] ram_bytes_total
>> +   6.95%  kqemu  [kernel.kallsyms]        [k] copy_user_enhanced_fast_string
>> +   6.23%  kqemu  qemu-system-x86_64       [.] qemu_put_qemu_file
>> +   6.20%  kqemu  qemu-system-x86_64       [.] qemu_event_set
>> +   5.80%  kqemu  qemu-system-x86_64       [.] __ring_put
>> +   4.82%  kqemu  qemu-system-x86_64       [.] compress_thread_data_done
>> +   4.11%  kqemu  qemu-system-x86_64       [.] ring_is_full
>> +   3.07%  kqemu  qemu-system-x86_64       [.] threads_submit_request_prepare
>> +   2.83%  kqemu  qemu-system-x86_64       [.] ring_mp_get
>> +   2.71%  kqemu  qemu-system-x86_64       [.] __ring_is_full
>> +   2.46%  kqemu  qemu-system-x86_64       [.] buffer_zero_sse2
>> +   2.40%  kqemu  qemu-system-x86_64       [.] add_to_iovec
>> +   2.21%  kqemu  qemu-system-x86_64       [.] ring_get
>> +   1.96%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
>> +   1.90%  kqemu  libc-2.12.so             [.] memcpy
>> +   1.55%  kqemu  qemu-system-x86_64       [.] ring_len
>> +   1.12%  kqemu  libpthread-2.12.so       [.] pthread_mutex_unlock
>> +   1.11%  kqemu  qemu-system-x86_64       [.] ram_find_and_save_block
>> +   1.07%  kqemu  qemu-system-x86_64       [.] ram_save_host_page
>> +   1.04%  kqemu  qemu-system-x86_64       [.] qemu_put_buffer
>> +   0.97%  kqemu  qemu-system-x86_64       [.] compress_page_with_multi_thread
>> +   0.96%  kqemu  qemu-system-x86_64       [.] ram_save_target_page
>> +   0.93%  kqemu  libpthread-2.12.so       [.] pthread_mutex_lock
> 
> (sorry to respond late; I was busy with other stuff for the
>   release...)
> 

You're welcome. :)

> I am trying to find out anything related to unqueue_page() but I
> failed.  Did I miss anything obvious there?
> 

unqueue_page() was not listed here indeed, i think the function
itself is light enough (a check then directly return) so it
did not leave a trace here.

This perf data was got after reverting this patch, i.e, it's
based on the lockless multithread model, then unqueue_page() is
the only place using mutex in the main thread.

And you can see the overload of mutext was gone after applying
this patch in the mail i replied to Dave.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 07/12] migration: hold the lock only if it is really needed
@ 2018-07-12  7:47           ` Xiao Guangrong
  0 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-07-12  7:47 UTC (permalink / raw)
  To: Peter Xu
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, jiang.biao2,
	wei.w.wang, Xiao Guangrong



On 07/11/2018 04:21 PM, Peter Xu wrote:
> On Thu, Jun 28, 2018 at 05:33:58PM +0800, Xiao Guangrong wrote:
>>
>>
>> On 06/19/2018 03:36 PM, Peter Xu wrote:
>>> On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
>>>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>>
>>>> Try to hold src_page_req_mutex only if the queue is not
>>>> empty
>>>
>>> Pure question: how much this patch would help?  Basically if you are
>>> running compression tests then I think it means you are with precopy
>>> (since postcopy cannot work with compression yet), then here the lock
>>> has no contention at all.
>>
>> Yes, you are right, however we can observe it is in the top functions
>> (after revert this patch):
>>
>> Samples: 29K of event 'cycles', Event count (approx.): 22263412260
>> +   7.99%  kqemu  qemu-system-x86_64       [.] ram_bytes_total
>> +   6.95%  kqemu  [kernel.kallsyms]        [k] copy_user_enhanced_fast_string
>> +   6.23%  kqemu  qemu-system-x86_64       [.] qemu_put_qemu_file
>> +   6.20%  kqemu  qemu-system-x86_64       [.] qemu_event_set
>> +   5.80%  kqemu  qemu-system-x86_64       [.] __ring_put
>> +   4.82%  kqemu  qemu-system-x86_64       [.] compress_thread_data_done
>> +   4.11%  kqemu  qemu-system-x86_64       [.] ring_is_full
>> +   3.07%  kqemu  qemu-system-x86_64       [.] threads_submit_request_prepare
>> +   2.83%  kqemu  qemu-system-x86_64       [.] ring_mp_get
>> +   2.71%  kqemu  qemu-system-x86_64       [.] __ring_is_full
>> +   2.46%  kqemu  qemu-system-x86_64       [.] buffer_zero_sse2
>> +   2.40%  kqemu  qemu-system-x86_64       [.] add_to_iovec
>> +   2.21%  kqemu  qemu-system-x86_64       [.] ring_get
>> +   1.96%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
>> +   1.90%  kqemu  libc-2.12.so             [.] memcpy
>> +   1.55%  kqemu  qemu-system-x86_64       [.] ring_len
>> +   1.12%  kqemu  libpthread-2.12.so       [.] pthread_mutex_unlock
>> +   1.11%  kqemu  qemu-system-x86_64       [.] ram_find_and_save_block
>> +   1.07%  kqemu  qemu-system-x86_64       [.] ram_save_host_page
>> +   1.04%  kqemu  qemu-system-x86_64       [.] qemu_put_buffer
>> +   0.97%  kqemu  qemu-system-x86_64       [.] compress_page_with_multi_thread
>> +   0.96%  kqemu  qemu-system-x86_64       [.] ram_save_target_page
>> +   0.93%  kqemu  libpthread-2.12.so       [.] pthread_mutex_lock
> 
> (sorry to respond late; I was busy with other stuff for the
>   release...)
> 

You're welcome. :)

> I am trying to find out anything related to unqueue_page() but I
> failed.  Did I miss anything obvious there?
> 

unqueue_page() was not listed here indeed, i think the function
itself is light enough (a check then directly return) so it
did not leave a trace here.

This perf data was got after reverting this patch, i.e, it's
based on the lockless multithread model, then unqueue_page() is
the only place using mutex in the main thread.

And you can see the overload of mutext was gone after applying
this patch in the mail i replied to Dave.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 07/12] migration: hold the lock only if it is really needed
  2018-07-12  7:47           ` [Qemu-devel] " Xiao Guangrong
@ 2018-07-12  8:26             ` Peter Xu
  -1 siblings, 0 replies; 156+ messages in thread
From: Peter Xu @ 2018-07-12  8:26 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, qemu-devel,
	wei.w.wang, jiang.biao2, pbonzini

On Thu, Jul 12, 2018 at 03:47:57PM +0800, Xiao Guangrong wrote:
> 
> 
> On 07/11/2018 04:21 PM, Peter Xu wrote:
> > On Thu, Jun 28, 2018 at 05:33:58PM +0800, Xiao Guangrong wrote:
> > > 
> > > 
> > > On 06/19/2018 03:36 PM, Peter Xu wrote:
> > > > On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
> > > > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > > > 
> > > > > Try to hold src_page_req_mutex only if the queue is not
> > > > > empty
> > > > 
> > > > Pure question: how much this patch would help?  Basically if you are
> > > > running compression tests then I think it means you are with precopy
> > > > (since postcopy cannot work with compression yet), then here the lock
> > > > has no contention at all.
> > > 
> > > Yes, you are right, however we can observe it is in the top functions
> > > (after revert this patch):
> > > 
> > > Samples: 29K of event 'cycles', Event count (approx.): 22263412260
> > > +   7.99%  kqemu  qemu-system-x86_64       [.] ram_bytes_total
> > > +   6.95%  kqemu  [kernel.kallsyms]        [k] copy_user_enhanced_fast_string
> > > +   6.23%  kqemu  qemu-system-x86_64       [.] qemu_put_qemu_file
> > > +   6.20%  kqemu  qemu-system-x86_64       [.] qemu_event_set
> > > +   5.80%  kqemu  qemu-system-x86_64       [.] __ring_put
> > > +   4.82%  kqemu  qemu-system-x86_64       [.] compress_thread_data_done
> > > +   4.11%  kqemu  qemu-system-x86_64       [.] ring_is_full
> > > +   3.07%  kqemu  qemu-system-x86_64       [.] threads_submit_request_prepare
> > > +   2.83%  kqemu  qemu-system-x86_64       [.] ring_mp_get
> > > +   2.71%  kqemu  qemu-system-x86_64       [.] __ring_is_full
> > > +   2.46%  kqemu  qemu-system-x86_64       [.] buffer_zero_sse2
> > > +   2.40%  kqemu  qemu-system-x86_64       [.] add_to_iovec
> > > +   2.21%  kqemu  qemu-system-x86_64       [.] ring_get
> > > +   1.96%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
> > > +   1.90%  kqemu  libc-2.12.so             [.] memcpy
> > > +   1.55%  kqemu  qemu-system-x86_64       [.] ring_len
> > > +   1.12%  kqemu  libpthread-2.12.so       [.] pthread_mutex_unlock
> > > +   1.11%  kqemu  qemu-system-x86_64       [.] ram_find_and_save_block
> > > +   1.07%  kqemu  qemu-system-x86_64       [.] ram_save_host_page
> > > +   1.04%  kqemu  qemu-system-x86_64       [.] qemu_put_buffer
> > > +   0.97%  kqemu  qemu-system-x86_64       [.] compress_page_with_multi_thread
> > > +   0.96%  kqemu  qemu-system-x86_64       [.] ram_save_target_page
> > > +   0.93%  kqemu  libpthread-2.12.so       [.] pthread_mutex_lock
> > 
> > (sorry to respond late; I was busy with other stuff for the
> >   release...)
> > 
> 
> You're welcome. :)
> 
> > I am trying to find out anything related to unqueue_page() but I
> > failed.  Did I miss anything obvious there?
> > 
> 
> unqueue_page() was not listed here indeed, i think the function
> itself is light enough (a check then directly return) so it
> did not leave a trace here.
> 
> This perf data was got after reverting this patch, i.e, it's
> based on the lockless multithread model, then unqueue_page() is
> the only place using mutex in the main thread.
> 
> And you can see the overload of mutext was gone after applying
> this patch in the mail i replied to Dave.

I see.  It's not a big portion of CPU resource, though of course I
don't have reason to object to this change as well.

Actually what interested me more is why ram_bytes_total() is such a
hot spot.  AFAIU it's only called in ram_find_and_save_block() per
call, and it should be mostly a constant if we don't plug/unplug
memories.  Not sure that means that's a better spot to work on.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 07/12] migration: hold the lock only if it is really needed
@ 2018-07-12  8:26             ` Peter Xu
  0 siblings, 0 replies; 156+ messages in thread
From: Peter Xu @ 2018-07-12  8:26 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, jiang.biao2,
	wei.w.wang, Xiao Guangrong

On Thu, Jul 12, 2018 at 03:47:57PM +0800, Xiao Guangrong wrote:
> 
> 
> On 07/11/2018 04:21 PM, Peter Xu wrote:
> > On Thu, Jun 28, 2018 at 05:33:58PM +0800, Xiao Guangrong wrote:
> > > 
> > > 
> > > On 06/19/2018 03:36 PM, Peter Xu wrote:
> > > > On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
> > > > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > > > 
> > > > > Try to hold src_page_req_mutex only if the queue is not
> > > > > empty
> > > > 
> > > > Pure question: how much this patch would help?  Basically if you are
> > > > running compression tests then I think it means you are with precopy
> > > > (since postcopy cannot work with compression yet), then here the lock
> > > > has no contention at all.
> > > 
> > > Yes, you are right, however we can observe it is in the top functions
> > > (after revert this patch):
> > > 
> > > Samples: 29K of event 'cycles', Event count (approx.): 22263412260
> > > +   7.99%  kqemu  qemu-system-x86_64       [.] ram_bytes_total
> > > +   6.95%  kqemu  [kernel.kallsyms]        [k] copy_user_enhanced_fast_string
> > > +   6.23%  kqemu  qemu-system-x86_64       [.] qemu_put_qemu_file
> > > +   6.20%  kqemu  qemu-system-x86_64       [.] qemu_event_set
> > > +   5.80%  kqemu  qemu-system-x86_64       [.] __ring_put
> > > +   4.82%  kqemu  qemu-system-x86_64       [.] compress_thread_data_done
> > > +   4.11%  kqemu  qemu-system-x86_64       [.] ring_is_full
> > > +   3.07%  kqemu  qemu-system-x86_64       [.] threads_submit_request_prepare
> > > +   2.83%  kqemu  qemu-system-x86_64       [.] ring_mp_get
> > > +   2.71%  kqemu  qemu-system-x86_64       [.] __ring_is_full
> > > +   2.46%  kqemu  qemu-system-x86_64       [.] buffer_zero_sse2
> > > +   2.40%  kqemu  qemu-system-x86_64       [.] add_to_iovec
> > > +   2.21%  kqemu  qemu-system-x86_64       [.] ring_get
> > > +   1.96%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
> > > +   1.90%  kqemu  libc-2.12.so             [.] memcpy
> > > +   1.55%  kqemu  qemu-system-x86_64       [.] ring_len
> > > +   1.12%  kqemu  libpthread-2.12.so       [.] pthread_mutex_unlock
> > > +   1.11%  kqemu  qemu-system-x86_64       [.] ram_find_and_save_block
> > > +   1.07%  kqemu  qemu-system-x86_64       [.] ram_save_host_page
> > > +   1.04%  kqemu  qemu-system-x86_64       [.] qemu_put_buffer
> > > +   0.97%  kqemu  qemu-system-x86_64       [.] compress_page_with_multi_thread
> > > +   0.96%  kqemu  qemu-system-x86_64       [.] ram_save_target_page
> > > +   0.93%  kqemu  libpthread-2.12.so       [.] pthread_mutex_lock
> > 
> > (sorry to respond late; I was busy with other stuff for the
> >   release...)
> > 
> 
> You're welcome. :)
> 
> > I am trying to find out anything related to unqueue_page() but I
> > failed.  Did I miss anything obvious there?
> > 
> 
> unqueue_page() was not listed here indeed, i think the function
> itself is light enough (a check then directly return) so it
> did not leave a trace here.
> 
> This perf data was got after reverting this patch, i.e, it's
> based on the lockless multithread model, then unqueue_page() is
> the only place using mutex in the main thread.
> 
> And you can see the overload of mutext was gone after applying
> this patch in the mail i replied to Dave.

I see.  It's not a big portion of CPU resource, though of course I
don't have reason to object to this change as well.

Actually what interested me more is why ram_bytes_total() is such a
hot spot.  AFAIU it's only called in ram_find_and_save_block() per
call, and it should be mostly a constant if we don't plug/unplug
memories.  Not sure that means that's a better spot to work on.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 10/12] migration: introduce lockless multithreads model
  2018-06-20  6:52     ` [Qemu-devel] " Peter Xu
@ 2018-07-13 16:24       ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-07-13 16:24 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, wei.w.wang,
	guangrong.xiao, jiang.biao2, pbonzini

* Peter Xu (peterx@redhat.com) wrote:
> On Mon, Jun 04, 2018 at 05:55:18PM +0800, guangrong.xiao@gmail.com wrote:
> > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > 
> > Current implementation of compression and decompression are very
> > hard to be enabled on productions. We noticed that too many wait-wakes
> > go to kernel space and CPU usages are very low even if the system
> > is really free
> > 
> > The reasons are:
> > 1) there are two many locks used to do synchronous,there
> >   is a global lock and each single thread has its own lock,
> >   migration thread and work threads need to go to sleep if
> >   these locks are busy
> > 
> > 2) migration thread separately submits request to the thread
> >    however, only one request can be pended, that means, the
> >    thread has to go to sleep after finishing the request
> > 
> > To make it work better, we introduce a new multithread model,
> > the user, currently it is the migration thread, submits request
> > to each thread with round-robin manner, the thread has its own
> > ring whose capacity is 4 and puts the result to a global ring
> > which is lockless for multiple producers, the user fetches result
> > out from the global ring and do remaining operations for the
> > request, e.g, posting the compressed data out for migration on
> > the source QEMU
> > 
> > Performance Result:
> > The test was based on top of the patch:
> >    ring: introduce lockless ring buffer
> > that means, previous optimizations are used for both of original case
> > and applying the new multithread model
> > 
> > We tested live migration on two hosts:
> >    Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz * 64 + 256G memory
> > to migration a VM between each other, which has 16 vCPUs and 60G
> > memory, during the migration, multiple threads are repeatedly writing
> > the memory in the VM
> > 
> > We used 16 threads on the destination to decompress the data and on the
> > source, we tried 8 threads and 16 threads to compress the data
> > 
> > --- Before our work ---
> > migration can not be finished for both 8 threads and 16 threads. The data
> > is as followings:
> > 
> > Use 8 threads to compress:
> > - on the source:
> > 	    migration thread   compress-threads
> > CPU usage       70%          some use 36%, others are very low ~20%
> > - on the destination:
> >             main thread        decompress-threads
> > CPU usage       100%         some use ~40%, other are very low ~2%
> > 
> > Migration status (CAN NOT FINISH):
> > info migrate
> > globals:
> > store-global-state: on
> > only-migratable: off
> > send-configuration: on
> > send-section-footer: on
> > capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
> > Migration status: active
> > total time: 1019540 milliseconds
> > expected downtime: 2263 milliseconds
> > setup: 218 milliseconds
> > transferred ram: 252419995 kbytes
> > throughput: 2469.45 mbps
> > remaining ram: 15611332 kbytes
> > total ram: 62931784 kbytes
> > duplicate: 915323 pages
> > skipped: 0 pages
> > normal: 59673047 pages
> > normal bytes: 238692188 kbytes
> > dirty sync count: 28
> > page size: 4 kbytes
> > dirty pages rate: 170551 pages
> > compression pages: 121309323 pages
> > compression busy: 60588337
> > compression busy rate: 0.36
> > compression reduced size: 484281967178
> > compression rate: 0.97
> > 
> > Use 16 threads to compress:
> > - on the source:
> > 	    migration thread   compress-threads
> > CPU usage       96%          some use 45%, others are very low ~6%
> > - on the destination:
> >             main thread        decompress-threads
> > CPU usage       96%         some use 58%, other are very low ~10%
> > 
> > Migration status (CAN NOT FINISH):
> > info migrate
> > globals:
> > store-global-state: on
> > only-migratable: off
> > send-configuration: on
> > send-section-footer: on
> > capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
> > Migration status: active
> > total time: 1189221 milliseconds
> > expected downtime: 6824 milliseconds
> > setup: 220 milliseconds
> > transferred ram: 90620052 kbytes
> > throughput: 840.41 mbps
> > remaining ram: 3678760 kbytes
> > total ram: 62931784 kbytes
> > duplicate: 195893 pages
> > skipped: 0 pages
> > normal: 17290715 pages
> > normal bytes: 69162860 kbytes
> > dirty sync count: 33
> > page size: 4 kbytes
> > dirty pages rate: 175039 pages
> > compression pages: 186739419 pages
> > compression busy: 17486568
> > compression busy rate: 0.09
> > compression reduced size: 744546683892
> > compression rate: 0.97
> > 
> > --- After our work ---
> > Migration can be finished quickly for both 8 threads and 16 threads. The
> > data is as followings:
> > 
> > Use 8 threads to compress:
> > - on the source:
> > 	    migration thread   compress-threads
> > CPU usage       30%               30% (all threads have same CPU usage)
> > - on the destination:
> >             main thread        decompress-threads
> > CPU usage       100%              50% (all threads have same CPU usage)
> > 
> > Migration status (finished in 219467 ms):
> > info migrate
> > globals:
> > store-global-state: on
> > only-migratable: off
> > send-configuration: on
> > send-section-footer: on
> > capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
> > Migration status: completed
> > total time: 219467 milliseconds
> > downtime: 115 milliseconds
> > setup: 222 milliseconds
> > transferred ram: 88510173 kbytes
> > throughput: 3303.81 mbps
> > remaining ram: 0 kbytes
> > total ram: 62931784 kbytes
> > duplicate: 2211775 pages
> > skipped: 0 pages
> > normal: 21166222 pages
> > normal bytes: 84664888 kbytes
> > dirty sync count: 15
> > page size: 4 kbytes
> > compression pages: 32045857 pages
> > compression busy: 23377968
> > compression busy rate: 0.34
> > compression reduced size: 127767894329
> > compression rate: 0.97
> > 
> > Use 16 threads to compress:
> > - on the source:
> > 	    migration thread   compress-threads
> > CPU usage       60%               60% (all threads have same CPU usage)
> > - on the destination:
> >             main thread        decompress-threads
> > CPU usage       100%              75% (all threads have same CPU usage)
> > 
> > Migration status (finished in 64118 ms):
> > info migrate
> > globals:
> > store-global-state: on
> > only-migratable: off
> > send-configuration: on
> > send-section-footer: on
> > capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
> > Migration status: completed
> > total time: 64118 milliseconds
> > downtime: 29 milliseconds
> > setup: 223 milliseconds
> > transferred ram: 13345135 kbytes
> > throughput: 1705.10 mbps
> > remaining ram: 0 kbytes
> > total ram: 62931784 kbytes
> > duplicate: 574921 pages
> > skipped: 0 pages
> > normal: 2570281 pages
> > normal bytes: 10281124 kbytes
> > dirty sync count: 9
> > page size: 4 kbytes
> > compression pages: 28007024 pages
> > compression busy: 3145182
> > compression busy rate: 0.08
> > compression reduced size: 111829024985
> > compression rate: 0.97
> 
> Not sure how other people think, for me these information suites
> better as cover letter.  For commit message, I would prefer to know
> about something like: what this thread model can do; how the APIs are
> designed and used; what's the limitations, etc.  After all until this
> patch nowhere is using the new model yet, so these numbers are a bit
> misleading.

I think it's OK to justify the need for such a large change; but OK
in the main cover letter.

> > 
> > Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> > ---
> >  migration/Makefile.objs |   1 +
> >  migration/threads.c     | 265 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  migration/threads.h     | 116 +++++++++++++++++++++
> 
> Again, this model seems to be suitable for scenarios even outside
> migration.  So I'm not sure whether you'd like to generalize it (I
> still see e.g. constants and comments related to migration, but there
> aren't much) and put it into util/.

We've already got one thread pool at least; so take care to
differentiate it (I don't know the details of it)

> >  3 files changed, 382 insertions(+)
> >  create mode 100644 migration/threads.c
> >  create mode 100644 migration/threads.h
> > 
> > diff --git a/migration/Makefile.objs b/migration/Makefile.objs
> > index c83ec47ba8..bdb61a7983 100644
> > --- a/migration/Makefile.objs
> > +++ b/migration/Makefile.objs
> > @@ -7,6 +7,7 @@ common-obj-y += qemu-file-channel.o
> >  common-obj-y += xbzrle.o postcopy-ram.o
> >  common-obj-y += qjson.o
> >  common-obj-y += block-dirty-bitmap.o
> > +common-obj-y += threads.o
> >  
> >  common-obj-$(CONFIG_RDMA) += rdma.o
> >  
> > diff --git a/migration/threads.c b/migration/threads.c
> > new file mode 100644
> > index 0000000000..eecd3229b7
> > --- /dev/null
> > +++ b/migration/threads.c
> > @@ -0,0 +1,265 @@
> > +#include "threads.h"
> > +
> > +/* retry to see if there is avilable request before actually go to wait. */
> > +#define BUSY_WAIT_COUNT 1000
> > +
> > +static void *thread_run(void *opaque)
> > +{
> > +    ThreadLocal *self_data = (ThreadLocal *)opaque;
> > +    Threads *threads = self_data->threads;
> > +    void (*handler)(ThreadRequest *data) = threads->thread_request_handler;
> > +    ThreadRequest *request;
> > +    int count, ret;
> > +
> > +    for ( ; !atomic_read(&self_data->quit); ) {
> > +        qemu_event_reset(&self_data->ev);
> > +
> > +        count = 0;
> > +        while ((request = ring_get(self_data->request_ring)) ||
> > +            count < BUSY_WAIT_COUNT) {
> > +             /*
> > +             * wait some while before go to sleep so that the user
> > +             * needn't go to kernel space to wake up the consumer
> > +             * threads.
> > +             *
> > +             * That will waste some CPU resource indeed however it
> > +             * can significantly improve the case that the request
> > +             * will be available soon.
> > +             */
> > +             if (!request) {
> > +                cpu_relax();
> > +                count++;
> > +                continue;
> > +            }
> > +            count = 0;

Things like busywait counts probably need isolating somewhere;
getting those counts right is quite hard.

Dave

> > +            handler(request);
> > +
> > +            do {
> > +                ret = ring_put(threads->request_done_ring, request);
> > +                /*
> > +                 * request_done_ring has enough room to contain all
> > +                 * requests, however, theoretically, it still can be
> > +                 * fail if the ring's indexes are overflow that would
> > +                 * happen if there is more than 2^32 requests are
> 
> Could you elaborate why this ring_put() could fail, and why failure is
> somehow related to 2^32 overflow?
> 
> Firstly, I don't understand why it will fail.
> 
> Meanwhile, AFAIU your ring can even live well with that 2^32 overflow.
> Or did I misunderstood?
> 
> > +                 * handled between two calls of threads_wait_done().
> > +                 * So we do retry to make the code more robust.
> > +                 *
> > +                 * It is unlikely the case for migration as the block's
> > +                 * memory is unlikely more than 16T (2^32 pages) memory.
> 
> (some migration-related comments; maybe we can remove that)
> 
> > +                 */
> > +                if (ret) {
> > +                    fprintf(stderr,
> > +                            "Potential BUG if it is triggered by migration.\n");
> > +                }
> > +            } while (ret);
> > +        }
> > +
> > +        qemu_event_wait(&self_data->ev);
> > +    }
> > +
> > +    return NULL;
> > +}
> > +
> > +static void add_free_request(Threads *threads, ThreadRequest *request)
> > +{
> > +    QSLIST_INSERT_HEAD(&threads->free_requests, request, node);
> > +    threads->free_requests_nr++;
> > +}
> > +
> > +static ThreadRequest *get_and_remove_first_free_request(Threads *threads)
> > +{
> > +    ThreadRequest *request;
> > +
> > +    if (QSLIST_EMPTY(&threads->free_requests)) {
> > +        return NULL;
> > +    }
> > +
> > +    request = QSLIST_FIRST(&threads->free_requests);
> > +    QSLIST_REMOVE_HEAD(&threads->free_requests, node);
> > +    threads->free_requests_nr--;
> > +    return request;
> > +}
> > +
> > +static void uninit_requests(Threads *threads, int free_nr)
> > +{
> > +    ThreadRequest *request;
> > +
> > +    /*
> > +     * all requests should be released to the list if threads are being
> > +     * destroyed, i,e. should call threads_wait_done() first.
> > +     */
> > +    assert(threads->free_requests_nr == free_nr);
> > +
> > +    while ((request = get_and_remove_first_free_request(threads))) {
> > +        threads->thread_request_uninit(request);
> > +    }
> > +
> > +    assert(ring_is_empty(threads->request_done_ring));
> > +    ring_free(threads->request_done_ring);
> > +}
> > +
> > +static int init_requests(Threads *threads)
> > +{
> > +    ThreadRequest *request;
> > +    unsigned int done_ring_size = pow2roundup32(threads->total_requests);
> > +    int i, free_nr = 0;
> > +
> > +    threads->request_done_ring = ring_alloc(done_ring_size,
> > +                                            RING_MULTI_PRODUCER);
> > +
> > +    QSLIST_INIT(&threads->free_requests);
> > +    for (i = 0; i < threads->total_requests; i++) {
> > +        request = threads->thread_request_init();
> > +        if (!request) {
> > +            goto cleanup;
> > +        }
> > +
> > +        free_nr++;
> > +        add_free_request(threads, request);
> > +    }
> > +    return 0;
> > +
> > +cleanup:
> > +    uninit_requests(threads, free_nr);
> > +    return -1;
> > +}
> > +
> > +static void uninit_thread_data(Threads *threads)
> > +{
> > +    ThreadLocal *thread_local = threads->per_thread_data;
> > +    int i;
> > +
> > +    for (i = 0; i < threads->threads_nr; i++) {
> > +        thread_local[i].quit = true;
> > +        qemu_event_set(&thread_local[i].ev);
> > +        qemu_thread_join(&thread_local[i].thread);
> > +        qemu_event_destroy(&thread_local[i].ev);
> > +        assert(ring_is_empty(thread_local[i].request_ring));
> > +        ring_free(thread_local[i].request_ring);
> > +    }
> > +}
> > +
> > +static void init_thread_data(Threads *threads)
> > +{
> > +    ThreadLocal *thread_local = threads->per_thread_data;
> > +    char *name;
> > +    int i;
> > +
> > +    for (i = 0; i < threads->threads_nr; i++) {
> > +        qemu_event_init(&thread_local[i].ev, false);
> > +
> > +        thread_local[i].threads = threads;
> > +        thread_local[i].self = i;
> > +        thread_local[i].request_ring = ring_alloc(threads->thread_ring_size, 0);
> > +        name = g_strdup_printf("%s/%d", threads->name, thread_local[i].self);
> > +        qemu_thread_create(&thread_local[i].thread, name,
> > +                           thread_run, &thread_local[i], QEMU_THREAD_JOINABLE);
> > +        g_free(name);
> > +    }
> > +}
> > +
> > +/* the size of thread local request ring */
> > +#define THREAD_REQ_RING_SIZE 4
> > +
> > +Threads *threads_create(unsigned int threads_nr, const char *name,
> > +                        ThreadRequest *(*thread_request_init)(void),
> > +                        void (*thread_request_uninit)(ThreadRequest *request),
> > +                        void (*thread_request_handler)(ThreadRequest *request),
> > +                        void (*thread_request_done)(ThreadRequest *request))
> > +{
> > +    Threads *threads;
> > +    int ret;
> > +
> > +    threads = g_malloc0(sizeof(*threads) + threads_nr * sizeof(ThreadLocal));
> > +    threads->threads_nr = threads_nr;
> > +    threads->thread_ring_size = THREAD_REQ_RING_SIZE;
> 
> (If we're going to generalize this thread model, maybe you'd consider
>  to allow specify this ring size as well?)
> 
> > +    threads->total_requests = threads->thread_ring_size * threads_nr;
> > +
> > +    threads->name = name;
> > +    threads->thread_request_init = thread_request_init;
> > +    threads->thread_request_uninit = thread_request_uninit;
> > +    threads->thread_request_handler = thread_request_handler;
> > +    threads->thread_request_done = thread_request_done;
> > +
> > +    ret = init_requests(threads);
> > +    if (ret) {
> > +        g_free(threads);
> > +        return NULL;
> > +    }
> > +
> > +    init_thread_data(threads);
> > +    return threads;
> > +}
> > +
> > +void threads_destroy(Threads *threads)
> > +{
> > +    uninit_thread_data(threads);
> > +    uninit_requests(threads, threads->total_requests);
> > +    g_free(threads);
> > +}
> > +
> > +ThreadRequest *threads_submit_request_prepare(Threads *threads)
> > +{
> > +    ThreadRequest *request;
> > +    unsigned int index;
> > +
> > +    index = threads->current_thread_index % threads->threads_nr;
> 
> Why round-robin rather than simply find a idle thread (still with
> valid free requests) and put the request onto that?
> 
> Asked since I don't see much difficulty to achieve that, meanwhile for
> round-robin I'm not sure whether it can happen that one thread stuck
> due to some reason (e.g., scheduling reason?), while the rest of the
> threads are idle, then would threads_submit_request_prepare() be stuck
> for that hanging thread?
> 
> > +
> > +    /* the thread is busy */
> > +    if (ring_is_full(threads->per_thread_data[index].request_ring)) {
> > +        return NULL;
> > +    }
> > +
> > +    /* try to get the request from the list */
> > +    request = get_and_remove_first_free_request(threads);
> > +    if (request) {
> > +        goto got_request;
> > +    }
> > +
> > +    /* get the request already been handled by the threads */
> > +    request = ring_get(threads->request_done_ring);
> > +    if (request) {
> > +        threads->thread_request_done(request);
> > +        goto got_request;
> > +    }
> > +    return NULL;
> > +
> > +got_request:
> > +    threads->current_thread_index++;
> > +    request->thread_index = index;
> > +    return request;
> > +}
> > +
> > +void threads_submit_request_commit(Threads *threads, ThreadRequest *request)
> > +{
> > +    int ret, index = request->thread_index;
> > +    ThreadLocal *thread_local = &threads->per_thread_data[index];
> > +
> > +    ret = ring_put(thread_local->request_ring, request);
> > +
> > +    /*
> > +     * we have detected that the thread's ring is not full in
> > +     * threads_submit_request_prepare(), there should be free
> > +     * room in the ring
> > +     */
> > +    assert(!ret);
> > +    /* new request arrived, notify the thread */
> > +    qemu_event_set(&thread_local->ev);
> > +}
> > +
> > +void threads_wait_done(Threads *threads)
> > +{
> > +    ThreadRequest *request;
> > +
> > +retry:
> > +    while ((request = ring_get(threads->request_done_ring))) {
> > +        threads->thread_request_done(request);
> > +        add_free_request(threads, request);
> > +    }
> > +
> > +    if (threads->free_requests_nr != threads->total_requests) {
> > +        cpu_relax();
> > +        goto retry;
> > +    }
> > +}
> > diff --git a/migration/threads.h b/migration/threads.h
> > new file mode 100644
> > index 0000000000..eced913065
> > --- /dev/null
> > +++ b/migration/threads.h
> > @@ -0,0 +1,116 @@
> > +#ifndef QEMU_MIGRATION_THREAD_H
> > +#define QEMU_MIGRATION_THREAD_H
> > +
> > +/*
> > + * Multithreads abstraction
> > + *
> > + * This is the abstraction layer for multithreads management which is
> > + * used to speed up migration.
> > + *
> > + * Note: currently only one producer is allowed.
> > + *
> > + * Copyright(C) 2018 Tencent Corporation.
> > + *
> > + * Author:
> > + *   Xiao Guangrong <xiaoguangrong@tencent.com>
> > + *
> > + * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
> > + * See the COPYING.LIB file in the top-level directory.
> > + */
> > +
> > +#include "qemu/osdep.h"
> 
> I was told (more than once) that we should not include "osdep.h" in
> headers. :) I'll suggest you include that in the source file.
> 
> > +#include "hw/boards.h"
> 
> Why do we need this header?
> 
> > +
> > +#include "ring.h"
> > +
> > +/*
> > + * the request representation which contains the internally used mete data,
> > + * it can be embedded to user's self-defined data struct and the user can
> > + * use container_of() to get the self-defined data
> > + */
> > +struct ThreadRequest {
> > +    QSLIST_ENTRY(ThreadRequest) node;
> > +    unsigned int thread_index;
> > +};
> > +typedef struct ThreadRequest ThreadRequest;
> > +
> > +struct Threads;
> > +
> > +struct ThreadLocal {
> > +    QemuThread thread;
> > +
> > +    /* the event used to wake up the thread */
> > +    QemuEvent ev;
> > +
> > +    struct Threads *threads;
> > +
> > +    /* local request ring which is filled by the user */
> > +    Ring *request_ring;
> > +
> > +    /* the index of the thread */
> > +    int self;
> > +
> > +    /* thread is useless and needs to exit */
> > +    bool quit;
> > +};
> > +typedef struct ThreadLocal ThreadLocal;
> > +
> > +/*
> > + * the main data struct represents multithreads which is shared by
> > + * all threads
> > + */
> > +struct Threads {
> > +    const char *name;
> > +    unsigned int threads_nr;
> > +    /* the request is pushed to the thread with round-robin manner */
> > +    unsigned int current_thread_index;
> > +
> > +    int thread_ring_size;
> > +    int total_requests;
> > +
> > +    /* the request is pre-allocated and linked in the list */
> > +    int free_requests_nr;
> > +    QSLIST_HEAD(, ThreadRequest) free_requests;
> > +
> > +    /* the constructor of request */
> > +    ThreadRequest *(*thread_request_init)(void);
> > +    /* the destructor of request */
> > +    void (*thread_request_uninit)(ThreadRequest *request);
> > +    /* the handler of the request which is called in the thread */
> > +    void (*thread_request_handler)(ThreadRequest *request);
> > +    /*
> > +     * the handler to process the result which is called in the
> > +     * user's context
> > +     */
> > +    void (*thread_request_done)(ThreadRequest *request);
> > +
> > +    /* the thread push the result to this ring so it has multiple producers */
> > +    Ring *request_done_ring;
> > +
> > +    ThreadLocal per_thread_data[0];
> > +};
> > +typedef struct Threads Threads;
> 
> Not sure whether we can move Threads/ThreadLocal definition into the
> source file, then we only expose the struct definition, along with the
> APIs.
> 
> Regards,
> 
> > +
> > +Threads *threads_create(unsigned int threads_nr, const char *name,
> > +                        ThreadRequest *(*thread_request_init)(void),
> > +                        void (*thread_request_uninit)(ThreadRequest *request),
> > +                        void (*thread_request_handler)(ThreadRequest *request),
> > +                        void (*thread_request_done)(ThreadRequest *request));
> > +void threads_destroy(Threads *threads);
> > +
> > +/*
> > + * find a free request and associate it with a free thread.
> > + * If no request or no thread is free, return NULL
> > + */
> > +ThreadRequest *threads_submit_request_prepare(Threads *threads);
> > +/*
> > + * push the request to its thread's local ring and notify the thread
> > + */
> > +void threads_submit_request_commit(Threads *threads, ThreadRequest *request);
> > +
> > +/*
> > + * wait all threads to complete the request filled in their local rings
> > + * to make sure there is no previous request exists.
> > + */
> > +void threads_wait_done(Threads *threads);
> > +#endif
> > -- 
> > 2.14.4
> > 
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 10/12] migration: introduce lockless multithreads model
@ 2018-07-13 16:24       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-07-13 16:24 UTC (permalink / raw)
  To: Peter Xu
  Cc: guangrong.xiao, pbonzini, mst, mtosatti, qemu-devel, kvm,
	jiang.biao2, wei.w.wang, Xiao Guangrong

* Peter Xu (peterx@redhat.com) wrote:
> On Mon, Jun 04, 2018 at 05:55:18PM +0800, guangrong.xiao@gmail.com wrote:
> > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > 
> > Current implementation of compression and decompression are very
> > hard to be enabled on productions. We noticed that too many wait-wakes
> > go to kernel space and CPU usages are very low even if the system
> > is really free
> > 
> > The reasons are:
> > 1) there are two many locks used to do synchronous,there
> >   is a global lock and each single thread has its own lock,
> >   migration thread and work threads need to go to sleep if
> >   these locks are busy
> > 
> > 2) migration thread separately submits request to the thread
> >    however, only one request can be pended, that means, the
> >    thread has to go to sleep after finishing the request
> > 
> > To make it work better, we introduce a new multithread model,
> > the user, currently it is the migration thread, submits request
> > to each thread with round-robin manner, the thread has its own
> > ring whose capacity is 4 and puts the result to a global ring
> > which is lockless for multiple producers, the user fetches result
> > out from the global ring and do remaining operations for the
> > request, e.g, posting the compressed data out for migration on
> > the source QEMU
> > 
> > Performance Result:
> > The test was based on top of the patch:
> >    ring: introduce lockless ring buffer
> > that means, previous optimizations are used for both of original case
> > and applying the new multithread model
> > 
> > We tested live migration on two hosts:
> >    Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz * 64 + 256G memory
> > to migration a VM between each other, which has 16 vCPUs and 60G
> > memory, during the migration, multiple threads are repeatedly writing
> > the memory in the VM
> > 
> > We used 16 threads on the destination to decompress the data and on the
> > source, we tried 8 threads and 16 threads to compress the data
> > 
> > --- Before our work ---
> > migration can not be finished for both 8 threads and 16 threads. The data
> > is as followings:
> > 
> > Use 8 threads to compress:
> > - on the source:
> > 	    migration thread   compress-threads
> > CPU usage       70%          some use 36%, others are very low ~20%
> > - on the destination:
> >             main thread        decompress-threads
> > CPU usage       100%         some use ~40%, other are very low ~2%
> > 
> > Migration status (CAN NOT FINISH):
> > info migrate
> > globals:
> > store-global-state: on
> > only-migratable: off
> > send-configuration: on
> > send-section-footer: on
> > capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
> > Migration status: active
> > total time: 1019540 milliseconds
> > expected downtime: 2263 milliseconds
> > setup: 218 milliseconds
> > transferred ram: 252419995 kbytes
> > throughput: 2469.45 mbps
> > remaining ram: 15611332 kbytes
> > total ram: 62931784 kbytes
> > duplicate: 915323 pages
> > skipped: 0 pages
> > normal: 59673047 pages
> > normal bytes: 238692188 kbytes
> > dirty sync count: 28
> > page size: 4 kbytes
> > dirty pages rate: 170551 pages
> > compression pages: 121309323 pages
> > compression busy: 60588337
> > compression busy rate: 0.36
> > compression reduced size: 484281967178
> > compression rate: 0.97
> > 
> > Use 16 threads to compress:
> > - on the source:
> > 	    migration thread   compress-threads
> > CPU usage       96%          some use 45%, others are very low ~6%
> > - on the destination:
> >             main thread        decompress-threads
> > CPU usage       96%         some use 58%, other are very low ~10%
> > 
> > Migration status (CAN NOT FINISH):
> > info migrate
> > globals:
> > store-global-state: on
> > only-migratable: off
> > send-configuration: on
> > send-section-footer: on
> > capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
> > Migration status: active
> > total time: 1189221 milliseconds
> > expected downtime: 6824 milliseconds
> > setup: 220 milliseconds
> > transferred ram: 90620052 kbytes
> > throughput: 840.41 mbps
> > remaining ram: 3678760 kbytes
> > total ram: 62931784 kbytes
> > duplicate: 195893 pages
> > skipped: 0 pages
> > normal: 17290715 pages
> > normal bytes: 69162860 kbytes
> > dirty sync count: 33
> > page size: 4 kbytes
> > dirty pages rate: 175039 pages
> > compression pages: 186739419 pages
> > compression busy: 17486568
> > compression busy rate: 0.09
> > compression reduced size: 744546683892
> > compression rate: 0.97
> > 
> > --- After our work ---
> > Migration can be finished quickly for both 8 threads and 16 threads. The
> > data is as followings:
> > 
> > Use 8 threads to compress:
> > - on the source:
> > 	    migration thread   compress-threads
> > CPU usage       30%               30% (all threads have same CPU usage)
> > - on the destination:
> >             main thread        decompress-threads
> > CPU usage       100%              50% (all threads have same CPU usage)
> > 
> > Migration status (finished in 219467 ms):
> > info migrate
> > globals:
> > store-global-state: on
> > only-migratable: off
> > send-configuration: on
> > send-section-footer: on
> > capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
> > Migration status: completed
> > total time: 219467 milliseconds
> > downtime: 115 milliseconds
> > setup: 222 milliseconds
> > transferred ram: 88510173 kbytes
> > throughput: 3303.81 mbps
> > remaining ram: 0 kbytes
> > total ram: 62931784 kbytes
> > duplicate: 2211775 pages
> > skipped: 0 pages
> > normal: 21166222 pages
> > normal bytes: 84664888 kbytes
> > dirty sync count: 15
> > page size: 4 kbytes
> > compression pages: 32045857 pages
> > compression busy: 23377968
> > compression busy rate: 0.34
> > compression reduced size: 127767894329
> > compression rate: 0.97
> > 
> > Use 16 threads to compress:
> > - on the source:
> > 	    migration thread   compress-threads
> > CPU usage       60%               60% (all threads have same CPU usage)
> > - on the destination:
> >             main thread        decompress-threads
> > CPU usage       100%              75% (all threads have same CPU usage)
> > 
> > Migration status (finished in 64118 ms):
> > info migrate
> > globals:
> > store-global-state: on
> > only-migratable: off
> > send-configuration: on
> > send-section-footer: on
> > capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off
> > Migration status: completed
> > total time: 64118 milliseconds
> > downtime: 29 milliseconds
> > setup: 223 milliseconds
> > transferred ram: 13345135 kbytes
> > throughput: 1705.10 mbps
> > remaining ram: 0 kbytes
> > total ram: 62931784 kbytes
> > duplicate: 574921 pages
> > skipped: 0 pages
> > normal: 2570281 pages
> > normal bytes: 10281124 kbytes
> > dirty sync count: 9
> > page size: 4 kbytes
> > compression pages: 28007024 pages
> > compression busy: 3145182
> > compression busy rate: 0.08
> > compression reduced size: 111829024985
> > compression rate: 0.97
> 
> Not sure how other people think, for me these information suites
> better as cover letter.  For commit message, I would prefer to know
> about something like: what this thread model can do; how the APIs are
> designed and used; what's the limitations, etc.  After all until this
> patch nowhere is using the new model yet, so these numbers are a bit
> misleading.

I think it's OK to justify the need for such a large change; but OK
in the main cover letter.

> > 
> > Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> > ---
> >  migration/Makefile.objs |   1 +
> >  migration/threads.c     | 265 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  migration/threads.h     | 116 +++++++++++++++++++++
> 
> Again, this model seems to be suitable for scenarios even outside
> migration.  So I'm not sure whether you'd like to generalize it (I
> still see e.g. constants and comments related to migration, but there
> aren't much) and put it into util/.

We've already got one thread pool at least; so take care to
differentiate it (I don't know the details of it)

> >  3 files changed, 382 insertions(+)
> >  create mode 100644 migration/threads.c
> >  create mode 100644 migration/threads.h
> > 
> > diff --git a/migration/Makefile.objs b/migration/Makefile.objs
> > index c83ec47ba8..bdb61a7983 100644
> > --- a/migration/Makefile.objs
> > +++ b/migration/Makefile.objs
> > @@ -7,6 +7,7 @@ common-obj-y += qemu-file-channel.o
> >  common-obj-y += xbzrle.o postcopy-ram.o
> >  common-obj-y += qjson.o
> >  common-obj-y += block-dirty-bitmap.o
> > +common-obj-y += threads.o
> >  
> >  common-obj-$(CONFIG_RDMA) += rdma.o
> >  
> > diff --git a/migration/threads.c b/migration/threads.c
> > new file mode 100644
> > index 0000000000..eecd3229b7
> > --- /dev/null
> > +++ b/migration/threads.c
> > @@ -0,0 +1,265 @@
> > +#include "threads.h"
> > +
> > +/* retry to see if there is avilable request before actually go to wait. */
> > +#define BUSY_WAIT_COUNT 1000
> > +
> > +static void *thread_run(void *opaque)
> > +{
> > +    ThreadLocal *self_data = (ThreadLocal *)opaque;
> > +    Threads *threads = self_data->threads;
> > +    void (*handler)(ThreadRequest *data) = threads->thread_request_handler;
> > +    ThreadRequest *request;
> > +    int count, ret;
> > +
> > +    for ( ; !atomic_read(&self_data->quit); ) {
> > +        qemu_event_reset(&self_data->ev);
> > +
> > +        count = 0;
> > +        while ((request = ring_get(self_data->request_ring)) ||
> > +            count < BUSY_WAIT_COUNT) {
> > +             /*
> > +             * wait some while before go to sleep so that the user
> > +             * needn't go to kernel space to wake up the consumer
> > +             * threads.
> > +             *
> > +             * That will waste some CPU resource indeed however it
> > +             * can significantly improve the case that the request
> > +             * will be available soon.
> > +             */
> > +             if (!request) {
> > +                cpu_relax();
> > +                count++;
> > +                continue;
> > +            }
> > +            count = 0;

Things like busywait counts probably need isolating somewhere;
getting those counts right is quite hard.

Dave

> > +            handler(request);
> > +
> > +            do {
> > +                ret = ring_put(threads->request_done_ring, request);
> > +                /*
> > +                 * request_done_ring has enough room to contain all
> > +                 * requests, however, theoretically, it still can be
> > +                 * fail if the ring's indexes are overflow that would
> > +                 * happen if there is more than 2^32 requests are
> 
> Could you elaborate why this ring_put() could fail, and why failure is
> somehow related to 2^32 overflow?
> 
> Firstly, I don't understand why it will fail.
> 
> Meanwhile, AFAIU your ring can even live well with that 2^32 overflow.
> Or did I misunderstood?
> 
> > +                 * handled between two calls of threads_wait_done().
> > +                 * So we do retry to make the code more robust.
> > +                 *
> > +                 * It is unlikely the case for migration as the block's
> > +                 * memory is unlikely more than 16T (2^32 pages) memory.
> 
> (some migration-related comments; maybe we can remove that)
> 
> > +                 */
> > +                if (ret) {
> > +                    fprintf(stderr,
> > +                            "Potential BUG if it is triggered by migration.\n");
> > +                }
> > +            } while (ret);
> > +        }
> > +
> > +        qemu_event_wait(&self_data->ev);
> > +    }
> > +
> > +    return NULL;
> > +}
> > +
> > +static void add_free_request(Threads *threads, ThreadRequest *request)
> > +{
> > +    QSLIST_INSERT_HEAD(&threads->free_requests, request, node);
> > +    threads->free_requests_nr++;
> > +}
> > +
> > +static ThreadRequest *get_and_remove_first_free_request(Threads *threads)
> > +{
> > +    ThreadRequest *request;
> > +
> > +    if (QSLIST_EMPTY(&threads->free_requests)) {
> > +        return NULL;
> > +    }
> > +
> > +    request = QSLIST_FIRST(&threads->free_requests);
> > +    QSLIST_REMOVE_HEAD(&threads->free_requests, node);
> > +    threads->free_requests_nr--;
> > +    return request;
> > +}
> > +
> > +static void uninit_requests(Threads *threads, int free_nr)
> > +{
> > +    ThreadRequest *request;
> > +
> > +    /*
> > +     * all requests should be released to the list if threads are being
> > +     * destroyed, i,e. should call threads_wait_done() first.
> > +     */
> > +    assert(threads->free_requests_nr == free_nr);
> > +
> > +    while ((request = get_and_remove_first_free_request(threads))) {
> > +        threads->thread_request_uninit(request);
> > +    }
> > +
> > +    assert(ring_is_empty(threads->request_done_ring));
> > +    ring_free(threads->request_done_ring);
> > +}
> > +
> > +static int init_requests(Threads *threads)
> > +{
> > +    ThreadRequest *request;
> > +    unsigned int done_ring_size = pow2roundup32(threads->total_requests);
> > +    int i, free_nr = 0;
> > +
> > +    threads->request_done_ring = ring_alloc(done_ring_size,
> > +                                            RING_MULTI_PRODUCER);
> > +
> > +    QSLIST_INIT(&threads->free_requests);
> > +    for (i = 0; i < threads->total_requests; i++) {
> > +        request = threads->thread_request_init();
> > +        if (!request) {
> > +            goto cleanup;
> > +        }
> > +
> > +        free_nr++;
> > +        add_free_request(threads, request);
> > +    }
> > +    return 0;
> > +
> > +cleanup:
> > +    uninit_requests(threads, free_nr);
> > +    return -1;
> > +}
> > +
> > +static void uninit_thread_data(Threads *threads)
> > +{
> > +    ThreadLocal *thread_local = threads->per_thread_data;
> > +    int i;
> > +
> > +    for (i = 0; i < threads->threads_nr; i++) {
> > +        thread_local[i].quit = true;
> > +        qemu_event_set(&thread_local[i].ev);
> > +        qemu_thread_join(&thread_local[i].thread);
> > +        qemu_event_destroy(&thread_local[i].ev);
> > +        assert(ring_is_empty(thread_local[i].request_ring));
> > +        ring_free(thread_local[i].request_ring);
> > +    }
> > +}
> > +
> > +static void init_thread_data(Threads *threads)
> > +{
> > +    ThreadLocal *thread_local = threads->per_thread_data;
> > +    char *name;
> > +    int i;
> > +
> > +    for (i = 0; i < threads->threads_nr; i++) {
> > +        qemu_event_init(&thread_local[i].ev, false);
> > +
> > +        thread_local[i].threads = threads;
> > +        thread_local[i].self = i;
> > +        thread_local[i].request_ring = ring_alloc(threads->thread_ring_size, 0);
> > +        name = g_strdup_printf("%s/%d", threads->name, thread_local[i].self);
> > +        qemu_thread_create(&thread_local[i].thread, name,
> > +                           thread_run, &thread_local[i], QEMU_THREAD_JOINABLE);
> > +        g_free(name);
> > +    }
> > +}
> > +
> > +/* the size of thread local request ring */
> > +#define THREAD_REQ_RING_SIZE 4
> > +
> > +Threads *threads_create(unsigned int threads_nr, const char *name,
> > +                        ThreadRequest *(*thread_request_init)(void),
> > +                        void (*thread_request_uninit)(ThreadRequest *request),
> > +                        void (*thread_request_handler)(ThreadRequest *request),
> > +                        void (*thread_request_done)(ThreadRequest *request))
> > +{
> > +    Threads *threads;
> > +    int ret;
> > +
> > +    threads = g_malloc0(sizeof(*threads) + threads_nr * sizeof(ThreadLocal));
> > +    threads->threads_nr = threads_nr;
> > +    threads->thread_ring_size = THREAD_REQ_RING_SIZE;
> 
> (If we're going to generalize this thread model, maybe you'd consider
>  to allow specify this ring size as well?)
> 
> > +    threads->total_requests = threads->thread_ring_size * threads_nr;
> > +
> > +    threads->name = name;
> > +    threads->thread_request_init = thread_request_init;
> > +    threads->thread_request_uninit = thread_request_uninit;
> > +    threads->thread_request_handler = thread_request_handler;
> > +    threads->thread_request_done = thread_request_done;
> > +
> > +    ret = init_requests(threads);
> > +    if (ret) {
> > +        g_free(threads);
> > +        return NULL;
> > +    }
> > +
> > +    init_thread_data(threads);
> > +    return threads;
> > +}
> > +
> > +void threads_destroy(Threads *threads)
> > +{
> > +    uninit_thread_data(threads);
> > +    uninit_requests(threads, threads->total_requests);
> > +    g_free(threads);
> > +}
> > +
> > +ThreadRequest *threads_submit_request_prepare(Threads *threads)
> > +{
> > +    ThreadRequest *request;
> > +    unsigned int index;
> > +
> > +    index = threads->current_thread_index % threads->threads_nr;
> 
> Why round-robin rather than simply find a idle thread (still with
> valid free requests) and put the request onto that?
> 
> Asked since I don't see much difficulty to achieve that, meanwhile for
> round-robin I'm not sure whether it can happen that one thread stuck
> due to some reason (e.g., scheduling reason?), while the rest of the
> threads are idle, then would threads_submit_request_prepare() be stuck
> for that hanging thread?
> 
> > +
> > +    /* the thread is busy */
> > +    if (ring_is_full(threads->per_thread_data[index].request_ring)) {
> > +        return NULL;
> > +    }
> > +
> > +    /* try to get the request from the list */
> > +    request = get_and_remove_first_free_request(threads);
> > +    if (request) {
> > +        goto got_request;
> > +    }
> > +
> > +    /* get the request already been handled by the threads */
> > +    request = ring_get(threads->request_done_ring);
> > +    if (request) {
> > +        threads->thread_request_done(request);
> > +        goto got_request;
> > +    }
> > +    return NULL;
> > +
> > +got_request:
> > +    threads->current_thread_index++;
> > +    request->thread_index = index;
> > +    return request;
> > +}
> > +
> > +void threads_submit_request_commit(Threads *threads, ThreadRequest *request)
> > +{
> > +    int ret, index = request->thread_index;
> > +    ThreadLocal *thread_local = &threads->per_thread_data[index];
> > +
> > +    ret = ring_put(thread_local->request_ring, request);
> > +
> > +    /*
> > +     * we have detected that the thread's ring is not full in
> > +     * threads_submit_request_prepare(), there should be free
> > +     * room in the ring
> > +     */
> > +    assert(!ret);
> > +    /* new request arrived, notify the thread */
> > +    qemu_event_set(&thread_local->ev);
> > +}
> > +
> > +void threads_wait_done(Threads *threads)
> > +{
> > +    ThreadRequest *request;
> > +
> > +retry:
> > +    while ((request = ring_get(threads->request_done_ring))) {
> > +        threads->thread_request_done(request);
> > +        add_free_request(threads, request);
> > +    }
> > +
> > +    if (threads->free_requests_nr != threads->total_requests) {
> > +        cpu_relax();
> > +        goto retry;
> > +    }
> > +}
> > diff --git a/migration/threads.h b/migration/threads.h
> > new file mode 100644
> > index 0000000000..eced913065
> > --- /dev/null
> > +++ b/migration/threads.h
> > @@ -0,0 +1,116 @@
> > +#ifndef QEMU_MIGRATION_THREAD_H
> > +#define QEMU_MIGRATION_THREAD_H
> > +
> > +/*
> > + * Multithreads abstraction
> > + *
> > + * This is the abstraction layer for multithreads management which is
> > + * used to speed up migration.
> > + *
> > + * Note: currently only one producer is allowed.
> > + *
> > + * Copyright(C) 2018 Tencent Corporation.
> > + *
> > + * Author:
> > + *   Xiao Guangrong <xiaoguangrong@tencent.com>
> > + *
> > + * This work is licensed under the terms of the GNU LGPL, version 2.1 or later.
> > + * See the COPYING.LIB file in the top-level directory.
> > + */
> > +
> > +#include "qemu/osdep.h"
> 
> I was told (more than once) that we should not include "osdep.h" in
> headers. :) I'll suggest you include that in the source file.
> 
> > +#include "hw/boards.h"
> 
> Why do we need this header?
> 
> > +
> > +#include "ring.h"
> > +
> > +/*
> > + * the request representation which contains the internally used mete data,
> > + * it can be embedded to user's self-defined data struct and the user can
> > + * use container_of() to get the self-defined data
> > + */
> > +struct ThreadRequest {
> > +    QSLIST_ENTRY(ThreadRequest) node;
> > +    unsigned int thread_index;
> > +};
> > +typedef struct ThreadRequest ThreadRequest;
> > +
> > +struct Threads;
> > +
> > +struct ThreadLocal {
> > +    QemuThread thread;
> > +
> > +    /* the event used to wake up the thread */
> > +    QemuEvent ev;
> > +
> > +    struct Threads *threads;
> > +
> > +    /* local request ring which is filled by the user */
> > +    Ring *request_ring;
> > +
> > +    /* the index of the thread */
> > +    int self;
> > +
> > +    /* thread is useless and needs to exit */
> > +    bool quit;
> > +};
> > +typedef struct ThreadLocal ThreadLocal;
> > +
> > +/*
> > + * the main data struct represents multithreads which is shared by
> > + * all threads
> > + */
> > +struct Threads {
> > +    const char *name;
> > +    unsigned int threads_nr;
> > +    /* the request is pushed to the thread with round-robin manner */
> > +    unsigned int current_thread_index;
> > +
> > +    int thread_ring_size;
> > +    int total_requests;
> > +
> > +    /* the request is pre-allocated and linked in the list */
> > +    int free_requests_nr;
> > +    QSLIST_HEAD(, ThreadRequest) free_requests;
> > +
> > +    /* the constructor of request */
> > +    ThreadRequest *(*thread_request_init)(void);
> > +    /* the destructor of request */
> > +    void (*thread_request_uninit)(ThreadRequest *request);
> > +    /* the handler of the request which is called in the thread */
> > +    void (*thread_request_handler)(ThreadRequest *request);
> > +    /*
> > +     * the handler to process the result which is called in the
> > +     * user's context
> > +     */
> > +    void (*thread_request_done)(ThreadRequest *request);
> > +
> > +    /* the thread push the result to this ring so it has multiple producers */
> > +    Ring *request_done_ring;
> > +
> > +    ThreadLocal per_thread_data[0];
> > +};
> > +typedef struct Threads Threads;
> 
> Not sure whether we can move Threads/ThreadLocal definition into the
> source file, then we only expose the struct definition, along with the
> APIs.
> 
> Regards,
> 
> > +
> > +Threads *threads_create(unsigned int threads_nr, const char *name,
> > +                        ThreadRequest *(*thread_request_init)(void),
> > +                        void (*thread_request_uninit)(ThreadRequest *request),
> > +                        void (*thread_request_handler)(ThreadRequest *request),
> > +                        void (*thread_request_done)(ThreadRequest *request));
> > +void threads_destroy(Threads *threads);
> > +
> > +/*
> > + * find a free request and associate it with a free thread.
> > + * If no request or no thread is free, return NULL
> > + */
> > +ThreadRequest *threads_submit_request_prepare(Threads *threads);
> > +/*
> > + * push the request to its thread's local ring and notify the thread
> > + */
> > +void threads_submit_request_commit(Threads *threads, ThreadRequest *request);
> > +
> > +/*
> > + * wait all threads to complete the request filled in their local rings
> > + * to make sure there is no previous request exists.
> > + */
> > +void threads_wait_done(Threads *threads);
> > +#endif
> > -- 
> > 2.14.4
> > 
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 07/12] migration: hold the lock only if it is really needed
  2018-07-12  7:47           ` [Qemu-devel] " Xiao Guangrong
@ 2018-07-13 17:44             ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-07-13 17:44 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, Peter Xu,
	wei.w.wang, jiang.biao2, pbonzini

* Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
> 
> 
> On 07/11/2018 04:21 PM, Peter Xu wrote:
> > On Thu, Jun 28, 2018 at 05:33:58PM +0800, Xiao Guangrong wrote:
> > > 
> > > 
> > > On 06/19/2018 03:36 PM, Peter Xu wrote:
> > > > On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
> > > > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > > > 
> > > > > Try to hold src_page_req_mutex only if the queue is not
> > > > > empty
> > > > 
> > > > Pure question: how much this patch would help?  Basically if you are
> > > > running compression tests then I think it means you are with precopy
> > > > (since postcopy cannot work with compression yet), then here the lock
> > > > has no contention at all.
> > > 
> > > Yes, you are right, however we can observe it is in the top functions
> > > (after revert this patch):
> > > 
> > > Samples: 29K of event 'cycles', Event count (approx.): 22263412260
> > > +   7.99%  kqemu  qemu-system-x86_64       [.] ram_bytes_total
> > > +   6.95%  kqemu  [kernel.kallsyms]        [k] copy_user_enhanced_fast_string
> > > +   6.23%  kqemu  qemu-system-x86_64       [.] qemu_put_qemu_file
> > > +   6.20%  kqemu  qemu-system-x86_64       [.] qemu_event_set
> > > +   5.80%  kqemu  qemu-system-x86_64       [.] __ring_put
> > > +   4.82%  kqemu  qemu-system-x86_64       [.] compress_thread_data_done
> > > +   4.11%  kqemu  qemu-system-x86_64       [.] ring_is_full
> > > +   3.07%  kqemu  qemu-system-x86_64       [.] threads_submit_request_prepare
> > > +   2.83%  kqemu  qemu-system-x86_64       [.] ring_mp_get
> > > +   2.71%  kqemu  qemu-system-x86_64       [.] __ring_is_full
> > > +   2.46%  kqemu  qemu-system-x86_64       [.] buffer_zero_sse2
> > > +   2.40%  kqemu  qemu-system-x86_64       [.] add_to_iovec
> > > +   2.21%  kqemu  qemu-system-x86_64       [.] ring_get
> > > +   1.96%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
> > > +   1.90%  kqemu  libc-2.12.so             [.] memcpy
> > > +   1.55%  kqemu  qemu-system-x86_64       [.] ring_len
> > > +   1.12%  kqemu  libpthread-2.12.so       [.] pthread_mutex_unlock
> > > +   1.11%  kqemu  qemu-system-x86_64       [.] ram_find_and_save_block
> > > +   1.07%  kqemu  qemu-system-x86_64       [.] ram_save_host_page
> > > +   1.04%  kqemu  qemu-system-x86_64       [.] qemu_put_buffer
> > > +   0.97%  kqemu  qemu-system-x86_64       [.] compress_page_with_multi_thread
> > > +   0.96%  kqemu  qemu-system-x86_64       [.] ram_save_target_page
> > > +   0.93%  kqemu  libpthread-2.12.so       [.] pthread_mutex_lock
> > 
> > (sorry to respond late; I was busy with other stuff for the
> >   release...)
> > 
> 
> You're welcome. :)
> 
> > I am trying to find out anything related to unqueue_page() but I
> > failed.  Did I miss anything obvious there?
> > 
> 
> unqueue_page() was not listed here indeed, i think the function
> itself is light enough (a check then directly return) so it
> did not leave a trace here.
> 
> This perf data was got after reverting this patch, i.e, it's
> based on the lockless multithread model, then unqueue_page() is
> the only place using mutex in the main thread.
> 
> And you can see the overload of mutext was gone after applying
> this patch in the mail i replied to Dave.

I got around to playing with this patch and using 'perf top'
to see what was going on.
What I noticed was that without this patch pthread_mutex_unlock and
qemu_mutex_lock_impl were both noticeable; with the patch they'd
pretty mich vanished.  So I think it's worth it.

I ocouldn't honestly see a difference in total CPU usage or
bandwidth; but the migration code is so spiky in usage that
it's difficult to measure anyway.

So,


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 07/12] migration: hold the lock only if it is really needed
@ 2018-07-13 17:44             ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-07-13 17:44 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Peter Xu, pbonzini, mst, mtosatti, qemu-devel, kvm, jiang.biao2,
	wei.w.wang, Xiao Guangrong

* Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
> 
> 
> On 07/11/2018 04:21 PM, Peter Xu wrote:
> > On Thu, Jun 28, 2018 at 05:33:58PM +0800, Xiao Guangrong wrote:
> > > 
> > > 
> > > On 06/19/2018 03:36 PM, Peter Xu wrote:
> > > > On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
> > > > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > > > 
> > > > > Try to hold src_page_req_mutex only if the queue is not
> > > > > empty
> > > > 
> > > > Pure question: how much this patch would help?  Basically if you are
> > > > running compression tests then I think it means you are with precopy
> > > > (since postcopy cannot work with compression yet), then here the lock
> > > > has no contention at all.
> > > 
> > > Yes, you are right, however we can observe it is in the top functions
> > > (after revert this patch):
> > > 
> > > Samples: 29K of event 'cycles', Event count (approx.): 22263412260
> > > +   7.99%  kqemu  qemu-system-x86_64       [.] ram_bytes_total
> > > +   6.95%  kqemu  [kernel.kallsyms]        [k] copy_user_enhanced_fast_string
> > > +   6.23%  kqemu  qemu-system-x86_64       [.] qemu_put_qemu_file
> > > +   6.20%  kqemu  qemu-system-x86_64       [.] qemu_event_set
> > > +   5.80%  kqemu  qemu-system-x86_64       [.] __ring_put
> > > +   4.82%  kqemu  qemu-system-x86_64       [.] compress_thread_data_done
> > > +   4.11%  kqemu  qemu-system-x86_64       [.] ring_is_full
> > > +   3.07%  kqemu  qemu-system-x86_64       [.] threads_submit_request_prepare
> > > +   2.83%  kqemu  qemu-system-x86_64       [.] ring_mp_get
> > > +   2.71%  kqemu  qemu-system-x86_64       [.] __ring_is_full
> > > +   2.46%  kqemu  qemu-system-x86_64       [.] buffer_zero_sse2
> > > +   2.40%  kqemu  qemu-system-x86_64       [.] add_to_iovec
> > > +   2.21%  kqemu  qemu-system-x86_64       [.] ring_get
> > > +   1.96%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
> > > +   1.90%  kqemu  libc-2.12.so             [.] memcpy
> > > +   1.55%  kqemu  qemu-system-x86_64       [.] ring_len
> > > +   1.12%  kqemu  libpthread-2.12.so       [.] pthread_mutex_unlock
> > > +   1.11%  kqemu  qemu-system-x86_64       [.] ram_find_and_save_block
> > > +   1.07%  kqemu  qemu-system-x86_64       [.] ram_save_host_page
> > > +   1.04%  kqemu  qemu-system-x86_64       [.] qemu_put_buffer
> > > +   0.97%  kqemu  qemu-system-x86_64       [.] compress_page_with_multi_thread
> > > +   0.96%  kqemu  qemu-system-x86_64       [.] ram_save_target_page
> > > +   0.93%  kqemu  libpthread-2.12.so       [.] pthread_mutex_lock
> > 
> > (sorry to respond late; I was busy with other stuff for the
> >   release...)
> > 
> 
> You're welcome. :)
> 
> > I am trying to find out anything related to unqueue_page() but I
> > failed.  Did I miss anything obvious there?
> > 
> 
> unqueue_page() was not listed here indeed, i think the function
> itself is light enough (a check then directly return) so it
> did not leave a trace here.
> 
> This perf data was got after reverting this patch, i.e, it's
> based on the lockless multithread model, then unqueue_page() is
> the only place using mutex in the main thread.
> 
> And you can see the overload of mutext was gone after applying
> this patch in the mail i replied to Dave.

I got around to playing with this patch and using 'perf top'
to see what was going on.
What I noticed was that without this patch pthread_mutex_unlock and
qemu_mutex_lock_impl were both noticeable; with the patch they'd
pretty mich vanished.  So I think it's worth it.

I ocouldn't honestly see a difference in total CPU usage or
bandwidth; but the migration code is so spiky in usage that
it's difficult to measure anyway.

So,


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 08/12] migration: do not flush_compressed_data at the end of each iteration
  2018-06-04  9:55   ` [Qemu-devel] " guangrong.xiao
@ 2018-07-13 18:01     ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-07-13 18:01 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, peterx,
	wei.w.wang, jiang.biao2, pbonzini

* guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> flush_compressed_data() needs to wait all compression threads to
> finish their work, after that all threads are free until the
> migration feed new request to them, reducing its call can improve
> the throughput and use CPU resource more effectively
> 
> We do not need to flush all threads at the end of iteration, the
> data can be kept locally until the memory block is changed
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---
>  migration/ram.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index f9a8646520..0a38c1c61e 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1994,6 +1994,7 @@ static void ram_save_cleanup(void *opaque)
>      }
>  
>      xbzrle_cleanup();
> +    flush_compressed_data(*rsp);
>      compress_threads_save_cleanup();
>      ram_state_cleanup(rsp);
>  }

I'm not sure why this change corresponds to the other removal.
We should already have sent all remaining data in ram_save_complete()'s
call to flush_compressed_data - so what is this one for?

> @@ -2690,7 +2691,6 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>          }
>          i++;
>      }
> -    flush_compressed_data(rs);
>      rcu_read_unlock();

Hmm - are we sure there's no other cases that depend on ordering of all
of the compressed data being sent out between threads?
I think the one I'd most worry about is the case where:

  iteration one:
     thread 1: Save compressed page 'n'

  iteration two:
     thread 2: Save compressed page 'n'

What guarantees that the version of page 'n'
from thread 2 reaches the destination first without
this flush?

Dave

>      /*
> -- 
> 2.14.4
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 08/12] migration: do not flush_compressed_data at the end of each iteration
@ 2018-07-13 18:01     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-07-13 18:01 UTC (permalink / raw)
  To: guangrong.xiao
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, peterx, jiang.biao2,
	wei.w.wang, Xiao Guangrong

* guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> flush_compressed_data() needs to wait all compression threads to
> finish their work, after that all threads are free until the
> migration feed new request to them, reducing its call can improve
> the throughput and use CPU resource more effectively
> 
> We do not need to flush all threads at the end of iteration, the
> data can be kept locally until the memory block is changed
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---
>  migration/ram.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index f9a8646520..0a38c1c61e 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1994,6 +1994,7 @@ static void ram_save_cleanup(void *opaque)
>      }
>  
>      xbzrle_cleanup();
> +    flush_compressed_data(*rsp);
>      compress_threads_save_cleanup();
>      ram_state_cleanup(rsp);
>  }

I'm not sure why this change corresponds to the other removal.
We should already have sent all remaining data in ram_save_complete()'s
call to flush_compressed_data - so what is this one for?

> @@ -2690,7 +2691,6 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>          }
>          i++;
>      }
> -    flush_compressed_data(rs);
>      rcu_read_unlock();

Hmm - are we sure there's no other cases that depend on ordering of all
of the compressed data being sent out between threads?
I think the one I'd most worry about is the case where:

  iteration one:
     thread 1: Save compressed page 'n'

  iteration two:
     thread 2: Save compressed page 'n'

What guarantees that the version of page 'n'
from thread 2 reaches the destination first without
this flush?

Dave

>      /*
> -- 
> 2.14.4
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 06/12] migration: do not detect zero page for compression
  2018-07-03  3:53           ` [Qemu-devel] " Xiao Guangrong
@ 2018-07-16 18:58             ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-07-16 18:58 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, Peter Xu,
	wei.w.wang, jiang.biao2, pbonzini

* Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
> 
> 
> On 06/29/2018 05:42 PM, Dr. David Alan Gilbert wrote:
> > * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
> > > 
> > > Hi Peter,
> > > 
> > > Sorry for the delay as i was busy on other things.
> > > 
> > > On 06/19/2018 03:30 PM, Peter Xu wrote:
> > > > On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
> > > > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > > > 
> > > > > Detecting zero page is not a light work, we can disable it
> > > > > for compression that can handle all zero data very well
> > > > 
> > > > Is there any number shows how the compression algo performs better
> > > > than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
> > > > be fast, depending on how init_accel() is done in util/bufferiszero.c.
> > > 
> > > This is the comparison between zero-detection and compression (the target
> > > buffer is all zero bit):
> > > 
> > > Zero 810 ns Compression: 26905 ns.
> > > Zero 417 ns Compression: 8022 ns.
> > > Zero 408 ns Compression: 7189 ns.
> > > Zero 400 ns Compression: 7255 ns.
> > > Zero 412 ns Compression: 7016 ns.
> > > Zero 411 ns Compression: 7035 ns.
> > > Zero 413 ns Compression: 6994 ns.
> > > Zero 399 ns Compression: 7024 ns.
> > > Zero 416 ns Compression: 7053 ns.
> > > Zero 405 ns Compression: 7041 ns.
> > > 
> > > Indeed, zero-detection is faster than compression.
> > > 
> > > However during our profiling for the live_migration thread (after reverted this patch),
> > > we noticed zero-detection cost lots of CPU:
> > > 
> > >   12.01%  kqemu  qemu-system-x86_64            [.] buffer_zero_sse2                                                                                                                                                                           ◆
> > 
> > Interesting; what host are you running on?
> > Some hosts have support for the faster buffer_zero_ss4/avx2
> 
> The host is:
> 
> model name	: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
> ...
> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi
>  mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts
>  rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor
>  ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt
>  tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3
>  cdp_l3 intel_ppin intel_pt mba tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1
>  hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt
>  clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total
>  cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke
> 
> I checked and noticed "CONFIG_AVX2_OPT" has not been enabled, maybe is due to too old glib/gcc
> version:
>    gcc version 4.4.6 20110731 (Red Hat 4.4.6-4) (GCC)
>    glibc.x86_64                     2.12

Yes, that's pretty old (RHEL6 ?) - I think you should get AVX2 in RHEL7.

> 
> > 
> > >    7.60%  kqemu  qemu-system-x86_64            [.] ram_bytes_total                                                                                                                                                                            ▒
> > >    6.56%  kqemu  qemu-system-x86_64            [.] qemu_event_set                                                                                                                                                                             ▒
> > >    5.61%  kqemu  qemu-system-x86_64            [.] qemu_put_qemu_file                                                                                                                                                                         ▒
> > >    5.00%  kqemu  qemu-system-x86_64            [.] __ring_put                                                                                                                                                                                 ▒
> > >    4.89%  kqemu  [kernel.kallsyms]             [k] copy_user_enhanced_fast_string                                                                                                                                                             ▒
> > >    4.71%  kqemu  qemu-system-x86_64            [.] compress_thread_data_done                                                                                                                                                                  ▒
> > >    3.63%  kqemu  qemu-system-x86_64            [.] ring_is_full                                                                                                                                                                               ▒
> > >    2.89%  kqemu  qemu-system-x86_64            [.] __ring_is_full                                                                                                                                                                             ▒
> > >    2.68%  kqemu  qemu-system-x86_64            [.] threads_submit_request_prepare                                                                                                                                                             ▒
> > >    2.60%  kqemu  qemu-system-x86_64            [.] ring_mp_get                                                                                                                                                                                ▒
> > >    2.25%  kqemu  qemu-system-x86_64            [.] ring_get                                                                                                                                                                                   ▒
> > >    1.96%  kqemu  libc-2.12.so                  [.] memcpy
> > > 
> > > After this patch, the workload is moved to the worker thread, is it
> > > acceptable?
> > > 
> > > > 
> > > >   From compression rate POV of course zero page algo wins since it
> > > > contains no data (but only a flag).
> > > > 
> > > 
> > > Yes it is. The compressed zero page is 45 bytes that is small enough i think.
> > 
> > So the compression is ~20x slow and 10x the size;  not a great
> > improvement!
> > 
> > However, the tricky thing is that in the case of a guest which is mostly
> > non-zero, this patch would save that time used by zero detection, so it
> > would be faster.
> 
> Yes, indeed.

It would be good to benchmark the performance difference for a guest
with mostly non-zero pages; you should see a useful improvement.

Dave

> > 
> > > Hmm, if you do not like, how about move detecting zero page to the work thread?
> > 
> > That would be interesting to try.
> > 
> 
> Okay, i will try it then. :)
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 06/12] migration: do not detect zero page for compression
@ 2018-07-16 18:58             ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-07-16 18:58 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Peter Xu, pbonzini, mst, mtosatti, qemu-devel, kvm, jiang.biao2,
	wei.w.wang, Xiao Guangrong

* Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
> 
> 
> On 06/29/2018 05:42 PM, Dr. David Alan Gilbert wrote:
> > * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
> > > 
> > > Hi Peter,
> > > 
> > > Sorry for the delay as i was busy on other things.
> > > 
> > > On 06/19/2018 03:30 PM, Peter Xu wrote:
> > > > On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
> > > > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > > > 
> > > > > Detecting zero page is not a light work, we can disable it
> > > > > for compression that can handle all zero data very well
> > > > 
> > > > Is there any number shows how the compression algo performs better
> > > > than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
> > > > be fast, depending on how init_accel() is done in util/bufferiszero.c.
> > > 
> > > This is the comparison between zero-detection and compression (the target
> > > buffer is all zero bit):
> > > 
> > > Zero 810 ns Compression: 26905 ns.
> > > Zero 417 ns Compression: 8022 ns.
> > > Zero 408 ns Compression: 7189 ns.
> > > Zero 400 ns Compression: 7255 ns.
> > > Zero 412 ns Compression: 7016 ns.
> > > Zero 411 ns Compression: 7035 ns.
> > > Zero 413 ns Compression: 6994 ns.
> > > Zero 399 ns Compression: 7024 ns.
> > > Zero 416 ns Compression: 7053 ns.
> > > Zero 405 ns Compression: 7041 ns.
> > > 
> > > Indeed, zero-detection is faster than compression.
> > > 
> > > However during our profiling for the live_migration thread (after reverted this patch),
> > > we noticed zero-detection cost lots of CPU:
> > > 
> > >   12.01%  kqemu  qemu-system-x86_64            [.] buffer_zero_sse2                                                                                                                                                                           ◆
> > 
> > Interesting; what host are you running on?
> > Some hosts have support for the faster buffer_zero_ss4/avx2
> 
> The host is:
> 
> model name	: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
> ...
> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi
>  mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts
>  rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor
>  ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt
>  tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3
>  cdp_l3 intel_ppin intel_pt mba tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1
>  hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt
>  clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total
>  cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke
> 
> I checked and noticed "CONFIG_AVX2_OPT" has not been enabled, maybe is due to too old glib/gcc
> version:
>    gcc version 4.4.6 20110731 (Red Hat 4.4.6-4) (GCC)
>    glibc.x86_64                     2.12

Yes, that's pretty old (RHEL6 ?) - I think you should get AVX2 in RHEL7.

> 
> > 
> > >    7.60%  kqemu  qemu-system-x86_64            [.] ram_bytes_total                                                                                                                                                                            ▒
> > >    6.56%  kqemu  qemu-system-x86_64            [.] qemu_event_set                                                                                                                                                                             ▒
> > >    5.61%  kqemu  qemu-system-x86_64            [.] qemu_put_qemu_file                                                                                                                                                                         ▒
> > >    5.00%  kqemu  qemu-system-x86_64            [.] __ring_put                                                                                                                                                                                 ▒
> > >    4.89%  kqemu  [kernel.kallsyms]             [k] copy_user_enhanced_fast_string                                                                                                                                                             ▒
> > >    4.71%  kqemu  qemu-system-x86_64            [.] compress_thread_data_done                                                                                                                                                                  ▒
> > >    3.63%  kqemu  qemu-system-x86_64            [.] ring_is_full                                                                                                                                                                               ▒
> > >    2.89%  kqemu  qemu-system-x86_64            [.] __ring_is_full                                                                                                                                                                             ▒
> > >    2.68%  kqemu  qemu-system-x86_64            [.] threads_submit_request_prepare                                                                                                                                                             ▒
> > >    2.60%  kqemu  qemu-system-x86_64            [.] ring_mp_get                                                                                                                                                                                ▒
> > >    2.25%  kqemu  qemu-system-x86_64            [.] ring_get                                                                                                                                                                                   ▒
> > >    1.96%  kqemu  libc-2.12.so                  [.] memcpy
> > > 
> > > After this patch, the workload is moved to the worker thread, is it
> > > acceptable?
> > > 
> > > > 
> > > >   From compression rate POV of course zero page algo wins since it
> > > > contains no data (but only a flag).
> > > > 
> > > 
> > > Yes it is. The compressed zero page is 45 bytes that is small enough i think.
> > 
> > So the compression is ~20x slow and 10x the size;  not a great
> > improvement!
> > 
> > However, the tricky thing is that in the case of a guest which is mostly
> > non-zero, this patch would save that time used by zero detection, so it
> > would be faster.
> 
> Yes, indeed.

It would be good to benchmark the performance difference for a guest
with mostly non-zero pages; you should see a useful improvement.

Dave

> > 
> > > Hmm, if you do not like, how about move detecting zero page to the work thread?
> > 
> > That would be interesting to try.
> > 
> 
> Okay, i will try it then. :)
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 05/12] migration: show the statistics of compression
  2018-06-14  6:48       ` [Qemu-devel] " Xiao Guangrong
@ 2018-07-16 19:01         ` Dr. David Alan Gilbert
  -1 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-07-16 19:01 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, peterx,
	wei.w.wang, jiang.biao2, pbonzini

* Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
> 
> 
> On 06/14/2018 12:25 AM, Dr. David Alan Gilbert wrote:
>  }
> > >   static void migration_bitmap_sync(RAMState *rs)
> > > @@ -1412,6 +1441,9 @@ static void flush_compressed_data(RAMState *rs)
> > >           qemu_mutex_lock(&comp_param[idx].mutex);
> > >           if (!comp_param[idx].quit) {
> > >               len = qemu_put_qemu_file(rs->f, comp_param[idx].file);
> > > +            /* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */
> > > +            compression_counters.reduced_size += TARGET_PAGE_SIZE - len + 8;
> > 
> > I think I'd rather save just len+8 rather than than the subtraction.
> > 
> Hmmmmmm, is this what you want?
>       compression_counters.reduced_size += len - 8;
> 
> Then calculate the real reduced size in populate_ram_info() where we return this
> info to the user:
>       info->compression->reduced_size = compression_counters.pages * PAGE_SIZE - compression_counters.reduced_size;
> 
> Right?

I mean I'd rather see the actual size presented to the user rather than
the saving compared to uncompressed.

Dave

> > I think other than that, and Eric's comments, it's OK.
> > 
> 
> Thanks.
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 05/12] migration: show the statistics of compression
@ 2018-07-16 19:01         ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 156+ messages in thread
From: Dr. David Alan Gilbert @ 2018-07-16 19:01 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, peterx, jiang.biao2,
	wei.w.wang, Xiao Guangrong

* Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
> 
> 
> On 06/14/2018 12:25 AM, Dr. David Alan Gilbert wrote:
>  }
> > >   static void migration_bitmap_sync(RAMState *rs)
> > > @@ -1412,6 +1441,9 @@ static void flush_compressed_data(RAMState *rs)
> > >           qemu_mutex_lock(&comp_param[idx].mutex);
> > >           if (!comp_param[idx].quit) {
> > >               len = qemu_put_qemu_file(rs->f, comp_param[idx].file);
> > > +            /* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */
> > > +            compression_counters.reduced_size += TARGET_PAGE_SIZE - len + 8;
> > 
> > I think I'd rather save just len+8 rather than than the subtraction.
> > 
> Hmmmmmm, is this what you want?
>       compression_counters.reduced_size += len - 8;
> 
> Then calculate the real reduced size in populate_ram_info() where we return this
> info to the user:
>       info->compression->reduced_size = compression_counters.pages * PAGE_SIZE - compression_counters.reduced_size;
> 
> Right?

I mean I'd rather see the actual size presented to the user rather than
the saving compared to uncompressed.

Dave

> > I think other than that, and Eric's comments, it's OK.
> > 
> 
> Thanks.
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 10/12] migration: introduce lockless multithreads model
  2018-07-13 16:24       ` [Qemu-devel] " Dr. David Alan Gilbert
@ 2018-07-18  7:12         ` Xiao Guangrong
  -1 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-07-18  7:12 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Peter Xu
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, wei.w.wang,
	pbonzini, jiang.biao2



On 07/14/2018 12:24 AM, Dr. David Alan Gilbert wrote:

>>> +static void *thread_run(void *opaque)
>>> +{
>>> +    ThreadLocal *self_data = (ThreadLocal *)opaque;
>>> +    Threads *threads = self_data->threads;
>>> +    void (*handler)(ThreadRequest *data) = threads->thread_request_handler;
>>> +    ThreadRequest *request;
>>> +    int count, ret;
>>> +
>>> +    for ( ; !atomic_read(&self_data->quit); ) {
>>> +        qemu_event_reset(&self_data->ev);
>>> +
>>> +        count = 0;
>>> +        while ((request = ring_get(self_data->request_ring)) ||
>>> +            count < BUSY_WAIT_COUNT) {
>>> +             /*
>>> +             * wait some while before go to sleep so that the user
>>> +             * needn't go to kernel space to wake up the consumer
>>> +             * threads.
>>> +             *
>>> +             * That will waste some CPU resource indeed however it
>>> +             * can significantly improve the case that the request
>>> +             * will be available soon.
>>> +             */
>>> +             if (!request) {
>>> +                cpu_relax();
>>> +                count++;
>>> +                continue;
>>> +            }
>>> +            count = 0;
> 
> Things like busywait counts probably need isolating somewhere;
> getting those counts right is quite hard.

Okay, i will make it to be a separated function.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 10/12] migration: introduce lockless multithreads model
@ 2018-07-18  7:12         ` Xiao Guangrong
  0 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-07-18  7:12 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Peter Xu
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, jiang.biao2,
	wei.w.wang, Xiao Guangrong



On 07/14/2018 12:24 AM, Dr. David Alan Gilbert wrote:

>>> +static void *thread_run(void *opaque)
>>> +{
>>> +    ThreadLocal *self_data = (ThreadLocal *)opaque;
>>> +    Threads *threads = self_data->threads;
>>> +    void (*handler)(ThreadRequest *data) = threads->thread_request_handler;
>>> +    ThreadRequest *request;
>>> +    int count, ret;
>>> +
>>> +    for ( ; !atomic_read(&self_data->quit); ) {
>>> +        qemu_event_reset(&self_data->ev);
>>> +
>>> +        count = 0;
>>> +        while ((request = ring_get(self_data->request_ring)) ||
>>> +            count < BUSY_WAIT_COUNT) {
>>> +             /*
>>> +             * wait some while before go to sleep so that the user
>>> +             * needn't go to kernel space to wake up the consumer
>>> +             * threads.
>>> +             *
>>> +             * That will waste some CPU resource indeed however it
>>> +             * can significantly improve the case that the request
>>> +             * will be available soon.
>>> +             */
>>> +             if (!request) {
>>> +                cpu_relax();
>>> +                count++;
>>> +                continue;
>>> +            }
>>> +            count = 0;
> 
> Things like busywait counts probably need isolating somewhere;
> getting those counts right is quite hard.

Okay, i will make it to be a separated function.

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 08/12] migration: do not flush_compressed_data at the end of each iteration
  2018-07-13 18:01     ` [Qemu-devel] " Dr. David Alan Gilbert
@ 2018-07-18  8:44       ` Xiao Guangrong
  -1 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-07-18  8:44 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, peterx,
	wei.w.wang, jiang.biao2, pbonzini



On 07/14/2018 02:01 AM, Dr. David Alan Gilbert wrote:
> * guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>> flush_compressed_data() needs to wait all compression threads to
>> finish their work, after that all threads are free until the
>> migration feed new request to them, reducing its call can improve
>> the throughput and use CPU resource more effectively
>>
>> We do not need to flush all threads at the end of iteration, the
>> data can be kept locally until the memory block is changed
>>
>> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
>> ---
>>   migration/ram.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/migration/ram.c b/migration/ram.c
>> index f9a8646520..0a38c1c61e 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -1994,6 +1994,7 @@ static void ram_save_cleanup(void *opaque)
>>       }
>>   
>>       xbzrle_cleanup();
>> +    flush_compressed_data(*rsp);
>>       compress_threads_save_cleanup();
>>       ram_state_cleanup(rsp);
>>   }
> 
> I'm not sure why this change corresponds to the other removal.
> We should already have sent all remaining data in ram_save_complete()'s
> call to flush_compressed_data - so what is this one for?
> 

This is for the error condition, if any error occurred during live migration,
there is no chance to call ram_save_complete. After using the lockless
multithreads model, we assert all requests have been handled before destroy
the work threads.

>> @@ -2690,7 +2691,6 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>>           }
>>           i++;
>>       }
>> -    flush_compressed_data(rs);
>>       rcu_read_unlock();
> 
> Hmm - are we sure there's no other cases that depend on ordering of all
> of the compressed data being sent out between threads?

Err, i tried think it over carefully, however, still missed the case you mentioned. :(
Anyway, doing flush_compressed_data() for every 50ms hurt us too much.

> I think the one I'd most worry about is the case where:
> 
>    iteration one:
>       thread 1: Save compressed page 'n'
> 
>    iteration two:
>       thread 2: Save compressed page 'n'
> 
> What guarantees that the version of page 'n'
> from thread 2 reaches the destination first without
> this flush?
> 

Hmm... you are right, i missed this case. So how about avoid it by doing this
check at the beginning of ram_save_iterate():

if (ram_counters.dirty_sync_count != rs.dirty_sync_count) {
     flush_compressed_data(*rsp);
     rs.dirty_sync_count = ram_counters.dirty_sync_count;
}

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 08/12] migration: do not flush_compressed_data at the end of each iteration
@ 2018-07-18  8:44       ` Xiao Guangrong
  0 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-07-18  8:44 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, peterx, jiang.biao2,
	wei.w.wang, Xiao Guangrong



On 07/14/2018 02:01 AM, Dr. David Alan Gilbert wrote:
> * guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>> flush_compressed_data() needs to wait all compression threads to
>> finish their work, after that all threads are free until the
>> migration feed new request to them, reducing its call can improve
>> the throughput and use CPU resource more effectively
>>
>> We do not need to flush all threads at the end of iteration, the
>> data can be kept locally until the memory block is changed
>>
>> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
>> ---
>>   migration/ram.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/migration/ram.c b/migration/ram.c
>> index f9a8646520..0a38c1c61e 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -1994,6 +1994,7 @@ static void ram_save_cleanup(void *opaque)
>>       }
>>   
>>       xbzrle_cleanup();
>> +    flush_compressed_data(*rsp);
>>       compress_threads_save_cleanup();
>>       ram_state_cleanup(rsp);
>>   }
> 
> I'm not sure why this change corresponds to the other removal.
> We should already have sent all remaining data in ram_save_complete()'s
> call to flush_compressed_data - so what is this one for?
> 

This is for the error condition, if any error occurred during live migration,
there is no chance to call ram_save_complete. After using the lockless
multithreads model, we assert all requests have been handled before destroy
the work threads.

>> @@ -2690,7 +2691,6 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>>           }
>>           i++;
>>       }
>> -    flush_compressed_data(rs);
>>       rcu_read_unlock();
> 
> Hmm - are we sure there's no other cases that depend on ordering of all
> of the compressed data being sent out between threads?

Err, i tried think it over carefully, however, still missed the case you mentioned. :(
Anyway, doing flush_compressed_data() for every 50ms hurt us too much.

> I think the one I'd most worry about is the case where:
> 
>    iteration one:
>       thread 1: Save compressed page 'n'
> 
>    iteration two:
>       thread 2: Save compressed page 'n'
> 
> What guarantees that the version of page 'n'
> from thread 2 reaches the destination first without
> this flush?
> 

Hmm... you are right, i missed this case. So how about avoid it by doing this
check at the beginning of ram_save_iterate():

if (ram_counters.dirty_sync_count != rs.dirty_sync_count) {
     flush_compressed_data(*rsp);
     rs.dirty_sync_count = ram_counters.dirty_sync_count;
}

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 06/12] migration: do not detect zero page for compression
  2018-07-16 18:58             ` [Qemu-devel] " Dr. David Alan Gilbert
@ 2018-07-18  8:46               ` Xiao Guangrong
  -1 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-07-18  8:46 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, Peter Xu,
	wei.w.wang, jiang.biao2, pbonzini



On 07/17/2018 02:58 AM, Dr. David Alan Gilbert wrote:
> * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
>>
>>
>> On 06/29/2018 05:42 PM, Dr. David Alan Gilbert wrote:
>>> * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
>>>>
>>>> Hi Peter,
>>>>
>>>> Sorry for the delay as i was busy on other things.
>>>>
>>>> On 06/19/2018 03:30 PM, Peter Xu wrote:
>>>>> On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
>>>>>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>>>>
>>>>>> Detecting zero page is not a light work, we can disable it
>>>>>> for compression that can handle all zero data very well
>>>>>
>>>>> Is there any number shows how the compression algo performs better
>>>>> than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
>>>>> be fast, depending on how init_accel() is done in util/bufferiszero.c.
>>>>
>>>> This is the comparison between zero-detection and compression (the target
>>>> buffer is all zero bit):
>>>>
>>>> Zero 810 ns Compression: 26905 ns.
>>>> Zero 417 ns Compression: 8022 ns.
>>>> Zero 408 ns Compression: 7189 ns.
>>>> Zero 400 ns Compression: 7255 ns.
>>>> Zero 412 ns Compression: 7016 ns.
>>>> Zero 411 ns Compression: 7035 ns.
>>>> Zero 413 ns Compression: 6994 ns.
>>>> Zero 399 ns Compression: 7024 ns.
>>>> Zero 416 ns Compression: 7053 ns.
>>>> Zero 405 ns Compression: 7041 ns.
>>>>
>>>> Indeed, zero-detection is faster than compression.
>>>>
>>>> However during our profiling for the live_migration thread (after reverted this patch),
>>>> we noticed zero-detection cost lots of CPU:
>>>>
>>>>    12.01%  kqemu  qemu-system-x86_64            [.] buffer_zero_sse2                                                                                                                                                                           ◆
>>>
>>> Interesting; what host are you running on?
>>> Some hosts have support for the faster buffer_zero_ss4/avx2
>>
>> The host is:
>>
>> model name	: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
>> ...
>> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi
>>   mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts
>>   rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor
>>   ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt
>>   tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3
>>   cdp_l3 intel_ppin intel_pt mba tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1
>>   hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt
>>   clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total
>>   cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke
>>
>> I checked and noticed "CONFIG_AVX2_OPT" has not been enabled, maybe is due to too old glib/gcc
>> version:
>>     gcc version 4.4.6 20110731 (Red Hat 4.4.6-4) (GCC)
>>     glibc.x86_64                     2.12
> 
> Yes, that's pretty old (RHEL6 ?) - I think you should get AVX2 in RHEL7.

Er, it is not easy to update glibc in the production env.... :(

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 06/12] migration: do not detect zero page for compression
@ 2018-07-18  8:46               ` Xiao Guangrong
  0 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-07-18  8:46 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Xu, pbonzini, mst, mtosatti, qemu-devel, kvm, jiang.biao2,
	wei.w.wang, Xiao Guangrong



On 07/17/2018 02:58 AM, Dr. David Alan Gilbert wrote:
> * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
>>
>>
>> On 06/29/2018 05:42 PM, Dr. David Alan Gilbert wrote:
>>> * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
>>>>
>>>> Hi Peter,
>>>>
>>>> Sorry for the delay as i was busy on other things.
>>>>
>>>> On 06/19/2018 03:30 PM, Peter Xu wrote:
>>>>> On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
>>>>>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>>>>
>>>>>> Detecting zero page is not a light work, we can disable it
>>>>>> for compression that can handle all zero data very well
>>>>>
>>>>> Is there any number shows how the compression algo performs better
>>>>> than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
>>>>> be fast, depending on how init_accel() is done in util/bufferiszero.c.
>>>>
>>>> This is the comparison between zero-detection and compression (the target
>>>> buffer is all zero bit):
>>>>
>>>> Zero 810 ns Compression: 26905 ns.
>>>> Zero 417 ns Compression: 8022 ns.
>>>> Zero 408 ns Compression: 7189 ns.
>>>> Zero 400 ns Compression: 7255 ns.
>>>> Zero 412 ns Compression: 7016 ns.
>>>> Zero 411 ns Compression: 7035 ns.
>>>> Zero 413 ns Compression: 6994 ns.
>>>> Zero 399 ns Compression: 7024 ns.
>>>> Zero 416 ns Compression: 7053 ns.
>>>> Zero 405 ns Compression: 7041 ns.
>>>>
>>>> Indeed, zero-detection is faster than compression.
>>>>
>>>> However during our profiling for the live_migration thread (after reverted this patch),
>>>> we noticed zero-detection cost lots of CPU:
>>>>
>>>>    12.01%  kqemu  qemu-system-x86_64            [.] buffer_zero_sse2                                                                                                                                                                           ◆
>>>
>>> Interesting; what host are you running on?
>>> Some hosts have support for the faster buffer_zero_ss4/avx2
>>
>> The host is:
>>
>> model name	: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
>> ...
>> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi
>>   mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts
>>   rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor
>>   ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt
>>   tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3
>>   cdp_l3 intel_ppin intel_pt mba tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1
>>   hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt
>>   clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total
>>   cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke
>>
>> I checked and noticed "CONFIG_AVX2_OPT" has not been enabled, maybe is due to too old glib/gcc
>> version:
>>     gcc version 4.4.6 20110731 (Red Hat 4.4.6-4) (GCC)
>>     glibc.x86_64                     2.12
> 
> Yes, that's pretty old (RHEL6 ?) - I think you should get AVX2 in RHEL7.

Er, it is not easy to update glibc in the production env.... :(

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 05/12] migration: show the statistics of compression
  2018-07-16 19:01         ` [Qemu-devel] " Dr. David Alan Gilbert
@ 2018-07-18  8:51           ` Xiao Guangrong
  -1 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-07-18  8:51 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, peterx,
	wei.w.wang, jiang.biao2, pbonzini



On 07/17/2018 03:01 AM, Dr. David Alan Gilbert wrote:
> * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
>>
>>
>> On 06/14/2018 12:25 AM, Dr. David Alan Gilbert wrote:
>>   }
>>>>    static void migration_bitmap_sync(RAMState *rs)
>>>> @@ -1412,6 +1441,9 @@ static void flush_compressed_data(RAMState *rs)
>>>>            qemu_mutex_lock(&comp_param[idx].mutex);
>>>>            if (!comp_param[idx].quit) {
>>>>                len = qemu_put_qemu_file(rs->f, comp_param[idx].file);
>>>> +            /* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */
>>>> +            compression_counters.reduced_size += TARGET_PAGE_SIZE - len + 8;
>>>
>>> I think I'd rather save just len+8 rather than than the subtraction.
>>>
>> Hmmmmmm, is this what you want?
>>        compression_counters.reduced_size += len - 8;
>>
>> Then calculate the real reduced size in populate_ram_info() where we return this
>> info to the user:
>>        info->compression->reduced_size = compression_counters.pages * PAGE_SIZE - compression_counters.reduced_size;
>>
>> Right?
> 
> I mean I'd rather see the actual size presented to the user rather than
> the saving compared to uncompressed.
> 

These statistics are used to help people to see whether compression works efficiently or not,
so maybe reduced-size is more straightforward? :)

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 05/12] migration: show the statistics of compression
@ 2018-07-18  8:51           ` Xiao Guangrong
  0 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-07-18  8:51 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, peterx, jiang.biao2,
	wei.w.wang, Xiao Guangrong



On 07/17/2018 03:01 AM, Dr. David Alan Gilbert wrote:
> * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
>>
>>
>> On 06/14/2018 12:25 AM, Dr. David Alan Gilbert wrote:
>>   }
>>>>    static void migration_bitmap_sync(RAMState *rs)
>>>> @@ -1412,6 +1441,9 @@ static void flush_compressed_data(RAMState *rs)
>>>>            qemu_mutex_lock(&comp_param[idx].mutex);
>>>>            if (!comp_param[idx].quit) {
>>>>                len = qemu_put_qemu_file(rs->f, comp_param[idx].file);
>>>> +            /* 8 means a header with RAM_SAVE_FLAG_CONTINUE. */
>>>> +            compression_counters.reduced_size += TARGET_PAGE_SIZE - len + 8;
>>>
>>> I think I'd rather save just len+8 rather than than the subtraction.
>>>
>> Hmmmmmm, is this what you want?
>>        compression_counters.reduced_size += len - 8;
>>
>> Then calculate the real reduced size in populate_ram_info() where we return this
>> info to the user:
>>        info->compression->reduced_size = compression_counters.pages * PAGE_SIZE - compression_counters.reduced_size;
>>
>> Right?
> 
> I mean I'd rather see the actual size presented to the user rather than
> the saving compared to uncompressed.
> 

These statistics are used to help people to see whether compression works efficiently or not,
so maybe reduced-size is more straightforward? :)

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 07/12] migration: hold the lock only if it is really needed
  2018-07-12  8:26             ` [Qemu-devel] " Peter Xu
@ 2018-07-18  8:56               ` Xiao Guangrong
  -1 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-07-18  8:56 UTC (permalink / raw)
  To: Peter Xu
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, qemu-devel,
	wei.w.wang, jiang.biao2, pbonzini



On 07/12/2018 04:26 PM, Peter Xu wrote:
> On Thu, Jul 12, 2018 at 03:47:57PM +0800, Xiao Guangrong wrote:
>>
>>
>> On 07/11/2018 04:21 PM, Peter Xu wrote:
>>> On Thu, Jun 28, 2018 at 05:33:58PM +0800, Xiao Guangrong wrote:
>>>>
>>>>
>>>> On 06/19/2018 03:36 PM, Peter Xu wrote:
>>>>> On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
>>>>>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>>>>
>>>>>> Try to hold src_page_req_mutex only if the queue is not
>>>>>> empty
>>>>>
>>>>> Pure question: how much this patch would help?  Basically if you are
>>>>> running compression tests then I think it means you are with precopy
>>>>> (since postcopy cannot work with compression yet), then here the lock
>>>>> has no contention at all.
>>>>
>>>> Yes, you are right, however we can observe it is in the top functions
>>>> (after revert this patch):
>>>>
>>>> Samples: 29K of event 'cycles', Event count (approx.): 22263412260
>>>> +   7.99%  kqemu  qemu-system-x86_64       [.] ram_bytes_total
>>>> +   6.95%  kqemu  [kernel.kallsyms]        [k] copy_user_enhanced_fast_string
>>>> +   6.23%  kqemu  qemu-system-x86_64       [.] qemu_put_qemu_file
>>>> +   6.20%  kqemu  qemu-system-x86_64       [.] qemu_event_set
>>>> +   5.80%  kqemu  qemu-system-x86_64       [.] __ring_put
>>>> +   4.82%  kqemu  qemu-system-x86_64       [.] compress_thread_data_done
>>>> +   4.11%  kqemu  qemu-system-x86_64       [.] ring_is_full
>>>> +   3.07%  kqemu  qemu-system-x86_64       [.] threads_submit_request_prepare
>>>> +   2.83%  kqemu  qemu-system-x86_64       [.] ring_mp_get
>>>> +   2.71%  kqemu  qemu-system-x86_64       [.] __ring_is_full
>>>> +   2.46%  kqemu  qemu-system-x86_64       [.] buffer_zero_sse2
>>>> +   2.40%  kqemu  qemu-system-x86_64       [.] add_to_iovec
>>>> +   2.21%  kqemu  qemu-system-x86_64       [.] ring_get
>>>> +   1.96%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
>>>> +   1.90%  kqemu  libc-2.12.so             [.] memcpy
>>>> +   1.55%  kqemu  qemu-system-x86_64       [.] ring_len
>>>> +   1.12%  kqemu  libpthread-2.12.so       [.] pthread_mutex_unlock
>>>> +   1.11%  kqemu  qemu-system-x86_64       [.] ram_find_and_save_block
>>>> +   1.07%  kqemu  qemu-system-x86_64       [.] ram_save_host_page
>>>> +   1.04%  kqemu  qemu-system-x86_64       [.] qemu_put_buffer
>>>> +   0.97%  kqemu  qemu-system-x86_64       [.] compress_page_with_multi_thread
>>>> +   0.96%  kqemu  qemu-system-x86_64       [.] ram_save_target_page
>>>> +   0.93%  kqemu  libpthread-2.12.so       [.] pthread_mutex_lock
>>>
>>> (sorry to respond late; I was busy with other stuff for the
>>>    release...)
>>>
>>
>> You're welcome. :)
>>
>>> I am trying to find out anything related to unqueue_page() but I
>>> failed.  Did I miss anything obvious there?
>>>
>>
>> unqueue_page() was not listed here indeed, i think the function
>> itself is light enough (a check then directly return) so it
>> did not leave a trace here.
>>
>> This perf data was got after reverting this patch, i.e, it's
>> based on the lockless multithread model, then unqueue_page() is
>> the only place using mutex in the main thread.
>>
>> And you can see the overload of mutext was gone after applying
>> this patch in the mail i replied to Dave.
> 
> I see.  It's not a big portion of CPU resource, though of course I
> don't have reason to object to this change as well.
> 
> Actually what interested me more is why ram_bytes_total() is such a
> hot spot.  AFAIU it's only called in ram_find_and_save_block() per
> call, and it should be mostly a constant if we don't plug/unplug
> memories.  Not sure that means that's a better spot to work on.
> 

I noticed it too. That could be another work we will work on. :)

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 07/12] migration: hold the lock only if it is really needed
@ 2018-07-18  8:56               ` Xiao Guangrong
  0 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-07-18  8:56 UTC (permalink / raw)
  To: Peter Xu
  Cc: pbonzini, mst, mtosatti, qemu-devel, kvm, dgilbert, jiang.biao2,
	wei.w.wang, Xiao Guangrong



On 07/12/2018 04:26 PM, Peter Xu wrote:
> On Thu, Jul 12, 2018 at 03:47:57PM +0800, Xiao Guangrong wrote:
>>
>>
>> On 07/11/2018 04:21 PM, Peter Xu wrote:
>>> On Thu, Jun 28, 2018 at 05:33:58PM +0800, Xiao Guangrong wrote:
>>>>
>>>>
>>>> On 06/19/2018 03:36 PM, Peter Xu wrote:
>>>>> On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
>>>>>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>>>>
>>>>>> Try to hold src_page_req_mutex only if the queue is not
>>>>>> empty
>>>>>
>>>>> Pure question: how much this patch would help?  Basically if you are
>>>>> running compression tests then I think it means you are with precopy
>>>>> (since postcopy cannot work with compression yet), then here the lock
>>>>> has no contention at all.
>>>>
>>>> Yes, you are right, however we can observe it is in the top functions
>>>> (after revert this patch):
>>>>
>>>> Samples: 29K of event 'cycles', Event count (approx.): 22263412260
>>>> +   7.99%  kqemu  qemu-system-x86_64       [.] ram_bytes_total
>>>> +   6.95%  kqemu  [kernel.kallsyms]        [k] copy_user_enhanced_fast_string
>>>> +   6.23%  kqemu  qemu-system-x86_64       [.] qemu_put_qemu_file
>>>> +   6.20%  kqemu  qemu-system-x86_64       [.] qemu_event_set
>>>> +   5.80%  kqemu  qemu-system-x86_64       [.] __ring_put
>>>> +   4.82%  kqemu  qemu-system-x86_64       [.] compress_thread_data_done
>>>> +   4.11%  kqemu  qemu-system-x86_64       [.] ring_is_full
>>>> +   3.07%  kqemu  qemu-system-x86_64       [.] threads_submit_request_prepare
>>>> +   2.83%  kqemu  qemu-system-x86_64       [.] ring_mp_get
>>>> +   2.71%  kqemu  qemu-system-x86_64       [.] __ring_is_full
>>>> +   2.46%  kqemu  qemu-system-x86_64       [.] buffer_zero_sse2
>>>> +   2.40%  kqemu  qemu-system-x86_64       [.] add_to_iovec
>>>> +   2.21%  kqemu  qemu-system-x86_64       [.] ring_get
>>>> +   1.96%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
>>>> +   1.90%  kqemu  libc-2.12.so             [.] memcpy
>>>> +   1.55%  kqemu  qemu-system-x86_64       [.] ring_len
>>>> +   1.12%  kqemu  libpthread-2.12.so       [.] pthread_mutex_unlock
>>>> +   1.11%  kqemu  qemu-system-x86_64       [.] ram_find_and_save_block
>>>> +   1.07%  kqemu  qemu-system-x86_64       [.] ram_save_host_page
>>>> +   1.04%  kqemu  qemu-system-x86_64       [.] qemu_put_buffer
>>>> +   0.97%  kqemu  qemu-system-x86_64       [.] compress_page_with_multi_thread
>>>> +   0.96%  kqemu  qemu-system-x86_64       [.] ram_save_target_page
>>>> +   0.93%  kqemu  libpthread-2.12.so       [.] pthread_mutex_lock
>>>
>>> (sorry to respond late; I was busy with other stuff for the
>>>    release...)
>>>
>>
>> You're welcome. :)
>>
>>> I am trying to find out anything related to unqueue_page() but I
>>> failed.  Did I miss anything obvious there?
>>>
>>
>> unqueue_page() was not listed here indeed, i think the function
>> itself is light enough (a check then directly return) so it
>> did not leave a trace here.
>>
>> This perf data was got after reverting this patch, i.e, it's
>> based on the lockless multithread model, then unqueue_page() is
>> the only place using mutex in the main thread.
>>
>> And you can see the overload of mutext was gone after applying
>> this patch in the mail i replied to Dave.
> 
> I see.  It's not a big portion of CPU resource, though of course I
> don't have reason to object to this change as well.
> 
> Actually what interested me more is why ram_bytes_total() is such a
> hot spot.  AFAIU it's only called in ram_find_and_save_block() per
> call, and it should be mostly a constant if we don't plug/unplug
> memories.  Not sure that means that's a better spot to work on.
> 

I noticed it too. That could be another work we will work on. :)

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 07/12] migration: hold the lock only if it is really needed
  2018-07-18  8:56               ` [Qemu-devel] " Xiao Guangrong
@ 2018-07-18 10:18                 ` Peter Xu
  -1 siblings, 0 replies; 156+ messages in thread
From: Peter Xu @ 2018-07-18 10:18 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: kvm, mst, mtosatti, Xiao Guangrong, qemu-devel, dgilbert,
	wei.w.wang, pbonzini, jiang.biao2

On Wed, Jul 18, 2018 at 04:56:13PM +0800, Xiao Guangrong wrote:
> 
> 
> On 07/12/2018 04:26 PM, Peter Xu wrote:
> > On Thu, Jul 12, 2018 at 03:47:57PM +0800, Xiao Guangrong wrote:
> > > 
> > > 
> > > On 07/11/2018 04:21 PM, Peter Xu wrote:
> > > > On Thu, Jun 28, 2018 at 05:33:58PM +0800, Xiao Guangrong wrote:
> > > > > 
> > > > > 
> > > > > On 06/19/2018 03:36 PM, Peter Xu wrote:
> > > > > > On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
> > > > > > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > > > > > 
> > > > > > > Try to hold src_page_req_mutex only if the queue is not
> > > > > > > empty
> > > > > > 
> > > > > > Pure question: how much this patch would help?  Basically if you are
> > > > > > running compression tests then I think it means you are with precopy
> > > > > > (since postcopy cannot work with compression yet), then here the lock
> > > > > > has no contention at all.
> > > > > 
> > > > > Yes, you are right, however we can observe it is in the top functions
> > > > > (after revert this patch):
> > > > > 
> > > > > Samples: 29K of event 'cycles', Event count (approx.): 22263412260
> > > > > +   7.99%  kqemu  qemu-system-x86_64       [.] ram_bytes_total
> > > > > +   6.95%  kqemu  [kernel.kallsyms]        [k] copy_user_enhanced_fast_string
> > > > > +   6.23%  kqemu  qemu-system-x86_64       [.] qemu_put_qemu_file
> > > > > +   6.20%  kqemu  qemu-system-x86_64       [.] qemu_event_set
> > > > > +   5.80%  kqemu  qemu-system-x86_64       [.] __ring_put
> > > > > +   4.82%  kqemu  qemu-system-x86_64       [.] compress_thread_data_done
> > > > > +   4.11%  kqemu  qemu-system-x86_64       [.] ring_is_full
> > > > > +   3.07%  kqemu  qemu-system-x86_64       [.] threads_submit_request_prepare
> > > > > +   2.83%  kqemu  qemu-system-x86_64       [.] ring_mp_get
> > > > > +   2.71%  kqemu  qemu-system-x86_64       [.] __ring_is_full
> > > > > +   2.46%  kqemu  qemu-system-x86_64       [.] buffer_zero_sse2
> > > > > +   2.40%  kqemu  qemu-system-x86_64       [.] add_to_iovec
> > > > > +   2.21%  kqemu  qemu-system-x86_64       [.] ring_get
> > > > > +   1.96%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
> > > > > +   1.90%  kqemu  libc-2.12.so             [.] memcpy
> > > > > +   1.55%  kqemu  qemu-system-x86_64       [.] ring_len
> > > > > +   1.12%  kqemu  libpthread-2.12.so       [.] pthread_mutex_unlock
> > > > > +   1.11%  kqemu  qemu-system-x86_64       [.] ram_find_and_save_block
> > > > > +   1.07%  kqemu  qemu-system-x86_64       [.] ram_save_host_page
> > > > > +   1.04%  kqemu  qemu-system-x86_64       [.] qemu_put_buffer
> > > > > +   0.97%  kqemu  qemu-system-x86_64       [.] compress_page_with_multi_thread
> > > > > +   0.96%  kqemu  qemu-system-x86_64       [.] ram_save_target_page
> > > > > +   0.93%  kqemu  libpthread-2.12.so       [.] pthread_mutex_lock
> > > > 
> > > > (sorry to respond late; I was busy with other stuff for the
> > > >    release...)
> > > > 
> > > 
> > > You're welcome. :)
> > > 
> > > > I am trying to find out anything related to unqueue_page() but I
> > > > failed.  Did I miss anything obvious there?
> > > > 
> > > 
> > > unqueue_page() was not listed here indeed, i think the function
> > > itself is light enough (a check then directly return) so it
> > > did not leave a trace here.
> > > 
> > > This perf data was got after reverting this patch, i.e, it's
> > > based on the lockless multithread model, then unqueue_page() is
> > > the only place using mutex in the main thread.
> > > 
> > > And you can see the overload of mutext was gone after applying
> > > this patch in the mail i replied to Dave.
> > 
> > I see.  It's not a big portion of CPU resource, though of course I
> > don't have reason to object to this change as well.
> > 
> > Actually what interested me more is why ram_bytes_total() is such a
> > hot spot.  AFAIU it's only called in ram_find_and_save_block() per
> > call, and it should be mostly a constant if we don't plug/unplug
> > memories.  Not sure that means that's a better spot to work on.
> > 
> 
> I noticed it too. That could be another work we will work on. :)

Yeah I'm looking forward to that. :)

Btw, please feel free to post the common codes separately in your next
post if the series grows even bigger (that may help even for
no-compression migrations), I would bet they'll have a better chance
to be reviewed and merged faster (though it'll possibly be after QEMU
3.0).

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 07/12] migration: hold the lock only if it is really needed
@ 2018-07-18 10:18                 ` Peter Xu
  0 siblings, 0 replies; 156+ messages in thread
From: Peter Xu @ 2018-07-18 10:18 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: kvm, mst, mtosatti, Xiao Guangrong, dgilbert, qemu-devel,
	wei.w.wang, jiang.biao2, pbonzini

On Wed, Jul 18, 2018 at 04:56:13PM +0800, Xiao Guangrong wrote:
> 
> 
> On 07/12/2018 04:26 PM, Peter Xu wrote:
> > On Thu, Jul 12, 2018 at 03:47:57PM +0800, Xiao Guangrong wrote:
> > > 
> > > 
> > > On 07/11/2018 04:21 PM, Peter Xu wrote:
> > > > On Thu, Jun 28, 2018 at 05:33:58PM +0800, Xiao Guangrong wrote:
> > > > > 
> > > > > 
> > > > > On 06/19/2018 03:36 PM, Peter Xu wrote:
> > > > > > On Mon, Jun 04, 2018 at 05:55:15PM +0800, guangrong.xiao@gmail.com wrote:
> > > > > > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > > > > > 
> > > > > > > Try to hold src_page_req_mutex only if the queue is not
> > > > > > > empty
> > > > > > 
> > > > > > Pure question: how much this patch would help?  Basically if you are
> > > > > > running compression tests then I think it means you are with precopy
> > > > > > (since postcopy cannot work with compression yet), then here the lock
> > > > > > has no contention at all.
> > > > > 
> > > > > Yes, you are right, however we can observe it is in the top functions
> > > > > (after revert this patch):
> > > > > 
> > > > > Samples: 29K of event 'cycles', Event count (approx.): 22263412260
> > > > > +   7.99%  kqemu  qemu-system-x86_64       [.] ram_bytes_total
> > > > > +   6.95%  kqemu  [kernel.kallsyms]        [k] copy_user_enhanced_fast_string
> > > > > +   6.23%  kqemu  qemu-system-x86_64       [.] qemu_put_qemu_file
> > > > > +   6.20%  kqemu  qemu-system-x86_64       [.] qemu_event_set
> > > > > +   5.80%  kqemu  qemu-system-x86_64       [.] __ring_put
> > > > > +   4.82%  kqemu  qemu-system-x86_64       [.] compress_thread_data_done
> > > > > +   4.11%  kqemu  qemu-system-x86_64       [.] ring_is_full
> > > > > +   3.07%  kqemu  qemu-system-x86_64       [.] threads_submit_request_prepare
> > > > > +   2.83%  kqemu  qemu-system-x86_64       [.] ring_mp_get
> > > > > +   2.71%  kqemu  qemu-system-x86_64       [.] __ring_is_full
> > > > > +   2.46%  kqemu  qemu-system-x86_64       [.] buffer_zero_sse2
> > > > > +   2.40%  kqemu  qemu-system-x86_64       [.] add_to_iovec
> > > > > +   2.21%  kqemu  qemu-system-x86_64       [.] ring_get
> > > > > +   1.96%  kqemu  [kernel.kallsyms]        [k] __lock_acquire
> > > > > +   1.90%  kqemu  libc-2.12.so             [.] memcpy
> > > > > +   1.55%  kqemu  qemu-system-x86_64       [.] ring_len
> > > > > +   1.12%  kqemu  libpthread-2.12.so       [.] pthread_mutex_unlock
> > > > > +   1.11%  kqemu  qemu-system-x86_64       [.] ram_find_and_save_block
> > > > > +   1.07%  kqemu  qemu-system-x86_64       [.] ram_save_host_page
> > > > > +   1.04%  kqemu  qemu-system-x86_64       [.] qemu_put_buffer
> > > > > +   0.97%  kqemu  qemu-system-x86_64       [.] compress_page_with_multi_thread
> > > > > +   0.96%  kqemu  qemu-system-x86_64       [.] ram_save_target_page
> > > > > +   0.93%  kqemu  libpthread-2.12.so       [.] pthread_mutex_lock
> > > > 
> > > > (sorry to respond late; I was busy with other stuff for the
> > > >    release...)
> > > > 
> > > 
> > > You're welcome. :)
> > > 
> > > > I am trying to find out anything related to unqueue_page() but I
> > > > failed.  Did I miss anything obvious there?
> > > > 
> > > 
> > > unqueue_page() was not listed here indeed, i think the function
> > > itself is light enough (a check then directly return) so it
> > > did not leave a trace here.
> > > 
> > > This perf data was got after reverting this patch, i.e, it's
> > > based on the lockless multithread model, then unqueue_page() is
> > > the only place using mutex in the main thread.
> > > 
> > > And you can see the overload of mutext was gone after applying
> > > this patch in the mail i replied to Dave.
> > 
> > I see.  It's not a big portion of CPU resource, though of course I
> > don't have reason to object to this change as well.
> > 
> > Actually what interested me more is why ram_bytes_total() is such a
> > hot spot.  AFAIU it's only called in ram_find_and_save_block() per
> > call, and it should be mostly a constant if we don't plug/unplug
> > memories.  Not sure that means that's a better spot to work on.
> > 
> 
> I noticed it too. That could be another work we will work on. :)

Yeah I'm looking forward to that. :)

Btw, please feel free to post the common codes separately in your next
post if the series grows even bigger (that may help even for
no-compression migrations), I would bet they'll have a better chance
to be reviewed and merged faster (though it'll possibly be after QEMU
3.0).

Regards,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 06/12] migration: do not detect zero page for compression
  2018-07-18  8:46               ` [Qemu-devel] " Xiao Guangrong
@ 2018-07-22 16:05                 ` Michael S. Tsirkin
  -1 siblings, 0 replies; 156+ messages in thread
From: Michael S. Tsirkin @ 2018-07-22 16:05 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: kvm, mtosatti, Xiao Guangrong, Dr. David Alan Gilbert, Peter Xu,
	qemu-devel, wei.w.wang, jiang.biao2, pbonzini

On Wed, Jul 18, 2018 at 04:46:21PM +0800, Xiao Guangrong wrote:
> 
> 
> On 07/17/2018 02:58 AM, Dr. David Alan Gilbert wrote:
> > * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
> > > 
> > > 
> > > On 06/29/2018 05:42 PM, Dr. David Alan Gilbert wrote:
> > > > * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
> > > > > 
> > > > > Hi Peter,
> > > > > 
> > > > > Sorry for the delay as i was busy on other things.
> > > > > 
> > > > > On 06/19/2018 03:30 PM, Peter Xu wrote:
> > > > > > On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
> > > > > > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > > > > > 
> > > > > > > Detecting zero page is not a light work, we can disable it
> > > > > > > for compression that can handle all zero data very well
> > > > > > 
> > > > > > Is there any number shows how the compression algo performs better
> > > > > > than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
> > > > > > be fast, depending on how init_accel() is done in util/bufferiszero.c.
> > > > > 
> > > > > This is the comparison between zero-detection and compression (the target
> > > > > buffer is all zero bit):
> > > > > 
> > > > > Zero 810 ns Compression: 26905 ns.
> > > > > Zero 417 ns Compression: 8022 ns.
> > > > > Zero 408 ns Compression: 7189 ns.
> > > > > Zero 400 ns Compression: 7255 ns.
> > > > > Zero 412 ns Compression: 7016 ns.
> > > > > Zero 411 ns Compression: 7035 ns.
> > > > > Zero 413 ns Compression: 6994 ns.
> > > > > Zero 399 ns Compression: 7024 ns.
> > > > > Zero 416 ns Compression: 7053 ns.
> > > > > Zero 405 ns Compression: 7041 ns.
> > > > > 
> > > > > Indeed, zero-detection is faster than compression.
> > > > > 
> > > > > However during our profiling for the live_migration thread (after reverted this patch),
> > > > > we noticed zero-detection cost lots of CPU:
> > > > > 
> > > > >    12.01%  kqemu  qemu-system-x86_64            [.] buffer_zero_sse2                                                                                                                                                                           ◆
> > > > 
> > > > Interesting; what host are you running on?
> > > > Some hosts have support for the faster buffer_zero_ss4/avx2
> > > 
> > > The host is:
> > > 
> > > model name	: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
> > > ...
> > > flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi
> > >   mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts
> > >   rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor
> > >   ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt
> > >   tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3
> > >   cdp_l3 intel_ppin intel_pt mba tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1
> > >   hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt
> > >   clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total
> > >   cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke
> > > 
> > > I checked and noticed "CONFIG_AVX2_OPT" has not been enabled, maybe is due to too old glib/gcc
> > > version:
> > >     gcc version 4.4.6 20110731 (Red Hat 4.4.6-4) (GCC)
> > >     glibc.x86_64                     2.12
> > 
> > Yes, that's pretty old (RHEL6 ?) - I think you should get AVX2 in RHEL7.
> 
> Er, it is not easy to update glibc in the production env.... :(

But neither is QEMU updated in production all that easily. While we do
want to support older hosts functionally, it does not make
much sense to devel complex optimizations that only benefit
older hosts.

-- 
MST

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 06/12] migration: do not detect zero page for compression
@ 2018-07-22 16:05                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 156+ messages in thread
From: Michael S. Tsirkin @ 2018-07-22 16:05 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Dr. David Alan Gilbert, Peter Xu, pbonzini, mtosatti, qemu-devel,
	kvm, jiang.biao2, wei.w.wang, Xiao Guangrong

On Wed, Jul 18, 2018 at 04:46:21PM +0800, Xiao Guangrong wrote:
> 
> 
> On 07/17/2018 02:58 AM, Dr. David Alan Gilbert wrote:
> > * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
> > > 
> > > 
> > > On 06/29/2018 05:42 PM, Dr. David Alan Gilbert wrote:
> > > > * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
> > > > > 
> > > > > Hi Peter,
> > > > > 
> > > > > Sorry for the delay as i was busy on other things.
> > > > > 
> > > > > On 06/19/2018 03:30 PM, Peter Xu wrote:
> > > > > > On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
> > > > > > > From: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > > > > > 
> > > > > > > Detecting zero page is not a light work, we can disable it
> > > > > > > for compression that can handle all zero data very well
> > > > > > 
> > > > > > Is there any number shows how the compression algo performs better
> > > > > > than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
> > > > > > be fast, depending on how init_accel() is done in util/bufferiszero.c.
> > > > > 
> > > > > This is the comparison between zero-detection and compression (the target
> > > > > buffer is all zero bit):
> > > > > 
> > > > > Zero 810 ns Compression: 26905 ns.
> > > > > Zero 417 ns Compression: 8022 ns.
> > > > > Zero 408 ns Compression: 7189 ns.
> > > > > Zero 400 ns Compression: 7255 ns.
> > > > > Zero 412 ns Compression: 7016 ns.
> > > > > Zero 411 ns Compression: 7035 ns.
> > > > > Zero 413 ns Compression: 6994 ns.
> > > > > Zero 399 ns Compression: 7024 ns.
> > > > > Zero 416 ns Compression: 7053 ns.
> > > > > Zero 405 ns Compression: 7041 ns.
> > > > > 
> > > > > Indeed, zero-detection is faster than compression.
> > > > > 
> > > > > However during our profiling for the live_migration thread (after reverted this patch),
> > > > > we noticed zero-detection cost lots of CPU:
> > > > > 
> > > > >    12.01%  kqemu  qemu-system-x86_64            [.] buffer_zero_sse2                                                                                                                                                                           ◆
> > > > 
> > > > Interesting; what host are you running on?
> > > > Some hosts have support for the faster buffer_zero_ss4/avx2
> > > 
> > > The host is:
> > > 
> > > model name	: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
> > > ...
> > > flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi
> > >   mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts
> > >   rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor
> > >   ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt
> > >   tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3
> > >   cdp_l3 intel_ppin intel_pt mba tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1
> > >   hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt
> > >   clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total
> > >   cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke
> > > 
> > > I checked and noticed "CONFIG_AVX2_OPT" has not been enabled, maybe is due to too old glib/gcc
> > > version:
> > >     gcc version 4.4.6 20110731 (Red Hat 4.4.6-4) (GCC)
> > >     glibc.x86_64                     2.12
> > 
> > Yes, that's pretty old (RHEL6 ?) - I think you should get AVX2 in RHEL7.
> 
> Er, it is not easy to update glibc in the production env.... :(

But neither is QEMU updated in production all that easily. While we do
want to support older hosts functionally, it does not make
much sense to devel complex optimizations that only benefit
older hosts.

-- 
MST

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [PATCH 06/12] migration: do not detect zero page for compression
  2018-07-22 16:05                 ` [Qemu-devel] " Michael S. Tsirkin
@ 2018-07-23  7:12                   ` Xiao Guangrong
  -1 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-07-23  7:12 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, mtosatti, Xiao Guangrong, Dr. David Alan Gilbert, Peter Xu,
	qemu-devel, wei.w.wang, jiang.biao2, pbonzini



On 07/23/2018 12:05 AM, Michael S. Tsirkin wrote:
> On Wed, Jul 18, 2018 at 04:46:21PM +0800, Xiao Guangrong wrote:
>>
>>
>> On 07/17/2018 02:58 AM, Dr. David Alan Gilbert wrote:
>>> * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
>>>>
>>>>
>>>> On 06/29/2018 05:42 PM, Dr. David Alan Gilbert wrote:
>>>>> * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
>>>>>>
>>>>>> Hi Peter,
>>>>>>
>>>>>> Sorry for the delay as i was busy on other things.
>>>>>>
>>>>>> On 06/19/2018 03:30 PM, Peter Xu wrote:
>>>>>>> On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
>>>>>>>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>>>>>>
>>>>>>>> Detecting zero page is not a light work, we can disable it
>>>>>>>> for compression that can handle all zero data very well
>>>>>>>
>>>>>>> Is there any number shows how the compression algo performs better
>>>>>>> than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
>>>>>>> be fast, depending on how init_accel() is done in util/bufferiszero.c.
>>>>>>
>>>>>> This is the comparison between zero-detection and compression (the target
>>>>>> buffer is all zero bit):
>>>>>>
>>>>>> Zero 810 ns Compression: 26905 ns.
>>>>>> Zero 417 ns Compression: 8022 ns.
>>>>>> Zero 408 ns Compression: 7189 ns.
>>>>>> Zero 400 ns Compression: 7255 ns.
>>>>>> Zero 412 ns Compression: 7016 ns.
>>>>>> Zero 411 ns Compression: 7035 ns.
>>>>>> Zero 413 ns Compression: 6994 ns.
>>>>>> Zero 399 ns Compression: 7024 ns.
>>>>>> Zero 416 ns Compression: 7053 ns.
>>>>>> Zero 405 ns Compression: 7041 ns.
>>>>>>
>>>>>> Indeed, zero-detection is faster than compression.
>>>>>>
>>>>>> However during our profiling for the live_migration thread (after reverted this patch),
>>>>>> we noticed zero-detection cost lots of CPU:
>>>>>>
>>>>>>     12.01%  kqemu  qemu-system-x86_64            [.] buffer_zero_sse2                                                                                                                                                                           ◆
>>>>>
>>>>> Interesting; what host are you running on?
>>>>> Some hosts have support for the faster buffer_zero_ss4/avx2
>>>>
>>>> The host is:
>>>>
>>>> model name	: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
>>>> ...
>>>> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi
>>>>    mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts
>>>>    rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor
>>>>    ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt
>>>>    tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3
>>>>    cdp_l3 intel_ppin intel_pt mba tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1
>>>>    hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt
>>>>    clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total
>>>>    cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke
>>>>
>>>> I checked and noticed "CONFIG_AVX2_OPT" has not been enabled, maybe is due to too old glib/gcc
>>>> version:
>>>>      gcc version 4.4.6 20110731 (Red Hat 4.4.6-4) (GCC)
>>>>      glibc.x86_64                     2.12
>>>
>>> Yes, that's pretty old (RHEL6 ?) - I think you should get AVX2 in RHEL7.
>>
>> Er, it is not easy to update glibc in the production env.... :(
> 
> But neither is QEMU updated in production all that easily. While we do
> want to support older hosts functionally, it does not make
> much sense to devel complex optimizations that only benefit
> older hosts.
> 

Can not agree with you more. :)

So i benchmarked in on the production with newer distribution installed.
Here is the data:
  27.48%  kqemu  [kernel.kallsyms]             [k] copy_user_enhanced_fast_string
  12.63%  kqemu  [kernel.kallsyms]             [k] copy_page_rep
  10.82%  kqemu  qemu-system-x86_64            [.] buffer_zero_avx2
   5.69%  kqemu  [kernel.kallsyms]             [k] native_queued_spin_lock_slowpath
   4.61%  kqemu  qemu-system-x86_64            [.] threads_submit_request_prepare
   4.39%  kqemu  qemu-system-x86_64            [.] qemu_event_set
   4.12%  kqemu  qemu-system-x86_64            [.] ram_find_and_save_block.part.24
   3.61%  kqemu  [kernel.kallsyms]             [k] tcp_sendmsg
   2.62%  kqemu  libc-2.17.so                  [.] __memcpy_ssse3_back
   1.89%  kqemu  qemu-system-x86_64            [.] qemu_put_qemu_file
   1.32%  kqemu  qemu-system-x86_64            [.] compress_thread_data_done

It does not help...

^ permalink raw reply	[flat|nested] 156+ messages in thread

* Re: [Qemu-devel] [PATCH 06/12] migration: do not detect zero page for compression
@ 2018-07-23  7:12                   ` Xiao Guangrong
  0 siblings, 0 replies; 156+ messages in thread
From: Xiao Guangrong @ 2018-07-23  7:12 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Dr. David Alan Gilbert, Peter Xu, pbonzini, mtosatti, qemu-devel,
	kvm, jiang.biao2, wei.w.wang, Xiao Guangrong



On 07/23/2018 12:05 AM, Michael S. Tsirkin wrote:
> On Wed, Jul 18, 2018 at 04:46:21PM +0800, Xiao Guangrong wrote:
>>
>>
>> On 07/17/2018 02:58 AM, Dr. David Alan Gilbert wrote:
>>> * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
>>>>
>>>>
>>>> On 06/29/2018 05:42 PM, Dr. David Alan Gilbert wrote:
>>>>> * Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
>>>>>>
>>>>>> Hi Peter,
>>>>>>
>>>>>> Sorry for the delay as i was busy on other things.
>>>>>>
>>>>>> On 06/19/2018 03:30 PM, Peter Xu wrote:
>>>>>>> On Mon, Jun 04, 2018 at 05:55:14PM +0800, guangrong.xiao@gmail.com wrote:
>>>>>>>> From: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>>>>>>
>>>>>>>> Detecting zero page is not a light work, we can disable it
>>>>>>>> for compression that can handle all zero data very well
>>>>>>>
>>>>>>> Is there any number shows how the compression algo performs better
>>>>>>> than the zero-detect algo?  Asked since AFAIU buffer_is_zero() might
>>>>>>> be fast, depending on how init_accel() is done in util/bufferiszero.c.
>>>>>>
>>>>>> This is the comparison between zero-detection and compression (the target
>>>>>> buffer is all zero bit):
>>>>>>
>>>>>> Zero 810 ns Compression: 26905 ns.
>>>>>> Zero 417 ns Compression: 8022 ns.
>>>>>> Zero 408 ns Compression: 7189 ns.
>>>>>> Zero 400 ns Compression: 7255 ns.
>>>>>> Zero 412 ns Compression: 7016 ns.
>>>>>> Zero 411 ns Compression: 7035 ns.
>>>>>> Zero 413 ns Compression: 6994 ns.
>>>>>> Zero 399 ns Compression: 7024 ns.
>>>>>> Zero 416 ns Compression: 7053 ns.
>>>>>> Zero 405 ns Compression: 7041 ns.
>>>>>>
>>>>>> Indeed, zero-detection is faster than compression.
>>>>>>
>>>>>> However during our profiling for the live_migration thread (after reverted this patch),
>>>>>> we noticed zero-detection cost lots of CPU:
>>>>>>
>>>>>>     12.01%  kqemu  qemu-system-x86_64            [.] buffer_zero_sse2                                                                                                                                                                           ◆
>>>>>
>>>>> Interesting; what host are you running on?
>>>>> Some hosts have support for the faster buffer_zero_ss4/avx2
>>>>
>>>> The host is:
>>>>
>>>> model name	: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz
>>>> ...
>>>> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi
>>>>    mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts
>>>>    rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor
>>>>    ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt
>>>>    tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3
>>>>    cdp_l3 intel_ppin intel_pt mba tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1
>>>>    hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt
>>>>    clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total
>>>>    cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke
>>>>
>>>> I checked and noticed "CONFIG_AVX2_OPT" has not been enabled, maybe is due to too old glib/gcc
>>>> version:
>>>>      gcc version 4.4.6 20110731 (Red Hat 4.4.6-4) (GCC)
>>>>      glibc.x86_64                     2.12
>>>
>>> Yes, that's pretty old (RHEL6 ?) - I think you should get AVX2 in RHEL7.
>>
>> Er, it is not easy to update glibc in the production env.... :(
> 
> But neither is QEMU updated in production all that easily. While we do
> want to support older hosts functionally, it does not make
> much sense to devel complex optimizations that only benefit
> older hosts.
> 

Can not agree with you more. :)

So i benchmarked in on the production with newer distribution installed.
Here is the data:
  27.48%  kqemu  [kernel.kallsyms]             [k] copy_user_enhanced_fast_string
  12.63%  kqemu  [kernel.kallsyms]             [k] copy_page_rep
  10.82%  kqemu  qemu-system-x86_64            [.] buffer_zero_avx2
   5.69%  kqemu  [kernel.kallsyms]             [k] native_queued_spin_lock_slowpath
   4.61%  kqemu  qemu-system-x86_64            [.] threads_submit_request_prepare
   4.39%  kqemu  qemu-system-x86_64            [.] qemu_event_set
   4.12%  kqemu  qemu-system-x86_64            [.] ram_find_and_save_block.part.24
   3.61%  kqemu  [kernel.kallsyms]             [k] tcp_sendmsg
   2.62%  kqemu  libc-2.17.so                  [.] __memcpy_ssse3_back
   1.89%  kqemu  qemu-system-x86_64            [.] qemu_put_qemu_file
   1.32%  kqemu  qemu-system-x86_64            [.] compress_thread_data_done

It does not help...

^ permalink raw reply	[flat|nested] 156+ messages in thread

end of thread, other threads:[~2018-07-23  7:12 UTC | newest]

Thread overview: 156+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-04  9:55 [PATCH 00/12] migration: improve multithreads for compression and decompression guangrong.xiao
2018-06-04  9:55 ` [Qemu-devel] " guangrong.xiao
2018-06-04  9:55 ` [PATCH 01/12] migration: do not wait if no free thread guangrong.xiao
2018-06-04  9:55   ` [Qemu-devel] " guangrong.xiao
2018-06-11  7:39   ` Peter Xu
2018-06-11  7:39     ` [Qemu-devel] " Peter Xu
2018-06-12  2:42     ` Xiao Guangrong
2018-06-12  2:42       ` [Qemu-devel] " Xiao Guangrong
2018-06-12  3:15       ` Peter Xu
2018-06-12  3:15         ` [Qemu-devel] " Peter Xu
2018-06-13 15:43         ` Dr. David Alan Gilbert
2018-06-13 15:43           ` [Qemu-devel] " Dr. David Alan Gilbert
2018-06-14  3:19           ` Xiao Guangrong
2018-06-14  3:19             ` [Qemu-devel] " Xiao Guangrong
2018-06-04  9:55 ` [PATCH 02/12] migration: fix counting normal page for compression guangrong.xiao
2018-06-04  9:55   ` [Qemu-devel] " guangrong.xiao
2018-06-13 15:51   ` Dr. David Alan Gilbert
2018-06-13 15:51     ` [Qemu-devel] " Dr. David Alan Gilbert
2018-06-14  3:32     ` Xiao Guangrong
2018-06-14  3:32       ` [Qemu-devel] " Xiao Guangrong
2018-06-04  9:55 ` [PATCH 03/12] migration: fix counting xbzrle cache_miss_rate guangrong.xiao
2018-06-04  9:55   ` [Qemu-devel] " guangrong.xiao
2018-06-13 16:09   ` Dr. David Alan Gilbert
2018-06-13 16:09     ` [Qemu-devel] " Dr. David Alan Gilbert
2018-06-15 11:30   ` Dr. David Alan Gilbert
2018-06-15 11:30     ` [Qemu-devel] " Dr. David Alan Gilbert
2018-06-04  9:55 ` [PATCH 04/12] migration: introduce migration_update_rates guangrong.xiao
2018-06-04  9:55   ` [Qemu-devel] " guangrong.xiao
2018-06-13 16:17   ` Dr. David Alan Gilbert
2018-06-13 16:17     ` [Qemu-devel] " Dr. David Alan Gilbert
2018-06-14  3:35     ` Xiao Guangrong
2018-06-14  3:35       ` [Qemu-devel] " Xiao Guangrong
2018-06-15 11:32     ` Dr. David Alan Gilbert
2018-06-15 11:32       ` [Qemu-devel] " Dr. David Alan Gilbert
2018-06-04  9:55 ` [PATCH 05/12] migration: show the statistics of compression guangrong.xiao
2018-06-04  9:55   ` [Qemu-devel] " guangrong.xiao
2018-06-04 22:31   ` Eric Blake
2018-06-04 22:31     ` [Qemu-devel] " Eric Blake
2018-06-06 12:44     ` Xiao Guangrong
2018-06-06 12:44       ` [Qemu-devel] " Xiao Guangrong
2018-06-13 16:25   ` Dr. David Alan Gilbert
2018-06-13 16:25     ` [Qemu-devel] " Dr. David Alan Gilbert
2018-06-14  6:48     ` Xiao Guangrong
2018-06-14  6:48       ` [Qemu-devel] " Xiao Guangrong
2018-07-16 19:01       ` Dr. David Alan Gilbert
2018-07-16 19:01         ` [Qemu-devel] " Dr. David Alan Gilbert
2018-07-18  8:51         ` Xiao Guangrong
2018-07-18  8:51           ` [Qemu-devel] " Xiao Guangrong
2018-06-04  9:55 ` [PATCH 06/12] migration: do not detect zero page for compression guangrong.xiao
2018-06-04  9:55   ` [Qemu-devel] " guangrong.xiao
2018-06-19  7:30   ` Peter Xu
2018-06-19  7:30     ` [Qemu-devel] " Peter Xu
2018-06-28  9:12     ` Xiao Guangrong
2018-06-28  9:12       ` [Qemu-devel] " Xiao Guangrong
2018-06-28  9:36       ` Daniel P. Berrangé
2018-06-28  9:36         ` [Qemu-devel] " Daniel P. Berrangé
2018-06-29  3:50         ` Xiao Guangrong
2018-06-29  3:50           ` [Qemu-devel] " Xiao Guangrong
2018-06-29  9:54         ` Dr. David Alan Gilbert
2018-06-29  9:54           ` [Qemu-devel] " Dr. David Alan Gilbert
2018-06-29  9:42       ` Dr. David Alan Gilbert
2018-06-29  9:42         ` [Qemu-devel] " Dr. David Alan Gilbert
2018-07-03  3:53         ` Xiao Guangrong
2018-07-03  3:53           ` [Qemu-devel] " Xiao Guangrong
2018-07-16 18:58           ` Dr. David Alan Gilbert
2018-07-16 18:58             ` [Qemu-devel] " Dr. David Alan Gilbert
2018-07-18  8:46             ` Xiao Guangrong
2018-07-18  8:46               ` [Qemu-devel] " Xiao Guangrong
2018-07-22 16:05               ` Michael S. Tsirkin
2018-07-22 16:05                 ` [Qemu-devel] " Michael S. Tsirkin
2018-07-23  7:12                 ` Xiao Guangrong
2018-07-23  7:12                   ` [Qemu-devel] " Xiao Guangrong
2018-06-04  9:55 ` [PATCH 07/12] migration: hold the lock only if it is really needed guangrong.xiao
2018-06-04  9:55   ` [Qemu-devel] " guangrong.xiao
2018-06-19  7:36   ` Peter Xu
2018-06-19  7:36     ` [Qemu-devel] " Peter Xu
2018-06-28  9:33     ` Xiao Guangrong
2018-06-28  9:33       ` [Qemu-devel] " Xiao Guangrong
2018-06-29 11:22       ` Dr. David Alan Gilbert
2018-06-29 11:22         ` [Qemu-devel] " Dr. David Alan Gilbert
2018-07-03  6:27         ` Xiao Guangrong
2018-07-03  6:27           ` [Qemu-devel] " Xiao Guangrong
2018-07-11  8:21       ` Peter Xu
2018-07-11  8:21         ` [Qemu-devel] " Peter Xu
2018-07-12  7:47         ` Xiao Guangrong
2018-07-12  7:47           ` [Qemu-devel] " Xiao Guangrong
2018-07-12  8:26           ` Peter Xu
2018-07-12  8:26             ` [Qemu-devel] " Peter Xu
2018-07-18  8:56             ` Xiao Guangrong
2018-07-18  8:56               ` [Qemu-devel] " Xiao Guangrong
2018-07-18 10:18               ` Peter Xu
2018-07-18 10:18                 ` [Qemu-devel] " Peter Xu
2018-07-13 17:44           ` Dr. David Alan Gilbert
2018-07-13 17:44             ` [Qemu-devel] " Dr. David Alan Gilbert
2018-06-04  9:55 ` [PATCH 08/12] migration: do not flush_compressed_data at the end of each iteration guangrong.xiao
2018-06-04  9:55   ` [Qemu-devel] " guangrong.xiao
2018-07-13 18:01   ` Dr. David Alan Gilbert
2018-07-13 18:01     ` [Qemu-devel] " Dr. David Alan Gilbert
2018-07-18  8:44     ` Xiao Guangrong
2018-07-18  8:44       ` [Qemu-devel] " Xiao Guangrong
2018-06-04  9:55 ` [PATCH 09/12] ring: introduce lockless ring buffer guangrong.xiao
2018-06-04  9:55   ` [Qemu-devel] " guangrong.xiao
2018-06-20  4:52   ` Peter Xu
2018-06-20  4:52     ` [Qemu-devel] " Peter Xu
2018-06-28 10:02     ` Xiao Guangrong
2018-06-28 10:02       ` [Qemu-devel] " Xiao Guangrong
2018-06-28 11:55       ` Wei Wang
2018-06-28 11:55         ` [Qemu-devel] " Wei Wang
2018-06-29  3:55         ` Xiao Guangrong
2018-06-29  3:55           ` [Qemu-devel] " Xiao Guangrong
2018-07-03 15:55           ` Paul E. McKenney
2018-07-03 15:55             ` [Qemu-devel] " Paul E. McKenney
2018-06-20  5:55   ` Peter Xu
2018-06-20  5:55     ` [Qemu-devel] " Peter Xu
2018-06-28 14:00     ` Xiao Guangrong
2018-06-28 14:00       ` [Qemu-devel] " Xiao Guangrong
2018-06-20 12:38   ` Michael S. Tsirkin
2018-06-20 12:38     ` [Qemu-devel] " Michael S. Tsirkin
2018-06-29  7:30     ` Xiao Guangrong
2018-06-29  7:30       ` [Qemu-devel] " Xiao Guangrong
2018-06-29 13:08       ` Michael S. Tsirkin
2018-06-29 13:08         ` [Qemu-devel] " Michael S. Tsirkin
2018-07-03  7:31         ` Xiao Guangrong
2018-07-03  7:31           ` [Qemu-devel] " Xiao Guangrong
2018-06-28 13:36   ` Jason Wang
2018-06-28 13:36     ` [Qemu-devel] " Jason Wang
2018-06-29  3:59     ` Xiao Guangrong
2018-06-29  3:59       ` [Qemu-devel] " Xiao Guangrong
2018-06-29  6:15       ` Jason Wang
2018-06-29  6:15         ` [Qemu-devel] " Jason Wang
2018-06-29  7:47         ` Xiao Guangrong
2018-06-29  7:47           ` [Qemu-devel] " Xiao Guangrong
2018-06-29  4:23     ` Michael S. Tsirkin
2018-06-29  4:23       ` [Qemu-devel] " Michael S. Tsirkin
2018-06-29  7:44       ` Xiao Guangrong
2018-06-29  7:44         ` [Qemu-devel] " Xiao Guangrong
2018-06-04  9:55 ` [PATCH 10/12] migration: introduce lockless multithreads model guangrong.xiao
2018-06-04  9:55   ` [Qemu-devel] " guangrong.xiao
2018-06-20  6:52   ` Peter Xu
2018-06-20  6:52     ` [Qemu-devel] " Peter Xu
2018-06-28 14:25     ` Xiao Guangrong
2018-06-28 14:25       ` [Qemu-devel] " Xiao Guangrong
2018-07-13 16:24     ` Dr. David Alan Gilbert
2018-07-13 16:24       ` [Qemu-devel] " Dr. David Alan Gilbert
2018-07-18  7:12       ` Xiao Guangrong
2018-07-18  7:12         ` [Qemu-devel] " Xiao Guangrong
2018-06-04  9:55 ` [PATCH 11/12] migration: use lockless Multithread model for compression guangrong.xiao
2018-06-04  9:55   ` [Qemu-devel] " guangrong.xiao
2018-06-04  9:55 ` [PATCH 12/12] migration: use lockless Multithread model for decompression guangrong.xiao
2018-06-04  9:55   ` [Qemu-devel] " guangrong.xiao
2018-06-11  8:00 ` [PATCH 00/12] migration: improve multithreads for compression and decompression Peter Xu
2018-06-11  8:00   ` [Qemu-devel] " Peter Xu
2018-06-12  3:19   ` Xiao Guangrong
2018-06-12  3:19     ` [Qemu-devel] " Xiao Guangrong
2018-06-12  5:36     ` Peter Xu
2018-06-12  5:36       ` [Qemu-devel] " Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.