All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration.
@ 2023-11-14  5:40 Hao Xiang
  2023-11-14  5:40 ` [PATCH v2 01/20] multifd: Add capability to enable/disable zero_page Hao Xiang
                   ` (20 more replies)
  0 siblings, 21 replies; 51+ messages in thread
From: Hao Xiang @ 2023-11-14  5:40 UTC (permalink / raw)
  To: farosas, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel
  Cc: Hao Xiang

v2
* Rebase on top of 3e01f1147a16ca566694b97eafc941d62fa1e8d8.
* Leave Juan's changes in their original form instead of squashing them.
* Add a new commit to refactor the multifd_send_thread function to prepare for introducing the DSA offload functionality.
* Use page count to configure multifd-packet-size option.
* Don't use the FLAKY flag in DSA tests.
* Test if DSA integration test is setup correctly and skip the test if
* not.
* Fixed broken link in the previous patch cover.

* Background:

I posted an RFC about DSA offloading in QEMU:
https://patchew.org/QEMU/20230529182001.2232069-1-hao.xiang@bytedance.com/

This patchset implements the DSA offloading on zero page checking in
multifd live migration code path.

* Overview:

Intel Data Streaming Accelerator(DSA) is introduced in Intel's 4th generation
Xeon server, aka Sapphire Rapids.
https://cdrdv2-public.intel.com/671116/341204-intel-data-streaming-accelerator-spec.pdf
https://www.intel.com/content/www/us/en/content-details/759709/intel-data-streaming-accelerator-user-guide.html
One of the things DSA can do is to offload memory comparison workload from
CPU to DSA accelerator hardware. This patchset implements a solution to offload
QEMU's zero page checking from CPU to DSA accelerator hardware. We gain
two benefits from this change:
1. Reduces CPU usage in multifd live migration workflow across all use
cases.
2. Reduces migration total time in some use cases. 

* Design:

These are the logical steps to perform DSA offloading:
1. Configure DSA accelerators and create user space openable DSA work
queues via the idxd driver.
2. Map DSA's work queue into a user space address space.
3. Fill an in-memory task descriptor to describe the memory operation.
4. Use dedicated CPU instruction _enqcmd to queue a task descriptor to
the work queue.
5. Pull the task descriptor's completion status field until the task
completes.
6. Check return status.

The memory operation is now totally done by the accelerator hardware but
the new workflow introduces overheads. The overhead is the extra cost CPU
prepares and submits the task descriptors and the extra cost CPU pulls for
completion. The design is around minimizing these two overheads.

1. In order to reduce the overhead on task preparation and submission,
we use batch descriptors. A batch descriptor will contain N individual
zero page checking tasks where the default N is 128 (default packet size
/ page size) and we can increase N by setting the packet size via a new
migration option.
2. The multifd sender threads prepares and submits batch tasks to DSA
hardware and it waits on a synchronization object for task completion.
Whenever a DSA task is submitted, the task structure is added to a
thread safe queue. It's safe to have multiple multifd sender threads to
submit tasks concurrently.
3. Multiple DSA hardware devices can be used. During multifd initialization,
every sender thread will be assigned a DSA device to work with. We
use a round-robin scheme to evenly distribute the work across all used
DSA devices.
4. Use a dedicated thread dsa_completion to perform busy pulling for all
DSA task completions. The thread keeps dequeuing DSA tasks from the
thread safe queue. The thread blocks when there is no outstanding DSA
task. When pulling for completion of a DSA task, the thread uses CPU
instruction _mm_pause between the iterations of a busy loop to save some
CPU power as well as optimizing core resources for the other hypercore.
5. DSA accelerator can encounter errors. The most popular error is a
page fault. We have tested using devices to handle page faults but
performance is bad. Right now, if DSA hits a page fault, we fallback to
use CPU to complete the rest of the work. The CPU fallback is done in
the multifd sender thread.
6. Added a new migration option multifd-dsa-accel to set the DSA device
path. If set, the multifd workflow will leverage the DSA devices for
offloading.
7. Added a new migration option multifd-normal-page-ratio to make
multifd live migration easier to test. Setting a normal page ratio will
make live migration recognize a zero page as a normal page and send
the entire payload over the network. If we want to send a large network
payload and analyze throughput, this option is useful.
8. Added a new migration option multifd-packet-size. This can increase
the number of pages being zero page checked and sent over the network.
The extra synchronization between the sender threads and the dsa
completion thread is an overhead. Using a large packet size can reduce
that overhead.

* Performance:

We use two Intel 4th generation Xeon servers for testing.

Architecture:        x86_64
CPU(s):              192
Thread(s) per core:  2
Core(s) per socket:  48
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               143
Model name:          Intel(R) Xeon(R) Platinum 8457C
Stepping:            8
CPU MHz:             2538.624
CPU max MHz:         3800.0000
CPU min MHz:         800.0000

We perform multifd live migration with below setup:
1. VM has 100GB memory. 
2. Use the new migration option multifd-set-normal-page-ratio to control the total
size of the payload sent over the network.
3. Use 8 multifd channels.
4. Use tcp for live migration.
4. Use CPU to perform zero page checking as the baseline.
5. Use one DSA device to offload zero page checking to compare with the baseline.
6. Use "perf sched record" and "perf sched timehist" to analyze CPU usage.

A) Scenario 1: 50% (50GB) normal pages on an 100GB vm.

	CPU usage

	|---------------|---------------|---------------|---------------|
	|		|comm		|runtime(msec)	|totaltime(msec)|
	|---------------|---------------|---------------|---------------|
	|Baseline	|live_migration	|5657.58	|		|
	|		|multifdsend_0	|3931.563	|		|
	|		|multifdsend_1	|4405.273	|		|
	|		|multifdsend_2	|3941.968	|		|
	|		|multifdsend_3	|5032.975	|		|
	|		|multifdsend_4	|4533.865	|		|
	|		|multifdsend_5	|4530.461	|		|
	|		|multifdsend_6	|5171.916	|		|
	|		|multifdsend_7	|4722.769	|41922		|
	|---------------|---------------|---------------|---------------|
	|DSA		|live_migration	|6129.168	|		|
	|		|multifdsend_0	|2954.717	|		|
	|		|multifdsend_1	|2766.359	|		|
	|		|multifdsend_2	|2853.519	|		|
	|		|multifdsend_3	|2740.717	|		|
	|		|multifdsend_4	|2824.169	|		|
	|		|multifdsend_5	|2966.908	|		|
	|		|multifdsend_6	|2611.137	|		|
	|		|multifdsend_7	|3114.732	|		|
	|		|dsa_completion	|3612.564	|32568		|
	|---------------|---------------|---------------|---------------|

Baseline total runtime is calculated by adding up all multifdsend_X
and live_migration threads runtime. DSA offloading total runtime is
calculated by adding up all multifdsend_X, live_migration and
dsa_completion threads runtime. 41922 msec VS 32568 msec runtime and
that is 23% total CPU usage savings.

	Latency
	|---------------|---------------|---------------|---------------|---------------|---------------|
	|		|total time	|down time	|throughput	|transferred-ram|total-ram	|
	|---------------|---------------|---------------|---------------|---------------|---------------|	
	|Baseline	|10343 ms	|161 ms		|41007.00 mbps	|51583797 kb	|102400520 kb	|
	|---------------|---------------|---------------|---------------|-------------------------------|
	|DSA offload	|9535 ms	|135 ms		|46554.40 mbps	|53947545 kb	|102400520 kb	|	
	|---------------|---------------|---------------|---------------|---------------|---------------|

Total time is 8% faster and down time is 16% faster.

B) Scenario 2: 100% (100GB) zero pages on an 100GB vm.

	CPU usage
	|---------------|---------------|---------------|---------------|
	|		|comm		|runtime(msec)	|totaltime(msec)|
	|---------------|---------------|---------------|---------------|
	|Baseline	|live_migration	|4860.718	|		|
	|	 	|multifdsend_0	|748.875	|		|
	|		|multifdsend_1	|898.498	|		|
	|		|multifdsend_2	|787.456	|		|
	|		|multifdsend_3	|764.537	|		|
	|		|multifdsend_4	|785.687	|		|
	|		|multifdsend_5	|756.941	|		|
	|		|multifdsend_6	|774.084	|		|
	|		|multifdsend_7	|782.900	|11154		|
	|---------------|---------------|-------------------------------|
	|DSA offloading	|live_migration	|3846.976	|		|
	|		|multifdsend_0	|191.880	|		|
	|		|multifdsend_1	|166.331	|		|
	|		|multifdsend_2	|168.528	|		|
	|		|multifdsend_3	|197.831	|		|
	|		|multifdsend_4	|169.580	|		|
	|		|multifdsend_5	|167.984	|		|
	|		|multifdsend_6	|198.042	|		|
	|		|multifdsend_7	|170.624	|		|
	|		|dsa_completion	|3428.669	|8700		|
	|---------------|---------------|---------------|---------------|

Baseline total runtime is 11154 msec and DSA offloading total runtime is
8700 msec. That is 22% CPU savings.

	Latency
	|--------------------------------------------------------------------------------------------|
	|		|total time	|down time	|throughput	|transferred-ram|total-ram   |
	|---------------|---------------|---------------|---------------|---------------|------------|	
	|Baseline	|4867 ms	|20 ms		|1.51 mbps	|565 kb		|102400520 kb|
	|---------------|---------------|---------------|---------------|----------------------------|
	|DSA offload	|3888 ms	|18 ms		|1.89 mbps	|565 kb		|102400520 kb|	
	|---------------|---------------|---------------|---------------|---------------|------------|

Total time 20% faster and down time 10% faster.

* Testing:

1. Added unit tests for cover the added code path in dsa.c
2. Added integration tests to cover multifd live migration using DSA
offloading.

* Patchset

Apply this patchset on top of commit
f78ea7ddb0e18766ece9fdfe02061744a7afc41b

Hao Xiang (16):
  meson: Introduce new instruction set enqcmd to the build system.
  util/dsa: Add dependency idxd.
  util/dsa: Implement DSA device start and stop logic.
  util/dsa: Implement DSA task enqueue and dequeue.
  util/dsa: Implement DSA task asynchronous completion thread model.
  util/dsa: Implement zero page checking in DSA task.
  util/dsa: Implement DSA task asynchronous submission and wait for
    completion.
  migration/multifd: Add new migration option for multifd DSA
    offloading.
  migration/multifd: Prepare to introduce DSA acceleration on the
    multifd path.
  migration/multifd: Enable DSA offloading in multifd sender path.
  migration/multifd: Add test hook to set normal page ratio.
  migration/multifd: Enable set normal page ratio test hook in multifd.
  migration/multifd: Add migration option set packet size.
  migration/multifd: Enable set packet size migration option.
  util/dsa: Add unit test coverage for Intel DSA task submission and
    completion.
  migration/multifd: Add integration tests for multifd with Intel DSA
    offloading.

Juan Quintela (4):
  multifd: Add capability to enable/disable zero_page
  multifd: Support for zero pages transmission
  multifd: Zero pages transmission
  So we use multifd to transmit zero pages.

 include/qemu/dsa.h             |  119 ++++
 linux-headers/linux/idxd.h     |  356 ++++++++++
 meson.build                    |    2 +
 meson_options.txt              |    2 +
 migration/migration-hmp-cmds.c |   22 +
 migration/multifd-zlib.c       |    8 +-
 migration/multifd-zstd.c       |    8 +-
 migration/multifd.c            |  203 +++++-
 migration/multifd.h            |   28 +-
 migration/options.c            |  107 +++
 migration/options.h            |    4 +
 migration/ram.c                |   45 +-
 migration/trace-events         |    8 +-
 qapi/migration.json            |   53 +-
 scripts/meson-buildoptions.sh  |    3 +
 tests/qtest/migration-test.c   |   77 ++-
 tests/unit/meson.build         |    6 +
 tests/unit/test-dsa.c          |  466 +++++++++++++
 util/dsa.c                     | 1132 ++++++++++++++++++++++++++++++++
 util/meson.build               |    1 +
 20 files changed, 2612 insertions(+), 38 deletions(-)
 create mode 100644 include/qemu/dsa.h
 create mode 100644 linux-headers/linux/idxd.h
 create mode 100644 tests/unit/test-dsa.c
 create mode 100644 util/dsa.c

-- 
2.30.2



^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v2 01/20] multifd: Add capability to enable/disable zero_page
  2023-11-14  5:40 [PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration Hao Xiang
@ 2023-11-14  5:40 ` Hao Xiang
  2023-11-16 15:15   ` Fabiano Rosas
  2023-11-14  5:40 ` [PATCH v2 02/20] multifd: Support for zero pages transmission Hao Xiang
                   ` (19 subsequent siblings)
  20 siblings, 1 reply; 51+ messages in thread
From: Hao Xiang @ 2023-11-14  5:40 UTC (permalink / raw)
  To: farosas, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel

From: Juan Quintela <quintela@redhat.com>

We have to enable it by default until we introduce the new code.

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 migration/options.c | 13 +++++++++++++
 migration/options.h |  1 +
 qapi/migration.json |  8 +++++++-
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/migration/options.c b/migration/options.c
index 8d8ec73ad9..00c0c4a0d6 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -204,6 +204,8 @@ Property migration_properties[] = {
     DEFINE_PROP_MIG_CAP("x-switchover-ack",
                         MIGRATION_CAPABILITY_SWITCHOVER_ACK),
     DEFINE_PROP_MIG_CAP("x-dirty-limit", MIGRATION_CAPABILITY_DIRTY_LIMIT),
+    DEFINE_PROP_MIG_CAP("main-zero-page",
+            MIGRATION_CAPABILITY_MAIN_ZERO_PAGE),
     DEFINE_PROP_END_OF_LIST(),
 };
 
@@ -284,6 +286,17 @@ bool migrate_multifd(void)
     return s->capabilities[MIGRATION_CAPABILITY_MULTIFD];
 }
 
+bool migrate_use_main_zero_page(void)
+{
+    //MigrationState *s;
+
+    //s = migrate_get_current();
+
+    // We will enable this when we add the right code.
+    // return s->enabled_capabilities[MIGRATION_CAPABILITY_MAIN_ZERO_PAGE];
+    return true;
+}
+
 bool migrate_pause_before_switchover(void)
 {
     MigrationState *s = migrate_get_current();
diff --git a/migration/options.h b/migration/options.h
index 246c160aee..c901eb57c6 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -88,6 +88,7 @@ int migrate_multifd_channels(void);
 MultiFDCompression migrate_multifd_compression(void);
 int migrate_multifd_zlib_level(void);
 int migrate_multifd_zstd_level(void);
+bool migrate_use_main_zero_page(void);
 uint8_t migrate_throttle_trigger_threshold(void);
 const char *migrate_tls_authz(void);
 const char *migrate_tls_creds(void);
diff --git a/qapi/migration.json b/qapi/migration.json
index 975761eebd..09e4393591 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -531,6 +531,12 @@
 #     and can result in more stable read performance.  Requires KVM
 #     with accelerator property "dirty-ring-size" set.  (Since 8.1)
 #
+#
+# @main-zero-page: If enabled, the detection of zero pages will be
+#                  done on the main thread.  Otherwise it is done on
+#                  the multifd threads.
+#                  (since 8.2)
+#
 # Features:
 #
 # @deprecated: Member @block is deprecated.  Use blockdev-mirror with
@@ -555,7 +561,7 @@
            { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
            'validate-uuid', 'background-snapshot',
            'zero-copy-send', 'postcopy-preempt', 'switchover-ack',
-           'dirty-limit'] }
+           'dirty-limit', 'main-zero-page'] }
 
 ##
 # @MigrationCapabilityStatus:
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 02/20] multifd: Support for zero pages transmission
  2023-11-14  5:40 [PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration Hao Xiang
  2023-11-14  5:40 ` [PATCH v2 01/20] multifd: Add capability to enable/disable zero_page Hao Xiang
@ 2023-11-14  5:40 ` Hao Xiang
  2023-11-14  5:40 ` [PATCH v2 03/20] multifd: Zero " Hao Xiang
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 51+ messages in thread
From: Hao Xiang @ 2023-11-14  5:40 UTC (permalink / raw)
  To: farosas, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel

From: Juan Quintela <quintela@redhat.com>

This patch adds counters and similar. Logic will be added on the
following patch.

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 migration/multifd.c    | 37 ++++++++++++++++++++++++++++++-------
 migration/multifd.h    | 17 ++++++++++++++++-
 migration/trace-events |  8 ++++----
 3 files changed, 50 insertions(+), 12 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index ec58c58082..d28ef0028b 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -267,6 +267,7 @@ static void multifd_send_fill_packet(MultiFDSendParams *p)
     packet->normal_pages = cpu_to_be32(p->normal_num);
     packet->next_packet_size = cpu_to_be32(p->next_packet_size);
     packet->packet_num = cpu_to_be64(p->packet_num);
+    packet->zero_pages = cpu_to_be32(p->zero_num);
 
     if (p->pages->block) {
         strncpy(packet->ramblock, p->pages->block->idstr, 256);
@@ -326,7 +327,15 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
     p->next_packet_size = be32_to_cpu(packet->next_packet_size);
     p->packet_num = be64_to_cpu(packet->packet_num);
 
-    if (p->normal_num == 0) {
+    p->zero_num = be32_to_cpu(packet->zero_pages);
+    if (p->zero_num > packet->pages_alloc - p->normal_num) {
+        error_setg(errp, "multifd: received packet "
+                   "with %u zero pages and expected maximum pages are %u",
+                   p->zero_num, packet->pages_alloc - p->normal_num) ;
+        return -1;
+    }
+
+    if (p->normal_num == 0 && p->zero_num == 0) {
         return 0;
     }
 
@@ -431,6 +440,7 @@ static int multifd_send_pages(QEMUFile *f)
     p->packet_num = multifd_send_state->packet_num++;
     multifd_send_state->pages = p->pages;
     p->pages = pages;
+
     qemu_mutex_unlock(&p->mutex);
     qemu_sem_post(&p->sem);
 
@@ -552,6 +562,8 @@ void multifd_save_cleanup(void)
         p->iov = NULL;
         g_free(p->normal);
         p->normal = NULL;
+        g_free(p->zero);
+        p->zero = NULL;
         multifd_send_state->ops->send_cleanup(p, &local_err);
         if (local_err) {
             migrate_set_error(migrate_get_current(), local_err);
@@ -680,6 +692,7 @@ static void *multifd_send_thread(void *opaque)
             uint64_t packet_num = p->packet_num;
             uint32_t flags;
             p->normal_num = 0;
+            p->zero_num = 0;
 
             if (use_zero_copy_send) {
                 p->iovs_num = 0;
@@ -704,12 +717,13 @@ static void *multifd_send_thread(void *opaque)
             p->flags = 0;
             p->num_packets++;
             p->total_normal_pages += p->normal_num;
+            p->total_zero_pages += p->zero_num;
             p->pages->num = 0;
             p->pages->block = NULL;
             qemu_mutex_unlock(&p->mutex);
 
-            trace_multifd_send(p->id, packet_num, p->normal_num, flags,
-                               p->next_packet_size);
+            trace_multifd_send(p->id, packet_num, p->normal_num, p->zero_num,
+                               flags, p->next_packet_size);
 
             if (use_zero_copy_send) {
                 /* Send header first, without zerocopy */
@@ -732,6 +746,8 @@ static void *multifd_send_thread(void *opaque)
 
             stat64_add(&mig_stats.multifd_bytes,
                        p->next_packet_size + p->packet_len);
+            stat64_add(&mig_stats.normal_pages, p->normal_num);
+            stat64_add(&mig_stats.zero_pages, p->zero_num);
             p->next_packet_size = 0;
             qemu_mutex_lock(&p->mutex);
             p->pending_job--;
@@ -762,7 +778,8 @@ out:
 
     rcu_unregister_thread();
     migration_threads_remove(thread);
-    trace_multifd_send_thread_end(p->id, p->num_packets, p->total_normal_pages);
+    trace_multifd_send_thread_end(p->id, p->num_packets, p->total_normal_pages,
+                                  p->total_zero_pages);
 
     return NULL;
 }
@@ -939,6 +956,7 @@ int multifd_save_setup(Error **errp)
         p->normal = g_new0(ram_addr_t, page_count);
         p->page_size = qemu_target_page_size();
         p->page_count = page_count;
+        p->zero = g_new0(ram_addr_t, page_count);
 
         if (migrate_zero_copy_send()) {
             p->write_flags = QIO_CHANNEL_WRITE_FLAG_ZERO_COPY;
@@ -1054,6 +1072,8 @@ void multifd_load_cleanup(void)
         p->iov = NULL;
         g_free(p->normal);
         p->normal = NULL;
+        g_free(p->zero);
+        p->zero = NULL;
         multifd_recv_state->ops->recv_cleanup(p);
     }
     qemu_sem_destroy(&multifd_recv_state->sem_sync);
@@ -1122,10 +1142,11 @@ static void *multifd_recv_thread(void *opaque)
         flags = p->flags;
         /* recv methods don't know how to handle the SYNC flag */
         p->flags &= ~MULTIFD_FLAG_SYNC;
-        trace_multifd_recv(p->id, p->packet_num, p->normal_num, flags,
-                           p->next_packet_size);
+        trace_multifd_recv(p->id, p->packet_num, p->normal_num, p->zero_num,
+                           flags, p->next_packet_size);
         p->num_packets++;
         p->total_normal_pages += p->normal_num;
+        p->total_zero_pages += p->zero_num;
         qemu_mutex_unlock(&p->mutex);
 
         if (p->normal_num) {
@@ -1150,7 +1171,8 @@ static void *multifd_recv_thread(void *opaque)
     qemu_mutex_unlock(&p->mutex);
 
     rcu_unregister_thread();
-    trace_multifd_recv_thread_end(p->id, p->num_packets, p->total_normal_pages);
+    trace_multifd_recv_thread_end(p->id, p->num_packets, p->total_normal_pages,
+                                  p->total_zero_pages);
 
     return NULL;
 }
@@ -1191,6 +1213,7 @@ int multifd_load_setup(Error **errp)
         p->normal = g_new0(ram_addr_t, page_count);
         p->page_count = page_count;
         p->page_size = qemu_target_page_size();
+        p->zero = g_new0(ram_addr_t, page_count);
     }
 
     for (i = 0; i < thread_count; i++) {
diff --git a/migration/multifd.h b/migration/multifd.h
index a835643b48..d587b0e19c 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -48,7 +48,10 @@ typedef struct {
     /* size of the next packet that contains pages */
     uint32_t next_packet_size;
     uint64_t packet_num;
-    uint64_t unused[4];    /* Reserved for future use */
+    /* zero pages */
+    uint32_t zero_pages;
+    uint32_t unused32[1];    /* Reserved for future use */
+    uint64_t unused64[3];    /* Reserved for future use */
     char ramblock[256];
     uint64_t offset[];
 } __attribute__((packed)) MultiFDPacket_t;
@@ -122,6 +125,8 @@ typedef struct {
     uint64_t num_packets;
     /* non zero pages sent through this channel */
     uint64_t total_normal_pages;
+    /* zero pages sent through this channel */
+    uint64_t total_zero_pages;
     /* buffers to send */
     struct iovec *iov;
     /* number of iovs used */
@@ -130,6 +135,10 @@ typedef struct {
     ram_addr_t *normal;
     /* num of non zero pages */
     uint32_t normal_num;
+    /* Pages that are zero */
+    ram_addr_t *zero;
+    /* num of zero pages */
+    uint32_t zero_num;
     /* used for compression methods */
     void *data;
 }  MultiFDSendParams;
@@ -181,12 +190,18 @@ typedef struct {
     uint8_t *host;
     /* non zero pages recv through this channel */
     uint64_t total_normal_pages;
+    /* zero pages recv through this channel */
+    uint64_t total_zero_pages;
     /* buffers to recv */
     struct iovec *iov;
     /* Pages that are not zero */
     ram_addr_t *normal;
     /* num of non zero pages */
     uint32_t normal_num;
+    /* Pages that are zero */
+    ram_addr_t *zero;
+    /* num of zero pages */
+    uint32_t zero_num;
     /* used for de-compression methods */
     void *data;
 } MultiFDRecvParams;
diff --git a/migration/trace-events b/migration/trace-events
index de4a743c8a..c0a758db9d 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -128,21 +128,21 @@ postcopy_preempt_reset_channel(void) ""
 # multifd.c
 multifd_new_send_channel_async(uint8_t id) "channel %u"
 multifd_new_send_channel_async_error(uint8_t id, void *err) "channel=%u err=%p"
-multifd_recv(uint8_t id, uint64_t packet_num, uint32_t used, uint32_t flags, uint32_t next_packet_size) "channel %u packet_num %" PRIu64 " pages %u flags 0x%x next packet size %u"
+multifd_recv(uint8_t id, uint64_t packet_num, uint32_t normal, uint32_t zero, uint32_t flags, uint32_t next_packet_size) "channel %u packet_num %" PRIu64 " normal pages %u zero pages %u flags 0x%x next packet size %u"
 multifd_recv_new_channel(uint8_t id) "channel %u"
 multifd_recv_sync_main(long packet_num) "packet num %ld"
 multifd_recv_sync_main_signal(uint8_t id) "channel %u"
 multifd_recv_sync_main_wait(uint8_t id) "channel %u"
 multifd_recv_terminate_threads(bool error) "error %d"
-multifd_recv_thread_end(uint8_t id, uint64_t packets, uint64_t pages) "channel %u packets %" PRIu64 " pages %" PRIu64
+multifd_recv_thread_end(uint8_t id, uint64_t packets, uint64_t normal_pages, uint64_t zero_pages) "channel %u packets %" PRIu64 " normal pages %" PRIu64 " zero pages %" PRIu64
 multifd_recv_thread_start(uint8_t id) "%u"
-multifd_send(uint8_t id, uint64_t packet_num, uint32_t normal, uint32_t flags, uint32_t next_packet_size) "channel %u packet_num %" PRIu64 " normal pages %u flags 0x%x next packet size %u"
+multifd_send(uint8_t id, uint64_t packet_num, uint32_t normalpages, uint32_t zero_pages, uint32_t flags, uint32_t next_packet_size) "channel %u packet_num %" PRIu64 " normal pages %u zero pages %u flags 0x%x next packet size %u"
 multifd_send_error(uint8_t id) "channel %u"
 multifd_send_sync_main(long packet_num) "packet num %ld"
 multifd_send_sync_main_signal(uint8_t id) "channel %u"
 multifd_send_sync_main_wait(uint8_t id) "channel %u"
 multifd_send_terminate_threads(bool error) "error %d"
-multifd_send_thread_end(uint8_t id, uint64_t packets, uint64_t normal_pages) "channel %u packets %" PRIu64 " normal pages %"  PRIu64
+multifd_send_thread_end(uint8_t id, uint64_t packets, uint64_t normal_pages, uint64_t zero_pages) "channel %u packets %" PRIu64 " normal pages %"  PRIu64 " zero pages %"  PRIu64
 multifd_send_thread_start(uint8_t id) "%u"
 multifd_tls_outgoing_handshake_start(void *ioc, void *tioc, const char *hostname) "ioc=%p tioc=%p hostname=%s"
 multifd_tls_outgoing_handshake_error(void *ioc, const char *err) "ioc=%p err=%s"
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 03/20] multifd: Zero pages transmission
  2023-11-14  5:40 [PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration Hao Xiang
  2023-11-14  5:40 ` [PATCH v2 01/20] multifd: Add capability to enable/disable zero_page Hao Xiang
  2023-11-14  5:40 ` [PATCH v2 02/20] multifd: Support for zero pages transmission Hao Xiang
@ 2023-11-14  5:40 ` Hao Xiang
  2023-12-18  2:43   ` Wang, Lei
  2023-11-14  5:40 ` [PATCH v2 04/20] So we use multifd to transmit zero pages Hao Xiang
                   ` (17 subsequent siblings)
  20 siblings, 1 reply; 51+ messages in thread
From: Hao Xiang @ 2023-11-14  5:40 UTC (permalink / raw)
  To: farosas, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel

From: Juan Quintela <quintela@redhat.com>

This implements the zero page dection and handling.

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 migration/multifd.c | 41 +++++++++++++++++++++++++++++++++++++++--
 migration/multifd.h |  5 +++++
 2 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index d28ef0028b..1b994790d5 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -11,6 +11,7 @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/cutils.h"
 #include "qemu/rcu.h"
 #include "exec/target_page.h"
 #include "sysemu/sysemu.h"
@@ -279,6 +280,12 @@ static void multifd_send_fill_packet(MultiFDSendParams *p)
 
         packet->offset[i] = cpu_to_be64(temp);
     }
+    for (i = 0; i < p->zero_num; i++) {
+        /* there are architectures where ram_addr_t is 32 bit */
+        uint64_t temp = p->zero[i];
+
+        packet->offset[p->normal_num + i] = cpu_to_be64(temp);
+    }
 }
 
 static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
@@ -361,6 +368,18 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
         p->normal[i] = offset;
     }
 
+    for (i = 0; i < p->zero_num; i++) {
+        uint64_t offset = be64_to_cpu(packet->offset[p->normal_num + i]);
+
+        if (offset > (p->block->used_length - p->page_size)) {
+            error_setg(errp, "multifd: offset too long %" PRIu64
+                       " (max " RAM_ADDR_FMT ")",
+                       offset, p->block->used_length);
+            return -1;
+        }
+        p->zero[i] = offset;
+    }
+
     return 0;
 }
 
@@ -664,6 +683,8 @@ static void *multifd_send_thread(void *opaque)
     MultiFDSendParams *p = opaque;
     MigrationThread *thread = NULL;
     Error *local_err = NULL;
+    /* qemu older than 8.2 don't understand zero page on multifd channel */
+    bool use_zero_page = !migrate_use_main_zero_page();
     int ret = 0;
     bool use_zero_copy_send = migrate_zero_copy_send();
 
@@ -689,6 +710,7 @@ static void *multifd_send_thread(void *opaque)
         qemu_mutex_lock(&p->mutex);
 
         if (p->pending_job) {
+            RAMBlock *rb = p->pages->block;
             uint64_t packet_num = p->packet_num;
             uint32_t flags;
             p->normal_num = 0;
@@ -701,8 +723,16 @@ static void *multifd_send_thread(void *opaque)
             }
 
             for (int i = 0; i < p->pages->num; i++) {
-                p->normal[p->normal_num] = p->pages->offset[i];
-                p->normal_num++;
+                uint64_t offset = p->pages->offset[i];
+                if (use_zero_page &&
+                    buffer_is_zero(rb->host + offset, p->page_size)) {
+                    p->zero[p->zero_num] = offset;
+                    p->zero_num++;
+                    ram_release_page(rb->idstr, offset);
+                } else {
+                    p->normal[p->normal_num] = offset;
+                    p->normal_num++;
+                }
             }
 
             if (p->normal_num) {
@@ -1156,6 +1186,13 @@ static void *multifd_recv_thread(void *opaque)
             }
         }
 
+        for (int i = 0; i < p->zero_num; i++) {
+            void *page = p->host + p->zero[i];
+            if (!buffer_is_zero(page, p->page_size)) {
+                memset(page, 0, p->page_size);
+            }
+        }
+
         if (flags & MULTIFD_FLAG_SYNC) {
             qemu_sem_post(&multifd_recv_state->sem_sync);
             qemu_sem_wait(&p->sem_sync);
diff --git a/migration/multifd.h b/migration/multifd.h
index d587b0e19c..13762900d4 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -53,6 +53,11 @@ typedef struct {
     uint32_t unused32[1];    /* Reserved for future use */
     uint64_t unused64[3];    /* Reserved for future use */
     char ramblock[256];
+    /*
+     * This array contains the pointers to:
+     *  - normal pages (initial normal_pages entries)
+     *  - zero pages (following zero_pages entries)
+     */
     uint64_t offset[];
 } __attribute__((packed)) MultiFDPacket_t;
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 04/20] So we use multifd to transmit zero pages.
  2023-11-14  5:40 [PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration Hao Xiang
                   ` (2 preceding siblings ...)
  2023-11-14  5:40 ` [PATCH v2 03/20] multifd: Zero " Hao Xiang
@ 2023-11-14  5:40 ` Hao Xiang
  2023-11-16 15:14   ` Fabiano Rosas
  2023-11-14  5:40 ` [PATCH v2 05/20] meson: Introduce new instruction set enqcmd to the build system Hao Xiang
                   ` (16 subsequent siblings)
  20 siblings, 1 reply; 51+ messages in thread
From: Hao Xiang @ 2023-11-14  5:40 UTC (permalink / raw)
  To: farosas, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel
  Cc: Leonardo Bras

From: Juan Quintela <quintela@redhat.com>

Signed-off-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Leonardo Bras <leobras@redhat.com>
---
 migration/multifd.c |  7 ++++---
 migration/options.c | 13 +++++++------
 migration/ram.c     | 45 ++++++++++++++++++++++++++++++++++++++-------
 qapi/migration.json |  1 -
 4 files changed, 49 insertions(+), 17 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 1b994790d5..1198ffde9c 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -13,6 +13,7 @@
 #include "qemu/osdep.h"
 #include "qemu/cutils.h"
 #include "qemu/rcu.h"
+#include "qemu/cutils.h"
 #include "exec/target_page.h"
 #include "sysemu/sysemu.h"
 #include "exec/ramblock.h"
@@ -459,7 +460,6 @@ static int multifd_send_pages(QEMUFile *f)
     p->packet_num = multifd_send_state->packet_num++;
     multifd_send_state->pages = p->pages;
     p->pages = pages;
-
     qemu_mutex_unlock(&p->mutex);
     qemu_sem_post(&p->sem);
 
@@ -684,7 +684,7 @@ static void *multifd_send_thread(void *opaque)
     MigrationThread *thread = NULL;
     Error *local_err = NULL;
     /* qemu older than 8.2 don't understand zero page on multifd channel */
-    bool use_zero_page = !migrate_use_main_zero_page();
+    bool use_multifd_zero_page = !migrate_use_main_zero_page();
     int ret = 0;
     bool use_zero_copy_send = migrate_zero_copy_send();
 
@@ -713,6 +713,7 @@ static void *multifd_send_thread(void *opaque)
             RAMBlock *rb = p->pages->block;
             uint64_t packet_num = p->packet_num;
             uint32_t flags;
+
             p->normal_num = 0;
             p->zero_num = 0;
 
@@ -724,7 +725,7 @@ static void *multifd_send_thread(void *opaque)
 
             for (int i = 0; i < p->pages->num; i++) {
                 uint64_t offset = p->pages->offset[i];
-                if (use_zero_page &&
+                if (use_multifd_zero_page &&
                     buffer_is_zero(rb->host + offset, p->page_size)) {
                     p->zero[p->zero_num] = offset;
                     p->zero_num++;
diff --git a/migration/options.c b/migration/options.c
index 00c0c4a0d6..97d121d4d7 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -195,6 +195,7 @@ Property migration_properties[] = {
     DEFINE_PROP_MIG_CAP("x-block", MIGRATION_CAPABILITY_BLOCK),
     DEFINE_PROP_MIG_CAP("x-return-path", MIGRATION_CAPABILITY_RETURN_PATH),
     DEFINE_PROP_MIG_CAP("x-multifd", MIGRATION_CAPABILITY_MULTIFD),
+    DEFINE_PROP_MIG_CAP("x-main-zero-page", MIGRATION_CAPABILITY_MAIN_ZERO_PAGE),
     DEFINE_PROP_MIG_CAP("x-background-snapshot",
             MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT),
 #ifdef CONFIG_LINUX
@@ -288,13 +289,9 @@ bool migrate_multifd(void)
 
 bool migrate_use_main_zero_page(void)
 {
-    //MigrationState *s;
-
-    //s = migrate_get_current();
+    MigrationState *s = migrate_get_current();
 
-    // We will enable this when we add the right code.
-    // return s->enabled_capabilities[MIGRATION_CAPABILITY_MAIN_ZERO_PAGE];
-    return true;
+    return s->capabilities[MIGRATION_CAPABILITY_MAIN_ZERO_PAGE];
 }
 
 bool migrate_pause_before_switchover(void)
@@ -457,6 +454,7 @@ INITIALIZE_MIGRATE_CAPS_SET(check_caps_background_snapshot,
     MIGRATION_CAPABILITY_LATE_BLOCK_ACTIVATE,
     MIGRATION_CAPABILITY_RETURN_PATH,
     MIGRATION_CAPABILITY_MULTIFD,
+    MIGRATION_CAPABILITY_MAIN_ZERO_PAGE,
     MIGRATION_CAPABILITY_PAUSE_BEFORE_SWITCHOVER,
     MIGRATION_CAPABILITY_AUTO_CONVERGE,
     MIGRATION_CAPABILITY_RELEASE_RAM,
@@ -534,6 +532,9 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, Error **errp)
             error_setg(errp, "Postcopy is not yet compatible with multifd");
             return false;
         }
+        if (new_caps[MIGRATION_CAPABILITY_MAIN_ZERO_PAGE]) {
+            error_setg(errp, "Postcopy is not yet compatible with main zero copy");
+        }
     }
 
     if (new_caps[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT]) {
diff --git a/migration/ram.c b/migration/ram.c
index 8c7886ab79..f7a42feff2 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2059,17 +2059,42 @@ static int ram_save_target_page_legacy(RAMState *rs, PageSearchStatus *pss)
     if (save_zero_page(rs, pss, offset)) {
         return 1;
     }
-
     /*
-     * Do not use multifd in postcopy as one whole host page should be
-     * placed.  Meanwhile postcopy requires atomic update of pages, so even
-     * if host page size == guest page size the dest guest during run may
-     * still see partially copied pages which is data corruption.
+     * Do not use multifd for:
+     * 1. Compression as the first page in the new block should be posted out
+     *    before sending the compressed page
+     * 2. In postcopy as one whole host page should be placed
      */
-    if (migrate_multifd() && !migration_in_postcopy()) {
+    if (!migrate_compress() && migrate_multifd() && !migration_in_postcopy()) {
+        return ram_save_multifd_page(pss->pss_channel, block, offset);
+    }
+
+    return ram_save_page(rs, pss);
+}
+
+/**
+ * ram_save_target_page_multifd: save one target page
+ *
+ * Returns the number of pages written
+ *
+ * @rs: current RAM state
+ * @pss: data about the page we want to send
+ */
+static int ram_save_target_page_multifd(RAMState *rs, PageSearchStatus *pss)
+{
+    RAMBlock *block = pss->block;
+    ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS;
+    int res;
+
+    if (!migration_in_postcopy()) {
         return ram_save_multifd_page(pss->pss_channel, block, offset);
     }
 
+    res = save_zero_page(rs, pss, offset);
+    if (res > 0) {
+        return res;
+    }
+
     return ram_save_page(rs, pss);
 }
 
@@ -2982,9 +3007,15 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
     }
 
     migration_ops = g_malloc0(sizeof(MigrationOps));
-    migration_ops->ram_save_target_page = ram_save_target_page_legacy;
+
+    if (migrate_multifd() && !migrate_use_main_zero_page()) {
+        migration_ops->ram_save_target_page = ram_save_target_page_multifd;
+    } else {
+        migration_ops->ram_save_target_page = ram_save_target_page_legacy;
+    }
 
     qemu_mutex_unlock_iothread();
+
     ret = multifd_send_sync_main(f);
     qemu_mutex_lock_iothread();
     if (ret < 0) {
diff --git a/qapi/migration.json b/qapi/migration.json
index 09e4393591..9783289bfc 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -531,7 +531,6 @@
 #     and can result in more stable read performance.  Requires KVM
 #     with accelerator property "dirty-ring-size" set.  (Since 8.1)
 #
-#
 # @main-zero-page: If enabled, the detection of zero pages will be
 #                  done on the main thread.  Otherwise it is done on
 #                  the multifd threads.
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 05/20] meson: Introduce new instruction set enqcmd to the build system.
  2023-11-14  5:40 [PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration Hao Xiang
                   ` (3 preceding siblings ...)
  2023-11-14  5:40 ` [PATCH v2 04/20] So we use multifd to transmit zero pages Hao Xiang
@ 2023-11-14  5:40 ` Hao Xiang
  2023-12-11 15:41   ` Fabiano Rosas
  2023-11-14  5:40 ` [PATCH v2 06/20] util/dsa: Add dependency idxd Hao Xiang
                   ` (15 subsequent siblings)
  20 siblings, 1 reply; 51+ messages in thread
From: Hao Xiang @ 2023-11-14  5:40 UTC (permalink / raw)
  To: farosas, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel
  Cc: Hao Xiang

Enable instruction set enqcmd in build.

Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
---
 meson.build                   | 2 ++
 meson_options.txt             | 2 ++
 scripts/meson-buildoptions.sh | 3 +++
 3 files changed, 7 insertions(+)

diff --git a/meson.build b/meson.build
index ec01f8b138..1292ab78a3 100644
--- a/meson.build
+++ b/meson.build
@@ -2708,6 +2708,8 @@ config_host_data.set('CONFIG_AVX512BW_OPT', get_option('avx512bw') \
     int main(int argc, char *argv[]) { return bar(argv[0]); }
   '''), error_message: 'AVX512BW not available').allowed())
 
+config_host_data.set('CONFIG_DSA_OPT', get_option('enqcmd'))
+
 # For both AArch64 and AArch32, detect if builtins are available.
 config_host_data.set('CONFIG_ARM_AES_BUILTIN', cc.compiles('''
     #include <arm_neon.h>
diff --git a/meson_options.txt b/meson_options.txt
index c9baeda639..6fe8aca181 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -121,6 +121,8 @@ option('avx512f', type: 'feature', value: 'disabled',
        description: 'AVX512F optimizations')
 option('avx512bw', type: 'feature', value: 'auto',
        description: 'AVX512BW optimizations')
+option('enqcmd', type: 'boolean', value: false,
+       description: 'MENQCMD optimizations')
 option('keyring', type: 'feature', value: 'auto',
        description: 'Linux keyring support')
 option('libkeyutils', type: 'feature', value: 'auto',
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
index 680fa3f581..bf139e3fb4 100644
--- a/scripts/meson-buildoptions.sh
+++ b/scripts/meson-buildoptions.sh
@@ -93,6 +93,7 @@ meson_options_help() {
   printf "%s\n" '  avx2            AVX2 optimizations'
   printf "%s\n" '  avx512bw        AVX512BW optimizations'
   printf "%s\n" '  avx512f         AVX512F optimizations'
+  printf "%s\n" '  enqcmd          ENQCMD optimizations'
   printf "%s\n" '  blkio           libblkio block device driver'
   printf "%s\n" '  bochs           bochs image format support'
   printf "%s\n" '  bpf             eBPF support'
@@ -240,6 +241,8 @@ _meson_option_parse() {
     --disable-avx512bw) printf "%s" -Davx512bw=disabled ;;
     --enable-avx512f) printf "%s" -Davx512f=enabled ;;
     --disable-avx512f) printf "%s" -Davx512f=disabled ;;
+    --enable-enqcmd) printf "%s" -Denqcmd=true ;;
+    --disable-enqcmd) printf "%s" -Denqcmd=false ;;
     --enable-gcov) printf "%s" -Db_coverage=true ;;
     --disable-gcov) printf "%s" -Db_coverage=false ;;
     --enable-lto) printf "%s" -Db_lto=true ;;
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 06/20] util/dsa: Add dependency idxd.
  2023-11-14  5:40 [PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration Hao Xiang
                   ` (4 preceding siblings ...)
  2023-11-14  5:40 ` [PATCH v2 05/20] meson: Introduce new instruction set enqcmd to the build system Hao Xiang
@ 2023-11-14  5:40 ` Hao Xiang
  2023-11-14  5:40 ` [PATCH v2 07/20] util/dsa: Implement DSA device start and stop logic Hao Xiang
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 51+ messages in thread
From: Hao Xiang @ 2023-11-14  5:40 UTC (permalink / raw)
  To: farosas, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel
  Cc: Hao Xiang

Idxd is the device driver for DSA (Intel Data Streaming
Accelerator). The driver is fully functioning since Linux
kernel 5.19. This change adds the driver's header file used
for userspace development.

Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
---
 linux-headers/linux/idxd.h | 356 +++++++++++++++++++++++++++++++++++++
 1 file changed, 356 insertions(+)
 create mode 100644 linux-headers/linux/idxd.h

diff --git a/linux-headers/linux/idxd.h b/linux-headers/linux/idxd.h
new file mode 100644
index 0000000000..1d553bedbd
--- /dev/null
+++ b/linux-headers/linux/idxd.h
@@ -0,0 +1,356 @@
+/* SPDX-License-Identifier: LGPL-2.1 WITH Linux-syscall-note */
+/* Copyright(c) 2019 Intel Corporation. All rights rsvd. */
+#ifndef _USR_IDXD_H_
+#define _USR_IDXD_H_
+
+#ifdef __KERNEL__
+#include <linux/types.h>
+#else
+#include <stdint.h>
+#endif
+
+/* Driver command error status */
+enum idxd_scmd_stat {
+	IDXD_SCMD_DEV_ENABLED = 0x80000010,
+	IDXD_SCMD_DEV_NOT_ENABLED = 0x80000020,
+	IDXD_SCMD_WQ_ENABLED = 0x80000021,
+	IDXD_SCMD_DEV_DMA_ERR = 0x80020000,
+	IDXD_SCMD_WQ_NO_GRP = 0x80030000,
+	IDXD_SCMD_WQ_NO_NAME = 0x80040000,
+	IDXD_SCMD_WQ_NO_SVM = 0x80050000,
+	IDXD_SCMD_WQ_NO_THRESH = 0x80060000,
+	IDXD_SCMD_WQ_PORTAL_ERR = 0x80070000,
+	IDXD_SCMD_WQ_RES_ALLOC_ERR = 0x80080000,
+	IDXD_SCMD_PERCPU_ERR = 0x80090000,
+	IDXD_SCMD_DMA_CHAN_ERR = 0x800a0000,
+	IDXD_SCMD_CDEV_ERR = 0x800b0000,
+	IDXD_SCMD_WQ_NO_SWQ_SUPPORT = 0x800c0000,
+	IDXD_SCMD_WQ_NONE_CONFIGURED = 0x800d0000,
+	IDXD_SCMD_WQ_NO_SIZE = 0x800e0000,
+	IDXD_SCMD_WQ_NO_PRIV = 0x800f0000,
+	IDXD_SCMD_WQ_IRQ_ERR = 0x80100000,
+	IDXD_SCMD_WQ_USER_NO_IOMMU = 0x80110000,
+};
+
+#define IDXD_SCMD_SOFTERR_MASK	0x80000000
+#define IDXD_SCMD_SOFTERR_SHIFT	16
+
+/* Descriptor flags */
+#define IDXD_OP_FLAG_FENCE	0x0001
+#define IDXD_OP_FLAG_BOF	0x0002
+#define IDXD_OP_FLAG_CRAV	0x0004
+#define IDXD_OP_FLAG_RCR	0x0008
+#define IDXD_OP_FLAG_RCI	0x0010
+#define IDXD_OP_FLAG_CRSTS	0x0020
+#define IDXD_OP_FLAG_CR		0x0080
+#define IDXD_OP_FLAG_CC		0x0100
+#define IDXD_OP_FLAG_ADDR1_TCS	0x0200
+#define IDXD_OP_FLAG_ADDR2_TCS	0x0400
+#define IDXD_OP_FLAG_ADDR3_TCS	0x0800
+#define IDXD_OP_FLAG_CR_TCS	0x1000
+#define IDXD_OP_FLAG_STORD	0x2000
+#define IDXD_OP_FLAG_DRDBK	0x4000
+#define IDXD_OP_FLAG_DSTS	0x8000
+
+/* IAX */
+#define IDXD_OP_FLAG_RD_SRC2_AECS	0x010000
+#define IDXD_OP_FLAG_RD_SRC2_2ND	0x020000
+#define IDXD_OP_FLAG_WR_SRC2_AECS_COMP	0x040000
+#define IDXD_OP_FLAG_WR_SRC2_AECS_OVFL	0x080000
+#define IDXD_OP_FLAG_SRC2_STS		0x100000
+#define IDXD_OP_FLAG_CRC_RFC3720	0x200000
+
+/* Opcode */
+enum dsa_opcode {
+	DSA_OPCODE_NOOP = 0,
+	DSA_OPCODE_BATCH,
+	DSA_OPCODE_DRAIN,
+	DSA_OPCODE_MEMMOVE,
+	DSA_OPCODE_MEMFILL,
+	DSA_OPCODE_COMPARE,
+	DSA_OPCODE_COMPVAL,
+	DSA_OPCODE_CR_DELTA,
+	DSA_OPCODE_AP_DELTA,
+	DSA_OPCODE_DUALCAST,
+	DSA_OPCODE_CRCGEN = 0x10,
+	DSA_OPCODE_COPY_CRC,
+	DSA_OPCODE_DIF_CHECK,
+	DSA_OPCODE_DIF_INS,
+	DSA_OPCODE_DIF_STRP,
+	DSA_OPCODE_DIF_UPDT,
+	DSA_OPCODE_CFLUSH = 0x20,
+};
+
+enum iax_opcode {
+	IAX_OPCODE_NOOP = 0,
+	IAX_OPCODE_DRAIN = 2,
+	IAX_OPCODE_MEMMOVE,
+	IAX_OPCODE_DECOMPRESS = 0x42,
+	IAX_OPCODE_COMPRESS,
+	IAX_OPCODE_CRC64,
+	IAX_OPCODE_ZERO_DECOMP_32 = 0x48,
+	IAX_OPCODE_ZERO_DECOMP_16,
+	IAX_OPCODE_ZERO_COMP_32 = 0x4c,
+	IAX_OPCODE_ZERO_COMP_16,
+	IAX_OPCODE_SCAN = 0x50,
+	IAX_OPCODE_SET_MEMBER,
+	IAX_OPCODE_EXTRACT,
+	IAX_OPCODE_SELECT,
+	IAX_OPCODE_RLE_BURST,
+	IAX_OPCODE_FIND_UNIQUE,
+	IAX_OPCODE_EXPAND,
+};
+
+/* Completion record status */
+enum dsa_completion_status {
+	DSA_COMP_NONE = 0,
+	DSA_COMP_SUCCESS,
+	DSA_COMP_SUCCESS_PRED,
+	DSA_COMP_PAGE_FAULT_NOBOF,
+	DSA_COMP_PAGE_FAULT_IR,
+	DSA_COMP_BATCH_FAIL,
+	DSA_COMP_BATCH_PAGE_FAULT,
+	DSA_COMP_DR_OFFSET_NOINC,
+	DSA_COMP_DR_OFFSET_ERANGE,
+	DSA_COMP_DIF_ERR,
+	DSA_COMP_BAD_OPCODE = 0x10,
+	DSA_COMP_INVALID_FLAGS,
+	DSA_COMP_NOZERO_RESERVE,
+	DSA_COMP_XFER_ERANGE,
+	DSA_COMP_DESC_CNT_ERANGE,
+	DSA_COMP_DR_ERANGE,
+	DSA_COMP_OVERLAP_BUFFERS,
+	DSA_COMP_DCAST_ERR,
+	DSA_COMP_DESCLIST_ALIGN,
+	DSA_COMP_INT_HANDLE_INVAL,
+	DSA_COMP_CRA_XLAT,
+	DSA_COMP_CRA_ALIGN,
+	DSA_COMP_ADDR_ALIGN,
+	DSA_COMP_PRIV_BAD,
+	DSA_COMP_TRAFFIC_CLASS_CONF,
+	DSA_COMP_PFAULT_RDBA,
+	DSA_COMP_HW_ERR1,
+	DSA_COMP_HW_ERR_DRB,
+	DSA_COMP_TRANSLATION_FAIL,
+};
+
+enum iax_completion_status {
+	IAX_COMP_NONE = 0,
+	IAX_COMP_SUCCESS,
+	IAX_COMP_PAGE_FAULT_IR = 0x04,
+	IAX_COMP_ANALYTICS_ERROR = 0x0a,
+	IAX_COMP_OUTBUF_OVERFLOW,
+	IAX_COMP_BAD_OPCODE = 0x10,
+	IAX_COMP_INVALID_FLAGS,
+	IAX_COMP_NOZERO_RESERVE,
+	IAX_COMP_INVALID_SIZE,
+	IAX_COMP_OVERLAP_BUFFERS = 0x16,
+	IAX_COMP_INT_HANDLE_INVAL = 0x19,
+	IAX_COMP_CRA_XLAT,
+	IAX_COMP_CRA_ALIGN,
+	IAX_COMP_ADDR_ALIGN,
+	IAX_COMP_PRIV_BAD,
+	IAX_COMP_TRAFFIC_CLASS_CONF,
+	IAX_COMP_PFAULT_RDBA,
+	IAX_COMP_HW_ERR1,
+	IAX_COMP_HW_ERR_DRB,
+	IAX_COMP_TRANSLATION_FAIL,
+	IAX_COMP_PRS_TIMEOUT,
+	IAX_COMP_WATCHDOG,
+	IAX_COMP_INVALID_COMP_FLAG = 0x30,
+	IAX_COMP_INVALID_FILTER_FLAG,
+	IAX_COMP_INVALID_INPUT_SIZE,
+	IAX_COMP_INVALID_NUM_ELEMS,
+	IAX_COMP_INVALID_SRC1_WIDTH,
+	IAX_COMP_INVALID_INVERT_OUT,
+};
+
+#define DSA_COMP_STATUS_MASK		0x7f
+#define DSA_COMP_STATUS_WRITE		0x80
+
+struct dsa_hw_desc {
+	uint32_t	pasid:20;
+	uint32_t	rsvd:11;
+	uint32_t	priv:1;
+	uint32_t	flags:24;
+	uint32_t	opcode:8;
+	uint64_t	completion_addr;
+	union {
+		uint64_t	src_addr;
+		uint64_t	rdback_addr;
+		uint64_t	pattern;
+		uint64_t	desc_list_addr;
+	};
+	union {
+		uint64_t	dst_addr;
+		uint64_t	rdback_addr2;
+		uint64_t	src2_addr;
+		uint64_t	comp_pattern;
+	};
+	union {
+		uint32_t	xfer_size;
+		uint32_t	desc_count;
+	};
+	uint16_t	int_handle;
+	uint16_t	rsvd1;
+	union {
+		uint8_t		expected_res;
+		/* create delta record */
+		struct {
+			uint64_t	delta_addr;
+			uint32_t	max_delta_size;
+			uint32_t 	delt_rsvd;
+			uint8_t 	expected_res_mask;
+		};
+		uint32_t	delta_rec_size;
+		uint64_t	dest2;
+		/* CRC */
+		struct {
+			uint32_t	crc_seed;
+			uint32_t	crc_rsvd;
+			uint64_t	seed_addr;
+		};
+		/* DIF check or strip */
+		struct {
+			uint8_t		src_dif_flags;
+			uint8_t		dif_chk_res;
+			uint8_t		dif_chk_flags;
+			uint8_t		dif_chk_res2[5];
+			uint32_t	chk_ref_tag_seed;
+			uint16_t	chk_app_tag_mask;
+			uint16_t	chk_app_tag_seed;
+		};
+		/* DIF insert */
+		struct {
+			uint8_t		dif_ins_res;
+			uint8_t		dest_dif_flag;
+			uint8_t		dif_ins_flags;
+			uint8_t		dif_ins_res2[13];
+			uint32_t	ins_ref_tag_seed;
+			uint16_t	ins_app_tag_mask;
+			uint16_t	ins_app_tag_seed;
+		};
+		/* DIF update */
+		struct {
+			uint8_t		src_upd_flags;
+			uint8_t		upd_dest_flags;
+			uint8_t		dif_upd_flags;
+			uint8_t		dif_upd_res[5];
+			uint32_t	src_ref_tag_seed;
+			uint16_t	src_app_tag_mask;
+			uint16_t	src_app_tag_seed;
+			uint32_t	dest_ref_tag_seed;
+			uint16_t	dest_app_tag_mask;
+			uint16_t	dest_app_tag_seed;
+		};
+
+		uint8_t		op_specific[24];
+	};
+} __attribute__((packed));
+
+struct iax_hw_desc {
+	uint32_t        pasid:20;
+	uint32_t        rsvd:11;
+	uint32_t        priv:1;
+	uint32_t        flags:24;
+	uint32_t        opcode:8;
+	uint64_t        completion_addr;
+	uint64_t        src1_addr;
+	uint64_t        dst_addr;
+	uint32_t        src1_size;
+	uint16_t        int_handle;
+	union {
+		uint16_t        compr_flags;
+		uint16_t        decompr_flags;
+	};
+	uint64_t        src2_addr;
+	uint32_t        max_dst_size;
+	uint32_t        src2_size;
+	uint32_t	filter_flags;
+	uint32_t	num_inputs;
+} __attribute__((packed));
+
+struct dsa_raw_desc {
+	uint64_t	field[8];
+} __attribute__((packed));
+
+/*
+ * The status field will be modified by hardware, therefore it should be
+ * volatile and prevent the compiler from optimize the read.
+ */
+struct dsa_completion_record {
+	volatile uint8_t	status;
+	union {
+		uint8_t		result;
+		uint8_t		dif_status;
+	};
+	uint16_t		rsvd;
+	uint32_t		bytes_completed;
+	uint64_t		fault_addr;
+	union {
+		/* common record */
+		struct {
+			uint32_t	invalid_flags:24;
+			uint32_t	rsvd2:8;
+		};
+
+		uint32_t	delta_rec_size;
+		uint64_t	crc_val;
+
+		/* DIF check & strip */
+		struct {
+			uint32_t	dif_chk_ref_tag;
+			uint16_t	dif_chk_app_tag_mask;
+			uint16_t	dif_chk_app_tag;
+		};
+
+		/* DIF insert */
+		struct {
+			uint64_t	dif_ins_res;
+			uint32_t	dif_ins_ref_tag;
+			uint16_t	dif_ins_app_tag_mask;
+			uint16_t	dif_ins_app_tag;
+		};
+
+		/* DIF update */
+		struct {
+			uint32_t	dif_upd_src_ref_tag;
+			uint16_t	dif_upd_src_app_tag_mask;
+			uint16_t	dif_upd_src_app_tag;
+			uint32_t	dif_upd_dest_ref_tag;
+			uint16_t	dif_upd_dest_app_tag_mask;
+			uint16_t	dif_upd_dest_app_tag;
+		};
+
+		uint8_t		op_specific[16];
+	};
+} __attribute__((packed));
+
+struct dsa_raw_completion_record {
+	uint64_t	field[4];
+} __attribute__((packed));
+
+struct iax_completion_record {
+	volatile uint8_t        status;
+	uint8_t                 error_code;
+	uint16_t                rsvd;
+	uint32_t                bytes_completed;
+	uint64_t                fault_addr;
+	uint32_t                invalid_flags;
+	uint32_t                rsvd2;
+	uint32_t                output_size;
+	uint8_t                 output_bits;
+	uint8_t                 rsvd3;
+	uint16_t                xor_csum;
+	uint32_t                crc;
+	uint32_t                min;
+	uint32_t                max;
+	uint32_t                sum;
+	uint64_t                rsvd4[2];
+} __attribute__((packed));
+
+struct iax_raw_completion_record {
+	uint64_t	field[8];
+} __attribute__((packed));
+
+#endif
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 07/20] util/dsa: Implement DSA device start and stop logic.
  2023-11-14  5:40 [PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration Hao Xiang
                   ` (5 preceding siblings ...)
  2023-11-14  5:40 ` [PATCH v2 06/20] util/dsa: Add dependency idxd Hao Xiang
@ 2023-11-14  5:40 ` Hao Xiang
  2023-12-11 21:28   ` Fabiano Rosas
  2023-11-14  5:40 ` [PATCH v2 08/20] util/dsa: Implement DSA task enqueue and dequeue Hao Xiang
                   ` (13 subsequent siblings)
  20 siblings, 1 reply; 51+ messages in thread
From: Hao Xiang @ 2023-11-14  5:40 UTC (permalink / raw)
  To: farosas, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel
  Cc: Hao Xiang

* DSA device open and close.
* DSA group contains multiple DSA devices.
* DSA group configure/start/stop/clean.

Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
---
 include/qemu/dsa.h |  49 +++++++
 util/dsa.c         | 338 +++++++++++++++++++++++++++++++++++++++++++++
 util/meson.build   |   1 +
 3 files changed, 388 insertions(+)
 create mode 100644 include/qemu/dsa.h
 create mode 100644 util/dsa.c

diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
new file mode 100644
index 0000000000..30246b507e
--- /dev/null
+++ b/include/qemu/dsa.h
@@ -0,0 +1,49 @@
+#ifndef QEMU_DSA_H
+#define QEMU_DSA_H
+
+#include "qemu/thread.h"
+#include "qemu/queue.h"
+
+#ifdef CONFIG_DSA_OPT
+
+#pragma GCC push_options
+#pragma GCC target("enqcmd")
+
+#include <linux/idxd.h>
+#include "x86intrin.h"
+
+#endif
+
+/**
+ * @brief Initializes DSA devices.
+ *
+ * @param dsa_parameter A list of DSA device path from migration parameter.
+ * @return int Zero if successful, otherwise non zero.
+ */
+int dsa_init(const char *dsa_parameter);
+
+/**
+ * @brief Start logic to enable using DSA.
+ */
+void dsa_start(void);
+
+/**
+ * @brief Stop logic to clean up DSA by halting the device group and cleaning up
+ * the completion thread.
+ */
+void dsa_stop(void);
+
+/**
+ * @brief Clean up system resources created for DSA offloading.
+ *        This function is called during QEMU process teardown.
+ */
+void dsa_cleanup(void);
+
+/**
+ * @brief Check if DSA is running.
+ *
+ * @return True if DSA is running, otherwise false.
+ */
+bool dsa_is_running(void);
+
+#endif
\ No newline at end of file
diff --git a/util/dsa.c b/util/dsa.c
new file mode 100644
index 0000000000..8edaa892ec
--- /dev/null
+++ b/util/dsa.c
@@ -0,0 +1,338 @@
+/*
+ * Use Intel Data Streaming Accelerator to offload certain background
+ * operations.
+ *
+ * Copyright (c) 2023 Hao Xiang <hao.xiang@bytedance.com>
+ *                    Bryan Zhang <bryan.zhang@bytedance.com>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/queue.h"
+#include "qemu/memalign.h"
+#include "qemu/lockable.h"
+#include "qemu/cutils.h"
+#include "qemu/dsa.h"
+#include "qemu/bswap.h"
+#include "qemu/error-report.h"
+#include "qemu/rcu.h"
+
+#ifdef CONFIG_DSA_OPT
+
+#pragma GCC push_options
+#pragma GCC target("enqcmd")
+
+#include <linux/idxd.h>
+#include "x86intrin.h"
+
+#define DSA_WQ_SIZE 4096
+#define MAX_DSA_DEVICES 16
+
+typedef QSIMPLEQ_HEAD(dsa_task_queue, buffer_zero_batch_task) dsa_task_queue;
+
+struct dsa_device {
+    void *work_queue;
+};
+
+struct dsa_device_group {
+    struct dsa_device *dsa_devices;
+    int num_dsa_devices;
+    uint32_t index;
+    bool running;
+    QemuMutex task_queue_lock;
+    QemuCond task_queue_cond;
+    dsa_task_queue task_queue;
+};
+
+uint64_t max_retry_count;
+static struct dsa_device_group dsa_group;
+
+
+/**
+ * @brief This function opens a DSA device's work queue and
+ *        maps the DSA device memory into the current process.
+ *
+ * @param dsa_wq_path A pointer to the DSA device work queue's file path.
+ * @return A pointer to the mapped memory.
+ */
+static void *
+map_dsa_device(const char *dsa_wq_path)
+{
+    void *dsa_device;
+    int fd;
+
+    fd = open(dsa_wq_path, O_RDWR);
+    if (fd < 0) {
+        fprintf(stderr, "open %s failed with errno = %d.\n",
+                dsa_wq_path, errno);
+        return MAP_FAILED;
+    }
+    dsa_device = mmap(NULL, DSA_WQ_SIZE, PROT_WRITE,
+                      MAP_SHARED | MAP_POPULATE, fd, 0);
+    close(fd);
+    if (dsa_device == MAP_FAILED) {
+        fprintf(stderr, "mmap failed with errno = %d.\n", errno);
+        return MAP_FAILED;
+    }
+    return dsa_device;
+}
+
+/**
+ * @brief Initializes a DSA device structure.
+ *
+ * @param instance A pointer to the DSA device.
+ * @param work_queue  A pointer to the DSA work queue.
+ */
+static void
+dsa_device_init(struct dsa_device *instance,
+                void *dsa_work_queue)
+{
+    instance->work_queue = dsa_work_queue;
+}
+
+/**
+ * @brief Cleans up a DSA device structure.
+ *
+ * @param instance A pointer to the DSA device to cleanup.
+ */
+static void
+dsa_device_cleanup(struct dsa_device *instance)
+{
+    if (instance->work_queue != MAP_FAILED) {
+        munmap(instance->work_queue, DSA_WQ_SIZE);
+    }
+}
+
+/**
+ * @brief Initializes a DSA device group.
+ *
+ * @param group A pointer to the DSA device group.
+ * @param num_dsa_devices The number of DSA devices this group will have.
+ *
+ * @return Zero if successful, non-zero otherwise.
+ */
+static int
+dsa_device_group_init(struct dsa_device_group *group,
+                      const char *dsa_parameter)
+{
+    if (dsa_parameter == NULL || strlen(dsa_parameter) == 0) {
+        return 0;
+    }
+
+    int ret = 0;
+    char *local_dsa_parameter = g_strdup(dsa_parameter);
+    const char *dsa_path[MAX_DSA_DEVICES];
+    int num_dsa_devices = 0;
+    char delim[2] = " ";
+
+    char *current_dsa_path = strtok(local_dsa_parameter, delim);
+
+    while (current_dsa_path != NULL) {
+        dsa_path[num_dsa_devices++] = current_dsa_path;
+        if (num_dsa_devices == MAX_DSA_DEVICES) {
+            break;
+        }
+        current_dsa_path = strtok(NULL, delim);
+    }
+
+    group->dsa_devices =
+        malloc(sizeof(struct dsa_device) * num_dsa_devices);
+    group->num_dsa_devices = num_dsa_devices;
+    group->index = 0;
+
+    group->running = false;
+    qemu_mutex_init(&group->task_queue_lock);
+    qemu_cond_init(&group->task_queue_cond);
+    QSIMPLEQ_INIT(&group->task_queue);
+
+    void *dsa_wq = MAP_FAILED;
+    for (int i = 0; i < num_dsa_devices; i++) {
+        dsa_wq = map_dsa_device(dsa_path[i]);
+        if (dsa_wq == MAP_FAILED) {
+            fprintf(stderr, "map_dsa_device failed MAP_FAILED, "
+                    "using simulation.\n");
+            ret = -1;
+            goto exit;
+        }
+        dsa_device_init(&dsa_group.dsa_devices[i], dsa_wq);
+    }
+
+exit:
+    g_free(local_dsa_parameter);
+    return ret;
+}
+
+/**
+ * @brief Starts a DSA device group.
+ *
+ * @param group A pointer to the DSA device group.
+ * @param dsa_path An array of DSA device path.
+ * @param num_dsa_devices The number of DSA devices in the device group.
+ */
+static void
+dsa_device_group_start(struct dsa_device_group *group)
+{
+    group->running = true;
+}
+
+/**
+ * @brief Stops a DSA device group.
+ *
+ * @param group A pointer to the DSA device group.
+ */
+__attribute__((unused))
+static void
+dsa_device_group_stop(struct dsa_device_group *group)
+{
+    group->running = false;
+}
+
+/**
+ * @brief Cleans up a DSA device group.
+ *
+ * @param group A pointer to the DSA device group.
+ */
+static void
+dsa_device_group_cleanup(struct dsa_device_group *group)
+{
+    if (!group->dsa_devices) {
+        return;
+    }
+    for (int i = 0; i < group->num_dsa_devices; i++) {
+        dsa_device_cleanup(&group->dsa_devices[i]);
+    }
+    free(group->dsa_devices);
+    group->dsa_devices = NULL;
+
+    qemu_mutex_destroy(&group->task_queue_lock);
+    qemu_cond_destroy(&group->task_queue_cond);
+}
+
+/**
+ * @brief Returns the next available DSA device in the group.
+ *
+ * @param group A pointer to the DSA device group.
+ *
+ * @return struct dsa_device* A pointer to the next available DSA device
+ *         in the group.
+ */
+__attribute__((unused))
+static struct dsa_device *
+dsa_device_group_get_next_device(struct dsa_device_group *group)
+{
+    if (group->num_dsa_devices == 0) {
+        return NULL;
+    }
+    uint32_t current = qatomic_fetch_inc(&group->index);
+    current %= group->num_dsa_devices;
+    return &group->dsa_devices[current];
+}
+
+/**
+ * @brief Check if DSA is running.
+ *
+ * @return True if DSA is running, otherwise false.
+ */
+bool dsa_is_running(void)
+{
+    return false;
+}
+
+static void
+dsa_globals_init(void)
+{
+    max_retry_count = UINT64_MAX;
+}
+
+/**
+ * @brief Initializes DSA devices.
+ *
+ * @param dsa_parameter A list of DSA device path from migration parameter.
+ * @return int Zero if successful, otherwise non zero.
+ */
+int dsa_init(const char *dsa_parameter)
+{
+    dsa_globals_init();
+
+    return dsa_device_group_init(&dsa_group, dsa_parameter);
+}
+
+/**
+ * @brief Start logic to enable using DSA.
+ *
+ */
+void dsa_start(void)
+{
+    if (dsa_group.num_dsa_devices == 0) {
+        return;
+    }
+    if (dsa_group.running) {
+        return;
+    }
+    dsa_device_group_start(&dsa_group);
+}
+
+/**
+ * @brief Stop logic to clean up DSA by halting the device group and cleaning up
+ * the completion thread.
+ *
+ */
+void dsa_stop(void)
+{
+    struct dsa_device_group *group = &dsa_group;
+
+    if (!group->running) {
+        return;
+    }
+}
+
+/**
+ * @brief Clean up system resources created for DSA offloading.
+ *        This function is called during QEMU process teardown.
+ *
+ */
+void dsa_cleanup(void)
+{
+    dsa_stop();
+    dsa_device_group_cleanup(&dsa_group);
+}
+
+#else
+
+bool dsa_is_running(void)
+{
+    return false;
+}
+
+int dsa_init(const char *dsa_parameter)
+{
+    fprintf(stderr, "Intel Data Streaming Accelerator is not supported "
+                    "on this platform.\n");
+    return -1;
+}
+
+void dsa_start(void) {}
+
+void dsa_stop(void) {}
+
+void dsa_cleanup(void) {}
+
+#endif
+
diff --git a/util/meson.build b/util/meson.build
index c2322ef6e7..f7277c5e9b 100644
--- a/util/meson.build
+++ b/util/meson.build
@@ -85,6 +85,7 @@ if have_block or have_ga
 endif
 if have_block
   util_ss.add(files('aio-wait.c'))
+  util_ss.add(files('dsa.c'))
   util_ss.add(files('buffer.c'))
   util_ss.add(files('bufferiszero.c'))
   util_ss.add(files('hbitmap.c'))
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 08/20] util/dsa: Implement DSA task enqueue and dequeue.
  2023-11-14  5:40 [PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration Hao Xiang
                   ` (6 preceding siblings ...)
  2023-11-14  5:40 ` [PATCH v2 07/20] util/dsa: Implement DSA device start and stop logic Hao Xiang
@ 2023-11-14  5:40 ` Hao Xiang
  2023-12-12 16:10   ` Fabiano Rosas
  2023-11-14  5:40 ` [PATCH v2 09/20] util/dsa: Implement DSA task asynchronous completion thread model Hao Xiang
                   ` (12 subsequent siblings)
  20 siblings, 1 reply; 51+ messages in thread
From: Hao Xiang @ 2023-11-14  5:40 UTC (permalink / raw)
  To: farosas, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel
  Cc: Hao Xiang

* Use a safe thread queue for DSA task enqueue/dequeue.
* Implement DSA task submission.
* Implement DSA batch task submission.

Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
---
 include/qemu/dsa.h |  35 ++++++++
 util/dsa.c         | 196 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 231 insertions(+)

diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
index 30246b507e..23f55185be 100644
--- a/include/qemu/dsa.h
+++ b/include/qemu/dsa.h
@@ -12,6 +12,41 @@
 #include <linux/idxd.h>
 #include "x86intrin.h"
 
+enum dsa_task_type {
+    DSA_TASK = 0,
+    DSA_BATCH_TASK
+};
+
+enum dsa_task_status {
+    DSA_TASK_READY = 0,
+    DSA_TASK_PROCESSING,
+    DSA_TASK_COMPLETION
+};
+
+typedef void (*buffer_zero_dsa_completion_fn)(void *);
+
+typedef struct buffer_zero_batch_task {
+    struct dsa_hw_desc batch_descriptor;
+    struct dsa_hw_desc *descriptors;
+    struct dsa_completion_record batch_completion __attribute__((aligned(32)));
+    struct dsa_completion_record *completions;
+    struct dsa_device_group *group;
+    struct dsa_device *device;
+    buffer_zero_dsa_completion_fn completion_callback;
+    QemuSemaphore sem_task_complete;
+    enum dsa_task_type task_type;
+    enum dsa_task_status status;
+    bool *results;
+    int batch_size;
+    QSIMPLEQ_ENTRY(buffer_zero_batch_task) entry;
+} buffer_zero_batch_task;
+
+#else
+
+struct buffer_zero_batch_task {
+    bool *results;
+};
+
 #endif
 
 /**
diff --git a/util/dsa.c b/util/dsa.c
index 8edaa892ec..f82282ce99 100644
--- a/util/dsa.c
+++ b/util/dsa.c
@@ -245,6 +245,200 @@ dsa_device_group_get_next_device(struct dsa_device_group *group)
     return &group->dsa_devices[current];
 }
 
+/**
+ * @brief Empties out the DSA task queue.
+ *
+ * @param group A pointer to the DSA device group.
+ */
+static void
+dsa_empty_task_queue(struct dsa_device_group *group)
+{
+    qemu_mutex_lock(&group->task_queue_lock);
+    dsa_task_queue *task_queue = &group->task_queue;
+    while (!QSIMPLEQ_EMPTY(task_queue)) {
+        QSIMPLEQ_REMOVE_HEAD(task_queue, entry);
+    }
+    qemu_mutex_unlock(&group->task_queue_lock);
+}
+
+/**
+ * @brief Adds a task to the DSA task queue.
+ *
+ * @param group A pointer to the DSA device group.
+ * @param context A pointer to the DSA task to enqueue.
+ *
+ * @return int Zero if successful, otherwise a proper error code.
+ */
+static int
+dsa_task_enqueue(struct dsa_device_group *group,
+                 struct buffer_zero_batch_task *task)
+{
+    dsa_task_queue *task_queue = &group->task_queue;
+    QemuMutex *task_queue_lock = &group->task_queue_lock;
+    QemuCond *task_queue_cond = &group->task_queue_cond;
+
+    bool notify = false;
+
+    qemu_mutex_lock(task_queue_lock);
+
+    if (!group->running) {
+        fprintf(stderr, "DSA: Tried to queue task to stopped device queue\n");
+        qemu_mutex_unlock(task_queue_lock);
+        return -1;
+    }
+
+    // The queue is empty. This enqueue operation is a 0->1 transition.
+    if (QSIMPLEQ_EMPTY(task_queue))
+        notify = true;
+
+    QSIMPLEQ_INSERT_TAIL(task_queue, task, entry);
+
+    // We need to notify the waiter for 0->1 transitions.
+    if (notify)
+        qemu_cond_signal(task_queue_cond);
+
+    qemu_mutex_unlock(task_queue_lock);
+
+    return 0;
+}
+
+/**
+ * @brief Takes a DSA task out of the task queue.
+ *
+ * @param group A pointer to the DSA device group.
+ * @return buffer_zero_batch_task* The DSA task being dequeued.
+ */
+__attribute__((unused))
+static struct buffer_zero_batch_task *
+dsa_task_dequeue(struct dsa_device_group *group)
+{
+    struct buffer_zero_batch_task *task = NULL;
+    dsa_task_queue *task_queue = &group->task_queue;
+    QemuMutex *task_queue_lock = &group->task_queue_lock;
+    QemuCond *task_queue_cond = &group->task_queue_cond;
+
+    qemu_mutex_lock(task_queue_lock);
+
+    while (true) {
+        if (!group->running)
+            goto exit;
+        task = QSIMPLEQ_FIRST(task_queue);
+        if (task != NULL) {
+            break;
+        }
+        qemu_cond_wait(task_queue_cond, task_queue_lock);
+    }
+
+    QSIMPLEQ_REMOVE_HEAD(task_queue, entry);
+
+exit:
+    qemu_mutex_unlock(task_queue_lock);
+    return task;
+}
+
+/**
+ * @brief Submits a DSA work item to the device work queue.
+ *
+ * @param wq A pointer to the DSA work queue's device memory.
+ * @param descriptor A pointer to the DSA work item descriptor.
+ *
+ * @return Zero if successful, non-zero otherwise.
+ */
+static int
+submit_wi_int(void *wq, struct dsa_hw_desc *descriptor)
+{
+    uint64_t retry = 0;
+
+    _mm_sfence();
+
+    while (true) {
+        if (_enqcmd(wq, descriptor) == 0) {
+            break;
+        }
+        retry++;
+        if (retry > max_retry_count) {
+            fprintf(stderr, "Submit work retry %lu times.\n", retry);
+            exit(1);
+        }
+    }
+
+    return 0;
+}
+
+/**
+ * @brief Synchronously submits a DSA work item to the
+ *        device work queue.
+ *
+ * @param wq A pointer to the DSA worjk queue's device memory.
+ * @param descriptor A pointer to the DSA work item descriptor.
+ *
+ * @return int Zero if successful, non-zero otherwise.
+ */
+__attribute__((unused))
+static int
+submit_wi(void *wq, struct dsa_hw_desc *descriptor)
+{
+    return submit_wi_int(wq, descriptor);
+}
+
+/**
+ * @brief Asynchronously submits a DSA work item to the
+ *        device work queue.
+ *
+ * @param task A pointer to the buffer zero task.
+ *
+ * @return int Zero if successful, non-zero otherwise.
+ */
+__attribute__((unused))
+static int
+submit_wi_async(struct buffer_zero_batch_task *task)
+{
+    struct dsa_device_group *device_group = task->group;
+    struct dsa_device *device_instance = task->device;
+    int ret;
+
+    assert(task->task_type == DSA_TASK);
+
+    task->status = DSA_TASK_PROCESSING;
+
+    ret = submit_wi_int(device_instance->work_queue,
+                        &task->descriptors[0]);
+    if (ret != 0)
+        return ret;
+
+    return dsa_task_enqueue(device_group, task);
+}
+
+/**
+ * @brief Asynchronously submits a DSA batch work item to the
+ *        device work queue.
+ *
+ * @param batch_task A pointer to the batch buffer zero task.
+ *
+ * @return int Zero if successful, non-zero otherwise.
+ */
+__attribute__((unused))
+static int
+submit_batch_wi_async(struct buffer_zero_batch_task *batch_task)
+{
+    struct dsa_device_group *device_group = batch_task->group;
+    struct dsa_device *device_instance = batch_task->device;
+    int ret;
+
+    assert(batch_task->task_type == DSA_BATCH_TASK);
+    assert(batch_task->batch_descriptor.desc_count <= batch_task->batch_size);
+    assert(batch_task->status == DSA_TASK_READY);
+
+    batch_task->status = DSA_TASK_PROCESSING;
+
+    ret = submit_wi_int(device_instance->work_queue,
+                        &batch_task->batch_descriptor);
+    if (ret != 0)
+        return ret;
+
+    return dsa_task_enqueue(device_group, batch_task);
+}
+
 /**
  * @brief Check if DSA is running.
  *
@@ -301,6 +495,8 @@ void dsa_stop(void)
     if (!group->running) {
         return;
     }
+
+    dsa_empty_task_queue(group);
 }
 
 /**
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 09/20] util/dsa: Implement DSA task asynchronous completion thread model.
  2023-11-14  5:40 [PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration Hao Xiang
                   ` (7 preceding siblings ...)
  2023-11-14  5:40 ` [PATCH v2 08/20] util/dsa: Implement DSA task enqueue and dequeue Hao Xiang
@ 2023-11-14  5:40 ` Hao Xiang
  2023-12-12 19:36   ` Fabiano Rosas
  2023-12-18  3:11   ` Wang, Lei
  2023-11-14  5:40 ` [PATCH v2 10/20] util/dsa: Implement zero page checking in DSA task Hao Xiang
                   ` (11 subsequent siblings)
  20 siblings, 2 replies; 51+ messages in thread
From: Hao Xiang @ 2023-11-14  5:40 UTC (permalink / raw)
  To: farosas, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel
  Cc: Hao Xiang

* Create a dedicated thread for DSA task completion.
* DSA completion thread runs a loop and poll for completed tasks.
* Start and stop DSA completion thread during DSA device start stop.

User space application can directly submit task to Intel DSA
accelerator by writing to DSA's device memory (mapped in user space).
Once a task is submitted, the device starts processing it and write
the completion status back to the task. A user space application can
poll the task's completion status to check for completion. This change
uses a dedicated thread to perform DSA task completion checking.

Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
---
 util/dsa.c | 243 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 242 insertions(+), 1 deletion(-)

diff --git a/util/dsa.c b/util/dsa.c
index f82282ce99..0e68013ffb 100644
--- a/util/dsa.c
+++ b/util/dsa.c
@@ -44,6 +44,7 @@
 
 #define DSA_WQ_SIZE 4096
 #define MAX_DSA_DEVICES 16
+#define DSA_COMPLETION_THREAD "dsa_completion"
 
 typedef QSIMPLEQ_HEAD(dsa_task_queue, buffer_zero_batch_task) dsa_task_queue;
 
@@ -61,8 +62,18 @@ struct dsa_device_group {
     dsa_task_queue task_queue;
 };
 
+struct dsa_completion_thread {
+    bool stopping;
+    bool running;
+    QemuThread thread;
+    int thread_id;
+    QemuSemaphore sem_init_done;
+    struct dsa_device_group *group;
+};
+
 uint64_t max_retry_count;
 static struct dsa_device_group dsa_group;
+static struct dsa_completion_thread completion_thread;
 
 
 /**
@@ -439,6 +450,234 @@ submit_batch_wi_async(struct buffer_zero_batch_task *batch_task)
     return dsa_task_enqueue(device_group, batch_task);
 }
 
+/**
+ * @brief Poll for the DSA work item completion.
+ *
+ * @param completion A pointer to the DSA work item completion record.
+ * @param opcode The DSA opcode.
+ *
+ * @return Zero if successful, non-zero otherwise.
+ */
+static int
+poll_completion(struct dsa_completion_record *completion,
+                enum dsa_opcode opcode)
+{
+    uint8_t status;
+    uint64_t retry = 0;
+
+    while (true) {
+        // The DSA operation completes successfully or fails.
+        status = completion->status;
+        if (status == DSA_COMP_SUCCESS ||
+            status == DSA_COMP_PAGE_FAULT_NOBOF ||
+            status == DSA_COMP_BATCH_PAGE_FAULT ||
+            status == DSA_COMP_BATCH_FAIL) {
+            break;
+        } else if (status != DSA_COMP_NONE) {
+            /* TODO: Error handling here on unexpected failure. */
+            fprintf(stderr, "DSA opcode %d failed with status = %d.\n",
+                    opcode, status);
+            exit(1);
+        }
+        retry++;
+        if (retry > max_retry_count) {
+            fprintf(stderr, "Wait for completion retry %lu times.\n", retry);
+            exit(1);
+        }
+        _mm_pause();
+    }
+
+    return 0;
+}
+
+/**
+ * @brief Complete a single DSA task in the batch task.
+ *
+ * @param task A pointer to the batch task structure.
+ */
+static void
+poll_task_completion(struct buffer_zero_batch_task *task)
+{
+    assert(task->task_type == DSA_TASK);
+
+    struct dsa_completion_record *completion = &task->completions[0];
+    uint8_t status;
+
+    poll_completion(completion, task->descriptors[0].opcode);
+
+    status = completion->status;
+    if (status == DSA_COMP_SUCCESS) {
+        task->results[0] = (completion->result == 0);
+        return;
+    }
+
+    assert(status == DSA_COMP_PAGE_FAULT_NOBOF);
+}
+
+/**
+ * @brief Poll a batch task status until it completes. If DSA task doesn't
+ *        complete properly, use CPU to complete the task.
+ *
+ * @param batch_task A pointer to the DSA batch task.
+ */
+static void
+poll_batch_task_completion(struct buffer_zero_batch_task *batch_task)
+{
+    struct dsa_completion_record *batch_completion = &batch_task->batch_completion;
+    struct dsa_completion_record *completion;
+    uint8_t batch_status;
+    uint8_t status;
+    bool *results = batch_task->results;
+    uint32_t count = batch_task->batch_descriptor.desc_count;
+
+    poll_completion(batch_completion,
+                    batch_task->batch_descriptor.opcode);
+
+    batch_status = batch_completion->status;
+
+    if (batch_status == DSA_COMP_SUCCESS) {
+        if (batch_completion->bytes_completed == count) {
+            // Let's skip checking for each descriptors' completion status
+            // if the batch descriptor says all succedded.
+            for (int i = 0; i < count; i++) {
+                assert(batch_task->completions[i].status == DSA_COMP_SUCCESS);
+                results[i] = (batch_task->completions[i].result == 0);
+            }
+            return;
+        }
+    } else {
+        assert(batch_status == DSA_COMP_BATCH_FAIL ||
+            batch_status == DSA_COMP_BATCH_PAGE_FAULT);
+    }
+
+    for (int i = 0; i < count; i++) {
+
+        completion = &batch_task->completions[i];
+        status = completion->status;
+
+        if (status == DSA_COMP_SUCCESS) {
+            results[i] = (completion->result == 0);
+            continue;
+        }
+
+        if (status != DSA_COMP_PAGE_FAULT_NOBOF) {
+            fprintf(stderr,
+                    "Unexpected completion status = %u.\n", status);
+            assert(false);
+        }
+    }
+}
+
+/**
+ * @brief Handles an asynchronous DSA batch task completion.
+ *
+ * @param task A pointer to the batch buffer zero task structure.
+ */
+static void
+dsa_batch_task_complete(struct buffer_zero_batch_task *batch_task)
+{
+    batch_task->status = DSA_TASK_COMPLETION;
+    batch_task->completion_callback(batch_task);
+}
+
+/**
+ * @brief The function entry point called by a dedicated DSA
+ *        work item completion thread.
+ *
+ * @param opaque A pointer to the thread context.
+ *
+ * @return void* Not used.
+ */
+static void *
+dsa_completion_loop(void *opaque)
+{
+    struct dsa_completion_thread *thread_context =
+        (struct dsa_completion_thread *)opaque;
+    struct buffer_zero_batch_task *batch_task;
+    struct dsa_device_group *group = thread_context->group;
+
+    rcu_register_thread();
+
+    thread_context->thread_id = qemu_get_thread_id();
+    qemu_sem_post(&thread_context->sem_init_done);
+
+    while (thread_context->running) {
+        batch_task = dsa_task_dequeue(group);
+        assert(batch_task != NULL || !group->running);
+        if (!group->running) {
+            assert(!thread_context->running);
+            break;
+        }
+        if (batch_task->task_type == DSA_TASK) {
+            poll_task_completion(batch_task);
+        } else {
+            assert(batch_task->task_type == DSA_BATCH_TASK);
+            poll_batch_task_completion(batch_task);
+        }
+
+        dsa_batch_task_complete(batch_task);
+    }
+
+    rcu_unregister_thread();
+    return NULL;
+}
+
+/**
+ * @brief Initializes a DSA completion thread.
+ *
+ * @param completion_thread A pointer to the completion thread context.
+ * @param group A pointer to the DSA device group.
+ */
+static void
+dsa_completion_thread_init(
+    struct dsa_completion_thread *completion_thread,
+    struct dsa_device_group *group)
+{
+    completion_thread->stopping = false;
+    completion_thread->running = true;
+    completion_thread->thread_id = -1;
+    qemu_sem_init(&completion_thread->sem_init_done, 0);
+    completion_thread->group = group;
+
+    qemu_thread_create(&completion_thread->thread,
+                       DSA_COMPLETION_THREAD,
+                       dsa_completion_loop,
+                       completion_thread,
+                       QEMU_THREAD_JOINABLE);
+
+    /* Wait for initialization to complete */
+    while (completion_thread->thread_id == -1) {
+        qemu_sem_wait(&completion_thread->sem_init_done);
+    }
+}
+
+/**
+ * @brief Stops the completion thread (and implicitly, the device group).
+ *
+ * @param opaque A pointer to the completion thread.
+ */
+static void dsa_completion_thread_stop(void *opaque)
+{
+    struct dsa_completion_thread *thread_context =
+        (struct dsa_completion_thread *)opaque;
+
+    struct dsa_device_group *group = thread_context->group;
+
+    qemu_mutex_lock(&group->task_queue_lock);
+
+    thread_context->stopping = true;
+    thread_context->running = false;
+
+    dsa_device_group_stop(group);
+
+    qemu_cond_signal(&group->task_queue_cond);
+    qemu_mutex_unlock(&group->task_queue_lock);
+
+    qemu_thread_join(&thread_context->thread);
+
+    qemu_sem_destroy(&thread_context->sem_init_done);
+}
+
 /**
  * @brief Check if DSA is running.
  *
@@ -446,7 +685,7 @@ submit_batch_wi_async(struct buffer_zero_batch_task *batch_task)
  */
 bool dsa_is_running(void)
 {
-    return false;
+    return completion_thread.running;
 }
 
 static void
@@ -481,6 +720,7 @@ void dsa_start(void)
         return;
     }
     dsa_device_group_start(&dsa_group);
+    dsa_completion_thread_init(&completion_thread, &dsa_group);
 }
 
 /**
@@ -496,6 +736,7 @@ void dsa_stop(void)
         return;
     }
 
+    dsa_completion_thread_stop(&completion_thread);
     dsa_empty_task_queue(group);
 }
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 10/20] util/dsa: Implement zero page checking in DSA task.
  2023-11-14  5:40 [PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration Hao Xiang
                   ` (8 preceding siblings ...)
  2023-11-14  5:40 ` [PATCH v2 09/20] util/dsa: Implement DSA task asynchronous completion thread model Hao Xiang
@ 2023-11-14  5:40 ` Hao Xiang
  2023-11-14  5:40 ` [PATCH v2 11/20] util/dsa: Implement DSA task asynchronous submission and wait for completion Hao Xiang
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 51+ messages in thread
From: Hao Xiang @ 2023-11-14  5:40 UTC (permalink / raw)
  To: farosas, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel
  Cc: Hao Xiang

Create DSA task with operation code DSA_OPCODE_COMPVAL.
Here we create two types of DSA tasks, a single DSA task and
a batch DSA task. Batch DSA task reduces task submission overhead
and hence should be the default option. However, due to the way DSA
hardware works, a DSA batch task must contain at least two individual
tasks. There are times we need to submit a single task and hence a
single DSA task submission is also required.

Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
---
 include/qemu/dsa.h |  16 +++
 util/dsa.c         | 252 +++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 247 insertions(+), 21 deletions(-)

diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
index 23f55185be..b10e7b8fb7 100644
--- a/include/qemu/dsa.h
+++ b/include/qemu/dsa.h
@@ -49,6 +49,22 @@ struct buffer_zero_batch_task {
 
 #endif
 
+/**
+ * @brief Initializes a buffer zero batch task.
+ *
+ * @param task A pointer to the batch task to initialize.
+ * @param batch_size The number of DSA tasks in the batch.
+ */
+void buffer_zero_batch_task_init(struct buffer_zero_batch_task *task,
+                                 int batch_size);
+
+/**
+ * @brief Performs the proper cleanup on a DSA batch task.
+ *
+ * @param task A pointer to the batch task to cleanup.
+ */
+void buffer_zero_batch_task_destroy(struct buffer_zero_batch_task *task);
+
 /**
  * @brief Initializes DSA devices.
  *
diff --git a/util/dsa.c b/util/dsa.c
index 0e68013ffb..3cc017b8a0 100644
--- a/util/dsa.c
+++ b/util/dsa.c
@@ -75,6 +75,7 @@ uint64_t max_retry_count;
 static struct dsa_device_group dsa_group;
 static struct dsa_completion_thread completion_thread;
 
+static void buffer_zero_dsa_completion(void *context);
 
 /**
  * @brief This function opens a DSA device's work queue and
@@ -208,7 +209,6 @@ dsa_device_group_start(struct dsa_device_group *group)
  *
  * @param group A pointer to the DSA device group.
  */
-__attribute__((unused))
 static void
 dsa_device_group_stop(struct dsa_device_group *group)
 {
@@ -244,7 +244,6 @@ dsa_device_group_cleanup(struct dsa_device_group *group)
  * @return struct dsa_device* A pointer to the next available DSA device
  *         in the group.
  */
-__attribute__((unused))
 static struct dsa_device *
 dsa_device_group_get_next_device(struct dsa_device_group *group)
 {
@@ -319,7 +318,6 @@ dsa_task_enqueue(struct dsa_device_group *group,
  * @param group A pointer to the DSA device group.
  * @return buffer_zero_batch_task* The DSA task being dequeued.
  */
-__attribute__((unused))
 static struct buffer_zero_batch_task *
 dsa_task_dequeue(struct dsa_device_group *group)
 {
@@ -376,22 +374,6 @@ submit_wi_int(void *wq, struct dsa_hw_desc *descriptor)
     return 0;
 }
 
-/**
- * @brief Synchronously submits a DSA work item to the
- *        device work queue.
- *
- * @param wq A pointer to the DSA worjk queue's device memory.
- * @param descriptor A pointer to the DSA work item descriptor.
- *
- * @return int Zero if successful, non-zero otherwise.
- */
-__attribute__((unused))
-static int
-submit_wi(void *wq, struct dsa_hw_desc *descriptor)
-{
-    return submit_wi_int(wq, descriptor);
-}
-
 /**
  * @brief Asynchronously submits a DSA work item to the
  *        device work queue.
@@ -400,7 +382,6 @@ submit_wi(void *wq, struct dsa_hw_desc *descriptor)
  *
  * @return int Zero if successful, non-zero otherwise.
  */
-__attribute__((unused))
 static int
 submit_wi_async(struct buffer_zero_batch_task *task)
 {
@@ -428,7 +409,6 @@ submit_wi_async(struct buffer_zero_batch_task *task)
  *
  * @return int Zero if successful, non-zero otherwise.
  */
-__attribute__((unused))
 static int
 submit_batch_wi_async(struct buffer_zero_batch_task *batch_task)
 {
@@ -678,6 +658,231 @@ static void dsa_completion_thread_stop(void *opaque)
     qemu_sem_destroy(&thread_context->sem_init_done);
 }
 
+/**
+ * @brief Initializes a buffer zero comparison DSA task.
+ *
+ * @param descriptor A pointer to the DSA task descriptor.
+ * @param completion A pointer to the DSA task completion record.
+ */
+static void
+buffer_zero_task_init_int(struct dsa_hw_desc *descriptor,
+                          struct dsa_completion_record *completion)
+{
+    descriptor->opcode = DSA_OPCODE_COMPVAL;
+    descriptor->flags = IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CRAV;
+    descriptor->comp_pattern = (uint64_t)0;
+    descriptor->completion_addr = (uint64_t)completion;
+}
+
+/**
+ * @brief Initializes a buffer zero batch task.
+ *
+ * @param task A pointer to the batch task to initialize.
+ * @param batch_size The number of DSA tasks in the batch.
+ */
+void
+buffer_zero_batch_task_init(struct buffer_zero_batch_task *task,
+                            int batch_size)
+{
+    int descriptors_size = sizeof(*task->descriptors) * batch_size;
+    memset(task, 0, sizeof(*task));
+
+    task->descriptors =
+        (struct dsa_hw_desc *)qemu_memalign(64, descriptors_size);
+    memset(task->descriptors, 0, descriptors_size);
+    task->completions = (struct dsa_completion_record *)qemu_memalign(
+        32, sizeof(*task->completions) * batch_size);
+    task->results = g_new0(bool, batch_size);
+    task->batch_size = batch_size;
+
+    task->batch_completion.status = DSA_COMP_NONE;
+    task->batch_descriptor.completion_addr = (uint64_t)&task->batch_completion;
+    // TODO: Ensure that we never send a batch with count <= 1
+    task->batch_descriptor.desc_count = 0;
+    task->batch_descriptor.opcode = DSA_OPCODE_BATCH;
+    task->batch_descriptor.flags = IDXD_OP_FLAG_RCR | IDXD_OP_FLAG_CRAV;
+    task->batch_descriptor.desc_list_addr = (uintptr_t)task->descriptors;
+    task->status = DSA_TASK_READY;
+    task->group = &dsa_group;
+    task->device = dsa_device_group_get_next_device(&dsa_group);
+
+    for (int i = 0; i < task->batch_size; i++) {
+        buffer_zero_task_init_int(&task->descriptors[i],
+                                  &task->completions[i]);
+    }
+
+    qemu_sem_init(&task->sem_task_complete, 0);
+    task->completion_callback = buffer_zero_dsa_completion;
+}
+
+/**
+ * @brief Performs the proper cleanup on a DSA batch task.
+ *
+ * @param task A pointer to the batch task to cleanup.
+ */
+void
+buffer_zero_batch_task_destroy(struct buffer_zero_batch_task *task)
+{
+    qemu_vfree(task->descriptors);
+    qemu_vfree(task->completions);
+    g_free(task->results);
+
+    qemu_sem_destroy(&task->sem_task_complete);
+}
+
+/**
+ * @brief Resets a buffer zero comparison DSA batch task.
+ *
+ * @param task A pointer to the batch task.
+ * @param count The number of DSA tasks this batch task will contain.
+ */
+static void
+buffer_zero_batch_task_reset(struct buffer_zero_batch_task *task, size_t count)
+{
+    task->batch_completion.status = DSA_COMP_NONE;
+    task->batch_descriptor.desc_count = count;
+    task->task_type = DSA_BATCH_TASK;
+    task->status = DSA_TASK_READY;
+}
+
+/**
+ * @brief Sets a buffer zero comparison DSA task.
+ *
+ * @param descriptor A pointer to the DSA task descriptor.
+ * @param buf A pointer to the memory buffer.
+ * @param len The length of the buffer.
+ */
+static void
+buffer_zero_task_set_int(struct dsa_hw_desc *descriptor,
+                         const void *buf,
+                         size_t len)
+{
+    struct dsa_completion_record *completion =
+        (struct dsa_completion_record *)descriptor->completion_addr;
+
+    descriptor->xfer_size = len;
+    descriptor->src_addr = (uintptr_t)buf;
+    completion->status = 0;
+    completion->result = 0;
+}
+
+/**
+ * @brief Resets a buffer zero comparison DSA batch task.
+ *
+ * @param task A pointer to the DSA batch task.
+ */
+static void
+buffer_zero_task_reset(struct buffer_zero_batch_task *task)
+{
+    task->completions[0].status = DSA_COMP_NONE;
+    task->task_type = DSA_TASK;
+    task->status = DSA_TASK_READY;
+}
+
+/**
+ * @brief Sets a buffer zero comparison DSA task.
+ *
+ * @param task A pointer to the DSA task.
+ * @param buf A pointer to the memory buffer.
+ * @param len The buffer length.
+ */
+static void
+buffer_zero_task_set(struct buffer_zero_batch_task *task,
+                     const void *buf,
+                     size_t len)
+{
+    buffer_zero_task_reset(task);
+    buffer_zero_task_set_int(&task->descriptors[0], buf, len);
+}
+
+/**
+ * @brief Sets a buffer zero comparison batch task.
+ *
+ * @param batch_task A pointer to the batch task.
+ * @param buf An array of memory buffers.
+ * @param count The number of buffers in the array.
+ * @param len The length of the buffers.
+ */
+static void
+buffer_zero_batch_task_set(struct buffer_zero_batch_task *batch_task,
+                           const void **buf, size_t count, size_t len)
+{
+    assert(count > 0);
+    assert(count <= batch_task->batch_size);
+
+    buffer_zero_batch_task_reset(batch_task, count);
+    for (int i = 0; i < count; i++) {
+        buffer_zero_task_set_int(&batch_task->descriptors[i], buf[i], len);
+    }
+}
+
+/**
+ * @brief Asychronously perform a buffer zero DSA operation.
+ *
+ * @param task A pointer to the batch task structure.
+ * @param buf A pointer to the memory buffer.
+ * @param len The length of the memory buffer.
+ *
+ * @return int Zero if successful, otherwise an appropriate error code.
+ */
+__attribute__((unused))
+static int
+buffer_zero_dsa_async(struct buffer_zero_batch_task *task,
+                      const void *buf, size_t len)
+{
+    buffer_zero_task_set(task, buf, len);
+
+    return submit_wi_async(task);
+}
+
+/**
+ * @brief Sends a memory comparison batch task to a DSA device and wait
+ *        for completion.
+ *
+ * @param batch_task The batch task to be submitted to DSA device.
+ * @param buf An array of memory buffers to check for zero.
+ * @param count The number of buffers.
+ * @param len The buffer length.
+ */
+__attribute__((unused))
+static int
+buffer_zero_dsa_batch_async(struct buffer_zero_batch_task *batch_task,
+                            const void **buf, size_t count, size_t len)
+{
+    assert(count <= batch_task->batch_size);
+    buffer_zero_batch_task_set(batch_task, buf, count, len);
+
+    return submit_batch_wi_async(batch_task);
+}
+
+/**
+ * @brief The completion callback function for buffer zero
+ *        comparison DSA task completion.
+ *
+ * @param context A pointer to the callback context.
+ */
+static void
+buffer_zero_dsa_completion(void *context)
+{
+    assert(context != NULL);
+
+    struct buffer_zero_batch_task *task =
+        (struct buffer_zero_batch_task *)context;
+    qemu_sem_post(&task->sem_task_complete);
+}
+
+/**
+ * @brief Wait for the asynchronous DSA task to complete.
+ *
+ * @param batch_task A pointer to the buffer zero comparison batch task.
+ */
+__attribute__((unused))
+static void
+buffer_zero_dsa_wait(struct buffer_zero_batch_task *batch_task)
+{
+    qemu_sem_wait(&batch_task->sem_task_complete);
+}
+
 /**
  * @brief Check if DSA is running.
  *
@@ -753,6 +958,11 @@ void dsa_cleanup(void)
 
 #else
 
+void buffer_zero_batch_task_init(struct buffer_zero_batch_task *task,
+                                 int batch_size) {}
+
+void buffer_zero_batch_task_destroy(struct buffer_zero_batch_task *task) {}
+
 bool dsa_is_running(void)
 {
     return false;
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 11/20] util/dsa: Implement DSA task asynchronous submission and wait for completion.
  2023-11-14  5:40 [PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration Hao Xiang
                   ` (9 preceding siblings ...)
  2023-11-14  5:40 ` [PATCH v2 10/20] util/dsa: Implement zero page checking in DSA task Hao Xiang
@ 2023-11-14  5:40 ` Hao Xiang
  2023-12-13 14:01   ` Fabiano Rosas
  2023-11-14  5:40 ` [PATCH v2 12/20] migration/multifd: Add new migration option for multifd DSA offloading Hao Xiang
                   ` (9 subsequent siblings)
  20 siblings, 1 reply; 51+ messages in thread
From: Hao Xiang @ 2023-11-14  5:40 UTC (permalink / raw)
  To: farosas, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel
  Cc: Hao Xiang

* Add a DSA task completion callback.
* DSA completion thread will call the tasks's completion callback
on every task/batch task completion.
* DSA submission path to wait for completion.
* Implement CPU fallback if DSA is not able to complete the task.

Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
---
 include/qemu/dsa.h |  14 +++++
 util/dsa.c         | 153 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 164 insertions(+), 3 deletions(-)

diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
index b10e7b8fb7..3f8ee07004 100644
--- a/include/qemu/dsa.h
+++ b/include/qemu/dsa.h
@@ -65,6 +65,20 @@ void buffer_zero_batch_task_init(struct buffer_zero_batch_task *task,
  */
 void buffer_zero_batch_task_destroy(struct buffer_zero_batch_task *task);
 
+/**
+ * @brief Performs buffer zero comparison on a DSA batch task asynchronously.
+ *
+ * @param batch_task A pointer to the batch task.
+ * @param buf An array of memory buffers.
+ * @param count The number of buffers in the array.
+ * @param len The buffer length.
+ *
+ * @return Zero if successful, otherwise non-zero.
+ */
+int
+buffer_is_zero_dsa_batch_async(struct buffer_zero_batch_task *batch_task,
+                               const void **buf, size_t count, size_t len);
+
 /**
  * @brief Initializes DSA devices.
  *
diff --git a/util/dsa.c b/util/dsa.c
index 3cc017b8a0..06c6fbf2ca 100644
--- a/util/dsa.c
+++ b/util/dsa.c
@@ -470,6 +470,41 @@ poll_completion(struct dsa_completion_record *completion,
     return 0;
 }
 
+/**
+ * @brief Use CPU to complete a single zero page checking task.
+ *
+ * @param task A pointer to the task.
+ */
+static void
+task_cpu_fallback(struct buffer_zero_batch_task *task)
+{
+    assert(task->task_type == DSA_TASK);
+
+    struct dsa_completion_record *completion = &task->completions[0];
+    const uint8_t *buf;
+    size_t len;
+
+    if (completion->status == DSA_COMP_SUCCESS) {
+        return;
+    }
+
+    /*
+     * DSA was able to partially complete the operation. Check the
+     * result. If we already know this is not a zero page, we can
+     * return now.
+     */
+    if (completion->bytes_completed != 0 && completion->result != 0) {
+        task->results[0] = false;
+        return;
+    }
+
+    /* Let's fallback to use CPU to complete it. */
+    buf = (const uint8_t *)task->descriptors[0].src_addr;
+    len = task->descriptors[0].xfer_size;
+    task->results[0] = buffer_is_zero(buf + completion->bytes_completed,
+                                      len - completion->bytes_completed);
+}
+
 /**
  * @brief Complete a single DSA task in the batch task.
  *
@@ -548,6 +583,62 @@ poll_batch_task_completion(struct buffer_zero_batch_task *batch_task)
     }
 }
 
+/**
+ * @brief Use CPU to complete the zero page checking batch task.
+ *
+ * @param batch_task A pointer to the batch task.
+ */
+static void
+batch_task_cpu_fallback(struct buffer_zero_batch_task *batch_task)
+{
+    assert(batch_task->task_type == DSA_BATCH_TASK);
+
+    struct dsa_completion_record *batch_completion =
+        &batch_task->batch_completion;
+    struct dsa_completion_record *completion;
+    uint8_t status;
+    const uint8_t *buf;
+    size_t len;
+    bool *results = batch_task->results;
+    uint32_t count = batch_task->batch_descriptor.desc_count;
+
+    // DSA is able to complete the entire batch task.
+    if (batch_completion->status == DSA_COMP_SUCCESS) {
+        assert(count == batch_completion->bytes_completed);
+        return;
+    }
+
+    /*
+     * DSA encounters some error and is not able to complete
+     * the entire batch task. Use CPU fallback.
+     */
+    for (int i = 0; i < count; i++) {
+        completion = &batch_task->completions[i];
+        status = completion->status;
+        if (status == DSA_COMP_SUCCESS) {
+            continue;
+        }
+        assert(status == DSA_COMP_PAGE_FAULT_NOBOF);
+
+        /*
+         * DSA was able to partially complete the operation. Check the
+         * result. If we already know this is not a zero page, we can
+         * return now.
+         */
+        if (completion->bytes_completed != 0 && completion->result != 0) {
+            results[i] = false;
+            continue;
+        }
+
+        /* Let's fallback to use CPU to complete it. */
+        buf = (uint8_t *)batch_task->descriptors[i].src_addr;
+        len = batch_task->descriptors[i].xfer_size;
+        results[i] =
+            buffer_is_zero(buf + completion->bytes_completed,
+                           len - completion->bytes_completed);
+    }
+}
+
 /**
  * @brief Handles an asynchronous DSA batch task completion.
  *
@@ -825,7 +916,6 @@ buffer_zero_batch_task_set(struct buffer_zero_batch_task *batch_task,
  *
  * @return int Zero if successful, otherwise an appropriate error code.
  */
-__attribute__((unused))
 static int
 buffer_zero_dsa_async(struct buffer_zero_batch_task *task,
                       const void *buf, size_t len)
@@ -844,7 +934,6 @@ buffer_zero_dsa_async(struct buffer_zero_batch_task *task,
  * @param count The number of buffers.
  * @param len The buffer length.
  */
-__attribute__((unused))
 static int
 buffer_zero_dsa_batch_async(struct buffer_zero_batch_task *batch_task,
                             const void **buf, size_t count, size_t len)
@@ -876,13 +965,29 @@ buffer_zero_dsa_completion(void *context)
  *
  * @param batch_task A pointer to the buffer zero comparison batch task.
  */
-__attribute__((unused))
 static void
 buffer_zero_dsa_wait(struct buffer_zero_batch_task *batch_task)
 {
     qemu_sem_wait(&batch_task->sem_task_complete);
 }
 
+/**
+ * @brief Use CPU to complete the zero page checking task if DSA
+ *        is not able to complete it.
+ *
+ * @param batch_task A pointer to the batch task.
+ */
+static void
+buffer_zero_cpu_fallback(struct buffer_zero_batch_task *batch_task)
+{
+    if (batch_task->task_type == DSA_TASK) {
+        task_cpu_fallback(batch_task);
+    } else {
+        assert(batch_task->task_type == DSA_BATCH_TASK);
+        batch_task_cpu_fallback(batch_task);
+    }
+}
+
 /**
  * @brief Check if DSA is running.
  *
@@ -956,6 +1061,41 @@ void dsa_cleanup(void)
     dsa_device_group_cleanup(&dsa_group);
 }
 
+/**
+ * @brief Performs buffer zero comparison on a DSA batch task asynchronously.
+ *
+ * @param batch_task A pointer to the batch task.
+ * @param buf An array of memory buffers.
+ * @param count The number of buffers in the array.
+ * @param len The buffer length.
+ *
+ * @return Zero if successful, otherwise non-zero.
+ */
+int
+buffer_is_zero_dsa_batch_async(struct buffer_zero_batch_task *batch_task,
+                               const void **buf, size_t count, size_t len)
+{
+    if (count <= 0 || count > batch_task->batch_size) {
+        return -1;
+    }
+
+    assert(batch_task != NULL);
+    assert(len != 0);
+    assert(buf != NULL);
+
+    if (count == 1) {
+        // DSA doesn't take batch operation with only 1 task.
+        buffer_zero_dsa_async(batch_task, buf[0], len);
+    } else {
+        buffer_zero_dsa_batch_async(batch_task, buf, count, len);
+    }
+
+    buffer_zero_dsa_wait(batch_task);
+    buffer_zero_cpu_fallback(batch_task);
+
+    return 0;
+}
+
 #else
 
 void buffer_zero_batch_task_init(struct buffer_zero_batch_task *task,
@@ -981,5 +1121,12 @@ void dsa_stop(void) {}
 
 void dsa_cleanup(void) {}
 
+int
+buffer_is_zero_dsa_batch_async(struct buffer_zero_batch_task *batch_task,
+                               const void **buf, size_t count, size_t len)
+{
+    exit(1);
+}
+
 #endif
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 12/20] migration/multifd: Add new migration option for multifd DSA offloading.
  2023-11-14  5:40 [PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration Hao Xiang
                   ` (10 preceding siblings ...)
  2023-11-14  5:40 ` [PATCH v2 11/20] util/dsa: Implement DSA task asynchronous submission and wait for completion Hao Xiang
@ 2023-11-14  5:40 ` Hao Xiang
  2023-12-11 19:44   ` Fabiano Rosas
  2023-12-18  3:12   ` Wang, Lei
  2023-11-14  5:40 ` [PATCH v2 13/20] migration/multifd: Prepare to introduce DSA acceleration on the multifd path Hao Xiang
                   ` (8 subsequent siblings)
  20 siblings, 2 replies; 51+ messages in thread
From: Hao Xiang @ 2023-11-14  5:40 UTC (permalink / raw)
  To: farosas, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel
  Cc: Hao Xiang

Intel DSA offloading is an optional feature that turns on if
proper hardware and software stack is available. To turn on
DSA offloading in multifd live migration:

multifd-dsa-accel="[dsa_dev_path1] ] [dsa_dev_path2] ... [dsa_dev_pathX]"

This feature is turned off by default.

Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
---
 migration/migration-hmp-cmds.c |  8 ++++++++
 migration/options.c            | 28 ++++++++++++++++++++++++++++
 migration/options.h            |  1 +
 qapi/migration.json            | 17 ++++++++++++++---
 scripts/meson-buildoptions.sh  |  6 +++---
 5 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 86ae832176..d9451744dd 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -353,6 +353,9 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
         monitor_printf(mon, "%s: '%s'\n",
             MigrationParameter_str(MIGRATION_PARAMETER_TLS_AUTHZ),
             params->tls_authz);
+        monitor_printf(mon, "%s: %s\n",
+            MigrationParameter_str(MIGRATION_PARAMETER_MULTIFD_DSA_ACCEL),
+            params->multifd_dsa_accel);
 
         if (params->has_block_bitmap_mapping) {
             const BitmapMigrationNodeAliasList *bmnal;
@@ -615,6 +618,11 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
         p->has_block_incremental = true;
         visit_type_bool(v, param, &p->block_incremental, &err);
         break;
+    case MIGRATION_PARAMETER_MULTIFD_DSA_ACCEL:
+        p->multifd_dsa_accel = g_new0(StrOrNull, 1);
+        p->multifd_dsa_accel->type = QTYPE_QSTRING;
+        visit_type_str(v, param, &p->multifd_dsa_accel->u.s, &err);
+        break;
     case MIGRATION_PARAMETER_MULTIFD_CHANNELS:
         p->has_multifd_channels = true;
         visit_type_uint8(v, param, &p->multifd_channels, &err);
diff --git a/migration/options.c b/migration/options.c
index 97d121d4d7..6e424b5d63 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -179,6 +179,8 @@ Property migration_properties[] = {
     DEFINE_PROP_MIG_MODE("mode", MigrationState,
                       parameters.mode,
                       MIG_MODE_NORMAL),
+    DEFINE_PROP_STRING("multifd-dsa-accel", MigrationState,
+                       parameters.multifd_dsa_accel),
 
     /* Migration capabilities */
     DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
@@ -901,6 +903,13 @@ const char *migrate_tls_creds(void)
     return s->parameters.tls_creds;
 }
 
+const char *migrate_multifd_dsa_accel(void)
+{
+    MigrationState *s = migrate_get_current();
+
+    return s->parameters.multifd_dsa_accel;
+}
+
 const char *migrate_tls_hostname(void)
 {
     MigrationState *s = migrate_get_current();
@@ -1025,6 +1034,7 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
     params->vcpu_dirty_limit = s->parameters.vcpu_dirty_limit;
     params->has_mode = true;
     params->mode = s->parameters.mode;
+    params->multifd_dsa_accel = s->parameters.multifd_dsa_accel;
 
     return params;
 }
@@ -1033,6 +1043,7 @@ void migrate_params_init(MigrationParameters *params)
 {
     params->tls_hostname = g_strdup("");
     params->tls_creds = g_strdup("");
+    params->multifd_dsa_accel = g_strdup("");
 
     /* Set has_* up only for parameter checks */
     params->has_compress_level = true;
@@ -1362,6 +1373,11 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
     if (params->has_mode) {
         dest->mode = params->mode;
     }
+
+    if (params->multifd_dsa_accel) {
+        assert(params->multifd_dsa_accel->type == QTYPE_QSTRING);
+        dest->multifd_dsa_accel = params->multifd_dsa_accel->u.s;
+    }
 }
 
 static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
@@ -1506,6 +1522,12 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
     if (params->has_mode) {
         s->parameters.mode = params->mode;
     }
+
+    if (params->multifd_dsa_accel) {
+        g_free(s->parameters.multifd_dsa_accel);
+        assert(params->multifd_dsa_accel->type == QTYPE_QSTRING);
+        s->parameters.multifd_dsa_accel = g_strdup(params->multifd_dsa_accel->u.s);
+    }
 }
 
 void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
@@ -1531,6 +1553,12 @@ void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
         params->tls_authz->type = QTYPE_QSTRING;
         params->tls_authz->u.s = strdup("");
     }
+    if (params->multifd_dsa_accel
+        && params->multifd_dsa_accel->type == QTYPE_QNULL) {
+        qobject_unref(params->multifd_dsa_accel->u.n);
+        params->multifd_dsa_accel->type = QTYPE_QSTRING;
+        params->multifd_dsa_accel->u.s = strdup("");
+    }
 
     migrate_params_test_apply(params, &tmp);
 
diff --git a/migration/options.h b/migration/options.h
index c901eb57c6..56100961a9 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -94,6 +94,7 @@ const char *migrate_tls_authz(void);
 const char *migrate_tls_creds(void);
 const char *migrate_tls_hostname(void);
 uint64_t migrate_xbzrle_cache_size(void);
+const char *migrate_multifd_dsa_accel(void);
 
 /* parameters setters */
 
diff --git a/qapi/migration.json b/qapi/migration.json
index 9783289bfc..a8e3b66d6f 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -879,6 +879,9 @@
 # @mode: Migration mode. See description in @MigMode. Default is 'normal'.
 #        (Since 8.2)
 #
+# @multifd-dsa-accel: If enabled, use DSA accelerator offloading for
+#                     certain memory operations. (since 8.2)
+#
 # Features:
 #
 # @deprecated: Member @block-incremental is deprecated.  Use
@@ -902,7 +905,7 @@
            'cpu-throttle-initial', 'cpu-throttle-increment',
            'cpu-throttle-tailslow',
            'tls-creds', 'tls-hostname', 'tls-authz', 'max-bandwidth',
-           'avail-switchover-bandwidth', 'downtime-limit',
+           'avail-switchover-bandwidth', 'downtime-limit', 'multifd-dsa-accel',
            { 'name': 'x-checkpoint-delay', 'features': [ 'unstable' ] },
            { 'name': 'block-incremental', 'features': [ 'deprecated' ] },
            'multifd-channels',
@@ -1067,6 +1070,9 @@
 # @mode: Migration mode. See description in @MigMode. Default is 'normal'.
 #        (Since 8.2)
 #
+# @multifd-dsa-accel: If enabled, use DSA accelerator offloading for
+#                     certain memory operations. (since 8.2)
+#
 # Features:
 #
 # @deprecated: Member @block-incremental is deprecated.  Use
@@ -1120,7 +1126,8 @@
             '*x-vcpu-dirty-limit-period': { 'type': 'uint64',
                                             'features': [ 'unstable' ] },
             '*vcpu-dirty-limit': 'uint64',
-            '*mode': 'MigMode'} }
+            '*mode': 'MigMode',
+            '*multifd-dsa-accel': 'StrOrNull'} }
 
 ##
 # @migrate-set-parameters:
@@ -1295,6 +1302,9 @@
 # @mode: Migration mode. See description in @MigMode. Default is 'normal'.
 #        (Since 8.2)
 #
+# @multifd-dsa-accel: If enabled, use DSA accelerator offloading for
+#                     certain memory operations. (since 8.2)
+#
 # Features:
 #
 # @deprecated: Member @block-incremental is deprecated.  Use
@@ -1345,7 +1355,8 @@
             '*x-vcpu-dirty-limit-period': { 'type': 'uint64',
                                             'features': [ 'unstable' ] },
             '*vcpu-dirty-limit': 'uint64',
-            '*mode': 'MigMode'} }
+            '*mode': 'MigMode',
+            '*multifd-dsa-accel': 'str'} }
 
 ##
 # @query-migrate-parameters:
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
index bf139e3fb4..35222ab63e 100644
--- a/scripts/meson-buildoptions.sh
+++ b/scripts/meson-buildoptions.sh
@@ -32,6 +32,7 @@ meson_options_help() {
   printf "%s\n" '  --enable-debug-stack-usage'
   printf "%s\n" '                           measure coroutine stack usage'
   printf "%s\n" '  --enable-debug-tcg       TCG debugging'
+  printf "%s\n" '  --enable-enqcmd          MENQCMD optimizations'
   printf "%s\n" '  --enable-fdt[=CHOICE]    Whether and how to find the libfdt library'
   printf "%s\n" '                           (choices: auto/disabled/enabled/internal/system)'
   printf "%s\n" '  --enable-fuzzing         build fuzzing targets'
@@ -93,7 +94,6 @@ meson_options_help() {
   printf "%s\n" '  avx2            AVX2 optimizations'
   printf "%s\n" '  avx512bw        AVX512BW optimizations'
   printf "%s\n" '  avx512f         AVX512F optimizations'
-  printf "%s\n" '  enqcmd          ENQCMD optimizations'
   printf "%s\n" '  blkio           libblkio block device driver'
   printf "%s\n" '  bochs           bochs image format support'
   printf "%s\n" '  bpf             eBPF support'
@@ -241,8 +241,6 @@ _meson_option_parse() {
     --disable-avx512bw) printf "%s" -Davx512bw=disabled ;;
     --enable-avx512f) printf "%s" -Davx512f=enabled ;;
     --disable-avx512f) printf "%s" -Davx512f=disabled ;;
-    --enable-enqcmd) printf "%s" -Denqcmd=true ;;
-    --disable-enqcmd) printf "%s" -Denqcmd=false ;;
     --enable-gcov) printf "%s" -Db_coverage=true ;;
     --disable-gcov) printf "%s" -Db_coverage=false ;;
     --enable-lto) printf "%s" -Db_lto=true ;;
@@ -309,6 +307,8 @@ _meson_option_parse() {
     --disable-docs) printf "%s" -Ddocs=disabled ;;
     --enable-dsound) printf "%s" -Ddsound=enabled ;;
     --disable-dsound) printf "%s" -Ddsound=disabled ;;
+    --enable-enqcmd) printf "%s" -Denqcmd=true ;;
+    --disable-enqcmd) printf "%s" -Denqcmd=false ;;
     --enable-fdt) printf "%s" -Dfdt=enabled ;;
     --disable-fdt) printf "%s" -Dfdt=disabled ;;
     --enable-fdt=*) quote_sh "-Dfdt=$2" ;;
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 13/20] migration/multifd: Prepare to introduce DSA acceleration on the multifd path.
  2023-11-14  5:40 [PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration Hao Xiang
                   ` (11 preceding siblings ...)
  2023-11-14  5:40 ` [PATCH v2 12/20] migration/multifd: Add new migration option for multifd DSA offloading Hao Xiang
@ 2023-11-14  5:40 ` Hao Xiang
  2023-12-18  3:20   ` Wang, Lei
  2023-11-14  5:40 ` [PATCH v2 14/20] migration/multifd: Enable DSA offloading in multifd sender path Hao Xiang
                   ` (7 subsequent siblings)
  20 siblings, 1 reply; 51+ messages in thread
From: Hao Xiang @ 2023-11-14  5:40 UTC (permalink / raw)
  To: farosas, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel
  Cc: Hao Xiang

1. Refactor multifd_send_thread function.
2. Implement buffer_is_zero_use_cpu to handle CPU based zero page
checking.
3. Introduce the batch task structure in MultiFDSendParams.

Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
---
 migration/multifd.c | 82 ++++++++++++++++++++++++++++++++++++---------
 migration/multifd.h |  3 ++
 2 files changed, 70 insertions(+), 15 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 1198ffde9c..68ab97f918 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -14,6 +14,8 @@
 #include "qemu/cutils.h"
 #include "qemu/rcu.h"
 #include "qemu/cutils.h"
+#include "qemu/dsa.h"
+#include "qemu/memalign.h"
 #include "exec/target_page.h"
 #include "sysemu/sysemu.h"
 #include "exec/ramblock.h"
@@ -574,6 +576,11 @@ void multifd_save_cleanup(void)
         p->name = NULL;
         multifd_pages_clear(p->pages);
         p->pages = NULL;
+        g_free(p->addr);
+        p->addr = NULL;
+        buffer_zero_batch_task_destroy(p->batch_task);
+        qemu_vfree(p->batch_task);
+        p->batch_task = NULL;
         p->packet_len = 0;
         g_free(p->packet);
         p->packet = NULL;
@@ -678,13 +685,66 @@ int multifd_send_sync_main(QEMUFile *f)
     return 0;
 }
 
+static void set_page(MultiFDSendParams *p, bool zero_page, uint64_t offset)
+{
+    RAMBlock *rb = p->pages->block;
+    if (zero_page) {
+        p->zero[p->zero_num] = offset;
+        p->zero_num++;
+        ram_release_page(rb->idstr, offset);
+    } else {
+        p->normal[p->normal_num] = offset;
+        p->normal_num++;
+    }
+}
+
+static void buffer_is_zero_use_cpu(MultiFDSendParams *p)
+{
+    const void **buf = (const void **)p->addr;
+    assert(!migrate_use_main_zero_page());
+
+    for (int i = 0; i < p->pages->num; i++) {
+        p->batch_task->results[i] = buffer_is_zero(buf[i], p->page_size);
+    }
+}
+
+static void set_normal_pages(MultiFDSendParams *p)
+{
+    for (int i = 0; i < p->pages->num; i++) {
+        p->batch_task->results[i] = false;
+    }
+}
+
+static void multifd_zero_page_check(MultiFDSendParams *p)
+{
+    /* older qemu don't understand zero page on multifd channel */
+    bool use_multifd_zero_page = !migrate_use_main_zero_page();
+
+    RAMBlock *rb = p->pages->block;
+
+    for (int i = 0; i < p->pages->num; i++) {
+        p->addr[i] = (ram_addr_t)(rb->host + p->pages->offset[i]);
+    }
+
+    if (use_multifd_zero_page) {
+        buffer_is_zero_use_cpu(p);
+    } else {
+        // No zero page checking. All pages are normal pages.
+        set_normal_pages(p);
+    }
+
+    for (int i = 0; i < p->pages->num; i++) {
+        uint64_t offset = p->pages->offset[i];
+        bool zero_page = p->batch_task->results[i];
+        set_page(p, zero_page, offset);
+    }
+}
+
 static void *multifd_send_thread(void *opaque)
 {
     MultiFDSendParams *p = opaque;
     MigrationThread *thread = NULL;
     Error *local_err = NULL;
-    /* qemu older than 8.2 don't understand zero page on multifd channel */
-    bool use_multifd_zero_page = !migrate_use_main_zero_page();
     int ret = 0;
     bool use_zero_copy_send = migrate_zero_copy_send();
 
@@ -710,7 +770,6 @@ static void *multifd_send_thread(void *opaque)
         qemu_mutex_lock(&p->mutex);
 
         if (p->pending_job) {
-            RAMBlock *rb = p->pages->block;
             uint64_t packet_num = p->packet_num;
             uint32_t flags;
 
@@ -723,18 +782,7 @@ static void *multifd_send_thread(void *opaque)
                 p->iovs_num = 1;
             }
 
-            for (int i = 0; i < p->pages->num; i++) {
-                uint64_t offset = p->pages->offset[i];
-                if (use_multifd_zero_page &&
-                    buffer_is_zero(rb->host + offset, p->page_size)) {
-                    p->zero[p->zero_num] = offset;
-                    p->zero_num++;
-                    ram_release_page(rb->idstr, offset);
-                } else {
-                    p->normal[p->normal_num] = offset;
-                    p->normal_num++;
-                }
-            }
+            multifd_zero_page_check(p);
 
             if (p->normal_num) {
                 ret = multifd_send_state->ops->send_prepare(p, &local_err);
@@ -976,6 +1024,10 @@ int multifd_save_setup(Error **errp)
         p->pending_job = 0;
         p->id = i;
         p->pages = multifd_pages_init(page_count);
+        p->addr = g_new0(ram_addr_t, page_count);
+        p->batch_task =
+            (struct buffer_zero_batch_task *)qemu_memalign(64, sizeof(*p->batch_task));
+        buffer_zero_batch_task_init(p->batch_task, page_count);
         p->packet_len = sizeof(MultiFDPacket_t)
                       + sizeof(uint64_t) * page_count;
         p->packet = g_malloc0(p->packet_len);
diff --git a/migration/multifd.h b/migration/multifd.h
index 13762900d4..62f31b03c0 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -119,6 +119,9 @@ typedef struct {
      * pending_job != 0 -> multifd_channel can use it.
      */
     MultiFDPages_t *pages;
+    /* Address of each pages in pages */
+    ram_addr_t *addr;
+    struct buffer_zero_batch_task *batch_task;
 
     /* thread local variables. No locking required */
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 14/20] migration/multifd: Enable DSA offloading in multifd sender path.
  2023-11-14  5:40 [PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration Hao Xiang
                   ` (12 preceding siblings ...)
  2023-11-14  5:40 ` [PATCH v2 13/20] migration/multifd: Prepare to introduce DSA acceleration on the multifd path Hao Xiang
@ 2023-11-14  5:40 ` Hao Xiang
  2023-11-14  5:40 ` [PATCH v2 15/20] migration/multifd: Add test hook to set normal page ratio Hao Xiang
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 51+ messages in thread
From: Hao Xiang @ 2023-11-14  5:40 UTC (permalink / raw)
  To: farosas, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel
  Cc: Hao Xiang

Multifd sender path gets an array of pages queued by the migration
thread. It performs zero page checking on every page in the array.
The pages are classfied as either a zero page or a normal page. This
change uses Intel DSA to offload the zero page checking from CPU to
the DSA accelerator. The sender thread submits a batch of pages to DSA
hardware and waits for the DSA completion thread to signal for work
completion.

Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
---
 migration/multifd.c | 33 ++++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 68ab97f918..2f635898ed 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -560,6 +560,8 @@ void multifd_save_cleanup(void)
             qemu_thread_join(&p->thread);
         }
     }
+    dsa_stop();
+    dsa_cleanup();
     for (i = 0; i < migrate_multifd_channels(); i++) {
         MultiFDSendParams *p = &multifd_send_state->params[i];
         Error *local_err = NULL;
@@ -702,6 +704,7 @@ static void buffer_is_zero_use_cpu(MultiFDSendParams *p)
 {
     const void **buf = (const void **)p->addr;
     assert(!migrate_use_main_zero_page());
+    assert(!dsa_is_running());
 
     for (int i = 0; i < p->pages->num; i++) {
         p->batch_task->results[i] = buffer_is_zero(buf[i], p->page_size);
@@ -710,15 +713,29 @@ static void buffer_is_zero_use_cpu(MultiFDSendParams *p)
 
 static void set_normal_pages(MultiFDSendParams *p)
 {
+    assert(migrate_use_main_zero_page());
+
     for (int i = 0; i < p->pages->num; i++) {
         p->batch_task->results[i] = false;
     }
 }
 
+static void buffer_is_zero_use_dsa(MultiFDSendParams *p)
+{
+    assert(!migrate_use_main_zero_page());
+    assert(dsa_is_running());
+
+    buffer_is_zero_dsa_batch_async(p->batch_task,
+                                   (const void **)p->addr,
+                                   p->pages->num,
+                                   p->page_size);
+}
+
 static void multifd_zero_page_check(MultiFDSendParams *p)
 {
     /* older qemu don't understand zero page on multifd channel */
     bool use_multifd_zero_page = !migrate_use_main_zero_page();
+    bool use_multifd_dsa_accel = dsa_is_running();
 
     RAMBlock *rb = p->pages->block;
 
@@ -726,7 +743,9 @@ static void multifd_zero_page_check(MultiFDSendParams *p)
         p->addr[i] = (ram_addr_t)(rb->host + p->pages->offset[i]);
     }
 
-    if (use_multifd_zero_page) {
+    if (use_multifd_dsa_accel && use_multifd_zero_page) {
+        buffer_is_zero_use_dsa(p);
+    } else if (use_multifd_zero_page) {
         buffer_is_zero_use_cpu(p);
     } else {
         // No zero page checking. All pages are normal pages.
@@ -1001,11 +1020,15 @@ int multifd_save_setup(Error **errp)
     int thread_count;
     uint32_t page_count = MULTIFD_PACKET_SIZE / qemu_target_page_size();
     uint8_t i;
+    const char *dsa_parameter = migrate_multifd_dsa_accel();
 
     if (!migrate_multifd()) {
         return 0;
     }
 
+    dsa_init(dsa_parameter);
+    dsa_start();
+
     thread_count = migrate_multifd_channels();
     multifd_send_state = g_malloc0(sizeof(*multifd_send_state));
     multifd_send_state->params = g_new0(MultiFDSendParams, thread_count);
@@ -1061,6 +1084,7 @@ int multifd_save_setup(Error **errp)
             return ret;
         }
     }
+
     return 0;
 }
 
@@ -1138,6 +1162,8 @@ void multifd_load_cleanup(void)
 
         qemu_thread_join(&p->thread);
     }
+    dsa_stop();
+    dsa_cleanup();
     for (i = 0; i < migrate_multifd_channels(); i++) {
         MultiFDRecvParams *p = &multifd_recv_state->params[i];
 
@@ -1272,6 +1298,7 @@ int multifd_load_setup(Error **errp)
     int thread_count;
     uint32_t page_count = MULTIFD_PACKET_SIZE / qemu_target_page_size();
     uint8_t i;
+    const char *dsa_parameter = migrate_multifd_dsa_accel();
 
     /*
      * Return successfully if multiFD recv state is already initialised
@@ -1281,6 +1308,9 @@ int multifd_load_setup(Error **errp)
         return 0;
     }
 
+    dsa_init(dsa_parameter);
+    dsa_start();
+
     thread_count = migrate_multifd_channels();
     multifd_recv_state = g_malloc0(sizeof(*multifd_recv_state));
     multifd_recv_state->params = g_new0(MultiFDRecvParams, thread_count);
@@ -1317,6 +1347,7 @@ int multifd_load_setup(Error **errp)
             return ret;
         }
     }
+
     return 0;
 }
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 15/20] migration/multifd: Add test hook to set normal page ratio.
  2023-11-14  5:40 [PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration Hao Xiang
                   ` (13 preceding siblings ...)
  2023-11-14  5:40 ` [PATCH v2 14/20] migration/multifd: Enable DSA offloading in multifd sender path Hao Xiang
@ 2023-11-14  5:40 ` Hao Xiang
  2023-11-14  5:40 ` [PATCH v2 16/20] migration/multifd: Enable set normal page ratio test hook in multifd Hao Xiang
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 51+ messages in thread
From: Hao Xiang @ 2023-11-14  5:40 UTC (permalink / raw)
  To: farosas, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel
  Cc: Hao Xiang

Multifd sender thread performs zero page checking. If a page is
a zero page, only the page's metadata is sent to the receiver.
If a page is a normal page, the entire page's content is sent to
the receiver. This change adds a test hook to set the normal page
ratio. A zero page will be forced to be sent as a normal page. This
is useful for live migration performance analysis and optimization.

Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
---
 migration/options.c | 31 +++++++++++++++++++++++++++++++
 migration/options.h |  1 +
 qapi/migration.json | 18 +++++++++++++++---
 3 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/migration/options.c b/migration/options.c
index 6e424b5d63..e7f1e2df24 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -79,6 +79,11 @@
 #define DEFAULT_MIGRATE_ANNOUNCE_ROUNDS    5
 #define DEFAULT_MIGRATE_ANNOUNCE_STEP    100
 
+/*
+ * Parameter for multifd normal page test hook.
+ */
+#define DEFAULT_MIGRATE_MULTIFD_NORMAL_PAGE_RATIO 101
+
 #define DEFINE_PROP_MIG_CAP(name, x)             \
     DEFINE_PROP_BOOL(name, MigrationState, capabilities[x], false)
 
@@ -181,6 +186,9 @@ Property migration_properties[] = {
                       MIG_MODE_NORMAL),
     DEFINE_PROP_STRING("multifd-dsa-accel", MigrationState,
                        parameters.multifd_dsa_accel),
+    DEFINE_PROP_UINT8("multifd-normal-page-ratio", MigrationState,
+                      parameters.multifd_normal_page_ratio,
+                      DEFAULT_MIGRATE_MULTIFD_NORMAL_PAGE_RATIO),
 
     /* Migration capabilities */
     DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
@@ -860,6 +868,12 @@ int migrate_multifd_channels(void)
     return s->parameters.multifd_channels;
 }
 
+uint8_t migrate_multifd_normal_page_ratio(void)
+{
+    MigrationState *s = migrate_get_current();
+    return s->parameters.multifd_normal_page_ratio;
+}
+
 MultiFDCompression migrate_multifd_compression(void)
 {
     MigrationState *s = migrate_get_current();
@@ -1258,6 +1272,14 @@ bool migrate_params_check(MigrationParameters *params, Error **errp)
         return false;
     }
 
+    if (params->has_multifd_normal_page_ratio &&
+        params->multifd_normal_page_ratio > 100) {
+        error_setg(errp, QERR_INVALID_PARAMETER_VALUE,
+                   "multifd_normal_page_ratio",
+                   "a value between 0 and 100");
+        return false;
+    }
+
     return true;
 }
 
@@ -1378,6 +1400,11 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
         assert(params->multifd_dsa_accel->type == QTYPE_QSTRING);
         dest->multifd_dsa_accel = params->multifd_dsa_accel->u.s;
     }
+
+    if (params->has_multifd_normal_page_ratio) {
+        dest->has_multifd_normal_page_ratio = true;
+        dest->multifd_normal_page_ratio = params->multifd_normal_page_ratio;
+    }
 }
 
 static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
@@ -1528,6 +1555,10 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
         assert(params->multifd_dsa_accel->type == QTYPE_QSTRING);
         s->parameters.multifd_dsa_accel = g_strdup(params->multifd_dsa_accel->u.s);
     }
+
+    if (params->has_multifd_normal_page_ratio) {
+        s->parameters.multifd_normal_page_ratio = params->multifd_normal_page_ratio;
+    }
 }
 
 void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
diff --git a/migration/options.h b/migration/options.h
index 56100961a9..21e3e7b0cf 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -95,6 +95,7 @@ const char *migrate_tls_creds(void);
 const char *migrate_tls_hostname(void);
 uint64_t migrate_xbzrle_cache_size(void);
 const char *migrate_multifd_dsa_accel(void);
+uint8_t migrate_multifd_normal_page_ratio(void);
 
 /* parameters setters */
 
diff --git a/qapi/migration.json b/qapi/migration.json
index a8e3b66d6f..bb876c8325 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -882,6 +882,9 @@
 # @multifd-dsa-accel: If enabled, use DSA accelerator offloading for
 #                     certain memory operations. (since 8.2)
 #
+# @multifd-normal-page-ratio: Test hook setting the normal page ratio.
+#     (Since 8.2)
+#
 # Features:
 #
 # @deprecated: Member @block-incremental is deprecated.  Use
@@ -915,7 +918,8 @@
            'block-bitmap-mapping',
            { 'name': 'x-vcpu-dirty-limit-period', 'features': ['unstable'] },
            'vcpu-dirty-limit',
-           'mode'] }
+           'mode',
+           'multifd-normal-page-ratio'] }
 
 ##
 # @MigrateSetParameters:
@@ -1073,6 +1077,9 @@
 # @multifd-dsa-accel: If enabled, use DSA accelerator offloading for
 #                     certain memory operations. (since 8.2)
 #
+# @multifd-normal-page-ratio: Test hook setting the normal page ratio.
+#     (Since 8.2)
+#
 # Features:
 #
 # @deprecated: Member @block-incremental is deprecated.  Use
@@ -1127,7 +1134,8 @@
                                             'features': [ 'unstable' ] },
             '*vcpu-dirty-limit': 'uint64',
             '*mode': 'MigMode',
-            '*multifd-dsa-accel': 'StrOrNull'} }
+            '*multifd-dsa-accel': 'StrOrNull',
+            '*multifd-normal-page-ratio': 'uint8'} }
 
 ##
 # @migrate-set-parameters:
@@ -1305,6 +1313,9 @@
 # @multifd-dsa-accel: If enabled, use DSA accelerator offloading for
 #                     certain memory operations. (since 8.2)
 #
+# @multifd-normal-page-ratio: Test hook setting the normal page ratio.
+#     (Since 8.2)
+#
 # Features:
 #
 # @deprecated: Member @block-incremental is deprecated.  Use
@@ -1356,7 +1367,8 @@
                                             'features': [ 'unstable' ] },
             '*vcpu-dirty-limit': 'uint64',
             '*mode': 'MigMode',
-            '*multifd-dsa-accel': 'str'} }
+            '*multifd-dsa-accel': 'str',
+            '*multifd-normal-page-ratio': 'uint8'} }
 
 ##
 # @query-migrate-parameters:
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 16/20] migration/multifd: Enable set normal page ratio test hook in multifd.
  2023-11-14  5:40 [PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration Hao Xiang
                   ` (14 preceding siblings ...)
  2023-11-14  5:40 ` [PATCH v2 15/20] migration/multifd: Add test hook to set normal page ratio Hao Xiang
@ 2023-11-14  5:40 ` Hao Xiang
  2023-11-14  5:40 ` [PATCH v2 17/20] migration/multifd: Add migration option set packet size Hao Xiang
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 51+ messages in thread
From: Hao Xiang @ 2023-11-14  5:40 UTC (permalink / raw)
  To: farosas, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel
  Cc: Hao Xiang

Test hook is disabled by default. To set it, a normal page ratio
between 0 and 100 are valid. If the ratio is set to 50, it means
at least 50% of all pages are sent as normal pages.

Set the option:
migrate_set_parameter multifd-normal-page-ratio 60

Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
---
 include/qemu/dsa.h             |  7 ++++++-
 migration/migration-hmp-cmds.c |  7 +++++++
 migration/multifd.c            | 33 +++++++++++++++++++++++++++++++++
 3 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
index 3f8ee07004..bc7f652e0b 100644
--- a/include/qemu/dsa.h
+++ b/include/qemu/dsa.h
@@ -37,7 +37,10 @@ typedef struct buffer_zero_batch_task {
     enum dsa_task_type task_type;
     enum dsa_task_status status;
     bool *results;
-    int batch_size;
+    uint32_t batch_size;
+    // Set normal page ratio test hook.
+    uint32_t normal_page_index;
+    uint32_t normal_page_counter;
     QSIMPLEQ_ENTRY(buffer_zero_batch_task) entry;
 } buffer_zero_batch_task;
 
@@ -45,6 +48,8 @@ typedef struct buffer_zero_batch_task {
 
 struct buffer_zero_batch_task {
     bool *results;
+    uint32_t normal_page_index;
+    uint32_t normal_page_counter;
 };
 
 #endif
diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index d9451744dd..788ce699ac 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -356,6 +356,9 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
         monitor_printf(mon, "%s: %s\n",
             MigrationParameter_str(MIGRATION_PARAMETER_MULTIFD_DSA_ACCEL),
             params->multifd_dsa_accel);
+        monitor_printf(mon, "%s: %u\n",
+            MigrationParameter_str(MIGRATION_PARAMETER_MULTIFD_NORMAL_PAGE_RATIO),
+            params->multifd_normal_page_ratio);
 
         if (params->has_block_bitmap_mapping) {
             const BitmapMigrationNodeAliasList *bmnal;
@@ -675,6 +678,10 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
         error_setg(&err, "The block-bitmap-mapping parameter can only be set "
                    "through QMP");
         break;
+    case MIGRATION_PARAMETER_MULTIFD_NORMAL_PAGE_RATIO:
+        p->has_multifd_normal_page_ratio = true;
+        visit_type_uint8(v, param, &p->multifd_normal_page_ratio, &err);
+        break;
     case MIGRATION_PARAMETER_X_VCPU_DIRTY_LIMIT_PERIOD:
         p->has_x_vcpu_dirty_limit_period = true;
         visit_type_size(v, param, &p->x_vcpu_dirty_limit_period, &err);
diff --git a/migration/multifd.c b/migration/multifd.c
index 2f635898ed..c9f9eef5b1 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -687,6 +687,37 @@ int multifd_send_sync_main(QEMUFile *f)
     return 0;
 }
 
+static void multifd_normal_page_test_hook(MultiFDSendParams *p)
+{
+    /*
+     * The value is between 0 to 100. If the value is 10, it means at
+     * least 10% of the pages are normal page. A zero page can be made
+     * a normal page but not the other way around.
+     */
+    uint8_t multifd_normal_page_ratio =
+        migrate_multifd_normal_page_ratio();
+    struct buffer_zero_batch_task *batch_task = p->batch_task;
+
+    // Set normal page test hook is disabled.
+    if (multifd_normal_page_ratio > 100) {
+        return;
+    }
+
+    for (int i = 0; i < p->pages->num; i++) {
+        if (batch_task->normal_page_counter < multifd_normal_page_ratio) {
+            // Turn a zero page into a normal page.
+            batch_task->results[i] = false;
+        }
+        batch_task->normal_page_index++;
+        batch_task->normal_page_counter++;
+
+        if (batch_task->normal_page_index >= 100) {
+            batch_task->normal_page_index = 0;
+            batch_task->normal_page_counter = 0;
+        }
+    }
+}
+
 static void set_page(MultiFDSendParams *p, bool zero_page, uint64_t offset)
 {
     RAMBlock *rb = p->pages->block;
@@ -752,6 +783,8 @@ static void multifd_zero_page_check(MultiFDSendParams *p)
         set_normal_pages(p);
     }
 
+    multifd_normal_page_test_hook(p);
+
     for (int i = 0; i < p->pages->num; i++) {
         uint64_t offset = p->pages->offset[i];
         bool zero_page = p->batch_task->results[i];
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 17/20] migration/multifd: Add migration option set packet size.
  2023-11-14  5:40 [PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration Hao Xiang
                   ` (15 preceding siblings ...)
  2023-11-14  5:40 ` [PATCH v2 16/20] migration/multifd: Enable set normal page ratio test hook in multifd Hao Xiang
@ 2023-11-14  5:40 ` Hao Xiang
  2023-11-14  5:40 ` [PATCH v2 18/20] migration/multifd: Enable set packet size migration option Hao Xiang
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 51+ messages in thread
From: Hao Xiang @ 2023-11-14  5:40 UTC (permalink / raw)
  To: farosas, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel
  Cc: Hao Xiang

The current multifd packet size is 128 * 4kb. This change adds
an option to set the packet size. Both sender and receiver needs
to set the same packet size for things to work.

Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
---
 migration/options.c | 34 ++++++++++++++++++++++++++++++++++
 migration/options.h |  1 +
 qapi/migration.json | 21 ++++++++++++++++++---
 3 files changed, 53 insertions(+), 3 deletions(-)

diff --git a/migration/options.c b/migration/options.c
index e7f1e2df24..81f1bf25d4 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -84,6 +84,12 @@
  */
 #define DEFAULT_MIGRATE_MULTIFD_NORMAL_PAGE_RATIO 101
 
+/*
+ * Parameter for multifd packet size.
+ */
+#define DEFAULT_MIGRATE_MULTIFD_PACKET_SIZE 128
+#define MAX_MIGRATE_MULTIFD_PACKET_SIZE 1024
+
 #define DEFINE_PROP_MIG_CAP(name, x)             \
     DEFINE_PROP_BOOL(name, MigrationState, capabilities[x], false)
 
@@ -189,6 +195,9 @@ Property migration_properties[] = {
     DEFINE_PROP_UINT8("multifd-normal-page-ratio", MigrationState,
                       parameters.multifd_normal_page_ratio,
                       DEFAULT_MIGRATE_MULTIFD_NORMAL_PAGE_RATIO),
+    DEFINE_PROP_SIZE("multifd-packet-size", MigrationState,
+                     parameters.multifd_packet_size,
+                     DEFAULT_MIGRATE_MULTIFD_PACKET_SIZE),
 
     /* Migration capabilities */
     DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
@@ -874,6 +883,13 @@ uint8_t migrate_multifd_normal_page_ratio(void)
     return s->parameters.multifd_normal_page_ratio;
 }
 
+uint64_t migrate_multifd_packet_size(void)
+{
+    MigrationState *s = migrate_get_current();
+
+    return s->parameters.multifd_packet_size;
+}
+
 MultiFDCompression migrate_multifd_compression(void)
 {
     MigrationState *s = migrate_get_current();
@@ -1012,6 +1028,8 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
     params->x_checkpoint_delay = s->parameters.x_checkpoint_delay;
     params->has_block_incremental = true;
     params->block_incremental = s->parameters.block_incremental;
+    params->has_multifd_packet_size = true;
+    params->multifd_packet_size = s->parameters.multifd_packet_size;
     params->has_multifd_channels = true;
     params->multifd_channels = s->parameters.multifd_channels;
     params->has_multifd_compression = true;
@@ -1072,6 +1090,7 @@ void migrate_params_init(MigrationParameters *params)
     params->has_downtime_limit = true;
     params->has_x_checkpoint_delay = true;
     params->has_block_incremental = true;
+    params->has_multifd_packet_size = true;
     params->has_multifd_channels = true;
     params->has_multifd_compression = true;
     params->has_multifd_zlib_level = true;
@@ -1170,6 +1189,15 @@ bool migrate_params_check(MigrationParameters *params, Error **errp)
 
     /* x_checkpoint_delay is now always positive */
 
+    if (params->has_multifd_packet_size &&
+        ((params->multifd_packet_size < DEFAULT_MIGRATE_MULTIFD_PACKET_SIZE) ||
+            (params->multifd_packet_size >  MAX_MIGRATE_MULTIFD_PACKET_SIZE))) {
+        error_setg(errp, QERR_INVALID_PARAMETER_VALUE,
+                    "multifd_packet_size",
+                    "a value between 128 and 1024");
+        return false;
+    }
+
     if (params->has_multifd_channels && (params->multifd_channels < 1)) {
         error_setg(errp, QERR_INVALID_PARAMETER_VALUE,
                    "multifd_channels",
@@ -1351,6 +1379,9 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
     if (params->has_block_incremental) {
         dest->block_incremental = params->block_incremental;
     }
+    if (params->has_multifd_packet_size) {
+        dest->multifd_packet_size = params->multifd_packet_size;
+    }
     if (params->has_multifd_channels) {
         dest->multifd_channels = params->multifd_channels;
     }
@@ -1496,6 +1527,9 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
                     " use blockdev-mirror with NBD instead");
         s->parameters.block_incremental = params->block_incremental;
     }
+    if (params->has_multifd_packet_size) {
+        s->parameters.multifd_packet_size = params->multifd_packet_size;
+    }
     if (params->has_multifd_channels) {
         s->parameters.multifd_channels = params->multifd_channels;
     }
diff --git a/migration/options.h b/migration/options.h
index 21e3e7b0cf..5816f6dac2 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -96,6 +96,7 @@ const char *migrate_tls_hostname(void);
 uint64_t migrate_xbzrle_cache_size(void);
 const char *migrate_multifd_dsa_accel(void);
 uint8_t migrate_multifd_normal_page_ratio(void);
+uint64_t migrate_multifd_packet_size(void);
 
 /* parameters setters */
 
diff --git a/qapi/migration.json b/qapi/migration.json
index bb876c8325..f87daddf33 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -885,6 +885,10 @@
 # @multifd-normal-page-ratio: Test hook setting the normal page ratio.
 #     (Since 8.2)
 #
+# @multifd-packet-size: Packet size used to migrate data. This value
+#     indicates the number of pages in a packet. The default value
+#     is 128 and max value is 1024. (Since 8.2)
+#
 # Features:
 #
 # @deprecated: Member @block-incremental is deprecated.  Use
@@ -919,7 +923,8 @@
            { 'name': 'x-vcpu-dirty-limit-period', 'features': ['unstable'] },
            'vcpu-dirty-limit',
            'mode',
-           'multifd-normal-page-ratio'] }
+           'multifd-normal-page-ratio',
+           'multifd-packet-size'] }
 
 ##
 # @MigrateSetParameters:
@@ -1080,6 +1085,10 @@
 # @multifd-normal-page-ratio: Test hook setting the normal page ratio.
 #     (Since 8.2)
 #
+# @multifd-packet-size: Packet size used to migrate data. This value
+#     indicates the number of pages in a packet. The default value
+#     is 128 and max value is 1024. (Since 8.2)
+#
 # Features:
 #
 # @deprecated: Member @block-incremental is deprecated.  Use
@@ -1135,7 +1144,8 @@
             '*vcpu-dirty-limit': 'uint64',
             '*mode': 'MigMode',
             '*multifd-dsa-accel': 'StrOrNull',
-            '*multifd-normal-page-ratio': 'uint8'} }
+            '*multifd-normal-page-ratio': 'uint8',
+            '*multifd-packet-size' : 'uint64'} }
 
 ##
 # @migrate-set-parameters:
@@ -1316,6 +1326,10 @@
 # @multifd-normal-page-ratio: Test hook setting the normal page ratio.
 #     (Since 8.2)
 #
+# @multifd-packet-size: Packet size used to migrate data. This value
+#     indicates the number of pages in a packet. The default value
+#     is 128 and max value is 1024. (Since 8.2)
+#
 # Features:
 #
 # @deprecated: Member @block-incremental is deprecated.  Use
@@ -1368,7 +1382,8 @@
             '*vcpu-dirty-limit': 'uint64',
             '*mode': 'MigMode',
             '*multifd-dsa-accel': 'str',
-            '*multifd-normal-page-ratio': 'uint8'} }
+            '*multifd-normal-page-ratio': 'uint8',
+            '*multifd-packet-size': 'uint64'} }
 
 ##
 # @query-migrate-parameters:
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 18/20] migration/multifd: Enable set packet size migration option.
  2023-11-14  5:40 [PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration Hao Xiang
                   ` (16 preceding siblings ...)
  2023-11-14  5:40 ` [PATCH v2 17/20] migration/multifd: Add migration option set packet size Hao Xiang
@ 2023-11-14  5:40 ` Hao Xiang
  2023-12-13 17:33   ` Fabiano Rosas
  2023-11-14  5:40 ` [PATCH v2 19/20] util/dsa: Add unit test coverage for Intel DSA task submission and completion Hao Xiang
                   ` (2 subsequent siblings)
  20 siblings, 1 reply; 51+ messages in thread
From: Hao Xiang @ 2023-11-14  5:40 UTC (permalink / raw)
  To: farosas, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel
  Cc: Hao Xiang

During live migration, if the latency between sender and receiver
is high but bandwidth is high (a long and fat pipe), using a bigger
packet size can help reduce migration total time. In addition, Intel
DSA offloading performs better with a large batch task. Providing an
option to set the packet size is useful for performance tuning.

Set the option:
migrate_set_parameter multifd-packet-size 512

Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
---
 migration/migration-hmp-cmds.c | 7 +++++++
 migration/multifd-zlib.c       | 8 ++++++--
 migration/multifd-zstd.c       | 8 ++++++--
 migration/multifd.c            | 4 ++--
 migration/multifd.h            | 3 ---
 5 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 788ce699ac..2d0c71294c 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -338,6 +338,9 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
         monitor_printf(mon, "%s: %s\n",
             MigrationParameter_str(MIGRATION_PARAMETER_BLOCK_INCREMENTAL),
             params->block_incremental ? "on" : "off");
+        monitor_printf(mon, "%s: %" PRIu64 "\n",
+            MigrationParameter_str(MIGRATION_PARAMETER_MULTIFD_PACKET_SIZE),
+            params->multifd_packet_size);
         monitor_printf(mon, "%s: %u\n",
             MigrationParameter_str(MIGRATION_PARAMETER_MULTIFD_CHANNELS),
             params->multifd_channels);
@@ -626,6 +629,10 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
         p->multifd_dsa_accel->type = QTYPE_QSTRING;
         visit_type_str(v, param, &p->multifd_dsa_accel->u.s, &err);
         break;
+    case MIGRATION_PARAMETER_MULTIFD_PACKET_SIZE:
+        p->has_multifd_packet_size = true;
+        visit_type_size(v, param, &p->multifd_packet_size, &err);
+        break;
     case MIGRATION_PARAMETER_MULTIFD_CHANNELS:
         p->has_multifd_channels = true;
         visit_type_uint8(v, param, &p->multifd_channels, &err);
diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c
index 37ce48621e..453c85d725 100644
--- a/migration/multifd-zlib.c
+++ b/migration/multifd-zlib.c
@@ -49,6 +49,8 @@ static int zlib_send_setup(MultiFDSendParams *p, Error **errp)
     struct zlib_data *z = g_new0(struct zlib_data, 1);
     z_stream *zs = &z->zs;
     const char *err_msg;
+    uint64_t multifd_packet_size =
+        migrate_multifd_packet_size() * qemu_target_page_size();
 
     zs->zalloc = Z_NULL;
     zs->zfree = Z_NULL;
@@ -58,7 +60,7 @@ static int zlib_send_setup(MultiFDSendParams *p, Error **errp)
         goto err_free_z;
     }
     /* This is the maximum size of the compressed buffer */
-    z->zbuff_len = compressBound(MULTIFD_PACKET_SIZE);
+    z->zbuff_len = compressBound(multifd_packet_size);
     z->zbuff = g_try_malloc(z->zbuff_len);
     if (!z->zbuff) {
         err_msg = "out of memory for zbuff";
@@ -186,6 +188,8 @@ static int zlib_send_prepare(MultiFDSendParams *p, Error **errp)
  */
 static int zlib_recv_setup(MultiFDRecvParams *p, Error **errp)
 {
+    uint64_t multifd_packet_size =
+        migrate_multifd_packet_size() * qemu_target_page_size();
     struct zlib_data *z = g_new0(struct zlib_data, 1);
     z_stream *zs = &z->zs;
 
@@ -200,7 +204,7 @@ static int zlib_recv_setup(MultiFDRecvParams *p, Error **errp)
         return -1;
     }
     /* To be safe, we reserve twice the size of the packet */
-    z->zbuff_len = MULTIFD_PACKET_SIZE * 2;
+    z->zbuff_len = multifd_packet_size * 2;
     z->zbuff = g_try_malloc(z->zbuff_len);
     if (!z->zbuff) {
         inflateEnd(zs);
diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c
index b471daadcd..60298861d6 100644
--- a/migration/multifd-zstd.c
+++ b/migration/multifd-zstd.c
@@ -49,6 +49,8 @@ struct zstd_data {
  */
 static int zstd_send_setup(MultiFDSendParams *p, Error **errp)
 {
+    uint64_t multifd_packet_size =
+        migrate_multifd_packet_size() * qemu_target_page_size();
     struct zstd_data *z = g_new0(struct zstd_data, 1);
     int res;
 
@@ -69,7 +71,7 @@ static int zstd_send_setup(MultiFDSendParams *p, Error **errp)
         return -1;
     }
     /* This is the maximum size of the compressed buffer */
-    z->zbuff_len = ZSTD_compressBound(MULTIFD_PACKET_SIZE);
+    z->zbuff_len = ZSTD_compressBound(multifd_packet_size);
     z->zbuff = g_try_malloc(z->zbuff_len);
     if (!z->zbuff) {
         ZSTD_freeCStream(z->zcs);
@@ -175,6 +177,8 @@ static int zstd_send_prepare(MultiFDSendParams *p, Error **errp)
  */
 static int zstd_recv_setup(MultiFDRecvParams *p, Error **errp)
 {
+    uint64_t multifd_packet_size =
+        migrate_multifd_packet_size() * qemu_target_page_size();
     struct zstd_data *z = g_new0(struct zstd_data, 1);
     int ret;
 
@@ -196,7 +200,7 @@ static int zstd_recv_setup(MultiFDRecvParams *p, Error **errp)
     }
 
     /* To be safe, we reserve twice the size of the packet */
-    z->zbuff_len = MULTIFD_PACKET_SIZE * 2;
+    z->zbuff_len = multifd_packet_size * 2;
     z->zbuff = g_try_malloc(z->zbuff_len);
     if (!z->zbuff) {
         ZSTD_freeDStream(z->zds);
diff --git a/migration/multifd.c b/migration/multifd.c
index c9f9eef5b1..fbe8bbcc5c 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -1051,7 +1051,7 @@ static void multifd_new_send_channel_create(gpointer opaque)
 int multifd_save_setup(Error **errp)
 {
     int thread_count;
-    uint32_t page_count = MULTIFD_PACKET_SIZE / qemu_target_page_size();
+    uint32_t page_count = migrate_multifd_packet_size();
     uint8_t i;
     const char *dsa_parameter = migrate_multifd_dsa_accel();
 
@@ -1329,7 +1329,7 @@ static void *multifd_recv_thread(void *opaque)
 int multifd_load_setup(Error **errp)
 {
     int thread_count;
-    uint32_t page_count = MULTIFD_PACKET_SIZE / qemu_target_page_size();
+    uint32_t page_count = migrate_multifd_packet_size();
     uint8_t i;
     const char *dsa_parameter = migrate_multifd_dsa_accel();
 
diff --git a/migration/multifd.h b/migration/multifd.h
index 62f31b03c0..173c3f4171 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -34,9 +34,6 @@ int multifd_queue_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset);
 #define MULTIFD_FLAG_ZLIB (1 << 1)
 #define MULTIFD_FLAG_ZSTD (2 << 1)
 
-/* This value needs to be a multiple of qemu_target_page_size() */
-#define MULTIFD_PACKET_SIZE (512 * 1024)
-
 typedef struct {
     uint32_t magic;
     uint32_t version;
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 19/20] util/dsa: Add unit test coverage for Intel DSA task submission and completion.
  2023-11-14  5:40 [PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration Hao Xiang
                   ` (17 preceding siblings ...)
  2023-11-14  5:40 ` [PATCH v2 18/20] migration/multifd: Enable set packet size migration option Hao Xiang
@ 2023-11-14  5:40 ` Hao Xiang
  2023-11-14  5:40 ` [PATCH v2 20/20] migration/multifd: Add integration tests for multifd with Intel DSA offloading Hao Xiang
  2023-11-15 17:43 ` [PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration Elena Ufimtseva
  20 siblings, 0 replies; 51+ messages in thread
From: Hao Xiang @ 2023-11-14  5:40 UTC (permalink / raw)
  To: farosas, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel
  Cc: Hao Xiang

* Test DSA start and stop path.
* Test DSA configure and cleanup path.
* Test DSA task submission and completion path.

Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
---
 tests/unit/meson.build |   6 +
 tests/unit/test-dsa.c  | 466 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 472 insertions(+)
 create mode 100644 tests/unit/test-dsa.c

diff --git a/tests/unit/meson.build b/tests/unit/meson.build
index a05d471090..72e22063dc 100644
--- a/tests/unit/meson.build
+++ b/tests/unit/meson.build
@@ -54,6 +54,12 @@ tests = {
   'test-virtio-dmabuf': [meson.project_source_root() / 'hw/display/virtio-dmabuf.c'],
 }
 
+if config_host_data.get('CONFIG_DSA_OPT')
+  tests += {
+    'test-dsa': [],
+  }
+endif
+
 if have_system or have_tools
   tests += {
     'test-qmp-event': [testqapi],
diff --git a/tests/unit/test-dsa.c b/tests/unit/test-dsa.c
new file mode 100644
index 0000000000..d2f23c3dba
--- /dev/null
+++ b/tests/unit/test-dsa.c
@@ -0,0 +1,466 @@
+/*
+ * Test DSA functions.
+ *
+ * Copyright (c) 2023 Hao Xiang <hao.xiang@bytedance.com>
+ * Copyright (c) 2023 Bryan Zhang <bryan.zhang@bytedance.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+#include "qemu/osdep.h"
+#include "qemu/host-utils.h"
+
+#include "qemu/cutils.h"
+#include "qemu/memalign.h"
+#include "qemu/dsa.h"
+
+// TODO Make these not-hardcoded.
+static const char *path1 = "/dev/dsa/wq4.0";
+static const char *path2 = "/dev/dsa/wq4.0 /dev/dsa/wq4.1";
+static const int num_devices = 2;
+
+static struct buffer_zero_batch_task batch_task __attribute__((aligned(64)));
+
+// TODO Communicate that DSA must be configured to support this batch size.
+// TODO Alternatively, poke the DSA device to figure out batch size.
+static int batch_size = 128;
+static int page_size = 4096;
+
+// A helper for running a single task and checking for correctness.
+static void do_single_task(void)
+{
+    buffer_zero_batch_task_init(&batch_task, batch_size);
+    char buf[page_size];
+    char* ptr = buf;
+
+    buffer_is_zero_dsa_batch_async(&batch_task,
+                                   (const void**) &ptr,
+                                   1,
+                                   page_size);
+    g_assert(batch_task.results[0] == buffer_is_zero(buf, page_size));
+}
+
+static void test_single_zero(void)
+{
+    g_assert(!dsa_init(path1));
+    dsa_start();
+
+    buffer_zero_batch_task_init(&batch_task, batch_size);
+
+    char buf[page_size];
+    char* ptr = buf;
+
+    memset(buf, 0x0, page_size);
+    buffer_is_zero_dsa_batch_async(&batch_task,
+                                   (const void**) &ptr,
+                                   1, page_size);
+    g_assert(batch_task.results[0]);
+
+    dsa_cleanup();
+}
+
+static void test_single_zero_async(void)
+{
+    test_single_zero();
+}
+
+static void test_single_nonzero(void)
+{
+    g_assert(!dsa_init(path1));
+    dsa_start();
+
+    buffer_zero_batch_task_init(&batch_task, batch_size);
+
+    char buf[page_size];
+    char* ptr = buf;
+
+    memset(buf, 0x1, page_size);
+    buffer_is_zero_dsa_batch_async(&batch_task,
+                                   (const void**) &ptr,
+                                   1, page_size);
+    g_assert(!batch_task.results[0]);
+
+    dsa_cleanup();
+}
+
+static void test_single_nonzero_async(void)
+{
+    test_single_nonzero();
+}
+
+// count == 0 should return quickly without calling into DSA.
+static void test_zero_count_async(void)
+{
+    char buf[page_size];
+    buffer_is_zero_dsa_batch_async(&batch_task,
+                             (const void **) &buf,
+                             0,
+                             page_size);
+}
+
+static void test_null_task_async(void)
+{
+    if (g_test_subprocess()) {
+        g_assert(!dsa_init(path1));
+
+        char buf[page_size * batch_size];
+        char *addrs[batch_size];
+        for (int i = 0; i < batch_size; i++) {
+            addrs[i] = buf + (page_size * i);
+        }
+
+        buffer_is_zero_dsa_batch_async(NULL, (const void**) addrs, batch_size,
+                                 page_size);
+    } else {
+        g_test_trap_subprocess(NULL, 0, 0);
+        g_test_trap_assert_failed();
+    }
+}
+
+static void test_oversized_batch(void)
+{
+    g_assert(!dsa_init(path1));
+    dsa_start();
+
+    buffer_zero_batch_task_init(&batch_task, batch_size);
+
+    int oversized_batch_size = batch_size + 1;
+    char buf[page_size * oversized_batch_size];
+    char *addrs[batch_size];
+    for (int i = 0; i < oversized_batch_size; i++) {
+        addrs[i] = buf + (page_size * i);
+    }
+
+    int ret = buffer_is_zero_dsa_batch_async(&batch_task,
+                                            (const void**) addrs,
+                                            oversized_batch_size,
+                                            page_size);
+    g_assert(ret != 0);
+
+    dsa_cleanup();
+}
+
+static void test_oversized_batch_async(void)
+{
+    test_oversized_batch();
+}
+
+static void test_zero_len_async(void)
+{
+    if (g_test_subprocess()) {
+        g_assert(!dsa_init(path1));
+
+        buffer_zero_batch_task_init(&batch_task, batch_size);
+
+        char buf[page_size];
+
+        buffer_is_zero_dsa_batch_async(&batch_task,
+                                       (const void**) &buf,
+                                       1,
+                                       0);
+    } else {
+        g_test_trap_subprocess(NULL, 0, 0);
+        g_test_trap_assert_failed();
+    }
+}
+
+static void test_null_buf_async(void)
+{
+    if (g_test_subprocess()) {
+        g_assert(!dsa_init(path1));
+
+        buffer_zero_batch_task_init(&batch_task, batch_size);
+
+        buffer_is_zero_dsa_batch_async(&batch_task, NULL, 1, page_size);
+    } else {
+        g_test_trap_subprocess(NULL, 0, 0);
+        g_test_trap_assert_failed();
+    }
+}
+
+static void test_batch(void)
+{
+    g_assert(!dsa_init(path1));
+    dsa_start();
+
+    buffer_zero_batch_task_init(&batch_task, batch_size);
+
+    char buf[page_size * batch_size];
+    char *addrs[batch_size];
+    for (int i = 0; i < batch_size; i++) {
+        addrs[i] = buf + (page_size * i);
+    }
+
+    // Using whatever is on the stack is somewhat random.
+    // Manually set some pages to zero and some to nonzero.
+    memset(buf + 0, 0, page_size * 10);
+    memset(buf + (10 * page_size), 0xff, page_size * 10);
+
+    buffer_is_zero_dsa_batch_async(&batch_task,
+                                   (const void**) addrs,
+                                   batch_size,
+                                   page_size);
+
+    bool is_zero;
+    for (int i = 0; i < batch_size; i++) {
+        is_zero = buffer_is_zero((const void*) &buf[page_size * i], page_size);
+        g_assert(batch_task.results[i] == is_zero);
+    }
+    dsa_cleanup();
+}
+
+static void test_batch_async(void)
+{
+    test_batch();
+}
+
+static void test_page_fault(void)
+{
+    g_assert(!dsa_init(path1));
+    dsa_start();
+
+    char* buf[2];
+    int prot = PROT_READ | PROT_WRITE;
+    int flags = MAP_SHARED | MAP_ANON;
+    buf[0] = (char*) mmap(NULL, page_size * batch_size, prot, flags, -1, 0);
+    assert(buf[0] != MAP_FAILED);
+    buf[1] = (char*) malloc(page_size * batch_size);
+    assert(buf[1] != NULL);
+
+    for (int j = 0; j < 2; j++) {
+        buffer_zero_batch_task_init(&batch_task, batch_size);
+
+        char *addrs[batch_size];
+        for (int i = 0; i < batch_size; i++) {
+            addrs[i] = buf[j] + (page_size * i);
+        }
+
+        buffer_is_zero_dsa_batch_async(&batch_task,
+                                       (const void**) addrs,
+                                       batch_size,
+                                       page_size);
+
+        bool is_zero;
+        for (int i = 0; i < batch_size; i++) {
+            is_zero = buffer_is_zero((const void*) &buf[j][page_size * i], page_size);
+            g_assert(batch_task.results[i] == is_zero);
+        }
+    }
+
+    assert(!munmap(buf[0], page_size * batch_size));
+    free(buf[1]);
+    dsa_cleanup();
+}
+
+static void test_various_buffer_sizes(void)
+{
+    g_assert(!dsa_init(path1));
+    dsa_start();
+
+    int len = 1 << 4;
+    for (int count = 12; count > 0; count--, len <<= 1) {
+        buffer_zero_batch_task_init(&batch_task, batch_size);
+
+        char buf[len * batch_size];
+        char *addrs[batch_size];
+        for (int i = 0; i < batch_size; i++) {
+            addrs[i] = buf + (len * i);
+        }
+
+        buffer_is_zero_dsa_batch_async(&batch_task,
+                                       (const void**) addrs,
+                                       batch_size,
+                                       len);
+
+        bool is_zero;
+        for (int j = 0; j < batch_size; j++) {
+            is_zero = buffer_is_zero((const void*) &buf[len * j], len);
+            g_assert(batch_task.results[j] == is_zero);
+        }
+    }
+
+    dsa_cleanup();
+}
+
+static void test_various_buffer_sizes_async(void)
+{
+    test_various_buffer_sizes();
+}
+
+static void test_double_start_stop(void)
+{
+    g_assert(!dsa_init(path1));
+    // Double start
+    dsa_start();
+    dsa_start();
+    g_assert(dsa_is_running());
+    do_single_task();
+
+    // Double stop
+    dsa_stop();
+    g_assert(!dsa_is_running());
+    dsa_stop();
+    g_assert(!dsa_is_running());
+
+    // Restart
+    dsa_start();
+    g_assert(dsa_is_running());
+    do_single_task();
+    dsa_cleanup();
+}
+
+static void test_is_running(void)
+{
+    g_assert(!dsa_init(path1));
+
+    g_assert(!dsa_is_running());
+    dsa_start();
+    g_assert(dsa_is_running());
+    dsa_stop();
+    g_assert(!dsa_is_running());
+    dsa_cleanup();
+}
+
+static void test_multiple_engines(void)
+{
+    g_assert(!dsa_init(path2));
+    dsa_start();
+
+    struct buffer_zero_batch_task tasks[num_devices]
+        __attribute__((aligned(64)));
+    char bufs[num_devices][page_size * batch_size];
+    char *addrs[num_devices][batch_size];
+
+    // This is a somewhat implementation-specific way of testing that the tasks
+    // have unique engines assigned to them.
+    buffer_zero_batch_task_init(&tasks[0], batch_size);
+    buffer_zero_batch_task_init(&tasks[1], batch_size);
+    g_assert(tasks[0].device != tasks[1].device);
+
+    for (int i = 0; i < num_devices; i++) {
+        for (int j = 0; j < batch_size; j++) {
+            addrs[i][j] = bufs[i] + (page_size * j);
+        }
+
+        buffer_is_zero_dsa_batch_async(&tasks[i],
+                                       (const void**) addrs[i],
+                                       batch_size, page_size);
+
+        bool is_zero;
+        for (int j = 0; j < batch_size; j++) {
+            is_zero = buffer_is_zero((const void*) &bufs[i][page_size * j],
+                                     page_size);
+            g_assert(tasks[i].results[j] == is_zero);
+        }
+    }
+
+    dsa_cleanup();
+}
+
+static void test_configure_dsa_twice(void)
+{
+    g_assert(!dsa_init(path2));
+    g_assert(!dsa_init(path2));
+    dsa_start();
+    do_single_task();
+    dsa_cleanup();
+}
+
+static void test_configure_dsa_bad_path(void)
+{
+    const char* bad_path = "/not/a/real/path";
+    g_assert(dsa_init(bad_path));
+}
+
+static void test_cleanup_before_configure(void)
+{
+    dsa_cleanup();
+    g_assert(!dsa_init(path2));
+}
+
+static void test_configure_dsa_num_devices(void)
+{
+    g_assert(!dsa_init(path1));
+    dsa_start();
+
+    do_single_task();
+    dsa_stop();
+    dsa_cleanup();
+}
+
+static void test_cleanup_twice(void)
+{
+    g_assert(!dsa_init(path2));
+    dsa_cleanup();
+    dsa_cleanup();
+
+    g_assert(!dsa_init(path2));
+    dsa_start();
+    do_single_task();
+    dsa_cleanup();
+}
+
+static int check_test_setup(void)
+{
+    const char *path[2] = {path1, path2};
+    for (int i = 0; i < sizeof(path) / sizeof(char *); i++) {
+        if (!dsa_init(path[i])) {
+            return -1;
+        }
+        dsa_cleanup();
+    }
+    return 0;
+}
+
+int main(int argc, char **argv)
+{
+    g_test_init(&argc, &argv, NULL);
+
+    if (check_test_setup() != 0) {
+        /*
+         * This test requires extra setup. The current
+         * setup is not correct. Just skip this test
+         * for now.
+         */
+        exit(0);
+    }
+
+    if (num_devices > 1) {
+        g_test_add_func("/dsa/multiple_engines", test_multiple_engines);
+    }
+
+    g_test_add_func("/dsa/async/batch", test_batch_async);
+    g_test_add_func("/dsa/async/various_buffer_sizes",
+                    test_various_buffer_sizes_async);
+    g_test_add_func("/dsa/async/null_buf", test_null_buf_async);
+    g_test_add_func("/dsa/async/zero_len", test_zero_len_async);
+    g_test_add_func("/dsa/async/oversized_batch", test_oversized_batch_async);
+    g_test_add_func("/dsa/async/zero_count", test_zero_count_async);
+    g_test_add_func("/dsa/async/single_zero", test_single_zero_async);
+    g_test_add_func("/dsa/async/single_nonzero", test_single_nonzero_async);
+    g_test_add_func("/dsa/async/null_task", test_null_task_async);
+    g_test_add_func("/dsa/async/page_fault", test_page_fault);
+
+    g_test_add_func("/dsa/double_start_stop", test_double_start_stop);
+    g_test_add_func("/dsa/is_running", test_is_running);
+
+    g_test_add_func("/dsa/configure_dsa_twice", test_configure_dsa_twice);
+    g_test_add_func("/dsa/configure_dsa_bad_path", test_configure_dsa_bad_path);
+    g_test_add_func("/dsa/cleanup_before_configure",
+                    test_cleanup_before_configure);
+    g_test_add_func("/dsa/configure_dsa_num_devices",
+                    test_configure_dsa_num_devices);
+    g_test_add_func("/dsa/cleanup_twice", test_cleanup_twice);
+
+    return g_test_run();
+}
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v2 20/20] migration/multifd: Add integration tests for multifd with Intel DSA offloading.
  2023-11-14  5:40 [PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration Hao Xiang
                   ` (18 preceding siblings ...)
  2023-11-14  5:40 ` [PATCH v2 19/20] util/dsa: Add unit test coverage for Intel DSA task submission and completion Hao Xiang
@ 2023-11-14  5:40 ` Hao Xiang
  2023-11-15 17:43 ` [PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration Elena Ufimtseva
  20 siblings, 0 replies; 51+ messages in thread
From: Hao Xiang @ 2023-11-14  5:40 UTC (permalink / raw)
  To: farosas, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel
  Cc: Hao Xiang

* Add test case to start and complete multifd live migration with DSA
offloading enabled.
* Add test case to start and cancel multifd live migration with DSA
offloading enabled.

Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
---
 tests/qtest/migration-test.c | 77 +++++++++++++++++++++++++++++++++++-
 1 file changed, 76 insertions(+), 1 deletion(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 5752412b64..3ffbdd5a65 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -639,6 +639,12 @@ typedef struct {
     const char *opts_target;
 } MigrateStart;
 
+/*
+ * It requires separate steps to configure and enable DSA device.
+ * This test assumes that the configuration is done already.
+ */
+static const char* dsa_dev_path = "/dev/dsa/wq4.0";
+
 /*
  * A hook that runs after the src and dst QEMUs have been
  * created, but before the migration is started. This can
@@ -2775,7 +2781,7 @@ static void test_multifd_tcp_tls_x509_reject_anon_client(void)
  *
  *  And see that it works
  */
-static void test_multifd_tcp_cancel(void)
+static void test_multifd_tcp_cancel_common(bool use_dsa)
 {
     MigrateStart args = {
         .hide_stderr = true,
@@ -2796,6 +2802,10 @@ static void test_multifd_tcp_cancel(void)
     migrate_set_capability(from, "multifd", true);
     migrate_set_capability(to, "multifd", true);
 
+    if (use_dsa) {
+        migrate_set_parameter_str(from, "multifd-dsa-accel", dsa_dev_path);
+    }
+
     /* Start incoming migration from the 1st socket */
     migrate_incoming_qmp(to, "tcp:127.0.0.1:0", "{}");
 
@@ -2852,6 +2862,48 @@ static void test_multifd_tcp_cancel(void)
     test_migrate_end(from, to2, true);
 }
 
+/*
+ * This test does:
+ *  source               target
+ *                       migrate_incoming
+ *     migrate
+ *     migrate_cancel
+ *                       launch another target
+ *     migrate
+ *
+ *  And see that it works
+ */
+static void test_multifd_tcp_cancel(void)
+{
+    test_multifd_tcp_cancel_common(false);
+}
+
+#ifdef CONFIG_DSA_OPT
+
+static void *test_migrate_precopy_tcp_multifd_start_dsa(QTestState *from,
+                                                        QTestState *to)
+{
+    migrate_set_parameter_str(from, "multifd-dsa-accel", dsa_dev_path);
+    return test_migrate_precopy_tcp_multifd_start_common(from, to, "none");
+}
+
+static void test_multifd_tcp_none_dsa(void)
+{
+    MigrateCommon args = {
+        .listen_uri = "defer",
+        .start_hook = test_migrate_precopy_tcp_multifd_start_dsa,
+    };
+
+    test_precopy_common(&args);
+}
+
+static void test_multifd_tcp_cancel_dsa(void)
+{
+    test_multifd_tcp_cancel_common(true);
+}
+
+#endif
+
 static void calc_dirty_rate(QTestState *who, uint64_t calc_time)
 {
     qtest_qmp_assert_success(who,
@@ -3274,6 +3326,19 @@ static bool kvm_dirty_ring_supported(void)
 #endif
 }
 
+#ifdef CONFIG_DSA_OPT
+static int test_dsa_setup(void)
+{
+    int fd;
+    fd = open(dsa_dev_path, O_RDWR);
+    if (fd < 0) {
+        return -1;
+    }
+    close(fd);
+    return 0;
+}
+#endif
+
 int main(int argc, char **argv)
 {
     bool has_kvm, has_tcg;
@@ -3468,6 +3533,16 @@ int main(int argc, char **argv)
     }
     qtest_add_func("/migration/multifd/tcp/plain/none",
                    test_multifd_tcp_none);
+
+#ifdef CONFIG_DSA_OPT
+    if (g_str_equal(arch, "x86_64") && test_dsa_setup() == 0) {
+        qtest_add_func("/migration/multifd/tcp/plain/none/dsa",
+                       test_multifd_tcp_none_dsa);
+        qtest_add_func("/migration/multifd/tcp/plain/cancel/dsa",
+                       test_multifd_tcp_cancel_dsa);
+    }
+#endif
+
     /*
      * This test is flaky and sometimes fails in CI and otherwise:
      * don't run unless user opts in via environment variable.
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration.
  2023-11-14  5:40 [PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration Hao Xiang
                   ` (19 preceding siblings ...)
  2023-11-14  5:40 ` [PATCH v2 20/20] migration/multifd: Add integration tests for multifd with Intel DSA offloading Hao Xiang
@ 2023-11-15 17:43 ` Elena Ufimtseva
  2023-11-15 19:37   ` [External] " Hao Xiang
  20 siblings, 1 reply; 51+ messages in thread
From: Elena Ufimtseva @ 2023-11-15 17:43 UTC (permalink / raw)
  To: Hao Xiang
  Cc: farosas, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel

Hello Hao,

On Mon, Nov 13, 2023 at 9:42 PM Hao Xiang <hao.xiang@bytedance.com> wrote:
>
> v2
> * Rebase on top of 3e01f1147a16ca566694b97eafc941d62fa1e8d8.
> * Leave Juan's changes in their original form instead of squashing them.
> * Add a new commit to refactor the multifd_send_thread function to prepare for introducing the DSA offload functionality.
> * Use page count to configure multifd-packet-size option.
> * Don't use the FLAKY flag in DSA tests.
> * Test if DSA integration test is setup correctly and skip the test if
> * not.
> * Fixed broken link in the previous patch cover.
>
> * Background:
>
> I posted an RFC about DSA offloading in QEMU:
> https://patchew.org/QEMU/20230529182001.2232069-1-hao.xiang@bytedance.com/
>
> This patchset implements the DSA offloading on zero page checking in
> multifd live migration code path.
>


Do you have performance numbers with different packet sizes for DSA
and non-DSA cases?
What have you found was an optimal size for DSA offloading?

Thank you!
> * Overview:
>
> Intel Data Streaming Accelerator(DSA) is introduced in Intel's 4th generation
> Xeon server, aka Sapphire Rapids.
> https://cdrdv2-public.intel.com/671116/341204-intel-data-streaming-accelerator-spec.pdf
> https://www.intel.com/content/www/us/en/content-details/759709/intel-data-streaming-accelerator-user-guide.html
> One of the things DSA can do is to offload memory comparison workload from
> CPU to DSA accelerator hardware. This patchset implements a solution to offload
> QEMU's zero page checking from CPU to DSA accelerator hardware. We gain
> two benefits from this change:
> 1. Reduces CPU usage in multifd live migration workflow across all use
> cases.
> 2. Reduces migration total time in some use cases.
>
> * Design:
>
> These are the logical steps to perform DSA offloading:
> 1. Configure DSA accelerators and create user space openable DSA work
> queues via the idxd driver.
> 2. Map DSA's work queue into a user space address space.
> 3. Fill an in-memory task descriptor to describe the memory operation.
> 4. Use dedicated CPU instruction _enqcmd to queue a task descriptor to
> the work queue.
> 5. Pull the task descriptor's completion status field until the task
> completes.
> 6. Check return status.
>
> The memory operation is now totally done by the accelerator hardware but
> the new workflow introduces overheads. The overhead is the extra cost CPU
> prepares and submits the task descriptors and the extra cost CPU pulls for
> completion. The design is around minimizing these two overheads.
>
> 1. In order to reduce the overhead on task preparation and submission,
> we use batch descriptors. A batch descriptor will contain N individual
> zero page checking tasks where the default N is 128 (default packet size
> / page size) and we can increase N by setting the packet size via a new
> migration option.
> 2. The multifd sender threads prepares and submits batch tasks to DSA
> hardware and it waits on a synchronization object for task completion.
> Whenever a DSA task is submitted, the task structure is added to a
> thread safe queue. It's safe to have multiple multifd sender threads to
> submit tasks concurrently.
> 3. Multiple DSA hardware devices can be used. During multifd initialization,
> every sender thread will be assigned a DSA device to work with. We
> use a round-robin scheme to evenly distribute the work across all used
> DSA devices.
> 4. Use a dedicated thread dsa_completion to perform busy pulling for all
> DSA task completions. The thread keeps dequeuing DSA tasks from the
> thread safe queue. The thread blocks when there is no outstanding DSA
> task. When pulling for completion of a DSA task, the thread uses CPU
> instruction _mm_pause between the iterations of a busy loop to save some
> CPU power as well as optimizing core resources for the other hypercore.
> 5. DSA accelerator can encounter errors. The most popular error is a
> page fault. We have tested using devices to handle page faults but
> performance is bad. Right now, if DSA hits a page fault, we fallback to
> use CPU to complete the rest of the work. The CPU fallback is done in
> the multifd sender thread.
> 6. Added a new migration option multifd-dsa-accel to set the DSA device
> path. If set, the multifd workflow will leverage the DSA devices for
> offloading.
> 7. Added a new migration option multifd-normal-page-ratio to make
> multifd live migration easier to test. Setting a normal page ratio will
> make live migration recognize a zero page as a normal page and send
> the entire payload over the network. If we want to send a large network
> payload and analyze throughput, this option is useful.
> 8. Added a new migration option multifd-packet-size. This can increase
> the number of pages being zero page checked and sent over the network.
> The extra synchronization between the sender threads and the dsa
> completion thread is an overhead. Using a large packet size can reduce
> that overhead.
>
> * Performance:
>
> We use two Intel 4th generation Xeon servers for testing.
>
> Architecture:        x86_64
> CPU(s):              192
> Thread(s) per core:  2
> Core(s) per socket:  48
> Socket(s):           2
> NUMA node(s):        2
> Vendor ID:           GenuineIntel
> CPU family:          6
> Model:               143
> Model name:          Intel(R) Xeon(R) Platinum 8457C
> Stepping:            8
> CPU MHz:             2538.624
> CPU max MHz:         3800.0000
> CPU min MHz:         800.0000
>
> We perform multifd live migration with below setup:
> 1. VM has 100GB memory.
> 2. Use the new migration option multifd-set-normal-page-ratio to control the total
> size of the payload sent over the network.
> 3. Use 8 multifd channels.
> 4. Use tcp for live migration.
> 4. Use CPU to perform zero page checking as the baseline.
> 5. Use one DSA device to offload zero page checking to compare with the baseline.
> 6. Use "perf sched record" and "perf sched timehist" to analyze CPU usage.
>
> A) Scenario 1: 50% (50GB) normal pages on an 100GB vm.
>
>         CPU usage
>
>         |---------------|---------------|---------------|---------------|
>         |               |comm           |runtime(msec)  |totaltime(msec)|
>         |---------------|---------------|---------------|---------------|
>         |Baseline       |live_migration |5657.58        |               |
>         |               |multifdsend_0  |3931.563       |               |
>         |               |multifdsend_1  |4405.273       |               |
>         |               |multifdsend_2  |3941.968       |               |
>         |               |multifdsend_3  |5032.975       |               |
>         |               |multifdsend_4  |4533.865       |               |
>         |               |multifdsend_5  |4530.461       |               |
>         |               |multifdsend_6  |5171.916       |               |
>         |               |multifdsend_7  |4722.769       |41922          |
>         |---------------|---------------|---------------|---------------|
>         |DSA            |live_migration |6129.168       |               |
>         |               |multifdsend_0  |2954.717       |               |
>         |               |multifdsend_1  |2766.359       |               |
>         |               |multifdsend_2  |2853.519       |               |
>         |               |multifdsend_3  |2740.717       |               |
>         |               |multifdsend_4  |2824.169       |               |
>         |               |multifdsend_5  |2966.908       |               |
>         |               |multifdsend_6  |2611.137       |               |
>         |               |multifdsend_7  |3114.732       |               |
>         |               |dsa_completion |3612.564       |32568          |
>         |---------------|---------------|---------------|---------------|
>
> Baseline total runtime is calculated by adding up all multifdsend_X
> and live_migration threads runtime. DSA offloading total runtime is
> calculated by adding up all multifdsend_X, live_migration and
> dsa_completion threads runtime. 41922 msec VS 32568 msec runtime and
> that is 23% total CPU usage savings.
>
>         Latency
>         |---------------|---------------|---------------|---------------|---------------|---------------|
>         |               |total time     |down time      |throughput     |transferred-ram|total-ram      |
>         |---------------|---------------|---------------|---------------|---------------|---------------|
>         |Baseline       |10343 ms       |161 ms         |41007.00 mbps  |51583797 kb    |102400520 kb   |
>         |---------------|---------------|---------------|---------------|-------------------------------|
>         |DSA offload    |9535 ms        |135 ms         |46554.40 mbps  |53947545 kb    |102400520 kb   |
>         |---------------|---------------|---------------|---------------|---------------|---------------|
>
> Total time is 8% faster and down time is 16% faster.
>
> B) Scenario 2: 100% (100GB) zero pages on an 100GB vm.
>
>         CPU usage
>         |---------------|---------------|---------------|---------------|
>         |               |comm           |runtime(msec)  |totaltime(msec)|
>         |---------------|---------------|---------------|---------------|
>         |Baseline       |live_migration |4860.718       |               |
>         |               |multifdsend_0  |748.875        |               |
>         |               |multifdsend_1  |898.498        |               |
>         |               |multifdsend_2  |787.456        |               |
>         |               |multifdsend_3  |764.537        |               |
>         |               |multifdsend_4  |785.687        |               |
>         |               |multifdsend_5  |756.941        |               |
>         |               |multifdsend_6  |774.084        |               |
>         |               |multifdsend_7  |782.900        |11154          |
>         |---------------|---------------|-------------------------------|
>         |DSA offloading |live_migration |3846.976       |               |
>         |               |multifdsend_0  |191.880        |               |
>         |               |multifdsend_1  |166.331        |               |
>         |               |multifdsend_2  |168.528        |               |
>         |               |multifdsend_3  |197.831        |               |
>         |               |multifdsend_4  |169.580        |               |
>         |               |multifdsend_5  |167.984        |               |
>         |               |multifdsend_6  |198.042        |               |
>         |               |multifdsend_7  |170.624        |               |
>         |               |dsa_completion |3428.669       |8700           |
>         |---------------|---------------|---------------|---------------|
>
> Baseline total runtime is 11154 msec and DSA offloading total runtime is
> 8700 msec. That is 22% CPU savings.
>
>         Latency
>         |--------------------------------------------------------------------------------------------|
>         |               |total time     |down time      |throughput     |transferred-ram|total-ram   |
>         |---------------|---------------|---------------|---------------|---------------|------------|
>         |Baseline       |4867 ms        |20 ms          |1.51 mbps      |565 kb         |102400520 kb|
>         |---------------|---------------|---------------|---------------|----------------------------|
>         |DSA offload    |3888 ms        |18 ms          |1.89 mbps      |565 kb         |102400520 kb|
>         |---------------|---------------|---------------|---------------|---------------|------------|
>
> Total time 20% faster and down time 10% faster.
>
> * Testing:
>
> 1. Added unit tests for cover the added code path in dsa.c
> 2. Added integration tests to cover multifd live migration using DSA
> offloading.
>
> * Patchset
>
> Apply this patchset on top of commit
> f78ea7ddb0e18766ece9fdfe02061744a7afc41b
>
> Hao Xiang (16):
>   meson: Introduce new instruction set enqcmd to the build system.
>   util/dsa: Add dependency idxd.
>   util/dsa: Implement DSA device start and stop logic.
>   util/dsa: Implement DSA task enqueue and dequeue.
>   util/dsa: Implement DSA task asynchronous completion thread model.
>   util/dsa: Implement zero page checking in DSA task.
>   util/dsa: Implement DSA task asynchronous submission and wait for
>     completion.
>   migration/multifd: Add new migration option for multifd DSA
>     offloading.
>   migration/multifd: Prepare to introduce DSA acceleration on the
>     multifd path.
>   migration/multifd: Enable DSA offloading in multifd sender path.
>   migration/multifd: Add test hook to set normal page ratio.
>   migration/multifd: Enable set normal page ratio test hook in multifd.
>   migration/multifd: Add migration option set packet size.
>   migration/multifd: Enable set packet size migration option.
>   util/dsa: Add unit test coverage for Intel DSA task submission and
>     completion.
>   migration/multifd: Add integration tests for multifd with Intel DSA
>     offloading.
>
> Juan Quintela (4):
>   multifd: Add capability to enable/disable zero_page
>   multifd: Support for zero pages transmission
>   multifd: Zero pages transmission
>   So we use multifd to transmit zero pages.
>
>  include/qemu/dsa.h             |  119 ++++
>  linux-headers/linux/idxd.h     |  356 ++++++++++
>  meson.build                    |    2 +
>  meson_options.txt              |    2 +
>  migration/migration-hmp-cmds.c |   22 +
>  migration/multifd-zlib.c       |    8 +-
>  migration/multifd-zstd.c       |    8 +-
>  migration/multifd.c            |  203 +++++-
>  migration/multifd.h            |   28 +-
>  migration/options.c            |  107 +++
>  migration/options.h            |    4 +
>  migration/ram.c                |   45 +-
>  migration/trace-events         |    8 +-
>  qapi/migration.json            |   53 +-
>  scripts/meson-buildoptions.sh  |    3 +
>  tests/qtest/migration-test.c   |   77 ++-
>  tests/unit/meson.build         |    6 +
>  tests/unit/test-dsa.c          |  466 +++++++++++++
>  util/dsa.c                     | 1132 ++++++++++++++++++++++++++++++++
>  util/meson.build               |    1 +
>  20 files changed, 2612 insertions(+), 38 deletions(-)
>  create mode 100644 include/qemu/dsa.h
>  create mode 100644 linux-headers/linux/idxd.h
>  create mode 100644 tests/unit/test-dsa.c
>  create mode 100644 util/dsa.c
>
> --
> 2.30.2
>
>


-- 
Elena


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [External] Re: [PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration.
  2023-11-15 17:43 ` [PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration Elena Ufimtseva
@ 2023-11-15 19:37   ` Hao Xiang
  0 siblings, 0 replies; 51+ messages in thread
From: Hao Xiang @ 2023-11-15 19:37 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: farosas, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel

On Wed, Nov 15, 2023 at 9:43 AM Elena Ufimtseva <ufimtseva@gmail.com> wrote:
>
> Hello Hao,
>
> On Mon, Nov 13, 2023 at 9:42 PM Hao Xiang <hao.xiang@bytedance.com> wrote:
> >
> > v2
> > * Rebase on top of 3e01f1147a16ca566694b97eafc941d62fa1e8d8.
> > * Leave Juan's changes in their original form instead of squashing them.
> > * Add a new commit to refactor the multifd_send_thread function to prepare for introducing the DSA offload functionality.
> > * Use page count to configure multifd-packet-size option.
> > * Don't use the FLAKY flag in DSA tests.
> > * Test if DSA integration test is setup correctly and skip the test if
> > * not.
> > * Fixed broken link in the previous patch cover.
> >
> > * Background:
> >
> > I posted an RFC about DSA offloading in QEMU:
> > https://patchew.org/QEMU/20230529182001.2232069-1-hao.xiang@bytedance.com/
> >
> > This patchset implements the DSA offloading on zero page checking in
> > multifd live migration code path.
> >
>
>
> Do you have performance numbers with different packet sizes for DSA
> and non-DSA cases?
> What have you found was an optimal size for DSA offloading?
>
> Thank you!

Hi Elena,

The current performance numbers in the cover letter are based on using
1024 pages per packet. I just realized that I didn't clarify this in
the description so will update that. Basically for DSA offloading, the
bigger packet size, the better CPU savings. DSA is doing what
buffer_is_zero() function call does but it also added a few overheads
all done by CPU
1. Prepare the DSA task descriptors and submit them to hardware.
2. Pull DSA task completion in a busy loop.
3. Thread synchronization between the sender threads and the DSA
completion thread.
To reduce the overhead of 1. I prepare a DSA task, making it a batch
with 1024 sub-tasks. The 1024 here is the biggest batch size DSA
hardware can handle and that's why I made the max packet size 1024
(there is also a constraint from the kernel network side where tcp
send can only handle a limited number of iov in the network packet
descriptor and that limit varies from kernel version to kernel
versions). To reduce overhead 2, I use a dedicated completion thread
to pull DSA task completion on tasks submitted by all sender threads
so we don't end up having multiple busy loops wasting CPU cycles. To
reduce overhead 3, the bigger the packet size is, the less number of
times we need to perform thread synchronization between the sender and
the DSA completion threads.

> > * Overview:
> >
> > Intel Data Streaming Accelerator(DSA) is introduced in Intel's 4th generation
> > Xeon server, aka Sapphire Rapids.
> > https://cdrdv2-public.intel.com/671116/341204-intel-data-streaming-accelerator-spec.pdf
> > https://www.intel.com/content/www/us/en/content-details/759709/intel-data-streaming-accelerator-user-guide.html
> > One of the things DSA can do is to offload memory comparison workload from
> > CPU to DSA accelerator hardware. This patchset implements a solution to offload
> > QEMU's zero page checking from CPU to DSA accelerator hardware. We gain
> > two benefits from this change:
> > 1. Reduces CPU usage in multifd live migration workflow across all use
> > cases.
> > 2. Reduces migration total time in some use cases.
> >
> > * Design:
> >
> > These are the logical steps to perform DSA offloading:
> > 1. Configure DSA accelerators and create user space openable DSA work
> > queues via the idxd driver.
> > 2. Map DSA's work queue into a user space address space.
> > 3. Fill an in-memory task descriptor to describe the memory operation.
> > 4. Use dedicated CPU instruction _enqcmd to queue a task descriptor to
> > the work queue.
> > 5. Pull the task descriptor's completion status field until the task
> > completes.
> > 6. Check return status.
> >
> > The memory operation is now totally done by the accelerator hardware but
> > the new workflow introduces overheads. The overhead is the extra cost CPU
> > prepares and submits the task descriptors and the extra cost CPU pulls for
> > completion. The design is around minimizing these two overheads.
> >
> > 1. In order to reduce the overhead on task preparation and submission,
> > we use batch descriptors. A batch descriptor will contain N individual
> > zero page checking tasks where the default N is 128 (default packet size
> > / page size) and we can increase N by setting the packet size via a new
> > migration option.
> > 2. The multifd sender threads prepares and submits batch tasks to DSA
> > hardware and it waits on a synchronization object for task completion.
> > Whenever a DSA task is submitted, the task structure is added to a
> > thread safe queue. It's safe to have multiple multifd sender threads to
> > submit tasks concurrently.
> > 3. Multiple DSA hardware devices can be used. During multifd initialization,
> > every sender thread will be assigned a DSA device to work with. We
> > use a round-robin scheme to evenly distribute the work across all used
> > DSA devices.
> > 4. Use a dedicated thread dsa_completion to perform busy pulling for all
> > DSA task completions. The thread keeps dequeuing DSA tasks from the
> > thread safe queue. The thread blocks when there is no outstanding DSA
> > task. When pulling for completion of a DSA task, the thread uses CPU
> > instruction _mm_pause between the iterations of a busy loop to save some
> > CPU power as well as optimizing core resources for the other hypercore.
> > 5. DSA accelerator can encounter errors. The most popular error is a
> > page fault. We have tested using devices to handle page faults but
> > performance is bad. Right now, if DSA hits a page fault, we fallback to
> > use CPU to complete the rest of the work. The CPU fallback is done in
> > the multifd sender thread.
> > 6. Added a new migration option multifd-dsa-accel to set the DSA device
> > path. If set, the multifd workflow will leverage the DSA devices for
> > offloading.
> > 7. Added a new migration option multifd-normal-page-ratio to make
> > multifd live migration easier to test. Setting a normal page ratio will
> > make live migration recognize a zero page as a normal page and send
> > the entire payload over the network. If we want to send a large network
> > payload and analyze throughput, this option is useful.
> > 8. Added a new migration option multifd-packet-size. This can increase
> > the number of pages being zero page checked and sent over the network.
> > The extra synchronization between the sender threads and the dsa
> > completion thread is an overhead. Using a large packet size can reduce
> > that overhead.
> >
> > * Performance:
> >
> > We use two Intel 4th generation Xeon servers for testing.
> >
> > Architecture:        x86_64
> > CPU(s):              192
> > Thread(s) per core:  2
> > Core(s) per socket:  48
> > Socket(s):           2
> > NUMA node(s):        2
> > Vendor ID:           GenuineIntel
> > CPU family:          6
> > Model:               143
> > Model name:          Intel(R) Xeon(R) Platinum 8457C
> > Stepping:            8
> > CPU MHz:             2538.624
> > CPU max MHz:         3800.0000
> > CPU min MHz:         800.0000
> >
> > We perform multifd live migration with below setup:
> > 1. VM has 100GB memory.
> > 2. Use the new migration option multifd-set-normal-page-ratio to control the total
> > size of the payload sent over the network.
> > 3. Use 8 multifd channels.
> > 4. Use tcp for live migration.
> > 4. Use CPU to perform zero page checking as the baseline.
> > 5. Use one DSA device to offload zero page checking to compare with the baseline.
> > 6. Use "perf sched record" and "perf sched timehist" to analyze CPU usage.
> >
> > A) Scenario 1: 50% (50GB) normal pages on an 100GB vm.
> >
> >         CPU usage
> >
> >         |---------------|---------------|---------------|---------------|
> >         |               |comm           |runtime(msec)  |totaltime(msec)|
> >         |---------------|---------------|---------------|---------------|
> >         |Baseline       |live_migration |5657.58        |               |
> >         |               |multifdsend_0  |3931.563       |               |
> >         |               |multifdsend_1  |4405.273       |               |
> >         |               |multifdsend_2  |3941.968       |               |
> >         |               |multifdsend_3  |5032.975       |               |
> >         |               |multifdsend_4  |4533.865       |               |
> >         |               |multifdsend_5  |4530.461       |               |
> >         |               |multifdsend_6  |5171.916       |               |
> >         |               |multifdsend_7  |4722.769       |41922          |
> >         |---------------|---------------|---------------|---------------|
> >         |DSA            |live_migration |6129.168       |               |
> >         |               |multifdsend_0  |2954.717       |               |
> >         |               |multifdsend_1  |2766.359       |               |
> >         |               |multifdsend_2  |2853.519       |               |
> >         |               |multifdsend_3  |2740.717       |               |
> >         |               |multifdsend_4  |2824.169       |               |
> >         |               |multifdsend_5  |2966.908       |               |
> >         |               |multifdsend_6  |2611.137       |               |
> >         |               |multifdsend_7  |3114.732       |               |
> >         |               |dsa_completion |3612.564       |32568          |
> >         |---------------|---------------|---------------|---------------|
> >
> > Baseline total runtime is calculated by adding up all multifdsend_X
> > and live_migration threads runtime. DSA offloading total runtime is
> > calculated by adding up all multifdsend_X, live_migration and
> > dsa_completion threads runtime. 41922 msec VS 32568 msec runtime and
> > that is 23% total CPU usage savings.
> >
> >         Latency
> >         |---------------|---------------|---------------|---------------|---------------|---------------|
> >         |               |total time     |down time      |throughput     |transferred-ram|total-ram      |
> >         |---------------|---------------|---------------|---------------|---------------|---------------|
> >         |Baseline       |10343 ms       |161 ms         |41007.00 mbps  |51583797 kb    |102400520 kb   |
> >         |---------------|---------------|---------------|---------------|-------------------------------|
> >         |DSA offload    |9535 ms        |135 ms         |46554.40 mbps  |53947545 kb    |102400520 kb   |
> >         |---------------|---------------|---------------|---------------|---------------|---------------|
> >
> > Total time is 8% faster and down time is 16% faster.
> >
> > B) Scenario 2: 100% (100GB) zero pages on an 100GB vm.
> >
> >         CPU usage
> >         |---------------|---------------|---------------|---------------|
> >         |               |comm           |runtime(msec)  |totaltime(msec)|
> >         |---------------|---------------|---------------|---------------|
> >         |Baseline       |live_migration |4860.718       |               |
> >         |               |multifdsend_0  |748.875        |               |
> >         |               |multifdsend_1  |898.498        |               |
> >         |               |multifdsend_2  |787.456        |               |
> >         |               |multifdsend_3  |764.537        |               |
> >         |               |multifdsend_4  |785.687        |               |
> >         |               |multifdsend_5  |756.941        |               |
> >         |               |multifdsend_6  |774.084        |               |
> >         |               |multifdsend_7  |782.900        |11154          |
> >         |---------------|---------------|-------------------------------|
> >         |DSA offloading |live_migration |3846.976       |               |
> >         |               |multifdsend_0  |191.880        |               |
> >         |               |multifdsend_1  |166.331        |               |
> >         |               |multifdsend_2  |168.528        |               |
> >         |               |multifdsend_3  |197.831        |               |
> >         |               |multifdsend_4  |169.580        |               |
> >         |               |multifdsend_5  |167.984        |               |
> >         |               |multifdsend_6  |198.042        |               |
> >         |               |multifdsend_7  |170.624        |               |
> >         |               |dsa_completion |3428.669       |8700           |
> >         |---------------|---------------|---------------|---------------|
> >
> > Baseline total runtime is 11154 msec and DSA offloading total runtime is
> > 8700 msec. That is 22% CPU savings.
> >
> >         Latency
> >         |--------------------------------------------------------------------------------------------|
> >         |               |total time     |down time      |throughput     |transferred-ram|total-ram   |
> >         |---------------|---------------|---------------|---------------|---------------|------------|
> >         |Baseline       |4867 ms        |20 ms          |1.51 mbps      |565 kb         |102400520 kb|
> >         |---------------|---------------|---------------|---------------|----------------------------|
> >         |DSA offload    |3888 ms        |18 ms          |1.89 mbps      |565 kb         |102400520 kb|
> >         |---------------|---------------|---------------|---------------|---------------|------------|
> >
> > Total time 20% faster and down time 10% faster.
> >
> > * Testing:
> >
> > 1. Added unit tests for cover the added code path in dsa.c
> > 2. Added integration tests to cover multifd live migration using DSA
> > offloading.
> >
> > * Patchset
> >
> > Apply this patchset on top of commit
> > f78ea7ddb0e18766ece9fdfe02061744a7afc41b
> >
> > Hao Xiang (16):
> >   meson: Introduce new instruction set enqcmd to the build system.
> >   util/dsa: Add dependency idxd.
> >   util/dsa: Implement DSA device start and stop logic.
> >   util/dsa: Implement DSA task enqueue and dequeue.
> >   util/dsa: Implement DSA task asynchronous completion thread model.
> >   util/dsa: Implement zero page checking in DSA task.
> >   util/dsa: Implement DSA task asynchronous submission and wait for
> >     completion.
> >   migration/multifd: Add new migration option for multifd DSA
> >     offloading.
> >   migration/multifd: Prepare to introduce DSA acceleration on the
> >     multifd path.
> >   migration/multifd: Enable DSA offloading in multifd sender path.
> >   migration/multifd: Add test hook to set normal page ratio.
> >   migration/multifd: Enable set normal page ratio test hook in multifd.
> >   migration/multifd: Add migration option set packet size.
> >   migration/multifd: Enable set packet size migration option.
> >   util/dsa: Add unit test coverage for Intel DSA task submission and
> >     completion.
> >   migration/multifd: Add integration tests for multifd with Intel DSA
> >     offloading.
> >
> > Juan Quintela (4):
> >   multifd: Add capability to enable/disable zero_page
> >   multifd: Support for zero pages transmission
> >   multifd: Zero pages transmission
> >   So we use multifd to transmit zero pages.
> >
> >  include/qemu/dsa.h             |  119 ++++
> >  linux-headers/linux/idxd.h     |  356 ++++++++++
> >  meson.build                    |    2 +
> >  meson_options.txt              |    2 +
> >  migration/migration-hmp-cmds.c |   22 +
> >  migration/multifd-zlib.c       |    8 +-
> >  migration/multifd-zstd.c       |    8 +-
> >  migration/multifd.c            |  203 +++++-
> >  migration/multifd.h            |   28 +-
> >  migration/options.c            |  107 +++
> >  migration/options.h            |    4 +
> >  migration/ram.c                |   45 +-
> >  migration/trace-events         |    8 +-
> >  qapi/migration.json            |   53 +-
> >  scripts/meson-buildoptions.sh  |    3 +
> >  tests/qtest/migration-test.c   |   77 ++-
> >  tests/unit/meson.build         |    6 +
> >  tests/unit/test-dsa.c          |  466 +++++++++++++
> >  util/dsa.c                     | 1132 ++++++++++++++++++++++++++++++++
> >  util/meson.build               |    1 +
> >  20 files changed, 2612 insertions(+), 38 deletions(-)
> >  create mode 100644 include/qemu/dsa.h
> >  create mode 100644 linux-headers/linux/idxd.h
> >  create mode 100644 tests/unit/test-dsa.c
> >  create mode 100644 util/dsa.c
> >
> > --
> > 2.30.2
> >
> >
>
>
> --
> Elena


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 04/20] So we use multifd to transmit zero pages.
  2023-11-14  5:40 ` [PATCH v2 04/20] So we use multifd to transmit zero pages Hao Xiang
@ 2023-11-16 15:14   ` Fabiano Rosas
  2024-01-23  4:28     ` [External] " Hao Xiang
  0 siblings, 1 reply; 51+ messages in thread
From: Fabiano Rosas @ 2023-11-16 15:14 UTC (permalink / raw)
  To: Hao Xiang, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel
  Cc: Leonardo Bras

Hao Xiang <hao.xiang@bytedance.com> writes:

> From: Juan Quintela <quintela@redhat.com>
>
> Signed-off-by: Juan Quintela <quintela@redhat.com>
> Reviewed-by: Leonardo Bras <leobras@redhat.com>
> ---
>  migration/multifd.c |  7 ++++---
>  migration/options.c | 13 +++++++------
>  migration/ram.c     | 45 ++++++++++++++++++++++++++++++++++++++-------
>  qapi/migration.json |  1 -
>  4 files changed, 49 insertions(+), 17 deletions(-)
>
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 1b994790d5..1198ffde9c 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -13,6 +13,7 @@
>  #include "qemu/osdep.h"
>  #include "qemu/cutils.h"
>  #include "qemu/rcu.h"
> +#include "qemu/cutils.h"
>  #include "exec/target_page.h"
>  #include "sysemu/sysemu.h"
>  #include "exec/ramblock.h"
> @@ -459,7 +460,6 @@ static int multifd_send_pages(QEMUFile *f)
>      p->packet_num = multifd_send_state->packet_num++;
>      multifd_send_state->pages = p->pages;
>      p->pages = pages;
> -
>      qemu_mutex_unlock(&p->mutex);
>      qemu_sem_post(&p->sem);
>  
> @@ -684,7 +684,7 @@ static void *multifd_send_thread(void *opaque)
>      MigrationThread *thread = NULL;
>      Error *local_err = NULL;
>      /* qemu older than 8.2 don't understand zero page on multifd channel */
> -    bool use_zero_page = !migrate_use_main_zero_page();
> +    bool use_multifd_zero_page = !migrate_use_main_zero_page();
>      int ret = 0;
>      bool use_zero_copy_send = migrate_zero_copy_send();
>  
> @@ -713,6 +713,7 @@ static void *multifd_send_thread(void *opaque)
>              RAMBlock *rb = p->pages->block;
>              uint64_t packet_num = p->packet_num;
>              uint32_t flags;
> +
>              p->normal_num = 0;
>              p->zero_num = 0;
>  
> @@ -724,7 +725,7 @@ static void *multifd_send_thread(void *opaque)
>  
>              for (int i = 0; i < p->pages->num; i++) {
>                  uint64_t offset = p->pages->offset[i];
> -                if (use_zero_page &&
> +                if (use_multifd_zero_page &&

We could have a new function in multifd_ops for zero page
handling. We're already considering an accelerator for the compression
method in the other series[1] and in this series we're adding an
accelerator for zero page checking. It's about time we make the
multifd_ops generic instead of only compression/no compression.

1- [PATCH v2 0/4] Live Migration Acceleration with IAA Compression
https://lore.kernel.org/r/20231109154638.488213-1-yuan1.liu@intel.com

>                      buffer_is_zero(rb->host + offset, p->page_size)) {
>                      p->zero[p->zero_num] = offset;
>                      p->zero_num++;
> diff --git a/migration/options.c b/migration/options.c
> index 00c0c4a0d6..97d121d4d7 100644
> --- a/migration/options.c
> +++ b/migration/options.c
> @@ -195,6 +195,7 @@ Property migration_properties[] = {
>      DEFINE_PROP_MIG_CAP("x-block", MIGRATION_CAPABILITY_BLOCK),
>      DEFINE_PROP_MIG_CAP("x-return-path", MIGRATION_CAPABILITY_RETURN_PATH),
>      DEFINE_PROP_MIG_CAP("x-multifd", MIGRATION_CAPABILITY_MULTIFD),
> +    DEFINE_PROP_MIG_CAP("x-main-zero-page", MIGRATION_CAPABILITY_MAIN_ZERO_PAGE),
>      DEFINE_PROP_MIG_CAP("x-background-snapshot",
>              MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT),
>  #ifdef CONFIG_LINUX
> @@ -288,13 +289,9 @@ bool migrate_multifd(void)
>  
>  bool migrate_use_main_zero_page(void)
>  {
> -    //MigrationState *s;
> -
> -    //s = migrate_get_current();
> +    MigrationState *s = migrate_get_current();
>  
> -    // We will enable this when we add the right code.
> -    // return s->enabled_capabilities[MIGRATION_CAPABILITY_MAIN_ZERO_PAGE];
> -    return true;
> +    return s->capabilities[MIGRATION_CAPABILITY_MAIN_ZERO_PAGE];

What happens if we disable main-zero-page while multifd is not enabled?

>  }
>  
>  bool migrate_pause_before_switchover(void)
> @@ -457,6 +454,7 @@ INITIALIZE_MIGRATE_CAPS_SET(check_caps_background_snapshot,
>      MIGRATION_CAPABILITY_LATE_BLOCK_ACTIVATE,
>      MIGRATION_CAPABILITY_RETURN_PATH,
>      MIGRATION_CAPABILITY_MULTIFD,
> +    MIGRATION_CAPABILITY_MAIN_ZERO_PAGE,
>      MIGRATION_CAPABILITY_PAUSE_BEFORE_SWITCHOVER,
>      MIGRATION_CAPABILITY_AUTO_CONVERGE,
>      MIGRATION_CAPABILITY_RELEASE_RAM,
> @@ -534,6 +532,9 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, Error **errp)
>              error_setg(errp, "Postcopy is not yet compatible with multifd");
>              return false;
>          }
> +        if (new_caps[MIGRATION_CAPABILITY_MAIN_ZERO_PAGE]) {
> +            error_setg(errp, "Postcopy is not yet compatible with main zero copy");
> +        }

Won't this will breaks compatibility for postcopy? A command that used
to work now will have to disable main-zero-page first.

>      }
>  
>      if (new_caps[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT]) {
> diff --git a/migration/ram.c b/migration/ram.c
> index 8c7886ab79..f7a42feff2 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -2059,17 +2059,42 @@ static int ram_save_target_page_legacy(RAMState *rs, PageSearchStatus *pss)
>      if (save_zero_page(rs, pss, offset)) {
>          return 1;
>      }
> -
>      /*
> -     * Do not use multifd in postcopy as one whole host page should be
> -     * placed.  Meanwhile postcopy requires atomic update of pages, so even
> -     * if host page size == guest page size the dest guest during run may
> -     * still see partially copied pages which is data corruption.
> +     * Do not use multifd for:
> +     * 1. Compression as the first page in the new block should be posted out
> +     *    before sending the compressed page
> +     * 2. In postcopy as one whole host page should be placed
>       */
> -    if (migrate_multifd() && !migration_in_postcopy()) {
> +    if (!migrate_compress() && migrate_multifd() && !migration_in_postcopy()) {
> +        return ram_save_multifd_page(pss->pss_channel, block, offset);
> +    }

This could go into ram_save_target_page_multifd like so:

if (!migrate_compress() && !migration_in_postcopy() && !migration_main_zero_page()) {
    return ram_save_multifd_page(pss->pss_channel, block, offset);
} else {
  return ram_save_target_page_legacy();
}

> +
> +    return ram_save_page(rs, pss);
> +}
> +
> +/**
> + * ram_save_target_page_multifd: save one target page
> + *
> + * Returns the number of pages written
> + *
> + * @rs: current RAM state
> + * @pss: data about the page we want to send
> + */
> +static int ram_save_target_page_multifd(RAMState *rs, PageSearchStatus *pss)
> +{
> +    RAMBlock *block = pss->block;
> +    ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS;
> +    int res;
> +
> +    if (!migration_in_postcopy()) {
>          return ram_save_multifd_page(pss->pss_channel, block, offset);
>      }
>  
> +    res = save_zero_page(rs, pss, offset);
> +    if (res > 0) {
> +        return res;
> +    }
> +
>      return ram_save_page(rs, pss);
>  }
>  
> @@ -2982,9 +3007,15 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>      }
>  
>      migration_ops = g_malloc0(sizeof(MigrationOps));
> -    migration_ops->ram_save_target_page = ram_save_target_page_legacy;
> +
> +    if (migrate_multifd() && !migrate_use_main_zero_page()) {
> +        migration_ops->ram_save_target_page = ram_save_target_page_multifd;
> +    } else {
> +        migration_ops->ram_save_target_page = ram_save_target_page_legacy;
> +    }

This should not check main-zero-page. Just have multifd vs. legacy and
have the multifd function defer to _legacy if main-zero-page or
in_postcopy.

>  
>      qemu_mutex_unlock_iothread();
> +
>      ret = multifd_send_sync_main(f);
>      qemu_mutex_lock_iothread();
>      if (ret < 0) {
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 09e4393591..9783289bfc 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -531,7 +531,6 @@
>  #     and can result in more stable read performance.  Requires KVM
>  #     with accelerator property "dirty-ring-size" set.  (Since 8.1)
>  #
> -#
>  # @main-zero-page: If enabled, the detection of zero pages will be
>  #                  done on the main thread.  Otherwise it is done on
>  #                  the multifd threads.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 01/20] multifd: Add capability to enable/disable zero_page
  2023-11-14  5:40 ` [PATCH v2 01/20] multifd: Add capability to enable/disable zero_page Hao Xiang
@ 2023-11-16 15:15   ` Fabiano Rosas
  0 siblings, 0 replies; 51+ messages in thread
From: Fabiano Rosas @ 2023-11-16 15:15 UTC (permalink / raw)
  To: Hao Xiang, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel

Hao Xiang <hao.xiang@bytedance.com> writes:

> From: Juan Quintela <quintela@redhat.com>
>
> We have to enable it by default until we introduce the new code.
>
> Signed-off-by: Juan Quintela <quintela@redhat.com>
> ---
>  migration/options.c | 13 +++++++++++++
>  migration/options.h |  1 +
>  qapi/migration.json |  8 +++++++-
>  3 files changed, 21 insertions(+), 1 deletion(-)
>
> diff --git a/migration/options.c b/migration/options.c
> index 8d8ec73ad9..00c0c4a0d6 100644
> --- a/migration/options.c
> +++ b/migration/options.c
> @@ -204,6 +204,8 @@ Property migration_properties[] = {
>      DEFINE_PROP_MIG_CAP("x-switchover-ack",
>                          MIGRATION_CAPABILITY_SWITCHOVER_ACK),
>      DEFINE_PROP_MIG_CAP("x-dirty-limit", MIGRATION_CAPABILITY_DIRTY_LIMIT),
> +    DEFINE_PROP_MIG_CAP("main-zero-page",
> +            MIGRATION_CAPABILITY_MAIN_ZERO_PAGE),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> @@ -284,6 +286,17 @@ bool migrate_multifd(void)
>      return s->capabilities[MIGRATION_CAPABILITY_MULTIFD];
>  }
>  
> +bool migrate_use_main_zero_page(void)

We dropped the 'use' from these a while back. Let's not bring it back.

> +{
> +    //MigrationState *s;
> +
> +    //s = migrate_get_current();
> +
> +    // We will enable this when we add the right code.
> +    // return s->enabled_capabilities[MIGRATION_CAPABILITY_MAIN_ZERO_PAGE];

Could use /* */ so checkpatch won't complain.

> +    return true;
> +}
> +
>  bool migrate_pause_before_switchover(void)
>  {
>      MigrationState *s = migrate_get_current();
> diff --git a/migration/options.h b/migration/options.h
> index 246c160aee..c901eb57c6 100644
> --- a/migration/options.h
> +++ b/migration/options.h
> @@ -88,6 +88,7 @@ int migrate_multifd_channels(void);
>  MultiFDCompression migrate_multifd_compression(void);
>  int migrate_multifd_zlib_level(void);
>  int migrate_multifd_zstd_level(void);
> +bool migrate_use_main_zero_page(void);
>  uint8_t migrate_throttle_trigger_threshold(void);
>  const char *migrate_tls_authz(void);
>  const char *migrate_tls_creds(void);
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 975761eebd..09e4393591 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -531,6 +531,12 @@
>  #     and can result in more stable read performance.  Requires KVM
>  #     with accelerator property "dirty-ring-size" set.  (Since 8.1)
>  #
> +#
> +# @main-zero-page: If enabled, the detection of zero pages will be
> +#                  done on the main thread.  Otherwise it is done on
> +#                  the multifd threads.
> +#                  (since 8.2)
> +#
>  # Features:
>  #
>  # @deprecated: Member @block is deprecated.  Use blockdev-mirror with
> @@ -555,7 +561,7 @@
>             { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
>             'validate-uuid', 'background-snapshot',
>             'zero-copy-send', 'postcopy-preempt', 'switchover-ack',
> -           'dirty-limit'] }
> +           'dirty-limit', 'main-zero-page'] }
>  
>  ##
>  # @MigrationCapabilityStatus:


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 05/20] meson: Introduce new instruction set enqcmd to the build system.
  2023-11-14  5:40 ` [PATCH v2 05/20] meson: Introduce new instruction set enqcmd to the build system Hao Xiang
@ 2023-12-11 15:41   ` Fabiano Rosas
  2023-12-16  0:26     ` [External] " Hao Xiang
  0 siblings, 1 reply; 51+ messages in thread
From: Fabiano Rosas @ 2023-12-11 15:41 UTC (permalink / raw)
  To: Hao Xiang, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel
  Cc: Hao Xiang

Hao Xiang <hao.xiang@bytedance.com> writes:

> Enable instruction set enqcmd in build.
>
> Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
> ---
>  meson.build                   | 2 ++
>  meson_options.txt             | 2 ++
>  scripts/meson-buildoptions.sh | 3 +++
>  3 files changed, 7 insertions(+)
>
> diff --git a/meson.build b/meson.build
> index ec01f8b138..1292ab78a3 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -2708,6 +2708,8 @@ config_host_data.set('CONFIG_AVX512BW_OPT', get_option('avx512bw') \
>      int main(int argc, char *argv[]) { return bar(argv[0]); }
>    '''), error_message: 'AVX512BW not available').allowed())
>  
> +config_host_data.set('CONFIG_DSA_OPT', get_option('enqcmd'))

We need some sort of detection at configure time whether the feature is
available. There are different compilers and compiler versions,
different Intel CPU versions, different CPU vendors, different
architectures, etc. Not all combinations will support DSA. Check avx512
above.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/20] migration/multifd: Add new migration option for multifd DSA offloading.
  2023-11-14  5:40 ` [PATCH v2 12/20] migration/multifd: Add new migration option for multifd DSA offloading Hao Xiang
@ 2023-12-11 19:44   ` Fabiano Rosas
  2023-12-18 18:34     ` [External] " Hao Xiang
  2023-12-18  3:12   ` Wang, Lei
  1 sibling, 1 reply; 51+ messages in thread
From: Fabiano Rosas @ 2023-12-11 19:44 UTC (permalink / raw)
  To: Hao Xiang, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel
  Cc: Hao Xiang

Hao Xiang <hao.xiang@bytedance.com> writes:

> Intel DSA offloading is an optional feature that turns on if
> proper hardware and software stack is available. To turn on
> DSA offloading in multifd live migration:
>
> multifd-dsa-accel="[dsa_dev_path1] ] [dsa_dev_path2] ... [dsa_dev_pathX]"
>
> This feature is turned off by default.

This patch breaks make check:

 43/357 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-hmp               ERROR           0.52s
 79/357 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test         ERROR           3.59s
167/357 qemu:qtest+qtest-x86_64 / qtest-x86_64/qmp-cmd-test ERROR           3.68s

Make sure you run make check before posting. Ideally also run the series
through the Gitlab CI on your personal fork.

> Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
> ---
>  migration/migration-hmp-cmds.c |  8 ++++++++
>  migration/options.c            | 28 ++++++++++++++++++++++++++++
>  migration/options.h            |  1 +
>  qapi/migration.json            | 17 ++++++++++++++---
>  scripts/meson-buildoptions.sh  |  6 +++---
>  5 files changed, 54 insertions(+), 6 deletions(-)
>
> diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
> index 86ae832176..d9451744dd 100644
> --- a/migration/migration-hmp-cmds.c
> +++ b/migration/migration-hmp-cmds.c
> @@ -353,6 +353,9 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
>          monitor_printf(mon, "%s: '%s'\n",
>              MigrationParameter_str(MIGRATION_PARAMETER_TLS_AUTHZ),
>              params->tls_authz);
> +        monitor_printf(mon, "%s: %s\n",

Use '%s' here.

> +            MigrationParameter_str(MIGRATION_PARAMETER_MULTIFD_DSA_ACCEL),
> +            params->multifd_dsa_accel);
>  
>          if (params->has_block_bitmap_mapping) {
>              const BitmapMigrationNodeAliasList *bmnal;
> @@ -615,6 +618,11 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
>          p->has_block_incremental = true;
>          visit_type_bool(v, param, &p->block_incremental, &err);
>          break;
> +    case MIGRATION_PARAMETER_MULTIFD_DSA_ACCEL:
> +        p->multifd_dsa_accel = g_new0(StrOrNull, 1);
> +        p->multifd_dsa_accel->type = QTYPE_QSTRING;
> +        visit_type_str(v, param, &p->multifd_dsa_accel->u.s, &err);
> +        break;
>      case MIGRATION_PARAMETER_MULTIFD_CHANNELS:
>          p->has_multifd_channels = true;
>          visit_type_uint8(v, param, &p->multifd_channels, &err);
> diff --git a/migration/options.c b/migration/options.c
> index 97d121d4d7..6e424b5d63 100644
> --- a/migration/options.c
> +++ b/migration/options.c
> @@ -179,6 +179,8 @@ Property migration_properties[] = {
>      DEFINE_PROP_MIG_MODE("mode", MigrationState,
>                        parameters.mode,
>                        MIG_MODE_NORMAL),
> +    DEFINE_PROP_STRING("multifd-dsa-accel", MigrationState,
> +                       parameters.multifd_dsa_accel),
>  
>      /* Migration capabilities */
>      DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
> @@ -901,6 +903,13 @@ const char *migrate_tls_creds(void)
>      return s->parameters.tls_creds;
>  }
>  
> +const char *migrate_multifd_dsa_accel(void)
> +{
> +    MigrationState *s = migrate_get_current();
> +
> +    return s->parameters.multifd_dsa_accel;
> +}
> +
>  const char *migrate_tls_hostname(void)
>  {
>      MigrationState *s = migrate_get_current();
> @@ -1025,6 +1034,7 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
>      params->vcpu_dirty_limit = s->parameters.vcpu_dirty_limit;
>      params->has_mode = true;
>      params->mode = s->parameters.mode;
> +    params->multifd_dsa_accel = s->parameters.multifd_dsa_accel;
>  
>      return params;
>  }
> @@ -1033,6 +1043,7 @@ void migrate_params_init(MigrationParameters *params)
>  {
>      params->tls_hostname = g_strdup("");
>      params->tls_creds = g_strdup("");
> +    params->multifd_dsa_accel = g_strdup("");
>  
>      /* Set has_* up only for parameter checks */
>      params->has_compress_level = true;
> @@ -1362,6 +1373,11 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
>      if (params->has_mode) {
>          dest->mode = params->mode;
>      }
> +
> +    if (params->multifd_dsa_accel) {
> +        assert(params->multifd_dsa_accel->type == QTYPE_QSTRING);
> +        dest->multifd_dsa_accel = params->multifd_dsa_accel->u.s;
> +    }
>  }
>  
>  static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
> @@ -1506,6 +1522,12 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
>      if (params->has_mode) {
>          s->parameters.mode = params->mode;
>      }
> +
> +    if (params->multifd_dsa_accel) {
> +        g_free(s->parameters.multifd_dsa_accel);
> +        assert(params->multifd_dsa_accel->type == QTYPE_QSTRING);
> +        s->parameters.multifd_dsa_accel = g_strdup(params->multifd_dsa_accel->u.s);
> +    }
>  }
>  
>  void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
> @@ -1531,6 +1553,12 @@ void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
>          params->tls_authz->type = QTYPE_QSTRING;
>          params->tls_authz->u.s = strdup("");
>      }
> +    if (params->multifd_dsa_accel
> +        && params->multifd_dsa_accel->type == QTYPE_QNULL) {
> +        qobject_unref(params->multifd_dsa_accel->u.n);
> +        params->multifd_dsa_accel->type = QTYPE_QSTRING;
> +        params->multifd_dsa_accel->u.s = strdup("");
> +    }
>  
>      migrate_params_test_apply(params, &tmp);
>  
> diff --git a/migration/options.h b/migration/options.h
> index c901eb57c6..56100961a9 100644
> --- a/migration/options.h
> +++ b/migration/options.h
> @@ -94,6 +94,7 @@ const char *migrate_tls_authz(void);
>  const char *migrate_tls_creds(void);
>  const char *migrate_tls_hostname(void);
>  uint64_t migrate_xbzrle_cache_size(void);
> +const char *migrate_multifd_dsa_accel(void);
>  
>  /* parameters setters */
>  
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 9783289bfc..a8e3b66d6f 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -879,6 +879,9 @@
>  # @mode: Migration mode. See description in @MigMode. Default is 'normal'.
>  #        (Since 8.2)
>  #
> +# @multifd-dsa-accel: If enabled, use DSA accelerator offloading for
> +#                     certain memory operations. (since 8.2)
> +#
>  # Features:
>  #
>  # @deprecated: Member @block-incremental is deprecated.  Use
> @@ -902,7 +905,7 @@
>             'cpu-throttle-initial', 'cpu-throttle-increment',
>             'cpu-throttle-tailslow',
>             'tls-creds', 'tls-hostname', 'tls-authz', 'max-bandwidth',
> -           'avail-switchover-bandwidth', 'downtime-limit',
> +           'avail-switchover-bandwidth', 'downtime-limit', 'multifd-dsa-accel',
>             { 'name': 'x-checkpoint-delay', 'features': [ 'unstable' ] },
>             { 'name': 'block-incremental', 'features': [ 'deprecated' ] },
>             'multifd-channels',
> @@ -1067,6 +1070,9 @@
>  # @mode: Migration mode. See description in @MigMode. Default is 'normal'.
>  #        (Since 8.2)
>  #
> +# @multifd-dsa-accel: If enabled, use DSA accelerator offloading for
> +#                     certain memory operations. (since 8.2)
> +#
>  # Features:
>  #
>  # @deprecated: Member @block-incremental is deprecated.  Use
> @@ -1120,7 +1126,8 @@
>              '*x-vcpu-dirty-limit-period': { 'type': 'uint64',
>                                              'features': [ 'unstable' ] },
>              '*vcpu-dirty-limit': 'uint64',
> -            '*mode': 'MigMode'} }
> +            '*mode': 'MigMode',
> +            '*multifd-dsa-accel': 'StrOrNull'} }
>  
>  ##
>  # @migrate-set-parameters:
> @@ -1295,6 +1302,9 @@
>  # @mode: Migration mode. See description in @MigMode. Default is 'normal'.
>  #        (Since 8.2)
>  #
> +# @multifd-dsa-accel: If enabled, use DSA accelerator offloading for
> +#                     certain memory operations. (since 8.2)
> +#
>  # Features:
>  #
>  # @deprecated: Member @block-incremental is deprecated.  Use
> @@ -1345,7 +1355,8 @@
>              '*x-vcpu-dirty-limit-period': { 'type': 'uint64',
>                                              'features': [ 'unstable' ] },
>              '*vcpu-dirty-limit': 'uint64',
> -            '*mode': 'MigMode'} }
> +            '*mode': 'MigMode',
> +            '*multifd-dsa-accel': 'str'} }
>  
>  ##
>  # @query-migrate-parameters:
> diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
> index bf139e3fb4..35222ab63e 100644
> --- a/scripts/meson-buildoptions.sh
> +++ b/scripts/meson-buildoptions.sh
> @@ -32,6 +32,7 @@ meson_options_help() {
>    printf "%s\n" '  --enable-debug-stack-usage'
>    printf "%s\n" '                           measure coroutine stack usage'
>    printf "%s\n" '  --enable-debug-tcg       TCG debugging'
> +  printf "%s\n" '  --enable-enqcmd          MENQCMD optimizations'
>    printf "%s\n" '  --enable-fdt[=CHOICE]    Whether and how to find the libfdt library'
>    printf "%s\n" '                           (choices: auto/disabled/enabled/internal/system)'
>    printf "%s\n" '  --enable-fuzzing         build fuzzing targets'
> @@ -93,7 +94,6 @@ meson_options_help() {
>    printf "%s\n" '  avx2            AVX2 optimizations'
>    printf "%s\n" '  avx512bw        AVX512BW optimizations'
>    printf "%s\n" '  avx512f         AVX512F optimizations'
> -  printf "%s\n" '  enqcmd          ENQCMD optimizations'
>    printf "%s\n" '  blkio           libblkio block device driver'
>    printf "%s\n" '  bochs           bochs image format support'
>    printf "%s\n" '  bpf             eBPF support'
> @@ -241,8 +241,6 @@ _meson_option_parse() {
>      --disable-avx512bw) printf "%s" -Davx512bw=disabled ;;
>      --enable-avx512f) printf "%s" -Davx512f=enabled ;;
>      --disable-avx512f) printf "%s" -Davx512f=disabled ;;
> -    --enable-enqcmd) printf "%s" -Denqcmd=true ;;
> -    --disable-enqcmd) printf "%s" -Denqcmd=false ;;
>      --enable-gcov) printf "%s" -Db_coverage=true ;;
>      --disable-gcov) printf "%s" -Db_coverage=false ;;
>      --enable-lto) printf "%s" -Db_lto=true ;;
> @@ -309,6 +307,8 @@ _meson_option_parse() {
>      --disable-docs) printf "%s" -Ddocs=disabled ;;
>      --enable-dsound) printf "%s" -Ddsound=enabled ;;
>      --disable-dsound) printf "%s" -Ddsound=disabled ;;
> +    --enable-enqcmd) printf "%s" -Denqcmd=true ;;
> +    --disable-enqcmd) printf "%s" -Denqcmd=false ;;
>      --enable-fdt) printf "%s" -Dfdt=enabled ;;
>      --disable-fdt) printf "%s" -Dfdt=disabled ;;
>      --enable-fdt=*) quote_sh "-Dfdt=$2" ;;


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 07/20] util/dsa: Implement DSA device start and stop logic.
  2023-11-14  5:40 ` [PATCH v2 07/20] util/dsa: Implement DSA device start and stop logic Hao Xiang
@ 2023-12-11 21:28   ` Fabiano Rosas
  2023-12-19  6:41     ` [External] " Hao Xiang
  0 siblings, 1 reply; 51+ messages in thread
From: Fabiano Rosas @ 2023-12-11 21:28 UTC (permalink / raw)
  To: Hao Xiang, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel
  Cc: Hao Xiang

Hao Xiang <hao.xiang@bytedance.com> writes:

> * DSA device open and close.
> * DSA group contains multiple DSA devices.
> * DSA group configure/start/stop/clean.
>
> Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
> Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
> ---
>  include/qemu/dsa.h |  49 +++++++
>  util/dsa.c         | 338 +++++++++++++++++++++++++++++++++++++++++++++
>  util/meson.build   |   1 +
>  3 files changed, 388 insertions(+)
>  create mode 100644 include/qemu/dsa.h
>  create mode 100644 util/dsa.c
>
> diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
> new file mode 100644
> index 0000000000..30246b507e
> --- /dev/null
> +++ b/include/qemu/dsa.h
> @@ -0,0 +1,49 @@
> +#ifndef QEMU_DSA_H
> +#define QEMU_DSA_H
> +
> +#include "qemu/thread.h"
> +#include "qemu/queue.h"
> +
> +#ifdef CONFIG_DSA_OPT
> +
> +#pragma GCC push_options
> +#pragma GCC target("enqcmd")
> +
> +#include <linux/idxd.h>
> +#include "x86intrin.h"
> +
> +#endif
> +
> +/**
> + * @brief Initializes DSA devices.
> + *
> + * @param dsa_parameter A list of DSA device path from migration parameter.

This code seems pretty generic, let's decouple this doc from migration.

> + * @return int Zero if successful, otherwise non zero.
> + */
> +int dsa_init(const char *dsa_parameter);
> +
> +/**
> + * @brief Start logic to enable using DSA.
> + */
> +void dsa_start(void);
> +
> +/**
> + * @brief Stop logic to clean up DSA by halting the device group and cleaning up
> + * the completion thread.

"Stop the device group and the completion thread"

The mention of "clean/cleaning up" makes this confusing because of
dsa_cleanup() below.

> + */
> +void dsa_stop(void);
> +
> +/**
> + * @brief Clean up system resources created for DSA offloading.
> + *        This function is called during QEMU process teardown.

This is not called during QEMU process teardown. It's called at the end
of migration AFAICS. Maybe just leave this sentence out.

> + */
> +void dsa_cleanup(void);
> +
> +/**
> + * @brief Check if DSA is running.
> + *
> + * @return True if DSA is running, otherwise false.
> + */
> +bool dsa_is_running(void);
> +
> +#endif
> \ No newline at end of file
> diff --git a/util/dsa.c b/util/dsa.c
> new file mode 100644
> index 0000000000..8edaa892ec
> --- /dev/null
> +++ b/util/dsa.c
> @@ -0,0 +1,338 @@
> +/*
> + * Use Intel Data Streaming Accelerator to offload certain background
> + * operations.
> + *
> + * Copyright (c) 2023 Hao Xiang <hao.xiang@bytedance.com>
> + *                    Bryan Zhang <bryan.zhang@bytedance.com>
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> + * of this software and associated documentation files (the "Software"), to deal
> + * in the Software without restriction, including without limitation the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> + * copies of the Software, and to permit persons to whom the Software is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> + * THE SOFTWARE.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/queue.h"
> +#include "qemu/memalign.h"
> +#include "qemu/lockable.h"
> +#include "qemu/cutils.h"
> +#include "qemu/dsa.h"
> +#include "qemu/bswap.h"
> +#include "qemu/error-report.h"
> +#include "qemu/rcu.h"
> +
> +#ifdef CONFIG_DSA_OPT
> +
> +#pragma GCC push_options
> +#pragma GCC target("enqcmd")
> +
> +#include <linux/idxd.h>
> +#include "x86intrin.h"
> +
> +#define DSA_WQ_SIZE 4096
> +#define MAX_DSA_DEVICES 16
> +
> +typedef QSIMPLEQ_HEAD(dsa_task_queue, buffer_zero_batch_task) dsa_task_queue;
> +
> +struct dsa_device {
> +    void *work_queue;
> +};
> +
> +struct dsa_device_group {
> +    struct dsa_device *dsa_devices;
> +    int num_dsa_devices;
> +    uint32_t index;
> +    bool running;
> +    QemuMutex task_queue_lock;
> +    QemuCond task_queue_cond;
> +    dsa_task_queue task_queue;
> +};
> +
> +uint64_t max_retry_count;
> +static struct dsa_device_group dsa_group;
> +
> +
> +/**
> + * @brief This function opens a DSA device's work queue and
> + *        maps the DSA device memory into the current process.
> + *
> + * @param dsa_wq_path A pointer to the DSA device work queue's file path.
> + * @return A pointer to the mapped memory.
> + */
> +static void *
> +map_dsa_device(const char *dsa_wq_path)
> +{
> +    void *dsa_device;
> +    int fd;
> +
> +    fd = open(dsa_wq_path, O_RDWR);
> +    if (fd < 0) {
> +        fprintf(stderr, "open %s failed with errno = %d.\n",
> +                dsa_wq_path, errno);

Use error_report and error_setg* for these. Throughout the series.

> +        return MAP_FAILED;
> +    }
> +    dsa_device = mmap(NULL, DSA_WQ_SIZE, PROT_WRITE,
> +                      MAP_SHARED | MAP_POPULATE, fd, 0);
> +    close(fd);
> +    if (dsa_device == MAP_FAILED) {
> +        fprintf(stderr, "mmap failed with errno = %d.\n", errno);
> +        return MAP_FAILED;
> +    }
> +    return dsa_device;
> +}
> +
> +/**
> + * @brief Initializes a DSA device structure.
> + *
> + * @param instance A pointer to the DSA device.
> + * @param work_queue  A pointer to the DSA work queue.
> + */
> +static void
> +dsa_device_init(struct dsa_device *instance,
> +                void *dsa_work_queue)
> +{
> +    instance->work_queue = dsa_work_queue;
> +}
> +
> +/**
> + * @brief Cleans up a DSA device structure.
> + *
> + * @param instance A pointer to the DSA device to cleanup.
> + */
> +static void
> +dsa_device_cleanup(struct dsa_device *instance)
> +{
> +    if (instance->work_queue != MAP_FAILED) {
> +        munmap(instance->work_queue, DSA_WQ_SIZE);
> +    }
> +}
> +
> +/**
> + * @brief Initializes a DSA device group.
> + *
> + * @param group A pointer to the DSA device group.
> + * @param num_dsa_devices The number of DSA devices this group will have.
> + *
> + * @return Zero if successful, non-zero otherwise.
> + */
> +static int
> +dsa_device_group_init(struct dsa_device_group *group,
> +                      const char *dsa_parameter)

The documentation doesn't match the signature. This happens in other
places as well, please review all of them.

> +{
> +    if (dsa_parameter == NULL || strlen(dsa_parameter) == 0) {
> +        return 0;
> +    }
> +
> +    int ret = 0;
> +    char *local_dsa_parameter = g_strdup(dsa_parameter);
> +    const char *dsa_path[MAX_DSA_DEVICES];
> +    int num_dsa_devices = 0;
> +    char delim[2] = " ";

So we're using space separated strings. Let's document this in this file
and also on the migration parameter documentation.

> +
> +    char *current_dsa_path = strtok(local_dsa_parameter, delim);
> +
> +    while (current_dsa_path != NULL) {
> +        dsa_path[num_dsa_devices++] = current_dsa_path;
> +        if (num_dsa_devices == MAX_DSA_DEVICES) {
> +            break;
> +        }
> +        current_dsa_path = strtok(NULL, delim);
> +    }
> +
> +    group->dsa_devices =
> +        malloc(sizeof(struct dsa_device) * num_dsa_devices);

Use g_new0() here.

> +    group->num_dsa_devices = num_dsa_devices;
> +    group->index = 0;
> +
> +    group->running = false;
> +    qemu_mutex_init(&group->task_queue_lock);
> +    qemu_cond_init(&group->task_queue_cond);
> +    QSIMPLEQ_INIT(&group->task_queue);
> +
> +    void *dsa_wq = MAP_FAILED;
> +    for (int i = 0; i < num_dsa_devices; i++) {
> +        dsa_wq = map_dsa_device(dsa_path[i]);
> +        if (dsa_wq == MAP_FAILED) {
> +            fprintf(stderr, "map_dsa_device failed MAP_FAILED, "
> +                    "using simulation.\n");

What does "using simulation" means? And how are doing it by returning -1
from this function?

> +            ret = -1;

What about the memory for group->dsa_devices in the failure case? We
should either free it here or make sure the client code calls the
cleanup routines.

> +            goto exit;
> +        }
> +        dsa_device_init(&dsa_group.dsa_devices[i], dsa_wq);
> +    }
> +
> +exit:
> +    g_free(local_dsa_parameter);
> +    return ret;
> +}
> +
> +/**
> + * @brief Starts a DSA device group.
> + *
> + * @param group A pointer to the DSA device group.
> + * @param dsa_path An array of DSA device path.
> + * @param num_dsa_devices The number of DSA devices in the device group.
> + */
> +static void
> +dsa_device_group_start(struct dsa_device_group *group)
> +{
> +    group->running = true;
> +}
> +
> +/**
> + * @brief Stops a DSA device group.
> + *
> + * @param group A pointer to the DSA device group.
> + */
> +__attribute__((unused))
> +static void
> +dsa_device_group_stop(struct dsa_device_group *group)
> +{
> +    group->running = false;
> +}
> +
> +/**
> + * @brief Cleans up a DSA device group.
> + *
> + * @param group A pointer to the DSA device group.
> + */
> +static void
> +dsa_device_group_cleanup(struct dsa_device_group *group)
> +{
> +    if (!group->dsa_devices) {
> +        return;
> +    }
> +    for (int i = 0; i < group->num_dsa_devices; i++) {
> +        dsa_device_cleanup(&group->dsa_devices[i]);
> +    }
> +    free(group->dsa_devices);
> +    group->dsa_devices = NULL;
> +
> +    qemu_mutex_destroy(&group->task_queue_lock);
> +    qemu_cond_destroy(&group->task_queue_cond);
> +}
> +
> +/**
> + * @brief Returns the next available DSA device in the group.
> + *
> + * @param group A pointer to the DSA device group.
> + *
> + * @return struct dsa_device* A pointer to the next available DSA device
> + *         in the group.
> + */
> +__attribute__((unused))
> +static struct dsa_device *
> +dsa_device_group_get_next_device(struct dsa_device_group *group)
> +{
> +    if (group->num_dsa_devices == 0) {
> +        return NULL;
> +    }
> +    uint32_t current = qatomic_fetch_inc(&group->index);

The name "index" alone feels a bit opaque. Is there a more
representative name we could give it?

> +    current %= group->num_dsa_devices;
> +    return &group->dsa_devices[current];
> +}
> +
> +/**
> + * @brief Check if DSA is running.
> + *
> + * @return True if DSA is running, otherwise false.
> + */
> +bool dsa_is_running(void)
> +{
> +    return false;
> +}
> +
> +static void
> +dsa_globals_init(void)
> +{
> +    max_retry_count = UINT64_MAX;
> +}
> +
> +/**
> + * @brief Initializes DSA devices.
> + *
> + * @param dsa_parameter A list of DSA device path from migration parameter.
> + * @return int Zero if successful, otherwise non zero.
> + */
> +int dsa_init(const char *dsa_parameter)
> +{
> +    dsa_globals_init();
> +
> +    return dsa_device_group_init(&dsa_group, dsa_parameter);
> +}
> +
> +/**
> + * @brief Start logic to enable using DSA.
> + *
> + */
> +void dsa_start(void)
> +{
> +    if (dsa_group.num_dsa_devices == 0) {
> +        return;
> +    }
> +    if (dsa_group.running) {
> +        return;
> +    }
> +    dsa_device_group_start(&dsa_group);
> +}
> +
> +/**
> + * @brief Stop logic to clean up DSA by halting the device group and cleaning up
> + * the completion thread.
> + *
> + */
> +void dsa_stop(void)
> +{
> +    struct dsa_device_group *group = &dsa_group;
> +
> +    if (!group->running) {
> +        return;
> +    }
> +}
> +
> +/**
> + * @brief Clean up system resources created for DSA offloading.
> + *        This function is called during QEMU process teardown.
> + *
> + */
> +void dsa_cleanup(void)
> +{
> +    dsa_stop();
> +    dsa_device_group_cleanup(&dsa_group);
> +}
> +
> +#else
> +
> +bool dsa_is_running(void)
> +{
> +    return false;
> +}
> +
> +int dsa_init(const char *dsa_parameter)
> +{
> +    fprintf(stderr, "Intel Data Streaming Accelerator is not supported "
> +                    "on this platform.\n");
> +    return -1;

Nothing checks this later in the series and we end up trying to start a
migration when we shouldn't. Fixing the configure step would already
stop this happening, but make sure you check this anyway and abort the
migration.

> +}
> +
> +void dsa_start(void) {}
> +
> +void dsa_stop(void) {}
> +
> +void dsa_cleanup(void) {}
> +
> +#endif

These could all be in the header.

> +
> diff --git a/util/meson.build b/util/meson.build
> index c2322ef6e7..f7277c5e9b 100644
> --- a/util/meson.build
> +++ b/util/meson.build
> @@ -85,6 +85,7 @@ if have_block or have_ga
>  endif
>  if have_block
>    util_ss.add(files('aio-wait.c'))
> +  util_ss.add(files('dsa.c'))

I find it clearer to add the file conditionally under CONFIG_DSA_OPT
here and remove the ifdef from the C file. I'm not sure if we have any
guidelines for this, so up to you.

>    util_ss.add(files('buffer.c'))
>    util_ss.add(files('bufferiszero.c'))
>    util_ss.add(files('hbitmap.c'))


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 08/20] util/dsa: Implement DSA task enqueue and dequeue.
  2023-11-14  5:40 ` [PATCH v2 08/20] util/dsa: Implement DSA task enqueue and dequeue Hao Xiang
@ 2023-12-12 16:10   ` Fabiano Rosas
  2023-12-27  0:07     ` [External] " Hao Xiang
  0 siblings, 1 reply; 51+ messages in thread
From: Fabiano Rosas @ 2023-12-12 16:10 UTC (permalink / raw)
  To: Hao Xiang, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel
  Cc: Hao Xiang

Hao Xiang <hao.xiang@bytedance.com> writes:

> * Use a safe thread queue for DSA task enqueue/dequeue.
> * Implement DSA task submission.
> * Implement DSA batch task submission.
>
> Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
> ---
>  include/qemu/dsa.h |  35 ++++++++
>  util/dsa.c         | 196 +++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 231 insertions(+)
>
> diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
> index 30246b507e..23f55185be 100644
> --- a/include/qemu/dsa.h
> +++ b/include/qemu/dsa.h
> @@ -12,6 +12,41 @@
>  #include <linux/idxd.h>
>  #include "x86intrin.h"
>  
> +enum dsa_task_type {

Our coding style requires CamelCase for enums and typedef'ed structures.

> +    DSA_TASK = 0,
> +    DSA_BATCH_TASK
> +};
> +
> +enum dsa_task_status {
> +    DSA_TASK_READY = 0,
> +    DSA_TASK_PROCESSING,
> +    DSA_TASK_COMPLETION
> +};
> +
> +typedef void (*buffer_zero_dsa_completion_fn)(void *);

We don't really need the "buffer_zero" mention in any of this
code. Simply dsa_batch_task or batch_task would suffice.

> +
> +typedef struct buffer_zero_batch_task {
> +    struct dsa_hw_desc batch_descriptor;
> +    struct dsa_hw_desc *descriptors;
> +    struct dsa_completion_record batch_completion __attribute__((aligned(32)));
> +    struct dsa_completion_record *completions;
> +    struct dsa_device_group *group;
> +    struct dsa_device *device;
> +    buffer_zero_dsa_completion_fn completion_callback;
> +    QemuSemaphore sem_task_complete;
> +    enum dsa_task_type task_type;
> +    enum dsa_task_status status;
> +    bool *results;
> +    int batch_size;
> +    QSIMPLEQ_ENTRY(buffer_zero_batch_task) entry;
> +} buffer_zero_batch_task;

I see data specific to this implementation and data coming from the
library, maybe these would be better organized in two separate
structures with the qemu-specific having a pointer to the generic
one. Looking ahead in the series, there seems to be migration data
coming into this as well.

> +
> +#else
> +
> +struct buffer_zero_batch_task {
> +    bool *results;
> +};
> +
>  #endif
>  
>  /**
> diff --git a/util/dsa.c b/util/dsa.c
> index 8edaa892ec..f82282ce99 100644
> --- a/util/dsa.c
> +++ b/util/dsa.c
> @@ -245,6 +245,200 @@ dsa_device_group_get_next_device(struct dsa_device_group *group)
>      return &group->dsa_devices[current];
>  }
>  
> +/**
> + * @brief Empties out the DSA task queue.
> + *
> + * @param group A pointer to the DSA device group.
> + */
> +static void
> +dsa_empty_task_queue(struct dsa_device_group *group)
> +{
> +    qemu_mutex_lock(&group->task_queue_lock);
> +    dsa_task_queue *task_queue = &group->task_queue;
> +    while (!QSIMPLEQ_EMPTY(task_queue)) {
> +        QSIMPLEQ_REMOVE_HEAD(task_queue, entry);
> +    }
> +    qemu_mutex_unlock(&group->task_queue_lock);
> +}
> +
> +/**
> + * @brief Adds a task to the DSA task queue.
> + *
> + * @param group A pointer to the DSA device group.
> + * @param context A pointer to the DSA task to enqueue.
> + *
> + * @return int Zero if successful, otherwise a proper error code.
> + */
> +static int
> +dsa_task_enqueue(struct dsa_device_group *group,
> +                 struct buffer_zero_batch_task *task)
> +{
> +    dsa_task_queue *task_queue = &group->task_queue;
> +    QemuMutex *task_queue_lock = &group->task_queue_lock;
> +    QemuCond *task_queue_cond = &group->task_queue_cond;
> +
> +    bool notify = false;
> +
> +    qemu_mutex_lock(task_queue_lock);
> +
> +    if (!group->running) {
> +        fprintf(stderr, "DSA: Tried to queue task to stopped device queue\n");
> +        qemu_mutex_unlock(task_queue_lock);
> +        return -1;
> +    }
> +
> +    // The queue is empty. This enqueue operation is a 0->1 transition.
> +    if (QSIMPLEQ_EMPTY(task_queue))
> +        notify = true;
> +
> +    QSIMPLEQ_INSERT_TAIL(task_queue, task, entry);
> +
> +    // We need to notify the waiter for 0->1 transitions.
> +    if (notify)
> +        qemu_cond_signal(task_queue_cond);
> +
> +    qemu_mutex_unlock(task_queue_lock);
> +
> +    return 0;
> +}
> +
> +/**
> + * @brief Takes a DSA task out of the task queue.
> + *
> + * @param group A pointer to the DSA device group.
> + * @return buffer_zero_batch_task* The DSA task being dequeued.
> + */
> +__attribute__((unused))
> +static struct buffer_zero_batch_task *
> +dsa_task_dequeue(struct dsa_device_group *group)
> +{
> +    struct buffer_zero_batch_task *task = NULL;
> +    dsa_task_queue *task_queue = &group->task_queue;
> +    QemuMutex *task_queue_lock = &group->task_queue_lock;
> +    QemuCond *task_queue_cond = &group->task_queue_cond;
> +
> +    qemu_mutex_lock(task_queue_lock);
> +
> +    while (true) {
> +        if (!group->running)
> +            goto exit;
> +        task = QSIMPLEQ_FIRST(task_queue);
> +        if (task != NULL) {
> +            break;
> +        }
> +        qemu_cond_wait(task_queue_cond, task_queue_lock);
> +    }
> +
> +    QSIMPLEQ_REMOVE_HEAD(task_queue, entry);
> +
> +exit:
> +    qemu_mutex_unlock(task_queue_lock);
> +    return task;
> +}
> +
> +/**
> + * @brief Submits a DSA work item to the device work queue.
> + *
> + * @param wq A pointer to the DSA work queue's device memory.
> + * @param descriptor A pointer to the DSA work item descriptor.
> + *
> + * @return Zero if successful, non-zero otherwise.
> + */
> +static int
> +submit_wi_int(void *wq, struct dsa_hw_desc *descriptor)
> +{
> +    uint64_t retry = 0;
> +
> +    _mm_sfence();
> +
> +    while (true) {
> +        if (_enqcmd(wq, descriptor) == 0) {
> +            break;
> +        }
> +        retry++;
> +        if (retry > max_retry_count) {

'max_retry_count' is UINT64_MAX so 'retry' will wrap around.

> +            fprintf(stderr, "Submit work retry %lu times.\n", retry);
> +            exit(1);

Is this not the case where we'd fallback to the CPU?

You should not exit() here, but return non-zero as the documentation
mentions and the callers expect.

> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +/**
> + * @brief Synchronously submits a DSA work item to the
> + *        device work queue.
> + *
> + * @param wq A pointer to the DSA worjk queue's device memory.
> + * @param descriptor A pointer to the DSA work item descriptor.
> + *
> + * @return int Zero if successful, non-zero otherwise.
> + */
> +__attribute__((unused))
> +static int
> +submit_wi(void *wq, struct dsa_hw_desc *descriptor)
> +{
> +    return submit_wi_int(wq, descriptor);
> +}
> +
> +/**
> + * @brief Asynchronously submits a DSA work item to the
> + *        device work queue.
> + *
> + * @param task A pointer to the buffer zero task.
> + *
> + * @return int Zero if successful, non-zero otherwise.
> + */
> +__attribute__((unused))
> +static int
> +submit_wi_async(struct buffer_zero_batch_task *task)
> +{
> +    struct dsa_device_group *device_group = task->group;
> +    struct dsa_device *device_instance = task->device;
> +    int ret;
> +
> +    assert(task->task_type == DSA_TASK);
> +
> +    task->status = DSA_TASK_PROCESSING;
> +
> +    ret = submit_wi_int(device_instance->work_queue,
> +                        &task->descriptors[0]);
> +    if (ret != 0)
> +        return ret;
> +
> +    return dsa_task_enqueue(device_group, task);
> +}
> +
> +/**
> + * @brief Asynchronously submits a DSA batch work item to the
> + *        device work queue.
> + *
> + * @param batch_task A pointer to the batch buffer zero task.
> + *
> + * @return int Zero if successful, non-zero otherwise.
> + */
> +__attribute__((unused))
> +static int
> +submit_batch_wi_async(struct buffer_zero_batch_task *batch_task)
> +{
> +    struct dsa_device_group *device_group = batch_task->group;
> +    struct dsa_device *device_instance = batch_task->device;
> +    int ret;
> +
> +    assert(batch_task->task_type == DSA_BATCH_TASK);
> +    assert(batch_task->batch_descriptor.desc_count <= batch_task->batch_size);
> +    assert(batch_task->status == DSA_TASK_READY);
> +
> +    batch_task->status = DSA_TASK_PROCESSING;
> +
> +    ret = submit_wi_int(device_instance->work_queue,
> +                        &batch_task->batch_descriptor);
> +    if (ret != 0)
> +        return ret;
> +
> +    return dsa_task_enqueue(device_group, batch_task);
> +}

At this point in the series submit_wi_async() and
submit_batch_wi_async() look the same to me without the asserts. Can't
we consolidate them?

There's also the fact that both functions receive a _batch_ task but one
is supposed to work in batches and the other is not. That could be
solved by renaming the structure I guess.

> +
>  /**
>   * @brief Check if DSA is running.
>   *
> @@ -301,6 +495,8 @@ void dsa_stop(void)
>      if (!group->running) {
>          return;
>      }
> +
> +    dsa_empty_task_queue(group);
>  }
>  
>  /**


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 09/20] util/dsa: Implement DSA task asynchronous completion thread model.
  2023-11-14  5:40 ` [PATCH v2 09/20] util/dsa: Implement DSA task asynchronous completion thread model Hao Xiang
@ 2023-12-12 19:36   ` Fabiano Rosas
  2023-12-18  3:11   ` Wang, Lei
  1 sibling, 0 replies; 51+ messages in thread
From: Fabiano Rosas @ 2023-12-12 19:36 UTC (permalink / raw)
  To: Hao Xiang, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel
  Cc: Hao Xiang

Hao Xiang <hao.xiang@bytedance.com> writes:

> * Create a dedicated thread for DSA task completion.
> * DSA completion thread runs a loop and poll for completed tasks.
> * Start and stop DSA completion thread during DSA device start stop.
>
> User space application can directly submit task to Intel DSA
> accelerator by writing to DSA's device memory (mapped in user space).
> Once a task is submitted, the device starts processing it and write
> the completion status back to the task. A user space application can
> poll the task's completion status to check for completion. This change
> uses a dedicated thread to perform DSA task completion checking.
>
> Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
> ---
>  util/dsa.c | 243 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 242 insertions(+), 1 deletion(-)
>
> diff --git a/util/dsa.c b/util/dsa.c
> index f82282ce99..0e68013ffb 100644
> --- a/util/dsa.c
> +++ b/util/dsa.c
> @@ -44,6 +44,7 @@
>  
>  #define DSA_WQ_SIZE 4096
>  #define MAX_DSA_DEVICES 16
> +#define DSA_COMPLETION_THREAD "dsa_completion"
>  
>  typedef QSIMPLEQ_HEAD(dsa_task_queue, buffer_zero_batch_task) dsa_task_queue;
>  
> @@ -61,8 +62,18 @@ struct dsa_device_group {
>      dsa_task_queue task_queue;
>  };
>  
> +struct dsa_completion_thread {
> +    bool stopping;
> +    bool running;
> +    QemuThread thread;
> +    int thread_id;
> +    QemuSemaphore sem_init_done;
> +    struct dsa_device_group *group;
> +};
> +
>  uint64_t max_retry_count;
>  static struct dsa_device_group dsa_group;
> +static struct dsa_completion_thread completion_thread;
>  
>  
>  /**
> @@ -439,6 +450,234 @@ submit_batch_wi_async(struct buffer_zero_batch_task *batch_task)
>      return dsa_task_enqueue(device_group, batch_task);
>  }
>  
> +/**
> + * @brief Poll for the DSA work item completion.
> + *
> + * @param completion A pointer to the DSA work item completion record.
> + * @param opcode The DSA opcode.
> + *
> + * @return Zero if successful, non-zero otherwise.
> + */
> +static int
> +poll_completion(struct dsa_completion_record *completion,
> +                enum dsa_opcode opcode)
> +{
> +    uint8_t status;
> +    uint64_t retry = 0;
> +
> +    while (true) {
> +        // The DSA operation completes successfully or fails.
> +        status = completion->status;
> +        if (status == DSA_COMP_SUCCESS ||

Should we read directly from completion->status or is the compiler smart
enough to not optimize 'status' out?

> +            status == DSA_COMP_PAGE_FAULT_NOBOF ||
> +            status == DSA_COMP_BATCH_PAGE_FAULT ||
> +            status == DSA_COMP_BATCH_FAIL) {
> +            break;
> +        } else if (status != DSA_COMP_NONE) {
> +            /* TODO: Error handling here on unexpected failure. */

Let's make sure this is dealt with before merging.

> +            fprintf(stderr, "DSA opcode %d failed with status = %d.\n",
> +                    opcode, status);
> +            exit(1);

return instead of exiting.

> +        }
> +        retry++;
> +        if (retry > max_retry_count) {
> +            fprintf(stderr, "Wait for completion retry %lu times.\n", retry);
> +            exit(1);

same here

> +        }
> +        _mm_pause();
> +    }
> +
> +    return 0;
> +}
> +
> +/**
> + * @brief Complete a single DSA task in the batch task.
> + *
> + * @param task A pointer to the batch task structure.
> + */
> +static void
> +poll_task_completion(struct buffer_zero_batch_task *task)
> +{
> +    assert(task->task_type == DSA_TASK);
> +
> +    struct dsa_completion_record *completion = &task->completions[0];
> +    uint8_t status;
> +
> +    poll_completion(completion, task->descriptors[0].opcode);
> +
> +    status = completion->status;
> +    if (status == DSA_COMP_SUCCESS) {
> +        task->results[0] = (completion->result == 0);
> +        return;
> +    }
> +
> +    assert(status == DSA_COMP_PAGE_FAULT_NOBOF);
> +}
> +
> +/**
> + * @brief Poll a batch task status until it completes. If DSA task doesn't
> + *        complete properly, use CPU to complete the task.
> + *
> + * @param batch_task A pointer to the DSA batch task.
> + */
> +static void
> +poll_batch_task_completion(struct buffer_zero_batch_task *batch_task)
> +{
> +    struct dsa_completion_record *batch_completion = &batch_task->batch_completion;
> +    struct dsa_completion_record *completion;
> +    uint8_t batch_status;
> +    uint8_t status;
> +    bool *results = batch_task->results;
> +    uint32_t count = batch_task->batch_descriptor.desc_count;
> +
> +    poll_completion(batch_completion,
> +                    batch_task->batch_descriptor.opcode);
> +
> +    batch_status = batch_completion->status;
> +
> +    if (batch_status == DSA_COMP_SUCCESS) {
> +        if (batch_completion->bytes_completed == count) {
> +            // Let's skip checking for each descriptors' completion status
> +            // if the batch descriptor says all succedded.
> +            for (int i = 0; i < count; i++) {
> +                assert(batch_task->completions[i].status == DSA_COMP_SUCCESS);
> +                results[i] = (batch_task->completions[i].result == 0);
> +            }
> +            return;
> +        }
> +    } else {
> +        assert(batch_status == DSA_COMP_BATCH_FAIL ||
> +            batch_status == DSA_COMP_BATCH_PAGE_FAULT);
> +    }
> +
> +    for (int i = 0; i < count; i++) {
> +

extra whitespace

> +        completion = &batch_task->completions[i];
> +        status = completion->status;
> +
> +        if (status == DSA_COMP_SUCCESS) {
> +            results[i] = (completion->result == 0);
> +            continue;
> +        }
> +
> +        if (status != DSA_COMP_PAGE_FAULT_NOBOF) {
> +            fprintf(stderr,
> +                    "Unexpected completion status = %u.\n", status);
> +            assert(false);

return here

> +        }
> +    }
> +}
> +
> +/**
> + * @brief Handles an asynchronous DSA batch task completion.
> + *
> + * @param task A pointer to the batch buffer zero task structure.
> + */
> +static void
> +dsa_batch_task_complete(struct buffer_zero_batch_task *batch_task)
> +{
> +    batch_task->status = DSA_TASK_COMPLETION;
> +    batch_task->completion_callback(batch_task);
> +}
> +
> +/**
> + * @brief The function entry point called by a dedicated DSA
> + *        work item completion thread.
> + *
> + * @param opaque A pointer to the thread context.
> + *
> + * @return void* Not used.
> + */
> +static void *
> +dsa_completion_loop(void *opaque)
> +{
> +    struct dsa_completion_thread *thread_context =
> +        (struct dsa_completion_thread *)opaque;
> +    struct buffer_zero_batch_task *batch_task;
> +    struct dsa_device_group *group = thread_context->group;
> +
> +    rcu_register_thread();
> +
> +    thread_context->thread_id = qemu_get_thread_id();
> +    qemu_sem_post(&thread_context->sem_init_done);
> +
> +    while (thread_context->running) {
> +        batch_task = dsa_task_dequeue(group);
> +        assert(batch_task != NULL || !group->running);
> +        if (!group->running) {
> +            assert(!thread_context->running);

This is racy if the compiler reorders "thread_context->running = false"
and "group->running = false". I'd put this under the task_queue_lock or
add a compiler barrier at dsa_completion_thread_stop().

> +            break;
> +        }
> +        if (batch_task->task_type == DSA_TASK) {
> +            poll_task_completion(batch_task);
> +        } else {
> +            assert(batch_task->task_type == DSA_BATCH_TASK);
> +            poll_batch_task_completion(batch_task);
> +        }
> +
> +        dsa_batch_task_complete(batch_task);
> +    }
> +
> +    rcu_unregister_thread();
> +    return NULL;
> +}
> +
> +/**
> + * @brief Initializes a DSA completion thread.
> + *
> + * @param completion_thread A pointer to the completion thread context.
> + * @param group A pointer to the DSA device group.
> + */
> +static void
> +dsa_completion_thread_init(
> +    struct dsa_completion_thread *completion_thread,
> +    struct dsa_device_group *group)
> +{
> +    completion_thread->stopping = false;
> +    completion_thread->running = true;
> +    completion_thread->thread_id = -1;
> +    qemu_sem_init(&completion_thread->sem_init_done, 0);
> +    completion_thread->group = group;
> +
> +    qemu_thread_create(&completion_thread->thread,
> +                       DSA_COMPLETION_THREAD,
> +                       dsa_completion_loop,
> +                       completion_thread,
> +                       QEMU_THREAD_JOINABLE);
> +
> +    /* Wait for initialization to complete */
> +    while (completion_thread->thread_id == -1) {
> +        qemu_sem_wait(&completion_thread->sem_init_done);
> +    }

This is racy, the thread can set 'thread_id' before this enters the loop
and the semaphore will be left unmatched. Not a huge deal but it might
cause confusion when debugging the initialization.

> +}
> +
> +/**
> + * @brief Stops the completion thread (and implicitly, the device group).
> + *
> + * @param opaque A pointer to the completion thread.
> + */
> +static void dsa_completion_thread_stop(void *opaque)
> +{
> +    struct dsa_completion_thread *thread_context =
> +        (struct dsa_completion_thread *)opaque;
> +
> +    struct dsa_device_group *group = thread_context->group;
> +
> +    qemu_mutex_lock(&group->task_queue_lock);
> +
> +    thread_context->stopping = true;
> +    thread_context->running = false;
> +
> +    dsa_device_group_stop(group);
> +
> +    qemu_cond_signal(&group->task_queue_cond);
> +    qemu_mutex_unlock(&group->task_queue_lock);
> +
> +    qemu_thread_join(&thread_context->thread);
> +
> +    qemu_sem_destroy(&thread_context->sem_init_done);
> +}
> +
>  /**
>   * @brief Check if DSA is running.
>   *
> @@ -446,7 +685,7 @@ submit_batch_wi_async(struct buffer_zero_batch_task *batch_task)
>   */
>  bool dsa_is_running(void)
>  {
> -    return false;
> +    return completion_thread.running;
>  }
>  
>  static void
> @@ -481,6 +720,7 @@ void dsa_start(void)
>          return;
>      }
>      dsa_device_group_start(&dsa_group);
> +    dsa_completion_thread_init(&completion_thread, &dsa_group);
>  }
>  
>  /**
> @@ -496,6 +736,7 @@ void dsa_stop(void)
>          return;
>      }
>  
> +    dsa_completion_thread_stop(&completion_thread);
>      dsa_empty_task_queue(group);
>  }


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 11/20] util/dsa: Implement DSA task asynchronous submission and wait for completion.
  2023-11-14  5:40 ` [PATCH v2 11/20] util/dsa: Implement DSA task asynchronous submission and wait for completion Hao Xiang
@ 2023-12-13 14:01   ` Fabiano Rosas
  2023-12-27  6:26     ` [External] " Hao Xiang
  0 siblings, 1 reply; 51+ messages in thread
From: Fabiano Rosas @ 2023-12-13 14:01 UTC (permalink / raw)
  To: Hao Xiang, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel
  Cc: Hao Xiang

Hao Xiang <hao.xiang@bytedance.com> writes:

> * Add a DSA task completion callback.
> * DSA completion thread will call the tasks's completion callback
> on every task/batch task completion.
> * DSA submission path to wait for completion.
> * Implement CPU fallback if DSA is not able to complete the task.
>
> Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
> Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
> ---
>  include/qemu/dsa.h |  14 +++++
>  util/dsa.c         | 153 ++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 164 insertions(+), 3 deletions(-)
>
> diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
> index b10e7b8fb7..3f8ee07004 100644
> --- a/include/qemu/dsa.h
> +++ b/include/qemu/dsa.h
> @@ -65,6 +65,20 @@ void buffer_zero_batch_task_init(struct buffer_zero_batch_task *task,
>   */
>  void buffer_zero_batch_task_destroy(struct buffer_zero_batch_task *task);
>  
> +/**
> + * @brief Performs buffer zero comparison on a DSA batch task asynchronously.
> + *
> + * @param batch_task A pointer to the batch task.
> + * @param buf An array of memory buffers.
> + * @param count The number of buffers in the array.
> + * @param len The buffer length.
> + *
> + * @return Zero if successful, otherwise non-zero.
> + */
> +int
> +buffer_is_zero_dsa_batch_async(struct buffer_zero_batch_task *batch_task,
> +                               const void **buf, size_t count, size_t len);
> +
>  /**
>   * @brief Initializes DSA devices.
>   *
> diff --git a/util/dsa.c b/util/dsa.c
> index 3cc017b8a0..06c6fbf2ca 100644
> --- a/util/dsa.c
> +++ b/util/dsa.c
> @@ -470,6 +470,41 @@ poll_completion(struct dsa_completion_record *completion,
>      return 0;
>  }
>  
> +/**
> + * @brief Use CPU to complete a single zero page checking task.
> + *
> + * @param task A pointer to the task.
> + */
> +static void
> +task_cpu_fallback(struct buffer_zero_batch_task *task)
> +{
> +    assert(task->task_type == DSA_TASK);
> +
> +    struct dsa_completion_record *completion = &task->completions[0];
> +    const uint8_t *buf;
> +    size_t len;
> +
> +    if (completion->status == DSA_COMP_SUCCESS) {
> +        return;
> +    }
> +
> +    /*
> +     * DSA was able to partially complete the operation. Check the
> +     * result. If we already know this is not a zero page, we can
> +     * return now.
> +     */
> +    if (completion->bytes_completed != 0 && completion->result != 0) {
> +        task->results[0] = false;
> +        return;
> +    }
> +
> +    /* Let's fallback to use CPU to complete it. */
> +    buf = (const uint8_t *)task->descriptors[0].src_addr;
> +    len = task->descriptors[0].xfer_size;
> +    task->results[0] = buffer_is_zero(buf + completion->bytes_completed,
> +                                      len - completion->bytes_completed);
> +}
> +
>  /**
>   * @brief Complete a single DSA task in the batch task.
>   *
> @@ -548,6 +583,62 @@ poll_batch_task_completion(struct buffer_zero_batch_task *batch_task)
>      }
>  }
>  
> +/**
> + * @brief Use CPU to complete the zero page checking batch task.
> + *
> + * @param batch_task A pointer to the batch task.
> + */
> +static void
> +batch_task_cpu_fallback(struct buffer_zero_batch_task *batch_task)
> +{
> +    assert(batch_task->task_type == DSA_BATCH_TASK);
> +
> +    struct dsa_completion_record *batch_completion =
> +        &batch_task->batch_completion;
> +    struct dsa_completion_record *completion;
> +    uint8_t status;
> +    const uint8_t *buf;
> +    size_t len;
> +    bool *results = batch_task->results;
> +    uint32_t count = batch_task->batch_descriptor.desc_count;
> +
> +    // DSA is able to complete the entire batch task.
> +    if (batch_completion->status == DSA_COMP_SUCCESS) {
> +        assert(count == batch_completion->bytes_completed);
> +        return;
> +    }
> +
> +    /*
> +     * DSA encounters some error and is not able to complete
> +     * the entire batch task. Use CPU fallback.
> +     */
> +    for (int i = 0; i < count; i++) {
> +        completion = &batch_task->completions[i];
> +        status = completion->status;
> +        if (status == DSA_COMP_SUCCESS) {
> +            continue;
> +        }
> +        assert(status == DSA_COMP_PAGE_FAULT_NOBOF);
> +
> +        /*
> +         * DSA was able to partially complete the operation. Check the
> +         * result. If we already know this is not a zero page, we can
> +         * return now.
> +         */
> +        if (completion->bytes_completed != 0 && completion->result != 0) {
> +            results[i] = false;
> +            continue;
> +        }
> +
> +        /* Let's fallback to use CPU to complete it. */
> +        buf = (uint8_t *)batch_task->descriptors[i].src_addr;
> +        len = batch_task->descriptors[i].xfer_size;
> +        results[i] =
> +            buffer_is_zero(buf + completion->bytes_completed,
> +                           len - completion->bytes_completed);

Here the same thing is happening as in other patches, the batch task
operation is just a repeat of the task operation n times. So this whole
inner code here could be nicely replaced by task_cpu_fallback() with
some adjustment of the function arguments. That makes intuitive sense
and removes code duplication.

> +    }
> +}
> +
>  /**
>   * @brief Handles an asynchronous DSA batch task completion.
>   *
> @@ -825,7 +916,6 @@ buffer_zero_batch_task_set(struct buffer_zero_batch_task *batch_task,
>   *
>   * @return int Zero if successful, otherwise an appropriate error code.
>   */
> -__attribute__((unused))
>  static int
>  buffer_zero_dsa_async(struct buffer_zero_batch_task *task,
>                        const void *buf, size_t len)
> @@ -844,7 +934,6 @@ buffer_zero_dsa_async(struct buffer_zero_batch_task *task,
>   * @param count The number of buffers.
>   * @param len The buffer length.
>   */
> -__attribute__((unused))
>  static int
>  buffer_zero_dsa_batch_async(struct buffer_zero_batch_task *batch_task,
>                              const void **buf, size_t count, size_t len)
> @@ -876,13 +965,29 @@ buffer_zero_dsa_completion(void *context)
>   *
>   * @param batch_task A pointer to the buffer zero comparison batch task.
>   */
> -__attribute__((unused))
>  static void
>  buffer_zero_dsa_wait(struct buffer_zero_batch_task *batch_task)
>  {
>      qemu_sem_wait(&batch_task->sem_task_complete);
>  }
>  
> +/**
> + * @brief Use CPU to complete the zero page checking task if DSA
> + *        is not able to complete it.
> + *
> + * @param batch_task A pointer to the batch task.
> + */
> +static void
> +buffer_zero_cpu_fallback(struct buffer_zero_batch_task *batch_task)
> +{
> +    if (batch_task->task_type == DSA_TASK) {
> +        task_cpu_fallback(batch_task);
> +    } else {
> +        assert(batch_task->task_type == DSA_BATCH_TASK);
> +        batch_task_cpu_fallback(batch_task);
> +    }
> +}
> +
>  /**
>   * @brief Check if DSA is running.
>   *
> @@ -956,6 +1061,41 @@ void dsa_cleanup(void)
>      dsa_device_group_cleanup(&dsa_group);
>  }
>  
> +/**
> + * @brief Performs buffer zero comparison on a DSA batch task asynchronously.
> + *
> + * @param batch_task A pointer to the batch task.
> + * @param buf An array of memory buffers.
> + * @param count The number of buffers in the array.
> + * @param len The buffer length.
> + *
> + * @return Zero if successful, otherwise non-zero.
> + */
> +int
> +buffer_is_zero_dsa_batch_async(struct buffer_zero_batch_task *batch_task,
> +                               const void **buf, size_t count, size_t len)
> +{
> +    if (count <= 0 || count > batch_task->batch_size) {
> +        return -1;
> +    }
> +
> +    assert(batch_task != NULL);
> +    assert(len != 0);
> +    assert(buf != NULL);
> +
> +    if (count == 1) {
> +        // DSA doesn't take batch operation with only 1 task.
> +        buffer_zero_dsa_async(batch_task, buf[0], len);
> +    } else {
> +        buffer_zero_dsa_batch_async(batch_task, buf, count, len);
> +    }
> +
> +    buffer_zero_dsa_wait(batch_task);
> +    buffer_zero_cpu_fallback(batch_task);
> +
> +    return 0;
> +}
> +
>  #else
>  
>  void buffer_zero_batch_task_init(struct buffer_zero_batch_task *task,
> @@ -981,5 +1121,12 @@ void dsa_stop(void) {}
>  
>  void dsa_cleanup(void) {}
>  
> +int
> +buffer_is_zero_dsa_batch_async(struct buffer_zero_batch_task *batch_task,
> +                               const void **buf, size_t count, size_t len)
> +{
> +    exit(1);
> +}
> +
>  #endif


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 18/20] migration/multifd: Enable set packet size migration option.
  2023-11-14  5:40 ` [PATCH v2 18/20] migration/multifd: Enable set packet size migration option Hao Xiang
@ 2023-12-13 17:33   ` Fabiano Rosas
  2024-01-03 20:04     ` [External] " Hao Xiang
  0 siblings, 1 reply; 51+ messages in thread
From: Fabiano Rosas @ 2023-12-13 17:33 UTC (permalink / raw)
  To: Hao Xiang, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel
  Cc: Hao Xiang

Hao Xiang <hao.xiang@bytedance.com> writes:

> During live migration, if the latency between sender and receiver
> is high but bandwidth is high (a long and fat pipe), using a bigger
> packet size can help reduce migration total time. In addition, Intel
> DSA offloading performs better with a large batch task. Providing an
> option to set the packet size is useful for performance tuning.
>
> Set the option:
> migrate_set_parameter multifd-packet-size 512

This should continue being bytes, we just needed to have code enforcing
it to be a multiple of page size at migrate_params_check().



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [External] Re: [PATCH v2 05/20] meson: Introduce new instruction set enqcmd to the build system.
  2023-12-11 15:41   ` Fabiano Rosas
@ 2023-12-16  0:26     ` Hao Xiang
  0 siblings, 0 replies; 51+ messages in thread
From: Hao Xiang @ 2023-12-16  0:26 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: peter.maydell, quintela, peterx, marcandre.lureau, bryan.zhang,
	qemu-devel

On Mon, Dec 11, 2023 at 7:41 AM Fabiano Rosas <farosas@suse.de> wrote:
>
> Hao Xiang <hao.xiang@bytedance.com> writes:
>
> > Enable instruction set enqcmd in build.
> >
> > Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
> > ---
> >  meson.build                   | 2 ++
> >  meson_options.txt             | 2 ++
> >  scripts/meson-buildoptions.sh | 3 +++
> >  3 files changed, 7 insertions(+)
> >
> > diff --git a/meson.build b/meson.build
> > index ec01f8b138..1292ab78a3 100644
> > --- a/meson.build
> > +++ b/meson.build
> > @@ -2708,6 +2708,8 @@ config_host_data.set('CONFIG_AVX512BW_OPT', get_option('avx512bw') \
> >      int main(int argc, char *argv[]) { return bar(argv[0]); }
> >    '''), error_message: 'AVX512BW not available').allowed())
> >
> > +config_host_data.set('CONFIG_DSA_OPT', get_option('enqcmd'))
>
> We need some sort of detection at configure time whether the feature is
> available. There are different compilers and compiler versions,
> different Intel CPU versions, different CPU vendors, different
> architectures, etc. Not all combinations will support DSA. Check avx512
> above.
>

Will fix it in the next version.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 03/20] multifd: Zero pages transmission
  2023-11-14  5:40 ` [PATCH v2 03/20] multifd: Zero " Hao Xiang
@ 2023-12-18  2:43   ` Wang, Lei
  0 siblings, 0 replies; 51+ messages in thread
From: Wang, Lei @ 2023-12-18  2:43 UTC (permalink / raw)
  To: Hao Xiang, farosas, peter.maydell, quintela, peterx,
	marcandre.lureau, bryan.zhang, qemu-devel

On 11/14/2023 13:40, Hao Xiang wrote:> From: Juan Quintela <quintela@redhat.com>
> 
> This implements the zero page dection and handling.

s/dection/detection

> 
> Signed-off-by: Juan Quintela <quintela@redhat.com>
> ---
>  migration/multifd.c | 41 +++++++++++++++++++++++++++++++++++++++--
>  migration/multifd.h |  5 +++++

[...]

> +    /*
> +     * This array contains the pointers to:

it contains the offsets in the RAMBlock, not the real pointer.

> +     *  - normal pages (initial normal_pages entries)
> +     *  - zero pages (following zero_pages entries)
> +     */
>      uint64_t offset[];
>  } __attribute__((packed)) MultiFDPacket_t;
>  


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 09/20] util/dsa: Implement DSA task asynchronous completion thread model.
  2023-11-14  5:40 ` [PATCH v2 09/20] util/dsa: Implement DSA task asynchronous completion thread model Hao Xiang
  2023-12-12 19:36   ` Fabiano Rosas
@ 2023-12-18  3:11   ` Wang, Lei
  2023-12-18 18:57     ` [External] " Hao Xiang
  1 sibling, 1 reply; 51+ messages in thread
From: Wang, Lei @ 2023-12-18  3:11 UTC (permalink / raw)
  To: Hao Xiang, farosas, peter.maydell, quintela, peterx,
	marcandre.lureau, bryan.zhang, qemu-devel

On 11/14/2023 13:40, Hao Xiang wrote:> * Create a dedicated thread for DSA task
completion.
> * DSA completion thread runs a loop and poll for completed tasks.
> * Start and stop DSA completion thread during DSA device start stop.
> 
> User space application can directly submit task to Intel DSA
> accelerator by writing to DSA's device memory (mapped in user space).

> +            }
> +            return;
> +        }
> +    } else {
> +        assert(batch_status == DSA_COMP_BATCH_FAIL ||
> +            batch_status == DSA_COMP_BATCH_PAGE_FAULT);

Nit: indentation is broken here.

> +    }
> +
> +    for (int i = 0; i < count; i++) {
> +
> +        completion = &batch_task->completions[i];
> +        status = completion->status;
> +
> +        if (status == DSA_COMP_SUCCESS) {
> +            results[i] = (completion->result == 0);
> +            continue;
> +        }
> +
> +        if (status != DSA_COMP_PAGE_FAULT_NOBOF) {
> +            fprintf(stderr,
> +                    "Unexpected completion status = %u.\n", status);
> +            assert(false);
> +        }
> +    }
> +}
> +
> +/**
> + * @brief Handles an asynchronous DSA batch task completion.
> + *
> + * @param task A pointer to the batch buffer zero task structure.
> + */
> +static void
> +dsa_batch_task_complete(struct buffer_zero_batch_task *batch_task)
> +{
> +    batch_task->status = DSA_TASK_COMPLETION;
> +    batch_task->completion_callback(batch_task);
> +}
> +
> +/**
> + * @brief The function entry point called by a dedicated DSA
> + *        work item completion thread.
> + *
> + * @param opaque A pointer to the thread context.
> + *
> + * @return void* Not used.
> + */
> +static void *
> +dsa_completion_loop(void *opaque)

Per my understanding, if a multifd sending thread corresponds to a DSA device,
then the batch tasks are executed in parallel which means a task may be
completed slower than another even if this task is enqueued earlier than it. If
we poll on the slower task first it will block the handling of the faster one,
even if the zero checking task for that thread is finished and it can go ahead
and send the data to the wire, this may lower the network resource utilization.

> +{
> +    struct dsa_completion_thread *thread_context =
> +        (struct dsa_completion_thread *)opaque;
> +    struct buffer_zero_batch_task *batch_task;
> +    struct dsa_device_group *group = thread_context->group;
> +
> +    rcu_register_thread();
> +
> +    thread_context->thread_id = qemu_get_thread_id();
> +    qemu_sem_post(&thread_context->sem_init_done);
> +
> +    while (thread_context->running) {
> +        batch_task = dsa_task_dequeue(group);
> +        assert(batch_task != NULL || !group->running);
> +        if (!group->running) {
> +            assert(!thread_context->running);
> +            break;
> +        }
> +        if (batch_task->task_type == DSA_TASK) {
> +            poll_task_completion(batch_task);
> +        } else {
> +            assert(batch_task->task_type == DSA_BATCH_TASK);
> +            poll_batch_task_completion(batch_task);
> +        }
> +
> +        dsa_batch_task_complete(batch_task);
> +    }
> +
> +    rcu_unregister_thread();
> +    return NULL;
> +}
> +
> +/**
> + * @brief Initializes a DSA completion thread.
> + *
> + * @param completion_thread A pointer to the completion thread context.
> + * @param group A pointer to the DSA device group.
> + */
> +static void
> +dsa_completion_thread_init(
> +    struct dsa_completion_thread *completion_thread,
> +    struct dsa_device_group *group)
> +{
> +    completion_thread->stopping = false;
> +    completion_thread->running = true;
> +    completion_thread->thread_id = -1;
> +    qemu_sem_init(&completion_thread->sem_init_done, 0);
> +    completion_thread->group = group;
> +
> +    qemu_thread_create(&completion_thread->thread,
> +                       DSA_COMPLETION_THREAD,
> +                       dsa_completion_loop,
> +                       completion_thread,
> +                       QEMU_THREAD_JOINABLE);
> +
> +    /* Wait for initialization to complete */
> +    while (completion_thread->thread_id == -1) {
> +        qemu_sem_wait(&completion_thread->sem_init_done);
> +    }
> +}
> +
> +/**
> + * @brief Stops the completion thread (and implicitly, the device group).
> + *
> + * @param opaque A pointer to the completion thread.
> + */
> +static void dsa_completion_thread_stop(void *opaque)
> +{
> +    struct dsa_completion_thread *thread_context =
> +        (struct dsa_completion_thread *)opaque;
> +
> +    struct dsa_device_group *group = thread_context->group;
> +
> +    qemu_mutex_lock(&group->task_queue_lock);
> +
> +    thread_context->stopping = true;
> +    thread_context->running = false;
> +
> +    dsa_device_group_stop(group);
> +
> +    qemu_cond_signal(&group->task_queue_cond);
> +    qemu_mutex_unlock(&group->task_queue_lock);
> +
> +    qemu_thread_join(&thread_context->thread);
> +
> +    qemu_sem_destroy(&thread_context->sem_init_done);
> +}
> +
>  /**
>   * @brief Check if DSA is running.
>   *
> @@ -446,7 +685,7 @@ submit_batch_wi_async(struct buffer_zero_batch_task *batch_task)
>   */
>  bool dsa_is_running(void)
>  {
> -    return false;
> +    return completion_thread.running;
>  }
>  
>  static void
> @@ -481,6 +720,7 @@ void dsa_start(void)
>          return;
>      }
>      dsa_device_group_start(&dsa_group);
> +    dsa_completion_thread_init(&completion_thread, &dsa_group);
>  }
>  
>  /**
> @@ -496,6 +736,7 @@ void dsa_stop(void)
>          return;
>      }
>  
> +    dsa_completion_thread_stop(&completion_thread);
>      dsa_empty_task_queue(group);
>  }
>  


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 12/20] migration/multifd: Add new migration option for multifd DSA offloading.
  2023-11-14  5:40 ` [PATCH v2 12/20] migration/multifd: Add new migration option for multifd DSA offloading Hao Xiang
  2023-12-11 19:44   ` Fabiano Rosas
@ 2023-12-18  3:12   ` Wang, Lei
  1 sibling, 0 replies; 51+ messages in thread
From: Wang, Lei @ 2023-12-18  3:12 UTC (permalink / raw)
  To: Hao Xiang, farosas, peter.maydell, quintela, peterx,
	marcandre.lureau, bryan.zhang, qemu-devel

On 11/14/2023 13:40, Hao Xiang wrote:
> Intel DSA offloading is an optional feature that turns on if
> proper hardware and software stack is available. To turn on
> DSA offloading in multifd live migration:
> 
> multifd-dsa-accel="[dsa_dev_path1] ] [dsa_dev_path2] ... [dsa_dev_pathX]"

Nit: a redundant bracket ]

> 
> This feature is turned off by default.
> 
> Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>

[...]


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v2 13/20] migration/multifd: Prepare to introduce DSA acceleration on the multifd path.
  2023-11-14  5:40 ` [PATCH v2 13/20] migration/multifd: Prepare to introduce DSA acceleration on the multifd path Hao Xiang
@ 2023-12-18  3:20   ` Wang, Lei
  0 siblings, 0 replies; 51+ messages in thread
From: Wang, Lei @ 2023-12-18  3:20 UTC (permalink / raw)
  To: Hao Xiang, farosas, peter.maydell, quintela, peterx,
	marcandre.lureau, bryan.zhang, qemu-devel

On 11/14/2023 13:40, Hao Xiang wrote:> 1. Refactor multifd_send_thread function.
> 2. Implement buffer_is_zero_use_cpu to handle CPU based zero page
> checking.
> 3. Introduce the batch task structure in MultiFDSendParams.
> 
> Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
> ---
>  migration/multifd.c | 82 ++++++++++++++++++++++++++++++++++++---------
>  migration/multifd.h |  3 ++
>  2 files changed, 70 insertions(+), 15 deletions(-)
> 
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 1198ffde9c..68ab97f918 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -14,6 +14,8 @@
>  #include "qemu/cutils.h"
>  #include "qemu/rcu.h"
>  #include "qemu/cutils.h"
> +#include "qemu/dsa.h"
> +#include "qemu/memalign.h"
>  #include "exec/target_page.h"
>  #include "sysemu/sysemu.h"
>  #include "exec/ramblock.h"
> @@ -574,6 +576,11 @@ void multifd_save_cleanup(void)
>          p->name = NULL;
>          multifd_pages_clear(p->pages);
>          p->pages = NULL;
> +        g_free(p->addr);
> +        p->addr = NULL;
> +        buffer_zero_batch_task_destroy(p->batch_task);
> +        qemu_vfree(p->batch_task);
> +        p->batch_task = NULL;
>          p->packet_len = 0;
>          g_free(p->packet);
>          p->packet = NULL;
> @@ -678,13 +685,66 @@ int multifd_send_sync_main(QEMUFile *f)
>      return 0;
>  }
>  
> +static void set_page(MultiFDSendParams *p, bool zero_page, uint64_t offset)
> +{
> +    RAMBlock *rb = p->pages->block;
> +    if (zero_page) {
> +        p->zero[p->zero_num] = offset;
> +        p->zero_num++;
> +        ram_release_page(rb->idstr, offset);
> +    } else {
> +        p->normal[p->normal_num] = offset;
> +        p->normal_num++;
> +    }
> +}
> +
> +static void buffer_is_zero_use_cpu(MultiFDSendParams *p)
> +{
> +    const void **buf = (const void **)p->addr;
> +    assert(!migrate_use_main_zero_page());
> +
> +    for (int i = 0; i < p->pages->num; i++) {
> +        p->batch_task->results[i] = buffer_is_zero(buf[i], p->page_size);
> +    }
> +}
> +
> +static void set_normal_pages(MultiFDSendParams *p)
> +{
> +    for (int i = 0; i < p->pages->num; i++) {
> +        p->batch_task->results[i] = false;
> +    }
> +}
> +
> +static void multifd_zero_page_check(MultiFDSendParams *p)
> +{
> +    /* older qemu don't understand zero page on multifd channel */
> +    bool use_multifd_zero_page = !migrate_use_main_zero_page();
> +
> +    RAMBlock *rb = p->pages->block;
> +
> +    for (int i = 0; i < p->pages->num; i++) {
> +        p->addr[i] = (ram_addr_t)(rb->host + p->pages->offset[i]);
> +    }
> +
> +    if (use_multifd_zero_page) {
> +        buffer_is_zero_use_cpu(p);
> +    } else {
> +        // No zero page checking. All pages are normal pages.

Please pay attention to the comment style here and in other patches.

> +        set_normal_pages(p);
> +    }
> +
> +    for (int i = 0; i < p->pages->num; i++) {
> +        uint64_t offset = p->pages->offset[i];
> +        bool zero_page = p->batch_task->results[i];
> +        set_page(p, zero_page, offset);
> +    }
> +}
> +
>  static void *multifd_send_thread(void *opaque)
>  {
>      MultiFDSendParams *p = opaque;
>      MigrationThread *thread = NULL;
>      Error *local_err = NULL;
> -    /* qemu older than 8.2 don't understand zero page on multifd channel */
> -    bool use_multifd_zero_page = !migrate_use_main_zero_page();
>      int ret = 0;
>      bool use_zero_copy_send = migrate_zero_copy_send();
>  
> @@ -710,7 +770,6 @@ static void *multifd_send_thread(void *opaque)
>          qemu_mutex_lock(&p->mutex);
>  
>          if (p->pending_job) {
> -            RAMBlock *rb = p->pages->block;
>              uint64_t packet_num = p->packet_num;
>              uint32_t flags;
>  
> @@ -723,18 +782,7 @@ static void *multifd_send_thread(void *opaque)
>                  p->iovs_num = 1;
>              }
>  
> -            for (int i = 0; i < p->pages->num; i++) {
> -                uint64_t offset = p->pages->offset[i];
> -                if (use_multifd_zero_page &&
> -                    buffer_is_zero(rb->host + offset, p->page_size)) {
> -                    p->zero[p->zero_num] = offset;
> -                    p->zero_num++;
> -                    ram_release_page(rb->idstr, offset);
> -                } else {
> -                    p->normal[p->normal_num] = offset;
> -                    p->normal_num++;
> -                }
> -            }
> +            multifd_zero_page_check(p);
>  
>              if (p->normal_num) {
>                  ret = multifd_send_state->ops->send_prepare(p, &local_err);
> @@ -976,6 +1024,10 @@ int multifd_save_setup(Error **errp)
>          p->pending_job = 0;
>          p->id = i;
>          p->pages = multifd_pages_init(page_count);
> +        p->addr = g_new0(ram_addr_t, page_count);
> +        p->batch_task =
> +            (struct buffer_zero_batch_task *)qemu_memalign(64, sizeof(*p->batch_task));
> +        buffer_zero_batch_task_init(p->batch_task, page_count);
>          p->packet_len = sizeof(MultiFDPacket_t)
>                        + sizeof(uint64_t) * page_count;
>          p->packet = g_malloc0(p->packet_len);
> diff --git a/migration/multifd.h b/migration/multifd.h
> index 13762900d4..62f31b03c0 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -119,6 +119,9 @@ typedef struct {
>       * pending_job != 0 -> multifd_channel can use it.
>       */
>      MultiFDPages_t *pages;
> +    /* Address of each pages in pages */

s/pages/page

> +    ram_addr_t *addr;

I think there is not need to introduce this variable since is can be simply
derived by:

	p->addr[i] = (ram_addr_t)(rb->host + p->pages->offset[i]);

and it is useless when we check the zero page in main thread (main-zero-page=y)

> +    struct buffer_zero_batch_task *batch_task;
>  
>      /* thread local variables. No locking required */
>  


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [External] Re: [PATCH v2 12/20] migration/multifd: Add new migration option for multifd DSA offloading.
  2023-12-11 19:44   ` Fabiano Rosas
@ 2023-12-18 18:34     ` Hao Xiang
  0 siblings, 0 replies; 51+ messages in thread
From: Hao Xiang @ 2023-12-18 18:34 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: peter.maydell, quintela, peterx, marcandre.lureau, bryan.zhang,
	qemu-devel

On Mon, Dec 11, 2023 at 11:44 AM Fabiano Rosas <farosas@suse.de> wrote:
>
> Hao Xiang <hao.xiang@bytedance.com> writes:
>
> > Intel DSA offloading is an optional feature that turns on if
> > proper hardware and software stack is available. To turn on
> > DSA offloading in multifd live migration:
> >
> > multifd-dsa-accel="[dsa_dev_path1] ] [dsa_dev_path2] ... [dsa_dev_pathX]"
> >
> > This feature is turned off by default.
>
> This patch breaks make check:
>
>  43/357 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-hmp               ERROR           0.52s
>  79/357 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test         ERROR           3.59s
> 167/357 qemu:qtest+qtest-x86_64 / qtest-x86_64/qmp-cmd-test ERROR           3.68s
>
> Make sure you run make check before posting. Ideally also run the series
> through the Gitlab CI on your personal fork.

* I think I accidentally deleted some code in meson-buildoptions.sh.
Reverted those now.
* I also found a bug in how I handle the string in migration options. Fixed now.
* make check is passing now. Fix will be in the next patchset.

69/818 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-hmp
                   OK              4.22s   9 subtests passed
37/818 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test
                   OK             60.16s   16 subtests passed
607/818 qemu:qtest+qtest-x86_64 / qtest-x86_64/qmp-cmd-test
                    OK              8.23s   65 subtests passed

Ok:                 747
Expected Fail:      0
Fail:               0
Unexpected Pass:    0
Skipped:            71
Timeout:            0

>
> > Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
> > ---
> >  migration/migration-hmp-cmds.c |  8 ++++++++
> >  migration/options.c            | 28 ++++++++++++++++++++++++++++
> >  migration/options.h            |  1 +
> >  qapi/migration.json            | 17 ++++++++++++++---
> >  scripts/meson-buildoptions.sh  |  6 +++---
> >  5 files changed, 54 insertions(+), 6 deletions(-)
> >
> > diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
> > index 86ae832176..d9451744dd 100644
> > --- a/migration/migration-hmp-cmds.c
> > +++ b/migration/migration-hmp-cmds.c
> > @@ -353,6 +353,9 @@ void hmp_info_migrate_parameters(Monitor *mon, const QDict *qdict)
> >          monitor_printf(mon, "%s: '%s'\n",
> >              MigrationParameter_str(MIGRATION_PARAMETER_TLS_AUTHZ),
> >              params->tls_authz);
> > +        monitor_printf(mon, "%s: %s\n",
>
> Use '%s' here.

Fixed. Will be in the next version.

>
> > +            MigrationParameter_str(MIGRATION_PARAMETER_MULTIFD_DSA_ACCEL),
> > +            params->multifd_dsa_accel);
> >
> >          if (params->has_block_bitmap_mapping) {
> >              const BitmapMigrationNodeAliasList *bmnal;
> > @@ -615,6 +618,11 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
> >          p->has_block_incremental = true;
> >          visit_type_bool(v, param, &p->block_incremental, &err);
> >          break;
> > +    case MIGRATION_PARAMETER_MULTIFD_DSA_ACCEL:
> > +        p->multifd_dsa_accel = g_new0(StrOrNull, 1);
> > +        p->multifd_dsa_accel->type = QTYPE_QSTRING;
> > +        visit_type_str(v, param, &p->multifd_dsa_accel->u.s, &err);
> > +        break;
> >      case MIGRATION_PARAMETER_MULTIFD_CHANNELS:
> >          p->has_multifd_channels = true;
> >          visit_type_uint8(v, param, &p->multifd_channels, &err);
> > diff --git a/migration/options.c b/migration/options.c
> > index 97d121d4d7..6e424b5d63 100644
> > --- a/migration/options.c
> > +++ b/migration/options.c
> > @@ -179,6 +179,8 @@ Property migration_properties[] = {
> >      DEFINE_PROP_MIG_MODE("mode", MigrationState,
> >                        parameters.mode,
> >                        MIG_MODE_NORMAL),
> > +    DEFINE_PROP_STRING("multifd-dsa-accel", MigrationState,
> > +                       parameters.multifd_dsa_accel),
> >
> >      /* Migration capabilities */
> >      DEFINE_PROP_MIG_CAP("x-xbzrle", MIGRATION_CAPABILITY_XBZRLE),
> > @@ -901,6 +903,13 @@ const char *migrate_tls_creds(void)
> >      return s->parameters.tls_creds;
> >  }
> >
> > +const char *migrate_multifd_dsa_accel(void)
> > +{
> > +    MigrationState *s = migrate_get_current();
> > +
> > +    return s->parameters.multifd_dsa_accel;
> > +}
> > +
> >  const char *migrate_tls_hostname(void)
> >  {
> >      MigrationState *s = migrate_get_current();
> > @@ -1025,6 +1034,7 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
> >      params->vcpu_dirty_limit = s->parameters.vcpu_dirty_limit;
> >      params->has_mode = true;
> >      params->mode = s->parameters.mode;
> > +    params->multifd_dsa_accel = s->parameters.multifd_dsa_accel;
> >
> >      return params;
> >  }
> > @@ -1033,6 +1043,7 @@ void migrate_params_init(MigrationParameters *params)
> >  {
> >      params->tls_hostname = g_strdup("");
> >      params->tls_creds = g_strdup("");
> > +    params->multifd_dsa_accel = g_strdup("");
> >
> >      /* Set has_* up only for parameter checks */
> >      params->has_compress_level = true;
> > @@ -1362,6 +1373,11 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
> >      if (params->has_mode) {
> >          dest->mode = params->mode;
> >      }
> > +
> > +    if (params->multifd_dsa_accel) {
> > +        assert(params->multifd_dsa_accel->type == QTYPE_QSTRING);
> > +        dest->multifd_dsa_accel = params->multifd_dsa_accel->u.s;
> > +    }
> >  }
> >
> >  static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
> > @@ -1506,6 +1522,12 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
> >      if (params->has_mode) {
> >          s->parameters.mode = params->mode;
> >      }
> > +
> > +    if (params->multifd_dsa_accel) {
> > +        g_free(s->parameters.multifd_dsa_accel);
> > +        assert(params->multifd_dsa_accel->type == QTYPE_QSTRING);
> > +        s->parameters.multifd_dsa_accel = g_strdup(params->multifd_dsa_accel->u.s);
> > +    }
> >  }
> >
> >  void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
> > @@ -1531,6 +1553,12 @@ void qmp_migrate_set_parameters(MigrateSetParameters *params, Error **errp)
> >          params->tls_authz->type = QTYPE_QSTRING;
> >          params->tls_authz->u.s = strdup("");
> >      }
> > +    if (params->multifd_dsa_accel
> > +        && params->multifd_dsa_accel->type == QTYPE_QNULL) {
> > +        qobject_unref(params->multifd_dsa_accel->u.n);
> > +        params->multifd_dsa_accel->type = QTYPE_QSTRING;
> > +        params->multifd_dsa_accel->u.s = strdup("");
> > +    }
> >
> >      migrate_params_test_apply(params, &tmp);
> >
> > diff --git a/migration/options.h b/migration/options.h
> > index c901eb57c6..56100961a9 100644
> > --- a/migration/options.h
> > +++ b/migration/options.h
> > @@ -94,6 +94,7 @@ const char *migrate_tls_authz(void);
> >  const char *migrate_tls_creds(void);
> >  const char *migrate_tls_hostname(void);
> >  uint64_t migrate_xbzrle_cache_size(void);
> > +const char *migrate_multifd_dsa_accel(void);
> >
> >  /* parameters setters */
> >
> > diff --git a/qapi/migration.json b/qapi/migration.json
> > index 9783289bfc..a8e3b66d6f 100644
> > --- a/qapi/migration.json
> > +++ b/qapi/migration.json
> > @@ -879,6 +879,9 @@
> >  # @mode: Migration mode. See description in @MigMode. Default is 'normal'.
> >  #        (Since 8.2)
> >  #
> > +# @multifd-dsa-accel: If enabled, use DSA accelerator offloading for
> > +#                     certain memory operations. (since 8.2)
> > +#
> >  # Features:
> >  #
> >  # @deprecated: Member @block-incremental is deprecated.  Use
> > @@ -902,7 +905,7 @@
> >             'cpu-throttle-initial', 'cpu-throttle-increment',
> >             'cpu-throttle-tailslow',
> >             'tls-creds', 'tls-hostname', 'tls-authz', 'max-bandwidth',
> > -           'avail-switchover-bandwidth', 'downtime-limit',
> > +           'avail-switchover-bandwidth', 'downtime-limit', 'multifd-dsa-accel',
> >             { 'name': 'x-checkpoint-delay', 'features': [ 'unstable' ] },
> >             { 'name': 'block-incremental', 'features': [ 'deprecated' ] },
> >             'multifd-channels',
> > @@ -1067,6 +1070,9 @@
> >  # @mode: Migration mode. See description in @MigMode. Default is 'normal'.
> >  #        (Since 8.2)
> >  #
> > +# @multifd-dsa-accel: If enabled, use DSA accelerator offloading for
> > +#                     certain memory operations. (since 8.2)
> > +#
> >  # Features:
> >  #
> >  # @deprecated: Member @block-incremental is deprecated.  Use
> > @@ -1120,7 +1126,8 @@
> >              '*x-vcpu-dirty-limit-period': { 'type': 'uint64',
> >                                              'features': [ 'unstable' ] },
> >              '*vcpu-dirty-limit': 'uint64',
> > -            '*mode': 'MigMode'} }
> > +            '*mode': 'MigMode',
> > +            '*multifd-dsa-accel': 'StrOrNull'} }
> >
> >  ##
> >  # @migrate-set-parameters:
> > @@ -1295,6 +1302,9 @@
> >  # @mode: Migration mode. See description in @MigMode. Default is 'normal'.
> >  #        (Since 8.2)
> >  #
> > +# @multifd-dsa-accel: If enabled, use DSA accelerator offloading for
> > +#                     certain memory operations. (since 8.2)
> > +#
> >  # Features:
> >  #
> >  # @deprecated: Member @block-incremental is deprecated.  Use
> > @@ -1345,7 +1355,8 @@
> >              '*x-vcpu-dirty-limit-period': { 'type': 'uint64',
> >                                              'features': [ 'unstable' ] },
> >              '*vcpu-dirty-limit': 'uint64',
> > -            '*mode': 'MigMode'} }
> > +            '*mode': 'MigMode',
> > +            '*multifd-dsa-accel': 'str'} }
> >
> >  ##
> >  # @query-migrate-parameters:
> > diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
> > index bf139e3fb4..35222ab63e 100644
> > --- a/scripts/meson-buildoptions.sh
> > +++ b/scripts/meson-buildoptions.sh
> > @@ -32,6 +32,7 @@ meson_options_help() {
> >    printf "%s\n" '  --enable-debug-stack-usage'
> >    printf "%s\n" '                           measure coroutine stack usage'
> >    printf "%s\n" '  --enable-debug-tcg       TCG debugging'
> > +  printf "%s\n" '  --enable-enqcmd          MENQCMD optimizations'
> >    printf "%s\n" '  --enable-fdt[=CHOICE]    Whether and how to find the libfdt library'
> >    printf "%s\n" '                           (choices: auto/disabled/enabled/internal/system)'
> >    printf "%s\n" '  --enable-fuzzing         build fuzzing targets'
> > @@ -93,7 +94,6 @@ meson_options_help() {
> >    printf "%s\n" '  avx2            AVX2 optimizations'
> >    printf "%s\n" '  avx512bw        AVX512BW optimizations'
> >    printf "%s\n" '  avx512f         AVX512F optimizations'
> > -  printf "%s\n" '  enqcmd          ENQCMD optimizations'
> >    printf "%s\n" '  blkio           libblkio block device driver'
> >    printf "%s\n" '  bochs           bochs image format support'
> >    printf "%s\n" '  bpf             eBPF support'
> > @@ -241,8 +241,6 @@ _meson_option_parse() {
> >      --disable-avx512bw) printf "%s" -Davx512bw=disabled ;;
> >      --enable-avx512f) printf "%s" -Davx512f=enabled ;;
> >      --disable-avx512f) printf "%s" -Davx512f=disabled ;;
> > -    --enable-enqcmd) printf "%s" -Denqcmd=true ;;
> > -    --disable-enqcmd) printf "%s" -Denqcmd=false ;;
> >      --enable-gcov) printf "%s" -Db_coverage=true ;;
> >      --disable-gcov) printf "%s" -Db_coverage=false ;;
> >      --enable-lto) printf "%s" -Db_lto=true ;;
> > @@ -309,6 +307,8 @@ _meson_option_parse() {
> >      --disable-docs) printf "%s" -Ddocs=disabled ;;
> >      --enable-dsound) printf "%s" -Ddsound=enabled ;;
> >      --disable-dsound) printf "%s" -Ddsound=disabled ;;
> > +    --enable-enqcmd) printf "%s" -Denqcmd=true ;;
> > +    --disable-enqcmd) printf "%s" -Denqcmd=false ;;
> >      --enable-fdt) printf "%s" -Dfdt=enabled ;;
> >      --disable-fdt) printf "%s" -Dfdt=disabled ;;
> >      --enable-fdt=*) quote_sh "-Dfdt=$2" ;;


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [External] Re: [PATCH v2 09/20] util/dsa: Implement DSA task asynchronous completion thread model.
  2023-12-18  3:11   ` Wang, Lei
@ 2023-12-18 18:57     ` Hao Xiang
  2023-12-19  1:33       ` Wang, Lei
  0 siblings, 1 reply; 51+ messages in thread
From: Hao Xiang @ 2023-12-18 18:57 UTC (permalink / raw)
  To: Wang, Lei
  Cc: farosas, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel

On Sun, Dec 17, 2023 at 7:11 PM Wang, Lei <lei4.wang@intel.com> wrote:
>
> On 11/14/2023 13:40, Hao Xiang wrote:> * Create a dedicated thread for DSA task
> completion.
> > * DSA completion thread runs a loop and poll for completed tasks.
> > * Start and stop DSA completion thread during DSA device start stop.
> >
> > User space application can directly submit task to Intel DSA
> > accelerator by writing to DSA's device memory (mapped in user space).
>
> > +            }
> > +            return;
> > +        }
> > +    } else {
> > +        assert(batch_status == DSA_COMP_BATCH_FAIL ||
> > +            batch_status == DSA_COMP_BATCH_PAGE_FAULT);
>
> Nit: indentation is broken here.
>
> > +    }
> > +
> > +    for (int i = 0; i < count; i++) {
> > +
> > +        completion = &batch_task->completions[i];
> > +        status = completion->status;
> > +
> > +        if (status == DSA_COMP_SUCCESS) {
> > +            results[i] = (completion->result == 0);
> > +            continue;
> > +        }
> > +
> > +        if (status != DSA_COMP_PAGE_FAULT_NOBOF) {
> > +            fprintf(stderr,
> > +                    "Unexpected completion status = %u.\n", status);
> > +            assert(false);
> > +        }
> > +    }
> > +}
> > +
> > +/**
> > + * @brief Handles an asynchronous DSA batch task completion.
> > + *
> > + * @param task A pointer to the batch buffer zero task structure.
> > + */
> > +static void
> > +dsa_batch_task_complete(struct buffer_zero_batch_task *batch_task)
> > +{
> > +    batch_task->status = DSA_TASK_COMPLETION;
> > +    batch_task->completion_callback(batch_task);
> > +}
> > +
> > +/**
> > + * @brief The function entry point called by a dedicated DSA
> > + *        work item completion thread.
> > + *
> > + * @param opaque A pointer to the thread context.
> > + *
> > + * @return void* Not used.
> > + */
> > +static void *
> > +dsa_completion_loop(void *opaque)
>
> Per my understanding, if a multifd sending thread corresponds to a DSA device,
> then the batch tasks are executed in parallel which means a task may be
> completed slower than another even if this task is enqueued earlier than it. If
> we poll on the slower task first it will block the handling of the faster one,
> even if the zero checking task for that thread is finished and it can go ahead
> and send the data to the wire, this may lower the network resource utilization.
>

Hi Lei, thanks for reviewing. You are correct that we can keep pulling
a task enqueued first while others in the queue have already been
completed. In fact, only one DSA completion thread (pulling thread) is
used here even when multiple DSA devices are used. The pulling loop is
the most CPU intensive activity in the DSA workflow and that acts
directly against the goal of saving CPU usage. The trade-off I want to
take here is a slightly higher latency on DSA task completion but more
CPU savings. A single DSA engine can reach 30 GB/s throughput on
memory comparison operation. We use kernel tcp stack for network
transfer. The best I see is around 10GB/s throughput.  RDMA can
potentially go higher but I am not sure if it can go higher than 30
GB/s throughput anytime soon.

> > +{
> > +    struct dsa_completion_thread *thread_context =
> > +        (struct dsa_completion_thread *)opaque;
> > +    struct buffer_zero_batch_task *batch_task;
> > +    struct dsa_device_group *group = thread_context->group;
> > +
> > +    rcu_register_thread();
> > +
> > +    thread_context->thread_id = qemu_get_thread_id();
> > +    qemu_sem_post(&thread_context->sem_init_done);
> > +
> > +    while (thread_context->running) {
> > +        batch_task = dsa_task_dequeue(group);
> > +        assert(batch_task != NULL || !group->running);
> > +        if (!group->running) {
> > +            assert(!thread_context->running);
> > +            break;
> > +        }
> > +        if (batch_task->task_type == DSA_TASK) {
> > +            poll_task_completion(batch_task);
> > +        } else {
> > +            assert(batch_task->task_type == DSA_BATCH_TASK);
> > +            poll_batch_task_completion(batch_task);
> > +        }
> > +
> > +        dsa_batch_task_complete(batch_task);
> > +    }
> > +
> > +    rcu_unregister_thread();
> > +    return NULL;
> > +}
> > +
> > +/**
> > + * @brief Initializes a DSA completion thread.
> > + *
> > + * @param completion_thread A pointer to the completion thread context.
> > + * @param group A pointer to the DSA device group.
> > + */
> > +static void
> > +dsa_completion_thread_init(
> > +    struct dsa_completion_thread *completion_thread,
> > +    struct dsa_device_group *group)
> > +{
> > +    completion_thread->stopping = false;
> > +    completion_thread->running = true;
> > +    completion_thread->thread_id = -1;
> > +    qemu_sem_init(&completion_thread->sem_init_done, 0);
> > +    completion_thread->group = group;
> > +
> > +    qemu_thread_create(&completion_thread->thread,
> > +                       DSA_COMPLETION_THREAD,
> > +                       dsa_completion_loop,
> > +                       completion_thread,
> > +                       QEMU_THREAD_JOINABLE);
> > +
> > +    /* Wait for initialization to complete */
> > +    while (completion_thread->thread_id == -1) {
> > +        qemu_sem_wait(&completion_thread->sem_init_done);
> > +    }
> > +}
> > +
> > +/**
> > + * @brief Stops the completion thread (and implicitly, the device group).
> > + *
> > + * @param opaque A pointer to the completion thread.
> > + */
> > +static void dsa_completion_thread_stop(void *opaque)
> > +{
> > +    struct dsa_completion_thread *thread_context =
> > +        (struct dsa_completion_thread *)opaque;
> > +
> > +    struct dsa_device_group *group = thread_context->group;
> > +
> > +    qemu_mutex_lock(&group->task_queue_lock);
> > +
> > +    thread_context->stopping = true;
> > +    thread_context->running = false;
> > +
> > +    dsa_device_group_stop(group);
> > +
> > +    qemu_cond_signal(&group->task_queue_cond);
> > +    qemu_mutex_unlock(&group->task_queue_lock);
> > +
> > +    qemu_thread_join(&thread_context->thread);
> > +
> > +    qemu_sem_destroy(&thread_context->sem_init_done);
> > +}
> > +
> >  /**
> >   * @brief Check if DSA is running.
> >   *
> > @@ -446,7 +685,7 @@ submit_batch_wi_async(struct buffer_zero_batch_task *batch_task)
> >   */
> >  bool dsa_is_running(void)
> >  {
> > -    return false;
> > +    return completion_thread.running;
> >  }
> >
> >  static void
> > @@ -481,6 +720,7 @@ void dsa_start(void)
> >          return;
> >      }
> >      dsa_device_group_start(&dsa_group);
> > +    dsa_completion_thread_init(&completion_thread, &dsa_group);
> >  }
> >
> >  /**
> > @@ -496,6 +736,7 @@ void dsa_stop(void)
> >          return;
> >      }
> >
> > +    dsa_completion_thread_stop(&completion_thread);
> >      dsa_empty_task_queue(group);
> >  }
> >


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [External] Re: [PATCH v2 09/20] util/dsa: Implement DSA task asynchronous completion thread model.
  2023-12-18 18:57     ` [External] " Hao Xiang
@ 2023-12-19  1:33       ` Wang, Lei
  2023-12-19  5:12         ` Hao Xiang
  0 siblings, 1 reply; 51+ messages in thread
From: Wang, Lei @ 2023-12-19  1:33 UTC (permalink / raw)
  To: Hao Xiang
  Cc: farosas, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel

On 12/19/2023 2:57, Hao Xiang wrote:> On Sun, Dec 17, 2023 at 7:11 PM Wang, Lei
<lei4.wang@intel.com> wrote:
>>
>> On 11/14/2023 13:40, Hao Xiang wrote:> * Create a dedicated thread for DSA task
>> completion.
>>> * DSA completion thread runs a loop and poll for completed tasks.
>>> * Start and stop DSA completion thread during DSA device start stop.
>>>
>>> User space application can directly submit task to Intel DSA
>>> accelerator by writing to DSA's device memory (mapped in user space).
>>
>>> +            }
>>> +            return;
>>> +        }
>>> +    } else {
>>> +        assert(batch_status == DSA_COMP_BATCH_FAIL ||
>>> +            batch_status == DSA_COMP_BATCH_PAGE_FAULT);
>>
>> Nit: indentation is broken here.
>>
>>> +    }
>>> +
>>> +    for (int i = 0; i < count; i++) {
>>> +
>>> +        completion = &batch_task->completions[i];
>>> +        status = completion->status;
>>> +
>>> +        if (status == DSA_COMP_SUCCESS) {
>>> +            results[i] = (completion->result == 0);
>>> +            continue;
>>> +        }
>>> +
>>> +        if (status != DSA_COMP_PAGE_FAULT_NOBOF) {
>>> +            fprintf(stderr,
>>> +                    "Unexpected completion status = %u.\n", status);
>>> +            assert(false);
>>> +        }
>>> +    }
>>> +}
>>> +
>>> +/**
>>> + * @brief Handles an asynchronous DSA batch task completion.
>>> + *
>>> + * @param task A pointer to the batch buffer zero task structure.
>>> + */
>>> +static void
>>> +dsa_batch_task_complete(struct buffer_zero_batch_task *batch_task)
>>> +{
>>> +    batch_task->status = DSA_TASK_COMPLETION;
>>> +    batch_task->completion_callback(batch_task);
>>> +}
>>> +
>>> +/**
>>> + * @brief The function entry point called by a dedicated DSA
>>> + *        work item completion thread.
>>> + *
>>> + * @param opaque A pointer to the thread context.
>>> + *
>>> + * @return void* Not used.
>>> + */
>>> +static void *
>>> +dsa_completion_loop(void *opaque)
>>
>> Per my understanding, if a multifd sending thread corresponds to a DSA device,
>> then the batch tasks are executed in parallel which means a task may be
>> completed slower than another even if this task is enqueued earlier than it. If
>> we poll on the slower task first it will block the handling of the faster one,
>> even if the zero checking task for that thread is finished and it can go ahead
>> and send the data to the wire, this may lower the network resource utilization.
>>
> 
> Hi Lei, thanks for reviewing. You are correct that we can keep pulling
> a task enqueued first while others in the queue have already been
> completed. In fact, only one DSA completion thread (pulling thread) is
> used here even when multiple DSA devices are used. The pulling loop is
> the most CPU intensive activity in the DSA workflow and that acts
> directly against the goal of saving CPU usage. The trade-off I want to
> take here is a slightly higher latency on DSA task completion but more
> CPU savings. A single DSA engine can reach 30 GB/s throughput on
> memory comparison operation. We use kernel tcp stack for network
> transfer. The best I see is around 10GB/s throughput.  RDMA can
> potentially go higher but I am not sure if it can go higher than 30
> GB/s throughput anytime soon.

Hi Hao, that makes sense, if the DSA is faster than the network, then a little
bit of latency in DSA checking is tolerable. In the long term, I think the best
form of the DSA task checking thread is to use an fd or such sort of thing that
can multiplex the checking of different DSA devices, then we can serve the DSA
task in the order they complete rather than FCFS.

> 
>>> +{
>>> +    struct dsa_completion_thread *thread_context =
>>> +        (struct dsa_completion_thread *)opaque;
>>> +    struct buffer_zero_batch_task *batch_task;
>>> +    struct dsa_device_group *group = thread_context->group;
>>> +
>>> +    rcu_register_thread();
>>> +
>>> +    thread_context->thread_id = qemu_get_thread_id();
>>> +    qemu_sem_post(&thread_context->sem_init_done);
>>> +
>>> +    while (thread_context->running) {
>>> +        batch_task = dsa_task_dequeue(group);
>>> +        assert(batch_task != NULL || !group->running);
>>> +        if (!group->running) {
>>> +            assert(!thread_context->running);
>>> +            break;
>>> +        }
>>> +        if (batch_task->task_type == DSA_TASK) {
>>> +            poll_task_completion(batch_task);
>>> +        } else {
>>> +            assert(batch_task->task_type == DSA_BATCH_TASK);
>>> +            poll_batch_task_completion(batch_task);
>>> +        }
>>> +
>>> +        dsa_batch_task_complete(batch_task);
>>> +    }
>>> +
>>> +    rcu_unregister_thread();
>>> +    return NULL;
>>> +}
>>> +
>>> +/**
>>> + * @brief Initializes a DSA completion thread.
>>> + *
>>> + * @param completion_thread A pointer to the completion thread context.
>>> + * @param group A pointer to the DSA device group.
>>> + */
>>> +static void
>>> +dsa_completion_thread_init(
>>> +    struct dsa_completion_thread *completion_thread,
>>> +    struct dsa_device_group *group)
>>> +{
>>> +    completion_thread->stopping = false;
>>> +    completion_thread->running = true;
>>> +    completion_thread->thread_id = -1;
>>> +    qemu_sem_init(&completion_thread->sem_init_done, 0);
>>> +    completion_thread->group = group;
>>> +
>>> +    qemu_thread_create(&completion_thread->thread,
>>> +                       DSA_COMPLETION_THREAD,
>>> +                       dsa_completion_loop,
>>> +                       completion_thread,
>>> +                       QEMU_THREAD_JOINABLE);
>>> +
>>> +    /* Wait for initialization to complete */
>>> +    while (completion_thread->thread_id == -1) {
>>> +        qemu_sem_wait(&completion_thread->sem_init_done);
>>> +    }
>>> +}
>>> +
>>> +/**
>>> + * @brief Stops the completion thread (and implicitly, the device group).
>>> + *
>>> + * @param opaque A pointer to the completion thread.
>>> + */
>>> +static void dsa_completion_thread_stop(void *opaque)
>>> +{
>>> +    struct dsa_completion_thread *thread_context =
>>> +        (struct dsa_completion_thread *)opaque;
>>> +
>>> +    struct dsa_device_group *group = thread_context->group;
>>> +
>>> +    qemu_mutex_lock(&group->task_queue_lock);
>>> +
>>> +    thread_context->stopping = true;
>>> +    thread_context->running = false;
>>> +
>>> +    dsa_device_group_stop(group);
>>> +
>>> +    qemu_cond_signal(&group->task_queue_cond);
>>> +    qemu_mutex_unlock(&group->task_queue_lock);
>>> +
>>> +    qemu_thread_join(&thread_context->thread);
>>> +
>>> +    qemu_sem_destroy(&thread_context->sem_init_done);
>>> +}
>>> +
>>>  /**
>>>   * @brief Check if DSA is running.
>>>   *
>>> @@ -446,7 +685,7 @@ submit_batch_wi_async(struct buffer_zero_batch_task *batch_task)
>>>   */
>>>  bool dsa_is_running(void)
>>>  {
>>> -    return false;
>>> +    return completion_thread.running;
>>>  }
>>>
>>>  static void
>>> @@ -481,6 +720,7 @@ void dsa_start(void)
>>>          return;
>>>      }
>>>      dsa_device_group_start(&dsa_group);
>>> +    dsa_completion_thread_init(&completion_thread, &dsa_group);
>>>  }
>>>
>>>  /**
>>> @@ -496,6 +736,7 @@ void dsa_stop(void)
>>>          return;
>>>      }
>>>
>>> +    dsa_completion_thread_stop(&completion_thread);
>>>      dsa_empty_task_queue(group);
>>>  }
>>>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [External] Re: [PATCH v2 09/20] util/dsa: Implement DSA task asynchronous completion thread model.
  2023-12-19  1:33       ` Wang, Lei
@ 2023-12-19  5:12         ` Hao Xiang
  0 siblings, 0 replies; 51+ messages in thread
From: Hao Xiang @ 2023-12-19  5:12 UTC (permalink / raw)
  To: Wang, Lei
  Cc: farosas, peter.maydell, quintela, peterx, marcandre.lureau,
	bryan.zhang, qemu-devel

On Mon, Dec 18, 2023 at 5:34 PM Wang, Lei <lei4.wang@intel.com> wrote:
>
> On 12/19/2023 2:57, Hao Xiang wrote:> On Sun, Dec 17, 2023 at 7:11 PM Wang, Lei
> <lei4.wang@intel.com> wrote:
> >>
> >> On 11/14/2023 13:40, Hao Xiang wrote:> * Create a dedicated thread for DSA task
> >> completion.
> >>> * DSA completion thread runs a loop and poll for completed tasks.
> >>> * Start and stop DSA completion thread during DSA device start stop.
> >>>
> >>> User space application can directly submit task to Intel DSA
> >>> accelerator by writing to DSA's device memory (mapped in user space).
> >>
> >>> +            }
> >>> +            return;
> >>> +        }
> >>> +    } else {
> >>> +        assert(batch_status == DSA_COMP_BATCH_FAIL ||
> >>> +            batch_status == DSA_COMP_BATCH_PAGE_FAULT);
> >>
> >> Nit: indentation is broken here.
> >>
> >>> +    }
> >>> +
> >>> +    for (int i = 0; i < count; i++) {
> >>> +
> >>> +        completion = &batch_task->completions[i];
> >>> +        status = completion->status;
> >>> +
> >>> +        if (status == DSA_COMP_SUCCESS) {
> >>> +            results[i] = (completion->result == 0);
> >>> +            continue;
> >>> +        }
> >>> +
> >>> +        if (status != DSA_COMP_PAGE_FAULT_NOBOF) {
> >>> +            fprintf(stderr,
> >>> +                    "Unexpected completion status = %u.\n", status);
> >>> +            assert(false);
> >>> +        }
> >>> +    }
> >>> +}
> >>> +
> >>> +/**
> >>> + * @brief Handles an asynchronous DSA batch task completion.
> >>> + *
> >>> + * @param task A pointer to the batch buffer zero task structure.
> >>> + */
> >>> +static void
> >>> +dsa_batch_task_complete(struct buffer_zero_batch_task *batch_task)
> >>> +{
> >>> +    batch_task->status = DSA_TASK_COMPLETION;
> >>> +    batch_task->completion_callback(batch_task);
> >>> +}
> >>> +
> >>> +/**
> >>> + * @brief The function entry point called by a dedicated DSA
> >>> + *        work item completion thread.
> >>> + *
> >>> + * @param opaque A pointer to the thread context.
> >>> + *
> >>> + * @return void* Not used.
> >>> + */
> >>> +static void *
> >>> +dsa_completion_loop(void *opaque)
> >>
> >> Per my understanding, if a multifd sending thread corresponds to a DSA device,
> >> then the batch tasks are executed in parallel which means a task may be
> >> completed slower than another even if this task is enqueued earlier than it. If
> >> we poll on the slower task first it will block the handling of the faster one,
> >> even if the zero checking task for that thread is finished and it can go ahead
> >> and send the data to the wire, this may lower the network resource utilization.
> >>
> >
> > Hi Lei, thanks for reviewing. You are correct that we can keep pulling
> > a task enqueued first while others in the queue have already been
> > completed. In fact, only one DSA completion thread (pulling thread) is
> > used here even when multiple DSA devices are used. The pulling loop is
> > the most CPU intensive activity in the DSA workflow and that acts
> > directly against the goal of saving CPU usage. The trade-off I want to
> > take here is a slightly higher latency on DSA task completion but more
> > CPU savings. A single DSA engine can reach 30 GB/s throughput on
> > memory comparison operation. We use kernel tcp stack for network
> > transfer. The best I see is around 10GB/s throughput.  RDMA can
> > potentially go higher but I am not sure if it can go higher than 30
> > GB/s throughput anytime soon.
>
> Hi Hao, that makes sense, if the DSA is faster than the network, then a little
> bit of latency in DSA checking is tolerable. In the long term, I think the best
> form of the DSA task checking thread is to use an fd or such sort of thing that
> can multiplex the checking of different DSA devices, then we can serve the DSA
> task in the order they complete rather than FCFS.
>
I have experimented using N completion threads and each thread pulls
tasks submitted to a particular DSA device. That approach uses too
many CPU cycles. If Intel can come up with a better workflow for DSA
completion, there is definitely space for improvement here.
> >
> >>> +{
> >>> +    struct dsa_completion_thread *thread_context =
> >>> +        (struct dsa_completion_thread *)opaque;
> >>> +    struct buffer_zero_batch_task *batch_task;
> >>> +    struct dsa_device_group *group = thread_context->group;
> >>> +
> >>> +    rcu_register_thread();
> >>> +
> >>> +    thread_context->thread_id = qemu_get_thread_id();
> >>> +    qemu_sem_post(&thread_context->sem_init_done);
> >>> +
> >>> +    while (thread_context->running) {
> >>> +        batch_task = dsa_task_dequeue(group);
> >>> +        assert(batch_task != NULL || !group->running);
> >>> +        if (!group->running) {
> >>> +            assert(!thread_context->running);
> >>> +            break;
> >>> +        }
> >>> +        if (batch_task->task_type == DSA_TASK) {
> >>> +            poll_task_completion(batch_task);
> >>> +        } else {
> >>> +            assert(batch_task->task_type == DSA_BATCH_TASK);
> >>> +            poll_batch_task_completion(batch_task);
> >>> +        }
> >>> +
> >>> +        dsa_batch_task_complete(batch_task);
> >>> +    }
> >>> +
> >>> +    rcu_unregister_thread();
> >>> +    return NULL;
> >>> +}
> >>> +
> >>> +/**
> >>> + * @brief Initializes a DSA completion thread.
> >>> + *
> >>> + * @param completion_thread A pointer to the completion thread context.
> >>> + * @param group A pointer to the DSA device group.
> >>> + */
> >>> +static void
> >>> +dsa_completion_thread_init(
> >>> +    struct dsa_completion_thread *completion_thread,
> >>> +    struct dsa_device_group *group)
> >>> +{
> >>> +    completion_thread->stopping = false;
> >>> +    completion_thread->running = true;
> >>> +    completion_thread->thread_id = -1;
> >>> +    qemu_sem_init(&completion_thread->sem_init_done, 0);
> >>> +    completion_thread->group = group;
> >>> +
> >>> +    qemu_thread_create(&completion_thread->thread,
> >>> +                       DSA_COMPLETION_THREAD,
> >>> +                       dsa_completion_loop,
> >>> +                       completion_thread,
> >>> +                       QEMU_THREAD_JOINABLE);
> >>> +
> >>> +    /* Wait for initialization to complete */
> >>> +    while (completion_thread->thread_id == -1) {
> >>> +        qemu_sem_wait(&completion_thread->sem_init_done);
> >>> +    }
> >>> +}
> >>> +
> >>> +/**
> >>> + * @brief Stops the completion thread (and implicitly, the device group).
> >>> + *
> >>> + * @param opaque A pointer to the completion thread.
> >>> + */
> >>> +static void dsa_completion_thread_stop(void *opaque)
> >>> +{
> >>> +    struct dsa_completion_thread *thread_context =
> >>> +        (struct dsa_completion_thread *)opaque;
> >>> +
> >>> +    struct dsa_device_group *group = thread_context->group;
> >>> +
> >>> +    qemu_mutex_lock(&group->task_queue_lock);
> >>> +
> >>> +    thread_context->stopping = true;
> >>> +    thread_context->running = false;
> >>> +
> >>> +    dsa_device_group_stop(group);
> >>> +
> >>> +    qemu_cond_signal(&group->task_queue_cond);
> >>> +    qemu_mutex_unlock(&group->task_queue_lock);
> >>> +
> >>> +    qemu_thread_join(&thread_context->thread);
> >>> +
> >>> +    qemu_sem_destroy(&thread_context->sem_init_done);
> >>> +}
> >>> +
> >>>  /**
> >>>   * @brief Check if DSA is running.
> >>>   *
> >>> @@ -446,7 +685,7 @@ submit_batch_wi_async(struct buffer_zero_batch_task *batch_task)
> >>>   */
> >>>  bool dsa_is_running(void)
> >>>  {
> >>> -    return false;
> >>> +    return completion_thread.running;
> >>>  }
> >>>
> >>>  static void
> >>> @@ -481,6 +720,7 @@ void dsa_start(void)
> >>>          return;
> >>>      }
> >>>      dsa_device_group_start(&dsa_group);
> >>> +    dsa_completion_thread_init(&completion_thread, &dsa_group);
> >>>  }
> >>>
> >>>  /**
> >>> @@ -496,6 +736,7 @@ void dsa_stop(void)
> >>>          return;
> >>>      }
> >>>
> >>> +    dsa_completion_thread_stop(&completion_thread);
> >>>      dsa_empty_task_queue(group);
> >>>  }
> >>>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [External] Re: [PATCH v2 07/20] util/dsa: Implement DSA device start and stop logic.
  2023-12-11 21:28   ` Fabiano Rosas
@ 2023-12-19  6:41     ` Hao Xiang
  2023-12-19 13:18       ` Fabiano Rosas
  0 siblings, 1 reply; 51+ messages in thread
From: Hao Xiang @ 2023-12-19  6:41 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: peter.maydell, quintela, peterx, marcandre.lureau, bryan.zhang,
	qemu-devel

On Mon, Dec 11, 2023 at 1:28 PM Fabiano Rosas <farosas@suse.de> wrote:
>
> Hao Xiang <hao.xiang@bytedance.com> writes:
>
> > * DSA device open and close.
> > * DSA group contains multiple DSA devices.
> > * DSA group configure/start/stop/clean.
> >
> > Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
> > Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
> > ---
> >  include/qemu/dsa.h |  49 +++++++
> >  util/dsa.c         | 338 +++++++++++++++++++++++++++++++++++++++++++++
> >  util/meson.build   |   1 +
> >  3 files changed, 388 insertions(+)
> >  create mode 100644 include/qemu/dsa.h
> >  create mode 100644 util/dsa.c
> >
> > diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
> > new file mode 100644
> > index 0000000000..30246b507e
> > --- /dev/null
> > +++ b/include/qemu/dsa.h
> > @@ -0,0 +1,49 @@
> > +#ifndef QEMU_DSA_H
> > +#define QEMU_DSA_H
> > +
> > +#include "qemu/thread.h"
> > +#include "qemu/queue.h"
> > +
> > +#ifdef CONFIG_DSA_OPT
> > +
> > +#pragma GCC push_options
> > +#pragma GCC target("enqcmd")
> > +
> > +#include <linux/idxd.h>
> > +#include "x86intrin.h"
> > +
> > +#endif
> > +
> > +/**
> > + * @brief Initializes DSA devices.
> > + *
> > + * @param dsa_parameter A list of DSA device path from migration parameter.
>
> This code seems pretty generic, let's decouple this doc from migration.
>
> > + * @return int Zero if successful, otherwise non zero.
> > + */
> > +int dsa_init(const char *dsa_parameter);
> > +
> > +/**
> > + * @brief Start logic to enable using DSA.
> > + */
> > +void dsa_start(void);
> > +
> > +/**
> > + * @brief Stop logic to clean up DSA by halting the device group and cleaning up
> > + * the completion thread.
>
> "Stop the device group and the completion thread"
>
> The mention of "clean/cleaning up" makes this confusing because of
> dsa_cleanup() below.

Fixed.

>
> > + */
> > +void dsa_stop(void);
> > +
> > +/**
> > + * @brief Clean up system resources created for DSA offloading.
> > + *        This function is called during QEMU process teardown.
>
> This is not called during QEMU process teardown. It's called at the end
> of migration AFAICS. Maybe just leave this sentence out.

Fixed.

>
> > + */
> > +void dsa_cleanup(void);
> > +
> > +/**
> > + * @brief Check if DSA is running.
> > + *
> > + * @return True if DSA is running, otherwise false.
> > + */
> > +bool dsa_is_running(void);
> > +
> > +#endif
> > \ No newline at end of file
> > diff --git a/util/dsa.c b/util/dsa.c
> > new file mode 100644
> > index 0000000000..8edaa892ec
> > --- /dev/null
> > +++ b/util/dsa.c
> > @@ -0,0 +1,338 @@
> > +/*
> > + * Use Intel Data Streaming Accelerator to offload certain background
> > + * operations.
> > + *
> > + * Copyright (c) 2023 Hao Xiang <hao.xiang@bytedance.com>
> > + *                    Bryan Zhang <bryan.zhang@bytedance.com>
> > + *
> > + * Permission is hereby granted, free of charge, to any person obtaining a copy
> > + * of this software and associated documentation files (the "Software"), to deal
> > + * in the Software without restriction, including without limitation the rights
> > + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> > + * copies of the Software, and to permit persons to whom the Software is
> > + * furnished to do so, subject to the following conditions:
> > + *
> > + * The above copyright notice and this permission notice shall be included in
> > + * all copies or substantial portions of the Software.
> > + *
> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> > + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> > + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> > + * THE SOFTWARE.
> > + */
> > +
> > +#include "qemu/osdep.h"
> > +#include "qemu/queue.h"
> > +#include "qemu/memalign.h"
> > +#include "qemu/lockable.h"
> > +#include "qemu/cutils.h"
> > +#include "qemu/dsa.h"
> > +#include "qemu/bswap.h"
> > +#include "qemu/error-report.h"
> > +#include "qemu/rcu.h"
> > +
> > +#ifdef CONFIG_DSA_OPT
> > +
> > +#pragma GCC push_options
> > +#pragma GCC target("enqcmd")
> > +
> > +#include <linux/idxd.h>
> > +#include "x86intrin.h"
> > +
> > +#define DSA_WQ_SIZE 4096
> > +#define MAX_DSA_DEVICES 16
> > +
> > +typedef QSIMPLEQ_HEAD(dsa_task_queue, buffer_zero_batch_task) dsa_task_queue;
> > +
> > +struct dsa_device {
> > +    void *work_queue;
> > +};
> > +
> > +struct dsa_device_group {
> > +    struct dsa_device *dsa_devices;
> > +    int num_dsa_devices;
> > +    uint32_t index;
> > +    bool running;
> > +    QemuMutex task_queue_lock;
> > +    QemuCond task_queue_cond;
> > +    dsa_task_queue task_queue;
> > +};
> > +
> > +uint64_t max_retry_count;
> > +static struct dsa_device_group dsa_group;
> > +
> > +
> > +/**
> > + * @brief This function opens a DSA device's work queue and
> > + *        maps the DSA device memory into the current process.
> > + *
> > + * @param dsa_wq_path A pointer to the DSA device work queue's file path.
> > + * @return A pointer to the mapped memory.
> > + */
> > +static void *
> > +map_dsa_device(const char *dsa_wq_path)
> > +{
> > +    void *dsa_device;
> > +    int fd;
> > +
> > +    fd = open(dsa_wq_path, O_RDWR);
> > +    if (fd < 0) {
> > +        fprintf(stderr, "open %s failed with errno = %d.\n",
> > +                dsa_wq_path, errno);
>
> Use error_report and error_setg* for these. Throughout the series.

All converted to using error_report.

>
> > +        return MAP_FAILED;
> > +    }
> > +    dsa_device = mmap(NULL, DSA_WQ_SIZE, PROT_WRITE,
> > +                      MAP_SHARED | MAP_POPULATE, fd, 0);
> > +    close(fd);
> > +    if (dsa_device == MAP_FAILED) {
> > +        fprintf(stderr, "mmap failed with errno = %d.\n", errno);
> > +        return MAP_FAILED;
> > +    }
> > +    return dsa_device;
> > +}
> > +
> > +/**
> > + * @brief Initializes a DSA device structure.
> > + *
> > + * @param instance A pointer to the DSA device.
> > + * @param work_queue  A pointer to the DSA work queue.
> > + */
> > +static void
> > +dsa_device_init(struct dsa_device *instance,
> > +                void *dsa_work_queue)
> > +{
> > +    instance->work_queue = dsa_work_queue;
> > +}
> > +
> > +/**
> > + * @brief Cleans up a DSA device structure.
> > + *
> > + * @param instance A pointer to the DSA device to cleanup.
> > + */
> > +static void
> > +dsa_device_cleanup(struct dsa_device *instance)
> > +{
> > +    if (instance->work_queue != MAP_FAILED) {
> > +        munmap(instance->work_queue, DSA_WQ_SIZE);
> > +    }
> > +}
> > +
> > +/**
> > + * @brief Initializes a DSA device group.
> > + *
> > + * @param group A pointer to the DSA device group.
> > + * @param num_dsa_devices The number of DSA devices this group will have.
> > + *
> > + * @return Zero if successful, non-zero otherwise.
> > + */
> > +static int
> > +dsa_device_group_init(struct dsa_device_group *group,
> > +                      const char *dsa_parameter)
>
> The documentation doesn't match the signature. This happens in other
> places as well, please review all of them.
>
Fixed all cases.

> > +{
> > +    if (dsa_parameter == NULL || strlen(dsa_parameter) == 0) {
> > +        return 0;
> > +    }
> > +
> > +    int ret = 0;
> > +    char *local_dsa_parameter = g_strdup(dsa_parameter);
> > +    const char *dsa_path[MAX_DSA_DEVICES];
> > +    int num_dsa_devices = 0;
> > +    char delim[2] = " ";
>
> So we're using space separated strings. Let's document this in this file
> and also on the migration parameter documentation.

Fixed.

>
> > +
> > +    char *current_dsa_path = strtok(local_dsa_parameter, delim);
> > +
> > +    while (current_dsa_path != NULL) {
> > +        dsa_path[num_dsa_devices++] = current_dsa_path;
> > +        if (num_dsa_devices == MAX_DSA_DEVICES) {
> > +            break;
> > +        }
> > +        current_dsa_path = strtok(NULL, delim);
> > +    }
> > +
> > +    group->dsa_devices =
> > +        malloc(sizeof(struct dsa_device) * num_dsa_devices);
>
> Use g_new0() here.

Converted to use g_new0 and g_free accordingly.

>
> > +    group->num_dsa_devices = num_dsa_devices;
> > +    group->index = 0;
> > +
> > +    group->running = false;
> > +    qemu_mutex_init(&group->task_queue_lock);
> > +    qemu_cond_init(&group->task_queue_cond);
> > +    QSIMPLEQ_INIT(&group->task_queue);
> > +
> > +    void *dsa_wq = MAP_FAILED;
> > +    for (int i = 0; i < num_dsa_devices; i++) {
> > +        dsa_wq = map_dsa_device(dsa_path[i]);
> > +        if (dsa_wq == MAP_FAILED) {
> > +            fprintf(stderr, "map_dsa_device failed MAP_FAILED, "
> > +                    "using simulation.\n");
>
> What does "using simulation" means? And how are doing it by returning -1
> from this function?

* "using simulation" was a copy and paste mistake. Removed that.
* -1 is an error code and will be propagated from
dsa_device_group_init to dsa_init and eventually to
multifd_load_setup/multifd_save_setup.
multifd_load_setup/multifd_save_setup now checks the return code from
dsa_init and aborts the migration if dsa_init fails.

>
> > +            ret = -1;
>
> What about the memory for group->dsa_devices in the failure case? We
> should either free it here or make sure the client code calls the
> cleanup routines.

In the failure case, dsa_device_group_cleanup will free the
group->dsa_devices memory allocation. dsa_device_group_cleanup is
called by dsa_cleanup. multifd_load_cleanup/multifd_save_cleanup will
call the cleanup routines.

>
> > +            goto exit;
> > +        }
> > +        dsa_device_init(&dsa_group.dsa_devices[i], dsa_wq);
> > +    }
> > +
> > +exit:
> > +    g_free(local_dsa_parameter);
> > +    return ret;
> > +}
> > +
> > +/**
> > + * @brief Starts a DSA device group.
> > + *
> > + * @param group A pointer to the DSA device group.
> > + * @param dsa_path An array of DSA device path.
> > + * @param num_dsa_devices The number of DSA devices in the device group.
> > + */
> > +static void
> > +dsa_device_group_start(struct dsa_device_group *group)
> > +{
> > +    group->running = true;
> > +}
> > +
> > +/**
> > + * @brief Stops a DSA device group.
> > + *
> > + * @param group A pointer to the DSA device group.
> > + */
> > +__attribute__((unused))
> > +static void
> > +dsa_device_group_stop(struct dsa_device_group *group)
> > +{
> > +    group->running = false;
> > +}
> > +
> > +/**
> > + * @brief Cleans up a DSA device group.
> > + *
> > + * @param group A pointer to the DSA device group.
> > + */
> > +static void
> > +dsa_device_group_cleanup(struct dsa_device_group *group)
> > +{
> > +    if (!group->dsa_devices) {
> > +        return;
> > +    }
> > +    for (int i = 0; i < group->num_dsa_devices; i++) {
> > +        dsa_device_cleanup(&group->dsa_devices[i]);
> > +    }
> > +    free(group->dsa_devices);
> > +    group->dsa_devices = NULL;
> > +
> > +    qemu_mutex_destroy(&group->task_queue_lock);
> > +    qemu_cond_destroy(&group->task_queue_cond);
> > +}
> > +
> > +/**
> > + * @brief Returns the next available DSA device in the group.
> > + *
> > + * @param group A pointer to the DSA device group.
> > + *
> > + * @return struct dsa_device* A pointer to the next available DSA device
> > + *         in the group.
> > + */
> > +__attribute__((unused))
> > +static struct dsa_device *
> > +dsa_device_group_get_next_device(struct dsa_device_group *group)
> > +{
> > +    if (group->num_dsa_devices == 0) {
> > +        return NULL;
> > +    }
> > +    uint32_t current = qatomic_fetch_inc(&group->index);
>
> The name "index" alone feels a bit opaque. Is there a more
> representative name we could give it?

I renamed it to device_allocator_index and added a comment to explain the field.

>
> > +    current %= group->num_dsa_devices;
> > +    return &group->dsa_devices[current];
> > +}
> > +
> > +/**
> > + * @brief Check if DSA is running.
> > + *
> > + * @return True if DSA is running, otherwise false.
> > + */
> > +bool dsa_is_running(void)
> > +{
> > +    return false;
> > +}
> > +
> > +static void
> > +dsa_globals_init(void)
> > +{
> > +    max_retry_count = UINT64_MAX;
> > +}
> > +
> > +/**
> > + * @brief Initializes DSA devices.
> > + *
> > + * @param dsa_parameter A list of DSA device path from migration parameter.
> > + * @return int Zero if successful, otherwise non zero.
> > + */
> > +int dsa_init(const char *dsa_parameter)
> > +{
> > +    dsa_globals_init();
> > +
> > +    return dsa_device_group_init(&dsa_group, dsa_parameter);
> > +}
> > +
> > +/**
> > + * @brief Start logic to enable using DSA.
> > + *
> > + */
> > +void dsa_start(void)
> > +{
> > +    if (dsa_group.num_dsa_devices == 0) {
> > +        return;
> > +    }
> > +    if (dsa_group.running) {
> > +        return;
> > +    }
> > +    dsa_device_group_start(&dsa_group);
> > +}
> > +
> > +/**
> > + * @brief Stop logic to clean up DSA by halting the device group and cleaning up
> > + * the completion thread.
> > + *
> > + */
> > +void dsa_stop(void)
> > +{
> > +    struct dsa_device_group *group = &dsa_group;
> > +
> > +    if (!group->running) {
> > +        return;
> > +    }
> > +}
> > +
> > +/**
> > + * @brief Clean up system resources created for DSA offloading.
> > + *        This function is called during QEMU process teardown.
> > + *
> > + */
> > +void dsa_cleanup(void)
> > +{
> > +    dsa_stop();
> > +    dsa_device_group_cleanup(&dsa_group);
> > +}
> > +
> > +#else
> > +
> > +bool dsa_is_running(void)
> > +{
> > +    return false;
> > +}
> > +
> > +int dsa_init(const char *dsa_parameter)
> > +{
> > +    fprintf(stderr, "Intel Data Streaming Accelerator is not supported "
> > +                    "on this platform.\n");
> > +    return -1;
>
> Nothing checks this later in the series and we end up trying to start a
> migration when we shouldn't. Fixing the configure step would already
> stop this happening, but make sure you check this anyway and abort the
> migration.

multifd_load_setup/multifd_save_setup now checks the return code from
dsa_init and aborts the migration if dsa_init fails. The
non-CONFIG_DSA_OPT version of dsa_init should really just be a no-op.
Changed that.

>
> > +}
> > +
> > +void dsa_start(void) {}
> > +
> > +void dsa_stop(void) {}
> > +
> > +void dsa_cleanup(void) {}
> > +
> > +#endif
>
> These could all be in the header.

The function definitions are already in dsa.h Do you mean moving the
function implementations to the header as well?

>
> > +
> > diff --git a/util/meson.build b/util/meson.build
> > index c2322ef6e7..f7277c5e9b 100644
> > --- a/util/meson.build
> > +++ b/util/meson.build
> > @@ -85,6 +85,7 @@ if have_block or have_ga
> >  endif
> >  if have_block
> >    util_ss.add(files('aio-wait.c'))
> > +  util_ss.add(files('dsa.c'))
>
> I find it clearer to add the file conditionally under CONFIG_DSA_OPT
> here and remove the ifdef from the C file. I'm not sure if we have any
> guidelines for this, so up to you.
>
> >    util_ss.add(files('buffer.c'))
> >    util_ss.add(files('bufferiszero.c'))
> >    util_ss.add(files('hbitmap.c'))


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [External] Re: [PATCH v2 07/20] util/dsa: Implement DSA device start and stop logic.
  2023-12-19  6:41     ` [External] " Hao Xiang
@ 2023-12-19 13:18       ` Fabiano Rosas
  2023-12-27  6:00         ` Hao Xiang
  0 siblings, 1 reply; 51+ messages in thread
From: Fabiano Rosas @ 2023-12-19 13:18 UTC (permalink / raw)
  To: Hao Xiang
  Cc: peter.maydell, quintela, peterx, marcandre.lureau, bryan.zhang,
	qemu-devel

Hao Xiang <hao.xiang@bytedance.com> writes:

>>
>> > +}
>> > +
>> > +void dsa_start(void) {}
>> > +
>> > +void dsa_stop(void) {}
>> > +
>> > +void dsa_cleanup(void) {}
>> > +
>> > +#endif
>>
>> These could all be in the header.
>
> The function definitions are already in dsa.h Do you mean moving the
> function implementations to the header as well?
>

I mean the empty !CONFIG_DSA_OPT variants could be in the header as
static inline.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [External] Re: [PATCH v2 08/20] util/dsa: Implement DSA task enqueue and dequeue.
  2023-12-12 16:10   ` Fabiano Rosas
@ 2023-12-27  0:07     ` Hao Xiang
  0 siblings, 0 replies; 51+ messages in thread
From: Hao Xiang @ 2023-12-27  0:07 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: peter.maydell, quintela, peterx, marcandre.lureau, bryan.zhang,
	qemu-devel

On Tue, Dec 12, 2023 at 8:10 AM Fabiano Rosas <farosas@suse.de> wrote:
>
> Hao Xiang <hao.xiang@bytedance.com> writes:
>
> > * Use a safe thread queue for DSA task enqueue/dequeue.
> > * Implement DSA task submission.
> > * Implement DSA batch task submission.
> >
> > Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
> > ---
> >  include/qemu/dsa.h |  35 ++++++++
> >  util/dsa.c         | 196 +++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 231 insertions(+)
> >
> > diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
> > index 30246b507e..23f55185be 100644
> > --- a/include/qemu/dsa.h
> > +++ b/include/qemu/dsa.h
> > @@ -12,6 +12,41 @@
> >  #include <linux/idxd.h>
> >  #include "x86intrin.h"
> >
> > +enum dsa_task_type {
>
> Our coding style requires CamelCase for enums and typedef'ed structures.

When I wrote this, I found numerous instances where snake case and no
typedef enum are used. But I do see the camel case and typedef'ed
instances now. Converted to that.

>
> > +    DSA_TASK = 0,
> > +    DSA_BATCH_TASK
> > +};
> > +
> > +enum dsa_task_status {
> > +    DSA_TASK_READY = 0,
> > +    DSA_TASK_PROCESSING,
> > +    DSA_TASK_COMPLETION
> > +};
> > +
> > +typedef void (*buffer_zero_dsa_completion_fn)(void *);
>
> We don't really need the "buffer_zero" mention in any of this
> code. Simply dsa_batch_task or batch_task would suffice.

I removed "buffer_zero" prefix in some of the places.

>
> > +
> > +typedef struct buffer_zero_batch_task {
> > +    struct dsa_hw_desc batch_descriptor;
> > +    struct dsa_hw_desc *descriptors;
> > +    struct dsa_completion_record batch_completion __attribute__((aligned(32)));
> > +    struct dsa_completion_record *completions;
> > +    struct dsa_device_group *group;
> > +    struct dsa_device *device;
> > +    buffer_zero_dsa_completion_fn completion_callback;
> > +    QemuSemaphore sem_task_complete;
> > +    enum dsa_task_type task_type;
> > +    enum dsa_task_status status;
> > +    bool *results;
> > +    int batch_size;
> > +    QSIMPLEQ_ENTRY(buffer_zero_batch_task) entry;
> > +} buffer_zero_batch_task;
>
> I see data specific to this implementation and data coming from the
> library, maybe these would be better organized in two separate
> structures with the qemu-specific having a pointer to the generic
> one. Looking ahead in the series, there seems to be migration data
> coming into this as well.

I refactored to create a generic structure batch_task and a DSA
specific version dsa_batch_task. batch_task has a pointer to
dsa_batch_task if DSA compilation option is enabled.

>
> > +
> > +#else
> > +
> > +struct buffer_zero_batch_task {
> > +    bool *results;
> > +};
> > +
> >  #endif
> >
> >  /**
> > diff --git a/util/dsa.c b/util/dsa.c
> > index 8edaa892ec..f82282ce99 100644
> > --- a/util/dsa.c
> > +++ b/util/dsa.c
> > @@ -245,6 +245,200 @@ dsa_device_group_get_next_device(struct dsa_device_group *group)
> >      return &group->dsa_devices[current];
> >  }
> >
> > +/**
> > + * @brief Empties out the DSA task queue.
> > + *
> > + * @param group A pointer to the DSA device group.
> > + */
> > +static void
> > +dsa_empty_task_queue(struct dsa_device_group *group)
> > +{
> > +    qemu_mutex_lock(&group->task_queue_lock);
> > +    dsa_task_queue *task_queue = &group->task_queue;
> > +    while (!QSIMPLEQ_EMPTY(task_queue)) {
> > +        QSIMPLEQ_REMOVE_HEAD(task_queue, entry);
> > +    }
> > +    qemu_mutex_unlock(&group->task_queue_lock);
> > +}
> > +
> > +/**
> > + * @brief Adds a task to the DSA task queue.
> > + *
> > + * @param group A pointer to the DSA device group.
> > + * @param context A pointer to the DSA task to enqueue.
> > + *
> > + * @return int Zero if successful, otherwise a proper error code.
> > + */
> > +static int
> > +dsa_task_enqueue(struct dsa_device_group *group,
> > +                 struct buffer_zero_batch_task *task)
> > +{
> > +    dsa_task_queue *task_queue = &group->task_queue;
> > +    QemuMutex *task_queue_lock = &group->task_queue_lock;
> > +    QemuCond *task_queue_cond = &group->task_queue_cond;
> > +
> > +    bool notify = false;
> > +
> > +    qemu_mutex_lock(task_queue_lock);
> > +
> > +    if (!group->running) {
> > +        fprintf(stderr, "DSA: Tried to queue task to stopped device queue\n");
> > +        qemu_mutex_unlock(task_queue_lock);
> > +        return -1;
> > +    }
> > +
> > +    // The queue is empty. This enqueue operation is a 0->1 transition.
> > +    if (QSIMPLEQ_EMPTY(task_queue))
> > +        notify = true;
> > +
> > +    QSIMPLEQ_INSERT_TAIL(task_queue, task, entry);
> > +
> > +    // We need to notify the waiter for 0->1 transitions.
> > +    if (notify)
> > +        qemu_cond_signal(task_queue_cond);
> > +
> > +    qemu_mutex_unlock(task_queue_lock);
> > +
> > +    return 0;
> > +}
> > +
> > +/**
> > + * @brief Takes a DSA task out of the task queue.
> > + *
> > + * @param group A pointer to the DSA device group.
> > + * @return buffer_zero_batch_task* The DSA task being dequeued.
> > + */
> > +__attribute__((unused))
> > +static struct buffer_zero_batch_task *
> > +dsa_task_dequeue(struct dsa_device_group *group)
> > +{
> > +    struct buffer_zero_batch_task *task = NULL;
> > +    dsa_task_queue *task_queue = &group->task_queue;
> > +    QemuMutex *task_queue_lock = &group->task_queue_lock;
> > +    QemuCond *task_queue_cond = &group->task_queue_cond;
> > +
> > +    qemu_mutex_lock(task_queue_lock);
> > +
> > +    while (true) {
> > +        if (!group->running)
> > +            goto exit;
> > +        task = QSIMPLEQ_FIRST(task_queue);
> > +        if (task != NULL) {
> > +            break;
> > +        }
> > +        qemu_cond_wait(task_queue_cond, task_queue_lock);
> > +    }
> > +
> > +    QSIMPLEQ_REMOVE_HEAD(task_queue, entry);
> > +
> > +exit:
> > +    qemu_mutex_unlock(task_queue_lock);
> > +    return task;
> > +}
> > +
> > +/**
> > + * @brief Submits a DSA work item to the device work queue.
> > + *
> > + * @param wq A pointer to the DSA work queue's device memory.
> > + * @param descriptor A pointer to the DSA work item descriptor.
> > + *
> > + * @return Zero if successful, non-zero otherwise.
> > + */
> > +static int
> > +submit_wi_int(void *wq, struct dsa_hw_desc *descriptor)
> > +{
> > +    uint64_t retry = 0;
> > +
> > +    _mm_sfence();
> > +
> > +    while (true) {
> > +        if (_enqcmd(wq, descriptor) == 0) {
> > +            break;
> > +        }
> > +        retry++;
> > +        if (retry > max_retry_count) {
>
> 'max_retry_count' is UINT64_MAX so 'retry' will wrap around.
>
> > +            fprintf(stderr, "Submit work retry %lu times.\n", retry);
> > +            exit(1);
>
> Is this not the case where we'd fallback to the CPU?

"retry" here means _enqcmd returned a failure because the shared DSA
queue is full. When we run out of retry counts, we definitely have a
bug that prevents any DSA task from completing. So this situation is
really not expected and we don't want to fallback to use CPU.

>
> You should not exit() here, but return non-zero as the documentation
> mentions and the callers expect.

 I will propagate this error all the way up to multifd_send_thread.

>
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +/**
> > + * @brief Synchronously submits a DSA work item to the
> > + *        device work queue.
> > + *
> > + * @param wq A pointer to the DSA worjk queue's device memory.
> > + * @param descriptor A pointer to the DSA work item descriptor.
> > + *
> > + * @return int Zero if successful, non-zero otherwise.
> > + */
> > +__attribute__((unused))
> > +static int
> > +submit_wi(void *wq, struct dsa_hw_desc *descriptor)
> > +{
> > +    return submit_wi_int(wq, descriptor);
> > +}
> > +
> > +/**
> > + * @brief Asynchronously submits a DSA work item to the
> > + *        device work queue.
> > + *
> > + * @param task A pointer to the buffer zero task.
> > + *
> > + * @return int Zero if successful, non-zero otherwise.
> > + */
> > +__attribute__((unused))
> > +static int
> > +submit_wi_async(struct buffer_zero_batch_task *task)
> > +{
> > +    struct dsa_device_group *device_group = task->group;
> > +    struct dsa_device *device_instance = task->device;
> > +    int ret;
> > +
> > +    assert(task->task_type == DSA_TASK);
> > +
> > +    task->status = DSA_TASK_PROCESSING;
> > +
> > +    ret = submit_wi_int(device_instance->work_queue,
> > +                        &task->descriptors[0]);
> > +    if (ret != 0)
> > +        return ret;
> > +
> > +    return dsa_task_enqueue(device_group, task);
> > +}
> > +
> > +/**
> > + * @brief Asynchronously submits a DSA batch work item to the
> > + *        device work queue.
> > + *
> > + * @param batch_task A pointer to the batch buffer zero task.
> > + *
> > + * @return int Zero if successful, non-zero otherwise.
> > + */
> > +__attribute__((unused))
> > +static int
> > +submit_batch_wi_async(struct buffer_zero_batch_task *batch_task)
> > +{
> > +    struct dsa_device_group *device_group = batch_task->group;
> > +    struct dsa_device *device_instance = batch_task->device;
> > +    int ret;
> > +
> > +    assert(batch_task->task_type == DSA_BATCH_TASK);
> > +    assert(batch_task->batch_descriptor.desc_count <= batch_task->batch_size);
> > +    assert(batch_task->status == DSA_TASK_READY);
> > +
> > +    batch_task->status = DSA_TASK_PROCESSING;
> > +
> > +    ret = submit_wi_int(device_instance->work_queue,
> > +                        &batch_task->batch_descriptor);
> > +    if (ret != 0)
> > +        return ret;
> > +
> > +    return dsa_task_enqueue(device_group, batch_task);
> > +}
>
> At this point in the series submit_wi_async() and
> submit_batch_wi_async() look the same to me without the asserts. Can't
> we consolidate them?
>
> There's also the fact that both functions receive a _batch_ task but one
> is supposed to work in batches and the other is not. That could be
> solved by renaming the structure I guess.

So we do need to have two functions to handle a single task and a
batch task respectively. This is due to how DSA is designed at the
lower level. When we submit a task to DSA hardware, the task
description can be an individual task or a batch task containing a
pointer to an array of individual tasks. The workflow tries to
aggregate a lot of individual tasks and put them into a batch task.
However, there are times when only 1 task is available but DSA doesn't
accept a batch task description with only 1 individual task in it so
we always need a path to submit an individual task. I used to have two
data structures representing an individual task and a batch task but I
converged them into the batch task right now. The two functions are
just using different fields of the same structure to process
individual task vs batch task. submit_wi_async and
submit_batch_wi_async are different on the actual descriptor passed
into the submit_wi_int call. Yes, the two functions look similar but
they are not completely the same and because its implementation is so
simple it doesn't worth adding a unified helper layer to have them
both calling that helper layer. I went back and forth between the
current implementation and the solution you suggested but ended up
using the current implementation. But let me know if you still prefer
a converged helper function.

>
> > +
> >  /**
> >   * @brief Check if DSA is running.
> >   *
> > @@ -301,6 +495,8 @@ void dsa_stop(void)
> >      if (!group->running) {
> >          return;
> >      }
> > +
> > +    dsa_empty_task_queue(group);
> >  }
> >
> >  /**


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [External] Re: [PATCH v2 07/20] util/dsa: Implement DSA device start and stop logic.
  2023-12-19 13:18       ` Fabiano Rosas
@ 2023-12-27  6:00         ` Hao Xiang
  0 siblings, 0 replies; 51+ messages in thread
From: Hao Xiang @ 2023-12-27  6:00 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: peter.maydell, quintela, peterx, marcandre.lureau, bryan.zhang,
	qemu-devel

On Tue, Dec 19, 2023 at 5:19 AM Fabiano Rosas <farosas@suse.de> wrote:
>
> Hao Xiang <hao.xiang@bytedance.com> writes:
>
> >>
> >> > +}
> >> > +
> >> > +void dsa_start(void) {}
> >> > +
> >> > +void dsa_stop(void) {}
> >> > +
> >> > +void dsa_cleanup(void) {}
> >> > +
> >> > +#endif
> >>
> >> These could all be in the header.
> >
> > The function definitions are already in dsa.h Do you mean moving the
> > function implementations to the header as well?
> >
>
> I mean the empty !CONFIG_DSA_OPT variants could be in the header as
> static inline.
>

Fixed.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [External] Re: [PATCH v2 11/20] util/dsa: Implement DSA task asynchronous submission and wait for completion.
  2023-12-13 14:01   ` Fabiano Rosas
@ 2023-12-27  6:26     ` Hao Xiang
  0 siblings, 0 replies; 51+ messages in thread
From: Hao Xiang @ 2023-12-27  6:26 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: peter.maydell, quintela, peterx, marcandre.lureau, bryan.zhang,
	qemu-devel

On Wed, Dec 13, 2023 at 6:01 AM Fabiano Rosas <farosas@suse.de> wrote:
>
> Hao Xiang <hao.xiang@bytedance.com> writes:
>
> > * Add a DSA task completion callback.
> > * DSA completion thread will call the tasks's completion callback
> > on every task/batch task completion.
> > * DSA submission path to wait for completion.
> > * Implement CPU fallback if DSA is not able to complete the task.
> >
> > Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
> > Signed-off-by: Bryan Zhang <bryan.zhang@bytedance.com>
> > ---
> >  include/qemu/dsa.h |  14 +++++
> >  util/dsa.c         | 153 ++++++++++++++++++++++++++++++++++++++++++++-
> >  2 files changed, 164 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/qemu/dsa.h b/include/qemu/dsa.h
> > index b10e7b8fb7..3f8ee07004 100644
> > --- a/include/qemu/dsa.h
> > +++ b/include/qemu/dsa.h
> > @@ -65,6 +65,20 @@ void buffer_zero_batch_task_init(struct buffer_zero_batch_task *task,
> >   */
> >  void buffer_zero_batch_task_destroy(struct buffer_zero_batch_task *task);
> >
> > +/**
> > + * @brief Performs buffer zero comparison on a DSA batch task asynchronously.
> > + *
> > + * @param batch_task A pointer to the batch task.
> > + * @param buf An array of memory buffers.
> > + * @param count The number of buffers in the array.
> > + * @param len The buffer length.
> > + *
> > + * @return Zero if successful, otherwise non-zero.
> > + */
> > +int
> > +buffer_is_zero_dsa_batch_async(struct buffer_zero_batch_task *batch_task,
> > +                               const void **buf, size_t count, size_t len);
> > +
> >  /**
> >   * @brief Initializes DSA devices.
> >   *
> > diff --git a/util/dsa.c b/util/dsa.c
> > index 3cc017b8a0..06c6fbf2ca 100644
> > --- a/util/dsa.c
> > +++ b/util/dsa.c
> > @@ -470,6 +470,41 @@ poll_completion(struct dsa_completion_record *completion,
> >      return 0;
> >  }
> >
> > +/**
> > + * @brief Use CPU to complete a single zero page checking task.
> > + *
> > + * @param task A pointer to the task.
> > + */
> > +static void
> > +task_cpu_fallback(struct buffer_zero_batch_task *task)
> > +{
> > +    assert(task->task_type == DSA_TASK);
> > +
> > +    struct dsa_completion_record *completion = &task->completions[0];
> > +    const uint8_t *buf;
> > +    size_t len;
> > +
> > +    if (completion->status == DSA_COMP_SUCCESS) {
> > +        return;
> > +    }
> > +
> > +    /*
> > +     * DSA was able to partially complete the operation. Check the
> > +     * result. If we already know this is not a zero page, we can
> > +     * return now.
> > +     */
> > +    if (completion->bytes_completed != 0 && completion->result != 0) {
> > +        task->results[0] = false;
> > +        return;
> > +    }
> > +
> > +    /* Let's fallback to use CPU to complete it. */
> > +    buf = (const uint8_t *)task->descriptors[0].src_addr;
> > +    len = task->descriptors[0].xfer_size;
> > +    task->results[0] = buffer_is_zero(buf + completion->bytes_completed,
> > +                                      len - completion->bytes_completed);
> > +}
> > +
> >  /**
> >   * @brief Complete a single DSA task in the batch task.
> >   *
> > @@ -548,6 +583,62 @@ poll_batch_task_completion(struct buffer_zero_batch_task *batch_task)
> >      }
> >  }
> >
> > +/**
> > + * @brief Use CPU to complete the zero page checking batch task.
> > + *
> > + * @param batch_task A pointer to the batch task.
> > + */
> > +static void
> > +batch_task_cpu_fallback(struct buffer_zero_batch_task *batch_task)
> > +{
> > +    assert(batch_task->task_type == DSA_BATCH_TASK);
> > +
> > +    struct dsa_completion_record *batch_completion =
> > +        &batch_task->batch_completion;
> > +    struct dsa_completion_record *completion;
> > +    uint8_t status;
> > +    const uint8_t *buf;
> > +    size_t len;
> > +    bool *results = batch_task->results;
> > +    uint32_t count = batch_task->batch_descriptor.desc_count;
> > +
> > +    // DSA is able to complete the entire batch task.
> > +    if (batch_completion->status == DSA_COMP_SUCCESS) {
> > +        assert(count == batch_completion->bytes_completed);
> > +        return;
> > +    }
> > +
> > +    /*
> > +     * DSA encounters some error and is not able to complete
> > +     * the entire batch task. Use CPU fallback.
> > +     */
> > +    for (int i = 0; i < count; i++) {
> > +        completion = &batch_task->completions[i];
> > +        status = completion->status;
> > +        if (status == DSA_COMP_SUCCESS) {
> > +            continue;
> > +        }
> > +        assert(status == DSA_COMP_PAGE_FAULT_NOBOF);
> > +
> > +        /*
> > +         * DSA was able to partially complete the operation. Check the
> > +         * result. If we already know this is not a zero page, we can
> > +         * return now.
> > +         */
> > +        if (completion->bytes_completed != 0 && completion->result != 0) {
> > +            results[i] = false;
> > +            continue;
> > +        }
> > +
> > +        /* Let's fallback to use CPU to complete it. */
> > +        buf = (uint8_t *)batch_task->descriptors[i].src_addr;
> > +        len = batch_task->descriptors[i].xfer_size;
> > +        results[i] =
> > +            buffer_is_zero(buf + completion->bytes_completed,
> > +                           len - completion->bytes_completed);
>
> Here the same thing is happening as in other patches, the batch task
> operation is just a repeat of the task operation n times. So this whole
> inner code here could be nicely replaced by task_cpu_fallback() with
> some adjustment of the function arguments. That makes intuitive sense
> and removes code duplication.

Added a helper function task_cpu_fallback_int() to remove the duplicated code.

>
> > +    }
> > +}
> > +
> >  /**
> >   * @brief Handles an asynchronous DSA batch task completion.
> >   *
> > @@ -825,7 +916,6 @@ buffer_zero_batch_task_set(struct buffer_zero_batch_task *batch_task,
> >   *
> >   * @return int Zero if successful, otherwise an appropriate error code.
> >   */
> > -__attribute__((unused))
> >  static int
> >  buffer_zero_dsa_async(struct buffer_zero_batch_task *task,
> >                        const void *buf, size_t len)
> > @@ -844,7 +934,6 @@ buffer_zero_dsa_async(struct buffer_zero_batch_task *task,
> >   * @param count The number of buffers.
> >   * @param len The buffer length.
> >   */
> > -__attribute__((unused))
> >  static int
> >  buffer_zero_dsa_batch_async(struct buffer_zero_batch_task *batch_task,
> >                              const void **buf, size_t count, size_t len)
> > @@ -876,13 +965,29 @@ buffer_zero_dsa_completion(void *context)
> >   *
> >   * @param batch_task A pointer to the buffer zero comparison batch task.
> >   */
> > -__attribute__((unused))
> >  static void
> >  buffer_zero_dsa_wait(struct buffer_zero_batch_task *batch_task)
> >  {
> >      qemu_sem_wait(&batch_task->sem_task_complete);
> >  }
> >
> > +/**
> > + * @brief Use CPU to complete the zero page checking task if DSA
> > + *        is not able to complete it.
> > + *
> > + * @param batch_task A pointer to the batch task.
> > + */
> > +static void
> > +buffer_zero_cpu_fallback(struct buffer_zero_batch_task *batch_task)
> > +{
> > +    if (batch_task->task_type == DSA_TASK) {
> > +        task_cpu_fallback(batch_task);
> > +    } else {
> > +        assert(batch_task->task_type == DSA_BATCH_TASK);
> > +        batch_task_cpu_fallback(batch_task);
> > +    }
> > +}
> > +
> >  /**
> >   * @brief Check if DSA is running.
> >   *
> > @@ -956,6 +1061,41 @@ void dsa_cleanup(void)
> >      dsa_device_group_cleanup(&dsa_group);
> >  }
> >
> > +/**
> > + * @brief Performs buffer zero comparison on a DSA batch task asynchronously.
> > + *
> > + * @param batch_task A pointer to the batch task.
> > + * @param buf An array of memory buffers.
> > + * @param count The number of buffers in the array.
> > + * @param len The buffer length.
> > + *
> > + * @return Zero if successful, otherwise non-zero.
> > + */
> > +int
> > +buffer_is_zero_dsa_batch_async(struct buffer_zero_batch_task *batch_task,
> > +                               const void **buf, size_t count, size_t len)
> > +{
> > +    if (count <= 0 || count > batch_task->batch_size) {
> > +        return -1;
> > +    }
> > +
> > +    assert(batch_task != NULL);
> > +    assert(len != 0);
> > +    assert(buf != NULL);
> > +
> > +    if (count == 1) {
> > +        // DSA doesn't take batch operation with only 1 task.
> > +        buffer_zero_dsa_async(batch_task, buf[0], len);
> > +    } else {
> > +        buffer_zero_dsa_batch_async(batch_task, buf, count, len);
> > +    }
> > +
> > +    buffer_zero_dsa_wait(batch_task);
> > +    buffer_zero_cpu_fallback(batch_task);
> > +
> > +    return 0;
> > +}
> > +
> >  #else
> >
> >  void buffer_zero_batch_task_init(struct buffer_zero_batch_task *task,
> > @@ -981,5 +1121,12 @@ void dsa_stop(void) {}
> >
> >  void dsa_cleanup(void) {}
> >
> > +int
> > +buffer_is_zero_dsa_batch_async(struct buffer_zero_batch_task *batch_task,
> > +                               const void **buf, size_t count, size_t len)
> > +{
> > +    exit(1);
> > +}
> > +
> >  #endif


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [External] Re: [PATCH v2 18/20] migration/multifd: Enable set packet size migration option.
  2023-12-13 17:33   ` Fabiano Rosas
@ 2024-01-03 20:04     ` Hao Xiang
  0 siblings, 0 replies; 51+ messages in thread
From: Hao Xiang @ 2024-01-03 20:04 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: peter.maydell, quintela, peterx, marcandre.lureau, bryan.zhang,
	qemu-devel

On Wed, Dec 13, 2023 at 9:33 AM Fabiano Rosas <farosas@suse.de> wrote:
>
> Hao Xiang <hao.xiang@bytedance.com> writes:
>
> > During live migration, if the latency between sender and receiver
> > is high but bandwidth is high (a long and fat pipe), using a bigger
> > packet size can help reduce migration total time. In addition, Intel
> > DSA offloading performs better with a large batch task. Providing an
> > option to set the packet size is useful for performance tuning.
> >
> > Set the option:
> > migrate_set_parameter multifd-packet-size 512
>
> This should continue being bytes, we just needed to have code enforcing
> it to be a multiple of page size at migrate_params_check().
>

OK. I switched back to use bytes and enforced multiple of page size at
migrate_params_check().


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [External] Re: [PATCH v2 04/20] So we use multifd to transmit zero pages.
  2023-11-16 15:14   ` Fabiano Rosas
@ 2024-01-23  4:28     ` Hao Xiang
  2024-01-25 21:55       ` Hao Xiang
  0 siblings, 1 reply; 51+ messages in thread
From: Hao Xiang @ 2024-01-23  4:28 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: peter.maydell, quintela, peterx, marcandre.lureau, bryan.zhang,
	qemu-devel, Leonardo Bras

On Thu, Nov 16, 2023 at 7:14 AM Fabiano Rosas <farosas@suse.de> wrote:
>
> Hao Xiang <hao.xiang@bytedance.com> writes:
>
> > From: Juan Quintela <quintela@redhat.com>
> >
> > Signed-off-by: Juan Quintela <quintela@redhat.com>
> > Reviewed-by: Leonardo Bras <leobras@redhat.com>
> > ---
> >  migration/multifd.c |  7 ++++---
> >  migration/options.c | 13 +++++++------
> >  migration/ram.c     | 45 ++++++++++++++++++++++++++++++++++++++-------
> >  qapi/migration.json |  1 -
> >  4 files changed, 49 insertions(+), 17 deletions(-)
> >
> > diff --git a/migration/multifd.c b/migration/multifd.c
> > index 1b994790d5..1198ffde9c 100644
> > --- a/migration/multifd.c
> > +++ b/migration/multifd.c
> > @@ -13,6 +13,7 @@
> >  #include "qemu/osdep.h"
> >  #include "qemu/cutils.h"
> >  #include "qemu/rcu.h"
> > +#include "qemu/cutils.h"
> >  #include "exec/target_page.h"
> >  #include "sysemu/sysemu.h"
> >  #include "exec/ramblock.h"
> > @@ -459,7 +460,6 @@ static int multifd_send_pages(QEMUFile *f)
> >      p->packet_num = multifd_send_state->packet_num++;
> >      multifd_send_state->pages = p->pages;
> >      p->pages = pages;
> > -
> >      qemu_mutex_unlock(&p->mutex);
> >      qemu_sem_post(&p->sem);
> >
> > @@ -684,7 +684,7 @@ static void *multifd_send_thread(void *opaque)
> >      MigrationThread *thread = NULL;
> >      Error *local_err = NULL;
> >      /* qemu older than 8.2 don't understand zero page on multifd channel */
> > -    bool use_zero_page = !migrate_use_main_zero_page();
> > +    bool use_multifd_zero_page = !migrate_use_main_zero_page();
> >      int ret = 0;
> >      bool use_zero_copy_send = migrate_zero_copy_send();
> >
> > @@ -713,6 +713,7 @@ static void *multifd_send_thread(void *opaque)
> >              RAMBlock *rb = p->pages->block;
> >              uint64_t packet_num = p->packet_num;
> >              uint32_t flags;
> > +
> >              p->normal_num = 0;
> >              p->zero_num = 0;
> >
> > @@ -724,7 +725,7 @@ static void *multifd_send_thread(void *opaque)
> >
> >              for (int i = 0; i < p->pages->num; i++) {
> >                  uint64_t offset = p->pages->offset[i];
> > -                if (use_zero_page &&
> > +                if (use_multifd_zero_page &&
>
> We could have a new function in multifd_ops for zero page
> handling. We're already considering an accelerator for the compression
> method in the other series[1] and in this series we're adding an
> accelerator for zero page checking. It's about time we make the
> multifd_ops generic instead of only compression/no compression.

Sorry I overlooked this email earlier.
I will extend the multifd_ops interface and add a new API for zero
page checking.

>
> 1- [PATCH v2 0/4] Live Migration Acceleration with IAA Compression
> https://lore.kernel.org/r/20231109154638.488213-1-yuan1.liu@intel.com
>
> >                      buffer_is_zero(rb->host + offset, p->page_size)) {
> >                      p->zero[p->zero_num] = offset;
> >                      p->zero_num++;
> > diff --git a/migration/options.c b/migration/options.c
> > index 00c0c4a0d6..97d121d4d7 100644
> > --- a/migration/options.c
> > +++ b/migration/options.c
> > @@ -195,6 +195,7 @@ Property migration_properties[] = {
> >      DEFINE_PROP_MIG_CAP("x-block", MIGRATION_CAPABILITY_BLOCK),
> >      DEFINE_PROP_MIG_CAP("x-return-path", MIGRATION_CAPABILITY_RETURN_PATH),
> >      DEFINE_PROP_MIG_CAP("x-multifd", MIGRATION_CAPABILITY_MULTIFD),
> > +    DEFINE_PROP_MIG_CAP("x-main-zero-page", MIGRATION_CAPABILITY_MAIN_ZERO_PAGE),
> >      DEFINE_PROP_MIG_CAP("x-background-snapshot",
> >              MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT),
> >  #ifdef CONFIG_LINUX
> > @@ -288,13 +289,9 @@ bool migrate_multifd(void)
> >
> >  bool migrate_use_main_zero_page(void)
> >  {
> > -    //MigrationState *s;
> > -
> > -    //s = migrate_get_current();
> > +    MigrationState *s = migrate_get_current();
> >
> > -    // We will enable this when we add the right code.
> > -    // return s->enabled_capabilities[MIGRATION_CAPABILITY_MAIN_ZERO_PAGE];
> > -    return true;
> > +    return s->capabilities[MIGRATION_CAPABILITY_MAIN_ZERO_PAGE];
>
> What happens if we disable main-zero-page while multifd is not enabled?

In ram.c
...
if (migrate_multifd() && !migrate_use_main_zero_page()) {
migration_ops->ram_save_target_page = ram_save_target_page_multifd;
} else {
migration_ops->ram_save_target_page = ram_save_target_page_legacy;
}
...

So if main-zero-page is disabled and multifd is also disabled, it will
go with the "else" path, which is the legacy path
ram_save_target_page_legacy() and do zero page checking from the main
thread.

>
> >  }
> >
> >  bool migrate_pause_before_switchover(void)
> > @@ -457,6 +454,7 @@ INITIALIZE_MIGRATE_CAPS_SET(check_caps_background_snapshot,
> >      MIGRATION_CAPABILITY_LATE_BLOCK_ACTIVATE,
> >      MIGRATION_CAPABILITY_RETURN_PATH,
> >      MIGRATION_CAPABILITY_MULTIFD,
> > +    MIGRATION_CAPABILITY_MAIN_ZERO_PAGE,
> >      MIGRATION_CAPABILITY_PAUSE_BEFORE_SWITCHOVER,
> >      MIGRATION_CAPABILITY_AUTO_CONVERGE,
> >      MIGRATION_CAPABILITY_RELEASE_RAM,
> > @@ -534,6 +532,9 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, Error **errp)
> >              error_setg(errp, "Postcopy is not yet compatible with multifd");
> >              return false;
> >          }
> > +        if (new_caps[MIGRATION_CAPABILITY_MAIN_ZERO_PAGE]) {
> > +            error_setg(errp, "Postcopy is not yet compatible with main zero copy");
> > +        }
>
> Won't this will breaks compatibility for postcopy? A command that used
> to work now will have to disable main-zero-page first.

main-zero-page is disabled by default.

>
> >      }
> >
> >      if (new_caps[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT]) {
> > diff --git a/migration/ram.c b/migration/ram.c
> > index 8c7886ab79..f7a42feff2 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -2059,17 +2059,42 @@ static int ram_save_target_page_legacy(RAMState *rs, PageSearchStatus *pss)
> >      if (save_zero_page(rs, pss, offset)) {
> >          return 1;
> >      }
> > -
> >      /*
> > -     * Do not use multifd in postcopy as one whole host page should be
> > -     * placed.  Meanwhile postcopy requires atomic update of pages, so even
> > -     * if host page size == guest page size the dest guest during run may
> > -     * still see partially copied pages which is data corruption.
> > +     * Do not use multifd for:
> > +     * 1. Compression as the first page in the new block should be posted out
> > +     *    before sending the compressed page
> > +     * 2. In postcopy as one whole host page should be placed
> >       */
> > -    if (migrate_multifd() && !migration_in_postcopy()) {
> > +    if (!migrate_compress() && migrate_multifd() && !migration_in_postcopy()) {
> > +        return ram_save_multifd_page(pss->pss_channel, block, offset);
> > +    }
>
> This could go into ram_save_target_page_multifd like so:
>
> if (!migrate_compress() && !migration_in_postcopy() && !migration_main_zero_page()) {
>     return ram_save_multifd_page(pss->pss_channel, block, offset);
> } else {
>   return ram_save_target_page_legacy();
> }
>
> > +
> > +    return ram_save_page(rs, pss);
> > +}
> > +
> > +/**
> > + * ram_save_target_page_multifd: save one target page
> > + *
> > + * Returns the number of pages written
> > + *
> > + * @rs: current RAM state
> > + * @pss: data about the page we want to send
> > + */
> > +static int ram_save_target_page_multifd(RAMState *rs, PageSearchStatus *pss)
> > +{
> > +    RAMBlock *block = pss->block;
> > +    ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS;
> > +    int res;
> > +
> > +    if (!migration_in_postcopy()) {
> >          return ram_save_multifd_page(pss->pss_channel, block, offset);
> >      }
> >
> > +    res = save_zero_page(rs, pss, offset);
> > +    if (res > 0) {
> > +        return res;
> > +    }
> > +
> >      return ram_save_page(rs, pss);
> >  }
> >
> > @@ -2982,9 +3007,15 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
> >      }
> >
> >      migration_ops = g_malloc0(sizeof(MigrationOps));
> > -    migration_ops->ram_save_target_page = ram_save_target_page_legacy;
> > +
> > +    if (migrate_multifd() && !migrate_use_main_zero_page()) {
> > +        migration_ops->ram_save_target_page = ram_save_target_page_multifd;
> > +    } else {
> > +        migration_ops->ram_save_target_page = ram_save_target_page_legacy;
> > +    }
>
> This should not check main-zero-page. Just have multifd vs. legacy and
> have the multifd function defer to _legacy if main-zero-page or
> in_postcopy.

I noticed that ram_save_target_page_legacy and
ram_save_target_page_multifd have a lot of overlap and are quite
confusing. I can refactor this path and take-in your comments here.

1) Remove ram_save_multifd_page() call from
ram_save_target_page_legacy(). ram_save_multifd_page() will only be
called in ram_save_target_page_multifd().
2) Remove save_zero_page() and ram_save_page() from
ram_save_target_page_multifd().
3) Postcopy will always go with the ram_save_target_page_legacy() path.
4) Legacy compression will always go with the
ram_save_target_page_legacy() path.
5) Call ram_save_target_page_legacy() from within
ram_save_target_page_multifd() if postcopy or legacy compression.

>
> >
> >      qemu_mutex_unlock_iothread();
> > +
> >      ret = multifd_send_sync_main(f);
> >      qemu_mutex_lock_iothread();
> >      if (ret < 0) {
> > diff --git a/qapi/migration.json b/qapi/migration.json
> > index 09e4393591..9783289bfc 100644
> > --- a/qapi/migration.json
> > +++ b/qapi/migration.json
> > @@ -531,7 +531,6 @@
> >  #     and can result in more stable read performance.  Requires KVM
> >  #     with accelerator property "dirty-ring-size" set.  (Since 8.1)
> >  #
> > -#
> >  # @main-zero-page: If enabled, the detection of zero pages will be
> >  #                  done on the main thread.  Otherwise it is done on
> >  #                  the multifd threads.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [External] Re: [PATCH v2 04/20] So we use multifd to transmit zero pages.
  2024-01-23  4:28     ` [External] " Hao Xiang
@ 2024-01-25 21:55       ` Hao Xiang
  2024-01-25 23:14         ` Fabiano Rosas
  0 siblings, 1 reply; 51+ messages in thread
From: Hao Xiang @ 2024-01-25 21:55 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: peter.maydell, peterx, marcandre.lureau, bryan.zhang, qemu-devel,
	Leonardo Bras

On Mon, Jan 22, 2024 at 8:28 PM Hao Xiang <hao.xiang@bytedance.com> wrote:
>
> On Thu, Nov 16, 2023 at 7:14 AM Fabiano Rosas <farosas@suse.de> wrote:
> >
> > Hao Xiang <hao.xiang@bytedance.com> writes:
> >
> > > From: Juan Quintela <quintela@redhat.com>
> > >
> > > Signed-off-by: Juan Quintela <quintela@redhat.com>
> > > Reviewed-by: Leonardo Bras <leobras@redhat.com>
> > > ---
> > >  migration/multifd.c |  7 ++++---
> > >  migration/options.c | 13 +++++++------
> > >  migration/ram.c     | 45 ++++++++++++++++++++++++++++++++++++++-------
> > >  qapi/migration.json |  1 -
> > >  4 files changed, 49 insertions(+), 17 deletions(-)
> > >
> > > diff --git a/migration/multifd.c b/migration/multifd.c
> > > index 1b994790d5..1198ffde9c 100644
> > > --- a/migration/multifd.c
> > > +++ b/migration/multifd.c
> > > @@ -13,6 +13,7 @@
> > >  #include "qemu/osdep.h"
> > >  #include "qemu/cutils.h"
> > >  #include "qemu/rcu.h"
> > > +#include "qemu/cutils.h"
> > >  #include "exec/target_page.h"
> > >  #include "sysemu/sysemu.h"
> > >  #include "exec/ramblock.h"
> > > @@ -459,7 +460,6 @@ static int multifd_send_pages(QEMUFile *f)
> > >      p->packet_num = multifd_send_state->packet_num++;
> > >      multifd_send_state->pages = p->pages;
> > >      p->pages = pages;
> > > -
> > >      qemu_mutex_unlock(&p->mutex);
> > >      qemu_sem_post(&p->sem);
> > >
> > > @@ -684,7 +684,7 @@ static void *multifd_send_thread(void *opaque)
> > >      MigrationThread *thread = NULL;
> > >      Error *local_err = NULL;
> > >      /* qemu older than 8.2 don't understand zero page on multifd channel */
> > > -    bool use_zero_page = !migrate_use_main_zero_page();
> > > +    bool use_multifd_zero_page = !migrate_use_main_zero_page();
> > >      int ret = 0;
> > >      bool use_zero_copy_send = migrate_zero_copy_send();
> > >
> > > @@ -713,6 +713,7 @@ static void *multifd_send_thread(void *opaque)
> > >              RAMBlock *rb = p->pages->block;
> > >              uint64_t packet_num = p->packet_num;
> > >              uint32_t flags;
> > > +
> > >              p->normal_num = 0;
> > >              p->zero_num = 0;
> > >
> > > @@ -724,7 +725,7 @@ static void *multifd_send_thread(void *opaque)
> > >
> > >              for (int i = 0; i < p->pages->num; i++) {
> > >                  uint64_t offset = p->pages->offset[i];
> > > -                if (use_zero_page &&
> > > +                if (use_multifd_zero_page &&
> >
> > We could have a new function in multifd_ops for zero page
> > handling. We're already considering an accelerator for the compression
> > method in the other series[1] and in this series we're adding an
> > accelerator for zero page checking. It's about time we make the
> > multifd_ops generic instead of only compression/no compression.
>
> Sorry I overlooked this email earlier.
> I will extend the multifd_ops interface and add a new API for zero
> page checking.
>
> >
> > 1- [PATCH v2 0/4] Live Migration Acceleration with IAA Compression
> > https://lore.kernel.org/r/20231109154638.488213-1-yuan1.liu@intel.com
> >
> > >                      buffer_is_zero(rb->host + offset, p->page_size)) {
> > >                      p->zero[p->zero_num] = offset;
> > >                      p->zero_num++;
> > > diff --git a/migration/options.c b/migration/options.c
> > > index 00c0c4a0d6..97d121d4d7 100644
> > > --- a/migration/options.c
> > > +++ b/migration/options.c
> > > @@ -195,6 +195,7 @@ Property migration_properties[] = {
> > >      DEFINE_PROP_MIG_CAP("x-block", MIGRATION_CAPABILITY_BLOCK),
> > >      DEFINE_PROP_MIG_CAP("x-return-path", MIGRATION_CAPABILITY_RETURN_PATH),
> > >      DEFINE_PROP_MIG_CAP("x-multifd", MIGRATION_CAPABILITY_MULTIFD),
> > > +    DEFINE_PROP_MIG_CAP("x-main-zero-page", MIGRATION_CAPABILITY_MAIN_ZERO_PAGE),
> > >      DEFINE_PROP_MIG_CAP("x-background-snapshot",
> > >              MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT),
> > >  #ifdef CONFIG_LINUX
> > > @@ -288,13 +289,9 @@ bool migrate_multifd(void)
> > >
> > >  bool migrate_use_main_zero_page(void)
> > >  {
> > > -    //MigrationState *s;
> > > -
> > > -    //s = migrate_get_current();
> > > +    MigrationState *s = migrate_get_current();
> > >
> > > -    // We will enable this when we add the right code.
> > > -    // return s->enabled_capabilities[MIGRATION_CAPABILITY_MAIN_ZERO_PAGE];
> > > -    return true;
> > > +    return s->capabilities[MIGRATION_CAPABILITY_MAIN_ZERO_PAGE];
> >
> > What happens if we disable main-zero-page while multifd is not enabled?
>
> In ram.c
> ...
> if (migrate_multifd() && !migrate_use_main_zero_page()) {
> migration_ops->ram_save_target_page = ram_save_target_page_multifd;
> } else {
> migration_ops->ram_save_target_page = ram_save_target_page_legacy;
> }
> ...
>
> So if main-zero-page is disabled and multifd is also disabled, it will
> go with the "else" path, which is the legacy path
> ram_save_target_page_legacy() and do zero page checking from the main
> thread.
>
> >
> > >  }
> > >
> > >  bool migrate_pause_before_switchover(void)
> > > @@ -457,6 +454,7 @@ INITIALIZE_MIGRATE_CAPS_SET(check_caps_background_snapshot,
> > >      MIGRATION_CAPABILITY_LATE_BLOCK_ACTIVATE,
> > >      MIGRATION_CAPABILITY_RETURN_PATH,
> > >      MIGRATION_CAPABILITY_MULTIFD,
> > > +    MIGRATION_CAPABILITY_MAIN_ZERO_PAGE,
> > >      MIGRATION_CAPABILITY_PAUSE_BEFORE_SWITCHOVER,
> > >      MIGRATION_CAPABILITY_AUTO_CONVERGE,
> > >      MIGRATION_CAPABILITY_RELEASE_RAM,
> > > @@ -534,6 +532,9 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, Error **errp)
> > >              error_setg(errp, "Postcopy is not yet compatible with multifd");
> > >              return false;
> > >          }
> > > +        if (new_caps[MIGRATION_CAPABILITY_MAIN_ZERO_PAGE]) {
> > > +            error_setg(errp, "Postcopy is not yet compatible with main zero copy");
> > > +        }
> >
> > Won't this will breaks compatibility for postcopy? A command that used
> > to work now will have to disable main-zero-page first.
>
> main-zero-page is disabled by default.
>
> >
> > >      }
> > >
> > >      if (new_caps[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT]) {
> > > diff --git a/migration/ram.c b/migration/ram.c
> > > index 8c7886ab79..f7a42feff2 100644
> > > --- a/migration/ram.c
> > > +++ b/migration/ram.c
> > > @@ -2059,17 +2059,42 @@ static int ram_save_target_page_legacy(RAMState *rs, PageSearchStatus *pss)
> > >      if (save_zero_page(rs, pss, offset)) {
> > >          return 1;
> > >      }
> > > -
> > >      /*
> > > -     * Do not use multifd in postcopy as one whole host page should be
> > > -     * placed.  Meanwhile postcopy requires atomic update of pages, so even
> > > -     * if host page size == guest page size the dest guest during run may
> > > -     * still see partially copied pages which is data corruption.
> > > +     * Do not use multifd for:
> > > +     * 1. Compression as the first page in the new block should be posted out
> > > +     *    before sending the compressed page
> > > +     * 2. In postcopy as one whole host page should be placed
> > >       */
> > > -    if (migrate_multifd() && !migration_in_postcopy()) {
> > > +    if (!migrate_compress() && migrate_multifd() && !migration_in_postcopy()) {
> > > +        return ram_save_multifd_page(pss->pss_channel, block, offset);
> > > +    }
> >
> > This could go into ram_save_target_page_multifd like so:
> >
> > if (!migrate_compress() && !migration_in_postcopy() && !migration_main_zero_page()) {
> >     return ram_save_multifd_page(pss->pss_channel, block, offset);
> > } else {
> >   return ram_save_target_page_legacy();
> > }
> >
> > > +
> > > +    return ram_save_page(rs, pss);
> > > +}
> > > +
> > > +/**
> > > + * ram_save_target_page_multifd: save one target page
> > > + *
> > > + * Returns the number of pages written
> > > + *
> > > + * @rs: current RAM state
> > > + * @pss: data about the page we want to send
> > > + */
> > > +static int ram_save_target_page_multifd(RAMState *rs, PageSearchStatus *pss)
> > > +{
> > > +    RAMBlock *block = pss->block;
> > > +    ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS;
> > > +    int res;
> > > +
> > > +    if (!migration_in_postcopy()) {
> > >          return ram_save_multifd_page(pss->pss_channel, block, offset);
> > >      }
> > >
> > > +    res = save_zero_page(rs, pss, offset);
> > > +    if (res > 0) {
> > > +        return res;
> > > +    }
> > > +
> > >      return ram_save_page(rs, pss);
> > >  }
> > >
> > > @@ -2982,9 +3007,15 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
> > >      }
> > >
> > >      migration_ops = g_malloc0(sizeof(MigrationOps));
> > > -    migration_ops->ram_save_target_page = ram_save_target_page_legacy;
> > > +
> > > +    if (migrate_multifd() && !migrate_use_main_zero_page()) {
> > > +        migration_ops->ram_save_target_page = ram_save_target_page_multifd;
> > > +    } else {
> > > +        migration_ops->ram_save_target_page = ram_save_target_page_legacy;
> > > +    }
> >
> > This should not check main-zero-page. Just have multifd vs. legacy and
> > have the multifd function defer to _legacy if main-zero-page or
> > in_postcopy.
>
> I noticed that ram_save_target_page_legacy and
> ram_save_target_page_multifd have a lot of overlap and are quite
> confusing. I can refactor this path and take-in your comments here.
>
> 1) Remove ram_save_multifd_page() call from
> ram_save_target_page_legacy(). ram_save_multifd_page() will only be
> called in ram_save_target_page_multifd().
> 2) Remove save_zero_page() and ram_save_page() from
> ram_save_target_page_multifd().
> 3) Postcopy will always go with the ram_save_target_page_legacy() path.
> 4) Legacy compression will always go with the
> ram_save_target_page_legacy() path.
> 5) Call ram_save_target_page_legacy() from within
> ram_save_target_page_multifd() if postcopy or legacy compression.
>

Hi Fabiano,
So I spent some time reading the
ram_save_target_page_legacy/ram_save_target_page_multifd code path
Juan wrote and here is my current understanding:
1) Multifd and legacy compression are not compatible.
2) Multifd and postcopy are not compatible.
The compatibility checks are implemented in migrate_caps_check(). So
there is really no need to handle a lot of the complexity in Juan's
code.

I think what we can do is:
1) If multifd is enabled, use ram_save_target_page_multifd().
Otherwise, use ram_save_target_page_legacy().
2) In ram_save_target_page_legacy(), we don't need the special path to
call ram_save_multifd_page(). That can be handled by
ram_save_target_page_multifd() alone.
3) In ram_save_target_page_multifd(), we assert that legacy
compression is not enabled. And we also assert that postcopy is also
not enabled.
4) We do need backward compatibility support for the main zero page
checking case in multifd. So in ram_save_target_page_multifd(), we
call save_zero_page() if migrate_multifd_zero_page() is false.

> >
> > >
> > >      qemu_mutex_unlock_iothread();
> > > +
> > >      ret = multifd_send_sync_main(f);
> > >      qemu_mutex_lock_iothread();
> > >      if (ret < 0) {
> > > diff --git a/qapi/migration.json b/qapi/migration.json
> > > index 09e4393591..9783289bfc 100644
> > > --- a/qapi/migration.json
> > > +++ b/qapi/migration.json
> > > @@ -531,7 +531,6 @@
> > >  #     and can result in more stable read performance.  Requires KVM
> > >  #     with accelerator property "dirty-ring-size" set.  (Since 8.1)
> > >  #
> > > -#
> > >  # @main-zero-page: If enabled, the detection of zero pages will be
> > >  #                  done on the main thread.  Otherwise it is done on
> > >  #                  the multifd threads.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [External] Re: [PATCH v2 04/20] So we use multifd to transmit zero pages.
  2024-01-25 21:55       ` Hao Xiang
@ 2024-01-25 23:14         ` Fabiano Rosas
  2024-01-25 23:46           ` Hao Xiang
  0 siblings, 1 reply; 51+ messages in thread
From: Fabiano Rosas @ 2024-01-25 23:14 UTC (permalink / raw)
  To: Hao Xiang
  Cc: peter.maydell, peterx, marcandre.lureau, bryan.zhang, qemu-devel,
	Leonardo Bras

Hao Xiang <hao.xiang@bytedance.com> writes:

> On Mon, Jan 22, 2024 at 8:28 PM Hao Xiang <hao.xiang@bytedance.com> wrote:
>>
>> On Thu, Nov 16, 2023 at 7:14 AM Fabiano Rosas <farosas@suse.de> wrote:
>> >
>> > Hao Xiang <hao.xiang@bytedance.com> writes:
>> >
>> > > From: Juan Quintela <quintela@redhat.com>
>> > >
>> > > Signed-off-by: Juan Quintela <quintela@redhat.com>
>> > > Reviewed-by: Leonardo Bras <leobras@redhat.com>
>> > > ---
>> > >  migration/multifd.c |  7 ++++---
>> > >  migration/options.c | 13 +++++++------
>> > >  migration/ram.c     | 45 ++++++++++++++++++++++++++++++++++++++-------
>> > >  qapi/migration.json |  1 -
>> > >  4 files changed, 49 insertions(+), 17 deletions(-)
>> > >
>> > > diff --git a/migration/multifd.c b/migration/multifd.c
>> > > index 1b994790d5..1198ffde9c 100644
>> > > --- a/migration/multifd.c
>> > > +++ b/migration/multifd.c
>> > > @@ -13,6 +13,7 @@
>> > >  #include "qemu/osdep.h"
>> > >  #include "qemu/cutils.h"
>> > >  #include "qemu/rcu.h"
>> > > +#include "qemu/cutils.h"
>> > >  #include "exec/target_page.h"
>> > >  #include "sysemu/sysemu.h"
>> > >  #include "exec/ramblock.h"
>> > > @@ -459,7 +460,6 @@ static int multifd_send_pages(QEMUFile *f)
>> > >      p->packet_num = multifd_send_state->packet_num++;
>> > >      multifd_send_state->pages = p->pages;
>> > >      p->pages = pages;
>> > > -
>> > >      qemu_mutex_unlock(&p->mutex);
>> > >      qemu_sem_post(&p->sem);
>> > >
>> > > @@ -684,7 +684,7 @@ static void *multifd_send_thread(void *opaque)
>> > >      MigrationThread *thread = NULL;
>> > >      Error *local_err = NULL;
>> > >      /* qemu older than 8.2 don't understand zero page on multifd channel */
>> > > -    bool use_zero_page = !migrate_use_main_zero_page();
>> > > +    bool use_multifd_zero_page = !migrate_use_main_zero_page();
>> > >      int ret = 0;
>> > >      bool use_zero_copy_send = migrate_zero_copy_send();
>> > >
>> > > @@ -713,6 +713,7 @@ static void *multifd_send_thread(void *opaque)
>> > >              RAMBlock *rb = p->pages->block;
>> > >              uint64_t packet_num = p->packet_num;
>> > >              uint32_t flags;
>> > > +
>> > >              p->normal_num = 0;
>> > >              p->zero_num = 0;
>> > >
>> > > @@ -724,7 +725,7 @@ static void *multifd_send_thread(void *opaque)
>> > >
>> > >              for (int i = 0; i < p->pages->num; i++) {
>> > >                  uint64_t offset = p->pages->offset[i];
>> > > -                if (use_zero_page &&
>> > > +                if (use_multifd_zero_page &&
>> >
>> > We could have a new function in multifd_ops for zero page
>> > handling. We're already considering an accelerator for the compression
>> > method in the other series[1] and in this series we're adding an
>> > accelerator for zero page checking. It's about time we make the
>> > multifd_ops generic instead of only compression/no compression.
>>
>> Sorry I overlooked this email earlier.
>> I will extend the multifd_ops interface and add a new API for zero
>> page checking.
>>
>> >
>> > 1- [PATCH v2 0/4] Live Migration Acceleration with IAA Compression
>> > https://lore.kernel.org/r/20231109154638.488213-1-yuan1.liu@intel.com
>> >
>> > >                      buffer_is_zero(rb->host + offset, p->page_size)) {
>> > >                      p->zero[p->zero_num] = offset;
>> > >                      p->zero_num++;
>> > > diff --git a/migration/options.c b/migration/options.c
>> > > index 00c0c4a0d6..97d121d4d7 100644
>> > > --- a/migration/options.c
>> > > +++ b/migration/options.c
>> > > @@ -195,6 +195,7 @@ Property migration_properties[] = {
>> > >      DEFINE_PROP_MIG_CAP("x-block", MIGRATION_CAPABILITY_BLOCK),
>> > >      DEFINE_PROP_MIG_CAP("x-return-path", MIGRATION_CAPABILITY_RETURN_PATH),
>> > >      DEFINE_PROP_MIG_CAP("x-multifd", MIGRATION_CAPABILITY_MULTIFD),
>> > > +    DEFINE_PROP_MIG_CAP("x-main-zero-page", MIGRATION_CAPABILITY_MAIN_ZERO_PAGE),
>> > >      DEFINE_PROP_MIG_CAP("x-background-snapshot",
>> > >              MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT),
>> > >  #ifdef CONFIG_LINUX
>> > > @@ -288,13 +289,9 @@ bool migrate_multifd(void)
>> > >
>> > >  bool migrate_use_main_zero_page(void)
>> > >  {
>> > > -    //MigrationState *s;
>> > > -
>> > > -    //s = migrate_get_current();
>> > > +    MigrationState *s = migrate_get_current();
>> > >
>> > > -    // We will enable this when we add the right code.
>> > > -    // return s->enabled_capabilities[MIGRATION_CAPABILITY_MAIN_ZERO_PAGE];
>> > > -    return true;
>> > > +    return s->capabilities[MIGRATION_CAPABILITY_MAIN_ZERO_PAGE];
>> >
>> > What happens if we disable main-zero-page while multifd is not enabled?
>>
>> In ram.c
>> ...
>> if (migrate_multifd() && !migrate_use_main_zero_page()) {
>> migration_ops->ram_save_target_page = ram_save_target_page_multifd;
>> } else {
>> migration_ops->ram_save_target_page = ram_save_target_page_legacy;
>> }
>> ...
>>
>> So if main-zero-page is disabled and multifd is also disabled, it will
>> go with the "else" path, which is the legacy path
>> ram_save_target_page_legacy() and do zero page checking from the main
>> thread.
>>
>> >
>> > >  }
>> > >
>> > >  bool migrate_pause_before_switchover(void)
>> > > @@ -457,6 +454,7 @@ INITIALIZE_MIGRATE_CAPS_SET(check_caps_background_snapshot,
>> > >      MIGRATION_CAPABILITY_LATE_BLOCK_ACTIVATE,
>> > >      MIGRATION_CAPABILITY_RETURN_PATH,
>> > >      MIGRATION_CAPABILITY_MULTIFD,
>> > > +    MIGRATION_CAPABILITY_MAIN_ZERO_PAGE,
>> > >      MIGRATION_CAPABILITY_PAUSE_BEFORE_SWITCHOVER,
>> > >      MIGRATION_CAPABILITY_AUTO_CONVERGE,
>> > >      MIGRATION_CAPABILITY_RELEASE_RAM,
>> > > @@ -534,6 +532,9 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, Error **errp)
>> > >              error_setg(errp, "Postcopy is not yet compatible with multifd");
>> > >              return false;
>> > >          }
>> > > +        if (new_caps[MIGRATION_CAPABILITY_MAIN_ZERO_PAGE]) {
>> > > +            error_setg(errp, "Postcopy is not yet compatible with main zero copy");
>> > > +        }
>> >
>> > Won't this will breaks compatibility for postcopy? A command that used
>> > to work now will have to disable main-zero-page first.
>>
>> main-zero-page is disabled by default.
>>
>> >
>> > >      }
>> > >
>> > >      if (new_caps[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT]) {
>> > > diff --git a/migration/ram.c b/migration/ram.c
>> > > index 8c7886ab79..f7a42feff2 100644
>> > > --- a/migration/ram.c
>> > > +++ b/migration/ram.c
>> > > @@ -2059,17 +2059,42 @@ static int ram_save_target_page_legacy(RAMState *rs, PageSearchStatus *pss)
>> > >      if (save_zero_page(rs, pss, offset)) {
>> > >          return 1;
>> > >      }
>> > > -
>> > >      /*
>> > > -     * Do not use multifd in postcopy as one whole host page should be
>> > > -     * placed.  Meanwhile postcopy requires atomic update of pages, so even
>> > > -     * if host page size == guest page size the dest guest during run may
>> > > -     * still see partially copied pages which is data corruption.
>> > > +     * Do not use multifd for:
>> > > +     * 1. Compression as the first page in the new block should be posted out
>> > > +     *    before sending the compressed page
>> > > +     * 2. In postcopy as one whole host page should be placed
>> > >       */
>> > > -    if (migrate_multifd() && !migration_in_postcopy()) {
>> > > +    if (!migrate_compress() && migrate_multifd() && !migration_in_postcopy()) {
>> > > +        return ram_save_multifd_page(pss->pss_channel, block, offset);
>> > > +    }
>> >
>> > This could go into ram_save_target_page_multifd like so:
>> >
>> > if (!migrate_compress() && !migration_in_postcopy() && !migration_main_zero_page()) {
>> >     return ram_save_multifd_page(pss->pss_channel, block, offset);
>> > } else {
>> >   return ram_save_target_page_legacy();
>> > }
>> >
>> > > +
>> > > +    return ram_save_page(rs, pss);
>> > > +}
>> > > +
>> > > +/**
>> > > + * ram_save_target_page_multifd: save one target page
>> > > + *
>> > > + * Returns the number of pages written
>> > > + *
>> > > + * @rs: current RAM state
>> > > + * @pss: data about the page we want to send
>> > > + */
>> > > +static int ram_save_target_page_multifd(RAMState *rs, PageSearchStatus *pss)
>> > > +{
>> > > +    RAMBlock *block = pss->block;
>> > > +    ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS;
>> > > +    int res;
>> > > +
>> > > +    if (!migration_in_postcopy()) {
>> > >          return ram_save_multifd_page(pss->pss_channel, block, offset);
>> > >      }
>> > >
>> > > +    res = save_zero_page(rs, pss, offset);
>> > > +    if (res > 0) {
>> > > +        return res;
>> > > +    }
>> > > +
>> > >      return ram_save_page(rs, pss);
>> > >  }
>> > >
>> > > @@ -2982,9 +3007,15 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>> > >      }
>> > >
>> > >      migration_ops = g_malloc0(sizeof(MigrationOps));
>> > > -    migration_ops->ram_save_target_page = ram_save_target_page_legacy;
>> > > +
>> > > +    if (migrate_multifd() && !migrate_use_main_zero_page()) {
>> > > +        migration_ops->ram_save_target_page = ram_save_target_page_multifd;
>> > > +    } else {
>> > > +        migration_ops->ram_save_target_page = ram_save_target_page_legacy;
>> > > +    }
>> >
>> > This should not check main-zero-page. Just have multifd vs. legacy and
>> > have the multifd function defer to _legacy if main-zero-page or
>> > in_postcopy.
>>
>> I noticed that ram_save_target_page_legacy and
>> ram_save_target_page_multifd have a lot of overlap and are quite
>> confusing. I can refactor this path and take-in your comments here.
>>
>> 1) Remove ram_save_multifd_page() call from
>> ram_save_target_page_legacy(). ram_save_multifd_page() will only be
>> called in ram_save_target_page_multifd().
>> 2) Remove save_zero_page() and ram_save_page() from
>> ram_save_target_page_multifd().
>> 3) Postcopy will always go with the ram_save_target_page_legacy() path.
>> 4) Legacy compression will always go with the
>> ram_save_target_page_legacy() path.
>> 5) Call ram_save_target_page_legacy() from within
>> ram_save_target_page_multifd() if postcopy or legacy compression.
>>
>
> Hi Fabiano,
> So I spent some time reading the
> ram_save_target_page_legacy/ram_save_target_page_multifd code path
> Juan wrote and here is my current understanding:
> 1) Multifd and legacy compression are not compatible.
> 2) Multifd and postcopy are not compatible.
> The compatibility checks are implemented in migrate_caps_check(). So
> there is really no need to handle a lot of the complexity in Juan's
> code.
>
> I think what we can do is:
> 1) If multifd is enabled, use ram_save_target_page_multifd().
> Otherwise, use ram_save_target_page_legacy().
> 2) In ram_save_target_page_legacy(), we don't need the special path to
> call ram_save_multifd_page(). That can be handled by
> ram_save_target_page_multifd() alone.
> 3) In ram_save_target_page_multifd(), we assert that legacy
> compression is not enabled. And we also assert that postcopy is also
> not enabled.
> 4) We do need backward compatibility support for the main zero page
> checking case in multifd. So in ram_save_target_page_multifd(), we
> call save_zero_page() if migrate_multifd_zero_page() is false.
>

Sounds good. Could you apply those changes and the capability we
discussed in the other message and send a separate series?  I haven't
found the time to work on this yet.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [External] Re: [PATCH v2 04/20] So we use multifd to transmit zero pages.
  2024-01-25 23:14         ` Fabiano Rosas
@ 2024-01-25 23:46           ` Hao Xiang
  0 siblings, 0 replies; 51+ messages in thread
From: Hao Xiang @ 2024-01-25 23:46 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: peter.maydell, peterx, marcandre.lureau, bryan.zhang, qemu-devel,
	Leonardo Bras

On Thu, Jan 25, 2024 at 3:14 PM Fabiano Rosas <farosas@suse.de> wrote:
>
> Hao Xiang <hao.xiang@bytedance.com> writes:
>
> > On Mon, Jan 22, 2024 at 8:28 PM Hao Xiang <hao.xiang@bytedance.com> wrote:
> >>
> >> On Thu, Nov 16, 2023 at 7:14 AM Fabiano Rosas <farosas@suse.de> wrote:
> >> >
> >> > Hao Xiang <hao.xiang@bytedance.com> writes:
> >> >
> >> > > From: Juan Quintela <quintela@redhat.com>
> >> > >
> >> > > Signed-off-by: Juan Quintela <quintela@redhat.com>
> >> > > Reviewed-by: Leonardo Bras <leobras@redhat.com>
> >> > > ---
> >> > >  migration/multifd.c |  7 ++++---
> >> > >  migration/options.c | 13 +++++++------
> >> > >  migration/ram.c     | 45 ++++++++++++++++++++++++++++++++++++++-------
> >> > >  qapi/migration.json |  1 -
> >> > >  4 files changed, 49 insertions(+), 17 deletions(-)
> >> > >
> >> > > diff --git a/migration/multifd.c b/migration/multifd.c
> >> > > index 1b994790d5..1198ffde9c 100644
> >> > > --- a/migration/multifd.c
> >> > > +++ b/migration/multifd.c
> >> > > @@ -13,6 +13,7 @@
> >> > >  #include "qemu/osdep.h"
> >> > >  #include "qemu/cutils.h"
> >> > >  #include "qemu/rcu.h"
> >> > > +#include "qemu/cutils.h"
> >> > >  #include "exec/target_page.h"
> >> > >  #include "sysemu/sysemu.h"
> >> > >  #include "exec/ramblock.h"
> >> > > @@ -459,7 +460,6 @@ static int multifd_send_pages(QEMUFile *f)
> >> > >      p->packet_num = multifd_send_state->packet_num++;
> >> > >      multifd_send_state->pages = p->pages;
> >> > >      p->pages = pages;
> >> > > -
> >> > >      qemu_mutex_unlock(&p->mutex);
> >> > >      qemu_sem_post(&p->sem);
> >> > >
> >> > > @@ -684,7 +684,7 @@ static void *multifd_send_thread(void *opaque)
> >> > >      MigrationThread *thread = NULL;
> >> > >      Error *local_err = NULL;
> >> > >      /* qemu older than 8.2 don't understand zero page on multifd channel */
> >> > > -    bool use_zero_page = !migrate_use_main_zero_page();
> >> > > +    bool use_multifd_zero_page = !migrate_use_main_zero_page();
> >> > >      int ret = 0;
> >> > >      bool use_zero_copy_send = migrate_zero_copy_send();
> >> > >
> >> > > @@ -713,6 +713,7 @@ static void *multifd_send_thread(void *opaque)
> >> > >              RAMBlock *rb = p->pages->block;
> >> > >              uint64_t packet_num = p->packet_num;
> >> > >              uint32_t flags;
> >> > > +
> >> > >              p->normal_num = 0;
> >> > >              p->zero_num = 0;
> >> > >
> >> > > @@ -724,7 +725,7 @@ static void *multifd_send_thread(void *opaque)
> >> > >
> >> > >              for (int i = 0; i < p->pages->num; i++) {
> >> > >                  uint64_t offset = p->pages->offset[i];
> >> > > -                if (use_zero_page &&
> >> > > +                if (use_multifd_zero_page &&
> >> >
> >> > We could have a new function in multifd_ops for zero page
> >> > handling. We're already considering an accelerator for the compression
> >> > method in the other series[1] and in this series we're adding an
> >> > accelerator for zero page checking. It's about time we make the
> >> > multifd_ops generic instead of only compression/no compression.
> >>
> >> Sorry I overlooked this email earlier.
> >> I will extend the multifd_ops interface and add a new API for zero
> >> page checking.
> >>
> >> >
> >> > 1- [PATCH v2 0/4] Live Migration Acceleration with IAA Compression
> >> > https://lore.kernel.org/r/20231109154638.488213-1-yuan1.liu@intel.com
> >> >
> >> > >                      buffer_is_zero(rb->host + offset, p->page_size)) {
> >> > >                      p->zero[p->zero_num] = offset;
> >> > >                      p->zero_num++;
> >> > > diff --git a/migration/options.c b/migration/options.c
> >> > > index 00c0c4a0d6..97d121d4d7 100644
> >> > > --- a/migration/options.c
> >> > > +++ b/migration/options.c
> >> > > @@ -195,6 +195,7 @@ Property migration_properties[] = {
> >> > >      DEFINE_PROP_MIG_CAP("x-block", MIGRATION_CAPABILITY_BLOCK),
> >> > >      DEFINE_PROP_MIG_CAP("x-return-path", MIGRATION_CAPABILITY_RETURN_PATH),
> >> > >      DEFINE_PROP_MIG_CAP("x-multifd", MIGRATION_CAPABILITY_MULTIFD),
> >> > > +    DEFINE_PROP_MIG_CAP("x-main-zero-page", MIGRATION_CAPABILITY_MAIN_ZERO_PAGE),
> >> > >      DEFINE_PROP_MIG_CAP("x-background-snapshot",
> >> > >              MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT),
> >> > >  #ifdef CONFIG_LINUX
> >> > > @@ -288,13 +289,9 @@ bool migrate_multifd(void)
> >> > >
> >> > >  bool migrate_use_main_zero_page(void)
> >> > >  {
> >> > > -    //MigrationState *s;
> >> > > -
> >> > > -    //s = migrate_get_current();
> >> > > +    MigrationState *s = migrate_get_current();
> >> > >
> >> > > -    // We will enable this when we add the right code.
> >> > > -    // return s->enabled_capabilities[MIGRATION_CAPABILITY_MAIN_ZERO_PAGE];
> >> > > -    return true;
> >> > > +    return s->capabilities[MIGRATION_CAPABILITY_MAIN_ZERO_PAGE];
> >> >
> >> > What happens if we disable main-zero-page while multifd is not enabled?
> >>
> >> In ram.c
> >> ...
> >> if (migrate_multifd() && !migrate_use_main_zero_page()) {
> >> migration_ops->ram_save_target_page = ram_save_target_page_multifd;
> >> } else {
> >> migration_ops->ram_save_target_page = ram_save_target_page_legacy;
> >> }
> >> ...
> >>
> >> So if main-zero-page is disabled and multifd is also disabled, it will
> >> go with the "else" path, which is the legacy path
> >> ram_save_target_page_legacy() and do zero page checking from the main
> >> thread.
> >>
> >> >
> >> > >  }
> >> > >
> >> > >  bool migrate_pause_before_switchover(void)
> >> > > @@ -457,6 +454,7 @@ INITIALIZE_MIGRATE_CAPS_SET(check_caps_background_snapshot,
> >> > >      MIGRATION_CAPABILITY_LATE_BLOCK_ACTIVATE,
> >> > >      MIGRATION_CAPABILITY_RETURN_PATH,
> >> > >      MIGRATION_CAPABILITY_MULTIFD,
> >> > > +    MIGRATION_CAPABILITY_MAIN_ZERO_PAGE,
> >> > >      MIGRATION_CAPABILITY_PAUSE_BEFORE_SWITCHOVER,
> >> > >      MIGRATION_CAPABILITY_AUTO_CONVERGE,
> >> > >      MIGRATION_CAPABILITY_RELEASE_RAM,
> >> > > @@ -534,6 +532,9 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, Error **errp)
> >> > >              error_setg(errp, "Postcopy is not yet compatible with multifd");
> >> > >              return false;
> >> > >          }
> >> > > +        if (new_caps[MIGRATION_CAPABILITY_MAIN_ZERO_PAGE]) {
> >> > > +            error_setg(errp, "Postcopy is not yet compatible with main zero copy");
> >> > > +        }
> >> >
> >> > Won't this will breaks compatibility for postcopy? A command that used
> >> > to work now will have to disable main-zero-page first.
> >>
> >> main-zero-page is disabled by default.
> >>
> >> >
> >> > >      }
> >> > >
> >> > >      if (new_caps[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT]) {
> >> > > diff --git a/migration/ram.c b/migration/ram.c
> >> > > index 8c7886ab79..f7a42feff2 100644
> >> > > --- a/migration/ram.c
> >> > > +++ b/migration/ram.c
> >> > > @@ -2059,17 +2059,42 @@ static int ram_save_target_page_legacy(RAMState *rs, PageSearchStatus *pss)
> >> > >      if (save_zero_page(rs, pss, offset)) {
> >> > >          return 1;
> >> > >      }
> >> > > -
> >> > >      /*
> >> > > -     * Do not use multifd in postcopy as one whole host page should be
> >> > > -     * placed.  Meanwhile postcopy requires atomic update of pages, so even
> >> > > -     * if host page size == guest page size the dest guest during run may
> >> > > -     * still see partially copied pages which is data corruption.
> >> > > +     * Do not use multifd for:
> >> > > +     * 1. Compression as the first page in the new block should be posted out
> >> > > +     *    before sending the compressed page
> >> > > +     * 2. In postcopy as one whole host page should be placed
> >> > >       */
> >> > > -    if (migrate_multifd() && !migration_in_postcopy()) {
> >> > > +    if (!migrate_compress() && migrate_multifd() && !migration_in_postcopy()) {
> >> > > +        return ram_save_multifd_page(pss->pss_channel, block, offset);
> >> > > +    }
> >> >
> >> > This could go into ram_save_target_page_multifd like so:
> >> >
> >> > if (!migrate_compress() && !migration_in_postcopy() && !migration_main_zero_page()) {
> >> >     return ram_save_multifd_page(pss->pss_channel, block, offset);
> >> > } else {
> >> >   return ram_save_target_page_legacy();
> >> > }
> >> >
> >> > > +
> >> > > +    return ram_save_page(rs, pss);
> >> > > +}
> >> > > +
> >> > > +/**
> >> > > + * ram_save_target_page_multifd: save one target page
> >> > > + *
> >> > > + * Returns the number of pages written
> >> > > + *
> >> > > + * @rs: current RAM state
> >> > > + * @pss: data about the page we want to send
> >> > > + */
> >> > > +static int ram_save_target_page_multifd(RAMState *rs, PageSearchStatus *pss)
> >> > > +{
> >> > > +    RAMBlock *block = pss->block;
> >> > > +    ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS;
> >> > > +    int res;
> >> > > +
> >> > > +    if (!migration_in_postcopy()) {
> >> > >          return ram_save_multifd_page(pss->pss_channel, block, offset);
> >> > >      }
> >> > >
> >> > > +    res = save_zero_page(rs, pss, offset);
> >> > > +    if (res > 0) {
> >> > > +        return res;
> >> > > +    }
> >> > > +
> >> > >      return ram_save_page(rs, pss);
> >> > >  }
> >> > >
> >> > > @@ -2982,9 +3007,15 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
> >> > >      }
> >> > >
> >> > >      migration_ops = g_malloc0(sizeof(MigrationOps));
> >> > > -    migration_ops->ram_save_target_page = ram_save_target_page_legacy;
> >> > > +
> >> > > +    if (migrate_multifd() && !migrate_use_main_zero_page()) {
> >> > > +        migration_ops->ram_save_target_page = ram_save_target_page_multifd;
> >> > > +    } else {
> >> > > +        migration_ops->ram_save_target_page = ram_save_target_page_legacy;
> >> > > +    }
> >> >
> >> > This should not check main-zero-page. Just have multifd vs. legacy and
> >> > have the multifd function defer to _legacy if main-zero-page or
> >> > in_postcopy.
> >>
> >> I noticed that ram_save_target_page_legacy and
> >> ram_save_target_page_multifd have a lot of overlap and are quite
> >> confusing. I can refactor this path and take-in your comments here.
> >>
> >> 1) Remove ram_save_multifd_page() call from
> >> ram_save_target_page_legacy(). ram_save_multifd_page() will only be
> >> called in ram_save_target_page_multifd().
> >> 2) Remove save_zero_page() and ram_save_page() from
> >> ram_save_target_page_multifd().
> >> 3) Postcopy will always go with the ram_save_target_page_legacy() path.
> >> 4) Legacy compression will always go with the
> >> ram_save_target_page_legacy() path.
> >> 5) Call ram_save_target_page_legacy() from within
> >> ram_save_target_page_multifd() if postcopy or legacy compression.
> >>
> >
> > Hi Fabiano,
> > So I spent some time reading the
> > ram_save_target_page_legacy/ram_save_target_page_multifd code path
> > Juan wrote and here is my current understanding:
> > 1) Multifd and legacy compression are not compatible.
> > 2) Multifd and postcopy are not compatible.
> > The compatibility checks are implemented in migrate_caps_check(). So
> > there is really no need to handle a lot of the complexity in Juan's
> > code.
> >
> > I think what we can do is:
> > 1) If multifd is enabled, use ram_save_target_page_multifd().
> > Otherwise, use ram_save_target_page_legacy().
> > 2) In ram_save_target_page_legacy(), we don't need the special path to
> > call ram_save_multifd_page(). That can be handled by
> > ram_save_target_page_multifd() alone.
> > 3) In ram_save_target_page_multifd(), we assert that legacy
> > compression is not enabled. And we also assert that postcopy is also
> > not enabled.
> > 4) We do need backward compatibility support for the main zero page
> > checking case in multifd. So in ram_save_target_page_multifd(), we
> > call save_zero_page() if migrate_multifd_zero_page() is false.
> >
>
> Sounds good. Could you apply those changes and the capability we
> discussed in the other message and send a separate series?  I haven't
> found the time to work on this yet.
>

Sure, I will send out a separate series for multifd zero page checking.


^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2024-01-25 23:47 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-14  5:40 [PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration Hao Xiang
2023-11-14  5:40 ` [PATCH v2 01/20] multifd: Add capability to enable/disable zero_page Hao Xiang
2023-11-16 15:15   ` Fabiano Rosas
2023-11-14  5:40 ` [PATCH v2 02/20] multifd: Support for zero pages transmission Hao Xiang
2023-11-14  5:40 ` [PATCH v2 03/20] multifd: Zero " Hao Xiang
2023-12-18  2:43   ` Wang, Lei
2023-11-14  5:40 ` [PATCH v2 04/20] So we use multifd to transmit zero pages Hao Xiang
2023-11-16 15:14   ` Fabiano Rosas
2024-01-23  4:28     ` [External] " Hao Xiang
2024-01-25 21:55       ` Hao Xiang
2024-01-25 23:14         ` Fabiano Rosas
2024-01-25 23:46           ` Hao Xiang
2023-11-14  5:40 ` [PATCH v2 05/20] meson: Introduce new instruction set enqcmd to the build system Hao Xiang
2023-12-11 15:41   ` Fabiano Rosas
2023-12-16  0:26     ` [External] " Hao Xiang
2023-11-14  5:40 ` [PATCH v2 06/20] util/dsa: Add dependency idxd Hao Xiang
2023-11-14  5:40 ` [PATCH v2 07/20] util/dsa: Implement DSA device start and stop logic Hao Xiang
2023-12-11 21:28   ` Fabiano Rosas
2023-12-19  6:41     ` [External] " Hao Xiang
2023-12-19 13:18       ` Fabiano Rosas
2023-12-27  6:00         ` Hao Xiang
2023-11-14  5:40 ` [PATCH v2 08/20] util/dsa: Implement DSA task enqueue and dequeue Hao Xiang
2023-12-12 16:10   ` Fabiano Rosas
2023-12-27  0:07     ` [External] " Hao Xiang
2023-11-14  5:40 ` [PATCH v2 09/20] util/dsa: Implement DSA task asynchronous completion thread model Hao Xiang
2023-12-12 19:36   ` Fabiano Rosas
2023-12-18  3:11   ` Wang, Lei
2023-12-18 18:57     ` [External] " Hao Xiang
2023-12-19  1:33       ` Wang, Lei
2023-12-19  5:12         ` Hao Xiang
2023-11-14  5:40 ` [PATCH v2 10/20] util/dsa: Implement zero page checking in DSA task Hao Xiang
2023-11-14  5:40 ` [PATCH v2 11/20] util/dsa: Implement DSA task asynchronous submission and wait for completion Hao Xiang
2023-12-13 14:01   ` Fabiano Rosas
2023-12-27  6:26     ` [External] " Hao Xiang
2023-11-14  5:40 ` [PATCH v2 12/20] migration/multifd: Add new migration option for multifd DSA offloading Hao Xiang
2023-12-11 19:44   ` Fabiano Rosas
2023-12-18 18:34     ` [External] " Hao Xiang
2023-12-18  3:12   ` Wang, Lei
2023-11-14  5:40 ` [PATCH v2 13/20] migration/multifd: Prepare to introduce DSA acceleration on the multifd path Hao Xiang
2023-12-18  3:20   ` Wang, Lei
2023-11-14  5:40 ` [PATCH v2 14/20] migration/multifd: Enable DSA offloading in multifd sender path Hao Xiang
2023-11-14  5:40 ` [PATCH v2 15/20] migration/multifd: Add test hook to set normal page ratio Hao Xiang
2023-11-14  5:40 ` [PATCH v2 16/20] migration/multifd: Enable set normal page ratio test hook in multifd Hao Xiang
2023-11-14  5:40 ` [PATCH v2 17/20] migration/multifd: Add migration option set packet size Hao Xiang
2023-11-14  5:40 ` [PATCH v2 18/20] migration/multifd: Enable set packet size migration option Hao Xiang
2023-12-13 17:33   ` Fabiano Rosas
2024-01-03 20:04     ` [External] " Hao Xiang
2023-11-14  5:40 ` [PATCH v2 19/20] util/dsa: Add unit test coverage for Intel DSA task submission and completion Hao Xiang
2023-11-14  5:40 ` [PATCH v2 20/20] migration/multifd: Add integration tests for multifd with Intel DSA offloading Hao Xiang
2023-11-15 17:43 ` [PATCH v2 00/20] Use Intel DSA accelerator to offload zero page checking in multifd live migration Elena Ufimtseva
2023-11-15 19:37   ` [External] " Hao Xiang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.