All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/14] migration/multifd: Refactor ->send_prepare() and cleanups
@ 2024-01-31 10:30 peterx
  2024-01-31 10:30 ` [PATCH 01/14] migration/multifd: Drop stale comment for multifd zero copy peterx
                   ` (14 more replies)
  0 siblings, 15 replies; 45+ messages in thread
From: peterx @ 2024-01-31 10:30 UTC (permalink / raw)
  To: qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, Fabiano Rosas, peterx, Yuan Liu,
	Avihai Horon, Hao Xiang

From: Peter Xu <peterx@redhat.com>

This patchset contains quite a few refactorings to current multifd:

  - It picked up some patches from an old series of mine [0] (the last
    patches were dropped, though; I did the cleanup slightly differently):

    I still managed to include one patch to split pending_job, but I
    rewrote the patch here.

  - It tries to cleanup multiple multifd paths here and there, the ultimate
    goal is to redefine send_prepare() to be something like:

      p->pages ----------->  send_prepare() -------------> IOVs

    So that there's no obvious change yet on multifd_ops besides redefined
    interface for send_prepare().  We may want a separate OPs for file
    later.

For 2), one benefit is already presented by Fabiano in his other series [1]
on cleaning up zero copy, but this patchset addressed it quite differently,
and hopefully also more gradually.  The other benefit is for sure if we
have a more concrete API for send_prepare() and if we can reach an initial
consensus, then we can have the recent compression accelerators rebased on
top of this one.

This also prepares for the case where the input can be extended to even not
any p->pages, but arbitrary data (like VFIO's potential use case in the
future?).  But that will also for later even if reasonable.

Please have a look.  Thanks,

[0] https://lore.kernel.org/r/20231022201211.452861-1-peterx@redhat.com
[1] https://lore.kernel.org/qemu-devel/20240126221943.26628-1-farosas@suse.de

Peter Xu (14):
  migration/multifd: Drop stale comment for multifd zero copy
  migration/multifd: multifd_send_kick_main()
  migration/multifd: Drop MultiFDSendParams.quit, cleanup error paths
  migration/multifd: Postpone reset of MultiFDPages_t
  migration/multifd: Drop MultiFDSendParams.normal[] array
  migration/multifd: Separate SYNC request with normal jobs
  migration/multifd: Simplify locking in sender thread
  migration/multifd: Drop pages->num check in sender thread
  migration/multifd: Rename p->num_packets and clean it up
  migration/multifd: Move total_normal_pages accounting
  migration/multifd: Move trace_multifd_send|recv()
  migration/multifd: multifd_send_prepare_header()
  migration/multifd: Move header prepare/fill into send_prepare()
  migration/multifd: Forbid spurious wakeups

 migration/multifd.h      |  34 +++--
 migration/multifd-zlib.c |  11 +-
 migration/multifd-zstd.c |  11 +-
 migration/multifd.c      | 291 +++++++++++++++++++--------------------
 4 files changed, 182 insertions(+), 165 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 01/14] migration/multifd: Drop stale comment for multifd zero copy
  2024-01-31 10:30 [PATCH 00/14] migration/multifd: Refactor ->send_prepare() and cleanups peterx
@ 2024-01-31 10:30 ` peterx
  2024-01-31 10:30 ` [PATCH 02/14] migration/multifd: multifd_send_kick_main() peterx
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: peterx @ 2024-01-31 10:30 UTC (permalink / raw)
  To: qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, Fabiano Rosas, peterx, Yuan Liu,
	Avihai Horon, Hao Xiang

From: Peter Xu <peterx@redhat.com>

We've already done that with multifd_flush_after_each_section, for multifd
in general.  Drop the stale "TODO-like" comment.

Reviewed-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/multifd.c | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 25cbc6dc6b..eee2586770 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -598,17 +598,6 @@ int multifd_send_sync_main(void)
         }
     }
 
-    /*
-     * When using zero-copy, it's necessary to flush the pages before any of
-     * the pages can be sent again, so we'll make sure the new version of the
-     * pages will always arrive _later_ than the old pages.
-     *
-     * Currently we achieve this by flushing the zero-page requested writes
-     * per ram iteration, but in the future we could potentially optimize it
-     * to be less frequent, e.g. only after we finished one whole scanning of
-     * all the dirty bitmaps.
-     */
-
     flush_zero_copy = migrate_zero_copy_send();
 
     for (i = 0; i < migrate_multifd_channels(); i++) {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 02/14] migration/multifd: multifd_send_kick_main()
  2024-01-31 10:30 [PATCH 00/14] migration/multifd: Refactor ->send_prepare() and cleanups peterx
  2024-01-31 10:30 ` [PATCH 01/14] migration/multifd: Drop stale comment for multifd zero copy peterx
@ 2024-01-31 10:30 ` peterx
  2024-01-31 10:31 ` [PATCH 03/14] migration/multifd: Drop MultiFDSendParams.quit, cleanup error paths peterx
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: peterx @ 2024-01-31 10:30 UTC (permalink / raw)
  To: qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, Fabiano Rosas, peterx, Yuan Liu,
	Avihai Horon, Hao Xiang

From: Peter Xu <peterx@redhat.com>

When a multifd sender thread hit errors, it always needs to kick the main
thread by kicking all the semaphores that it can be waiting upon.

Provide a helper for it and deduplicate the code.

Reviewed-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/multifd.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index eee2586770..b8d2c96533 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -372,6 +372,18 @@ struct {
     MultiFDMethods *ops;
 } *multifd_send_state;
 
+/*
+ * The migration thread can wait on either of the two semaphores.  This
+ * function can be used to kick the main thread out of waiting on either of
+ * them.  Should mostly only be called when something wrong happened with
+ * the current multifd send thread.
+ */
+static void multifd_send_kick_main(MultiFDSendParams *p)
+{
+    qemu_sem_post(&p->sem_sync);
+    qemu_sem_post(&multifd_send_state->channels_ready);
+}
+
 /*
  * How we use multifd_send_state->pages and channel->pages?
  *
@@ -739,8 +751,7 @@ out:
         assert(local_err);
         trace_multifd_send_error(p->id);
         multifd_send_terminate_threads(local_err);
-        qemu_sem_post(&p->sem_sync);
-        qemu_sem_post(&multifd_send_state->channels_ready);
+        multifd_send_kick_main(p);
         error_free(local_err);
     }
 
@@ -781,8 +792,7 @@ static void multifd_tls_outgoing_handshake(QIOTask *task,
      * is not created, and then tell who pay attention to me.
      */
     p->quit = true;
-    qemu_sem_post(&multifd_send_state->channels_ready);
-    qemu_sem_post(&p->sem_sync);
+    multifd_send_kick_main(p);
     error_free(err);
 }
 
@@ -852,8 +862,7 @@ static void multifd_new_send_channel_cleanup(MultiFDSendParams *p,
 {
      migrate_set_error(migrate_get_current(), err);
      /* Error happen, we need to tell who pay attention to me */
-     qemu_sem_post(&multifd_send_state->channels_ready);
-     qemu_sem_post(&p->sem_sync);
+     multifd_send_kick_main(p);
      /*
       * Although multifd_send_thread is not created, but main migration
       * thread need to judge whether it is running, so we need to mark
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 03/14] migration/multifd: Drop MultiFDSendParams.quit, cleanup error paths
  2024-01-31 10:30 [PATCH 00/14] migration/multifd: Refactor ->send_prepare() and cleanups peterx
  2024-01-31 10:30 ` [PATCH 01/14] migration/multifd: Drop stale comment for multifd zero copy peterx
  2024-01-31 10:30 ` [PATCH 02/14] migration/multifd: multifd_send_kick_main() peterx
@ 2024-01-31 10:31 ` peterx
  2024-01-31 15:05   ` Fabiano Rosas
  2024-01-31 10:31 ` [PATCH 04/14] migration/multifd: Postpone reset of MultiFDPages_t peterx
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: peterx @ 2024-01-31 10:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, Fabiano Rosas, peterx, Yuan Liu,
	Avihai Horon, Hao Xiang

From: Peter Xu <peterx@redhat.com>

Multifd send side has two fields to indicate error quits:

  - MultiFDSendParams.quit
  - &multifd_send_state->exiting

Merge them into the global one.  The replacement is done by changing all
p->quit checks into the global var check.  The global check doesn't need
any lock.

A few more things done on top of this altogether:

  - multifd_send_terminate_threads()

    Moving the xchg() of &multifd_send_state->exiting upper, so as to cover
    the tracepoint, migrate_set_error() and migrate_set_state().

  - multifd_send_sync_main()

    In the 2nd loop, add one more check over the global var to make sure we
    don't keep the looping if QEMU already decided to quit.

  - multifd_tls_outgoing_handshake()

    Use multifd_send_terminate_threads() to set the error state.  That has
    a benefit of updating MigrationState.error to that error too, so we can
    persist that 1st error we hit in that specific channel.

  - multifd_new_send_channel_async()

    Take similar approach like above, drop the migrate_set_error() because
    multifd_send_terminate_threads() already covers that.  Unwrap the helper
    multifd_new_send_channel_cleanup() along the way; not really needed.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/multifd.h |  2 --
 migration/multifd.c | 85 ++++++++++++++++++---------------------------
 2 files changed, 33 insertions(+), 54 deletions(-)

diff --git a/migration/multifd.h b/migration/multifd.h
index 35d11f103c..7c040cb85a 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -95,8 +95,6 @@ typedef struct {
     QemuMutex mutex;
     /* is this channel thread running */
     bool running;
-    /* should this thread finish */
-    bool quit;
     /* multifd flags for each packet */
     uint32_t flags;
     /* global number of generated multifd packets */
diff --git a/migration/multifd.c b/migration/multifd.c
index b8d2c96533..2c98023d67 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -372,6 +372,11 @@ struct {
     MultiFDMethods *ops;
 } *multifd_send_state;
 
+static bool multifd_send_should_exit(void)
+{
+    return qatomic_read(&multifd_send_state->exiting);
+}
+
 /*
  * The migration thread can wait on either of the two semaphores.  This
  * function can be used to kick the main thread out of waiting on either of
@@ -409,7 +414,7 @@ static int multifd_send_pages(void)
     MultiFDSendParams *p = NULL; /* make happy gcc */
     MultiFDPages_t *pages = multifd_send_state->pages;
 
-    if (qatomic_read(&multifd_send_state->exiting)) {
+    if (multifd_send_should_exit()) {
         return -1;
     }
 
@@ -421,14 +426,11 @@ static int multifd_send_pages(void)
      */
     next_channel %= migrate_multifd_channels();
     for (i = next_channel;; i = (i + 1) % migrate_multifd_channels()) {
-        p = &multifd_send_state->params[i];
-
-        qemu_mutex_lock(&p->mutex);
-        if (p->quit) {
-            error_report("%s: channel %d has already quit!", __func__, i);
-            qemu_mutex_unlock(&p->mutex);
+        if (multifd_send_should_exit()) {
             return -1;
         }
+        p = &multifd_send_state->params[i];
+        qemu_mutex_lock(&p->mutex);
         if (!p->pending_job) {
             p->pending_job++;
             next_channel = (i + 1) % migrate_multifd_channels();
@@ -483,6 +485,16 @@ static void multifd_send_terminate_threads(Error *err)
 {
     int i;
 
+    /*
+     * We don't want to exit each threads twice.  Depending on where
+     * we get the error, or if there are two independent errors in two
+     * threads at the same time, we can end calling this function
+     * twice.
+     */
+    if (qatomic_xchg(&multifd_send_state->exiting, 1)) {
+        return;
+    }
+
     trace_multifd_send_terminate_threads(err != NULL);
 
     if (err) {
@@ -497,26 +509,13 @@ static void multifd_send_terminate_threads(Error *err)
         }
     }
 
-    /*
-     * We don't want to exit each threads twice.  Depending on where
-     * we get the error, or if there are two independent errors in two
-     * threads at the same time, we can end calling this function
-     * twice.
-     */
-    if (qatomic_xchg(&multifd_send_state->exiting, 1)) {
-        return;
-    }
-
     for (i = 0; i < migrate_multifd_channels(); i++) {
         MultiFDSendParams *p = &multifd_send_state->params[i];
 
-        qemu_mutex_lock(&p->mutex);
-        p->quit = true;
         qemu_sem_post(&p->sem);
         if (p->c) {
             qio_channel_shutdown(p->c, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
         }
-        qemu_mutex_unlock(&p->mutex);
     }
 }
 
@@ -615,16 +614,13 @@ int multifd_send_sync_main(void)
     for (i = 0; i < migrate_multifd_channels(); i++) {
         MultiFDSendParams *p = &multifd_send_state->params[i];
 
-        trace_multifd_send_sync_main_signal(p->id);
-
-        qemu_mutex_lock(&p->mutex);
-
-        if (p->quit) {
-            error_report("%s: channel %d has already quit", __func__, i);
-            qemu_mutex_unlock(&p->mutex);
+        if (multifd_send_should_exit()) {
             return -1;
         }
 
+        trace_multifd_send_sync_main_signal(p->id);
+
+        qemu_mutex_lock(&p->mutex);
         p->packet_num = multifd_send_state->packet_num++;
         p->flags |= MULTIFD_FLAG_SYNC;
         p->pending_job++;
@@ -634,6 +630,10 @@ int multifd_send_sync_main(void)
     for (i = 0; i < migrate_multifd_channels(); i++) {
         MultiFDSendParams *p = &multifd_send_state->params[i];
 
+        if (multifd_send_should_exit()) {
+            return -1;
+        }
+
         qemu_sem_wait(&multifd_send_state->channels_ready);
         trace_multifd_send_sync_main_wait(p->id);
         qemu_sem_wait(&p->sem_sync);
@@ -671,7 +671,7 @@ static void *multifd_send_thread(void *opaque)
         qemu_sem_post(&multifd_send_state->channels_ready);
         qemu_sem_wait(&p->sem);
 
-        if (qatomic_read(&multifd_send_state->exiting)) {
+        if (multifd_send_should_exit()) {
             break;
         }
         qemu_mutex_lock(&p->mutex);
@@ -786,12 +786,7 @@ static void multifd_tls_outgoing_handshake(QIOTask *task,
 
     trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err));
 
-    migrate_set_error(migrate_get_current(), err);
-    /*
-     * Error happen, mark multifd_send_thread status as 'quit' although it
-     * is not created, and then tell who pay attention to me.
-     */
-    p->quit = true;
+    multifd_send_terminate_threads(err);
     multifd_send_kick_main(p);
     error_free(err);
 }
@@ -857,22 +852,6 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
     return true;
 }
 
-static void multifd_new_send_channel_cleanup(MultiFDSendParams *p,
-                                             QIOChannel *ioc, Error *err)
-{
-     migrate_set_error(migrate_get_current(), err);
-     /* Error happen, we need to tell who pay attention to me */
-     multifd_send_kick_main(p);
-     /*
-      * Although multifd_send_thread is not created, but main migration
-      * thread need to judge whether it is running, so we need to mark
-      * its status.
-      */
-     p->quit = true;
-     object_unref(OBJECT(ioc));
-     error_free(err);
-}
-
 static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
 {
     MultiFDSendParams *p = opaque;
@@ -889,7 +868,10 @@ static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
     }
 
     trace_multifd_new_send_channel_async_error(p->id, local_err);
-    multifd_new_send_channel_cleanup(p, ioc, local_err);
+    multifd_send_terminate_threads(local_err);
+    multifd_send_kick_main(p);
+    object_unref(OBJECT(ioc));
+    error_free(local_err);
 }
 
 static void multifd_new_send_channel_create(gpointer opaque)
@@ -921,7 +903,6 @@ int multifd_save_setup(Error **errp)
         qemu_mutex_init(&p->mutex);
         qemu_sem_init(&p->sem, 0);
         qemu_sem_init(&p->sem_sync, 0);
-        p->quit = false;
         p->pending_job = 0;
         p->id = i;
         p->pages = multifd_pages_init(page_count);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 04/14] migration/multifd: Postpone reset of MultiFDPages_t
  2024-01-31 10:30 [PATCH 00/14] migration/multifd: Refactor ->send_prepare() and cleanups peterx
                   ` (2 preceding siblings ...)
  2024-01-31 10:31 ` [PATCH 03/14] migration/multifd: Drop MultiFDSendParams.quit, cleanup error paths peterx
@ 2024-01-31 10:31 ` peterx
  2024-01-31 15:27   ` Fabiano Rosas
  2024-01-31 10:31 ` [PATCH 05/14] migration/multifd: Drop MultiFDSendParams.normal[] array peterx
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: peterx @ 2024-01-31 10:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, Fabiano Rosas, peterx, Yuan Liu,
	Avihai Horon, Hao Xiang

From: Peter Xu <peterx@redhat.com>

Now we reset MultiFDPages_t object in the multifd sender thread in the
middle of the sending job.  That's not necessary, because the "*pages"
struct will not be reused anyway until pending_job is cleared.

Move that to the end after the job is completed, provide a helper to reset
a "*pages" object.  Use that same helper when free the object too.

This prepares us to keep using p->pages in the follow up patches, where we
may drop p->normal[].

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/multifd.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 2c98023d67..5633ac245a 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -172,6 +172,17 @@ void multifd_register_ops(int method, MultiFDMethods *ops)
     multifd_ops[method] = ops;
 }
 
+/* Reset a MultiFDPages_t* object for the next use */
+static void multifd_pages_reset(MultiFDPages_t *pages)
+{
+    /*
+     * We don't need to touch offset[] array, because it will be
+     * overwritten later when reused.
+     */
+    pages->num = 0;
+    pages->block = NULL;
+}
+
 static int multifd_send_initial_packet(MultiFDSendParams *p, Error **errp)
 {
     MultiFDInit_t msg = {};
@@ -248,9 +259,8 @@ static MultiFDPages_t *multifd_pages_init(uint32_t n)
 
 static void multifd_pages_clear(MultiFDPages_t *pages)
 {
-    pages->num = 0;
+    multifd_pages_reset(pages);
     pages->allocated = 0;
-    pages->block = NULL;
     g_free(pages->offset);
     pages->offset = NULL;
     g_free(pages);
@@ -704,8 +714,6 @@ static void *multifd_send_thread(void *opaque)
             p->flags = 0;
             p->num_packets++;
             p->total_normal_pages += p->normal_num;
-            p->pages->num = 0;
-            p->pages->block = NULL;
             qemu_mutex_unlock(&p->mutex);
 
             trace_multifd_send(p->id, packet_num, p->normal_num, flags,
@@ -732,6 +740,8 @@ static void *multifd_send_thread(void *opaque)
 
             stat64_add(&mig_stats.multifd_bytes,
                        p->next_packet_size + p->packet_len);
+
+            multifd_pages_reset(p->pages);
             p->next_packet_size = 0;
             qemu_mutex_lock(&p->mutex);
             p->pending_job--;
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 05/14] migration/multifd: Drop MultiFDSendParams.normal[] array
  2024-01-31 10:30 [PATCH 00/14] migration/multifd: Refactor ->send_prepare() and cleanups peterx
                   ` (3 preceding siblings ...)
  2024-01-31 10:31 ` [PATCH 04/14] migration/multifd: Postpone reset of MultiFDPages_t peterx
@ 2024-01-31 10:31 ` peterx
  2024-01-31 16:02   ` Fabiano Rosas
  2024-01-31 10:31 ` [PATCH 06/14] migration/multifd: Separate SYNC request with normal jobs peterx
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: peterx @ 2024-01-31 10:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, Fabiano Rosas, peterx, Yuan Liu,
	Avihai Horon, Hao Xiang

From: Peter Xu <peterx@redhat.com>

This array is redundant when p->pages exists.  Now we extended the life of
p->pages to the whole period where pending_job is set, it should be safe to
always use p->pages->offset[] rather than p->normal[].  Drop the array.

Alongside, the normal_num is also redundant, which is the same to
p->pages->num.

This doesn't apply to recv side, because there's no extra buffering on recv
side, so p->normal[] array is still needed.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/multifd.h      |  4 ----
 migration/multifd-zlib.c |  7 ++++---
 migration/multifd-zstd.c |  7 ++++---
 migration/multifd.c      | 33 +++++++++++++--------------------
 4 files changed, 21 insertions(+), 30 deletions(-)

diff --git a/migration/multifd.h b/migration/multifd.h
index 7c040cb85a..3920bdbcf1 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -122,10 +122,6 @@ typedef struct {
     struct iovec *iov;
     /* number of iovs used */
     uint32_t iovs_num;
-    /* Pages that are not zero */
-    ram_addr_t *normal;
-    /* num of non zero pages */
-    uint32_t normal_num;
     /* used for compression methods */
     void *data;
 }  MultiFDSendParams;
diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c
index 37ce48621e..100809abc1 100644
--- a/migration/multifd-zlib.c
+++ b/migration/multifd-zlib.c
@@ -116,17 +116,18 @@ static void zlib_send_cleanup(MultiFDSendParams *p, Error **errp)
  */
 static int zlib_send_prepare(MultiFDSendParams *p, Error **errp)
 {
+    MultiFDPages_t *pages = p->pages;
     struct zlib_data *z = p->data;
     z_stream *zs = &z->zs;
     uint32_t out_size = 0;
     int ret;
     uint32_t i;
 
-    for (i = 0; i < p->normal_num; i++) {
+    for (i = 0; i < pages->num; i++) {
         uint32_t available = z->zbuff_len - out_size;
         int flush = Z_NO_FLUSH;
 
-        if (i == p->normal_num - 1) {
+        if (i == pages->num - 1) {
             flush = Z_SYNC_FLUSH;
         }
 
@@ -135,7 +136,7 @@ static int zlib_send_prepare(MultiFDSendParams *p, Error **errp)
          * with compression. zlib does not guarantee that this is safe,
          * therefore copy the page before calling deflate().
          */
-        memcpy(z->buf, p->pages->block->host + p->normal[i], p->page_size);
+        memcpy(z->buf, p->pages->block->host + pages->offset[i], p->page_size);
         zs->avail_in = p->page_size;
         zs->next_in = z->buf;
 
diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c
index b471daadcd..2023edd8cc 100644
--- a/migration/multifd-zstd.c
+++ b/migration/multifd-zstd.c
@@ -113,6 +113,7 @@ static void zstd_send_cleanup(MultiFDSendParams *p, Error **errp)
  */
 static int zstd_send_prepare(MultiFDSendParams *p, Error **errp)
 {
+    MultiFDPages_t *pages = p->pages;
     struct zstd_data *z = p->data;
     int ret;
     uint32_t i;
@@ -121,13 +122,13 @@ static int zstd_send_prepare(MultiFDSendParams *p, Error **errp)
     z->out.size = z->zbuff_len;
     z->out.pos = 0;
 
-    for (i = 0; i < p->normal_num; i++) {
+    for (i = 0; i < pages->num; i++) {
         ZSTD_EndDirective flush = ZSTD_e_continue;
 
-        if (i == p->normal_num - 1) {
+        if (i == pages->num - 1) {
             flush = ZSTD_e_flush;
         }
-        z->in.src = p->pages->block->host + p->normal[i];
+        z->in.src = p->pages->block->host + pages->offset[i];
         z->in.size = p->page_size;
         z->in.pos = 0;
 
diff --git a/migration/multifd.c b/migration/multifd.c
index 5633ac245a..8bb1fd95cf 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -90,13 +90,13 @@ static int nocomp_send_prepare(MultiFDSendParams *p, Error **errp)
 {
     MultiFDPages_t *pages = p->pages;
 
-    for (int i = 0; i < p->normal_num; i++) {
-        p->iov[p->iovs_num].iov_base = pages->block->host + p->normal[i];
+    for (int i = 0; i < pages->num; i++) {
+        p->iov[p->iovs_num].iov_base = pages->block->host + pages->offset[i];
         p->iov[p->iovs_num].iov_len = p->page_size;
         p->iovs_num++;
     }
 
-    p->next_packet_size = p->normal_num * p->page_size;
+    p->next_packet_size = pages->num * p->page_size;
     p->flags |= MULTIFD_FLAG_NOCOMP;
     return 0;
 }
@@ -269,21 +269,22 @@ static void multifd_pages_clear(MultiFDPages_t *pages)
 static void multifd_send_fill_packet(MultiFDSendParams *p)
 {
     MultiFDPacket_t *packet = p->packet;
+    MultiFDPages_t *pages = p->pages;
     int i;
 
     packet->flags = cpu_to_be32(p->flags);
     packet->pages_alloc = cpu_to_be32(p->pages->allocated);
-    packet->normal_pages = cpu_to_be32(p->normal_num);
+    packet->normal_pages = cpu_to_be32(pages->num);
     packet->next_packet_size = cpu_to_be32(p->next_packet_size);
     packet->packet_num = cpu_to_be64(p->packet_num);
 
-    if (p->pages->block) {
-        strncpy(packet->ramblock, p->pages->block->idstr, 256);
+    if (pages->block) {
+        strncpy(packet->ramblock, pages->block->idstr, 256);
     }
 
-    for (i = 0; i < p->normal_num; i++) {
+    for (i = 0; i < pages->num; i++) {
         /* there are architectures where ram_addr_t is 32 bit */
-        uint64_t temp = p->normal[i];
+        uint64_t temp = pages->offset[i];
 
         packet->offset[i] = cpu_to_be64(temp);
     }
@@ -570,8 +571,6 @@ void multifd_save_cleanup(void)
         p->packet = NULL;
         g_free(p->iov);
         p->iov = NULL;
-        g_free(p->normal);
-        p->normal = NULL;
         multifd_send_state->ops->send_cleanup(p, &local_err);
         if (local_err) {
             migrate_set_error(migrate_get_current(), local_err);
@@ -688,8 +687,8 @@ static void *multifd_send_thread(void *opaque)
 
         if (p->pending_job) {
             uint64_t packet_num = p->packet_num;
+            MultiFDPages_t *pages = p->pages;
             uint32_t flags;
-            p->normal_num = 0;
 
             if (use_zero_copy_send) {
                 p->iovs_num = 0;
@@ -697,12 +696,7 @@ static void *multifd_send_thread(void *opaque)
                 p->iovs_num = 1;
             }
 
-            for (int i = 0; i < p->pages->num; i++) {
-                p->normal[p->normal_num] = p->pages->offset[i];
-                p->normal_num++;
-            }
-
-            if (p->normal_num) {
+            if (pages->num) {
                 ret = multifd_send_state->ops->send_prepare(p, &local_err);
                 if (ret != 0) {
                     qemu_mutex_unlock(&p->mutex);
@@ -713,10 +707,10 @@ static void *multifd_send_thread(void *opaque)
             flags = p->flags;
             p->flags = 0;
             p->num_packets++;
-            p->total_normal_pages += p->normal_num;
+            p->total_normal_pages += pages->num;
             qemu_mutex_unlock(&p->mutex);
 
-            trace_multifd_send(p->id, packet_num, p->normal_num, flags,
+            trace_multifd_send(p->id, packet_num, pages->num, flags,
                                p->next_packet_size);
 
             if (use_zero_copy_send) {
@@ -924,7 +918,6 @@ int multifd_save_setup(Error **errp)
         p->name = g_strdup_printf("multifdsend_%d", i);
         /* We need one extra place for the packet header */
         p->iov = g_new0(struct iovec, page_count + 1);
-        p->normal = g_new0(ram_addr_t, page_count);
         p->page_size = qemu_target_page_size();
         p->page_count = page_count;
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 06/14] migration/multifd: Separate SYNC request with normal jobs
  2024-01-31 10:30 [PATCH 00/14] migration/multifd: Refactor ->send_prepare() and cleanups peterx
                   ` (4 preceding siblings ...)
  2024-01-31 10:31 ` [PATCH 05/14] migration/multifd: Drop MultiFDSendParams.normal[] array peterx
@ 2024-01-31 10:31 ` peterx
  2024-01-31 18:45   ` Fabiano Rosas
  2024-01-31 10:31 ` [PATCH 07/14] migration/multifd: Simplify locking in sender thread peterx
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: peterx @ 2024-01-31 10:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, Fabiano Rosas, peterx, Yuan Liu,
	Avihai Horon, Hao Xiang

From: Peter Xu <peterx@redhat.com>

Multifd provide a threaded model for processing jobs.  On sender side,
there can be two kinds of job: (1) a list of pages to send, or (2) a sync
request.

The sync request is a very special kind of job.  It never contains a page
array, but only a multifd packet telling the dest side to synchronize with
sent pages.

Before this patch, both requests use the pending_job field, no matter what
the request is, it will boost pending_job, while multifd sender thread will
decrement it after it finishes one job.

However this should be racy, because SYNC is special in that it needs to
set p->flags with MULTIFD_FLAG_SYNC, showing that this is a sync request.
Consider a sequence of operations where:

  - migration thread enqueue a job to send some pages, pending_job++ (0->1)

  - [...before the selected multifd sender thread wakes up...]

  - migration thread enqueue another job to sync, pending_job++ (1->2),
    setup p->flags=MULTIFD_FLAG_SYNC

  - multifd sender thread wakes up, found pending_job==2
    - send the 1st packet with MULTIFD_FLAG_SYNC and list of pages
    - send the 2nd packet with flags==0 and no pages

This is not expected, because MULTIFD_FLAG_SYNC should hopefully be done
after all the pages are received.  Meanwhile, the 2nd packet will be
completely useless, which contains zero information.

I didn't verify above, but I think this issue is still benign in that at
least on the recv side we always receive pages before handling
MULTIFD_FLAG_SYNC.  However that's not always guaranteed and just tricky.

One other reason I want to separate it is using p->flags to communicate
between the two threads is also not clearly defined, it's very hard to read
and understand why accessing p->flags is always safe; see the current impl
of multifd_send_thread() where we tried to cache only p->flags.  It doesn't
need to be that complicated.

This patch introduces pending_sync, a separate flag just to show that the
requester needs a sync.  Alongside, we remove the tricky caching of
p->flags now because after this patch p->flags should only be used by
multifd sender thread now, which will be crystal clear.  So it is always
thread safe to access p->flags.

With that, we can also safely convert the pending_job into a boolean,
because we don't support >1 pending jobs anyway.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/multifd.h | 13 +++++++++++--
 migration/multifd.c | 29 ++++++++++++++++++-----------
 2 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/migration/multifd.h b/migration/multifd.h
index 3920bdbcf1..08f26ef3fe 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -99,8 +99,17 @@ typedef struct {
     uint32_t flags;
     /* global number of generated multifd packets */
     uint64_t packet_num;
-    /* thread has work to do */
-    int pending_job;
+    /*
+     * The sender thread has work to do if either of below boolean is set.
+     *
+     * @pending_job:  a job is pending
+     * @pending_sync: a sync request is pending
+     *
+     * For both of these fields, they're only set by the requesters, and
+     * cleared by the multifd sender threads.
+     */
+    bool pending_job;
+    bool pending_sync;
     /* array of pages to sent.
      * The owner of 'pages' depends of 'pending_job' value:
      * pending_job == 0 -> migration_thread can use it.
diff --git a/migration/multifd.c b/migration/multifd.c
index 8bb1fd95cf..6a4863edd2 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -443,7 +443,7 @@ static int multifd_send_pages(void)
         p = &multifd_send_state->params[i];
         qemu_mutex_lock(&p->mutex);
         if (!p->pending_job) {
-            p->pending_job++;
+            p->pending_job = true;
             next_channel = (i + 1) % migrate_multifd_channels();
             break;
         }
@@ -631,8 +631,7 @@ int multifd_send_sync_main(void)
 
         qemu_mutex_lock(&p->mutex);
         p->packet_num = multifd_send_state->packet_num++;
-        p->flags |= MULTIFD_FLAG_SYNC;
-        p->pending_job++;
+        p->pending_sync = true;
         qemu_mutex_unlock(&p->mutex);
         qemu_sem_post(&p->sem);
     }
@@ -688,7 +687,6 @@ static void *multifd_send_thread(void *opaque)
         if (p->pending_job) {
             uint64_t packet_num = p->packet_num;
             MultiFDPages_t *pages = p->pages;
-            uint32_t flags;
 
             if (use_zero_copy_send) {
                 p->iovs_num = 0;
@@ -704,13 +702,11 @@ static void *multifd_send_thread(void *opaque)
                 }
             }
             multifd_send_fill_packet(p);
-            flags = p->flags;
-            p->flags = 0;
             p->num_packets++;
             p->total_normal_pages += pages->num;
             qemu_mutex_unlock(&p->mutex);
 
-            trace_multifd_send(p->id, packet_num, pages->num, flags,
+            trace_multifd_send(p->id, packet_num, pages->num, p->flags,
                                p->next_packet_size);
 
             if (use_zero_copy_send) {
@@ -738,12 +734,23 @@ static void *multifd_send_thread(void *opaque)
             multifd_pages_reset(p->pages);
             p->next_packet_size = 0;
             qemu_mutex_lock(&p->mutex);
-            p->pending_job--;
+            p->pending_job = false;
             qemu_mutex_unlock(&p->mutex);
-
-            if (flags & MULTIFD_FLAG_SYNC) {
-                qemu_sem_post(&p->sem_sync);
+        } else if (p->pending_sync) {
+            p->flags = MULTIFD_FLAG_SYNC;
+            multifd_send_fill_packet(p);
+            ret = qio_channel_write_all(p->c, (void *)p->packet,
+                                        p->packet_len, &local_err);
+            if (ret != 0) {
+                qemu_mutex_unlock(&p->mutex);
+                break;
             }
+            /* p->next_packet_size will always be zero for a SYNC packet */
+            stat64_add(&mig_stats.multifd_bytes, p->packet_len);
+            p->flags = 0;
+            p->pending_sync = false;
+            qemu_mutex_unlock(&p->mutex);
+            qemu_sem_post(&p->sem_sync);
         } else {
             qemu_mutex_unlock(&p->mutex);
             /* sometimes there are spurious wakeups */
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 07/14] migration/multifd: Simplify locking in sender thread
  2024-01-31 10:30 [PATCH 00/14] migration/multifd: Refactor ->send_prepare() and cleanups peterx
                   ` (5 preceding siblings ...)
  2024-01-31 10:31 ` [PATCH 06/14] migration/multifd: Separate SYNC request with normal jobs peterx
@ 2024-01-31 10:31 ` peterx
  2024-01-31 20:21   ` Fabiano Rosas
  2024-01-31 10:31 ` [PATCH 08/14] migration/multifd: Drop pages->num check " peterx
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: peterx @ 2024-01-31 10:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, Fabiano Rosas, peterx, Yuan Liu,
	Avihai Horon, Hao Xiang

From: Peter Xu <peterx@redhat.com>

The sender thread will yield the p->mutex before IO starts, trying to not
block the requester thread.  This may be unnecessary lock optimizations,
because the requester can already read pending_job safely even without the
lock, because the requester is currently the only one who can assign a
task.

Drop that lock complication on both sides:

  (1) in the sender thread, always take the mutex until job done
  (2) in the requester thread, check pending_job clear lockless

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/multifd.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 6a4863edd2..4dc5af0a15 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -429,7 +429,9 @@ static int multifd_send_pages(void)
         return -1;
     }
 
+    /* We wait here, until at least one channel is ready */
     qemu_sem_wait(&multifd_send_state->channels_ready);
+
     /*
      * next_channel can remain from a previous migration that was
      * using more channels, so ensure it doesn't overflow if the
@@ -441,17 +443,26 @@ static int multifd_send_pages(void)
             return -1;
         }
         p = &multifd_send_state->params[i];
-        qemu_mutex_lock(&p->mutex);
+        /*
+         * Lockless read to p->pending_job is safe, because only multifd
+         * sender thread can clear it.
+         */
         if (!p->pending_job) {
-            p->pending_job = true;
             next_channel = (i + 1) % migrate_multifd_channels();
             break;
         }
-        qemu_mutex_unlock(&p->mutex);
     }
+
+    qemu_mutex_lock(&p->mutex);
+    /*
+     * Double check on pending_job==false with the lock.  In the future if
+     * we can have >1 requester thread, we can replace this with a "goto
+     * retry", but that is for later.
+     */
+    assert(p->pending_job == false);
+    p->pending_job = true;
     assert(!p->pages->num);
     assert(!p->pages->block);
-
     p->packet_num = multifd_send_state->packet_num++;
     multifd_send_state->pages = p->pages;
     p->pages = pages;
@@ -704,8 +715,6 @@ static void *multifd_send_thread(void *opaque)
             multifd_send_fill_packet(p);
             p->num_packets++;
             p->total_normal_pages += pages->num;
-            qemu_mutex_unlock(&p->mutex);
-
             trace_multifd_send(p->id, packet_num, pages->num, p->flags,
                                p->next_packet_size);
 
@@ -725,6 +734,7 @@ static void *multifd_send_thread(void *opaque)
             ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num, NULL,
                                               0, p->write_flags, &local_err);
             if (ret != 0) {
+                qemu_mutex_unlock(&p->mutex);
                 break;
             }
 
@@ -733,7 +743,6 @@ static void *multifd_send_thread(void *opaque)
 
             multifd_pages_reset(p->pages);
             p->next_packet_size = 0;
-            qemu_mutex_lock(&p->mutex);
             p->pending_job = false;
             qemu_mutex_unlock(&p->mutex);
         } else if (p->pending_sync) {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 08/14] migration/multifd: Drop pages->num check in sender thread
  2024-01-31 10:30 [PATCH 00/14] migration/multifd: Refactor ->send_prepare() and cleanups peterx
                   ` (6 preceding siblings ...)
  2024-01-31 10:31 ` [PATCH 07/14] migration/multifd: Simplify locking in sender thread peterx
@ 2024-01-31 10:31 ` peterx
  2024-01-31 21:19   ` Fabiano Rosas
  2024-01-31 10:31 ` [PATCH 09/14] migration/multifd: Rename p->num_packets and clean it up peterx
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: peterx @ 2024-01-31 10:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, Fabiano Rosas, peterx, Yuan Liu,
	Avihai Horon, Hao Xiang

From: Peter Xu <peterx@redhat.com>

Now with a split SYNC handler, we always have pages->num set for
pending_job==true.  Assert it instead.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/multifd.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 4dc5af0a15..2d12de01a1 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -705,13 +705,14 @@ static void *multifd_send_thread(void *opaque)
                 p->iovs_num = 1;
             }
 
-            if (pages->num) {
-                ret = multifd_send_state->ops->send_prepare(p, &local_err);
-                if (ret != 0) {
-                    qemu_mutex_unlock(&p->mutex);
-                    break;
-                }
+            assert(pages->num);
+
+            ret = multifd_send_state->ops->send_prepare(p, &local_err);
+            if (ret != 0) {
+                qemu_mutex_unlock(&p->mutex);
+                break;
             }
+
             multifd_send_fill_packet(p);
             p->num_packets++;
             p->total_normal_pages += pages->num;
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 09/14] migration/multifd: Rename p->num_packets and clean it up
  2024-01-31 10:30 [PATCH 00/14] migration/multifd: Refactor ->send_prepare() and cleanups peterx
                   ` (7 preceding siblings ...)
  2024-01-31 10:31 ` [PATCH 08/14] migration/multifd: Drop pages->num check " peterx
@ 2024-01-31 10:31 ` peterx
  2024-01-31 21:24   ` Fabiano Rosas
  2024-01-31 10:31 ` [PATCH 10/14] migration/multifd: Move total_normal_pages accounting peterx
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: peterx @ 2024-01-31 10:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, Fabiano Rosas, peterx, Yuan Liu,
	Avihai Horon, Hao Xiang

From: Peter Xu <peterx@redhat.com>

This field, no matter whether on src or dest, is only used for debugging
purpose.

They can even be removed already, unless it still more or less provide some
accounting on "how many packets are sent/recved for this thread".  The
other more important one is called packet_num, which is embeded in the
multifd packet headers (MultiFDPacket_t).

So let's keep them for now, but make them much easier to understand, by
doing below:

  - Rename both of them to packets_sent / packets_recved, the old
  name (num_packets) are waaay too confusing when we already have
  MultiFDPacket_t.packets_num.

  - Avoid worrying on the "initial packet": we know we will send it, that's
  good enough.  The accounting won't matter a great deal to start with 0 or
  with 1.

  - Move them to where we send/recv the packets.  They're:

    - multifd_send_fill_packet() for senders.
    - multifd_recv_unfill_packet() for receivers.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/multifd.h |  6 +++---
 migration/multifd.c | 13 +++++--------
 2 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/migration/multifd.h b/migration/multifd.h
index 08f26ef3fe..2e4ad0dc56 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -124,7 +124,7 @@ typedef struct {
     /* size of the next packet that contains pages */
     uint32_t next_packet_size;
     /* packets sent through this channel */
-    uint64_t num_packets;
+    uint64_t packets_sent;
     /* non zero pages sent through this channel */
     uint64_t total_normal_pages;
     /* buffers to send */
@@ -174,8 +174,8 @@ typedef struct {
     MultiFDPacket_t *packet;
     /* size of the next packet that contains pages */
     uint32_t next_packet_size;
-    /* packets sent through this channel */
-    uint64_t num_packets;
+    /* packets received through this channel */
+    uint64_t packets_recved;
     /* ramblock */
     RAMBlock *block;
     /* ramblock host address */
diff --git a/migration/multifd.c b/migration/multifd.c
index 2d12de01a1..abc2746b6e 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -288,6 +288,8 @@ static void multifd_send_fill_packet(MultiFDSendParams *p)
 
         packet->offset[i] = cpu_to_be64(temp);
     }
+
+    p->packets_sent++;
 }
 
 static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
@@ -335,6 +337,7 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
 
     p->next_packet_size = be32_to_cpu(packet->next_packet_size);
     p->packet_num = be64_to_cpu(packet->packet_num);
+    p->packets_recved++;
 
     if (p->normal_num == 0) {
         return 0;
@@ -683,8 +686,6 @@ static void *multifd_send_thread(void *opaque)
         ret = -1;
         goto out;
     }
-    /* initial packet */
-    p->num_packets = 1;
 
     while (true) {
         qemu_sem_post(&multifd_send_state->channels_ready);
@@ -714,7 +715,6 @@ static void *multifd_send_thread(void *opaque)
             }
 
             multifd_send_fill_packet(p);
-            p->num_packets++;
             p->total_normal_pages += pages->num;
             trace_multifd_send(p->id, packet_num, pages->num, p->flags,
                                p->next_packet_size);
@@ -782,7 +782,7 @@ out:
 
     rcu_unregister_thread();
     migration_threads_remove(thread);
-    trace_multifd_send_thread_end(p->id, p->num_packets, p->total_normal_pages);
+    trace_multifd_send_thread_end(p->id, p->packets_sent, p->total_normal_pages);
 
     return NULL;
 }
@@ -1120,7 +1120,6 @@ static void *multifd_recv_thread(void *opaque)
         p->flags &= ~MULTIFD_FLAG_SYNC;
         trace_multifd_recv(p->id, p->packet_num, p->normal_num, flags,
                            p->next_packet_size);
-        p->num_packets++;
         p->total_normal_pages += p->normal_num;
         qemu_mutex_unlock(&p->mutex);
 
@@ -1146,7 +1145,7 @@ static void *multifd_recv_thread(void *opaque)
     qemu_mutex_unlock(&p->mutex);
 
     rcu_unregister_thread();
-    trace_multifd_recv_thread_end(p->id, p->num_packets, p->total_normal_pages);
+    trace_multifd_recv_thread_end(p->id, p->packets_recved, p->total_normal_pages);
 
     return NULL;
 }
@@ -1248,8 +1247,6 @@ void multifd_recv_new_channel(QIOChannel *ioc, Error **errp)
     }
     p->c = ioc;
     object_ref(OBJECT(ioc));
-    /* initial packet */
-    p->num_packets = 1;
 
     p->running = true;
     qemu_thread_create(&p->thread, p->name, multifd_recv_thread, p,
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 10/14] migration/multifd: Move total_normal_pages accounting
  2024-01-31 10:30 [PATCH 00/14] migration/multifd: Refactor ->send_prepare() and cleanups peterx
                   ` (8 preceding siblings ...)
  2024-01-31 10:31 ` [PATCH 09/14] migration/multifd: Rename p->num_packets and clean it up peterx
@ 2024-01-31 10:31 ` peterx
  2024-01-31 21:26   ` Fabiano Rosas
  2024-01-31 10:31 ` [PATCH 11/14] migration/multifd: Move trace_multifd_send|recv() peterx
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: peterx @ 2024-01-31 10:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, Fabiano Rosas, peterx, Yuan Liu,
	Avihai Horon, Hao Xiang

From: Peter Xu <peterx@redhat.com>

Just like the previous patch, move the accounting for total_normal_pages on
both src/dst sides into the packet fill/unfill procedures.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/multifd.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index abc2746b6e..2224dc9833 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -290,6 +290,7 @@ static void multifd_send_fill_packet(MultiFDSendParams *p)
     }
 
     p->packets_sent++;
+    p->total_normal_pages += pages->num;
 }
 
 static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
@@ -338,6 +339,7 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
     p->next_packet_size = be32_to_cpu(packet->next_packet_size);
     p->packet_num = be64_to_cpu(packet->packet_num);
     p->packets_recved++;
+    p->total_normal_pages += p->normal_num;
 
     if (p->normal_num == 0) {
         return 0;
@@ -715,7 +717,6 @@ static void *multifd_send_thread(void *opaque)
             }
 
             multifd_send_fill_packet(p);
-            p->total_normal_pages += pages->num;
             trace_multifd_send(p->id, packet_num, pages->num, p->flags,
                                p->next_packet_size);
 
@@ -1120,7 +1121,6 @@ static void *multifd_recv_thread(void *opaque)
         p->flags &= ~MULTIFD_FLAG_SYNC;
         trace_multifd_recv(p->id, p->packet_num, p->normal_num, flags,
                            p->next_packet_size);
-        p->total_normal_pages += p->normal_num;
         qemu_mutex_unlock(&p->mutex);
 
         if (p->normal_num) {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 11/14] migration/multifd: Move trace_multifd_send|recv()
  2024-01-31 10:30 [PATCH 00/14] migration/multifd: Refactor ->send_prepare() and cleanups peterx
                   ` (9 preceding siblings ...)
  2024-01-31 10:31 ` [PATCH 10/14] migration/multifd: Move total_normal_pages accounting peterx
@ 2024-01-31 10:31 ` peterx
  2024-01-31 21:26   ` Fabiano Rosas
  2024-01-31 10:31 ` [PATCH 12/14] migration/multifd: multifd_send_prepare_header() peterx
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: peterx @ 2024-01-31 10:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, Fabiano Rosas, peterx, Yuan Liu,
	Avihai Horon, Hao Xiang

From: Peter Xu <peterx@redhat.com>

Move them into fill/unfill of packets.  With that, we can further cleanup
the send/recv thread procedure, and remove one more temp var.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/multifd.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 2224dc9833..8d4b80f365 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -291,6 +291,9 @@ static void multifd_send_fill_packet(MultiFDSendParams *p)
 
     p->packets_sent++;
     p->total_normal_pages += pages->num;
+
+    trace_multifd_send(p->id, p->packet_num, pages->num, p->flags,
+                       p->next_packet_size);
 }
 
 static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
@@ -341,6 +344,9 @@ static int multifd_recv_unfill_packet(MultiFDRecvParams *p, Error **errp)
     p->packets_recved++;
     p->total_normal_pages += p->normal_num;
 
+    trace_multifd_recv(p->id, p->packet_num, p->normal_num, p->flags,
+                       p->next_packet_size);
+
     if (p->normal_num == 0) {
         return 0;
     }
@@ -699,7 +705,6 @@ static void *multifd_send_thread(void *opaque)
         qemu_mutex_lock(&p->mutex);
 
         if (p->pending_job) {
-            uint64_t packet_num = p->packet_num;
             MultiFDPages_t *pages = p->pages;
 
             if (use_zero_copy_send) {
@@ -717,8 +722,6 @@ static void *multifd_send_thread(void *opaque)
             }
 
             multifd_send_fill_packet(p);
-            trace_multifd_send(p->id, packet_num, pages->num, p->flags,
-                               p->next_packet_size);
 
             if (use_zero_copy_send) {
                 /* Send header first, without zerocopy */
@@ -1119,8 +1122,6 @@ static void *multifd_recv_thread(void *opaque)
         flags = p->flags;
         /* recv methods don't know how to handle the SYNC flag */
         p->flags &= ~MULTIFD_FLAG_SYNC;
-        trace_multifd_recv(p->id, p->packet_num, p->normal_num, flags,
-                           p->next_packet_size);
         qemu_mutex_unlock(&p->mutex);
 
         if (p->normal_num) {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 12/14] migration/multifd: multifd_send_prepare_header()
  2024-01-31 10:30 [PATCH 00/14] migration/multifd: Refactor ->send_prepare() and cleanups peterx
                   ` (10 preceding siblings ...)
  2024-01-31 10:31 ` [PATCH 11/14] migration/multifd: Move trace_multifd_send|recv() peterx
@ 2024-01-31 10:31 ` peterx
  2024-01-31 21:32   ` Fabiano Rosas
  2024-01-31 10:31 ` [PATCH 13/14] migration/multifd: Move header prepare/fill into send_prepare() peterx
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: peterx @ 2024-01-31 10:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, Fabiano Rosas, peterx, Yuan Liu,
	Avihai Horon, Hao Xiang

From: Peter Xu <peterx@redhat.com>

Introduce a helper multifd_send_prepare_header() to setup the header packet
for multifd sender.

It's fine to setup the IOV[0] _before_ send_prepare() because the packet
buffer is already ready, even if the content is to be filled in.

With this helper, we can already slightly clean up the zero copy path.

Note that I explicitly put it into multifd.h, because I want it inlined
directly into multifd*.c where necessary later.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/multifd.h |  8 ++++++++
 migration/multifd.c | 16 ++++++++--------
 2 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/migration/multifd.h b/migration/multifd.h
index 2e4ad0dc56..4ec005f53f 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -209,5 +209,13 @@ typedef struct {
 
 void multifd_register_ops(int method, MultiFDMethods *ops);
 
+static inline void multifd_send_prepare_header(MultiFDSendParams *p)
+{
+    p->iov[0].iov_len = p->packet_len;
+    p->iov[0].iov_base = p->packet;
+    p->iovs_num++;
+}
+
+
 #endif
 
diff --git a/migration/multifd.c b/migration/multifd.c
index 8d4b80f365..1b0035787e 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -707,10 +707,14 @@ static void *multifd_send_thread(void *opaque)
         if (p->pending_job) {
             MultiFDPages_t *pages = p->pages;
 
-            if (use_zero_copy_send) {
-                p->iovs_num = 0;
-            } else {
-                p->iovs_num = 1;
+            p->iovs_num = 0;
+
+            if (!use_zero_copy_send) {
+                /*
+                 * Only !zero_copy needs the header in IOV; zerocopy will
+                 * send it separately.
+                 */
+                multifd_send_prepare_header(p);
             }
 
             assert(pages->num);
@@ -730,10 +734,6 @@ static void *multifd_send_thread(void *opaque)
                 if (ret != 0) {
                     break;
                 }
-            } else {
-                /* Send header using the same writev call */
-                p->iov[0].iov_len = p->packet_len;
-                p->iov[0].iov_base = p->packet;
             }
 
             ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num, NULL,
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 13/14] migration/multifd: Move header prepare/fill into send_prepare()
  2024-01-31 10:30 [PATCH 00/14] migration/multifd: Refactor ->send_prepare() and cleanups peterx
                   ` (11 preceding siblings ...)
  2024-01-31 10:31 ` [PATCH 12/14] migration/multifd: multifd_send_prepare_header() peterx
@ 2024-01-31 10:31 ` peterx
  2024-01-31 21:42   ` Fabiano Rosas
  2024-02-02  3:57   ` Peter Xu
  2024-01-31 10:31 ` [PATCH 14/14] migration/multifd: Forbid spurious wakeups peterx
  2024-01-31 22:49 ` [PATCH 00/14] migration/multifd: Refactor ->send_prepare() and cleanups Fabiano Rosas
  14 siblings, 2 replies; 45+ messages in thread
From: peterx @ 2024-01-31 10:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, Fabiano Rosas, peterx, Yuan Liu,
	Avihai Horon, Hao Xiang

From: Peter Xu <peterx@redhat.com>

This patch redefines the interfacing of ->send_prepare().  It further
simplifies multifd_send_thread() especially on zero copy.

Now with the new interface, we require the hook to do all the work for
preparing the IOVs to send.  After it's completed, the IOVs should be ready
to be dumped into the specific multifd QIOChannel later.

So now the API looks like:

  p->pages ----------->  send_prepare() -------------> IOVs

This also prepares for the case where the input can be extended to even not
any p->pages.  But that's for later.

This patch will achieve similar goal of what Fabiano used to propose here:

https://lore.kernel.org/r/20240126221943.26628-1-farosas@suse.de

However the send() interface may not be necessary.  I'm boldly attaching a
"Co-developed-by" for Fabiano.

Co-developed-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/multifd.h      |  1 +
 migration/multifd-zlib.c |  4 ++++
 migration/multifd-zstd.c |  4 ++++
 migration/multifd.c      | 45 ++++++++++++++++++++--------------------
 4 files changed, 32 insertions(+), 22 deletions(-)

diff --git a/migration/multifd.h b/migration/multifd.h
index 4ec005f53f..34a2ecb9f4 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -208,6 +208,7 @@ typedef struct {
 } MultiFDMethods;
 
 void multifd_register_ops(int method, MultiFDMethods *ops);
+void multifd_send_fill_packet(MultiFDSendParams *p);
 
 static inline void multifd_send_prepare_header(MultiFDSendParams *p)
 {
diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c
index 100809abc1..012e3bdea1 100644
--- a/migration/multifd-zlib.c
+++ b/migration/multifd-zlib.c
@@ -123,6 +123,8 @@ static int zlib_send_prepare(MultiFDSendParams *p, Error **errp)
     int ret;
     uint32_t i;
 
+    multifd_send_prepare_header(p);
+
     for (i = 0; i < pages->num; i++) {
         uint32_t available = z->zbuff_len - out_size;
         int flush = Z_NO_FLUSH;
@@ -172,6 +174,8 @@ static int zlib_send_prepare(MultiFDSendParams *p, Error **errp)
     p->next_packet_size = out_size;
     p->flags |= MULTIFD_FLAG_ZLIB;
 
+    multifd_send_fill_packet(p);
+
     return 0;
 }
 
diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c
index 2023edd8cc..dc8fe43e94 100644
--- a/migration/multifd-zstd.c
+++ b/migration/multifd-zstd.c
@@ -118,6 +118,8 @@ static int zstd_send_prepare(MultiFDSendParams *p, Error **errp)
     int ret;
     uint32_t i;
 
+    multifd_send_prepare_header(p);
+
     z->out.dst = z->zbuff;
     z->out.size = z->zbuff_len;
     z->out.pos = 0;
@@ -161,6 +163,8 @@ static int zstd_send_prepare(MultiFDSendParams *p, Error **errp)
     p->next_packet_size = z->out.pos;
     p->flags |= MULTIFD_FLAG_ZSTD;
 
+    multifd_send_fill_packet(p);
+
     return 0;
 }
 
diff --git a/migration/multifd.c b/migration/multifd.c
index 1b0035787e..0f22646f95 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -88,7 +88,17 @@ static void nocomp_send_cleanup(MultiFDSendParams *p, Error **errp)
  */
 static int nocomp_send_prepare(MultiFDSendParams *p, Error **errp)
 {
+    bool use_zero_copy_send = migrate_zero_copy_send();
     MultiFDPages_t *pages = p->pages;
+    int ret;
+
+    if (!use_zero_copy_send) {
+        /*
+         * Only !zero_copy needs the header in IOV; zerocopy will
+         * send it separately.
+         */
+        multifd_send_prepare_header(p);
+    }
 
     for (int i = 0; i < pages->num; i++) {
         p->iov[p->iovs_num].iov_base = pages->block->host + pages->offset[i];
@@ -98,6 +108,18 @@ static int nocomp_send_prepare(MultiFDSendParams *p, Error **errp)
 
     p->next_packet_size = pages->num * p->page_size;
     p->flags |= MULTIFD_FLAG_NOCOMP;
+
+    multifd_send_fill_packet(p);
+
+    if (use_zero_copy_send) {
+        /* Send header first, without zerocopy */
+        ret = qio_channel_write_all(p->c, (void *)p->packet,
+                                    p->packet_len, errp);
+        if (ret != 0) {
+            return -1;
+        }
+    }
+
     return 0;
 }
 
@@ -266,7 +288,7 @@ static void multifd_pages_clear(MultiFDPages_t *pages)
     g_free(pages);
 }
 
-static void multifd_send_fill_packet(MultiFDSendParams *p)
+void multifd_send_fill_packet(MultiFDSendParams *p)
 {
     MultiFDPacket_t *packet = p->packet;
     MultiFDPages_t *pages = p->pages;
@@ -683,7 +705,6 @@ static void *multifd_send_thread(void *opaque)
     MigrationThread *thread = NULL;
     Error *local_err = NULL;
     int ret = 0;
-    bool use_zero_copy_send = migrate_zero_copy_send();
 
     thread = migration_threads_add(p->name, qemu_get_thread_id());
 
@@ -708,15 +729,6 @@ static void *multifd_send_thread(void *opaque)
             MultiFDPages_t *pages = p->pages;
 
             p->iovs_num = 0;
-
-            if (!use_zero_copy_send) {
-                /*
-                 * Only !zero_copy needs the header in IOV; zerocopy will
-                 * send it separately.
-                 */
-                multifd_send_prepare_header(p);
-            }
-
             assert(pages->num);
 
             ret = multifd_send_state->ops->send_prepare(p, &local_err);
@@ -725,17 +737,6 @@ static void *multifd_send_thread(void *opaque)
                 break;
             }
 
-            multifd_send_fill_packet(p);
-
-            if (use_zero_copy_send) {
-                /* Send header first, without zerocopy */
-                ret = qio_channel_write_all(p->c, (void *)p->packet,
-                                            p->packet_len, &local_err);
-                if (ret != 0) {
-                    break;
-                }
-            }
-
             ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num, NULL,
                                               0, p->write_flags, &local_err);
             if (ret != 0) {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 14/14] migration/multifd: Forbid spurious wakeups
  2024-01-31 10:30 [PATCH 00/14] migration/multifd: Refactor ->send_prepare() and cleanups peterx
                   ` (12 preceding siblings ...)
  2024-01-31 10:31 ` [PATCH 13/14] migration/multifd: Move header prepare/fill into send_prepare() peterx
@ 2024-01-31 10:31 ` peterx
  2024-01-31 21:43   ` Fabiano Rosas
  2024-02-01  6:01   ` Peter Xu
  2024-01-31 22:49 ` [PATCH 00/14] migration/multifd: Refactor ->send_prepare() and cleanups Fabiano Rosas
  14 siblings, 2 replies; 45+ messages in thread
From: peterx @ 2024-01-31 10:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, Fabiano Rosas, peterx, Yuan Liu,
	Avihai Horon, Hao Xiang

From: Peter Xu <peterx@redhat.com>

Now multifd's logic is designed to have no spurious wakeup.  I still
remember a talk to Juan and he seems to agree we should drop it now, and if
my memory was right it was there because multifd used to hit that when
still debugging.

Let's drop it and see what can explode; as long as it's not reaching
soft-freeze.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/multifd.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 0f22646f95..bd0e3ea1a5 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -766,9 +766,6 @@ static void *multifd_send_thread(void *opaque)
             p->pending_sync = false;
             qemu_mutex_unlock(&p->mutex);
             qemu_sem_post(&p->sem_sync);
-        } else {
-            qemu_mutex_unlock(&p->mutex);
-            /* sometimes there are spurious wakeups */
         }
     }
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 03/14] migration/multifd: Drop MultiFDSendParams.quit, cleanup error paths
  2024-01-31 10:31 ` [PATCH 03/14] migration/multifd: Drop MultiFDSendParams.quit, cleanup error paths peterx
@ 2024-01-31 15:05   ` Fabiano Rosas
  2024-02-01  9:28     ` Peter Xu
  0 siblings, 1 reply; 45+ messages in thread
From: Fabiano Rosas @ 2024-01-31 15:05 UTC (permalink / raw)
  To: peterx, qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, peterx, Yuan Liu, Avihai Horon, Hao Xiang

peterx@redhat.com writes:

> From: Peter Xu <peterx@redhat.com>
>
> Multifd send side has two fields to indicate error quits:
>
>   - MultiFDSendParams.quit
>   - &multifd_send_state->exiting
>
> Merge them into the global one.  The replacement is done by changing all
> p->quit checks into the global var check.  The global check doesn't need
> any lock.
>
> A few more things done on top of this altogether:
>
>   - multifd_send_terminate_threads()
>
>     Moving the xchg() of &multifd_send_state->exiting upper, so as to cover
>     the tracepoint, migrate_set_error() and migrate_set_state().

Good.

>
>   - multifd_send_sync_main()
>
>     In the 2nd loop, add one more check over the global var to make sure we
>     don't keep the looping if QEMU already decided to quit.

Yes, also because we don't necessarily enter at multifd_send_page()
every time.

>
>   - multifd_tls_outgoing_handshake()
>
>     Use multifd_send_terminate_threads() to set the error state.  That has
>     a benefit of updating MigrationState.error to that error too, so we can
>     persist that 1st error we hit in that specific channel.

Makes sense.

>
>   - multifd_new_send_channel_async()
>
>     Take similar approach like above, drop the migrate_set_error() because
>     multifd_send_terminate_threads() already covers that.  Unwrap the helper
>     multifd_new_send_channel_cleanup() along the way; not really needed.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/multifd.h |  2 --
>  migration/multifd.c | 85 ++++++++++++++++++---------------------------
>  2 files changed, 33 insertions(+), 54 deletions(-)
>
> diff --git a/migration/multifd.h b/migration/multifd.h
> index 35d11f103c..7c040cb85a 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -95,8 +95,6 @@ typedef struct {
>      QemuMutex mutex;
>      /* is this channel thread running */
>      bool running;
> -    /* should this thread finish */
> -    bool quit;
>      /* multifd flags for each packet */
>      uint32_t flags;
>      /* global number of generated multifd packets */
> diff --git a/migration/multifd.c b/migration/multifd.c
> index b8d2c96533..2c98023d67 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -372,6 +372,11 @@ struct {
>      MultiFDMethods *ops;
>  } *multifd_send_state;
>  
> +static bool multifd_send_should_exit(void)
> +{
> +    return qatomic_read(&multifd_send_state->exiting);
> +}
> +
>  /*
>   * The migration thread can wait on either of the two semaphores.  This
>   * function can be used to kick the main thread out of waiting on either of
> @@ -409,7 +414,7 @@ static int multifd_send_pages(void)
>      MultiFDSendParams *p = NULL; /* make happy gcc */
>      MultiFDPages_t *pages = multifd_send_state->pages;
>  
> -    if (qatomic_read(&multifd_send_state->exiting)) {
> +    if (multifd_send_should_exit()) {
>          return -1;
>      }
>  
> @@ -421,14 +426,11 @@ static int multifd_send_pages(void)
>       */
>      next_channel %= migrate_multifd_channels();
>      for (i = next_channel;; i = (i + 1) % migrate_multifd_channels()) {
> -        p = &multifd_send_state->params[i];
> -
> -        qemu_mutex_lock(&p->mutex);
> -        if (p->quit) {
> -            error_report("%s: channel %d has already quit!", __func__, i);
> -            qemu_mutex_unlock(&p->mutex);
> +        if (multifd_send_should_exit()) {
>              return -1;
>          }
> +        p = &multifd_send_state->params[i];
> +        qemu_mutex_lock(&p->mutex);
>          if (!p->pending_job) {
>              p->pending_job++;
>              next_channel = (i + 1) % migrate_multifd_channels();

Hm, I'm not sure it's correct to check 'exiting' outside of the
lock. While it is an atomic operation, it is not atomic in relation to
pending_job...

... looking closer, it seems that we can do what you suggest because
p->pending_job is not touched by the multifd_send_thread in case of
error, which means this function will indeed miss the 'exiting' flag,
but pending_job > 0 means it will loop to the next channel and _then_ it
will see the 'exiting' flag.

> @@ -483,6 +485,16 @@ static void multifd_send_terminate_threads(Error *err)
>  {
>      int i;
>  
> +    /*
> +     * We don't want to exit each threads twice.  Depending on where
> +     * we get the error, or if there are two independent errors in two
> +     * threads at the same time, we can end calling this function
> +     * twice.
> +     */
> +    if (qatomic_xchg(&multifd_send_state->exiting, 1)) {
> +        return;
> +    }
> +
>      trace_multifd_send_terminate_threads(err != NULL);
>  
>      if (err) {
> @@ -497,26 +509,13 @@ static void multifd_send_terminate_threads(Error *err)
>          }
>      }
>  
> -    /*
> -     * We don't want to exit each threads twice.  Depending on where
> -     * we get the error, or if there are two independent errors in two
> -     * threads at the same time, we can end calling this function
> -     * twice.
> -     */
> -    if (qatomic_xchg(&multifd_send_state->exiting, 1)) {
> -        return;
> -    }
> -
>      for (i = 0; i < migrate_multifd_channels(); i++) {
>          MultiFDSendParams *p = &multifd_send_state->params[i];
>  
> -        qemu_mutex_lock(&p->mutex);
> -        p->quit = true;

Now that you removed this, we decoupled kicking the threads from setting
the exit/error, so this function could be split in two.

We could set the exiting flag at the places the error occurred (multifd
threads, thread creation, etc) and "terminate the threads" at
multifd_save_cleanup(). That second part we already do actually:

void multifd_save_cleanup(void) {
...
    multifd_send_terminate_threads(NULL);
                                   ^see?
    for (i = 0; i < migrate_multifd_channels(); i++) {
        MultiFDSendParams *p = &multifd_send_state->params[i];

        if (p->running) {
            qemu_thread_join(&p->thread);
        }
    }
...
}

I think there's no reason anymore for the channels to kick each
other. They would all be waiting at p->sem and multifd_send_cleanup()
would kick + join them.

>          qemu_sem_post(&p->sem);
>          if (p->c) {
>              qio_channel_shutdown(p->c, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
>          }
> -        qemu_mutex_unlock(&p->mutex);
>      }
>  }
>  
> @@ -615,16 +614,13 @@ int multifd_send_sync_main(void)
>      for (i = 0; i < migrate_multifd_channels(); i++) {
>          MultiFDSendParams *p = &multifd_send_state->params[i];
>  
> -        trace_multifd_send_sync_main_signal(p->id);
> -
> -        qemu_mutex_lock(&p->mutex);
> -
> -        if (p->quit) {
> -            error_report("%s: channel %d has already quit", __func__, i);
> -            qemu_mutex_unlock(&p->mutex);
> +        if (multifd_send_should_exit()) {
>              return -1;
>          }
>  
> +        trace_multifd_send_sync_main_signal(p->id);
> +
> +        qemu_mutex_lock(&p->mutex);
>          p->packet_num = multifd_send_state->packet_num++;
>          p->flags |= MULTIFD_FLAG_SYNC;
>          p->pending_job++;
> @@ -634,6 +630,10 @@ int multifd_send_sync_main(void)
>      for (i = 0; i < migrate_multifd_channels(); i++) {
>          MultiFDSendParams *p = &multifd_send_state->params[i];
>  
> +        if (multifd_send_should_exit()) {
> +            return -1;
> +        }
> +
>          qemu_sem_wait(&multifd_send_state->channels_ready);
>          trace_multifd_send_sync_main_wait(p->id);
>          qemu_sem_wait(&p->sem_sync);
> @@ -671,7 +671,7 @@ static void *multifd_send_thread(void *opaque)
>          qemu_sem_post(&multifd_send_state->channels_ready);
>          qemu_sem_wait(&p->sem);
>  
> -        if (qatomic_read(&multifd_send_state->exiting)) {
> +        if (multifd_send_should_exit()) {
>              break;
>          }
>          qemu_mutex_lock(&p->mutex);
> @@ -786,12 +786,7 @@ static void multifd_tls_outgoing_handshake(QIOTask *task,
>  
>      trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err));
>  
> -    migrate_set_error(migrate_get_current(), err);
> -    /*
> -     * Error happen, mark multifd_send_thread status as 'quit' although it
> -     * is not created, and then tell who pay attention to me.
> -     */
> -    p->quit = true;
> +    multifd_send_terminate_threads(err);
>      multifd_send_kick_main(p);
>      error_free(err);
>  }
> @@ -857,22 +852,6 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
>      return true;
>  }
>  
> -static void multifd_new_send_channel_cleanup(MultiFDSendParams *p,
> -                                             QIOChannel *ioc, Error *err)
> -{
> -     migrate_set_error(migrate_get_current(), err);
> -     /* Error happen, we need to tell who pay attention to me */
> -     multifd_send_kick_main(p);
> -     /*
> -      * Although multifd_send_thread is not created, but main migration
> -      * thread need to judge whether it is running, so we need to mark
> -      * its status.
> -      */
> -     p->quit = true;
> -     object_unref(OBJECT(ioc));
> -     error_free(err);
> -}
> -
>  static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
>  {
>      MultiFDSendParams *p = opaque;
> @@ -889,7 +868,10 @@ static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
>      }
>  
>      trace_multifd_new_send_channel_async_error(p->id, local_err);
> -    multifd_new_send_channel_cleanup(p, ioc, local_err);
> +    multifd_send_terminate_threads(local_err);
> +    multifd_send_kick_main(p);
> +    object_unref(OBJECT(ioc));
> +    error_free(local_err);
>  }
>  
>  static void multifd_new_send_channel_create(gpointer opaque)
> @@ -921,7 +903,6 @@ int multifd_save_setup(Error **errp)
>          qemu_mutex_init(&p->mutex);
>          qemu_sem_init(&p->sem, 0);
>          qemu_sem_init(&p->sem_sync, 0);
> -        p->quit = false;
>          p->pending_job = 0;
>          p->id = i;
>          p->pages = multifd_pages_init(page_count);


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 04/14] migration/multifd: Postpone reset of MultiFDPages_t
  2024-01-31 10:31 ` [PATCH 04/14] migration/multifd: Postpone reset of MultiFDPages_t peterx
@ 2024-01-31 15:27   ` Fabiano Rosas
  2024-02-01 10:01     ` Peter Xu
  0 siblings, 1 reply; 45+ messages in thread
From: Fabiano Rosas @ 2024-01-31 15:27 UTC (permalink / raw)
  To: peterx, qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, peterx, Yuan Liu, Avihai Horon, Hao Xiang

peterx@redhat.com writes:

> From: Peter Xu <peterx@redhat.com>
>
> Now we reset MultiFDPages_t object in the multifd sender thread in the
> middle of the sending job.  That's not necessary, because the "*pages"
> struct will not be reused anyway until pending_job is cleared.
>
> Move that to the end after the job is completed, provide a helper to reset
> a "*pages" object.  Use that same helper when free the object too.
>
> This prepares us to keep using p->pages in the follow up patches, where we
> may drop p->normal[].
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>

Just an observation about the code below.

> ---
>  migration/multifd.c | 18 ++++++++++++++----
>  1 file changed, 14 insertions(+), 4 deletions(-)
>
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 2c98023d67..5633ac245a 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -172,6 +172,17 @@ void multifd_register_ops(int method, MultiFDMethods *ops)
>      multifd_ops[method] = ops;
>  }
>  
> +/* Reset a MultiFDPages_t* object for the next use */
> +static void multifd_pages_reset(MultiFDPages_t *pages)
> +{
> +    /*
> +     * We don't need to touch offset[] array, because it will be
> +     * overwritten later when reused.
> +     */
> +    pages->num = 0;
> +    pages->block = NULL;

Having to do this at all is a huge overloading of this field. This not
only resets it, but it also communicates to multifd_queue_page() that
the previous payload has been sent. Otherwise, multifd_queue_page()
wouldn't know whether the previous call to multifd_queue_page() has
called multifd_send_pages() or if it has exited early. So this basically
means "the block that was previously here has been sent".

That's why we need the changed=true logic. A
multifd_send_state->pages->block still has a few pages left to send, but
because it's less than pages->allocated, it skips
multifd_send_pages(). The next call to multifd_queue_page() already has
the next ramblock. So we set changed=true, call multifd_send_pages() to
send the remaining pages of that block and recurse into
multifd_queue_page() once more to send the new block.

> +}
> +
>  static int multifd_send_initial_packet(MultiFDSendParams *p, Error **errp)
>  {
>      MultiFDInit_t msg = {};
> @@ -248,9 +259,8 @@ static MultiFDPages_t *multifd_pages_init(uint32_t n)
>  
>  static void multifd_pages_clear(MultiFDPages_t *pages)
>  {
> -    pages->num = 0;
> +    multifd_pages_reset(pages);
>      pages->allocated = 0;
> -    pages->block = NULL;
>      g_free(pages->offset);
>      pages->offset = NULL;
>      g_free(pages);
> @@ -704,8 +714,6 @@ static void *multifd_send_thread(void *opaque)
>              p->flags = 0;
>              p->num_packets++;
>              p->total_normal_pages += p->normal_num;
> -            p->pages->num = 0;
> -            p->pages->block = NULL;
>              qemu_mutex_unlock(&p->mutex);
>  
>              trace_multifd_send(p->id, packet_num, p->normal_num, flags,
> @@ -732,6 +740,8 @@ static void *multifd_send_thread(void *opaque)
>  
>              stat64_add(&mig_stats.multifd_bytes,
>                         p->next_packet_size + p->packet_len);
> +
> +            multifd_pages_reset(p->pages);
>              p->next_packet_size = 0;
>              qemu_mutex_lock(&p->mutex);
>              p->pending_job--;


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 05/14] migration/multifd: Drop MultiFDSendParams.normal[] array
  2024-01-31 10:31 ` [PATCH 05/14] migration/multifd: Drop MultiFDSendParams.normal[] array peterx
@ 2024-01-31 16:02   ` Fabiano Rosas
  0 siblings, 0 replies; 45+ messages in thread
From: Fabiano Rosas @ 2024-01-31 16:02 UTC (permalink / raw)
  To: peterx, qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, peterx, Yuan Liu, Avihai Horon, Hao Xiang

peterx@redhat.com writes:

> From: Peter Xu <peterx@redhat.com>
>
> This array is redundant when p->pages exists.  Now we extended the life of
> p->pages to the whole period where pending_job is set, it should be safe to
> always use p->pages->offset[] rather than p->normal[].  Drop the array.
>
> Alongside, the normal_num is also redundant, which is the same to
> p->pages->num.
>
> This doesn't apply to recv side, because there's no extra buffering on recv
> side, so p->normal[] array is still needed.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 06/14] migration/multifd: Separate SYNC request with normal jobs
  2024-01-31 10:31 ` [PATCH 06/14] migration/multifd: Separate SYNC request with normal jobs peterx
@ 2024-01-31 18:45   ` Fabiano Rosas
  0 siblings, 0 replies; 45+ messages in thread
From: Fabiano Rosas @ 2024-01-31 18:45 UTC (permalink / raw)
  To: peterx, qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, peterx, Yuan Liu, Avihai Horon, Hao Xiang

peterx@redhat.com writes:

> From: Peter Xu <peterx@redhat.com>
>
> Multifd provide a threaded model for processing jobs.  On sender side,
> there can be two kinds of job: (1) a list of pages to send, or (2) a sync
> request.
>
> The sync request is a very special kind of job.  It never contains a page
> array, but only a multifd packet telling the dest side to synchronize with
> sent pages.
>
> Before this patch, both requests use the pending_job field, no matter what
> the request is, it will boost pending_job, while multifd sender thread will
> decrement it after it finishes one job.
>
> However this should be racy, because SYNC is special in that it needs to
> set p->flags with MULTIFD_FLAG_SYNC, showing that this is a sync request.
> Consider a sequence of operations where:
>
>   - migration thread enqueue a job to send some pages, pending_job++ (0->1)
>
>   - [...before the selected multifd sender thread wakes up...]
>
>   - migration thread enqueue another job to sync, pending_job++ (1->2),
>     setup p->flags=MULTIFD_FLAG_SYNC
>
>   - multifd sender thread wakes up, found pending_job==2
>     - send the 1st packet with MULTIFD_FLAG_SYNC and list of pages
>     - send the 2nd packet with flags==0 and no pages
>
> This is not expected, because MULTIFD_FLAG_SYNC should hopefully be done
> after all the pages are received.  Meanwhile, the 2nd packet will be
> completely useless, which contains zero information.
>
> I didn't verify above, but I think this issue is still benign in that at
> least on the recv side we always receive pages before handling
> MULTIFD_FLAG_SYNC.  However that's not always guaranteed and just tricky.
>
> One other reason I want to separate it is using p->flags to communicate
> between the two threads is also not clearly defined, it's very hard to read
> and understand why accessing p->flags is always safe; see the current impl
> of multifd_send_thread() where we tried to cache only p->flags.  It doesn't
> need to be that complicated.
>
> This patch introduces pending_sync, a separate flag just to show that the
> requester needs a sync.  Alongside, we remove the tricky caching of
> p->flags now because after this patch p->flags should only be used by
> multifd sender thread now, which will be crystal clear.  So it is always
> thread safe to access p->flags.
>
> With that, we can also safely convert the pending_job into a boolean,
> because we don't support >1 pending jobs anyway.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 07/14] migration/multifd: Simplify locking in sender thread
  2024-01-31 10:31 ` [PATCH 07/14] migration/multifd: Simplify locking in sender thread peterx
@ 2024-01-31 20:21   ` Fabiano Rosas
  2024-02-01 10:37     ` Peter Xu
  0 siblings, 1 reply; 45+ messages in thread
From: Fabiano Rosas @ 2024-01-31 20:21 UTC (permalink / raw)
  To: peterx, qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, peterx, Yuan Liu, Avihai Horon, Hao Xiang

peterx@redhat.com writes:

> From: Peter Xu <peterx@redhat.com>
>
> The sender thread will yield the p->mutex before IO starts, trying to not
> block the requester thread.  This may be unnecessary lock optimizations,
> because the requester can already read pending_job safely even without the
> lock, because the requester is currently the only one who can assign a
> task.

What about the coroutine yield at qio_channel_writev_full_all()? Is it
safe from yield while holding a lock? Could the main loop dispatch the
cleanup function, it calls join on the multifd thread and it deadlocks?

>
> Drop that lock complication on both sides:
>
>   (1) in the sender thread, always take the mutex until job done
>   (2) in the requester thread, check pending_job clear lockless
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/multifd.c | 23 ++++++++++++++++-------
>  1 file changed, 16 insertions(+), 7 deletions(-)
>
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 6a4863edd2..4dc5af0a15 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -429,7 +429,9 @@ static int multifd_send_pages(void)
>          return -1;
>      }
>  
> +    /* We wait here, until at least one channel is ready */
>      qemu_sem_wait(&multifd_send_state->channels_ready);
> +
>      /*
>       * next_channel can remain from a previous migration that was
>       * using more channels, so ensure it doesn't overflow if the
> @@ -441,17 +443,26 @@ static int multifd_send_pages(void)
>              return -1;
>          }
>          p = &multifd_send_state->params[i];
> -        qemu_mutex_lock(&p->mutex);
> +        /*
> +         * Lockless read to p->pending_job is safe, because only multifd
> +         * sender thread can clear it.
> +         */
>          if (!p->pending_job) {

The worst it could happen is we read at the same time the thread is
clearing it and we loop to the next channel. So it doesn't need to be
atomic either.

> -            p->pending_job = true;
>              next_channel = (i + 1) % migrate_multifd_channels();
>              break;
>          }
> -        qemu_mutex_unlock(&p->mutex);
>      }
> +
> +    qemu_mutex_lock(&p->mutex);

What data this lock protects now? Everything below here only happens
after this thread sees pending_job==false. It seems we would only need a
barrier on the multifd thread to make sure p->pending_job=false is
ordered after everything.

Even for the "sync" case, it appears the lock is not needed as well?

We might need to remove p->running first and move the kick from
multifd_send_terminate_threads() into multifd_save_cleanup() like I
suggested, but it seems like we could remove this lock.

Which would make sense, because there's nothing another thread would
want to do with a channel's MultiFDSendParams unless the channel is idle
waiting for work.

> +    /*
> +     * Double check on pending_job==false with the lock.  In the future if
> +     * we can have >1 requester thread, we can replace this with a "goto
> +     * retry", but that is for later.
> +     */
> +    assert(p->pending_job == false);
> +    p->pending_job = true;
>      assert(!p->pages->num);
>      assert(!p->pages->block);
> -
>      p->packet_num = multifd_send_state->packet_num++;

I noticed this line cannot be here. If the channel thread takes long to
wakeup, the "sync" code will increment once more and overwrite this
field. This and the identical line at multifd_send_sync_main() should go
into multifd_send_fill_packet() I think.

>      multifd_send_state->pages = p->pages;
>      p->pages = pages;
> @@ -704,8 +715,6 @@ static void *multifd_send_thread(void *opaque)
>              multifd_send_fill_packet(p);
>              p->num_packets++;
>              p->total_normal_pages += pages->num;
> -            qemu_mutex_unlock(&p->mutex);
> -
>              trace_multifd_send(p->id, packet_num, pages->num, p->flags,
>                                 p->next_packet_size);
>  
> @@ -725,6 +734,7 @@ static void *multifd_send_thread(void *opaque)
>              ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num, NULL,
>                                                0, p->write_flags, &local_err);
>              if (ret != 0) {
> +                qemu_mutex_unlock(&p->mutex);
>                  break;
>              }
>  
> @@ -733,7 +743,6 @@ static void *multifd_send_thread(void *opaque)
>  
>              multifd_pages_reset(p->pages);
>              p->next_packet_size = 0;
> -            qemu_mutex_lock(&p->mutex);
>              p->pending_job = false;
>              qemu_mutex_unlock(&p->mutex);
>          } else if (p->pending_sync) {


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/14] migration/multifd: Drop pages->num check in sender thread
  2024-01-31 10:31 ` [PATCH 08/14] migration/multifd: Drop pages->num check " peterx
@ 2024-01-31 21:19   ` Fabiano Rosas
  0 siblings, 0 replies; 45+ messages in thread
From: Fabiano Rosas @ 2024-01-31 21:19 UTC (permalink / raw)
  To: peterx, qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, peterx, Yuan Liu, Avihai Horon, Hao Xiang

peterx@redhat.com writes:

> From: Peter Xu <peterx@redhat.com>
>
> Now with a split SYNC handler, we always have pages->num set for
> pending_job==true.  Assert it instead.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 09/14] migration/multifd: Rename p->num_packets and clean it up
  2024-01-31 10:31 ` [PATCH 09/14] migration/multifd: Rename p->num_packets and clean it up peterx
@ 2024-01-31 21:24   ` Fabiano Rosas
  0 siblings, 0 replies; 45+ messages in thread
From: Fabiano Rosas @ 2024-01-31 21:24 UTC (permalink / raw)
  To: peterx, qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, peterx, Yuan Liu, Avihai Horon, Hao Xiang

peterx@redhat.com writes:

> From: Peter Xu <peterx@redhat.com>
>
> This field, no matter whether on src or dest, is only used for debugging
> purpose.
>
> They can even be removed already, unless it still more or less provide some
> accounting on "how many packets are sent/recved for this thread".  The
> other more important one is called packet_num, which is embeded in the
> multifd packet headers (MultiFDPacket_t).
>
> So let's keep them for now, but make them much easier to understand, by
> doing below:
>
>   - Rename both of them to packets_sent / packets_recved, the old
>   name (num_packets) are waaay too confusing when we already have
>   MultiFDPacket_t.packets_num.
>
>   - Avoid worrying on the "initial packet": we know we will send it, that's
>   good enough.  The accounting won't matter a great deal to start with 0 or
>   with 1.
>
>   - Move them to where we send/recv the packets.  They're:
>
>     - multifd_send_fill_packet() for senders.
>     - multifd_recv_unfill_packet() for receivers.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 10/14] migration/multifd: Move total_normal_pages accounting
  2024-01-31 10:31 ` [PATCH 10/14] migration/multifd: Move total_normal_pages accounting peterx
@ 2024-01-31 21:26   ` Fabiano Rosas
  0 siblings, 0 replies; 45+ messages in thread
From: Fabiano Rosas @ 2024-01-31 21:26 UTC (permalink / raw)
  To: peterx, qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, peterx, Yuan Liu, Avihai Horon, Hao Xiang

peterx@redhat.com writes:

> From: Peter Xu <peterx@redhat.com>
>
> Just like the previous patch, move the accounting for total_normal_pages on
> both src/dst sides into the packet fill/unfill procedures.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 11/14] migration/multifd: Move trace_multifd_send|recv()
  2024-01-31 10:31 ` [PATCH 11/14] migration/multifd: Move trace_multifd_send|recv() peterx
@ 2024-01-31 21:26   ` Fabiano Rosas
  0 siblings, 0 replies; 45+ messages in thread
From: Fabiano Rosas @ 2024-01-31 21:26 UTC (permalink / raw)
  To: peterx, qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, peterx, Yuan Liu, Avihai Horon, Hao Xiang

peterx@redhat.com writes:

> From: Peter Xu <peterx@redhat.com>
>
> Move them into fill/unfill of packets.  With that, we can further cleanup
> the send/recv thread procedure, and remove one more temp var.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 12/14] migration/multifd: multifd_send_prepare_header()
  2024-01-31 10:31 ` [PATCH 12/14] migration/multifd: multifd_send_prepare_header() peterx
@ 2024-01-31 21:32   ` Fabiano Rosas
  2024-02-01 10:02     ` Peter Xu
  0 siblings, 1 reply; 45+ messages in thread
From: Fabiano Rosas @ 2024-01-31 21:32 UTC (permalink / raw)
  To: peterx, qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, peterx, Yuan Liu, Avihai Horon, Hao Xiang

peterx@redhat.com writes:

> From: Peter Xu <peterx@redhat.com>
>
> Introduce a helper multifd_send_prepare_header() to setup the header packet
> for multifd sender.
>
> It's fine to setup the IOV[0] _before_ send_prepare() because the packet
> buffer is already ready, even if the content is to be filled in.
>
> With this helper, we can already slightly clean up the zero copy path.
>
> Note that I explicitly put it into multifd.h, because I want it inlined
> directly into multifd*.c where necessary later.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>

nit below:

> ---
>  migration/multifd.h |  8 ++++++++
>  migration/multifd.c | 16 ++++++++--------
>  2 files changed, 16 insertions(+), 8 deletions(-)
>
> diff --git a/migration/multifd.h b/migration/multifd.h
> index 2e4ad0dc56..4ec005f53f 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -209,5 +209,13 @@ typedef struct {
>  
>  void multifd_register_ops(int method, MultiFDMethods *ops);
>  
> +static inline void multifd_send_prepare_header(MultiFDSendParams *p)
> +{
> +    p->iov[0].iov_len = p->packet_len;
> +    p->iov[0].iov_base = p->packet;
> +    p->iovs_num++;
> +}
> +
> +
>  #endif
>  
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 8d4b80f365..1b0035787e 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -707,10 +707,14 @@ static void *multifd_send_thread(void *opaque)
>          if (p->pending_job) {
>              MultiFDPages_t *pages = p->pages;
>  
> -            if (use_zero_copy_send) {
> -                p->iovs_num = 0;
> -            } else {
> -                p->iovs_num = 1;
> +            p->iovs_num = 0;
> +
> +            if (!use_zero_copy_send) {
> +                /*
> +                 * Only !zero_copy needs the header in IOV; zerocopy will
> +                 * send it separately.

Could use the same spelling for both mentions to zero copy.

> +                 */
> +                multifd_send_prepare_header(p);
>              }
>  
>              assert(pages->num);
> @@ -730,10 +734,6 @@ static void *multifd_send_thread(void *opaque)
>                  if (ret != 0) {
>                      break;
>                  }
> -            } else {
> -                /* Send header using the same writev call */
> -                p->iov[0].iov_len = p->packet_len;
> -                p->iov[0].iov_base = p->packet;
>              }
>  
>              ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num, NULL,


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 13/14] migration/multifd: Move header prepare/fill into send_prepare()
  2024-01-31 10:31 ` [PATCH 13/14] migration/multifd: Move header prepare/fill into send_prepare() peterx
@ 2024-01-31 21:42   ` Fabiano Rosas
  2024-02-01 10:15     ` Peter Xu
  2024-02-02  3:57   ` Peter Xu
  1 sibling, 1 reply; 45+ messages in thread
From: Fabiano Rosas @ 2024-01-31 21:42 UTC (permalink / raw)
  To: peterx, qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, peterx, Yuan Liu, Avihai Horon, Hao Xiang

peterx@redhat.com writes:

> From: Peter Xu <peterx@redhat.com>
>
> This patch redefines the interfacing of ->send_prepare().  It further
> simplifies multifd_send_thread() especially on zero copy.
>
> Now with the new interface, we require the hook to do all the work for
> preparing the IOVs to send.  After it's completed, the IOVs should be ready
> to be dumped into the specific multifd QIOChannel later.
>
> So now the API looks like:
>
>   p->pages ----------->  send_prepare() -------------> IOVs
>
> This also prepares for the case where the input can be extended to even not
> any p->pages.  But that's for later.
>
> This patch will achieve similar goal of what Fabiano used to propose here:
>
> https://lore.kernel.org/r/20240126221943.26628-1-farosas@suse.de
>
> However the send() interface may not be necessary.  I'm boldly attaching a

So should I drop send() for fixed-ram as well? Or do you still want a
separate layer just for send()?

> "Co-developed-by" for Fabiano.
>
> Co-developed-by: Fabiano Rosas <farosas@suse.de>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 14/14] migration/multifd: Forbid spurious wakeups
  2024-01-31 10:31 ` [PATCH 14/14] migration/multifd: Forbid spurious wakeups peterx
@ 2024-01-31 21:43   ` Fabiano Rosas
  2024-02-01  6:01   ` Peter Xu
  1 sibling, 0 replies; 45+ messages in thread
From: Fabiano Rosas @ 2024-01-31 21:43 UTC (permalink / raw)
  To: peterx, qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, peterx, Yuan Liu, Avihai Horon, Hao Xiang

peterx@redhat.com writes:

> From: Peter Xu <peterx@redhat.com>
>
> Now multifd's logic is designed to have no spurious wakeup.  I still
> remember a talk to Juan and he seems to agree we should drop it now, and if
> my memory was right it was there because multifd used to hit that when
> still debugging.
>
> Let's drop it and see what can explode; as long as it's not reaching
> soft-freeze.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 00/14] migration/multifd: Refactor ->send_prepare() and cleanups
  2024-01-31 10:30 [PATCH 00/14] migration/multifd: Refactor ->send_prepare() and cleanups peterx
                   ` (13 preceding siblings ...)
  2024-01-31 10:31 ` [PATCH 14/14] migration/multifd: Forbid spurious wakeups peterx
@ 2024-01-31 22:49 ` Fabiano Rosas
  2024-02-01  5:47   ` Peter Xu
  14 siblings, 1 reply; 45+ messages in thread
From: Fabiano Rosas @ 2024-01-31 22:49 UTC (permalink / raw)
  To: peterx, qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, peterx, Yuan Liu, Avihai Horon, Hao Xiang

peterx@redhat.com writes:

> From: Peter Xu <peterx@redhat.com>
>
> This patchset contains quite a few refactorings to current multifd:
>
>   - It picked up some patches from an old series of mine [0] (the last
>     patches were dropped, though; I did the cleanup slightly differently):
>
>     I still managed to include one patch to split pending_job, but I
>     rewrote the patch here.
>
>   - It tries to cleanup multiple multifd paths here and there, the ultimate
>     goal is to redefine send_prepare() to be something like:
>
>       p->pages ----------->  send_prepare() -------------> IOVs
>
>     So that there's no obvious change yet on multifd_ops besides redefined
>     interface for send_prepare().  We may want a separate OPs for file
>     later.
>
> For 2), one benefit is already presented by Fabiano in his other series [1]
> on cleaning up zero copy, but this patchset addressed it quite differently,
> and hopefully also more gradually.  The other benefit is for sure if we
> have a more concrete API for send_prepare() and if we can reach an initial
> consensus, then we can have the recent compression accelerators rebased on
> top of this one.
>
> This also prepares for the case where the input can be extended to even not
> any p->pages, but arbitrary data (like VFIO's potential use case in the
> future?).  But that will also for later even if reasonable.
>
> Please have a look.  Thanks,
>
> [0] https://lore.kernel.org/r/20231022201211.452861-1-peterx@redhat.com
> [1] https://lore.kernel.org/qemu-devel/20240126221943.26628-1-farosas@suse.de
>
> Peter Xu (14):
>   migration/multifd: Drop stale comment for multifd zero copy
>   migration/multifd: multifd_send_kick_main()
>   migration/multifd: Drop MultiFDSendParams.quit, cleanup error paths
>   migration/multifd: Postpone reset of MultiFDPages_t
>   migration/multifd: Drop MultiFDSendParams.normal[] array
>   migration/multifd: Separate SYNC request with normal jobs
>   migration/multifd: Simplify locking in sender thread
>   migration/multifd: Drop pages->num check in sender thread
>   migration/multifd: Rename p->num_packets and clean it up
>   migration/multifd: Move total_normal_pages accounting
>   migration/multifd: Move trace_multifd_send|recv()
>   migration/multifd: multifd_send_prepare_header()
>   migration/multifd: Move header prepare/fill into send_prepare()
>   migration/multifd: Forbid spurious wakeups
>
>  migration/multifd.h      |  34 +++--
>  migration/multifd-zlib.c |  11 +-
>  migration/multifd-zstd.c |  11 +-
>  migration/multifd.c      | 291 +++++++++++++++++++--------------------
>  4 files changed, 182 insertions(+), 165 deletions(-)

This series didn't survive my 9999 iterations test on the opensuse
machine.

# Running /x86_64/migration/multifd/tcp/tls/x509/reject-anon-client
...
kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)


#0  0x00005575dda06399 in qemu_mutex_lock_impl (mutex=0x18, file=0x5575ddce9cc3 "../util/qemu-thread-posix.c", line=275) at ../util/qemu-thread-posix.c:92
#1  0x00005575dda06a94 in qemu_sem_post (sem=0x18) at ../util/qemu-thread-posix.c:275
#2  0x00005575dd56a512 in multifd_send_thread (opaque=0x5575df054ef8) at ../migration/multifd.c:720
#3  0x00005575dda0709b in qemu_thread_start (args=0x7fd404001d50) at ../util/qemu-thread-posix.c:541
#4  0x00007fd45e8a26ea in start_thread (arg=0x7fd3faffd700) at pthread_create.c:477
#5  0x00007fd45cd2150f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The multifd thread is posting channels_ready with an already freed
multifd_send_state.

This is the bug Avihai has hit. We're going into multifd_save_cleanup()
so early that multifd_new_send_channel_async() hasn't even had the
chance to set p->running. So it misses the join and frees everything up
while a second multifd thread is just starting.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 00/14] migration/multifd: Refactor ->send_prepare() and cleanups
  2024-01-31 22:49 ` [PATCH 00/14] migration/multifd: Refactor ->send_prepare() and cleanups Fabiano Rosas
@ 2024-02-01  5:47   ` Peter Xu
  2024-02-01 12:51     ` Avihai Horon
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2024-02-01  5:47 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, Bryan Zhang, Prasad Pandit, Yuan Liu, Avihai Horon,
	Hao Xiang

On Wed, Jan 31, 2024 at 07:49:51PM -0300, Fabiano Rosas wrote:
> peterx@redhat.com writes:
> 
> > From: Peter Xu <peterx@redhat.com>
> >
> > This patchset contains quite a few refactorings to current multifd:
> >
> >   - It picked up some patches from an old series of mine [0] (the last
> >     patches were dropped, though; I did the cleanup slightly differently):
> >
> >     I still managed to include one patch to split pending_job, but I
> >     rewrote the patch here.
> >
> >   - It tries to cleanup multiple multifd paths here and there, the ultimate
> >     goal is to redefine send_prepare() to be something like:
> >
> >       p->pages ----------->  send_prepare() -------------> IOVs
> >
> >     So that there's no obvious change yet on multifd_ops besides redefined
> >     interface for send_prepare().  We may want a separate OPs for file
> >     later.
> >
> > For 2), one benefit is already presented by Fabiano in his other series [1]
> > on cleaning up zero copy, but this patchset addressed it quite differently,
> > and hopefully also more gradually.  The other benefit is for sure if we
> > have a more concrete API for send_prepare() and if we can reach an initial
> > consensus, then we can have the recent compression accelerators rebased on
> > top of this one.
> >
> > This also prepares for the case where the input can be extended to even not
> > any p->pages, but arbitrary data (like VFIO's potential use case in the
> > future?).  But that will also for later even if reasonable.
> >
> > Please have a look.  Thanks,
> >
> > [0] https://lore.kernel.org/r/20231022201211.452861-1-peterx@redhat.com
> > [1] https://lore.kernel.org/qemu-devel/20240126221943.26628-1-farosas@suse.de
> >
> > Peter Xu (14):
> >   migration/multifd: Drop stale comment for multifd zero copy
> >   migration/multifd: multifd_send_kick_main()
> >   migration/multifd: Drop MultiFDSendParams.quit, cleanup error paths
> >   migration/multifd: Postpone reset of MultiFDPages_t
> >   migration/multifd: Drop MultiFDSendParams.normal[] array
> >   migration/multifd: Separate SYNC request with normal jobs
> >   migration/multifd: Simplify locking in sender thread
> >   migration/multifd: Drop pages->num check in sender thread
> >   migration/multifd: Rename p->num_packets and clean it up
> >   migration/multifd: Move total_normal_pages accounting
> >   migration/multifd: Move trace_multifd_send|recv()
> >   migration/multifd: multifd_send_prepare_header()
> >   migration/multifd: Move header prepare/fill into send_prepare()
> >   migration/multifd: Forbid spurious wakeups
> >
> >  migration/multifd.h      |  34 +++--
> >  migration/multifd-zlib.c |  11 +-
> >  migration/multifd-zstd.c |  11 +-
> >  migration/multifd.c      | 291 +++++++++++++++++++--------------------
> >  4 files changed, 182 insertions(+), 165 deletions(-)
> 
> This series didn't survive my 9999 iterations test on the opensuse
> machine.
> 
> # Running /x86_64/migration/multifd/tcp/tls/x509/reject-anon-client
> ...
> kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
> 
> 
> #0  0x00005575dda06399 in qemu_mutex_lock_impl (mutex=0x18, file=0x5575ddce9cc3 "../util/qemu-thread-posix.c", line=275) at ../util/qemu-thread-posix.c:92
> #1  0x00005575dda06a94 in qemu_sem_post (sem=0x18) at ../util/qemu-thread-posix.c:275
> #2  0x00005575dd56a512 in multifd_send_thread (opaque=0x5575df054ef8) at ../migration/multifd.c:720
> #3  0x00005575dda0709b in qemu_thread_start (args=0x7fd404001d50) at ../util/qemu-thread-posix.c:541
> #4  0x00007fd45e8a26ea in start_thread (arg=0x7fd3faffd700) at pthread_create.c:477
> #5  0x00007fd45cd2150f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> 
> The multifd thread is posting channels_ready with an already freed
> multifd_send_state.
> 
> This is the bug Avihai has hit. We're going into multifd_save_cleanup()
> so early that multifd_new_send_channel_async() hasn't even had the
> chance to set p->running. So it misses the join and frees everything up
> while a second multifd thread is just starting.

Thanks for doing that.

Would this series makes that bug easier to happen?  I didn't do a lot of
test on it, it only survived the smoke test and the kicked CI job.  I think
we can still decide to fix that issues separately; but if this series makes
that easier to happen then that's definitely bad..

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 14/14] migration/multifd: Forbid spurious wakeups
  2024-01-31 10:31 ` [PATCH 14/14] migration/multifd: Forbid spurious wakeups peterx
  2024-01-31 21:43   ` Fabiano Rosas
@ 2024-02-01  6:01   ` Peter Xu
  1 sibling, 0 replies; 45+ messages in thread
From: Peter Xu @ 2024-02-01  6:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, Fabiano Rosas, Yuan Liu,
	Avihai Horon, Hao Xiang

On Wed, Jan 31, 2024 at 06:31:11PM +0800, peterx@redhat.com wrote:
> From: Peter Xu <peterx@redhat.com>
> 
> Now multifd's logic is designed to have no spurious wakeup.  I still
> remember a talk to Juan and he seems to agree we should drop it now, and if
> my memory was right it was there because multifd used to hit that when
> still debugging.
> 
> Let's drop it and see what can explode; as long as it's not reaching
> soft-freeze.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/multifd.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 0f22646f95..bd0e3ea1a5 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -766,9 +766,6 @@ static void *multifd_send_thread(void *opaque)
>              p->pending_sync = false;
>              qemu_mutex_unlock(&p->mutex);
>              qemu_sem_post(&p->sem_sync);
> -        } else {
> -            qemu_mutex_unlock(&p->mutex);
> -            /* sometimes there are spurious wakeups */
>          }
>      }
>  
> -- 
> 2.43.0
> 

While removing this is still the goal, I just noticed that _if_ something
spurious wakeup happens then this will not crash qemu, but instead it'll
cause mutex locked forever and deadlock.

A deadlock is less wanted than a crash in this case, so when I repost, I'll
make sure it crashes and does it hard, like squashing this in:

====
diff --git a/migration/multifd.c b/migration/multifd.c
index bd0e3ea1a5..89011f75d9 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -751,7 +751,9 @@ static void *multifd_send_thread(void *opaque)
             p->next_packet_size = 0;
             p->pending_job = false;
             qemu_mutex_unlock(&p->mutex);
-        } else if (p->pending_sync) {
+        } else {
+            /* If not a normal job, must be a sync request */
+            assert(p->pending_sync);
             p->flags = MULTIFD_FLAG_SYNC;
             multifd_send_fill_packet(p);
             ret = qio_channel_write_all(p->c, (void *)p->packet,
====

Fabiano, I'll keep your ACK, but let me know otherwise..

-- 
Peter Xu



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 03/14] migration/multifd: Drop MultiFDSendParams.quit, cleanup error paths
  2024-01-31 15:05   ` Fabiano Rosas
@ 2024-02-01  9:28     ` Peter Xu
  2024-02-01 13:30       ` Fabiano Rosas
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2024-02-01  9:28 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, Bryan Zhang, Prasad Pandit, Yuan Liu, Avihai Horon,
	Hao Xiang

On Wed, Jan 31, 2024 at 12:05:08PM -0300, Fabiano Rosas wrote:
> peterx@redhat.com writes:
> 
> > From: Peter Xu <peterx@redhat.com>
> >
> > Multifd send side has two fields to indicate error quits:
> >
> >   - MultiFDSendParams.quit
> >   - &multifd_send_state->exiting
> >
> > Merge them into the global one.  The replacement is done by changing all
> > p->quit checks into the global var check.  The global check doesn't need
> > any lock.
> >
> > A few more things done on top of this altogether:
> >
> >   - multifd_send_terminate_threads()
> >
> >     Moving the xchg() of &multifd_send_state->exiting upper, so as to cover
> >     the tracepoint, migrate_set_error() and migrate_set_state().
> 
> Good.
> 
> >
> >   - multifd_send_sync_main()
> >
> >     In the 2nd loop, add one more check over the global var to make sure we
> >     don't keep the looping if QEMU already decided to quit.
> 
> Yes, also because we don't necessarily enter at multifd_send_page()
> every time.
> 
> >
> >   - multifd_tls_outgoing_handshake()
> >
> >     Use multifd_send_terminate_threads() to set the error state.  That has
> >     a benefit of updating MigrationState.error to that error too, so we can
> >     persist that 1st error we hit in that specific channel.
> 
> Makes sense.
> 
> >
> >   - multifd_new_send_channel_async()
> >
> >     Take similar approach like above, drop the migrate_set_error() because
> >     multifd_send_terminate_threads() already covers that.  Unwrap the helper
> >     multifd_new_send_channel_cleanup() along the way; not really needed.
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  migration/multifd.h |  2 --
> >  migration/multifd.c | 85 ++++++++++++++++++---------------------------
> >  2 files changed, 33 insertions(+), 54 deletions(-)
> >
> > diff --git a/migration/multifd.h b/migration/multifd.h
> > index 35d11f103c..7c040cb85a 100644
> > --- a/migration/multifd.h
> > +++ b/migration/multifd.h
> > @@ -95,8 +95,6 @@ typedef struct {
> >      QemuMutex mutex;
> >      /* is this channel thread running */
> >      bool running;
> > -    /* should this thread finish */
> > -    bool quit;
> >      /* multifd flags for each packet */
> >      uint32_t flags;
> >      /* global number of generated multifd packets */
> > diff --git a/migration/multifd.c b/migration/multifd.c
> > index b8d2c96533..2c98023d67 100644
> > --- a/migration/multifd.c
> > +++ b/migration/multifd.c
> > @@ -372,6 +372,11 @@ struct {
> >      MultiFDMethods *ops;
> >  } *multifd_send_state;
> >  
> > +static bool multifd_send_should_exit(void)
> > +{
> > +    return qatomic_read(&multifd_send_state->exiting);
> > +}
> > +
> >  /*
> >   * The migration thread can wait on either of the two semaphores.  This
> >   * function can be used to kick the main thread out of waiting on either of
> > @@ -409,7 +414,7 @@ static int multifd_send_pages(void)
> >      MultiFDSendParams *p = NULL; /* make happy gcc */
> >      MultiFDPages_t *pages = multifd_send_state->pages;
> >  
> > -    if (qatomic_read(&multifd_send_state->exiting)) {
> > +    if (multifd_send_should_exit()) {
> >          return -1;
> >      }
> >  
> > @@ -421,14 +426,11 @@ static int multifd_send_pages(void)
> >       */
> >      next_channel %= migrate_multifd_channels();
> >      for (i = next_channel;; i = (i + 1) % migrate_multifd_channels()) {
> > -        p = &multifd_send_state->params[i];
> > -
> > -        qemu_mutex_lock(&p->mutex);
> > -        if (p->quit) {
> > -            error_report("%s: channel %d has already quit!", __func__, i);
> > -            qemu_mutex_unlock(&p->mutex);
> > +        if (multifd_send_should_exit()) {
> >              return -1;
> >          }
> > +        p = &multifd_send_state->params[i];
> > +        qemu_mutex_lock(&p->mutex);
> >          if (!p->pending_job) {
> >              p->pending_job++;
> >              next_channel = (i + 1) % migrate_multifd_channels();
> 
> Hm, I'm not sure it's correct to check 'exiting' outside of the
> lock. While it is an atomic operation, it is not atomic in relation to
> pending_job...
> 
> ... looking closer, it seems that we can do what you suggest because
> p->pending_job is not touched by the multifd_send_thread in case of
> error, which means this function will indeed miss the 'exiting' flag,
> but pending_job > 0 means it will loop to the next channel and _then_ it
> will see the 'exiting' flag.

It could still be the last channel we iterate, then IIUC we can still try
to assign a job to a thread even if a concurrent error is set there.

However IMHO it's okay; the error in the sender thread should ultimately
set migrate_set_error() and the main thread should detect that in the
migration loop, then we'll still quit.  The extra queued job shouldn't
matter, IMHO.

> 
> > @@ -483,6 +485,16 @@ static void multifd_send_terminate_threads(Error *err)
> >  {
> >      int i;
> >  
> > +    /*
> > +     * We don't want to exit each threads twice.  Depending on where
> > +     * we get the error, or if there are two independent errors in two
> > +     * threads at the same time, we can end calling this function
> > +     * twice.
> > +     */
> > +    if (qatomic_xchg(&multifd_send_state->exiting, 1)) {
> > +        return;
> > +    }
> > +
> >      trace_multifd_send_terminate_threads(err != NULL);
> >  
> >      if (err) {
> > @@ -497,26 +509,13 @@ static void multifd_send_terminate_threads(Error *err)
> >          }
> >      }
> >  
> > -    /*
> > -     * We don't want to exit each threads twice.  Depending on where
> > -     * we get the error, or if there are two independent errors in two
> > -     * threads at the same time, we can end calling this function
> > -     * twice.
> > -     */
> > -    if (qatomic_xchg(&multifd_send_state->exiting, 1)) {
> > -        return;
> > -    }
> > -
> >      for (i = 0; i < migrate_multifd_channels(); i++) {
> >          MultiFDSendParams *p = &multifd_send_state->params[i];
> >  
> > -        qemu_mutex_lock(&p->mutex);
> > -        p->quit = true;
> 
> Now that you removed this, we decoupled kicking the threads from setting
> the exit/error, so this function could be split in two.
> 
> We could set the exiting flag at the places the error occurred (multifd
> threads, thread creation, etc) and "terminate the threads" at
> multifd_save_cleanup(). That second part we already do actually:
> 
> void multifd_save_cleanup(void) {
> ...
>     multifd_send_terminate_threads(NULL);
>                                    ^see?
>     for (i = 0; i < migrate_multifd_channels(); i++) {
>         MultiFDSendParams *p = &multifd_send_state->params[i];
> 
>         if (p->running) {
>             qemu_thread_join(&p->thread);
>         }
>     }
> ...
> }
> 
> I think there's no reason anymore for the channels to kick each
> other. They would all be waiting at p->sem and multifd_send_cleanup()
> would kick + join them.

Sounds good here.

I'll attach one patch like this, feel free to have an early look:

=====

From f9a3d63d5cca0068daaea4c72392803f4b29dcb5 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Thu, 1 Feb 2024 17:01:54 +0800
Subject: [PATCH] migration/multifd: Split multifd_send_terminate_threads()

Split multifd_send_terminate_threads() into two functions:

  - multifd_send_set_error(): used when an error happened on the sender
    side, set error and quit state only

  - multifd_send_terminate_threads(): used only by the main thread to kick
    all multifd send threads out of sleep, for the last recycling.

Use multifd_send_set_error() in the three old call sites where only the
error will be set.

Use multifd_send_terminate_threads() in the last one where the main thread
will kick the multifd threads at last in multifd_save_cleanup().

Both helpers will need to set quitting=1.

Suggested-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/multifd.c    | 27 ++++++++++++++++++---------
 migration/trace-events |  2 +-
 2 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index c71e74b101..95dc29c8c7 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -536,10 +536,9 @@ int multifd_queue_page(RAMBlock *block, ram_addr_t offset)
     return 1;
 }
 
-static void multifd_send_terminate_threads(Error *err)
+/* Multifd send side hit an error; remember it and prepare to quit */
+static void multifd_send_set_error(Error *err)
 {
-    int i;
-
     /*
      * We don't want to exit each threads twice.  Depending on where
      * we get the error, or if there are two independent errors in two
@@ -550,8 +549,6 @@ static void multifd_send_terminate_threads(Error *err)
         return;
     }
 
-    trace_multifd_send_terminate_threads(err != NULL);
-
     if (err) {
         MigrationState *s = migrate_get_current();
         migrate_set_error(s, err);
@@ -563,7 +560,19 @@ static void multifd_send_terminate_threads(Error *err)
                               MIGRATION_STATUS_FAILED);
         }
     }
+}
+
+static void multifd_send_terminate_threads(void)
+{
+    int i;
+
+    trace_multifd_send_terminate_threads();
 
+    /*
+     * Tell everyone we're quitting.  No xchg() needed here; we simply
+     * always set it.
+     */
+    qatomic_set(&multifd_send_state->exiting, 1);
     for (i = 0; i < migrate_multifd_channels(); i++) {
         MultiFDSendParams *p = &multifd_send_state->params[i];
 
@@ -586,7 +595,7 @@ void multifd_save_cleanup(void)
     if (!migrate_multifd()) {
         return;
     }
-    multifd_send_terminate_threads(NULL);
+    multifd_send_terminate_threads();
     for (i = 0; i < migrate_multifd_channels(); i++) {
         MultiFDSendParams *p = &multifd_send_state->params[i];
 
@@ -778,7 +787,7 @@ out:
     if (ret) {
         assert(local_err);
         trace_multifd_send_error(p->id);
-        multifd_send_terminate_threads(local_err);
+        multifd_send_set_error(local_err);
         multifd_send_kick_main(p);
         error_free(local_err);
     }
@@ -814,7 +823,7 @@ static void multifd_tls_outgoing_handshake(QIOTask *task,
 
     trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err));
 
-    multifd_send_terminate_threads(err);
+    multifd_send_set_error(err);
     multifd_send_kick_main(p);
     error_free(err);
 }
@@ -896,7 +905,7 @@ static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
     }
 
     trace_multifd_new_send_channel_async_error(p->id, local_err);
-    multifd_send_terminate_threads(local_err);
+    multifd_send_set_error(local_err);
     multifd_send_kick_main(p);
     object_unref(OBJECT(ioc));
     error_free(local_err);
diff --git a/migration/trace-events b/migration/trace-events
index de4a743c8a..298ad2b0dd 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -141,7 +141,7 @@ multifd_send_error(uint8_t id) "channel %u"
 multifd_send_sync_main(long packet_num) "packet num %ld"
 multifd_send_sync_main_signal(uint8_t id) "channel %u"
 multifd_send_sync_main_wait(uint8_t id) "channel %u"
-multifd_send_terminate_threads(bool error) "error %d"
+multifd_send_terminate_threads(void) ""
 multifd_send_thread_end(uint8_t id, uint64_t packets, uint64_t normal_pages) "channel %u packets %" PRIu64 " normal pages %"  PRIu64
 multifd_send_thread_start(uint8_t id) "%u"
 multifd_tls_outgoing_handshake_start(void *ioc, void *tioc, const char *hostname) "ioc=%p tioc=%p hostname=%s"
-- 
2.43.0


-- 
Peter Xu



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 04/14] migration/multifd: Postpone reset of MultiFDPages_t
  2024-01-31 15:27   ` Fabiano Rosas
@ 2024-02-01 10:01     ` Peter Xu
  2024-02-01 15:21       ` Fabiano Rosas
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2024-02-01 10:01 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, Bryan Zhang, Prasad Pandit, Yuan Liu, Avihai Horon,
	Hao Xiang

[-- Attachment #1: Type: text/plain, Size: 2394 bytes --]

On Wed, Jan 31, 2024 at 12:27:51PM -0300, Fabiano Rosas wrote:
> > +/* Reset a MultiFDPages_t* object for the next use */
> > +static void multifd_pages_reset(MultiFDPages_t *pages)
> > +{
> > +    /*
> > +     * We don't need to touch offset[] array, because it will be
> > +     * overwritten later when reused.
> > +     */
> > +    pages->num = 0;
> > +    pages->block = NULL;
> 
> Having to do this at all is a huge overloading of this field. This not
> only resets it, but it also communicates to multifd_queue_page() that
> the previous payload has been sent. Otherwise, multifd_queue_page()
> wouldn't know whether the previous call to multifd_queue_page() has
> called multifd_send_pages() or if it has exited early. So this basically
> means "the block that was previously here has been sent".
> 
> That's why we need the changed=true logic. A
> multifd_send_state->pages->block still has a few pages left to send, but
> because it's less than pages->allocated, it skips
> multifd_send_pages(). The next call to multifd_queue_page() already has
> the next ramblock. So we set changed=true, call multifd_send_pages() to
> send the remaining pages of that block and recurse into
> multifd_queue_page() once more to send the new block.

I agree, the queue page routines are not easy to follow as well.

How do you like a rewrite of the queue logic, like this?

=====
bool multifd_queue_page(RAMBlock *block, ram_addr_t offset)
{
    MultiFDPages_t *pages;

retry:
    pages = multifd_send_state->pages;

    /* If the queue is empty, we can already enqueue now */
    if (multifd_queue_empty(pages)) {
        pages->block = block;
        multifd_enqueue(pages, offset);
        return true;
    }

    /*
     * Not empty, meanwhile we need a flush.  It can because of either:
     *
     * (1) The page is not on the same ramblock of previous ones, or,
     * (2) The queue is full.
     *
     * After flush, always retry.
     */
    if (pages->block != block || multifd_queue_full(pages)) {
        if (!multifd_send_pages()) {
            return false;
        }
        goto retry;
    }

    /* Not empty, and we still have space, do it! */
    multifd_enqueue(pages, offset);
    return true;
}
=====

Would this be clearer?  With above, we can drop the ->ramblock reset,
afaict.

I attached three patches if you agree it's better, then I'll include them
in v2.

-- 
Peter Xu

[-- Attachment #2: 0001-migration-multifd-Change-retval-of-multifd_queue_pag.patch --]
[-- Type: text/plain, Size: 2520 bytes --]

From c5dc2052794efd6da6a1e6f4b49f25d5b32879f7 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Thu, 1 Feb 2024 17:50:21 +0800
Subject: [PATCH 1/3] migration/multifd: Change retval of multifd_queue_page()

Using int is an overkill when there're only two options.  Change it to a
boolean.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/multifd.h | 2 +-
 migration/multifd.c | 9 +++++----
 migration/ram.c     | 2 +-
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/migration/multifd.h b/migration/multifd.h
index 34a2ecb9f4..a320c53a6f 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -22,7 +22,7 @@ bool multifd_recv_all_channels_created(void);
 void multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
 void multifd_recv_sync_main(void);
 int multifd_send_sync_main(void);
-int multifd_queue_page(RAMBlock *block, ram_addr_t offset);
+bool multifd_queue_page(RAMBlock *block, ram_addr_t offset);
 
 /* Multifd Compression flags */
 #define MULTIFD_FLAG_SYNC (1 << 0)
diff --git a/migration/multifd.c b/migration/multifd.c
index 91be6d2fc4..d0a3b4e062 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -505,7 +505,8 @@ static int multifd_send_pages(void)
     return 1;
 }
 
-int multifd_queue_page(RAMBlock *block, ram_addr_t offset)
+/* Returns true if enqueue successful, false otherwise */
+bool multifd_queue_page(RAMBlock *block, ram_addr_t offset)
 {
     MultiFDPages_t *pages = multifd_send_state->pages;
     bool changed = false;
@@ -519,21 +520,21 @@ int multifd_queue_page(RAMBlock *block, ram_addr_t offset)
         pages->num++;
 
         if (pages->num < pages->allocated) {
-            return 1;
+            return true;
         }
     } else {
         changed = true;
     }
 
     if (multifd_send_pages() < 0) {
-        return -1;
+        return false;
     }
 
     if (changed) {
         return multifd_queue_page(block, offset);
     }
 
-    return 1;
+    return true;
 }
 
 /* Multifd send side hit an error; remember it and prepare to quit */
diff --git a/migration/ram.c b/migration/ram.c
index d5b7cd5ac2..4649a81204 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1252,7 +1252,7 @@ static int ram_save_page(RAMState *rs, PageSearchStatus *pss)
 
 static int ram_save_multifd_page(RAMBlock *block, ram_addr_t offset)
 {
-    if (multifd_queue_page(block, offset) < 0) {
+    if (!multifd_queue_page(block, offset)) {
         return -1;
     }
     stat64_add(&mig_stats.normal_pages, 1);
-- 
2.43.0


[-- Attachment #3: 0002-migration-multifd-Change-retval-of-multifd_send_page.patch --]
[-- Type: text/plain, Size: 2352 bytes --]

From f393f1cfe95d79bed72e6043903ee4c4cb298c21 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Thu, 1 Feb 2024 17:51:38 +0800
Subject: [PATCH 2/3] migration/multifd: Change retval of multifd_send_pages()

Using int is an overkill when there're only two options.  Change it to a
boolean.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/multifd.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index d0a3b4e062..d2b0f0eda9 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -449,9 +449,10 @@ static void multifd_send_kick_main(MultiFDSendParams *p)
  * thread is using the channel mutex when changing it, and the channel
  * have to had finish with its own, otherwise pending_job can't be
  * false.
+ *
+ * Returns true if succeed, false otherwise.
  */
-
-static int multifd_send_pages(void)
+static bool multifd_send_pages(void)
 {
     int i;
     static int next_channel;
@@ -459,7 +460,7 @@ static int multifd_send_pages(void)
     MultiFDPages_t *pages = multifd_send_state->pages;
 
     if (multifd_send_should_exit()) {
-        return -1;
+        return false;
     }
 
     /* We wait here, until at least one channel is ready */
@@ -473,7 +474,7 @@ static int multifd_send_pages(void)
     next_channel %= migrate_multifd_channels();
     for (i = next_channel;; i = (i + 1) % migrate_multifd_channels()) {
         if (multifd_send_should_exit()) {
-            return -1;
+            return false;
         }
         p = &multifd_send_state->params[i];
         /*
@@ -502,7 +503,7 @@ static int multifd_send_pages(void)
     qemu_mutex_unlock(&p->mutex);
     qemu_sem_post(&p->sem);
 
-    return 1;
+    return true;
 }
 
 /* Returns true if enqueue successful, false otherwise */
@@ -526,7 +527,7 @@ bool multifd_queue_page(RAMBlock *block, ram_addr_t offset)
         changed = true;
     }
 
-    if (multifd_send_pages() < 0) {
+    if (!multifd_send_pages()) {
         return false;
     }
 
@@ -666,7 +667,7 @@ int multifd_send_sync_main(void)
         return 0;
     }
     if (multifd_send_state->pages->num) {
-        if (multifd_send_pages() < 0) {
+        if (!multifd_send_pages()) {
             error_report("%s: multifd_send_pages fail", __func__);
             return -1;
         }
-- 
2.43.0


[-- Attachment #4: 0003-migration-multifd-Rewrite-multifd_queue_page.patch --]
[-- Type: text/plain, Size: 2926 bytes --]

From fcddc942cb31bc9d395d67a555d9a2281da452b1 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Thu, 1 Feb 2024 17:55:42 +0800
Subject: [PATCH 3/3] migration/multifd: Rewrite multifd_queue_page()

The current multifd_queue_page() is not easy to read and follow.  It is not
good with a few reasons:

  - No helper at all to show what exactly does a condition mean; in short,
  readability is low.

  - Rely on pages->ramblock being cleared to detect an empty queue.  It's
  slightly an overload of the ramblock pointer, per Fabiano [1], which I
  also agree.

  - Contains a self recursion, even if not necessary..

Rewrite this function.  We add some comments to make it even clearer on
what it does.

[1] https://lore.kernel.org/r/87wmrpjzew.fsf@suse.de

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/multifd.c | 56 ++++++++++++++++++++++++++++++---------------
 1 file changed, 37 insertions(+), 19 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index d2b0f0eda9..5a64a9c2e2 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -506,35 +506,53 @@ static bool multifd_send_pages(void)
     return true;
 }
 
+static inline bool multifd_queue_empty(MultiFDPages_t *pages)
+{
+    return pages->num == 0;
+}
+
+static inline bool multifd_queue_full(MultiFDPages_t *pages)
+{
+    return pages->num == pages->allocated;
+}
+
+static inline void multifd_enqueue(MultiFDPages_t *pages, ram_addr_t offset)
+{
+    pages->offset[pages->num++] = offset;
+}
+
 /* Returns true if enqueue successful, false otherwise */
 bool multifd_queue_page(RAMBlock *block, ram_addr_t offset)
 {
-    MultiFDPages_t *pages = multifd_send_state->pages;
-    bool changed = false;
+    MultiFDPages_t *pages;
+
+retry:
+    pages = multifd_send_state->pages;
 
-    if (!pages->block) {
+    /* If the queue is empty, we can already enqueue now */
+    if (multifd_queue_empty(pages)) {
         pages->block = block;
+        multifd_enqueue(pages, offset);
+        return true;
     }
 
-    if (pages->block == block) {
-        pages->offset[pages->num] = offset;
-        pages->num++;
-
-        if (pages->num < pages->allocated) {
-            return true;
+    /*
+     * Not empty, meanwhile we need a flush.  It can because of either:
+     *
+     * (1) The page is not on the same ramblock of previous ones, or,
+     * (2) The queue is full.
+     *
+     * After flush, always retry.
+     */
+    if (pages->block != block || multifd_queue_full(pages)) {
+        if (!multifd_send_pages()) {
+            return false;
         }
-    } else {
-        changed = true;
-    }
-
-    if (!multifd_send_pages()) {
-        return false;
-    }
-
-    if (changed) {
-        return multifd_queue_page(block, offset);
+        goto retry;
     }
 
+    /* Not empty, and we still have space, do it! */
+    multifd_enqueue(pages, offset);
     return true;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 12/14] migration/multifd: multifd_send_prepare_header()
  2024-01-31 21:32   ` Fabiano Rosas
@ 2024-02-01 10:02     ` Peter Xu
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Xu @ 2024-02-01 10:02 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, Bryan Zhang, Prasad Pandit, Yuan Liu, Avihai Horon,
	Hao Xiang

On Wed, Jan 31, 2024 at 06:32:54PM -0300, Fabiano Rosas wrote:
> > +            if (!use_zero_copy_send) {
> > +                /*
> > +                 * Only !zero_copy needs the header in IOV; zerocopy will
> > +                 * send it separately.
> 
> Could use the same spelling for both mentions to zero copy.

Will do.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 13/14] migration/multifd: Move header prepare/fill into send_prepare()
  2024-01-31 21:42   ` Fabiano Rosas
@ 2024-02-01 10:15     ` Peter Xu
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Xu @ 2024-02-01 10:15 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, Bryan Zhang, Prasad Pandit, Yuan Liu, Avihai Horon,
	Hao Xiang

On Wed, Jan 31, 2024 at 06:42:57PM -0300, Fabiano Rosas wrote:
> peterx@redhat.com writes:
> 
> > From: Peter Xu <peterx@redhat.com>
> >
> > This patch redefines the interfacing of ->send_prepare().  It further
> > simplifies multifd_send_thread() especially on zero copy.
> >
> > Now with the new interface, we require the hook to do all the work for
> > preparing the IOVs to send.  After it's completed, the IOVs should be ready
> > to be dumped into the specific multifd QIOChannel later.
> >
> > So now the API looks like:
> >
> >   p->pages ----------->  send_prepare() -------------> IOVs
> >
> > This also prepares for the case where the input can be extended to even not
> > any p->pages.  But that's for later.
> >
> > This patch will achieve similar goal of what Fabiano used to propose here:
> >
> > https://lore.kernel.org/r/20240126221943.26628-1-farosas@suse.de
> >
> > However the send() interface may not be necessary.  I'm boldly attaching a
> 
> So should I drop send() for fixed-ram as well? Or do you still want a
> separate layer just for send()?

Currently after the whole set applied, the IO side is pretty like before,
and IMHO straightforward enough:

            ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num, NULL,
                                              0, p->write_flags, &local_err);
            if (ret != 0) {
                qemu_mutex_unlock(&p->mutex);
                break;
            }

IIRC with your file interface added, it could logically become something
like:

            if (use_socket()) {
                ret = qio_channel_writev_full_all(IOV, ...);
            } else {
                /*
                 * Assert "file:".  I forgot what we used to discuss here for
                 * the interface as name.. but I remember we need ramblock,
                 * so perhaps passing over "p" would work?  As then it is
                 * p->pages->ramblock, along with IOVs, etc.
                 */
                ret = qio_channel_XXX(p, ...);
            }

            if (ret != 0) {
                qemu_mutex_unlock(&p->mutex);
                break;
            }

So there's only one way or another.  We can add one helper to even wrap
these two.

IMHO a hook will be more helpful if there can be a bunch of "if, else if,
... else" things, so at least three options perhaps?  But if you prefer a
hook, that'll also work for me.  So.. your call. :)

But I hope if the send() will exist, it's a separate OPs, so that the
compiler accelerators should avoid worrying at all with how the data will
be dumped when they prepare their new MultiFDMethods (even though your
"file:" will need to block them all as of now, but only support no
compression, iiuc).

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 07/14] migration/multifd: Simplify locking in sender thread
  2024-01-31 20:21   ` Fabiano Rosas
@ 2024-02-01 10:37     ` Peter Xu
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Xu @ 2024-02-01 10:37 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, Bryan Zhang, Prasad Pandit, Yuan Liu, Avihai Horon,
	Hao Xiang

On Wed, Jan 31, 2024 at 05:21:06PM -0300, Fabiano Rosas wrote:
> peterx@redhat.com writes:
> 
> > From: Peter Xu <peterx@redhat.com>
> >
> > The sender thread will yield the p->mutex before IO starts, trying to not
> > block the requester thread.  This may be unnecessary lock optimizations,
> > because the requester can already read pending_job safely even without the
> > lock, because the requester is currently the only one who can assign a
> > task.
> 
> What about the coroutine yield at qio_channel_writev_full_all()? Is it
> safe from yield while holding a lock? Could the main loop dispatch the
> cleanup function, it calls join on the multifd thread and it deadlocks?

This should be fine, IMHO, as sender threads are never in a coroutine?
IOW, it should be qemu_in_coroutine()==false always.

> 
> >
> > Drop that lock complication on both sides:
> >
> >   (1) in the sender thread, always take the mutex until job done
> >   (2) in the requester thread, check pending_job clear lockless
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  migration/multifd.c | 23 ++++++++++++++++-------
> >  1 file changed, 16 insertions(+), 7 deletions(-)
> >
> > diff --git a/migration/multifd.c b/migration/multifd.c
> > index 6a4863edd2..4dc5af0a15 100644
> > --- a/migration/multifd.c
> > +++ b/migration/multifd.c
> > @@ -429,7 +429,9 @@ static int multifd_send_pages(void)
> >          return -1;
> >      }
> >  
> > +    /* We wait here, until at least one channel is ready */
> >      qemu_sem_wait(&multifd_send_state->channels_ready);
> > +
> >      /*
> >       * next_channel can remain from a previous migration that was
> >       * using more channels, so ensure it doesn't overflow if the
> > @@ -441,17 +443,26 @@ static int multifd_send_pages(void)
> >              return -1;
> >          }
> >          p = &multifd_send_state->params[i];
> > -        qemu_mutex_lock(&p->mutex);
> > +        /*
> > +         * Lockless read to p->pending_job is safe, because only multifd
> > +         * sender thread can clear it.
> > +         */
> >          if (!p->pending_job) {
> 
> The worst it could happen is we read at the same time the thread is
> clearing it and we loop to the next channel. So it doesn't need to be
> atomic either.

Yep.  Actually the worst case is when all the rest N-1 channels are all
busy, then we loop N more times to fetch this pending_job finally became
false, but only if a race, so should be fine.

I'll switch to qatomic_read|set() in v2, btw, which I forgot yesterday.  in
case some compiler does register-cache tricks here to avoid a dead loop.

> 
> > -            p->pending_job = true;
> >              next_channel = (i + 1) % migrate_multifd_channels();
> >              break;
> >          }
> > -        qemu_mutex_unlock(&p->mutex);
> >      }
> > +
> > +    qemu_mutex_lock(&p->mutex);
> 
> What data this lock protects now? Everything below here only happens
> after this thread sees pending_job==false. It seems we would only need a
> barrier on the multifd thread to make sure p->pending_job=false is
> ordered after everything.
> 
> Even for the "sync" case, it appears the lock is not needed as well?

Great question. :)

Let's see whether we can remove the lock.  Since I'll need to run for
today, I'll have a closer look tomorrow.

Hopefully we still keep this patch untouched? The goal of this patch
originally was only trying to simplify the sender thread on releasing lock
one more time.  Current change should avoid that already.

> 
> We might need to remove p->running first and move the kick from
> multifd_send_terminate_threads() into multifd_save_cleanup() like I
> suggested, but it seems like we could remove this lock.
> 
> Which would make sense, because there's nothing another thread would
> want to do with a channel's MultiFDSendParams unless the channel is idle
> waiting for work.
> 
> > +    /*
> > +     * Double check on pending_job==false with the lock.  In the future if
> > +     * we can have >1 requester thread, we can replace this with a "goto
> > +     * retry", but that is for later.
> > +     */
> > +    assert(p->pending_job == false);
> > +    p->pending_job = true;
> >      assert(!p->pages->num);
> >      assert(!p->pages->block);
> > -
> >      p->packet_num = multifd_send_state->packet_num++;
> 
> I noticed this line cannot be here. If the channel thread takes long to
> wakeup, the "sync" code will increment once more and overwrite this
> field. This and the identical line at multifd_send_sync_main() should go
> into multifd_send_fill_packet() I think.

Another good one.

This is similarly relevant to my effort to split pending_job into two:
these two things (job/sync) are potentially racy on using *p.

Moving it to threads will require an atomic op, but I'll do it, because
otherwise it's racy as you correctly pointed out.

Another work for me tomorrow; I'll prepare something.

> 
> >      multifd_send_state->pages = p->pages;
> >      p->pages = pages;
> > @@ -704,8 +715,6 @@ static void *multifd_send_thread(void *opaque)
> >              multifd_send_fill_packet(p);
> >              p->num_packets++;
> >              p->total_normal_pages += pages->num;
> > -            qemu_mutex_unlock(&p->mutex);
> > -
> >              trace_multifd_send(p->id, packet_num, pages->num, p->flags,
> >                                 p->next_packet_size);
> >  
> > @@ -725,6 +734,7 @@ static void *multifd_send_thread(void *opaque)
> >              ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num, NULL,
> >                                                0, p->write_flags, &local_err);
> >              if (ret != 0) {
> > +                qemu_mutex_unlock(&p->mutex);
> >                  break;
> >              }
> >  
> > @@ -733,7 +743,6 @@ static void *multifd_send_thread(void *opaque)
> >  
> >              multifd_pages_reset(p->pages);
> >              p->next_packet_size = 0;
> > -            qemu_mutex_lock(&p->mutex);
> >              p->pending_job = false;
> >              qemu_mutex_unlock(&p->mutex);
> >          } else if (p->pending_sync) {
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 00/14] migration/multifd: Refactor ->send_prepare() and cleanups
  2024-02-01  5:47   ` Peter Xu
@ 2024-02-01 12:51     ` Avihai Horon
  2024-02-01 21:46       ` Fabiano Rosas
  0 siblings, 1 reply; 45+ messages in thread
From: Avihai Horon @ 2024-02-01 12:51 UTC (permalink / raw)
  To: Peter Xu, Fabiano Rosas
  Cc: qemu-devel, Bryan Zhang, Prasad Pandit, Yuan Liu, Hao Xiang


On 01/02/2024 7:47, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Wed, Jan 31, 2024 at 07:49:51PM -0300, Fabiano Rosas wrote:
>> peterx@redhat.com writes:
>>
>>> From: Peter Xu <peterx@redhat.com>
>>>
>>> This patchset contains quite a few refactorings to current multifd:
>>>
>>>    - It picked up some patches from an old series of mine [0] (the last
>>>      patches were dropped, though; I did the cleanup slightly differently):
>>>
>>>      I still managed to include one patch to split pending_job, but I
>>>      rewrote the patch here.
>>>
>>>    - It tries to cleanup multiple multifd paths here and there, the ultimate
>>>      goal is to redefine send_prepare() to be something like:
>>>
>>>        p->pages ----------->  send_prepare() -------------> IOVs
>>>
>>>      So that there's no obvious change yet on multifd_ops besides redefined
>>>      interface for send_prepare().  We may want a separate OPs for file
>>>      later.
>>>
>>> For 2), one benefit is already presented by Fabiano in his other series [1]
>>> on cleaning up zero copy, but this patchset addressed it quite differently,
>>> and hopefully also more gradually.  The other benefit is for sure if we
>>> have a more concrete API for send_prepare() and if we can reach an initial
>>> consensus, then we can have the recent compression accelerators rebased on
>>> top of this one.
>>>
>>> This also prepares for the case where the input can be extended to even not
>>> any p->pages, but arbitrary data (like VFIO's potential use case in the
>>> future?).  But that will also for later even if reasonable.
>>>
>>> Please have a look.  Thanks,
>>>
>>> [0] https://lore.kernel.org/r/20231022201211.452861-1-peterx@redhat.com
>>> [1] https://lore.kernel.org/qemu-devel/20240126221943.26628-1-farosas@suse.de
>>>
>>> Peter Xu (14):
>>>    migration/multifd: Drop stale comment for multifd zero copy
>>>    migration/multifd: multifd_send_kick_main()
>>>    migration/multifd: Drop MultiFDSendParams.quit, cleanup error paths
>>>    migration/multifd: Postpone reset of MultiFDPages_t
>>>    migration/multifd: Drop MultiFDSendParams.normal[] array
>>>    migration/multifd: Separate SYNC request with normal jobs
>>>    migration/multifd: Simplify locking in sender thread
>>>    migration/multifd: Drop pages->num check in sender thread
>>>    migration/multifd: Rename p->num_packets and clean it up
>>>    migration/multifd: Move total_normal_pages accounting
>>>    migration/multifd: Move trace_multifd_send|recv()
>>>    migration/multifd: multifd_send_prepare_header()
>>>    migration/multifd: Move header prepare/fill into send_prepare()
>>>    migration/multifd: Forbid spurious wakeups
>>>
>>>   migration/multifd.h      |  34 +++--
>>>   migration/multifd-zlib.c |  11 +-
>>>   migration/multifd-zstd.c |  11 +-
>>>   migration/multifd.c      | 291 +++++++++++++++++++--------------------
>>>   4 files changed, 182 insertions(+), 165 deletions(-)
>> This series didn't survive my 9999 iterations test on the opensuse
>> machine.
>>
>> # Running /x86_64/migration/multifd/tcp/tls/x509/reject-anon-client
>> ...
>> kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
>>
>>
>> #0  0x00005575dda06399 in qemu_mutex_lock_impl (mutex=0x18, file=0x5575ddce9cc3 "../util/qemu-thread-posix.c", line=275) at ../util/qemu-thread-posix.c:92
>> #1  0x00005575dda06a94 in qemu_sem_post (sem=0x18) at ../util/qemu-thread-posix.c:275
>> #2  0x00005575dd56a512 in multifd_send_thread (opaque=0x5575df054ef8) at ../migration/multifd.c:720
>> #3  0x00005575dda0709b in qemu_thread_start (args=0x7fd404001d50) at ../util/qemu-thread-posix.c:541
>> #4  0x00007fd45e8a26ea in start_thread (arg=0x7fd3faffd700) at pthread_create.c:477
>> #5  0x00007fd45cd2150f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>>
>> The multifd thread is posting channels_ready with an already freed
>> multifd_send_state.
>>
>> This is the bug Avihai has hit. We're going into multifd_save_cleanup()
>> so early that multifd_new_send_channel_async() hasn't even had the
>> chance to set p->running. So it misses the join and frees everything up
>> while a second multifd thread is just starting.
> Thanks for doing that.
>
> Would this series makes that bug easier to happen?

I think so.
Patch #3 added an extra multifd_send_should_exit() check in 
multifd_send_sync_main(), so now it can exit early if the first channel 
fails.
Plus, now migration state is set to FAILED early by:
multifd_new_send_channel_async()->multifd_send_terminate_threads() and 
multifd_tls_outgoing_handshake()->multifd_send_terminate_threads()
so migration_iteration_run() is completely skipped because 
migration_is_active() check before it will return false.

I *think* this is what makes main migration thread finish earlier and 
call multifd_save_cleanup() earlier, at least for me.

> I didn't do a lot of
> test on it, it only survived the smoke test and the kicked CI job.  I think
> we can still decide to fix that issues separately; but if this series makes
> that easier to happen then that's definitely bad..
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 03/14] migration/multifd: Drop MultiFDSendParams.quit, cleanup error paths
  2024-02-01  9:28     ` Peter Xu
@ 2024-02-01 13:30       ` Fabiano Rosas
  2024-02-02  0:21         ` Peter Xu
  0 siblings, 1 reply; 45+ messages in thread
From: Fabiano Rosas @ 2024-02-01 13:30 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Bryan Zhang, Prasad Pandit, Yuan Liu, Avihai Horon,
	Hao Xiang

Peter Xu <peterx@redhat.com> writes:

> On Wed, Jan 31, 2024 at 12:05:08PM -0300, Fabiano Rosas wrote:
>> peterx@redhat.com writes:
>> 
>> > From: Peter Xu <peterx@redhat.com>
>> >
>> > Multifd send side has two fields to indicate error quits:
>> >
>> >   - MultiFDSendParams.quit
>> >   - &multifd_send_state->exiting
>> >
>> > Merge them into the global one.  The replacement is done by changing all
>> > p->quit checks into the global var check.  The global check doesn't need
>> > any lock.
>> >
>> > A few more things done on top of this altogether:
>> >
>> >   - multifd_send_terminate_threads()
>> >
>> >     Moving the xchg() of &multifd_send_state->exiting upper, so as to cover
>> >     the tracepoint, migrate_set_error() and migrate_set_state().
>> 
>> Good.
>> 
>> >
>> >   - multifd_send_sync_main()
>> >
>> >     In the 2nd loop, add one more check over the global var to make sure we
>> >     don't keep the looping if QEMU already decided to quit.
>> 
>> Yes, also because we don't necessarily enter at multifd_send_page()
>> every time.
>> 
>> >
>> >   - multifd_tls_outgoing_handshake()
>> >
>> >     Use multifd_send_terminate_threads() to set the error state.  That has
>> >     a benefit of updating MigrationState.error to that error too, so we can
>> >     persist that 1st error we hit in that specific channel.
>> 
>> Makes sense.
>> 
>> >
>> >   - multifd_new_send_channel_async()
>> >
>> >     Take similar approach like above, drop the migrate_set_error() because
>> >     multifd_send_terminate_threads() already covers that.  Unwrap the helper
>> >     multifd_new_send_channel_cleanup() along the way; not really needed.
>> >
>> > Signed-off-by: Peter Xu <peterx@redhat.com>
>> > ---
>> >  migration/multifd.h |  2 --
>> >  migration/multifd.c | 85 ++++++++++++++++++---------------------------
>> >  2 files changed, 33 insertions(+), 54 deletions(-)
>> >
>> > diff --git a/migration/multifd.h b/migration/multifd.h
>> > index 35d11f103c..7c040cb85a 100644
>> > --- a/migration/multifd.h
>> > +++ b/migration/multifd.h
>> > @@ -95,8 +95,6 @@ typedef struct {
>> >      QemuMutex mutex;
>> >      /* is this channel thread running */
>> >      bool running;
>> > -    /* should this thread finish */
>> > -    bool quit;
>> >      /* multifd flags for each packet */
>> >      uint32_t flags;
>> >      /* global number of generated multifd packets */
>> > diff --git a/migration/multifd.c b/migration/multifd.c
>> > index b8d2c96533..2c98023d67 100644
>> > --- a/migration/multifd.c
>> > +++ b/migration/multifd.c
>> > @@ -372,6 +372,11 @@ struct {
>> >      MultiFDMethods *ops;
>> >  } *multifd_send_state;
>> >  
>> > +static bool multifd_send_should_exit(void)
>> > +{
>> > +    return qatomic_read(&multifd_send_state->exiting);
>> > +}
>> > +
>> >  /*
>> >   * The migration thread can wait on either of the two semaphores.  This
>> >   * function can be used to kick the main thread out of waiting on either of
>> > @@ -409,7 +414,7 @@ static int multifd_send_pages(void)
>> >      MultiFDSendParams *p = NULL; /* make happy gcc */
>> >      MultiFDPages_t *pages = multifd_send_state->pages;
>> >  
>> > -    if (qatomic_read(&multifd_send_state->exiting)) {
>> > +    if (multifd_send_should_exit()) {
>> >          return -1;
>> >      }
v>> >  
>> > @@ -421,14 +426,11 @@ static int multifd_send_pages(void)
>> >       */
>> >      next_channel %= migrate_multifd_channels();
>> >      for (i = next_channel;; i = (i + 1) % migrate_multifd_channels()) {
>> > -        p = &multifd_send_state->params[i];
>> > -
>> > -        qemu_mutex_lock(&p->mutex);
>> > -        if (p->quit) {
>> > -            error_report("%s: channel %d has already quit!", __func__, i);
>> > -            qemu_mutex_unlock(&p->mutex);
>> > +        if (multifd_send_should_exit()) {
>> >              return -1;
>> >          }
>> > +        p = &multifd_send_state->params[i];
>> > +        qemu_mutex_lock(&p->mutex);
>> >          if (!p->pending_job) {
>> >              p->pending_job++;
>> >              next_channel = (i + 1) % migrate_multifd_channels();
>> 
>> Hm, I'm not sure it's correct to check 'exiting' outside of the
>> lock. While it is an atomic operation, it is not atomic in relation to
>> pending_job...
>> 
>> ... looking closer, it seems that we can do what you suggest because
>> p->pending_job is not touched by the multifd_send_thread in case of
>> error, which means this function will indeed miss the 'exiting' flag,
>> but pending_job > 0 means it will loop to the next channel and _then_ it
>> will see the 'exiting' flag.
>
> It could still be the last channel we iterate, then IIUC we can still try
> to assign a job to a thread even if a concurrent error is set there.
>
> However IMHO it's okay; the error in the sender thread should ultimately
> set migrate_set_error() and the main thread should detect that in the
> migration loop, then we'll still quit.  The extra queued job shouldn't
> matter, IMHO.
>
>> 
>> > @@ -483,6 +485,16 @@ static void multifd_send_terminate_threads(Error *err)
>> >  {
>> >      int i;
>> >  
>> > +    /*
>> > +     * We don't want to exit each threads twice.  Depending on where
>> > +     * we get the error, or if there are two independent errors in two
>> > +     * threads at the same time, we can end calling this function
>> > +     * twice.
>> > +     */
>> > +    if (qatomic_xchg(&multifd_send_state->exiting, 1)) {
>> > +        return;
>> > +    }
>> > +
>> >      trace_multifd_send_terminate_threads(err != NULL);
>> >  
>> >      if (err) {
>> > @@ -497,26 +509,13 @@ static void multifd_send_terminate_threads(Error *err)
>> >          }
>> >      }
>> >  
>> > -    /*
>> > -     * We don't want to exit each threads twice.  Depending on where
>> > -     * we get the error, or if there are two independent errors in two
>> > -     * threads at the same time, we can end calling this function
>> > -     * twice.
>> > -     */
>> > -    if (qatomic_xchg(&multifd_send_state->exiting, 1)) {
>> > -        return;
>> > -    }
>> > -
>> >      for (i = 0; i < migrate_multifd_channels(); i++) {
>> >          MultiFDSendParams *p = &multifd_send_state->params[i];
>> >  
>> > -        qemu_mutex_lock(&p->mutex);
>> > -        p->quit = true;
>> 
>> Now that you removed this, we decoupled kicking the threads from setting
>> the exit/error, so this function could be split in two.
>> 
>> We could set the exiting flag at the places the error occurred (multifd
>> threads, thread creation, etc) and "terminate the threads" at
>> multifd_save_cleanup(). That second part we already do actually:
>> 
>> void multifd_save_cleanup(void) {
>> ...
>>     multifd_send_terminate_threads(NULL);
>>                                    ^see?
>>     for (i = 0; i < migrate_multifd_channels(); i++) {
>>         MultiFDSendParams *p = &multifd_send_state->params[i];
>> 
>>         if (p->running) {
>>             qemu_thread_join(&p->thread);
>>         }
>>     }
>> ...
>> }
>> 
>> I think there's no reason anymore for the channels to kick each
>> other. They would all be waiting at p->sem and multifd_send_cleanup()
>> would kick + join them.
>
> Sounds good here.
>
> I'll attach one patch like this, feel free to have an early look:
>
> =====
>
> From f9a3d63d5cca0068daaea4c72392803f4b29dcb5 Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Thu, 1 Feb 2024 17:01:54 +0800
> Subject: [PATCH] migration/multifd: Split multifd_send_terminate_threads()
>
> Split multifd_send_terminate_threads() into two functions:
>
>   - multifd_send_set_error(): used when an error happened on the sender
>     side, set error and quit state only
>
>   - multifd_send_terminate_threads(): used only by the main thread to kick
>     all multifd send threads out of sleep, for the last recycling.
>
> Use multifd_send_set_error() in the three old call sites where only the
> error will be set.
>
> Use multifd_send_terminate_threads() in the last one where the main thread
> will kick the multifd threads at last in multifd_save_cleanup().
>
> Both helpers will need to set quitting=1.
>
> Suggested-by: Fabiano Rosas <farosas@suse.de>
> Signed-off-by: Peter Xu <peterx@redhat.com>

New patch looks good.

> ---
>  migration/multifd.c    | 27 ++++++++++++++++++---------
>  migration/trace-events |  2 +-
>  2 files changed, 19 insertions(+), 10 deletions(-)
>
> diff --git a/migration/multifd.c b/migration/multifd.c
> index c71e74b101..95dc29c8c7 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -536,10 +536,9 @@ int multifd_queue_page(RAMBlock *block, ram_addr_t offset)
>      return 1;
>  }
>  
> -static void multifd_send_terminate_threads(Error *err)
> +/* Multifd send side hit an error; remember it and prepare to quit */
> +static void multifd_send_set_error(Error *err)
>  {
> -    int i;
> -
>      /*
>       * We don't want to exit each threads twice.  Depending on where
>       * we get the error, or if there are two independent errors in two
> @@ -550,8 +549,6 @@ static void multifd_send_terminate_threads(Error *err)
>          return;
>      }
>  
> -    trace_multifd_send_terminate_threads(err != NULL);
> -
>      if (err) {
>          MigrationState *s = migrate_get_current();
>          migrate_set_error(s, err);
> @@ -563,7 +560,19 @@ static void multifd_send_terminate_threads(Error *err)
>                                MIGRATION_STATUS_FAILED);
>          }
>      }
> +}
> +
> +static void multifd_send_terminate_threads(void)
> +{
> +    int i;
> +
> +    trace_multifd_send_terminate_threads();
>  
> +    /*
> +     * Tell everyone we're quitting.  No xchg() needed here; we simply
> +     * always set it.
> +     */
> +    qatomic_set(&multifd_send_state->exiting, 1);
>      for (i = 0; i < migrate_multifd_channels(); i++) {
>          MultiFDSendParams *p = &multifd_send_state->params[i];
>  
> @@ -586,7 +595,7 @@ void multifd_save_cleanup(void)
>      if (!migrate_multifd()) {
>          return;
>      }
> -    multifd_send_terminate_threads(NULL);
> +    multifd_send_terminate_threads();
>      for (i = 0; i < migrate_multifd_channels(); i++) {
>          MultiFDSendParams *p = &multifd_send_state->params[i];

We could then move the qemu_thread_join loop into
multifd_send_terminate_threads().

(and fix all the bugs we have so that we only progress past
multifd_send_terminate_threads() once all threads have exited and no
more thread is going to spawn)

>  
> @@ -778,7 +787,7 @@ out:
>      if (ret) {
>          assert(local_err);
>          trace_multifd_send_error(p->id);
> -        multifd_send_terminate_threads(local_err);
> +        multifd_send_set_error(local_err);
>          multifd_send_kick_main(p);
>          error_free(local_err);
>      }
> @@ -814,7 +823,7 @@ static void multifd_tls_outgoing_handshake(QIOTask *task,
>  
>      trace_multifd_tls_outgoing_handshake_error(ioc, error_get_pretty(err));
>  
> -    multifd_send_terminate_threads(err);
> +    multifd_send_set_error(err);
>      multifd_send_kick_main(p);
>      error_free(err);
>  }
> @@ -896,7 +905,7 @@ static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
>      }
>  
>      trace_multifd_new_send_channel_async_error(p->id, local_err);
> -    multifd_send_terminate_threads(local_err);
> +    multifd_send_set_error(local_err);
>      multifd_send_kick_main(p);
>      object_unref(OBJECT(ioc));
>      error_free(local_err);
> diff --git a/migration/trace-events b/migration/trace-events
> index de4a743c8a..298ad2b0dd 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -141,7 +141,7 @@ multifd_send_error(uint8_t id) "channel %u"
>  multifd_send_sync_main(long packet_num) "packet num %ld"
>  multifd_send_sync_main_signal(uint8_t id) "channel %u"
>  multifd_send_sync_main_wait(uint8_t id) "channel %u"
> -multifd_send_terminate_threads(bool error) "error %d"
> +multifd_send_terminate_threads(void) ""
>  multifd_send_thread_end(uint8_t id, uint64_t packets, uint64_t normal_pages) "channel %u packets %" PRIu64 " normal pages %"  PRIu64
>  multifd_send_thread_start(uint8_t id) "%u"
>  multifd_tls_outgoing_handshake_start(void *ioc, void *tioc, const char *hostname) "ioc=%p tioc=%p hostname=%s"
> -- 
> 2.43.0


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 04/14] migration/multifd: Postpone reset of MultiFDPages_t
  2024-02-01 10:01     ` Peter Xu
@ 2024-02-01 15:21       ` Fabiano Rosas
  2024-02-02  0:28         ` Peter Xu
  0 siblings, 1 reply; 45+ messages in thread
From: Fabiano Rosas @ 2024-02-01 15:21 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Bryan Zhang, Prasad Pandit, Yuan Liu, Avihai Horon,
	Hao Xiang

Peter Xu <peterx@redhat.com> writes:

> On Wed, Jan 31, 2024 at 12:27:51PM -0300, Fabiano Rosas wrote:
>> > +/* Reset a MultiFDPages_t* object for the next use */
>> > +static void multifd_pages_reset(MultiFDPages_t *pages)
>> > +{
>> > +    /*
>> > +     * We don't need to touch offset[] array, because it will be
>> > +     * overwritten later when reused.
>> > +     */
>> > +    pages->num = 0;
>> > +    pages->block = NULL;
>> 
>> Having to do this at all is a huge overloading of this field. This not
>> only resets it, but it also communicates to multifd_queue_page() that
>> the previous payload has been sent. Otherwise, multifd_queue_page()
>> wouldn't know whether the previous call to multifd_queue_page() has
>> called multifd_send_pages() or if it has exited early. So this basically
>> means "the block that was previously here has been sent".
>> 
>> That's why we need the changed=true logic. A
>> multifd_send_state->pages->block still has a few pages left to send, but
>> because it's less than pages->allocated, it skips
>> multifd_send_pages(). The next call to multifd_queue_page() already has
>> the next ramblock. So we set changed=true, call multifd_send_pages() to
>> send the remaining pages of that block and recurse into
>> multifd_queue_page() once more to send the new block.
>
> I agree, the queue page routines are not easy to follow as well.
>
> How do you like a rewrite of the queue logic, like this?
>
> =====
> bool multifd_queue_page(RAMBlock *block, ram_addr_t offset)
> {
>     MultiFDPages_t *pages;
>
> retry:
>     pages = multifd_send_state->pages;
>
>     /* If the queue is empty, we can already enqueue now */
>     if (multifd_queue_empty(pages)) {
>         pages->block = block;
>         multifd_enqueue(pages, offset);
>         return true;
>     }
>
>     /*
>      * Not empty, meanwhile we need a flush.  It can because of either:
>      *
>      * (1) The page is not on the same ramblock of previous ones, or,
>      * (2) The queue is full.
>      *
>      * After flush, always retry.
>      */
>     if (pages->block != block || multifd_queue_full(pages)) {
>         if (!multifd_send_pages()) {
>             return false;
>         }
>         goto retry;
>     }
>
>     /* Not empty, and we still have space, do it! */
>     multifd_enqueue(pages, offset);
>     return true;
> }
> =====
>
> Would this be clearer?  With above, we can drop the ->ramblock reset,
> afaict.
>
> I attached three patches if you agree it's better, then I'll include them
> in v2.

Yes, let's do it.

>
> -- 
> Peter Xu
> From c5dc2052794efd6da6a1e6f4b49f25d5b32879f7 Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Thu, 1 Feb 2024 17:50:21 +0800
> Subject: [PATCH 1/3] migration/multifd: Change retval of multifd_queue_page()
>
> Using int is an overkill when there're only two options.  Change it to a
> boolean.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/multifd.h | 2 +-
>  migration/multifd.c | 9 +++++----
>  migration/ram.c     | 2 +-
>  3 files changed, 7 insertions(+), 6 deletions(-)
>
> diff --git a/migration/multifd.h b/migration/multifd.h
> index 34a2ecb9f4..a320c53a6f 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -22,7 +22,7 @@ bool multifd_recv_all_channels_created(void);
>  void multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
>  void multifd_recv_sync_main(void);
>  int multifd_send_sync_main(void);
> -int multifd_queue_page(RAMBlock *block, ram_addr_t offset);
> +bool multifd_queue_page(RAMBlock *block, ram_addr_t offset);
>  
>  /* Multifd Compression flags */
>  #define MULTIFD_FLAG_SYNC (1 << 0)
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 91be6d2fc4..d0a3b4e062 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -505,7 +505,8 @@ static int multifd_send_pages(void)
>      return 1;
>  }
>  
> -int multifd_queue_page(RAMBlock *block, ram_addr_t offset)
> +/* Returns true if enqueue successful, false otherwise */
> +bool multifd_queue_page(RAMBlock *block, ram_addr_t offset)
>  {
>      MultiFDPages_t *pages = multifd_send_state->pages;
>      bool changed = false;
> @@ -519,21 +520,21 @@ int multifd_queue_page(RAMBlock *block, ram_addr_t offset)
>          pages->num++;
>  
>          if (pages->num < pages->allocated) {
> -            return 1;
> +            return true;
>          }
>      } else {
>          changed = true;
>      }
>  
>      if (multifd_send_pages() < 0) {
> -        return -1;
> +        return false;
>      }
>  
>      if (changed) {
>          return multifd_queue_page(block, offset);
>      }
>  
> -    return 1;
> +    return true;
>  }
>  
>  /* Multifd send side hit an error; remember it and prepare to quit */
> diff --git a/migration/ram.c b/migration/ram.c
> index d5b7cd5ac2..4649a81204 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1252,7 +1252,7 @@ static int ram_save_page(RAMState *rs, PageSearchStatus *pss)
>  
>  static int ram_save_multifd_page(RAMBlock *block, ram_addr_t offset)
>  {
> -    if (multifd_queue_page(block, offset) < 0) {
> +    if (!multifd_queue_page(block, offset)) {
>          return -1;
>      }
>      stat64_add(&mig_stats.normal_pages, 1);
> -- 
> 2.43.0
>
> From f393f1cfe95d79bed72e6043903ee4c4cb298c21 Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Thu, 1 Feb 2024 17:51:38 +0800
> Subject: [PATCH 2/3] migration/multifd: Change retval of multifd_send_pages()
>
> Using int is an overkill when there're only two options.  Change it to a
> boolean.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/multifd.c | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/migration/multifd.c b/migration/multifd.c
> index d0a3b4e062..d2b0f0eda9 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -449,9 +449,10 @@ static void multifd_send_kick_main(MultiFDSendParams *p)
>   * thread is using the channel mutex when changing it, and the channel
>   * have to had finish with its own, otherwise pending_job can't be
>   * false.
> + *
> + * Returns true if succeed, false otherwise.
>   */
> -
> -static int multifd_send_pages(void)
> +static bool multifd_send_pages(void)
>  {
>      int i;
>      static int next_channel;
> @@ -459,7 +460,7 @@ static int multifd_send_pages(void)
>      MultiFDPages_t *pages = multifd_send_state->pages;
>  
>      if (multifd_send_should_exit()) {
> -        return -1;
> +        return false;
>      }
>  
>      /* We wait here, until at least one channel is ready */
> @@ -473,7 +474,7 @@ static int multifd_send_pages(void)
>      next_channel %= migrate_multifd_channels();
>      for (i = next_channel;; i = (i + 1) % migrate_multifd_channels()) {
>          if (multifd_send_should_exit()) {
> -            return -1;
> +            return false;
>          }
>          p = &multifd_send_state->params[i];
>          /*
> @@ -502,7 +503,7 @@ static int multifd_send_pages(void)
>      qemu_mutex_unlock(&p->mutex);
>      qemu_sem_post(&p->sem);
>  
> -    return 1;
> +    return true;
>  }
>  
>  /* Returns true if enqueue successful, false otherwise */
> @@ -526,7 +527,7 @@ bool multifd_queue_page(RAMBlock *block, ram_addr_t offset)
>          changed = true;
>      }
>  
> -    if (multifd_send_pages() < 0) {
> +    if (!multifd_send_pages()) {
>          return false;
>      }
>  
> @@ -666,7 +667,7 @@ int multifd_send_sync_main(void)
>          return 0;
>      }
>      if (multifd_send_state->pages->num) {
> -        if (multifd_send_pages() < 0) {
> +        if (!multifd_send_pages()) {
>              error_report("%s: multifd_send_pages fail", __func__);
>              return -1;
>          }
> -- 
> 2.43.0
>
> From fcddc942cb31bc9d395d67a555d9a2281da452b1 Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Thu, 1 Feb 2024 17:55:42 +0800
> Subject: [PATCH 3/3] migration/multifd: Rewrite multifd_queue_page()
>
> The current multifd_queue_page() is not easy to read and follow.  It is not
> good with a few reasons:
>
>   - No helper at all to show what exactly does a condition mean; in short,
>   readability is low.
>
>   - Rely on pages->ramblock being cleared to detect an empty queue.  It's
>   slightly an overload of the ramblock pointer, per Fabiano [1], which I
>   also agree.
>
>   - Contains a self recursion, even if not necessary..
>
> Rewrite this function.  We add some comments to make it even clearer on
> what it does.
>
> [1] https://lore.kernel.org/r/87wmrpjzew.fsf@suse.de
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/multifd.c | 56 ++++++++++++++++++++++++++++++---------------
>  1 file changed, 37 insertions(+), 19 deletions(-)
>
> diff --git a/migration/multifd.c b/migration/multifd.c
> index d2b0f0eda9..5a64a9c2e2 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -506,35 +506,53 @@ static bool multifd_send_pages(void)
>      return true;
>  }
>  
> +static inline bool multifd_queue_empty(MultiFDPages_t *pages)
> +{
> +    return pages->num == 0;
> +}

Good, because we can later switch from pages to something else entirely.

> +
> +static inline bool multifd_queue_full(MultiFDPages_t *pages)
> +{
> +    return pages->num == pages->allocated;
> +}

Pages allocated is nonsense. See if you agree with its removal:
https://gitlab.com/farosas/qemu/-/commit/7cfff1a3e31b271e901a6c08d8b5d8c01b680e4d

---
From 7cfff1a3e31b271e901a6c08d8b5d8c01b680e4d Mon Sep 17 00:00:00 2001
From: Fabiano Rosas <farosas@suse.de>
Date: Tue, 24 Oct 2023 19:03:41 -0300
Subject: [PATCH] multifd: Remove MultiFDPage_t:allocated

When dealing with RAM, having a field called 'allocated' is
confusing. This field simply holds number of pages that fit in a
multifd packet.

Since it is a constant dependent on the size of the multifd packet,
remove it and instead use the page size and MULTIFD_PACKET_SIZE
directly.

This is another step in the direction of having no mentions of 'page'
in the multifd send thread.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/multifd.c | 6 ++----
 migration/multifd.h | 2 --
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index bdefce27706..83fb2caab04 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -241,7 +241,6 @@ static MultiFDPages_t *multifd_pages_init(uint32_t n)
 {
     MultiFDPages_t *pages = g_new0(MultiFDPages_t, 1);
 
-    pages->allocated = n;
     pages->offset = g_new0(ram_addr_t, n);
     pages->page_size = qemu_target_page_size();
 
@@ -251,7 +250,6 @@ static MultiFDPages_t *multifd_pages_init(uint32_t n)
 static void multifd_pages_clear(MultiFDPages_t *pages)
 {
     pages->num = 0;
-    pages->allocated = 0;
     pages->block = NULL;
     g_free(pages->offset);
     pages->offset = NULL;
@@ -264,7 +262,7 @@ static void multifd_send_fill_packet(MultiFDSendParams *p)
     int i;
 
     packet->flags = cpu_to_be32(p->flags);
-    packet->pages_alloc = cpu_to_be32(p->pages->allocated);
+    packet->pages_alloc = cpu_to_be32(MULTIFD_PACKET_SIZE / p->pages->page_size);
     packet->normal_pages = cpu_to_be32(p->pages->num);
     packet->next_packet_size = cpu_to_be32(p->next_packet_size);
     packet->packet_num = cpu_to_be64(p->packet_num);
@@ -451,7 +449,7 @@ int multifd_queue_page(RAMBlock *block, ram_addr_t offset)
         pages->offset[pages->num] = offset;
         pages->num++;
 
-        if (pages->num < pages->allocated) {
+        if (pages->num * pages->page_size < MULTIFD_PACKET_SIZE) {
             return 1;
         }
     } else {
diff --git a/migration/multifd.h b/migration/multifd.h
index 655f8d5eeb4..d1342296d63 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -56,8 +56,6 @@ typedef struct {
 typedef struct {
     /* number of used pages */
     uint32_t num;
-    /* number of allocated pages */
-    uint32_t allocated;
     /* guest page size */
     uint32_t page_size;
     /* offset of each page */
-- 

> +
> +static inline void multifd_enqueue(MultiFDPages_t *pages, ram_addr_t offset)
> +{
> +    pages->offset[pages->num++] = offset;
> +}
> +
>  /* Returns true if enqueue successful, false otherwise */
>  bool multifd_queue_page(RAMBlock *block, ram_addr_t offset)
>  {
> -    MultiFDPages_t *pages = multifd_send_state->pages;
> -    bool changed = false;
> +    MultiFDPages_t *pages;
> +
> +retry:
> +    pages = multifd_send_state->pages;
>  
> -    if (!pages->block) {
> +    /* If the queue is empty, we can already enqueue now */
> +    if (multifd_queue_empty(pages)) {
>          pages->block = block;
> +        multifd_enqueue(pages, offset);
> +        return true;
>      }
>  
> -    if (pages->block == block) {
> -        pages->offset[pages->num] = offset;
> -        pages->num++;
> -
> -        if (pages->num < pages->allocated) {
> -            return true;
> +    /*
> +     * Not empty, meanwhile we need a flush.  It can because of either:
> +     *
> +     * (1) The page is not on the same ramblock of previous ones, or,
> +     * (2) The queue is full.
> +     *
> +     * After flush, always retry.
> +     */
> +    if (pages->block != block || multifd_queue_full(pages)) {
> +        if (!multifd_send_pages()) {
> +            return false;
>          }
> -    } else {
> -        changed = true;
> -    }
> -
> -    if (!multifd_send_pages()) {
> -        return false;
> -    }
> -
> -    if (changed) {
> -        return multifd_queue_page(block, offset);
> +        goto retry;
>      }
>  
> +    /* Not empty, and we still have space, do it! */
> +    multifd_enqueue(pages, offset);
>      return true;
>  }


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 00/14] migration/multifd: Refactor ->send_prepare() and cleanups
  2024-02-01 12:51     ` Avihai Horon
@ 2024-02-01 21:46       ` Fabiano Rosas
  2024-02-02  2:12         ` Peter Xu
  0 siblings, 1 reply; 45+ messages in thread
From: Fabiano Rosas @ 2024-02-01 21:46 UTC (permalink / raw)
  To: Avihai Horon, Peter Xu
  Cc: qemu-devel, Bryan Zhang, Prasad Pandit, Yuan Liu, Hao Xiang

Avihai Horon <avihaih@nvidia.com> writes:

> On 01/02/2024 7:47, Peter Xu wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On Wed, Jan 31, 2024 at 07:49:51PM -0300, Fabiano Rosas wrote:
>>> peterx@redhat.com writes:
>>>
>>>> From: Peter Xu <peterx@redhat.com>
>>>>
>>>> This patchset contains quite a few refactorings to current multifd:
>>>>
>>>>    - It picked up some patches from an old series of mine [0] (the last
>>>>      patches were dropped, though; I did the cleanup slightly differently):
>>>>
>>>>      I still managed to include one patch to split pending_job, but I
>>>>      rewrote the patch here.
>>>>
>>>>    - It tries to cleanup multiple multifd paths here and there, the ultimate
>>>>      goal is to redefine send_prepare() to be something like:
>>>>
>>>>        p->pages ----------->  send_prepare() -------------> IOVs
>>>>
>>>>      So that there's no obvious change yet on multifd_ops besides redefined
>>>>      interface for send_prepare().  We may want a separate OPs for file
>>>>      later.
>>>>
>>>> For 2), one benefit is already presented by Fabiano in his other series [1]
>>>> on cleaning up zero copy, but this patchset addressed it quite differently,
>>>> and hopefully also more gradually.  The other benefit is for sure if we
>>>> have a more concrete API for send_prepare() and if we can reach an initial
>>>> consensus, then we can have the recent compression accelerators rebased on
>>>> top of this one.
>>>>
>>>> This also prepares for the case where the input can be extended to even not
>>>> any p->pages, but arbitrary data (like VFIO's potential use case in the
>>>> future?).  But that will also for later even if reasonable.
>>>>
>>>> Please have a look.  Thanks,
>>>>
>>>> [0] https://lore.kernel.org/r/20231022201211.452861-1-peterx@redhat.com
>>>> [1] https://lore.kernel.org/qemu-devel/20240126221943.26628-1-farosas@suse.de
>>>>
>>>> Peter Xu (14):
>>>>    migration/multifd: Drop stale comment for multifd zero copy
>>>>    migration/multifd: multifd_send_kick_main()
>>>>    migration/multifd: Drop MultiFDSendParams.quit, cleanup error paths
>>>>    migration/multifd: Postpone reset of MultiFDPages_t
>>>>    migration/multifd: Drop MultiFDSendParams.normal[] array
>>>>    migration/multifd: Separate SYNC request with normal jobs
>>>>    migration/multifd: Simplify locking in sender thread
>>>>    migration/multifd: Drop pages->num check in sender thread
>>>>    migration/multifd: Rename p->num_packets and clean it up
>>>>    migration/multifd: Move total_normal_pages accounting
>>>>    migration/multifd: Move trace_multifd_send|recv()
>>>>    migration/multifd: multifd_send_prepare_header()
>>>>    migration/multifd: Move header prepare/fill into send_prepare()
>>>>    migration/multifd: Forbid spurious wakeups
>>>>
>>>>   migration/multifd.h      |  34 +++--
>>>>   migration/multifd-zlib.c |  11 +-
>>>>   migration/multifd-zstd.c |  11 +-
>>>>   migration/multifd.c      | 291 +++++++++++++++++++--------------------
>>>>   4 files changed, 182 insertions(+), 165 deletions(-)
>>> This series didn't survive my 9999 iterations test on the opensuse
>>> machine.
>>>
>>> # Running /x86_64/migration/multifd/tcp/tls/x509/reject-anon-client
>>> ...
>>> kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
>>>
>>>
>>> #0  0x00005575dda06399 in qemu_mutex_lock_impl (mutex=0x18, file=0x5575ddce9cc3 "../util/qemu-thread-posix.c", line=275) at ../util/qemu-thread-posix.c:92
>>> #1  0x00005575dda06a94 in qemu_sem_post (sem=0x18) at ../util/qemu-thread-posix.c:275
>>> #2  0x00005575dd56a512 in multifd_send_thread (opaque=0x5575df054ef8) at ../migration/multifd.c:720
>>> #3  0x00005575dda0709b in qemu_thread_start (args=0x7fd404001d50) at ../util/qemu-thread-posix.c:541
>>> #4  0x00007fd45e8a26ea in start_thread (arg=0x7fd3faffd700) at pthread_create.c:477
>>> #5  0x00007fd45cd2150f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>>>
>>> The multifd thread is posting channels_ready with an already freed
>>> multifd_send_state.
>>>
>>> This is the bug Avihai has hit. We're going into multifd_save_cleanup()
>>> so early that multifd_new_send_channel_async() hasn't even had the
>>> chance to set p->running. So it misses the join and frees everything up
>>> while a second multifd thread is just starting.
>> Thanks for doing that.
>>
>> Would this series makes that bug easier to happen?
>
> I think so.
> Patch #3 added an extra multifd_send_should_exit() check in 
> multifd_send_sync_main(), so now it can exit early if the first channel 
> fails.
> Plus, now migration state is set to FAILED early by:
> multifd_new_send_channel_async()->multifd_send_terminate_threads() and 
> multifd_tls_outgoing_handshake()->multifd_send_terminate_threads()
> so migration_iteration_run() is completely skipped because 
> migration_is_active() check before it will return false.
>
> I *think* this is what makes main migration thread finish earlier and 
> call multifd_save_cleanup() earlier, at least for me.
>

I'm doing some experiments with a global semaphore like channels_ready
instead of a per-channel structure like you suggested. I think we only
need to have a point past which we're assured no more channels will be
created. With that we'd only need one post at
multifd_new_send_channel_async.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 03/14] migration/multifd: Drop MultiFDSendParams.quit, cleanup error paths
  2024-02-01 13:30       ` Fabiano Rosas
@ 2024-02-02  0:21         ` Peter Xu
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Xu @ 2024-02-02  0:21 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, Bryan Zhang, Prasad Pandit, Yuan Liu, Avihai Horon,
	Hao Xiang

On Thu, Feb 01, 2024 at 10:30:19AM -0300, Fabiano Rosas wrote:
> > @@ -586,7 +595,7 @@ void multifd_save_cleanup(void)
> >      if (!migrate_multifd()) {
> >          return;
> >      }
> > -    multifd_send_terminate_threads(NULL);
> > +    multifd_send_terminate_threads();
> >      for (i = 0; i < migrate_multifd_channels(); i++) {
> >          MultiFDSendParams *p = &multifd_send_state->params[i];
> 
> We could then move the qemu_thread_join loop into
> multifd_send_terminate_threads().

Sure, I can do that.

When at it, I found that maybe I should cleanup more things in this
function to provide small helpers.

I think I'll keep this one alone, while I'll append one more patch to do
it.

> 
> (and fix all the bugs we have so that we only progress past
> multifd_send_terminate_threads() once all threads have exited and no
> more thread is going to spawn)

I guess this will still take some effort.  I hope that we can avoid some
threads from being created at all for either async/tls purpose.

For now when I'm doing the cleanup I'll add a TODO too for this.

I'll repost a new version for the whole set today.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 04/14] migration/multifd: Postpone reset of MultiFDPages_t
  2024-02-01 15:21       ` Fabiano Rosas
@ 2024-02-02  0:28         ` Peter Xu
  2024-02-02  0:37           ` Peter Xu
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2024-02-02  0:28 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, Bryan Zhang, Prasad Pandit, Yuan Liu, Avihai Horon,
	Hao Xiang

On Thu, Feb 01, 2024 at 12:21:27PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Wed, Jan 31, 2024 at 12:27:51PM -0300, Fabiano Rosas wrote:
> >> > +/* Reset a MultiFDPages_t* object for the next use */
> >> > +static void multifd_pages_reset(MultiFDPages_t *pages)
> >> > +{
> >> > +    /*
> >> > +     * We don't need to touch offset[] array, because it will be
> >> > +     * overwritten later when reused.
> >> > +     */
> >> > +    pages->num = 0;
> >> > +    pages->block = NULL;
> >> 
> >> Having to do this at all is a huge overloading of this field. This not
> >> only resets it, but it also communicates to multifd_queue_page() that
> >> the previous payload has been sent. Otherwise, multifd_queue_page()
> >> wouldn't know whether the previous call to multifd_queue_page() has
> >> called multifd_send_pages() or if it has exited early. So this basically
> >> means "the block that was previously here has been sent".
> >> 
> >> That's why we need the changed=true logic. A
> >> multifd_send_state->pages->block still has a few pages left to send, but
> >> because it's less than pages->allocated, it skips
> >> multifd_send_pages(). The next call to multifd_queue_page() already has
> >> the next ramblock. So we set changed=true, call multifd_send_pages() to
> >> send the remaining pages of that block and recurse into
> >> multifd_queue_page() once more to send the new block.
> >
> > I agree, the queue page routines are not easy to follow as well.
> >
> > How do you like a rewrite of the queue logic, like this?
> >
> > =====
> > bool multifd_queue_page(RAMBlock *block, ram_addr_t offset)
> > {
> >     MultiFDPages_t *pages;
> >
> > retry:
> >     pages = multifd_send_state->pages;
> >
> >     /* If the queue is empty, we can already enqueue now */
> >     if (multifd_queue_empty(pages)) {
> >         pages->block = block;
> >         multifd_enqueue(pages, offset);
> >         return true;
> >     }
> >
> >     /*
> >      * Not empty, meanwhile we need a flush.  It can because of either:
> >      *
> >      * (1) The page is not on the same ramblock of previous ones, or,
> >      * (2) The queue is full.
> >      *
> >      * After flush, always retry.
> >      */
> >     if (pages->block != block || multifd_queue_full(pages)) {
> >         if (!multifd_send_pages()) {
> >             return false;
> >         }
> >         goto retry;
> >     }
> >
> >     /* Not empty, and we still have space, do it! */
> >     multifd_enqueue(pages, offset);
> >     return true;
> > }
> > =====
> >
> > Would this be clearer?  With above, we can drop the ->ramblock reset,
> > afaict.
> >
> > I attached three patches if you agree it's better, then I'll include them
> > in v2.
> 
> Yes, let's do it.
> 
> >
> > -- 
> > Peter Xu
> > From c5dc2052794efd6da6a1e6f4b49f25d5b32879f7 Mon Sep 17 00:00:00 2001
> > From: Peter Xu <peterx@redhat.com>
> > Date: Thu, 1 Feb 2024 17:50:21 +0800
> > Subject: [PATCH 1/3] migration/multifd: Change retval of multifd_queue_page()
> >
> > Using int is an overkill when there're only two options.  Change it to a
> > boolean.
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  migration/multifd.h | 2 +-
> >  migration/multifd.c | 9 +++++----
> >  migration/ram.c     | 2 +-
> >  3 files changed, 7 insertions(+), 6 deletions(-)
> >
> > diff --git a/migration/multifd.h b/migration/multifd.h
> > index 34a2ecb9f4..a320c53a6f 100644
> > --- a/migration/multifd.h
> > +++ b/migration/multifd.h
> > @@ -22,7 +22,7 @@ bool multifd_recv_all_channels_created(void);
> >  void multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
> >  void multifd_recv_sync_main(void);
> >  int multifd_send_sync_main(void);
> > -int multifd_queue_page(RAMBlock *block, ram_addr_t offset);
> > +bool multifd_queue_page(RAMBlock *block, ram_addr_t offset);
> >  
> >  /* Multifd Compression flags */
> >  #define MULTIFD_FLAG_SYNC (1 << 0)
> > diff --git a/migration/multifd.c b/migration/multifd.c
> > index 91be6d2fc4..d0a3b4e062 100644
> > --- a/migration/multifd.c
> > +++ b/migration/multifd.c
> > @@ -505,7 +505,8 @@ static int multifd_send_pages(void)
> >      return 1;
> >  }
> >  
> > -int multifd_queue_page(RAMBlock *block, ram_addr_t offset)
> > +/* Returns true if enqueue successful, false otherwise */
> > +bool multifd_queue_page(RAMBlock *block, ram_addr_t offset)
> >  {
> >      MultiFDPages_t *pages = multifd_send_state->pages;
> >      bool changed = false;
> > @@ -519,21 +520,21 @@ int multifd_queue_page(RAMBlock *block, ram_addr_t offset)
> >          pages->num++;
> >  
> >          if (pages->num < pages->allocated) {
> > -            return 1;
> > +            return true;
> >          }
> >      } else {
> >          changed = true;
> >      }
> >  
> >      if (multifd_send_pages() < 0) {
> > -        return -1;
> > +        return false;
> >      }
> >  
> >      if (changed) {
> >          return multifd_queue_page(block, offset);
> >      }
> >  
> > -    return 1;
> > +    return true;
> >  }
> >  
> >  /* Multifd send side hit an error; remember it and prepare to quit */
> > diff --git a/migration/ram.c b/migration/ram.c
> > index d5b7cd5ac2..4649a81204 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -1252,7 +1252,7 @@ static int ram_save_page(RAMState *rs, PageSearchStatus *pss)
> >  
> >  static int ram_save_multifd_page(RAMBlock *block, ram_addr_t offset)
> >  {
> > -    if (multifd_queue_page(block, offset) < 0) {
> > +    if (!multifd_queue_page(block, offset)) {
> >          return -1;
> >      }
> >      stat64_add(&mig_stats.normal_pages, 1);
> > -- 
> > 2.43.0
> >
> > From f393f1cfe95d79bed72e6043903ee4c4cb298c21 Mon Sep 17 00:00:00 2001
> > From: Peter Xu <peterx@redhat.com>
> > Date: Thu, 1 Feb 2024 17:51:38 +0800
> > Subject: [PATCH 2/3] migration/multifd: Change retval of multifd_send_pages()
> >
> > Using int is an overkill when there're only two options.  Change it to a
> > boolean.
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  migration/multifd.c | 15 ++++++++-------
> >  1 file changed, 8 insertions(+), 7 deletions(-)
> >
> > diff --git a/migration/multifd.c b/migration/multifd.c
> > index d0a3b4e062..d2b0f0eda9 100644
> > --- a/migration/multifd.c
> > +++ b/migration/multifd.c
> > @@ -449,9 +449,10 @@ static void multifd_send_kick_main(MultiFDSendParams *p)
> >   * thread is using the channel mutex when changing it, and the channel
> >   * have to had finish with its own, otherwise pending_job can't be
> >   * false.
> > + *
> > + * Returns true if succeed, false otherwise.
> >   */
> > -
> > -static int multifd_send_pages(void)
> > +static bool multifd_send_pages(void)
> >  {
> >      int i;
> >      static int next_channel;
> > @@ -459,7 +460,7 @@ static int multifd_send_pages(void)
> >      MultiFDPages_t *pages = multifd_send_state->pages;
> >  
> >      if (multifd_send_should_exit()) {
> > -        return -1;
> > +        return false;
> >      }
> >  
> >      /* We wait here, until at least one channel is ready */
> > @@ -473,7 +474,7 @@ static int multifd_send_pages(void)
> >      next_channel %= migrate_multifd_channels();
> >      for (i = next_channel;; i = (i + 1) % migrate_multifd_channels()) {
> >          if (multifd_send_should_exit()) {
> > -            return -1;
> > +            return false;
> >          }
> >          p = &multifd_send_state->params[i];
> >          /*
> > @@ -502,7 +503,7 @@ static int multifd_send_pages(void)
> >      qemu_mutex_unlock(&p->mutex);
> >      qemu_sem_post(&p->sem);
> >  
> > -    return 1;
> > +    return true;
> >  }
> >  
> >  /* Returns true if enqueue successful, false otherwise */
> > @@ -526,7 +527,7 @@ bool multifd_queue_page(RAMBlock *block, ram_addr_t offset)
> >          changed = true;
> >      }
> >  
> > -    if (multifd_send_pages() < 0) {
> > +    if (!multifd_send_pages()) {
> >          return false;
> >      }
> >  
> > @@ -666,7 +667,7 @@ int multifd_send_sync_main(void)
> >          return 0;
> >      }
> >      if (multifd_send_state->pages->num) {
> > -        if (multifd_send_pages() < 0) {
> > +        if (!multifd_send_pages()) {
> >              error_report("%s: multifd_send_pages fail", __func__);
> >              return -1;
> >          }
> > -- 
> > 2.43.0
> >
> > From fcddc942cb31bc9d395d67a555d9a2281da452b1 Mon Sep 17 00:00:00 2001
> > From: Peter Xu <peterx@redhat.com>
> > Date: Thu, 1 Feb 2024 17:55:42 +0800
> > Subject: [PATCH 3/3] migration/multifd: Rewrite multifd_queue_page()
> >
> > The current multifd_queue_page() is not easy to read and follow.  It is not
> > good with a few reasons:
> >
> >   - No helper at all to show what exactly does a condition mean; in short,
> >   readability is low.
> >
> >   - Rely on pages->ramblock being cleared to detect an empty queue.  It's
> >   slightly an overload of the ramblock pointer, per Fabiano [1], which I
> >   also agree.
> >
> >   - Contains a self recursion, even if not necessary..
> >
> > Rewrite this function.  We add some comments to make it even clearer on
> > what it does.
> >
> > [1] https://lore.kernel.org/r/87wmrpjzew.fsf@suse.de
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  migration/multifd.c | 56 ++++++++++++++++++++++++++++++---------------
> >  1 file changed, 37 insertions(+), 19 deletions(-)
> >
> > diff --git a/migration/multifd.c b/migration/multifd.c
> > index d2b0f0eda9..5a64a9c2e2 100644
> > --- a/migration/multifd.c
> > +++ b/migration/multifd.c
> > @@ -506,35 +506,53 @@ static bool multifd_send_pages(void)
> >      return true;
> >  }
> >  
> > +static inline bool multifd_queue_empty(MultiFDPages_t *pages)
> > +{
> > +    return pages->num == 0;
> > +}
> 
> Good, because we can later switch from pages to something else entirely.
> 
> > +
> > +static inline bool multifd_queue_full(MultiFDPages_t *pages)
> > +{
> > +    return pages->num == pages->allocated;
> > +}
> 
> Pages allocated is nonsense. See if you agree with its removal:
> https://gitlab.com/farosas/qemu/-/commit/7cfff1a3e31b271e901a6c08d8b5d8c01b680e4d
> 
> ---
> From 7cfff1a3e31b271e901a6c08d8b5d8c01b680e4d Mon Sep 17 00:00:00 2001
> From: Fabiano Rosas <farosas@suse.de>
> Date: Tue, 24 Oct 2023 19:03:41 -0300
> Subject: [PATCH] multifd: Remove MultiFDPage_t:allocated
> 
> When dealing with RAM, having a field called 'allocated' is
> confusing. This field simply holds number of pages that fit in a
> multifd packet.
> 
> Since it is a constant dependent on the size of the multifd packet,
> remove it and instead use the page size and MULTIFD_PACKET_SIZE
> directly.
> 
> This is another step in the direction of having no mentions of 'page'
> in the multifd send thread.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  migration/multifd.c | 6 ++----
>  migration/multifd.h | 2 --
>  2 files changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/migration/multifd.c b/migration/multifd.c
> index bdefce27706..83fb2caab04 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -241,7 +241,6 @@ static MultiFDPages_t *multifd_pages_init(uint32_t n)
>  {
>      MultiFDPages_t *pages = g_new0(MultiFDPages_t, 1);
>  
> -    pages->allocated = n;
>      pages->offset = g_new0(ram_addr_t, n);
>      pages->page_size = qemu_target_page_size();
>  
> @@ -251,7 +250,6 @@ static MultiFDPages_t *multifd_pages_init(uint32_t n)
>  static void multifd_pages_clear(MultiFDPages_t *pages)
>  {
>      pages->num = 0;
> -    pages->allocated = 0;
>      pages->block = NULL;
>      g_free(pages->offset);
>      pages->offset = NULL;
> @@ -264,7 +262,7 @@ static void multifd_send_fill_packet(MultiFDSendParams *p)
>      int i;
>  
>      packet->flags = cpu_to_be32(p->flags);
> -    packet->pages_alloc = cpu_to_be32(p->pages->allocated);
> +    packet->pages_alloc = cpu_to_be32(MULTIFD_PACKET_SIZE / p->pages->page_size);
>      packet->normal_pages = cpu_to_be32(p->pages->num);
>      packet->next_packet_size = cpu_to_be32(p->next_packet_size);
>      packet->packet_num = cpu_to_be64(p->packet_num);
> @@ -451,7 +449,7 @@ int multifd_queue_page(RAMBlock *block, ram_addr_t offset)
>          pages->offset[pages->num] = offset;
>          pages->num++;
>  
> -        if (pages->num < pages->allocated) {
> +        if (pages->num * pages->page_size < MULTIFD_PACKET_SIZE) {
>              return 1;
>          }
>      } else {
> diff --git a/migration/multifd.h b/migration/multifd.h
> index 655f8d5eeb4..d1342296d63 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -56,8 +56,6 @@ typedef struct {
>  typedef struct {
>      /* number of used pages */
>      uint32_t num;
> -    /* number of allocated pages */
> -    uint32_t allocated;
>      /* guest page size */
>      uint32_t page_size;
>      /* offset of each page */
> -- 

I agree.

Even if we would like to add a parameter to setup the allcated size (I
remember one of the accelerator series has it), it'll still be a global
variable rather than per-pages thing.

I can cherry pick this and post together; will need a rebase but I can do
that.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 04/14] migration/multifd: Postpone reset of MultiFDPages_t
  2024-02-02  0:28         ` Peter Xu
@ 2024-02-02  0:37           ` Peter Xu
  2024-02-02 12:15             ` Fabiano Rosas
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Xu @ 2024-02-02  0:37 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: qemu-devel, Bryan Zhang, Prasad Pandit, Yuan Liu, Avihai Horon,
	Hao Xiang

On Fri, Feb 02, 2024 at 08:28:47AM +0800, Peter Xu wrote:
> > Pages allocated is nonsense. See if you agree with its removal:
> > https://gitlab.com/farosas/qemu/-/commit/7cfff1a3e31b271e901a6c08d8b5d8c01b680e4d
> > 
> > ---
> > From 7cfff1a3e31b271e901a6c08d8b5d8c01b680e4d Mon Sep 17 00:00:00 2001
> > From: Fabiano Rosas <farosas@suse.de>
> > Date: Tue, 24 Oct 2023 19:03:41 -0300
> > Subject: [PATCH] multifd: Remove MultiFDPage_t:allocated
> > 
> > When dealing with RAM, having a field called 'allocated' is
> > confusing. This field simply holds number of pages that fit in a
> > multifd packet.
> > 
> > Since it is a constant dependent on the size of the multifd packet,
> > remove it and instead use the page size and MULTIFD_PACKET_SIZE
> > directly.
> > 
> > This is another step in the direction of having no mentions of 'page'
> > in the multifd send thread.
> > 
> > Signed-off-by: Fabiano Rosas <farosas@suse.de>
> > ---
> >  migration/multifd.c | 6 ++----
> >  migration/multifd.h | 2 --
> >  2 files changed, 2 insertions(+), 6 deletions(-)
> > 
> > diff --git a/migration/multifd.c b/migration/multifd.c
> > index bdefce27706..83fb2caab04 100644
> > --- a/migration/multifd.c
> > +++ b/migration/multifd.c
> > @@ -241,7 +241,6 @@ static MultiFDPages_t *multifd_pages_init(uint32_t n)
> >  {
> >      MultiFDPages_t *pages = g_new0(MultiFDPages_t, 1);
> >  
> > -    pages->allocated = n;
> >      pages->offset = g_new0(ram_addr_t, n);
> >      pages->page_size = qemu_target_page_size();
> >  
> > @@ -251,7 +250,6 @@ static MultiFDPages_t *multifd_pages_init(uint32_t n)
> >  static void multifd_pages_clear(MultiFDPages_t *pages)
> >  {
> >      pages->num = 0;
> > -    pages->allocated = 0;
> >      pages->block = NULL;
> >      g_free(pages->offset);
> >      pages->offset = NULL;
> > @@ -264,7 +262,7 @@ static void multifd_send_fill_packet(MultiFDSendParams *p)
> >      int i;
> >  
> >      packet->flags = cpu_to_be32(p->flags);
> > -    packet->pages_alloc = cpu_to_be32(p->pages->allocated);
> > +    packet->pages_alloc = cpu_to_be32(MULTIFD_PACKET_SIZE / p->pages->page_size);
> >      packet->normal_pages = cpu_to_be32(p->pages->num);
> >      packet->next_packet_size = cpu_to_be32(p->next_packet_size);
> >      packet->packet_num = cpu_to_be64(p->packet_num);
> > @@ -451,7 +449,7 @@ int multifd_queue_page(RAMBlock *block, ram_addr_t offset)
> >          pages->offset[pages->num] = offset;
> >          pages->num++;
> >  
> > -        if (pages->num < pages->allocated) {
> > +        if (pages->num * pages->page_size < MULTIFD_PACKET_SIZE) {
> >              return 1;
> >          }
> >      } else {
> > diff --git a/migration/multifd.h b/migration/multifd.h
> > index 655f8d5eeb4..d1342296d63 100644
> > --- a/migration/multifd.h
> > +++ b/migration/multifd.h
> > @@ -56,8 +56,6 @@ typedef struct {
> >  typedef struct {
> >      /* number of used pages */
> >      uint32_t num;
> > -    /* number of allocated pages */
> > -    uint32_t allocated;
> >      /* guest page size */
> >      uint32_t page_size;
> >      /* offset of each page */
> > -- 
> 
> I agree.
> 
> Even if we would like to add a parameter to setup the allcated size (I
> remember one of the accelerator series has it), it'll still be a global
> variable rather than per-pages thing.
> 
> I can cherry pick this and post together; will need a rebase but I can do
> that.

I see a slight step back here when rebase, since we'll calculate n_pages
every time to enqueue the page:

static inline bool multifd_queue_full(MultiFDPages_t *pages)
{
    return pages->num == (MULTIFD_PACKET_SIZE / pages->page_size);
}

The "allocated" is still good to cache the value.  Fabiano, would it make
sense we still use a global var (perhaps in multifd_save_state?) to cache
this?

I'll leave this alone as of now I think, but again I agree we should have
something similar.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 00/14] migration/multifd: Refactor ->send_prepare() and cleanups
  2024-02-01 21:46       ` Fabiano Rosas
@ 2024-02-02  2:12         ` Peter Xu
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Xu @ 2024-02-02  2:12 UTC (permalink / raw)
  To: Fabiano Rosas
  Cc: Avihai Horon, qemu-devel, Bryan Zhang, Prasad Pandit, Yuan Liu,
	Hao Xiang

On Thu, Feb 01, 2024 at 06:46:35PM -0300, Fabiano Rosas wrote:
> Avihai Horon <avihaih@nvidia.com> writes:
> 
> > On 01/02/2024 7:47, Peter Xu wrote:
> >> External email: Use caution opening links or attachments
> >>
> >>
> >> On Wed, Jan 31, 2024 at 07:49:51PM -0300, Fabiano Rosas wrote:
> >>> peterx@redhat.com writes:
> >>>
> >>>> From: Peter Xu <peterx@redhat.com>
> >>>>
> >>>> This patchset contains quite a few refactorings to current multifd:
> >>>>
> >>>>    - It picked up some patches from an old series of mine [0] (the last
> >>>>      patches were dropped, though; I did the cleanup slightly differently):
> >>>>
> >>>>      I still managed to include one patch to split pending_job, but I
> >>>>      rewrote the patch here.
> >>>>
> >>>>    - It tries to cleanup multiple multifd paths here and there, the ultimate
> >>>>      goal is to redefine send_prepare() to be something like:
> >>>>
> >>>>        p->pages ----------->  send_prepare() -------------> IOVs
> >>>>
> >>>>      So that there's no obvious change yet on multifd_ops besides redefined
> >>>>      interface for send_prepare().  We may want a separate OPs for file
> >>>>      later.
> >>>>
> >>>> For 2), one benefit is already presented by Fabiano in his other series [1]
> >>>> on cleaning up zero copy, but this patchset addressed it quite differently,
> >>>> and hopefully also more gradually.  The other benefit is for sure if we
> >>>> have a more concrete API for send_prepare() and if we can reach an initial
> >>>> consensus, then we can have the recent compression accelerators rebased on
> >>>> top of this one.
> >>>>
> >>>> This also prepares for the case where the input can be extended to even not
> >>>> any p->pages, but arbitrary data (like VFIO's potential use case in the
> >>>> future?).  But that will also for later even if reasonable.
> >>>>
> >>>> Please have a look.  Thanks,
> >>>>
> >>>> [0] https://lore.kernel.org/r/20231022201211.452861-1-peterx@redhat.com
> >>>> [1] https://lore.kernel.org/qemu-devel/20240126221943.26628-1-farosas@suse.de
> >>>>
> >>>> Peter Xu (14):
> >>>>    migration/multifd: Drop stale comment for multifd zero copy
> >>>>    migration/multifd: multifd_send_kick_main()
> >>>>    migration/multifd: Drop MultiFDSendParams.quit, cleanup error paths
> >>>>    migration/multifd: Postpone reset of MultiFDPages_t
> >>>>    migration/multifd: Drop MultiFDSendParams.normal[] array
> >>>>    migration/multifd: Separate SYNC request with normal jobs
> >>>>    migration/multifd: Simplify locking in sender thread
> >>>>    migration/multifd: Drop pages->num check in sender thread
> >>>>    migration/multifd: Rename p->num_packets and clean it up
> >>>>    migration/multifd: Move total_normal_pages accounting
> >>>>    migration/multifd: Move trace_multifd_send|recv()
> >>>>    migration/multifd: multifd_send_prepare_header()
> >>>>    migration/multifd: Move header prepare/fill into send_prepare()
> >>>>    migration/multifd: Forbid spurious wakeups
> >>>>
> >>>>   migration/multifd.h      |  34 +++--
> >>>>   migration/multifd-zlib.c |  11 +-
> >>>>   migration/multifd-zstd.c |  11 +-
> >>>>   migration/multifd.c      | 291 +++++++++++++++++++--------------------
> >>>>   4 files changed, 182 insertions(+), 165 deletions(-)
> >>> This series didn't survive my 9999 iterations test on the opensuse
> >>> machine.
> >>>
> >>> # Running /x86_64/migration/multifd/tcp/tls/x509/reject-anon-client
> >>> ...
> >>> kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)
> >>>
> >>>
> >>> #0  0x00005575dda06399 in qemu_mutex_lock_impl (mutex=0x18, file=0x5575ddce9cc3 "../util/qemu-thread-posix.c", line=275) at ../util/qemu-thread-posix.c:92
> >>> #1  0x00005575dda06a94 in qemu_sem_post (sem=0x18) at ../util/qemu-thread-posix.c:275
> >>> #2  0x00005575dd56a512 in multifd_send_thread (opaque=0x5575df054ef8) at ../migration/multifd.c:720
> >>> #3  0x00005575dda0709b in qemu_thread_start (args=0x7fd404001d50) at ../util/qemu-thread-posix.c:541
> >>> #4  0x00007fd45e8a26ea in start_thread (arg=0x7fd3faffd700) at pthread_create.c:477
> >>> #5  0x00007fd45cd2150f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> >>>
> >>> The multifd thread is posting channels_ready with an already freed
> >>> multifd_send_state.
> >>>
> >>> This is the bug Avihai has hit. We're going into multifd_save_cleanup()
> >>> so early that multifd_new_send_channel_async() hasn't even had the
> >>> chance to set p->running. So it misses the join and frees everything up
> >>> while a second multifd thread is just starting.
> >> Thanks for doing that.
> >>
> >> Would this series makes that bug easier to happen?
> >
> > I think so.
> > Patch #3 added an extra multifd_send_should_exit() check in 
> > multifd_send_sync_main(), so now it can exit early if the first channel 
> > fails.
> > Plus, now migration state is set to FAILED early by:
> > multifd_new_send_channel_async()->multifd_send_terminate_threads() and 
> > multifd_tls_outgoing_handshake()->multifd_send_terminate_threads()
> > so migration_iteration_run() is completely skipped because 
> > migration_is_active() check before it will return false.
> >
> > I *think* this is what makes main migration thread finish earlier and 
> > call multifd_save_cleanup() earlier, at least for me.
> >
> 
> I'm doing some experiments with a global semaphore like channels_ready
> instead of a per-channel structure like you suggested. I think we only
> need to have a point past which we're assured no more channels will be
> created. With that we'd only need one post at
> multifd_new_send_channel_async.

Fabiano, Avihai,

If this series is not drastically making things worse, I would leave that
issue alone for now and move on with reposting this one, with the hope that
we still have time to address this in 9.0 (while the issue existed much
longer).  I do have plan to merge this one earlier if possible, assuming
it'll be easier for the accelerator projects to rebase on top.

If I missed something please feel free to still reply in v2.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 13/14] migration/multifd: Move header prepare/fill into send_prepare()
  2024-01-31 10:31 ` [PATCH 13/14] migration/multifd: Move header prepare/fill into send_prepare() peterx
  2024-01-31 21:42   ` Fabiano Rosas
@ 2024-02-02  3:57   ` Peter Xu
  1 sibling, 0 replies; 45+ messages in thread
From: Peter Xu @ 2024-02-02  3:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: Bryan Zhang, Prasad Pandit, Fabiano Rosas, Yuan Liu,
	Avihai Horon, Hao Xiang

On Wed, Jan 31, 2024 at 06:31:10PM +0800, peterx@redhat.com wrote:
> From: Peter Xu <peterx@redhat.com>
> 
> This patch redefines the interfacing of ->send_prepare().  It further
> simplifies multifd_send_thread() especially on zero copy.
> 
> Now with the new interface, we require the hook to do all the work for
> preparing the IOVs to send.  After it's completed, the IOVs should be ready
> to be dumped into the specific multifd QIOChannel later.
> 
> So now the API looks like:
> 
>   p->pages ----------->  send_prepare() -------------> IOVs
> 
> This also prepares for the case where the input can be extended to even not
> any p->pages.  But that's for later.
> 
> This patch will achieve similar goal of what Fabiano used to propose here:
> 
> https://lore.kernel.org/r/20240126221943.26628-1-farosas@suse.de
> 
> However the send() interface may not be necessary.  I'm boldly attaching a
> "Co-developed-by" for Fabiano.
> 
> Co-developed-by: Fabiano Rosas <farosas@suse.de>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Just a heads-up: I plan to squash something like below also into it.
That's mostly Fabiano's:

https://lore.kernel.org/r/20240126221943.26628-6-farosas@suse.de

But instead of overwritting write_flags in the hook, I made it a
conditional "OR" just in case we'll extend write_flags later in common
paths and get it overlooked.

In short, I'll keep all zerocopy changes together in this single patch,
hopefully clearer.

=====
diff --git a/migration/multifd.c b/migration/multifd.c
index cd4467aff4..6aa44340de 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -50,15 +50,15 @@ typedef struct {
 /**
  * nocomp_send_setup: setup send side
  *
- * For no compression this function does nothing.
- *
- * Returns 0 for success or -1 for error
- *
  * @p: Params for the channel that we are using
  * @errp: pointer to an error
  */
 static int nocomp_send_setup(MultiFDSendParams *p, Error **errp)
 {
+    if (migrate_zero_copy_send()) {
+        p->write_flags |= QIO_CHANNEL_WRITE_FLAG_ZERO_COPY;
+    }
+
     return 0;
 }

-- 
Peter Xu



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 04/14] migration/multifd: Postpone reset of MultiFDPages_t
  2024-02-02  0:37           ` Peter Xu
@ 2024-02-02 12:15             ` Fabiano Rosas
  0 siblings, 0 replies; 45+ messages in thread
From: Fabiano Rosas @ 2024-02-02 12:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Bryan Zhang, Prasad Pandit, Yuan Liu, Avihai Horon,
	Hao Xiang

Peter Xu <peterx@redhat.com> writes:

> On Fri, Feb 02, 2024 at 08:28:47AM +0800, Peter Xu wrote:
>> > Pages allocated is nonsense. See if you agree with its removal:
>> > https://gitlab.com/farosas/qemu/-/commit/7cfff1a3e31b271e901a6c08d8b5d8c01b680e4d
>> > 
>> > ---
>> > From 7cfff1a3e31b271e901a6c08d8b5d8c01b680e4d Mon Sep 17 00:00:00 2001
>> > From: Fabiano Rosas <farosas@suse.de>
>> > Date: Tue, 24 Oct 2023 19:03:41 -0300
>> > Subject: [PATCH] multifd: Remove MultiFDPage_t:allocated
>> > 
>> > When dealing with RAM, having a field called 'allocated' is
>> > confusing. This field simply holds number of pages that fit in a
>> > multifd packet.
>> > 
>> > Since it is a constant dependent on the size of the multifd packet,
>> > remove it and instead use the page size and MULTIFD_PACKET_SIZE
>> > directly.
>> > 
>> > This is another step in the direction of having no mentions of 'page'
>> > in the multifd send thread.
>> > 
>> > Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> > ---
>> >  migration/multifd.c | 6 ++----
>> >  migration/multifd.h | 2 --
>> >  2 files changed, 2 insertions(+), 6 deletions(-)
>> > 
>> > diff --git a/migration/multifd.c b/migration/multifd.c
>> > index bdefce27706..83fb2caab04 100644
>> > --- a/migration/multifd.c
>> > +++ b/migration/multifd.c
>> > @@ -241,7 +241,6 @@ static MultiFDPages_t *multifd_pages_init(uint32_t n)
>> >  {
>> >      MultiFDPages_t *pages = g_new0(MultiFDPages_t, 1);
>> >  
>> > -    pages->allocated = n;
>> >      pages->offset = g_new0(ram_addr_t, n);
>> >      pages->page_size = qemu_target_page_size();
>> >  
>> > @@ -251,7 +250,6 @@ static MultiFDPages_t *multifd_pages_init(uint32_t n)
>> >  static void multifd_pages_clear(MultiFDPages_t *pages)
>> >  {
>> >      pages->num = 0;
>> > -    pages->allocated = 0;
>> >      pages->block = NULL;
>> >      g_free(pages->offset);
>> >      pages->offset = NULL;
>> > @@ -264,7 +262,7 @@ static void multifd_send_fill_packet(MultiFDSendParams *p)
>> >      int i;
>> >  
>> >      packet->flags = cpu_to_be32(p->flags);
>> > -    packet->pages_alloc = cpu_to_be32(p->pages->allocated);
>> > +    packet->pages_alloc = cpu_to_be32(MULTIFD_PACKET_SIZE / p->pages->page_size);
>> >      packet->normal_pages = cpu_to_be32(p->pages->num);
>> >      packet->next_packet_size = cpu_to_be32(p->next_packet_size);
>> >      packet->packet_num = cpu_to_be64(p->packet_num);
>> > @@ -451,7 +449,7 @@ int multifd_queue_page(RAMBlock *block, ram_addr_t offset)
>> >          pages->offset[pages->num] = offset;
>> >          pages->num++;
>> >  
>> > -        if (pages->num < pages->allocated) {
>> > +        if (pages->num * pages->page_size < MULTIFD_PACKET_SIZE) {
>> >              return 1;
>> >          }
>> >      } else {
>> > diff --git a/migration/multifd.h b/migration/multifd.h
>> > index 655f8d5eeb4..d1342296d63 100644
>> > --- a/migration/multifd.h
>> > +++ b/migration/multifd.h
>> > @@ -56,8 +56,6 @@ typedef struct {
>> >  typedef struct {
>> >      /* number of used pages */
>> >      uint32_t num;
>> > -    /* number of allocated pages */
>> > -    uint32_t allocated;
>> >      /* guest page size */
>> >      uint32_t page_size;
>> >      /* offset of each page */
>> > -- 
>> 
>> I agree.
>> 
>> Even if we would like to add a parameter to setup the allcated size (I
>> remember one of the accelerator series has it), it'll still be a global
>> variable rather than per-pages thing.
>> 
>> I can cherry pick this and post together; will need a rebase but I can do
>> that.
>
> I see a slight step back here when rebase, since we'll calculate n_pages
> every time to enqueue the page:
>
> static inline bool multifd_queue_full(MultiFDPages_t *pages)
> {
>     return pages->num == (MULTIFD_PACKET_SIZE / pages->page_size);
> }
>
> The "allocated" is still good to cache the value.  Fabiano, would it make
> sense we still use a global var (perhaps in multifd_save_state?) to cache
> this?

Yep.

>
> I'll leave this alone as of now I think, but again I agree we should have
> something similar.

Ok, no problem. I can change this at another time.


^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2024-02-02 12:16 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-31 10:30 [PATCH 00/14] migration/multifd: Refactor ->send_prepare() and cleanups peterx
2024-01-31 10:30 ` [PATCH 01/14] migration/multifd: Drop stale comment for multifd zero copy peterx
2024-01-31 10:30 ` [PATCH 02/14] migration/multifd: multifd_send_kick_main() peterx
2024-01-31 10:31 ` [PATCH 03/14] migration/multifd: Drop MultiFDSendParams.quit, cleanup error paths peterx
2024-01-31 15:05   ` Fabiano Rosas
2024-02-01  9:28     ` Peter Xu
2024-02-01 13:30       ` Fabiano Rosas
2024-02-02  0:21         ` Peter Xu
2024-01-31 10:31 ` [PATCH 04/14] migration/multifd: Postpone reset of MultiFDPages_t peterx
2024-01-31 15:27   ` Fabiano Rosas
2024-02-01 10:01     ` Peter Xu
2024-02-01 15:21       ` Fabiano Rosas
2024-02-02  0:28         ` Peter Xu
2024-02-02  0:37           ` Peter Xu
2024-02-02 12:15             ` Fabiano Rosas
2024-01-31 10:31 ` [PATCH 05/14] migration/multifd: Drop MultiFDSendParams.normal[] array peterx
2024-01-31 16:02   ` Fabiano Rosas
2024-01-31 10:31 ` [PATCH 06/14] migration/multifd: Separate SYNC request with normal jobs peterx
2024-01-31 18:45   ` Fabiano Rosas
2024-01-31 10:31 ` [PATCH 07/14] migration/multifd: Simplify locking in sender thread peterx
2024-01-31 20:21   ` Fabiano Rosas
2024-02-01 10:37     ` Peter Xu
2024-01-31 10:31 ` [PATCH 08/14] migration/multifd: Drop pages->num check " peterx
2024-01-31 21:19   ` Fabiano Rosas
2024-01-31 10:31 ` [PATCH 09/14] migration/multifd: Rename p->num_packets and clean it up peterx
2024-01-31 21:24   ` Fabiano Rosas
2024-01-31 10:31 ` [PATCH 10/14] migration/multifd: Move total_normal_pages accounting peterx
2024-01-31 21:26   ` Fabiano Rosas
2024-01-31 10:31 ` [PATCH 11/14] migration/multifd: Move trace_multifd_send|recv() peterx
2024-01-31 21:26   ` Fabiano Rosas
2024-01-31 10:31 ` [PATCH 12/14] migration/multifd: multifd_send_prepare_header() peterx
2024-01-31 21:32   ` Fabiano Rosas
2024-02-01 10:02     ` Peter Xu
2024-01-31 10:31 ` [PATCH 13/14] migration/multifd: Move header prepare/fill into send_prepare() peterx
2024-01-31 21:42   ` Fabiano Rosas
2024-02-01 10:15     ` Peter Xu
2024-02-02  3:57   ` Peter Xu
2024-01-31 10:31 ` [PATCH 14/14] migration/multifd: Forbid spurious wakeups peterx
2024-01-31 21:43   ` Fabiano Rosas
2024-02-01  6:01   ` Peter Xu
2024-01-31 22:49 ` [PATCH 00/14] migration/multifd: Refactor ->send_prepare() and cleanups Fabiano Rosas
2024-02-01  5:47   ` Peter Xu
2024-02-01 12:51     ` Avihai Horon
2024-02-01 21:46       ` Fabiano Rosas
2024-02-02  2:12         ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.