All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] MultiFD zero-copy improvements
@ 2022-10-25  4:47 Leonardo Bras
  2022-10-25  4:47 ` [RFC PATCH 1/4] migration/multifd/zero-copy: Create helper function for flushing Leonardo Bras
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Leonardo Bras @ 2022-10-25  4:47 UTC (permalink / raw)
  To: Daniel P. Berrangé, Juan Quintela, Dr. David Alan Gilbert, Peter Xu
  Cc: Leonardo Bras, qemu-devel

RFC for an improvement suggested by Juan during the KVM Forum and a few
optimizations I found in the way.

Patch #1 is just moving code to a helper, should have no impact.

Patch #2 is my implementation of Juan's suggestion. I implemented the
simplest way I thought on the array size: a fixed defined value.
I am not sure if this is fine, or if the array size should be either
informed by the user either via QMP or cmdline.
That's an important point I really need feedback on.

Patch #3: Improve the qio_channel_flush() interface to accept flush
waiting for some writes finished instead of all of them. This reduces
the waiting time, since most recent writes/sends will take more time to
finish, while the older ones are probably finished by the first recvmsg()
syscall return.

Patch #4 uses #3 in multifd zero-copy. It flushes the LRU half of the
header array, allowing more writes to happen while the most recent ones
are ongoing, instead of waiting for everything to finish before sending
more.

It all works fine in my tests, but maybe I missed some cornercase.
Please provide any feedback you find fit.

Thank you all!

Best regards,
Leo

Leonardo Bras (4):
  migration/multifd/zero-copy: Create helper function for flushing
  migration/multifd/zero-copy: Merge header & pages send in a single
    write
  QIOChannel: Add max_pending parameter to qio_channel_flush()
  migration/multifd/zero-copy: Flush only the LRU half of the header
    array

 include/io/channel.h |  7 +++-
 migration/multifd.h  |  5 ++-
 io/channel-socket.c  |  5 ++-
 io/channel.c         |  5 ++-
 migration/multifd.c  | 88 ++++++++++++++++++++++++++------------------
 5 files changed, 68 insertions(+), 42 deletions(-)

-- 
2.38.0



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH 1/4] migration/multifd/zero-copy: Create helper function for flushing
  2022-10-25  4:47 [RFC PATCH 0/4] MultiFD zero-copy improvements Leonardo Bras
@ 2022-10-25  4:47 ` Leonardo Bras
  2022-10-25  9:44   ` Juan Quintela
  2022-10-25  4:47 ` [RFC PATCH 2/4] migration/multifd/zero-copy: Merge header & pages send in a single write Leonardo Bras
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Leonardo Bras @ 2022-10-25  4:47 UTC (permalink / raw)
  To: Daniel P. Berrangé, Juan Quintela, Dr. David Alan Gilbert, Peter Xu
  Cc: Leonardo Bras, qemu-devel

Move flushing code from multifd_send_sync_main() to a new helper, and call
it in multifd_send_sync_main().

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 migration/multifd.c | 30 +++++++++++++++++++-----------
 1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 586ddc9d65..509bbbe3bf 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -566,6 +566,23 @@ void multifd_save_cleanup(void)
     multifd_send_state = NULL;
 }
 
+static int multifd_zero_copy_flush(QIOChannel *c)
+{
+    int ret;
+    Error *err = NULL;
+
+    ret = qio_channel_flush(c, &err);
+    if (ret < 0) {
+        error_report_err(err);
+        return -1;
+    }
+    if (ret == 1) {
+        dirty_sync_missed_zero_copy();
+    }
+
+    return ret;
+}
+
 int multifd_send_sync_main(QEMUFile *f)
 {
     int i;
@@ -616,17 +633,8 @@ int multifd_send_sync_main(QEMUFile *f)
         qemu_mutex_unlock(&p->mutex);
         qemu_sem_post(&p->sem);
 
-        if (flush_zero_copy && p->c) {
-            int ret;
-            Error *err = NULL;
-
-            ret = qio_channel_flush(p->c, &err);
-            if (ret < 0) {
-                error_report_err(err);
-                return -1;
-            } else if (ret == 1) {
-                dirty_sync_missed_zero_copy();
-            }
+        if (flush_zero_copy && p->c && (multifd_zero_copy_flush(p->c) < 0)) {
+            return -1;
         }
     }
     for (i = 0; i < migrate_multifd_channels(); i++) {
-- 
2.38.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 2/4] migration/multifd/zero-copy: Merge header & pages send in a single write
  2022-10-25  4:47 [RFC PATCH 0/4] MultiFD zero-copy improvements Leonardo Bras
  2022-10-25  4:47 ` [RFC PATCH 1/4] migration/multifd/zero-copy: Create helper function for flushing Leonardo Bras
@ 2022-10-25  4:47 ` Leonardo Bras
  2022-10-25  9:51   ` Juan Quintela
  2022-10-25  4:47 ` [RFC PATCH 3/4] QIOChannel: Add max_pending parameter to qio_channel_flush() Leonardo Bras
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Leonardo Bras @ 2022-10-25  4:47 UTC (permalink / raw)
  To: Daniel P. Berrangé, Juan Quintela, Dr. David Alan Gilbert, Peter Xu
  Cc: Leonardo Bras, qemu-devel

When zero-copy-send is enabled, each loop iteration of the
multifd_send_thread will calls for qio_channel_write_*() twice: The first
one for sending the header without zero-copy flag and the second one for
sending the memory pages, with zero-copy flag enabled.

This ends up calling two syscalls per loop iteration, where one should be
enough.

Also, since the behavior for non-zero-copy write is synchronous, and the
behavior for zero-copy write is asynchronous, it ends up interleaving
synchronous and asynchronous writes, hurting performance that could
otherwise be improved.

The choice of sending the header without the zero-copy flag in a separated
write happened because the header memory area would be reused in the next
write, so it was almost certain to have changed before the kernel could
send the packet.

To send the packet with zero-copy, create an array of header area instead
of a single one, and use a different header area after each write. Also,
flush the sending queue after all the headers have been used.

To avoid adding a zero-copy conditional in multifd_send_fill_packet(),
add a packet parameter (the packet that should be filled). This way it's
simpler to pick which element of the array will be used as a header.

Suggested-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 migration/multifd.h |  5 ++++-
 migration/multifd.c | 52 ++++++++++++++++++++++++---------------------
 2 files changed, 32 insertions(+), 25 deletions(-)

diff --git a/migration/multifd.h b/migration/multifd.h
index 519f498643..7f200c0286 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -90,6 +90,7 @@ typedef struct {
 
     /* this mutex protects the following parameters */
     QemuMutex mutex;
+
     /* is this channel thread running */
     bool running;
     /* should this thread finish */
@@ -109,8 +110,10 @@ typedef struct {
 
     /* thread local variables. No locking required */
 
-    /* pointer to the packet */
+    /* pointer to the packet array */
     MultiFDPacket_t *packet;
+    /* index of the packet array */
+    uint32_t packet_idx;
     /* size of the next packet that contains pages */
     uint32_t next_packet_size;
     /* packets sent through this channel */
diff --git a/migration/multifd.c b/migration/multifd.c
index 509bbbe3bf..aa18c47141 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -34,6 +34,8 @@
 #define MULTIFD_MAGIC 0x11223344U
 #define MULTIFD_VERSION 1
 
+#define HEADER_ARR_SZ 1024
+
 typedef struct {
     uint32_t magic;
     uint32_t version;
@@ -255,9 +257,9 @@ static void multifd_pages_clear(MultiFDPages_t *pages)
     g_free(pages);
 }
 
-static void multifd_send_fill_packet(MultiFDSendParams *p)
+static void multifd_send_fill_packet(MultiFDSendParams *p,
+                                     MultiFDPacket_t *packet)
 {
-    MultiFDPacket_t *packet = p->packet;
     int i;
 
     packet->flags = cpu_to_be32(p->flags);
@@ -547,6 +549,7 @@ void multifd_save_cleanup(void)
         p->packet_len = 0;
         g_free(p->packet);
         p->packet = NULL;
+        p->packet_idx = 0;
         g_free(p->iov);
         p->iov = NULL;
         g_free(p->normal);
@@ -651,6 +654,7 @@ int multifd_send_sync_main(QEMUFile *f)
 static void *multifd_send_thread(void *opaque)
 {
     MultiFDSendParams *p = opaque;
+    MultiFDPacket_t *header = p->packet;
     Error *local_err = NULL;
     int ret = 0;
     bool use_zero_copy_send = migrate_use_zero_copy_send();
@@ -676,13 +680,9 @@ static void *multifd_send_thread(void *opaque)
         if (p->pending_job) {
             uint64_t packet_num = p->packet_num;
             uint32_t flags = p->flags;
-            p->normal_num = 0;
 
-            if (use_zero_copy_send) {
-                p->iovs_num = 0;
-            } else {
-                p->iovs_num = 1;
-            }
+            p->normal_num = 0;
+            p->iovs_num = 1;
 
             for (int i = 0; i < p->pages->num; i++) {
                 p->normal[p->normal_num] = p->pages->offset[i];
@@ -696,7 +696,8 @@ static void *multifd_send_thread(void *opaque)
                     break;
                 }
             }
-            multifd_send_fill_packet(p);
+
+            multifd_send_fill_packet(p, header);
             p->flags = 0;
             p->num_packets++;
             p->total_normal_pages += p->normal_num;
@@ -707,18 +708,8 @@ static void *multifd_send_thread(void *opaque)
             trace_multifd_send(p->id, packet_num, p->normal_num, flags,
                                p->next_packet_size);
 
-            if (use_zero_copy_send) {
-                /* Send header first, without zerocopy */
-                ret = qio_channel_write_all(p->c, (void *)p->packet,
-                                            p->packet_len, &local_err);
-                if (ret != 0) {
-                    break;
-                }
-            } else {
-                /* Send header using the same writev call */
-                p->iov[0].iov_len = p->packet_len;
-                p->iov[0].iov_base = p->packet;
-            }
+            p->iov[0].iov_len = p->packet_len;
+            p->iov[0].iov_base = header;
 
             ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num, NULL,
                                               0, p->write_flags, &local_err);
@@ -726,6 +717,14 @@ static void *multifd_send_thread(void *opaque)
                 break;
             }
 
+            if (use_zero_copy_send) {
+                p->packet_idx = (p->packet_idx + 1) % HEADER_ARR_SZ;
+
+                if (!p->packet_idx && (multifd_zero_copy_flush(p->c) < 0)) {
+                    break;
+                }
+                header = (void *)p->packet + p->packet_idx * p->packet_len;
+            }
             qemu_mutex_lock(&p->mutex);
             p->pending_job--;
             qemu_mutex_unlock(&p->mutex);
@@ -930,6 +929,7 @@ int multifd_save_setup(Error **errp)
 
     for (i = 0; i < thread_count; i++) {
         MultiFDSendParams *p = &multifd_send_state->params[i];
+        int j;
 
         qemu_mutex_init(&p->mutex);
         qemu_sem_init(&p->sem, 0);
@@ -940,9 +940,13 @@ int multifd_save_setup(Error **errp)
         p->pages = multifd_pages_init(page_count);
         p->packet_len = sizeof(MultiFDPacket_t)
                       + sizeof(uint64_t) * page_count;
-        p->packet = g_malloc0(p->packet_len);
-        p->packet->magic = cpu_to_be32(MULTIFD_MAGIC);
-        p->packet->version = cpu_to_be32(MULTIFD_VERSION);
+        p->packet = g_malloc0_n(HEADER_ARR_SZ, p->packet_len);
+        for (j = 0; j < HEADER_ARR_SZ ; j++) {
+            MultiFDPacket_t *packet = (void *)p->packet + j * p->packet_len;
+            packet->magic = cpu_to_be32(MULTIFD_MAGIC);
+            packet->version = cpu_to_be32(MULTIFD_VERSION);
+        }
+        p->packet_idx = 0;
         p->name = g_strdup_printf("multifdsend_%d", i);
         /* We need one extra place for the packet header */
         p->iov = g_new0(struct iovec, page_count + 1);
-- 
2.38.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 3/4] QIOChannel: Add max_pending parameter to qio_channel_flush()
  2022-10-25  4:47 [RFC PATCH 0/4] MultiFD zero-copy improvements Leonardo Bras
  2022-10-25  4:47 ` [RFC PATCH 1/4] migration/multifd/zero-copy: Create helper function for flushing Leonardo Bras
  2022-10-25  4:47 ` [RFC PATCH 2/4] migration/multifd/zero-copy: Merge header & pages send in a single write Leonardo Bras
@ 2022-10-25  4:47 ` Leonardo Bras
  2022-10-25  4:47 ` [RFC PATCH 4/4] migration/multifd/zero-copy: Flush only the LRU half of the header array Leonardo Bras
  2022-10-25 16:36 ` [RFC PATCH 0/4] MultiFD zero-copy improvements Peter Xu
  4 siblings, 0 replies; 12+ messages in thread
From: Leonardo Bras @ 2022-10-25  4:47 UTC (permalink / raw)
  To: Daniel P. Berrangé, Juan Quintela, Dr. David Alan Gilbert, Peter Xu
  Cc: Leonardo Bras, qemu-devel

Zero-copy write in Linux is an asynchronous type of write, meaning the send
process is not finished by the time the function returns. Since it's also
zero-copy, it means that incorrect data may be sent if the write buffer
gets modified after write returns.

To check if a zero-copy write is finished, qio_channel has implemented a
flush operation: qio_channel_flush(). As of today, by the time the flush
returns, user code knows for sure all previous zero-copy write have
finished, and it's safe to modify any write buffer.

While this kind of flush is necessary, it may take a while to flush if
there has been a write recently, as the OS has to wait for everything to be
sent before returning and allowing reuse / free of the write buffers.

An option that can improve performance in some scenarios is to modify flush
so it guarantees less than N zero-copy writes are pending, allowing some of
the write buffers to be reused. This allows flush to return much faster,
since it does not need to wait for the more recent writes to complete.

If (N == 0), then it replicates the previous flushing behavior.

Add max_pending parameter to qio_channel_flush() so caller can decide
what is the maximum number of pending writes remaining before the function
returns. Also, implement this change in qio_channel_socket_flush().

Change current calls of qio_channel_flush() so (max_pending == 0), and the
flush-all behavior is maintained.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 include/io/channel.h | 7 +++++--
 io/channel-socket.c  | 5 +++--
 io/channel.c         | 5 +++--
 migration/multifd.c  | 2 +-
 4 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/io/channel.h b/include/io/channel.h
index c680ee7480..9ec1978a26 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -141,6 +141,7 @@ struct QIOChannelClass {
                                   IOHandler *io_write,
                                   void *opaque);
     int (*io_flush)(QIOChannel *ioc,
+                    int max_pending,
                     Error **errp);
 };
 
@@ -875,11 +876,12 @@ int qio_channel_writev_full_all(QIOChannel *ioc,
 /**
  * qio_channel_flush:
  * @ioc: the channel object
+ * @max_pending: Maximum remaining writes allowed in queue upon returning
  * @errp: pointer to a NULL-initialized error object
  *
- * Will block until every packet queued with
+ * Will block until there are at most max_pending writes called with
  * qio_channel_writev_full() + QIO_CHANNEL_WRITE_FLAG_ZERO_COPY
- * is sent, or return in case of any error.
+ * pending, or return in case of any error.
  *
  * If not implemented, acts as a no-op, and returns 0.
  *
@@ -889,6 +891,7 @@ int qio_channel_writev_full_all(QIOChannel *ioc,
  */
 
 int qio_channel_flush(QIOChannel *ioc,
+                      int max_pending,
                       Error **errp);
 
 #endif /* QIO_CHANNEL_H */
diff --git a/io/channel-socket.c b/io/channel-socket.c
index b76dca9cc1..3d0c2b8a14 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -708,6 +708,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
 
 #ifdef QEMU_MSG_ZEROCOPY
 static int qio_channel_socket_flush(QIOChannel *ioc,
+                                    int max_pending,
                                     Error **errp)
 {
     QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
@@ -718,7 +719,7 @@ static int qio_channel_socket_flush(QIOChannel *ioc,
     int received;
     int ret;
 
-    if (sioc->zero_copy_queued == sioc->zero_copy_sent) {
+    if (sioc->zero_copy_queued - sioc->zero_copy_sent <= max_pending) {
         return 0;
     }
 
@@ -728,7 +729,7 @@ static int qio_channel_socket_flush(QIOChannel *ioc,
 
     ret = 1;
 
-    while (sioc->zero_copy_sent < sioc->zero_copy_queued) {
+    while (sioc->zero_copy_queued - sioc->zero_copy_sent > max_pending) {
         received = recvmsg(sioc->fd, &msg, MSG_ERRQUEUE);
         if (received < 0) {
             switch (errno) {
diff --git a/io/channel.c b/io/channel.c
index 0640941ac5..9d9f15af50 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -490,7 +490,8 @@ off_t qio_channel_io_seek(QIOChannel *ioc,
 }
 
 int qio_channel_flush(QIOChannel *ioc,
-                                Error **errp)
+                      int max_pending,
+                      Error **errp)
 {
     QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
 
@@ -499,7 +500,7 @@ int qio_channel_flush(QIOChannel *ioc,
         return 0;
     }
 
-    return klass->io_flush(ioc, errp);
+    return klass->io_flush(ioc, max_pending, errp);
 }
 
 
diff --git a/migration/multifd.c b/migration/multifd.c
index aa18c47141..c5d1f911a4 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -574,7 +574,7 @@ static int multifd_zero_copy_flush(QIOChannel *c)
     int ret;
     Error *err = NULL;
 
-    ret = qio_channel_flush(c, &err);
+    ret = qio_channel_flush(c, 0, &err);
     if (ret < 0) {
         error_report_err(err);
         return -1;
-- 
2.38.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH 4/4] migration/multifd/zero-copy: Flush only the LRU half of the header array
  2022-10-25  4:47 [RFC PATCH 0/4] MultiFD zero-copy improvements Leonardo Bras
                   ` (2 preceding siblings ...)
  2022-10-25  4:47 ` [RFC PATCH 3/4] QIOChannel: Add max_pending parameter to qio_channel_flush() Leonardo Bras
@ 2022-10-25  4:47 ` Leonardo Bras
  2022-10-25  8:35   ` Daniel P. Berrangé
  2022-10-25 16:36 ` [RFC PATCH 0/4] MultiFD zero-copy improvements Peter Xu
  4 siblings, 1 reply; 12+ messages in thread
From: Leonardo Bras @ 2022-10-25  4:47 UTC (permalink / raw)
  To: Daniel P. Berrangé, Juan Quintela, Dr. David Alan Gilbert, Peter Xu
  Cc: Leonardo Bras, qemu-devel

Zero-copy multifd migration sends both the header and the memory pages in a
single syscall. Since it's necessary to flush before reusing the header, a
header array was implemented, so each write call uses a different
array, and flushing only take place after all headers have been used,
meaning 1 flush for each N writes.

This method has a bottleneck, though: After the last write, a flush will
have to wait for all writes to finish, which will be a lot, meaning the
recvmsg() syscall called in qio_channel_socket_flush() will be called a
lot. On top of that, it will create a time period when the I/O queue is
empty and nothing is getting send: between the flush and the next write.

To avoid that, use qio_channel_flush()'s new max_pending parameter to wait
until at most half of the array is still in use. (i.e. the LRU half of the
array can be reused)

Flushing for the LRU half of the array is much faster, since it does not
have to wait for the most recent writes to finish, making up for having
to flush twice per array.

As a main benefit, this approach keeps the I/O queue from being empty while
there are still data to be sent, making it easier to keep the I/O maximum
throughput while consuming less cpu time.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 migration/multifd.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index c5d1f911a4..fe9df460f6 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -569,12 +569,13 @@ void multifd_save_cleanup(void)
     multifd_send_state = NULL;
 }
 
-static int multifd_zero_copy_flush(QIOChannel *c)
+static int multifd_zero_copy_flush(QIOChannel *c,
+                                   int max_remaining)
 {
     int ret;
     Error *err = NULL;
 
-    ret = qio_channel_flush(c, 0, &err);
+    ret = qio_channel_flush(c, max_remaining, &err);
     if (ret < 0) {
         error_report_err(err);
         return -1;
@@ -636,7 +637,7 @@ int multifd_send_sync_main(QEMUFile *f)
         qemu_mutex_unlock(&p->mutex);
         qemu_sem_post(&p->sem);
 
-        if (flush_zero_copy && p->c && (multifd_zero_copy_flush(p->c) < 0)) {
+        if (flush_zero_copy && p->c && (multifd_zero_copy_flush(p->c, 0) < 0)) {
             return -1;
         }
     }
@@ -719,12 +720,17 @@ static void *multifd_send_thread(void *opaque)
 
             if (use_zero_copy_send) {
                 p->packet_idx = (p->packet_idx + 1) % HEADER_ARR_SZ;
-
-                if (!p->packet_idx && (multifd_zero_copy_flush(p->c) < 0)) {
+                /*
+                 * When half the array have been used, flush to make sure the
+                 * next half is available
+                 */
+                if (!(p->packet_idx % (HEADER_ARR_SZ / 2)) &&
+                    (multifd_zero_copy_flush(p->c, HEADER_ARR_SZ / 2) < 0)) {
                     break;
                 }
                 header = (void *)p->packet + p->packet_idx * p->packet_len;
             }
+
             qemu_mutex_lock(&p->mutex);
             p->pending_job--;
             qemu_mutex_unlock(&p->mutex);
-- 
2.38.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 4/4] migration/multifd/zero-copy: Flush only the LRU half of the header array
  2022-10-25  4:47 ` [RFC PATCH 4/4] migration/multifd/zero-copy: Flush only the LRU half of the header array Leonardo Bras
@ 2022-10-25  8:35   ` Daniel P. Berrangé
  2022-10-25 10:07     ` Juan Quintela
  0 siblings, 1 reply; 12+ messages in thread
From: Daniel P. Berrangé @ 2022-10-25  8:35 UTC (permalink / raw)
  To: Leonardo Bras; +Cc: Juan Quintela, Dr. David Alan Gilbert, Peter Xu, qemu-devel

On Tue, Oct 25, 2022 at 01:47:31AM -0300, Leonardo Bras wrote:
> Zero-copy multifd migration sends both the header and the memory pages in a
> single syscall. Since it's necessary to flush before reusing the header, a
> header array was implemented, so each write call uses a different
> array, and flushing only take place after all headers have been used,
> meaning 1 flush for each N writes.
> 
> This method has a bottleneck, though: After the last write, a flush will
> have to wait for all writes to finish, which will be a lot, meaning the
> recvmsg() syscall called in qio_channel_socket_flush() will be called a
> lot. On top of that, it will create a time period when the I/O queue is
> empty and nothing is getting send: between the flush and the next write.
> 
> To avoid that, use qio_channel_flush()'s new max_pending parameter to wait
> until at most half of the array is still in use. (i.e. the LRU half of the
> array can be reused)
> 
> Flushing for the LRU half of the array is much faster, since it does not
> have to wait for the most recent writes to finish, making up for having
> to flush twice per array.
> 
> As a main benefit, this approach keeps the I/O queue from being empty while
> there are still data to be sent, making it easier to keep the I/O maximum
> throughput while consuming less cpu time.

Doesn't this defeat the reason for adding the flush in the first
place, which was to ensure that a migration iteration was fully
sent before starting the next iteration over RAM ? If it is OK to
only partially flush on each iteration, then why do we need to
flush at all ?


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 1/4] migration/multifd/zero-copy: Create helper function for flushing
  2022-10-25  4:47 ` [RFC PATCH 1/4] migration/multifd/zero-copy: Create helper function for flushing Leonardo Bras
@ 2022-10-25  9:44   ` Juan Quintela
  0 siblings, 0 replies; 12+ messages in thread
From: Juan Quintela @ 2022-10-25  9:44 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Daniel P. Berrangé, Dr. David Alan Gilbert, Peter Xu, qemu-devel

Leonardo Bras <leobras@redhat.com> wrote:
> Move flushing code from multifd_send_sync_main() to a new helper, and call
> it in multifd_send_sync_main().
>
> Signed-off-by: Leonardo Bras <leobras@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 2/4] migration/multifd/zero-copy: Merge header & pages send in a single write
  2022-10-25  4:47 ` [RFC PATCH 2/4] migration/multifd/zero-copy: Merge header & pages send in a single write Leonardo Bras
@ 2022-10-25  9:51   ` Juan Quintela
  2022-10-25 13:28     ` Leonardo Brás
  0 siblings, 1 reply; 12+ messages in thread
From: Juan Quintela @ 2022-10-25  9:51 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Daniel P. Berrangé, Dr. David Alan Gilbert, Peter Xu, qemu-devel

Leonardo Bras <leobras@redhat.com> wrote:
> When zero-copy-send is enabled, each loop iteration of the
> multifd_send_thread will calls for qio_channel_write_*() twice: The first
> one for sending the header without zero-copy flag and the second one for
> sending the memory pages, with zero-copy flag enabled.
>
> This ends up calling two syscalls per loop iteration, where one should be
> enough.
>
> Also, since the behavior for non-zero-copy write is synchronous, and the
> behavior for zero-copy write is asynchronous, it ends up interleaving
> synchronous and asynchronous writes, hurting performance that could
> otherwise be improved.
>
> The choice of sending the header without the zero-copy flag in a separated
> write happened because the header memory area would be reused in the next
> write, so it was almost certain to have changed before the kernel could
> send the packet.
>
> To send the packet with zero-copy, create an array of header area instead
> of a single one, and use a different header area after each write. Also,
> flush the sending queue after all the headers have been used.
>
> To avoid adding a zero-copy conditional in multifd_send_fill_packet(),
> add a packet parameter (the packet that should be filled). This way it's
> simpler to pick which element of the array will be used as a header.
>
> Suggested-by: Juan Quintela <quintela@redhat.com>
> Signed-off-by: Leonardo Bras <leobras@redhat.com>

>  
> +            if (use_zero_copy_send) {
> +                p->packet_idx = (p->packet_idx + 1) % HEADER_ARR_SZ;
> +
> +                if (!p->packet_idx && (multifd_zero_copy_flush(p->c) < 0)) {
> +                    break;
> +                }
> +                header = (void *)p->packet + p->packet_idx * p->packet_len;

Isn't this equivalent to?

      header = &(p->packet[p->packet_idx]);

>      for (i = 0; i < thread_count; i++) {
>          MultiFDSendParams *p = &multifd_send_state->params[i];
> +        int j;

For new code you can:

>          qemu_mutex_init(&p->mutex);
>          qemu_sem_init(&p->sem, 0);
> @@ -940,9 +940,13 @@ int multifd_save_setup(Error **errp)
>          p->pages = multifd_pages_init(page_count);
>          p->packet_len = sizeof(MultiFDPacket_t)
>                        + sizeof(uint64_t) * page_count;
> -        p->packet = g_malloc0(p->packet_len);
> -        p->packet->magic = cpu_to_be32(MULTIFD_MAGIC);
> -        p->packet->version = cpu_to_be32(MULTIFD_VERSION);
> +        p->packet = g_malloc0_n(HEADER_ARR_SZ, p->packet_len);
> +        for (j = 0; j < HEADER_ARR_SZ ; j++) {

           for (int j = 0; j < HEADER_ARR_SZ ; j++) {

> +            MultiFDPacket_t *packet = (void *)p->packet + j * p->packet_len;
> +            packet->magic = cpu_to_be32(MULTIFD_MAGIC);
> +            packet->version = cpu_to_be32(MULTIFD_VERSION);

Can't you use here:

            packet[j].magic = cpu_to_be32(MULTIFD_MAGIC);
            packet[j].version = cpu_to_be32(MULTIFD_VERSION);

And call it a day?

The rest is fine for me.  Thanks for the effort.

Later, Juan.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 4/4] migration/multifd/zero-copy: Flush only the LRU half of the header array
  2022-10-25  8:35   ` Daniel P. Berrangé
@ 2022-10-25 10:07     ` Juan Quintela
  2022-10-25 13:47       ` Leonardo Brás
  0 siblings, 1 reply; 12+ messages in thread
From: Juan Quintela @ 2022-10-25 10:07 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Leonardo Bras, Dr. David Alan Gilbert, Peter Xu, qemu-devel

Daniel P. Berrangé <berrange@redhat.com> wrote:
> On Tue, Oct 25, 2022 at 01:47:31AM -0300, Leonardo Bras wrote:
>> Zero-copy multifd migration sends both the header and the memory pages in a
>> single syscall. Since it's necessary to flush before reusing the header, a
>> header array was implemented, so each write call uses a different
>> array, and flushing only take place after all headers have been used,
>> meaning 1 flush for each N writes.
>> 
>> This method has a bottleneck, though: After the last write, a flush will
>> have to wait for all writes to finish, which will be a lot, meaning the
>> recvmsg() syscall called in qio_channel_socket_flush() will be called a
>> lot. On top of that, it will create a time period when the I/O queue is
>> empty and nothing is getting send: between the flush and the next write.
>> 
>> To avoid that, use qio_channel_flush()'s new max_pending parameter to wait
>> until at most half of the array is still in use. (i.e. the LRU half of the
>> array can be reused)
>> 
>> Flushing for the LRU half of the array is much faster, since it does not
>> have to wait for the most recent writes to finish, making up for having
>> to flush twice per array.
>> 
>> As a main benefit, this approach keeps the I/O queue from being empty while
>> there are still data to be sent, making it easier to keep the I/O maximum
>> throughput while consuming less cpu time.
>
> Doesn't this defeat the reason for adding the flush in the first
> place, which was to ensure that a migration iteration was fully
> sent before starting the next iteration over RAM ? If it is OK to
> only partially flush on each iteration, then why do we need to
> flush at all ?

Now we need to do the flush in two places:
- on sync_main (the old place)
- on the migration thread, when we run out of array entries.
  This one has nothing to do with correctness, it is just that we have
  more space than needed.

So, in this regard, I think that the patches are correct.

But on the other hand, I am not sure that I like the size of the array.
Leonardo is using 1024 entries for each migration channel.  That means
that it allows it to have 1024 entries * 512 KB each packet is 512MB of
unflushed data in each channel.  I think that this is still too much.

I will need some data from testing, but my understanding on how Zero
Copy work is that having around 10MB in each channel would be more than
enough to saturate the link.  And once that the data inflight is
smaller, we can just flush it when we get out of channels.

My idea here was to work the size the other way around, add a parameter
to the user about how much memory is he available for mlocking, and
just do a memory/channels array entries on each channel.  That will:

a - limit the amount of mlocked memory that we need
    10MB/channel for 10 channels is 100MB of mlocked memory, for a guest
    that has lots of Gigabytes of RAM looks reasonable.

b - We don't synchronize after each write, because I still claim than
    doing a non asynchronous write on the channel just syncs everything
    (otherwise I can't see how the hardware can even work).

So I guess that the best thing to decide this is:
- get a couple of nice 100Gigabit networking cards
- Enable 16 channels or so, so we know that the CPU is not going to be
  the bottleneck
- test with this patch
- remove last two patches and test with several array sizes
  (10, 20, 30,..) and we will see that after some size, performance will
  not improve at all.
- Got that value as default one.

What do you think?

Later, Juan.

PD.  Yes, I don't like to add another parameter, so you can recompile
     with different values, or we will not add the parameter once that
     we find a right value.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 2/4] migration/multifd/zero-copy: Merge header & pages send in a single write
  2022-10-25  9:51   ` Juan Quintela
@ 2022-10-25 13:28     ` Leonardo Brás
  0 siblings, 0 replies; 12+ messages in thread
From: Leonardo Brás @ 2022-10-25 13:28 UTC (permalink / raw)
  To: quintela
  Cc: Daniel P.Berrangé, Dr. David Alan Gilbert, Peter Xu, qemu-devel

On Tue, 2022-10-25 at 11:51 +0200, Juan Quintela wrote:
> Leonardo Bras <leobras@redhat.com> wrote:
> > When zero-copy-send is enabled, each loop iteration of the
> > multifd_send_thread will calls for qio_channel_write_*() twice: The first
> > one for sending the header without zero-copy flag and the second one for
> > sending the memory pages, with zero-copy flag enabled.
> > 
> > This ends up calling two syscalls per loop iteration, where one should be
> > enough.
> > 
> > Also, since the behavior for non-zero-copy write is synchronous, and the
> > behavior for zero-copy write is asynchronous, it ends up interleaving
> > synchronous and asynchronous writes, hurting performance that could
> > otherwise be improved.
> > 
> > The choice of sending the header without the zero-copy flag in a separated
> > write happened because the header memory area would be reused in the next
> > write, so it was almost certain to have changed before the kernel could
> > send the packet.
> > 
> > To send the packet with zero-copy, create an array of header area instead
> > of a single one, and use a different header area after each write. Also,
> > flush the sending queue after all the headers have been used.
> > 
> > To avoid adding a zero-copy conditional in multifd_send_fill_packet(),
> > add a packet parameter (the packet that should be filled). This way it's
> > simpler to pick which element of the array will be used as a header.
> > 
> > Suggested-by: Juan Quintela <quintela@redhat.com>
> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> 
> >  
> > +            if (use_zero_copy_send) {
> > +                p->packet_idx = (p->packet_idx + 1) % HEADER_ARR_SZ;
> > +
> > +                if (!p->packet_idx && (multifd_zero_copy_flush(p->c) < 0)) {
> > +                    break;
> > +                }
> > +                header = (void *)p->packet + p->packet_idx * p->packet_len;
> 
> Isn't this equivalent to?
> 
>       header = &(p->packet[p->packet_idx]);


So, that's the thing:
MultiFDPacket_t does contain a flexible array member (offset[]) at the end of
the struct, which is used together with dynamic allocation to create the
perception of a variable size struct. 

This is used to make sure the packet will be able to carry the offset for each
memory page getting sent, independent of it's size:

multifd_save_setup (){ 
	uint32_t page_count = MULTIFD_PACKET_SIZE / qemu_target_page_size();
	[...]
	p->packet_len = sizeof(MultiFDPacket_t)
                      + sizeof(uint64_t) * page_count;
	p->packet = g_malloc0(p->packet_len);
	[...]
}

So the size of the header is actually p->packet_len, and not
sizeof(MultiFDPacket_t). This means using the default pointer arithmetic does
not work for this case, and

header = &(p->packet[p->packet_idx]) 

will actually point to the wrong place in memory, since it does not consider the
the actual size of offset[] array.

> 
> >      for (i = 0; i < thread_count; i++) {
> >          MultiFDSendParams *p = &multifd_send_state->params[i];
> > +        int j;
> 
> For new code you can:
> 
> >          qemu_mutex_init(&p->mutex);
> >          qemu_sem_init(&p->sem, 0);
> > @@ -940,9 +940,13 @@ int multifd_save_setup(Error **errp)
> >          p->pages = multifd_pages_init(page_count);
> >          p->packet_len = sizeof(MultiFDPacket_t)
> >                        + sizeof(uint64_t) * page_count;
> > -        p->packet = g_malloc0(p->packet_len);
> > -        p->packet->magic = cpu_to_be32(MULTIFD_MAGIC);
> > -        p->packet->version = cpu_to_be32(MULTIFD_VERSION);
> > +        p->packet = g_malloc0_n(HEADER_ARR_SZ, p->packet_len);
> > +        for (j = 0; j < HEADER_ARR_SZ ; j++) {
> 
>            for (int j = 0; j < HEADER_ARR_SZ ; j++) {

Oh, sweet. I am very fond of C99, just not used to it getting accepted upstream.
Thanks for the tip :)

> 
> > +            MultiFDPacket_t *packet = (void *)p->packet + j * p->packet_len;
> > +            packet->magic = cpu_to_be32(MULTIFD_MAGIC);
> > +            packet->version = cpu_to_be32(MULTIFD_VERSION);
> 
> Can't you use here:
> 
>             packet[j].magic = cpu_to_be32(MULTIFD_MAGIC);
>             packet[j].version = cpu_to_be32(MULTIFD_VERSION);
> 
> And call it a day?

Not actually. The size of this struct is different from sizeof(MultiFDPacket_t),
so it does not work. See above.

> 
> The rest is fine for me.  Thanks for the effort.

Thanks for reviewing Juan!

Leo

> 
> Later, Juan.
> 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 4/4] migration/multifd/zero-copy: Flush only the LRU half of the header array
  2022-10-25 10:07     ` Juan Quintela
@ 2022-10-25 13:47       ` Leonardo Brás
  0 siblings, 0 replies; 12+ messages in thread
From: Leonardo Brás @ 2022-10-25 13:47 UTC (permalink / raw)
  To: quintela, Daniel P.Berrangé
  Cc: Dr. David Alan Gilbert, Peter Xu, qemu-devel

On Tue, 2022-10-25 at 12:07 +0200, Juan Quintela wrote:
> Daniel P. Berrangé <berrange@redhat.com> wrote:
> > On Tue, Oct 25, 2022 at 01:47:31AM -0300, Leonardo Bras wrote:
> > > Zero-copy multifd migration sends both the header and the memory pages in a
> > > single syscall. Since it's necessary to flush before reusing the header, a
> > > header array was implemented, so each write call uses a different
> > > array, and flushing only take place after all headers have been used,
> > > meaning 1 flush for each N writes.
> > > 
> > > This method has a bottleneck, though: After the last write, a flush will
> > > have to wait for all writes to finish, which will be a lot, meaning the
> > > recvmsg() syscall called in qio_channel_socket_flush() will be called a
> > > lot. On top of that, it will create a time period when the I/O queue is
> > > empty and nothing is getting send: between the flush and the next write.
> > > 
> > > To avoid that, use qio_channel_flush()'s new max_pending parameter to wait
> > > until at most half of the array is still in use. (i.e. the LRU half of the
> > > array can be reused)
> > > 
> > > Flushing for the LRU half of the array is much faster, since it does not
> > > have to wait for the most recent writes to finish, making up for having
> > > to flush twice per array.
> > > 
> > > As a main benefit, this approach keeps the I/O queue from being empty while
> > > there are still data to be sent, making it easier to keep the I/O maximum
> > > throughput while consuming less cpu time.
> > 
> > Doesn't this defeat the reason for adding the flush in the first
> > place, which was to ensure that a migration iteration was fully
> > sent before starting the next iteration over RAM ? If it is OK to
> > only partially flush on each iteration, then why do we need to
> > flush at all ?
> 
> Now we need to do the flush in two places:
> - on sync_main (the old place)
> - on the migration thread, when we run out of array entries.
>   This one has nothing to do with correctness, it is just that we have
>   more space than needed.

That's correct! In fact, sync_main (the old place) will call flush with
max_remaining = 0, meaning it will only return when there is nothing else
pending.

> 
> So, in this regard, I think that the patches are correct.
> 
> But on the other hand, I am not sure that I like the size of the array.
> Leonardo is using 1024 entries for each migration channel.  That means
> that it allows it to have 1024 entries * 512 KB each packet is 512MB of
> unflushed data in each channel.  I think that this is still too much.

The size is correct, meaning we need to allow 512MB of locked memory per multifd
channel.

> 
> I will need some data from testing, but my understanding on how Zero
> Copy work is that having around 10MB in each channel would be more than
> enough to saturate the link.  And once that the data inflight is
> smaller, we can just flush it when we get out of channels.
> 
> My idea here was to work the size the other way around, add a parameter
> to the user about how much memory is he available for mlocking, and
> just do a memory/channels array entries on each channel.  That will:
> 
> a - limit the amount of mlocked memory that we need
>     10MB/channel for 10 channels is 100MB of mlocked memory, for a guest
>     that has lots of Gigabytes of RAM looks reasonable.
> 
> b - We don't synchronize after each write, because I still claim than
>     doing a non asynchronous write on the channel just syncs everything
>     (otherwise I can't see how the hardware can even work).

So the user sets the locked memory available and we split it between the number
of channels during migration. Seems a good strategy, since it will explore the
hardware well regardless of the channel count. 

> 
> So I guess that the best thing to decide this is:
> - get a couple of nice 100Gigabit networking cards
> - Enable 16 channels or so, so we know that the CPU is not going to be
>   the bottleneck
> - test with this patch
> - remove last two patches and test with several array sizes
>   (10, 20, 30,..) and we will see that after some size, performance will
>   not improve at all.
> - Got that value as default one.
> 
> What do you think?

Most of it is good. I only disagree on removing the two last patches.

I did some tests with smaller array sizes, and without the last two patches it
would totally destroy performance. I blame the fact that once each N writes it
will have to wait the queue to be completely empty, and then add more data for
sending. 

Also, waiting for the completion of a send that was just scheduled takes
multiple recvmsg syscalls, raising cpu usage.

Other than that, seems a good strategy.

Thank you both for reviewing!

Best regards,
Leo

> 
> Later, Juan.
> 
> PD.  Yes, I don't like to add another parameter, so you can recompile
>      with different values, or we will not add the parameter once that
>      we find a right value.
> 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/4] MultiFD zero-copy improvements
  2022-10-25  4:47 [RFC PATCH 0/4] MultiFD zero-copy improvements Leonardo Bras
                   ` (3 preceding siblings ...)
  2022-10-25  4:47 ` [RFC PATCH 4/4] migration/multifd/zero-copy: Flush only the LRU half of the header array Leonardo Bras
@ 2022-10-25 16:36 ` Peter Xu
  4 siblings, 0 replies; 12+ messages in thread
From: Peter Xu @ 2022-10-25 16:36 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, qemu-devel

On Tue, Oct 25, 2022 at 01:47:27AM -0300, Leonardo Bras wrote:
> It all works fine in my tests, but maybe I missed some cornercase.
> Please provide any feedback you find fit.

Personally I very much like the whole series. :)

I just want to make sure it won't start to fail users when switching to
this new version of qemu due to vmlocked restrictions which I expect to
grow.  But I see that there're discussions on patch 4 for further
optimizations of lockvm and let's keep the discussion there for this.

The other thing is since it's a patchset for performance optimization only,
it'll always be nice to attach performance numbers.  If on slow networks it
won't show a difference worth mentioning that.  And ultimately I believe
anyone would be interested to know in which case (e.g. 100g nic?) would
this start to benefit.

Thanks!

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-10-25 16:43 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-25  4:47 [RFC PATCH 0/4] MultiFD zero-copy improvements Leonardo Bras
2022-10-25  4:47 ` [RFC PATCH 1/4] migration/multifd/zero-copy: Create helper function for flushing Leonardo Bras
2022-10-25  9:44   ` Juan Quintela
2022-10-25  4:47 ` [RFC PATCH 2/4] migration/multifd/zero-copy: Merge header & pages send in a single write Leonardo Bras
2022-10-25  9:51   ` Juan Quintela
2022-10-25 13:28     ` Leonardo Brás
2022-10-25  4:47 ` [RFC PATCH 3/4] QIOChannel: Add max_pending parameter to qio_channel_flush() Leonardo Bras
2022-10-25  4:47 ` [RFC PATCH 4/4] migration/multifd/zero-copy: Flush only the LRU half of the header array Leonardo Bras
2022-10-25  8:35   ` Daniel P. Berrangé
2022-10-25 10:07     ` Juan Quintela
2022-10-25 13:47       ` Leonardo Brás
2022-10-25 16:36 ` [RFC PATCH 0/4] MultiFD zero-copy improvements Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.