All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/4] MSG_ZEROCOPY fixes & improvements
@ 2022-06-20  5:39 Leonardo Bras
  2022-06-20  5:39 ` [PATCH v4 1/4] QIOChannelSocket: Introduce assert and reduce ifdefs to improve readability Leonardo Bras
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Leonardo Bras @ 2022-06-20  5:39 UTC (permalink / raw)
  To: Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster, Peter Xu, 徐闯
  Cc: Leonardo Bras, qemu-devel

Patches 1 & 2 are about a fix needed to make zero-copy flush work.
The code that incremented the sendmsg counter ended up missing in
the last versions of the patchset, causing the flush to never
happen.

Patch 3 is about an improvement on flushing that improves 
sending performance.

Patch 4 is about making zero-copy-send a migration capability,
instead of a migration parameter. Which actually makes more sense
and helps the implementation of the libvirt code.

Leonardo Bras (4):
  QIOChannelSocket: Introduce assert and reduce ifdefs to improve
    readability
  QIOChannelSocket: Fix zero-copy send so socket flush works
  migration: zero-copy flush only at the end of bitmap scanning
  migration: Change zero_copy_send from migration parameter to migration
    capability

 qapi/migration.json   | 33 +++++++-----------------
 migration/multifd.h   |  1 +
 io/channel-socket.c   | 19 ++++++++++----
 migration/migration.c | 52 ++++++++++++++------------------------
 migration/multifd.c   | 58 ++++++++++++++++++++++++-------------------
 migration/ram.c       |  7 ++++++
 monitor/hmp-cmds.c    |  6 -----
 7 files changed, 83 insertions(+), 93 deletions(-)

-- 
2.36.1



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v4 1/4] QIOChannelSocket: Introduce assert and reduce ifdefs to improve readability
  2022-06-20  5:39 [PATCH v4 0/4] MSG_ZEROCOPY fixes & improvements Leonardo Bras
@ 2022-06-20  5:39 ` Leonardo Bras
  2022-06-20  8:47   ` Juan Quintela
  2022-06-20 15:27   ` Peter Xu
  2022-06-20  5:39 ` [PATCH v4 2/4] QIOChannelSocket: Fix zero-copy send so socket flush works Leonardo Bras
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 21+ messages in thread
From: Leonardo Bras @ 2022-06-20  5:39 UTC (permalink / raw)
  To: Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster, Peter Xu, 徐闯
  Cc: Leonardo Bras, qemu-devel

During implementation of MSG_ZEROCOPY feature, a lot of #ifdefs were
introduced, particularly at qio_channel_socket_writev().

Rewrite some of those changes so it's easier to read.

Also, introduce an assert to help detect incorrect zero-copy usage is when
it's disabled on build.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
---
 io/channel-socket.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/io/channel-socket.c b/io/channel-socket.c
index dc9c165de1..dac9e60c20 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -578,11 +578,17 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
         memcpy(CMSG_DATA(cmsg), fds, fdsize);
     }
 
-#ifdef QEMU_MSG_ZEROCOPY
     if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
+#ifdef QEMU_MSG_ZEROCOPY
         sflags = MSG_ZEROCOPY;
-    }
+#else
+        /*
+         * We expect QIOChannel class entry point to have
+         * blocked this code path already
+         */
+        g_assert_unreachable();
 #endif
+    }
 
  retry:
     ret = sendmsg(sioc->fd, &msg, sflags);
@@ -592,15 +598,13 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
             return QIO_CHANNEL_ERR_BLOCK;
         case EINTR:
             goto retry;
-#ifdef QEMU_MSG_ZEROCOPY
         case ENOBUFS:
-            if (sflags & MSG_ZEROCOPY) {
+            if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
                 error_setg_errno(errp, errno,
                                  "Process can't lock enough memory for using MSG_ZEROCOPY");
                 return -1;
             }
             break;
-#endif
         }
 
         error_setg_errno(errp, errno,
-- 
2.36.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v4 2/4] QIOChannelSocket: Fix zero-copy send so socket flush works
  2022-06-20  5:39 [PATCH v4 0/4] MSG_ZEROCOPY fixes & improvements Leonardo Bras
  2022-06-20  5:39 ` [PATCH v4 1/4] QIOChannelSocket: Introduce assert and reduce ifdefs to improve readability Leonardo Bras
@ 2022-06-20  5:39 ` Leonardo Bras
  2022-06-20  8:48   ` Juan Quintela
  2022-06-20 15:27   ` Peter Xu
  2022-06-20  5:39 ` [PATCH v4 3/4] migration: zero-copy flush only at the end of bitmap scanning Leonardo Bras
  2022-06-20  5:39 ` [PATCH v4 4/4] migration: Change zero_copy_send from migration parameter to migration capability Leonardo Bras
  3 siblings, 2 replies; 21+ messages in thread
From: Leonardo Bras @ 2022-06-20  5:39 UTC (permalink / raw)
  To: Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster, Peter Xu, 徐闯
  Cc: Leonardo Bras, qemu-devel

Somewhere between v6 and v7 the of the zero-copy-send patchset a crucial
part of the flushing mechanism got missing: incrementing zero_copy_queued.

Without that, the flushing interface becomes a no-op, and there is no
guarantee the buffer is really sent.

This can go as bad as causing a corruption in RAM during migration.

Fixes: 2bc58ffc2926 ("QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX")
Reported-by: 徐闯 <xuchuangxclwt@bytedance.com>
Signed-off-by: Leonardo Bras <leobras@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
---
 io/channel-socket.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/io/channel-socket.c b/io/channel-socket.c
index dac9e60c20..4fa0402f54 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -611,6 +611,11 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
                          "Unable to write to socket");
         return -1;
     }
+
+    if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
+        sioc->zero_copy_queued++;
+    }
+
     return ret;
 }
 #else /* WIN32 */
-- 
2.36.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v4 3/4] migration: zero-copy flush only at the end of bitmap scanning
  2022-06-20  5:39 [PATCH v4 0/4] MSG_ZEROCOPY fixes & improvements Leonardo Bras
  2022-06-20  5:39 ` [PATCH v4 1/4] QIOChannelSocket: Introduce assert and reduce ifdefs to improve readability Leonardo Bras
  2022-06-20  5:39 ` [PATCH v4 2/4] QIOChannelSocket: Fix zero-copy send so socket flush works Leonardo Bras
@ 2022-06-20  5:39 ` Leonardo Bras
  2022-06-20  9:23   ` Juan Quintela
  2022-06-20  5:39 ` [PATCH v4 4/4] migration: Change zero_copy_send from migration parameter to migration capability Leonardo Bras
  3 siblings, 1 reply; 21+ messages in thread
From: Leonardo Bras @ 2022-06-20  5:39 UTC (permalink / raw)
  To: Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster, Peter Xu, 徐闯
  Cc: Leonardo Bras, qemu-devel

When sending memory pages with MSG_ZEROCOPY, it's necessary to flush
to make sure all dirty pages are sent before a future version of them
happens to be sent.

Currently, the flush happens every time at the end of ram_save_iterate(),
which usually happens around 20x per second, due to a timeout.

Change so it flushes only after a whole scanning of the dirty bimap,
so it never sends a newer version of a page before an older one, while
avoiding unnecessary overhead.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 migration/multifd.h |  1 +
 migration/multifd.c | 58 ++++++++++++++++++++++++++-------------------
 migration/ram.c     |  7 ++++++
 3 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/migration/multifd.h b/migration/multifd.h
index 4d8d89e5e5..e7cbdf1fb4 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -22,6 +22,7 @@ bool multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
 void multifd_recv_sync_main(void);
 int multifd_send_sync_main(QEMUFile *f);
 int multifd_queue_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset);
+int multifd_zero_copy_flush(void);
 
 /* Multifd Compression flags */
 #define MULTIFD_FLAG_SYNC (1 << 0)
diff --git a/migration/multifd.c b/migration/multifd.c
index 9282ab6aa4..ce4220a97d 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -566,10 +566,42 @@ void multifd_save_cleanup(void)
     multifd_send_state = NULL;
 }
 
+/*
+ * Set zero_copy_flush = true for every multifd channel
+ *
+ * When using zero-copy, it's necessary to flush the pages before any of
+ * the pages can be sent again, so we'll make sure the new version of the
+ * pages will always arrive _later_ than the old pages.
+ *
+ * Should be called only after we finished one whole scanning of
+ * all the dirty bitmaps.
+ */
+int multifd_zero_copy_flush(void)
+{
+    int i;
+    Error *local_err = NULL;
+
+    if (!migrate_use_multifd()) {
+        return 0;
+    }
+
+    for (i = 0; i < migrate_multifd_channels(); i++) {
+        MultiFDSendParams *p = &multifd_send_state->params[i];
+        int ret;
+
+        ret = qio_channel_flush(p->c, &local_err);
+        if (ret < 0) {
+            error_report_err(local_err);
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
 int multifd_send_sync_main(QEMUFile *f)
 {
     int i;
-    bool flush_zero_copy;
 
     if (!migrate_use_multifd()) {
         return 0;
@@ -581,19 +613,6 @@ int multifd_send_sync_main(QEMUFile *f)
         }
     }
 
-    /*
-     * When using zero-copy, it's necessary to flush the pages before any of
-     * the pages can be sent again, so we'll make sure the new version of the
-     * pages will always arrive _later_ than the old pages.
-     *
-     * Currently we achieve this by flushing the zero-page requested writes
-     * per ram iteration, but in the future we could potentially optimize it
-     * to be less frequent, e.g. only after we finished one whole scanning of
-     * all the dirty bitmaps.
-     */
-
-    flush_zero_copy = migrate_use_zero_copy_send();
-
     for (i = 0; i < migrate_multifd_channels(); i++) {
         MultiFDSendParams *p = &multifd_send_state->params[i];
 
@@ -615,17 +634,6 @@ int multifd_send_sync_main(QEMUFile *f)
         ram_counters.transferred += p->packet_len;
         qemu_mutex_unlock(&p->mutex);
         qemu_sem_post(&p->sem);
-
-        if (flush_zero_copy && p->c) {
-            int ret;
-            Error *err = NULL;
-
-            ret = qio_channel_flush(p->c, &err);
-            if (ret < 0) {
-                error_report_err(err);
-                return -1;
-            }
-        }
     }
     for (i = 0; i < migrate_multifd_channels(); i++) {
         MultiFDSendParams *p = &multifd_send_state->params[i];
diff --git a/migration/ram.c b/migration/ram.c
index 5f5e37f64d..514584e44f 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2288,6 +2288,13 @@ static int ram_find_and_save_block(RAMState *rs)
     rs->last_seen_block = pss.block;
     rs->last_page = pss.page;
 
+    if (pss.complete_round && migrate_use_zero_copy_send()) {
+        int ret = multifd_zero_copy_flush();
+        if (ret < 0) {
+            return ret;
+        }
+    }
+
     return pages;
 }
 
-- 
2.36.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v4 4/4] migration: Change zero_copy_send from migration parameter to migration capability
  2022-06-20  5:39 [PATCH v4 0/4] MSG_ZEROCOPY fixes & improvements Leonardo Bras
                   ` (2 preceding siblings ...)
  2022-06-20  5:39 ` [PATCH v4 3/4] migration: zero-copy flush only at the end of bitmap scanning Leonardo Bras
@ 2022-06-20  5:39 ` Leonardo Bras
  2022-06-20  5:46   ` Leonardo Bras Soares Passos
                     ` (3 more replies)
  3 siblings, 4 replies; 21+ messages in thread
From: Leonardo Bras @ 2022-06-20  5:39 UTC (permalink / raw)
  To: Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster, Peter Xu, 徐闯
  Cc: Leonardo Bras, qemu-devel

When originally implemented, zero_copy_send was designed as a Migration
paramenter.

But taking into account how is that supposed to work, and how
the difference between a capability and a parameter, it only makes sense
that zero-copy-send would work better as a capability.

Taking into account how recently the change got merged, it was decided
that it's still time to make it right, and convert zero_copy_send into
a Migration capability.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 qapi/migration.json   | 33 ++++++++-------------------
 migration/migration.c | 52 ++++++++++++++++---------------------------
 monitor/hmp-cmds.c    |  6 -----
 3 files changed, 28 insertions(+), 63 deletions(-)

diff --git a/qapi/migration.json b/qapi/migration.json
index 6130cd9fae..baf8d734de 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -461,6 +461,13 @@
 #                       procedure starts. The VM RAM is saved with running VM.
 #                       (since 6.0)
 #
+# @zero-copy-send: Controls behavior on sending memory pages on migration.
+#                  When true, enables a zero-copy mechanism for sending
+#                  memory pages, if host supports it.
+#                  Requires that QEMU be permitted to use locked memory
+#                  for guest RAM pages.
+#                  (since 7.1)
+#
 # Features:
 # @unstable: Members @x-colo and @x-ignore-shared are experimental.
 #
@@ -474,7 +481,8 @@
            'block', 'return-path', 'pause-before-switchover', 'multifd',
            'dirty-bitmaps', 'postcopy-blocktime', 'late-block-activate',
            { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
-           'validate-uuid', 'background-snapshot'] }
+           'validate-uuid', 'background-snapshot',
+           { 'name': 'zero-copy-send', 'if' : 'CONFIG_LINUX'}] }
 
 ##
 # @MigrationCapabilityStatus:
@@ -738,12 +746,6 @@
 #                      will consume more CPU.
 #                      Defaults to 1. (Since 5.0)
 #
-# @zero-copy-send: Controls behavior on sending memory pages on migration.
-#                  When true, enables a zero-copy mechanism for sending
-#                  memory pages, if host supports it.
-#                  Requires that QEMU be permitted to use locked memory
-#                  for guest RAM pages.
-#                  Defaults to false. (Since 7.1)
 #
 # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
 #                        aliases for the purpose of dirty bitmap migration.  Such
@@ -784,7 +786,6 @@
            'xbzrle-cache-size', 'max-postcopy-bandwidth',
            'max-cpu-throttle', 'multifd-compression',
            'multifd-zlib-level' ,'multifd-zstd-level',
-           { 'name': 'zero-copy-send', 'if' : 'CONFIG_LINUX'},
            'block-bitmap-mapping' ] }
 
 ##
@@ -911,13 +912,6 @@
 #                      will consume more CPU.
 #                      Defaults to 1. (Since 5.0)
 #
-# @zero-copy-send: Controls behavior on sending memory pages on migration.
-#                  When true, enables a zero-copy mechanism for sending
-#                  memory pages, if host supports it.
-#                  Requires that QEMU be permitted to use locked memory
-#                  for guest RAM pages.
-#                  Defaults to false. (Since 7.1)
-#
 # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
 #                        aliases for the purpose of dirty bitmap migration.  Such
 #                        aliases may for example be the corresponding names on the
@@ -972,7 +966,6 @@
             '*multifd-compression': 'MultiFDCompression',
             '*multifd-zlib-level': 'uint8',
             '*multifd-zstd-level': 'uint8',
-            '*zero-copy-send': { 'type': 'bool', 'if': 'CONFIG_LINUX' },
             '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
 
 ##
@@ -1119,13 +1112,6 @@
 #                      will consume more CPU.
 #                      Defaults to 1. (Since 5.0)
 #
-# @zero-copy-send: Controls behavior on sending memory pages on migration.
-#                  When true, enables a zero-copy mechanism for sending
-#                  memory pages, if host supports it.
-#                  Requires that QEMU be permitted to use locked memory
-#                  for guest RAM pages.
-#                  Defaults to false. (Since 7.1)
-#
 # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
 #                        aliases for the purpose of dirty bitmap migration.  Such
 #                        aliases may for example be the corresponding names on the
@@ -1178,7 +1164,6 @@
             '*multifd-compression': 'MultiFDCompression',
             '*multifd-zlib-level': 'uint8',
             '*multifd-zstd-level': 'uint8',
-            '*zero-copy-send': { 'type': 'bool', 'if': 'CONFIG_LINUX' },
             '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
 
 ##
diff --git a/migration/migration.c b/migration/migration.c
index 31739b2af9..cc253d66e3 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -163,7 +163,8 @@ INITIALIZE_MIGRATE_CAPS_SET(check_caps_background_snapshot,
     MIGRATION_CAPABILITY_COMPRESS,
     MIGRATION_CAPABILITY_XBZRLE,
     MIGRATION_CAPABILITY_X_COLO,
-    MIGRATION_CAPABILITY_VALIDATE_UUID);
+    MIGRATION_CAPABILITY_VALIDATE_UUID,
+    MIGRATION_CAPABILITY_ZERO_COPY_SEND);
 
 /* When we add fault tolerance, we could have several
    migrations at once.  For now we don't need to add
@@ -910,10 +911,6 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
     params->multifd_zlib_level = s->parameters.multifd_zlib_level;
     params->has_multifd_zstd_level = true;
     params->multifd_zstd_level = s->parameters.multifd_zstd_level;
-#ifdef CONFIG_LINUX
-    params->has_zero_copy_send = true;
-    params->zero_copy_send = s->parameters.zero_copy_send;
-#endif
     params->has_xbzrle_cache_size = true;
     params->xbzrle_cache_size = s->parameters.xbzrle_cache_size;
     params->has_max_postcopy_bandwidth = true;
@@ -1275,6 +1272,18 @@ static bool migrate_caps_check(bool *cap_list,
         }
     }
 
+#ifdef CONFIG_LINUX
+    if (cap_list[MIGRATION_CAPABILITY_ZERO_COPY_SEND] &&
+        (!cap_list[MIGRATION_CAPABILITY_MULTIFD] ||
+         migrate_use_compression() ||
+         migrate_use_tls())) {
+        error_setg(errp,
+                   "Zero copy only available for non-compressed non-TLS multifd migration");
+        return false;
+    }
+#endif
+
+
     /* incoming side only */
     if (runstate_check(RUN_STATE_INMIGRATE) &&
         !migrate_multi_channels_is_allowed() &&
@@ -1497,16 +1506,6 @@ static bool migrate_params_check(MigrationParameters *params, Error **errp)
         error_prepend(errp, "Invalid mapping given for block-bitmap-mapping: ");
         return false;
     }
-#ifdef CONFIG_LINUX
-    if (params->zero_copy_send &&
-        (!migrate_use_multifd() ||
-         params->multifd_compression != MULTIFD_COMPRESSION_NONE ||
-         (params->tls_creds && *params->tls_creds))) {
-        error_setg(errp,
-                   "Zero copy only available for non-compressed non-TLS multifd migration");
-        return false;
-    }
-#endif
     return true;
 }
 
@@ -1580,11 +1579,6 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
     if (params->has_multifd_compression) {
         dest->multifd_compression = params->multifd_compression;
     }
-#ifdef CONFIG_LINUX
-    if (params->has_zero_copy_send) {
-        dest->zero_copy_send = params->zero_copy_send;
-    }
-#endif
     if (params->has_xbzrle_cache_size) {
         dest->xbzrle_cache_size = params->xbzrle_cache_size;
     }
@@ -1697,11 +1691,6 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
     if (params->has_multifd_compression) {
         s->parameters.multifd_compression = params->multifd_compression;
     }
-#ifdef CONFIG_LINUX
-    if (params->has_zero_copy_send) {
-        s->parameters.zero_copy_send = params->zero_copy_send;
-    }
-#endif
     if (params->has_xbzrle_cache_size) {
         s->parameters.xbzrle_cache_size = params->xbzrle_cache_size;
         xbzrle_cache_resize(params->xbzrle_cache_size, errp);
@@ -2593,7 +2582,7 @@ bool migrate_use_zero_copy_send(void)
 
     s = migrate_get_current();
 
-    return s->parameters.zero_copy_send;
+    return s->enabled_capabilities[MIGRATION_CAPABILITY_ZERO_COPY_SEND];
 }
 #endif
 
@@ -4249,10 +4238,6 @@ static Property migration_properties[] = {
     DEFINE_PROP_UINT8("multifd-zstd-level", MigrationState,
                       parameters.multifd_zstd_level,
                       DEFAULT_MIGRATE_MULTIFD_ZSTD_LEVEL),
-#ifdef CONFIG_LINUX
-    DEFINE_PROP_BOOL("zero_copy_send", MigrationState,
-                      parameters.zero_copy_send, false),
-#endif
     DEFINE_PROP_SIZE("xbzrle-cache-size", MigrationState,
                       parameters.xbzrle_cache_size,
                       DEFAULT_MIGRATE_XBZRLE_CACHE_SIZE),
@@ -4290,6 +4275,10 @@ static Property migration_properties[] = {
     DEFINE_PROP_MIG_CAP("x-multifd", MIGRATION_CAPABILITY_MULTIFD),
     DEFINE_PROP_MIG_CAP("x-background-snapshot",
             MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT),
+#ifdef CONFIG_LINUX
+    DEFINE_PROP_MIG_CAP("x-zero-copy-send",
+            MIGRATION_CAPABILITY_ZERO_COPY_SEND),
+#endif
 
     DEFINE_PROP_END_OF_LIST(),
 };
@@ -4350,9 +4339,6 @@ static void migration_instance_init(Object *obj)
     params->has_multifd_compression = true;
     params->has_multifd_zlib_level = true;
     params->has_multifd_zstd_level = true;
-#ifdef CONFIG_LINUX
-    params->has_zero_copy_send = true;
-#endif
     params->has_xbzrle_cache_size = true;
     params->has_max_postcopy_bandwidth = true;
     params->has_max_cpu_throttle = true;
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index 47a27326ee..ca98df0495 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -1311,12 +1311,6 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
         p->has_multifd_zstd_level = true;
         visit_type_uint8(v, param, &p->multifd_zstd_level, &err);
         break;
-#ifdef CONFIG_LINUX
-    case MIGRATION_PARAMETER_ZERO_COPY_SEND:
-        p->has_zero_copy_send = true;
-        visit_type_bool(v, param, &p->zero_copy_send, &err);
-        break;
-#endif
     case MIGRATION_PARAMETER_XBZRLE_CACHE_SIZE:
         p->has_xbzrle_cache_size = true;
         if (!visit_type_size(v, param, &cache_size, &err)) {
-- 
2.36.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 4/4] migration: Change zero_copy_send from migration parameter to migration capability
  2022-06-20  5:39 ` [PATCH v4 4/4] migration: Change zero_copy_send from migration parameter to migration capability Leonardo Bras
@ 2022-06-20  5:46   ` Leonardo Bras Soares Passos
  2022-06-20  9:34   ` Juan Quintela
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 21+ messages in thread
From: Leonardo Bras Soares Passos @ 2022-06-20  5:46 UTC (permalink / raw)
  To: Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster, Peter Xu, 徐闯,
	Jiri Denemark
  Cc: qemu-devel

CC: Jiri Denemark <jdenemar@redhat.com>

On Mon, Jun 20, 2022 at 2:40 AM Leonardo Bras <leobras@redhat.com> wrote:
>
> When originally implemented, zero_copy_send was designed as a Migration
> paramenter.
>
> But taking into account how is that supposed to work, and how
> the difference between a capability and a parameter, it only makes sense
> that zero-copy-send would work better as a capability.
>
> Taking into account how recently the change got merged, it was decided
> that it's still time to make it right, and convert zero_copy_send into
> a Migration capability.
>
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> ---
>  qapi/migration.json   | 33 ++++++++-------------------
>  migration/migration.c | 52 ++++++++++++++++---------------------------
>  monitor/hmp-cmds.c    |  6 -----
>  3 files changed, 28 insertions(+), 63 deletions(-)
>
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 6130cd9fae..baf8d734de 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -461,6 +461,13 @@
>  #                       procedure starts. The VM RAM is saved with running VM.
>  #                       (since 6.0)
>  #
> +# @zero-copy-send: Controls behavior on sending memory pages on migration.
> +#                  When true, enables a zero-copy mechanism for sending
> +#                  memory pages, if host supports it.
> +#                  Requires that QEMU be permitted to use locked memory
> +#                  for guest RAM pages.
> +#                  (since 7.1)
> +#
>  # Features:
>  # @unstable: Members @x-colo and @x-ignore-shared are experimental.
>  #
> @@ -474,7 +481,8 @@
>             'block', 'return-path', 'pause-before-switchover', 'multifd',
>             'dirty-bitmaps', 'postcopy-blocktime', 'late-block-activate',
>             { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
> -           'validate-uuid', 'background-snapshot'] }
> +           'validate-uuid', 'background-snapshot',
> +           { 'name': 'zero-copy-send', 'if' : 'CONFIG_LINUX'}] }
>
>  ##
>  # @MigrationCapabilityStatus:
> @@ -738,12 +746,6 @@
>  #                      will consume more CPU.
>  #                      Defaults to 1. (Since 5.0)
>  #
> -# @zero-copy-send: Controls behavior on sending memory pages on migration.
> -#                  When true, enables a zero-copy mechanism for sending
> -#                  memory pages, if host supports it.
> -#                  Requires that QEMU be permitted to use locked memory
> -#                  for guest RAM pages.
> -#                  Defaults to false. (Since 7.1)
>  #
>  # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
>  #                        aliases for the purpose of dirty bitmap migration.  Such
> @@ -784,7 +786,6 @@
>             'xbzrle-cache-size', 'max-postcopy-bandwidth',
>             'max-cpu-throttle', 'multifd-compression',
>             'multifd-zlib-level' ,'multifd-zstd-level',
> -           { 'name': 'zero-copy-send', 'if' : 'CONFIG_LINUX'},
>             'block-bitmap-mapping' ] }
>
>  ##
> @@ -911,13 +912,6 @@
>  #                      will consume more CPU.
>  #                      Defaults to 1. (Since 5.0)
>  #
> -# @zero-copy-send: Controls behavior on sending memory pages on migration.
> -#                  When true, enables a zero-copy mechanism for sending
> -#                  memory pages, if host supports it.
> -#                  Requires that QEMU be permitted to use locked memory
> -#                  for guest RAM pages.
> -#                  Defaults to false. (Since 7.1)
> -#
>  # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
>  #                        aliases for the purpose of dirty bitmap migration.  Such
>  #                        aliases may for example be the corresponding names on the
> @@ -972,7 +966,6 @@
>              '*multifd-compression': 'MultiFDCompression',
>              '*multifd-zlib-level': 'uint8',
>              '*multifd-zstd-level': 'uint8',
> -            '*zero-copy-send': { 'type': 'bool', 'if': 'CONFIG_LINUX' },
>              '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
>
>  ##
> @@ -1119,13 +1112,6 @@
>  #                      will consume more CPU.
>  #                      Defaults to 1. (Since 5.0)
>  #
> -# @zero-copy-send: Controls behavior on sending memory pages on migration.
> -#                  When true, enables a zero-copy mechanism for sending
> -#                  memory pages, if host supports it.
> -#                  Requires that QEMU be permitted to use locked memory
> -#                  for guest RAM pages.
> -#                  Defaults to false. (Since 7.1)
> -#
>  # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
>  #                        aliases for the purpose of dirty bitmap migration.  Such
>  #                        aliases may for example be the corresponding names on the
> @@ -1178,7 +1164,6 @@
>              '*multifd-compression': 'MultiFDCompression',
>              '*multifd-zlib-level': 'uint8',
>              '*multifd-zstd-level': 'uint8',
> -            '*zero-copy-send': { 'type': 'bool', 'if': 'CONFIG_LINUX' },
>              '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
>
>  ##
> diff --git a/migration/migration.c b/migration/migration.c
> index 31739b2af9..cc253d66e3 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -163,7 +163,8 @@ INITIALIZE_MIGRATE_CAPS_SET(check_caps_background_snapshot,
>      MIGRATION_CAPABILITY_COMPRESS,
>      MIGRATION_CAPABILITY_XBZRLE,
>      MIGRATION_CAPABILITY_X_COLO,
> -    MIGRATION_CAPABILITY_VALIDATE_UUID);
> +    MIGRATION_CAPABILITY_VALIDATE_UUID,
> +    MIGRATION_CAPABILITY_ZERO_COPY_SEND);
>
>  /* When we add fault tolerance, we could have several
>     migrations at once.  For now we don't need to add
> @@ -910,10 +911,6 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
>      params->multifd_zlib_level = s->parameters.multifd_zlib_level;
>      params->has_multifd_zstd_level = true;
>      params->multifd_zstd_level = s->parameters.multifd_zstd_level;
> -#ifdef CONFIG_LINUX
> -    params->has_zero_copy_send = true;
> -    params->zero_copy_send = s->parameters.zero_copy_send;
> -#endif
>      params->has_xbzrle_cache_size = true;
>      params->xbzrle_cache_size = s->parameters.xbzrle_cache_size;
>      params->has_max_postcopy_bandwidth = true;
> @@ -1275,6 +1272,18 @@ static bool migrate_caps_check(bool *cap_list,
>          }
>      }
>
> +#ifdef CONFIG_LINUX
> +    if (cap_list[MIGRATION_CAPABILITY_ZERO_COPY_SEND] &&
> +        (!cap_list[MIGRATION_CAPABILITY_MULTIFD] ||
> +         migrate_use_compression() ||
> +         migrate_use_tls())) {
> +        error_setg(errp,
> +                   "Zero copy only available for non-compressed non-TLS multifd migration");
> +        return false;
> +    }
> +#endif
> +
> +
>      /* incoming side only */
>      if (runstate_check(RUN_STATE_INMIGRATE) &&
>          !migrate_multi_channels_is_allowed() &&
> @@ -1497,16 +1506,6 @@ static bool migrate_params_check(MigrationParameters *params, Error **errp)
>          error_prepend(errp, "Invalid mapping given for block-bitmap-mapping: ");
>          return false;
>      }
> -#ifdef CONFIG_LINUX
> -    if (params->zero_copy_send &&
> -        (!migrate_use_multifd() ||
> -         params->multifd_compression != MULTIFD_COMPRESSION_NONE ||
> -         (params->tls_creds && *params->tls_creds))) {
> -        error_setg(errp,
> -                   "Zero copy only available for non-compressed non-TLS multifd migration");
> -        return false;
> -    }
> -#endif
>      return true;
>  }
>
> @@ -1580,11 +1579,6 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
>      if (params->has_multifd_compression) {
>          dest->multifd_compression = params->multifd_compression;
>      }
> -#ifdef CONFIG_LINUX
> -    if (params->has_zero_copy_send) {
> -        dest->zero_copy_send = params->zero_copy_send;
> -    }
> -#endif
>      if (params->has_xbzrle_cache_size) {
>          dest->xbzrle_cache_size = params->xbzrle_cache_size;
>      }
> @@ -1697,11 +1691,6 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
>      if (params->has_multifd_compression) {
>          s->parameters.multifd_compression = params->multifd_compression;
>      }
> -#ifdef CONFIG_LINUX
> -    if (params->has_zero_copy_send) {
> -        s->parameters.zero_copy_send = params->zero_copy_send;
> -    }
> -#endif
>      if (params->has_xbzrle_cache_size) {
>          s->parameters.xbzrle_cache_size = params->xbzrle_cache_size;
>          xbzrle_cache_resize(params->xbzrle_cache_size, errp);
> @@ -2593,7 +2582,7 @@ bool migrate_use_zero_copy_send(void)
>
>      s = migrate_get_current();
>
> -    return s->parameters.zero_copy_send;
> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_ZERO_COPY_SEND];
>  }
>  #endif
>
> @@ -4249,10 +4238,6 @@ static Property migration_properties[] = {
>      DEFINE_PROP_UINT8("multifd-zstd-level", MigrationState,
>                        parameters.multifd_zstd_level,
>                        DEFAULT_MIGRATE_MULTIFD_ZSTD_LEVEL),
> -#ifdef CONFIG_LINUX
> -    DEFINE_PROP_BOOL("zero_copy_send", MigrationState,
> -                      parameters.zero_copy_send, false),
> -#endif
>      DEFINE_PROP_SIZE("xbzrle-cache-size", MigrationState,
>                        parameters.xbzrle_cache_size,
>                        DEFAULT_MIGRATE_XBZRLE_CACHE_SIZE),
> @@ -4290,6 +4275,10 @@ static Property migration_properties[] = {
>      DEFINE_PROP_MIG_CAP("x-multifd", MIGRATION_CAPABILITY_MULTIFD),
>      DEFINE_PROP_MIG_CAP("x-background-snapshot",
>              MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT),
> +#ifdef CONFIG_LINUX
> +    DEFINE_PROP_MIG_CAP("x-zero-copy-send",
> +            MIGRATION_CAPABILITY_ZERO_COPY_SEND),
> +#endif
>
>      DEFINE_PROP_END_OF_LIST(),
>  };
> @@ -4350,9 +4339,6 @@ static void migration_instance_init(Object *obj)
>      params->has_multifd_compression = true;
>      params->has_multifd_zlib_level = true;
>      params->has_multifd_zstd_level = true;
> -#ifdef CONFIG_LINUX
> -    params->has_zero_copy_send = true;
> -#endif
>      params->has_xbzrle_cache_size = true;
>      params->has_max_postcopy_bandwidth = true;
>      params->has_max_cpu_throttle = true;
> diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
> index 47a27326ee..ca98df0495 100644
> --- a/monitor/hmp-cmds.c
> +++ b/monitor/hmp-cmds.c
> @@ -1311,12 +1311,6 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
>          p->has_multifd_zstd_level = true;
>          visit_type_uint8(v, param, &p->multifd_zstd_level, &err);
>          break;
> -#ifdef CONFIG_LINUX
> -    case MIGRATION_PARAMETER_ZERO_COPY_SEND:
> -        p->has_zero_copy_send = true;
> -        visit_type_bool(v, param, &p->zero_copy_send, &err);
> -        break;
> -#endif
>      case MIGRATION_PARAMETER_XBZRLE_CACHE_SIZE:
>          p->has_xbzrle_cache_size = true;
>          if (!visit_type_size(v, param, &cache_size, &err)) {
> --
> 2.36.1
>



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/4] QIOChannelSocket: Introduce assert and reduce ifdefs to improve readability
  2022-06-20  5:39 ` [PATCH v4 1/4] QIOChannelSocket: Introduce assert and reduce ifdefs to improve readability Leonardo Bras
@ 2022-06-20  8:47   ` Juan Quintela
  2022-06-20 15:27   ` Peter Xu
  1 sibling, 0 replies; 21+ messages in thread
From: Juan Quintela @ 2022-06-20  8:47 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Daniel P. Berrangé,
	Dr. David Alan Gilbert, Eric Blake, Markus Armbruster, Peter Xu,
	徐闯,
	qemu-devel

Leonardo Bras <leobras@redhat.com> wrote:
> During implementation of MSG_ZEROCOPY feature, a lot of #ifdefs were
> introduced, particularly at qio_channel_socket_writev().
>
> Rewrite some of those changes so it's easier to read.
>
> Also, introduce an assert to help detect incorrect zero-copy usage is when
> it's disabled on build.
>
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 2/4] QIOChannelSocket: Fix zero-copy send so socket flush works
  2022-06-20  5:39 ` [PATCH v4 2/4] QIOChannelSocket: Fix zero-copy send so socket flush works Leonardo Bras
@ 2022-06-20  8:48   ` Juan Quintela
  2022-06-20 15:27   ` Peter Xu
  1 sibling, 0 replies; 21+ messages in thread
From: Juan Quintela @ 2022-06-20  8:48 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Daniel P. Berrangé,
	Dr. David Alan Gilbert, Eric Blake, Markus Armbruster, Peter Xu,
	徐闯,
	qemu-devel

Leonardo Bras <leobras@redhat.com> wrote:
> Somewhere between v6 and v7 the of the zero-copy-send patchset a crucial
> part of the flushing mechanism got missing: incrementing zero_copy_queued.
>
> Without that, the flushing interface becomes a no-op, and there is no
> guarantee the buffer is really sent.
>
> This can go as bad as causing a corruption in RAM during migration.
>
> Fixes: 2bc58ffc2926 ("QIOChannelSocket: Implement io_writev zero copy
> flag & io_flush for CONFIG_LINUX")
> Reported-by: 徐闯 <xuchuangxclwt@bytedance.com>
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 3/4] migration: zero-copy flush only at the end of bitmap scanning
  2022-06-20  5:39 ` [PATCH v4 3/4] migration: zero-copy flush only at the end of bitmap scanning Leonardo Bras
@ 2022-06-20  9:23   ` Juan Quintela
  2022-06-20 15:44     ` Peter Xu
  2022-06-21  3:26     ` Leonardo Brás
  0 siblings, 2 replies; 21+ messages in thread
From: Juan Quintela @ 2022-06-20  9:23 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Daniel P. Berrangé,
	Dr. David Alan Gilbert, Eric Blake, Markus Armbruster, Peter Xu,
	徐闯,
	qemu-devel

Leonardo Bras <leobras@redhat.com> wrote:
> When sending memory pages with MSG_ZEROCOPY, it's necessary to flush
> to make sure all dirty pages are sent before a future version of them
> happens to be sent.
>
> Currently, the flush happens every time at the end of ram_save_iterate(),
> which usually happens around 20x per second, due to a timeout.
>
> Change so it flushes only after a whole scanning of the dirty bimap,
> so it never sends a newer version of a page before an older one, while
> avoiding unnecessary overhead.
>
> Signed-off-by: Leonardo Bras <leobras@redhat.com>

I agree that previous one is too many flushes, but this one changes to too
much memory to be uncommitted, and that is important because otherwise we
have unlimited amount of dirty memory.

> +/*
> + * Set zero_copy_flush = true for every multifd channel
> + *
> + * When using zero-copy, it's necessary to flush the pages before any of
> + * the pages can be sent again, so we'll make sure the new version of the
> + * pages will always arrive _later_ than the old pages.
> + *
> + * Should be called only after we finished one whole scanning of
> + * all the dirty bitmaps.
> + */
> +int multifd_zero_copy_flush(void)
> +{
> +    int i;
> +    Error *local_err = NULL;
> +
> +    if (!migrate_use_multifd()) {
> +        return 0;
> +    }
> +
> +    for (i = 0; i < migrate_multifd_channels(); i++) {
> +        MultiFDSendParams *p = &multifd_send_state->params[i];
> +        int ret;
> +
> +        ret = qio_channel_flush(p->c, &local_err);
> +        if (ret < 0) {
> +            error_report_err(local_err);
> +            return ret;
> +        }
> +    }
> +
> +    return 0;
> +}


Here you flush every channel, Only at the end of a range you want to do this.


>  int multifd_send_sync_main(QEMUFile *f)
>  {
>      int i;
> -    bool flush_zero_copy;
>  
>      if (!migrate_use_multifd()) {
>          return 0;
> @@ -581,19 +613,6 @@ int multifd_send_sync_main(QEMUFile *f)
>          }
>      }
>  
> -    /*
> -     * When using zero-copy, it's necessary to flush the pages before any of
> -     * the pages can be sent again, so we'll make sure the new version of the
> -     * pages will always arrive _later_ than the old pages.
> -     *
> -     * Currently we achieve this by flushing the zero-page requested writes
> -     * per ram iteration, but in the future we could potentially optimize it
> -     * to be less frequent, e.g. only after we finished one whole scanning of
> -     * all the dirty bitmaps.
> -     */
> -
> -    flush_zero_copy = migrate_use_zero_copy_send();
> -
>      for (i = 0; i < migrate_multifd_channels(); i++) {
>          MultiFDSendParams *p = &multifd_send_state->params[i];
>  
> @@ -615,17 +634,6 @@ int multifd_send_sync_main(QEMUFile *f)
>          ram_counters.transferred += p->packet_len;
>          qemu_mutex_unlock(&p->mutex);
>          qemu_sem_post(&p->sem);
> -
> -        if (flush_zero_copy && p->c) {
> -            int ret;
> -            Error *err = NULL;
> -
> -            ret = qio_channel_flush(p->c, &err);
> -            if (ret < 0) {
> -                error_report_err(err);
> -                return -1;
> -            }
> -        }

This synchronization already happens once every iteration through all
ram.

</me checks how>

And low and behold, it doesn't.

The problem here is that we are calling multifd_send_sync_main() in the
wrong place, i.e. we are being too conservative.

We need to call multifd_send_sync_main() just before doing
migration_bitmap_sync().  The reason that we need to call that function
is exactly the same that we need to flush for zero_copy.

So, what we need to change is remove the call to
multifd_send_sync_main(), not how it handles zero_copy.

>      }
>      for (i = 0; i < migrate_multifd_channels(); i++) {
>          MultiFDSendParams *p = &multifd_send_state->params[i];
> diff --git a/migration/ram.c b/migration/ram.c
> index 5f5e37f64d..514584e44f 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -2288,6 +2288,13 @@ static int ram_find_and_save_block(RAMState *rs)
>      rs->last_seen_block = pss.block;
>      rs->last_page = pss.page;
>  
> +    if (pss.complete_round && migrate_use_zero_copy_send()) {
> +        int ret = multifd_zero_copy_flush();
> +        if (ret < 0) {
> +            return ret;
> +        }
> +    }
> +

This place is not right either, because we can have a sync in the middle
for other reasons.

We call migration_bitmap_sync() in save_live_pending(), and that is not
when we finish the complete_round.

Once discussed this, what I asked in the past is that you are having too
much dirty memory on zero_copy.  When you have a Multiterabyte guest, in
a single round you have a "potentially" dirty memory on each channel of:

   total_amount_memory / number of channels.

In a Multiterabyte guest, this is going to be more that probably in the
dozens of gigabytes.  As far as I know there is no card/driver that will
benefit for so many pages in zero_copy, and kernel will move to
synchronous copy at some point.  (In older threads, daniel showed how to
test for this case).

What I proposed is that you check in the migration_send_thread() how
much memory you have written since last synchronization.  Once that it
is big enough (I don't know the limits for card, in the docs that you
showed suggested the size is a few megabytes), you just sync at that
point and continue.

You still need to synchronize all threads at bitmap sync, but as said,
that is handled by multifd_send_sync_main(), and we should fix that
instead of changing where zero_copy flushes.

/* Removed not relevant bits of the function */

static void *multifd_send_thread(void *opaque)
{
    size_t zero_copy_bytes_sent = 0;

    ...

    while (true) {

            ....

            trace_multifd_send(p->id, packet_num, p->normal_num, p->zero_num,
                               p->flags, p->next_packet_size);

            if (use_zero_copy_send) {
                /* Send header first, without zerocopy */
                ret = qio_channel_write_all(p->c, (void *)p->packet,
                                            p->packet_len, &local_err);
                if (ret != 0) {
                    break;
                }

****** Note aside

Did you answered my question here of what happens when you do:

write_with_zero_copy(1MB); write_without_zero_copy(4KB); write_with_zero_copy(1MB); write_without_zero_copy(4KB);

My guess is that write_without_zero_copy(4KB) will just flush the
socket.  I can't think how it can work without doing that.
If so, we are not getting what we want.

            } else {
                /* Send header using the same writev call */
                p->iov[0].iov_len = p->packet_len;
                p->iov[0].iov_base = p->packet;
            }

            ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num, NULL,
                                              0, p->write_flags, &local_err);
            if (ret != 0) {
                break;
            }

            qemu_mutex_lock(&p->mutex);
            p->num_packets++;
            p->total_normal_pages += p->normal_num;
            p->total_zero_pages += p->zero_num;
            p->pages->num = 0;
            p->pages->block = NULL;
            p->sent_bytes += p->packet_len;;
            p->sent_bytes += p->next_packet_size;

**** Addition
            zero_copy_bytes_sent += p->packet_len + p->next_packet_size;

            p->pending_job--;
            qemu_mutex_unlock(&p->mutex);
***** Addition
            if (zero_copy_bytes_sent > Threshold) // 2MB/4MB?  I don't know
                    ret = qio_channel_flush(p->c, &local_err);
                    // Handle error somehow
                    //  If you want to be a pro, just change the
                    // Threshold depending on what the kernel answers.
                    // If it has to revert to synchronous sent, just
                    // decrease the threshold, otherwise increase it.

            if (p->flags & MULTIFD_FLAG_SYNC) {
                qemu_sem_post(&p->sem_sync);
            }
            qemu_sem_post(&multifd_send_state->channels_ready);
        } else if (p->quit) {
            qemu_mutex_unlock(&p->mutex);
            break;
        } else {
            qemu_mutex_unlock(&p->mutex);
            /* sometimes there are spurious wakeups */
        }
    }
    .............
}

What do you think?

Later, Juan.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 4/4] migration: Change zero_copy_send from migration parameter to migration capability
  2022-06-20  5:39 ` [PATCH v4 4/4] migration: Change zero_copy_send from migration parameter to migration capability Leonardo Bras
  2022-06-20  5:46   ` Leonardo Bras Soares Passos
@ 2022-06-20  9:34   ` Juan Quintela
  2022-06-20 15:31   ` Peter Xu
  2022-06-21 12:34   ` Markus Armbruster
  3 siblings, 0 replies; 21+ messages in thread
From: Juan Quintela @ 2022-06-20  9:34 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Daniel P. Berrangé,
	Dr. David Alan Gilbert, Eric Blake, Markus Armbruster, Peter Xu,
	徐闯,
	qemu-devel

Leonardo Bras <leobras@redhat.com> wrote:
> When originally implemented, zero_copy_send was designed as a Migration
> paramenter.
>
> But taking into account how is that supposed to work, and how
> the difference between a capability and a parameter, it only makes sense
> that zero-copy-send would work better as a capability.
>
> Taking into account how recently the change got merged, it was decided
> that it's still time to make it right, and convert zero_copy_send into
> a Migration capability.
>
> Signed-off-by: Leonardo Bras <leobras@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>


Note to myself: I would really love not to have to write so much
boilerplate code each time that we want to add a parameter/capability.

Adding the name, the documentation, and the check to set it up should be
all that be necesary.

A parameter needs to be defined twiced in qapi, 6 times in migration.c
and another one in hmp.  Capabilities are a better, but still room to improvement.

Later, Juan.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 1/4] QIOChannelSocket: Introduce assert and reduce ifdefs to improve readability
  2022-06-20  5:39 ` [PATCH v4 1/4] QIOChannelSocket: Introduce assert and reduce ifdefs to improve readability Leonardo Bras
  2022-06-20  8:47   ` Juan Quintela
@ 2022-06-20 15:27   ` Peter Xu
  1 sibling, 0 replies; 21+ messages in thread
From: Peter Xu @ 2022-06-20 15:27 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster, 徐闯,
	qemu-devel

On Mon, Jun 20, 2022 at 02:39:42AM -0300, Leonardo Bras wrote:
> During implementation of MSG_ZEROCOPY feature, a lot of #ifdefs were
> introduced, particularly at qio_channel_socket_writev().
> 
> Rewrite some of those changes so it's easier to read.
> 
> Also, introduce an assert to help detect incorrect zero-copy usage is when
> it's disabled on build.
> 
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 2/4] QIOChannelSocket: Fix zero-copy send so socket flush works
  2022-06-20  5:39 ` [PATCH v4 2/4] QIOChannelSocket: Fix zero-copy send so socket flush works Leonardo Bras
  2022-06-20  8:48   ` Juan Quintela
@ 2022-06-20 15:27   ` Peter Xu
  1 sibling, 0 replies; 21+ messages in thread
From: Peter Xu @ 2022-06-20 15:27 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster, 徐闯,
	qemu-devel

On Mon, Jun 20, 2022 at 02:39:43AM -0300, Leonardo Bras wrote:
> Somewhere between v6 and v7 the of the zero-copy-send patchset a crucial
> part of the flushing mechanism got missing: incrementing zero_copy_queued.
> 
> Without that, the flushing interface becomes a no-op, and there is no
> guarantee the buffer is really sent.
> 
> This can go as bad as causing a corruption in RAM during migration.
> 
> Fixes: 2bc58ffc2926 ("QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX")
> Reported-by: 徐闯 <xuchuangxclwt@bytedance.com>
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 4/4] migration: Change zero_copy_send from migration parameter to migration capability
  2022-06-20  5:39 ` [PATCH v4 4/4] migration: Change zero_copy_send from migration parameter to migration capability Leonardo Bras
  2022-06-20  5:46   ` Leonardo Bras Soares Passos
  2022-06-20  9:34   ` Juan Quintela
@ 2022-06-20 15:31   ` Peter Xu
  2022-06-20 15:51     ` Juan Quintela
  2022-06-21 12:34   ` Markus Armbruster
  3 siblings, 1 reply; 21+ messages in thread
From: Peter Xu @ 2022-06-20 15:31 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster, 徐闯,
	qemu-devel

On Mon, Jun 20, 2022 at 02:39:45AM -0300, Leonardo Bras wrote:
> When originally implemented, zero_copy_send was designed as a Migration
> paramenter.
> 
> But taking into account how is that supposed to work, and how
> the difference between a capability and a parameter, it only makes sense
> that zero-copy-send would work better as a capability.
> 
> Taking into account how recently the change got merged, it was decided
> that it's still time to make it right, and convert zero_copy_send into
> a Migration capability.
> 
> Signed-off-by: Leonardo Bras <leobras@redhat.com>

I assume this is a request from Libvirt? We don't have a release yet so
yeah probably we still have time..

Acked-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 3/4] migration: zero-copy flush only at the end of bitmap scanning
  2022-06-20  9:23   ` Juan Quintela
@ 2022-06-20 15:44     ` Peter Xu
  2022-06-21  3:35       ` Leonardo Brás
  2022-06-21  3:26     ` Leonardo Brás
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Xu @ 2022-06-20 15:44 UTC (permalink / raw)
  To: Juan Quintela, Leonardo Bras Soares Passos
  Cc: Leonardo Bras, Daniel P. Berrangé,
	Dr. David Alan Gilbert, Eric Blake, Markus Armbruster,
	徐闯,
	qemu-devel

On Mon, Jun 20, 2022 at 11:23:53AM +0200, Juan Quintela wrote:
> Once discussed this, what I asked in the past is that you are having too
> much dirty memory on zero_copy.  When you have a Multiterabyte guest, in
> a single round you have a "potentially" dirty memory on each channel of:
> 
>    total_amount_memory / number of channels.
> 
> In a Multiterabyte guest, this is going to be more that probably in the
> dozens of gigabytes.  As far as I know there is no card/driver that will
> benefit for so many pages in zero_copy, and kernel will move to
> synchronous copy at some point.  (In older threads, daniel showed how to
> test for this case).

I was wondering whether the kernel needs to cache a lot of messages for
zero copy if we don't flush it for a long time, as recvmsg(MSG_ERRQUEUE)
seems to be fetching one message from the kernel one at a time.  And,
whether that queue has a limit in length or something.

Does it mean that when the kernel could have cached enough of these
messages then it'll fallback to the no-zero-copy mode?  And probably that's
the way how kernel protects itself from using too much buffer for the error
msgs?

This reminded me - Leo, have you considered adding the patch altogether to
detect the "fallback to non-zero-copy" condition?  Because when with it and
when the fallback happens at some point (e.g. when the guest memory is
larger than some value) we'll know.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 4/4] migration: Change zero_copy_send from migration parameter to migration capability
  2022-06-20 15:31   ` Peter Xu
@ 2022-06-20 15:51     ` Juan Quintela
  0 siblings, 0 replies; 21+ messages in thread
From: Juan Quintela @ 2022-06-20 15:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: Leonardo Bras, Daniel P. Berrangé,
	Dr. David Alan Gilbert, Eric Blake, Markus Armbruster,
	徐闯,
	qemu-devel

Peter Xu <peterx@redhat.com> wrote:
> On Mon, Jun 20, 2022 at 02:39:45AM -0300, Leonardo Bras wrote:
>> When originally implemented, zero_copy_send was designed as a Migration
>> paramenter.
>> 
>> But taking into account how is that supposed to work, and how
>> the difference between a capability and a parameter, it only makes sense
>> that zero-copy-send would work better as a capability.
>> 
>> Taking into account how recently the change got merged, it was decided
>> that it's still time to make it right, and convert zero_copy_send into
>> a Migration capability.
>> 
>> Signed-off-by: Leonardo Bras <leobras@redhat.com>
>
> I assume this is a request from Libvirt? We don't have a release yet so
> yeah probably we still have time..
>
> Acked-by: Peter Xu <peterx@redhat.com>

Livirt already have a concept of migration capabilities (bools) that
lets layered products to check/discover/etc.  Putting it there instead
of one parameter makes their life much easier (i.e. more common code).

As this code hasn't been in a stable release yet, I think it is ok.

Later, Juan.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 3/4] migration: zero-copy flush only at the end of bitmap scanning
  2022-06-20  9:23   ` Juan Quintela
  2022-06-20 15:44     ` Peter Xu
@ 2022-06-21  3:26     ` Leonardo Brás
  1 sibling, 0 replies; 21+ messages in thread
From: Leonardo Brás @ 2022-06-21  3:26 UTC (permalink / raw)
  To: quintela
  Cc: Daniel P.Berrangé,
	Dr. David Alan Gilbert, Eric Blake, Markus Armbruster, Peter Xu,
	徐闯,
	qemu-devel

On Mon, 2022-06-20 at 11:23 +0200, Juan Quintela wrote:
> Leonardo Bras <leobras@redhat.com> wrote:
> > When sending memory pages with MSG_ZEROCOPY, it's necessary to flush
> > to make sure all dirty pages are sent before a future version of them
> > happens to be sent.
> > 
> > Currently, the flush happens every time at the end of ram_save_iterate(),
> > which usually happens around 20x per second, due to a timeout.
> > 
> > Change so it flushes only after a whole scanning of the dirty bimap,
> > so it never sends a newer version of a page before an older one, while
> > avoiding unnecessary overhead.
> > 
> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> 
> I agree that previous one is too many flushes, but this one changes to too
> much memory to be uncommitted, and that is important because otherwise we
> have unlimited amount of dirty memory.

I don't quite understand what you meant here.
What does it mean to be uncommitted at this context?
I don't see how we get unlimited amounts of dirty memory here. 

Zero-copy is not supposed to copy memory pages, so let's say all pages are dirty
and enqueued to send, but our network interface is stalling the send:
All memory we should have allocated is VM ram pagecount x sizeof(iov), because
then we flush.

Unless... are you referring to locked memory (as dirty enqueued memory -> locked
memory)? 
That would be a point: If we enqueue more memory than the locked memory amount
our user support, the migration will fail. 

But that would mean a very fast CPU (lots of sendmsgs) and a very slow network
interface. 

> 
> > +/*
> > + * Set zero_copy_flush = true for every multifd channel
> > + *
> > + * When using zero-copy, it's necessary to flush the pages before any of
> > + * the pages can be sent again, so we'll make sure the new version of the
> > + * pages will always arrive _later_ than the old pages.
> > + *
> > + * Should be called only after we finished one whole scanning of
> > + * all the dirty bitmaps.
> > + */
> > +int multifd_zero_copy_flush(void)
> > +{
> > +    int i;
> > +    Error *local_err = NULL;
> > +
> > +    if (!migrate_use_multifd()) {
> > +        return 0;
> > +    }
> > +
> > +    for (i = 0; i < migrate_multifd_channels(); i++) {
> > +        MultiFDSendParams *p = &multifd_send_state->params[i];
> > +        int ret;
> > +
> > +        ret = qio_channel_flush(p->c, &local_err);
> > +        if (ret < 0) {
> > +            error_report_err(local_err);
> > +            return ret;
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}
> 
> 
> Here you flush every channel, Only at the end of a range you want to do this.

Yes, the idea is to flush after a full scan of the dirty-bitmap.

> 
> 
> >  int multifd_send_sync_main(QEMUFile *f)
> >  {
> >      int i;
> > -    bool flush_zero_copy;
> >  
> >      if (!migrate_use_multifd()) {
> >          return 0;
> > @@ -581,19 +613,6 @@ int multifd_send_sync_main(QEMUFile *f)
> >          }
> >      }
> >  
> > -    /*
> > -     * When using zero-copy, it's necessary to flush the pages before any
> > of
> > -     * the pages can be sent again, so we'll make sure the new version of
> > the
> > -     * pages will always arrive _later_ than the old pages.
> > -     *
> > -     * Currently we achieve this by flushing the zero-page requested writes
> > -     * per ram iteration, but in the future we could potentially optimize
> > it
> > -     * to be less frequent, e.g. only after we finished one whole scanning
> > of
> > -     * all the dirty bitmaps.
> > -     */
> > -
> > -    flush_zero_copy = migrate_use_zero_copy_send();
> > -
> >      for (i = 0; i < migrate_multifd_channels(); i++) {
> >          MultiFDSendParams *p = &multifd_send_state->params[i];
> >  
> > @@ -615,17 +634,6 @@ int multifd_send_sync_main(QEMUFile *f)
> >          ram_counters.transferred += p->packet_len;
> >          qemu_mutex_unlock(&p->mutex);
> >          qemu_sem_post(&p->sem);
> > -
> > -        if (flush_zero_copy && p->c) {
> > -            int ret;
> > -            Error *err = NULL;
> > -
> > -            ret = qio_channel_flush(p->c, &err);
> > -            if (ret < 0) {
> > -                error_report_err(err);
> > -                return -1;
> > -            }
> > -        }
> 
> This synchronization already happens once every iteration through all
> ram.
> 
> </me checks how>
> 
> And low and behold, it doesn't.
> 
> The problem here is that we are calling multifd_send_sync_main() in the
> wrong place, i.e. we are being too conservative.
> 
> We need to call multifd_send_sync_main() just before doing
> migration_bitmap_sync().  The reason that we need to call that function
> is exactly the same that we need to flush for zero_copy.
> 
> So, what we need to change is remove the call to
> multifd_send_sync_main(), not how it handles zero_copy.

So, IIUC, multifd have been syncing in a conservative way (i.e. more than
actually needed), and we need to do the same improvement (sync less) for multifd
too, instead of just changing it for zero-copy. Is that correct?

> 
> >      }
> >      for (i = 0; i < migrate_multifd_channels(); i++) {
> >          MultiFDSendParams *p = &multifd_send_state->params[i];
> > diff --git a/migration/ram.c b/migration/ram.c
> > index 5f5e37f64d..514584e44f 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -2288,6 +2288,13 @@ static int ram_find_and_save_block(RAMState *rs)
> >      rs->last_seen_block = pss.block;
> >      rs->last_page = pss.page;
> >  
> > +    if (pss.complete_round && migrate_use_zero_copy_send()) {
> > +        int ret = multifd_zero_copy_flush();
> > +        if (ret < 0) {
> > +            return ret;
> > +        }
> > +    }
> > +
> 
> This place is not right either, because we can have a sync in the middle
> for other reasons.

What do you mean by "in the middle" here? 
Does it mean we should not have multifd code in this function?

> 
> We call migration_bitmap_sync() in save_live_pending(), and that is not
> when we finish the complete_round.

Agree. 
We could add/set a flag in multifd, say in the above area, and then sync/flush
in the correct place (in the future).

But at my experience debugging my code, I found that the loop at
ram_save_iterate() will not stop/break at "bitmap scan end", so we may end up
sending the same page twice before a flush, which is what we want to avoid.

> 
> Once discussed this, what I asked in the past is that you are having too
> much dirty memory on zero_copy.  When you have a Multiterabyte guest, in
> a single round you have a "potentially" dirty memory on each channel of:
> 
>    total_amount_memory / number of channels.

If dirty memory -> locked memory, yes, max_locked_memory == guest_ram.

> 
> In a Multiterabyte guest, this is going to be more that probably in the
> dozens of gigabytes.  As far as I know there is no card/driver that will
> benefit for so many pages in zero_copy, and kernel will move to
> synchronous copy at some point.  (In older threads, daniel showed how to
> test for this case).

The idea of MSG_ZEROCOPY is to avoid copying the pages content over to the
kernel space. I don't recall any reason a network card would drop to copying
instead of keep using zero-copy, unless it does not support it (scatter-gather
not supported, for example), but in this case it would never use zero-copy from
the start. 

IIRC, as long as we have enough locked memory, there should be no problem for
the driver to keep using zero-copy send.

Well, unless you mean the gain of using zero-copy should be irrelevant compared
to the cost of locked memory in some scenarios, say very powerful CPU with slow
network interface, where using zero-copy should not interfere with migration
speed. But that is not a scenario for zero-copy anyway

> 
> What I proposed is that you check in the migration_send_thread() how
> much memory you have written since last synchronization.  Once that it
> is big enough (I don't know the limits for card, in the docs that you
> showed suggested the size is a few megabytes), you just sync at that
> point and continue.

This would help us keep the locked memory under control? yes.
But it would come with the cost of flushing *much* more often than needed.

I mean, it would only be useful if the network card is completely stalling the
send, or if the cpu is scheduling more zero-copy sends than the network card can
process.

The point here is: we don't have to keep track of how much has been sent, but
instead of how much is enqueued (i.e. locked memory). Maybe there is cheap way
to track this value, and we could flush when we detect it's too high.

> 
> You still need to synchronize all threads at bitmap sync, but as said,
> that is handled by multifd_send_sync_main(), and we should fix that
> instead of changing where zero_copy flushes.

Ok, so we keep the zero-copy-flush inside the multifd_send_sync_main() and
change where it's called in order to match the end of the dirty-bitmap scan.
Is that correct?

> 
> /* Removed not relevant bits of the function */
> 
> static void *multifd_send_thread(void *opaque)
> {
>     size_t zero_copy_bytes_sent = 0;
> 
>     ...
> 
>     while (true) {
> 
>             ....
> 
>             trace_multifd_send(p->id, packet_num, p->normal_num, p->zero_num,
>                                p->flags, p->next_packet_size);
> 
>             if (use_zero_copy_send) {
>                 /* Send header first, without zerocopy */
>                 ret = qio_channel_write_all(p->c, (void *)p->packet,
>                                             p->packet_len, &local_err);
>                 if (ret != 0) {
>                     break;
>                 }
> 
> ****** Note aside
> 
> Did you answered my question here of what happens when you do:
> 
> write_with_zero_copy(1MB); write_without_zero_copy(4KB);
> write_with_zero_copy(1MB); write_without_zero_copy(4KB);
> 
> My guess is that write_without_zero_copy(4KB) will just flush the
> socket.  I can't think how it can work without doing that.

To be honest, I did not follow the MSG_ZEROCOPY code to this extent. 

But it depends. If write_without_zero_copy() has MSG_DONTWAIT we could have
everything being sent in a non-blocking way.

But, of course, taking it into account that in multifd we are using a blocking
sendmsg(), I also agree that it is probably flushing the packets previously sent
with MSG_ZEROCOPY (or we could have issues in the receiving part).

But that's an assumption. I still need to dig the code in that aspect to be
sure. I will work on that asap.

> If so, we are not getting what we want.

In fact yes, we are.
We are avoiding copying the buffer, which is what we intend to do, and avoiding
the cost of the 'flush' syscall happening too often. 

If we think multifd is supposed to have blocking sendmsg(), then we are doing
exactly that here, with zero-copy for the heavy part, which is ideal IMHO.

We could try adding MSG_DONTWAIT to the sendmsg() in multifd (which would take
the blocking / synchronous send away), but I have no idea on the impact it would
have at multifd inner workings.

> 
>             } else {
>                 /* Send header using the same writev call */
>                 p->iov[0].iov_len = p->packet_len;
>                 p->iov[0].iov_base = p->packet;
>             }
> 
>             ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num, NULL,
>                                               0, p->write_flags, &local_err);
>             if (ret != 0) {
>                 break;
>             }
> 
>             qemu_mutex_lock(&p->mutex);
>             p->num_packets++;
>             p->total_normal_pages += p->normal_num;
>             p->total_zero_pages += p->zero_num;
>             p->pages->num = 0;
>             p->pages->block = NULL;
>             p->sent_bytes += p->packet_len;;
>             p->sent_bytes += p->next_packet_size;
> 
> **** Addition
>             zero_copy_bytes_sent += p->packet_len + p->next_packet_size;
> 
>             p->pending_job--;
>             qemu_mutex_unlock(&p->mutex);
> ***** Addition
>             if (zero_copy_bytes_sent > Threshold) // 2MB/4MB?  I don't know
>                     ret = qio_channel_flush(p->c, &local_err);
>                     // Handle error somehow
>                     //  If you want to be a pro, just change the
>                     // Threshold depending on what the kernel answers.
>                     // If it has to revert to synchronous sent, just
>                     // decrease the threshold, otherwise increase it.
> 
>             if (p->flags & MULTIFD_FLAG_SYNC) {
>                 qemu_sem_post(&p->sem_sync);
>             }
>             qemu_sem_post(&multifd_send_state->channels_ready);
>         } else if (p->quit) {
>             qemu_mutex_unlock(&p->mutex);
>             break;
>         } else {
>             qemu_mutex_unlock(&p->mutex);
>             /* sometimes there are spurious wakeups */
>         }
>     }
>     .............
> }
> 
> What do you think?

It's a good idea to keep locked memory at bay, but I think flushing after N
bytes sent is too often, and will kill the performance. Even more if we assume
the non-zerocopy blocking sendmsg() is already flushing for us. 

> 
> Later, Juan.
> 

Thanks Juan!

Best regards,
Leo



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 3/4] migration: zero-copy flush only at the end of bitmap scanning
  2022-06-20 15:44     ` Peter Xu
@ 2022-06-21  3:35       ` Leonardo Brás
  2022-06-21 14:51         ` Juan Quintela
  2022-06-21 15:04         ` Peter Xu
  0 siblings, 2 replies; 21+ messages in thread
From: Leonardo Brás @ 2022-06-21  3:35 UTC (permalink / raw)
  To: Peter Xu, Juan Quintela, Leonardo Bras Soares Passos
  Cc: Daniel P. Berrangé,
	Dr. David Alan Gilbert, Eric Blake, Markus Armbruster,
	徐闯,
	qemu-devel

On Mon, 2022-06-20 at 11:44 -0400, Peter Xu wrote:
> On Mon, Jun 20, 2022 at 11:23:53AM +0200, Juan Quintela wrote:
> > Once discussed this, what I asked in the past is that you are having too
> > much dirty memory on zero_copy.  When you have a Multiterabyte guest, in
> > a single round you have a "potentially" dirty memory on each channel of:
> > 
> >    total_amount_memory / number of channels.
> > 
> > In a Multiterabyte guest, this is going to be more that probably in the
> > dozens of gigabytes.  As far as I know there is no card/driver that will
> > benefit for so many pages in zero_copy, and kernel will move to
> > synchronous copy at some point.  (In older threads, daniel showed how to
> > test for this case).
> 
> I was wondering whether the kernel needs to cache a lot of messages for
> zero copy if we don't flush it for a long time, as recvmsg(MSG_ERRQUEUE)
> seems to be fetching one message from the kernel one at a time.  And,
> whether that queue has a limit in length or something.

IIRC, if all messages look the same, it 'merges' them in a single message, like,
'this range has these flags and output'.

So, if no issue happens, we should have a single message with the confirmation
of all sent buffers, meaning just a little memory is used for that.

> 
> Does it mean that when the kernel could have cached enough of these
> messages then it'll fallback to the no-zero-copy mode?  And probably that's
> the way how kernel protects itself from using too much buffer for the error
> msgs?

Since it merges the messages, I don't think it uses a lot of space for that.

IIRC, the kernel will fall back to copying only if the network adapter / driver
does not support MSG_ZEROCOPY, like when it does not support scatter-gather.

> 
> This reminded me - Leo, have you considered adding the patch altogether to
> detect the "fallback to non-zero-copy" condition?  Because when with it and
> when the fallback happens at some point (e.g. when the guest memory is
> larger than some value) we'll know.

I still did not consider that, but sure, how do you see that working?

We can't just disable zero-copy-send because the user actually opted in, so we
could instead add a one time error message for when it falls back to copying, as
it should happen in the first try of zero-copy send.

Or we could fail the migration, stating the interface does not support
MSG_ZEROCOPY, since it should happen in the first sendmsg().

I would personally opt for the last option.

What do you think?

> 
> Thanks,
> 

Thanks Peter!

Best regards,
Leo



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 4/4] migration: Change zero_copy_send from migration parameter to migration capability
  2022-06-20  5:39 ` [PATCH v4 4/4] migration: Change zero_copy_send from migration parameter to migration capability Leonardo Bras
                     ` (2 preceding siblings ...)
  2022-06-20 15:31   ` Peter Xu
@ 2022-06-21 12:34   ` Markus Armbruster
  3 siblings, 0 replies; 21+ messages in thread
From: Markus Armbruster @ 2022-06-21 12:34 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake, Peter Xu,
	徐闯,
	qemu-devel

Leonardo Bras <leobras@redhat.com> writes:

> When originally implemented, zero_copy_send was designed as a Migration
> paramenter.
>
> But taking into account how is that supposed to work, and how
> the difference between a capability and a parameter, it only makes sense
> that zero-copy-send would work better as a capability.
>
> Taking into account how recently the change got merged, it was decided
> that it's still time to make it right, and convert zero_copy_send into
> a Migration capability.
>
> Signed-off-by: Leonardo Bras <leobras@redhat.com>

QAPI schema
Acked-by: Markus Armbruster <armbru@redhat.com>



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 3/4] migration: zero-copy flush only at the end of bitmap scanning
  2022-06-21  3:35       ` Leonardo Brás
@ 2022-06-21 14:51         ` Juan Quintela
  2022-06-21 15:09           ` Peter Xu
  2022-06-21 15:04         ` Peter Xu
  1 sibling, 1 reply; 21+ messages in thread
From: Juan Quintela @ 2022-06-21 14:51 UTC (permalink / raw)
  To: Leonardo Brás
  Cc: Peter Xu, Leonardo Bras Soares Passos, Daniel P. Berrangé,
	Dr. David Alan Gilbert, Eric Blake, Markus Armbruster,
	徐闯,
	qemu-devel

Leonardo Brás <leobras@redhat.com> wrote:
> On Mon, 2022-06-20 at 11:44 -0400, Peter Xu wrote:
>> On Mon, Jun 20, 2022 at 11:23:53AM +0200, Juan Quintela wrote:
>> > Once discussed this, what I asked in the past is that you are having too
>> > much dirty memory on zero_copy.  When you have a Multiterabyte guest, in
>> > a single round you have a "potentially" dirty memory on each channel of:
>> > 
>> >    total_amount_memory / number of channels.
>> > 
>> > In a Multiterabyte guest, this is going to be more that probably in the
>> > dozens of gigabytes.  As far as I know there is no card/driver that will
>> > benefit for so many pages in zero_copy, and kernel will move to
>> > synchronous copy at some point.  (In older threads, daniel showed how to
>> > test for this case).
>> 
>> I was wondering whether the kernel needs to cache a lot of messages for
>> zero copy if we don't flush it for a long time, as recvmsg(MSG_ERRQUEUE)
>> seems to be fetching one message from the kernel one at a time.  And,
>> whether that queue has a limit in length or something.
>
> IIRC, if all messages look the same, it 'merges' them in a single message, like,
> 'this range has these flags and output'.
>
> So, if no issue happens, we should have a single message with the confirmation
> of all sent buffers, meaning just a little memory is used for that.
>
>> 
>> Does it mean that when the kernel could have cached enough of these
>> messages then it'll fallback to the no-zero-copy mode?  And probably that's
>> the way how kernel protects itself from using too much buffer for the error
>> msgs?
>
> Since it merges the messages, I don't think it uses a lot of space for that.
>
> IIRC, the kernel will fall back to copying only if the network adapter / driver
> does not support MSG_ZEROCOPY, like when it does not support scatter-gather.

My understanding is that it will fallback when you have too much stuff
inflight.

>> 
>> This reminded me - Leo, have you considered adding the patch altogether to
>> detect the "fallback to non-zero-copy" condition?  Because when with it and
>> when the fallback happens at some point (e.g. when the guest memory is
>> larger than some value) we'll know.
>
> I still did not consider that, but sure, how do you see that working?

send with zero_copy(1MB)
send with zero_copy(1MB)
.... (repeat)
at some point kernel decides:
sync all queue()
send synchronously next package.

we are not wondering if the kernel does this (it does).  What we are
wondering is when it does it, i.e. after 1MB worth of writes, 2MB, 10MB
....
That is the thing that depends on kernel/network card/driver.


> We can't just disable zero-copy-send because the user actually opted in, so we
> could instead add a one time error message for when it falls back to copying, as
> it should happen in the first try of zero-copy send.

On your 1st (or second) series, Dan Berrange explained hew to use the
error message interface to detect it.

> Or we could fail the migration, stating the interface does not support
> MSG_ZEROCOPY, since it should happen in the first sendmsg().

> I would personally opt for the last option.
>
> What do you think?

Later, Juan.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 3/4] migration: zero-copy flush only at the end of bitmap scanning
  2022-06-21  3:35       ` Leonardo Brás
  2022-06-21 14:51         ` Juan Quintela
@ 2022-06-21 15:04         ` Peter Xu
  1 sibling, 0 replies; 21+ messages in thread
From: Peter Xu @ 2022-06-21 15:04 UTC (permalink / raw)
  To: Leonardo Brás
  Cc: Juan Quintela, Leonardo Bras Soares Passos,
	Daniel P. Berrangé,
	Dr. David Alan Gilbert, Eric Blake, Markus Armbruster,
	徐闯,
	qemu-devel

On Tue, Jun 21, 2022 at 12:35:54AM -0300, Leonardo Brás wrote:
> > This reminded me - Leo, have you considered adding the patch altogether to
> > detect the "fallback to non-zero-copy" condition?  Because when with it and
> > when the fallback happens at some point (e.g. when the guest memory is
> > larger than some value) we'll know.
> 
> I still did not consider that, but sure, how do you see that working?
> 
> We can't just disable zero-copy-send because the user actually opted in, so we
> could instead add a one time error message for when it falls back to copying, as
> it should happen in the first try of zero-copy send.
> 
> Or we could fail the migration, stating the interface does not support
> MSG_ZEROCOPY, since it should happen in the first sendmsg().
> 
> I would personally opt for the last option.
> 
> What do you think?

I don't have a very strong feeling on it, but yes that sounds proper to me.
If one day we'll like zero-copy send be on by default then we'll consider
the other way round, but maybe not necessary for now.  Thanks.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v4 3/4] migration: zero-copy flush only at the end of bitmap scanning
  2022-06-21 14:51         ` Juan Quintela
@ 2022-06-21 15:09           ` Peter Xu
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2022-06-21 15:09 UTC (permalink / raw)
  To: Juan Quintela
  Cc: Leonardo Brás, Leonardo Bras Soares Passos,
	Daniel P. Berrangé,
	Dr. David Alan Gilbert, Eric Blake, Markus Armbruster,
	徐闯,
	qemu-devel

On Tue, Jun 21, 2022 at 04:51:39PM +0200, Juan Quintela wrote:
> > IIRC, the kernel will fall back to copying only if the network adapter / driver
> > does not support MSG_ZEROCOPY, like when it does not support scatter-gather.
> 
> My understanding is that it will fallback when you have too much stuff
> inflight.

I think we'd better figure this out soon, because if so then IMHO we can't
simply fail the migration when the fallback happens..  If we're not sure
about that, we can always be on the safe side to dump an error only.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2022-06-21 15:11 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-20  5:39 [PATCH v4 0/4] MSG_ZEROCOPY fixes & improvements Leonardo Bras
2022-06-20  5:39 ` [PATCH v4 1/4] QIOChannelSocket: Introduce assert and reduce ifdefs to improve readability Leonardo Bras
2022-06-20  8:47   ` Juan Quintela
2022-06-20 15:27   ` Peter Xu
2022-06-20  5:39 ` [PATCH v4 2/4] QIOChannelSocket: Fix zero-copy send so socket flush works Leonardo Bras
2022-06-20  8:48   ` Juan Quintela
2022-06-20 15:27   ` Peter Xu
2022-06-20  5:39 ` [PATCH v4 3/4] migration: zero-copy flush only at the end of bitmap scanning Leonardo Bras
2022-06-20  9:23   ` Juan Quintela
2022-06-20 15:44     ` Peter Xu
2022-06-21  3:35       ` Leonardo Brás
2022-06-21 14:51         ` Juan Quintela
2022-06-21 15:09           ` Peter Xu
2022-06-21 15:04         ` Peter Xu
2022-06-21  3:26     ` Leonardo Brás
2022-06-20  5:39 ` [PATCH v4 4/4] migration: Change zero_copy_send from migration parameter to migration capability Leonardo Bras
2022-06-20  5:46   ` Leonardo Bras Soares Passos
2022-06-20  9:34   ` Juan Quintela
2022-06-20 15:31   ` Peter Xu
2022-06-20 15:51     ` Juan Quintela
2022-06-21 12:34   ` Markus Armbruster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.