qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/6] MSG_ZEROCOPY + multifd
@ 2021-12-09  9:39 Leonardo Bras
  2021-12-09  9:39 ` [PATCH v6 1/6] QIOChannel: Add io_writev_zero_copy & io_flush_zero_copy callbacks Leonardo Bras
                   ` (5 more replies)
  0 siblings, 6 replies; 8+ messages in thread
From: Leonardo Bras @ 2021-12-09  9:39 UTC (permalink / raw)
  To: Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster
  Cc: Leonardo Bras, qemu-devel

This patch series intends to enable MSG_ZEROCOPY in QIOChannel, and make
use of it for multifd migration performance improvement, by reducing cpu
usage.

Patch #1 creates new callbacks for QIOChannel, allowing the implementation
of zero copy writing.

Patch #2 reworks qio_channel_socket_writev() so it accepts flags for
that are later passed to sendmsg().

Patch #3 implements writev_zero_copy and flush_zero_copy on QIOChannelSocket,
making use of MSG_ZEROCOPY on Linux.

Patch #4 adds a "zero_copy" migration property, only available with
CONFIG_LINUX, and compiled-out in any other architectures.
This migration property has to be enabled before multifd migration starts.

Patch #5 adds a helper function that allows to see if TLS is going to be used.
This helper will be later used in patch #6.

Patch #6 Makes use of QIOChannelSocket zero_copy implementation on
nocomp multifd migration.

Results:
In preliminary tests, the resource usage of __sys_sendmsg() reduced 15 times,
and the overall migration took 13-22% less time, based in synthetic cpu
workload.

In further tests, it was noted that, on multifd migration with 8 channels:
- On idle hosts, migration time reduced in 10% to 21%.
- On hosts busy with heavy cpu stress (1 stress thread per cpu, but
  not cpu-pinned) migration time reduced in ~25% by enabling zero-copy.
- On hosts with heavy cpu-pinned workloads (1 stress thread per cpu, 
  cpu-pinned), migration time reducted in ~66% by enabling zero-copy.

Above tests setup:
- Sending and Receiving hosts:
  - CPU : Intel(R) Xeon(R) Platinum 8276L CPU @ 2.20GHz (448 CPUS)
  - Network card: E810-C (100Gbps)
  - >1TB RAM
  - QEMU: Upstream master branch + This patchset
  - Linux: Upstream v5.15 
- VM configuration:
  - 28 VCPUs
  - 512GB RAM


---
Changes since v5:
- flush_zero_copy now returns -1 on fail, 0 on success, and 1 when all
  processed writes were not able to use zerocopy in kernel.
- qio_channel_socket_poll() removed, using qio_channel_wait() instead
- ENOBUFS is now processed inside qio_channel_socket_writev_flags()
- Most zerocopy parameter validation moved to migrate_params_check(),
  leaving only feature test to socket_outgoing_migration() callback
- Naming went from *zerocopy to *zero_copy or *zero-copy, due to QAPI/QMP
  preferences
- Improved docs

Changes since v4:
- 3 patches got splitted in 6
- Flush is used for syncing after each iteration, instead of only at the end
- If zerocopy is not available, fail in connect instead of failing on write
- 'multifd-zerocopy' property renamed to 'zerocopy'
- Fail migrations that don't support zerocopy, if it's enabled.
- Instead of checking for zerocopy at each write, save the flags in
  MultiFDSendParams->write_flags and use them on write
- Reorganized flag usage in QIOChannelSocket 
- A lot of typos fixed
- More doc on buffer restrictions

Changes since v3:
- QIOChannel interface names changed from io_async_{writev,flush} to
  io_{writev,flush}_zerocopy
- Instead of falling back in case zerocopy is not implemented, return
  error and abort operation.
- Flush now waits as long as needed, or return error in case anything
  goes wrong, aborting the operation.
- Zerocopy is now conditional in multifd, being set by parameter
  multifd-zerocopy
- Moves zerocopy_flush to multifd_send_sync_main() from multifd_save_cleanup
  so migration can abort if flush goes wrong.
- Several other small improvements

Changes since v2:
- Patch #1: One more fallback
- Patch #2: Fall back to sync if fails to lock buffer memory in MSG_ZEROCOPY send.

Changes since v1:
- Reimplemented the patchset using async_write + async_flush approach.
- Implemented a flush to be able to tell whenever all data was written.


Leonardo Bras (6):
  QIOChannel: Add io_writev_zero_copy & io_flush_zero_copy callbacks
  QIOChannelSocket: Add flags parameter for writing
  QIOChannelSocket: Implement io_writev_zero_copy & io_flush_zero_copy
    for CONFIG_LINUX
  migration: Add zero-copy parameter for QMP/HMP for Linux
  migration: Add migrate_use_tls() helper
  multifd: Implement zero copy write in multifd migration
    (multifd-zero-copy)

 qapi/migration.json         |  24 ++++++
 include/io/channel-socket.h |   2 +
 include/io/channel.h        |  98 +++++++++++++++++++++---
 migration/migration.h       |   6 ++
 migration/multifd.h         |   4 +-
 io/channel-socket.c         | 145 +++++++++++++++++++++++++++++++++---
 io/channel.c                |  66 +++++++++++++---
 migration/channel.c         |   6 +-
 migration/migration.c       |  49 ++++++++++++
 migration/multifd.c         |  45 ++++++++---
 migration/ram.c             |  29 ++++++--
 migration/socket.c          |   6 ++
 monitor/hmp-cmds.c          |   6 ++
 13 files changed, 434 insertions(+), 52 deletions(-)

-- 
2.33.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v6 1/6] QIOChannel: Add io_writev_zero_copy & io_flush_zero_copy callbacks
  2021-12-09  9:39 [PATCH v6 0/6] MSG_ZEROCOPY + multifd Leonardo Bras
@ 2021-12-09  9:39 ` Leonardo Bras
  2021-12-10 12:15   ` Daniel P. Berrangé
  2021-12-09  9:39 ` [PATCH v6 2/6] QIOChannelSocket: Add flags parameter for writing Leonardo Bras
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 8+ messages in thread
From: Leonardo Bras @ 2021-12-09  9:39 UTC (permalink / raw)
  To: Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster
  Cc: Leonardo Bras, qemu-devel

Adds io_writev_zero_copy and io_flush_zero_copy as optional callback to QIOChannelClass,
allowing the implementation of zero copy writes by subclasses.

How to use them:
- Write data using qio_channel_writev_zero_copy(),
- Wait write completion with qio_channel_flush_zero_copy().

Notes:
As some zero copy implementations work asynchronously, it's
recommended to keep the write buffer untouched until the return of
qio_channel_flush_zero_copy(), to avoid the risk of sending an updated
buffer instead of the one at the write.

As the new callbacks are optional, if a subclass does not implement them, then:
- io_writev_zero_copy will return -1,
- io_flush_zero_copy will return 0 without changing anything.

Also, some functions like qio_channel_writev_full_all() were adapted to
receive a flag parameter. That allows shared code between zero copy and
non-zero copy writev, and also an easier implementation on new flags.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 include/io/channel.h | 98 +++++++++++++++++++++++++++++++++++++++-----
 io/channel.c         | 66 +++++++++++++++++++++++------
 2 files changed, 142 insertions(+), 22 deletions(-)

diff --git a/include/io/channel.h b/include/io/channel.h
index 88988979f8..83fa970a19 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -32,12 +32,15 @@ OBJECT_DECLARE_TYPE(QIOChannel, QIOChannelClass,
 
 #define QIO_CHANNEL_ERR_BLOCK -2
 
+#define QIO_CHANNEL_WRITE_FLAG_ZERO_COPY 0x1
+
 typedef enum QIOChannelFeature QIOChannelFeature;
 
 enum QIOChannelFeature {
     QIO_CHANNEL_FEATURE_FD_PASS,
     QIO_CHANNEL_FEATURE_SHUTDOWN,
     QIO_CHANNEL_FEATURE_LISTEN,
+    QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY,
 };
 
 
@@ -136,6 +139,12 @@ struct QIOChannelClass {
                                   IOHandler *io_read,
                                   IOHandler *io_write,
                                   void *opaque);
+    ssize_t (*io_writev_zero_copy)(QIOChannel *ioc,
+                                   const struct iovec *iov,
+                                   size_t niov,
+                                   Error **errp);
+    int (*io_flush_zero_copy)(QIOChannel *ioc,
+                              Error **errp);
 };
 
 /* General I/O handling functions */
@@ -321,10 +330,11 @@ int qio_channel_readv_all(QIOChannel *ioc,
 
 
 /**
- * qio_channel_writev_all:
+ * qio_channel_writev_all_flags:
  * @ioc: the channel object
  * @iov: the array of memory regions to write data from
  * @niov: the length of the @iov array
+ * @flags: write flags (QIO_CHANNEL_WRITE_FLAG_*)
  * @errp: pointer to a NULL-initialized error object
  *
  * Write data to the IO channel, reading it from the
@@ -337,12 +347,23 @@ int qio_channel_readv_all(QIOChannel *ioc,
  * to be written, yielding from the current coroutine
  * if required.
  *
+ * If QIO_CHANNEL_WRITE_FLAG_ZERO_COPY is passed in flags,
+ * instead of waiting for all requested data to be written,
+ * this function will wait until it's all queued for writing.
+ * In this case, if the buffer gets changed between queueing and
+ * sending, the updated buffer will be sent. If this is not a
+ * desired behavior, it's suggested to call qio_channel_flush_zero_copy()
+ * before reusing the buffer.
+ *
  * Returns: 0 if all bytes were written, or -1 on error
  */
-int qio_channel_writev_all(QIOChannel *ioc,
-                           const struct iovec *iov,
-                           size_t niov,
-                           Error **erp);
+int qio_channel_writev_all_flags(QIOChannel *ioc,
+                                 const struct iovec *iov,
+                                 size_t niov,
+                                 int flags,
+                                 Error **errp);
+#define qio_channel_writev_all(ioc, iov, niov, errp) \
+    qio_channel_writev_all_flags(ioc, iov, niov, 0, errp)
 
 /**
  * qio_channel_readv:
@@ -831,12 +852,13 @@ int qio_channel_readv_full_all(QIOChannel *ioc,
                                Error **errp);
 
 /**
- * qio_channel_writev_full_all:
+ * qio_channel_writev_full_all_flags:
  * @ioc: the channel object
  * @iov: the array of memory regions to write data from
  * @niov: the length of the @iov array
  * @fds: an array of file handles to send
  * @nfds: number of file handles in @fds
+ * @flags: write flags (QIO_CHANNEL_WRITE_FLAG_*)
  * @errp: pointer to a NULL-initialized error object
  *
  *
@@ -846,13 +868,69 @@ int qio_channel_readv_full_all(QIOChannel *ioc,
  * to be written, yielding from the current coroutine
  * if required.
  *
+ * If QIO_CHANNEL_WRITE_FLAG_ZERO_COPY is passed in flags,
+ * instead of waiting for all requested data to be written,
+ * this function will wait until it's all queued for writing.
+ * In this case, if the buffer gets changed between queueing and
+ * sending, the updated buffer will be sent. If this is not a
+ * desired behavior, it's suggested to call qio_channel_flush_zero_copy()
+ * before reusing the buffer.
+ *
  * Returns: 0 if all bytes were written, or -1 on error
  */
 
-int qio_channel_writev_full_all(QIOChannel *ioc,
-                                const struct iovec *iov,
-                                size_t niov,
-                                int *fds, size_t nfds,
+int qio_channel_writev_full_all_flags(QIOChannel *ioc,
+                                      const struct iovec *iov,
+                                      size_t niov,
+                                      int *fds, size_t nfds,
+                                      int flags, Error **errp);
+#define qio_channel_writev_full_all(ioc, iov, niov, fds, nfds, errp) \
+    qio_channel_writev_full_all_flags(ioc, iov, niov, fds, nfds, 0, errp)
+
+/**
+ * qio_channel_writev_zero_copy:
+ * @ioc: the channel object
+ * @iov: the array of memory regions to write data from
+ * @niov: the length of the @iov array
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Behaves like qio_channel_writev, but may write
+ * data asynchronously while avoiding unnecessary data copy.
+ * This function may return before any data is actually written,
+ * but will queue every buffer for writing.
+ *
+ * Some implementations require the buffer region to be locked,
+ * so if there is not enough locked memory available to the process
+ * this function will fail.
+ *
+ * If at some point it's necessary to wait for all data to be
+ * written, use qio_channel_flush_zero_copy().
+ *
+ * If zero copy is not available, returns -1 and set errp.
+ */
+
+ssize_t qio_channel_writev_zero_copy(QIOChannel *ioc,
+                                     const struct iovec *iov,
+                                     size_t niov,
+                                     Error **errp);
+
+/**
+ * qio_channel_flush_zero_copy:
+ * @ioc: the channel object
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Will block until every packet queued with
+ * qio_channel_writev_zero_copy() is sent, or return
+ * in case of any error.
+ *
+ * If not implemented, acts as a no-op, and returns 0.
+ *
+ * Returns -1 if any error is found,
+ *          1 if every send failed to use zero copy.
+ *          0 otherwise.
+ */
+
+int qio_channel_flush_zero_copy(QIOChannel *ioc,
                                 Error **errp);
 
 #endif /* QIO_CHANNEL_H */
diff --git a/io/channel.c b/io/channel.c
index e8b019dc36..d04a6772c8 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -212,19 +212,21 @@ int qio_channel_readv_full_all(QIOChannel *ioc,
     return ret;
 }
 
-int qio_channel_writev_all(QIOChannel *ioc,
-                           const struct iovec *iov,
-                           size_t niov,
-                           Error **errp)
+int qio_channel_writev_all_flags(QIOChannel *ioc,
+                                 const struct iovec *iov,
+                                 size_t niov,
+                                 int flags,
+                                 Error **errp)
 {
-    return qio_channel_writev_full_all(ioc, iov, niov, NULL, 0, errp);
+    return qio_channel_writev_full_all_flags(ioc, iov, niov, NULL, 0, flags,
+                                             errp);
 }
 
-int qio_channel_writev_full_all(QIOChannel *ioc,
-                                const struct iovec *iov,
-                                size_t niov,
-                                int *fds, size_t nfds,
-                                Error **errp)
+int qio_channel_writev_full_all_flags(QIOChannel *ioc,
+                                      const struct iovec *iov,
+                                      size_t niov,
+                                      int *fds, size_t nfds,
+                                      int flags, Error **errp)
 {
     int ret = -1;
     struct iovec *local_iov = g_new(struct iovec, niov);
@@ -237,8 +239,16 @@ int qio_channel_writev_full_all(QIOChannel *ioc,
 
     while (nlocal_iov > 0) {
         ssize_t len;
-        len = qio_channel_writev_full(ioc, local_iov, nlocal_iov, fds, nfds,
-                                      errp);
+
+        if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
+            assert(fds == NULL && nfds == 0);
+            len = qio_channel_writev_zero_copy(ioc, local_iov, nlocal_iov,
+                                               errp);
+        } else {
+            len = qio_channel_writev_full(ioc, local_iov, nlocal_iov, fds, nfds,
+                                          errp);
+        }
+
         if (len == QIO_CHANNEL_ERR_BLOCK) {
             if (qemu_in_coroutine()) {
                 qio_channel_yield(ioc, G_IO_OUT);
@@ -474,6 +484,38 @@ off_t qio_channel_io_seek(QIOChannel *ioc,
 }
 
 
+ssize_t qio_channel_writev_zero_copy(QIOChannel *ioc,
+                                     const struct iovec *iov,
+                                     size_t niov,
+                                     Error **errp)
+{
+    QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
+
+    if (!klass->io_writev_zero_copy ||
+        !qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY)) {
+        error_setg_errno(errp, EINVAL,
+                         "Channel does not support zero copy writev");
+        return -1;
+    }
+
+    return klass->io_writev_zero_copy(ioc, iov, niov, errp);
+}
+
+
+int qio_channel_flush_zero_copy(QIOChannel *ioc,
+                                Error **errp)
+{
+    QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
+
+    if (!klass->io_flush_zero_copy ||
+        !qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY)) {
+        return 0;
+    }
+
+    return klass->io_flush_zero_copy(ioc, errp);
+}
+
+
 static void qio_channel_restart_read(void *opaque)
 {
     QIOChannel *ioc = opaque;
-- 
2.33.1



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v6 2/6] QIOChannelSocket: Add flags parameter for writing
  2021-12-09  9:39 [PATCH v6 0/6] MSG_ZEROCOPY + multifd Leonardo Bras
  2021-12-09  9:39 ` [PATCH v6 1/6] QIOChannel: Add io_writev_zero_copy & io_flush_zero_copy callbacks Leonardo Bras
@ 2021-12-09  9:39 ` Leonardo Bras
  2021-12-09  9:39 ` [PATCH v6 3/6] QIOChannelSocket: Implement io_writev_zero_copy & io_flush_zero_copy for CONFIG_LINUX Leonardo Bras
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Leonardo Bras @ 2021-12-09  9:39 UTC (permalink / raw)
  To: Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster
  Cc: Leonardo Bras, qemu-devel

Change qio_channel_socket_writev() in order to accept flags, so its possible
to selectively make use of sendmsg() flags.

qio_channel_socket_writev() contents were moved to a helper function
qio_channel_socket_writev_flags() which accepts an extra argument for flags.
(This argument is passed directly to sendmsg().

Signed-off-by: Leonardo Bras <leobras@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
---
 io/channel-socket.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/io/channel-socket.c b/io/channel-socket.c
index 606ec97cf7..b57a27bf91 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -520,12 +520,13 @@ static ssize_t qio_channel_socket_readv(QIOChannel *ioc,
     return ret;
 }
 
-static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
-                                         const struct iovec *iov,
-                                         size_t niov,
-                                         int *fds,
-                                         size_t nfds,
-                                         Error **errp)
+static ssize_t qio_channel_socket_writev_flags(QIOChannel *ioc,
+                                               const struct iovec *iov,
+                                               size_t niov,
+                                               int *fds,
+                                               size_t nfds,
+                                               int flags,
+                                               Error **errp)
 {
     QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
     ssize_t ret;
@@ -558,7 +559,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
     }
 
  retry:
-    ret = sendmsg(sioc->fd, &msg, 0);
+    ret = sendmsg(sioc->fd, &msg, flags);
     if (ret <= 0) {
         if (errno == EAGAIN) {
             return QIO_CHANNEL_ERR_BLOCK;
@@ -572,6 +573,17 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
     }
     return ret;
 }
+
+static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
+                                         const struct iovec *iov,
+                                         size_t niov,
+                                         int *fds,
+                                         size_t nfds,
+                                         Error **errp)
+{
+    return qio_channel_socket_writev_flags(ioc, iov, niov, fds, nfds, 0, errp);
+}
+
 #else /* WIN32 */
 static ssize_t qio_channel_socket_readv(QIOChannel *ioc,
                                         const struct iovec *iov,
-- 
2.33.1



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v6 3/6] QIOChannelSocket: Implement io_writev_zero_copy & io_flush_zero_copy for CONFIG_LINUX
  2021-12-09  9:39 [PATCH v6 0/6] MSG_ZEROCOPY + multifd Leonardo Bras
  2021-12-09  9:39 ` [PATCH v6 1/6] QIOChannel: Add io_writev_zero_copy & io_flush_zero_copy callbacks Leonardo Bras
  2021-12-09  9:39 ` [PATCH v6 2/6] QIOChannelSocket: Add flags parameter for writing Leonardo Bras
@ 2021-12-09  9:39 ` Leonardo Bras
  2021-12-09  9:39 ` [PATCH v6 4/6] migration: Add zero-copy parameter for QMP/HMP for Linux Leonardo Bras
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Leonardo Bras @ 2021-12-09  9:39 UTC (permalink / raw)
  To: Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster
  Cc: Leonardo Bras, qemu-devel

For CONFIG_LINUX, implement the new optional callbacks io_write_zero_copy and
io_flush_zero_copy on QIOChannelSocket, but enables it only when MSG_ZEROCOPY
feature is available in the host kernel, which is checked on
qio_channel_socket_connect_sync()

qio_channel_socket_flush_zero_copy() was implemented by counting how many times
sendmsg(...,MSG_ZEROCOPY) was successfully called, and then reading the
socket's error queue, in order to find how many of them finished sending.
Flush will loop until those counters are the same, or until some error occurs.

A new function qio_channel_socket_poll() was also created in order to avoid
busy-looping recvmsg() in qio_channel_socket_flush_zero_copy() while waiting for
updates in socket's error queue.

Notes on using writev_zero_copy():
1: Buffer
- As MSG_ZEROCOPY tells the kernel to use the same user buffer to avoid copying,
some caution is necessary to avoid overwriting any buffer before it's sent.
If something like this happen, a newer version of the buffer may be sent instead.
- If this is a problem, it's recommended to call flush_zero_copy() before freeing
or re-using the buffer.

2: Locked memory
- When using MSG_ZERCOCOPY, the buffer memory will be locked after queued, and
unlocked after it's sent.
- Depending on the size of each buffer, and how often it's sent, it may require
a larger amount of locked memory than usually available to non-root user.
- If the required amount of locked memory is not available, writev_zero_copy
will return an error, which can abort an operation like migration,
- Because of this, when an user code wants to add zero copy as a feature, it
requires a mechanism to disable it, so it can still be accessible to less
privileged users.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 include/io/channel-socket.h |   2 +
 io/channel-socket.c         | 119 +++++++++++++++++++++++++++++++++++-
 2 files changed, 118 insertions(+), 3 deletions(-)

diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
index e747e63514..513c428fe4 100644
--- a/include/io/channel-socket.h
+++ b/include/io/channel-socket.h
@@ -47,6 +47,8 @@ struct QIOChannelSocket {
     socklen_t localAddrLen;
     struct sockaddr_storage remoteAddr;
     socklen_t remoteAddrLen;
+    ssize_t zero_copy_queued;
+    ssize_t zero_copy_sent;
 };
 
 
diff --git a/io/channel-socket.c b/io/channel-socket.c
index b57a27bf91..d0c91662c1 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -26,6 +26,10 @@
 #include "io/channel-watch.h"
 #include "trace.h"
 #include "qapi/clone-visitor.h"
+#ifdef CONFIG_LINUX
+#include <linux/errqueue.h>
+#include <bits/socket.h>
+#endif
 
 #define SOCKET_MAX_FDS 16
 
@@ -55,6 +59,8 @@ qio_channel_socket_new(void)
 
     sioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET));
     sioc->fd = -1;
+    sioc->zero_copy_queued = 0;
+    sioc->zero_copy_sent = 0;
 
     ioc = QIO_CHANNEL(sioc);
     qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN);
@@ -154,6 +160,16 @@ int qio_channel_socket_connect_sync(QIOChannelSocket *ioc,
         return -1;
     }
 
+#ifdef CONFIG_LINUX
+    int ret, v = 1;
+    ret = qemu_setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v));
+    if (ret == 0) {
+        /* Zero copy available on host */
+        qio_channel_set_feature(QIO_CHANNEL(ioc),
+                                QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY);
+    }
+#endif
+
     return 0;
 }
 
@@ -561,12 +577,19 @@ static ssize_t qio_channel_socket_writev_flags(QIOChannel *ioc,
  retry:
     ret = sendmsg(sioc->fd, &msg, flags);
     if (ret <= 0) {
-        if (errno == EAGAIN) {
+        switch (errno) {
+        case EAGAIN:
             return QIO_CHANNEL_ERR_BLOCK;
-        }
-        if (errno == EINTR) {
+        case EINTR:
             goto retry;
+        case ENOBUFS:
+            if (flags & MSG_ZEROCOPY) {
+                error_setg_errno(errp, errno,
+                                 "Process can't lock enough memory for using MSG_ZEROCOPY");
+                return -1;
+            }
         }
+
         error_setg_errno(errp, errno,
                          "Unable to write to socket");
         return -1;
@@ -670,6 +693,92 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
 }
 #endif /* WIN32 */
 
+
+#ifdef CONFIG_LINUX
+
+static ssize_t qio_channel_socket_writev_zero_copy(QIOChannel *ioc,
+                                                   const struct iovec *iov,
+                                                   size_t niov,
+                                                   Error **errp)
+{
+    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
+    ssize_t ret;
+
+    ret = qio_channel_socket_writev_flags(ioc, iov, niov, NULL, 0,
+                                          MSG_ZEROCOPY, errp);
+    if (ret > 0) {
+        sioc->zero_copy_queued++;
+    }
+
+    return ret;
+}
+
+static int qio_channel_socket_flush_zero_copy(QIOChannel *ioc,
+                                              Error **errp)
+{
+    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
+    struct msghdr msg = {};
+    struct sock_extended_err *serr;
+    struct cmsghdr *cm;
+    char control[CMSG_SPACE(sizeof(*serr))];
+    int received;
+    int ret = 1;
+
+    msg.msg_control = control;
+    msg.msg_controllen = sizeof(control);
+    memset(control, 0, sizeof(control));
+
+    while (sioc->zero_copy_sent < sioc->zero_copy_queued) {
+        received = recvmsg(sioc->fd, &msg, MSG_ERRQUEUE);
+        if (received < 0) {
+            switch (errno) {
+            case EAGAIN:
+                /* Nothing on errqueue, wait until something is available */
+                qio_channel_wait(ioc, G_IO_ERR);
+                continue;
+            case EINTR:
+                continue;
+            default:
+                error_setg_errno(errp, errno,
+                                 "Unable to read errqueue");
+                return -1;
+            }
+        }
+
+        cm = CMSG_FIRSTHDR(&msg);
+        if (cm->cmsg_level != SOL_IP &&
+            cm->cmsg_type != IP_RECVERR) {
+            error_setg_errno(errp, EPROTOTYPE,
+                             "Wrong cmsg in errqueue");
+            return -1;
+        }
+
+        serr = (void *) CMSG_DATA(cm);
+        if (serr->ee_errno != SO_EE_ORIGIN_NONE) {
+            error_setg_errno(errp, serr->ee_errno,
+                             "Error on socket");
+            return -1;
+        }
+        if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) {
+            error_setg_errno(errp, serr->ee_origin,
+                             "Error not from zero copy");
+            return -1;
+        }
+
+        /* No errors, count successfully finished sendmsg()*/
+        sioc->zero_copy_sent += serr->ee_data - serr->ee_info + 1;
+
+        /* If any sendmsg() succeeded using zero copy, return 0 at the end */
+        if (serr->ee_code != SO_EE_CODE_ZEROCOPY_COPIED) {
+            ret = 0;
+        }
+    }
+
+    return ret;
+}
+
+#endif /* CONFIG_LINUX */
+
 static int
 qio_channel_socket_set_blocking(QIOChannel *ioc,
                                 bool enabled,
@@ -799,6 +908,10 @@ static void qio_channel_socket_class_init(ObjectClass *klass,
     ioc_klass->io_set_delay = qio_channel_socket_set_delay;
     ioc_klass->io_create_watch = qio_channel_socket_create_watch;
     ioc_klass->io_set_aio_fd_handler = qio_channel_socket_set_aio_fd_handler;
+#ifdef CONFIG_LINUX
+    ioc_klass->io_writev_zero_copy = qio_channel_socket_writev_zero_copy;
+    ioc_klass->io_flush_zero_copy = qio_channel_socket_flush_zero_copy;
+#endif
 }
 
 static const TypeInfo qio_channel_socket_info = {
-- 
2.33.1



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v6 4/6] migration: Add zero-copy parameter for QMP/HMP for Linux
  2021-12-09  9:39 [PATCH v6 0/6] MSG_ZEROCOPY + multifd Leonardo Bras
                   ` (2 preceding siblings ...)
  2021-12-09  9:39 ` [PATCH v6 3/6] QIOChannelSocket: Implement io_writev_zero_copy & io_flush_zero_copy for CONFIG_LINUX Leonardo Bras
@ 2021-12-09  9:39 ` Leonardo Bras
  2021-12-09  9:39 ` [PATCH v6 5/6] migration: Add migrate_use_tls() helper Leonardo Bras
  2021-12-09  9:39 ` [PATCH v6 6/6] multifd: Implement zero copy write in multifd migration (multifd-zero-copy) Leonardo Bras
  5 siblings, 0 replies; 8+ messages in thread
From: Leonardo Bras @ 2021-12-09  9:39 UTC (permalink / raw)
  To: Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster
  Cc: Leonardo Bras, qemu-devel

Add property that allows zero-copy migration of memory pages,
and also includes a helper function migrate_use_zero_copy() to check
if it's enabled.

No code is introduced to actually do the migration, but it allow
future implementations to enable/disable this feature.

On non-Linux builds this parameter is compiled-out.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 qapi/migration.json   | 24 ++++++++++++++++++++++++
 migration/migration.h |  5 +++++
 migration/migration.c | 32 ++++++++++++++++++++++++++++++++
 migration/socket.c    |  5 +++++
 monitor/hmp-cmds.c    |  6 ++++++
 5 files changed, 72 insertions(+)

diff --git a/qapi/migration.json b/qapi/migration.json
index bbfd48cf0b..2e62ea6ebd 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -730,6 +730,13 @@
 #                      will consume more CPU.
 #                      Defaults to 1. (Since 5.0)
 #
+# @zero-copy: Controls behavior on sending memory pages on migration.
+#             When true, enables a zero-copy mechanism for sending memory
+#             pages, if host supports it.
+#             Requires that QEMU be permitted to use locked memory for guest
+#             RAM pages.
+#             Defaults to false. (Since 7.0)
+#
 # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
 #                        aliases for the purpose of dirty bitmap migration.  Such
 #                        aliases may for example be the corresponding names on the
@@ -769,6 +776,7 @@
            'xbzrle-cache-size', 'max-postcopy-bandwidth',
            'max-cpu-throttle', 'multifd-compression',
            'multifd-zlib-level' ,'multifd-zstd-level',
+           { 'name': 'zero-copy', 'if' : 'CONFIG_LINUX'},
            'block-bitmap-mapping' ] }
 
 ##
@@ -895,6 +903,13 @@
 #                      will consume more CPU.
 #                      Defaults to 1. (Since 5.0)
 #
+# @zero-copy: Controls behavior on sending memory pages on migration.
+#             When true, enables a zero-copy mechanism for sending memory
+#             pages, if host supports it.
+#             Requires that QEMU be permitted to use locked memory for guest
+#             RAM pages.
+#             Defaults to false. (Since 7.0)
+#
 # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
 #                        aliases for the purpose of dirty bitmap migration.  Such
 #                        aliases may for example be the corresponding names on the
@@ -949,6 +964,7 @@
             '*multifd-compression': 'MultiFDCompression',
             '*multifd-zlib-level': 'uint8',
             '*multifd-zstd-level': 'uint8',
+            '*zero-copy': { 'type': 'bool', 'if': 'CONFIG_LINUX' },
             '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
 
 ##
@@ -1095,6 +1111,13 @@
 #                      will consume more CPU.
 #                      Defaults to 1. (Since 5.0)
 #
+# @zero-copy: Controls behavior on sending memory pages on migration.
+#             When true, enables a zero-copy mechanism for sending memory
+#             pages, if host supports it.
+#             Requires that QEMU be permitted to use locked memory for guest
+#             RAM pages.
+#             Defaults to false. (Since 7.0)
+#
 # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
 #                        aliases for the purpose of dirty bitmap migration.  Such
 #                        aliases may for example be the corresponding names on the
@@ -1147,6 +1170,7 @@
             '*multifd-compression': 'MultiFDCompression',
             '*multifd-zlib-level': 'uint8',
             '*multifd-zstd-level': 'uint8',
+            '*zero-copy': { 'type': 'bool', 'if': 'CONFIG_LINUX' },
             '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
 
 ##
diff --git a/migration/migration.h b/migration/migration.h
index 8130b703eb..1489eeb165 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -339,6 +339,11 @@ MultiFDCompression migrate_multifd_compression(void);
 int migrate_multifd_zlib_level(void);
 int migrate_multifd_zstd_level(void);
 
+#ifdef CONFIG_LINUX
+bool migrate_use_zero_copy(void);
+#else
+#define migrate_use_zero_copy() (false)
+#endif
 int migrate_use_xbzrle(void);
 uint64_t migrate_xbzrle_cache_size(void);
 bool migrate_colo_enabled(void);
diff --git a/migration/migration.c b/migration/migration.c
index abaf6f9e3d..109d11e4a1 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -886,6 +886,10 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
     params->multifd_zlib_level = s->parameters.multifd_zlib_level;
     params->has_multifd_zstd_level = true;
     params->multifd_zstd_level = s->parameters.multifd_zstd_level;
+#ifdef CONFIG_LINUX
+    params->has_zero_copy = true;
+    params->zero_copy = s->parameters.zero_copy;
+#endif
     params->has_xbzrle_cache_size = true;
     params->xbzrle_cache_size = s->parameters.xbzrle_cache_size;
     params->has_max_postcopy_bandwidth = true;
@@ -1538,6 +1542,11 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
     if (params->has_multifd_compression) {
         dest->multifd_compression = params->multifd_compression;
     }
+#ifdef CONFIG_LINUX
+    if (params->has_zero_copy) {
+        dest->zero_copy = params->zero_copy;
+    }
+#endif
     if (params->has_xbzrle_cache_size) {
         dest->xbzrle_cache_size = params->xbzrle_cache_size;
     }
@@ -1650,6 +1659,11 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
     if (params->has_multifd_compression) {
         s->parameters.multifd_compression = params->multifd_compression;
     }
+#ifdef CONFIG_LINUX
+    if (params->has_zero_copy) {
+        s->parameters.zero_copy = params->zero_copy;
+    }
+#endif
     if (params->has_xbzrle_cache_size) {
         s->parameters.xbzrle_cache_size = params->xbzrle_cache_size;
         xbzrle_cache_resize(params->xbzrle_cache_size, errp);
@@ -2540,6 +2554,17 @@ int migrate_multifd_zstd_level(void)
     return s->parameters.multifd_zstd_level;
 }
 
+#ifdef CONFIG_LINUX
+bool migrate_use_zero_copy(void)
+{
+    MigrationState *s;
+
+    s = migrate_get_current();
+
+    return s->parameters.zero_copy;
+}
+#endif
+
 int migrate_use_xbzrle(void)
 {
     MigrationState *s;
@@ -4190,6 +4215,10 @@ static Property migration_properties[] = {
     DEFINE_PROP_UINT8("multifd-zstd-level", MigrationState,
                       parameters.multifd_zstd_level,
                       DEFAULT_MIGRATE_MULTIFD_ZSTD_LEVEL),
+#ifdef CONFIG_LINUX
+    DEFINE_PROP_BOOL("zero_copy", MigrationState,
+                      parameters.zero_copy, false),
+#endif
     DEFINE_PROP_SIZE("xbzrle-cache-size", MigrationState,
                       parameters.xbzrle_cache_size,
                       DEFAULT_MIGRATE_XBZRLE_CACHE_SIZE),
@@ -4287,6 +4316,9 @@ static void migration_instance_init(Object *obj)
     params->has_multifd_compression = true;
     params->has_multifd_zlib_level = true;
     params->has_multifd_zstd_level = true;
+#ifdef CONFIG_LINUX
+    params->has_zero_copy = true;
+#endif
     params->has_xbzrle_cache_size = true;
     params->has_max_postcopy_bandwidth = true;
     params->has_max_cpu_throttle = true;
diff --git a/migration/socket.c b/migration/socket.c
index 05705a32d8..f7a77aafd3 100644
--- a/migration/socket.c
+++ b/migration/socket.c
@@ -77,6 +77,11 @@ static void socket_outgoing_migration(QIOTask *task,
     } else {
         trace_migration_socket_outgoing_connected(data->hostname);
     }
+
+    if (migrate_use_zero_copy()) {
+        error_setg(&err, "Zero copy not available in migration");
+    }
+
     migration_channel_connect(data->s, sioc, data->hostname, err);
     object_unref(OBJECT(sioc));
 }
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index 9c91bf93e9..71d16f86ac 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -1297,6 +1297,12 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
         p->has_multifd_zstd_level = true;
         visit_type_uint8(v, param, &p->multifd_zstd_level, &err);
         break;
+#ifdef CONFIG_LINUX
+    case MIGRATION_PARAMETER_ZERO_COPY:
+        p->has_zero_copy = true;
+        visit_type_bool(v, param, &p->zero_copy, &err);
+        break;
+#endif
     case MIGRATION_PARAMETER_XBZRLE_CACHE_SIZE:
         p->has_xbzrle_cache_size = true;
         if (!visit_type_size(v, param, &cache_size, &err)) {
-- 
2.33.1



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v6 5/6] migration: Add migrate_use_tls() helper
  2021-12-09  9:39 [PATCH v6 0/6] MSG_ZEROCOPY + multifd Leonardo Bras
                   ` (3 preceding siblings ...)
  2021-12-09  9:39 ` [PATCH v6 4/6] migration: Add zero-copy parameter for QMP/HMP for Linux Leonardo Bras
@ 2021-12-09  9:39 ` Leonardo Bras
  2021-12-09  9:39 ` [PATCH v6 6/6] multifd: Implement zero copy write in multifd migration (multifd-zero-copy) Leonardo Bras
  5 siblings, 0 replies; 8+ messages in thread
From: Leonardo Bras @ 2021-12-09  9:39 UTC (permalink / raw)
  To: Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster
  Cc: Leonardo Bras, qemu-devel

A lot of places check parameters.tls_creds in order to evaluate if TLS is
in use, and sometimes call migrate_get_current() just for that test.

Add new helper function migrate_use_tls() in order to simplify testing
for TLS usage.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
---
 migration/migration.h | 1 +
 migration/channel.c   | 6 +++---
 migration/migration.c | 9 +++++++++
 migration/multifd.c   | 5 +----
 4 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/migration/migration.h b/migration/migration.h
index 1489eeb165..445d95bbf2 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -344,6 +344,7 @@ bool migrate_use_zero_copy(void);
 #else
 #define migrate_use_zero_copy() (false)
 #endif
+int migrate_use_tls(void);
 int migrate_use_xbzrle(void);
 uint64_t migrate_xbzrle_cache_size(void);
 bool migrate_colo_enabled(void);
diff --git a/migration/channel.c b/migration/channel.c
index c4fc000a1a..1a45b75d29 100644
--- a/migration/channel.c
+++ b/migration/channel.c
@@ -32,16 +32,16 @@
  */
 void migration_channel_process_incoming(QIOChannel *ioc)
 {
-    MigrationState *s = migrate_get_current();
     Error *local_err = NULL;
 
     trace_migration_set_incoming_channel(
         ioc, object_get_typename(OBJECT(ioc)));
 
-    if (s->parameters.tls_creds &&
-        *s->parameters.tls_creds &&
+    if (migrate_use_tls() &&
         !object_dynamic_cast(OBJECT(ioc),
                              TYPE_QIO_CHANNEL_TLS)) {
+        MigrationState *s = migrate_get_current();
+
         migration_tls_channel_process_incoming(s, ioc, &local_err);
     } else {
         migration_ioc_register_yank(ioc);
diff --git a/migration/migration.c b/migration/migration.c
index 109d11e4a1..8e50f7508a 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2565,6 +2565,15 @@ bool migrate_use_zero_copy(void)
 }
 #endif
 
+int migrate_use_tls(void)
+{
+    MigrationState *s;
+
+    s = migrate_get_current();
+
+    return s->parameters.tls_creds && *s->parameters.tls_creds;
+}
+
 int migrate_use_xbzrle(void)
 {
     MigrationState *s;
diff --git a/migration/multifd.c b/migration/multifd.c
index 7c9deb1921..b32b756147 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -794,14 +794,11 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
                                     QIOChannel *ioc,
                                     Error *error)
 {
-    MigrationState *s = migrate_get_current();
-
     trace_multifd_set_outgoing_channel(
         ioc, object_get_typename(OBJECT(ioc)), p->tls_hostname, error);
 
     if (!error) {
-        if (s->parameters.tls_creds &&
-            *s->parameters.tls_creds &&
+        if (migrate_use_tls() &&
             !object_dynamic_cast(OBJECT(ioc),
                                  TYPE_QIO_CHANNEL_TLS)) {
             multifd_tls_channel_connect(p, ioc, &error);
-- 
2.33.1



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v6 6/6] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)
  2021-12-09  9:39 [PATCH v6 0/6] MSG_ZEROCOPY + multifd Leonardo Bras
                   ` (4 preceding siblings ...)
  2021-12-09  9:39 ` [PATCH v6 5/6] migration: Add migrate_use_tls() helper Leonardo Bras
@ 2021-12-09  9:39 ` Leonardo Bras
  5 siblings, 0 replies; 8+ messages in thread
From: Leonardo Bras @ 2021-12-09  9:39 UTC (permalink / raw)
  To: Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster
  Cc: Leonardo Bras, qemu-devel

Implement zero copy on nocomp_send_write(), by making use of QIOChannel
zero copy interface.

Change multifd_send_sync_main() so it can distinguish each iteration sync from
the setup and the completion, so a flush_zero_copy() can be called
after each iteration in order to make sure all dirty pages are sent
before a new iteration is started.

Also make it return -1 if flush_zero_copy() fails, in order to cancel
the migration process, and avoid resuming the guest in the target host
without receiving all current RAM.

This will work fine on RAM migration because the RAM pages are not usually freed,
and there is no problem on changing the pages content between writev_zero_copy() and
the actual sending of the buffer, because this change will dirty the page and
cause it to be re-sent on a next iteration anyway.

A lot of locked memory may be needed in order to use multid migration
with zero-copy enabled, so disabling the feature should be necessary for
low-privileged users trying to perform multifd migrations.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 migration/multifd.h   |  4 +++-
 migration/migration.c | 11 ++++++++++-
 migration/multifd.c   | 40 +++++++++++++++++++++++++++++++++++-----
 migration/ram.c       | 29 ++++++++++++++++++++++-------
 migration/socket.c    |  5 +++--
 5 files changed, 73 insertions(+), 16 deletions(-)

diff --git a/migration/multifd.h b/migration/multifd.h
index 15c50ca0b2..37941c1872 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -22,7 +22,7 @@ int multifd_load_cleanup(Error **errp);
 bool multifd_recv_all_channels_created(void);
 bool multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
 void multifd_recv_sync_main(void);
-void multifd_send_sync_main(QEMUFile *f);
+int multifd_send_sync_main(QEMUFile *f, bool sync);
 int multifd_queue_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset);
 
 /* Multifd Compression flags */
@@ -97,6 +97,8 @@ typedef struct {
     uint32_t packet_len;
     /* pointer to the packet */
     MultiFDPacket_t *packet;
+    /* multifd flags for sending ram */
+    int write_flags;
     /* multifd flags for each packet */
     uint32_t flags;
     /* size of the next packet that contains pages */
diff --git a/migration/migration.c b/migration/migration.c
index 8e50f7508a..4e154ff901 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1468,7 +1468,16 @@ static bool migrate_params_check(MigrationParameters *params, Error **errp)
         error_prepend(errp, "Invalid mapping given for block-bitmap-mapping: ");
         return false;
     }
-
+#ifdef CONFIG_LINUX
+    if (params->zero_copy &&
+        (!migrate_use_multifd() ||
+         params->multifd_compression != MULTIFD_COMPRESSION_NONE ||
+         (params->tls_creds && *params->tls_creds))) {
+        error_setg(errp,
+                   "Zero copy only available for non-compressed non-TLS multifd migration");
+        return false;
+    }
+#endif
     return true;
 }
 
diff --git a/migration/multifd.c b/migration/multifd.c
index b32b756147..4c718cddf7 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -105,7 +105,8 @@ static int nocomp_send_prepare(MultiFDSendParams *p, uint32_t used,
  */
 static int nocomp_send_write(MultiFDSendParams *p, uint32_t used, Error **errp)
 {
-    return qio_channel_writev_all(p->c, p->pages->iov, used, errp);
+    return qio_channel_writev_all_flags(p->c, p->pages->iov, used,
+                                        p->write_flags, errp);
 }
 
 /**
@@ -578,19 +579,28 @@ void multifd_save_cleanup(void)
     multifd_send_state = NULL;
 }
 
-void multifd_send_sync_main(QEMUFile *f)
+int multifd_send_sync_main(QEMUFile *f, bool sync)
 {
     int i;
+    bool flush_zero_copy;
 
     if (!migrate_use_multifd()) {
-        return;
+        return 0;
     }
     if (multifd_send_state->pages->used) {
         if (multifd_send_pages(f) < 0) {
             error_report("%s: multifd_send_pages fail", __func__);
-            return;
+            return 0;
         }
     }
+
+    /*
+     * When using zero-copy, it's necessary to flush after each iteration to
+     * make sure pages from earlier iterations don't end up replacing newer
+     * pages.
+     */
+    flush_zero_copy = sync && migrate_use_zero_copy();
+
     for (i = 0; i < migrate_multifd_channels(); i++) {
         MultiFDSendParams *p = &multifd_send_state->params[i];
 
@@ -601,7 +611,7 @@ void multifd_send_sync_main(QEMUFile *f)
         if (p->quit) {
             error_report("%s: channel %d has already quit", __func__, i);
             qemu_mutex_unlock(&p->mutex);
-            return;
+            return 0;
         }
 
         p->packet_num = multifd_send_state->packet_num++;
@@ -612,6 +622,17 @@ void multifd_send_sync_main(QEMUFile *f)
         ram_counters.transferred += p->packet_len;
         qemu_mutex_unlock(&p->mutex);
         qemu_sem_post(&p->sem);
+
+        if (flush_zero_copy) {
+            int ret;
+            Error *err = NULL;
+
+            ret = qio_channel_flush_zero_copy(p->c, &err);
+            if (ret < 0) {
+                error_report_err(err);
+                return -1;
+            }
+        }
     }
     for (i = 0; i < migrate_multifd_channels(); i++) {
         MultiFDSendParams *p = &multifd_send_state->params[i];
@@ -620,6 +641,8 @@ void multifd_send_sync_main(QEMUFile *f)
         qemu_sem_wait(&p->sem_sync);
     }
     trace_multifd_send_sync_main(multifd_send_state->packet_num);
+
+    return 0;
 }
 
 static void *multifd_send_thread(void *opaque)
@@ -917,6 +940,13 @@ int multifd_save_setup(Error **errp)
         p->packet->version = cpu_to_be32(MULTIFD_VERSION);
         p->name = g_strdup_printf("multifdsend_%d", i);
         p->tls_hostname = g_strdup(s->hostname);
+
+        if (migrate_use_zero_copy()) {
+            p->write_flags = QIO_CHANNEL_WRITE_FLAG_ZERO_COPY;
+        } else {
+            p->write_flags = 0;
+        }
+
         socket_send_channel_create(multifd_new_send_channel_async, p);
     }
 
diff --git a/migration/ram.c b/migration/ram.c
index 863035d235..0b3ddbffc1 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2992,6 +2992,7 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
 {
     RAMState **rsp = opaque;
     RAMBlock *block;
+    int ret;
 
     if (compress_threads_save_setup()) {
         return -1;
@@ -3026,7 +3027,11 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
     ram_control_before_iterate(f, RAM_CONTROL_SETUP);
     ram_control_after_iterate(f, RAM_CONTROL_SETUP);
 
-    multifd_send_sync_main(f);
+    ret =  multifd_send_sync_main(f, false);
+    if (ret < 0) {
+        return ret;
+    }
+
     qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
     qemu_fflush(f);
 
@@ -3135,7 +3140,11 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
 out:
     if (ret >= 0
         && migration_is_setup_or_active(migrate_get_current()->state)) {
-        multifd_send_sync_main(rs->f);
+        ret = multifd_send_sync_main(rs->f, true);
+        if (ret < 0) {
+            return ret;
+        }
+
         qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
         qemu_fflush(f);
         ram_counters.transferred += 8;
@@ -3193,13 +3202,19 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
         ram_control_after_iterate(f, RAM_CONTROL_FINISH);
     }
 
-    if (ret >= 0) {
-        multifd_send_sync_main(rs->f);
-        qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
-        qemu_fflush(f);
+    if (ret < 0) {
+        return ret;
     }
 
-    return ret;
+    ret = multifd_send_sync_main(rs->f, false);
+    if (ret < 0) {
+        return ret;
+    }
+
+    qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
+    qemu_fflush(f);
+
+    return 0;
 }
 
 static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
diff --git a/migration/socket.c b/migration/socket.c
index f7a77aafd3..23b03e6190 100644
--- a/migration/socket.c
+++ b/migration/socket.c
@@ -78,8 +78,9 @@ static void socket_outgoing_migration(QIOTask *task,
         trace_migration_socket_outgoing_connected(data->hostname);
     }
 
-    if (migrate_use_zero_copy()) {
-        error_setg(&err, "Zero copy not available in migration");
+    if (migrate_use_zero_copy() &&
+        !qio_channel_has_feature(sioc, QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY)) {
+        error_setg(&err, "Zero copy feature not detected in host kernel");
     }
 
     migration_channel_connect(data->s, sioc, data->hostname, err);
-- 
2.33.1



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v6 1/6] QIOChannel: Add io_writev_zero_copy & io_flush_zero_copy callbacks
  2021-12-09  9:39 ` [PATCH v6 1/6] QIOChannel: Add io_writev_zero_copy & io_flush_zero_copy callbacks Leonardo Bras
@ 2021-12-10 12:15   ` Daniel P. Berrangé
  0 siblings, 0 replies; 8+ messages in thread
From: Daniel P. Berrangé @ 2021-12-10 12:15 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: qemu-devel, Markus Armbruster, Eric Blake,
	Dr. David Alan Gilbert, Juan Quintela

On Thu, Dec 09, 2021 at 06:39:19AM -0300, Leonardo Bras wrote:
> Adds io_writev_zero_copy and io_flush_zero_copy as optional callback to QIOChannelClass,
> allowing the implementation of zero copy writes by subclasses.
> 
> How to use them:
> - Write data using qio_channel_writev_zero_copy(),
> - Wait write completion with qio_channel_flush_zero_copy().
> 
> Notes:
> As some zero copy implementations work asynchronously, it's
> recommended to keep the write buffer untouched until the return of
> qio_channel_flush_zero_copy(), to avoid the risk of sending an updated
> buffer instead of the one at the write.
> 
> As the new callbacks are optional, if a subclass does not implement them, then:
> - io_writev_zero_copy will return -1,
> - io_flush_zero_copy will return 0 without changing anything.
> 
> Also, some functions like qio_channel_writev_full_all() were adapted to
> receive a flag parameter. That allows shared code between zero copy and
> non-zero copy writev, and also an easier implementation on new flags.
> 
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> ---
>  include/io/channel.h | 98 +++++++++++++++++++++++++++++++++++++++-----
>  io/channel.c         | 66 +++++++++++++++++++++++------
>  2 files changed, 142 insertions(+), 22 deletions(-)
> 
> diff --git a/include/io/channel.h b/include/io/channel.h
> index 88988979f8..83fa970a19 100644
> --- a/include/io/channel.h
> +++ b/include/io/channel.h
> @@ -32,12 +32,15 @@ OBJECT_DECLARE_TYPE(QIOChannel, QIOChannelClass,
>  
>  #define QIO_CHANNEL_ERR_BLOCK -2
>  
> +#define QIO_CHANNEL_WRITE_FLAG_ZERO_COPY 0x1
> +
>  typedef enum QIOChannelFeature QIOChannelFeature;
>  
>  enum QIOChannelFeature {
>      QIO_CHANNEL_FEATURE_FD_PASS,
>      QIO_CHANNEL_FEATURE_SHUTDOWN,
>      QIO_CHANNEL_FEATURE_LISTEN,
> +    QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY,
>  };
>  
>  
> @@ -136,6 +139,12 @@ struct QIOChannelClass {
>                                    IOHandler *io_read,
>                                    IOHandler *io_write,
>                                    void *opaque);
> +    ssize_t (*io_writev_zero_copy)(QIOChannel *ioc,
> +                                   const struct iovec *iov,
> +                                   size_t niov,
> +                                   Error **errp);
> +    int (*io_flush_zero_copy)(QIOChannel *ioc,
> +                              Error **errp);
>  };

I've still got the same feedback as previous iterations. It
does not make sense to having both separate callbacks / APIs
and also add flags to existing methods. It just solves thue
same problem twice which si redundant.

I had suggested separate callbacks originally because I
thought we would need to have different signature with
ability to get completions. We've done completions with
a separate API call though.

So the separate zero_copy methods aren't so compelling
as an idea, and we could just use flags only in
retrospect.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-12-10 12:17 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-09  9:39 [PATCH v6 0/6] MSG_ZEROCOPY + multifd Leonardo Bras
2021-12-09  9:39 ` [PATCH v6 1/6] QIOChannel: Add io_writev_zero_copy & io_flush_zero_copy callbacks Leonardo Bras
2021-12-10 12:15   ` Daniel P. Berrangé
2021-12-09  9:39 ` [PATCH v6 2/6] QIOChannelSocket: Add flags parameter for writing Leonardo Bras
2021-12-09  9:39 ` [PATCH v6 3/6] QIOChannelSocket: Implement io_writev_zero_copy & io_flush_zero_copy for CONFIG_LINUX Leonardo Bras
2021-12-09  9:39 ` [PATCH v6 4/6] migration: Add zero-copy parameter for QMP/HMP for Linux Leonardo Bras
2021-12-09  9:39 ` [PATCH v6 5/6] migration: Add migrate_use_tls() helper Leonardo Bras
2021-12-09  9:39 ` [PATCH v6 6/6] multifd: Implement zero copy write in multifd migration (multifd-zero-copy) Leonardo Bras

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).