All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v11 0/7] MSG_ZEROCOPY + multifd
@ 2022-05-04 19:18 Leonardo Bras
  2022-05-04 19:18 ` [PATCH v11 1/7] QIOChannel: Add flags on io_writev and introduce io_flush callback Leonardo Bras
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Leonardo Bras @ 2022-05-04 19:18 UTC (permalink / raw)
  To: Marc-André Lureau, Paolo Bonzini, Elena Ufimtseva,
	Jagannathan Raman, John G Johnson, Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster, Fam Zheng, Peter Xu
  Cc: Leonardo Bras, qemu-devel, qemu-block

This patch series intends to enable MSG_ZEROCOPY in QIOChannel, and make
use of it for multifd migration performance improvement, by reducing cpu
usage.

Patch #1 creates new callbacks for QIOChannel, allowing the implementation
of zero copy writing.

Patch #2 implements io_writev flags and io_flush() on QIOChannelSocket,
making use of MSG_ZEROCOPY on Linux.

Patch #3 adds a "zero_copy_send" migration property, only available with
CONFIG_LINUX, and compiled-out in any other architectures.
This migration property has to be enabled before multifd migration starts.

Patch #4 adds a helper function that allows to see if TLS is going to be used.
This helper will be later used in patch #5.

Patch #5 changes multifd_send_sync_main() so it returns int instead of void.
The return value is used to understand if any error happened in the function,
allowing migration to possible fail earlier.

Patch #6 implements an workaround: The behavior introduced in d48c3a0445 is
hard to deal with in zerocopy, so a workaround is introduced to send the
header in a different syscall, without MSG_ZEROCOPY.

Patch #7 Makes use of QIOChannelSocket zero_copy implementation on
nocomp multifd migration.

Results:
In preliminary tests, the resource usage of __sys_sendmsg() reduced 15 times,
and the overall migration took 13-22% less time, based in synthetic cpu
workload.

In further tests, it was noted that, on multifd migration with 8 channels:
- On idle hosts, migration time reduced in 10% to 21%.
- On hosts busy with heavy cpu stress (1 stress thread per cpu, but
  not cpu-pinned) migration time reduced in ~25% by enabling zero-copy.
- On hosts with heavy cpu-pinned workloads (1 stress thread per cpu, 
  cpu-pinned), migration time reducted in ~66% by enabling zero-copy.

Above tests setup:
- Sending and Receiving hosts:
  - CPU : Intel(R) Xeon(R) Platinum 8276L CPU @ 2.20GHz (448 CPUS)
  - Network card: E810-C (100Gbps)
  - >1TB RAM
  - QEMU: Upstream master branch + This patchset
  - Linux: Upstream v5.15 
- VM configuration:
  - 28 VCPUs
  - 512GB RAM


---
Changes since v10:
- Patch #2 was breaking build on systems with glibc < glibc-2.27,
  and probably non-linux builds.
- Also on Patch #2, replaced bits/socket.h with sys/socket.h,
  (thanks Peter Xu)

Changes since v9:
- Patch #6 got simplified and improved (thanks Daniel)
- Patch #7 got better comments (thanks Peter Xu)

Changes since v8:
- Inserted two new patches #5 & #6, previous patch #5 is now #7.
- Workaround an optimization introduced in d48c3a0445
- Removed unnecessary assert in qio_channel_writev_full_all

Changes since v7:
- Migration property renamed from zero-copy to zero-copy-send
- A few early tests added to help misconfigurations to fail earlier
- qio_channel_full*_flags() renamed back to qio_channel_full*()
- multifd_send_sync_main() reverted back to not receiving a flag,
  so it always sync zero-copy when enabled.
- Improve code quality on a few points

Changes since v6:
- Remove io_writev_zero_copy(), and makes use of io_writev() new flags
  to achieve the same results.
- Rename io_flush_zero_copy() to io_flush()
- Previous patch #2 became too small, so it was squashed in previous
  patch #3 (now patch #2)

Changes since v5:
- flush_zero_copy now returns -1 on fail, 0 on success, and 1 when all
  processed writes were not able to use zerocopy in kernel.
- qio_channel_socket_poll() removed, using qio_channel_wait() instead
- ENOBUFS is now processed inside qio_channel_socket_writev_flags()
- Most zerocopy parameter validation moved to migrate_params_check(),
  leaving only feature test to socket_outgoing_migration() callback
- Naming went from *zerocopy to *zero_copy or *zero-copy, due to QAPI/QMP
  preferences
- Improved docs

Changes since v4:
- 3 patches got splitted in 6
- Flush is used for syncing after each iteration, instead of only at the end
- If zerocopy is not available, fail in connect instead of failing on write
- 'multifd-zerocopy' property renamed to 'zerocopy'
- Fail migrations that don't support zerocopy, if it's enabled.
- Instead of checking for zerocopy at each write, save the flags in
  MultiFDSendParams->write_flags and use them on write
- Reorganized flag usage in QIOChannelSocket 
- A lot of typos fixed
- More doc on buffer restrictions

Changes since v3:
- QIOChannel interface names changed from io_async_{writev,flush} to
  io_{writev,flush}_zerocopy
- Instead of falling back in case zerocopy is not implemented, return
  error and abort operation.
- Flush now waits as long as needed, or return error in case anything
  goes wrong, aborting the operation.
- Zerocopy is now conditional in multifd, being set by parameter
  multifd-zerocopy
- Moves zerocopy_flush to multifd_send_sync_main() from multifd_save_cleanup
  so migration can abort if flush goes wrong.
- Several other small improvements

Changes since v2:
- Patch #1: One more fallback
- Patch #2: Fall back to sync if fails to lock buffer memory in MSG_ZEROCOPY send.

Changes since v1:
- Reimplemented the patchset using async_write + async_flush approach.
- Implemented a flush to be able to tell whenever all data was written.

Leonardo Bras (7):
  QIOChannel: Add flags on io_writev and introduce io_flush callback
  QIOChannelSocket: Implement io_writev zero copy flag & io_flush for
    CONFIG_LINUX
  migration: Add zero-copy-send parameter for QMP/HMP for Linux
  migration: Add migrate_use_tls() helper
  multifd: multifd_send_sync_main now returns negative on error
  multifd: Send header packet without flags if zero-copy-send is enabled
  multifd: Implement zero copy write in multifd migration
    (multifd-zero-copy)

 qapi/migration.json                 |  24 ++++++
 include/io/channel-socket.h         |   2 +
 include/io/channel.h                |  38 ++++++++-
 migration/migration.h               |   6 ++
 migration/multifd.h                 |   4 +-
 chardev/char-io.c                   |   2 +-
 hw/remote/mpqemu-link.c             |   2 +-
 io/channel-buffer.c                 |   1 +
 io/channel-command.c                |   1 +
 io/channel-file.c                   |   1 +
 io/channel-socket.c                 | 122 +++++++++++++++++++++++++++-
 io/channel-tls.c                    |   1 +
 io/channel-websock.c                |   1 +
 io/channel.c                        |  49 ++++++++---
 migration/channel.c                 |   3 +-
 migration/migration.c               |  52 +++++++++++-
 migration/multifd.c                 |  74 ++++++++++++++---
 migration/ram.c                     |  29 +++++--
 migration/rdma.c                    |   1 +
 migration/socket.c                  |  12 ++-
 monitor/hmp-cmds.c                  |   6 ++
 scsi/pr-manager-helper.c            |   2 +-
 tests/unit/test-io-channel-socket.c |   1 +
 23 files changed, 390 insertions(+), 44 deletions(-)

-- 
2.36.0



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v11 1/7] QIOChannel: Add flags on io_writev and introduce io_flush callback
  2022-05-04 19:18 [PATCH v11 0/7] MSG_ZEROCOPY + multifd Leonardo Bras
@ 2022-05-04 19:18 ` Leonardo Bras
  2022-05-04 19:18 ` [PATCH v11 2/7] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX Leonardo Bras
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Leonardo Bras @ 2022-05-04 19:18 UTC (permalink / raw)
  To: Marc-André Lureau, Paolo Bonzini, Elena Ufimtseva,
	Jagannathan Raman, John G Johnson, Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster, Fam Zheng, Peter Xu
  Cc: Leonardo Bras, qemu-devel, qemu-block

Add flags to io_writev and introduce io_flush as optional callback to
QIOChannelClass, allowing the implementation of zero copy writes by
subclasses.

How to use them:
- Write data using qio_channel_writev*(...,QIO_CHANNEL_WRITE_FLAG_ZERO_COPY),
- Wait write completion with qio_channel_flush().

Notes:
As some zero copy write implementations work asynchronously, it's
recommended to keep the write buffer untouched until the return of
qio_channel_flush(), to avoid the risk of sending an updated buffer
instead of the buffer state during write.

As io_flush callback is optional, if a subclass does not implement it, then:
- io_flush will return 0 without changing anything.

Also, some functions like qio_channel_writev_full_all() were adapted to
receive a flag parameter. That allows shared code between zero copy and
non-zero copy writev, and also an easier implementation on new flags.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
---
 include/io/channel.h                | 38 +++++++++++++++++++++-
 chardev/char-io.c                   |  2 +-
 hw/remote/mpqemu-link.c             |  2 +-
 io/channel-buffer.c                 |  1 +
 io/channel-command.c                |  1 +
 io/channel-file.c                   |  1 +
 io/channel-socket.c                 |  2 ++
 io/channel-tls.c                    |  1 +
 io/channel-websock.c                |  1 +
 io/channel.c                        | 49 +++++++++++++++++++++++------
 migration/rdma.c                    |  1 +
 scsi/pr-manager-helper.c            |  2 +-
 tests/unit/test-io-channel-socket.c |  1 +
 13 files changed, 88 insertions(+), 14 deletions(-)

diff --git a/include/io/channel.h b/include/io/channel.h
index 88988979f8..c680ee7480 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -32,12 +32,15 @@ OBJECT_DECLARE_TYPE(QIOChannel, QIOChannelClass,
 
 #define QIO_CHANNEL_ERR_BLOCK -2
 
+#define QIO_CHANNEL_WRITE_FLAG_ZERO_COPY 0x1
+
 typedef enum QIOChannelFeature QIOChannelFeature;
 
 enum QIOChannelFeature {
     QIO_CHANNEL_FEATURE_FD_PASS,
     QIO_CHANNEL_FEATURE_SHUTDOWN,
     QIO_CHANNEL_FEATURE_LISTEN,
+    QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY,
 };
 
 
@@ -104,6 +107,7 @@ struct QIOChannelClass {
                          size_t niov,
                          int *fds,
                          size_t nfds,
+                         int flags,
                          Error **errp);
     ssize_t (*io_readv)(QIOChannel *ioc,
                         const struct iovec *iov,
@@ -136,6 +140,8 @@ struct QIOChannelClass {
                                   IOHandler *io_read,
                                   IOHandler *io_write,
                                   void *opaque);
+    int (*io_flush)(QIOChannel *ioc,
+                    Error **errp);
 };
 
 /* General I/O handling functions */
@@ -228,6 +234,7 @@ ssize_t qio_channel_readv_full(QIOChannel *ioc,
  * @niov: the length of the @iov array
  * @fds: an array of file handles to send
  * @nfds: number of file handles in @fds
+ * @flags: write flags (QIO_CHANNEL_WRITE_FLAG_*)
  * @errp: pointer to a NULL-initialized error object
  *
  * Write data to the IO channel, reading it from the
@@ -260,6 +267,7 @@ ssize_t qio_channel_writev_full(QIOChannel *ioc,
                                 size_t niov,
                                 int *fds,
                                 size_t nfds,
+                                int flags,
                                 Error **errp);
 
 /**
@@ -837,6 +845,7 @@ int qio_channel_readv_full_all(QIOChannel *ioc,
  * @niov: the length of the @iov array
  * @fds: an array of file handles to send
  * @nfds: number of file handles in @fds
+ * @flags: write flags (QIO_CHANNEL_WRITE_FLAG_*)
  * @errp: pointer to a NULL-initialized error object
  *
  *
@@ -846,6 +855,14 @@ int qio_channel_readv_full_all(QIOChannel *ioc,
  * to be written, yielding from the current coroutine
  * if required.
  *
+ * If QIO_CHANNEL_WRITE_FLAG_ZERO_COPY is passed in flags,
+ * instead of waiting for all requested data to be written,
+ * this function will wait until it's all queued for writing.
+ * In this case, if the buffer gets changed between queueing and
+ * sending, the updated buffer will be sent. If this is not a
+ * desired behavior, it's suggested to call qio_channel_flush()
+ * before reusing the buffer.
+ *
  * Returns: 0 if all bytes were written, or -1 on error
  */
 
@@ -853,6 +870,25 @@ int qio_channel_writev_full_all(QIOChannel *ioc,
                                 const struct iovec *iov,
                                 size_t niov,
                                 int *fds, size_t nfds,
-                                Error **errp);
+                                int flags, Error **errp);
+
+/**
+ * qio_channel_flush:
+ * @ioc: the channel object
+ * @errp: pointer to a NULL-initialized error object
+ *
+ * Will block until every packet queued with
+ * qio_channel_writev_full() + QIO_CHANNEL_WRITE_FLAG_ZERO_COPY
+ * is sent, or return in case of any error.
+ *
+ * If not implemented, acts as a no-op, and returns 0.
+ *
+ * Returns -1 if any error is found,
+ *          1 if every send failed to use zero copy.
+ *          0 otherwise.
+ */
+
+int qio_channel_flush(QIOChannel *ioc,
+                      Error **errp);
 
 #endif /* QIO_CHANNEL_H */
diff --git a/chardev/char-io.c b/chardev/char-io.c
index 8ced184160..4451128cba 100644
--- a/chardev/char-io.c
+++ b/chardev/char-io.c
@@ -122,7 +122,7 @@ int io_channel_send_full(QIOChannel *ioc,
 
         ret = qio_channel_writev_full(
             ioc, &iov, 1,
-            fds, nfds, NULL);
+            fds, nfds, 0, NULL);
         if (ret == QIO_CHANNEL_ERR_BLOCK) {
             if (offset) {
                 return offset;
diff --git a/hw/remote/mpqemu-link.c b/hw/remote/mpqemu-link.c
index 2a4aa651ca..9bd98e8219 100644
--- a/hw/remote/mpqemu-link.c
+++ b/hw/remote/mpqemu-link.c
@@ -68,7 +68,7 @@ bool mpqemu_msg_send(MPQemuMsg *msg, QIOChannel *ioc, Error **errp)
     }
 
     if (!qio_channel_writev_full_all(ioc, send, G_N_ELEMENTS(send),
-                                    fds, nfds, errp)) {
+                                    fds, nfds, 0, errp)) {
         ret = true;
     } else {
         trace_mpqemu_send_io_error(msg->cmd, msg->size, nfds);
diff --git a/io/channel-buffer.c b/io/channel-buffer.c
index baa4e2b089..bf52011be2 100644
--- a/io/channel-buffer.c
+++ b/io/channel-buffer.c
@@ -81,6 +81,7 @@ static ssize_t qio_channel_buffer_writev(QIOChannel *ioc,
                                          size_t niov,
                                          int *fds,
                                          size_t nfds,
+                                         int flags,
                                          Error **errp)
 {
     QIOChannelBuffer *bioc = QIO_CHANNEL_BUFFER(ioc);
diff --git a/io/channel-command.c b/io/channel-command.c
index 338da73ade..54560464ae 100644
--- a/io/channel-command.c
+++ b/io/channel-command.c
@@ -258,6 +258,7 @@ static ssize_t qio_channel_command_writev(QIOChannel *ioc,
                                           size_t niov,
                                           int *fds,
                                           size_t nfds,
+                                          int flags,
                                           Error **errp)
 {
     QIOChannelCommand *cioc = QIO_CHANNEL_COMMAND(ioc);
diff --git a/io/channel-file.c b/io/channel-file.c
index d7cf6d278f..ef6807a6be 100644
--- a/io/channel-file.c
+++ b/io/channel-file.c
@@ -114,6 +114,7 @@ static ssize_t qio_channel_file_writev(QIOChannel *ioc,
                                        size_t niov,
                                        int *fds,
                                        size_t nfds,
+                                       int flags,
                                        Error **errp)
 {
     QIOChannelFile *fioc = QIO_CHANNEL_FILE(ioc);
diff --git a/io/channel-socket.c b/io/channel-socket.c
index 9f5ddf68b6..696a04dc9c 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -524,6 +524,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
                                          size_t niov,
                                          int *fds,
                                          size_t nfds,
+                                         int flags,
                                          Error **errp)
 {
     QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
@@ -619,6 +620,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
                                          size_t niov,
                                          int *fds,
                                          size_t nfds,
+                                         int flags,
                                          Error **errp)
 {
     QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
diff --git a/io/channel-tls.c b/io/channel-tls.c
index 2ae1b92fc0..4ce890a538 100644
--- a/io/channel-tls.c
+++ b/io/channel-tls.c
@@ -301,6 +301,7 @@ static ssize_t qio_channel_tls_writev(QIOChannel *ioc,
                                       size_t niov,
                                       int *fds,
                                       size_t nfds,
+                                      int flags,
                                       Error **errp)
 {
     QIOChannelTLS *tioc = QIO_CHANNEL_TLS(ioc);
diff --git a/io/channel-websock.c b/io/channel-websock.c
index 55145a6a8c..9619906ac3 100644
--- a/io/channel-websock.c
+++ b/io/channel-websock.c
@@ -1127,6 +1127,7 @@ static ssize_t qio_channel_websock_writev(QIOChannel *ioc,
                                           size_t niov,
                                           int *fds,
                                           size_t nfds,
+                                          int flags,
                                           Error **errp)
 {
     QIOChannelWebsock *wioc = QIO_CHANNEL_WEBSOCK(ioc);
diff --git a/io/channel.c b/io/channel.c
index e8b019dc36..0640941ac5 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -72,18 +72,32 @@ ssize_t qio_channel_writev_full(QIOChannel *ioc,
                                 size_t niov,
                                 int *fds,
                                 size_t nfds,
+                                int flags,
                                 Error **errp)
 {
     QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
 
-    if ((fds || nfds) &&
-        !qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_FD_PASS)) {
+    if (fds || nfds) {
+        if (!qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_FD_PASS)) {
+            error_setg_errno(errp, EINVAL,
+                             "Channel does not support file descriptor passing");
+            return -1;
+        }
+        if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
+            error_setg_errno(errp, EINVAL,
+                             "Zero Copy does not support file descriptor passing");
+            return -1;
+        }
+    }
+
+    if ((flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) &&
+        !qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY)) {
         error_setg_errno(errp, EINVAL,
-                         "Channel does not support file descriptor passing");
+                         "Requested Zero Copy feature is not available");
         return -1;
     }
 
-    return klass->io_writev(ioc, iov, niov, fds, nfds, errp);
+    return klass->io_writev(ioc, iov, niov, fds, nfds, flags, errp);
 }
 
 
@@ -217,14 +231,14 @@ int qio_channel_writev_all(QIOChannel *ioc,
                            size_t niov,
                            Error **errp)
 {
-    return qio_channel_writev_full_all(ioc, iov, niov, NULL, 0, errp);
+    return qio_channel_writev_full_all(ioc, iov, niov, NULL, 0, 0, errp);
 }
 
 int qio_channel_writev_full_all(QIOChannel *ioc,
                                 const struct iovec *iov,
                                 size_t niov,
                                 int *fds, size_t nfds,
-                                Error **errp)
+                                int flags, Error **errp)
 {
     int ret = -1;
     struct iovec *local_iov = g_new(struct iovec, niov);
@@ -237,8 +251,10 @@ int qio_channel_writev_full_all(QIOChannel *ioc,
 
     while (nlocal_iov > 0) {
         ssize_t len;
-        len = qio_channel_writev_full(ioc, local_iov, nlocal_iov, fds, nfds,
-                                      errp);
+
+        len = qio_channel_writev_full(ioc, local_iov, nlocal_iov, fds,
+                                            nfds, flags, errp);
+
         if (len == QIO_CHANNEL_ERR_BLOCK) {
             if (qemu_in_coroutine()) {
                 qio_channel_yield(ioc, G_IO_OUT);
@@ -277,7 +293,7 @@ ssize_t qio_channel_writev(QIOChannel *ioc,
                            size_t niov,
                            Error **errp)
 {
-    return qio_channel_writev_full(ioc, iov, niov, NULL, 0, errp);
+    return qio_channel_writev_full(ioc, iov, niov, NULL, 0, 0, errp);
 }
 
 
@@ -297,7 +313,7 @@ ssize_t qio_channel_write(QIOChannel *ioc,
                           Error **errp)
 {
     struct iovec iov = { .iov_base = (char *)buf, .iov_len = buflen };
-    return qio_channel_writev_full(ioc, &iov, 1, NULL, 0, errp);
+    return qio_channel_writev_full(ioc, &iov, 1, NULL, 0, 0, errp);
 }
 
 
@@ -473,6 +489,19 @@ off_t qio_channel_io_seek(QIOChannel *ioc,
     return klass->io_seek(ioc, offset, whence, errp);
 }
 
+int qio_channel_flush(QIOChannel *ioc,
+                                Error **errp)
+{
+    QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
+
+    if (!klass->io_flush ||
+        !qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY)) {
+        return 0;
+    }
+
+    return klass->io_flush(ioc, errp);
+}
+
 
 static void qio_channel_restart_read(void *opaque)
 {
diff --git a/migration/rdma.c b/migration/rdma.c
index ef1e65ec36..672d1958a9 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -2840,6 +2840,7 @@ static ssize_t qio_channel_rdma_writev(QIOChannel *ioc,
                                        size_t niov,
                                        int *fds,
                                        size_t nfds,
+                                       int flags,
                                        Error **errp)
 {
     QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
diff --git a/scsi/pr-manager-helper.c b/scsi/pr-manager-helper.c
index 451c7631b7..3be52a98d5 100644
--- a/scsi/pr-manager-helper.c
+++ b/scsi/pr-manager-helper.c
@@ -77,7 +77,7 @@ static int pr_manager_helper_write(PRManagerHelper *pr_mgr,
         iov.iov_base = (void *)buf;
         iov.iov_len = sz;
         n_written = qio_channel_writev_full(QIO_CHANNEL(pr_mgr->ioc), &iov, 1,
-                                            nfds ? &fd : NULL, nfds, errp);
+                                            nfds ? &fd : NULL, nfds, 0, errp);
 
         if (n_written <= 0) {
             assert(n_written != QIO_CHANNEL_ERR_BLOCK);
diff --git a/tests/unit/test-io-channel-socket.c b/tests/unit/test-io-channel-socket.c
index c49eec1f03..6713886d02 100644
--- a/tests/unit/test-io-channel-socket.c
+++ b/tests/unit/test-io-channel-socket.c
@@ -444,6 +444,7 @@ static void test_io_channel_unix_fd_pass(void)
                             G_N_ELEMENTS(iosend),
                             fdsend,
                             G_N_ELEMENTS(fdsend),
+                            0,
                             &error_abort);
 
     qio_channel_readv_full(dst,
-- 
2.36.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v11 2/7] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX
  2022-05-04 19:18 [PATCH v11 0/7] MSG_ZEROCOPY + multifd Leonardo Bras
  2022-05-04 19:18 ` [PATCH v11 1/7] QIOChannel: Add flags on io_writev and introduce io_flush callback Leonardo Bras
@ 2022-05-04 19:18 ` Leonardo Bras
  2022-05-04 19:53   ` Peter Xu
  2022-05-05  8:05   ` Daniel P. Berrangé
  2022-05-04 19:18 ` [PATCH v11 3/7] migration: Add zero-copy-send parameter for QMP/HMP for Linux Leonardo Bras
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 16+ messages in thread
From: Leonardo Bras @ 2022-05-04 19:18 UTC (permalink / raw)
  To: Marc-André Lureau, Paolo Bonzini, Elena Ufimtseva,
	Jagannathan Raman, John G Johnson, Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster, Fam Zheng, Peter Xu
  Cc: Leonardo Bras, qemu-devel, qemu-block

For CONFIG_LINUX, implement the new zero copy flag and the optional callback
io_flush on QIOChannelSocket, but enables it only when MSG_ZEROCOPY
feature is available in the host kernel, which is checked on
qio_channel_socket_connect_sync()

qio_channel_socket_flush() was implemented by counting how many times
sendmsg(...,MSG_ZEROCOPY) was successfully called, and then reading the
socket's error queue, in order to find how many of them finished sending.
Flush will loop until those counters are the same, or until some error occurs.

Notes on using writev() with QIO_CHANNEL_WRITE_FLAG_ZERO_COPY:
1: Buffer
- As MSG_ZEROCOPY tells the kernel to use the same user buffer to avoid copying,
some caution is necessary to avoid overwriting any buffer before it's sent.
If something like this happen, a newer version of the buffer may be sent instead.
- If this is a problem, it's recommended to call qio_channel_flush() before freeing
or re-using the buffer.

2: Locked memory
- When using MSG_ZERCOCOPY, the buffer memory will be locked after queued, and
unlocked after it's sent.
- Depending on the size of each buffer, and how often it's sent, it may require
a larger amount of locked memory than usually available to non-root user.
- If the required amount of locked memory is not available, writev_zero_copy
will return an error, which can abort an operation like migration,
- Because of this, when an user code wants to add zero copy as a feature, it
requires a mechanism to disable it, so it can still be accessible to less
privileged users.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
---
 include/io/channel-socket.h |   2 +
 io/channel-socket.c         | 120 ++++++++++++++++++++++++++++++++++--
 2 files changed, 118 insertions(+), 4 deletions(-)

diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
index e747e63514..513c428fe4 100644
--- a/include/io/channel-socket.h
+++ b/include/io/channel-socket.h
@@ -47,6 +47,8 @@ struct QIOChannelSocket {
     socklen_t localAddrLen;
     struct sockaddr_storage remoteAddr;
     socklen_t remoteAddrLen;
+    ssize_t zero_copy_queued;
+    ssize_t zero_copy_sent;
 };
 
 
diff --git a/io/channel-socket.c b/io/channel-socket.c
index 696a04dc9c..ae756ce166 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -25,9 +25,25 @@
 #include "io/channel-watch.h"
 #include "trace.h"
 #include "qapi/clone-visitor.h"
+#ifdef CONFIG_LINUX
+#include <linux/errqueue.h>
+#include <sys/socket.h>
+#endif
 
 #define SOCKET_MAX_FDS 16
 
+/*
+ * Zero-copy defines bellow are included to avoid breaking builds on systems
+ * that don't support MSG_ZEROCOPY, while keeping the functions more readable
+ * (without a lot of ifdefs).
+ */
+#ifndef MSG_ZEROCOPY
+#define MSG_ZEROCOPY 0x4000000
+#endif
+#ifndef SO_ZEROCOPY
+#define SO_ZEROCOPY 60
+#endif
+
 SocketAddress *
 qio_channel_socket_get_local_address(QIOChannelSocket *ioc,
                                      Error **errp)
@@ -54,6 +70,8 @@ qio_channel_socket_new(void)
 
     sioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET));
     sioc->fd = -1;
+    sioc->zero_copy_queued = 0;
+    sioc->zero_copy_sent = 0;
 
     ioc = QIO_CHANNEL(sioc);
     qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN);
@@ -153,6 +171,16 @@ int qio_channel_socket_connect_sync(QIOChannelSocket *ioc,
         return -1;
     }
 
+#ifdef CONFIG_LINUX
+    int ret, v = 1;
+    ret = setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v));
+    if (ret == 0) {
+        /* Zero copy available on host */
+        qio_channel_set_feature(QIO_CHANNEL(ioc),
+                                QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY);
+    }
+#endif
+
     return 0;
 }
 
@@ -533,6 +561,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
     char control[CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS)];
     size_t fdsize = sizeof(int) * nfds;
     struct cmsghdr *cmsg;
+    int sflags = 0;
 
     memset(control, 0, CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS));
 
@@ -557,15 +586,27 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
         memcpy(CMSG_DATA(cmsg), fds, fdsize);
     }
 
+    if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
+        sflags = MSG_ZEROCOPY;
+    }
+
  retry:
-    ret = sendmsg(sioc->fd, &msg, 0);
+    ret = sendmsg(sioc->fd, &msg, sflags);
     if (ret <= 0) {
-        if (errno == EAGAIN) {
+        switch (errno) {
+        case EAGAIN:
             return QIO_CHANNEL_ERR_BLOCK;
-        }
-        if (errno == EINTR) {
+        case EINTR:
             goto retry;
+        case ENOBUFS:
+            if (sflags & MSG_ZEROCOPY) {
+                error_setg_errno(errp, errno,
+                                 "Process can't lock enough memory for using MSG_ZEROCOPY");
+                return -1;
+            }
+            break;
         }
+
         error_setg_errno(errp, errno,
                          "Unable to write to socket");
         return -1;
@@ -659,6 +700,74 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
 }
 #endif /* WIN32 */
 
+
+#ifdef CONFIG_LINUX
+static int qio_channel_socket_flush(QIOChannel *ioc,
+                                    Error **errp)
+{
+    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
+    struct msghdr msg = {};
+    struct sock_extended_err *serr;
+    struct cmsghdr *cm;
+    char control[CMSG_SPACE(sizeof(*serr))];
+    int received;
+    int ret = 1;
+
+    msg.msg_control = control;
+    msg.msg_controllen = sizeof(control);
+    memset(control, 0, sizeof(control));
+
+    while (sioc->zero_copy_sent < sioc->zero_copy_queued) {
+        received = recvmsg(sioc->fd, &msg, MSG_ERRQUEUE);
+        if (received < 0) {
+            switch (errno) {
+            case EAGAIN:
+                /* Nothing on errqueue, wait until something is available */
+                qio_channel_wait(ioc, G_IO_ERR);
+                continue;
+            case EINTR:
+                continue;
+            default:
+                error_setg_errno(errp, errno,
+                                 "Unable to read errqueue");
+                return -1;
+            }
+        }
+
+        cm = CMSG_FIRSTHDR(&msg);
+        if (cm->cmsg_level != SOL_IP &&
+            cm->cmsg_type != IP_RECVERR) {
+            error_setg_errno(errp, EPROTOTYPE,
+                             "Wrong cmsg in errqueue");
+            return -1;
+        }
+
+        serr = (void *) CMSG_DATA(cm);
+        if (serr->ee_errno != SO_EE_ORIGIN_NONE) {
+            error_setg_errno(errp, serr->ee_errno,
+                             "Error on socket");
+            return -1;
+        }
+        if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) {
+            error_setg_errno(errp, serr->ee_origin,
+                             "Error not from zero copy");
+            return -1;
+        }
+
+        /* No errors, count successfully finished sendmsg()*/
+        sioc->zero_copy_sent += serr->ee_data - serr->ee_info + 1;
+
+        /* If any sendmsg() succeeded using zero copy, return 0 at the end */
+        if (serr->ee_code != SO_EE_CODE_ZEROCOPY_COPIED) {
+            ret = 0;
+        }
+    }
+
+    return ret;
+}
+
+#endif /* CONFIG_LINUX */
+
 static int
 qio_channel_socket_set_blocking(QIOChannel *ioc,
                                 bool enabled,
@@ -789,6 +898,9 @@ static void qio_channel_socket_class_init(ObjectClass *klass,
     ioc_klass->io_set_delay = qio_channel_socket_set_delay;
     ioc_klass->io_create_watch = qio_channel_socket_create_watch;
     ioc_klass->io_set_aio_fd_handler = qio_channel_socket_set_aio_fd_handler;
+#ifdef CONFIG_LINUX
+    ioc_klass->io_flush = qio_channel_socket_flush;
+#endif
 }
 
 static const TypeInfo qio_channel_socket_info = {
-- 
2.36.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v11 3/7] migration: Add zero-copy-send parameter for QMP/HMP for Linux
  2022-05-04 19:18 [PATCH v11 0/7] MSG_ZEROCOPY + multifd Leonardo Bras
  2022-05-04 19:18 ` [PATCH v11 1/7] QIOChannel: Add flags on io_writev and introduce io_flush callback Leonardo Bras
  2022-05-04 19:18 ` [PATCH v11 2/7] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX Leonardo Bras
@ 2022-05-04 19:18 ` Leonardo Bras
  2022-05-05  5:55   ` Markus Armbruster
  2022-05-04 19:18 ` [PATCH v11 4/7] migration: Add migrate_use_tls() helper Leonardo Bras
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Leonardo Bras @ 2022-05-04 19:18 UTC (permalink / raw)
  To: Marc-André Lureau, Paolo Bonzini, Elena Ufimtseva,
	Jagannathan Raman, John G Johnson, Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster, Fam Zheng, Peter Xu
  Cc: Leonardo Bras, qemu-devel, qemu-block

Add property that allows zero-copy migration of memory pages
on the sending side, and also includes a helper function
migrate_use_zero_copy_send() to check if it's enabled.

No code is introduced to actually do the migration, but it allow
future implementations to enable/disable this feature.

On non-Linux builds this parameter is compiled-out.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
---
 qapi/migration.json   | 24 ++++++++++++++++++++++++
 migration/migration.h |  5 +++++
 migration/migration.c | 32 ++++++++++++++++++++++++++++++++
 migration/socket.c    | 11 +++++++++--
 monitor/hmp-cmds.c    |  6 ++++++
 5 files changed, 76 insertions(+), 2 deletions(-)

diff --git a/qapi/migration.json b/qapi/migration.json
index 409eb086a2..04246481ce 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -741,6 +741,13 @@
 #                      will consume more CPU.
 #                      Defaults to 1. (Since 5.0)
 #
+# @zero-copy-send: Controls behavior on sending memory pages on migration.
+#                  When true, enables a zero-copy mechanism for sending memory
+#                  pages, if host supports it.
+#                  Requires that QEMU be permitted to use locked memory for guest
+#                  RAM pages.
+#                  Defaults to false. (Since 7.1)
+#
 # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
 #                        aliases for the purpose of dirty bitmap migration.  Such
 #                        aliases may for example be the corresponding names on the
@@ -780,6 +787,7 @@
            'xbzrle-cache-size', 'max-postcopy-bandwidth',
            'max-cpu-throttle', 'multifd-compression',
            'multifd-zlib-level' ,'multifd-zstd-level',
+           { 'name': 'zero-copy-send', 'if' : 'CONFIG_LINUX'},
            'block-bitmap-mapping' ] }
 
 ##
@@ -906,6 +914,13 @@
 #                      will consume more CPU.
 #                      Defaults to 1. (Since 5.0)
 #
+# @zero-copy-send: Controls behavior on sending memory pages on migration.
+#                  When true, enables a zero-copy mechanism for sending memory
+#                  pages, if host supports it.
+#                  Requires that QEMU be permitted to use locked memory for guest
+#                  RAM pages.
+#                  Defaults to false. (Since 7.1)
+#
 # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
 #                        aliases for the purpose of dirty bitmap migration.  Such
 #                        aliases may for example be the corresponding names on the
@@ -960,6 +975,7 @@
             '*multifd-compression': 'MultiFDCompression',
             '*multifd-zlib-level': 'uint8',
             '*multifd-zstd-level': 'uint8',
+            '*zero-copy-send': { 'type': 'bool', 'if': 'CONFIG_LINUX' },
             '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
 
 ##
@@ -1106,6 +1122,13 @@
 #                      will consume more CPU.
 #                      Defaults to 1. (Since 5.0)
 #
+# @zero-copy-send: Controls behavior on sending memory pages on migration.
+#                  When true, enables a zero-copy mechanism for sending memory
+#                  pages, if host supports it.
+#                  Requires that QEMU be permitted to use locked memory for guest
+#                  RAM pages.
+#                  Defaults to false. (Since 7.1)
+#
 # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
 #                        aliases for the purpose of dirty bitmap migration.  Such
 #                        aliases may for example be the corresponding names on the
@@ -1158,6 +1181,7 @@
             '*multifd-compression': 'MultiFDCompression',
             '*multifd-zlib-level': 'uint8',
             '*multifd-zstd-level': 'uint8',
+            '*zero-copy-send': { 'type': 'bool', 'if': 'CONFIG_LINUX' },
             '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
 
 ##
diff --git a/migration/migration.h b/migration/migration.h
index a863032b71..e8f2941a55 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -375,6 +375,11 @@ MultiFDCompression migrate_multifd_compression(void);
 int migrate_multifd_zlib_level(void);
 int migrate_multifd_zstd_level(void);
 
+#ifdef CONFIG_LINUX
+bool migrate_use_zero_copy_send(void);
+#else
+#define migrate_use_zero_copy_send() (false)
+#endif
 int migrate_use_xbzrle(void);
 uint64_t migrate_xbzrle_cache_size(void);
 bool migrate_colo_enabled(void);
diff --git a/migration/migration.c b/migration/migration.c
index 5a31b23bd6..3e91f4b5e2 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -910,6 +910,10 @@ MigrationParameters *qmp_query_migrate_parameters(Error **errp)
     params->multifd_zlib_level = s->parameters.multifd_zlib_level;
     params->has_multifd_zstd_level = true;
     params->multifd_zstd_level = s->parameters.multifd_zstd_level;
+#ifdef CONFIG_LINUX
+    params->has_zero_copy_send = true;
+    params->zero_copy_send = s->parameters.zero_copy_send;
+#endif
     params->has_xbzrle_cache_size = true;
     params->xbzrle_cache_size = s->parameters.xbzrle_cache_size;
     params->has_max_postcopy_bandwidth = true;
@@ -1567,6 +1571,11 @@ static void migrate_params_test_apply(MigrateSetParameters *params,
     if (params->has_multifd_compression) {
         dest->multifd_compression = params->multifd_compression;
     }
+#ifdef CONFIG_LINUX
+    if (params->has_zero_copy_send) {
+        dest->zero_copy_send = params->zero_copy_send;
+    }
+#endif
     if (params->has_xbzrle_cache_size) {
         dest->xbzrle_cache_size = params->xbzrle_cache_size;
     }
@@ -1679,6 +1688,11 @@ static void migrate_params_apply(MigrateSetParameters *params, Error **errp)
     if (params->has_multifd_compression) {
         s->parameters.multifd_compression = params->multifd_compression;
     }
+#ifdef CONFIG_LINUX
+    if (params->has_zero_copy_send) {
+        s->parameters.zero_copy_send = params->zero_copy_send;
+    }
+#endif
     if (params->has_xbzrle_cache_size) {
         s->parameters.xbzrle_cache_size = params->xbzrle_cache_size;
         xbzrle_cache_resize(params->xbzrle_cache_size, errp);
@@ -2563,6 +2577,17 @@ int migrate_multifd_zstd_level(void)
     return s->parameters.multifd_zstd_level;
 }
 
+#ifdef CONFIG_LINUX
+bool migrate_use_zero_copy_send(void)
+{
+    MigrationState *s;
+
+    s = migrate_get_current();
+
+    return s->parameters.zero_copy_send;
+}
+#endif
+
 int migrate_use_xbzrle(void)
 {
     MigrationState *s;
@@ -4206,6 +4231,10 @@ static Property migration_properties[] = {
     DEFINE_PROP_UINT8("multifd-zstd-level", MigrationState,
                       parameters.multifd_zstd_level,
                       DEFAULT_MIGRATE_MULTIFD_ZSTD_LEVEL),
+#ifdef CONFIG_LINUX
+    DEFINE_PROP_BOOL("zero_copy_send", MigrationState,
+                      parameters.zero_copy_send, false),
+#endif
     DEFINE_PROP_SIZE("xbzrle-cache-size", MigrationState,
                       parameters.xbzrle_cache_size,
                       DEFAULT_MIGRATE_XBZRLE_CACHE_SIZE),
@@ -4303,6 +4332,9 @@ static void migration_instance_init(Object *obj)
     params->has_multifd_compression = true;
     params->has_multifd_zlib_level = true;
     params->has_multifd_zstd_level = true;
+#ifdef CONFIG_LINUX
+    params->has_zero_copy_send = true;
+#endif
     params->has_xbzrle_cache_size = true;
     params->has_max_postcopy_bandwidth = true;
     params->has_max_cpu_throttle = true;
diff --git a/migration/socket.c b/migration/socket.c
index 05705a32d8..3754d8f72c 100644
--- a/migration/socket.c
+++ b/migration/socket.c
@@ -74,9 +74,16 @@ static void socket_outgoing_migration(QIOTask *task,
 
     if (qio_task_propagate_error(task, &err)) {
         trace_migration_socket_outgoing_error(error_get_pretty(err));
-    } else {
-        trace_migration_socket_outgoing_connected(data->hostname);
+           goto out;
     }
+
+    trace_migration_socket_outgoing_connected(data->hostname);
+
+    if (migrate_use_zero_copy_send()) {
+        error_setg(&err, "Zero copy send not available in migration");
+    }
+
+out:
     migration_channel_connect(data->s, sioc, data->hostname, err);
     object_unref(OBJECT(sioc));
 }
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index 93061a11af..622c783c32 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -1309,6 +1309,12 @@ void hmp_migrate_set_parameter(Monitor *mon, const QDict *qdict)
         p->has_multifd_zstd_level = true;
         visit_type_uint8(v, param, &p->multifd_zstd_level, &err);
         break;
+#ifdef CONFIG_LINUX
+    case MIGRATION_PARAMETER_ZERO_COPY_SEND:
+        p->has_zero_copy_send = true;
+        visit_type_bool(v, param, &p->zero_copy_send, &err);
+        break;
+#endif
     case MIGRATION_PARAMETER_XBZRLE_CACHE_SIZE:
         p->has_xbzrle_cache_size = true;
         if (!visit_type_size(v, param, &cache_size, &err)) {
-- 
2.36.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v11 4/7] migration: Add migrate_use_tls() helper
  2022-05-04 19:18 [PATCH v11 0/7] MSG_ZEROCOPY + multifd Leonardo Bras
                   ` (2 preceding siblings ...)
  2022-05-04 19:18 ` [PATCH v11 3/7] migration: Add zero-copy-send parameter for QMP/HMP for Linux Leonardo Bras
@ 2022-05-04 19:18 ` Leonardo Bras
  2022-05-04 19:18 ` [PATCH v11 5/7] multifd: multifd_send_sync_main now returns negative on error Leonardo Bras
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Leonardo Bras @ 2022-05-04 19:18 UTC (permalink / raw)
  To: Marc-André Lureau, Paolo Bonzini, Elena Ufimtseva,
	Jagannathan Raman, John G Johnson, Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster, Fam Zheng, Peter Xu
  Cc: Leonardo Bras, qemu-devel, qemu-block

A lot of places check parameters.tls_creds in order to evaluate if TLS is
in use, and sometimes call migrate_get_current() just for that test.

Add new helper function migrate_use_tls() in order to simplify testing
for TLS usage.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
---
 migration/migration.h | 1 +
 migration/channel.c   | 3 +--
 migration/migration.c | 9 +++++++++
 migration/multifd.c   | 5 +----
 4 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/migration/migration.h b/migration/migration.h
index e8f2941a55..485d58b95f 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -380,6 +380,7 @@ bool migrate_use_zero_copy_send(void);
 #else
 #define migrate_use_zero_copy_send() (false)
 #endif
+int migrate_use_tls(void);
 int migrate_use_xbzrle(void);
 uint64_t migrate_xbzrle_cache_size(void);
 bool migrate_colo_enabled(void);
diff --git a/migration/channel.c b/migration/channel.c
index c6a8dcf1d7..a162d00fea 100644
--- a/migration/channel.c
+++ b/migration/channel.c
@@ -38,8 +38,7 @@ void migration_channel_process_incoming(QIOChannel *ioc)
     trace_migration_set_incoming_channel(
         ioc, object_get_typename(OBJECT(ioc)));
 
-    if (s->parameters.tls_creds &&
-        *s->parameters.tls_creds &&
+    if (migrate_use_tls() &&
         !object_dynamic_cast(OBJECT(ioc),
                              TYPE_QIO_CHANNEL_TLS)) {
         migration_tls_channel_process_incoming(s, ioc, &local_err);
diff --git a/migration/migration.c b/migration/migration.c
index 3e91f4b5e2..4b6df2eb5e 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2588,6 +2588,15 @@ bool migrate_use_zero_copy_send(void)
 }
 #endif
 
+int migrate_use_tls(void)
+{
+    MigrationState *s;
+
+    s = migrate_get_current();
+
+    return s->parameters.tls_creds && *s->parameters.tls_creds;
+}
+
 int migrate_use_xbzrle(void)
 {
     MigrationState *s;
diff --git a/migration/multifd.c b/migration/multifd.c
index 9ea4f581e2..2a8c8570c3 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -782,15 +782,12 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
                                     QIOChannel *ioc,
                                     Error *error)
 {
-    MigrationState *s = migrate_get_current();
-
     trace_multifd_set_outgoing_channel(
         ioc, object_get_typename(OBJECT(ioc)),
         migrate_get_current()->hostname, error);
 
     if (!error) {
-        if (s->parameters.tls_creds &&
-            *s->parameters.tls_creds &&
+        if (migrate_use_tls() &&
             !object_dynamic_cast(OBJECT(ioc),
                                  TYPE_QIO_CHANNEL_TLS)) {
             multifd_tls_channel_connect(p, ioc, &error);
-- 
2.36.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v11 5/7] multifd: multifd_send_sync_main now returns negative on error
  2022-05-04 19:18 [PATCH v11 0/7] MSG_ZEROCOPY + multifd Leonardo Bras
                   ` (3 preceding siblings ...)
  2022-05-04 19:18 ` [PATCH v11 4/7] migration: Add migrate_use_tls() helper Leonardo Bras
@ 2022-05-04 19:18 ` Leonardo Bras
  2022-05-04 19:18 ` [PATCH v11 6/7] multifd: Send header packet without flags if zero-copy-send is enabled Leonardo Bras
  2022-05-04 19:18 ` [PATCH v11 7/7] multifd: Implement zero copy write in multifd migration (multifd-zero-copy) Leonardo Bras
  6 siblings, 0 replies; 16+ messages in thread
From: Leonardo Bras @ 2022-05-04 19:18 UTC (permalink / raw)
  To: Marc-André Lureau, Paolo Bonzini, Elena Ufimtseva,
	Jagannathan Raman, John G Johnson, Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster, Fam Zheng, Peter Xu
  Cc: Leonardo Bras, qemu-devel, qemu-block

Even though multifd_send_sync_main() currently emits error_reports, it's
callers don't really check it before continuing.

Change multifd_send_sync_main() to return -1 on error and 0 on success.
Also change all it's callers to make use of this change and possibly fail
earlier.

(This change is important to next patch on  multifd zero copy
implementation, to make it sure an error in zero-copy flush does not go
unnoticed.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
 migration/multifd.h |  2 +-
 migration/multifd.c | 10 ++++++----
 migration/ram.c     | 29 ++++++++++++++++++++++-------
 3 files changed, 29 insertions(+), 12 deletions(-)

diff --git a/migration/multifd.h b/migration/multifd.h
index 7d0effcb03..bcf5992945 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -20,7 +20,7 @@ int multifd_load_cleanup(Error **errp);
 bool multifd_recv_all_channels_created(void);
 bool multifd_recv_new_channel(QIOChannel *ioc, Error **errp);
 void multifd_recv_sync_main(void);
-void multifd_send_sync_main(QEMUFile *f);
+int multifd_send_sync_main(QEMUFile *f);
 int multifd_queue_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset);
 
 /* Multifd Compression flags */
diff --git a/migration/multifd.c b/migration/multifd.c
index 2a8c8570c3..15fb668e64 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -566,17 +566,17 @@ void multifd_save_cleanup(void)
     multifd_send_state = NULL;
 }
 
-void multifd_send_sync_main(QEMUFile *f)
+int multifd_send_sync_main(QEMUFile *f)
 {
     int i;
 
     if (!migrate_use_multifd()) {
-        return;
+        return 0;
     }
     if (multifd_send_state->pages->num) {
         if (multifd_send_pages(f) < 0) {
             error_report("%s: multifd_send_pages fail", __func__);
-            return;
+            return -1;
         }
     }
     for (i = 0; i < migrate_multifd_channels(); i++) {
@@ -589,7 +589,7 @@ void multifd_send_sync_main(QEMUFile *f)
         if (p->quit) {
             error_report("%s: channel %d has already quit", __func__, i);
             qemu_mutex_unlock(&p->mutex);
-            return;
+            return -1;
         }
 
         p->packet_num = multifd_send_state->packet_num++;
@@ -608,6 +608,8 @@ void multifd_send_sync_main(QEMUFile *f)
         qemu_sem_wait(&p->sem_sync);
     }
     trace_multifd_send_sync_main(multifd_send_state->packet_num);
+
+    return 0;
 }
 
 static void *multifd_send_thread(void *opaque)
diff --git a/migration/ram.c b/migration/ram.c
index a2489a2699..5f5e37f64d 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2909,6 +2909,7 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
 {
     RAMState **rsp = opaque;
     RAMBlock *block;
+    int ret;
 
     if (compress_threads_save_setup()) {
         return -1;
@@ -2943,7 +2944,11 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
     ram_control_before_iterate(f, RAM_CONTROL_SETUP);
     ram_control_after_iterate(f, RAM_CONTROL_SETUP);
 
-    multifd_send_sync_main(f);
+    ret =  multifd_send_sync_main(f);
+    if (ret < 0) {
+        return ret;
+    }
+
     qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
     qemu_fflush(f);
 
@@ -3052,7 +3057,11 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
 out:
     if (ret >= 0
         && migration_is_setup_or_active(migrate_get_current()->state)) {
-        multifd_send_sync_main(rs->f);
+        ret = multifd_send_sync_main(rs->f);
+        if (ret < 0) {
+            return ret;
+        }
+
         qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
         qemu_fflush(f);
         ram_transferred_add(8);
@@ -3112,13 +3121,19 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
         ram_control_after_iterate(f, RAM_CONTROL_FINISH);
     }
 
-    if (ret >= 0) {
-        multifd_send_sync_main(rs->f);
-        qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
-        qemu_fflush(f);
+    if (ret < 0) {
+        return ret;
     }
 
-    return ret;
+    ret = multifd_send_sync_main(rs->f);
+    if (ret < 0) {
+        return ret;
+    }
+
+    qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
+    qemu_fflush(f);
+
+    return 0;
 }
 
 static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
-- 
2.36.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v11 6/7] multifd: Send header packet without flags if zero-copy-send is enabled
  2022-05-04 19:18 [PATCH v11 0/7] MSG_ZEROCOPY + multifd Leonardo Bras
                   ` (4 preceding siblings ...)
  2022-05-04 19:18 ` [PATCH v11 5/7] multifd: multifd_send_sync_main now returns negative on error Leonardo Bras
@ 2022-05-04 19:18 ` Leonardo Bras
  2022-05-04 19:18 ` [PATCH v11 7/7] multifd: Implement zero copy write in multifd migration (multifd-zero-copy) Leonardo Bras
  6 siblings, 0 replies; 16+ messages in thread
From: Leonardo Bras @ 2022-05-04 19:18 UTC (permalink / raw)
  To: Marc-André Lureau, Paolo Bonzini, Elena Ufimtseva,
	Jagannathan Raman, John G Johnson, Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster, Fam Zheng, Peter Xu
  Cc: Leonardo Bras, qemu-devel, qemu-block

Since d48c3a0445 ("multifd: Use a single writev on the send side"),
sending the header packet and the memory pages happens in the same
writev, which can potentially make the migration faster.

Using channel-socket as example, this works well with the default copying
mechanism of sendmsg(), but with zero-copy-send=true, it will cause
the migration to often break.

This happens because the header packet buffer gets reused quite often,
and there is a high chance that by the time the MSG_ZEROCOPY mechanism get
to send the buffer, it has already changed, sending the wrong data and
causing the migration to abort.

It means that, as it is, the buffer for the header packet is not suitable
for sending with MSG_ZEROCOPY.

In order to enable zero copy for multifd, send the header packet on an
individual write(), without any flags, and the remanining pages with a
writev(), as it was happening before. This only changes how a migration
with zero-copy-send=true works, not changing any current behavior for
migrations with zero-copy-send=false.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
---
 migration/multifd.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 15fb668e64..2541cd2322 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -617,6 +617,7 @@ static void *multifd_send_thread(void *opaque)
     MultiFDSendParams *p = opaque;
     Error *local_err = NULL;
     int ret = 0;
+    bool use_zero_copy_send = migrate_use_zero_copy_send();
 
     trace_multifd_send_thread_start(p->id);
     rcu_register_thread();
@@ -639,9 +640,14 @@ static void *multifd_send_thread(void *opaque)
         if (p->pending_job) {
             uint64_t packet_num = p->packet_num;
             uint32_t flags = p->flags;
-            p->iovs_num = 1;
             p->normal_num = 0;
 
+            if (use_zero_copy_send) {
+                p->iovs_num = 0;
+            } else {
+                p->iovs_num = 1;
+            }
+
             for (int i = 0; i < p->pages->num; i++) {
                 p->normal[p->normal_num] = p->pages->offset[i];
                 p->normal_num++;
@@ -665,8 +671,18 @@ static void *multifd_send_thread(void *opaque)
             trace_multifd_send(p->id, packet_num, p->normal_num, flags,
                                p->next_packet_size);
 
-            p->iov[0].iov_len = p->packet_len;
-            p->iov[0].iov_base = p->packet;
+            if (use_zero_copy_send) {
+                /* Send header first, without zerocopy */
+                ret = qio_channel_write_all(p->c, (void *)p->packet,
+                                            p->packet_len, &local_err);
+                if (ret != 0) {
+                    break;
+                }
+            } else {
+                /* Send header using the same writev call */
+                p->iov[0].iov_len = p->packet_len;
+                p->iov[0].iov_base = p->packet;
+            }
 
             ret = qio_channel_writev_all(p->c, p->iov, p->iovs_num,
                                          &local_err);
-- 
2.36.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v11 7/7] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)
  2022-05-04 19:18 [PATCH v11 0/7] MSG_ZEROCOPY + multifd Leonardo Bras
                   ` (5 preceding siblings ...)
  2022-05-04 19:18 ` [PATCH v11 6/7] multifd: Send header packet without flags if zero-copy-send is enabled Leonardo Bras
@ 2022-05-04 19:18 ` Leonardo Bras
  6 siblings, 0 replies; 16+ messages in thread
From: Leonardo Bras @ 2022-05-04 19:18 UTC (permalink / raw)
  To: Marc-André Lureau, Paolo Bonzini, Elena Ufimtseva,
	Jagannathan Raman, John G Johnson, Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster, Fam Zheng, Peter Xu
  Cc: Leonardo Bras, qemu-devel, qemu-block

Implement zero copy send on nocomp_send_write(), by making use of QIOChannel
writev + flags & flush interface.

Change multifd_send_sync_main() so flush_zero_copy() can be called
after each iteration in order to make sure all dirty pages are sent before
a new iteration is started. It will also flush at the beginning and at the
end of migration.

Also make it return -1 if flush_zero_copy() fails, in order to cancel
the migration process, and avoid resuming the guest in the target host
without receiving all current RAM.

This will work fine on RAM migration because the RAM pages are not usually freed,
and there is no problem on changing the pages content between writev_zero_copy() and
the actual sending of the buffer, because this change will dirty the page and
cause it to be re-sent on a next iteration anyway.

A lot of locked memory may be needed in order to use multifd migration
with zero-copy enabled, so disabling the feature should be necessary for
low-privileged users trying to perform multifd migrations.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
---
 migration/multifd.h   |  2 ++
 migration/migration.c | 11 ++++++++++-
 migration/multifd.c   | 37 +++++++++++++++++++++++++++++++++++--
 migration/socket.c    |  5 +++--
 4 files changed, 50 insertions(+), 5 deletions(-)

diff --git a/migration/multifd.h b/migration/multifd.h
index bcf5992945..4d8d89e5e5 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -92,6 +92,8 @@ typedef struct {
     uint32_t packet_len;
     /* pointer to the packet */
     MultiFDPacket_t *packet;
+    /* multifd flags for sending ram */
+    int write_flags;
     /* multifd flags for each packet */
     uint32_t flags;
     /* size of the next packet that contains pages */
diff --git a/migration/migration.c b/migration/migration.c
index 4b6df2eb5e..31739b2af9 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1497,7 +1497,16 @@ static bool migrate_params_check(MigrationParameters *params, Error **errp)
         error_prepend(errp, "Invalid mapping given for block-bitmap-mapping: ");
         return false;
     }
-
+#ifdef CONFIG_LINUX
+    if (params->zero_copy_send &&
+        (!migrate_use_multifd() ||
+         params->multifd_compression != MULTIFD_COMPRESSION_NONE ||
+         (params->tls_creds && *params->tls_creds))) {
+        error_setg(errp,
+                   "Zero copy only available for non-compressed non-TLS multifd migration");
+        return false;
+    }
+#endif
     return true;
 }
 
diff --git a/migration/multifd.c b/migration/multifd.c
index 2541cd2322..9282ab6aa4 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -569,6 +569,7 @@ void multifd_save_cleanup(void)
 int multifd_send_sync_main(QEMUFile *f)
 {
     int i;
+    bool flush_zero_copy;
 
     if (!migrate_use_multifd()) {
         return 0;
@@ -579,6 +580,20 @@ int multifd_send_sync_main(QEMUFile *f)
             return -1;
         }
     }
+
+    /*
+     * When using zero-copy, it's necessary to flush the pages before any of
+     * the pages can be sent again, so we'll make sure the new version of the
+     * pages will always arrive _later_ than the old pages.
+     *
+     * Currently we achieve this by flushing the zero-page requested writes
+     * per ram iteration, but in the future we could potentially optimize it
+     * to be less frequent, e.g. only after we finished one whole scanning of
+     * all the dirty bitmaps.
+     */
+
+    flush_zero_copy = migrate_use_zero_copy_send();
+
     for (i = 0; i < migrate_multifd_channels(); i++) {
         MultiFDSendParams *p = &multifd_send_state->params[i];
 
@@ -600,6 +615,17 @@ int multifd_send_sync_main(QEMUFile *f)
         ram_counters.transferred += p->packet_len;
         qemu_mutex_unlock(&p->mutex);
         qemu_sem_post(&p->sem);
+
+        if (flush_zero_copy && p->c) {
+            int ret;
+            Error *err = NULL;
+
+            ret = qio_channel_flush(p->c, &err);
+            if (ret < 0) {
+                error_report_err(err);
+                return -1;
+            }
+        }
     }
     for (i = 0; i < migrate_multifd_channels(); i++) {
         MultiFDSendParams *p = &multifd_send_state->params[i];
@@ -684,8 +710,8 @@ static void *multifd_send_thread(void *opaque)
                 p->iov[0].iov_base = p->packet;
             }
 
-            ret = qio_channel_writev_all(p->c, p->iov, p->iovs_num,
-                                         &local_err);
+            ret = qio_channel_writev_full_all(p->c, p->iov, p->iovs_num, NULL,
+                                              0, p->write_flags, &local_err);
             if (ret != 0) {
                 break;
             }
@@ -913,6 +939,13 @@ int multifd_save_setup(Error **errp)
         /* We need one extra place for the packet header */
         p->iov = g_new0(struct iovec, page_count + 1);
         p->normal = g_new0(ram_addr_t, page_count);
+
+        if (migrate_use_zero_copy_send()) {
+            p->write_flags = QIO_CHANNEL_WRITE_FLAG_ZERO_COPY;
+        } else {
+            p->write_flags = 0;
+        }
+
         socket_send_channel_create(multifd_new_send_channel_async, p);
     }
 
diff --git a/migration/socket.c b/migration/socket.c
index 3754d8f72c..4fd5e85f50 100644
--- a/migration/socket.c
+++ b/migration/socket.c
@@ -79,8 +79,9 @@ static void socket_outgoing_migration(QIOTask *task,
 
     trace_migration_socket_outgoing_connected(data->hostname);
 
-    if (migrate_use_zero_copy_send()) {
-        error_setg(&err, "Zero copy send not available in migration");
+    if (migrate_use_zero_copy_send() &&
+        !qio_channel_has_feature(sioc, QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY)) {
+        error_setg(&err, "Zero copy send feature not detected in host kernel");
     }
 
 out:
-- 
2.36.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v11 2/7] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX
  2022-05-04 19:18 ` [PATCH v11 2/7] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX Leonardo Bras
@ 2022-05-04 19:53   ` Peter Xu
  2022-05-05  4:20     ` Leonardo Bras Soares Passos
  2022-05-05  8:05   ` Daniel P. Berrangé
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Xu @ 2022-05-04 19:53 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Marc-André Lureau, Paolo Bonzini, Elena Ufimtseva,
	Jagannathan Raman, John G Johnson, Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster, Fam Zheng, qemu-devel, qemu-block

On Wed, May 04, 2022 at 04:18:31PM -0300, Leonardo Bras wrote:
> +/*
> + * Zero-copy defines bellow are included to avoid breaking builds on systems
> + * that don't support MSG_ZEROCOPY, while keeping the functions more readable
> + * (without a lot of ifdefs).
> + */
> +#ifndef MSG_ZEROCOPY
> +#define MSG_ZEROCOPY 0x4000000
> +#endif
> +#ifndef SO_ZEROCOPY
> +#define SO_ZEROCOPY 60
> +#endif

So this will define these two values on e.g. FreeBSD, while they do not
make sense at all there because these numbers are pure magics and
meaningless outside Linux..

I don't think it's anything dangerous, but IMHO it's another way of being
not clean comparing of using some "#ifdef"s.  Comparing to this approach
the "use #ifdef" approach is actually slightly more cleaner to me. :)

Let's wait for some other inputs.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v11 2/7] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX
  2022-05-04 19:53   ` Peter Xu
@ 2022-05-05  4:20     ` Leonardo Bras Soares Passos
  2022-05-05 12:41       ` Peter Xu
  0 siblings, 1 reply; 16+ messages in thread
From: Leonardo Bras Soares Passos @ 2022-05-05  4:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: Marc-André Lureau, Paolo Bonzini, Elena Ufimtseva,
	Jagannathan Raman, John G Johnson, Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster, Fam Zheng, qemu-devel, qemu-block

On Wed, May 4, 2022 at 4:53 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, May 04, 2022 at 04:18:31PM -0300, Leonardo Bras wrote:
> > +/*
> > + * Zero-copy defines bellow are included to avoid breaking builds on systems
> > + * that don't support MSG_ZEROCOPY, while keeping the functions more readable
> > + * (without a lot of ifdefs).
> > + */
> > +#ifndef MSG_ZEROCOPY
> > +#define MSG_ZEROCOPY 0x4000000
> > +#endif
> > +#ifndef SO_ZEROCOPY
> > +#define SO_ZEROCOPY 60
> > +#endif
>
> So this will define these two values on e.g. FreeBSD, while they do not
> make sense at all there because these numbers are pure magics and
> meaningless outside Linux..

Correct.
But since only in Linux it's possible to set the
QIO_CHANNEL_WRITE_FLAG_ZERO_COPY flag, sflags will always be zero and
it would never try using MSG_ZEROCOPY outside Linux.

> I don't think it's anything dangerous, but IMHO it's another way of being
> not clean comparing of using some "#ifdef"s.  Comparing to this approach
> the "use #ifdef" approach is actually slightly more cleaner to me. :)
>

This requires:
- Creating a define such as 'QEMU_MSG_ZEROCOPY', that needs to include
<sys/socket.h> to get some flags:
#define QEMU_MSG_ZEROCOPY defined(CONFIG_LINUX) &&
defined(MSG_ZEROCOPY) && defined(SO_ZEROCOPY)
- Making it available for all code in this patchset that does "ifdef
CONFIG_LINUX'
(migration/migration.c/h, qapi/migration.json, monitor/hmp-cmds.c,
io/channel-socket.c)
- Replace current usage of CONFIG_LINUX in this patchset for QEMU_MSG_ZEROCOPY
- Change qio_channel_socket_writev() so the current 2 usages of
MSG_ZEROCOPY are surrounded by ifdef QEMU_MSG_ZEROCOPY.

Pros of above approach (1):
- Smaller binary: The whole MSG_ZEROCOPY code is compiled out if the
building system does not support it.
- Since it's compiled out, there is a couple lines of less code
running if the building system does not support it
- It's not even possible to set this option in MigrationSetParams,
which will return an error.

Pros of current approach (2):
- Define is local to file (I am not sure if it's ok to create a
'global' define for above approach, including <sys/socket.h> bits)
- A build system that does not support MSG_ZEROCOPY can produce a
binary that can use MSG_ZEROCOPY if the target system supports it.
- There are no #ifdefs on qio_channel_socket_writev()

(2) is already implemented in v11, but I have no issue implementing
(1) for v12 if it's ok to create this 'global' define.

> Let's wait for some other inputs.

Agree.
Having the pros of each approach clear, I would like some input on
what is better for the project.

Best regards,
Leo



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v11 3/7] migration: Add zero-copy-send parameter for QMP/HMP for Linux
  2022-05-04 19:18 ` [PATCH v11 3/7] migration: Add zero-copy-send parameter for QMP/HMP for Linux Leonardo Bras
@ 2022-05-05  5:55   ` Markus Armbruster
  0 siblings, 0 replies; 16+ messages in thread
From: Markus Armbruster @ 2022-05-05  5:55 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Marc-André Lureau, Paolo Bonzini, Elena Ufimtseva,
	Jagannathan Raman, John G Johnson, Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster, Fam Zheng, Peter Xu, qemu-devel, qemu-block

Leonardo Bras <leobras@redhat.com> writes:

> Add property that allows zero-copy migration of memory pages
> on the sending side, and also includes a helper function
> migrate_use_zero_copy_send() to check if it's enabled.
>
> No code is introduced to actually do the migration, but it allow
> future implementations to enable/disable this feature.
>
> On non-Linux builds this parameter is compiled-out.
>
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
> Reviewed-by: Juan Quintela <quintela@redhat.com>
> Acked-by: Markus Armbruster <armbru@redhat.com>
> ---
>  qapi/migration.json   | 24 ++++++++++++++++++++++++
>  migration/migration.h |  5 +++++
>  migration/migration.c | 32 ++++++++++++++++++++++++++++++++
>  migration/socket.c    | 11 +++++++++--
>  monitor/hmp-cmds.c    |  6 ++++++
>  5 files changed, 76 insertions(+), 2 deletions(-)
>
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 409eb086a2..04246481ce 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -741,6 +741,13 @@
>  #                      will consume more CPU.
>  #                      Defaults to 1. (Since 5.0)
>  #
> +# @zero-copy-send: Controls behavior on sending memory pages on migration.
> +#                  When true, enables a zero-copy mechanism for sending memory
> +#                  pages, if host supports it.
> +#                  Requires that QEMU be permitted to use locked memory for guest

Please wrap lines around column 75.  More of the same below.

> +#                  RAM pages.
> +#                  Defaults to false. (Since 7.1)
> +#
>  # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
>  #                        aliases for the purpose of dirty bitmap migration.  Such
>  #                        aliases may for example be the corresponding names on the
> @@ -780,6 +787,7 @@
>             'xbzrle-cache-size', 'max-postcopy-bandwidth',
>             'max-cpu-throttle', 'multifd-compression',
>             'multifd-zlib-level' ,'multifd-zstd-level',
> +           { 'name': 'zero-copy-send', 'if' : 'CONFIG_LINUX'},
>             'block-bitmap-mapping' ] }
>  
>  ##
> @@ -906,6 +914,13 @@
>  #                      will consume more CPU.
>  #                      Defaults to 1. (Since 5.0)
>  #
> +# @zero-copy-send: Controls behavior on sending memory pages on migration.
> +#                  When true, enables a zero-copy mechanism for sending memory
> +#                  pages, if host supports it.
> +#                  Requires that QEMU be permitted to use locked memory for guest
> +#                  RAM pages.
> +#                  Defaults to false. (Since 7.1)
> +#
>  # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
>  #                        aliases for the purpose of dirty bitmap migration.  Such
>  #                        aliases may for example be the corresponding names on the
> @@ -960,6 +975,7 @@
>              '*multifd-compression': 'MultiFDCompression',
>              '*multifd-zlib-level': 'uint8',
>              '*multifd-zstd-level': 'uint8',
> +            '*zero-copy-send': { 'type': 'bool', 'if': 'CONFIG_LINUX' },
>              '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
>  
>  ##
> @@ -1106,6 +1122,13 @@
>  #                      will consume more CPU.
>  #                      Defaults to 1. (Since 5.0)
>  #
> +# @zero-copy-send: Controls behavior on sending memory pages on migration.
> +#                  When true, enables a zero-copy mechanism for sending memory
> +#                  pages, if host supports it.
> +#                  Requires that QEMU be permitted to use locked memory for guest
> +#                  RAM pages.
> +#                  Defaults to false. (Since 7.1)
> +#
>  # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
>  #                        aliases for the purpose of dirty bitmap migration.  Such
>  #                        aliases may for example be the corresponding names on the
> @@ -1158,6 +1181,7 @@
>              '*multifd-compression': 'MultiFDCompression',
>              '*multifd-zlib-level': 'uint8',
>              '*multifd-zstd-level': 'uint8',
> +            '*zero-copy-send': { 'type': 'bool', 'if': 'CONFIG_LINUX' },
>              '*block-bitmap-mapping': [ 'BitmapMigrationNodeAlias' ] } }
>  
>  ##

[...]



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v11 2/7] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX
  2022-05-04 19:18 ` [PATCH v11 2/7] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX Leonardo Bras
  2022-05-04 19:53   ` Peter Xu
@ 2022-05-05  8:05   ` Daniel P. Berrangé
  2022-05-05 15:42     ` Leonardo Bras Soares Passos
  1 sibling, 1 reply; 16+ messages in thread
From: Daniel P. Berrangé @ 2022-05-05  8:05 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Marc-André Lureau, Paolo Bonzini, Elena Ufimtseva,
	Jagannathan Raman, John G Johnson, Juan Quintela,
	Dr. David Alan Gilbert, Eric Blake, Markus Armbruster, Fam Zheng,
	Peter Xu, qemu-devel, qemu-block

On Wed, May 04, 2022 at 04:18:31PM -0300, Leonardo Bras wrote:
> For CONFIG_LINUX, implement the new zero copy flag and the optional callback
> io_flush on QIOChannelSocket, but enables it only when MSG_ZEROCOPY
> feature is available in the host kernel, which is checked on
> qio_channel_socket_connect_sync()
> 
> qio_channel_socket_flush() was implemented by counting how many times
> sendmsg(...,MSG_ZEROCOPY) was successfully called, and then reading the
> socket's error queue, in order to find how many of them finished sending.
> Flush will loop until those counters are the same, or until some error occurs.
> 
> Notes on using writev() with QIO_CHANNEL_WRITE_FLAG_ZERO_COPY:
> 1: Buffer
> - As MSG_ZEROCOPY tells the kernel to use the same user buffer to avoid copying,
> some caution is necessary to avoid overwriting any buffer before it's sent.
> If something like this happen, a newer version of the buffer may be sent instead.
> - If this is a problem, it's recommended to call qio_channel_flush() before freeing
> or re-using the buffer.
> 
> 2: Locked memory
> - When using MSG_ZERCOCOPY, the buffer memory will be locked after queued, and
> unlocked after it's sent.
> - Depending on the size of each buffer, and how often it's sent, it may require
> a larger amount of locked memory than usually available to non-root user.
> - If the required amount of locked memory is not available, writev_zero_copy
> will return an error, which can abort an operation like migration,
> - Because of this, when an user code wants to add zero copy as a feature, it
> requires a mechanism to disable it, so it can still be accessible to less
> privileged users.
> 
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
> Reviewed-by: Juan Quintela <quintela@redhat.com>
> ---
>  include/io/channel-socket.h |   2 +
>  io/channel-socket.c         | 120 ++++++++++++++++++++++++++++++++++--
>  2 files changed, 118 insertions(+), 4 deletions(-)
> 
> diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
> index e747e63514..513c428fe4 100644
> --- a/include/io/channel-socket.h
> +++ b/include/io/channel-socket.h
> @@ -47,6 +47,8 @@ struct QIOChannelSocket {
>      socklen_t localAddrLen;
>      struct sockaddr_storage remoteAddr;
>      socklen_t remoteAddrLen;
> +    ssize_t zero_copy_queued;
> +    ssize_t zero_copy_sent;
>  };
>  
>  
> diff --git a/io/channel-socket.c b/io/channel-socket.c
> index 696a04dc9c..ae756ce166 100644
> --- a/io/channel-socket.c
> +++ b/io/channel-socket.c
> @@ -25,9 +25,25 @@
>  #include "io/channel-watch.h"
>  #include "trace.h"
>  #include "qapi/clone-visitor.h"
> +#ifdef CONFIG_LINUX
> +#include <linux/errqueue.h>
> +#include <sys/socket.h>
> +#endif
>  
>  #define SOCKET_MAX_FDS 16
>  
> +/*
> + * Zero-copy defines bellow are included to avoid breaking builds on systems
> + * that don't support MSG_ZEROCOPY, while keeping the functions more readable
> + * (without a lot of ifdefs).
> + */
> +#ifndef MSG_ZEROCOPY
> +#define MSG_ZEROCOPY 0x4000000
> +#endif
> +#ifndef SO_ZEROCOPY
> +#define SO_ZEROCOPY 60
> +#endif

Please put these behind CONFIG_LINUX to make it clear to readers that
this is entirely Linux specific


> +
>  SocketAddress *
>  qio_channel_socket_get_local_address(QIOChannelSocket *ioc,
>                                       Error **errp)
> @@ -54,6 +70,8 @@ qio_channel_socket_new(void)
>  
>      sioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET));
>      sioc->fd = -1;
> +    sioc->zero_copy_queued = 0;
> +    sioc->zero_copy_sent = 0;
>  
>      ioc = QIO_CHANNEL(sioc);
>      qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN);
> @@ -153,6 +171,16 @@ int qio_channel_socket_connect_sync(QIOChannelSocket *ioc,
>          return -1;
>      }
>  
> +#ifdef CONFIG_LINUX
> +    int ret, v = 1;
> +    ret = setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v));
> +    if (ret == 0) {
> +        /* Zero copy available on host */
> +        qio_channel_set_feature(QIO_CHANNEL(ioc),
> +                                QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY);
> +    }
> +#endif
> +
>      return 0;
>  }
>  
> @@ -533,6 +561,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
>      char control[CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS)];
>      size_t fdsize = sizeof(int) * nfds;
>      struct cmsghdr *cmsg;
> +    int sflags = 0;
>  
>      memset(control, 0, CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS));
>  
> @@ -557,15 +586,27 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
>          memcpy(CMSG_DATA(cmsg), fds, fdsize);
>      }
>  
> +    if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
> +        sflags = MSG_ZEROCOPY;
> +    }

Also should be behind CONFIG_LINUX

> +
>   retry:
> -    ret = sendmsg(sioc->fd, &msg, 0);
> +    ret = sendmsg(sioc->fd, &msg, sflags);
>      if (ret <= 0) {
> -        if (errno == EAGAIN) {
> +        switch (errno) {
> +        case EAGAIN:
>              return QIO_CHANNEL_ERR_BLOCK;
> -        }
> -        if (errno == EINTR) {
> +        case EINTR:
>              goto retry;
> +        case ENOBUFS:
> +            if (sflags & MSG_ZEROCOPY) {
> +                error_setg_errno(errp, errno,
> +                                 "Process can't lock enough memory for using MSG_ZEROCOPY");
> +                return -1;
> +            }
> +            break;

And this ENOBUGS case behind CONFIG_LINUX

>          }
> +
>          error_setg_errno(errp, errno,
>                           "Unable to write to socket");
>          return -1;
> @@ -659,6 +700,74 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
>  }
>  #endif /* WIN32 */
>  
> +
> +#ifdef CONFIG_LINUX
> +static int qio_channel_socket_flush(QIOChannel *ioc,
> +                                    Error **errp)
> +{
> +    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
> +    struct msghdr msg = {};
> +    struct sock_extended_err *serr;
> +    struct cmsghdr *cm;
> +    char control[CMSG_SPACE(sizeof(*serr))];
> +    int received;
> +    int ret = 1;
> +
> +    msg.msg_control = control;
> +    msg.msg_controllen = sizeof(control);
> +    memset(control, 0, sizeof(control));
> +
> +    while (sioc->zero_copy_sent < sioc->zero_copy_queued) {
> +        received = recvmsg(sioc->fd, &msg, MSG_ERRQUEUE);
> +        if (received < 0) {
> +            switch (errno) {
> +            case EAGAIN:
> +                /* Nothing on errqueue, wait until something is available */
> +                qio_channel_wait(ioc, G_IO_ERR);
> +                continue;
> +            case EINTR:
> +                continue;
> +            default:
> +                error_setg_errno(errp, errno,
> +                                 "Unable to read errqueue");
> +                return -1;
> +            }
> +        }
> +
> +        cm = CMSG_FIRSTHDR(&msg);
> +        if (cm->cmsg_level != SOL_IP &&
> +            cm->cmsg_type != IP_RECVERR) {
> +            error_setg_errno(errp, EPROTOTYPE,
> +                             "Wrong cmsg in errqueue");
> +            return -1;
> +        }
> +
> +        serr = (void *) CMSG_DATA(cm);
> +        if (serr->ee_errno != SO_EE_ORIGIN_NONE) {
> +            error_setg_errno(errp, serr->ee_errno,
> +                             "Error on socket");
> +            return -1;
> +        }
> +        if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) {
> +            error_setg_errno(errp, serr->ee_origin,
> +                             "Error not from zero copy");
> +            return -1;
> +        }
> +
> +        /* No errors, count successfully finished sendmsg()*/
> +        sioc->zero_copy_sent += serr->ee_data - serr->ee_info + 1;
> +
> +        /* If any sendmsg() succeeded using zero copy, return 0 at the end */
> +        if (serr->ee_code != SO_EE_CODE_ZEROCOPY_COPIED) {
> +            ret = 0;
> +        }
> +    }
> +
> +    return ret;
> +}
> +
> +#endif /* CONFIG_LINUX */
> +
>  static int
>  qio_channel_socket_set_blocking(QIOChannel *ioc,
>                                  bool enabled,
> @@ -789,6 +898,9 @@ static void qio_channel_socket_class_init(ObjectClass *klass,
>      ioc_klass->io_set_delay = qio_channel_socket_set_delay;
>      ioc_klass->io_create_watch = qio_channel_socket_create_watch;
>      ioc_klass->io_set_aio_fd_handler = qio_channel_socket_set_aio_fd_handler;
> +#ifdef CONFIG_LINUX
> +    ioc_klass->io_flush = qio_channel_socket_flush;
> +#endif
>  }
>  
>  static const TypeInfo qio_channel_socket_info = {
> -- 
> 2.36.0
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v11 2/7] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX
  2022-05-05  4:20     ` Leonardo Bras Soares Passos
@ 2022-05-05 12:41       ` Peter Xu
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Xu @ 2022-05-05 12:41 UTC (permalink / raw)
  To: Leonardo Bras Soares Passos
  Cc: Marc-André Lureau, Paolo Bonzini, Elena Ufimtseva,
	Jagannathan Raman, John G Johnson, Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake,
	Markus Armbruster, Fam Zheng, qemu-devel, qemu-block

On Thu, May 05, 2022 at 01:20:04AM -0300, Leonardo Bras Soares Passos wrote:
> (2) is already implemented in v11, but I have no issue implementing
> (1) for v12 if it's ok to create this 'global' define.

Dan's suggestion in the other thread sounds good to me with current
approach, on having CONFIG_LINUX to wrap the defines.

But note that it's still prone to change in the future, e.g., when other
qemu modules start to use MSG_ZEROCOPY, we'll probably at least move the
defines into some other headers.  We could probably need to clean things up
when it happens.

But I won't strongly ask for that - we can leave that for later.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v11 2/7] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX
  2022-05-05  8:05   ` Daniel P. Berrangé
@ 2022-05-05 15:42     ` Leonardo Bras Soares Passos
  2022-05-05 15:55       ` Daniel P. Berrangé
  0 siblings, 1 reply; 16+ messages in thread
From: Leonardo Bras Soares Passos @ 2022-05-05 15:42 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Marc-André Lureau, Paolo Bonzini, Elena Ufimtseva,
	Jagannathan Raman, John G Johnson, Juan Quintela,
	Dr. David Alan Gilbert, Eric Blake, Markus Armbruster, Fam Zheng,
	Peter Xu, qemu-devel, qemu-block

On Thu, May 5, 2022 at 5:05 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Wed, May 04, 2022 at 04:18:31PM -0300, Leonardo Bras wrote:
> > For CONFIG_LINUX, implement the new zero copy flag and the optional callback
> > io_flush on QIOChannelSocket, but enables it only when MSG_ZEROCOPY
> > feature is available in the host kernel, which is checked on
> > qio_channel_socket_connect_sync()
> >
> > qio_channel_socket_flush() was implemented by counting how many times
> > sendmsg(...,MSG_ZEROCOPY) was successfully called, and then reading the
> > socket's error queue, in order to find how many of them finished sending.
> > Flush will loop until those counters are the same, or until some error occurs.
> >
> > Notes on using writev() with QIO_CHANNEL_WRITE_FLAG_ZERO_COPY:
> > 1: Buffer
> > - As MSG_ZEROCOPY tells the kernel to use the same user buffer to avoid copying,
> > some caution is necessary to avoid overwriting any buffer before it's sent.
> > If something like this happen, a newer version of the buffer may be sent instead.
> > - If this is a problem, it's recommended to call qio_channel_flush() before freeing
> > or re-using the buffer.
> >
> > 2: Locked memory
> > - When using MSG_ZERCOCOPY, the buffer memory will be locked after queued, and
> > unlocked after it's sent.
> > - Depending on the size of each buffer, and how often it's sent, it may require
> > a larger amount of locked memory than usually available to non-root user.
> > - If the required amount of locked memory is not available, writev_zero_copy
> > will return an error, which can abort an operation like migration,
> > - Because of this, when an user code wants to add zero copy as a feature, it
> > requires a mechanism to disable it, so it can still be accessible to less
> > privileged users.
> >
> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> > Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
> > Reviewed-by: Juan Quintela <quintela@redhat.com>
> > ---
> >  include/io/channel-socket.h |   2 +
> >  io/channel-socket.c         | 120 ++++++++++++++++++++++++++++++++++--
> >  2 files changed, 118 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
> > index e747e63514..513c428fe4 100644
> > --- a/include/io/channel-socket.h
> > +++ b/include/io/channel-socket.h
> > @@ -47,6 +47,8 @@ struct QIOChannelSocket {
> >      socklen_t localAddrLen;
> >      struct sockaddr_storage remoteAddr;
> >      socklen_t remoteAddrLen;
> > +    ssize_t zero_copy_queued;
> > +    ssize_t zero_copy_sent;
> >  };
> >
> >
> > diff --git a/io/channel-socket.c b/io/channel-socket.c
> > index 696a04dc9c..ae756ce166 100644
> > --- a/io/channel-socket.c
> > +++ b/io/channel-socket.c
> > @@ -25,9 +25,25 @@
> >  #include "io/channel-watch.h"
> >  #include "trace.h"
> >  #include "qapi/clone-visitor.h"
> > +#ifdef CONFIG_LINUX
> > +#include <linux/errqueue.h>
> > +#include <sys/socket.h>
> > +#endif
> >
> >  #define SOCKET_MAX_FDS 16
> >
> > +/*
> > + * Zero-copy defines bellow are included to avoid breaking builds on systems
> > + * that don't support MSG_ZEROCOPY, while keeping the functions more readable
> > + * (without a lot of ifdefs).
> > + */
> > +#ifndef MSG_ZEROCOPY
> > +#define MSG_ZEROCOPY 0x4000000
> > +#endif
> > +#ifndef SO_ZEROCOPY
> > +#define SO_ZEROCOPY 60
> > +#endif
>
> Please put these behind CONFIG_LINUX to make it clear to readers that
> this is entirely Linux specific
>
>
> > +
> >  SocketAddress *
> >  qio_channel_socket_get_local_address(QIOChannelSocket *ioc,
> >                                       Error **errp)
> > @@ -54,6 +70,8 @@ qio_channel_socket_new(void)
> >
> >      sioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET));
> >      sioc->fd = -1;
> > +    sioc->zero_copy_queued = 0;
> > +    sioc->zero_copy_sent = 0;
> >
> >      ioc = QIO_CHANNEL(sioc);
> >      qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN);
> > @@ -153,6 +171,16 @@ int qio_channel_socket_connect_sync(QIOChannelSocket *ioc,
> >          return -1;
> >      }
> >
> > +#ifdef CONFIG_LINUX
> > +    int ret, v = 1;
> > +    ret = setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v));
> > +    if (ret == 0) {
> > +        /* Zero copy available on host */
> > +        qio_channel_set_feature(QIO_CHANNEL(ioc),
> > +                                QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY);
> > +    }
> > +#endif
> > +
> >      return 0;
> >  }
> >
> > @@ -533,6 +561,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
> >      char control[CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS)];
> >      size_t fdsize = sizeof(int) * nfds;
> >      struct cmsghdr *cmsg;
> > +    int sflags = 0;
> >
> >      memset(control, 0, CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS));
> >
> > @@ -557,15 +586,27 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
> >          memcpy(CMSG_DATA(cmsg), fds, fdsize);
> >      }
> >
> > +    if (flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) {
> > +        sflags = MSG_ZEROCOPY;
> > +    }
>
> Also should be behind CONFIG_LINUX


Hello Daniel,

But what if this gets compiled in a Linux system without MSG_ZEROCOPY support?
As qapi will have zero-copy-send as an option we could have this scenario:

- User request migration using zero-copy-send
- multifd_save_setup() will set write_flags = QIO_CHANNEL_WRITE_FLAG_ZERO_COPY
- In qio_channel_socket_connect_sync(): setsockopt() part will be
compiled-out, so no error here
- above part in qio_channel_socket_writev() will be commented-out,
which means write_flags will be ignored
- sflags will not contain MSG_ZEROCOPY, so sendmsg() will use copy-mode
- migration will succeed

In the above case, the user has all the reason to think migration is
using MSG_ZEROCOPY, but in fact it's quietly falling back to
copy-mode.

That's why I suggested creating a 'global' config usiing SO_ZEROCOPY &
MSG_ZEROCOPY & CONFIG_LINUX so we can use in qapi and have no chance
of even offering zero-copy-send if we don't have it.

Another local option is to do implement your suggestions, and also
change qio_channel_socket_connect_sync() so it returns an error if
MSG_ZEROCOPY && SO_ZEROCOPY is not present, such as:

+#ifdef CONFIG_LINUX
+#if defined(MSG_ZEROCOPY)  && defined(SO_ZEROCOPY)
+    int ret, v = 1;
+    ret = setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v));
+    if (ret == 0) {
+        /* Zero copy available on host */
+        qio_channel_set_feature(QIO_CHANNEL(ioc),
+                                QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY);
+    }
+#else
+    error_setg_errno(errp, errno,"MSG_ZEROCOPY not available");
+    return -1;
+#endif
+#endif

What do you think?

Best regards,
Leo

>
> > +
> >   retry:
> > -    ret = sendmsg(sioc->fd, &msg, 0);
> > +    ret = sendmsg(sioc->fd, &msg, sflags);
> >      if (ret <= 0) {
> > -        if (errno == EAGAIN) {
> > +        switch (errno) {
> > +        case EAGAIN:
> >              return QIO_CHANNEL_ERR_BLOCK;
> > -        }
> > -        if (errno == EINTR) {
> > +        case EINTR:
> >              goto retry;
> > +        case ENOBUFS:
> > +            if (sflags & MSG_ZEROCOPY) {
> > +                error_setg_errno(errp, errno,
> > +                                 "Process can't lock enough memory for using MSG_ZEROCOPY");
> > +                return -1;
> > +            }
> > +            break;
>
> And this ENOBUGS case behind CONFIG_LINUX
>


[...]



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v11 2/7] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX
  2022-05-05 15:42     ` Leonardo Bras Soares Passos
@ 2022-05-05 15:55       ` Daniel P. Berrangé
  2022-05-05 17:01         ` Leonardo Bras Soares Passos
  0 siblings, 1 reply; 16+ messages in thread
From: Daniel P. Berrangé @ 2022-05-05 15:55 UTC (permalink / raw)
  To: Leonardo Bras Soares Passos
  Cc: Marc-André Lureau, Paolo Bonzini, Elena Ufimtseva,
	Jagannathan Raman, John G Johnson, Juan Quintela,
	Dr. David Alan Gilbert, Eric Blake, Markus Armbruster, Fam Zheng,
	Peter Xu, qemu-devel, qemu-block

On Thu, May 05, 2022 at 12:42:47PM -0300, Leonardo Bras Soares Passos wrote:
> 
> Hello Daniel,
> 
> But what if this gets compiled in a Linux system without MSG_ZEROCOPY support?
> As qapi will have zero-copy-send as an option we could have this scenario:
> 
> - User request migration using zero-copy-send
> - multifd_save_setup() will set write_flags = QIO_CHANNEL_WRITE_FLAG_ZERO_COPY
> - In qio_channel_socket_connect_sync(): setsockopt() part will be
> compiled-out, so no error here
> - above part in qio_channel_socket_writev() will be commented-out,
> which means write_flags will be ignored
> - sflags will not contain MSG_ZEROCOPY, so sendmsg() will use copy-mode
> - migration will succeed
> 
> In the above case, the user has all the reason to think migration is
> using MSG_ZEROCOPY, but in fact it's quietly falling back to
> copy-mode.

I think we're ok because qio_channel_writev_full() does

    if ((flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) &&
        !qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY)) {
        error_setg_errno(errp, EINVAL,
                         "Requested Zero Copy feature is not available");
        return -1;
    }

and since there's no way for QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY to
get set when MSG_ZEROCOPY is compiled out, we'll trigger the error
condition.

> That's why I suggested creating a 'global' config usiing SO_ZEROCOPY &
> MSG_ZEROCOPY & CONFIG_LINUX so we can use in qapi and have no chance
> of even offering zero-copy-send if we don't have it.
> 
> Another local option is to do implement your suggestions, and also
> change qio_channel_socket_connect_sync() so it returns an error if
> MSG_ZEROCOPY && SO_ZEROCOPY is not present, such as:
> 
> +#ifdef CONFIG_LINUX
> +#if defined(MSG_ZEROCOPY)  && defined(SO_ZEROCOPY)
> +    int ret, v = 1;
> +    ret = setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v));
> +    if (ret == 0) {
> +        /* Zero copy available on host */
> +        qio_channel_set_feature(QIO_CHANNEL(ioc),
> +                                QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY);
> +    }
> +#else
> +    error_setg_errno(errp, errno,"MSG_ZEROCOPY not available");
> +    return -1;
> +#endif
> +#endif

Do we actually need the ifdef CONFIG_LINUX bit at all ?

Sufficient to just have the check for MSG_ZEROCOPY + SO_ZEROCOPY,
which will fail on non-Linux anyway.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v11 2/7] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX
  2022-05-05 15:55       ` Daniel P. Berrangé
@ 2022-05-05 17:01         ` Leonardo Bras Soares Passos
  0 siblings, 0 replies; 16+ messages in thread
From: Leonardo Bras Soares Passos @ 2022-05-05 17:01 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Marc-André Lureau, Paolo Bonzini, Elena Ufimtseva,
	Jagannathan Raman, John G Johnson, Juan Quintela,
	Dr. David Alan Gilbert, Eric Blake, Markus Armbruster, Fam Zheng,
	Peter Xu, qemu-devel, qemu-block

On Thu, May 5, 2022 at 12:55 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Thu, May 05, 2022 at 12:42:47PM -0300, Leonardo Bras Soares Passos wrote:
> >
> > Hello Daniel,
> >
> > But what if this gets compiled in a Linux system without MSG_ZEROCOPY support?
> > As qapi will have zero-copy-send as an option we could have this scenario:
> >
> > - User request migration using zero-copy-send
> > - multifd_save_setup() will set write_flags = QIO_CHANNEL_WRITE_FLAG_ZERO_COPY
> > - In qio_channel_socket_connect_sync(): setsockopt() part will be
> > compiled-out, so no error here
> > - above part in qio_channel_socket_writev() will be commented-out,
> > which means write_flags will be ignored
> > - sflags will not contain MSG_ZEROCOPY, so sendmsg() will use copy-mode
> > - migration will succeed
> >
> > In the above case, the user has all the reason to think migration is
> > using MSG_ZEROCOPY, but in fact it's quietly falling back to
> > copy-mode.
>
> I think we're ok because qio_channel_writev_full() does
>
>     if ((flags & QIO_CHANNEL_WRITE_FLAG_ZERO_COPY) &&
>         !qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY)) {
>         error_setg_errno(errp, EINVAL,
>                          "Requested Zero Copy feature is not available");
>         return -1;
>     }
>
> and since there's no way for QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY to
> get set when MSG_ZEROCOPY is compiled out, we'll trigger the error
> condition.

Oh, that's right. It will fail in the first writev(), I was just
considering failing during setup.

>
> > That's why I suggested creating a 'global' config usiing SO_ZEROCOPY &
> > MSG_ZEROCOPY & CONFIG_LINUX so we can use in qapi and have no chance
> > of even offering zero-copy-send if we don't have it.
> >
> > Another local option is to do implement your suggestions, and also
> > change qio_channel_socket_connect_sync() so it returns an error if
> > MSG_ZEROCOPY && SO_ZEROCOPY is not present, such as:
> >
> > +#ifdef CONFIG_LINUX
> > +#if defined(MSG_ZEROCOPY)  && defined(SO_ZEROCOPY)
> > +    int ret, v = 1;
> > +    ret = setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v));
> > +    if (ret == 0) {
> > +        /* Zero copy available on host */
> > +        qio_channel_set_feature(QIO_CHANNEL(ioc),
> > +                                QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY);
> > +    }
> > +#else
> > +    error_setg_errno(errp, errno,"MSG_ZEROCOPY not available");
> > +    return -1;
> > +#endif
> > +#endif
>
> Do we actually need the ifdef CONFIG_LINUX bit at all ?
>
> Sufficient to just have the check for MSG_ZEROCOPY + SO_ZEROCOPY,
> which will fail on non-Linux anyway.

By some include issue, or by future implementations we can have
MSG_ZEROCOPY or SO_ZEROCOPY getting defined in OS other than Linux,
which would introduce some headaches.

Since you pointed out that migration will fail on writev, the above
piece of code is not necessary.
We could have a local define that equals to (MSG_ZEROCOPY &&
SO_ZEROCOPY && CONFIG_LINUX) so that we can make the code simpler
where needed.

I will work on a v12 and send it here.

Best regards,
Leo



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-05-05 17:21 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-04 19:18 [PATCH v11 0/7] MSG_ZEROCOPY + multifd Leonardo Bras
2022-05-04 19:18 ` [PATCH v11 1/7] QIOChannel: Add flags on io_writev and introduce io_flush callback Leonardo Bras
2022-05-04 19:18 ` [PATCH v11 2/7] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX Leonardo Bras
2022-05-04 19:53   ` Peter Xu
2022-05-05  4:20     ` Leonardo Bras Soares Passos
2022-05-05 12:41       ` Peter Xu
2022-05-05  8:05   ` Daniel P. Berrangé
2022-05-05 15:42     ` Leonardo Bras Soares Passos
2022-05-05 15:55       ` Daniel P. Berrangé
2022-05-05 17:01         ` Leonardo Bras Soares Passos
2022-05-04 19:18 ` [PATCH v11 3/7] migration: Add zero-copy-send parameter for QMP/HMP for Linux Leonardo Bras
2022-05-05  5:55   ` Markus Armbruster
2022-05-04 19:18 ` [PATCH v11 4/7] migration: Add migrate_use_tls() helper Leonardo Bras
2022-05-04 19:18 ` [PATCH v11 5/7] multifd: multifd_send_sync_main now returns negative on error Leonardo Bras
2022-05-04 19:18 ` [PATCH v11 6/7] multifd: Send header packet without flags if zero-copy-send is enabled Leonardo Bras
2022-05-04 19:18 ` [PATCH v11 7/7] multifd: Implement zero copy write in multifd migration (multifd-zero-copy) Leonardo Bras

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.