All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/3] QIOChannel flags + multifd zerocopy
@ 2021-08-31 11:02 Leonardo Bras
  2021-08-31 11:02 ` [PATCH v1 1/3] io: Enable write flags for QIOChannel Leonardo Bras
                   ` (3 more replies)
  0 siblings, 4 replies; 62+ messages in thread
From: Leonardo Bras @ 2021-08-31 11:02 UTC (permalink / raw)
  To: Marc-André Lureau, Paolo Bonzini, Elena Ufimtseva,
	Jagannathan Raman, John G Johnson, Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Fam Zheng, Peter Xu
  Cc: Leonardo Bras, qemu-devel, qemu-block

This patch series intends to enable MSG_ZEROCOPY in QIOChannel, and make
use of it for multifd migration performance improvement.

Patch #1 enables the use of flags on qio_channel_write*(), allowing
more flexibility in using the channel. 
It was designed for MSG_ZEROCOPY usage, in which it's a good idea
having a eassy way to choose what packets are sent with the flag, but
also makes it more flexible for future usage.

Patch #2 just adds the MSG_ZEROCOPY feature, and defines the enablement
mechanics, while not enabling it in any code.

Patch #3 enables MSG_ZEROCOPY for migration / multifd.


Results:
So far, the resource usage of __sys_sendmsg() reduced 15 times, and the
overall migration took 13-18% less time, based in synthetic workload.

The objective is to reduce migration time in hosts with heavy cpu usage.

Leonardo Bras (3):
  io: Enable write flags for QIOChannel
  io: Add zerocopy and errqueue
  migration: multifd: Enable zerocopy

 chardev/char-io.c                   |  2 +-
 hw/remote/mpqemu-link.c             |  2 +-
 include/io/channel-socket.h         |  2 +
 include/io/channel.h                | 85 +++++++++++++++++++++++------
 io/channel-buffer.c                 |  1 +
 io/channel-command.c                |  1 +
 io/channel-file.c                   |  1 +
 io/channel-socket.c                 | 80 ++++++++++++++++++++++++++-
 io/channel-tls.c                    | 12 ++++
 io/channel-websock.c                | 10 ++++
 io/channel.c                        | 64 +++++++++++++---------
 migration/multifd-zlib.c            |  7 ++-
 migration/multifd-zstd.c            |  7 ++-
 migration/multifd.c                 |  9 ++-
 migration/multifd.h                 |  3 +-
 migration/rdma.c                    |  1 +
 scsi/pr-manager-helper.c            |  2 +-
 tests/unit/test-io-channel-socket.c |  1 +
 18 files changed, 235 insertions(+), 55 deletions(-)

-- 
2.33.0



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v1 1/3] io: Enable write flags for QIOChannel
  2021-08-31 11:02 [PATCH v1 0/3] QIOChannel flags + multifd zerocopy Leonardo Bras
@ 2021-08-31 11:02 ` Leonardo Bras
  2021-09-01 20:54   ` Eric Blake
  2021-08-31 11:02 ` [PATCH v1 2/3] io: Add zerocopy and errqueue Leonardo Bras
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 62+ messages in thread
From: Leonardo Bras @ 2021-08-31 11:02 UTC (permalink / raw)
  To: Marc-André Lureau, Paolo Bonzini, Elena Ufimtseva,
	Jagannathan Raman, John G Johnson, Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Fam Zheng, Peter Xu
  Cc: Leonardo Bras, qemu-devel, qemu-block

Some syscalls used for writting, such as sendmsg(), accept flags that
can modify their behavior, even allowing the usage of features such as
MSG_ZEROCOPY.

Change qio_channel_write*() interface to allow passing down flags,
allowing a more flexible use of IOChannel.

At first, it's use is enabled for QIOChannelSocket, but can be easily
extended to any other QIOChannel implementation.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 chardev/char-io.c                   |  2 +-
 hw/remote/mpqemu-link.c             |  2 +-
 include/io/channel.h                | 56 ++++++++++++++++++++---------
 io/channel-buffer.c                 |  1 +
 io/channel-command.c                |  1 +
 io/channel-file.c                   |  1 +
 io/channel-socket.c                 |  4 ++-
 io/channel-tls.c                    |  1 +
 io/channel-websock.c                |  1 +
 io/channel.c                        | 53 ++++++++++++++-------------
 migration/rdma.c                    |  1 +
 scsi/pr-manager-helper.c            |  2 +-
 tests/unit/test-io-channel-socket.c |  1 +
 13 files changed, 81 insertions(+), 45 deletions(-)

diff --git a/chardev/char-io.c b/chardev/char-io.c
index 8ced184160..4ea7b1ee2a 100644
--- a/chardev/char-io.c
+++ b/chardev/char-io.c
@@ -122,7 +122,7 @@ int io_channel_send_full(QIOChannel *ioc,
 
         ret = qio_channel_writev_full(
             ioc, &iov, 1,
-            fds, nfds, NULL);
+            fds, 0, nfds, NULL);
         if (ret == QIO_CHANNEL_ERR_BLOCK) {
             if (offset) {
                 return offset;
diff --git a/hw/remote/mpqemu-link.c b/hw/remote/mpqemu-link.c
index 7e841820e5..0d13321ef0 100644
--- a/hw/remote/mpqemu-link.c
+++ b/hw/remote/mpqemu-link.c
@@ -69,7 +69,7 @@ bool mpqemu_msg_send(MPQemuMsg *msg, QIOChannel *ioc, Error **errp)
     }
 
     if (!qio_channel_writev_full_all(ioc, send, G_N_ELEMENTS(send),
-                                    fds, nfds, errp)) {
+                                     fds, nfds, 0, errp)) {
         ret = true;
     } else {
         trace_mpqemu_send_io_error(msg->cmd, msg->size, nfds);
diff --git a/include/io/channel.h b/include/io/channel.h
index 88988979f8..dada9ebaaf 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -104,6 +104,7 @@ struct QIOChannelClass {
                          size_t niov,
                          int *fds,
                          size_t nfds,
+                         int flags,
                          Error **errp);
     ssize_t (*io_readv)(QIOChannel *ioc,
                         const struct iovec *iov,
@@ -260,6 +261,7 @@ ssize_t qio_channel_writev_full(QIOChannel *ioc,
                                 size_t niov,
                                 int *fds,
                                 size_t nfds,
+                                int flags,
                                 Error **errp);
 
 /**
@@ -325,6 +327,7 @@ int qio_channel_readv_all(QIOChannel *ioc,
  * @ioc: the channel object
  * @iov: the array of memory regions to write data from
  * @niov: the length of the @iov array
+ * @flags: optional sending flags
  * @errp: pointer to a NULL-initialized error object
  *
  * Write data to the IO channel, reading it from the
@@ -339,10 +342,14 @@ int qio_channel_readv_all(QIOChannel *ioc,
  *
  * Returns: 0 if all bytes were written, or -1 on error
  */
-int qio_channel_writev_all(QIOChannel *ioc,
-                           const struct iovec *iov,
-                           size_t niov,
-                           Error **erp);
+int qio_channel_writev_all_flags(QIOChannel *ioc,
+                                 const struct iovec *iov,
+                                 size_t niov,
+                                 int flags,
+                                 Error **errp);
+
+#define qio_channel_writev_all(ioc, iov, niov, errp) \
+    qio_channel_writev_all_flags(ioc, iov, niov, 0, errp)
 
 /**
  * qio_channel_readv:
@@ -364,15 +371,21 @@ ssize_t qio_channel_readv(QIOChannel *ioc,
  * @ioc: the channel object
  * @iov: the array of memory regions to write data from
  * @niov: the length of the @iov array
+ * @flags: optional sending flags
  * @errp: pointer to a NULL-initialized error object
  *
  * Behaves as qio_channel_writev_full() but does not support
  * sending of file handles.
  */
-ssize_t qio_channel_writev(QIOChannel *ioc,
-                           const struct iovec *iov,
-                           size_t niov,
-                           Error **errp);
+ssize_t qio_channel_writev_flags(QIOChannel *ioc,
+                                 const struct iovec *iov,
+                                 size_t niov,
+                                 int flags,
+                                 Error **errp);
+
+#define qio_channel_writev(ioc, iov, niov, errp) \
+    qio_channel_writev_flags(ioc, iov, niov, 0, errp)
+
 
 /**
  * qio_channel_read:
@@ -395,16 +408,21 @@ ssize_t qio_channel_read(QIOChannel *ioc,
  * @ioc: the channel object
  * @buf: the memory regions to send data from
  * @buflen: the length of @buf
+ * @flags: optional sending flags
  * @errp: pointer to a NULL-initialized error object
  *
  * Behaves as qio_channel_writev_full() but does not support
  * sending of file handles, and only supports writing from a
  * single memory region.
  */
-ssize_t qio_channel_write(QIOChannel *ioc,
-                          const char *buf,
-                          size_t buflen,
-                          Error **errp);
+ssize_t qio_channel_write_flags(QIOChannel *ioc,
+                                const char *buf,
+                                size_t buflen,
+                                int flags,
+                                Error **errp);
+
+#define qio_channel_write(ioc, buf, buflen, errp) \
+     qio_channel_write_flags(ioc, buf, buflen, 0, errp)
 
 /**
  * qio_channel_read_all_eof:
@@ -453,6 +471,7 @@ int qio_channel_read_all(QIOChannel *ioc,
  * @ioc: the channel object
  * @buf: the memory region to write data into
  * @buflen: the number of bytes to @buf
+ * @flags: optional sending flags
  * @errp: pointer to a NULL-initialized error object
  *
  * Writes @buflen bytes from @buf, possibly blocking or (if the
@@ -462,10 +481,14 @@ int qio_channel_read_all(QIOChannel *ioc,
  *
  * Returns: 0 if all bytes were written, or -1 on error
  */
-int qio_channel_write_all(QIOChannel *ioc,
-                          const char *buf,
-                          size_t buflen,
-                          Error **errp);
+int qio_channel_write_all_flags(QIOChannel *ioc,
+                                const char *buf,
+                                size_t buflen,
+                                int flags,
+                                Error **errp);
+
+#define qio_channel_write_all(ioc, buf, buflen, errp) \
+    qio_channel_write_all_flags(ioc, buf, buflen, 0, errp)
 
 /**
  * qio_channel_set_blocking:
@@ -853,6 +876,7 @@ int qio_channel_writev_full_all(QIOChannel *ioc,
                                 const struct iovec *iov,
                                 size_t niov,
                                 int *fds, size_t nfds,
+                                int flags,
                                 Error **errp);
 
 #endif /* QIO_CHANNEL_H */
diff --git a/io/channel-buffer.c b/io/channel-buffer.c
index baa4e2b089..bf52011be2 100644
--- a/io/channel-buffer.c
+++ b/io/channel-buffer.c
@@ -81,6 +81,7 @@ static ssize_t qio_channel_buffer_writev(QIOChannel *ioc,
                                          size_t niov,
                                          int *fds,
                                          size_t nfds,
+                                         int flags,
                                          Error **errp)
 {
     QIOChannelBuffer *bioc = QIO_CHANNEL_BUFFER(ioc);
diff --git a/io/channel-command.c b/io/channel-command.c
index b2a9e27138..5ff1691bad 100644
--- a/io/channel-command.c
+++ b/io/channel-command.c
@@ -258,6 +258,7 @@ static ssize_t qio_channel_command_writev(QIOChannel *ioc,
                                           size_t niov,
                                           int *fds,
                                           size_t nfds,
+                                          int flags,
                                           Error **errp)
 {
     QIOChannelCommand *cioc = QIO_CHANNEL_COMMAND(ioc);
diff --git a/io/channel-file.c b/io/channel-file.c
index c4bf799a80..348a48545e 100644
--- a/io/channel-file.c
+++ b/io/channel-file.c
@@ -114,6 +114,7 @@ static ssize_t qio_channel_file_writev(QIOChannel *ioc,
                                        size_t niov,
                                        int *fds,
                                        size_t nfds,
+                                       int flags,
                                        Error **errp)
 {
     QIOChannelFile *fioc = QIO_CHANNEL_FILE(ioc);
diff --git a/io/channel-socket.c b/io/channel-socket.c
index 606ec97cf7..e377e7303d 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -525,6 +525,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
                                          size_t niov,
                                          int *fds,
                                          size_t nfds,
+                                         int flags,
                                          Error **errp)
 {
     QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
@@ -558,7 +559,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
     }
 
  retry:
-    ret = sendmsg(sioc->fd, &msg, 0);
+    ret = sendmsg(sioc->fd, &msg, flags);
     if (ret <= 0) {
         if (errno == EAGAIN) {
             return QIO_CHANNEL_ERR_BLOCK;
@@ -620,6 +621,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
                                          size_t niov,
                                          int *fds,
                                          size_t nfds,
+                                         int flags,
                                          Error **errp)
 {
     QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
diff --git a/io/channel-tls.c b/io/channel-tls.c
index 2ae1b92fc0..4ce890a538 100644
--- a/io/channel-tls.c
+++ b/io/channel-tls.c
@@ -301,6 +301,7 @@ static ssize_t qio_channel_tls_writev(QIOChannel *ioc,
                                       size_t niov,
                                       int *fds,
                                       size_t nfds,
+                                      int flags,
                                       Error **errp)
 {
     QIOChannelTLS *tioc = QIO_CHANNEL_TLS(ioc);
diff --git a/io/channel-websock.c b/io/channel-websock.c
index 70889bb54d..035dd6075b 100644
--- a/io/channel-websock.c
+++ b/io/channel-websock.c
@@ -1127,6 +1127,7 @@ static ssize_t qio_channel_websock_writev(QIOChannel *ioc,
                                           size_t niov,
                                           int *fds,
                                           size_t nfds,
+                                          int flags,
                                           Error **errp)
 {
     QIOChannelWebsock *wioc = QIO_CHANNEL_WEBSOCK(ioc);
diff --git a/io/channel.c b/io/channel.c
index e8b019dc36..ee3cb83d4d 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -72,6 +72,7 @@ ssize_t qio_channel_writev_full(QIOChannel *ioc,
                                 size_t niov,
                                 int *fds,
                                 size_t nfds,
+                                int flags,
                                 Error **errp)
 {
     QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
@@ -83,7 +84,7 @@ ssize_t qio_channel_writev_full(QIOChannel *ioc,
         return -1;
     }
 
-    return klass->io_writev(ioc, iov, niov, fds, nfds, errp);
+    return klass->io_writev(ioc, iov, niov, fds, nfds, flags, errp);
 }
 
 
@@ -212,18 +213,20 @@ int qio_channel_readv_full_all(QIOChannel *ioc,
     return ret;
 }
 
-int qio_channel_writev_all(QIOChannel *ioc,
-                           const struct iovec *iov,
-                           size_t niov,
-                           Error **errp)
+int qio_channel_writev_all_flags(QIOChannel *ioc,
+                                 const struct iovec *iov,
+                                 size_t niov,
+                                 int flags,
+                                 Error **errp)
 {
-    return qio_channel_writev_full_all(ioc, iov, niov, NULL, 0, errp);
+    return qio_channel_writev_full_all(ioc, iov, niov, NULL, 0, flags, errp);
 }
 
 int qio_channel_writev_full_all(QIOChannel *ioc,
                                 const struct iovec *iov,
                                 size_t niov,
                                 int *fds, size_t nfds,
+                                int flags,
                                 Error **errp)
 {
     int ret = -1;
@@ -238,7 +241,7 @@ int qio_channel_writev_full_all(QIOChannel *ioc,
     while (nlocal_iov > 0) {
         ssize_t len;
         len = qio_channel_writev_full(ioc, local_iov, nlocal_iov, fds, nfds,
-                                      errp);
+                                      flags, errp);
         if (len == QIO_CHANNEL_ERR_BLOCK) {
             if (qemu_in_coroutine()) {
                 qio_channel_yield(ioc, G_IO_OUT);
@@ -272,15 +275,15 @@ ssize_t qio_channel_readv(QIOChannel *ioc,
 }
 
 
-ssize_t qio_channel_writev(QIOChannel *ioc,
-                           const struct iovec *iov,
-                           size_t niov,
-                           Error **errp)
+ssize_t qio_channel_writev_flags(QIOChannel *ioc,
+                                 const struct iovec *iov,
+                                 size_t niov,
+                                 int flags,
+                                 Error **errp)
 {
-    return qio_channel_writev_full(ioc, iov, niov, NULL, 0, errp);
+    return qio_channel_writev_full(ioc, iov, niov, NULL, 0, flags, errp);
 }
 
-
 ssize_t qio_channel_read(QIOChannel *ioc,
                          char *buf,
                          size_t buflen,
@@ -291,16 +294,16 @@ ssize_t qio_channel_read(QIOChannel *ioc,
 }
 
 
-ssize_t qio_channel_write(QIOChannel *ioc,
-                          const char *buf,
-                          size_t buflen,
-                          Error **errp)
+ssize_t qio_channel_write_flags(QIOChannel *ioc,
+                                const char *buf,
+                                size_t buflen,
+                                int flags,
+                                Error **errp)
 {
     struct iovec iov = { .iov_base = (char *)buf, .iov_len = buflen };
-    return qio_channel_writev_full(ioc, &iov, 1, NULL, 0, errp);
+    return qio_channel_writev_full(ioc, &iov, 1, NULL, 0, flags, errp);
 }
 
-
 int qio_channel_read_all_eof(QIOChannel *ioc,
                              char *buf,
                              size_t buflen,
@@ -321,16 +324,16 @@ int qio_channel_read_all(QIOChannel *ioc,
 }
 
 
-int qio_channel_write_all(QIOChannel *ioc,
-                          const char *buf,
-                          size_t buflen,
-                          Error **errp)
+int qio_channel_write_all_flags(QIOChannel *ioc,
+                                const char *buf,
+                                size_t buflen,
+                                int flags,
+                                Error **errp)
 {
     struct iovec iov = { .iov_base = (char *)buf, .iov_len = buflen };
-    return qio_channel_writev_all(ioc, &iov, 1, errp);
+    return qio_channel_writev_all_flags(ioc, &iov, 1, flags, errp);
 }
 
-
 int qio_channel_set_blocking(QIOChannel *ioc,
                               bool enabled,
                               Error **errp)
diff --git a/migration/rdma.c b/migration/rdma.c
index 5c2d113aa9..bc0558b70e 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -2713,6 +2713,7 @@ static ssize_t qio_channel_rdma_writev(QIOChannel *ioc,
                                        size_t niov,
                                        int *fds,
                                        size_t nfds,
+                                       int flags,
                                        Error **errp)
 {
     QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
diff --git a/scsi/pr-manager-helper.c b/scsi/pr-manager-helper.c
index 451c7631b7..3be52a98d5 100644
--- a/scsi/pr-manager-helper.c
+++ b/scsi/pr-manager-helper.c
@@ -77,7 +77,7 @@ static int pr_manager_helper_write(PRManagerHelper *pr_mgr,
         iov.iov_base = (void *)buf;
         iov.iov_len = sz;
         n_written = qio_channel_writev_full(QIO_CHANNEL(pr_mgr->ioc), &iov, 1,
-                                            nfds ? &fd : NULL, nfds, errp);
+                                            nfds ? &fd : NULL, nfds, 0, errp);
 
         if (n_written <= 0) {
             assert(n_written != QIO_CHANNEL_ERR_BLOCK);
diff --git a/tests/unit/test-io-channel-socket.c b/tests/unit/test-io-channel-socket.c
index c49eec1f03..6713886d02 100644
--- a/tests/unit/test-io-channel-socket.c
+++ b/tests/unit/test-io-channel-socket.c
@@ -444,6 +444,7 @@ static void test_io_channel_unix_fd_pass(void)
                             G_N_ELEMENTS(iosend),
                             fdsend,
                             G_N_ELEMENTS(fdsend),
+                            0,
                             &error_abort);
 
     qio_channel_readv_full(dst,
-- 
2.33.0



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v1 2/3] io: Add zerocopy and errqueue
  2021-08-31 11:02 [PATCH v1 0/3] QIOChannel flags + multifd zerocopy Leonardo Bras
  2021-08-31 11:02 ` [PATCH v1 1/3] io: Enable write flags for QIOChannel Leonardo Bras
@ 2021-08-31 11:02 ` Leonardo Bras
  2021-08-31 12:57   ` Daniel P. Berrangé
  2021-08-31 11:02 ` [PATCH v1 3/3] migration: multifd: Enable zerocopy Leonardo Bras
  2021-08-31 21:24 ` [PATCH v1 0/3] QIOChannel flags + multifd zerocopy Peter Xu
  3 siblings, 1 reply; 62+ messages in thread
From: Leonardo Bras @ 2021-08-31 11:02 UTC (permalink / raw)
  To: Marc-André Lureau, Paolo Bonzini, Elena Ufimtseva,
	Jagannathan Raman, John G Johnson, Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Fam Zheng, Peter Xu
  Cc: Leonardo Bras, qemu-devel, qemu-block

MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
send calls. It does so by avoiding copying user data into kernel buffers.

To make it work, three steps are needed:
1 - A setsockopt() system call, enabling SO_ZEROCOPY
2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
3 - Process the socket's error queue, dealing with any error

Zerocopy has it's costs, so it will only get improved performance if
the sending buffer is big (10KB, according to Linux docs).

The step 2 makes it possible to use the same socket to send data
using both zerocopy and the default copying approach, so the application
cat choose what is best for each packet.

To implement step 1, an optional set_zerocopy() interface was created
in QIOChannel, allowing each using code to enable or disable it.

Step 2 will be enabled by the using code at each qio_channel_write*()
that would benefit of zerocopy;

Step 3 is done with qio_channel_socket_errq_proc(), that runs after
SOCKET_ERRQ_THRESH (16k) iovs sent, dealing with any error found.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 include/io/channel-socket.h |  2 +
 include/io/channel.h        | 29 ++++++++++++++
 io/channel-socket.c         | 76 +++++++++++++++++++++++++++++++++++++
 io/channel-tls.c            | 11 ++++++
 io/channel-websock.c        |  9 +++++
 io/channel.c                | 11 ++++++
 6 files changed, 138 insertions(+)

diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
index e747e63514..09dffe059f 100644
--- a/include/io/channel-socket.h
+++ b/include/io/channel-socket.h
@@ -47,6 +47,8 @@ struct QIOChannelSocket {
     socklen_t localAddrLen;
     struct sockaddr_storage remoteAddr;
     socklen_t remoteAddrLen;
+    size_t errq_pending;
+    bool zerocopy_enabled;
 };
 
 
diff --git a/include/io/channel.h b/include/io/channel.h
index dada9ebaaf..de10a78b10 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -137,6 +137,8 @@ struct QIOChannelClass {
                                   IOHandler *io_read,
                                   IOHandler *io_write,
                                   void *opaque);
+    void (*io_set_zerocopy)(QIOChannel *ioc,
+                            bool enabled);
 };
 
 /* General I/O handling functions */
@@ -570,6 +572,33 @@ int qio_channel_shutdown(QIOChannel *ioc,
 void qio_channel_set_delay(QIOChannel *ioc,
                            bool enabled);
 
+/**
+ * qio_channel_set_zerocopy:
+ * @ioc: the channel object
+ * @enabled: the new flag state
+ *
+ * Controls whether the underlying transport is
+ * permitted to use zerocopy to avoid copying the
+ * sending buffer in kernel. If @enabled is true, then the
+ * writes may avoid buffer copy in kernel. If @enabled
+ * is false, writes will cause the kernel to always
+ * copy the buffer contents before sending.
+ *
+ * In order to use make a write with zerocopy feature,
+ * it's also necessary to sent each packet with
+ * MSG_ZEROCOPY flag. With this, it's possible to
+ * to select only writes that would benefit from the
+ * use of zerocopy feature, i.e. the ones with larger
+ * buffers.
+ *
+ * This feature was added in Linux 4.14, so older
+ * versions will fail on enabling. This is not an
+ * issue, since it will fall-back to default copying
+ * approach.
+ */
+void qio_channel_set_zerocopy(QIOChannel *ioc,
+                              bool enabled);
+
 /**
  * qio_channel_set_cork:
  * @ioc: the channel object
diff --git a/io/channel-socket.c b/io/channel-socket.c
index e377e7303d..a69fec7315 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -26,8 +26,10 @@
 #include "io/channel-watch.h"
 #include "trace.h"
 #include "qapi/clone-visitor.h"
+#include <linux/errqueue.h>
 
 #define SOCKET_MAX_FDS 16
+#define SOCKET_ERRQ_THRESH 16384
 
 SocketAddress *
 qio_channel_socket_get_local_address(QIOChannelSocket *ioc,
@@ -55,6 +57,8 @@ qio_channel_socket_new(void)
 
     sioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET));
     sioc->fd = -1;
+    sioc->zerocopy_enabled = false;
+    sioc->errq_pending = 0;
 
     ioc = QIO_CHANNEL(sioc);
     qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN);
@@ -520,6 +524,54 @@ static ssize_t qio_channel_socket_readv(QIOChannel *ioc,
     return ret;
 }
 
+static void qio_channel_socket_errq_proc(QIOChannelSocket *sioc,
+                                         Error **errp)
+{
+    int fd = sioc->fd;
+    int ret;
+    struct msghdr msg = {};
+    struct sock_extended_err *serr;
+    struct cmsghdr *cm;
+
+    do {
+        ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
+        if (ret <= 0) {
+            if (ret == 0 || errno == EAGAIN) {
+                /* Nothing on errqueue */
+                 sioc->errq_pending = 0;
+                 break;
+            }
+            if (errno == EINTR) {
+                continue;
+            }
+
+            error_setg_errno(errp, errno,
+                             "Unable to read errqueue");
+            break;
+        }
+
+        cm = CMSG_FIRSTHDR(&msg);
+        if (cm->cmsg_level != SOL_IP &&
+            cm->cmsg_type != IP_RECVERR) {
+            error_setg_errno(errp, EPROTOTYPE,
+                             "Wrong cmsg in errqueue");
+            break;
+        }
+
+        serr = (void *) CMSG_DATA(cm);
+        if (serr->ee_errno != 0) {
+            error_setg_errno(errp, serr->ee_errno,
+                             "Error on socket");
+            break;
+        }
+        if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) {
+            error_setg_errno(errp, serr->ee_origin,
+                             "Error not from zerocopy");
+            break;
+        }
+    } while (true);
+}
+
 static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
                                          const struct iovec *iov,
                                          size_t niov,
@@ -571,6 +623,14 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
                          "Unable to write to socket");
         return -1;
     }
+
+    if ((flags & MSG_ZEROCOPY) && sioc->zerocopy_enabled) {
+        sioc->errq_pending += niov;
+        if (sioc->errq_pending > SOCKET_ERRQ_THRESH) {
+            qio_channel_socket_errq_proc(sioc, errp);
+        }
+    }
+
     return ret;
 }
 #else /* WIN32 */
@@ -689,6 +749,21 @@ qio_channel_socket_set_delay(QIOChannel *ioc,
 }
 
 
+static void
+qio_channel_socket_set_zerocopy(QIOChannel *ioc,
+                                bool enabled)
+{
+    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
+    int v = enabled ? 1 : 0;
+    int ret;
+
+    ret = qemu_setsockopt(sioc->fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v));
+    if (ret >= 0) {
+        sioc->zerocopy_enabled = true;
+    }
+}
+
+
 static void
 qio_channel_socket_set_cork(QIOChannel *ioc,
                             bool enabled)
@@ -789,6 +864,7 @@ static void qio_channel_socket_class_init(ObjectClass *klass,
     ioc_klass->io_set_delay = qio_channel_socket_set_delay;
     ioc_klass->io_create_watch = qio_channel_socket_create_watch;
     ioc_klass->io_set_aio_fd_handler = qio_channel_socket_set_aio_fd_handler;
+    ioc_klass->io_set_zerocopy = qio_channel_socket_set_zerocopy;
 }
 
 static const TypeInfo qio_channel_socket_info = {
diff --git a/io/channel-tls.c b/io/channel-tls.c
index 4ce890a538..bf44b0f7b0 100644
--- a/io/channel-tls.c
+++ b/io/channel-tls.c
@@ -350,6 +350,16 @@ static void qio_channel_tls_set_delay(QIOChannel *ioc,
     qio_channel_set_delay(tioc->master, enabled);
 }
 
+
+static void qio_channel_tls_set_zerocopy(QIOChannel *ioc,
+                                         bool enabled)
+{
+    QIOChannelTLS *tioc = QIO_CHANNEL_TLS(ioc);
+
+    qio_channel_set_zerocopy(tioc->master, enabled);
+}
+
+
 static void qio_channel_tls_set_cork(QIOChannel *ioc,
                                      bool enabled)
 {
@@ -416,6 +426,7 @@ static void qio_channel_tls_class_init(ObjectClass *klass,
     ioc_klass->io_shutdown = qio_channel_tls_shutdown;
     ioc_klass->io_create_watch = qio_channel_tls_create_watch;
     ioc_klass->io_set_aio_fd_handler = qio_channel_tls_set_aio_fd_handler;
+    ioc_klass->io_set_zerocopy = qio_channel_tls_set_zerocopy;
 }
 
 static const TypeInfo qio_channel_tls_info = {
diff --git a/io/channel-websock.c b/io/channel-websock.c
index 035dd6075b..4e9491966b 100644
--- a/io/channel-websock.c
+++ b/io/channel-websock.c
@@ -1194,6 +1194,14 @@ static void qio_channel_websock_set_delay(QIOChannel *ioc,
     qio_channel_set_delay(tioc->master, enabled);
 }
 
+static void qio_channel_websock_set_zerocopy(QIOChannel *ioc,
+                                             bool enabled)
+{
+    QIOChannelWebsock *tioc = QIO_CHANNEL_WEBSOCK(ioc);
+
+    qio_channel_set_zerocopy(tioc->master, enabled);
+}
+
 static void qio_channel_websock_set_cork(QIOChannel *ioc,
                                          bool enabled)
 {
@@ -1318,6 +1326,7 @@ static void qio_channel_websock_class_init(ObjectClass *klass,
     ioc_klass->io_close = qio_channel_websock_close;
     ioc_klass->io_shutdown = qio_channel_websock_shutdown;
     ioc_klass->io_create_watch = qio_channel_websock_create_watch;
+    ioc_klass->io_set_zerocopy = qio_channel_websock_set_zerocopy;
 }
 
 static const TypeInfo qio_channel_websock_info = {
diff --git a/io/channel.c b/io/channel.c
index ee3cb83d4d..476440e8b2 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -450,6 +450,17 @@ void qio_channel_set_delay(QIOChannel *ioc,
 }
 
 
+void qio_channel_set_zerocopy(QIOChannel *ioc,
+                              bool enabled)
+{
+    QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
+
+    if (klass->io_set_zerocopy) {
+        klass->io_set_zerocopy(ioc, enabled);
+    }
+}
+
+
 void qio_channel_set_cork(QIOChannel *ioc,
                           bool enabled)
 {
-- 
2.33.0



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v1 3/3] migration: multifd: Enable zerocopy
  2021-08-31 11:02 [PATCH v1 0/3] QIOChannel flags + multifd zerocopy Leonardo Bras
  2021-08-31 11:02 ` [PATCH v1 1/3] io: Enable write flags for QIOChannel Leonardo Bras
  2021-08-31 11:02 ` [PATCH v1 2/3] io: Add zerocopy and errqueue Leonardo Bras
@ 2021-08-31 11:02 ` Leonardo Bras
  2021-08-31 13:16   ` Daniel P. Berrangé
  2021-08-31 21:24 ` [PATCH v1 0/3] QIOChannel flags + multifd zerocopy Peter Xu
  3 siblings, 1 reply; 62+ messages in thread
From: Leonardo Bras @ 2021-08-31 11:02 UTC (permalink / raw)
  To: Marc-André Lureau, Paolo Bonzini, Elena Ufimtseva,
	Jagannathan Raman, John G Johnson, Daniel P. Berrangé,
	Juan Quintela, Dr. David Alan Gilbert, Fam Zheng, Peter Xu
  Cc: Leonardo Bras, qemu-devel, qemu-block

Call qio_channel_set_zerocopy(true) in the start of every multifd thread.

Change the send_write() interface of multifd, allowing it to pass down
flags for qio_channel_write*().

Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
other data being sent at the default copying approach.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 migration/multifd-zlib.c | 7 ++++---
 migration/multifd-zstd.c | 7 ++++---
 migration/multifd.c      | 9 ++++++---
 migration/multifd.h      | 3 ++-
 4 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c
index ab4ba75d75..d8cce1810a 100644
--- a/migration/multifd-zlib.c
+++ b/migration/multifd-zlib.c
@@ -160,12 +160,13 @@ static int zlib_send_prepare(MultiFDSendParams *p, uint32_t used, Error **errp)
  * @used: number of pages used
  * @errp: pointer to an error
  */
-static int zlib_send_write(MultiFDSendParams *p, uint32_t used, Error **errp)
+static int zlib_send_write(MultiFDSendParams *p, uint32_t used, int flags,
+                           Error **errp)
 {
     struct zlib_data *z = p->data;
 
-    return qio_channel_write_all(p->c, (void *)z->zbuff, p->next_packet_size,
-                                 errp);
+    return qio_channel_write_all_flags(p->c, (void *)z->zbuff,
+                                       p->next_packet_size, flags, errp);
 }
 
 /**
diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c
index 693bddf8c9..fa063fd33e 100644
--- a/migration/multifd-zstd.c
+++ b/migration/multifd-zstd.c
@@ -171,12 +171,13 @@ static int zstd_send_prepare(MultiFDSendParams *p, uint32_t used, Error **errp)
  * @used: number of pages used
  * @errp: pointer to an error
  */
-static int zstd_send_write(MultiFDSendParams *p, uint32_t used, Error **errp)
+static int zstd_send_write(MultiFDSendParams *p, uint32_t used, int flags,
+                           Error **errp)
 {
     struct zstd_data *z = p->data;
 
-    return qio_channel_write_all(p->c, (void *)z->zbuff, p->next_packet_size,
-                                 errp);
+    return qio_channel_write_all_flags(p->c, (void *)z->zbuff,
+                                       p->next_packet_size, flags, errp);
 }
 
 /**
diff --git a/migration/multifd.c b/migration/multifd.c
index 377da78f5b..097621c12c 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -103,9 +103,10 @@ static int nocomp_send_prepare(MultiFDSendParams *p, uint32_t used,
  * @used: number of pages used
  * @errp: pointer to an error
  */
-static int nocomp_send_write(MultiFDSendParams *p, uint32_t used, Error **errp)
+static int nocomp_send_write(MultiFDSendParams *p, uint32_t used, int flags,
+                             Error **errp)
 {
-    return qio_channel_writev_all(p->c, p->pages->iov, used, errp);
+    return qio_channel_writev_all_flags(p->c, p->pages->iov, used, flags, errp);
 }
 
 /**
@@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
             }
 
             if (used) {
-                ret = multifd_send_state->ops->send_write(p, used, &local_err);
+                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
+                                                          &local_err);
                 if (ret != 0) {
                     break;
                 }
@@ -815,6 +817,7 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
         } else {
             /* update for tls qio channel */
             p->c = ioc;
+            qio_channel_set_zerocopy(ioc, true);
             qemu_thread_create(&p->thread, p->name, multifd_send_thread, p,
                                    QEMU_THREAD_JOINABLE);
        }
diff --git a/migration/multifd.h b/migration/multifd.h
index 8d6751f5ed..7243ba4185 100644
--- a/migration/multifd.h
+++ b/migration/multifd.h
@@ -157,7 +157,8 @@ typedef struct {
     /* Prepare the send packet */
     int (*send_prepare)(MultiFDSendParams *p, uint32_t used, Error **errp);
     /* Write the send packet */
-    int (*send_write)(MultiFDSendParams *p, uint32_t used, Error **errp);
+    int (*send_write)(MultiFDSendParams *p, uint32_t used, int flags,
+                      Error **errp);
     /* Setup for receiving side */
     int (*recv_setup)(MultiFDRecvParams *p, Error **errp);
     /* Cleanup for receiving side */
-- 
2.33.0



^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
  2021-08-31 11:02 ` [PATCH v1 2/3] io: Add zerocopy and errqueue Leonardo Bras
@ 2021-08-31 12:57   ` Daniel P. Berrangé
  2021-08-31 20:27     ` Peter Xu
  2021-09-02  6:38     ` Leonardo Bras Soares Passos
  0 siblings, 2 replies; 62+ messages in thread
From: Daniel P. Berrangé @ 2021-08-31 12:57 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Dr. David Alan Gilbert, Peter Xu, qemu-devel,
	Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote:
> MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
> send calls. It does so by avoiding copying user data into kernel buffers.
> 
> To make it work, three steps are needed:
> 1 - A setsockopt() system call, enabling SO_ZEROCOPY
> 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
> 3 - Process the socket's error queue, dealing with any error

AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY.

It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg()
from a synchronous call to an asynchronous call.

It is forbidden to overwrite/reuse/free the buffer passed to sendmsg
until an asynchronous completion notification has been received from
the socket error queue. These notifications are not required to
arrive in-order, even for a TCP stream, because the kernel hangs on
to the buffer if a re-transmit is needed.

https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html

  "Page pinning also changes system call semantics. It temporarily 
   shares the buffer between process and network stack. Unlike with
   copying, the process cannot immediately overwrite the buffer 
   after system call return without possibly modifying the data in 
   flight. Kernel integrity is not affected, but a buggy program
   can possibly corrupt its own data stream."

AFAICT, the design added in this patch does not provide any way
to honour these requirements around buffer lifetime.

I can't see how we can introduce MSG_ZEROCOPY in any seemless
way. The buffer lifetime requirements imply need for an API
design that is fundamentally different for asynchronous usage,
with a callback to notify when the write has finished/failed.

> Zerocopy has it's costs, so it will only get improved performance if
> the sending buffer is big (10KB, according to Linux docs).
> 
> The step 2 makes it possible to use the same socket to send data
> using both zerocopy and the default copying approach, so the application
> cat choose what is best for each packet.
> 
> To implement step 1, an optional set_zerocopy() interface was created
> in QIOChannel, allowing each using code to enable or disable it.
> 
> Step 2 will be enabled by the using code at each qio_channel_write*()
> that would benefit of zerocopy;
> 
> Step 3 is done with qio_channel_socket_errq_proc(), that runs after
> SOCKET_ERRQ_THRESH (16k) iovs sent, dealing with any error found.
> 
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> ---
>  include/io/channel-socket.h |  2 +
>  include/io/channel.h        | 29 ++++++++++++++
>  io/channel-socket.c         | 76 +++++++++++++++++++++++++++++++++++++
>  io/channel-tls.c            | 11 ++++++
>  io/channel-websock.c        |  9 +++++
>  io/channel.c                | 11 ++++++
>  6 files changed, 138 insertions(+)
> 
> diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
> index e747e63514..09dffe059f 100644
> --- a/include/io/channel-socket.h
> +++ b/include/io/channel-socket.h
> @@ -47,6 +47,8 @@ struct QIOChannelSocket {
>      socklen_t localAddrLen;
>      struct sockaddr_storage remoteAddr;
>      socklen_t remoteAddrLen;
> +    size_t errq_pending;
> +    bool zerocopy_enabled;
>  };
>  
>  
> diff --git a/include/io/channel.h b/include/io/channel.h
> index dada9ebaaf..de10a78b10 100644
> --- a/include/io/channel.h
> +++ b/include/io/channel.h
> @@ -137,6 +137,8 @@ struct QIOChannelClass {
>                                    IOHandler *io_read,
>                                    IOHandler *io_write,
>                                    void *opaque);
> +    void (*io_set_zerocopy)(QIOChannel *ioc,
> +                            bool enabled);
>  };
>  
>  /* General I/O handling functions */
> @@ -570,6 +572,33 @@ int qio_channel_shutdown(QIOChannel *ioc,
>  void qio_channel_set_delay(QIOChannel *ioc,
>                             bool enabled);
>  
> +/**
> + * qio_channel_set_zerocopy:
> + * @ioc: the channel object
> + * @enabled: the new flag state
> + *
> + * Controls whether the underlying transport is
> + * permitted to use zerocopy to avoid copying the
> + * sending buffer in kernel. If @enabled is true, then the
> + * writes may avoid buffer copy in kernel. If @enabled
> + * is false, writes will cause the kernel to always
> + * copy the buffer contents before sending.
> + *
> + * In order to use make a write with zerocopy feature,
> + * it's also necessary to sent each packet with
> + * MSG_ZEROCOPY flag. With this, it's possible to
> + * to select only writes that would benefit from the
> + * use of zerocopy feature, i.e. the ones with larger
> + * buffers.
> + *
> + * This feature was added in Linux 4.14, so older
> + * versions will fail on enabling. This is not an
> + * issue, since it will fall-back to default copying
> + * approach.
> + */
> +void qio_channel_set_zerocopy(QIOChannel *ioc,
> +                              bool enabled);
> +
>  /**
>   * qio_channel_set_cork:
>   * @ioc: the channel object
> diff --git a/io/channel-socket.c b/io/channel-socket.c
> index e377e7303d..a69fec7315 100644
> --- a/io/channel-socket.c
> +++ b/io/channel-socket.c
> @@ -26,8 +26,10 @@
>  #include "io/channel-watch.h"
>  #include "trace.h"
>  #include "qapi/clone-visitor.h"
> +#include <linux/errqueue.h>

That's going to fail to biuld on non-Linux 

>  
>  #define SOCKET_MAX_FDS 16
> +#define SOCKET_ERRQ_THRESH 16384
>  
>  SocketAddress *
>  qio_channel_socket_get_local_address(QIOChannelSocket *ioc,
> @@ -55,6 +57,8 @@ qio_channel_socket_new(void)
>  
>      sioc = QIO_CHANNEL_SOCKET(object_new(TYPE_QIO_CHANNEL_SOCKET));
>      sioc->fd = -1;
> +    sioc->zerocopy_enabled = false;
> +    sioc->errq_pending = 0;
>  
>      ioc = QIO_CHANNEL(sioc);
>      qio_channel_set_feature(ioc, QIO_CHANNEL_FEATURE_SHUTDOWN);
> @@ -520,6 +524,54 @@ static ssize_t qio_channel_socket_readv(QIOChannel *ioc,
>      return ret;
>  }
>  
> +static void qio_channel_socket_errq_proc(QIOChannelSocket *sioc,
> +                                         Error **errp)
> +{
> +    int fd = sioc->fd;
> +    int ret;
> +    struct msghdr msg = {};
> +    struct sock_extended_err *serr;
> +    struct cmsghdr *cm;
> +
> +    do {
> +        ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
> +        if (ret <= 0) {
> +            if (ret == 0 || errno == EAGAIN) {
> +                /* Nothing on errqueue */
> +                 sioc->errq_pending = 0;
> +                 break;
> +            }
> +            if (errno == EINTR) {
> +                continue;
> +            }
> +
> +            error_setg_errno(errp, errno,
> +                             "Unable to read errqueue");
> +            break;
> +        }
> +
> +        cm = CMSG_FIRSTHDR(&msg);
> +        if (cm->cmsg_level != SOL_IP &&
> +            cm->cmsg_type != IP_RECVERR) {
> +            error_setg_errno(errp, EPROTOTYPE,
> +                             "Wrong cmsg in errqueue");
> +            break;
> +        }
> +
> +        serr = (void *) CMSG_DATA(cm);
> +        if (serr->ee_errno != 0) {
> +            error_setg_errno(errp, serr->ee_errno,
> +                             "Error on socket");
> +            break;
> +        }
> +        if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) {
> +            error_setg_errno(errp, serr->ee_origin,
> +                             "Error not from zerocopy");
> +            break;
> +        }
> +    } while (true);
> +}
> +
>  static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
>                                           const struct iovec *iov,
>                                           size_t niov,
> @@ -571,6 +623,14 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
>                           "Unable to write to socket");
>          return -1;
>      }
> +
> +    if ((flags & MSG_ZEROCOPY) && sioc->zerocopy_enabled) {
> +        sioc->errq_pending += niov;
> +        if (sioc->errq_pending > SOCKET_ERRQ_THRESH) {
> +            qio_channel_socket_errq_proc(sioc, errp);
> +        }
> +    }

This silently looses any errors set from upto the final
SOCKET_ERRQ_THRESH write() calls with MSG_ZEROCOPY set.

Also means if you have a series of writes without
MSG_ZEROCOPY, it'll delay checking any pending
errors.

I would suggest checkig in close(), but as mentioned
earlier, I think the design is flawed because the caller
fundamentally needs to know about completion for every
single write they make in order to know when the buffer
can be released / reused.

> +static void
> +qio_channel_socket_set_zerocopy(QIOChannel *ioc,
> +                                bool enabled)
> +{
> +    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
> +    int v = enabled ? 1 : 0;
> +    int ret;
> +
> +    ret = qemu_setsockopt(sioc->fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v));
> +    if (ret >= 0) {
> +        sioc->zerocopy_enabled = true;
> +    }
> +}

Surely we need to tell the caller wether this succeeed, otherwise
the later sendmsg() is going to fail with EINVAL on older kernels
where MSG_ZEROCOPY is not supported.


> diff --git a/io/channel-tls.c b/io/channel-tls.c
> index 4ce890a538..bf44b0f7b0 100644
> --- a/io/channel-tls.c
> +++ b/io/channel-tls.c
> @@ -350,6 +350,16 @@ static void qio_channel_tls_set_delay(QIOChannel *ioc,
>      qio_channel_set_delay(tioc->master, enabled);
>  }
>  
> +
> +static void qio_channel_tls_set_zerocopy(QIOChannel *ioc,
> +                                         bool enabled)
> +{
> +    QIOChannelTLS *tioc = QIO_CHANNEL_TLS(ioc);
> +
> +    qio_channel_set_zerocopy(tioc->master, enabled);
> +}

This is going to be unsafe because gnutls will discard/reuse the
buffer for the ciphertext after every write(). I can't see a
way to safely enable MSG_ZEROCOPY when TLS is used.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy
  2021-08-31 11:02 ` [PATCH v1 3/3] migration: multifd: Enable zerocopy Leonardo Bras
@ 2021-08-31 13:16   ` Daniel P. Berrangé
  2021-08-31 20:29     ` Peter Xu
  2021-09-02  7:22     ` Leonardo Bras Soares Passos
  0 siblings, 2 replies; 62+ messages in thread
From: Daniel P. Berrangé @ 2021-08-31 13:16 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Dr. David Alan Gilbert, Peter Xu, qemu-devel,
	Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> 
> Change the send_write() interface of multifd, allowing it to pass down
> flags for qio_channel_write*().
> 
> Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> other data being sent at the default copying approach.
> 
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> ---
>  migration/multifd-zlib.c | 7 ++++---
>  migration/multifd-zstd.c | 7 ++++---
>  migration/multifd.c      | 9 ++++++---
>  migration/multifd.h      | 3 ++-
>  4 files changed, 16 insertions(+), 10 deletions(-)

> @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
>              }
>  
>              if (used) {
> -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> +                                                          &local_err);

I don't think it is valid to unconditionally enable this feature due to the
resource usage implications

https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html

  "A zerocopy failure will return -1 with errno ENOBUFS. This happens 
   if the socket option was not set, the socket exceeds its optmem 
   limit or the user exceeds its ulimit on locked pages."

The limit on locked pages is something that looks very likely to be
exceeded unless you happen to be running a QEMU config that already
implies locked memory (eg PCI assignment)


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
  2021-08-31 12:57   ` Daniel P. Berrangé
@ 2021-08-31 20:27     ` Peter Xu
  2021-09-01  8:50       ` Daniel P. Berrangé
  2021-09-02  6:59       ` Leonardo Bras Soares Passos
  2021-09-02  6:38     ` Leonardo Bras Soares Passos
  1 sibling, 2 replies; 62+ messages in thread
From: Peter Xu @ 2021-08-31 20:27 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Dr. David Alan Gilbert, qemu-devel, Leonardo Bras,
	Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Tue, Aug 31, 2021 at 01:57:33PM +0100, Daniel P. Berrangé wrote:
> On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote:
> > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
> > send calls. It does so by avoiding copying user data into kernel buffers.
> > 
> > To make it work, three steps are needed:
> > 1 - A setsockopt() system call, enabling SO_ZEROCOPY
> > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
> > 3 - Process the socket's error queue, dealing with any error
> 
> AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY.
> 
> It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg()
> from a synchronous call to an asynchronous call.
> 
> It is forbidden to overwrite/reuse/free the buffer passed to sendmsg
> until an asynchronous completion notification has been received from
> the socket error queue. These notifications are not required to
> arrive in-order, even for a TCP stream, because the kernel hangs on
> to the buffer if a re-transmit is needed.
> 
> https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> 
>   "Page pinning also changes system call semantics. It temporarily 
>    shares the buffer between process and network stack. Unlike with
>    copying, the process cannot immediately overwrite the buffer 
>    after system call return without possibly modifying the data in 
>    flight. Kernel integrity is not affected, but a buggy program
>    can possibly corrupt its own data stream."
> 
> AFAICT, the design added in this patch does not provide any way
> to honour these requirements around buffer lifetime.
> 
> I can't see how we can introduce MSG_ZEROCOPY in any seemless
> way. The buffer lifetime requirements imply need for an API
> design that is fundamentally different for asynchronous usage,
> with a callback to notify when the write has finished/failed.

Regarding buffer reuse - it indeed has a very deep implication on the buffer
being available and it's not obvious at all.  Just to mention that the initial
user of this work will make sure all zero copy buffers will be guest pages only
(as it's only used in multi-fd), so they should always be there during the
process.

I think asking for a complete design still makes sense.  E.g., for precopy
before we flush device states and completes the migration, we may want to at
least have a final ack on all the zero-copies of guest pages to guarantee they
are flushed.

IOW, we need to make sure the last piece of migration stream lands after the
guest pages so that the dest VM will always contain the latest page data when
dest VM starts.  So far I don't see how current code guaranteed that.

In short, we may just want to at least having a way to make sure all zero
copied buffers are finished using and they're sent after some function returns
(e.g., qio_channel_flush()).  That may require us to do some accounting on when
we called sendmsg(MSG_ZEROCOPY), meanwhile we should need to read out the
ee_data field within SO_EE_ORIGIN_ZEROCOPY msg when we do recvmsg() for the
error queue and keep those information somewhere too.

Some other side notes that reached my mind..

The qio_channel_writev_full() may not be suitable for async operations, as the
name "full" implies synchronous to me.  So maybe we can add a new helper for
zero copy on the channel?

We may also want a new QIOChannelFeature as QIO_CHANNEL_FEATURE_ZEROCOPY, then
we fail qio_channel_writv_zerocopy() (or whatever name we come up with) if that
bit is not set in qio channel features.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy
  2021-08-31 13:16   ` Daniel P. Berrangé
@ 2021-08-31 20:29     ` Peter Xu
  2021-09-01  8:53       ` Daniel P. Berrangé
  2021-09-02  7:27       ` Leonardo Bras Soares Passos
  2021-09-02  7:22     ` Leonardo Bras Soares Passos
  1 sibling, 2 replies; 62+ messages in thread
From: Peter Xu @ 2021-08-31 20:29 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Dr. David Alan Gilbert, qemu-devel, Leonardo Bras,
	Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote:
> On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > 
> > Change the send_write() interface of multifd, allowing it to pass down
> > flags for qio_channel_write*().
> > 
> > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > other data being sent at the default copying approach.
> > 
> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > ---
> >  migration/multifd-zlib.c | 7 ++++---
> >  migration/multifd-zstd.c | 7 ++++---
> >  migration/multifd.c      | 9 ++++++---
> >  migration/multifd.h      | 3 ++-
> >  4 files changed, 16 insertions(+), 10 deletions(-)
> 
> > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> >              }
> >  
> >              if (used) {
> > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > +                                                          &local_err);
> 
> I don't think it is valid to unconditionally enable this feature due to the
> resource usage implications
> 
> https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> 
>   "A zerocopy failure will return -1 with errno ENOBUFS. This happens 
>    if the socket option was not set, the socket exceeds its optmem 
>    limit or the user exceeds its ulimit on locked pages."
> 
> The limit on locked pages is something that looks very likely to be
> exceeded unless you happen to be running a QEMU config that already
> implies locked memory (eg PCI assignment)

Yes it would be great to be a migration capability in parallel to multifd. At
initial phase if it's easy to be implemented on multi-fd only, we can add a
dependency between the caps.  In the future we can remove that dependency when
the code is ready to go without multifd.  Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 0/3] QIOChannel flags + multifd zerocopy
  2021-08-31 11:02 [PATCH v1 0/3] QIOChannel flags + multifd zerocopy Leonardo Bras
                   ` (2 preceding siblings ...)
  2021-08-31 11:02 ` [PATCH v1 3/3] migration: multifd: Enable zerocopy Leonardo Bras
@ 2021-08-31 21:24 ` Peter Xu
  2021-09-01 19:21   ` Leonardo Bras Soares Passos
  3 siblings, 1 reply; 62+ messages in thread
From: Peter Xu @ 2021-08-31 21:24 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, qemu-devel, Daniel P. Berrangé,
	Dr. David Alan Gilbert, Paolo Bonzini, Marc-André Lureau,
	Fam Zheng

On Tue, Aug 31, 2021 at 08:02:36AM -0300, Leonardo Bras wrote:
> Results:
> So far, the resource usage of __sys_sendmsg() reduced 15 times, and the
> overall migration took 13-18% less time, based in synthetic workload.

Leo,

Could you share some of the details of your tests?  E.g., what's the
configuration of your VM for testing?  What's the migration time before/after
the patchset applied?  What is the network you're using?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
  2021-08-31 20:27     ` Peter Xu
@ 2021-09-01  8:50       ` Daniel P. Berrangé
  2021-09-01 15:52         ` Peter Xu
  2021-09-02  7:07         ` Leonardo Bras Soares Passos
  2021-09-02  6:59       ` Leonardo Bras Soares Passos
  1 sibling, 2 replies; 62+ messages in thread
From: Daniel P. Berrangé @ 2021-09-01  8:50 UTC (permalink / raw)
  To: Peter Xu
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Dr. David Alan Gilbert, qemu-devel, Leonardo Bras,
	Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Tue, Aug 31, 2021 at 04:27:04PM -0400, Peter Xu wrote:
> On Tue, Aug 31, 2021 at 01:57:33PM +0100, Daniel P. Berrangé wrote:
> > On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote:
> > > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
> > > send calls. It does so by avoiding copying user data into kernel buffers.
> > > 
> > > To make it work, three steps are needed:
> > > 1 - A setsockopt() system call, enabling SO_ZEROCOPY
> > > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
> > > 3 - Process the socket's error queue, dealing with any error
> > 
> > AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY.
> > 
> > It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg()
> > from a synchronous call to an asynchronous call.
> > 
> > It is forbidden to overwrite/reuse/free the buffer passed to sendmsg
> > until an asynchronous completion notification has been received from
> > the socket error queue. These notifications are not required to
> > arrive in-order, even for a TCP stream, because the kernel hangs on
> > to the buffer if a re-transmit is needed.
> > 
> > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > 
> >   "Page pinning also changes system call semantics. It temporarily 
> >    shares the buffer between process and network stack. Unlike with
> >    copying, the process cannot immediately overwrite the buffer 
> >    after system call return without possibly modifying the data in 
> >    flight. Kernel integrity is not affected, but a buggy program
> >    can possibly corrupt its own data stream."
> > 
> > AFAICT, the design added in this patch does not provide any way
> > to honour these requirements around buffer lifetime.
> > 
> > I can't see how we can introduce MSG_ZEROCOPY in any seemless
> > way. The buffer lifetime requirements imply need for an API
> > design that is fundamentally different for asynchronous usage,
> > with a callback to notify when the write has finished/failed.
> 
> Regarding buffer reuse - it indeed has a very deep implication on the buffer
> being available and it's not obvious at all.  Just to mention that the initial
> user of this work will make sure all zero copy buffers will be guest pages only
> (as it's only used in multi-fd), so they should always be there during the
> process.

That is not the case when migration is using TLS, because the buffers
transmitted are the ciphertext buffer, not the plaintext guest page.

> In short, we may just want to at least having a way to make sure all zero
> copied buffers are finished using and they're sent after some function returns
> (e.g., qio_channel_flush()).  That may require us to do some accounting on when
> we called sendmsg(MSG_ZEROCOPY), meanwhile we should need to read out the
> ee_data field within SO_EE_ORIGIN_ZEROCOPY msg when we do recvmsg() for the
> error queue and keep those information somewhere too.
> 
> Some other side notes that reached my mind..
> 
> The qio_channel_writev_full() may not be suitable for async operations, as the
> name "full" implies synchronous to me.  So maybe we can add a new helper for
> zero copy on the channel?

All the APIs that exist today are fundamentally only suitable for sync
operations. Supporting async correctly will definitely a brand new APIs
separate from what exists today.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy
  2021-08-31 20:29     ` Peter Xu
@ 2021-09-01  8:53       ` Daniel P. Berrangé
  2021-09-01 15:35         ` Peter Xu
  2021-09-02  7:27       ` Leonardo Bras Soares Passos
  1 sibling, 1 reply; 62+ messages in thread
From: Daniel P. Berrangé @ 2021-09-01  8:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Dr. David Alan Gilbert, qemu-devel, Leonardo Bras,
	Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Tue, Aug 31, 2021 at 04:29:09PM -0400, Peter Xu wrote:
> On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote:
> > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > > 
> > > Change the send_write() interface of multifd, allowing it to pass down
> > > flags for qio_channel_write*().
> > > 
> > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > > other data being sent at the default copying approach.
> > > 
> > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > ---
> > >  migration/multifd-zlib.c | 7 ++++---
> > >  migration/multifd-zstd.c | 7 ++++---
> > >  migration/multifd.c      | 9 ++++++---
> > >  migration/multifd.h      | 3 ++-
> > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > 
> > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > >              }
> > >  
> > >              if (used) {
> > > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > > +                                                          &local_err);
> > 
> > I don't think it is valid to unconditionally enable this feature due to the
> > resource usage implications
> > 
> > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > 
> >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens 
> >    if the socket option was not set, the socket exceeds its optmem 
> >    limit or the user exceeds its ulimit on locked pages."
> > 
> > The limit on locked pages is something that looks very likely to be
> > exceeded unless you happen to be running a QEMU config that already
> > implies locked memory (eg PCI assignment)
> 
> Yes it would be great to be a migration capability in parallel to multifd. At
> initial phase if it's easy to be implemented on multi-fd only, we can add a
> dependency between the caps.  In the future we can remove that dependency when
> the code is ready to go without multifd.  Thanks,

Also, I'm wondering how zerocopy support interacts with kernel support
for kTLS and multipath-TCP, both of which we want to be able to use
with migration.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy
  2021-09-01  8:53       ` Daniel P. Berrangé
@ 2021-09-01 15:35         ` Peter Xu
  2021-09-01 15:44           ` Daniel P. Berrangé
  2021-09-02  7:23           ` Jason Wang
  0 siblings, 2 replies; 62+ messages in thread
From: Peter Xu @ 2021-09-01 15:35 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Jason Wang, Dr. David Alan Gilbert, qemu-devel,
	Leonardo Bras, Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Wed, Sep 01, 2021 at 09:53:07AM +0100, Daniel P. Berrangé wrote:
> On Tue, Aug 31, 2021 at 04:29:09PM -0400, Peter Xu wrote:
> > On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote:
> > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > > > 
> > > > Change the send_write() interface of multifd, allowing it to pass down
> > > > flags for qio_channel_write*().
> > > > 
> > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > > > other data being sent at the default copying approach.
> > > > 
> > > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > > ---
> > > >  migration/multifd-zlib.c | 7 ++++---
> > > >  migration/multifd-zstd.c | 7 ++++---
> > > >  migration/multifd.c      | 9 ++++++---
> > > >  migration/multifd.h      | 3 ++-
> > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > > 
> > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > >              }
> > > >  
> > > >              if (used) {
> > > > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > > > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > > > +                                                          &local_err);
> > > 
> > > I don't think it is valid to unconditionally enable this feature due to the
> > > resource usage implications
> > > 
> > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > 
> > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens 
> > >    if the socket option was not set, the socket exceeds its optmem 
> > >    limit or the user exceeds its ulimit on locked pages."
> > > 
> > > The limit on locked pages is something that looks very likely to be
> > > exceeded unless you happen to be running a QEMU config that already
> > > implies locked memory (eg PCI assignment)
> > 
> > Yes it would be great to be a migration capability in parallel to multifd. At
> > initial phase if it's easy to be implemented on multi-fd only, we can add a
> > dependency between the caps.  In the future we can remove that dependency when
> > the code is ready to go without multifd.  Thanks,
> 
> Also, I'm wondering how zerocopy support interacts with kernel support
> for kTLS and multipath-TCP, both of which we want to be able to use
> with migration.

Copying Jason Wang for net implications between these features on kernel side
and whether they can be enabled together (MSG_ZEROCOPY, mptcp, kTLS).

From the safe side we may want to only enable one of them until we prove
they'll work together I guess..

Not a immediate concern as I don't really think any of them is really
explicitly supported in qemu.

KTLS may be implicitly included by a new gnutls, but we need to mark TLS and
ZEROCOPY mutual exclusive anyway because at least the userspace TLS code of
gnutls won't has a way to maintain the tls buffers used by zerocopy.  So at
least we need some knob to detect whether kTLS is enabled in gnutls.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy
  2021-09-01 15:35         ` Peter Xu
@ 2021-09-01 15:44           ` Daniel P. Berrangé
  2021-09-01 16:01             ` Peter Xu
                               ` (2 more replies)
  2021-09-02  7:23           ` Jason Wang
  1 sibling, 3 replies; 62+ messages in thread
From: Daniel P. Berrangé @ 2021-09-01 15:44 UTC (permalink / raw)
  To: Peter Xu
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Jason Wang, Dr. David Alan Gilbert, qemu-devel,
	Leonardo Bras, Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Wed, Sep 01, 2021 at 11:35:33AM -0400, Peter Xu wrote:
> On Wed, Sep 01, 2021 at 09:53:07AM +0100, Daniel P. Berrangé wrote:
> > On Tue, Aug 31, 2021 at 04:29:09PM -0400, Peter Xu wrote:
> > > On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote:
> > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > > > > 
> > > > > Change the send_write() interface of multifd, allowing it to pass down
> > > > > flags for qio_channel_write*().
> > > > > 
> > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > > > > other data being sent at the default copying approach.
> > > > > 
> > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > > > ---
> > > > >  migration/multifd-zlib.c | 7 ++++---
> > > > >  migration/multifd-zstd.c | 7 ++++---
> > > > >  migration/multifd.c      | 9 ++++++---
> > > > >  migration/multifd.h      | 3 ++-
> > > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > > > 
> > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > > >              }
> > > > >  
> > > > >              if (used) {
> > > > > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > > > > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > > > > +                                                          &local_err);
> > > > 
> > > > I don't think it is valid to unconditionally enable this feature due to the
> > > > resource usage implications
> > > > 
> > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > 
> > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens 
> > > >    if the socket option was not set, the socket exceeds its optmem 
> > > >    limit or the user exceeds its ulimit on locked pages."
> > > > 
> > > > The limit on locked pages is something that looks very likely to be
> > > > exceeded unless you happen to be running a QEMU config that already
> > > > implies locked memory (eg PCI assignment)
> > > 
> > > Yes it would be great to be a migration capability in parallel to multifd. At
> > > initial phase if it's easy to be implemented on multi-fd only, we can add a
> > > dependency between the caps.  In the future we can remove that dependency when
> > > the code is ready to go without multifd.  Thanks,
> > 
> > Also, I'm wondering how zerocopy support interacts with kernel support
> > for kTLS and multipath-TCP, both of which we want to be able to use
> > with migration.
> 
> Copying Jason Wang for net implications between these features on kernel side
> and whether they can be enabled together (MSG_ZEROCOPY, mptcp, kTLS).
> 
> From the safe side we may want to only enable one of them until we prove
> they'll work together I guess..

MPTCP is good when we're network limited for migration

KTLS will be good when we're CPU limited on AES for migration,
which is essentially always when TLS is used.

ZEROCOPY will be good when we're CPU limited for data copy
on migration, or to reduce the impact on other concurrent
VMs on the same CPUs.

Ultimately we woudld benefit from all of them at the same
time, if it were technically possible todo.

> Not a immediate concern as I don't really think any of them is really
> explicitly supported in qemu.

QEMU has mptcp support already:

  commit 8bd1078aebcec5eac196a83ef1a7e74be0ba67b7
  Author: Dr. David Alan Gilbert <dgilbert@redhat.com>
  Date:   Wed Apr 21 12:28:34 2021 +0100

    sockets: Support multipath TCP
    
    Multipath TCP allows combining multiple interfaces/routes into a single
    socket, with very little work for the user/admin.
    
    It's enabled by 'mptcp' on most socket addresses:
    
       ./qemu-system-x86_64 -nographic -incoming tcp:0:4444,mptcp

> KTLS may be implicitly included by a new gnutls, but we need to mark TLS and
> ZEROCOPY mutual exclusive anyway because at least the userspace TLS code of
> gnutls won't has a way to maintain the tls buffers used by zerocopy.  So at
> least we need some knob to detect whether kTLS is enabled in gnutls.

It isn't possible for gnutls to transparently enable KTLS, because
GNUTLS doesn't get to see the actual socket directly - it'll need
some work in QEMU to enable it.  We know MPTCP and KTLS are currently
mutually exclusive as they both use the same kernel network hooks
framework.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
  2021-09-01  8:50       ` Daniel P. Berrangé
@ 2021-09-01 15:52         ` Peter Xu
  2021-09-01 15:59           ` Daniel P. Berrangé
  2021-09-02  7:07         ` Leonardo Bras Soares Passos
  1 sibling, 1 reply; 62+ messages in thread
From: Peter Xu @ 2021-09-01 15:52 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Dr. David Alan Gilbert, qemu-devel, Leonardo Bras,
	Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Wed, Sep 01, 2021 at 09:50:56AM +0100, Daniel P. Berrangé wrote:
> On Tue, Aug 31, 2021 at 04:27:04PM -0400, Peter Xu wrote:
> > On Tue, Aug 31, 2021 at 01:57:33PM +0100, Daniel P. Berrangé wrote:
> > > On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote:
> > > > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
> > > > send calls. It does so by avoiding copying user data into kernel buffers.
> > > > 
> > > > To make it work, three steps are needed:
> > > > 1 - A setsockopt() system call, enabling SO_ZEROCOPY
> > > > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
> > > > 3 - Process the socket's error queue, dealing with any error
> > > 
> > > AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY.
> > > 
> > > It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg()
> > > from a synchronous call to an asynchronous call.
> > > 
> > > It is forbidden to overwrite/reuse/free the buffer passed to sendmsg
> > > until an asynchronous completion notification has been received from
> > > the socket error queue. These notifications are not required to
> > > arrive in-order, even for a TCP stream, because the kernel hangs on
> > > to the buffer if a re-transmit is needed.
> > > 
> > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > 
> > >   "Page pinning also changes system call semantics. It temporarily 
> > >    shares the buffer between process and network stack. Unlike with
> > >    copying, the process cannot immediately overwrite the buffer 
> > >    after system call return without possibly modifying the data in 
> > >    flight. Kernel integrity is not affected, but a buggy program
> > >    can possibly corrupt its own data stream."
> > > 
> > > AFAICT, the design added in this patch does not provide any way
> > > to honour these requirements around buffer lifetime.
> > > 
> > > I can't see how we can introduce MSG_ZEROCOPY in any seemless
> > > way. The buffer lifetime requirements imply need for an API
> > > design that is fundamentally different for asynchronous usage,
> > > with a callback to notify when the write has finished/failed.
> > 
> > Regarding buffer reuse - it indeed has a very deep implication on the buffer
> > being available and it's not obvious at all.  Just to mention that the initial
> > user of this work will make sure all zero copy buffers will be guest pages only
> > (as it's only used in multi-fd), so they should always be there during the
> > process.
> 
> That is not the case when migration is using TLS, because the buffers
> transmitted are the ciphertext buffer, not the plaintext guest page.

Agreed.

> 
> > In short, we may just want to at least having a way to make sure all zero
> > copied buffers are finished using and they're sent after some function returns
> > (e.g., qio_channel_flush()).  That may require us to do some accounting on when
> > we called sendmsg(MSG_ZEROCOPY), meanwhile we should need to read out the
> > ee_data field within SO_EE_ORIGIN_ZEROCOPY msg when we do recvmsg() for the
> > error queue and keep those information somewhere too.
> > 
> > Some other side notes that reached my mind..
> > 
> > The qio_channel_writev_full() may not be suitable for async operations, as the
> > name "full" implies synchronous to me.  So maybe we can add a new helper for
> > zero copy on the channel?
> 
> All the APIs that exist today are fundamentally only suitable for sync
> operations. Supporting async correctly will definitely a brand new APIs
> separate from what exists today.

Yes.  What I wanted to say is maybe we can still keep the io_writev() interface
untouched, but just add a new interface at qio_channel_writev_full() level.

IOW, we could comment on io_writev() that it can be either sync or async now,
just like sendmsg() has that implication too with different flag passed in.
When calling io_writev() with different upper helpers, QIO channel could
identify what flag to pass over to io_writev().

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
  2021-09-01 15:52         ` Peter Xu
@ 2021-09-01 15:59           ` Daniel P. Berrangé
  0 siblings, 0 replies; 62+ messages in thread
From: Daniel P. Berrangé @ 2021-09-01 15:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Dr. David Alan Gilbert, qemu-devel, Leonardo Bras,
	Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Wed, Sep 01, 2021 at 11:52:13AM -0400, Peter Xu wrote:
> On Wed, Sep 01, 2021 at 09:50:56AM +0100, Daniel P. Berrangé wrote:
> > On Tue, Aug 31, 2021 at 04:27:04PM -0400, Peter Xu wrote:
> > > On Tue, Aug 31, 2021 at 01:57:33PM +0100, Daniel P. Berrangé wrote:
> > > > On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote:
> > > > > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
> > > > > send calls. It does so by avoiding copying user data into kernel buffers.
> > > > > 
> > > > > To make it work, three steps are needed:
> > > > > 1 - A setsockopt() system call, enabling SO_ZEROCOPY
> > > > > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
> > > > > 3 - Process the socket's error queue, dealing with any error
> > > > 
> > > > AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY.
> > > > 
> > > > It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg()
> > > > from a synchronous call to an asynchronous call.
> > > > 
> > > > It is forbidden to overwrite/reuse/free the buffer passed to sendmsg
> > > > until an asynchronous completion notification has been received from
> > > > the socket error queue. These notifications are not required to
> > > > arrive in-order, even for a TCP stream, because the kernel hangs on
> > > > to the buffer if a re-transmit is needed.
> > > > 
> > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > 
> > > >   "Page pinning also changes system call semantics. It temporarily 
> > > >    shares the buffer between process and network stack. Unlike with
> > > >    copying, the process cannot immediately overwrite the buffer 
> > > >    after system call return without possibly modifying the data in 
> > > >    flight. Kernel integrity is not affected, but a buggy program
> > > >    can possibly corrupt its own data stream."
> > > > 
> > > > AFAICT, the design added in this patch does not provide any way
> > > > to honour these requirements around buffer lifetime.
> > > > 
> > > > I can't see how we can introduce MSG_ZEROCOPY in any seemless
> > > > way. The buffer lifetime requirements imply need for an API
> > > > design that is fundamentally different for asynchronous usage,
> > > > with a callback to notify when the write has finished/failed.
> > > 
> > > Regarding buffer reuse - it indeed has a very deep implication on the buffer
> > > being available and it's not obvious at all.  Just to mention that the initial
> > > user of this work will make sure all zero copy buffers will be guest pages only
> > > (as it's only used in multi-fd), so they should always be there during the
> > > process.
> > 
> > That is not the case when migration is using TLS, because the buffers
> > transmitted are the ciphertext buffer, not the plaintext guest page.
> 
> Agreed.
> 
> > 
> > > In short, we may just want to at least having a way to make sure all zero
> > > copied buffers are finished using and they're sent after some function returns
> > > (e.g., qio_channel_flush()).  That may require us to do some accounting on when
> > > we called sendmsg(MSG_ZEROCOPY), meanwhile we should need to read out the
> > > ee_data field within SO_EE_ORIGIN_ZEROCOPY msg when we do recvmsg() for the
> > > error queue and keep those information somewhere too.
> > > 
> > > Some other side notes that reached my mind..
> > > 
> > > The qio_channel_writev_full() may not be suitable for async operations, as the
> > > name "full" implies synchronous to me.  So maybe we can add a new helper for
> > > zero copy on the channel?
> > 
> > All the APIs that exist today are fundamentally only suitable for sync
> > operations. Supporting async correctly will definitely a brand new APIs
> > separate from what exists today.
> 
> Yes.  What I wanted to say is maybe we can still keep the io_writev() interface
> untouched, but just add a new interface at qio_channel_writev_full() level.
> 
> IOW, we could comment on io_writev() that it can be either sync or async now,
> just like sendmsg() has that implication too with different flag passed in.
> When calling io_writev() with different upper helpers, QIO channel could
> identify what flag to pass over to io_writev().

I don't think we should overload any of the existing methods with extra
parameters for async. Introduce a completely new set of methods exclusively
for this alternative interaction model.  I can kinda understand why they
took the approach they did with sendmsg, but I wouldn't use it as design
motivation for QEMU (except as example of what not to do).

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy
  2021-09-01 15:44           ` Daniel P. Berrangé
@ 2021-09-01 16:01             ` Peter Xu
  2021-09-02  7:57             ` Leonardo Bras Soares Passos
  2021-09-07 11:13             ` Dr. David Alan Gilbert
  2 siblings, 0 replies; 62+ messages in thread
From: Peter Xu @ 2021-09-01 16:01 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Jason Wang, Dr. David Alan Gilbert, qemu-devel,
	Leonardo Bras, Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Wed, Sep 01, 2021 at 04:44:30PM +0100, Daniel P. Berrangé wrote:
> QEMU has mptcp support already:
> 
>   commit 8bd1078aebcec5eac196a83ef1a7e74be0ba67b7
>   Author: Dr. David Alan Gilbert <dgilbert@redhat.com>
>   Date:   Wed Apr 21 12:28:34 2021 +0100
> 
>     sockets: Support multipath TCP
>     
>     Multipath TCP allows combining multiple interfaces/routes into a single
>     socket, with very little work for the user/admin.
>     
>     It's enabled by 'mptcp' on most socket addresses:
>     
>        ./qemu-system-x86_64 -nographic -incoming tcp:0:4444,mptcp

Oops, I totally forgot about that, sorry!

> 
> > KTLS may be implicitly included by a new gnutls, but we need to mark TLS and
> > ZEROCOPY mutual exclusive anyway because at least the userspace TLS code of
> > gnutls won't has a way to maintain the tls buffers used by zerocopy.  So at
> > least we need some knob to detect whether kTLS is enabled in gnutls.
> 
> It isn't possible for gnutls to transparently enable KTLS, because
> GNUTLS doesn't get to see the actual socket directly - it'll need
> some work in QEMU to enable it.  We know MPTCP and KTLS are currently
> mutually exclusive as they both use the same kernel network hooks
> framework.

Then we may need to at least figure out whether zerocopy needs to mask out
mptcp.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 0/3] QIOChannel flags + multifd zerocopy
  2021-08-31 21:24 ` [PATCH v1 0/3] QIOChannel flags + multifd zerocopy Peter Xu
@ 2021-09-01 19:21   ` Leonardo Bras Soares Passos
  0 siblings, 0 replies; 62+ messages in thread
From: Leonardo Bras Soares Passos @ 2021-09-01 19:21 UTC (permalink / raw)
  To: Peter Xu
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, qemu-devel, Daniel P. Berrangé,
	Dr. David Alan Gilbert, Paolo Bonzini, Marc-André Lureau,
	Fam Zheng

Hello Peter,

On Tue, Aug 31, 2021 at 6:24 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Aug 31, 2021 at 08:02:36AM -0300, Leonardo Bras wrote:
> > Results:
> > So far, the resource usage of __sys_sendmsg() reduced 15 times, and the
> > overall migration took 13-18% less time, based in synthetic workload.
>
> Leo,
>
> Could you share some of the details of your tests?  E.g., what's the
> configuration of your VM for testing?  What's the migration time before/after
> the patchset applied?  What is the network you're using?
>
> Thanks,
>
> --
> Peter Xu
>

Sure,
- Both receiving and sending hosts have 128GB ram and a 10Gbps network interface
  - There is a direct connection between the network interfaces.
- The guest has 100GB ram, mem-lock=on and enable-kvm.
- Before sending, I use a simple application to completely fill all
guest pages with unique values, to avoid duplicated pages and zeroed
pages.

On a single test:

Without zerocopy (qemu/master)
- Migration took 123355ms, with an average of 6912.58 Mbps
With Zerocopy:
- Migration took 108514ms, with an average of 7858.39 Mbps

This represents a throughput improvement around 13.6%.

Comparing perf recorded during default and zerocopy migration:
Without zerocopy:
- copy_user_generic_string() uses 5.4% of cpu time
- __sys_sendmsg() uses 5.19% of cpu time
With zerocopy:
- copy_user_generic_string() uses 0.02% of cpu time (~1/270 of the original)
- __sys_sendmsg() uses 0.34% of cpu time (~1/15 of the original)



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 1/3] io: Enable write flags for QIOChannel
  2021-08-31 11:02 ` [PATCH v1 1/3] io: Enable write flags for QIOChannel Leonardo Bras
@ 2021-09-01 20:54   ` Eric Blake
  2021-09-02  8:26     ` Leonardo Bras Soares Passos
  0 siblings, 1 reply; 62+ messages in thread
From: Eric Blake @ 2021-09-01 20:54 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, qemu-devel, Daniel P. Berrangé,
	Peter Xu, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Fam Zheng

On Tue, Aug 31, 2021 at 08:02:37AM -0300, Leonardo Bras wrote:
> Some syscalls used for writting, such as sendmsg(), accept flags that
> can modify their behavior, even allowing the usage of features such as
> MSG_ZEROCOPY.
> 
> Change qio_channel_write*() interface to allow passing down flags,
> allowing a more flexible use of IOChannel.
> 
> At first, it's use is enabled for QIOChannelSocket, but can be easily
> extended to any other QIOChannel implementation.

As a followup to this patch, I wonder if we can also get performance
improvements by implementing MSG_MORE, and using that in preference to
corking/uncorking to better indicate that back-to-back short messages
may behave better when grouped together over the wire.  At least the
NBD code could make use of it (going off of my experience with the
libnbd project demonstrating a performance boost when we added
MSG_MORE support there).

> 
> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> ---
>  chardev/char-io.c                   |  2 +-
>  hw/remote/mpqemu-link.c             |  2 +-
>  include/io/channel.h                | 56 ++++++++++++++++++++---------
>  io/channel-buffer.c                 |  1 +
>  io/channel-command.c                |  1 +
>  io/channel-file.c                   |  1 +
>  io/channel-socket.c                 |  4 ++-
>  io/channel-tls.c                    |  1 +
>  io/channel-websock.c                |  1 +
>  io/channel.c                        | 53 ++++++++++++++-------------
>  migration/rdma.c                    |  1 +
>  scsi/pr-manager-helper.c            |  2 +-
>  tests/unit/test-io-channel-socket.c |  1 +
>  13 files changed, 81 insertions(+), 45 deletions(-)
> 
> diff --git a/chardev/char-io.c b/chardev/char-io.c
> index 8ced184160..4ea7b1ee2a 100644
> --- a/chardev/char-io.c
> +++ b/chardev/char-io.c
> @@ -122,7 +122,7 @@ int io_channel_send_full(QIOChannel *ioc,
>  
>          ret = qio_channel_writev_full(
>              ioc, &iov, 1,
> -            fds, nfds, NULL);
> +            fds, 0, nfds, NULL);

0 before nfds here...

>          if (ret == QIO_CHANNEL_ERR_BLOCK) {
>              if (offset) {
>                  return offset;
> diff --git a/hw/remote/mpqemu-link.c b/hw/remote/mpqemu-link.c
> index 7e841820e5..0d13321ef0 100644
> --- a/hw/remote/mpqemu-link.c
> +++ b/hw/remote/mpqemu-link.c
> @@ -69,7 +69,7 @@ bool mpqemu_msg_send(MPQemuMsg *msg, QIOChannel *ioc, Error **errp)
>      }
>  
>      if (!qio_channel_writev_full_all(ioc, send, G_N_ELEMENTS(send),
> -                                    fds, nfds, errp)) {
> +                                     fds, nfds, 0, errp)) {

Thanks for fixing the broken indentation.

...but after nfds here, so one is wrong; up to this point in a linear
review, I can't tell which was intended...

> +++ b/include/io/channel.h
> @@ -104,6 +104,7 @@ struct QIOChannelClass {
>                           size_t niov,
>                           int *fds,
>                           size_t nfds,
> +                         int flags,
>                           Error **errp);

...and finally I see that in general, you wanted to add the argument
after.  Looks like the change to char-io.c is buggy.

(You can use scripts/git.orderfile as a way to force your patch to
list the .h file first, to make it easier for linear review).

>      ssize_t (*io_readv)(QIOChannel *ioc,
>                          const struct iovec *iov,
> @@ -260,6 +261,7 @@ ssize_t qio_channel_writev_full(QIOChannel *ioc,
>                                  size_t niov,
>                                  int *fds,
>                                  size_t nfds,
> +                                int flags,
>                                  Error **errp);
>  
>  /**
> @@ -325,6 +327,7 @@ int qio_channel_readv_all(QIOChannel *ioc,
>   * @ioc: the channel object
>   * @iov: the array of memory regions to write data from
>   * @niov: the length of the @iov array
> + * @flags: optional sending flags
>   * @errp: pointer to a NULL-initialized error object
>   *
>   * Write data to the IO channel, reading it from the
> @@ -339,10 +342,14 @@ int qio_channel_readv_all(QIOChannel *ioc,
>   *
>   * Returns: 0 if all bytes were written, or -1 on error
>   */
> -int qio_channel_writev_all(QIOChannel *ioc,
> -                           const struct iovec *iov,
> -                           size_t niov,
> -                           Error **erp);
> +int qio_channel_writev_all_flags(QIOChannel *ioc,
> +                                 const struct iovec *iov,
> +                                 size_t niov,
> +                                 int flags,
> +                                 Error **errp);

You changed the function name here, but not in the comment beforehand.

> +
> +#define qio_channel_writev_all(ioc, iov, niov, errp) \
> +    qio_channel_writev_all_flags(ioc, iov, niov, 0, errp)

In most cases, you were merely adding a new function to minimize churn
to existing callers while making the old name a macro,...

> @@ -853,6 +876,7 @@ int qio_channel_writev_full_all(QIOChannel *ioc,
>                                  const struct iovec *iov,
>                                  size_t niov,
>                                  int *fds, size_t nfds,
> +                                int flags,
>                                  Error **errp);

...but this one you changed in-place.  Any reason?  It might be nice
to mention how you chose which functions to wrap (to minimize churn to
existing clients) vs. modify signatures.

>  
>  #endif /* QIO_CHANNEL_H */
> diff --git a/io/channel-buffer.c b/io/channel-buffer.c
> index baa4e2b089..bf52011be2 100644
> --- a/io/channel-buffer.c
> +++ b/io/channel-buffer.c
> @@ -81,6 +81,7 @@ static ssize_t qio_channel_buffer_writev(QIOChannel *ioc,
>                                           size_t niov,
>                                           int *fds,
>                                           size_t nfds,
> +                                         int flags,
>                                           Error **errp)
>  {
>      QIOChannelBuffer *bioc = QIO_CHANNEL_BUFFER(ioc);

Would it be wise to check that flags only contains values we can honor
in this (and all other) implementations of qio backends?  Do we need
some way for a backend to advertise to the core qio code which flags
it is willing to accept?

> +++ b/io/channel-socket.c
> @@ -525,6 +525,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
>                                           size_t niov,
>                                           int *fds,
>                                           size_t nfds,
> +                                         int flags,
>                                           Error **errp)
>  {
>      QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
> @@ -558,7 +559,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
>      }
>  
>   retry:
> -    ret = sendmsg(sioc->fd, &msg, 0);
> +    ret = sendmsg(sioc->fd, &msg, flags);

Because of this line, we are forcing our use of flags to be exactly
the same set of MSG_* flags understood by sendmsg(), which feels a bit
fragile.  Wouldn't it be safer to define our own set of QIO_MSG_
flags, and map those into whatever flag values the underlying backends
can support?  After all, one thing I learned on libnbd is that
MSG_MORE is not universally portable, but the goal of qio is to
abstract away things so that the rest of the code doesn't have to do
#ifdef tests everywhere, but instead let the qio code deal with it
(whether to ignore an unsupported flag because it is only an
optimization hint, or to return an error because we depend on the
behavior change the flag would cause if supported, or...).  And that
goes back to my question of whether backends should have a way to
inform the qio core which flags they can support.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
  2021-08-31 12:57   ` Daniel P. Berrangé
  2021-08-31 20:27     ` Peter Xu
@ 2021-09-02  6:38     ` Leonardo Bras Soares Passos
  2021-09-02  8:47       ` Daniel P. Berrangé
  1 sibling, 1 reply; 62+ messages in thread
From: Leonardo Bras Soares Passos @ 2021-09-02  6:38 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Dr. David Alan Gilbert, Peter Xu, qemu-devel,
	Paolo Bonzini, Marc-André Lureau, Fam Zheng

Hello Daniel, thank you for the feedback!

Comments inline.

On Tue, Aug 31, 2021 at 9:57 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote:
> > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
> > send calls. It does so by avoiding copying user data into kernel buffers.
> >
> > To make it work, three steps are needed:
> > 1 - A setsockopt() system call, enabling SO_ZEROCOPY
> > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
> > 3 - Process the socket's error queue, dealing with any error
>
> AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY.
>
> It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg()
> from a synchronous call to an asynchronous call.

You are correct. I tried to adapt io_writev() interface to use MSG_ZEROCOPY in
a somehow synchronous way, but returning errp (and sometimes closing the
channel because of it) was a poor implementation.

>
> It is forbidden to overwrite/reuse/free the buffer passed to sendmsg
> until an asynchronous completion notification has been received from
> the socket error queue. These notifications are not required to
> arrive in-order, even for a TCP stream, because the kernel hangs on
> to the buffer if a re-transmit is needed.
>
> https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
>
>   "Page pinning also changes system call semantics. It temporarily
>    shares the buffer between process and network stack. Unlike with
>    copying, the process cannot immediately overwrite the buffer
>    after system call return without possibly modifying the data in
>    flight. Kernel integrity is not affected, but a buggy program
>    can possibly corrupt its own data stream."
>

By the above piece of documentation, I get there is no problem in
overwriting the buffer, but a corrupt, or newer version of the memory may
be sent instead of the original one. I am pointing this out because there
are workloads like page migration that would not be impacted, given
once the buffer is changed, it will dirty the page and it will be re-sent.

But I agree with you.
It's not a good choice to expect all the users to behave like that,
and since an interface for dealing with those errors is not provided
to the using code, there is no way of using that in other scenarios.

> AFAICT, the design added in this patch does not provide any way
> to honour these requirements around buffer lifetime.
>
> I can't see how we can introduce MSG_ZEROCOPY in any seemless
> way. The buffer lifetime requirements imply need for an API
> design that is fundamentally different for asynchronous usage,
> with a callback to notify when the write has finished/failed.
>

That is a good point.
Proposing a new optional method like io_async_writev() + an error
checking mechanism could do the job.
io_async_writev() could fall-back to io_writev() in cases where it's not
implemented.

I am not sure about the error checking yet.
Options I can see are:
1 - A callback, as you suggested, which IIUC would be provided by
code using the QIOChannel, and would only fix the reported errors,
leaving the responsibility of checking for errors to the IOChannel code.

2 - A new method, maybe io_async_errck(), which would be called
whenever the using code wants to deal with pending errors. It could
return an array/list of IOVs that need to be re-sent, for example,
and code using QIOChannel could deal with it however it wants.

[snip]

> >   * qio_channel_set_cork:
> >   * @ioc: the channel object
> > diff --git a/io/channel-socket.c b/io/channel-socket.c
> > index e377e7303d..a69fec7315 100644
> > --- a/io/channel-socket.c
> > +++ b/io/channel-socket.c
> > @@ -26,8 +26,10 @@
> >  #include "io/channel-watch.h"
> >  #include "trace.h"
> >  #include "qapi/clone-visitor.h"
> > +#include <linux/errqueue.h>
>
> That's going to fail to biuld on non-Linux

Good catch, thanks!

[snip]

> > @@ -571,6 +623,14 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
> >                           "Unable to write to socket");
> >          return -1;
> >      }
> > +
> > +    if ((flags & MSG_ZEROCOPY) && sioc->zerocopy_enabled) {
> > +        sioc->errq_pending += niov;
> > +        if (sioc->errq_pending > SOCKET_ERRQ_THRESH) {
> > +            qio_channel_socket_errq_proc(sioc, errp);
> > +        }
> > +    }
>
> This silently looses any errors set from upto the final
> SOCKET_ERRQ_THRESH write() calls with MSG_ZEROCOPY set.

You are right.

>
> Also means if you have a series of writes without
> MSG_ZEROCOPY, it'll delay checking any pending
> errors.

That's expected... if there are only happening sends without MSG_ZEROCOPY,
it means the ones sent with zerocopy can wait. The problem would be
the above case.

>
> I would suggest checkig in close(), but as mentioned
> earlier, I think the design is flawed because the caller
> fundamentally needs to know about completion for every
> single write they make in order to know when the buffer
> can be released / reused.

Well, there could be a flush mechanism (maybe in io_sync_errck(),
activated with a
parameter flag, or on a different method if callback is preferred):
In the MSG_ZEROCOPY docs, we can see that the example includes using a poll()
syscall after each packet sent, and this means the fd gets a signal after each
sendmsg() happens, with error or not.

We could harness this with a poll() and a relatively high timeout:
- We stop sending packets, and then call poll().
- Whenever poll() returns 0, it means a timeout happened, and so it
took too long
without sendmsg() happening, meaning all the packets are sent.
- If it returns anything else, we go back to fixing the errors found (re-send)

The problem may be defining the value of this timeout, but it could be
called only
when zerocopy is active.

What do you think?


>
> > +static void
> > +qio_channel_socket_set_zerocopy(QIOChannel *ioc,
> > +                                bool enabled)
> > +{
> > +    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
> > +    int v = enabled ? 1 : 0;
> > +    int ret;
> > +
> > +    ret = qemu_setsockopt(sioc->fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v));
> > +    if (ret >= 0) {
> > +        sioc->zerocopy_enabled = true;
> > +    }
> > +}
>
> Surely we need to tell the caller wether this succeeed, otherwise
> the later sendmsg() is going to fail with EINVAL on older kernels
> where MSG_ZEROCOPY is not supported.

Yeah, that was the idea on sioc->zerocopy_enabled, but in fact it
should have been
something like 'sioc->zerocopy_enabled = (ret >= 0) && enabled ', this
way it would
reflect zerocopy only if it was enabled and the syscall worked, and not_zerocopy
otherwise.

or something like this, if we want it to stick with zerocopy if
setting it off fails.
if (ret >= 0) {
    sioc->zerocopy_enabled = enabled;
}

>
>
> > diff --git a/io/channel-tls.c b/io/channel-tls.c
> > index 4ce890a538..bf44b0f7b0 100644
> > --- a/io/channel-tls.c
> > +++ b/io/channel-tls.c
> > @@ -350,6 +350,16 @@ static void qio_channel_tls_set_delay(QIOChannel *ioc,
> >      qio_channel_set_delay(tioc->master, enabled);
> >  }
> >
> > +
> > +static void qio_channel_tls_set_zerocopy(QIOChannel *ioc,
> > +                                         bool enabled)
> > +{
> > +    QIOChannelTLS *tioc = QIO_CHANNEL_TLS(ioc);
> > +
> > +    qio_channel_set_zerocopy(tioc->master, enabled);
> > +}
>
> This is going to be unsafe because gnutls will discard/reuse the
> buffer for the ciphertext after every write(). I can't see a
> way to safely enable MSG_ZEROCOPY when TLS is used.

Yeah, that makes sense.
It would make more sense to implement KTLS, as IIRC it kind of does
'zerocopy', since it saves the encrypted data directly in kernel buffer.

We could implement KTLS as io_async_writev() for Channel_TLS, and change this
flag to async_enabled. If KTLS is not available, it would fall back to
using gnuTLS
on io_writev, just like it would happen in zerocopy.

>
>
> Regards,
> Daniel
> --
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
  2021-08-31 20:27     ` Peter Xu
  2021-09-01  8:50       ` Daniel P. Berrangé
@ 2021-09-02  6:59       ` Leonardo Bras Soares Passos
  2021-09-07 16:44         ` Peter Xu
  1 sibling, 1 reply; 62+ messages in thread
From: Leonardo Bras Soares Passos @ 2021-09-02  6:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, qemu-devel, Daniel P. Berrangé,
	Dr. David Alan Gilbert, Paolo Bonzini, Marc-André Lureau,
	Fam Zheng

Thanks for this feedback Peter!

I ended up reading/replying the e-mails in thread order, so I may have
been redundant
with your argument, sorry about that.

I will add my comments inline, but I will add references to the
previous mail I sent
Daniel, so please read it too.

On Tue, Aug 31, 2021 at 5:27 PM Peter Xu <peterx@redhat.com> wrote:
[snip]
> >
> > I can't see how we can introduce MSG_ZEROCOPY in any seemless
> > way. The buffer lifetime requirements imply need for an API
> > design that is fundamentally different for asynchronous usage,
> > with a callback to notify when the write has finished/failed.
>
> Regarding buffer reuse - it indeed has a very deep implication on the buffer
> being available and it's not obvious at all.  Just to mention that the initial
> user of this work will make sure all zero copy buffers will be guest pages only
> (as it's only used in multi-fd), so they should always be there during the
> process.

Thanks for pointing that out, what's what I had in mind at the time.

>
> I think asking for a complete design still makes sense.

Agree, since I am touching QIOChannel, it makes sense to make it work for
other code that uses it too, not only our case.

>  E.g., for precopy
> before we flush device states and completes the migration, we may want to at
> least have a final ack on all the zero-copies of guest pages to guarantee they
> are flushed.
>
> IOW, we need to make sure the last piece of migration stream lands after the
> guest pages so that the dest VM will always contain the latest page data when
> dest VM starts.  So far I don't see how current code guaranteed that.
>
> In short, we may just want to at least having a way to make sure all zero
> copied buffers are finished using and they're sent after some function returns
> (e.g., qio_channel_flush()).  That may require us to do some accounting on when
> we called sendmsg(MSG_ZEROCOPY), meanwhile we should need to read out the
> ee_data field within SO_EE_ORIGIN_ZEROCOPY msg when we do recvmsg() for the
> error queue and keep those information somewhere too.

Yeah, that's correct.
I haven't fully studied what the returned data represents, but I
suppose this could be
a way to fix that. In my previous reply to Daniel I pointed out a way
we may achieve
a flush behavior with poll() too, but it could be a little hacky.

>
> Some other side notes that reached my mind..
>
> The qio_channel_writev_full() may not be suitable for async operations, as the
> name "full" implies synchronous to me.  So maybe we can add a new helper for
> zero copy on the channel?
>
> We may also want a new QIOChannelFeature as QIO_CHANNEL_FEATURE_ZEROCOPY, then
> we fail qio_channel_writv_zerocopy() (or whatever name we come up with) if that
> bit is not set in qio channel features.

I also suggested something like that, but I thought it could be good if we could
fall back to io_writev() if we didn't have the zerocopy feature (or
the async feature).

What do you think?

>
> Thanks,
>
> --
> Peter Xu
>

I really appreciate your suggestions, thanks Peter!

Best regards,
Leonardo



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
  2021-09-01  8:50       ` Daniel P. Berrangé
  2021-09-01 15:52         ` Peter Xu
@ 2021-09-02  7:07         ` Leonardo Bras Soares Passos
  1 sibling, 0 replies; 62+ messages in thread
From: Leonardo Bras Soares Passos @ 2021-09-02  7:07 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Dr. David Alan Gilbert, Peter Xu, qemu-devel,
	Paolo Bonzini, Marc-André Lureau, Fam Zheng

Hello Daniel,.
A few more comments:

On Wed, Sep 1, 2021 at 5:51 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Tue, Aug 31, 2021 at 04:27:04PM -0400, Peter Xu wrote:
> > On Tue, Aug 31, 2021 at 01:57:33PM +0100, Daniel P. Berrangé wrote:
> > > On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote:
> > > > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
> > > > send calls. It does so by avoiding copying user data into kernel buffers.
> > > >
> > > > To make it work, three steps are needed:
> > > > 1 - A setsockopt() system call, enabling SO_ZEROCOPY
> > > > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
> > > > 3 - Process the socket's error queue, dealing with any error
> > >
> > > AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY.
> > >
> > > It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg()
> > > from a synchronous call to an asynchronous call.
> > >
> > > It is forbidden to overwrite/reuse/free the buffer passed to sendmsg
> > > until an asynchronous completion notification has been received from
> > > the socket error queue. These notifications are not required to
> > > arrive in-order, even for a TCP stream, because the kernel hangs on
> > > to the buffer if a re-transmit is needed.
> > >
> > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > >
> > >   "Page pinning also changes system call semantics. It temporarily
> > >    shares the buffer between process and network stack. Unlike with
> > >    copying, the process cannot immediately overwrite the buffer
> > >    after system call return without possibly modifying the data in
> > >    flight. Kernel integrity is not affected, but a buggy program
> > >    can possibly corrupt its own data stream."
> > >
> > > AFAICT, the design added in this patch does not provide any way
> > > to honour these requirements around buffer lifetime.
> > >
> > > I can't see how we can introduce MSG_ZEROCOPY in any seemless
> > > way. The buffer lifetime requirements imply need for an API
> > > design that is fundamentally different for asynchronous usage,
> > > with a callback to notify when the write has finished/failed.
> >
> > Regarding buffer reuse - it indeed has a very deep implication on the buffer
> > being available and it's not obvious at all.  Just to mention that the initial
> > user of this work will make sure all zero copy buffers will be guest pages only
> > (as it's only used in multi-fd), so they should always be there during the
> > process.
>
> That is not the case when migration is using TLS, because the buffers
> transmitted are the ciphertext buffer, not the plaintext guest page.

That makes sense.
I am still working my way on Migration code, and I ended up not
knowing that part.

I think we can work on KTLS for this case, and implement something
'like zerocopy' for this. I added more details in the previous mail I sent you.


>
> > In short, we may just want to at least having a way to make sure all zero
> > copied buffers are finished using and they're sent after some function returns
> > (e.g., qio_channel_flush()).  That may require us to do some accounting on when
> > we called sendmsg(MSG_ZEROCOPY), meanwhile we should need to read out the
> > ee_data field within SO_EE_ORIGIN_ZEROCOPY msg when we do recvmsg() for the
> > error queue and keep those information somewhere too.
> >
> > Some other side notes that reached my mind..
> >
> > The qio_channel_writev_full() may not be suitable for async operations, as the
> > name "full" implies synchronous to me.  So maybe we can add a new helper for
> > zero copy on the channel?
>
> All the APIs that exist today are fundamentally only suitable for sync
> operations. Supporting async correctly will definitely a brand new APIs
> separate from what exists today.

That's a great, I was thinking on implementing this as an optional method for
QIOChannel, more details in the previous mail :).

(sorry, I ended up reading/replying the mails in thread order)

>
> Regards,
> Daniel
> --
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>

Best regards,
Leonardo Bras



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy
  2021-08-31 13:16   ` Daniel P. Berrangé
  2021-08-31 20:29     ` Peter Xu
@ 2021-09-02  7:22     ` Leonardo Bras Soares Passos
  2021-09-02  8:20       ` Daniel P. Berrangé
  2021-09-07 18:32       ` Peter Xu
  1 sibling, 2 replies; 62+ messages in thread
From: Leonardo Bras Soares Passos @ 2021-09-02  7:22 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Dr. David Alan Gilbert, Peter Xu, qemu-devel,
	Paolo Bonzini, Marc-André Lureau, Fam Zheng

Hello Daniel, thanks for the feedback !

On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> >
> > Change the send_write() interface of multifd, allowing it to pass down
> > flags for qio_channel_write*().
> >
> > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > other data being sent at the default copying approach.
> >
> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > ---
> >  migration/multifd-zlib.c | 7 ++++---
> >  migration/multifd-zstd.c | 7 ++++---
> >  migration/multifd.c      | 9 ++++++---
> >  migration/multifd.h      | 3 ++-
> >  4 files changed, 16 insertions(+), 10 deletions(-)
>
> > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> >              }
> >
> >              if (used) {
> > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > +                                                          &local_err);
>
> I don't think it is valid to unconditionally enable this feature due to the
> resource usage implications
>
> https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
>
>   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
>    if the socket option was not set, the socket exceeds its optmem
>    limit or the user exceeds its ulimit on locked pages."

You are correct, I unfortunately missed this part in the docs :(

> The limit on locked pages is something that looks very likely to be
> exceeded unless you happen to be running a QEMU config that already
> implies locked memory (eg PCI assignment)

Do you mean the limit an user has on locking memory?

If so, that makes sense. I remember I needed to set the upper limit of locked
memory for the user before using it, or adding a capability to qemu before.

Maybe an option would be trying to mlock all guest memory before setting
zerocopy=on in qemu code. If it fails, we can print an error message and fall
back to not using zerocopy (following the idea of a new io_async_writev()
I told you in the previous mail).


>
>
> Regards,
> Daniel
> --
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>

Best regards,
Leonardo



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy
  2021-09-01 15:35         ` Peter Xu
  2021-09-01 15:44           ` Daniel P. Berrangé
@ 2021-09-02  7:23           ` Jason Wang
  2021-09-02  8:08             ` Leonardo Bras Soares Passos
  1 sibling, 1 reply; 62+ messages in thread
From: Jason Wang @ 2021-09-02  7:23 UTC (permalink / raw)
  To: Peter Xu, Daniel P. Berrangé
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Dr. David Alan Gilbert, qemu-devel, Leonardo Bras,
	Paolo Bonzini, Marc-André Lureau, Fam Zheng


在 2021/9/1 下午11:35, Peter Xu 写道:
> On Wed, Sep 01, 2021 at 09:53:07AM +0100, Daniel P. Berrangé wrote:
>> On Tue, Aug 31, 2021 at 04:29:09PM -0400, Peter Xu wrote:
>>> On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote:
>>>> On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
>>>>> Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
>>>>>
>>>>> Change the send_write() interface of multifd, allowing it to pass down
>>>>> flags for qio_channel_write*().
>>>>>
>>>>> Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
>>>>> other data being sent at the default copying approach.
>>>>>
>>>>> Signed-off-by: Leonardo Bras <leobras@redhat.com>
>>>>> ---
>>>>>   migration/multifd-zlib.c | 7 ++++---
>>>>>   migration/multifd-zstd.c | 7 ++++---
>>>>>   migration/multifd.c      | 9 ++++++---
>>>>>   migration/multifd.h      | 3 ++-
>>>>>   4 files changed, 16 insertions(+), 10 deletions(-)
>>>>> @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
>>>>>               }
>>>>>   
>>>>>               if (used) {
>>>>> -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
>>>>> +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
>>>>> +                                                          &local_err);
>>>> I don't think it is valid to unconditionally enable this feature due to the
>>>> resource usage implications
>>>>
>>>> https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
>>>>
>>>>    "A zerocopy failure will return -1 with errno ENOBUFS. This happens
>>>>     if the socket option was not set, the socket exceeds its optmem
>>>>     limit or the user exceeds its ulimit on locked pages."
>>>>
>>>> The limit on locked pages is something that looks very likely to be
>>>> exceeded unless you happen to be running a QEMU config that already
>>>> implies locked memory (eg PCI assignment)
>>> Yes it would be great to be a migration capability in parallel to multifd. At
>>> initial phase if it's easy to be implemented on multi-fd only, we can add a
>>> dependency between the caps.  In the future we can remove that dependency when
>>> the code is ready to go without multifd.  Thanks,
>> Also, I'm wondering how zerocopy support interacts with kernel support
>> for kTLS and multipath-TCP, both of which we want to be able to use
>> with migration.
> Copying Jason Wang for net implications between these features on kernel side


Note that the MSG_ZEROCOPY is contributed by Google :)


> and whether they can be enabled together (MSG_ZEROCOPY, mptcp, kTLS).


I think they can. Anyway kernel can choose to do datacopy when necessary.

Note that the "zerocopy" is probably not correct here. What's better is 
"Enable MSG_ZEROCOPY" since:

1) kernel supports various kinds of zerocopy, for TX, it has supported 
sendfile() for many years.
2) MSG_ZEROCOPY is only used for TX but not RX
3) TCP rx zerocopy is only supported via mmap() which requires some 
extra configurations e.g 4K MTU, driver support for header split etc.

[1] https://www.youtube.com/watch?v=_ZfiQGWFvg0

Thanks


>
>  From the safe side we may want to only enable one of them until we prove
> they'll work together I guess..
>
> Not a immediate concern as I don't really think any of them is really
> explicitly supported in qemu.
>
> KTLS may be implicitly included by a new gnutls, but we need to mark TLS and
> ZEROCOPY mutual exclusive anyway because at least the userspace TLS code of
> gnutls won't has a way to maintain the tls buffers used by zerocopy.  So at
> least we need some knob to detect whether kTLS is enabled in gnutls.
>



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy
  2021-08-31 20:29     ` Peter Xu
  2021-09-01  8:53       ` Daniel P. Berrangé
@ 2021-09-02  7:27       ` Leonardo Bras Soares Passos
  1 sibling, 0 replies; 62+ messages in thread
From: Leonardo Bras Soares Passos @ 2021-09-02  7:27 UTC (permalink / raw)
  To: Peter Xu
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, qemu-devel, Daniel P. Berrangé,
	Dr. David Alan Gilbert, Paolo Bonzini, Marc-André Lureau,
	Fam Zheng

Hello Peter, thank you for this feedback!

On Tue, Aug 31, 2021 at 5:29 PM Peter Xu <peterx@redhat.com> wrote:
> Yes it would be great to be a migration capability in parallel to multifd. At
> initial phase if it's easy to be implemented on multi-fd only, we can add a
> dependency between the caps.  In the future we can remove that dependency when
> the code is ready to go without multifd.  Thanks,

I thought the idea was to use MSG_ZEROCOPY whenever possible, and otherwise
fall back to the default copying mechanism.

On replies to 2/3 I mentioned a new method to QIOChannel interface,
that would fall
back to copying, and that may be a way to not have to use a capability.

Best regards,
Leonardo



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy
  2021-09-01 15:44           ` Daniel P. Berrangé
  2021-09-01 16:01             ` Peter Xu
@ 2021-09-02  7:57             ` Leonardo Bras Soares Passos
  2021-09-07 11:13             ` Dr. David Alan Gilbert
  2 siblings, 0 replies; 62+ messages in thread
From: Leonardo Bras Soares Passos @ 2021-09-02  7:57 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Jason Wang, Dr. David Alan Gilbert, Peter Xu,
	qemu-devel, Paolo Bonzini, Marc-André Lureau, Fam Zheng

A few more comments on this one:

On Wed, Sep 1, 2021 at 12:44 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> > From the safe side we may want to only enable one of them until we prove
> > they'll work together I guess..
>
> MPTCP is good when we're network limited for migration
>
> KTLS will be good when we're CPU limited on AES for migration,
> which is essentially always when TLS is used.
>
> ZEROCOPY will be good when we're CPU limited for data copy
> on migration, or to reduce the impact on other concurrent
> VMs on the same CPUs.
>
> Ultimately we woudld benefit from all of them at the same
> time, if it were technically possible todo.

Maybe I misunderstood something, but IIRC KTLS kind of does some
'zerocopy'.

I mean, on an usual setup, we would have a buffer A that would contain
the encrypted data, and a buffer B that would receive the encrypted data,
and a buffer C, in kernel, that would receive a copy of buffer B.

On KTLS, the buffer A would be passed to kernel and be encrypted directly
in buffer C, avoiding one copy.
Oh, it's possible Buffer A would be copied to buffer B', which is
inside the kernel,
and then encrypted to buffer C, but I am not sure if this would make sense.

Anyway, other than B' case, it would make no sense having MSG_ZEROCOPY
in QIOChannel_TLS, so that's something that I need to change in this patchset.

>
> > Not a immediate concern as I don't really think any of them is really
> > explicitly supported in qemu.
>
> QEMU has mptcp support already:
>
>   commit 8bd1078aebcec5eac196a83ef1a7e74be0ba67b7
>   Author: Dr. David Alan Gilbert <dgilbert@redhat.com>
>   Date:   Wed Apr 21 12:28:34 2021 +0100
>
>     sockets: Support multipath TCP
>
>     Multipath TCP allows combining multiple interfaces/routes into a single
>     socket, with very little work for the user/admin.
>
>     It's enabled by 'mptcp' on most socket addresses:
>
>        ./qemu-system-x86_64 -nographic -incoming tcp:0:4444,mptcp
>
> > KTLS may be implicitly included by a new gnutls, but we need to mark TLS and
> > ZEROCOPY mutual exclusive anyway because at least the userspace TLS code of
> > gnutls won't has a way to maintain the tls buffers used by zerocopy.  So at
> > least we need some knob to detect whether kTLS is enabled in gnutls.
>
> It isn't possible for gnutls to transparently enable KTLS, because
> GNUTLS doesn't get to see the actual socket directly

Yes, IIRC it uses gnuTLS with callbacks instead.

> - it'll need
> some work in QEMU to enable it.

Maybe it's overkill, but what if we get to implement using KTLS
directly in QEMU,
and fall back to gnuTLS when it's not available?



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy
  2021-09-02  7:23           ` Jason Wang
@ 2021-09-02  8:08             ` Leonardo Bras Soares Passos
  0 siblings, 0 replies; 62+ messages in thread
From: Leonardo Bras Soares Passos @ 2021-09-02  8:08 UTC (permalink / raw)
  To: Jason Wang
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, qemu-devel, Daniel P. Berrangé,
	Peter Xu, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Fam Zheng

Thanks for contributing Jason!

On Thu, Sep 2, 2021 at 4:23 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/9/1 下午11:35, Peter Xu 写道:
> > On Wed, Sep 01, 2021 at 09:53:07AM +0100, Daniel P. Berrangé wrote:
> >> On Tue, Aug 31, 2021 at 04:29:09PM -0400, Peter Xu wrote:
> >>> On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote:
> >>>> On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> >>>>> Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> >>>>>
> >>>>> Change the send_write() interface of multifd, allowing it to pass down
> >>>>> flags for qio_channel_write*().
> >>>>>
> >>>>> Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> >>>>> other data being sent at the default copying approach.
> >>>>>
> >>>>> Signed-off-by: Leonardo Bras <leobras@redhat.com>
> >>>>> ---
> >>>>>   migration/multifd-zlib.c | 7 ++++---
> >>>>>   migration/multifd-zstd.c | 7 ++++---
> >>>>>   migration/multifd.c      | 9 ++++++---
> >>>>>   migration/multifd.h      | 3 ++-
> >>>>>   4 files changed, 16 insertions(+), 10 deletions(-)
> >>>>> @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> >>>>>               }
> >>>>>
> >>>>>               if (used) {
> >>>>> -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> >>>>> +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> >>>>> +                                                          &local_err);
> >>>> I don't think it is valid to unconditionally enable this feature due to the
> >>>> resource usage implications
> >>>>
> >>>> https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> >>>>
> >>>>    "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> >>>>     if the socket option was not set, the socket exceeds its optmem
> >>>>     limit or the user exceeds its ulimit on locked pages."
> >>>>
> >>>> The limit on locked pages is something that looks very likely to be
> >>>> exceeded unless you happen to be running a QEMU config that already
> >>>> implies locked memory (eg PCI assignment)
> >>> Yes it would be great to be a migration capability in parallel to multifd. At
> >>> initial phase if it's easy to be implemented on multi-fd only, we can add a
> >>> dependency between the caps.  In the future we can remove that dependency when
> >>> the code is ready to go without multifd.  Thanks,
> >> Also, I'm wondering how zerocopy support interacts with kernel support
> >> for kTLS and multipath-TCP, both of which we want to be able to use
> >> with migration.
> > Copying Jason Wang for net implications between these features on kernel side
>
>
> Note that the MSG_ZEROCOPY is contributed by Google :)
>
>
> > and whether they can be enabled together (MSG_ZEROCOPY, mptcp, kTLS).
>
>
> I think they can. Anyway kernel can choose to do datacopy when necessary.
>
> Note that the "zerocopy" is probably not correct here. What's better is
> "Enable MSG_ZEROCOPY" since:
>
> 1) kernel supports various kinds of zerocopy, for TX, it has supported
> sendfile() for many years.
> 2) MSG_ZEROCOPY is only used for TX but not RX
> 3) TCP rx zerocopy is only supported via mmap() which requires some
> extra configurations e.g 4K MTU, driver support for header split etc.

RX would be my next challenge :)

>
> [1] https://www.youtube.com/watch?v=_ZfiQGWFvg0

Thank you for sharing!

Best regards,
Leonardo



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy
  2021-09-02  7:22     ` Leonardo Bras Soares Passos
@ 2021-09-02  8:20       ` Daniel P. Berrangé
  2021-09-02  8:52         ` Leonardo Bras Soares Passos
  2021-09-07 18:32       ` Peter Xu
  1 sibling, 1 reply; 62+ messages in thread
From: Daniel P. Berrangé @ 2021-09-02  8:20 UTC (permalink / raw)
  To: Leonardo Bras Soares Passos
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Dr. David Alan Gilbert, Peter Xu, qemu-devel,
	Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> Hello Daniel, thanks for the feedback !
> 
> On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > >
> > > Change the send_write() interface of multifd, allowing it to pass down
> > > flags for qio_channel_write*().
> > >
> > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > > other data being sent at the default copying approach.
> > >
> > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > ---
> > >  migration/multifd-zlib.c | 7 ++++---
> > >  migration/multifd-zstd.c | 7 ++++---
> > >  migration/multifd.c      | 9 ++++++---
> > >  migration/multifd.h      | 3 ++-
> > >  4 files changed, 16 insertions(+), 10 deletions(-)
> >
> > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > >              }
> > >
> > >              if (used) {
> > > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > > +                                                          &local_err);
> >
> > I don't think it is valid to unconditionally enable this feature due to the
> > resource usage implications
> >
> > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> >
> >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> >    if the socket option was not set, the socket exceeds its optmem
> >    limit or the user exceeds its ulimit on locked pages."
> 
> You are correct, I unfortunately missed this part in the docs :(
> 
> > The limit on locked pages is something that looks very likely to be
> > exceeded unless you happen to be running a QEMU config that already
> > implies locked memory (eg PCI assignment)
> 
> Do you mean the limit an user has on locking memory?

Yes, by default limit QEMU sees will be something very small.

> If so, that makes sense. I remember I needed to set the upper limit of locked
> memory for the user before using it, or adding a capability to qemu before.
> 
> Maybe an option would be trying to mlock all guest memory before setting
> zerocopy=on in qemu code. If it fails, we can print an error message and fall
> back to not using zerocopy (following the idea of a new io_async_writev()
> I told you in the previous mail).

Currently ability to lock memory is something that has to be configured
when QEMU starts, and it requires libvirt to grant suitable permissions
to QEMU. Memory locking is generally undesirable because it prevents
memory overcommit. Or rather if you are allowing memory overcommit, then
allowing memory locking is a way to kill your entire host.

I don't think we can unconditionally grant ability to lock arbitrary
guest RAM at startup, just to satisfy a possible desire to use zerocopy
migration later. Granting it at runtime feels questionable as you now
need to track and predict how much locked memory you've allowed, and
also have possible problems with revokation.

Possibly you could unconditionally grant ability to lock a small amount
of guest RAM at startup, but how small can it be, while still making a
useful difference to migration. It would imply we also need to be very
careful with migration to avoid having too large an amount of outstanding
zerocopy requests to exceed the limit.

IOW, the only clear place in which we can use zerocopy, is where we are
already forced to accept the penalty of locked memory at startup. eg when
the guest is using huge pages and no overcommit, or possibly when the guest
is using PCI device assignment, though in the latter I can't remember if
we allow entire of guest RAM to be locked or not.

Overall the memory locking needs look like a significant constraint that
will affect ability to use this feature.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 1/3] io: Enable write flags for QIOChannel
  2021-09-01 20:54   ` Eric Blake
@ 2021-09-02  8:26     ` Leonardo Bras Soares Passos
  0 siblings, 0 replies; 62+ messages in thread
From: Leonardo Bras Soares Passos @ 2021-09-02  8:26 UTC (permalink / raw)
  To: Eric Blake
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, qemu-devel, Daniel P. Berrangé,
	Peter Xu, Dr. David Alan Gilbert, Paolo Bonzini,
	Marc-André Lureau, Fam Zheng

Thanks for the feedback Eric!

On Wed, Sep 1, 2021 at 5:54 PM Eric Blake <eblake@redhat.com> wrote:
>
> On Tue, Aug 31, 2021 at 08:02:37AM -0300, Leonardo Bras wrote:
> > Some syscalls used for writting, such as sendmsg(), accept flags that
> > can modify their behavior, even allowing the usage of features such as
> > MSG_ZEROCOPY.
> >
> > Change qio_channel_write*() interface to allow passing down flags,
> > allowing a more flexible use of IOChannel.
> >
> > At first, it's use is enabled for QIOChannelSocket, but can be easily
> > extended to any other QIOChannel implementation.
>
> As a followup to this patch, I wonder if we can also get performance
> improvements by implementing MSG_MORE, and using that in preference to
> corking/uncorking to better indicate that back-to-back short messages
> may behave better when grouped together over the wire.  At least the
> NBD code could make use of it (going off of my experience with the
> libnbd project demonstrating a performance boost when we added
> MSG_MORE support there).

That's interesting!
We could use this patchset for testing that out, as I believe it's easy
to add those flags to the sendmsg() we want to have it enabled.

>
> >
> > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > ---
> >  chardev/char-io.c                   |  2 +-
> >  hw/remote/mpqemu-link.c             |  2 +-
> >  include/io/channel.h                | 56 ++++++++++++++++++++---------
> >  io/channel-buffer.c                 |  1 +
> >  io/channel-command.c                |  1 +
> >  io/channel-file.c                   |  1 +
> >  io/channel-socket.c                 |  4 ++-
> >  io/channel-tls.c                    |  1 +
> >  io/channel-websock.c                |  1 +
> >  io/channel.c                        | 53 ++++++++++++++-------------
> >  migration/rdma.c                    |  1 +
> >  scsi/pr-manager-helper.c            |  2 +-
> >  tests/unit/test-io-channel-socket.c |  1 +
> >  13 files changed, 81 insertions(+), 45 deletions(-)
> >
> > diff --git a/chardev/char-io.c b/chardev/char-io.c
> > index 8ced184160..4ea7b1ee2a 100644
> > --- a/chardev/char-io.c
> > +++ b/chardev/char-io.c
> > @@ -122,7 +122,7 @@ int io_channel_send_full(QIOChannel *ioc,
> >
> >          ret = qio_channel_writev_full(
> >              ioc, &iov, 1,
> > -            fds, nfds, NULL);
> > +            fds, 0, nfds, NULL);
>
> 0 before nfds here...

Good catch! I will fix that for v2!

>
> >          if (ret == QIO_CHANNEL_ERR_BLOCK) {
> >              if (offset) {
> >                  return offset;
> > diff --git a/hw/remote/mpqemu-link.c b/hw/remote/mpqemu-link.c
> > index 7e841820e5..0d13321ef0 100644
> > --- a/hw/remote/mpqemu-link.c
> > +++ b/hw/remote/mpqemu-link.c
> > @@ -69,7 +69,7 @@ bool mpqemu_msg_send(MPQemuMsg *msg, QIOChannel *ioc, Error **errp)
> >      }
> >
> >      if (!qio_channel_writev_full_all(ioc, send, G_N_ELEMENTS(send),
> > -                                    fds, nfds, errp)) {
> > +                                     fds, nfds, 0, errp)) {
>
> Thanks for fixing the broken indentation.

I kept questioning myself if I was breaking something here :)

>
> ...but after nfds here, so one is wrong; up to this point in a linear
> review, I can't tell which was intended...
>
> > +++ b/include/io/channel.h
> > @@ -104,6 +104,7 @@ struct QIOChannelClass {
> >                           size_t niov,
> >                           int *fds,
> >                           size_t nfds,
> > +                         int flags,
> >                           Error **errp);
>
> ...and finally I see that in general, you wanted to add the argument
> after.  Looks like the change to char-io.c is buggy.

Yeap!

>
> (You can use scripts/git.orderfile as a way to force your patch to
> list the .h file first, to make it easier for linear review).

Nice tip! just added to my qemu gitconfig :)

>
> >      ssize_t (*io_readv)(QIOChannel *ioc,
> >                          const struct iovec *iov,
> > @@ -260,6 +261,7 @@ ssize_t qio_channel_writev_full(QIOChannel *ioc,
> >                                  size_t niov,
> >                                  int *fds,
> >                                  size_t nfds,
> > +                                int flags,
> >                                  Error **errp);
> >
> >  /**
> > @@ -325,6 +327,7 @@ int qio_channel_readv_all(QIOChannel *ioc,
> >   * @ioc: the channel object
> >   * @iov: the array of memory regions to write data from
> >   * @niov: the length of the @iov array
> > + * @flags: optional sending flags
> >   * @errp: pointer to a NULL-initialized error object
> >   *
> >   * Write data to the IO channel, reading it from the
> > @@ -339,10 +342,14 @@ int qio_channel_readv_all(QIOChannel *ioc,
> >   *
> >   * Returns: 0 if all bytes were written, or -1 on error
> >   */
> > -int qio_channel_writev_all(QIOChannel *ioc,
> > -                           const struct iovec *iov,
> > -                           size_t niov,
> > -                           Error **erp);
> > +int qio_channel_writev_all_flags(QIOChannel *ioc,
> > +                                 const struct iovec *iov,
> > +                                 size_t niov,
> > +                                 int flags,
> > +                                 Error **errp);
>
> You changed the function name here, but not in the comment beforehand.
>

Will fix this for v2, thanks !

> > +
> > +#define qio_channel_writev_all(ioc, iov, niov, errp) \
> > +    qio_channel_writev_all_flags(ioc, iov, niov, 0, errp)
>
> In most cases, you were merely adding a new function to minimize churn
> to existing callers while making the old name a macro,...
>
> > @@ -853,6 +876,7 @@ int qio_channel_writev_full_all(QIOChannel *ioc,
> >                                  const struct iovec *iov,
> >                                  size_t niov,
> >                                  int *fds, size_t nfds,
> > +                                int flags,
> >                                  Error **errp);
>
> ...but this one you changed in-place.  Any reason?  It might be nice
> to mention how you chose which functions to wrap (to minimize churn to
> existing clients) vs. modify signatures.

It's the first one I did change, TBH.
It just had a few uses. mostly in the same file scope, and a single
use on mpqemu-link.c,
so I thought it would be ok to just change it.

But yeah, it makes sense to also add a macro to this one as well, and
create another function to keep them all looking the same.

>
> >
> >  #endif /* QIO_CHANNEL_H */
> > diff --git a/io/channel-buffer.c b/io/channel-buffer.c
> > index baa4e2b089..bf52011be2 100644
> > --- a/io/channel-buffer.c
> > +++ b/io/channel-buffer.c
> > @@ -81,6 +81,7 @@ static ssize_t qio_channel_buffer_writev(QIOChannel *ioc,
> >                                           size_t niov,
> >                                           int *fds,
> >                                           size_t nfds,
> > +                                         int flags,
> >                                           Error **errp)
> >  {
> >      QIOChannelBuffer *bioc = QIO_CHANNEL_BUFFER(ioc);
>
> Would it be wise to check that flags only contains values we can honor
> in this (and all other) implementations of qio backends?  Do we need
> some way for a backend to advertise to the core qio code which flags
> it is willing to accept?

That's a good idea, maybe we can do as you suggest below, choose a set of
features we are willing to support and then translate it depending on the
implementation. Then this would only be testing for a mask.

>
> > +++ b/io/channel-socket.c
> > @@ -525,6 +525,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
> >                                           size_t niov,
> >                                           int *fds,
> >                                           size_t nfds,
> > +                                         int flags,
> >                                           Error **errp)
> >  {
> >      QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
> > @@ -558,7 +559,7 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
> >      }
> >
> >   retry:
> > -    ret = sendmsg(sioc->fd, &msg, 0);
> > +    ret = sendmsg(sioc->fd, &msg, flags);
>
> Because of this line, we are forcing our use of flags to be exactly
> the same set of MSG_* flags understood by sendmsg(), which feels a bit
> fragile.  Wouldn't it be safer to define our own set of QIO_MSG_
> flags, and map those into whatever flag values the underlying backends
> can support?  After all, one thing I learned on libnbd is that
> MSG_MORE is not universally portable, but the goal of qio is to
> abstract away things so that the rest of the code doesn't have to do
> #ifdef tests everywhere, but instead let the qio code deal with it
> (whether to ignore an unsupported flag because it is only an
> optimization hint, or to return an error because we depend on the
> behavior change the flag would cause if supported, or...).  And that
> goes back to my question of whether backends should have a way to
> inform the qio core which flags they can support.

I think you are correct and  having our own QIO_MSG_* would make
sense here. This could allow us to filter incorrect flags easily, and also
have well documented what each implementation supports, by their own
masks.

Thanks!

Leonardo Bras

>
> --
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.           +1-919-301-3266
> Virtualization:  qemu.org | libvirt.org
>



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
  2021-09-02  6:38     ` Leonardo Bras Soares Passos
@ 2021-09-02  8:47       ` Daniel P. Berrangé
  2021-09-02  9:34         ` Leonardo Bras Soares Passos
  0 siblings, 1 reply; 62+ messages in thread
From: Daniel P. Berrangé @ 2021-09-02  8:47 UTC (permalink / raw)
  To: Leonardo Bras Soares Passos
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Dr. David Alan Gilbert, Peter Xu, qemu-devel,
	Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Thu, Sep 02, 2021 at 03:38:11AM -0300, Leonardo Bras Soares Passos wrote:
> Hello Daniel, thank you for the feedback!
> 
> Comments inline.
> 
> On Tue, Aug 31, 2021 at 9:57 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote:
> > > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
> > > send calls. It does so by avoiding copying user data into kernel buffers.
> > >
> > > To make it work, three steps are needed:
> > > 1 - A setsockopt() system call, enabling SO_ZEROCOPY
> > > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
> > > 3 - Process the socket's error queue, dealing with any error
> >
> > AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY.
> >
> > It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg()
> > from a synchronous call to an asynchronous call.
> 
> You are correct. I tried to adapt io_writev() interface to use MSG_ZEROCOPY in
> a somehow synchronous way, but returning errp (and sometimes closing the
> channel because of it) was a poor implementation.
> 
> >
> > It is forbidden to overwrite/reuse/free the buffer passed to sendmsg
> > until an asynchronous completion notification has been received from
> > the socket error queue. These notifications are not required to
> > arrive in-order, even for a TCP stream, because the kernel hangs on
> > to the buffer if a re-transmit is needed.
> >
> > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> >
> >   "Page pinning also changes system call semantics. It temporarily
> >    shares the buffer between process and network stack. Unlike with
> >    copying, the process cannot immediately overwrite the buffer
> >    after system call return without possibly modifying the data in
> >    flight. Kernel integrity is not affected, but a buggy program
> >    can possibly corrupt its own data stream."
> >
> 
> By the above piece of documentation, I get there is no problem in
> overwriting the buffer, but a corrupt, or newer version of the memory may
> be sent instead of the original one. I am pointing this out because there
> are workloads like page migration that would not be impacted, given
> once the buffer is changed, it will dirty the page and it will be re-sent.

The idea of having the buffer overwritten is still seriously worrying
me. I get the idea that in theory the "worst" that should happen is
that we unexpectedly transmit newer guest memory contents. There are
so many edge cases in migration code though that I'm worried there
might be scenarios in which that is still going to cause problems.

The current migration code has a synchronization point at the end of
each iteration of the migration loop. At the end of each iteration,
the source has a guarantee that all pages from the dirty bitmap have
now been sent. With the ZEROCOPY feature we have entirely removed all
synchronization points from the source host wrt memory sending. At
best we know that the pages will have been sent if we get to close()
without seeing errors, unless we're going to explicitly track the
completion messages. The TCP re-transmit semantics are especially
worrying to me, because the re-transmit is liable to send different
data than was in the original lost TCP packet.

Maybe everything is still safe because TCP is a reliable ordered
stream, and thus eventually the dst will get the newest guest
page contents. I'm still very worried that we have code in the
source that implicitly relies on there being some synchronization
points that we've effectively removed.

> > AFAICT, the design added in this patch does not provide any way
> > to honour these requirements around buffer lifetime.
> >
> > I can't see how we can introduce MSG_ZEROCOPY in any seemless
> > way. The buffer lifetime requirements imply need for an API
> > design that is fundamentally different for asynchronous usage,
> > with a callback to notify when the write has finished/failed.
> 
> That is a good point.
> Proposing a new optional method like io_async_writev() + an error
> checking mechanism could do the job.
> io_async_writev() could fall-back to io_writev() in cases where it's not
> implemented.
> 
> I am not sure about the error checking yet.
> Options I can see are:
> 1 - A callback, as you suggested, which IIUC would be provided by
> code using the QIOChannel, and would only fix the reported errors,
> leaving the responsibility of checking for errors to the IOChannel code.

A callback is fairly simple, but potentially limits performance. Reading
the kernel docs it seems they intentionally merge notifications to enable
a whole contiguous set to be processed in one go. A callback would
effectively be unmerging them requiring processing one a time. Not
sure how much this matters to our usage.

> 2 - A new method, maybe io_async_errck(), which would be called
> whenever the using code wants to deal with pending errors. It could
> return an array/list of IOVs that need to be re-sent, for example,
> and code using QIOChannel could deal with it however it wants.

Considering that we're using TCP, we get a reliable, ordered data
stream. So there should never be a failure that requires use to
know specific IOVs to re-sent - the kernel still handles TCP
re-transmit - we just have to still have the buffer available.
As noted above though, the re-transmit is liable to send different
data than the original transmit, which worries me.

So the only errors we should get from the completion notifications
would be errors that are completely fatal for the TCP stream. So
for errors, we only need to know whether an error has ocurred or
not. The second reason for the completion notifications is for
providing a synchronization, to know when the buffer can be released
or overwritten. That is the only scenario we need to know list of
IOVs that completed.



> > > @@ -571,6 +623,14 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
> > >                           "Unable to write to socket");
> > >          return -1;
> > >      }
> > > +
> > > +    if ((flags & MSG_ZEROCOPY) && sioc->zerocopy_enabled) {
> > > +        sioc->errq_pending += niov;
> > > +        if (sioc->errq_pending > SOCKET_ERRQ_THRESH) {
> > > +            qio_channel_socket_errq_proc(sioc, errp);
> > > +        }
> > > +    }
> >
> > This silently looses any errors set from upto the final
> > SOCKET_ERRQ_THRESH write() calls with MSG_ZEROCOPY set.
> 
> You are right.
> 
> >
> > Also means if you have a series of writes without
> > MSG_ZEROCOPY, it'll delay checking any pending
> > errors.
> 
> That's expected... if there are only happening sends without MSG_ZEROCOPY,
> it means the ones sent with zerocopy can wait. The problem would be
> the above case.

Well it depends whether there are synchonization requirements in the
caller. For example, current migration code knows that at the end of
each iteration sending pages, that the data has all actally been
sent. With this new logic, we might have done several more iterations
of the migration loop before the data for the original iteration has
been sent. The writes() for the original iteration may well be sending
the data from the later iterations. Possibly this is ok, but it is
a clear difference in the data that will actually go out on the wire
with this code.

> > I would suggest checkig in close(), but as mentioned
> > earlier, I think the design is flawed because the caller
> > fundamentally needs to know about completion for every
> > single write they make in order to know when the buffer
> > can be released / reused.
> 
> Well, there could be a flush mechanism (maybe in io_sync_errck(),
> activated with a
> parameter flag, or on a different method if callback is preferred):
> In the MSG_ZEROCOPY docs, we can see that the example includes using a poll()
> syscall after each packet sent, and this means the fd gets a signal after each
> sendmsg() happens, with error or not.
> 
> We could harness this with a poll() and a relatively high timeout:
> - We stop sending packets, and then call poll().
> - Whenever poll() returns 0, it means a timeout happened, and so it
> took too long
> without sendmsg() happening, meaning all the packets are sent.
> - If it returns anything else, we go back to fixing the errors found (re-send)
> 
> The problem may be defining the value of this timeout, but it could be
> called only
> when zerocopy is active.

Maybe we need to check completions at the end of each iteration of the
migration dirtypage loop ?


> > > +static void
> > > +qio_channel_socket_set_zerocopy(QIOChannel *ioc,
> > > +                                bool enabled)
> > > +{
> > > +    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
> > > +    int v = enabled ? 1 : 0;
> > > +    int ret;
> > > +
> > > +    ret = qemu_setsockopt(sioc->fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v));
> > > +    if (ret >= 0) {
> > > +        sioc->zerocopy_enabled = true;
> > > +    }
> > > +}
> >
> > Surely we need to tell the caller wether this succeeed, otherwise
> > the later sendmsg() is going to fail with EINVAL on older kernels
> > where MSG_ZEROCOPY is not supported.
> 
> Yeah, that was the idea on sioc->zerocopy_enabled, but in fact it
> should have been
> something like 'sioc->zerocopy_enabled = (ret >= 0) && enabled ', this
> way it would
> reflect zerocopy only if it was enabled and the syscall worked, and not_zerocopy
> otherwise.
> 
> or something like this, if we want it to stick with zerocopy if
> setting it off fails.
> if (ret >= 0) {
>     sioc->zerocopy_enabled = enabled;
> }

Yes, that is a bug fix we need, but actually I was refering
to the later sendmsg() call. Surely we need to clear MSG_ZEROCOPY
from 'flags', if zerocopy_enabled is not set, to avoid EINVAL
from sendmsg.

> > > diff --git a/io/channel-tls.c b/io/channel-tls.c
> > > index 4ce890a538..bf44b0f7b0 100644
> > > --- a/io/channel-tls.c
> > > +++ b/io/channel-tls.c
> > > @@ -350,6 +350,16 @@ static void qio_channel_tls_set_delay(QIOChannel *ioc,
> > >      qio_channel_set_delay(tioc->master, enabled);
> > >  }
> > >
> > > +
> > > +static void qio_channel_tls_set_zerocopy(QIOChannel *ioc,
> > > +                                         bool enabled)
> > > +{
> > > +    QIOChannelTLS *tioc = QIO_CHANNEL_TLS(ioc);
> > > +
> > > +    qio_channel_set_zerocopy(tioc->master, enabled);
> > > +}
> >
> > This is going to be unsafe because gnutls will discard/reuse the
> > buffer for the ciphertext after every write(). I can't see a
> > way to safely enable MSG_ZEROCOPY when TLS is used.
> 
> Yeah, that makes sense.
> It would make more sense to implement KTLS, as IIRC it kind of does
> 'zerocopy', since it saves the encrypted data directly in kernel buffer.

I would hope it is zerocopy, in so much as the kernel can directly
use the userspace pages as the plaintext, and then the ciphertext
in kernel space can be sent directly without further copies.

What i'm not sure is whether this means it also becomes an effectively
asynchronous API for transmitting data. ie does it have the same
constraints about needing to lock pages, and not modify data in the
pages? I've not investigated it in any detail, but I don't recall
that being mentioned, and if it was the case, it would be impossible
to enable that transparently in gnutls as its a major semantic changed.

> We could implement KTLS as io_async_writev() for Channel_TLS, and change this
> flag to async_enabled. If KTLS is not available, it would fall back to
> using gnuTLS on io_writev, just like it would happen in zerocopy.

If gnutls is going to support KTLS in a way we can use, I dont want to
have any of that duplicated in QEMU code. I want to be able delegate
as much as possible to gnutls, at least so that QEMU isn't in the loop
for any crypto sensitive parts, as that complicates certification
of crypto.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy
  2021-09-02  8:20       ` Daniel P. Berrangé
@ 2021-09-02  8:52         ` Leonardo Bras Soares Passos
  2021-09-02  9:20           ` Daniel P. Berrangé
  0 siblings, 1 reply; 62+ messages in thread
From: Leonardo Bras Soares Passos @ 2021-09-02  8:52 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Dr. David Alan Gilbert, Peter Xu, qemu-devel,
	Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Thu, Sep 2, 2021 at 5:21 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > Hello Daniel, thanks for the feedback !
> >
> > On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >
> > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > > >
> > > > Change the send_write() interface of multifd, allowing it to pass down
> > > > flags for qio_channel_write*().
> > > >
> > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > > > other data being sent at the default copying approach.
> > > >
> > > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > > ---
> > > >  migration/multifd-zlib.c | 7 ++++---
> > > >  migration/multifd-zstd.c | 7 ++++---
> > > >  migration/multifd.c      | 9 ++++++---
> > > >  migration/multifd.h      | 3 ++-
> > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > >
> > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > >              }
> > > >
> > > >              if (used) {
> > > > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > > > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > > > +                                                          &local_err);
> > >
> > > I don't think it is valid to unconditionally enable this feature due to the
> > > resource usage implications
> > >
> > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > >
> > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > >    if the socket option was not set, the socket exceeds its optmem
> > >    limit or the user exceeds its ulimit on locked pages."
> >
> > You are correct, I unfortunately missed this part in the docs :(
> >
> > > The limit on locked pages is something that looks very likely to be
> > > exceeded unless you happen to be running a QEMU config that already
> > > implies locked memory (eg PCI assignment)
> >
> > Do you mean the limit an user has on locking memory?
>
> Yes, by default limit QEMU sees will be something very small.
>
> > If so, that makes sense. I remember I needed to set the upper limit of locked
> > memory for the user before using it, or adding a capability to qemu before.
> >
> > Maybe an option would be trying to mlock all guest memory before setting
> > zerocopy=on in qemu code. If it fails, we can print an error message and fall
> > back to not using zerocopy (following the idea of a new io_async_writev()
> > I told you in the previous mail).
>
> Currently ability to lock memory is something that has to be configured
> when QEMU starts, and it requires libvirt to grant suitable permissions
> to QEMU. Memory locking is generally undesirable because it prevents
> memory overcommit. Or rather if you are allowing memory overcommit, then
> allowing memory locking is a way to kill your entire host.

You mean it's gonna consume too much memory, or something else?

>
> I don't think we can unconditionally grant ability to lock arbitrary
> guest RAM at startup, just to satisfy a possible desire to use zerocopy
> migration later. Granting it at runtime feels questionable as you now
> need to track and predict how much locked memory you've allowed, and
> also have possible problems with revokation.

(I am really new to this, so please forgive me if I am asking dumb or
overly basic questions)

What does revokation means in this context?
You give the process hability to lock n MB of memory, and then you take it?
Why would that happen? Is Locked memory a limited resource?

>
> Possibly you could unconditionally grant ability to lock a small amount
> of guest RAM at startup, but how small can it be, while still making a
> useful difference to migration. It would imply we also need to be very
> careful with migration to avoid having too large an amount of outstanding
> zerocopy requests to exceed the limit.

Yeah, having to decide on a value that would be ok to lock is very
complex, given we can migrate with multifd, which can make this value grow
a lot. Except if we only allow a few of those fds to really use zerocopy.

>
> IOW, the only clear place in which we can use zerocopy, is where we are
> already forced to accept the penalty of locked memory at startup. eg when
> the guest is using huge pages and no overcommit, or possibly when the guest
> is using PCI device assignment,

It would be something already, given that those scenarios are the ones with
the largest number of pages to migrate. But I understand we could give it a try
on other scenarios.

> though in the latter I can't remember if
> we allow entire of guest RAM to be locked or not.

If I recall correctly on a previous discussion, this was the case at least for
PCI passthrough.

>
> Overall the memory locking needs look like a significant constraint that
> will affect ability to use this feature.
>

I Agree, there is a lot to take in account.
In any way, those constraints could be checked at the same function as
the setsockopt() right?
(Setting up configs to improve the chance of zerocopy would probably only
happen at/before qemu starting, right?)

Best regards,
Leonardo



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy
  2021-09-02  8:52         ` Leonardo Bras Soares Passos
@ 2021-09-02  9:20           ` Daniel P. Berrangé
  2021-09-02  9:49             ` Leonardo Bras Soares Passos
  2021-09-07 11:17             ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 62+ messages in thread
From: Daniel P. Berrangé @ 2021-09-02  9:20 UTC (permalink / raw)
  To: Leonardo Bras Soares Passos
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Dr. David Alan Gilbert, Peter Xu, qemu-devel,
	Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Thu, Sep 02, 2021 at 05:52:15AM -0300, Leonardo Bras Soares Passos wrote:
> On Thu, Sep 2, 2021 at 5:21 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > > Hello Daniel, thanks for the feedback !
> > >
> > > On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > >
> > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > > > >
> > > > > Change the send_write() interface of multifd, allowing it to pass down
> > > > > flags for qio_channel_write*().
> > > > >
> > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > > > > other data being sent at the default copying approach.
> > > > >
> > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > > > ---
> > > > >  migration/multifd-zlib.c | 7 ++++---
> > > > >  migration/multifd-zstd.c | 7 ++++---
> > > > >  migration/multifd.c      | 9 ++++++---
> > > > >  migration/multifd.h      | 3 ++-
> > > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > > >
> > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > > >              }
> > > > >
> > > > >              if (used) {
> > > > > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > > > > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > > > > +                                                          &local_err);
> > > >
> > > > I don't think it is valid to unconditionally enable this feature due to the
> > > > resource usage implications
> > > >
> > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > >
> > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > >    if the socket option was not set, the socket exceeds its optmem
> > > >    limit or the user exceeds its ulimit on locked pages."
> > >
> > > You are correct, I unfortunately missed this part in the docs :(
> > >
> > > > The limit on locked pages is something that looks very likely to be
> > > > exceeded unless you happen to be running a QEMU config that already
> > > > implies locked memory (eg PCI assignment)
> > >
> > > Do you mean the limit an user has on locking memory?
> >
> > Yes, by default limit QEMU sees will be something very small.
> >
> > > If so, that makes sense. I remember I needed to set the upper limit of locked
> > > memory for the user before using it, or adding a capability to qemu before.
> > >
> > > Maybe an option would be trying to mlock all guest memory before setting
> > > zerocopy=on in qemu code. If it fails, we can print an error message and fall
> > > back to not using zerocopy (following the idea of a new io_async_writev()
> > > I told you in the previous mail).
> >
> > Currently ability to lock memory is something that has to be configured
> > when QEMU starts, and it requires libvirt to grant suitable permissions
> > to QEMU. Memory locking is generally undesirable because it prevents
> > memory overcommit. Or rather if you are allowing memory overcommit, then
> > allowing memory locking is a way to kill your entire host.
> 
> You mean it's gonna consume too much memory, or something else?

Essentially yes. 

> > I don't think we can unconditionally grant ability to lock arbitrary
> > guest RAM at startup, just to satisfy a possible desire to use zerocopy
> > migration later. Granting it at runtime feels questionable as you now
> > need to track and predict how much locked memory you've allowed, and
> > also have possible problems with revokation.
> 
> (I am really new to this, so please forgive me if I am asking dumb or
> overly basic questions)
> 
> What does revokation means in this context?
> You give the process hability to lock n MB of memory, and then you take it?
> Why would that happen? Is Locked memory a limited resource?

Consider a VM host with 64 GB of RAM and 64 GB of swap, and an
overcommit ratio of 1.5. ie we'll run VMs with 64*1.5 GB of RAM
total.

So we can run 3 VMs each with 32 GB of RAM, giving 96 GB of usage
which exceeds physical RAM. Most of the time this may well be fine
as the VMs don't concurrently need their full RAM allocation, and
worst case they'll get pushed to swap as the kernel re-shares
memory in respose to load. So perhaps each VM only needs 20 GB
resident at any time, but over time one VM can burst upto 32 GB
and then 12 GB of it get swapped out later when inactive.

But now consider if we allowed 2 of the VMs to lock memory for
purposes of migration. Those 2 VMs can now pin 64 GB of memory
in the worst case, leaving no free memory for the 3rd VM or
for the OS. This will likely take down the entire host, regardless
of swap availability.

IOW, if you are overcomitting RAM you have to be extremely
careful about allowing any VM to lock memory. If you do decide
to allow memory locking, you need to make sure that the worst
case locked memory amount still leaves enough unlocked memory
for the OS to be able to effectively manage the overcommit
load via swap.  We definitely can't grant memory locking to
VMs at startup in this scenario, and if we grant it at runtime,
we need to be able to revoke it again later.

These overcommit numbers are a bit more extreme that you'd 
usually do in real world, but it illustrates the genreal
problem. Also bear in mind that QEMU has memory overhead
beyond the guest RAM block, which varies over time, making
accounting quite hard. We have to also assume that QEMU
could have been compromised by a guest breakout, so we
can't assume that migration will play nice - we have to
assume the worst case possible, given the process ulimits.


> > Overall the memory locking needs look like a significant constraint that
> > will affect ability to use this feature.
> >
> 
> I Agree, there is a lot to take in account.
> In any way, those constraints could be checked at the same function as
> the setsockopt() right?

QEMU could possibly check its ulimits to see if it is possible, but it
would be safer to require a migration capability to be explicitly set
by the mgmt app to opt-in to zerocopy.

> (Setting up configs to improve the chance of zerocopy would probably only
> happen at/before qemu starting, right?)

Usually, but you can change ulimits on the fly for a process. I'm just
not sure of semantics if you reduce limits and existing usage exceeds
the reduced value.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
  2021-09-02  8:47       ` Daniel P. Berrangé
@ 2021-09-02  9:34         ` Leonardo Bras Soares Passos
  2021-09-02  9:49           ` Daniel P. Berrangé
  0 siblings, 1 reply; 62+ messages in thread
From: Leonardo Bras Soares Passos @ 2021-09-02  9:34 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Dr. David Alan Gilbert, Peter Xu, qemu-devel,
	Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Thu, Sep 2, 2021 at 5:47 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Thu, Sep 02, 2021 at 03:38:11AM -0300, Leonardo Bras Soares Passos wrote:
> > Hello Daniel, thank you for the feedback!
> >
> > Comments inline.
> >
> > On Tue, Aug 31, 2021 at 9:57 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >
> > > On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote:
> > > > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
> > > > send calls. It does so by avoiding copying user data into kernel buffers.
> > > >
> > > > To make it work, three steps are needed:
> > > > 1 - A setsockopt() system call, enabling SO_ZEROCOPY
> > > > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
> > > > 3 - Process the socket's error queue, dealing with any error
> > >
> > > AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY.
> > >
> > > It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg()
> > > from a synchronous call to an asynchronous call.
> >
> > You are correct. I tried to adapt io_writev() interface to use MSG_ZEROCOPY in
> > a somehow synchronous way, but returning errp (and sometimes closing the
> > channel because of it) was a poor implementation.
> >
> > >
> > > It is forbidden to overwrite/reuse/free the buffer passed to sendmsg
> > > until an asynchronous completion notification has been received from
> > > the socket error queue. These notifications are not required to
> > > arrive in-order, even for a TCP stream, because the kernel hangs on
> > > to the buffer if a re-transmit is needed.
> > >
> > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > >
> > >   "Page pinning also changes system call semantics. It temporarily
> > >    shares the buffer between process and network stack. Unlike with
> > >    copying, the process cannot immediately overwrite the buffer
> > >    after system call return without possibly modifying the data in
> > >    flight. Kernel integrity is not affected, but a buggy program
> > >    can possibly corrupt its own data stream."
> > >
> >
> > By the above piece of documentation, I get there is no problem in
> > overwriting the buffer, but a corrupt, or newer version of the memory may
> > be sent instead of the original one. I am pointing this out because there
> > are workloads like page migration that would not be impacted, given
> > once the buffer is changed, it will dirty the page and it will be re-sent.
>
> The idea of having the buffer overwritten is still seriously worrying
> me. I get the idea that in theory the "worst" that should happen is
> that we unexpectedly transmit newer guest memory contents. There are
> so many edge cases in migration code though that I'm worried there
> might be scenarios in which that is still going to cause problems.

I agree we should keep a eye on that, maybe have David Gilbert's opinion
on that.

>
> The current migration code has a synchronization point at the end of
> each iteration of the migration loop. At the end of each iteration,
> the source has a guarantee that all pages from the dirty bitmap have
> now been sent. With the ZEROCOPY feature we have entirely removed all
> synchronization points from the source host wrt memory sending. At
> best we know that the pages will have been sent if we get to close()
> without seeing errors, unless we're going to explicitly track the
> completion messages. The TCP re-transmit semantics are especially
> worrying to me, because the re-transmit is liable to send different
> data than was in the original lost TCP packet.
>
> Maybe everything is still safe because TCP is a reliable ordered
> stream, and thus eventually the dst will get the newest guest
> page contents. I'm still very worried that we have code in the
> source that implicitly relies on there being some synchronization
> points that we've effectively removed.
>
> > > AFAICT, the design added in this patch does not provide any way
> > > to honour these requirements around buffer lifetime.
> > >
> > > I can't see how we can introduce MSG_ZEROCOPY in any seemless
> > > way. The buffer lifetime requirements imply need for an API
> > > design that is fundamentally different for asynchronous usage,
> > > with a callback to notify when the write has finished/failed.
> >
> > That is a good point.
> > Proposing a new optional method like io_async_writev() + an error
> > checking mechanism could do the job.
> > io_async_writev() could fall-back to io_writev() in cases where it's not
> > implemented.
> >
> > I am not sure about the error checking yet.
> > Options I can see are:
> > 1 - A callback, as you suggested, which IIUC would be provided by
> > code using the QIOChannel, and would only fix the reported errors,
> > leaving the responsibility of checking for errors to the IOChannel code.
>
> A callback is fairly simple, but potentially limits performance. Reading
> the kernel docs it seems they intentionally merge notifications to enable
> a whole contiguous set to be processed in one go. A callback would
> effectively be unmerging them requiring processing one a time. Not
> sure how much this matters to our usage.
>
> > 2 - A new method, maybe io_async_errck(), which would be called
> > whenever the using code wants to deal with pending errors. It could
> > return an array/list of IOVs that need to be re-sent, for example,
> > and code using QIOChannel could deal with it however it wants.
>
> Considering that we're using TCP, we get a reliable, ordered data
> stream. So there should never be a failure that requires use to
> know specific IOVs to re-sent - the kernel still handles TCP
> re-transmit - we just have to still have the buffer available.
> As noted above though, the re-transmit is liable to send different
> data than the original transmit, which worries me.
>
> So the only errors we should get from the completion notifications
> would be errors that are completely fatal for the TCP stream. So
> for errors, we only need to know whether an error has ocurred or
> not. The second reason for the completion notifications is for
> providing a synchronization, to know when the buffer can be released
> or overwritten. That is the only scenario we need to know list of
> IOVs that completed.
>
>
>
> > > > @@ -571,6 +623,14 @@ static ssize_t qio_channel_socket_writev(QIOChannel *ioc,
> > > >                           "Unable to write to socket");
> > > >          return -1;
> > > >      }
> > > > +
> > > > +    if ((flags & MSG_ZEROCOPY) && sioc->zerocopy_enabled) {
> > > > +        sioc->errq_pending += niov;
> > > > +        if (sioc->errq_pending > SOCKET_ERRQ_THRESH) {
> > > > +            qio_channel_socket_errq_proc(sioc, errp);
> > > > +        }
> > > > +    }
> > >
> > > This silently looses any errors set from upto the final
> > > SOCKET_ERRQ_THRESH write() calls with MSG_ZEROCOPY set.
> >
> > You are right.
> >
> > >
> > > Also means if you have a series of writes without
> > > MSG_ZEROCOPY, it'll delay checking any pending
> > > errors.
> >
> > That's expected... if there are only happening sends without MSG_ZEROCOPY,
> > it means the ones sent with zerocopy can wait. The problem would be
> > the above case.
>
> Well it depends whether there are synchonization requirements in the
> caller. For example, current migration code knows that at the end of
> each iteration sending pages, that the data has all actally been
> sent. With this new logic, we might have done several more iterations
> of the migration loop before the data for the original iteration has
> been sent. The writes() for the original iteration may well be sending
> the data from the later iterations. Possibly this is ok, but it is
> a clear difference in the data that will actually go out on the wire
> with this code.
>
> > > I would suggest checkig in close(), but as mentioned
> > > earlier, I think the design is flawed because the caller
> > > fundamentally needs to know about completion for every
> > > single write they make in order to know when the buffer
> > > can be released / reused.
> >
> > Well, there could be a flush mechanism (maybe in io_sync_errck(),
> > activated with a
> > parameter flag, or on a different method if callback is preferred):
> > In the MSG_ZEROCOPY docs, we can see that the example includes using a poll()
> > syscall after each packet sent, and this means the fd gets a signal after each
> > sendmsg() happens, with error or not.
> >
> > We could harness this with a poll() and a relatively high timeout:
> > - We stop sending packets, and then call poll().
> > - Whenever poll() returns 0, it means a timeout happened, and so it
> > took too long
> > without sendmsg() happening, meaning all the packets are sent.
> > - If it returns anything else, we go back to fixing the errors found (re-send)
> >
> > The problem may be defining the value of this timeout, but it could be
> > called only
> > when zerocopy is active.
>
> Maybe we need to check completions at the end of each iteration of the
> migration dirtypage loop ?

Sorry, I am really new to this, and I still couldn't understand why would we
need to check at the end of each iteration, instead of doing a full check at the
end.

Is that a case of a fatal error on TCP stream, in which a lot of packets
were reported as 'sent' but we may lose track of which were in fact sent?

Would the poll() approach above be enough for a 'flush()' ?

>
>
> > > > +static void
> > > > +qio_channel_socket_set_zerocopy(QIOChannel *ioc,
> > > > +                                bool enabled)
> > > > +{
> > > > +    QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
> > > > +    int v = enabled ? 1 : 0;
> > > > +    int ret;
> > > > +
> > > > +    ret = qemu_setsockopt(sioc->fd, SOL_SOCKET, SO_ZEROCOPY, &v, sizeof(v));
> > > > +    if (ret >= 0) {
> > > > +        sioc->zerocopy_enabled = true;
> > > > +    }
> > > > +}
> > >
> > > Surely we need to tell the caller wether this succeeed, otherwise
> > > the later sendmsg() is going to fail with EINVAL on older kernels
> > > where MSG_ZEROCOPY is not supported.
> >
> > Yeah, that was the idea on sioc->zerocopy_enabled, but in fact it
> > should have been
> > something like 'sioc->zerocopy_enabled = (ret >= 0) && enabled ', this
> > way it would
> > reflect zerocopy only if it was enabled and the syscall worked, and not_zerocopy
> > otherwise.
> >
> > or something like this, if we want it to stick with zerocopy if
> > setting it off fails.
> > if (ret >= 0) {
> >     sioc->zerocopy_enabled = enabled;
> > }
>
> Yes, that is a bug fix we need, but actually I was refering
> to the later sendmsg() call. Surely we need to clear MSG_ZEROCOPY
> from 'flags', if zerocopy_enabled is not set, to avoid EINVAL
> from sendmsg.

Agree, something like io_writev(,, sioc->zerocopy_enabled ?
MSG_ZEROCOPY : 0, errp)'
should do, right?
(or an io_async_writev(), that will fall_back to io_writev() if
zerocopy is disabled)

>
> > > > diff --git a/io/channel-tls.c b/io/channel-tls.c
> > > > index 4ce890a538..bf44b0f7b0 100644
> > > > --- a/io/channel-tls.c
> > > > +++ b/io/channel-tls.c
> > > > @@ -350,6 +350,16 @@ static void qio_channel_tls_set_delay(QIOChannel *ioc,
> > > >      qio_channel_set_delay(tioc->master, enabled);
> > > >  }
> > > >
> > > > +
> > > > +static void qio_channel_tls_set_zerocopy(QIOChannel *ioc,
> > > > +                                         bool enabled)
> > > > +{
> > > > +    QIOChannelTLS *tioc = QIO_CHANNEL_TLS(ioc);
> > > > +
> > > > +    qio_channel_set_zerocopy(tioc->master, enabled);
> > > > +}
> > >
> > > This is going to be unsafe because gnutls will discard/reuse the
> > > buffer for the ciphertext after every write(). I can't see a
> > > way to safely enable MSG_ZEROCOPY when TLS is used.
> >
> > Yeah, that makes sense.
> > It would make more sense to implement KTLS, as IIRC it kind of does
> > 'zerocopy', since it saves the encrypted data directly in kernel buffer.
>
> I would hope it is zerocopy, in so much as the kernel can directly
> use the userspace pages as the plaintext, and then the ciphertext
> in kernel space can be sent directly without further copies.
>
> What i'm not sure is whether this means it also becomes an effectively
> asynchronous API for transmitting data. ie does it have the same
> constraints about needing to lock pages, and not modify data in the
> pages? I've not investigated it in any detail, but I don't recall
> that being mentioned, and if it was the case, it would be impossible
> to enable that transparently in gnutls as its a major semantic changed.

I think it somehow waits for the user buffer to be encrypted to the kernel
buffer, and then returns, behaving like a normal sendmsg().
(except if it's using kernel async cripto API,  which I don't think is the case)

>
> > We could implement KTLS as io_async_writev() for Channel_TLS, and change this
> > flag to async_enabled. If KTLS is not available, it would fall back to
> > using gnuTLS on io_writev, just like it would happen in zerocopy.
>
> If gnutls is going to support KTLS in a way we can use, I dont want to
> have any of that duplicated in QEMU code. I want to be able delegate
> as much as possible to gnutls, at least so that QEMU isn't in the loop
> for any crypto sensitive parts, as that complicates certification
> of crypto.

Yeah, that's a very good argument :)
I think it's probably the case to move from the current callback implementation
to the implementation in which we give gnuTLS the file descriptor.

( I was thinking on implementing an simpler direct gnuTLS version only
for the 'zerocopy' QIOChannel_TLS, but I think it would not make much sense.)



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy
  2021-09-02  9:20           ` Daniel P. Berrangé
@ 2021-09-02  9:49             ` Leonardo Bras Soares Passos
  2021-09-02  9:59               ` Daniel P. Berrangé
  2021-09-07 11:17             ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 62+ messages in thread
From: Leonardo Bras Soares Passos @ 2021-09-02  9:49 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Dr. David Alan Gilbert, Peter Xu, qemu-devel,
	Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Thu, Sep 2, 2021 at 6:20 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Thu, Sep 02, 2021 at 05:52:15AM -0300, Leonardo Bras Soares Passos wrote:
> > On Thu, Sep 2, 2021 at 5:21 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >
> > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > > > Hello Daniel, thanks for the feedback !
> > > >
> > > > On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > >
> > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > > > > >
> > > > > > Change the send_write() interface of multifd, allowing it to pass down
> > > > > > flags for qio_channel_write*().
> > > > > >
> > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > > > > > other data being sent at the default copying approach.
> > > > > >
> > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > > > > ---
> > > > > >  migration/multifd-zlib.c | 7 ++++---
> > > > > >  migration/multifd-zstd.c | 7 ++++---
> > > > > >  migration/multifd.c      | 9 ++++++---
> > > > > >  migration/multifd.h      | 3 ++-
> > > > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > > > >
> > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > > > >              }
> > > > > >
> > > > > >              if (used) {
> > > > > > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > > > > > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > > > > > +                                                          &local_err);
> > > > >
> > > > > I don't think it is valid to unconditionally enable this feature due to the
> > > > > resource usage implications
> > > > >
> > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > >
> > > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > > >    if the socket option was not set, the socket exceeds its optmem
> > > > >    limit or the user exceeds its ulimit on locked pages."
> > > >
> > > > You are correct, I unfortunately missed this part in the docs :(
> > > >
> > > > > The limit on locked pages is something that looks very likely to be
> > > > > exceeded unless you happen to be running a QEMU config that already
> > > > > implies locked memory (eg PCI assignment)
> > > >
> > > > Do you mean the limit an user has on locking memory?
> > >
> > > Yes, by default limit QEMU sees will be something very small.
> > >
> > > > If so, that makes sense. I remember I needed to set the upper limit of locked
> > > > memory for the user before using it, or adding a capability to qemu before.
> > > >
> > > > Maybe an option would be trying to mlock all guest memory before setting
> > > > zerocopy=on in qemu code. If it fails, we can print an error message and fall
> > > > back to not using zerocopy (following the idea of a new io_async_writev()
> > > > I told you in the previous mail).
> > >
> > > Currently ability to lock memory is something that has to be configured
> > > when QEMU starts, and it requires libvirt to grant suitable permissions
> > > to QEMU. Memory locking is generally undesirable because it prevents
> > > memory overcommit. Or rather if you are allowing memory overcommit, then
> > > allowing memory locking is a way to kill your entire host.
> >
> > You mean it's gonna consume too much memory, or something else?
>
> Essentially yes.

Well, maybe we can check for available memory before doing that,
but maybe it's too much effort.

>
> > > I don't think we can unconditionally grant ability to lock arbitrary
> > > guest RAM at startup, just to satisfy a possible desire to use zerocopy
> > > migration later. Granting it at runtime feels questionable as you now
> > > need to track and predict how much locked memory you've allowed, and
> > > also have possible problems with revokation.
> >
> > (I am really new to this, so please forgive me if I am asking dumb or
> > overly basic questions)
> >
> > What does revokation means in this context?
> > You give the process hability to lock n MB of memory, and then you take it?
> > Why would that happen? Is Locked memory a limited resource?
>
> Consider a VM host with 64 GB of RAM and 64 GB of swap, and an
> overcommit ratio of 1.5. ie we'll run VMs with 64*1.5 GB of RAM
> total.
>
> So we can run 3 VMs each with 32 GB of RAM, giving 96 GB of usage
> which exceeds physical RAM. Most of the time this may well be fine
> as the VMs don't concurrently need their full RAM allocation, and
> worst case they'll get pushed to swap as the kernel re-shares
> memory in respose to load. So perhaps each VM only needs 20 GB
> resident at any time, but over time one VM can burst upto 32 GB
> and then 12 GB of it get swapped out later when inactive.
>
> But now consider if we allowed 2 of the VMs to lock memory for
> purposes of migration. Those 2 VMs can now pin 64 GB of memory
> in the worst case, leaving no free memory for the 3rd VM or
> for the OS. This will likely take down the entire host, regardless
> of swap availability.
>
> IOW, if you are overcomitting RAM you have to be extremely
> careful about allowing any VM to lock memory. If you do decide
> to allow memory locking, you need to make sure that the worst
> case locked memory amount still leaves enough unlocked memory
> for the OS to be able to effectively manage the overcommit
> load via swap.  We definitely can't grant memory locking to
> VMs at startup in this scenario, and if we grant it at runtime,
> we need to be able to revoke it again later.
>
> These overcommit numbers are a bit more extreme that you'd
> usually do in real world, but it illustrates the genreal
> problem. Also bear in mind that QEMU has memory overhead
> beyond the guest RAM block, which varies over time, making
> accounting quite hard. We have to also assume that QEMU
> could have been compromised by a guest breakout, so we
> can't assume that migration will play nice - we have to
> assume the worst case possible, given the process ulimits.
>

Yeah, that makes sense. Thanks for this illustration and elucidation !

I assume there is no way of asking the OS to lock memory, and if there is
no space available, it fails and rolls back the locking.

If there was, the VM2 would fail the locking and we could achieve
MSG_ZEROCOPY at least in VM1.

But thinking in a better way, if both could lock the required memory, and little
was available to VM3,  it could experience major slowdown, as would the
host OS. Yeah, that would be complex.

>
> > > Overall the memory locking needs look like a significant constraint that
> > > will affect ability to use this feature.
> > >
> >
> > I Agree, there is a lot to take in account.
> > In any way, those constraints could be checked at the same function as
> > the setsockopt() right?
>
> QEMU could possibly check its ulimits to see if it is possible, but it
> would be safer to require a migration capability to be explicitly set
> by the mgmt app to opt-in to zerocopy.

Yeah, that would make sense. Let the mgmt app check system resources
and choose where to request zerocopy.  It also has access to knowing if
that process has enough memory locked.

>
> > (Setting up configs to improve the chance of zerocopy would probably only
> > happen at/before qemu starting, right?)
>
> Usually, but you can change ulimits on the fly for a process. I'm just
> not sure of semantics if you reduce limits and existing usage exceeds
> the reduced value.

Yeah, that's a good question.

>
> Regards,
> Daniel
> --
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>

Best regards,
Leonardo Bras



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
  2021-09-02  9:34         ` Leonardo Bras Soares Passos
@ 2021-09-02  9:49           ` Daniel P. Berrangé
  2021-09-02 10:19             ` Leonardo Bras Soares Passos
  0 siblings, 1 reply; 62+ messages in thread
From: Daniel P. Berrangé @ 2021-09-02  9:49 UTC (permalink / raw)
  To: Leonardo Bras Soares Passos
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Dr. David Alan Gilbert, Peter Xu, qemu-devel,
	Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Thu, Sep 02, 2021 at 06:34:01AM -0300, Leonardo Bras Soares Passos wrote:
> On Thu, Sep 2, 2021 at 5:47 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Thu, Sep 02, 2021 at 03:38:11AM -0300, Leonardo Bras Soares Passos wrote:
> >
> > > > I would suggest checkig in close(), but as mentioned
> > > > earlier, I think the design is flawed because the caller
> > > > fundamentally needs to know about completion for every
> > > > single write they make in order to know when the buffer
> > > > can be released / reused.
> > >
> > > Well, there could be a flush mechanism (maybe in io_sync_errck(),
> > > activated with a
> > > parameter flag, or on a different method if callback is preferred):
> > > In the MSG_ZEROCOPY docs, we can see that the example includes using a poll()
> > > syscall after each packet sent, and this means the fd gets a signal after each
> > > sendmsg() happens, with error or not.
> > >
> > > We could harness this with a poll() and a relatively high timeout:
> > > - We stop sending packets, and then call poll().
> > > - Whenever poll() returns 0, it means a timeout happened, and so it
> > > took too long
> > > without sendmsg() happening, meaning all the packets are sent.
> > > - If it returns anything else, we go back to fixing the errors found (re-send)
> > >
> > > The problem may be defining the value of this timeout, but it could be
> > > called only
> > > when zerocopy is active.
> >
> > Maybe we need to check completions at the end of each iteration of the
> > migration dirtypage loop ?
> 
> Sorry, I am really new to this, and I still couldn't understand why would we
> need to check at the end of each iteration, instead of doing a full check at the
> end.

The end of each iteration is an implicit synchronization point in the
current migration code.

For example, we might do 2 iterations of migration pre-copy, and then
switch to post-copy mode. If the data from those 2 iterations hasn't
been sent at the point we switch to post-copy, that is a semantic
change from current behaviour. I don't know if that will have an
problematic effect on the migration process, or not. Checking the
async completions at the end of each iteration though, would ensure
the semantics similar to current semantics, reducing risk of unexpected
problems.


> > > or something like this, if we want it to stick with zerocopy if
> > > setting it off fails.
> > > if (ret >= 0) {
> > >     sioc->zerocopy_enabled = enabled;
> > > }
> >
> > Yes, that is a bug fix we need, but actually I was refering
> > to the later sendmsg() call. Surely we need to clear MSG_ZEROCOPY
> > from 'flags', if zerocopy_enabled is not set, to avoid EINVAL
> > from sendmsg.
> 
> Agree, something like io_writev(,, sioc->zerocopy_enabled ?
> MSG_ZEROCOPY : 0, errp)'
> should do, right?
> (or an io_async_writev(), that will fall_back to io_writev() if
> zerocopy is disabled)

Something like that - depends what API we end up deciding on

> > > We could implement KTLS as io_async_writev() for Channel_TLS, and change this
> > > flag to async_enabled. If KTLS is not available, it would fall back to
> > > using gnuTLS on io_writev, just like it would happen in zerocopy.
> >
> > If gnutls is going to support KTLS in a way we can use, I dont want to
> > have any of that duplicated in QEMU code. I want to be able delegate
> > as much as possible to gnutls, at least so that QEMU isn't in the loop
> > for any crypto sensitive parts, as that complicates certification
> > of crypto.
> 
> Yeah, that's a very good argument :)
> I think it's probably the case to move from the current callback implementation
> to the implementation in which we give gnuTLS the file descriptor.

That would introduce a coupling  between the two QIOChannel impls
though, which is avoided so far, because we didn't want an assuption
that a QIOChannel == a FD.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy
  2021-09-02  9:49             ` Leonardo Bras Soares Passos
@ 2021-09-02  9:59               ` Daniel P. Berrangé
  2021-09-02 10:25                 ` Leonardo Bras Soares Passos
  0 siblings, 1 reply; 62+ messages in thread
From: Daniel P. Berrangé @ 2021-09-02  9:59 UTC (permalink / raw)
  To: Leonardo Bras Soares Passos
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Dr. David Alan Gilbert, Peter Xu, qemu-devel,
	Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Thu, Sep 02, 2021 at 06:49:06AM -0300, Leonardo Bras Soares Passos wrote:
> On Thu, Sep 2, 2021 at 6:20 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Thu, Sep 02, 2021 at 05:52:15AM -0300, Leonardo Bras Soares Passos wrote:
> > > On Thu, Sep 2, 2021 at 5:21 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > >
> > > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > > > > Hello Daniel, thanks for the feedback !
> > > > >
> > > > > On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > > >
> > > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > > > > > >
> > > > > > > Change the send_write() interface of multifd, allowing it to pass down
> > > > > > > flags for qio_channel_write*().
> > > > > > >
> > > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > > > > > > other data being sent at the default copying approach.
> > > > > > >
> > > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > > > > > ---
> > > > > > >  migration/multifd-zlib.c | 7 ++++---
> > > > > > >  migration/multifd-zstd.c | 7 ++++---
> > > > > > >  migration/multifd.c      | 9 ++++++---
> > > > > > >  migration/multifd.h      | 3 ++-
> > > > > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > > > > >
> > > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > > > > >              }
> > > > > > >
> > > > > > >              if (used) {
> > > > > > > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > > > > > > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > > > > > > +                                                          &local_err);
> > > > > >
> > > > > > I don't think it is valid to unconditionally enable this feature due to the
> > > > > > resource usage implications
> > > > > >
> > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > > >
> > > > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > > > >    if the socket option was not set, the socket exceeds its optmem
> > > > > >    limit or the user exceeds its ulimit on locked pages."
> > > > >
> > > > > You are correct, I unfortunately missed this part in the docs :(
> > > > >
> > > > > > The limit on locked pages is something that looks very likely to be
> > > > > > exceeded unless you happen to be running a QEMU config that already
> > > > > > implies locked memory (eg PCI assignment)
> > > > >
> > > > > Do you mean the limit an user has on locking memory?
> > > >
> > > > Yes, by default limit QEMU sees will be something very small.
> > > >
> > > > > If so, that makes sense. I remember I needed to set the upper limit of locked
> > > > > memory for the user before using it, or adding a capability to qemu before.
> > > > >
> > > > > Maybe an option would be trying to mlock all guest memory before setting
> > > > > zerocopy=on in qemu code. If it fails, we can print an error message and fall
> > > > > back to not using zerocopy (following the idea of a new io_async_writev()
> > > > > I told you in the previous mail).
> > > >
> > > > Currently ability to lock memory is something that has to be configured
> > > > when QEMU starts, and it requires libvirt to grant suitable permissions
> > > > to QEMU. Memory locking is generally undesirable because it prevents
> > > > memory overcommit. Or rather if you are allowing memory overcommit, then
> > > > allowing memory locking is a way to kill your entire host.
> > >
> > > You mean it's gonna consume too much memory, or something else?
> >
> > Essentially yes.
> 
> Well, maybe we can check for available memory before doing that,
> but maybe it's too much effort.

From a mgmt app POV, we assume QEMU is untrustworthy, so the mgmt
app needs to enforce policy based on the worst case that the
VM configuration allows for.

Checking current available memory is not viable, because even
if you see 10 GB free, QEMU can't know if that free memory is
there to satisfy other VMs's worst case needs, or its own. QEMU
needs to be explicitly told whether its OK to use locked memory,
and an enforcement policy applied to it.


> > Consider a VM host with 64 GB of RAM and 64 GB of swap, and an
> > overcommit ratio of 1.5. ie we'll run VMs with 64*1.5 GB of RAM
> > total.
> >
> > So we can run 3 VMs each with 32 GB of RAM, giving 96 GB of usage
> > which exceeds physical RAM. Most of the time this may well be fine
> > as the VMs don't concurrently need their full RAM allocation, and
> > worst case they'll get pushed to swap as the kernel re-shares
> > memory in respose to load. So perhaps each VM only needs 20 GB
> > resident at any time, but over time one VM can burst upto 32 GB
> > and then 12 GB of it get swapped out later when inactive.
> >
> > But now consider if we allowed 2 of the VMs to lock memory for
> > purposes of migration. Those 2 VMs can now pin 64 GB of memory
> > in the worst case, leaving no free memory for the 3rd VM or
> > for the OS. This will likely take down the entire host, regardless
> > of swap availability.
> >
> > IOW, if you are overcomitting RAM you have to be extremely
> > careful about allowing any VM to lock memory. If you do decide
> > to allow memory locking, you need to make sure that the worst
> > case locked memory amount still leaves enough unlocked memory
> > for the OS to be able to effectively manage the overcommit
> > load via swap.  We definitely can't grant memory locking to
> > VMs at startup in this scenario, and if we grant it at runtime,
> > we need to be able to revoke it again later.
> >
> > These overcommit numbers are a bit more extreme that you'd
> > usually do in real world, but it illustrates the genreal
> > problem. Also bear in mind that QEMU has memory overhead
> > beyond the guest RAM block, which varies over time, making
> > accounting quite hard. We have to also assume that QEMU
> > could have been compromised by a guest breakout, so we
> > can't assume that migration will play nice - we have to
> > assume the worst case possible, given the process ulimits.
> >
> 
> Yeah, that makes sense. Thanks for this illustration and elucidation !
> 
> I assume there is no way of asking the OS to lock memory, and if there is
> no space available, it fails and rolls back the locking.

Yes & no. On most Linux configs though it ends up being no, because
instead of getting a nice failure, when host memory pressure occurs,
the OOM Killer wakes up and just culls processes.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
  2021-09-02  9:49           ` Daniel P. Berrangé
@ 2021-09-02 10:19             ` Leonardo Bras Soares Passos
  2021-09-02 10:28               ` Daniel P. Berrangé
  0 siblings, 1 reply; 62+ messages in thread
From: Leonardo Bras Soares Passos @ 2021-09-02 10:19 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Dr. David Alan Gilbert, Peter Xu, qemu-devel,
	Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Thu, Sep 2, 2021 at 6:50 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Thu, Sep 02, 2021 at 06:34:01AM -0300, Leonardo Bras Soares Passos wrote:
> > On Thu, Sep 2, 2021 at 5:47 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >
> > > On Thu, Sep 02, 2021 at 03:38:11AM -0300, Leonardo Bras Soares Passos wrote:
> > >
> > > > > I would suggest checkig in close(), but as mentioned
> > > > > earlier, I think the design is flawed because the caller
> > > > > fundamentally needs to know about completion for every
> > > > > single write they make in order to know when the buffer
> > > > > can be released / reused.
> > > >
> > > > Well, there could be a flush mechanism (maybe in io_sync_errck(),
> > > > activated with a
> > > > parameter flag, or on a different method if callback is preferred):
> > > > In the MSG_ZEROCOPY docs, we can see that the example includes using a poll()
> > > > syscall after each packet sent, and this means the fd gets a signal after each
> > > > sendmsg() happens, with error or not.
> > > >
> > > > We could harness this with a poll() and a relatively high timeout:
> > > > - We stop sending packets, and then call poll().
> > > > - Whenever poll() returns 0, it means a timeout happened, and so it
> > > > took too long
> > > > without sendmsg() happening, meaning all the packets are sent.
> > > > - If it returns anything else, we go back to fixing the errors found (re-send)
> > > >
> > > > The problem may be defining the value of this timeout, but it could be
> > > > called only
> > > > when zerocopy is active.
> > >
> > > Maybe we need to check completions at the end of each iteration of the
> > > migration dirtypage loop ?
> >
> > Sorry, I am really new to this, and I still couldn't understand why would we
> > need to check at the end of each iteration, instead of doing a full check at the
> > end.
>
> The end of each iteration is an implicit synchronization point in the
> current migration code.
>
> For example, we might do 2 iterations of migration pre-copy, and then
> switch to post-copy mode. If the data from those 2 iterations hasn't
> been sent at the point we switch to post-copy, that is a semantic
> change from current behaviour. I don't know if that will have an
> problematic effect on the migration process, or not. Checking the
> async completions at the end of each iteration though, would ensure
> the semantics similar to current semantics, reducing risk of unexpected
> problems.
>

What if we do the 'flush()' before we start post-copy, instead of after each
iteration? would that be enough?

>
> > > > or something like this, if we want it to stick with zerocopy if
> > > > setting it off fails.
> > > > if (ret >= 0) {
> > > >     sioc->zerocopy_enabled = enabled;
> > > > }
> > >
> > > Yes, that is a bug fix we need, but actually I was refering
> > > to the later sendmsg() call. Surely we need to clear MSG_ZEROCOPY
> > > from 'flags', if zerocopy_enabled is not set, to avoid EINVAL
> > > from sendmsg.
> >
> > Agree, something like io_writev(,, sioc->zerocopy_enabled ?
> > MSG_ZEROCOPY : 0, errp)'
> > should do, right?
> > (or an io_async_writev(), that will fall_back to io_writev() if
> > zerocopy is disabled)
>
> Something like that - depends what API we end up deciding on

Great :)

>
> > > > We could implement KTLS as io_async_writev() for Channel_TLS, and change this
> > > > flag to async_enabled. If KTLS is not available, it would fall back to
> > > > using gnuTLS on io_writev, just like it would happen in zerocopy.
> > >
> > > If gnutls is going to support KTLS in a way we can use, I dont want to
> > > have any of that duplicated in QEMU code. I want to be able delegate
> > > as much as possible to gnutls, at least so that QEMU isn't in the loop
> > > for any crypto sensitive parts, as that complicates certification
> > > of crypto.
> >
> > Yeah, that's a very good argument :)
> > I think it's probably the case to move from the current callback implementation
> > to the implementation in which we give gnuTLS the file descriptor.
>
> That would introduce a coupling  between the two QIOChannel impls
> though, which is avoided so far, because we didn't want an assuption
> that a QIOChannel == a FD.

Interesting, QIOChannelTLS would necessarily need to have a QIOChannelSocket,
is that right?

Maybe we can get a QIOChannelKTLS, which would use gnuTLS but get to
have it's own fd,
but that possibly would cause a lot of duplicated code.
On the other hand, this way the QIOChannelTLS would still be able to
accept any other
channel, if desired.

Anyway, it's a complicated decision :)

> Regards,
> Daniel
> --
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>

Thank you, I am learning a lot today :)

Best regards,
Leonardo



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy
  2021-09-02  9:59               ` Daniel P. Berrangé
@ 2021-09-02 10:25                 ` Leonardo Bras Soares Passos
  0 siblings, 0 replies; 62+ messages in thread
From: Leonardo Bras Soares Passos @ 2021-09-02 10:25 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Dr. David Alan Gilbert, Peter Xu, qemu-devel,
	Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Thu, Sep 2, 2021 at 6:59 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Thu, Sep 02, 2021 at 06:49:06AM -0300, Leonardo Bras Soares Passos wrote:
> > On Thu, Sep 2, 2021 at 6:20 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >
> > > On Thu, Sep 02, 2021 at 05:52:15AM -0300, Leonardo Bras Soares Passos wrote:
> > > > On Thu, Sep 2, 2021 at 5:21 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > >
> > > > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > > > > > Hello Daniel, thanks for the feedback !
> > > > > >
> > > > > > On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > > > >
> > > > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > > > > > > >
> > > > > > > > Change the send_write() interface of multifd, allowing it to pass down
> > > > > > > > flags for qio_channel_write*().
> > > > > > > >
> > > > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > > > > > > > other data being sent at the default copying approach.
> > > > > > > >
> > > > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > > > > > > ---
> > > > > > > >  migration/multifd-zlib.c | 7 ++++---
> > > > > > > >  migration/multifd-zstd.c | 7 ++++---
> > > > > > > >  migration/multifd.c      | 9 ++++++---
> > > > > > > >  migration/multifd.h      | 3 ++-
> > > > > > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > > > > > >
> > > > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > > > > > >              }
> > > > > > > >
> > > > > > > >              if (used) {
> > > > > > > > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > > > > > > > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > > > > > > > +                                                          &local_err);
> > > > > > >
> > > > > > > I don't think it is valid to unconditionally enable this feature due to the
> > > > > > > resource usage implications
> > > > > > >
> > > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > > > >
> > > > > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > > > > >    if the socket option was not set, the socket exceeds its optmem
> > > > > > >    limit or the user exceeds its ulimit on locked pages."
> > > > > >
> > > > > > You are correct, I unfortunately missed this part in the docs :(
> > > > > >
> > > > > > > The limit on locked pages is something that looks very likely to be
> > > > > > > exceeded unless you happen to be running a QEMU config that already
> > > > > > > implies locked memory (eg PCI assignment)
> > > > > >
> > > > > > Do you mean the limit an user has on locking memory?
> > > > >
> > > > > Yes, by default limit QEMU sees will be something very small.
> > > > >
> > > > > > If so, that makes sense. I remember I needed to set the upper limit of locked
> > > > > > memory for the user before using it, or adding a capability to qemu before.
> > > > > >
> > > > > > Maybe an option would be trying to mlock all guest memory before setting
> > > > > > zerocopy=on in qemu code. If it fails, we can print an error message and fall
> > > > > > back to not using zerocopy (following the idea of a new io_async_writev()
> > > > > > I told you in the previous mail).
> > > > >
> > > > > Currently ability to lock memory is something that has to be configured
> > > > > when QEMU starts, and it requires libvirt to grant suitable permissions
> > > > > to QEMU. Memory locking is generally undesirable because it prevents
> > > > > memory overcommit. Or rather if you are allowing memory overcommit, then
> > > > > allowing memory locking is a way to kill your entire host.
> > > >
> > > > You mean it's gonna consume too much memory, or something else?
> > >
> > > Essentially yes.
> >
> > Well, maybe we can check for available memory before doing that,
> > but maybe it's too much effort.
>
> From a mgmt app POV, we assume QEMU is untrustworthy, so the mgmt
> app needs to enforce policy based on the worst case that the
> VM configuration allows for.
>
> Checking current available memory is not viable, because even
> if you see 10 GB free, QEMU can't know if that free memory is
> there to satisfy other VMs's worst case needs, or its own. QEMU
> needs to be explicitly told whether its OK to use locked memory,
> and an enforcement policy applied to it.


Yeah, it makes sense to let the mgmt app deal with that and enable/disable
the MSG_ZEROCOPY on migration whenever it sees fit.


>
>
> > > Consider a VM host with 64 GB of RAM and 64 GB of swap, and an
> > > overcommit ratio of 1.5. ie we'll run VMs with 64*1.5 GB of RAM
> > > total.
> > >
> > > So we can run 3 VMs each with 32 GB of RAM, giving 96 GB of usage
> > > which exceeds physical RAM. Most of the time this may well be fine
> > > as the VMs don't concurrently need their full RAM allocation, and
> > > worst case they'll get pushed to swap as the kernel re-shares
> > > memory in respose to load. So perhaps each VM only needs 20 GB
> > > resident at any time, but over time one VM can burst upto 32 GB
> > > and then 12 GB of it get swapped out later when inactive.
> > >
> > > But now consider if we allowed 2 of the VMs to lock memory for
> > > purposes of migration. Those 2 VMs can now pin 64 GB of memory
> > > in the worst case, leaving no free memory for the 3rd VM or
> > > for the OS. This will likely take down the entire host, regardless
> > > of swap availability.
> > >
> > > IOW, if you are overcomitting RAM you have to be extremely
> > > careful about allowing any VM to lock memory. If you do decide
> > > to allow memory locking, you need to make sure that the worst
> > > case locked memory amount still leaves enough unlocked memory
> > > for the OS to be able to effectively manage the overcommit
> > > load via swap.  We definitely can't grant memory locking to
> > > VMs at startup in this scenario, and if we grant it at runtime,
> > > we need to be able to revoke it again later.
> > >
> > > These overcommit numbers are a bit more extreme that you'd
> > > usually do in real world, but it illustrates the genreal
> > > problem. Also bear in mind that QEMU has memory overhead
> > > beyond the guest RAM block, which varies over time, making
> > > accounting quite hard. We have to also assume that QEMU
> > > could have been compromised by a guest breakout, so we
> > > can't assume that migration will play nice - we have to
> > > assume the worst case possible, given the process ulimits.
> > >
> >
> > Yeah, that makes sense. Thanks for this illustration and elucidation !
> >
> > I assume there is no way of asking the OS to lock memory, and if there is
> > no space available, it fails and rolls back the locking.
>
> Yes & no. On most Linux configs though it ends up being no, because
> instead of getting a nice failure, when host memory pressure occurs,
> the OOM Killer wakes up and just culls processes.

oh, right the OOM Killer :)

>
> Regards,
> Daniel
> --
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>

Thanks!

Best regards,
Leonardo



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
  2021-09-02 10:19             ` Leonardo Bras Soares Passos
@ 2021-09-02 10:28               ` Daniel P. Berrangé
  2021-09-07 11:06                 ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 62+ messages in thread
From: Daniel P. Berrangé @ 2021-09-02 10:28 UTC (permalink / raw)
  To: Leonardo Bras Soares Passos
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Dr. David Alan Gilbert, Peter Xu, qemu-devel,
	Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Thu, Sep 02, 2021 at 07:19:58AM -0300, Leonardo Bras Soares Passos wrote:
> On Thu, Sep 2, 2021 at 6:50 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Thu, Sep 02, 2021 at 06:34:01AM -0300, Leonardo Bras Soares Passos wrote:
> > > On Thu, Sep 2, 2021 at 5:47 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > >
> > > > On Thu, Sep 02, 2021 at 03:38:11AM -0300, Leonardo Bras Soares Passos wrote:
> > > >
> > > > > > I would suggest checkig in close(), but as mentioned
> > > > > > earlier, I think the design is flawed because the caller
> > > > > > fundamentally needs to know about completion for every
> > > > > > single write they make in order to know when the buffer
> > > > > > can be released / reused.
> > > > >
> > > > > Well, there could be a flush mechanism (maybe in io_sync_errck(),
> > > > > activated with a
> > > > > parameter flag, or on a different method if callback is preferred):
> > > > > In the MSG_ZEROCOPY docs, we can see that the example includes using a poll()
> > > > > syscall after each packet sent, and this means the fd gets a signal after each
> > > > > sendmsg() happens, with error or not.
> > > > >
> > > > > We could harness this with a poll() and a relatively high timeout:
> > > > > - We stop sending packets, and then call poll().
> > > > > - Whenever poll() returns 0, it means a timeout happened, and so it
> > > > > took too long
> > > > > without sendmsg() happening, meaning all the packets are sent.
> > > > > - If it returns anything else, we go back to fixing the errors found (re-send)
> > > > >
> > > > > The problem may be defining the value of this timeout, but it could be
> > > > > called only
> > > > > when zerocopy is active.
> > > >
> > > > Maybe we need to check completions at the end of each iteration of the
> > > > migration dirtypage loop ?
> > >
> > > Sorry, I am really new to this, and I still couldn't understand why would we
> > > need to check at the end of each iteration, instead of doing a full check at the
> > > end.
> >
> > The end of each iteration is an implicit synchronization point in the
> > current migration code.
> >
> > For example, we might do 2 iterations of migration pre-copy, and then
> > switch to post-copy mode. If the data from those 2 iterations hasn't
> > been sent at the point we switch to post-copy, that is a semantic
> > change from current behaviour. I don't know if that will have an
> > problematic effect on the migration process, or not. Checking the
> > async completions at the end of each iteration though, would ensure
> > the semantics similar to current semantics, reducing risk of unexpected
> > problems.
> >
> 
> What if we do the 'flush()' before we start post-copy, instead of after each
> iteration? would that be enough?

Possibly, yes. This really need David G's input since he understands
the code in way more detail than me.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
  2021-09-02 10:28               ` Daniel P. Berrangé
@ 2021-09-07 11:06                 ` Dr. David Alan Gilbert
  2021-09-07 18:09                   ` Peter Xu
  2021-09-08 20:25                   ` Leonardo Bras Soares Passos
  0 siblings, 2 replies; 62+ messages in thread
From: Dr. David Alan Gilbert @ 2021-09-07 11:06 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, qemu-devel, Peter Xu, Leonardo Bras Soares Passos,
	Paolo Bonzini, Marc-André Lureau, Fam Zheng

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Thu, Sep 02, 2021 at 07:19:58AM -0300, Leonardo Bras Soares Passos wrote:
> > On Thu, Sep 2, 2021 at 6:50 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >
> > > On Thu, Sep 02, 2021 at 06:34:01AM -0300, Leonardo Bras Soares Passos wrote:
> > > > On Thu, Sep 2, 2021 at 5:47 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > >
> > > > > On Thu, Sep 02, 2021 at 03:38:11AM -0300, Leonardo Bras Soares Passos wrote:
> > > > >
> > > > > > > I would suggest checkig in close(), but as mentioned
> > > > > > > earlier, I think the design is flawed because the caller
> > > > > > > fundamentally needs to know about completion for every
> > > > > > > single write they make in order to know when the buffer
> > > > > > > can be released / reused.
> > > > > >
> > > > > > Well, there could be a flush mechanism (maybe in io_sync_errck(),
> > > > > > activated with a
> > > > > > parameter flag, or on a different method if callback is preferred):
> > > > > > In the MSG_ZEROCOPY docs, we can see that the example includes using a poll()
> > > > > > syscall after each packet sent, and this means the fd gets a signal after each
> > > > > > sendmsg() happens, with error or not.
> > > > > >
> > > > > > We could harness this with a poll() and a relatively high timeout:
> > > > > > - We stop sending packets, and then call poll().
> > > > > > - Whenever poll() returns 0, it means a timeout happened, and so it
> > > > > > took too long
> > > > > > without sendmsg() happening, meaning all the packets are sent.
> > > > > > - If it returns anything else, we go back to fixing the errors found (re-send)
> > > > > >
> > > > > > The problem may be defining the value of this timeout, but it could be
> > > > > > called only
> > > > > > when zerocopy is active.
> > > > >
> > > > > Maybe we need to check completions at the end of each iteration of the
> > > > > migration dirtypage loop ?
> > > >
> > > > Sorry, I am really new to this, and I still couldn't understand why would we
> > > > need to check at the end of each iteration, instead of doing a full check at the
> > > > end.
> > >
> > > The end of each iteration is an implicit synchronization point in the
> > > current migration code.
> > >
> > > For example, we might do 2 iterations of migration pre-copy, and then
> > > switch to post-copy mode. If the data from those 2 iterations hasn't
> > > been sent at the point we switch to post-copy, that is a semantic
> > > change from current behaviour. I don't know if that will have an
> > > problematic effect on the migration process, or not. Checking the
> > > async completions at the end of each iteration though, would ensure
> > > the semantics similar to current semantics, reducing risk of unexpected
> > > problems.
> > >
> > 
> > What if we do the 'flush()' before we start post-copy, instead of after each
> > iteration? would that be enough?
> 
> Possibly, yes. This really need David G's input since he understands
> the code in way more detail than me.

Hmm I'm not entirely sure why we have the sync after each iteration;
the case I can think of is if we're doing async sending, we could have
two versions of the same page in flight (one from each iteration) -
you'd want those to get there in the right order.

Dave

> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy
  2021-09-01 15:44           ` Daniel P. Berrangé
  2021-09-01 16:01             ` Peter Xu
  2021-09-02  7:57             ` Leonardo Bras Soares Passos
@ 2021-09-07 11:13             ` Dr. David Alan Gilbert
  2021-09-08 15:26               ` Daniel P. Berrangé
  2 siblings, 1 reply; 62+ messages in thread
From: Dr. David Alan Gilbert @ 2021-09-07 11:13 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Jason Wang, qemu-devel, Peter Xu, Leonardo Bras,
	Paolo Bonzini, Marc-André Lureau, Fam Zheng

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Wed, Sep 01, 2021 at 11:35:33AM -0400, Peter Xu wrote:
> > On Wed, Sep 01, 2021 at 09:53:07AM +0100, Daniel P. Berrangé wrote:
> > > On Tue, Aug 31, 2021 at 04:29:09PM -0400, Peter Xu wrote:
> > > > On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote:
> > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > > > > > 
> > > > > > Change the send_write() interface of multifd, allowing it to pass down
> > > > > > flags for qio_channel_write*().
> > > > > > 
> > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > > > > > other data being sent at the default copying approach.
> > > > > > 
> > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > > > > ---
> > > > > >  migration/multifd-zlib.c | 7 ++++---
> > > > > >  migration/multifd-zstd.c | 7 ++++---
> > > > > >  migration/multifd.c      | 9 ++++++---
> > > > > >  migration/multifd.h      | 3 ++-
> > > > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > > > > 
> > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > > > >              }
> > > > > >  
> > > > > >              if (used) {
> > > > > > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > > > > > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > > > > > +                                                          &local_err);
> > > > > 
> > > > > I don't think it is valid to unconditionally enable this feature due to the
> > > > > resource usage implications
> > > > > 
> > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > > 
> > > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens 
> > > > >    if the socket option was not set, the socket exceeds its optmem 
> > > > >    limit or the user exceeds its ulimit on locked pages."
> > > > > 
> > > > > The limit on locked pages is something that looks very likely to be
> > > > > exceeded unless you happen to be running a QEMU config that already
> > > > > implies locked memory (eg PCI assignment)
> > > > 
> > > > Yes it would be great to be a migration capability in parallel to multifd. At
> > > > initial phase if it's easy to be implemented on multi-fd only, we can add a
> > > > dependency between the caps.  In the future we can remove that dependency when
> > > > the code is ready to go without multifd.  Thanks,
> > > 
> > > Also, I'm wondering how zerocopy support interacts with kernel support
> > > for kTLS and multipath-TCP, both of which we want to be able to use
> > > with migration.
> > 
> > Copying Jason Wang for net implications between these features on kernel side
> > and whether they can be enabled together (MSG_ZEROCOPY, mptcp, kTLS).
> > 
> > From the safe side we may want to only enable one of them until we prove
> > they'll work together I guess..
> 
> MPTCP is good when we're network limited for migration
> 
> KTLS will be good when we're CPU limited on AES for migration,
> which is essentially always when TLS is used.
> 
> ZEROCOPY will be good when we're CPU limited for data copy
> on migration, or to reduce the impact on other concurrent
> VMs on the same CPUs.
> 
> Ultimately we woudld benefit from all of them at the same
> time, if it were technically possible todo.

I think last time I spoke to Paolo Abeni there were some interactions
between them; I can't remember what though (I think mptcp and ktls
didn't play at the time).

Dave

> > Not a immediate concern as I don't really think any of them is really
> > explicitly supported in qemu.
> 
> QEMU has mptcp support already:
> 
>   commit 8bd1078aebcec5eac196a83ef1a7e74be0ba67b7
>   Author: Dr. David Alan Gilbert <dgilbert@redhat.com>
>   Date:   Wed Apr 21 12:28:34 2021 +0100
> 
>     sockets: Support multipath TCP
>     
>     Multipath TCP allows combining multiple interfaces/routes into a single
>     socket, with very little work for the user/admin.
>     
>     It's enabled by 'mptcp' on most socket addresses:
>     
>        ./qemu-system-x86_64 -nographic -incoming tcp:0:4444,mptcp
> 
> > KTLS may be implicitly included by a new gnutls, but we need to mark TLS and
> > ZEROCOPY mutual exclusive anyway because at least the userspace TLS code of
> > gnutls won't has a way to maintain the tls buffers used by zerocopy.  So at
> > least we need some knob to detect whether kTLS is enabled in gnutls.
> 
> It isn't possible for gnutls to transparently enable KTLS, because
> GNUTLS doesn't get to see the actual socket directly - it'll need
> some work in QEMU to enable it.  We know MPTCP and KTLS are currently
> mutually exclusive as they both use the same kernel network hooks
> framework.
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy
  2021-09-02  9:20           ` Daniel P. Berrangé
  2021-09-02  9:49             ` Leonardo Bras Soares Passos
@ 2021-09-07 11:17             ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 62+ messages in thread
From: Dr. David Alan Gilbert @ 2021-09-07 11:17 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, qemu-devel, Peter Xu, Leonardo Bras Soares Passos,
	Paolo Bonzini, Marc-André Lureau, Fam Zheng

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Thu, Sep 02, 2021 at 05:52:15AM -0300, Leonardo Bras Soares Passos wrote:
> > On Thu, Sep 2, 2021 at 5:21 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >
> > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > > > Hello Daniel, thanks for the feedback !
> > > >
> > > > On Tue, Aug 31, 2021 at 10:17 AM Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > >
> > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > > > > >
> > > > > > Change the send_write() interface of multifd, allowing it to pass down
> > > > > > flags for qio_channel_write*().
> > > > > >
> > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > > > > > other data being sent at the default copying approach.
> > > > > >
> > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > > > > ---
> > > > > >  migration/multifd-zlib.c | 7 ++++---
> > > > > >  migration/multifd-zstd.c | 7 ++++---
> > > > > >  migration/multifd.c      | 9 ++++++---
> > > > > >  migration/multifd.h      | 3 ++-
> > > > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > > > >
> > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > > > >              }
> > > > > >
> > > > > >              if (used) {
> > > > > > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > > > > > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > > > > > +                                                          &local_err);
> > > > >
> > > > > I don't think it is valid to unconditionally enable this feature due to the
> > > > > resource usage implications
> > > > >
> > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > >
> > > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > > >    if the socket option was not set, the socket exceeds its optmem
> > > > >    limit or the user exceeds its ulimit on locked pages."
> > > >
> > > > You are correct, I unfortunately missed this part in the docs :(
> > > >
> > > > > The limit on locked pages is something that looks very likely to be
> > > > > exceeded unless you happen to be running a QEMU config that already
> > > > > implies locked memory (eg PCI assignment)
> > > >
> > > > Do you mean the limit an user has on locking memory?
> > >
> > > Yes, by default limit QEMU sees will be something very small.
> > >
> > > > If so, that makes sense. I remember I needed to set the upper limit of locked
> > > > memory for the user before using it, or adding a capability to qemu before.
> > > >
> > > > Maybe an option would be trying to mlock all guest memory before setting
> > > > zerocopy=on in qemu code. If it fails, we can print an error message and fall
> > > > back to not using zerocopy (following the idea of a new io_async_writev()
> > > > I told you in the previous mail).
> > >
> > > Currently ability to lock memory is something that has to be configured
> > > when QEMU starts, and it requires libvirt to grant suitable permissions
> > > to QEMU. Memory locking is generally undesirable because it prevents
> > > memory overcommit. Or rather if you are allowing memory overcommit, then
> > > allowing memory locking is a way to kill your entire host.
> > 
> > You mean it's gonna consume too much memory, or something else?
> 
> Essentially yes. 
> 
> > > I don't think we can unconditionally grant ability to lock arbitrary
> > > guest RAM at startup, just to satisfy a possible desire to use zerocopy
> > > migration later. Granting it at runtime feels questionable as you now
> > > need to track and predict how much locked memory you've allowed, and
> > > also have possible problems with revokation.
> > 
> > (I am really new to this, so please forgive me if I am asking dumb or
> > overly basic questions)
> > 
> > What does revokation means in this context?
> > You give the process hability to lock n MB of memory, and then you take it?
> > Why would that happen? Is Locked memory a limited resource?
> 
> Consider a VM host with 64 GB of RAM and 64 GB of swap, and an
> overcommit ratio of 1.5. ie we'll run VMs with 64*1.5 GB of RAM
> total.
> 
> So we can run 3 VMs each with 32 GB of RAM, giving 96 GB of usage
> which exceeds physical RAM. Most of the time this may well be fine
> as the VMs don't concurrently need their full RAM allocation, and
> worst case they'll get pushed to swap as the kernel re-shares
> memory in respose to load. So perhaps each VM only needs 20 GB
> resident at any time, but over time one VM can burst upto 32 GB
> and then 12 GB of it get swapped out later when inactive.
> 
> But now consider if we allowed 2 of the VMs to lock memory for
> purposes of migration. Those 2 VMs can now pin 64 GB of memory
> in the worst case, leaving no free memory for the 3rd VM or
> for the OS. This will likely take down the entire host, regardless
> of swap availability.
> 
> IOW, if you are overcomitting RAM you have to be extremely
> careful about allowing any VM to lock memory. If you do decide
> to allow memory locking, you need to make sure that the worst
> case locked memory amount still leaves enough unlocked memory
> for the OS to be able to effectively manage the overcommit
> load via swap.  We definitely can't grant memory locking to
> VMs at startup in this scenario, and if we grant it at runtime,
> we need to be able to revoke it again later.
> 
> These overcommit numbers are a bit more extreme that you'd 
> usually do in real world, but it illustrates the genreal
> problem. Also bear in mind that QEMU has memory overhead
> beyond the guest RAM block, which varies over time, making
> accounting quite hard. We have to also assume that QEMU
> could have been compromised by a guest breakout, so we
> can't assume that migration will play nice - we have to
> assume the worst case possible, given the process ulimits.

We already have the same problem for RDMA; 
(Although it has some options for doing rolling locking in chunks
and recently there's code for use with new cards that don't need
locking).

I think the thing here is to offer zerocopy as an option; then people
can decide on the costs etc.

Dave

> 
> > > Overall the memory locking needs look like a significant constraint that
> > > will affect ability to use this feature.
> > >
> > 
> > I Agree, there is a lot to take in account.
> > In any way, those constraints could be checked at the same function as
> > the setsockopt() right?
> 
> QEMU could possibly check its ulimits to see if it is possible, but it
> would be safer to require a migration capability to be explicitly set
> by the mgmt app to opt-in to zerocopy.
> 
> > (Setting up configs to improve the chance of zerocopy would probably only
> > happen at/before qemu starting, right?)
> 
> Usually, but you can change ulimits on the fly for a process. I'm just
> not sure of semantics if you reduce limits and existing usage exceeds
> the reduced value.
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
  2021-09-02  6:59       ` Leonardo Bras Soares Passos
@ 2021-09-07 16:44         ` Peter Xu
  2021-09-08 20:13           ` Leonardo Bras Soares Passos
  0 siblings, 1 reply; 62+ messages in thread
From: Peter Xu @ 2021-09-07 16:44 UTC (permalink / raw)
  To: Leonardo Bras Soares Passos
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, qemu-devel, Daniel P. Berrangé,
	Dr. David Alan Gilbert, Paolo Bonzini, Marc-André Lureau,
	Fam Zheng

On Thu, Sep 02, 2021 at 03:59:25AM -0300, Leonardo Bras Soares Passos wrote:
> I also suggested something like that, but I thought it could be good if we could
> fall back to io_writev() if we didn't have the zerocopy feature (or
> the async feature).
> 
> What do you think?

That fallback looks safe and ok, I'm just not sure whether it'll be of great
help.  E.g. if we provide an QIO api that allows both sync write and zero-copy
write (then we do the fallback when necessary), it means the buffer implication
applies too to this api, so it's easier to me to just detect the zero copy
capability and use one alternative.  Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
  2021-09-07 11:06                 ` Dr. David Alan Gilbert
@ 2021-09-07 18:09                   ` Peter Xu
  2021-09-08  8:30                     ` Dr. David Alan Gilbert
  2021-09-08 20:25                   ` Leonardo Bras Soares Passos
  1 sibling, 1 reply; 62+ messages in thread
From: Peter Xu @ 2021-09-07 18:09 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Elena Ufimtseva, John G Johnson, Daniel P. Berrangé,
	qemu-block, Juan Quintela, qemu-devel,
	Leonardo Bras Soares Passos, Paolo Bonzini,
	Marc-André Lureau, Fam Zheng, Jagannathan Raman

On Tue, Sep 07, 2021 at 12:06:15PM +0100, Dr. David Alan Gilbert wrote:
> > > What if we do the 'flush()' before we start post-copy, instead of after each
> > > iteration? would that be enough?
> > 
> > Possibly, yes. This really need David G's input since he understands
> > the code in way more detail than me.
> 
> Hmm I'm not entirely sure why we have the sync after each iteration;

We don't have that yet, do we?

> the case I can think of is if we're doing async sending, we could have
> two versions of the same page in flight (one from each iteration) -
> you'd want those to get there in the right order.

From MSG_ZEROCOPY document:

        For protocols that acknowledge data in-order, like TCP, each
        notification can be squashed into the previous one, so that no more
        than one notification is outstanding at any one point.

        Ordered delivery is the common case, but not guaranteed. Notifications
        may arrive out of order on retransmission and socket teardown.

So no matter whether we have a flush() per iteration before, it seems we may
want one when zero copy enabled?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy
  2021-09-02  7:22     ` Leonardo Bras Soares Passos
  2021-09-02  8:20       ` Daniel P. Berrangé
@ 2021-09-07 18:32       ` Peter Xu
  2021-09-08  2:59         ` Jason Wang
  1 sibling, 1 reply; 62+ messages in thread
From: Peter Xu @ 2021-09-07 18:32 UTC (permalink / raw)
  To: Leonardo Bras Soares Passos
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, qemu-devel, Jason Wang, Daniel P. Berrangé,
	Dr. David Alan Gilbert, Paolo Bonzini, Marc-André Lureau,
	Fam Zheng

On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > I don't think it is valid to unconditionally enable this feature due to the
> > resource usage implications
> >
> > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> >
> >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> >    if the socket option was not set, the socket exceeds its optmem
> >    limit or the user exceeds its ulimit on locked pages."
> 
> You are correct, I unfortunately missed this part in the docs :(
> 
> > The limit on locked pages is something that looks very likely to be
> > exceeded unless you happen to be running a QEMU config that already
> > implies locked memory (eg PCI assignment)
> 
> Do you mean the limit an user has on locking memory?
> 
> If so, that makes sense. I remember I needed to set the upper limit of locked
> memory for the user before using it, or adding a capability to qemu before.

So I'm a bit confused on why MSG_ZEROCOPY requires checking RLIMIT_MEMLOCK.

The thing is IIUC that's accounting for pinned pages only with either mlock()
(FOLL_MLOCK) or vfio (FOLL_PIN).

I don't really think MSG_ZEROCOPY is doing that at all...  I'm looking at
__zerocopy_sg_from_iter() -> iov_iter_get_pages().

Say, I think there're plenty of iov_iter_get_pages() callers and normal GUPs, I
think most of them do no need such accountings.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy
  2021-09-07 18:32       ` Peter Xu
@ 2021-09-08  2:59         ` Jason Wang
  2021-09-08  3:24           ` Peter Xu
  2021-09-08  8:19           ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 62+ messages in thread
From: Jason Wang @ 2021-09-08  2:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: Elena Ufimtseva, John G Johnson, Daniel P. Berrangé,
	qemu-block, Juan Quintela, Dr. David Alan Gilbert, qemu-devel,
	Leonardo Bras Soares Passos, Paolo Bonzini,
	Marc-André Lureau, Fam Zheng, Jagannathan Raman

On Wed, Sep 8, 2021 at 2:32 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > > I don't think it is valid to unconditionally enable this feature due to the
> > > resource usage implications
> > >
> > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > >
> > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > >    if the socket option was not set, the socket exceeds its optmem
> > >    limit or the user exceeds its ulimit on locked pages."
> >
> > You are correct, I unfortunately missed this part in the docs :(
> >
> > > The limit on locked pages is something that looks very likely to be
> > > exceeded unless you happen to be running a QEMU config that already
> > > implies locked memory (eg PCI assignment)
> >
> > Do you mean the limit an user has on locking memory?
> >
> > If so, that makes sense. I remember I needed to set the upper limit of locked
> > memory for the user before using it, or adding a capability to qemu before.
>
> So I'm a bit confused on why MSG_ZEROCOPY requires checking RLIMIT_MEMLOCK.
>
> The thing is IIUC that's accounting for pinned pages only with either mlock()
> (FOLL_MLOCK) or vfio (FOLL_PIN).
>
> I don't really think MSG_ZEROCOPY is doing that at all...  I'm looking at
> __zerocopy_sg_from_iter() -> iov_iter_get_pages().

It happens probably here:

E.g

__ip_append_data()
    msg_zerocopy_realloc()
        mm_account_pinned_pages()

Thanks

>
> Say, I think there're plenty of iov_iter_get_pages() callers and normal GUPs, I
> think most of them do no need such accountings.
>
> --
> Peter Xu
>



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy
  2021-09-08  2:59         ` Jason Wang
@ 2021-09-08  3:24           ` Peter Xu
  2021-09-08  3:26             ` Jason Wang
  2021-09-08  8:19           ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 62+ messages in thread
From: Peter Xu @ 2021-09-08  3:24 UTC (permalink / raw)
  To: Jason Wang
  Cc: Elena Ufimtseva, John G Johnson, Daniel P. Berrangé,
	qemu-block, Juan Quintela, Dr. David Alan Gilbert, qemu-devel,
	Leonardo Bras Soares Passos, Paolo Bonzini,
	Marc-André Lureau, Fam Zheng, Jagannathan Raman

On Wed, Sep 08, 2021 at 10:59:57AM +0800, Jason Wang wrote:
> On Wed, Sep 8, 2021 at 2:32 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > > > I don't think it is valid to unconditionally enable this feature due to the
> > > > resource usage implications
> > > >
> > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > >
> > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > >    if the socket option was not set, the socket exceeds its optmem
> > > >    limit or the user exceeds its ulimit on locked pages."
> > >
> > > You are correct, I unfortunately missed this part in the docs :(
> > >
> > > > The limit on locked pages is something that looks very likely to be
> > > > exceeded unless you happen to be running a QEMU config that already
> > > > implies locked memory (eg PCI assignment)
> > >
> > > Do you mean the limit an user has on locking memory?
> > >
> > > If so, that makes sense. I remember I needed to set the upper limit of locked
> > > memory for the user before using it, or adding a capability to qemu before.
> >
> > So I'm a bit confused on why MSG_ZEROCOPY requires checking RLIMIT_MEMLOCK.
> >
> > The thing is IIUC that's accounting for pinned pages only with either mlock()
> > (FOLL_MLOCK) or vfio (FOLL_PIN).
> >
> > I don't really think MSG_ZEROCOPY is doing that at all...  I'm looking at
> > __zerocopy_sg_from_iter() -> iov_iter_get_pages().
> 
> It happens probably here:
> 
> E.g
> 
> __ip_append_data()
>     msg_zerocopy_realloc()
>         mm_account_pinned_pages()

Right. :)

But my previous question is more about the reason behind it - I thought that's
a common GUP and it shouldn't rely on locked_vm because it should still be
temporary GUP rather than a long time GUP, IMHO that's how we use locked_vm (we
don't account for temp GUPs but only longterm ones). IOW, I'm wondering whether
all the rest of iov_iter_get_pages() callers should check locked_vm too, and
AFAIU they're not doing that right now..

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy
  2021-09-08  3:24           ` Peter Xu
@ 2021-09-08  3:26             ` Jason Wang
  0 siblings, 0 replies; 62+ messages in thread
From: Jason Wang @ 2021-09-08  3:26 UTC (permalink / raw)
  To: Peter Xu
  Cc: Elena Ufimtseva, John G Johnson, Daniel P. Berrangé,
	qemu-block, Juan Quintela, Dr. David Alan Gilbert, qemu-devel,
	Leonardo Bras Soares Passos, Paolo Bonzini,
	Marc-André Lureau, Fam Zheng, Jagannathan Raman

On Wed, Sep 8, 2021 at 11:24 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Sep 08, 2021 at 10:59:57AM +0800, Jason Wang wrote:
> > On Wed, Sep 8, 2021 at 2:32 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > > > > I don't think it is valid to unconditionally enable this feature due to the
> > > > > resource usage implications
> > > > >
> > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > >
> > > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > > >    if the socket option was not set, the socket exceeds its optmem
> > > > >    limit or the user exceeds its ulimit on locked pages."
> > > >
> > > > You are correct, I unfortunately missed this part in the docs :(
> > > >
> > > > > The limit on locked pages is something that looks very likely to be
> > > > > exceeded unless you happen to be running a QEMU config that already
> > > > > implies locked memory (eg PCI assignment)
> > > >
> > > > Do you mean the limit an user has on locking memory?
> > > >
> > > > If so, that makes sense. I remember I needed to set the upper limit of locked
> > > > memory for the user before using it, or adding a capability to qemu before.
> > >
> > > So I'm a bit confused on why MSG_ZEROCOPY requires checking RLIMIT_MEMLOCK.
> > >
> > > The thing is IIUC that's accounting for pinned pages only with either mlock()
> > > (FOLL_MLOCK) or vfio (FOLL_PIN).
> > >
> > > I don't really think MSG_ZEROCOPY is doing that at all...  I'm looking at
> > > __zerocopy_sg_from_iter() -> iov_iter_get_pages().
> >
> > It happens probably here:
> >
> > E.g
> >
> > __ip_append_data()
> >     msg_zerocopy_realloc()
> >         mm_account_pinned_pages()
>
> Right. :)
>
> But my previous question is more about the reason behind it - I thought that's
> a common GUP and it shouldn't rely on locked_vm because it should still be
> temporary GUP rather than a long time GUP, IMHO that's how we use locked_vm (we
> don't account for temp GUPs but only longterm ones). IOW, I'm wondering whether
> all the rest of iov_iter_get_pages() callers should check locked_vm too, and
> AFAIU they're not doing that right now..

You are probably right, but I'm not sure if it's too late to fix that.
(E.g it will break existing userspace)

Thanks

>
> Thanks,
>
> --
> Peter Xu
>



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy
  2021-09-08  2:59         ` Jason Wang
  2021-09-08  3:24           ` Peter Xu
@ 2021-09-08  8:19           ` Dr. David Alan Gilbert
  2021-09-08 15:19             ` Peter Xu
  1 sibling, 1 reply; 62+ messages in thread
From: Dr. David Alan Gilbert @ 2021-09-08  8:19 UTC (permalink / raw)
  To: Jason Wang
  Cc: Elena Ufimtseva, John G Johnson, Daniel P. Berrangé,
	qemu-block, Juan Quintela, qemu-devel, Peter Xu,
	Leonardo Bras Soares Passos, Paolo Bonzini,
	Marc-André Lureau, Fam Zheng, Jagannathan Raman

* Jason Wang (jasowang@redhat.com) wrote:
> On Wed, Sep 8, 2021 at 2:32 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > > > I don't think it is valid to unconditionally enable this feature due to the
> > > > resource usage implications
> > > >
> > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > >
> > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > >    if the socket option was not set, the socket exceeds its optmem
> > > >    limit or the user exceeds its ulimit on locked pages."
> > >
> > > You are correct, I unfortunately missed this part in the docs :(
> > >
> > > > The limit on locked pages is something that looks very likely to be
> > > > exceeded unless you happen to be running a QEMU config that already
> > > > implies locked memory (eg PCI assignment)
> > >
> > > Do you mean the limit an user has on locking memory?
> > >
> > > If so, that makes sense. I remember I needed to set the upper limit of locked
> > > memory for the user before using it, or adding a capability to qemu before.
> >
> > So I'm a bit confused on why MSG_ZEROCOPY requires checking RLIMIT_MEMLOCK.
> >
> > The thing is IIUC that's accounting for pinned pages only with either mlock()
> > (FOLL_MLOCK) or vfio (FOLL_PIN).
> >
> > I don't really think MSG_ZEROCOPY is doing that at all...  I'm looking at
> > __zerocopy_sg_from_iter() -> iov_iter_get_pages().
> 
> It happens probably here:
> 
> E.g
> 
> __ip_append_data()
>     msg_zerocopy_realloc()
>         mm_account_pinned_pages()

When do they get uncounted?  i.e. is it just the data that's in flight
that's marked as pinned?

Dave

> Thanks
> 
> >
> > Say, I think there're plenty of iov_iter_get_pages() callers and normal GUPs, I
> > think most of them do no need such accountings.
> >
> > --
> > Peter Xu
> >
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
  2021-09-07 18:09                   ` Peter Xu
@ 2021-09-08  8:30                     ` Dr. David Alan Gilbert
  2021-09-08 15:24                       ` Peter Xu
  0 siblings, 1 reply; 62+ messages in thread
From: Dr. David Alan Gilbert @ 2021-09-08  8:30 UTC (permalink / raw)
  To: Peter Xu
  Cc: Elena Ufimtseva, John G Johnson, Daniel P. Berrangé,
	qemu-block, Juan Quintela, qemu-devel,
	Leonardo Bras Soares Passos, Paolo Bonzini,
	Marc-André Lureau, Fam Zheng, Jagannathan Raman

* Peter Xu (peterx@redhat.com) wrote:
> On Tue, Sep 07, 2021 at 12:06:15PM +0100, Dr. David Alan Gilbert wrote:
> > > > What if we do the 'flush()' before we start post-copy, instead of after each
> > > > iteration? would that be enough?
> > > 
> > > Possibly, yes. This really need David G's input since he understands
> > > the code in way more detail than me.
> > 
> > Hmm I'm not entirely sure why we have the sync after each iteration;
> 
> We don't have that yet, do we?

I think multifd does; I think it calls multifd_send_sync_main sometimes,
which I *think* corresponds to iterations.

Dave

> > the case I can think of is if we're doing async sending, we could have
> > two versions of the same page in flight (one from each iteration) -
> > you'd want those to get there in the right order.
> 
> From MSG_ZEROCOPY document:
> 
>         For protocols that acknowledge data in-order, like TCP, each
>         notification can be squashed into the previous one, so that no more
>         than one notification is outstanding at any one point.
> 
>         Ordered delivery is the common case, but not guaranteed. Notifications
>         may arrive out of order on retransmission and socket teardown.
> 
> So no matter whether we have a flush() per iteration before, it seems we may
> want one when zero copy enabled?
> 
> Thanks,
> 
> -- 
> Peter Xu
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy
  2021-09-08  8:19           ` Dr. David Alan Gilbert
@ 2021-09-08 15:19             ` Peter Xu
  2021-09-09  1:10               ` Jason Wang
  0 siblings, 1 reply; 62+ messages in thread
From: Peter Xu @ 2021-09-08 15:19 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Elena Ufimtseva, John G Johnson, Daniel P. Berrangé,
	qemu-block, Juan Quintela, Jason Wang, qemu-devel,
	Leonardo Bras Soares Passos, Paolo Bonzini,
	Marc-André Lureau, Fam Zheng, Jagannathan Raman

On Wed, Sep 08, 2021 at 09:19:27AM +0100, Dr. David Alan Gilbert wrote:
> * Jason Wang (jasowang@redhat.com) wrote:
> > On Wed, Sep 8, 2021 at 2:32 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > > > > I don't think it is valid to unconditionally enable this feature due to the
> > > > > resource usage implications
> > > > >
> > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > >
> > > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > > >    if the socket option was not set, the socket exceeds its optmem
> > > > >    limit or the user exceeds its ulimit on locked pages."
> > > >
> > > > You are correct, I unfortunately missed this part in the docs :(
> > > >
> > > > > The limit on locked pages is something that looks very likely to be
> > > > > exceeded unless you happen to be running a QEMU config that already
> > > > > implies locked memory (eg PCI assignment)
> > > >
> > > > Do you mean the limit an user has on locking memory?
> > > >
> > > > If so, that makes sense. I remember I needed to set the upper limit of locked
> > > > memory for the user before using it, or adding a capability to qemu before.
> > >
> > > So I'm a bit confused on why MSG_ZEROCOPY requires checking RLIMIT_MEMLOCK.
> > >
> > > The thing is IIUC that's accounting for pinned pages only with either mlock()
> > > (FOLL_MLOCK) or vfio (FOLL_PIN).
> > >
> > > I don't really think MSG_ZEROCOPY is doing that at all...  I'm looking at
> > > __zerocopy_sg_from_iter() -> iov_iter_get_pages().
> > 
> > It happens probably here:
> > 
> > E.g
> > 
> > __ip_append_data()
> >     msg_zerocopy_realloc()
> >         mm_account_pinned_pages()
> 
> When do they get uncounted?  i.e. is it just the data that's in flight
> that's marked as pinned?

I think so - there's __msg_zerocopy_callback() -> mm_unaccount_pinned_pages()
correspondingly.  Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
  2021-09-08  8:30                     ` Dr. David Alan Gilbert
@ 2021-09-08 15:24                       ` Peter Xu
  2021-09-09  8:49                         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 62+ messages in thread
From: Peter Xu @ 2021-09-08 15:24 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Elena Ufimtseva, John G Johnson, Daniel P. Berrangé,
	qemu-block, Juan Quintela, qemu-devel,
	Leonardo Bras Soares Passos, Paolo Bonzini,
	Marc-André Lureau, Fam Zheng, Jagannathan Raman

On Wed, Sep 08, 2021 at 09:30:58AM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Tue, Sep 07, 2021 at 12:06:15PM +0100, Dr. David Alan Gilbert wrote:
> > > > > What if we do the 'flush()' before we start post-copy, instead of after each
> > > > > iteration? would that be enough?
> > > > 
> > > > Possibly, yes. This really need David G's input since he understands
> > > > the code in way more detail than me.
> > > 
> > > Hmm I'm not entirely sure why we have the sync after each iteration;
> > 
> > We don't have that yet, do we?
> 
> I think multifd does; I think it calls multifd_send_sync_main sometimes,
> which I *think* corresponds to iterations.

Oh.. Then we may need to:

  (1) Make that sync work for zero-copy too; say, semaphores may not be enough
      with it - we need to make sure all async buffers are consumed by checking
      the error queue of the sockets,

  (2) When we make zero-copy work without multi-fd, we may want some similar
      sync on the main stream too

Thanks,

> 
> Dave
> 
> > > the case I can think of is if we're doing async sending, we could have
> > > two versions of the same page in flight (one from each iteration) -
> > > you'd want those to get there in the right order.
> > 
> > From MSG_ZEROCOPY document:
> > 
> >         For protocols that acknowledge data in-order, like TCP, each
> >         notification can be squashed into the previous one, so that no more
> >         than one notification is outstanding at any one point.
> > 
> >         Ordered delivery is the common case, but not guaranteed. Notifications
> >         may arrive out of order on retransmission and socket teardown.
> > 
> > So no matter whether we have a flush() per iteration before, it seems we may
> > want one when zero copy enabled?
> > 
> > Thanks,
> > 
> > -- 
> > Peter Xu
> > 
> -- 
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy
  2021-09-07 11:13             ` Dr. David Alan Gilbert
@ 2021-09-08 15:26               ` Daniel P. Berrangé
  0 siblings, 0 replies; 62+ messages in thread
From: Daniel P. Berrangé @ 2021-09-08 15:26 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Jason Wang, qemu-devel, Peter Xu, Leonardo Bras,
	Paolo Bonzini, Marc-André Lureau, Fam Zheng

On Tue, Sep 07, 2021 at 12:13:28PM +0100, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > On Wed, Sep 01, 2021 at 11:35:33AM -0400, Peter Xu wrote:
> > > On Wed, Sep 01, 2021 at 09:53:07AM +0100, Daniel P. Berrangé wrote:
> > > > On Tue, Aug 31, 2021 at 04:29:09PM -0400, Peter Xu wrote:
> > > > > On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote:
> > > > > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > > > > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > > > > > > 
> > > > > > > Change the send_write() interface of multifd, allowing it to pass down
> > > > > > > flags for qio_channel_write*().
> > > > > > > 
> > > > > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > > > > > > other data being sent at the default copying approach.
> > > > > > > 
> > > > > > > Signed-off-by: Leonardo Bras <leobras@redhat.com>
> > > > > > > ---
> > > > > > >  migration/multifd-zlib.c | 7 ++++---
> > > > > > >  migration/multifd-zstd.c | 7 ++++---
> > > > > > >  migration/multifd.c      | 9 ++++++---
> > > > > > >  migration/multifd.h      | 3 ++-
> > > > > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > > > > > 
> > > > > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > > > > >              }
> > > > > > >  
> > > > > > >              if (used) {
> > > > > > > -                ret = multifd_send_state->ops->send_write(p, used, &local_err);
> > > > > > > +                ret = multifd_send_state->ops->send_write(p, used, MSG_ZEROCOPY,
> > > > > > > +                                                          &local_err);
> > > > > > 
> > > > > > I don't think it is valid to unconditionally enable this feature due to the
> > > > > > resource usage implications
> > > > > > 
> > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > > > 
> > > > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens 
> > > > > >    if the socket option was not set, the socket exceeds its optmem 
> > > > > >    limit or the user exceeds its ulimit on locked pages."
> > > > > > 
> > > > > > The limit on locked pages is something that looks very likely to be
> > > > > > exceeded unless you happen to be running a QEMU config that already
> > > > > > implies locked memory (eg PCI assignment)
> > > > > 
> > > > > Yes it would be great to be a migration capability in parallel to multifd. At
> > > > > initial phase if it's easy to be implemented on multi-fd only, we can add a
> > > > > dependency between the caps.  In the future we can remove that dependency when
> > > > > the code is ready to go without multifd.  Thanks,
> > > > 
> > > > Also, I'm wondering how zerocopy support interacts with kernel support
> > > > for kTLS and multipath-TCP, both of which we want to be able to use
> > > > with migration.
> > > 
> > > Copying Jason Wang for net implications between these features on kernel side
> > > and whether they can be enabled together (MSG_ZEROCOPY, mptcp, kTLS).
> > > 
> > > From the safe side we may want to only enable one of them until we prove
> > > they'll work together I guess..
> > 
> > MPTCP is good when we're network limited for migration
> > 
> > KTLS will be good when we're CPU limited on AES for migration,
> > which is essentially always when TLS is used.
> > 
> > ZEROCOPY will be good when we're CPU limited for data copy
> > on migration, or to reduce the impact on other concurrent
> > VMs on the same CPUs.
> > 
> > Ultimately we woudld benefit from all of them at the same
> > time, if it were technically possible todo.
> 
> I think last time I spoke to Paolo Abeni there were some interactions
> between them; I can't remember what though (I think mptcp and ktls
> didn't play at the time).

MPTCP and KTLS use the same kernel hook in the network layer and
only 1 hook is permitted at a time, making them mutually exclusive
for now. In theory this can be fixed but no one is actively working
on it yet.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
  2021-09-07 16:44         ` Peter Xu
@ 2021-09-08 20:13           ` Leonardo Bras Soares Passos
  2021-09-08 21:04             ` Peter Xu
  0 siblings, 1 reply; 62+ messages in thread
From: Leonardo Bras Soares Passos @ 2021-09-08 20:13 UTC (permalink / raw)
  To: Peter Xu
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, qemu-devel, Daniel P. Berrangé,
	Dr. David Alan Gilbert, Paolo Bonzini, Marc-André Lureau,
	Fam Zheng

On Tue, Sep 7, 2021 at 1:44 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Sep 02, 2021 at 03:59:25AM -0300, Leonardo Bras Soares Passos wrote:
> > I also suggested something like that, but I thought it could be good if we could
> > fall back to io_writev() if we didn't have the zerocopy feature (or
> > the async feature).
> >
> > What do you think?
>
> That fallback looks safe and ok, I'm just not sure whether it'll be of great
> help.  E.g. if we provide an QIO api that allows both sync write and zero-copy
> write (then we do the fallback when necessary), it means the buffer implication
> applies too to this api, so it's easier to me to just detect the zero copy
> capability and use one alternative.  Thanks,
>
> --
> Peter Xu
>

I was thinking this way (async method with fallback) we would allow code using
the QIO api to just try to use the io_async_writev() whenever the code
seems fit, and
let the QIO channel layer to decide when it can be used
(implementation provides,
features available), and just fallback to io_writev() when it can't be used.

Best regards,
Leo



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
  2021-09-07 11:06                 ` Dr. David Alan Gilbert
  2021-09-07 18:09                   ` Peter Xu
@ 2021-09-08 20:25                   ` Leonardo Bras Soares Passos
  2021-09-08 21:09                     ` Peter Xu
  1 sibling, 1 reply; 62+ messages in thread
From: Leonardo Bras Soares Passos @ 2021-09-08 20:25 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Daniel P. Berrangé,
	Peter Xu, qemu-devel, Paolo Bonzini, Marc-André Lureau,
	Fam Zheng

On Tue, Sep 7, 2021 at 8:06 AM Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
> > Possibly, yes. This really need David G's input since he understands
> > the code in way more detail than me.
>
> Hmm I'm not entirely sure why we have the sync after each iteration;
> the case I can think of is if we're doing async sending, we could have
> two versions of the same page in flight (one from each iteration) -
> you'd want those to get there in the right order.
>
> Dave

Well, that's the thing: as we don't copy the buffer in MSG_ZEROCOPY,  we will in
fact have the same page in flight twice, instead of two versions,
given the buffer is
sent as it is during transmission.

For me it looks like there will be no change in the current algorithm,
but I am still
a beginner in migration code, and I am probably missing something.

Best regards,
Leo



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
  2021-09-08 20:13           ` Leonardo Bras Soares Passos
@ 2021-09-08 21:04             ` Peter Xu
  0 siblings, 0 replies; 62+ messages in thread
From: Peter Xu @ 2021-09-08 21:04 UTC (permalink / raw)
  To: Leonardo Bras Soares Passos
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, qemu-devel, Daniel P. Berrangé,
	Dr. David Alan Gilbert, Paolo Bonzini, Marc-André Lureau,
	Fam Zheng

On Wed, Sep 08, 2021 at 05:13:40PM -0300, Leonardo Bras Soares Passos wrote:
> On Tue, Sep 7, 2021 at 1:44 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Sep 02, 2021 at 03:59:25AM -0300, Leonardo Bras Soares Passos wrote:
> > > I also suggested something like that, but I thought it could be good if we could
> > > fall back to io_writev() if we didn't have the zerocopy feature (or
> > > the async feature).
> > >
> > > What do you think?
> >
> > That fallback looks safe and ok, I'm just not sure whether it'll be of great
> > help.  E.g. if we provide an QIO api that allows both sync write and zero-copy
> > write (then we do the fallback when necessary), it means the buffer implication
> > applies too to this api, so it's easier to me to just detect the zero copy
> > capability and use one alternative.  Thanks,
> >
> > --
> > Peter Xu
> >
> 
> I was thinking this way (async method with fallback) we would allow code using
> the QIO api to just try to use the io_async_writev() whenever the code
> seems fit, and
> let the QIO channel layer to decide when it can be used
> (implementation provides,
> features available), and just fallback to io_writev() when it can't be used.

Sure, I think it depends on whether the whole sync/async interface will be the
same, e.g., when async api needs some callback registered then the interface
won't be the same, and the caller will be aware of that anyways.  Otherwise it
looks fine indeed.  Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
  2021-09-08 20:25                   ` Leonardo Bras Soares Passos
@ 2021-09-08 21:09                     ` Peter Xu
  2021-09-08 21:57                       ` Daniel P. Berrangé
  0 siblings, 1 reply; 62+ messages in thread
From: Peter Xu @ 2021-09-08 21:09 UTC (permalink / raw)
  To: Leonardo Bras Soares Passos
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, qemu-devel, Daniel P. Berrangé,
	Dr. David Alan Gilbert, Paolo Bonzini, Marc-André Lureau,
	Fam Zheng

On Wed, Sep 08, 2021 at 05:25:50PM -0300, Leonardo Bras Soares Passos wrote:
> On Tue, Sep 7, 2021 at 8:06 AM Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> > > Possibly, yes. This really need David G's input since he understands
> > > the code in way more detail than me.
> >
> > Hmm I'm not entirely sure why we have the sync after each iteration;
> > the case I can think of is if we're doing async sending, we could have
> > two versions of the same page in flight (one from each iteration) -
> > you'd want those to get there in the right order.
> >
> > Dave
> 
> Well, that's the thing: as we don't copy the buffer in MSG_ZEROCOPY,  we will in
> fact have the same page in flight twice, instead of two versions,
> given the buffer is
> sent as it is during transmission.

That's an interesting point, which looks even valid... :)

There can still be two versions depending on when the page is read and feed to
the NICs as the page can be changing during the window, but as long as the
latter sent page always lands later than the former page on dest node then it
looks ok.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
  2021-09-08 21:09                     ` Peter Xu
@ 2021-09-08 21:57                       ` Daniel P. Berrangé
  2021-09-09  2:05                         ` Peter Xu
  0 siblings, 1 reply; 62+ messages in thread
From: Daniel P. Berrangé @ 2021-09-08 21:57 UTC (permalink / raw)
  To: Peter Xu
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Dr. David Alan Gilbert, qemu-devel,
	Leonardo Bras Soares Passos, Paolo Bonzini,
	Marc-André Lureau, Fam Zheng

On Wed, Sep 08, 2021 at 05:09:33PM -0400, Peter Xu wrote:
> On Wed, Sep 08, 2021 at 05:25:50PM -0300, Leonardo Bras Soares Passos wrote:
> > On Tue, Sep 7, 2021 at 8:06 AM Dr. David Alan Gilbert
> > <dgilbert@redhat.com> wrote:
> > > > Possibly, yes. This really need David G's input since he understands
> > > > the code in way more detail than me.
> > >
> > > Hmm I'm not entirely sure why we have the sync after each iteration;
> > > the case I can think of is if we're doing async sending, we could have
> > > two versions of the same page in flight (one from each iteration) -
> > > you'd want those to get there in the right order.
> > >
> > > Dave
> > 
> > Well, that's the thing: as we don't copy the buffer in MSG_ZEROCOPY,  we will in
> > fact have the same page in flight twice, instead of two versions,
> > given the buffer is
> > sent as it is during transmission.
> 
> That's an interesting point, which looks even valid... :)
> 
> There can still be two versions depending on when the page is read and feed to
> the NICs as the page can be changing during the window, but as long as the
> latter sent page always lands later than the former page on dest node then it
> looks ok.

The really strange scenario is around TCP re-transmits. The kernel may
transmit page version 1, then we have version 2 written. The kerenl now
needs to retransmit a packet due to network loss. The re-transmission will
contain version 2 payload which differs from the originally lost packet's
payload.

IOW, the supposed "reliable" TCP stream is no longer reliable and actually
changes its data retrospectively because we've intentionally violated the
rules the kernel documented for use of MSG_ZEROCOPY.

We think we're probably ok with migration as we are going to rely on the
face that we eventually pause the guest to stop page changes during the
final switchover. None the less I really strongly dislike the idea of
not honouring the kernel API contract, despite the potential performance
benefits it brings.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy
  2021-09-08 15:19             ` Peter Xu
@ 2021-09-09  1:10               ` Jason Wang
  0 siblings, 0 replies; 62+ messages in thread
From: Jason Wang @ 2021-09-09  1:10 UTC (permalink / raw)
  To: Peter Xu
  Cc: Elena Ufimtseva, John G Johnson, Daniel P. Berrangé,
	qemu-block, Juan Quintela, Dr. David Alan Gilbert, qemu-devel,
	Leonardo Bras Soares Passos, Paolo Bonzini,
	Marc-André Lureau, Fam Zheng, Jagannathan Raman

On Wed, Sep 8, 2021 at 11:19 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Sep 08, 2021 at 09:19:27AM +0100, Dr. David Alan Gilbert wrote:
> > * Jason Wang (jasowang@redhat.com) wrote:
> > > On Wed, Sep 8, 2021 at 2:32 AM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > > > > > I don't think it is valid to unconditionally enable this feature due to the
> > > > > > resource usage implications
> > > > > >
> > > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > > >
> > > > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > > > >    if the socket option was not set, the socket exceeds its optmem
> > > > > >    limit or the user exceeds its ulimit on locked pages."
> > > > >
> > > > > You are correct, I unfortunately missed this part in the docs :(
> > > > >
> > > > > > The limit on locked pages is something that looks very likely to be
> > > > > > exceeded unless you happen to be running a QEMU config that already
> > > > > > implies locked memory (eg PCI assignment)
> > > > >
> > > > > Do you mean the limit an user has on locking memory?
> > > > >
> > > > > If so, that makes sense. I remember I needed to set the upper limit of locked
> > > > > memory for the user before using it, or adding a capability to qemu before.
> > > >
> > > > So I'm a bit confused on why MSG_ZEROCOPY requires checking RLIMIT_MEMLOCK.
> > > >
> > > > The thing is IIUC that's accounting for pinned pages only with either mlock()
> > > > (FOLL_MLOCK) or vfio (FOLL_PIN).
> > > >
> > > > I don't really think MSG_ZEROCOPY is doing that at all...  I'm looking at
> > > > __zerocopy_sg_from_iter() -> iov_iter_get_pages().
> > >
> > > It happens probably here:
> > >
> > > E.g
> > >
> > > __ip_append_data()
> > >     msg_zerocopy_realloc()
> > >         mm_account_pinned_pages()
> >
> > When do they get uncounted?  i.e. is it just the data that's in flight
> > that's marked as pinned?
>
> I think so - there's __msg_zerocopy_callback() -> mm_unaccount_pinned_pages()
> correspondingly.  Thanks,

Yes, and actually the memory that could be pinned is limited by the
sndbuf of TCP socket. So we are fine with rlimit (e.g we don't need to
pin all guest pages).

Thanks

>
> --
> Peter Xu
>



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
  2021-09-08 21:57                       ` Daniel P. Berrangé
@ 2021-09-09  2:05                         ` Peter Xu
  2021-09-09  4:58                           ` Leonardo Bras Soares Passos
  0 siblings, 1 reply; 62+ messages in thread
From: Peter Xu @ 2021-09-09  2:05 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, Dr. David Alan Gilbert, qemu-devel,
	Leonardo Bras Soares Passos, Paolo Bonzini,
	Marc-André Lureau, Fam Zheng

On Wed, Sep 08, 2021 at 10:57:06PM +0100, Daniel P. Berrangé wrote:
> We think we're probably ok with migration as we are going to rely on the
> face that we eventually pause the guest to stop page changes during the
> final switchover. None the less I really strongly dislike the idea of
> not honouring the kernel API contract, despite the potential performance
> benefits it brings.

Yes understandable, and it does looks tricky. But it's guest page and it's just
by nature how it works to me (sending page happening in parallel with page
being modified).

I think the MSG_ZEROCOPY doc page mentioned that and it's userspace program's
own choice if that happens. So even if it's not by design and not suggested, it
seems not forbidden either.

We can wr-protect the page (using things like userfaultfd-wp) during sending
and unprotect them when it's done with a qio callback, that'll guarantee the
buffer not changing during sending, however we gain nothing besides the "api
cleaness" then..

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
  2021-09-09  2:05                         ` Peter Xu
@ 2021-09-09  4:58                           ` Leonardo Bras Soares Passos
  2021-09-09 16:40                             ` Peter Xu
  0 siblings, 1 reply; 62+ messages in thread
From: Leonardo Bras Soares Passos @ 2021-09-09  4:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, qemu-devel, Daniel P. Berrangé,
	Dr. David Alan Gilbert, Paolo Bonzini, Marc-André Lureau,
	Fam Zheng

On Wed, Sep 8, 2021 at 11:05 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Sep 08, 2021 at 10:57:06PM +0100, Daniel P. Berrangé wrote:
> > We think we're probably ok with migration as we are going to rely on the
> > face that we eventually pause the guest to stop page changes during the
> > final switchover. None the less I really strongly dislike the idea of
> > not honouring the kernel API contract, despite the potential performance
> > benefits it brings.
>
> Yes understandable, and it does looks tricky. But it's guest page and it's just
> by nature how it works to me (sending page happening in parallel with page
> being modified).
>
> I think the MSG_ZEROCOPY doc page mentioned that and it's userspace program's
> own choice if that happens. So even if it's not by design and not suggested, it
> seems not forbidden either.
>
> We can wr-protect the page (using things like userfaultfd-wp) during sending
> and unprotect them when it's done with a qio callback, that'll guarantee the
> buffer not changing during sending, however we gain nothing besides the "api
> cleaness" then..
>
> Thanks,
>
> --
> Peter Xu
>

I can't stress enough how inexperienced I am in qemu codebase, but based on
the current discussion and my humble opinion, it would make sense if something
like an async_writev + async_flush method (or a callback instead) would be
offered by QIOChannel, and let the responsibility of "which flags to use",
"when to lock", and "how / when to flush" to the codebase using it.

I mean, it's clear how the sync requirements will be very different
depending on what
the using code will be sending with it, so It makes sense to me that
we offer the tool
and let it decide how it should be used (and possibly deal with any
consequences).

Is there anything that could go wrong with the above that I am missing?

About the callback proposed, I am not sure how that would work in an
efficient way.
Could someone help me with that?

FWIW, what I had in mind for a (theoretical) migration setup with
io_async_writev() + io_async_flush():
- For guest RAM we can decide not to rw_lock memory on zerocopy,
because there is no need,
- For headers, we can decide to not use async  (use io_writev() instead),
- flush() can happen each iteration of migration, or at each N
seconds, or at the end.

Thank you for the great discussion :)
Leonardo Bras



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
  2021-09-08 15:24                       ` Peter Xu
@ 2021-09-09  8:49                         ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 62+ messages in thread
From: Dr. David Alan Gilbert @ 2021-09-09  8:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: Elena Ufimtseva, John G Johnson, Daniel P. Berrangé,
	qemu-block, Juan Quintela, qemu-devel,
	Leonardo Bras Soares Passos, Paolo Bonzini,
	Marc-André Lureau, Fam Zheng, Jagannathan Raman

* Peter Xu (peterx@redhat.com) wrote:
> On Wed, Sep 08, 2021 at 09:30:58AM +0100, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > On Tue, Sep 07, 2021 at 12:06:15PM +0100, Dr. David Alan Gilbert wrote:
> > > > > > What if we do the 'flush()' before we start post-copy, instead of after each
> > > > > > iteration? would that be enough?
> > > > > 
> > > > > Possibly, yes. This really need David G's input since he understands
> > > > > the code in way more detail than me.
> > > > 
> > > > Hmm I'm not entirely sure why we have the sync after each iteration;
> > > 
> > > We don't have that yet, do we?
> > 
> > I think multifd does; I think it calls multifd_send_sync_main sometimes,
> > which I *think* corresponds to iterations.
> 
> Oh.. Then we may need to:
> 
>   (1) Make that sync work for zero-copy too; say, semaphores may not be enough
>       with it - we need to make sure all async buffers are consumed by checking
>       the error queue of the sockets,
> 
>   (2) When we make zero-copy work without multi-fd, we may want some similar
>       sync on the main stream too

It might not be worth bothering with (2) - zerocopy fits a lot better
with multifd's data organisation.

Dave

> Thanks,
> 
> > 
> > Dave
> > 
> > > > the case I can think of is if we're doing async sending, we could have
> > > > two versions of the same page in flight (one from each iteration) -
> > > > you'd want those to get there in the right order.
> > > 
> > > From MSG_ZEROCOPY document:
> > > 
> > >         For protocols that acknowledge data in-order, like TCP, each
> > >         notification can be squashed into the previous one, so that no more
> > >         than one notification is outstanding at any one point.
> > > 
> > >         Ordered delivery is the common case, but not guaranteed. Notifications
> > >         may arrive out of order on retransmission and socket teardown.
> > > 
> > > So no matter whether we have a flush() per iteration before, it seems we may
> > > want one when zero copy enabled?
> > > 
> > > Thanks,
> > > 
> > > -- 
> > > Peter Xu
> > > 
> > -- 
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
> 
> -- 
> Peter Xu
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v1 2/3] io: Add zerocopy and errqueue
  2021-09-09  4:58                           ` Leonardo Bras Soares Passos
@ 2021-09-09 16:40                             ` Peter Xu
  0 siblings, 0 replies; 62+ messages in thread
From: Peter Xu @ 2021-09-09 16:40 UTC (permalink / raw)
  To: Leonardo Bras Soares Passos
  Cc: Elena Ufimtseva, John G Johnson, Jagannathan Raman, qemu-block,
	Juan Quintela, qemu-devel, Daniel P. Berrangé,
	Dr. David Alan Gilbert, Paolo Bonzini, Marc-André Lureau,
	Fam Zheng

On Thu, Sep 09, 2021 at 01:58:39AM -0300, Leonardo Bras Soares Passos wrote:
> FWIW, what I had in mind for a (theoretical) migration setup with
> io_async_writev() + io_async_flush():

One trivial concern is it's not strictly just "async" because "async" can
happen on any nonblocking fd; here it's more "zerocopy" specific.  But as long
as Dan is okay with it I'm fine too to start with "async", as long as we do
proper documentation on the requirement of lifecycle of the buffers.

> - For guest RAM we can decide not to rw_lock memory on zerocopy,
> because there is no need,
> - For headers, we can decide to not use async  (use io_writev() instead),
> - flush() can happen each iteration of migration, or at each N
> seconds, or at the end.

I think we should only need the flush at the end of precopy, and that should
also cover when switching to postcopy.  We could merge this flush() into the
existing per-iteration sync of multi-fd, but it's not strictly necessary, imho.
We can see which is easier. The rest looks good to me.  Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2021-09-09 16:41 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-31 11:02 [PATCH v1 0/3] QIOChannel flags + multifd zerocopy Leonardo Bras
2021-08-31 11:02 ` [PATCH v1 1/3] io: Enable write flags for QIOChannel Leonardo Bras
2021-09-01 20:54   ` Eric Blake
2021-09-02  8:26     ` Leonardo Bras Soares Passos
2021-08-31 11:02 ` [PATCH v1 2/3] io: Add zerocopy and errqueue Leonardo Bras
2021-08-31 12:57   ` Daniel P. Berrangé
2021-08-31 20:27     ` Peter Xu
2021-09-01  8:50       ` Daniel P. Berrangé
2021-09-01 15:52         ` Peter Xu
2021-09-01 15:59           ` Daniel P. Berrangé
2021-09-02  7:07         ` Leonardo Bras Soares Passos
2021-09-02  6:59       ` Leonardo Bras Soares Passos
2021-09-07 16:44         ` Peter Xu
2021-09-08 20:13           ` Leonardo Bras Soares Passos
2021-09-08 21:04             ` Peter Xu
2021-09-02  6:38     ` Leonardo Bras Soares Passos
2021-09-02  8:47       ` Daniel P. Berrangé
2021-09-02  9:34         ` Leonardo Bras Soares Passos
2021-09-02  9:49           ` Daniel P. Berrangé
2021-09-02 10:19             ` Leonardo Bras Soares Passos
2021-09-02 10:28               ` Daniel P. Berrangé
2021-09-07 11:06                 ` Dr. David Alan Gilbert
2021-09-07 18:09                   ` Peter Xu
2021-09-08  8:30                     ` Dr. David Alan Gilbert
2021-09-08 15:24                       ` Peter Xu
2021-09-09  8:49                         ` Dr. David Alan Gilbert
2021-09-08 20:25                   ` Leonardo Bras Soares Passos
2021-09-08 21:09                     ` Peter Xu
2021-09-08 21:57                       ` Daniel P. Berrangé
2021-09-09  2:05                         ` Peter Xu
2021-09-09  4:58                           ` Leonardo Bras Soares Passos
2021-09-09 16:40                             ` Peter Xu
2021-08-31 11:02 ` [PATCH v1 3/3] migration: multifd: Enable zerocopy Leonardo Bras
2021-08-31 13:16   ` Daniel P. Berrangé
2021-08-31 20:29     ` Peter Xu
2021-09-01  8:53       ` Daniel P. Berrangé
2021-09-01 15:35         ` Peter Xu
2021-09-01 15:44           ` Daniel P. Berrangé
2021-09-01 16:01             ` Peter Xu
2021-09-02  7:57             ` Leonardo Bras Soares Passos
2021-09-07 11:13             ` Dr. David Alan Gilbert
2021-09-08 15:26               ` Daniel P. Berrangé
2021-09-02  7:23           ` Jason Wang
2021-09-02  8:08             ` Leonardo Bras Soares Passos
2021-09-02  7:27       ` Leonardo Bras Soares Passos
2021-09-02  7:22     ` Leonardo Bras Soares Passos
2021-09-02  8:20       ` Daniel P. Berrangé
2021-09-02  8:52         ` Leonardo Bras Soares Passos
2021-09-02  9:20           ` Daniel P. Berrangé
2021-09-02  9:49             ` Leonardo Bras Soares Passos
2021-09-02  9:59               ` Daniel P. Berrangé
2021-09-02 10:25                 ` Leonardo Bras Soares Passos
2021-09-07 11:17             ` Dr. David Alan Gilbert
2021-09-07 18:32       ` Peter Xu
2021-09-08  2:59         ` Jason Wang
2021-09-08  3:24           ` Peter Xu
2021-09-08  3:26             ` Jason Wang
2021-09-08  8:19           ` Dr. David Alan Gilbert
2021-09-08 15:19             ` Peter Xu
2021-09-09  1:10               ` Jason Wang
2021-08-31 21:24 ` [PATCH v1 0/3] QIOChannel flags + multifd zerocopy Peter Xu
2021-09-01 19:21   ` Leonardo Bras Soares Passos

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.