All of lore.kernel.org
 help / color / mirror / Atom feed
* [PULL 00/10] migration queue
@ 2020-10-07 15:55 Dr. David Alan Gilbert (git)
  2020-10-07 15:55 ` [PULL 01/10] virtiofsd: Silence gcc warning Dr. David Alan Gilbert (git)
                   ` (10 more replies)
  0 siblings, 11 replies; 18+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2020-10-07 15:55 UTC (permalink / raw)
  To: qemu-devel, alex.bennee, zhengchuan, stefanha, peterx; +Cc: quintela

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

The following changes since commit f2687fdb7571a444b5af3509574b659d35ddd601:

  Merge remote-tracking branch 'remotes/bonzini-gitlab/tags/for-upstream' into staging (2020-10-06 15:04:10 +0100)

are available in the Git repository at:

  git://github.com/dagrh/qemu.git tags/pull-migration-20201007b

for you to fetch changes up to 1df31b8aca2aa4f83d5827d74700eeb6d711bbdf:

  migration/dirtyrate: present dirty rate only when querying the rate has completed (2020-10-07 16:49:26 +0100)

----------------------------------------------------------------
Migration and virtiofs pull 2020-07-10

Migration:
  Dirtyrate measurement API cleanup
  Postcopy recovery fixes

Virtiofsd:
  Missing qemu_init_exec_dir call
  Support for setting the group on socket creation
  Stop a gcc warning
  Avoid tempdir in sandboxing

----------------------------------------------------------------
Alex Bennée (1):
      tools/virtiofsd: add support for --socket-group

Chuan Zheng (2):
      migration/dirtyrate: record start_time and calc_time while at the measuring state
      migration/dirtyrate: present dirty rate only when querying the rate has completed

Dr. David Alan Gilbert (2):
      virtiofsd: Silence gcc warning
      virtiofsd: Call qemu_init_exec_dir

Peter Xu (4):
      migration: Pass incoming state into qemu_ufd_copy_ioctl()
      migration: Introduce migrate_send_rp_message_req_pages()
      migration: Maintain postcopy faulted addresses
      migration: Sync requested pages after postcopy recovery

Stefan Hajnoczi (1):
      virtiofsd: avoid /proc/self/fd tempdir

 docs/tools/virtiofsd.rst         |  4 +++
 migration/dirtyrate.c            | 16 ++++++-----
 migration/migration.c            | 49 ++++++++++++++++++++++++++++++++--
 migration/migration.h            | 21 ++++++++++++++-
 migration/postcopy-ram.c         | 25 +++++++++++++-----
 migration/savevm.c               | 57 ++++++++++++++++++++++++++++++++++++++++
 migration/trace-events           |  3 +++
 qapi/migration.json              |  8 +++---
 tools/virtiofsd/fuse_i.h         |  1 +
 tools/virtiofsd/fuse_lowlevel.c  |  6 +++++
 tools/virtiofsd/fuse_virtio.c    | 21 +++++++++++++--
 tools/virtiofsd/passthrough_ll.c | 38 ++++++++++-----------------
 12 files changed, 203 insertions(+), 46 deletions(-)



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PULL 01/10] virtiofsd: Silence gcc warning
  2020-10-07 15:55 [PULL 00/10] migration queue Dr. David Alan Gilbert (git)
@ 2020-10-07 15:55 ` Dr. David Alan Gilbert (git)
  2020-10-07 15:55 ` [PULL 02/10] tools/virtiofsd: add support for --socket-group Dr. David Alan Gilbert (git)
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2020-10-07 15:55 UTC (permalink / raw)
  To: qemu-devel, alex.bennee, zhengchuan, stefanha, peterx; +Cc: quintela

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Gcc worries fd might be used unset, in reality it's always set if
fi is set, and only used if fi is set so it's safe.  Initialise it to -1
just to keep gcc happy for now.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20200827153657.111098-2-dgilbert@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 tools/virtiofsd/passthrough_ll.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 0b229ebd57..36ad46e0c0 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -620,7 +620,7 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr,
     struct lo_inode *inode;
     int ifd;
     int res;
-    int fd;
+    int fd = -1;
 
     inode = lo_inode(req, ino);
     if (!inode) {
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PULL 02/10] tools/virtiofsd: add support for --socket-group
  2020-10-07 15:55 [PULL 00/10] migration queue Dr. David Alan Gilbert (git)
  2020-10-07 15:55 ` [PULL 01/10] virtiofsd: Silence gcc warning Dr. David Alan Gilbert (git)
@ 2020-10-07 15:55 ` Dr. David Alan Gilbert (git)
  2020-10-07 15:55 ` [PULL 03/10] virtiofsd: Call qemu_init_exec_dir Dr. David Alan Gilbert (git)
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2020-10-07 15:55 UTC (permalink / raw)
  To: qemu-devel, alex.bennee, zhengchuan, stefanha, peterx; +Cc: quintela

From: Alex Bennée <alex.bennee@linaro.org>

If you like running QEMU as a normal user (very common for TCG runs)
but you have to run virtiofsd as a root user you run into connection
problems. Adding support for an optional --socket-group allows the
users to keep using the command line.

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

Message-Id: <20200925125147.26943-2-alex.bennee@linaro.org>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
  dgilbert: Split long line
---
 docs/tools/virtiofsd.rst        |  4 ++++
 tools/virtiofsd/fuse_i.h        |  1 +
 tools/virtiofsd/fuse_lowlevel.c |  6 ++++++
 tools/virtiofsd/fuse_virtio.c   | 21 +++++++++++++++++++--
 4 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/docs/tools/virtiofsd.rst b/docs/tools/virtiofsd.rst
index ae02938a95..7ecee49834 100644
--- a/docs/tools/virtiofsd.rst
+++ b/docs/tools/virtiofsd.rst
@@ -87,6 +87,10 @@ Options
 
   Listen on vhost-user UNIX domain socket at PATH.
 
+.. option:: --socket-group=GROUP
+
+  Set the vhost-user UNIX domain socket gid to GROUP.
+
 .. option:: --fd=FDNUM
 
   Accept connections from vhost-user UNIX domain socket file descriptor FDNUM.
diff --git a/tools/virtiofsd/fuse_i.h b/tools/virtiofsd/fuse_i.h
index 1240828208..492e002181 100644
--- a/tools/virtiofsd/fuse_i.h
+++ b/tools/virtiofsd/fuse_i.h
@@ -68,6 +68,7 @@ struct fuse_session {
     size_t bufsize;
     int error;
     char *vu_socket_path;
+    char *vu_socket_group;
     int   vu_listen_fd;
     int   vu_socketfd;
     struct fv_VuDev *virtio_dev;
diff --git a/tools/virtiofsd/fuse_lowlevel.c b/tools/virtiofsd/fuse_lowlevel.c
index 2dd36ec03b..4d1ba2925d 100644
--- a/tools/virtiofsd/fuse_lowlevel.c
+++ b/tools/virtiofsd/fuse_lowlevel.c
@@ -2523,6 +2523,7 @@ static const struct fuse_opt fuse_ll_opts[] = {
     LL_OPTION("--debug", debug, 1),
     LL_OPTION("allow_root", deny_others, 1),
     LL_OPTION("--socket-path=%s", vu_socket_path, 0),
+    LL_OPTION("--socket-group=%s", vu_socket_group, 0),
     LL_OPTION("--fd=%d", vu_listen_fd, 0),
     LL_OPTION("--thread-pool-size=%d", thread_pool_size, 0),
     FUSE_OPT_END
@@ -2630,6 +2631,11 @@ struct fuse_session *fuse_session_new(struct fuse_args *args,
                  "fuse: --socket-path and --fd cannot be given together\n");
         goto out4;
     }
+    if (se->vu_socket_group && !se->vu_socket_path) {
+        fuse_log(FUSE_LOG_ERR,
+                 "fuse: --socket-group can only be used with --socket-path\n");
+        goto out4;
+    }
 
     se->bufsize = FUSE_MAX_MAX_PAGES * getpagesize() + FUSE_BUFFER_HEADER_SIZE;
 
diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
index d5c8e98253..89f537f79b 100644
--- a/tools/virtiofsd/fuse_virtio.c
+++ b/tools/virtiofsd/fuse_virtio.c
@@ -31,6 +31,8 @@
 #include <sys/socket.h>
 #include <sys/types.h>
 #include <sys/un.h>
+#include <sys/types.h>
+#include <grp.h>
 #include <unistd.h>
 
 #include "contrib/libvhost-user/libvhost-user.h"
@@ -924,15 +926,30 @@ static int fv_create_listen_socket(struct fuse_session *se)
 
     /*
      * Unfortunately bind doesn't let you set the mask on the socket,
-     * so set umask to 077 and restore it later.
+     * so set umask appropriately and restore it later.
      */
-    old_umask = umask(0077);
+    if (se->vu_socket_group) {
+        old_umask = umask(S_IROTH | S_IWOTH | S_IXOTH);
+    } else {
+        old_umask = umask(S_IRGRP | S_IWGRP | S_IXGRP |
+                          S_IROTH | S_IWOTH | S_IXOTH);
+    }
     if (bind(listen_sock, (struct sockaddr *)&un, addr_len) == -1) {
         fuse_log(FUSE_LOG_ERR, "vhost socket bind: %m\n");
         close(listen_sock);
         umask(old_umask);
         return -1;
     }
+    if (se->vu_socket_group) {
+        struct group *g = getgrnam(se->vu_socket_group);
+        if (g) {
+            if (!chown(se->vu_socket_path, -1, g->gr_gid)) {
+                fuse_log(FUSE_LOG_WARNING,
+                         "vhost socket failed to set group to %s (%d)\n",
+                         se->vu_socket_group, g->gr_gid);
+            }
+        }
+    }
     umask(old_umask);
 
     if (listen(listen_sock, 1) == -1) {
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PULL 03/10] virtiofsd: Call qemu_init_exec_dir
  2020-10-07 15:55 [PULL 00/10] migration queue Dr. David Alan Gilbert (git)
  2020-10-07 15:55 ` [PULL 01/10] virtiofsd: Silence gcc warning Dr. David Alan Gilbert (git)
  2020-10-07 15:55 ` [PULL 02/10] tools/virtiofsd: add support for --socket-group Dr. David Alan Gilbert (git)
@ 2020-10-07 15:55 ` Dr. David Alan Gilbert (git)
  2020-10-07 15:55 ` [PULL 04/10] virtiofsd: avoid /proc/self/fd tempdir Dr. David Alan Gilbert (git)
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2020-10-07 15:55 UTC (permalink / raw)
  To: qemu-devel, alex.bennee, zhengchuan, stefanha, peterx; +Cc: quintela

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Since fcb4f59c879 qemu_get_local_state_pathname relies on the
init_exec_dir, and virtiofsd asserts because we never set it.
Set it.

Reported-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20201002124015.44820-1-dgilbert@redhat.com>
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 tools/virtiofsd/passthrough_ll.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 36ad46e0c0..477e6ee0b5 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -2839,6 +2839,8 @@ int main(int argc, char *argv[])
     /* Don't mask creation mode, kernel already did that */
     umask(0);
 
+    qemu_init_exec_dir(argv[0]);
+
     pthread_mutex_init(&lo.mutex, NULL);
     lo.inodes = g_hash_table_new(lo_key_hash, lo_key_equal);
     lo.root.fd = -1;
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PULL 04/10] virtiofsd: avoid /proc/self/fd tempdir
  2020-10-07 15:55 [PULL 00/10] migration queue Dr. David Alan Gilbert (git)
                   ` (2 preceding siblings ...)
  2020-10-07 15:55 ` [PULL 03/10] virtiofsd: Call qemu_init_exec_dir Dr. David Alan Gilbert (git)
@ 2020-10-07 15:55 ` Dr. David Alan Gilbert (git)
  2020-10-07 15:55 ` [PULL 05/10] migration: Pass incoming state into qemu_ufd_copy_ioctl() Dr. David Alan Gilbert (git)
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2020-10-07 15:55 UTC (permalink / raw)
  To: qemu-devel, alex.bennee, zhengchuan, stefanha, peterx; +Cc: quintela

From: Stefan Hajnoczi <stefanha@redhat.com>

In order to prevent /proc/self/fd escapes a temporary directory is
created where /proc/self/fd is bind-mounted. This doesn't work on
read-only file systems.

Avoid the temporary directory by bind-mounting /proc/self/fd over /proc.
This does not affect other processes since we remounted / with MS_REC |
MS_SLAVE. /proc must exist and virtiofsd does not use it so it's safe to
do this.

Path traversal can be tested with the following function:

  static void test_proc_fd_escape(struct lo_data *lo)
  {
      int fd;
      int level = 0;
      ino_t last_ino = 0;

      fd = lo->proc_self_fd;
      for (;;) {
          struct stat st;

          if (fstat(fd, &st) != 0) {
              perror("fstat");
              return;
          }
          if (last_ino && st.st_ino == last_ino) {
              fprintf(stderr, "inode number unchanged, stopping\n");
              return;
          }
          last_ino = st.st_ino;

          fprintf(stderr, "Level %d dev %lu ino %lu\n", level,
                  (unsigned long)st.st_dev,
                  (unsigned long)last_ino);
          fd = openat(fd, "..", O_PATH | O_DIRECTORY | O_NOFOLLOW);
          level++;
      }
  }

Before and after this patch only Level 0 is displayed. Without
/proc/self/fd bind-mount protection it is possible to traverse parent
directories.

Fixes: 397ae982f4df4 ("virtiofsd: jail lo->proc_self_fd")
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Jens Freimann <jfreimann@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20201006095826.59813-1-stefanha@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Tested-by: Jens Freimann <jfreimann@redhat.com>
Reviewed-by: Jens Freimann <jfreimann@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 tools/virtiofsd/passthrough_ll.c | 34 +++++++++++---------------------
 1 file changed, 11 insertions(+), 23 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 477e6ee0b5..ff53df4451 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -2393,8 +2393,6 @@ static void setup_wait_parent_capabilities(void)
 static void setup_namespaces(struct lo_data *lo, struct fuse_session *se)
 {
     pid_t child;
-    char template[] = "virtiofsd-XXXXXX";
-    char *tmpdir;
 
     /*
      * Create a new pid namespace for *child* processes.  We'll have to
@@ -2458,33 +2456,23 @@ static void setup_namespaces(struct lo_data *lo, struct fuse_session *se)
         exit(1);
     }
 
-    tmpdir = mkdtemp(template);
-    if (!tmpdir) {
-        fuse_log(FUSE_LOG_ERR, "tmpdir(%s): %m\n", template);
-        exit(1);
-    }
-
-    if (mount("/proc/self/fd", tmpdir, NULL, MS_BIND, NULL) < 0) {
-        fuse_log(FUSE_LOG_ERR, "mount(/proc/self/fd, %s, MS_BIND): %m\n",
-                 tmpdir);
+    /*
+     * We only need /proc/self/fd. Prevent ".." from accessing parent
+     * directories of /proc/self/fd by bind-mounting it over /proc. Since / was
+     * previously remounted with MS_REC | MS_SLAVE this mount change only
+     * affects our process.
+     */
+    if (mount("/proc/self/fd", "/proc", NULL, MS_BIND, NULL) < 0) {
+        fuse_log(FUSE_LOG_ERR, "mount(/proc/self/fd, MS_BIND): %m\n");
         exit(1);
     }
 
-    /* Now we can get our /proc/self/fd directory file descriptor */
-    lo->proc_self_fd = open(tmpdir, O_PATH);
+    /* Get the /proc (actually /proc/self/fd, see above) file descriptor */
+    lo->proc_self_fd = open("/proc", O_PATH);
     if (lo->proc_self_fd == -1) {
-        fuse_log(FUSE_LOG_ERR, "open(%s, O_PATH): %m\n", tmpdir);
+        fuse_log(FUSE_LOG_ERR, "open(/proc, O_PATH): %m\n");
         exit(1);
     }
-
-    if (umount2(tmpdir, MNT_DETACH) < 0) {
-        fuse_log(FUSE_LOG_ERR, "umount2(%s, MNT_DETACH): %m\n", tmpdir);
-        exit(1);
-    }
-
-    if (rmdir(tmpdir) < 0) {
-        fuse_log(FUSE_LOG_ERR, "rmdir(%s): %m\n", tmpdir);
-    }
 }
 
 /*
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PULL 05/10] migration: Pass incoming state into qemu_ufd_copy_ioctl()
  2020-10-07 15:55 [PULL 00/10] migration queue Dr. David Alan Gilbert (git)
                   ` (3 preceding siblings ...)
  2020-10-07 15:55 ` [PULL 04/10] virtiofsd: avoid /proc/self/fd tempdir Dr. David Alan Gilbert (git)
@ 2020-10-07 15:55 ` Dr. David Alan Gilbert (git)
  2020-10-07 15:55 ` [PULL 06/10] migration: Introduce migrate_send_rp_message_req_pages() Dr. David Alan Gilbert (git)
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2020-10-07 15:55 UTC (permalink / raw)
  To: qemu-devel, alex.bennee, zhengchuan, stefanha, peterx; +Cc: quintela

From: Peter Xu <peterx@redhat.com>

It'll be used in follow up patches to access more fields out of it.  Meanwhile
fetch the userfaultfd inside the function.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20201002175336.30858-2-peterx@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/postcopy-ram.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 0a2f88a87d..722034dc01 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -1128,10 +1128,12 @@ int postcopy_ram_incoming_setup(MigrationIncomingState *mis)
     return 0;
 }
 
-static int qemu_ufd_copy_ioctl(int userfault_fd, void *host_addr,
+static int qemu_ufd_copy_ioctl(MigrationIncomingState *mis, void *host_addr,
                                void *from_addr, uint64_t pagesize, RAMBlock *rb)
 {
+    int userfault_fd = mis->userfault_fd;
     int ret;
+
     if (from_addr) {
         struct uffdio_copy copy_struct;
         copy_struct.dst = (uint64_t)(uintptr_t)host_addr;
@@ -1185,7 +1187,7 @@ int postcopy_place_page(MigrationIncomingState *mis, void *host, void *from,
      * which would be slightly cheaper, but we'd have to be careful
      * of the order of updating our page state.
      */
-    if (qemu_ufd_copy_ioctl(mis->userfault_fd, host, from, pagesize, rb)) {
+    if (qemu_ufd_copy_ioctl(mis, host, from, pagesize, rb)) {
         int e = errno;
         error_report("%s: %s copy host: %p from: %p (size: %zd)",
                      __func__, strerror(e), host, from, pagesize);
@@ -1212,7 +1214,7 @@ int postcopy_place_page_zero(MigrationIncomingState *mis, void *host,
      * but it's not available for everything (e.g. hugetlbpages)
      */
     if (qemu_ram_is_uf_zeroable(rb)) {
-        if (qemu_ufd_copy_ioctl(mis->userfault_fd, host, NULL, pagesize, rb)) {
+        if (qemu_ufd_copy_ioctl(mis, host, NULL, pagesize, rb)) {
             int e = errno;
             error_report("%s: %s zero host: %p",
                          __func__, strerror(e), host);
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PULL 06/10] migration: Introduce migrate_send_rp_message_req_pages()
  2020-10-07 15:55 [PULL 00/10] migration queue Dr. David Alan Gilbert (git)
                   ` (4 preceding siblings ...)
  2020-10-07 15:55 ` [PULL 05/10] migration: Pass incoming state into qemu_ufd_copy_ioctl() Dr. David Alan Gilbert (git)
@ 2020-10-07 15:55 ` Dr. David Alan Gilbert (git)
  2020-10-07 15:55 ` [PULL 07/10] migration: Maintain postcopy faulted addresses Dr. David Alan Gilbert (git)
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2020-10-07 15:55 UTC (permalink / raw)
  To: qemu-devel, alex.bennee, zhengchuan, stefanha, peterx; +Cc: quintela

From: Peter Xu <peterx@redhat.com>

This is another layer wrapper for sending a page request to the source VM.  The
new migrate_send_rp_message_req_pages() will be used elsewhere in coming
patches.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20201002175336.30858-3-peterx@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/migration.c | 10 ++++++++--
 migration/migration.h |  2 ++
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index aca7fdcd0b..b2dac6b39c 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -316,8 +316,8 @@ error:
  *   Start: Address offset within the RB
  *   Len: Length in bytes required - must be a multiple of pagesize
  */
-int migrate_send_rp_req_pages(MigrationIncomingState *mis, RAMBlock *rb,
-                              ram_addr_t start)
+int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
+                                      RAMBlock *rb, ram_addr_t start)
 {
     uint8_t bufc[12 + 1 + 255]; /* start (8), len (4), rbname up to 256 */
     size_t msglen = 12; /* start + len */
@@ -353,6 +353,12 @@ int migrate_send_rp_req_pages(MigrationIncomingState *mis, RAMBlock *rb,
     return migrate_send_rp_message(mis, msg_type, msglen, bufc);
 }
 
+int migrate_send_rp_req_pages(MigrationIncomingState *mis,
+                              RAMBlock *rb, ram_addr_t start)
+{
+    return migrate_send_rp_message_req_pages(mis, rb, start);
+}
+
 static bool migration_colo_enabled;
 bool migration_incoming_colo_enabled(void)
 {
diff --git a/migration/migration.h b/migration/migration.h
index deb411aaad..e853ccf8b1 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -333,6 +333,8 @@ void migrate_send_rp_pong(MigrationIncomingState *mis,
                           uint32_t value);
 int migrate_send_rp_req_pages(MigrationIncomingState *mis, RAMBlock *rb,
                               ram_addr_t start);
+int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
+                                      RAMBlock *rb, ram_addr_t start);
 void migrate_send_rp_recv_bitmap(MigrationIncomingState *mis,
                                  char *block_name);
 void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t value);
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PULL 07/10] migration: Maintain postcopy faulted addresses
  2020-10-07 15:55 [PULL 00/10] migration queue Dr. David Alan Gilbert (git)
                   ` (5 preceding siblings ...)
  2020-10-07 15:55 ` [PULL 06/10] migration: Introduce migrate_send_rp_message_req_pages() Dr. David Alan Gilbert (git)
@ 2020-10-07 15:55 ` Dr. David Alan Gilbert (git)
  2020-10-07 15:55 ` [PULL 08/10] migration: Sync requested pages after postcopy recovery Dr. David Alan Gilbert (git)
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2020-10-07 15:55 UTC (permalink / raw)
  To: qemu-devel, alex.bennee, zhengchuan, stefanha, peterx; +Cc: quintela

From: Peter Xu <peterx@redhat.com>

Maintain a list of faulted addresses on the destination host for which we're
waiting on.  This is implemented using a GTree rather than a real list to make
sure even there're plenty of vCPUs/threads that are faulting, the lookup will
still be fast with O(log(N)) (because we'll do that after placing each page).
It should bring a slight overhead, but ideally that shouldn't be a big problem
simply because in most cases the requested page list will be short.

Actually we did similar things for postcopy blocktime measurements.  This patch
didn't use that simply because:

  (1) blocktime measurement is towards vcpu threads only, but here we need to
      record all faulted addresses, including main thread and external
      thread (like, DPDK via vhost-user).

  (2) blocktime measurement will require UFFD_FEATURE_THREAD_ID, but here we
      don't want to add that extra dependency on the kernel version since not
      necessary.  E.g., we don't need to know which thread faulted on which
      page, we also don't care about multiple threads faulting on the same
      page.  But we only care about what addresses are faulted so waiting for a
      page copying from src.

  (3) blocktime measurement is not enabled by default.  However we need this by
      default especially for postcopy recover.

Another thing to mention is that this patch introduced a new mutex to serialize
the receivedmap and the page_requested tree, however that serialization does
not cover other procedures like UFFDIO_COPY.

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20201002175336.30858-4-peterx@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/migration.c    | 41 +++++++++++++++++++++++++++++++++++++++-
 migration/migration.h    | 19 ++++++++++++++++++-
 migration/postcopy-ram.c | 17 ++++++++++++++---
 migration/trace-events   |  2 ++
 4 files changed, 74 insertions(+), 5 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index b2dac6b39c..e7d179bffc 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -143,6 +143,13 @@ static int migration_maybe_pause(MigrationState *s,
                                  int new_state);
 static void migrate_fd_cancel(MigrationState *s);
 
+static gint page_request_addr_cmp(gconstpointer ap, gconstpointer bp)
+{
+    unsigned long a = (unsigned long) ap, b = (unsigned long) bp;
+
+    return (a > b) - (a < b);
+}
+
 void migration_object_init(void)
 {
     MachineState *ms = MACHINE(qdev_get_machine());
@@ -165,6 +172,8 @@ void migration_object_init(void)
     qemu_event_init(&current_incoming->main_thread_load_event, false);
     qemu_sem_init(&current_incoming->postcopy_pause_sem_dst, 0);
     qemu_sem_init(&current_incoming->postcopy_pause_sem_fault, 0);
+    qemu_mutex_init(&current_incoming->page_request_mutex);
+    current_incoming->page_requested = g_tree_new(page_request_addr_cmp);
 
     if (!migration_object_check(current_migration, &err)) {
         error_report_err(err);
@@ -240,6 +249,11 @@ void migration_incoming_state_destroy(void)
 
     qemu_event_reset(&mis->main_thread_load_event);
 
+    if (mis->page_requested) {
+        g_tree_destroy(mis->page_requested);
+        mis->page_requested = NULL;
+    }
+
     if (mis->socket_address_list) {
         qapi_free_SocketAddressList(mis->socket_address_list);
         mis->socket_address_list = NULL;
@@ -354,8 +368,33 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
 }
 
 int migrate_send_rp_req_pages(MigrationIncomingState *mis,
-                              RAMBlock *rb, ram_addr_t start)
+                              RAMBlock *rb, ram_addr_t start, uint64_t haddr)
 {
+    void *aligned = (void *)(uintptr_t)(haddr & (-qemu_target_page_size()));
+    bool received;
+
+    WITH_QEMU_LOCK_GUARD(&mis->page_request_mutex) {
+        received = ramblock_recv_bitmap_test_byte_offset(rb, start);
+        if (!received && !g_tree_lookup(mis->page_requested, aligned)) {
+            /*
+             * The page has not been received, and it's not yet in the page
+             * request list.  Queue it.  Set the value of element to 1, so that
+             * things like g_tree_lookup() will return TRUE (1) when found.
+             */
+            g_tree_insert(mis->page_requested, aligned, (gpointer)1);
+            mis->page_requested_count++;
+            trace_postcopy_page_req_add(aligned, mis->page_requested_count);
+        }
+    }
+
+    /*
+     * If the page is there, skip sending the message.  We don't even need the
+     * lock because as long as the page arrived, it'll be there forever.
+     */
+    if (received) {
+        return 0;
+    }
+
     return migrate_send_rp_message_req_pages(mis, rb, start);
 }
 
diff --git a/migration/migration.h b/migration/migration.h
index e853ccf8b1..8d2d1ce839 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -104,6 +104,23 @@ struct MigrationIncomingState {
 
     /* List of listening socket addresses  */
     SocketAddressList *socket_address_list;
+
+    /* A tree of pages that we requested to the source VM */
+    GTree *page_requested;
+    /* For debugging purpose only, but would be nice to keep */
+    int page_requested_count;
+    /*
+     * The mutex helps to maintain the requested pages that we sent to the
+     * source, IOW, to guarantee coherent between the page_requests tree and
+     * the per-ramblock receivedmap.  Note! This does not guarantee consistency
+     * of the real page copy procedures (using UFFDIO_[ZERO]COPY).  E.g., even
+     * if one bit in receivedmap is cleared, UFFDIO_COPY could have happened
+     * for that page already.  This is intended so that the mutex won't
+     * serialize and blocked by slow operations like UFFDIO_* ioctls.  However
+     * this should be enough to make sure the page_requested tree always
+     * contains valid information.
+     */
+    QemuMutex page_request_mutex;
 };
 
 MigrationIncomingState *migration_incoming_get_current(void);
@@ -332,7 +349,7 @@ void migrate_send_rp_shut(MigrationIncomingState *mis,
 void migrate_send_rp_pong(MigrationIncomingState *mis,
                           uint32_t value);
 int migrate_send_rp_req_pages(MigrationIncomingState *mis, RAMBlock *rb,
-                              ram_addr_t start);
+                              ram_addr_t start, uint64_t haddr);
 int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
                                       RAMBlock *rb, ram_addr_t start);
 void migrate_send_rp_recv_bitmap(MigrationIncomingState *mis,
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 722034dc01..ca1daf0024 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -684,7 +684,7 @@ int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
                                         qemu_ram_get_idstr(rb), rb_offset);
         return postcopy_wake_shared(pcfd, client_addr, rb);
     }
-    migrate_send_rp_req_pages(mis, rb, aligned_rbo);
+    migrate_send_rp_req_pages(mis, rb, aligned_rbo, client_addr);
     return 0;
 }
 
@@ -979,7 +979,8 @@ retry:
              * Send the request to the source - we want to request one
              * of our host page sizes (which is >= TPS)
              */
-            ret = migrate_send_rp_req_pages(mis, rb, rb_offset);
+            ret = migrate_send_rp_req_pages(mis, rb, rb_offset,
+                                            msg.arg.pagefault.address);
             if (ret) {
                 /* May be network failure, try to wait for recovery */
                 if (ret == -EIO && postcopy_pause_fault_thread(mis)) {
@@ -1149,10 +1150,20 @@ static int qemu_ufd_copy_ioctl(MigrationIncomingState *mis, void *host_addr,
         ret = ioctl(userfault_fd, UFFDIO_ZEROPAGE, &zero_struct);
     }
     if (!ret) {
+        qemu_mutex_lock(&mis->page_request_mutex);
         ramblock_recv_bitmap_set_range(rb, host_addr,
                                        pagesize / qemu_target_page_size());
+        /*
+         * If this page resolves a page fault for a previous recorded faulted
+         * address, take a special note to maintain the requested page list.
+         */
+        if (g_tree_lookup(mis->page_requested, host_addr)) {
+            g_tree_remove(mis->page_requested, host_addr);
+            mis->page_requested_count--;
+            trace_postcopy_page_req_del(host_addr, mis->page_requested_count);
+        }
+        qemu_mutex_unlock(&mis->page_request_mutex);
         mark_postcopy_blocktime_end((uintptr_t)host_addr);
-
     }
     return ret;
 }
diff --git a/migration/trace-events b/migration/trace-events
index 338f38b3dd..e4d5eb94ca 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -162,6 +162,7 @@ postcopy_pause_return_path(void) ""
 postcopy_pause_return_path_continued(void) ""
 postcopy_pause_continued(void) ""
 postcopy_start_set_run(void) ""
+postcopy_page_req_add(void *addr, int count) "new page req %p total %d"
 source_return_path_thread_bad_end(void) ""
 source_return_path_thread_end(void) ""
 source_return_path_thread_entry(void) ""
@@ -272,6 +273,7 @@ postcopy_ram_incoming_cleanup_blocktime(uint64_t total) "total blocktime %" PRIu
 postcopy_request_shared_page(const char *sharer, const char *rb, uint64_t rb_offset) "for %s in %s offset 0x%"PRIx64
 postcopy_request_shared_page_present(const char *sharer, const char *rb, uint64_t rb_offset) "%s already %s offset 0x%"PRIx64
 postcopy_wake_shared(uint64_t client_addr, const char *rb) "at 0x%"PRIx64" in %s"
+postcopy_page_req_del(void *addr, int count) "resolved page req %p total %d"
 
 get_mem_fault_cpu_index(int cpu, uint32_t pid) "cpu: %d, pid: %u"
 
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PULL 08/10] migration: Sync requested pages after postcopy recovery
  2020-10-07 15:55 [PULL 00/10] migration queue Dr. David Alan Gilbert (git)
                   ` (6 preceding siblings ...)
  2020-10-07 15:55 ` [PULL 07/10] migration: Maintain postcopy faulted addresses Dr. David Alan Gilbert (git)
@ 2020-10-07 15:55 ` Dr. David Alan Gilbert (git)
  2020-10-07 15:55 ` [PULL 09/10] migration/dirtyrate: record start_time and calc_time while at the measuring state Dr. David Alan Gilbert (git)
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2020-10-07 15:55 UTC (permalink / raw)
  To: qemu-devel, alex.bennee, zhengchuan, stefanha, peterx; +Cc: quintela

From: Peter Xu <peterx@redhat.com>

We synchronize the requested pages right after a postcopy recovery happens.
This helps to synchronize the prioritized pages on source so that the faulted
threads can be served faster.

Reported-by: Xiaohui Li <xiaohli@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20201002175336.30858-5-peterx@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/savevm.c     | 57 ++++++++++++++++++++++++++++++++++++++++++
 migration/trace-events |  1 +
 2 files changed, 58 insertions(+)

diff --git a/migration/savevm.c b/migration/savevm.c
index d2e141f7b1..33acbba1a4 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2011,6 +2011,49 @@ static int loadvm_postcopy_handle_run(MigrationIncomingState *mis)
     return LOADVM_QUIT;
 }
 
+/* We must be with page_request_mutex held */
+static gboolean postcopy_sync_page_req(gpointer key, gpointer value,
+                                       gpointer data)
+{
+    MigrationIncomingState *mis = data;
+    void *host_addr = (void *) key;
+    ram_addr_t rb_offset;
+    RAMBlock *rb;
+    int ret;
+
+    rb = qemu_ram_block_from_host(host_addr, true, &rb_offset);
+    if (!rb) {
+        /*
+         * This should _never_ happen.  However be nice for a migrating VM to
+         * not crash/assert.  Post an error (note: intended to not use *_once
+         * because we do want to see all the illegal addresses; and this can
+         * never be triggered by the guest so we're safe) and move on next.
+         */
+        error_report("%s: illegal host addr %p", __func__, host_addr);
+        /* Try the next entry */
+        return FALSE;
+    }
+
+    ret = migrate_send_rp_message_req_pages(mis, rb, rb_offset);
+    if (ret) {
+        /* Please refer to above comment. */
+        error_report("%s: send rp message failed for addr %p",
+                     __func__, host_addr);
+        return FALSE;
+    }
+
+    trace_postcopy_page_req_sync(host_addr);
+
+    return FALSE;
+}
+
+static void migrate_send_rp_req_pages_pending(MigrationIncomingState *mis)
+{
+    WITH_QEMU_LOCK_GUARD(&mis->page_request_mutex) {
+        g_tree_foreach(mis->page_requested, postcopy_sync_page_req, mis);
+    }
+}
+
 static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
 {
     if (mis->state != MIGRATION_STATUS_POSTCOPY_RECOVER) {
@@ -2033,6 +2076,20 @@ static int loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
     /* Tell source that "we are ready" */
     migrate_send_rp_resume_ack(mis, MIGRATION_RESUME_ACK_VALUE);
 
+    /*
+     * After a postcopy recovery, the source should have lost the postcopy
+     * queue, or potentially the requested pages could have been lost during
+     * the network down phase.  Let's re-sync with the source VM by re-sending
+     * all the pending pages that we eagerly need, so these threads won't get
+     * blocked too long due to the recovery.
+     *
+     * Without this procedure, the faulted destination VM threads (waiting for
+     * page requests right before the postcopy is interrupted) can keep hanging
+     * until the pages are sent by the source during the background copying of
+     * pages, or another thread faulted on the same address accidentally.
+     */
+    migrate_send_rp_req_pages_pending(mis);
+
     return 0;
 }
 
diff --git a/migration/trace-events b/migration/trace-events
index e4d5eb94ca..0fbfd2da60 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -49,6 +49,7 @@ vmstate_save(const char *idstr, const char *vmsd_name) "%s, %s"
 vmstate_load(const char *idstr, const char *vmsd_name) "%s, %s"
 postcopy_pause_incoming(void) ""
 postcopy_pause_incoming_continued(void) ""
+postcopy_page_req_sync(void *host_addr) "sync page req %p"
 
 # vmstate.c
 vmstate_load_field_error(const char *field, int ret) "field \"%s\" load failed, ret = %d"
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PULL 09/10] migration/dirtyrate: record start_time and calc_time while at the measuring state
  2020-10-07 15:55 [PULL 00/10] migration queue Dr. David Alan Gilbert (git)
                   ` (7 preceding siblings ...)
  2020-10-07 15:55 ` [PULL 08/10] migration: Sync requested pages after postcopy recovery Dr. David Alan Gilbert (git)
@ 2020-10-07 15:55 ` Dr. David Alan Gilbert (git)
  2020-10-07 15:56 ` [PULL 10/10] migration/dirtyrate: present dirty rate only when querying the rate has completed Dr. David Alan Gilbert (git)
  2020-10-08 16:18 ` [PULL 00/10] migration queue Peter Maydell
  10 siblings, 0 replies; 18+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2020-10-07 15:55 UTC (permalink / raw)
  To: qemu-devel, alex.bennee, zhengchuan, stefanha, peterx; +Cc: quintela

From: Chuan Zheng <zhengchuan@huawei.com>

Querying could include both the start-time and the calc-time while at the measuring
state, allowing a caller to determine when they should expect to come back looking
for a result.

Signed-off-by: Chuan Zheng <zhengchuan@huawei.com>
Message-Id: <1601350938-128320-2-git-send-email-zhengchuan@huawei.com>
Reviewed-by: David Edmondson <david.edmondson@oracle.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/dirtyrate.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/migration/dirtyrate.c b/migration/dirtyrate.c
index 68577ef250..40e41e793e 100644
--- a/migration/dirtyrate.c
+++ b/migration/dirtyrate.c
@@ -83,14 +83,14 @@ static struct DirtyRateInfo *query_dirty_rate_info(void)
     return info;
 }
 
-static void reset_dirtyrate_stat(void)
+static void init_dirtyrate_stat(int64_t start_time, int64_t calc_time)
 {
     DirtyStat.total_dirty_samples = 0;
     DirtyStat.total_sample_count = 0;
     DirtyStat.total_block_mem_MB = 0;
     DirtyStat.dirty_rate = -1;
-    DirtyStat.start_time = 0;
-    DirtyStat.calc_time = 0;
+    DirtyStat.start_time = start_time;
+    DirtyStat.calc_time = calc_time;
 }
 
 static void update_dirtyrate_stat(struct RamblockDirtyInfo *info)
@@ -335,7 +335,6 @@ static void calculate_dirtyrate(struct DirtyRateConfig config)
     int64_t initial_time;
 
     rcu_register_thread();
-    reset_dirtyrate_stat();
     rcu_read_lock();
     initial_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
     if (!record_ramblock_hash_info(&block_dinfo, config, &block_count)) {
@@ -365,6 +364,8 @@ void *get_dirtyrate_thread(void *arg)
 {
     struct DirtyRateConfig config = *(struct DirtyRateConfig *)arg;
     int ret;
+    int64_t start_time;
+    int64_t calc_time;
 
     ret = dirtyrate_set_state(&CalculatingState, DIRTY_RATE_STATUS_UNSTARTED,
                               DIRTY_RATE_STATUS_MEASURING);
@@ -373,6 +374,10 @@ void *get_dirtyrate_thread(void *arg)
         return NULL;
     }
 
+    start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME) / 1000;
+    calc_time = config.sample_period_seconds;
+    init_dirtyrate_stat(start_time, calc_time);
+
     calculate_dirtyrate(config);
 
     ret = dirtyrate_set_state(&CalculatingState, DIRTY_RATE_STATUS_MEASURING,
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PULL 10/10] migration/dirtyrate: present dirty rate only when querying the rate has completed
  2020-10-07 15:55 [PULL 00/10] migration queue Dr. David Alan Gilbert (git)
                   ` (8 preceding siblings ...)
  2020-10-07 15:55 ` [PULL 09/10] migration/dirtyrate: record start_time and calc_time while at the measuring state Dr. David Alan Gilbert (git)
@ 2020-10-07 15:56 ` Dr. David Alan Gilbert (git)
  2020-10-08 16:18 ` [PULL 00/10] migration queue Peter Maydell
  10 siblings, 0 replies; 18+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2020-10-07 15:56 UTC (permalink / raw)
  To: qemu-devel, alex.bennee, zhengchuan, stefanha, peterx; +Cc: quintela

From: Chuan Zheng <zhengchuan@huawei.com>

Make dirty_rate field optional, present dirty rate only when querying
the rate has completed.
The qmp results is shown as follow:
@unstarted:
{"return":{"status":"unstarted","start-time":0,"calc-time":0},"id":"libvirt-12"}
@measuring:
{"return":{"status":"measuring","start-time":102931,"calc-time":1},"id":"libvirt-85"}
@measured:
{"return":{"status":"measured","dirty-rate":4,"start-time":150146,"calc-time":1},"id":"libvirt-15"}

Signed-off-by: Chuan Zheng <zhengchuan@huawei.com>
Reviewed-by: David Edmondson <david.edmondson@oracle.com>
Message-Id: <1601350938-128320-3-git-send-email-zhengchuan@huawei.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/dirtyrate.c | 3 +--
 qapi/migration.json   | 8 +++-----
 2 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/migration/dirtyrate.c b/migration/dirtyrate.c
index 40e41e793e..ab9e1301f6 100644
--- a/migration/dirtyrate.c
+++ b/migration/dirtyrate.c
@@ -69,9 +69,8 @@ static struct DirtyRateInfo *query_dirty_rate_info(void)
     struct DirtyRateInfo *info = g_malloc0(sizeof(DirtyRateInfo));
 
     if (qatomic_read(&CalculatingState) == DIRTY_RATE_STATUS_MEASURED) {
+        info->has_dirty_rate = true;
         info->dirty_rate = dirty_rate;
-    } else {
-        info->dirty_rate = -1;
     }
 
     info->status = CalculatingState;
diff --git a/qapi/migration.json b/qapi/migration.json
index 7f5e6fd681..974021a5c8 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -1743,10 +1743,8 @@
 #
 # Information about current dirty page rate of vm.
 #
-# @dirty-rate: @dirtyrate describing the dirty page rate of vm
-#              in units of MB/s.
-#              If this field returns '-1', it means querying has not
-#              yet started or completed.
+# @dirty-rate: an estimate of the dirty page rate of the VM in units of
+#              MB/s, present only when estimating the rate has completed.
 #
 # @status: status containing dirtyrate query status includes
 #          'unstarted' or 'measuring' or 'measured'
@@ -1759,7 +1757,7 @@
 #
 ##
 { 'struct': 'DirtyRateInfo',
-  'data': {'dirty-rate': 'int64',
+  'data': {'*dirty-rate': 'int64',
            'status': 'DirtyRateStatus',
            'start-time': 'int64',
            'calc-time': 'int64'} }
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PULL 00/10] migration queue
  2020-10-07 15:55 [PULL 00/10] migration queue Dr. David Alan Gilbert (git)
                   ` (9 preceding siblings ...)
  2020-10-07 15:56 ` [PULL 10/10] migration/dirtyrate: present dirty rate only when querying the rate has completed Dr. David Alan Gilbert (git)
@ 2020-10-08 16:18 ` Peter Maydell
  2020-10-08 16:31   ` Dr. David Alan Gilbert
                     ` (2 more replies)
  10 siblings, 3 replies; 18+ messages in thread
From: Peter Maydell @ 2020-10-08 16:18 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git)
  Cc: Juan Quintela, QEMU Developers, Peter Xu, zhengchuan,
	Stefan Hajnoczi, Alex Bennée

On Wed, 7 Oct 2020 at 17:06, Dr. David Alan Gilbert (git)
<dgilbert@redhat.com> wrote:
>
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> The following changes since commit f2687fdb7571a444b5af3509574b659d35ddd601:
>
>   Merge remote-tracking branch 'remotes/bonzini-gitlab/tags/for-upstream' into staging (2020-10-06 15:04:10 +0100)
>
> are available in the Git repository at:
>
>   git://github.com/dagrh/qemu.git tags/pull-migration-20201007b
>
> for you to fetch changes up to 1df31b8aca2aa4f83d5827d74700eeb6d711bbdf:
>
>   migration/dirtyrate: present dirty rate only when querying the rate has completed (2020-10-07 16:49:26 +0100)
>
> ----------------------------------------------------------------
> Migration and virtiofs pull 2020-07-10
>
> Migration:
>   Dirtyrate measurement API cleanup
>   Postcopy recovery fixes
>
> Virtiofsd:
>   Missing qemu_init_exec_dir call
>   Support for setting the group on socket creation
>   Stop a gcc warning
>   Avoid tempdir in sandboxing

Compile failure, windows crossbuilds:

../../migration/migration.c: In function 'page_request_addr_cmp':
../../migration/migration.c:148:23: error: cast from pointer to
integer of different size [-Werror=pointer-to-int-cast]
     unsigned long a = (unsigned long) ap, b = (unsigned long) bp;
                       ^
../../migration/migration.c:148:47: error: cast from pointer to
integer of different size [-Werror=pointer-to-int-cast]
     unsigned long a = (unsigned long) ap, b = (unsigned long) bp;
                                               ^

thanks
-- PMM


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PULL 00/10] migration queue
  2020-10-08 16:18 ` [PULL 00/10] migration queue Peter Maydell
@ 2020-10-08 16:31   ` Dr. David Alan Gilbert
  2020-10-08 17:09   ` Eric Blake
  2020-10-08 18:51   ` Peter Xu
  2 siblings, 0 replies; 18+ messages in thread
From: Dr. David Alan Gilbert @ 2020-10-08 16:31 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Juan Quintela, QEMU Developers, Peter Xu, zhengchuan,
	Stefan Hajnoczi, Alex Bennée

* Peter Maydell (peter.maydell@linaro.org) wrote:
> On Wed, 7 Oct 2020 at 17:06, Dr. David Alan Gilbert (git)
> <dgilbert@redhat.com> wrote:
> >
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> >
> > The following changes since commit f2687fdb7571a444b5af3509574b659d35ddd601:
> >
> >   Merge remote-tracking branch 'remotes/bonzini-gitlab/tags/for-upstream' into staging (2020-10-06 15:04:10 +0100)
> >
> > are available in the Git repository at:
> >
> >   git://github.com/dagrh/qemu.git tags/pull-migration-20201007b
> >
> > for you to fetch changes up to 1df31b8aca2aa4f83d5827d74700eeb6d711bbdf:
> >
> >   migration/dirtyrate: present dirty rate only when querying the rate has completed (2020-10-07 16:49:26 +0100)
> >
> > ----------------------------------------------------------------
> > Migration and virtiofs pull 2020-07-10
> >
> > Migration:
> >   Dirtyrate measurement API cleanup
> >   Postcopy recovery fixes
> >
> > Virtiofsd:
> >   Missing qemu_init_exec_dir call
> >   Support for setting the group on socket creation
> >   Stop a gcc warning
> >   Avoid tempdir in sandboxing
> 
> Compile failure, windows crossbuilds:
> 
> ../../migration/migration.c: In function 'page_request_addr_cmp':
> ../../migration/migration.c:148:23: error: cast from pointer to
> integer of different size [-Werror=pointer-to-int-cast]
>      unsigned long a = (unsigned long) ap, b = (unsigned long) bp;
>                        ^
> ../../migration/migration.c:148:47: error: cast from pointer to
> integer of different size [-Werror=pointer-to-int-cast]
>      unsigned long a = (unsigned long) ap, b = (unsigned long) bp;

Sorry about that; I'll see if we can fix it.

Dave

> 
> thanks
> -- PMM
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PULL 00/10] migration queue
  2020-10-08 16:18 ` [PULL 00/10] migration queue Peter Maydell
  2020-10-08 16:31   ` Dr. David Alan Gilbert
@ 2020-10-08 17:09   ` Eric Blake
  2020-10-08 17:43     ` Peter Xu
  2020-10-08 18:51   ` Peter Xu
  2 siblings, 1 reply; 18+ messages in thread
From: Eric Blake @ 2020-10-08 17:09 UTC (permalink / raw)
  To: Peter Maydell, Dr. David Alan Gilbert (git)
  Cc: Juan Quintela, QEMU Developers, Peter Xu, zhengchuan,
	Stefan Hajnoczi, Alex Bennée


[-- Attachment #1.1: Type: text/plain, Size: 779 bytes --]

On 10/8/20 11:18 AM, Peter Maydell wrote:

> 
> Compile failure, windows crossbuilds:
> 
> ../../migration/migration.c: In function 'page_request_addr_cmp':
> ../../migration/migration.c:148:23: error: cast from pointer to
> integer of different size [-Werror=pointer-to-int-cast]
>      unsigned long a = (unsigned long) ap, b = (unsigned long) bp;
>                        ^

'unsigned long' is platform specific; so is uintptr_t, but it may fit
more naturally.  Or maybe you are better off with a specific 32- or
64-bit type, but even so, may need a double cast (first to uintptr_t
then to your real target) to shut up warnings?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PULL 00/10] migration queue
  2020-10-08 17:09   ` Eric Blake
@ 2020-10-08 17:43     ` Peter Xu
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Xu @ 2020-10-08 17:43 UTC (permalink / raw)
  To: Eric Blake
  Cc: Peter Maydell, Juan Quintela, QEMU Developers,
	Dr. David Alan Gilbert (git),
	zhengchuan, Stefan Hajnoczi, Alex Bennée

On Thu, Oct 08, 2020 at 12:09:15PM -0500, Eric Blake wrote:
> On 10/8/20 11:18 AM, Peter Maydell wrote:
> 
> > 
> > Compile failure, windows crossbuilds:
> > 
> > ../../migration/migration.c: In function 'page_request_addr_cmp':
> > ../../migration/migration.c:148:23: error: cast from pointer to
> > integer of different size [-Werror=pointer-to-int-cast]
> >      unsigned long a = (unsigned long) ap, b = (unsigned long) bp;
> >                        ^
> 
> 'unsigned long' is platform specific; so is uintptr_t, but it may fit
> more naturally.  Or maybe you are better off with a specific 32- or
> 64-bit type, but even so, may need a double cast (first to uintptr_t
> then to your real target) to shut up warnings?

Sorry for that.

When I was initially trying to fix the 32bit build failure I did use double
cast, but I (obviously, wrongly) thought sizeof(unsigned long) should always be
the same size as sizeof(void *), so I explicitly removed that, since at least
my 32bit compile didn't complaint.

But obviously Windows/mingw is probably different on that..

I'll find a mingw environment soon and verify.  It would take some more time
after docker stopped to work on my current host due to cgroup versions, however
shouldn't be long.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PULL 00/10] migration queue
  2020-10-08 16:18 ` [PULL 00/10] migration queue Peter Maydell
  2020-10-08 16:31   ` Dr. David Alan Gilbert
  2020-10-08 17:09   ` Eric Blake
@ 2020-10-08 18:51   ` Peter Xu
  2020-10-08 19:09     ` Dr. David Alan Gilbert
  2 siblings, 1 reply; 18+ messages in thread
From: Peter Xu @ 2020-10-08 18:51 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Maydell, Juan Quintela, Dr. David Alan Gilbert (git),
	QEMU Developers, zhengchuan, Stefan Hajnoczi, Alex Bennée

[-- Attachment #1: Type: text/plain, Size: 1282 bytes --]

On Thu, Oct 08, 2020 at 05:18:14PM +0100, Peter Maydell wrote:
> Compile failure, windows crossbuilds:
> 
> ../../migration/migration.c: In function 'page_request_addr_cmp':
> ../../migration/migration.c:148:23: error: cast from pointer to
> integer of different size [-Werror=pointer-to-int-cast]
>      unsigned long a = (unsigned long) ap, b = (unsigned long) bp;
>                        ^
> ../../migration/migration.c:148:47: error: cast from pointer to
> integer of different size [-Werror=pointer-to-int-cast]
>      unsigned long a = (unsigned long) ap, b = (unsigned long) bp;
>                                                ^

Below squashed into "migration: Maintain postcopy faulted addresses" should fix
the problem:

------8<------
diff --git a/migration/migration.c b/migration/migration.c
index e7d179bffc..d8a5c0de44 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -145,7 +145,7 @@ static void migrate_fd_cancel(MigrationState *s);
 
 static gint page_request_addr_cmp(gconstpointer ap, gconstpointer bp)
 {
-    unsigned long a = (unsigned long) ap, b = (unsigned long) bp;
+    uintptr_t a = (uintptr_t) ap, b = (uintptr_t) bp;
 
     return (a > b) - (a < b);
 }
------8<------

A whole replacement patch attached too.

Thanks!

-- 
Peter Xu

[-- Attachment #2: 0001-migration-Maintain-postcopy-faulted-addresses.patch --]
[-- Type: text/plain, Size: 9841 bytes --]

From 6d5708092466abb2ed683f9ef24021e2edf25872 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Fri, 28 Aug 2020 18:02:14 -0400
Subject: [PATCH] migration: Maintain postcopy faulted addresses

Maintain a list of faulted addresses on the destination host for which we're
waiting on.  This is implemented using a GTree rather than a real list to make
sure even there're plenty of vCPUs/threads that are faulting, the lookup will
still be fast with O(log(N)) (because we'll do that after placing each page).
It should bring a slight overhead, but ideally that shouldn't be a big problem
simply because in most cases the requested page list will be short.

Actually we did similar things for postcopy blocktime measurements.  This patch
didn't use that simply because:

  (1) blocktime measurement is towards vcpu threads only, but here we need to
      record all faulted addresses, including main thread and external
      thread (like, DPDK via vhost-user).

  (2) blocktime measurement will require UFFD_FEATURE_THREAD_ID, but here we
      don't want to add that extra dependency on the kernel version since not
      necessary.  E.g., we don't need to know which thread faulted on which
      page, we also don't care about multiple threads faulting on the same
      page.  But we only care about what addresses are faulted so waiting for a
      page copying from src.

  (3) blocktime measurement is not enabled by default.  However we need this by
      default especially for postcopy recover.

Another thing to mention is that this patch introduced a new mutex to serialize
the receivedmap and the page_requested tree, however that serialization does
not cover other procedures like UFFDIO_COPY.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c    | 41 +++++++++++++++++++++++++++++++++++++++-
 migration/migration.h    | 19 ++++++++++++++++++-
 migration/postcopy-ram.c | 17 ++++++++++++++---
 migration/trace-events   |  2 ++
 4 files changed, 74 insertions(+), 5 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index b2dac6b39c..d8a5c0de44 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -143,6 +143,13 @@ static int migration_maybe_pause(MigrationState *s,
                                  int new_state);
 static void migrate_fd_cancel(MigrationState *s);
 
+static gint page_request_addr_cmp(gconstpointer ap, gconstpointer bp)
+{
+    uintptr_t a = (uintptr_t) ap, b = (uintptr_t) bp;
+
+    return (a > b) - (a < b);
+}
+
 void migration_object_init(void)
 {
     MachineState *ms = MACHINE(qdev_get_machine());
@@ -165,6 +172,8 @@ void migration_object_init(void)
     qemu_event_init(&current_incoming->main_thread_load_event, false);
     qemu_sem_init(&current_incoming->postcopy_pause_sem_dst, 0);
     qemu_sem_init(&current_incoming->postcopy_pause_sem_fault, 0);
+    qemu_mutex_init(&current_incoming->page_request_mutex);
+    current_incoming->page_requested = g_tree_new(page_request_addr_cmp);
 
     if (!migration_object_check(current_migration, &err)) {
         error_report_err(err);
@@ -240,6 +249,11 @@ void migration_incoming_state_destroy(void)
 
     qemu_event_reset(&mis->main_thread_load_event);
 
+    if (mis->page_requested) {
+        g_tree_destroy(mis->page_requested);
+        mis->page_requested = NULL;
+    }
+
     if (mis->socket_address_list) {
         qapi_free_SocketAddressList(mis->socket_address_list);
         mis->socket_address_list = NULL;
@@ -354,8 +368,33 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
 }
 
 int migrate_send_rp_req_pages(MigrationIncomingState *mis,
-                              RAMBlock *rb, ram_addr_t start)
+                              RAMBlock *rb, ram_addr_t start, uint64_t haddr)
 {
+    void *aligned = (void *)(uintptr_t)(haddr & (-qemu_target_page_size()));
+    bool received;
+
+    WITH_QEMU_LOCK_GUARD(&mis->page_request_mutex) {
+        received = ramblock_recv_bitmap_test_byte_offset(rb, start);
+        if (!received && !g_tree_lookup(mis->page_requested, aligned)) {
+            /*
+             * The page has not been received, and it's not yet in the page
+             * request list.  Queue it.  Set the value of element to 1, so that
+             * things like g_tree_lookup() will return TRUE (1) when found.
+             */
+            g_tree_insert(mis->page_requested, aligned, (gpointer)1);
+            mis->page_requested_count++;
+            trace_postcopy_page_req_add(aligned, mis->page_requested_count);
+        }
+    }
+
+    /*
+     * If the page is there, skip sending the message.  We don't even need the
+     * lock because as long as the page arrived, it'll be there forever.
+     */
+    if (received) {
+        return 0;
+    }
+
     return migrate_send_rp_message_req_pages(mis, rb, start);
 }
 
diff --git a/migration/migration.h b/migration/migration.h
index e853ccf8b1..8d2d1ce839 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -104,6 +104,23 @@ struct MigrationIncomingState {
 
     /* List of listening socket addresses  */
     SocketAddressList *socket_address_list;
+
+    /* A tree of pages that we requested to the source VM */
+    GTree *page_requested;
+    /* For debugging purpose only, but would be nice to keep */
+    int page_requested_count;
+    /*
+     * The mutex helps to maintain the requested pages that we sent to the
+     * source, IOW, to guarantee coherent between the page_requests tree and
+     * the per-ramblock receivedmap.  Note! This does not guarantee consistency
+     * of the real page copy procedures (using UFFDIO_[ZERO]COPY).  E.g., even
+     * if one bit in receivedmap is cleared, UFFDIO_COPY could have happened
+     * for that page already.  This is intended so that the mutex won't
+     * serialize and blocked by slow operations like UFFDIO_* ioctls.  However
+     * this should be enough to make sure the page_requested tree always
+     * contains valid information.
+     */
+    QemuMutex page_request_mutex;
 };
 
 MigrationIncomingState *migration_incoming_get_current(void);
@@ -332,7 +349,7 @@ void migrate_send_rp_shut(MigrationIncomingState *mis,
 void migrate_send_rp_pong(MigrationIncomingState *mis,
                           uint32_t value);
 int migrate_send_rp_req_pages(MigrationIncomingState *mis, RAMBlock *rb,
-                              ram_addr_t start);
+                              ram_addr_t start, uint64_t haddr);
 int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
                                       RAMBlock *rb, ram_addr_t start);
 void migrate_send_rp_recv_bitmap(MigrationIncomingState *mis,
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 722034dc01..ca1daf0024 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -684,7 +684,7 @@ int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
                                         qemu_ram_get_idstr(rb), rb_offset);
         return postcopy_wake_shared(pcfd, client_addr, rb);
     }
-    migrate_send_rp_req_pages(mis, rb, aligned_rbo);
+    migrate_send_rp_req_pages(mis, rb, aligned_rbo, client_addr);
     return 0;
 }
 
@@ -979,7 +979,8 @@ retry:
              * Send the request to the source - we want to request one
              * of our host page sizes (which is >= TPS)
              */
-            ret = migrate_send_rp_req_pages(mis, rb, rb_offset);
+            ret = migrate_send_rp_req_pages(mis, rb, rb_offset,
+                                            msg.arg.pagefault.address);
             if (ret) {
                 /* May be network failure, try to wait for recovery */
                 if (ret == -EIO && postcopy_pause_fault_thread(mis)) {
@@ -1149,10 +1150,20 @@ static int qemu_ufd_copy_ioctl(MigrationIncomingState *mis, void *host_addr,
         ret = ioctl(userfault_fd, UFFDIO_ZEROPAGE, &zero_struct);
     }
     if (!ret) {
+        qemu_mutex_lock(&mis->page_request_mutex);
         ramblock_recv_bitmap_set_range(rb, host_addr,
                                        pagesize / qemu_target_page_size());
+        /*
+         * If this page resolves a page fault for a previous recorded faulted
+         * address, take a special note to maintain the requested page list.
+         */
+        if (g_tree_lookup(mis->page_requested, host_addr)) {
+            g_tree_remove(mis->page_requested, host_addr);
+            mis->page_requested_count--;
+            trace_postcopy_page_req_del(host_addr, mis->page_requested_count);
+        }
+        qemu_mutex_unlock(&mis->page_request_mutex);
         mark_postcopy_blocktime_end((uintptr_t)host_addr);
-
     }
     return ret;
 }
diff --git a/migration/trace-events b/migration/trace-events
index 338f38b3dd..e4d5eb94ca 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -162,6 +162,7 @@ postcopy_pause_return_path(void) ""
 postcopy_pause_return_path_continued(void) ""
 postcopy_pause_continued(void) ""
 postcopy_start_set_run(void) ""
+postcopy_page_req_add(void *addr, int count) "new page req %p total %d"
 source_return_path_thread_bad_end(void) ""
 source_return_path_thread_end(void) ""
 source_return_path_thread_entry(void) ""
@@ -272,6 +273,7 @@ postcopy_ram_incoming_cleanup_blocktime(uint64_t total) "total blocktime %" PRIu
 postcopy_request_shared_page(const char *sharer, const char *rb, uint64_t rb_offset) "for %s in %s offset 0x%"PRIx64
 postcopy_request_shared_page_present(const char *sharer, const char *rb, uint64_t rb_offset) "%s already %s offset 0x%"PRIx64
 postcopy_wake_shared(uint64_t client_addr, const char *rb) "at 0x%"PRIx64" in %s"
+postcopy_page_req_del(void *addr, int count) "resolved page req %p total %d"
 
 get_mem_fault_cpu_index(int cpu, uint32_t pid) "cpu: %d, pid: %u"
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PULL 00/10] migration queue
  2020-10-08 18:51   ` Peter Xu
@ 2020-10-08 19:09     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 18+ messages in thread
From: Dr. David Alan Gilbert @ 2020-10-08 19:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: Peter Maydell, Juan Quintela, QEMU Developers, zhengchuan,
	Stefan Hajnoczi, Alex Bennée

* Peter Xu (peterx@redhat.com) wrote:
> On Thu, Oct 08, 2020 at 05:18:14PM +0100, Peter Maydell wrote:
> > Compile failure, windows crossbuilds:
> > 
> > ../../migration/migration.c: In function 'page_request_addr_cmp':
> > ../../migration/migration.c:148:23: error: cast from pointer to
> > integer of different size [-Werror=pointer-to-int-cast]
> >      unsigned long a = (unsigned long) ap, b = (unsigned long) bp;
> >                        ^
> > ../../migration/migration.c:148:47: error: cast from pointer to
> > integer of different size [-Werror=pointer-to-int-cast]
> >      unsigned long a = (unsigned long) ap, b = (unsigned long) bp;
> >                                                ^
> 
> Below squashed into "migration: Maintain postcopy faulted addresses" should fix
> the problem:
> 
> ------8<------
> diff --git a/migration/migration.c b/migration/migration.c
> index e7d179bffc..d8a5c0de44 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -145,7 +145,7 @@ static void migrate_fd_cancel(MigrationState *s);
>  
>  static gint page_request_addr_cmp(gconstpointer ap, gconstpointer bp)
>  {
> -    unsigned long a = (unsigned long) ap, b = (unsigned long) bp;
> +    uintptr_t a = (uintptr_t) ap, b = (uintptr_t) bp;
>  
>      return (a > b) - (a < b);
>  }
> ------8<------
> 
> A whole replacement patch attached too.
> 
> Thanks!
> 


> -- 
> Peter Xu

> From 6d5708092466abb2ed683f9ef24021e2edf25872 Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Fri, 28 Aug 2020 18:02:14 -0400
> Subject: [PATCH] migration: Maintain postcopy faulted addresses
> 
> Maintain a list of faulted addresses on the destination host for which we're
> waiting on.  This is implemented using a GTree rather than a real list to make
> sure even there're plenty of vCPUs/threads that are faulting, the lookup will
> still be fast with O(log(N)) (because we'll do that after placing each page).
> It should bring a slight overhead, but ideally that shouldn't be a big problem
> simply because in most cases the requested page list will be short.
> 
> Actually we did similar things for postcopy blocktime measurements.  This patch
> didn't use that simply because:
> 
>   (1) blocktime measurement is towards vcpu threads only, but here we need to
>       record all faulted addresses, including main thread and external
>       thread (like, DPDK via vhost-user).
> 
>   (2) blocktime measurement will require UFFD_FEATURE_THREAD_ID, but here we
>       don't want to add that extra dependency on the kernel version since not
>       necessary.  E.g., we don't need to know which thread faulted on which
>       page, we also don't care about multiple threads faulting on the same
>       page.  But we only care about what addresses are faulted so waiting for a
>       page copying from src.
> 
>   (3) blocktime measurement is not enabled by default.  However we need this by
>       default especially for postcopy recover.
> 
> Another thing to mention is that this patch introduced a new mutex to serialize
> the receivedmap and the page_requested tree, however that serialization does
> not cover other procedures like UFFDIO_COPY.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Respin with that replacing the original coming up.

Dave

> ---
>  migration/migration.c    | 41 +++++++++++++++++++++++++++++++++++++++-
>  migration/migration.h    | 19 ++++++++++++++++++-
>  migration/postcopy-ram.c | 17 ++++++++++++++---
>  migration/trace-events   |  2 ++
>  4 files changed, 74 insertions(+), 5 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index b2dac6b39c..d8a5c0de44 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -143,6 +143,13 @@ static int migration_maybe_pause(MigrationState *s,
>                                   int new_state);
>  static void migrate_fd_cancel(MigrationState *s);
>  
> +static gint page_request_addr_cmp(gconstpointer ap, gconstpointer bp)
> +{
> +    uintptr_t a = (uintptr_t) ap, b = (uintptr_t) bp;
> +
> +    return (a > b) - (a < b);
> +}
> +
>  void migration_object_init(void)
>  {
>      MachineState *ms = MACHINE(qdev_get_machine());
> @@ -165,6 +172,8 @@ void migration_object_init(void)
>      qemu_event_init(&current_incoming->main_thread_load_event, false);
>      qemu_sem_init(&current_incoming->postcopy_pause_sem_dst, 0);
>      qemu_sem_init(&current_incoming->postcopy_pause_sem_fault, 0);
> +    qemu_mutex_init(&current_incoming->page_request_mutex);
> +    current_incoming->page_requested = g_tree_new(page_request_addr_cmp);
>  
>      if (!migration_object_check(current_migration, &err)) {
>          error_report_err(err);
> @@ -240,6 +249,11 @@ void migration_incoming_state_destroy(void)
>  
>      qemu_event_reset(&mis->main_thread_load_event);
>  
> +    if (mis->page_requested) {
> +        g_tree_destroy(mis->page_requested);
> +        mis->page_requested = NULL;
> +    }
> +
>      if (mis->socket_address_list) {
>          qapi_free_SocketAddressList(mis->socket_address_list);
>          mis->socket_address_list = NULL;
> @@ -354,8 +368,33 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
>  }
>  
>  int migrate_send_rp_req_pages(MigrationIncomingState *mis,
> -                              RAMBlock *rb, ram_addr_t start)
> +                              RAMBlock *rb, ram_addr_t start, uint64_t haddr)
>  {
> +    void *aligned = (void *)(uintptr_t)(haddr & (-qemu_target_page_size()));
> +    bool received;
> +
> +    WITH_QEMU_LOCK_GUARD(&mis->page_request_mutex) {
> +        received = ramblock_recv_bitmap_test_byte_offset(rb, start);
> +        if (!received && !g_tree_lookup(mis->page_requested, aligned)) {
> +            /*
> +             * The page has not been received, and it's not yet in the page
> +             * request list.  Queue it.  Set the value of element to 1, so that
> +             * things like g_tree_lookup() will return TRUE (1) when found.
> +             */
> +            g_tree_insert(mis->page_requested, aligned, (gpointer)1);
> +            mis->page_requested_count++;
> +            trace_postcopy_page_req_add(aligned, mis->page_requested_count);
> +        }
> +    }
> +
> +    /*
> +     * If the page is there, skip sending the message.  We don't even need the
> +     * lock because as long as the page arrived, it'll be there forever.
> +     */
> +    if (received) {
> +        return 0;
> +    }
> +
>      return migrate_send_rp_message_req_pages(mis, rb, start);
>  }
>  
> diff --git a/migration/migration.h b/migration/migration.h
> index e853ccf8b1..8d2d1ce839 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -104,6 +104,23 @@ struct MigrationIncomingState {
>  
>      /* List of listening socket addresses  */
>      SocketAddressList *socket_address_list;
> +
> +    /* A tree of pages that we requested to the source VM */
> +    GTree *page_requested;
> +    /* For debugging purpose only, but would be nice to keep */
> +    int page_requested_count;
> +    /*
> +     * The mutex helps to maintain the requested pages that we sent to the
> +     * source, IOW, to guarantee coherent between the page_requests tree and
> +     * the per-ramblock receivedmap.  Note! This does not guarantee consistency
> +     * of the real page copy procedures (using UFFDIO_[ZERO]COPY).  E.g., even
> +     * if one bit in receivedmap is cleared, UFFDIO_COPY could have happened
> +     * for that page already.  This is intended so that the mutex won't
> +     * serialize and blocked by slow operations like UFFDIO_* ioctls.  However
> +     * this should be enough to make sure the page_requested tree always
> +     * contains valid information.
> +     */
> +    QemuMutex page_request_mutex;
>  };
>  
>  MigrationIncomingState *migration_incoming_get_current(void);
> @@ -332,7 +349,7 @@ void migrate_send_rp_shut(MigrationIncomingState *mis,
>  void migrate_send_rp_pong(MigrationIncomingState *mis,
>                            uint32_t value);
>  int migrate_send_rp_req_pages(MigrationIncomingState *mis, RAMBlock *rb,
> -                              ram_addr_t start);
> +                              ram_addr_t start, uint64_t haddr);
>  int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
>                                        RAMBlock *rb, ram_addr_t start);
>  void migrate_send_rp_recv_bitmap(MigrationIncomingState *mis,
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 722034dc01..ca1daf0024 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -684,7 +684,7 @@ int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
>                                          qemu_ram_get_idstr(rb), rb_offset);
>          return postcopy_wake_shared(pcfd, client_addr, rb);
>      }
> -    migrate_send_rp_req_pages(mis, rb, aligned_rbo);
> +    migrate_send_rp_req_pages(mis, rb, aligned_rbo, client_addr);
>      return 0;
>  }
>  
> @@ -979,7 +979,8 @@ retry:
>               * Send the request to the source - we want to request one
>               * of our host page sizes (which is >= TPS)
>               */
> -            ret = migrate_send_rp_req_pages(mis, rb, rb_offset);
> +            ret = migrate_send_rp_req_pages(mis, rb, rb_offset,
> +                                            msg.arg.pagefault.address);
>              if (ret) {
>                  /* May be network failure, try to wait for recovery */
>                  if (ret == -EIO && postcopy_pause_fault_thread(mis)) {
> @@ -1149,10 +1150,20 @@ static int qemu_ufd_copy_ioctl(MigrationIncomingState *mis, void *host_addr,
>          ret = ioctl(userfault_fd, UFFDIO_ZEROPAGE, &zero_struct);
>      }
>      if (!ret) {
> +        qemu_mutex_lock(&mis->page_request_mutex);
>          ramblock_recv_bitmap_set_range(rb, host_addr,
>                                         pagesize / qemu_target_page_size());
> +        /*
> +         * If this page resolves a page fault for a previous recorded faulted
> +         * address, take a special note to maintain the requested page list.
> +         */
> +        if (g_tree_lookup(mis->page_requested, host_addr)) {
> +            g_tree_remove(mis->page_requested, host_addr);
> +            mis->page_requested_count--;
> +            trace_postcopy_page_req_del(host_addr, mis->page_requested_count);
> +        }
> +        qemu_mutex_unlock(&mis->page_request_mutex);
>          mark_postcopy_blocktime_end((uintptr_t)host_addr);
> -
>      }
>      return ret;
>  }
> diff --git a/migration/trace-events b/migration/trace-events
> index 338f38b3dd..e4d5eb94ca 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -162,6 +162,7 @@ postcopy_pause_return_path(void) ""
>  postcopy_pause_return_path_continued(void) ""
>  postcopy_pause_continued(void) ""
>  postcopy_start_set_run(void) ""
> +postcopy_page_req_add(void *addr, int count) "new page req %p total %d"
>  source_return_path_thread_bad_end(void) ""
>  source_return_path_thread_end(void) ""
>  source_return_path_thread_entry(void) ""
> @@ -272,6 +273,7 @@ postcopy_ram_incoming_cleanup_blocktime(uint64_t total) "total blocktime %" PRIu
>  postcopy_request_shared_page(const char *sharer, const char *rb, uint64_t rb_offset) "for %s in %s offset 0x%"PRIx64
>  postcopy_request_shared_page_present(const char *sharer, const char *rb, uint64_t rb_offset) "%s already %s offset 0x%"PRIx64
>  postcopy_wake_shared(uint64_t client_addr, const char *rb) "at 0x%"PRIx64" in %s"
> +postcopy_page_req_del(void *addr, int count) "resolved page req %p total %d"
>  
>  get_mem_fault_cpu_index(int cpu, uint32_t pid) "cpu: %d, pid: %u"
>  
> -- 
> 2.26.2
> 

-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PULL 10/10] migration/dirtyrate: present dirty rate only when querying the rate has completed
  2020-10-08 19:10 Dr. David Alan Gilbert (git)
@ 2020-10-08 19:10 ` Dr. David Alan Gilbert (git)
  0 siblings, 0 replies; 18+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2020-10-08 19:10 UTC (permalink / raw)
  To: qemu-devel, alex.bennee, zhengchuan, stefanha, peterx; +Cc: quintela

From: Chuan Zheng <zhengchuan@huawei.com>

Make dirty_rate field optional, present dirty rate only when querying
the rate has completed.
The qmp results is shown as follow:
@unstarted:
{"return":{"status":"unstarted","start-time":0,"calc-time":0},"id":"libvirt-12"}
@measuring:
{"return":{"status":"measuring","start-time":102931,"calc-time":1},"id":"libvirt-85"}
@measured:
{"return":{"status":"measured","dirty-rate":4,"start-time":150146,"calc-time":1},"id":"libvirt-15"}

Signed-off-by: Chuan Zheng <zhengchuan@huawei.com>
Reviewed-by: David Edmondson <david.edmondson@oracle.com>
Message-Id: <1601350938-128320-3-git-send-email-zhengchuan@huawei.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/dirtyrate.c | 3 +--
 qapi/migration.json   | 8 +++-----
 2 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/migration/dirtyrate.c b/migration/dirtyrate.c
index 40e41e793e..ab9e1301f6 100644
--- a/migration/dirtyrate.c
+++ b/migration/dirtyrate.c
@@ -69,9 +69,8 @@ static struct DirtyRateInfo *query_dirty_rate_info(void)
     struct DirtyRateInfo *info = g_malloc0(sizeof(DirtyRateInfo));
 
     if (qatomic_read(&CalculatingState) == DIRTY_RATE_STATUS_MEASURED) {
+        info->has_dirty_rate = true;
         info->dirty_rate = dirty_rate;
-    } else {
-        info->dirty_rate = -1;
     }
 
     info->status = CalculatingState;
diff --git a/qapi/migration.json b/qapi/migration.json
index 7f5e6fd681..974021a5c8 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -1743,10 +1743,8 @@
 #
 # Information about current dirty page rate of vm.
 #
-# @dirty-rate: @dirtyrate describing the dirty page rate of vm
-#              in units of MB/s.
-#              If this field returns '-1', it means querying has not
-#              yet started or completed.
+# @dirty-rate: an estimate of the dirty page rate of the VM in units of
+#              MB/s, present only when estimating the rate has completed.
 #
 # @status: status containing dirtyrate query status includes
 #          'unstarted' or 'measuring' or 'measured'
@@ -1759,7 +1757,7 @@
 #
 ##
 { 'struct': 'DirtyRateInfo',
-  'data': {'dirty-rate': 'int64',
+  'data': {'*dirty-rate': 'int64',
            'status': 'DirtyRateStatus',
            'start-time': 'int64',
            'calc-time': 'int64'} }
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-10-08 19:21 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-07 15:55 [PULL 00/10] migration queue Dr. David Alan Gilbert (git)
2020-10-07 15:55 ` [PULL 01/10] virtiofsd: Silence gcc warning Dr. David Alan Gilbert (git)
2020-10-07 15:55 ` [PULL 02/10] tools/virtiofsd: add support for --socket-group Dr. David Alan Gilbert (git)
2020-10-07 15:55 ` [PULL 03/10] virtiofsd: Call qemu_init_exec_dir Dr. David Alan Gilbert (git)
2020-10-07 15:55 ` [PULL 04/10] virtiofsd: avoid /proc/self/fd tempdir Dr. David Alan Gilbert (git)
2020-10-07 15:55 ` [PULL 05/10] migration: Pass incoming state into qemu_ufd_copy_ioctl() Dr. David Alan Gilbert (git)
2020-10-07 15:55 ` [PULL 06/10] migration: Introduce migrate_send_rp_message_req_pages() Dr. David Alan Gilbert (git)
2020-10-07 15:55 ` [PULL 07/10] migration: Maintain postcopy faulted addresses Dr. David Alan Gilbert (git)
2020-10-07 15:55 ` [PULL 08/10] migration: Sync requested pages after postcopy recovery Dr. David Alan Gilbert (git)
2020-10-07 15:55 ` [PULL 09/10] migration/dirtyrate: record start_time and calc_time while at the measuring state Dr. David Alan Gilbert (git)
2020-10-07 15:56 ` [PULL 10/10] migration/dirtyrate: present dirty rate only when querying the rate has completed Dr. David Alan Gilbert (git)
2020-10-08 16:18 ` [PULL 00/10] migration queue Peter Maydell
2020-10-08 16:31   ` Dr. David Alan Gilbert
2020-10-08 17:09   ` Eric Blake
2020-10-08 17:43     ` Peter Xu
2020-10-08 18:51   ` Peter Xu
2020-10-08 19:09     ` Dr. David Alan Gilbert
2020-10-08 19:10 Dr. David Alan Gilbert (git)
2020-10-08 19:10 ` [PULL 10/10] migration/dirtyrate: present dirty rate only when querying the rate has completed Dr. David Alan Gilbert (git)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.