All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V3 00/22] Live Update
@ 2021-05-07 12:24 Steve Sistare
  2021-05-07 12:24 ` [PATCH V3 01/22] as_flat_walk Steve Sistare
                   ` (24 more replies)
  0 siblings, 25 replies; 81+ messages in thread
From: Steve Sistare @ 2021-05-07 12:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Markus Armbruster, Alex Williamson, Steve Sistare,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Provide the cprsave and cprload commands for live update.  These save and
restore VM state, with minimal guest pause time, so that qemu may be updated
to a new version in between.

cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
/usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
paused state and waits for the cprload command.

To use the restart mode, qemu must be started with the memfd-alloc option,
which allocates guest ram using memfd_create.  The memfd's are saved to
the environment and kept open across exec, after which they are found from
the environment and re-mmap'd.  Hence guest ram is preserved in place,
albeit with new virtual addresses in the qemu process.  The caller resumes
the guest by calling cprload, which loads state from the file.  If the VM
was running at cprsave time, then VM execution resumes.  cprsave supports
any type of guest image and block device, but the caller must not modify
guest block devices between cprsave and cprload.

The restart mode supports vfio devices by preserving the vfio container,
group, device, and event descriptors across the qemu re-exec, and by
updating DMA mapping virtual addresses using VFIO_DMA_UNMAP_FLAG_VADDR and
VFIO_DMA_MAP_FLAG_VADDR as defined in https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sistare@oracle.com/
and integrated in Linux kernel 5.12.

For the reboot mode, cprsave saves state and exits qemu, and the caller is
allowed to update the host kernel and system software and reboot.  The
caller resumes the guest by running qemu with the same arguments as the
original process and calling cprload.  To use this mode, guest ram must be
mapped to a persistent shared memory file such as /dev/dax0.0, or /dev/shm
PKRAM as proposed in https://lore.kernel.org/lkml/1617140178-8773-1-git-send-email-anthony.yznaga@oracle.com.

The reboot mode supports vfio devices if the caller suspends the guest
instead of stopping the VM, such as by issuing guest-suspend-ram to the
qemu guest agent.  The guest drivers' suspend methods flush outstanding
requests and re-initialize the devices, and thus there is no device state
to save and restore.

The first patches add helper functions:

  - as_flat_walk
  - qemu_ram_volatile
  - oslib: qemu_clr_cloexec
  - util: env var helpers
  - machine: memfd-alloc option
  - vl: add helper to request re-exec

The next patches implement cprsave and cprload:

  - cpr
  - cpr: QMP interfaces
  - cpr: HMP interfaces

The next patches add vfio support for the restart mode:

  - pci: export functions for cpr
  - vfio-pci: refactor for cpr
  - vfio-pci: cpr part 1
  - vfio-pci: cpr part 2

The next patches preserve various descriptor-based backend devices across
a cprsave restart:

  - vhost: reset vhost devices upon cprsave
  - hostmem-memfd: cpr support
  - chardev: cpr framework
  - chardev: cpr for simple devices
  - chardev: cpr for pty
  - chardev: cpr for sockets
  - cpr: only-cpr-capable option
  - cpr: maintainers
  - simplify savevm

Here is an example of updating qemu from v4.2.0 to v4.2.1 using 
"cprload restart".  The software update is performed while the guest is
running to minimize downtime.

window 1				| window 2
					|
# qemu-system-x86_64 ... 		|
QEMU 4.2.0 monitor - type 'help' ...	|
(qemu) info status			|
VM status: running			|
					| # yum update qemu
(qemu) cprsave /tmp/qemu.sav restart	|
QEMU 4.2.1 monitor - type 'help' ...	|
(qemu) info status			|
VM status: paused (prelaunch)		|
(qemu) cprload /tmp/qemu.sav		|
(qemu) info status			|
VM status: running			|


Here is an example of updating the host kernel using "cprload reboot"

window 1					| window 2
						|
# qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
QEMU 4.2.1 monitor - type 'help' ...		|
(qemu) info status				|
VM status: running				|
						| # yum update kernel-uek
(qemu) cprsave /tmp/qemu.sav restart		|
						|
# systemctl kexec				|
kexec_core: Starting new kernel			|
...						|
						|
# qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
QEMU 4.2.1 monitor - type 'help' ...		|
(qemu) info status				|
VM status: paused (prelaunch)			|
(qemu) cprload /tmp/qemu.sav			|
(qemu) info status				|
VM status: running				|

Changes from V1 to V2:
  - revert vmstate infrastructure changes
  - refactor cpr functions into new files
  - delete MADV_DOEXEC and use memfd + VFIO_DMA_UNMAP_FLAG_SUSPEND to 
    preserve memory.
  - add framework to filter chardev's that support cpr
  - save and restore vfio eventfd's
  - modify cprinfo QMP interface
  - incorporate misc review feedback
  - remove unrelated and unneeded patches
  - refactor all patches into a shorter and easier to review series

Changes from V2 to V3:
  - rebase to qemu 6.0.0
  - use final definition of vfio ioctls (VFIO_DMA_UNMAP_FLAG_VADDR etc)
  - change memfd-alloc to a machine option
  - use existing channel socket function instead of defining new ones
  - close monitor socket during cpr
  - support memory-backend-memfd
  - fix a few unreported bugs

Steve Sistare (18):
  as_flat_walk
  qemu_ram_volatile
  oslib: qemu_clr_cloexec
  util: env var helpers
  machine: memfd-alloc option
  vl: add helper to request re-exec
  cpr
  pci: export functions for cpr
  vfio-pci: refactor for cpr
  vfio-pci: cpr part 1
  vfio-pci: cpr part 2
  hostmem-memfd: cpr support
  chardev: cpr framework
  chardev: cpr for simple devices
  chardev: cpr for pty
  cpr: only-cpr-capable option
  cpr: maintainers
  simplify savevm

Mark Kanda, Steve Sistare (4):
  cpr: QMP interfaces
  cpr: HMP interfaces
  vhost: reset vhost devices upon cprsave
  chardev: cpr for sockets

 MAINTAINERS                   |  11 +++
 backends/hostmem-memfd.c      |  21 +++--
 chardev/char-mux.c            |   1 +
 chardev/char-null.c           |   1 +
 chardev/char-pty.c            |  15 ++-
 chardev/char-serial.c         |   1 +
 chardev/char-socket.c         |  35 +++++++
 chardev/char-stdio.c          |   8 ++
 chardev/char.c                |  41 +++++++-
 gdbstub.c                     |   1 +
 hmp-commands.hx               |  44 +++++++++
 hw/core/machine.c             |  19 ++++
 hw/pci/msi.c                  |   4 +
 hw/pci/msix.c                 |  20 ++--
 hw/pci/pci.c                  |   7 +-
 hw/vfio/common.c              |  68 +++++++++++++-
 hw/vfio/cpr.c                 | 131 ++++++++++++++++++++++++++
 hw/vfio/meson.build           |   1 +
 hw/vfio/pci.c                 | 214 ++++++++++++++++++++++++++++++++++++++----
 hw/vfio/trace-events          |   1 +
 hw/virtio/vhost.c             |  11 +++
 include/chardev/char.h        |   6 ++
 include/exec/memory.h         |  25 +++++
 include/hw/boards.h           |   1 +
 include/hw/pci/msix.h         |   5 +
 include/hw/pci/pci.h          |   2 +
 include/hw/vfio/vfio-common.h |   8 ++
 include/hw/virtio/vhost.h     |   1 +
 include/migration/cpr.h       |  17 ++++
 include/monitor/hmp.h         |   3 +
 include/qemu/env.h            |  23 +++++
 include/qemu/osdep.h          |   1 +
 include/sysemu/runstate.h     |   2 +
 include/sysemu/sysemu.h       |   2 +
 linux-headers/linux/vfio.h    |  27 ++++++
 migration/cpr.c               | 200 +++++++++++++++++++++++++++++++++++++++
 migration/meson.build         |   1 +
 migration/migration.c         |   5 +
 migration/savevm.c            |  21 ++---
 migration/savevm.h            |   2 +
 monitor/hmp-cmds.c            |  48 ++++++++++
 monitor/hmp.c                 |   3 +
 monitor/qmp-cmds.c            |  31 ++++++
 monitor/qmp.c                 |   3 +
 qapi/char.json                |   5 +-
 qapi/cpr.json                 |  76 +++++++++++++++
 qapi/meson.build              |   1 +
 qapi/qapi-schema.json         |   1 +
 qemu-options.hx               |  39 +++++++-
 softmmu/globals.c             |   2 +
 softmmu/memory.c              |  48 ++++++++++
 softmmu/physmem.c             |  49 ++++++++--
 softmmu/runstate.c            |  49 +++++++++-
 softmmu/vl.c                  |  21 ++++-
 stubs/cpr.c                   |   3 +
 stubs/meson.build             |   1 +
 trace-events                  |   1 +
 util/env.c                    |  99 +++++++++++++++++++
 util/meson.build              |   1 +
 util/oslib-posix.c            |   9 ++
 util/oslib-win32.c            |   4 +
 util/qemu-config.c            |   4 +
 62 files changed, 1431 insertions(+), 74 deletions(-)
 create mode 100644 hw/vfio/cpr.c
 create mode 100644 include/migration/cpr.h
 create mode 100644 include/qemu/env.h
 create mode 100644 migration/cpr.c
 create mode 100644 qapi/cpr.json
 create mode 100644 stubs/cpr.c
 create mode 100644 util/env.c

-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH V3 01/22] as_flat_walk
  2021-05-07 12:24 [PATCH V3 00/22] Live Update Steve Sistare
@ 2021-05-07 12:24 ` Steve Sistare
  2021-05-07 12:25 ` [PATCH V3 02/22] qemu_ram_volatile Steve Sistare
                   ` (23 subsequent siblings)
  24 siblings, 0 replies; 81+ messages in thread
From: Steve Sistare @ 2021-05-07 12:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Markus Armbruster, Alex Williamson, Steve Sistare,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Add an iterator over the sections of a flattened address space.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/exec/memory.h | 17 +++++++++++++++++
 softmmu/memory.c      | 18 ++++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 5728a68..2e5495a 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -2003,6 +2003,23 @@ bool memory_region_present(MemoryRegion *container, hwaddr addr);
  */
 bool memory_region_is_mapped(MemoryRegion *mr);
 
+typedef int (*qemu_flat_walk_cb)(MemoryRegionSection *s,
+                                 void *handle,
+                                 Error **errp);
+
+/**
+ * as_flat_walk: walk the ranges in the address space flat view and call @func
+ * for each.  Return 0 on success, else return non-zero with a message in
+ * @errp.
+ *
+ * @as: target address space
+ * @func: callback function
+ * @handle: passed to @func
+ * @errp: passed to @func
+ */
+int as_flat_walk(AddressSpace *as, qemu_flat_walk_cb func,
+                 void *handle, Error **errp);
+
 /**
  * memory_region_find: translate an address/size relative to a
  * MemoryRegion into a #MemoryRegionSection.
diff --git a/softmmu/memory.c b/softmmu/memory.c
index d4493ef..75d7d17 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -2570,6 +2570,24 @@ bool memory_region_is_mapped(MemoryRegion *mr)
     return mr->container ? true : false;
 }
 
+int as_flat_walk(AddressSpace *as, qemu_flat_walk_cb func,
+                 void *handle, Error **errp)
+{
+    FlatView *view = address_space_get_flatview(as);
+    FlatRange *fr;
+    int ret;
+
+    FOR_EACH_FLAT_RANGE(fr, view) {
+        MemoryRegionSection section = section_from_flat_range(fr, view);
+        ret = func(&section, handle, errp);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
 /* Same as memory_region_find, but it does not add a reference to the
  * returned region.  It must be called from an RCU critical section.
  */
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V3 02/22] qemu_ram_volatile
  2021-05-07 12:24 [PATCH V3 00/22] Live Update Steve Sistare
  2021-05-07 12:24 ` [PATCH V3 01/22] as_flat_walk Steve Sistare
@ 2021-05-07 12:25 ` Steve Sistare
  2021-05-07 12:25 ` [PATCH V3 03/22] oslib: qemu_clr_cloexec Steve Sistare
                   ` (22 subsequent siblings)
  24 siblings, 0 replies; 81+ messages in thread
From: Steve Sistare @ 2021-05-07 12:25 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Markus Armbruster, Alex Williamson, Steve Sistare,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Add a function that returns true if any ram_list block represents
volatile memory.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/exec/memory.h |  8 ++++++++
 softmmu/memory.c      | 30 ++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 2e5495a..d87c059 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -2646,6 +2646,14 @@ bool ram_block_discard_is_disabled(void);
  */
 bool ram_block_discard_is_required(void);
 
+/**
+ * qemu_ram_volatile: return true if any memory regions are writable and not
+ * backed by shared memory.
+ *
+ * @errp: returned error message identifying the bad region.
+ */
+bool qemu_ram_volatile(Error **errp);
+
 #endif
 
 #endif
diff --git a/softmmu/memory.c b/softmmu/memory.c
index 75d7d17..b2d5092 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -2725,6 +2725,36 @@ void memory_global_dirty_log_stop(void)
     memory_global_dirty_log_do_stop();
 }
 
+/*
+ * Return true if any memory regions are writable and not backed by shared
+ * memory.
+ */
+bool qemu_ram_volatile(Error **errp)
+{
+    RAMBlock *block;
+    MemoryRegion *mr;
+    bool ret = false;
+
+    rcu_read_lock();
+    QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
+        mr = block->mr;
+        if (mr &&
+            memory_region_is_ram(mr) &&
+            !memory_region_is_ram_device(mr) &&
+            !memory_region_is_rom(mr) &&
+            (block->fd == -1 || !qemu_ram_is_shared(block))) {
+
+            error_setg(errp, "Memory region %s is volatile",
+                       memory_region_name(mr));
+            ret = true;
+            break;
+        }
+    }
+
+    rcu_read_unlock();
+    return ret;
+}
+
 static void listener_add_address_space(MemoryListener *listener,
                                        AddressSpace *as)
 {
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V3 03/22] oslib: qemu_clr_cloexec
  2021-05-07 12:24 [PATCH V3 00/22] Live Update Steve Sistare
  2021-05-07 12:24 ` [PATCH V3 01/22] as_flat_walk Steve Sistare
  2021-05-07 12:25 ` [PATCH V3 02/22] qemu_ram_volatile Steve Sistare
@ 2021-05-07 12:25 ` Steve Sistare
  2021-05-07 12:25 ` [PATCH V3 04/22] util: env var helpers Steve Sistare
                   ` (21 subsequent siblings)
  24 siblings, 0 replies; 81+ messages in thread
From: Steve Sistare @ 2021-05-07 12:25 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Markus Armbruster, Alex Williamson, Steve Sistare,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Define qemu_clr_cloexec, analogous to qemu_set_cloexec.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/qemu/osdep.h | 1 +
 util/oslib-posix.c   | 9 +++++++++
 util/oslib-win32.c   | 4 ++++
 3 files changed, 14 insertions(+)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index cb2a07e..de06e60 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -577,6 +577,7 @@ static inline void qemu_timersub(const struct timeval *val1,
 #endif
 
 void qemu_set_cloexec(int fd);
+void qemu_clr_cloexec(int fd);
 
 /* Starting on QEMU 2.5, qemu_hw_version() returns "2.5+" by default
  * instead of QEMU_VERSION, so setting hw_version on MachineClass
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 36820fe..ac9229d 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -317,6 +317,15 @@ void qemu_set_cloexec(int fd)
     assert(f != -1);
 }
 
+void qemu_clr_cloexec(int fd)
+{
+    int f;
+    f = fcntl(fd, F_GETFD);
+    assert(f != -1);
+    f = fcntl(fd, F_SETFD, f & ~FD_CLOEXEC);
+    assert(f != -1);
+}
+
 /*
  * Creates a pipe with FD_CLOEXEC set on both file descriptors
  */
diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index f68b801..b5c53b3 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -251,6 +251,10 @@ void qemu_set_cloexec(int fd)
 {
 }
 
+void qemu_clr_cloexec(int fd)
+{
+}
+
 /* Offset between 1/1/1601 and 1/1/1970 in 100 nanosec units */
 #define _W32_FT_OFFSET (116444736000000000ULL)
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V3 04/22] util: env var helpers
  2021-05-07 12:24 [PATCH V3 00/22] Live Update Steve Sistare
                   ` (2 preceding siblings ...)
  2021-05-07 12:25 ` [PATCH V3 03/22] oslib: qemu_clr_cloexec Steve Sistare
@ 2021-05-07 12:25 ` Steve Sistare
  2021-05-07 12:25 ` [PATCH V3 05/22] machine: memfd-alloc option Steve Sistare
                   ` (20 subsequent siblings)
  24 siblings, 0 replies; 81+ messages in thread
From: Steve Sistare @ 2021-05-07 12:25 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Markus Armbruster, Alex Williamson, Steve Sistare,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Add functions for saving fd's and other values in the environment via
setenv, and for reading them back via getenv.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/qemu/env.h | 23 +++++++++++++
 util/env.c         | 99 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 util/meson.build   |  1 +
 3 files changed, 123 insertions(+)
 create mode 100644 include/qemu/env.h
 create mode 100644 util/env.c

diff --git a/include/qemu/env.h b/include/qemu/env.h
new file mode 100644
index 0000000..3dad503
--- /dev/null
+++ b/include/qemu/env.h
@@ -0,0 +1,23 @@
+/*
+ * Copyright (c) 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef QEMU_ENV_H
+#define QEMU_ENV_H
+
+#define FD_PREFIX "QEMU_FD_"
+
+typedef int (*walkenv_cb)(const char *name, const char *val, void *handle);
+
+int getenv_fd(const char *name);
+void setenv_fd(const char *name, int fd);
+void unsetenv_fd(const char *name);
+void unsetenv_fdv(const char *fmt, ...);
+int walkenv(const char *prefix, walkenv_cb cb, void *handle);
+void printenv(void);
+
+#endif
diff --git a/util/env.c b/util/env.c
new file mode 100644
index 0000000..b09ba05
--- /dev/null
+++ b/util/env.c
@@ -0,0 +1,99 @@
+/*
+ * Copyright (c) 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/env.h"
+
+static uint64_t getenv_ulong(const char *prefix, const char *name, bool *found)
+{
+    char var[80], *val;
+    uint64_t res;
+
+    snprintf(var, sizeof(var), "%s%s", prefix, name);
+    val = getenv(var);
+    if (val) {
+        *found = true;
+        res = strtol(val, 0, 10);
+    } else {
+        *found = false;
+        res = 0;
+    }
+    return res;
+}
+
+static void setenv_ulong(const char *prefix, const char *name, uint64_t val)
+{
+    char var[80], val_str[80];
+    snprintf(var, sizeof(var), "%s%s", prefix, name);
+    snprintf(val_str, sizeof(val_str), "%"PRIu64, val);
+    setenv(var, val_str, 1);
+}
+
+static void unsetenv_ulong(const char *prefix, const char *name)
+{
+    char var[80];
+    snprintf(var, sizeof(var), "%s%s", prefix, name);
+    unsetenv(var);
+}
+
+int getenv_fd(const char *name)
+{
+    bool found;
+    int fd = getenv_ulong(FD_PREFIX, name, &found);
+    if (!found) {
+        fd = -1;
+    }
+    return fd;
+}
+
+void setenv_fd(const char *name, int fd)
+{
+    setenv_ulong(FD_PREFIX, name, fd);
+}
+
+void unsetenv_fd(const char *name)
+{
+    unsetenv_ulong(FD_PREFIX, name);
+}
+
+void unsetenv_fdv(const char *fmt, ...)
+{
+    va_list args;
+    char buf[80];
+    va_start(args, fmt);
+    vsnprintf(buf, sizeof(buf), fmt, args);
+    va_end(args);
+}
+
+int walkenv(const char *prefix, walkenv_cb cb, void *handle)
+{
+    char *str, name[128];
+    char **envp = environ;
+    size_t prefix_len = strlen(prefix);
+
+    while (*envp) {
+        str = *envp++;
+        if (!strncmp(str, prefix, prefix_len)) {
+            char *val = strchr(str, '=');
+            str += prefix_len;
+            strncpy(name, str, val - str);
+            name[val - str] = 0;
+            if (cb(name, val + 1, handle)) {
+                return 1;
+            }
+        }
+    }
+    return 0;
+}
+
+void printenv(void)
+{
+    char **ptr = environ;
+    while (*ptr) {
+        puts(*ptr++);
+    }
+}
diff --git a/util/meson.build b/util/meson.build
index 510765c..d2d90cc 100644
--- a/util/meson.build
+++ b/util/meson.build
@@ -22,6 +22,7 @@ util_ss.add(files('host-utils.c'))
 util_ss.add(files('bitmap.c', 'bitops.c'))
 util_ss.add(files('fifo8.c'))
 util_ss.add(files('cacheinfo.c', 'cacheflush.c'))
+util_ss.add(files('env.c'))
 util_ss.add(files('error.c', 'qemu-error.c'))
 util_ss.add(files('qemu-print.c'))
 util_ss.add(files('id.c'))
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V3 05/22] machine: memfd-alloc option
  2021-05-07 12:24 [PATCH V3 00/22] Live Update Steve Sistare
                   ` (3 preceding siblings ...)
  2021-05-07 12:25 ` [PATCH V3 04/22] util: env var helpers Steve Sistare
@ 2021-05-07 12:25 ` Steve Sistare
  2021-05-07 12:25 ` [PATCH V3 06/22] vl: add helper to request re-exec Steve Sistare
                   ` (19 subsequent siblings)
  24 siblings, 0 replies; 81+ messages in thread
From: Steve Sistare @ 2021-05-07 12:25 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Markus Armbruster, Alex Williamson, Steve Sistare,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Allocate anonymous memory using memfd_create if the memfd-alloc machine
option is set.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/core/machine.c   | 19 +++++++++++++++++++
 include/hw/boards.h |  1 +
 qemu-options.hx     |  5 +++++
 softmmu/physmem.c   | 41 ++++++++++++++++++++++++++++++++---------
 trace-events        |  1 +
 util/qemu-config.c  |  4 ++++
 6 files changed, 62 insertions(+), 9 deletions(-)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 40def78..3ce5303 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -375,6 +375,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
     ms->mem_merge = value;
 }
 
+static bool machine_get_memfd_alloc(Object *obj, Error **errp)
+{
+    MachineState *ms = MACHINE(obj);
+
+    return ms->memfd_alloc;
+}
+
+static void machine_set_memfd_alloc(Object *obj, bool value, Error **errp)
+{
+    MachineState *ms = MACHINE(obj);
+
+    ms->memfd_alloc = value;
+}
+
 static bool machine_get_usb(Object *obj, Error **errp)
 {
     MachineState *ms = MACHINE(obj);
@@ -858,6 +872,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
     object_class_property_set_description(oc, "mem-merge",
         "Enable/disable memory merge support");
 
+    object_class_property_add_bool(oc, "memfd-alloc",
+        machine_get_memfd_alloc, machine_set_memfd_alloc);
+    object_class_property_set_description(oc, "memfd-alloc",
+        "Enable/disable allocating anonymous memory using memfd_create");
+
     object_class_property_add_bool(oc, "usb",
         machine_get_usb, machine_set_usb);
     object_class_property_set_description(oc, "usb",
diff --git a/include/hw/boards.h b/include/hw/boards.h
index ad6c8fd..dceb7f7 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -305,6 +305,7 @@ struct MachineState {
     char *dt_compatible;
     bool dump_guest_core;
     bool mem_merge;
+    bool memfd_alloc;
     bool usb;
     bool usb_disabled;
     char *firmware;
diff --git a/qemu-options.hx b/qemu-options.hx
index fd21002..3392ac0 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -30,6 +30,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
     "                vmport=on|off|auto controls emulation of vmport (default: auto)\n"
     "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
     "                mem-merge=on|off controls memory merge support (default: on)\n"
+    "                memfd-alloc=on|off controls allocating anonymous memory using memfd_create (default: off)\n"
     "                aes-key-wrap=on|off controls support for AES key wrapping (default=on)\n"
     "                dea-key-wrap=on|off controls support for DEA key wrapping (default=on)\n"
     "                suppress-vmdesc=on|off disables self-describing migration (default=off)\n"
@@ -76,6 +77,10 @@ SRST
         supported by the host, de-duplicates identical memory pages
         among VMs instances (enabled by default).
 
+    ``memfd-alloc=on|off``
+        Enables or disables allocation of anonymous memory using memfd_create.
+        (disabled by default).
+
     ``aes-key-wrap=on|off``
         Enables or disables AES key wrapping support on s390-ccw hosts.
         This feature controls whether AES wrapping keys will be created
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 85034d9..695aa10 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -67,6 +67,7 @@
 
 #include "qemu/pmem.h"
 
+#include "qemu/memfd.h"
 #include "migration/vmstate.h"
 
 #include "qemu/range.h"
@@ -1931,35 +1932,57 @@ static void ram_block_add(RAMBlock *new_block, Error **errp, bool shared)
 {
     RAMBlock *block;
     RAMBlock *last_block = NULL;
+    struct MemoryRegion *mr = new_block->mr;
     ram_addr_t old_ram_size, new_ram_size;
     Error *err = NULL;
+    const char *name;
+    void *addr = 0;
+    size_t maxlen;
+    MachineState *ms = MACHINE(qdev_get_machine());
 
     old_ram_size = last_ram_page();
 
     qemu_mutex_lock_ramlist();
-    new_block->offset = find_ram_offset(new_block->max_length);
+    maxlen = new_block->max_length;
+    new_block->offset = find_ram_offset(maxlen);
 
     if (!new_block->host) {
         if (xen_enabled()) {
-            xen_ram_alloc(new_block->offset, new_block->max_length,
-                          new_block->mr, &err);
+            xen_ram_alloc(new_block->offset, maxlen, new_block->mr, &err);
             if (err) {
                 error_propagate(errp, err);
                 qemu_mutex_unlock_ramlist();
                 return;
             }
         } else {
-            new_block->host = qemu_anon_ram_alloc(new_block->max_length,
-                                                  &new_block->mr->align,
-                                                  shared);
-            if (!new_block->host) {
+            name = memory_region_name(new_block->mr);
+            if (ms->memfd_alloc) {
+                int mfd = -1;          /* placeholder until next patch */
+                mr->align = QEMU_VMALLOC_ALIGN;
+                if (mfd < 0) {
+                    mfd = qemu_memfd_create(name, maxlen + mr->align,
+                                            0, 0, 0, &err);
+                    if (mfd < 0) {
+                        return;
+                    }
+                }
+                new_block->flags |= RAM_SHARED;
+                addr = file_ram_alloc(new_block, maxlen, mfd,
+                                      false, false, 0, errp);
+                trace_anon_memfd_alloc(name, maxlen, addr, mfd);
+            } else {
+                addr = qemu_anon_ram_alloc(maxlen, &mr->align, shared);
+            }
+
+            if (!addr) {
                 error_setg_errno(errp, errno,
                                  "cannot set up guest memory '%s'",
-                                 memory_region_name(new_block->mr));
+                                 name);
                 qemu_mutex_unlock_ramlist();
                 return;
             }
-            memory_try_enable_merging(new_block->host, new_block->max_length);
+            memory_try_enable_merging(addr, maxlen);
+            new_block->host = addr;
         }
     }
 
diff --git a/trace-events b/trace-events
index ac7cef9..99e8208 100644
--- a/trace-events
+++ b/trace-events
@@ -40,6 +40,7 @@ ram_block_discard_range(const char *rbname, void *hva, size_t length, bool need_
 # accel/tcg/cputlb.c
 memory_notdirty_write_access(uint64_t vaddr, uint64_t ram_addr, unsigned size) "0x%" PRIx64 " ram_addr 0x%" PRIx64 " size %u"
 memory_notdirty_set_dirty(uint64_t vaddr) "0x%" PRIx64
+anon_memfd_alloc(const char *name, size_t size, void *ptr, int fd) "%s size %zu ptr %p fd %d"
 
 # gdbstub.c
 gdbstub_op_start(const char *device) "Starting gdbstub using device %s"
diff --git a/util/qemu-config.c b/util/qemu-config.c
index 670bd6e..135ec3b 100644
--- a/util/qemu-config.c
+++ b/util/qemu-config.c
@@ -205,6 +205,10 @@ static QemuOptsList machine_opts = {
             .type = QEMU_OPT_BOOL,
             .help = "enable/disable memory merge support",
         },{
+            .name = "memfd-alloc",
+            .type = QEMU_OPT_BOOL,
+            .help = "enable/disable memfd_create for anonymous memory",
+        },{
             .name = "usb",
             .type = QEMU_OPT_BOOL,
             .help = "Set on/off to enable/disable usb",
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V3 06/22] vl: add helper to request re-exec
  2021-05-07 12:24 [PATCH V3 00/22] Live Update Steve Sistare
                   ` (4 preceding siblings ...)
  2021-05-07 12:25 ` [PATCH V3 05/22] machine: memfd-alloc option Steve Sistare
@ 2021-05-07 12:25 ` Steve Sistare
  2021-05-07 14:31   ` Eric Blake
  2021-05-12 16:27   ` Stefan Hajnoczi
  2021-05-07 12:25 ` [PATCH V3 07/22] cpr Steve Sistare
                   ` (18 subsequent siblings)
  24 siblings, 2 replies; 81+ messages in thread
From: Steve Sistare @ 2021-05-07 12:25 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Markus Armbruster, Alex Williamson, Steve Sistare,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Add a qemu_exec_requested() hook that causes the main loop to exit and
re-exec qemu using the same initial arguments.  If /usr/bin/qemu-exec
exists, exec that instead.  This is an optional site-specific trampoline
that may alter the environment before exec'ing the qemu binary.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/sysemu/runstate.h |  1 +
 include/sysemu/sysemu.h   |  1 +
 softmmu/globals.c         |  1 +
 softmmu/runstate.c        | 28 ++++++++++++++++++++++++++++
 softmmu/vl.c              |  1 +
 5 files changed, 32 insertions(+)

diff --git a/include/sysemu/runstate.h b/include/sysemu/runstate.h
index a535691..50c84af 100644
--- a/include/sysemu/runstate.h
+++ b/include/sysemu/runstate.h
@@ -56,6 +56,7 @@ void qemu_system_wakeup_enable(WakeupReason reason, bool enabled);
 void qemu_register_wakeup_notifier(Notifier *notifier);
 void qemu_register_wakeup_support(void);
 void qemu_system_shutdown_request(ShutdownCause reason);
+void qemu_system_exec_request(void);
 void qemu_system_powerdown_request(void);
 void qemu_register_powerdown_notifier(Notifier *notifier);
 void qemu_register_shutdown_notifier(Notifier *notifier);
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index 8fae667..f56058e 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -9,6 +9,7 @@
 /* vl.c */
 
 extern int only_migratable;
+extern char **argv_main;
 extern const char *qemu_name;
 extern QemuUUID qemu_uuid;
 extern bool qemu_uuid_set;
diff --git a/softmmu/globals.c b/softmmu/globals.c
index 7d0fc81..2bb630d 100644
--- a/softmmu/globals.c
+++ b/softmmu/globals.c
@@ -60,6 +60,7 @@ bool boot_strict;
 uint8_t *boot_splash_filedata;
 int only_migratable; /* turn it off unless user states otherwise */
 int icount_align_option;
+char **argv_main;
 
 /* The bytes in qemu_uuid are in the order specified by RFC4122, _not_ in the
  * little-endian "wire format" described in the SMBIOS 2.6 specification.
diff --git a/softmmu/runstate.c b/softmmu/runstate.c
index ce8977c..bea7513 100644
--- a/softmmu/runstate.c
+++ b/softmmu/runstate.c
@@ -338,6 +338,7 @@ static ShutdownCause reset_requested;
 static ShutdownCause shutdown_requested;
 static int shutdown_signal;
 static pid_t shutdown_pid;
+static int exec_requested;
 static int powerdown_requested;
 static int debug_requested;
 static int suspend_requested;
@@ -367,6 +368,11 @@ static int qemu_shutdown_requested(void)
     return qatomic_xchg(&shutdown_requested, SHUTDOWN_CAUSE_NONE);
 }
 
+static int qemu_exec_requested(void)
+{
+    return qatomic_xchg(&exec_requested, 0);
+}
+
 static void qemu_kill_report(void)
 {
     if (!qtest_driver() && shutdown_signal) {
@@ -625,6 +631,13 @@ void qemu_system_shutdown_request(ShutdownCause reason)
     qemu_notify_event();
 }
 
+void qemu_system_exec_request(void)
+{
+    shutdown_requested = 1;
+    exec_requested = 1;
+    qemu_notify_event();
+}
+
 static void qemu_system_powerdown(void)
 {
     qapi_event_send_powerdown();
@@ -660,6 +673,16 @@ void qemu_system_debug_request(void)
     qemu_notify_event();
 }
 
+static void qemu_exec(void)
+{
+    const char *helper = "/usr/bin/qemu-exec";
+    const char *bin = !access(helper, X_OK) ? helper : argv_main[0];
+
+    execvp(bin, argv_main);
+    error_report("execvp failed, errno %d.", errno);
+    exit(1);
+}
+
 static bool main_loop_should_exit(void)
 {
     RunState r;
@@ -673,6 +696,11 @@ static bool main_loop_should_exit(void)
     }
     request = qemu_shutdown_requested();
     if (request) {
+
+        if (qemu_exec_requested()) {
+            qemu_exec();
+            /* not reached */
+        }
         qemu_kill_report();
         qemu_system_shutdown(request);
         if (shutdown_action == SHUTDOWN_ACTION_PAUSE) {
diff --git a/softmmu/vl.c b/softmmu/vl.c
index aadb526..04ab752 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -2662,6 +2662,7 @@ void qemu_init(int argc, char **argv, char **envp)
 
     error_init(argv[0]);
     qemu_init_exec_dir(argv[0]);
+    argv_main = argv;
 
     qemu_init_subsystems();
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V3 07/22] cpr
  2021-05-07 12:24 [PATCH V3 00/22] Live Update Steve Sistare
                   ` (5 preceding siblings ...)
  2021-05-07 12:25 ` [PATCH V3 06/22] vl: add helper to request re-exec Steve Sistare
@ 2021-05-07 12:25 ` Steve Sistare
  2021-05-12 16:19   ` Stefan Hajnoczi
  2021-05-07 12:25 ` [PATCH V3 08/22] cpr: QMP interfaces Steve Sistare
                   ` (17 subsequent siblings)
  24 siblings, 1 reply; 81+ messages in thread
From: Steve Sistare @ 2021-05-07 12:25 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Markus Armbruster, Alex Williamson, Steve Sistare,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Provide the cprsave and cprload functions for live update.  These save and
restore VM state, with minimal guest pause time, so that qemu may be updated
to a new version in between.

cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
/usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
paused state and waits for the cprload command.

To use the restart mode, qemu must be started with the memfd-alloc machine
option.  The memfd's are saved to the environment and kept open across exec,
after which they are found from the environment and re-mmap'd.  Hence guest
ram is preserved in place, albeit with new virtual addresses in the qemu
process.  The caller resumes the guest by calling cprload, which loads
state from the file.  If the VM was running at cprsave time, then VM
execution resumes.  cprsave supports any type of guest image and block
device, but the caller must not modify guest block devices between cprsave
and cprload.

For the reboot mode, cprsave saves state and exits qemu, and the caller is
allowed to update the host kernel and system software and reboot.  The
caller resumes the guest by running qemu with the same arguments as the
original process and calling cprload.  To use this mode, guest ram must be
mapped to a persistent shared memory file such as /dev/dax0.0 or /dev/shm
PKRAM.

The reboot mode supports vfio devices if the caller suspends the guest
instead of stopping the VM, such as by issuing guest-suspend-ram to the
qemu guest agent.  The guest drivers' suspend methods flush outstanding
requests and re-initialize the devices, and thus there is no device state
to save and restore.

The restart mode supports vfio devices in a subsequent patch.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/migration/cpr.h   |  17 +++++
 include/sysemu/runstate.h |   1 +
 migration/cpr.c           | 188 ++++++++++++++++++++++++++++++++++++++++++++++
 migration/meson.build     |   1 +
 migration/savevm.h        |   2 +
 softmmu/physmem.c         |   6 +-
 softmmu/runstate.c        |  21 +++++-
 softmmu/vl.c              |   6 ++
 8 files changed, 240 insertions(+), 2 deletions(-)
 create mode 100644 include/migration/cpr.h
 create mode 100644 migration/cpr.c

diff --git a/include/migration/cpr.h b/include/migration/cpr.h
new file mode 100644
index 0000000..42dec4e
--- /dev/null
+++ b/include/migration/cpr.h
@@ -0,0 +1,17 @@
+/*
+ * Copyright (c) 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef MIGRATION_CPR_H
+#define MIGRATION_CPR_H
+
+#include "qapi/qapi-types-cpr.h"
+
+bool cpr_active(void);
+void cprsave(const char *file, CprMode mode, Error **errp);
+void cprload(const char *file, Error **errp);
+
+#endif
diff --git a/include/sysemu/runstate.h b/include/sysemu/runstate.h
index 50c84af..d69dc2d 100644
--- a/include/sysemu/runstate.h
+++ b/include/sysemu/runstate.h
@@ -51,6 +51,7 @@ void qemu_system_reset_request(ShutdownCause reason);
 void qemu_system_suspend_request(void);
 void qemu_register_suspend_notifier(Notifier *notifier);
 bool qemu_wakeup_suspend_enabled(void);
+void qemu_system_start_on_wake_request(void);
 void qemu_system_wakeup_request(WakeupReason reason, Error **errp);
 void qemu_system_wakeup_enable(WakeupReason reason, bool enabled);
 void qemu_register_wakeup_notifier(Notifier *notifier);
diff --git a/migration/cpr.c b/migration/cpr.c
new file mode 100644
index 0000000..e0da1cf
--- /dev/null
+++ b/migration/cpr.c
@@ -0,0 +1,188 @@
+/*
+ * Copyright (c) 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "monitor/monitor.h"
+#include "migration.h"
+#include "migration/snapshot.h"
+#include "chardev/char.h"
+#include "migration/misc.h"
+#include "migration/cpr.h"
+#include "migration/global_state.h"
+#include "qemu-file-channel.h"
+#include "qemu-file.h"
+#include "savevm.h"
+#include "qapi/error.h"
+#include "qapi/qmp/qerror.h"
+#include "qemu/error-report.h"
+#include "io/channel-buffer.h"
+#include "io/channel-file.h"
+#include "sysemu/cpu-timers.h"
+#include "sysemu/runstate.h"
+#include "sysemu/runstate-action.h"
+#include "sysemu/sysemu.h"
+#include "sysemu/replay.h"
+#include "sysemu/xen.h"
+#include "hw/vfio/vfio-common.h"
+#include "hw/virtio/vhost.h"
+#include "qemu/env.h"
+
+static int cpr_is_active;
+
+bool cpr_active(void)
+{
+    return cpr_is_active;
+}
+
+QEMUFile *qf_file_open(const char *path, int flags, int mode,
+                              const char *name, Error **errp)
+{
+    QIOChannelFile *fioc;
+    QIOChannel *ioc;
+    QEMUFile *f;
+
+    if (flags & O_RDWR) {
+        error_setg(errp, "qf_file_open %s: O_RDWR not supported", path);
+        return 0;
+    }
+
+    fioc = qio_channel_file_new_path(path, flags, mode, errp);
+    if (!fioc) {
+        return 0;
+    }
+
+    ioc = QIO_CHANNEL(fioc);
+    qio_channel_set_name(ioc, name);
+    f = (flags & O_WRONLY) ? qemu_fopen_channel_output(ioc) :
+                             qemu_fopen_channel_input(ioc);
+    object_unref(OBJECT(fioc));
+    return f;
+}
+
+static int preserve_fd(const char *name, const char *val, void *handle)
+{
+    qemu_clr_cloexec(atoi(val));
+    return 0;
+}
+
+void cprsave(const char *file, CprMode mode, Error **errp)
+{
+    int ret = 0;
+    QEMUFile *f;
+    int saved_vm_running = runstate_is_running();
+    bool restart = (mode == CPR_MODE_RESTART);
+    bool reboot = (mode == CPR_MODE_REBOOT);
+
+    if (reboot && qemu_ram_volatile(errp)) {
+        return;
+    }
+
+    if (restart && xen_enabled()) {
+        error_setg(errp, "xen does not support cprsave restart");
+        return;
+    }
+
+    if (migrate_colo_enabled()) {
+        error_setg(errp, "error: cprsave does not support x-colo");
+        return;
+    }
+
+    if (replay_mode != REPLAY_MODE_NONE) {
+        error_setg(errp, "error: cprsave does not support replay");
+        return;
+    }
+
+    f = qf_file_open(file, O_CREAT | O_WRONLY | O_TRUNC, 0600, "cprsave", errp);
+    if (!f) {
+        return;
+    }
+
+    ret = global_state_store();
+    if (ret) {
+        error_setg(errp, "Error saving global state");
+        qemu_fclose(f);
+        return;
+    }
+    if (runstate_check(RUN_STATE_SUSPENDED)) {
+        /* Update timers_state before saving.  Suspend did not so do. */
+        cpu_disable_ticks();
+    }
+    vm_stop(RUN_STATE_SAVE_VM);
+
+    cpr_is_active = true;
+    ret = qemu_save_device_state(f);
+    qemu_fclose(f);
+    if (ret < 0) {
+        error_setg(errp, QERR_IO_ERROR);
+        goto err;
+    }
+
+    if (ret < 0) {
+        if (!*errp) {
+            error_setg(errp, "Error %d while saving VM state", ret);
+        }
+        goto err;
+    }
+
+    if (reboot) {
+        shutdown_action = SHUTDOWN_ACTION_POWEROFF;
+        qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
+    } else if (restart) {
+        walkenv(FD_PREFIX, preserve_fd, 0);
+        setenv("QEMU_START_FREEZE", "", 1);
+        qemu_system_exec_request();
+    }
+    goto done;
+
+err:
+    if (saved_vm_running) {
+        vm_start();
+    }
+done:
+    cpr_is_active = false;
+    return;
+}
+
+void cprload(const char *file, Error **errp)
+{
+    QEMUFile *f;
+    int ret;
+    RunState state;
+
+    if (runstate_is_running()) {
+        error_setg(errp, "cprload called for a running VM");
+        return;
+    }
+
+    f = qf_file_open(file, O_RDONLY, 0, "cprload", errp);
+    if (!f) {
+        return;
+    }
+
+    if (qemu_get_be32(f) != QEMU_VM_FILE_MAGIC ||
+        qemu_get_be32(f) != QEMU_VM_FILE_VERSION) {
+        error_setg(errp, "error: %s is not a vmstate file", file);
+        return;
+    }
+
+    ret = qemu_load_device_state(f);
+    qemu_fclose(f);
+    if (ret < 0) {
+        error_setg(errp, "Error %d while loading VM state", ret);
+        return;
+    }
+
+    state = global_state_get_runstate();
+    if (state == RUN_STATE_RUNNING) {
+        vm_start();
+    } else {
+        runstate_set(state);
+        if (runstate_check(RUN_STATE_SUSPENDED)) {
+            qemu_system_start_on_wake_request();
+        }
+    }
+}
diff --git a/migration/meson.build b/migration/meson.build
index 3ecedce..c756374 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -15,6 +15,7 @@ softmmu_ss.add(files(
   'channel.c',
   'colo-failover.c',
   'colo.c',
+  'cpr.c',
   'exec.c',
   'fd.c',
   'global_state.c',
diff --git a/migration/savevm.h b/migration/savevm.h
index 6461342..ce5d710 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -67,5 +67,7 @@ int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
 int qemu_load_device_state(QEMUFile *f);
 int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
         bool in_postcopy, bool inactivate_disks);
+QEMUFile *qf_file_open(const char *path, int flags, int mode,
+                       const char *name, Error **errp);
 
 #endif
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 695aa10..b79f408 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -68,6 +68,7 @@
 #include "qemu/pmem.h"
 
 #include "qemu/memfd.h"
+#include "qemu/env.h"
 #include "migration/vmstate.h"
 
 #include "qemu/range.h"
@@ -1957,7 +1958,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp, bool shared)
         } else {
             name = memory_region_name(new_block->mr);
             if (ms->memfd_alloc) {
-                int mfd = -1;          /* placeholder until next patch */
+                int mfd = getenv_fd(name);
                 mr->align = QEMU_VMALLOC_ALIGN;
                 if (mfd < 0) {
                     mfd = qemu_memfd_create(name, maxlen + mr->align,
@@ -1965,7 +1966,9 @@ static void ram_block_add(RAMBlock *new_block, Error **errp, bool shared)
                     if (mfd < 0) {
                         return;
                     }
+                    setenv_fd(name, mfd);
                 }
+                qemu_clr_cloexec(mfd);
                 new_block->flags |= RAM_SHARED;
                 addr = file_ram_alloc(new_block, maxlen, mfd,
                                       false, false, 0, errp);
@@ -2214,6 +2217,7 @@ void qemu_ram_free(RAMBlock *block)
     }
 
     qemu_mutex_lock_ramlist();
+    unsetenv_fd(memory_region_name(block->mr));
     QLIST_REMOVE_RCU(block, next);
     ram_list.mru_block = NULL;
     /* Write list before version */
diff --git a/softmmu/runstate.c b/softmmu/runstate.c
index bea7513..07952cc 100644
--- a/softmmu/runstate.c
+++ b/softmmu/runstate.c
@@ -115,6 +115,8 @@ static const RunStateTransition runstate_transitions_def[] = {
     { RUN_STATE_PRELAUNCH, RUN_STATE_RUNNING },
     { RUN_STATE_PRELAUNCH, RUN_STATE_FINISH_MIGRATE },
     { RUN_STATE_PRELAUNCH, RUN_STATE_INMIGRATE },
+    { RUN_STATE_PRELAUNCH, RUN_STATE_SUSPENDED },
+    { RUN_STATE_PRELAUNCH, RUN_STATE_PAUSED },
 
     { RUN_STATE_FINISH_MIGRATE, RUN_STATE_RUNNING },
     { RUN_STATE_FINISH_MIGRATE, RUN_STATE_PAUSED },
@@ -334,6 +336,7 @@ void vm_state_notify(bool running, RunState state)
     }
 }
 
+static bool start_on_wake_requested;
 static ShutdownCause reset_requested;
 static ShutdownCause shutdown_requested;
 static int shutdown_signal;
@@ -567,6 +570,11 @@ void qemu_register_suspend_notifier(Notifier *notifier)
     notifier_list_add(&suspend_notifiers, notifier);
 }
 
+void qemu_system_start_on_wake_request(void)
+{
+    start_on_wake_requested = true;
+}
+
 void qemu_system_wakeup_request(WakeupReason reason, Error **errp)
 {
     trace_system_wakeup_request(reason);
@@ -579,7 +587,18 @@ void qemu_system_wakeup_request(WakeupReason reason, Error **errp)
     if (!(wakeup_reason_mask & (1 << reason))) {
         return;
     }
-    runstate_set(RUN_STATE_RUNNING);
+
+    /*
+     * Must call vm_start if it has never been called, to invoke the state
+     * change callbacks for the first time.
+     */
+    if (start_on_wake_requested) {
+        start_on_wake_requested = false;
+        vm_start();
+    } else {
+        runstate_set(RUN_STATE_RUNNING);
+    }
+
     wakeup_reason = reason;
     qemu_notify_event();
 }
diff --git a/softmmu/vl.c b/softmmu/vl.c
index 04ab752..4654693 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -3510,6 +3510,12 @@ void qemu_init(int argc, char **argv, char **envp)
      */
     loc_set_none();
 
+    /* Equivalent to -S, but no need for parent to modify argv. */
+    if (getenv("QEMU_START_FREEZE")) {
+        unsetenv("QEMU_START_FREEZE");
+        autostart = 0;
+    }
+
     qemu_validate_options();
     qemu_process_sugar_options();
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V3 08/22] cpr: QMP interfaces
  2021-05-07 12:24 [PATCH V3 00/22] Live Update Steve Sistare
                   ` (6 preceding siblings ...)
  2021-05-07 12:25 ` [PATCH V3 07/22] cpr Steve Sistare
@ 2021-05-07 12:25 ` Steve Sistare
  2021-06-04 13:59   ` Eric Blake
  2021-05-07 12:25 ` [PATCH V3 09/22] cpr: HMP interfaces Steve Sistare
                   ` (16 subsequent siblings)
  24 siblings, 1 reply; 81+ messages in thread
From: Steve Sistare @ 2021-05-07 12:25 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Markus Armbruster, Alex Williamson, Steve Sistare,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

cprsave calls cprsave().  Syntax:
  { 'enum': 'CprMode', 'data': [ 'reboot', 'restart' ] }
  { 'command': 'cprsave', 'data': { 'file': 'str', 'mode': 'CprMode' } }

cprload calls cprload().  Syntax:
  { 'command': 'cprload', 'data': { 'file': 'str' } }

cprinfo returns a list of supported modes.  Syntax:
  { 'struct': 'CprInfo', 'data': { 'modes': [ 'CprMode' ] } }
  { 'command': 'cprinfo', 'returns': 'CprInfo' }

Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 monitor/qmp-cmds.c    | 31 +++++++++++++++++++++
 qapi/cpr.json         | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++
 qapi/meson.build      |  1 +
 qapi/qapi-schema.json |  1 +
 4 files changed, 109 insertions(+)
 create mode 100644 qapi/cpr.json

diff --git a/monitor/qmp-cmds.c b/monitor/qmp-cmds.c
index f7d64a6..1128604 100644
--- a/monitor/qmp-cmds.c
+++ b/monitor/qmp-cmds.c
@@ -37,9 +37,11 @@
 #include "qapi/qapi-commands-machine.h"
 #include "qapi/qapi-commands-misc.h"
 #include "qapi/qapi-commands-ui.h"
+#include "qapi/qapi-commands-cpr.h"
 #include "qapi/qmp/qerror.h"
 #include "hw/mem/memory-device.h"
 #include "hw/acpi/acpi_dev_interface.h"
+#include "migration/cpr.h"
 
 NameInfo *qmp_query_name(Error **errp)
 {
@@ -153,6 +155,35 @@ void qmp_cont(Error **errp)
     }
 }
 
+CprInfo *qmp_cprinfo(Error **errp)
+{
+    CprInfo *cprinfo;
+    CprModeList *mode, *mode_list = NULL;
+    CprMode i;
+
+    cprinfo = g_malloc0(sizeof(*cprinfo));
+
+    for (i = 0; i < CPR_MODE__MAX; i++) {
+        mode = g_malloc0(sizeof(*mode));
+        mode->value = i;
+        mode->next = mode_list;
+        mode_list = mode;
+    }
+
+    cprinfo->modes = mode_list;
+    return cprinfo;
+}
+
+void qmp_cprsave(const char *file, CprMode mode, Error **errp)
+{
+    cprsave(file, mode, errp);
+}
+
+void qmp_cprload(const char *file, Error **errp)
+{
+    cprload(file, errp);
+}
+
 void qmp_system_wakeup(Error **errp)
 {
     if (!qemu_wakeup_suspend_enabled()) {
diff --git a/qapi/cpr.json b/qapi/cpr.json
new file mode 100644
index 0000000..2d80cca
--- /dev/null
+++ b/qapi/cpr.json
@@ -0,0 +1,76 @@
+# -*- Mode: Python -*-
+#
+# Copyright (c) 2021 Oracle and/or its affiliates.
+#
+# This work is licensed under the terms of the GNU GPL, version 2.
+# See the COPYING file in the top-level directory.
+
+##
+# = CPR
+##
+
+{ 'include': 'common.json' }
+
+##
+# @CprMode:
+#
+# @reboot: checkpoint can be cprload'ed after a host kexec reboot.
+#
+# @restart: checkpoint can be cprload'ed after restarting qemu.
+#
+# Since: 6.0
+##
+{ 'enum': 'CprMode',
+  'data': [ 'reboot', 'restart' ] }
+
+
+##
+# @CprInfo:
+#
+# @modes: @CprMode list
+#
+# Since: 6.0
+##
+{ 'struct': 'CprInfo',
+  'data': { 'modes': [ 'CprMode' ] } }
+
+##
+# @cprinfo:
+#
+# Returns the modes supported by @cprsave.
+#
+# Returns: @CprInfo
+#
+# Since: 6.0
+#
+##
+{ 'command': 'cprinfo',
+  'returns': 'CprInfo' }
+
+##
+# @cprsave:
+#
+# Create a checkpoint of the virtual machine device state in @file.
+# Guest RAM and guest block device blocks are not saved.
+#
+# @file: name of checkpoint file
+# @mode: @CprMode mode
+#
+# Since: 6.0
+##
+{ 'command': 'cprsave',
+  'data': { 'file': 'str',
+            'mode': 'CprMode' } }
+
+##
+# @cprload:
+#
+# Start virtual machine from checkpoint file that was created earlier using
+# the cprsave command.
+#
+# @file: name of checkpoint file
+#
+# Since: 6.0
+##
+{ 'command': 'cprload',
+  'data': { 'file': 'str' } }
diff --git a/qapi/meson.build b/qapi/meson.build
index 376f4ce..7e7c48a 100644
--- a/qapi/meson.build
+++ b/qapi/meson.build
@@ -26,6 +26,7 @@ qapi_all_modules = [
   'common',
   'compat',
   'control',
+  'cpr',
   'crypto',
   'dump',
   'error',
diff --git a/qapi/qapi-schema.json b/qapi/qapi-schema.json
index 4912b97..001d790 100644
--- a/qapi/qapi-schema.json
+++ b/qapi/qapi-schema.json
@@ -77,6 +77,7 @@
 { 'include': 'ui.json' }
 { 'include': 'authz.json' }
 { 'include': 'migration.json' }
+{ 'include': 'cpr.json' }
 { 'include': 'transaction.json' }
 { 'include': 'trace.json' }
 { 'include': 'compat.json' }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V3 09/22] cpr: HMP interfaces
  2021-05-07 12:24 [PATCH V3 00/22] Live Update Steve Sistare
                   ` (7 preceding siblings ...)
  2021-05-07 12:25 ` [PATCH V3 08/22] cpr: QMP interfaces Steve Sistare
@ 2021-05-07 12:25 ` Steve Sistare
  2021-05-07 12:25 ` [PATCH V3 10/22] pci: export functions for cpr Steve Sistare
                   ` (15 subsequent siblings)
  24 siblings, 0 replies; 81+ messages in thread
From: Steve Sistare @ 2021-05-07 12:25 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Markus Armbruster, Alex Williamson, Steve Sistare,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

cprsave <file> <mode>
  Call cprsave().
  Arguments:
    file : save vmstate to this file name
    mode: "reboot" or "restart"

cprload <file>
  Call cprload().
  Arguments:
    file : load vmstate from this file name

cprinfo
  Print to stdout a space-delimited list of modes supported by cprsave.
  Arguments: none

Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hmp-commands.hx       | 44 ++++++++++++++++++++++++++++++++++++++++++++
 include/monitor/hmp.h |  3 +++
 monitor/hmp-cmds.c    | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 95 insertions(+)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index 435c591..5c79c5a 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -351,6 +351,50 @@ SRST
 ERST
 
     {
+        .name       = "cprinfo",
+        .args_type  = "",
+        .params     = "",
+        .help       = "return list of modes supported by cprsave",
+        .cmd        = hmp_cprinfo,
+    },
+
+SRST
+``cprinfo``
+Return a space-delimited list of modes supported by cprsave.
+ERST
+
+    {
+        .name       = "cprsave",
+        .args_type  = "file:s,mode:s",
+        .params     = "file 'restart'|'reboot'",
+        .help       = "create a checkpoint of the VM in file",
+        .cmd        = hmp_cprsave,
+    },
+
+SRST
+``cprsave`` *file* *mode*
+Create a checkpoint of the whole virtual machine and save it in *file*.
+If *mode* is 'reboot', the checkpoint remains valid after a host kexec
+reboot.  Guest ram must be backed by persistant shared memory.
+If *mode* is 'restart', pause the VCPUs, exec /usr/bin/qemu-exec if it
+exists, else exec argv[0], passing all the original command line arguments.
+Guest ram must be allocated with the memfd-alloc machine option.
+ERST
+
+    {
+        .name       = "cprload",
+        .args_type  = "file:s",
+        .params     = "file",
+        .help       = "load VM checkpoint from file",
+        .cmd        = hmp_cprload,
+    },
+
+SRST
+``cprload`` *file*
+Load a virtual machine from checkpoint file *file* and continue VCPUs.
+ERST
+
+    {
         .name       = "delvm",
         .args_type  = "name:s",
         .params     = "tag",
diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
index 605d572..e4ebdf1 100644
--- a/include/monitor/hmp.h
+++ b/include/monitor/hmp.h
@@ -58,6 +58,9 @@ void hmp_balloon(Monitor *mon, const QDict *qdict);
 void hmp_loadvm(Monitor *mon, const QDict *qdict);
 void hmp_savevm(Monitor *mon, const QDict *qdict);
 void hmp_delvm(Monitor *mon, const QDict *qdict);
+void hmp_cprinfo(Monitor *mon, const QDict *qdict);
+void hmp_cprsave(Monitor *mon, const QDict *qdict);
+void hmp_cprload(Monitor *mon, const QDict *qdict);
 void hmp_migrate_cancel(Monitor *mon, const QDict *qdict);
 void hmp_migrate_continue(Monitor *mon, const QDict *qdict);
 void hmp_migrate_incoming(Monitor *mon, const QDict *qdict);
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index 0ad5b77..e115a23 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -33,6 +33,7 @@
 #include "qapi/qapi-commands-block.h"
 #include "qapi/qapi-commands-char.h"
 #include "qapi/qapi-commands-control.h"
+#include "qapi/qapi-commands-cpr.h"
 #include "qapi/qapi-commands-machine.h"
 #include "qapi/qapi-commands-migration.h"
 #include "qapi/qapi-commands-misc.h"
@@ -1173,6 +1174,53 @@ void hmp_announce_self(Monitor *mon, const QDict *qdict)
     qapi_free_AnnounceParameters(params);
 }
 
+void hmp_cprinfo(Monitor *mon, const QDict *qdict)
+{
+    Error *err = NULL;
+    CprInfo *cprinfo;
+    CprModeList *mode;
+
+    cprinfo = qmp_cprinfo(&err);
+    if (err) {
+        goto out;
+    }
+
+    for (mode = cprinfo->modes; mode; mode = mode->next) {
+        monitor_printf(mon, "%s ", CprMode_str(mode->value));
+    }
+
+out:
+    hmp_handle_error(mon, err);
+    qapi_free_CprInfo(cprinfo);
+}
+
+void hmp_cprsave(Monitor *mon, const QDict *qdict)
+{
+    Error *err = NULL;
+    const char *mode;
+    int val;
+
+    mode = qdict_get_try_str(qdict, "mode");
+    val = qapi_enum_parse(&CprMode_lookup, mode, -1, &err);
+
+    if (val == -1) {
+        goto out;
+    }
+
+    qmp_cprsave(qdict_get_try_str(qdict, "file"), val, &err);
+
+out:
+    hmp_handle_error(mon, err);
+}
+
+void hmp_cprload(Monitor *mon, const QDict *qdict)
+{
+    Error *err = NULL;
+
+    qmp_cprload(qdict_get_try_str(qdict, "file"), &err);
+    hmp_handle_error(mon, err);
+}
+
 void hmp_migrate_cancel(Monitor *mon, const QDict *qdict)
 {
     qmp_migrate_cancel(NULL);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V3 10/22] pci: export functions for cpr
  2021-05-07 12:24 [PATCH V3 00/22] Live Update Steve Sistare
                   ` (8 preceding siblings ...)
  2021-05-07 12:25 ` [PATCH V3 09/22] cpr: HMP interfaces Steve Sistare
@ 2021-05-07 12:25 ` Steve Sistare
  2021-05-07 12:25 ` [PATCH V3 11/22] vfio-pci: refactor " Steve Sistare
                   ` (14 subsequent siblings)
  24 siblings, 0 replies; 81+ messages in thread
From: Steve Sistare @ 2021-05-07 12:25 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Markus Armbruster, Alex Williamson, Steve Sistare,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Export msix_is_pending and msix_init_vector_notifiers for use by cpr.
No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/pci/msix.c         | 20 ++++++++++++++------
 hw/pci/pci.c          |  3 +--
 include/hw/pci/msix.h |  5 +++++
 include/hw/pci/pci.h  |  1 +
 4 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index ae9331c..73f4259 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -64,7 +64,7 @@ static uint8_t *msix_pending_byte(PCIDevice *dev, int vector)
     return dev->msix_pba + vector / 8;
 }
 
-static int msix_is_pending(PCIDevice *dev, int vector)
+int msix_is_pending(PCIDevice *dev, unsigned int vector)
 {
     return *msix_pending_byte(dev, vector) & msix_pending_mask(vector);
 }
@@ -579,6 +579,17 @@ static void msix_unset_notifier_for_vector(PCIDevice *dev, unsigned int vector)
     dev->msix_vector_release_notifier(dev, vector);
 }
 
+void msix_init_vector_notifiers(PCIDevice *dev,
+                                MSIVectorUseNotifier use_notifier,
+                                MSIVectorReleaseNotifier release_notifier,
+                                MSIVectorPollNotifier poll_notifier)
+{
+    assert(use_notifier && release_notifier);
+    dev->msix_vector_use_notifier = use_notifier;
+    dev->msix_vector_release_notifier = release_notifier;
+    dev->msix_vector_poll_notifier = poll_notifier;
+}
+
 int msix_set_vector_notifiers(PCIDevice *dev,
                               MSIVectorUseNotifier use_notifier,
                               MSIVectorReleaseNotifier release_notifier,
@@ -586,11 +597,8 @@ int msix_set_vector_notifiers(PCIDevice *dev,
 {
     int vector, ret;
 
-    assert(use_notifier && release_notifier);
-
-    dev->msix_vector_use_notifier = use_notifier;
-    dev->msix_vector_release_notifier = release_notifier;
-    dev->msix_vector_poll_notifier = poll_notifier;
+    msix_init_vector_notifiers(dev, use_notifier, release_notifier,
+                               poll_notifier);
 
     if ((dev->config[dev->msix_cap + MSIX_CONTROL_OFFSET] &
         (MSIX_ENABLE_MASK | MSIX_MASKALL_MASK)) == MSIX_ENABLE_MASK) {
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 8f35e13..e08d981 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -226,7 +226,6 @@ static const TypeInfo pcie_bus_info = {
 };
 
 static PCIBus *pci_find_bus_nr(PCIBus *bus, int bus_num);
-static void pci_update_mappings(PCIDevice *d);
 static void pci_irq_handler(void *opaque, int irq_num, int level);
 static void pci_add_option_rom(PCIDevice *pdev, bool is_default_rom, Error **);
 static void pci_del_option_rom(PCIDevice *pdev);
@@ -1335,7 +1334,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
     return new_addr;
 }
 
-static void pci_update_mappings(PCIDevice *d)
+void pci_update_mappings(PCIDevice *d)
 {
     PCIIORegion *r;
     int i;
diff --git a/include/hw/pci/msix.h b/include/hw/pci/msix.h
index 4c4a60c..46606cf 100644
--- a/include/hw/pci/msix.h
+++ b/include/hw/pci/msix.h
@@ -32,6 +32,7 @@ int msix_present(PCIDevice *dev);
 bool msix_is_masked(PCIDevice *dev, unsigned vector);
 void msix_set_pending(PCIDevice *dev, unsigned vector);
 void msix_clr_pending(PCIDevice *dev, int vector);
+int msix_is_pending(PCIDevice *dev, unsigned vector);
 
 int msix_vector_use(PCIDevice *dev, unsigned vector);
 void msix_vector_unuse(PCIDevice *dev, unsigned vector);
@@ -41,6 +42,10 @@ void msix_notify(PCIDevice *dev, unsigned vector);
 
 void msix_reset(PCIDevice *dev);
 
+void msix_init_vector_notifiers(PCIDevice *dev,
+                                MSIVectorUseNotifier use_notifier,
+                                MSIVectorReleaseNotifier release_notifier,
+                                MSIVectorPollNotifier poll_notifier);
 int msix_set_vector_notifiers(PCIDevice *dev,
                               MSIVectorUseNotifier use_notifier,
                               MSIVectorReleaseNotifier release_notifier,
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 6be4e0c..bef3e49 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -902,5 +902,6 @@ extern const VMStateDescription vmstate_pci_device;
 }
 
 MSIMessage pci_get_msi_message(PCIDevice *dev, int vector);
+void pci_update_mappings(PCIDevice *d);
 
 #endif
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V3 11/22] vfio-pci: refactor for cpr
  2021-05-07 12:24 [PATCH V3 00/22] Live Update Steve Sistare
                   ` (9 preceding siblings ...)
  2021-05-07 12:25 ` [PATCH V3 10/22] pci: export functions for cpr Steve Sistare
@ 2021-05-07 12:25 ` Steve Sistare
  2021-05-19 22:38   ` Alex Williamson
  2021-05-07 12:25 ` [PATCH V3 12/22] vfio-pci: cpr part 1 Steve Sistare
                   ` (13 subsequent siblings)
  24 siblings, 1 reply; 81+ messages in thread
From: Steve Sistare @ 2021-05-07 12:25 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Markus Armbruster, Alex Williamson, Steve Sistare,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Export vfio_address_spaces and vfio_listener_skipped_section.
Add optional eventfd arg to vfio_add_kvm_msi_virq.
Refactor vector use into a helper vfio_vector_init.
All for use by cpr in a subsequent patch.  No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/common.c              |  4 ++--
 hw/vfio/pci.c                 | 36 +++++++++++++++++++++++++-----------
 include/hw/vfio/vfio-common.h |  3 +++
 3 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index ae5654f..9220e64 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -42,7 +42,7 @@
 
 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
-static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces =
+VFIOAddressSpaceList vfio_address_spaces =
     QLIST_HEAD_INITIALIZER(vfio_address_spaces);
 
 #ifdef CONFIG_KVM
@@ -534,7 +534,7 @@ static int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova,
     return -1;
 }
 
-static bool vfio_listener_skipped_section(MemoryRegionSection *section)
+bool vfio_listener_skipped_section(MemoryRegionSection *section)
 {
     return (!memory_region_is_ram(section->mr) &&
             !memory_region_is_iommu(section->mr)) ||
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5c65aa0..7a4fb6c 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -411,7 +411,7 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
 }
 
 static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
-                                  int vector_n, bool msix)
+                                  int vector_n, bool msix, int eventfd)
 {
     int virq;
 
@@ -419,7 +419,9 @@ static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
         return;
     }
 
-    if (event_notifier_init(&vector->kvm_interrupt, 0)) {
+    if (eventfd >= 0) {
+        event_notifier_init_fd(&vector->kvm_interrupt, eventfd);
+    } else if (event_notifier_init(&vector->kvm_interrupt, 0)) {
         return;
     }
 
@@ -455,6 +457,22 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
     kvm_irqchip_commit_routes(kvm_state);
 }
 
+static void vfio_vector_init(VFIOPCIDevice *vdev, int nr, int eventfd)
+{
+    VFIOMSIVector *vector = &vdev->msi_vectors[nr];
+    PCIDevice *pdev = &vdev->pdev;
+
+    vector->vdev = vdev;
+    vector->virq = -1;
+    if (eventfd >= 0) {
+        event_notifier_init_fd(&vector->interrupt, eventfd);
+    } else if (event_notifier_init(&vector->interrupt, 0)) {
+        error_report("vfio: Error: event_notifier_init failed");
+    }
+    vector->use = true;
+    msix_vector_use(pdev, nr);
+}
+
 static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
                                    MSIMessage *msg, IOHandler *handler)
 {
@@ -466,14 +484,10 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
 
     vector = &vdev->msi_vectors[nr];
 
+    vfio_vector_init(vdev, nr, -1);
+
     if (!vector->use) {
-        vector->vdev = vdev;
-        vector->virq = -1;
-        if (event_notifier_init(&vector->interrupt, 0)) {
-            error_report("vfio: Error: event_notifier_init failed");
-        }
-        vector->use = true;
-        msix_vector_use(pdev, nr);
+        vfio_vector_init(vdev, nr, -1);
     }
 
     qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
@@ -491,7 +505,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
         }
     } else {
         if (msg) {
-            vfio_add_kvm_msi_virq(vdev, vector, nr, true);
+            vfio_add_kvm_msi_virq(vdev, vector, nr, true, -1);
         }
     }
 
@@ -641,7 +655,7 @@ retry:
          * Attempt to enable route through KVM irqchip,
          * default to userspace handling if unavailable.
          */
-        vfio_add_kvm_msi_virq(vdev, vector, i, false);
+        vfio_add_kvm_msi_virq(vdev, vector, i, false, -1);
     }
 
     /* Set interrupt type prior to possible interrupts */
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 6141162..00acb85 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -204,6 +204,8 @@ int vfio_get_device(VFIOGroup *group, const char *name,
 extern const MemoryRegionOps vfio_region_ops;
 typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
 extern VFIOGroupList vfio_group_list;
+typedef QLIST_HEAD(, VFIOAddressSpace) VFIOAddressSpaceList;
+extern VFIOAddressSpaceList vfio_address_spaces;
 
 bool vfio_mig_active(void);
 int64_t vfio_mig_bytes_transferred(void);
@@ -222,6 +224,7 @@ struct vfio_info_cap_header *
 vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id);
 #endif
 extern const MemoryListener vfio_prereg_listener;
+bool vfio_listener_skipped_section(MemoryRegionSection *section);
 
 int vfio_spapr_create_window(VFIOContainer *container,
                              MemoryRegionSection *section,
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V3 12/22] vfio-pci: cpr part 1
  2021-05-07 12:24 [PATCH V3 00/22] Live Update Steve Sistare
                   ` (10 preceding siblings ...)
  2021-05-07 12:25 ` [PATCH V3 11/22] vfio-pci: refactor " Steve Sistare
@ 2021-05-07 12:25 ` Steve Sistare
  2021-05-21 22:24   ` Alex Williamson
  2021-05-07 12:25 ` [PATCH V3 13/22] vfio-pci: cpr part 2 Steve Sistare
                   ` (12 subsequent siblings)
  24 siblings, 1 reply; 81+ messages in thread
From: Steve Sistare @ 2021-05-07 12:25 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Markus Armbruster, Alex Williamson, Steve Sistare,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Enable vfio-pci devices to be saved and restored across an exec restart
of qemu.

At vfio creation time, save the value of vfio container, group, and device
descriptors in the environment.

In cprsave, suspend the use of virtual addresses in DMA mappings with
VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped at a
different VA after exec.  DMA to already-mapped pages continues.  Save
the msi message area as part of vfio-pci vmstate, save the interrupt and
notifier eventfd's in the environment, and clear the close-on-exec flag
for the vfio descriptors.  The flag is not cleared earlier because the
descriptors should not persist across miscellaneous fork and exec calls
that may be performed during normal operation.

On qemu restart, vfio_realize() finds the descriptor env vars, uses
the descriptors, and notes that the device is being reused.  Device and
iommu state is already configured, so operations in vfio_realize that
would modify the configuration are skipped for a reused device, including
vfio ioctl's and writes to PCI configuration space.  The result is that
vfio_realize constructs qemu data structures that reflect the current
state of the device.  However, the reconstruction is not complete until
cprload is called. cprload loads the msi data and finds eventfds in the
environment.  It rebuilds vector data structures and attaches the
interrupts to the new KVM instance.  cprload then walks the flattened
ranges of the vfio_address_spaces and calls VFIO_DMA_MAP_FLAG_VADDR to
inform the kernel of the new VA's.  Lastly, it starts the VM and suppresses
vfio device reset.

This functionality is delivered by 2 patches for clarity.  Part 2 adds
eventfd and vector support.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/pci/msi.c                  |   4 ++
 hw/pci/pci.c                  |   4 ++
 hw/vfio/common.c              |  59 ++++++++++++++++++-
 hw/vfio/cpr.c                 | 131 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/meson.build           |   1 +
 hw/vfio/pci.c                 |  65 +++++++++++++++++++--
 hw/vfio/trace-events          |   1 +
 include/hw/pci/pci.h          |   1 +
 include/hw/vfio/vfio-common.h |   5 ++
 linux-headers/linux/vfio.h    |  27 +++++++++
 migration/cpr.c               |   7 +++
 11 files changed, 298 insertions(+), 7 deletions(-)
 create mode 100644 hw/vfio/cpr.c

diff --git a/hw/pci/msi.c b/hw/pci/msi.c
index 47d2b0f..39de6a7 100644
--- a/hw/pci/msi.c
+++ b/hw/pci/msi.c
@@ -225,6 +225,10 @@ int msi_init(struct PCIDevice *dev, uint8_t offset,
     dev->msi_cap = config_offset;
     dev->cap_present |= QEMU_PCI_CAP_MSI;
 
+    if (dev->reused) {
+        return 0;
+    }
+
     pci_set_word(dev->config + msi_flags_off(dev), flags);
     pci_set_word(dev->wmask + msi_flags_off(dev),
                  PCI_MSI_FLAGS_QSIZE | PCI_MSI_FLAGS_ENABLE);
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index e08d981..27019ca 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -308,6 +308,10 @@ static void pci_do_device_reset(PCIDevice *dev)
 {
     int r;
 
+    if (dev->reused) {
+        return;
+    }
+
     pci_device_deassert_intx(dev);
     assert(dev->irq_state == 0);
 
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 9220e64..00d07b2 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -31,6 +31,7 @@
 #include "exec/memory.h"
 #include "exec/ram_addr.h"
 #include "hw/hw.h"
+#include "qemu/env.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "qemu/range.h"
@@ -440,6 +441,10 @@ static int vfio_dma_unmap(VFIOContainer *container,
         return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
     }
 
+    if (container->reused) {
+        return 0;
+    }
+
     while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
         /*
          * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
@@ -463,6 +468,11 @@ static int vfio_dma_unmap(VFIOContainer *container,
         return -errno;
     }
 
+    if (unmap.size != size) {
+        warn_report("VFIO_UNMAP_DMA(0x%lx, 0x%lx) only unmaps 0x%llx",
+                     iova, size, unmap.size);
+    }
+
     return 0;
 }
 
@@ -477,6 +487,10 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
         .size = size,
     };
 
+    if (container->reused) {
+        return 0;
+    }
+
     if (!readonly) {
         map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
     }
@@ -1603,6 +1617,10 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
     if (iommu_type < 0) {
         return iommu_type;
     }
+    if (container->reused) {
+        container->iommu_type = iommu_type;
+        return 0;
+    }
 
     ret = ioctl(group_fd, VFIO_GROUP_SET_CONTAINER, &container->fd);
     if (ret) {
@@ -1703,6 +1721,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
 {
     VFIOContainer *container;
     int ret, fd;
+    bool reused;
+    char name[40];
     VFIOAddressSpace *space;
 
     space = vfio_get_address_space(as);
@@ -1739,16 +1759,29 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
         return ret;
     }
 
+    snprintf(name, sizeof(name), "vfio_container_%d", group->groupid);
+    fd = getenv_fd(name);
+    reused = (fd >= 0);
+
     QLIST_FOREACH(container, &space->containers, next) {
+        if (fd >= 0 && container->fd == fd) {
+            group->container = container;
+            QLIST_INSERT_HEAD(&container->group_list, group, container_next);
+            return 0;
+        }
         if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
             group->container = container;
             QLIST_INSERT_HEAD(&container->group_list, group, container_next);
             vfio_kvm_device_add_group(group);
+            setenv_fd(name, container->fd);
             return 0;
         }
     }
 
-    fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
+    if (fd < 0) {
+        fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
+    }
+
     if (fd < 0) {
         error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
         ret = -errno;
@@ -1766,6 +1799,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     container = g_malloc0(sizeof(*container));
     container->space = space;
     container->fd = fd;
+    container->reused = reused;
     container->error = NULL;
     container->dirty_pages_supported = false;
     QLIST_INIT(&container->giommu_list);
@@ -1893,6 +1927,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     }
 
     container->initialized = true;
+    setenv_fd(name, fd);
 
     return 0;
 listener_release_exit:
@@ -1920,6 +1955,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
 
     QLIST_REMOVE(group, container_next);
     group->container = NULL;
+    unsetenv_fdv("vfio_container_%d", group->groupid);
 
     /*
      * Explicitly release the listener first before unset container,
@@ -1978,7 +2014,12 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
     group = g_malloc0(sizeof(*group));
 
     snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
-    group->fd = qemu_open_old(path, O_RDWR);
+
+    group->fd = getenv_fd(path);
+    if (group->fd < 0) {
+        group->fd = qemu_open_old(path, O_RDWR);
+    }
+
     if (group->fd < 0) {
         error_setg_errno(errp, errno, "failed to open %s", path);
         goto free_group_exit;
@@ -2012,6 +2053,8 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
 
     QLIST_INSERT_HEAD(&vfio_group_list, group, next);
 
+    setenv_fd(path, group->fd);
+
     return group;
 
 close_fd_exit:
@@ -2036,6 +2079,7 @@ void vfio_put_group(VFIOGroup *group)
     vfio_disconnect_container(group);
     QLIST_REMOVE(group, next);
     trace_vfio_put_group(group->fd);
+    unsetenv_fdv("/dev/vfio/%d", group->groupid);
     close(group->fd);
     g_free(group);
 
@@ -2049,8 +2093,14 @@ int vfio_get_device(VFIOGroup *group, const char *name,
 {
     struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
     int ret, fd;
+    bool reused;
+
+    fd = getenv_fd(name);
+    reused = (fd >= 0);
+    if (fd < 0) {
+        fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
+    }
 
-    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
     if (fd < 0) {
         error_setg_errno(errp, errno, "error getting device from group %d",
                          group->groupid);
@@ -2095,6 +2145,8 @@ int vfio_get_device(VFIOGroup *group, const char *name,
     vbasedev->num_irqs = dev_info.num_irqs;
     vbasedev->num_regions = dev_info.num_regions;
     vbasedev->flags = dev_info.flags;
+    vbasedev->reused = reused;
+    setenv_fd(name, fd);
 
     trace_vfio_get_device(name, dev_info.flags, dev_info.num_regions,
                           dev_info.num_irqs);
@@ -2111,6 +2163,7 @@ void vfio_put_base_device(VFIODevice *vbasedev)
     QLIST_REMOVE(vbasedev, next);
     vbasedev->group = NULL;
     trace_vfio_put_base_device(vbasedev->fd);
+    unsetenv_fd(vbasedev->name);
     close(vbasedev->fd);
 }
 
diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
new file mode 100644
index 0000000..c5ad9f2
--- /dev/null
+++ b/hw/vfio/cpr.c
@@ -0,0 +1,131 @@
+/*
+ * Copyright (c) 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include <sys/ioctl.h>
+#include <linux/vfio.h>
+#include "hw/vfio/vfio-common.h"
+#include "sysemu/kvm.h"
+#include "qapi/error.h"
+#include "trace.h"
+
+static int
+vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
+{
+    struct vfio_iommu_type1_dma_unmap unmap = {
+        .argsz = sizeof(unmap),
+        .flags = VFIO_DMA_UNMAP_FLAG_VADDR | VFIO_DMA_UNMAP_FLAG_ALL,
+        .iova = 0,
+        .size = 0,
+    };
+    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
+        error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
+        return -errno;
+    }
+    return 0;
+}
+
+static int vfio_dma_map_vaddr(VFIOContainer *container, hwaddr iova,
+                              ram_addr_t size, void *vaddr,
+                              Error **errp)
+{
+    struct vfio_iommu_type1_dma_map map = {
+        .argsz = sizeof(map),
+        .flags = VFIO_DMA_MAP_FLAG_VADDR,
+        .vaddr = (__u64)(uintptr_t)vaddr,
+        .iova = iova,
+        .size = size,
+    };
+    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
+        error_setg_errno(errp, errno,
+                         "vfio_dma_map_vaddr(iova %lu, size %ld, va %p)",
+                         iova, size, vaddr);
+        return -errno;
+    }
+    return 0;
+}
+
+static int
+vfio_region_remap(MemoryRegionSection *section, void *handle, Error **errp)
+{
+    MemoryRegion *mr = section->mr;
+    VFIOContainer *container = handle;
+    const char *name = memory_region_name(mr);
+    ram_addr_t size = int128_get64(section->size);
+    hwaddr offset, iova, roundup;
+    void *vaddr;
+
+    if (vfio_listener_skipped_section(section) || memory_region_is_iommu(mr)) {
+        return 0;
+    }
+
+    offset = section->offset_within_address_space;
+    iova = TARGET_PAGE_ALIGN(offset);
+    roundup = iova - offset;
+    size = (size - roundup) & TARGET_PAGE_MASK;
+    vaddr = memory_region_get_ram_ptr(mr) +
+            section->offset_within_region + roundup;
+
+    trace_vfio_region_remap(name, container->fd, iova, iova + size - 1, vaddr);
+    return vfio_dma_map_vaddr(container, iova, size, vaddr, errp);
+}
+
+bool vfio_cpr_capable(VFIOContainer *container, Error **errp)
+{
+    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
+        !ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
+        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR "
+                         "or VFIO_UNMAP_ALL");
+        return false;
+    } else {
+        return true;
+    }
+}
+
+int vfio_cprsave(Error **errp)
+{
+    VFIOAddressSpace *space;
+    VFIOContainer *container;
+
+    QLIST_FOREACH(space, &vfio_address_spaces, list) {
+        QLIST_FOREACH(container, &space->containers, next) {
+            if (!vfio_cpr_capable(container, errp)) {
+                return 1;
+            }
+            if (vfio_dma_unmap_vaddr_all(container, errp)) {
+                return 1;
+            }
+        }
+    }
+    return 0;
+}
+
+int vfio_cprload(Error **errp)
+{
+    VFIOAddressSpace *space;
+    VFIOContainer *container;
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+
+    QLIST_FOREACH(space, &vfio_address_spaces, list) {
+        QLIST_FOREACH(container, &space->containers, next) {
+            if (!vfio_cpr_capable(container, errp)) {
+                return 1;
+            }
+            container->reused = false;
+            if (as_flat_walk(space->as, vfio_region_remap, container, errp)) {
+                return 1;
+            }
+        }
+    }
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            vbasedev->reused = false;
+        }
+    }
+    return 0;
+}
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index da9af29..e247b2b 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -5,6 +5,7 @@ vfio_ss.add(files(
   'migration.c',
 ))
 vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
+  'cpr.c',
   'display.c',
   'pci-quirks.c',
   'pci.c',
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 7a4fb6c..f7ac9f03 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -29,6 +29,8 @@
 #include "hw/qdev-properties.h"
 #include "hw/qdev-properties-system.h"
 #include "migration/vmstate.h"
+#include "migration/cpr.h"
+#include "qemu/env.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "qemu/module.h"
@@ -1612,6 +1614,14 @@ static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled)
     }
 }
 
+static void vfio_config_sync(VFIOPCIDevice *vdev, uint32_t offset, size_t len)
+{
+    if (pread(vdev->vbasedev.fd, vdev->pdev.config + offset, len,
+          vdev->config_offset + offset) != len) {
+        error_report("vfio_config_sync pread failed");
+    }
+}
+
 static void vfio_bar_prepare(VFIOPCIDevice *vdev, int nr)
 {
     VFIOBAR *bar = &vdev->bars[nr];
@@ -1652,6 +1662,7 @@ static void vfio_bars_prepare(VFIOPCIDevice *vdev)
 static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
 {
     VFIOBAR *bar = &vdev->bars[nr];
+    PCIDevice *pdev = &vdev->pdev;
     char *name;
 
     if (!bar->size) {
@@ -1672,7 +1683,10 @@ static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
         }
     }
 
-    pci_register_bar(&vdev->pdev, nr, bar->type, bar->mr);
+    pci_register_bar(pdev, nr, bar->type, bar->mr);
+    if (pdev->reused) {
+        vfio_config_sync(vdev, pci_bar(pdev, nr), 8);
+    }
 }
 
 static void vfio_bars_register(VFIOPCIDevice *vdev)
@@ -2884,6 +2898,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         vfio_put_group(group);
         goto error;
     }
+    pdev->reused = vdev->vbasedev.reused;
 
     vfio_populate_device(vdev, &err);
     if (err) {
@@ -3046,9 +3061,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
                                              vfio_intx_routing_notifier);
         vdev->irqchip_change_notifier.notify = vfio_irqchip_change;
         kvm_irqchip_add_change_notifier(&vdev->irqchip_change_notifier);
-        ret = vfio_intx_enable(vdev, errp);
-        if (ret) {
-            goto out_deregister;
+        if (!pdev->reused) {
+            ret = vfio_intx_enable(vdev, errp);
+            if (ret) {
+                goto out_deregister;
+            }
         }
     }
 
@@ -3098,6 +3115,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn_quirk(vdev);
 
+    vfio_config_sync(vdev, pdev->msix_cap + PCI_MSIX_FLAGS, 2);
+    if (pdev->reused) {
+        pci_update_mappings(pdev);
+    }
+
     return;
 
 out_deregister:
@@ -3153,6 +3175,10 @@ static void vfio_pci_reset(DeviceState *dev)
 {
     VFIOPCIDevice *vdev = VFIO_PCI(dev);
 
+    if (vdev->pdev.reused) {
+        return;
+    }
+
     trace_vfio_pci_reset(vdev->vbasedev.name);
 
     vfio_pci_pre_reset(vdev);
@@ -3260,6 +3286,36 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
+static int vfio_pci_post_load(void *opaque, int version_id)
+{
+    VFIOPCIDevice *vdev = opaque;
+    PCIDevice *pdev = &vdev->pdev;
+    bool enabled;
+
+    pdev->reused = false;
+    enabled = pci_get_word(pdev->config + PCI_COMMAND) & PCI_COMMAND_MASTER;
+    memory_region_set_enabled(&pdev->bus_master_enable_region, enabled);
+
+    return 0;
+}
+
+static bool vfio_pci_needed(void *opaque)
+{
+    return cpr_active();
+}
+
+static const VMStateDescription vfio_pci_vmstate = {
+    .name = "vfio-pci",
+    .unmigratable = 1,
+    .version_id = 0,
+    .minimum_version_id = 0,
+    .post_load = vfio_pci_post_load,
+    .needed = vfio_pci_needed,
+    .fields = (VMStateField[]) {
+        VMSTATE_END_OF_LIST()
+    }
+};
+
 static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
@@ -3267,6 +3323,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 
     dc->reset = vfio_pci_reset;
     device_class_set_props(dc, vfio_pci_dev_properties);
+    dc->vmsd = &vfio_pci_vmstate;
     dc->desc = "VFIO-based PCI device assignment";
     set_bit(DEVICE_CATEGORY_MISC, dc->categories);
     pdc->realize = vfio_realize;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 079f53a..0f8b166 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -118,6 +118,7 @@ vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Devic
 vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
 vfio_dma_unmap_overflow_workaround(void) ""
+vfio_region_remap(const char *name, int fd, uint64_t iova_start, uint64_t iova_end, void *vaddr) "%s fd %d 0x%"PRIx64" - 0x%"PRIx64" [%p]"
 
 # platform.c
 vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index bef3e49..add7f46 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -360,6 +360,7 @@ struct PCIDevice {
     /* ID of standby device in net_failover pair */
     char *failover_pair_id;
     uint32_t acpi_index;
+    bool reused;
 };
 
 void pci_register_bar(PCIDevice *pci_dev, int region_num,
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 00acb85..b46d850 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -85,6 +85,7 @@ typedef struct VFIOContainer {
     Error *error;
     bool initialized;
     bool dirty_pages_supported;
+    bool reused;
     uint64_t dirty_pgsizes;
     uint64_t max_dirty_bitmap_size;
     unsigned long pgsizes;
@@ -124,6 +125,7 @@ typedef struct VFIODevice {
     bool no_mmap;
     bool ram_block_discard_allowed;
     bool enable_migration;
+    bool reused;
     VFIODeviceOps *ops;
     unsigned int num_irqs;
     unsigned int num_regions;
@@ -200,6 +202,9 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
 void vfio_put_group(VFIOGroup *group);
 int vfio_get_device(VFIOGroup *group, const char *name,
                     VFIODevice *vbasedev, Error **errp);
+int vfio_cprsave(Error **errp);
+int vfio_cprload(Error **errp);
+bool vfio_cpr_capable(VFIOContainer *container, Error **errp);
 
 extern const MemoryRegionOps vfio_region_ops;
 typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 609099e..bc3a66e 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -46,6 +46,12 @@
  */
 #define VFIO_NOIOMMU_IOMMU		8
 
+/* Supports VFIO_DMA_UNMAP_FLAG_ALL */
+#define VFIO_UNMAP_ALL                        9
+
+/* Supports VFIO DMA map and unmap with the VADDR flag */
+#define VFIO_UPDATE_VADDR              10
+
 /*
  * The IOCTL interface is designed for extensibility by embedding the
  * structure length (argsz) and flags into structures passed between
@@ -1074,12 +1080,22 @@ struct vfio_iommu_type1_info_dma_avail {
  *
  * Map process virtual addresses to IO virtual addresses using the
  * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
+ *
+ * If flags & VFIO_DMA_MAP_FLAG_VADDR, record the new base vaddr for iova, and
+ * unblock translation of host virtual addresses in the iova range.  The vaddr
+ * must have previously been invalidated with VFIO_DMA_UNMAP_FLAG_VADDR.  To
+ * maintain memory consistency within the user application, the updated vaddr
+ * must address the same memory object as originally mapped.  Failure to do so
+ * will result in user memory corruption and/or device misbehavior.  iova and
+ * size must match those in the original MAP_DMA call.  Protection is not
+ * changed, and the READ & WRITE flags must be 0.
  */
 struct vfio_iommu_type1_dma_map {
 	__u32	argsz;
 	__u32	flags;
 #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
 #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
+#define VFIO_DMA_MAP_FLAG_VADDR (1 << 2)
 	__u64	vaddr;				/* Process virtual address */
 	__u64	iova;				/* IO virtual address */
 	__u64	size;				/* Size of mapping (bytes) */
@@ -1102,6 +1118,7 @@ struct vfio_bitmap {
  * field.  No guarantee is made to the user that arbitrary unmaps of iova
  * or size different from those used in the original mapping call will
  * succeed.
+ *
  * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get the dirty bitmap
  * before unmapping IO virtual addresses. When this flag is set, the user must
  * provide a struct vfio_bitmap in data[]. User must provide zero-allocated
@@ -1111,11 +1128,21 @@ struct vfio_bitmap {
  * indicates that the page at that offset from iova is dirty. A Bitmap of the
  * pages in the range of unmapped size is returned in the user-provided
  * vfio_bitmap.data.
+ *
+ * If flags & VFIO_DMA_UNMAP_FLAG_ALL, unmap all addresses.  iova and size
+ * must be 0.  This cannot be combined with the get-dirty-bitmap flag.
+ *
+ * If flags & VFIO_DMA_UNMAP_FLAG_VADDR, do not unmap, but invalidate host
+ * virtual addresses in the iova range.  Tasks that attempt to translate an
+ * iova's vaddr will block.  DMA to already-mapped pages continues.  This
+ * cannot be combined with the get-dirty-bitmap flag.
  */
 struct vfio_iommu_type1_dma_unmap {
 	__u32	argsz;
 	__u32	flags;
 #define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
+#define VFIO_DMA_UNMAP_FLAG_ALL              (1 << 1)
+#define VFIO_DMA_UNMAP_FLAG_VADDR            (1 << 2)
 	__u64	iova;				/* IO virtual address */
 	__u64	size;				/* Size of mapping (bytes) */
 	__u8    data[];
diff --git a/migration/cpr.c b/migration/cpr.c
index e0da1cf..e9a189b 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -132,6 +132,9 @@ void cprsave(const char *file, CprMode mode, Error **errp)
         shutdown_action = SHUTDOWN_ACTION_POWEROFF;
         qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
     } else if (restart) {
+        if (vfio_cprsave(errp)) {
+            goto err;
+        }
         walkenv(FD_PREFIX, preserve_fd, 0);
         setenv("QEMU_START_FREEZE", "", 1);
         qemu_system_exec_request();
@@ -176,6 +179,10 @@ void cprload(const char *file, Error **errp)
         return;
     }
 
+    if (vfio_cprload(errp)) {
+        return;
+    }
+
     state = global_state_get_runstate();
     if (state == RUN_STATE_RUNNING) {
         vm_start();
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V3 13/22] vfio-pci: cpr part 2
  2021-05-07 12:24 [PATCH V3 00/22] Live Update Steve Sistare
                   ` (11 preceding siblings ...)
  2021-05-07 12:25 ` [PATCH V3 12/22] vfio-pci: cpr part 1 Steve Sistare
@ 2021-05-07 12:25 ` Steve Sistare
  2021-05-21 22:24   ` Alex Williamson
  2021-05-07 12:25 ` [PATCH V3 14/22] vhost: reset vhost devices upon cprsave Steve Sistare
                   ` (11 subsequent siblings)
  24 siblings, 1 reply; 81+ messages in thread
From: Steve Sistare @ 2021-05-07 12:25 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Markus Armbruster, Alex Williamson, Steve Sistare,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Finish cpr for vfio-pci by preserving eventfd's and vector state.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/pci.c | 110 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 108 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index f7ac9f03..e983db4 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2661,6 +2661,27 @@ static void vfio_put_device(VFIOPCIDevice *vdev)
     vfio_put_base_device(&vdev->vbasedev);
 }
 
+static void setenv_event_fd(VFIOPCIDevice *vdev, int nr, const char *name,
+                            EventNotifier *ev)
+{
+    char envname[256];
+    int fd = event_notifier_get_fd(ev);
+    const char *vfname = vdev->vbasedev.name;
+
+    if (fd >= 0) {
+        snprintf(envname, sizeof(envname), "%s_%s_%d", vfname, name, nr);
+        setenv_fd(envname, fd);
+    }
+}
+
+static int getenv_event_fd(VFIOPCIDevice *vdev, int nr, const char *name)
+{
+    char envname[256];
+    const char *vfname = vdev->vbasedev.name;
+    snprintf(envname, sizeof(envname), "%s_%s_%d", vfname, name, nr);
+    return getenv_fd(envname);
+}
+
 static void vfio_err_notifier_handler(void *opaque)
 {
     VFIOPCIDevice *vdev = opaque;
@@ -2692,7 +2713,13 @@ static void vfio_err_notifier_handler(void *opaque)
 static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
 {
     Error *err = NULL;
-    int32_t fd;
+    int32_t fd = getenv_event_fd(vdev, 0, "err");
+
+    if (fd >= 0) {
+        event_notifier_init_fd(&vdev->err_notifier, fd);
+        qemu_set_fd_handler(fd, vfio_err_notifier_handler, NULL, vdev);
+        return;
+    }
 
     if (!vdev->pci_aer) {
         return;
@@ -2753,7 +2780,14 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
     struct vfio_irq_info irq_info = { .argsz = sizeof(irq_info),
                                       .index = VFIO_PCI_REQ_IRQ_INDEX };
     Error *err = NULL;
-    int32_t fd;
+    int32_t fd = getenv_event_fd(vdev, 0, "req");
+
+    if (fd >= 0) {
+        event_notifier_init_fd(&vdev->req_notifier, fd);
+        qemu_set_fd_handler(fd, vfio_req_notifier_handler, NULL, vdev);
+        vdev->req_enabled = true;
+        return;
+    }
 
     if (!(vdev->features & VFIO_FEATURE_ENABLE_REQ)) {
         return;
@@ -3286,12 +3320,82 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
+static int vfio_pci_pre_save(void *opaque)
+{
+    VFIOPCIDevice *vdev = opaque;
+    int i;
+
+    for (i = 0; i < vdev->nr_vectors; i++) {
+        VFIOMSIVector *vector = &vdev->msi_vectors[i];
+        if (vector->use) {
+            setenv_event_fd(vdev, i, "interrupt", &vector->interrupt);
+            if (vector->virq >= 0) {
+                setenv_event_fd(vdev, i, "kvm_interrupt",
+                                &vector->kvm_interrupt);
+            }
+        }
+    }
+    setenv_event_fd(vdev, 0, "err", &vdev->err_notifier);
+    setenv_event_fd(vdev, 0, "req", &vdev->req_notifier);
+    return 0;
+}
+
+static void vfio_claim_vectors(VFIOPCIDevice *vdev, int nr_vectors, bool msix)
+{
+    int i, fd;
+    bool pending = false;
+    PCIDevice *pdev = &vdev->pdev;
+
+    vdev->nr_vectors = nr_vectors;
+    vdev->msi_vectors = g_new0(VFIOMSIVector, nr_vectors);
+    vdev->interrupt = msix ? VFIO_INT_MSIX : VFIO_INT_MSI;
+
+    for (i = 0; i < nr_vectors; i++) {
+        VFIOMSIVector *vector = &vdev->msi_vectors[i];
+
+        fd = getenv_event_fd(vdev, i, "interrupt");
+        if (fd >= 0) {
+            vfio_vector_init(vdev, i, fd);
+            qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
+        }
+
+        fd = getenv_event_fd(vdev, i, "kvm_interrupt");
+        if (fd >= 0) {
+            vfio_add_kvm_msi_virq(vdev, vector, i, msix, fd);
+        }
+
+        if (msix && msix_is_pending(pdev, i) && msix_is_masked(pdev, i)) {
+            set_bit(i, vdev->msix->pending);
+            pending = true;
+        }
+    }
+
+    if (msix) {
+        memory_region_set_enabled(&pdev->msix_pba_mmio, pending);
+    }
+}
+
 static int vfio_pci_post_load(void *opaque, int version_id)
 {
     VFIOPCIDevice *vdev = opaque;
     PCIDevice *pdev = &vdev->pdev;
+    int nr_vectors;
     bool enabled;
 
+    if (msix_enabled(pdev)) {
+        nr_vectors = vdev->msix->entries;
+        vfio_claim_vectors(vdev, nr_vectors, true);
+        msix_init_vector_notifiers(pdev, vfio_msix_vector_use,
+                                   vfio_msix_vector_release, NULL);
+
+    } else if (msi_enabled(pdev)) {
+        nr_vectors = msi_nr_vectors_allocated(pdev);
+        vfio_claim_vectors(vdev, nr_vectors, false);
+
+    } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
+        error_report("vfio_pci_post_load does not support INTX");
+    }
+
     pdev->reused = false;
     enabled = pci_get_word(pdev->config + PCI_COMMAND) & PCI_COMMAND_MASTER;
     memory_region_set_enabled(&pdev->bus_master_enable_region, enabled);
@@ -3310,8 +3414,10 @@ static const VMStateDescription vfio_pci_vmstate = {
     .version_id = 0,
     .minimum_version_id = 0,
     .post_load = vfio_pci_post_load,
+    .pre_save = vfio_pci_pre_save,
     .needed = vfio_pci_needed,
     .fields = (VMStateField[]) {
+        VMSTATE_MSIX(pdev, VFIOPCIDevice),
         VMSTATE_END_OF_LIST()
     }
 };
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V3 14/22] vhost: reset vhost devices upon cprsave
  2021-05-07 12:24 [PATCH V3 00/22] Live Update Steve Sistare
                   ` (12 preceding siblings ...)
  2021-05-07 12:25 ` [PATCH V3 13/22] vfio-pci: cpr part 2 Steve Sistare
@ 2021-05-07 12:25 ` Steve Sistare
  2021-05-07 12:25 ` [PATCH V3 15/22] hostmem-memfd: cpr support Steve Sistare
                   ` (10 subsequent siblings)
  24 siblings, 0 replies; 81+ messages in thread
From: Steve Sistare @ 2021-05-07 12:25 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Markus Armbruster, Alex Williamson, Steve Sistare,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

A vhost device is implicitly preserved across re-exec because its fd is not
closed, and the value of the fd is specified on the command line for the
new qemu to find.  However, new qemu issues an VHOST_RESET_OWNER ioctl,
which fails because the device already has an owner.  To fix, reset the
owner prior to exec.

Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/virtio/vhost.c         | 11 +++++++++++
 include/hw/virtio/vhost.h |  1 +
 migration/cpr.c           |  1 +
 3 files changed, 13 insertions(+)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index e2163a0..8c0c9c3 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1820,6 +1820,17 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
     hdev->vdev = NULL;
 }
 
+void vhost_dev_reset_all(void)
+{
+    struct vhost_dev *dev;
+
+    QLIST_FOREACH(dev, &vhost_devices, entry) {
+        if (dev->vhost_ops->vhost_reset_device(dev) < 0) {
+            VHOST_OPS_DEBUG("vhost_reset_device failed");
+        }
+    }
+}
+
 int vhost_net_set_backend(struct vhost_dev *hdev,
                           struct vhost_vring_file *file)
 {
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index 4a8bc75..71704d4 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -106,6 +106,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
 void vhost_dev_cleanup(struct vhost_dev *hdev);
 int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev);
 void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev);
+void vhost_dev_reset_all(void);
 int vhost_dev_enable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev);
 void vhost_dev_disable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev);
 
diff --git a/migration/cpr.c b/migration/cpr.c
index e9a189b..3cde26f 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -136,6 +136,7 @@ void cprsave(const char *file, CprMode mode, Error **errp)
             goto err;
         }
         walkenv(FD_PREFIX, preserve_fd, 0);
+        vhost_dev_reset_all();
         setenv("QEMU_START_FREEZE", "", 1);
         qemu_system_exec_request();
     }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V3 15/22] hostmem-memfd: cpr support
  2021-05-07 12:24 [PATCH V3 00/22] Live Update Steve Sistare
                   ` (13 preceding siblings ...)
  2021-05-07 12:25 ` [PATCH V3 14/22] vhost: reset vhost devices upon cprsave Steve Sistare
@ 2021-05-07 12:25 ` Steve Sistare
  2021-05-07 12:25 ` [PATCH V3 16/22] chardev: cpr framework Steve Sistare
                   ` (9 subsequent siblings)
  24 siblings, 0 replies; 81+ messages in thread
From: Steve Sistare @ 2021-05-07 12:25 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Markus Armbruster, Alex Williamson, Steve Sistare,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Preserve memory-backend-memfd memory objects during cpr.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 backends/hostmem-memfd.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c
index 69b0ae3..3503c89 100644
--- a/backends/hostmem-memfd.c
+++ b/backends/hostmem-memfd.c
@@ -15,6 +15,7 @@
 #include "sysemu/sysemu.h"
 #include "qom/object_interfaces.h"
 #include "qemu/memfd.h"
+#include "qemu/env.h"
 #include "qemu/module.h"
 #include "qapi/error.h"
 #include "qom/object.h"
@@ -36,23 +37,25 @@ static void
 memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
 {
     HostMemoryBackendMemfd *m = MEMORY_BACKEND_MEMFD(backend);
-    char *name;
-    int fd;
+    char *name = host_memory_backend_get_name(backend);
+    int fd = getenv_fd(name);
 
     if (!backend->size) {
         error_setg(errp, "can't create backend with size 0");
         return;
     }
 
-    fd = qemu_memfd_create(TYPE_MEMORY_BACKEND_MEMFD, backend->size,
-                           m->hugetlb, m->hugetlbsize, m->seal ?
-                           F_SEAL_GROW | F_SEAL_SHRINK | F_SEAL_SEAL : 0,
-                           errp);
-    if (fd == -1) {
-        return;
+    if (fd < 0) {
+        fd = qemu_memfd_create(TYPE_MEMORY_BACKEND_MEMFD, backend->size,
+                               m->hugetlb, m->hugetlbsize, m->seal ?
+                               F_SEAL_GROW | F_SEAL_SHRINK | F_SEAL_SEAL : 0,
+                               errp);
+        if (fd == -1) {
+            return;
+        }
+        setenv_fd(name, fd);
     }
 
-    name = host_memory_backend_get_name(backend);
     memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend),
                                    name, backend->size,
                                    backend->share, fd, 0, errp);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V3 16/22] chardev: cpr framework
  2021-05-07 12:24 [PATCH V3 00/22] Live Update Steve Sistare
                   ` (14 preceding siblings ...)
  2021-05-07 12:25 ` [PATCH V3 15/22] hostmem-memfd: cpr support Steve Sistare
@ 2021-05-07 12:25 ` Steve Sistare
  2021-05-07 14:33   ` Eric Blake
  2021-05-07 12:25 ` [PATCH V3 17/22] chardev: cpr for simple devices Steve Sistare
                   ` (8 subsequent siblings)
  24 siblings, 1 reply; 81+ messages in thread
From: Steve Sistare @ 2021-05-07 12:25 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Markus Armbruster, Alex Williamson, Steve Sistare,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Add QEMU_CHAR_FEATURE_CPR for devices that support cpr.
Add the chardev close_on_cpr option for devices that can be closed on cpr
and reopened after exec.
cpr is allowed only if either QEMU_CHAR_FEATURE_CPR or close_on_cpr is set
for all chardevs in the configuration.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 chardev/char.c         | 41 ++++++++++++++++++++++++++++++++++++++---
 include/chardev/char.h |  5 +++++
 migration/cpr.c        |  3 +++
 qapi/char.json         |  5 ++++-
 qemu-options.hx        | 26 ++++++++++++++++++++++----
 5 files changed, 72 insertions(+), 8 deletions(-)

diff --git a/chardev/char.c b/chardev/char.c
index 398f09d..596d4f9 100644
--- a/chardev/char.c
+++ b/chardev/char.c
@@ -37,6 +37,7 @@
 #include "qemu/help_option.h"
 #include "qemu/module.h"
 #include "qemu/option.h"
+#include "qemu/env.h"
 #include "qemu/id.h"
 #include "qemu/coroutine.h"
 #include "qemu/yank.h"
@@ -240,6 +241,9 @@ static void qemu_char_open(Chardev *chr, ChardevBackend *backend,
     ChardevClass *cc = CHARDEV_GET_CLASS(chr);
     /* Any ChardevCommon member would work */
     ChardevCommon *common = backend ? backend->u.null.data : NULL;
+    char fdname[40];
+
+    chr->close_on_cpr = (common && common->close_on_cpr);
 
     if (common && common->has_logfile) {
         int flags = O_WRONLY | O_CREAT;
@@ -249,7 +253,14 @@ static void qemu_char_open(Chardev *chr, ChardevBackend *backend,
         } else {
             flags |= O_TRUNC;
         }
-        chr->logfd = qemu_open_old(common->logfile, flags, 0666);
+        snprintf(fdname, sizeof(fdname), "%s_log", chr->label);
+        chr->logfd = getenv_fd(fdname);
+        if (chr->logfd < 0) {
+            chr->logfd = qemu_open_old(common->logfile, flags, 0666);
+            if (!chr->close_on_cpr) {
+                setenv_fd(fdname, chr->logfd);
+            }
+        }
         if (chr->logfd < 0) {
             error_setg_errno(errp, errno,
                              "Unable to open logfile %s",
@@ -301,11 +312,12 @@ static void char_finalize(Object *obj)
     if (chr->be) {
         chr->be->chr = NULL;
     }
-    g_free(chr->filename);
-    g_free(chr->label);
     if (chr->logfd != -1) {
         close(chr->logfd);
+        unsetenv_fdv("%s_log", chr->label);
     }
+    g_free(chr->filename);
+    g_free(chr->label);
     qemu_mutex_destroy(&chr->chr_write_lock);
 }
 
@@ -505,6 +517,8 @@ void qemu_chr_parse_common(QemuOpts *opts, ChardevCommon *backend)
 
     backend->has_logappend = true;
     backend->logappend = qemu_opt_get_bool(opts, "logappend", false);
+
+    backend->close_on_cpr = qemu_opt_get_bool(opts, "close-on-cpr", false);
 }
 
 static const ChardevClass *char_get_class(const char *driver, Error **errp)
@@ -940,6 +954,9 @@ QemuOptsList qemu_chardev_opts = {
         },{
             .name = "abstract",
             .type = QEMU_OPT_BOOL,
+        },{
+            .name = "close-on-cpr",
+            .type = QEMU_OPT_BOOL,
 #endif
         },
         { /* end of list */ }
@@ -1207,6 +1224,24 @@ GSource *qemu_chr_timeout_add_ms(Chardev *chr, guint ms,
     return source;
 }
 
+static int chr_cpr_capable(Object *obj, void *opaque)
+{
+    Chardev *chr = (Chardev *)obj;
+    Error **errp = opaque;
+
+    if (qemu_chr_has_feature(chr, QEMU_CHAR_FEATURE_CPR) || chr->close_on_cpr) {
+        return 0;
+    }
+    error_setg(errp, "error: chardev %s -> %s is not capable of cpr",
+               chr->label, chr->filename);
+    return 1;
+}
+
+bool qemu_chr_cpr_capable(Error **errp)
+{
+    return !object_child_foreach(get_chardevs_root(), chr_cpr_capable, errp);
+}
+
 void qemu_chr_cleanup(void)
 {
     object_unparent(get_chardevs_root());
diff --git a/include/chardev/char.h b/include/chardev/char.h
index 7c0444f..e488ad1 100644
--- a/include/chardev/char.h
+++ b/include/chardev/char.h
@@ -50,6 +50,8 @@ typedef enum {
     /* Whether the gcontext can be changed after calling
      * qemu_chr_be_update_read_handlers() */
     QEMU_CHAR_FEATURE_GCONTEXT,
+    /* Whether the device supports cpr */
+    QEMU_CHAR_FEATURE_CPR,
 
     QEMU_CHAR_FEATURE_LAST,
 } ChardevFeature;
@@ -67,6 +69,7 @@ struct Chardev {
     int be_open;
     /* used to coordinate the chardev-change special-case: */
     bool handover_yank_instance;
+    bool close_on_cpr;
     GSource *gsource;
     GMainContext *gcontext;
     DECLARE_BITMAP(features, QEMU_CHAR_FEATURE_LAST);
@@ -291,4 +294,6 @@ void resume_mux_open(void);
 /* console.c */
 void qemu_chr_parse_vc(QemuOpts *opts, ChardevBackend *backend, Error **errp);
 
+bool qemu_chr_cpr_capable(Error **errp);
+
 #endif
diff --git a/migration/cpr.c b/migration/cpr.c
index 3cde26f..8dfd5f1 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -132,6 +132,9 @@ void cprsave(const char *file, CprMode mode, Error **errp)
         shutdown_action = SHUTDOWN_ACTION_POWEROFF;
         qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
     } else if (restart) {
+        if (!qemu_chr_cpr_capable(errp)) {
+            goto err;
+        }
         if (vfio_cprsave(errp)) {
             goto err;
         }
diff --git a/qapi/char.json b/qapi/char.json
index 6413970..dea5dad 100644
--- a/qapi/char.json
+++ b/qapi/char.json
@@ -204,12 +204,15 @@
 # @logfile: The name of a logfile to save output
 # @logappend: true to append instead of truncate
 #             (default to false to truncate)
+# @close-on-cpr: if true, close device's fd on cprsave. defaults to false.
+#                since 6.0.
 #
 # Since: 2.6
 ##
 { 'struct': 'ChardevCommon',
   'data': { '*logfile': 'str',
-            '*logappend': 'bool' } }
+            '*logappend': 'bool',
+            '*close-on-cpr': 'bool' } }
 
 ##
 # @ChardevFile:
diff --git a/qemu-options.hx b/qemu-options.hx
index 3392ac0..ef2d24a 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -3071,43 +3071,57 @@ DEFHEADING(Character device options:)
 
 DEF("chardev", HAS_ARG, QEMU_OPTION_chardev,
     "-chardev help\n"
-    "-chardev null,id=id[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "-chardev null,id=id[,mux=on|off][,logfile=PATH][,logappend=on|off][,close-on-cpr=on|off]\n"
     "-chardev socket,id=id[,host=host],port=port[,to=to][,ipv4=on|off][,ipv6=on|off][,nodelay=on|off][,reconnect=seconds]\n"
     "         [,server=on|off][,wait=on|off][,telnet=on|off][,websocket=on|off][,reconnect=seconds][,mux=on|off]\n"
-    "         [,logfile=PATH][,logappend=on|off][,tls-creds=ID][,tls-authz=ID] (tcp)\n"
+    "         [,logfile=PATH][,logappend=on|off][,tls-creds=ID][,tls-authz=ID][,close-on-cpr=on|off] (tcp)\n"
     "-chardev socket,id=id,path=path[,server=on|off][,wait=on|off][,telnet=on|off][,websocket=on|off][,reconnect=seconds]\n"
-    "         [,mux=on|off][,logfile=PATH][,logappend=on|off][,abstract=on|off][,tight=on|off] (unix)\n"
+    "         [,mux=on|off][,logfile=PATH][,logappend=on|off][,abstract=on|off][,tight=on|off][,close-on-cpr=on|off] (unix)\n"
     "-chardev udp,id=id[,host=host],port=port[,localaddr=localaddr]\n"
     "         [,localport=localport][,ipv4=on|off][,ipv6=on|off][,mux=on|off]\n"
-    "         [,logfile=PATH][,logappend=on|off]\n"
+    "         [,logfile=PATH][,logappend=on|off][,close-on-cpr=on|off]\n"
     "-chardev msmouse,id=id[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,close-on-cpr=on|off]\n"
     "-chardev vc,id=id[[,width=width][,height=height]][[,cols=cols][,rows=rows]]\n"
     "         [,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,close-on-cpr=on|off]\n"
     "-chardev ringbuf,id=id[,size=size][,logfile=PATH][,logappend=on|off]\n"
+    "         [,close-on-cpr=on|off]\n"
     "-chardev file,id=id,path=path[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,close-on-cpr=on|off]\n"
     "-chardev pipe,id=id,path=path[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,close-on-cpr=on|off]\n"
 #ifdef _WIN32
     "-chardev console,id=id[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
     "-chardev serial,id=id,path=path[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
 #else
     "-chardev pty,id=id[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,close-on-cpr=on|off]\n"
     "-chardev stdio,id=id[,mux=on|off][,signal=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,close-on-cpr=on|off]\n"
 #endif
 #ifdef CONFIG_BRLAPI
     "-chardev braille,id=id[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,close-on-cpr=on|off]\n"
 #endif
 #if defined(__linux__) || defined(__sun__) || defined(__FreeBSD__) \
         || defined(__NetBSD__) || defined(__OpenBSD__) || defined(__DragonFly__)
     "-chardev serial,id=id,path=path[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,close-on-cpr=on|off]\n"
     "-chardev tty,id=id,path=path[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,close-on-cpr=on|off]\n"
 #endif
 #if defined(__linux__) || defined(__FreeBSD__) || defined(__DragonFly__)
     "-chardev parallel,id=id,path=path[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,close-on-cpr=on|off]\n"
     "-chardev parport,id=id,path=path[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,close-on-cpr=on|off]\n"
 #endif
 #if defined(CONFIG_SPICE)
     "-chardev spicevmc,id=id,name=name[,debug=debug][,logfile=PATH][,logappend=on|off]\n"
+    "         [,close-on-cpr=on|off]\n"
     "-chardev spiceport,id=id,name=name[,debug=debug][,logfile=PATH][,logappend=on|off]\n"
+    "         [,close-on-cpr=on|off]\n"
 #endif
     , QEMU_ARCH_ALL
 )
@@ -3182,6 +3196,10 @@ The general form of a character device option is:
     ``logappend`` option controls whether the log file will be truncated
     or appended to when opened.
 
+    Every backend supports the ``close-on-cpr`` option.  If on, the
+    devices's descriptor is closed during cprsave, and reopened after exec.
+    This is useful for devices that do not support cpr.
+
 The available backends are:
 
 ``-chardev null,id=id``
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V3 17/22] chardev: cpr for simple devices
  2021-05-07 12:24 [PATCH V3 00/22] Live Update Steve Sistare
                   ` (15 preceding siblings ...)
  2021-05-07 12:25 ` [PATCH V3 16/22] chardev: cpr framework Steve Sistare
@ 2021-05-07 12:25 ` Steve Sistare
  2021-05-07 12:25 ` [PATCH V3 18/22] chardev: cpr for pty Steve Sistare
                   ` (7 subsequent siblings)
  24 siblings, 0 replies; 81+ messages in thread
From: Steve Sistare @ 2021-05-07 12:25 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Markus Armbruster, Alex Williamson, Steve Sistare,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Set QEMU_CHAR_FEATURE_CPR for devices that trivially support cpr.
char-stdio is slightly less trivial.  Allow the gdb server by
closing it on exec.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 chardev/char-mux.c     | 1 +
 chardev/char-null.c    | 1 +
 chardev/char-serial.c  | 1 +
 chardev/char-stdio.c   | 8 ++++++++
 gdbstub.c              | 1 +
 include/chardev/char.h | 1 +
 migration/cpr.c        | 1 +
 7 files changed, 14 insertions(+)

diff --git a/chardev/char-mux.c b/chardev/char-mux.c
index 72beef2..af74eaf 100644
--- a/chardev/char-mux.c
+++ b/chardev/char-mux.c
@@ -337,6 +337,7 @@ static void qemu_chr_open_mux(Chardev *chr,
      */
     *be_opened = muxes_opened;
     qemu_chr_fe_init(&d->chr, drv, errp);
+    qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
 }
 
 static void qemu_chr_parse_mux(QemuOpts *opts, ChardevBackend *backend,
diff --git a/chardev/char-null.c b/chardev/char-null.c
index 1c6a290..02acaff 100644
--- a/chardev/char-null.c
+++ b/chardev/char-null.c
@@ -32,6 +32,7 @@ static void null_chr_open(Chardev *chr,
                           Error **errp)
 {
     *be_opened = false;
+    qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
 }
 
 static void char_null_class_init(ObjectClass *oc, void *data)
diff --git a/chardev/char-serial.c b/chardev/char-serial.c
index 7c3d84a..b585085 100644
--- a/chardev/char-serial.c
+++ b/chardev/char-serial.c
@@ -274,6 +274,7 @@ static void qmp_chardev_open_serial(Chardev *chr,
     qemu_set_nonblock(fd);
     tty_serial_init(fd, 115200, 'N', 8, 1);
 
+    qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
     qemu_chr_open_fd(chr, fd, fd);
 }
 #endif /* __linux__ || __sun__ */
diff --git a/chardev/char-stdio.c b/chardev/char-stdio.c
index 403da30..9410c16 100644
--- a/chardev/char-stdio.c
+++ b/chardev/char-stdio.c
@@ -114,9 +114,17 @@ static void qemu_chr_open_stdio(Chardev *chr,
 
     stdio_allow_signal = !opts->has_signal || opts->signal;
     qemu_chr_set_echo_stdio(chr, false);
+    qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
 }
 #endif
 
+void qemu_term_exit(void)
+{
+#ifndef _WIN32
+    term_exit();
+#endif
+}
+
 static void qemu_chr_parse_stdio(QemuOpts *opts, ChardevBackend *backend,
                                  Error **errp)
 {
diff --git a/gdbstub.c b/gdbstub.c
index 054665e..fdbf531 100644
--- a/gdbstub.c
+++ b/gdbstub.c
@@ -3540,6 +3540,7 @@ int gdbserver_start(const char *device)
         mon_chr = gdbserver_state.mon_chr;
         reset_gdbserver_state();
     }
+    mon_chr->close_on_cpr = true;
 
     create_processes(&gdbserver_state);
 
diff --git a/include/chardev/char.h b/include/chardev/char.h
index e488ad1..96e5570 100644
--- a/include/chardev/char.h
+++ b/include/chardev/char.h
@@ -295,5 +295,6 @@ void resume_mux_open(void);
 void qemu_chr_parse_vc(QemuOpts *opts, ChardevBackend *backend, Error **errp);
 
 bool qemu_chr_cpr_capable(Error **errp);
+void qemu_term_exit(void);
 
 #endif
diff --git a/migration/cpr.c b/migration/cpr.c
index 8dfd5f1..a65a671 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -140,6 +140,7 @@ void cprsave(const char *file, CprMode mode, Error **errp)
         }
         walkenv(FD_PREFIX, preserve_fd, 0);
         vhost_dev_reset_all();
+        qemu_term_exit();
         setenv("QEMU_START_FREEZE", "", 1);
         qemu_system_exec_request();
     }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V3 18/22] chardev: cpr for pty
  2021-05-07 12:24 [PATCH V3 00/22] Live Update Steve Sistare
                   ` (16 preceding siblings ...)
  2021-05-07 12:25 ` [PATCH V3 17/22] chardev: cpr for simple devices Steve Sistare
@ 2021-05-07 12:25 ` Steve Sistare
  2021-05-07 12:25 ` [PATCH V3 19/22] chardev: cpr for sockets Steve Sistare
                   ` (6 subsequent siblings)
  24 siblings, 0 replies; 81+ messages in thread
From: Steve Sistare @ 2021-05-07 12:25 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Markus Armbruster, Alex Williamson, Steve Sistare,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Save and restore pty descriptors across cprsave and cprload.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 chardev/char-pty.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/chardev/char-pty.c b/chardev/char-pty.c
index a2d1e7c..c91151d 100644
--- a/chardev/char-pty.c
+++ b/chardev/char-pty.c
@@ -30,6 +30,7 @@
 #include "qemu/sockets.h"
 #include "qemu/error-report.h"
 #include "qemu/module.h"
+#include "qemu/env.h"
 #include "qemu/qemu-print.h"
 
 #include "chardev/char-io.h"
@@ -191,6 +192,7 @@ static void char_pty_finalize(Object *obj)
     Chardev *chr = CHARDEV(obj);
     PtyChardev *s = PTY_CHARDEV(obj);
 
+    unsetenv_fd(chr->label);
     pty_chr_state(chr, 0);
     object_unref(OBJECT(s->ioc));
     pty_chr_timer_cancel(s);
@@ -207,19 +209,28 @@ static void char_pty_open(Chardev *chr,
     char pty_name[PATH_MAX];
     char *name;
 
+    master_fd = getenv_fd(chr->label);
+    if (master_fd >= 0) {
+        chr->filename = g_strdup_printf("pty:unknown");
+        goto have_fd;
+    }
+
     master_fd = qemu_openpty_raw(&slave_fd, pty_name);
     if (master_fd < 0) {
         error_setg_errno(errp, errno, "Failed to create PTY");
         return;
     }
-
+    if (!chr->close_on_cpr) {
+        setenv_fd(chr->label, master_fd);
+    }
     close(slave_fd);
     qemu_set_nonblock(master_fd);
-
     chr->filename = g_strdup_printf("pty:%s", pty_name);
     qemu_printf("char device redirected to %s (label %s)\n",
                 pty_name, chr->label);
 
+have_fd:
+    qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
     s = PTY_CHARDEV(chr);
     s->ioc = QIO_CHANNEL(qio_channel_file_new_fd(master_fd));
     name = g_strdup_printf("chardev-pty-%s", chr->label);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V3 19/22] chardev: cpr for sockets
  2021-05-07 12:24 [PATCH V3 00/22] Live Update Steve Sistare
                   ` (17 preceding siblings ...)
  2021-05-07 12:25 ` [PATCH V3 18/22] chardev: cpr for pty Steve Sistare
@ 2021-05-07 12:25 ` Steve Sistare
  2021-05-07 12:25 ` [PATCH V3 20/22] cpr: only-cpr-capable option Steve Sistare
                   ` (5 subsequent siblings)
  24 siblings, 0 replies; 81+ messages in thread
From: Steve Sistare @ 2021-05-07 12:25 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Markus Armbruster, Alex Williamson, Steve Sistare,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Save accepted socket fds in the environment before cprsave, and look for
fds in the environment after cprload.  Reject cprsave if a socket enables
the TLS or websocket option.  Allow a monitor socket by closing it on exec.

Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 chardev/char-socket.c | 31 +++++++++++++++++++++++++++++++
 monitor/hmp.c         |  3 +++
 monitor/qmp.c         |  3 +++
 3 files changed, 37 insertions(+)

diff --git a/chardev/char-socket.c b/chardev/char-socket.c
index daa89fe..110f263 100644
--- a/chardev/char-socket.c
+++ b/chardev/char-socket.c
@@ -27,7 +27,9 @@
 #include "io/channel-socket.h"
 #include "io/channel-tls.h"
 #include "io/channel-websock.h"
+#include "qemu/env.h"
 #include "io/net-listener.h"
+#include "qemu/env.h"
 #include "qemu/error-report.h"
 #include "qemu/module.h"
 #include "qemu/option.h"
@@ -414,6 +416,7 @@ static void tcp_chr_free_connection(Chardev *chr)
     SocketChardev *s = SOCKET_CHARDEV(chr);
     int i;
 
+    unsetenv_fd(chr->label);
     if (s->read_msgfds_num) {
         for (i = 0; i < s->read_msgfds_num; i++) {
             close(s->read_msgfds[i]);
@@ -976,6 +979,10 @@ static void tcp_chr_accept(QIONetListener *listener,
                                QIO_CHANNEL(cioc));
     }
     tcp_chr_new_client(chr, cioc);
+
+    if (s->sioc && !chr->close_on_cpr) {
+        setenv_fd(chr->label, s->sioc->fd);
+    }
 }
 
 
@@ -1231,6 +1238,24 @@ static gboolean socket_reconnect_timeout(gpointer opaque)
     return false;
 }
 
+static void load_char_socket_fd(Chardev *chr, Error **errp)
+{
+    SocketChardev *sockchar = SOCKET_CHARDEV(chr);
+    QIOChannelSocket *sioc;
+    int fd = getenv_fd(chr->label);
+
+    if (fd != -1) {
+        sockchar = SOCKET_CHARDEV(chr);
+        sioc = qio_channel_socket_new_fd(fd, errp);
+        if (sioc) {
+            tcp_chr_accept(sockchar->listener, sioc, chr);
+            object_unref(OBJECT(sioc));
+        } else {
+            error_setg(errp, "error: could not restore socket for %s",
+                       chr->label);
+        }
+    }
+}
 
 static int qmp_chardev_open_socket_server(Chardev *chr,
                                           bool is_telnet,
@@ -1441,6 +1466,10 @@ static void qmp_chardev_open_socket(Chardev *chr,
     }
     s->registered_yank = true;
 
+    if (!s->tls_creds && !s->is_websock) {
+        qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
+    }
+
     /* be isn't opened until we get a connection */
     *be_opened = false;
 
@@ -1456,6 +1485,8 @@ static void qmp_chardev_open_socket(Chardev *chr,
             return;
         }
     }
+
+    load_char_socket_fd(chr, errp);
 }
 
 static void qemu_chr_parse_socket(QemuOpts *opts, ChardevBackend *backend,
diff --git a/monitor/hmp.c b/monitor/hmp.c
index 6c0b33a..63700b3 100644
--- a/monitor/hmp.c
+++ b/monitor/hmp.c
@@ -1451,4 +1451,7 @@ void monitor_init_hmp(Chardev *chr, bool use_readline, Error **errp)
     qemu_chr_fe_set_handlers(&mon->common.chr, monitor_can_read, monitor_read,
                              monitor_event, NULL, &mon->common, NULL, true);
     monitor_list_append(&mon->common);
+
+    /* monitor cannot yet be preserved across cpr */
+    chr->close_on_cpr = true;
 }
diff --git a/monitor/qmp.c b/monitor/qmp.c
index 2b0308f..495d68f 100644
--- a/monitor/qmp.c
+++ b/monitor/qmp.c
@@ -531,4 +531,7 @@ void monitor_init_qmp(Chardev *chr, bool pretty, Error **errp)
                                  NULL, &mon->common, NULL, true);
         monitor_list_append(&mon->common);
     }
+
+    /* Monitor cannot yet be preserved across cpr */
+    chr->close_on_cpr = true;
 }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V3 20/22] cpr: only-cpr-capable option
  2021-05-07 12:24 [PATCH V3 00/22] Live Update Steve Sistare
                   ` (18 preceding siblings ...)
  2021-05-07 12:25 ` [PATCH V3 19/22] chardev: cpr for sockets Steve Sistare
@ 2021-05-07 12:25 ` Steve Sistare
  2021-05-07 12:25 ` [PATCH V3 21/22] cpr: maintainers Steve Sistare
                   ` (4 subsequent siblings)
  24 siblings, 0 replies; 81+ messages in thread
From: Steve Sistare @ 2021-05-07 12:25 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Markus Armbruster, Alex Williamson, Steve Sistare,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Add the only-cpr-capable option, which causes qemu to exit with an error
if any devices that are not capable of cpr are added.  This guarantees that
a cprsave operation will not fail with an unsupported device error.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 chardev/char-socket.c   |  4 ++++
 hw/vfio/common.c        |  5 +++++
 hw/vfio/pci.c           |  3 +++
 include/sysemu/sysemu.h |  1 +
 migration/migration.c   |  5 +++++
 qemu-options.hx         |  8 ++++++++
 softmmu/globals.c       |  1 +
 softmmu/physmem.c       |  4 ++++
 softmmu/vl.c            | 14 +++++++++++++-
 stubs/cpr.c             |  3 +++
 stubs/meson.build       |  1 +
 11 files changed, 48 insertions(+), 1 deletion(-)
 create mode 100644 stubs/cpr.c

diff --git a/chardev/char-socket.c b/chardev/char-socket.c
index 110f263..b8c75ff 100644
--- a/chardev/char-socket.c
+++ b/chardev/char-socket.c
@@ -40,6 +40,7 @@
 
 #include "chardev/char-io.h"
 #include "qom/object.h"
+#include "sysemu/sysemu.h"
 
 /***********************************************************/
 /* TCP Net console */
@@ -1468,6 +1469,9 @@ static void qmp_chardev_open_socket(Chardev *chr,
 
     if (!s->tls_creds && !s->is_websock) {
         qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
+    } else if (only_cpr_capable) {
+        error_setg(errp, "error: socket %s is not cpr capable due to %s option",
+                   chr->label, (s->tls_creds ? "TLS" : "websocket"));
     }
 
     /* be isn't opened until we get a connection */
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 00d07b2..f2f1926 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -37,6 +37,7 @@
 #include "qemu/range.h"
 #include "sysemu/kvm.h"
 #include "sysemu/reset.h"
+#include "sysemu/sysemu.h"
 #include "trace.h"
 #include "qapi/error.h"
 #include "migration/migration.h"
@@ -1601,6 +1602,10 @@ static int vfio_get_iommu_type(VFIOContainer *container,
 
     for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
         if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {
+            if (only_cpr_capable && !vfio_cpr_capable(container, errp)) {
+                error_prepend(errp, "only-cpr-capable is specified: ");
+                return -EINVAL;
+            }
             return iommu_types[i];
         }
     }
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e983db4..908b0e5 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -266,6 +266,9 @@ static int vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
 
     if (!pin) {
         return 0;
+    } else if (only_cpr_capable) {
+        error_setg(errp, "INTX is not compatible with -only-cpr-capable");
+        return -1;
     }
 
     vfio_disable_interrupts(vdev);
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index f56058e..05c2d8e 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -9,6 +9,7 @@
 /* vl.c */
 
 extern int only_migratable;
+extern bool only_cpr_capable;
 extern char **argv_main;
 extern const char *qemu_name;
 extern QemuUUID qemu_uuid;
diff --git a/migration/migration.c b/migration/migration.c
index 8ca0341..181c8d5 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1262,6 +1262,11 @@ static bool migrate_caps_check(bool *cap_list,
         }
     }
 
+    if (cap_list[MIGRATION_CAPABILITY_X_COLO] && only_cpr_capable) {
+        error_setg(errp, "x-colo is not compatible with -only-cpr-capable");
+        return false;
+    }
+
     return true;
 }
 
diff --git a/qemu-options.hx b/qemu-options.hx
index ef2d24a..f1b372b 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -4257,6 +4257,14 @@ SRST
     an unmigratable state.
 ERST
 
+DEF("only-cpr-capable", 0, QEMU_OPTION_only_cpr_capable, \
+    "-only-cpr-capable    allow only cpr capable devices\n", QEMU_ARCH_ALL)
+SRST
+``-only-cpr-capable``
+    Only allow cpr capable devices, which guarantees that cprsave will not
+    fail with an unsupported device error.
+ERST
+
 DEF("nodefaults", 0, QEMU_OPTION_nodefaults, \
     "-nodefaults     don't create default devices\n", QEMU_ARCH_ALL)
 SRST
diff --git a/softmmu/globals.c b/softmmu/globals.c
index 2bb630d..752f119 100644
--- a/softmmu/globals.c
+++ b/softmmu/globals.c
@@ -59,6 +59,7 @@ int boot_menu;
 bool boot_strict;
 uint8_t *boot_splash_filedata;
 int only_migratable; /* turn it off unless user states otherwise */
+bool only_cpr_capable;
 int icount_align_option;
 char **argv_main;
 
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index b79f408..04e3603 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -1973,6 +1973,10 @@ static void ram_block_add(RAMBlock *new_block, Error **errp, bool shared)
                 addr = file_ram_alloc(new_block, maxlen, mfd,
                                       false, false, 0, errp);
                 trace_anon_memfd_alloc(name, maxlen, addr, mfd);
+            } else if (only_cpr_capable) {
+                error_setg(errp,
+                    "only-cpr-capable requires -machine memfd-alloc=on");
+                return;
             } else {
                 addr = qemu_anon_ram_alloc(maxlen, &mr->align, shared);
             }
diff --git a/softmmu/vl.c b/softmmu/vl.c
index 4654693..76a14a0 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -2589,6 +2589,10 @@ void qmp_x_exit_preconfig(Error **errp)
     qemu_create_cli_devices();
     qemu_machine_creation_done();
 
+    if (only_cpr_capable && !qemu_chr_cpr_capable(errp)) {
+        ;    /* not reached due to error_fatal */
+    }
+
     if (loadvm) {
         Error *local_err = NULL;
         if (!load_snapshot(loadvm, NULL, false, NULL, &local_err)) {
@@ -2598,7 +2602,12 @@ void qmp_x_exit_preconfig(Error **errp)
         }
     }
     if (replay_mode != REPLAY_MODE_NONE) {
-        replay_vmstate_init();
+        if (only_cpr_capable) {
+            error_setg(errp, "replay is not compatible with -only-cpr-capable");
+            /* not reached due to error_fatal */
+        } else {
+            replay_vmstate_init();
+        }
     }
 
     if (incoming) {
@@ -3340,6 +3349,9 @@ void qemu_init(int argc, char **argv, char **envp)
             case QEMU_OPTION_only_migratable:
                 only_migratable = 1;
                 break;
+            case QEMU_OPTION_only_cpr_capable:
+                only_cpr_capable = true;
+                break;
             case QEMU_OPTION_nodefaults:
                 has_defaults = 0;
                 break;
diff --git a/stubs/cpr.c b/stubs/cpr.c
new file mode 100644
index 0000000..aaa189e
--- /dev/null
+++ b/stubs/cpr.c
@@ -0,0 +1,3 @@
+#include "qemu/osdep.h"
+
+bool only_cpr_capable;
diff --git a/stubs/meson.build b/stubs/meson.build
index be6f6d6..2003c77 100644
--- a/stubs/meson.build
+++ b/stubs/meson.build
@@ -5,6 +5,7 @@ stub_ss.add(files('blk-exp-close-all.c'))
 stub_ss.add(files('blockdev-close-all-bdrv-states.c'))
 stub_ss.add(files('change-state-handler.c'))
 stub_ss.add(files('cmos.c'))
+stub_ss.add(files('cpr.c'))
 stub_ss.add(files('cpu-get-clock.c'))
 stub_ss.add(files('cpus-get-virtual-clock.c'))
 stub_ss.add(files('qemu-timer-notify-cb.c'))
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V3 21/22] cpr: maintainers
  2021-05-07 12:24 [PATCH V3 00/22] Live Update Steve Sistare
                   ` (19 preceding siblings ...)
  2021-05-07 12:25 ` [PATCH V3 20/22] cpr: only-cpr-capable option Steve Sistare
@ 2021-05-07 12:25 ` Steve Sistare
  2021-05-07 12:25 ` [PATCH V3 22/22] simplify savevm Steve Sistare
                   ` (3 subsequent siblings)
  24 siblings, 0 replies; 81+ messages in thread
From: Steve Sistare @ 2021-05-07 12:25 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Markus Armbruster, Alex Williamson, Steve Sistare,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Add the maintainers for cpr related files.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 MAINTAINERS | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 36055f1..b69bbf5 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2843,6 +2843,17 @@ F: net/colo*
 F: net/filter-rewriter.c
 F: net/filter-mirror.c
 
+CPR
+M: Steve Sistare <steven.sistare@oracle.com>
+M: Mark Kanda <mark.kanda@oracle.com>
+S: Maintained
+F: hw/vfio/cpr.c
+F: include/migration/cpr.h
+F: migration/cpr.c
+F: qapi/cpr.json
+F: include/qemu/env.h
+F: util/env.c
+
 Record/replay
 M: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
 R: Paolo Bonzini <pbonzini@redhat.com>
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH V3 22/22] simplify savevm
  2021-05-07 12:24 [PATCH V3 00/22] Live Update Steve Sistare
                   ` (20 preceding siblings ...)
  2021-05-07 12:25 ` [PATCH V3 21/22] cpr: maintainers Steve Sistare
@ 2021-05-07 12:25 ` Steve Sistare
  2021-05-07 13:00 ` [PATCH V3 00/22] Live Update no-reply
                   ` (2 subsequent siblings)
  24 siblings, 0 replies; 81+ messages in thread
From: Steve Sistare @ 2021-05-07 12:25 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Markus Armbruster, Alex Williamson, Steve Sistare,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Use qf_file_open to simplify a few functions in savevm.c.
No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 migration/savevm.c | 21 +++++++--------------
 1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index 52e2d72..d02bce2 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2904,8 +2904,9 @@ bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
 void qmp_xen_save_devices_state(const char *filename, bool has_live, bool live,
                                 Error **errp)
 {
+    const char *ioc_name = "migration-xen-save-state";
+    int flags = O_WRONLY | O_CREAT | O_TRUNC;
     QEMUFile *f;
-    QIOChannelFile *ioc;
     int saved_vm_running;
     int ret;
 
@@ -2919,14 +2920,10 @@ void qmp_xen_save_devices_state(const char *filename, bool has_live, bool live,
     vm_stop(RUN_STATE_SAVE_VM);
     global_state_store_running();
 
-    ioc = qio_channel_file_new_path(filename, O_WRONLY | O_CREAT | O_TRUNC,
-                                    0660, errp);
-    if (!ioc) {
+    f = qf_file_open(filename, flags, 0660, ioc_name, errp);
+    if (!f) {
         goto the_end;
     }
-    qio_channel_set_name(QIO_CHANNEL(ioc), "migration-xen-save-state");
-    f = qemu_fopen_channel_output(QIO_CHANNEL(ioc));
-    object_unref(OBJECT(ioc));
     ret = qemu_save_device_state(f);
     if (ret < 0 || qemu_fclose(f) < 0) {
         error_setg(errp, QERR_IO_ERROR);
@@ -2954,8 +2951,8 @@ void qmp_xen_save_devices_state(const char *filename, bool has_live, bool live,
 
 void qmp_xen_load_devices_state(const char *filename, Error **errp)
 {
+    const char *ioc_name = "migration-xen-load-state";
     QEMUFile *f;
-    QIOChannelFile *ioc;
     int ret;
 
     /* Guest must be paused before loading the device state; the RAM state
@@ -2967,14 +2964,10 @@ void qmp_xen_load_devices_state(const char *filename, Error **errp)
     }
     vm_stop(RUN_STATE_RESTORE_VM);
 
-    ioc = qio_channel_file_new_path(filename, O_RDONLY | O_BINARY, 0, errp);
-    if (!ioc) {
+    f = qf_file_open(filename, O_RDONLY | O_BINARY, 0, ioc_name, errp);
+    if (!f) {
         return;
     }
-    qio_channel_set_name(QIO_CHANNEL(ioc), "migration-xen-load-state");
-    f = qemu_fopen_channel_input(QIO_CHANNEL(ioc));
-    object_unref(OBJECT(ioc));
-
     ret = qemu_loadvm_state(f);
     qemu_fclose(f);
     if (ret < 0) {
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update
  2021-05-07 12:24 [PATCH V3 00/22] Live Update Steve Sistare
                   ` (21 preceding siblings ...)
  2021-05-07 12:25 ` [PATCH V3 22/22] simplify savevm Steve Sistare
@ 2021-05-07 13:00 ` no-reply
  2021-05-13 20:42   ` Steven Sistare
  2021-05-12 16:42 ` Stefan Hajnoczi
  2021-05-19 16:43 ` [PATCH V3 00/22] Live Update Steven Sistare
  24 siblings, 1 reply; 81+ messages in thread
From: no-reply @ 2021-05-07 13:00 UTC (permalink / raw)
  To: steven.sistare
  Cc: jason.zeng, quintela, armbru, mst, qemu-devel, dgilbert,
	alex.williamson, pbonzini, steven.sistare, stefanha,
	marcandre.lureau, berrange, philmd, alex.bennee

Patchew URL: https://patchew.org/QEMU/1620390320-301716-1-git-send-email-steven.sistare@oracle.com/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 1620390320-301716-1-git-send-email-steven.sistare@oracle.com
Subject: [PATCH V3 00/22] Live Update

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 * [new tag]         patchew/1620390320-301716-1-git-send-email-steven.sistare@oracle.com -> patchew/1620390320-301716-1-git-send-email-steven.sistare@oracle.com
 - [tag update]      patchew/20210504124140.1100346-1-linux@roeck-us.net -> patchew/20210504124140.1100346-1-linux@roeck-us.net
 - [tag update]      patchew/20210506185641.284821-1-dgilbert@redhat.com -> patchew/20210506185641.284821-1-dgilbert@redhat.com
 - [tag update]      patchew/20210506193341.140141-1-lvivier@redhat.com -> patchew/20210506193341.140141-1-lvivier@redhat.com
 - [tag update]      patchew/20210506194358.3925-1-peter.maydell@linaro.org -> patchew/20210506194358.3925-1-peter.maydell@linaro.org
Switched to a new branch 'test'
8c778e6 simplify savevm
aca4f09 cpr: maintainers
697f8d0 cpr: only-cpr-capable option
0a8c20e chardev: cpr for sockets
cb270f4 chardev: cpr for pty
279230e chardev: cpr for simple devices
b122cfa chardev: cpr framework
6596676 hostmem-memfd: cpr support
8cb6348 vhost: reset vhost devices upon cprsave
e3ae86d vfio-pci: cpr part 2
02c628d vfio-pci: cpr part 1
d93623c vfio-pci: refactor for cpr
bc63b3e pci: export functions for cpr
2b10bdd cpr: HMP interfaces
29bc20a cpr: QMP interfaces
3f84e6c cpr
0a32588 vl: add helper to request re-exec
466b4cf machine: memfd-alloc option
50c3e84 util: env var helpers
76c3550 oslib: qemu_clr_cloexec
d819bd4 qemu_ram_volatile
c466ddf as_flat_walk

=== OUTPUT BEGIN ===
1/22 Checking commit c466ddfd2209 (as_flat_walk)
2/22 Checking commit d819bd4dcc09 (qemu_ram_volatile)
3/22 Checking commit 76c3550a677b (oslib: qemu_clr_cloexec)
4/22 Checking commit 50c3e84cf5a6 (util: env var helpers)
Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#19: 
new file mode 100644

ERROR: consider using qemu_strtol in preference to strtol
#72: FILE: util/env.c:20:
+        res = strtol(val, 0, 10);

total: 1 errors, 1 warnings, 129 lines checked

Patch 4/22 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

5/22 Checking commit 466b4cf4ce8c (machine: memfd-alloc option)
6/22 Checking commit 0a32588de76e (vl: add helper to request re-exec)
7/22 Checking commit 3f84e6c38bd6 (cpr)
Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#55: 
new file mode 100644

total: 0 errors, 1 warnings, 314 lines checked

Patch 7/22 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
8/22 Checking commit 29bc20ab5870 (cpr: QMP interfaces)
Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#81: 
new file mode 100644

total: 0 errors, 1 warnings, 136 lines checked

Patch 8/22 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
9/22 Checking commit 2b10bdd5edb3 (cpr: HMP interfaces)
10/22 Checking commit bc63b3edc621 (pci: export functions for cpr)
11/22 Checking commit d93623c4da4d (vfio-pci: refactor for cpr)
12/22 Checking commit 02c628d50b57 (vfio-pci: cpr part 1)
Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#271: 
new file mode 100644

total: 0 errors, 1 warnings, 566 lines checked

Patch 12/22 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
13/22 Checking commit e3ae86d2076c (vfio-pci: cpr part 2)
14/22 Checking commit 8cb6348c8cff (vhost: reset vhost devices upon cprsave)
15/22 Checking commit 65966769fa93 (hostmem-memfd: cpr support)
16/22 Checking commit b122cfa96106 (chardev: cpr framework)
17/22 Checking commit 279230e03a78 (chardev: cpr for simple devices)
18/22 Checking commit cb270f49693f (chardev: cpr for pty)
19/22 Checking commit 0a8c20e0a8d4 (chardev: cpr for sockets)
20/22 Checking commit 697f8d021f43 (cpr: only-cpr-capable option)
Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#200: 
new file mode 100644

total: 0 errors, 1 warnings, 133 lines checked

Patch 20/22 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
21/22 Checking commit aca4f092c865 (cpr: maintainers)
22/22 Checking commit 8c778e6f284c (simplify savevm)
=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/1620390320-301716-1-git-send-email-steven.sistare@oracle.com/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 06/22] vl: add helper to request re-exec
  2021-05-07 12:25 ` [PATCH V3 06/22] vl: add helper to request re-exec Steve Sistare
@ 2021-05-07 14:31   ` Eric Blake
  2021-05-13 20:19     ` Steven Sistare
  2021-05-12 16:27   ` Stefan Hajnoczi
  1 sibling, 1 reply; 81+ messages in thread
From: Eric Blake @ 2021-05-07 14:31 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Markus Armbruster, Alex Williamson, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On 5/7/21 7:25 AM, Steve Sistare wrote:
> Add a qemu_exec_requested() hook that causes the main loop to exit and
> re-exec qemu using the same initial arguments.  If /usr/bin/qemu-exec
> exists, exec that instead.  This is an optional site-specific trampoline
> that may alter the environment before exec'ing the qemu binary.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---

> +static void qemu_exec(void)
> +{
> +    const char *helper = "/usr/bin/qemu-exec";
> +    const char *bin = !access(helper, X_OK) ? helper : argv_main[0];

Reads awkwardly; I would have used '...= access(helper, X_OK) == 0 ?...'

> +
> +    execvp(bin, argv_main);
> +    error_report("execvp failed, errno %d.", errno);

error_report should not be used with a trailing dot.  Also, %d for errno
is awkward, better is:

error_report("execvp failed: %s", strerror(errno));

> +    exit(1);

We aren't consistent about use of EXIT_FAILED.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 16/22] chardev: cpr framework
  2021-05-07 12:25 ` [PATCH V3 16/22] chardev: cpr framework Steve Sistare
@ 2021-05-07 14:33   ` Eric Blake
  2021-05-13 20:19     ` Steven Sistare
  0 siblings, 1 reply; 81+ messages in thread
From: Eric Blake @ 2021-05-07 14:33 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Markus Armbruster, Alex Williamson, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On 5/7/21 7:25 AM, Steve Sistare wrote:
> Add QEMU_CHAR_FEATURE_CPR for devices that support cpr.
> Add the chardev close_on_cpr option for devices that can be closed on cpr
> and reopened after exec.
> cpr is allowed only if either QEMU_CHAR_FEATURE_CPR or close_on_cpr is set
> for all chardevs in the configuration.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---

> +++ b/qapi/char.json
> @@ -204,12 +204,15 @@
>  # @logfile: The name of a logfile to save output
>  # @logappend: true to append instead of truncate
>  #             (default to false to truncate)
> +# @close-on-cpr: if true, close device's fd on cprsave. defaults to false.
> +#                since 6.0.

6.1, actually.


> @@ -3182,6 +3196,10 @@ The general form of a character device option is:
>      ``logappend`` option controls whether the log file will be truncated
>      or appended to when opened.
>  
> +    Every backend supports the ``close-on-cpr`` option.  If on, the
> +    devices's descriptor is closed during cprsave, and reopened after exec.

device's

> +    This is useful for devices that do not support cpr.
> +
>  The available backends are:
>  
>  ``-chardev null,id=id``
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 07/22] cpr
  2021-05-07 12:25 ` [PATCH V3 07/22] cpr Steve Sistare
@ 2021-05-12 16:19   ` Stefan Hajnoczi
  2021-05-13 20:21     ` Steven Sistare
  0 siblings, 1 reply; 81+ messages in thread
From: Stefan Hajnoczi @ 2021-05-12 16:19 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Alex Williamson, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 777 bytes --]

On Fri, May 07, 2021 at 05:25:05AM -0700, Steve Sistare wrote:
> To use the restart mode, qemu must be started with the memfd-alloc machine
> option.  The memfd's are saved to the environment and kept open across exec,
> after which they are found from the environment and re-mmap'd.  Hence guest
> ram is preserved in place, albeit with new virtual addresses in the qemu
> process.  The caller resumes the guest by calling cprload, which loads
> state from the file.  If the VM was running at cprsave time, then VM
> execution resumes.  cprsave supports any type of guest image and block
> device, but the caller must not modify guest block devices between cprsave
> and cprload.

Does QEMU's existing -object memory-backend-file on tmpfs or hugetlbfs
achieve the same thing?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 06/22] vl: add helper to request re-exec
  2021-05-07 12:25 ` [PATCH V3 06/22] vl: add helper to request re-exec Steve Sistare
  2021-05-07 14:31   ` Eric Blake
@ 2021-05-12 16:27   ` Stefan Hajnoczi
  2021-05-13 20:20     ` Steven Sistare
  1 sibling, 1 reply; 81+ messages in thread
From: Stefan Hajnoczi @ 2021-05-12 16:27 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Alex Williamson, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 411 bytes --]

On Fri, May 07, 2021 at 05:25:04AM -0700, Steve Sistare wrote:
> @@ -660,6 +673,16 @@ void qemu_system_debug_request(void)
>      qemu_notify_event();
>  }
>  
> +static void qemu_exec(void)
> +{
> +    const char *helper = "/usr/bin/qemu-exec";

The network up script is get_relocated_path(CONFIG_SYSCONFDIR
"/qemu-ifup"). For consistency maybe this should use the same path
rather than /usr/bin/.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update
  2021-05-07 12:24 [PATCH V3 00/22] Live Update Steve Sistare
                   ` (22 preceding siblings ...)
  2021-05-07 13:00 ` [PATCH V3 00/22] Live Update no-reply
@ 2021-05-12 16:42 ` Stefan Hajnoczi
  2021-05-13 20:21   ` Steven Sistare
  2021-05-19 16:43 ` [PATCH V3 00/22] Live Update Steven Sistare
  24 siblings, 1 reply; 81+ messages in thread
From: Stefan Hajnoczi @ 2021-05-12 16:42 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Alex Williamson, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 1329 bytes --]

On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
> Provide the cprsave and cprload commands for live update.  These save and
> restore VM state, with minimal guest pause time, so that qemu may be updated
> to a new version in between.
> 
> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
> paused state and waits for the cprload command.

I think cprsave/cprload could be generalized by using QMP to stash the
file descriptors. The 'getfd' QMP command already exists and QEMU code
already opens fds passed using this mechanism.

I haven't checked but it may be possible to drop some patches by reusing
QEMU's monitor file descriptor passing since the code already knows how
to open from 'getfd' fds.

The reason why using QMP is interesting is because it eliminates the
need for execve(2). QEMU may be unable to execute a program due to
chroot, seccomp, etc.

QMP would enable cprsave/cprload to work both with and without
execve(2).

One tricky thing with this approach might be startup ordering: how to
get fds via the QMP monitor in the new process before processing the
entire command-line.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 06/22] vl: add helper to request re-exec
  2021-05-07 14:31   ` Eric Blake
@ 2021-05-13 20:19     ` Steven Sistare
  2021-05-14  8:18       ` Daniel P. Berrangé
  0 siblings, 1 reply; 81+ messages in thread
From: Steven Sistare @ 2021-05-13 20:19 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Markus Armbruster, Alex Williamson, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On 5/7/2021 10:31 AM, Eric Blake wrote:
> On 5/7/21 7:25 AM, Steve Sistare wrote:
>> Add a qemu_exec_requested() hook that causes the main loop to exit and
>> re-exec qemu using the same initial arguments.  If /usr/bin/qemu-exec
>> exists, exec that instead.  This is an optional site-specific trampoline
>> that may alter the environment before exec'ing the qemu binary.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
> 
>> +static void qemu_exec(void)
>> +{
>> +    const char *helper = "/usr/bin/qemu-exec";
>> +    const char *bin = !access(helper, X_OK) ? helper : argv_main[0];
> 
> Reads awkwardly; I would have used '...= access(helper, X_OK) == 0 ?...'

Will fix.

>> +
>> +    execvp(bin, argv_main);
>> +    error_report("execvp failed, errno %d.", errno);
> 
> error_report should not be used with a trailing dot.  

Will fix.  I was not sure because I see examples both ways, though no dot prevails.
Perhaps it should be added to the style guide and checkpatch.

> Also, %d for errno is awkward, better is:
> 
> error_report("execvp failed: %s", strerror(errno));

I shy away from strerror because it is not thread safe, but I see qemu uses it
extensively.  Will fix.

> 
>> +    exit(1);
> 
> We aren't consistent about use of EXIT_FAILED.

OK, I will use EXIT_FAILURE.

Thanks for reviewing.

- Steve



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 16/22] chardev: cpr framework
  2021-05-07 14:33   ` Eric Blake
@ 2021-05-13 20:19     ` Steven Sistare
  0 siblings, 0 replies; 81+ messages in thread
From: Steven Sistare @ 2021-05-13 20:19 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Markus Armbruster, Alex Williamson, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On 5/7/2021 10:33 AM, Eric Blake wrote:
> On 5/7/21 7:25 AM, Steve Sistare wrote:
>> Add QEMU_CHAR_FEATURE_CPR for devices that support cpr.
>> Add the chardev close_on_cpr option for devices that can be closed on cpr
>> and reopened after exec.
>> cpr is allowed only if either QEMU_CHAR_FEATURE_CPR or close_on_cpr is set
>> for all chardevs in the configuration.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
> 
>> +++ b/qapi/char.json
>> @@ -204,12 +204,15 @@
>>  # @logfile: The name of a logfile to save output
>>  # @logappend: true to append instead of truncate
>>  #             (default to false to truncate)
>> +# @close-on-cpr: if true, close device's fd on cprsave. defaults to false.
>> +#                since 6.0.
> 
> 6.1, actually.
> 
>> @@ -3182,6 +3196,10 @@ The general form of a character device option is:
>>      ``logappend`` option controls whether the log file will be truncated
>>      or appended to when opened.
>>  
>> +    Every backend supports the ``close-on-cpr`` option.  If on, the
>> +    devices's descriptor is closed during cprsave, and reopened after exec.
> 
> device's

Thanks, will fix both - Steve


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 06/22] vl: add helper to request re-exec
  2021-05-12 16:27   ` Stefan Hajnoczi
@ 2021-05-13 20:20     ` Steven Sistare
  0 siblings, 0 replies; 81+ messages in thread
From: Steven Sistare @ 2021-05-13 20:20 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Alex Williamson, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On 5/12/2021 12:27 PM, Stefan Hajnoczi wrote:
> On Fri, May 07, 2021 at 05:25:04AM -0700, Steve Sistare wrote:
>> @@ -660,6 +673,16 @@ void qemu_system_debug_request(void)
>>      qemu_notify_event();
>>  }
>>  
>> +static void qemu_exec(void)
>> +{
>> +    const char *helper = "/usr/bin/qemu-exec";
> 
> The network up script is get_relocated_path(CONFIG_SYSCONFDIR
> "/qemu-ifup"). For consistency maybe this should use the same path
> rather than /usr/bin/.

CONFIG_QEMU_HELPERDIR=/usr/libexec looks like a good choice.
And maybe rename qemu-exec to qemu-exec-helper, analogous to qemu-bridge-helper.

- Steve


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 07/22] cpr
  2021-05-12 16:19   ` Stefan Hajnoczi
@ 2021-05-13 20:21     ` Steven Sistare
  2021-05-14 11:28       ` Stefan Hajnoczi
  0 siblings, 1 reply; 81+ messages in thread
From: Steven Sistare @ 2021-05-13 20:21 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Alex Williamson, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On 5/12/2021 12:19 PM, Stefan Hajnoczi wrote:
> On Fri, May 07, 2021 at 05:25:05AM -0700, Steve Sistare wrote:
>> To use the restart mode, qemu must be started with the memfd-alloc machine
>> option.  The memfd's are saved to the environment and kept open across exec,
>> after which they are found from the environment and re-mmap'd.  Hence guest
>> ram is preserved in place, albeit with new virtual addresses in the qemu
>> process.  The caller resumes the guest by calling cprload, which loads
>> state from the file.  If the VM was running at cprsave time, then VM
>> execution resumes.  cprsave supports any type of guest image and block
>> device, but the caller must not modify guest block devices between cprsave
>> and cprload.
> 
> Does QEMU's existing -object memory-backend-file on tmpfs or hugetlbfs
> achieve the same thing?

Not quite.  Various secondary anonymous memory objects are allocated via ram_block_add
and must be preserved, such as these on x86_64.  
  vga.vram
  pc.ram
  pc.bios
  pc.rom
  vga.rom
  rom@etc/acpi/tables
  rom@etc/table-loader
  rom@etc/acpi/rsdp

Even the read-only areas must be preserved rather than recreated from files in the updated
qemu, as their contents may have changed.

- Steve


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update
  2021-05-12 16:42 ` Stefan Hajnoczi
@ 2021-05-13 20:21   ` Steven Sistare
  2021-05-14 11:53     ` Stefan Hajnoczi
  0 siblings, 1 reply; 81+ messages in thread
From: Steven Sistare @ 2021-05-13 20:21 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Alex Williamson, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:
> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
>> Provide the cprsave and cprload commands for live update.  These save and
>> restore VM state, with minimal guest pause time, so that qemu may be updated
>> to a new version in between.
>>
>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
>> paused state and waits for the cprload command.
> 
> I think cprsave/cprload could be generalized by using QMP to stash the
> file descriptors. The 'getfd' QMP command already exists and QEMU code
> already opens fds passed using this mechanism.
> 
> I haven't checked but it may be possible to drop some patches by reusing
> QEMU's monitor file descriptor passing since the code already knows how
> to open from 'getfd' fds.
> 
> The reason why using QMP is interesting is because it eliminates the
> need for execve(2). QEMU may be unable to execute a program due to
> chroot, seccomp, etc.
> 
> QMP would enable cprsave/cprload to work both with and without
> execve(2).
> 
> One tricky thing with this approach might be startup ordering: how to
> get fds via the QMP monitor in the new process before processing the
> entire command-line.

Early on I experimented with a similar approach.  Old qemu passed descriptors to an
escrow process and exited; new qemu started and retrieved the descriptors from escrow.
vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
I suspect my recent vfio extensions would smooth the rough edges.

However, the main issue is that guest ram must be backed by named shared memory, and
we would need to add code to support shared memory for all the secondary memory objects.
That makes it less interesting for us at this time; we care about updating legacy qemu 
instances with anonymous guest memory.

Having said all that, this would be an interesting project, just not the one I want to 
push now.  In the future we could add a new cprsave mode to support it in a backward
compatible manner.

- Steve


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update
  2021-05-07 13:00 ` [PATCH V3 00/22] Live Update no-reply
@ 2021-05-13 20:42   ` Steven Sistare
  0 siblings, 0 replies; 81+ messages in thread
From: Steven Sistare @ 2021-05-13 20:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: jason.zeng, quintela, philmd, mst, dgilbert, armbru,
	alex.williamson, pbonzini, stefanha, marcandre.lureau, berrange,
	alex.bennee

I will add to MAINTAINERS incrementally instead of at the end to make checkpatch happy.

I will use qemu_strtol even though I thought the message "consider using qemu_strtol"
was giving me a choice.

You can't fight The Man when the man is a robot.

- Steve

On 5/7/2021 9:00 AM, no-reply@patchew.org wrote:
> Patchew URL: https://patchew.org/QEMU/1620390320-301716-1-git-send-email-steven.sistare@oracle.com/
> 
> Hi,
> 
> This series seems to have some coding style problems. See output below for
> more information:
> 
> Type: series
> Message-id: 1620390320-301716-1-git-send-email-steven.sistare@oracle.com
> Subject: [PATCH V3 00/22] Live Update
> 
> === TEST SCRIPT BEGIN ===
> #!/bin/bash
> git rev-parse base > /dev/null || exit 0
> git config --local diff.renamelimit 0
> git config --local diff.renames True
> git config --local diff.algorithm histogram
> ./scripts/checkpatch.pl --mailback base..
> === TEST SCRIPT END ===
> 
> Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
> From https://github.com/patchew-project/qemu
>  * [new tag]         patchew/1620390320-301716-1-git-send-email-steven.sistare@oracle.com -> patchew/1620390320-301716-1-git-send-email-steven.sistare@oracle.com
>  - [tag update]      patchew/20210504124140.1100346-1-linux@roeck-us.net -> patchew/20210504124140.1100346-1-linux@roeck-us.net
>  - [tag update]      patchew/20210506185641.284821-1-dgilbert@redhat.com -> patchew/20210506185641.284821-1-dgilbert@redhat.com
>  - [tag update]      patchew/20210506193341.140141-1-lvivier@redhat.com -> patchew/20210506193341.140141-1-lvivier@redhat.com
>  - [tag update]      patchew/20210506194358.3925-1-peter.maydell@linaro.org -> patchew/20210506194358.3925-1-peter.maydell@linaro.org
> Switched to a new branch 'test'
> 8c778e6 simplify savevm
> aca4f09 cpr: maintainers
> 697f8d0 cpr: only-cpr-capable option
> 0a8c20e chardev: cpr for sockets
> cb270f4 chardev: cpr for pty
> 279230e chardev: cpr for simple devices
> b122cfa chardev: cpr framework
> 6596676 hostmem-memfd: cpr support
> 8cb6348 vhost: reset vhost devices upon cprsave
> e3ae86d vfio-pci: cpr part 2
> 02c628d vfio-pci: cpr part 1
> d93623c vfio-pci: refactor for cpr
> bc63b3e pci: export functions for cpr
> 2b10bdd cpr: HMP interfaces
> 29bc20a cpr: QMP interfaces
> 3f84e6c cpr
> 0a32588 vl: add helper to request re-exec
> 466b4cf machine: memfd-alloc option
> 50c3e84 util: env var helpers
> 76c3550 oslib: qemu_clr_cloexec
> d819bd4 qemu_ram_volatile
> c466ddf as_flat_walk
> 
> === OUTPUT BEGIN ===
> 1/22 Checking commit c466ddfd2209 (as_flat_walk)
> 2/22 Checking commit d819bd4dcc09 (qemu_ram_volatile)
> 3/22 Checking commit 76c3550a677b (oslib: qemu_clr_cloexec)
> 4/22 Checking commit 50c3e84cf5a6 (util: env var helpers)
> Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
> WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
> #19: 
> new file mode 100644
> 
> ERROR: consider using qemu_strtol in preference to strtol
> #72: FILE: util/env.c:20:
> +        res = strtol(val, 0, 10);
> 
> total: 1 errors, 1 warnings, 129 lines checked
> 
> Patch 4/22 has style problems, please review.  If any of these errors
> are false positives report them to the maintainer, see
> CHECKPATCH in MAINTAINERS.
> 
> 5/22 Checking commit 466b4cf4ce8c (machine: memfd-alloc option)
> 6/22 Checking commit 0a32588de76e (vl: add helper to request re-exec)
> 7/22 Checking commit 3f84e6c38bd6 (cpr)
> Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
> WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
> #55: 
> new file mode 100644
> 
> total: 0 errors, 1 warnings, 314 lines checked
> 
> Patch 7/22 has style problems, please review.  If any of these errors
> are false positives report them to the maintainer, see
> CHECKPATCH in MAINTAINERS.
> 8/22 Checking commit 29bc20ab5870 (cpr: QMP interfaces)
> Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
> WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
> #81: 
> new file mode 100644
> 
> total: 0 errors, 1 warnings, 136 lines checked
> 
> Patch 8/22 has style problems, please review.  If any of these errors
> are false positives report them to the maintainer, see
> CHECKPATCH in MAINTAINERS.
> 9/22 Checking commit 2b10bdd5edb3 (cpr: HMP interfaces)
> 10/22 Checking commit bc63b3edc621 (pci: export functions for cpr)
> 11/22 Checking commit d93623c4da4d (vfio-pci: refactor for cpr)
> 12/22 Checking commit 02c628d50b57 (vfio-pci: cpr part 1)
> Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
> WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
> #271: 
> new file mode 100644
> 
> total: 0 errors, 1 warnings, 566 lines checked
> 
> Patch 12/22 has style problems, please review.  If any of these errors
> are false positives report them to the maintainer, see
> CHECKPATCH in MAINTAINERS.
> 13/22 Checking commit e3ae86d2076c (vfio-pci: cpr part 2)
> 14/22 Checking commit 8cb6348c8cff (vhost: reset vhost devices upon cprsave)
> 15/22 Checking commit 65966769fa93 (hostmem-memfd: cpr support)
> 16/22 Checking commit b122cfa96106 (chardev: cpr framework)
> 17/22 Checking commit 279230e03a78 (chardev: cpr for simple devices)
> 18/22 Checking commit cb270f49693f (chardev: cpr for pty)
> 19/22 Checking commit 0a8c20e0a8d4 (chardev: cpr for sockets)
> 20/22 Checking commit 697f8d021f43 (cpr: only-cpr-capable option)
> Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
> WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
> #200: 
> new file mode 100644
> 
> total: 0 errors, 1 warnings, 133 lines checked
> 
> Patch 20/22 has style problems, please review.  If any of these errors
> are false positives report them to the maintainer, see
> CHECKPATCH in MAINTAINERS.
> 21/22 Checking commit aca4f092c865 (cpr: maintainers)
> 22/22 Checking commit 8c778e6f284c (simplify savevm)
> === OUTPUT END ===
> 
> Test command exited with code: 1
> 
> 
> The full log is available at
> http://patchew.org/logs/1620390320-301716-1-git-send-email-steven.sistare@oracle.com/testing.checkpatch/?type=message.
> ---
> Email generated automatically by Patchew [https://patchew.org/].
> Please send your feedback to patchew-devel@redhat.com
> 


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 06/22] vl: add helper to request re-exec
  2021-05-13 20:19     ` Steven Sistare
@ 2021-05-14  8:18       ` Daniel P. Berrangé
  0 siblings, 0 replies; 81+ messages in thread
From: Daniel P. Berrangé @ 2021-05-14  8:18 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Jason Zeng, Juan Quintela, Michael S. Tsirkin, qemu-devel,
	Dr. David Alan Gilbert, Alex Williamson, Stefan Hajnoczi,
	Paolo Bonzini, Marc-André Lureau,
	Philippe Mathieu-Daudé,
	Alex Bennée, Markus Armbruster

On Thu, May 13, 2021 at 04:19:22PM -0400, Steven Sistare wrote:
> On 5/7/2021 10:31 AM, Eric Blake wrote:
> > On 5/7/21 7:25 AM, Steve Sistare wrote:
> >> Add a qemu_exec_requested() hook that causes the main loop to exit and
> >> re-exec qemu using the same initial arguments.  If /usr/bin/qemu-exec
> >> exists, exec that instead.  This is an optional site-specific trampoline
> >> that may alter the environment before exec'ing the qemu binary.
> >>
> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >> ---
> > 
> >> +static void qemu_exec(void)
> >> +{
> >> +    const char *helper = "/usr/bin/qemu-exec";
> >> +    const char *bin = !access(helper, X_OK) ? helper : argv_main[0];
> > 
> > Reads awkwardly; I would have used '...= access(helper, X_OK) == 0 ?...'
> 
> Will fix.
> 
> >> +
> >> +    execvp(bin, argv_main);
> >> +    error_report("execvp failed, errno %d.", errno);
> > 
> > error_report should not be used with a trailing dot.  
> 
> Will fix.  I was not sure because I see examples both ways, though no dot prevails.
> Perhaps it should be added to the style guide and checkpatch.
> 
> > Also, %d for errno is awkward, better is:
> > 
> > error_report("execvp failed: %s", strerror(errno));
> 
> I shy away from strerror because it is not thread safe, but I see qemu uses it
> extensively.  Will fix.

GLib provides  'g_strerror' which is threadsafe, but without
the horrible API of strerror_r.  It works by caching the
errno strings in a static table on demand.  We don't use
it much in QEMU, but I think we ought to.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 07/22] cpr
  2021-05-13 20:21     ` Steven Sistare
@ 2021-05-14 11:28       ` Stefan Hajnoczi
  2021-05-14 15:14         ` Steven Sistare
  0 siblings, 1 reply; 81+ messages in thread
From: Stefan Hajnoczi @ 2021-05-14 11:28 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Alex Williamson, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 1530 bytes --]

On Thu, May 13, 2021 at 04:21:02PM -0400, Steven Sistare wrote:
> On 5/12/2021 12:19 PM, Stefan Hajnoczi wrote:
> > On Fri, May 07, 2021 at 05:25:05AM -0700, Steve Sistare wrote:
> >> To use the restart mode, qemu must be started with the memfd-alloc machine
> >> option.  The memfd's are saved to the environment and kept open across exec,
> >> after which they are found from the environment and re-mmap'd.  Hence guest
> >> ram is preserved in place, albeit with new virtual addresses in the qemu
> >> process.  The caller resumes the guest by calling cprload, which loads
> >> state from the file.  If the VM was running at cprsave time, then VM
> >> execution resumes.  cprsave supports any type of guest image and block
> >> device, but the caller must not modify guest block devices between cprsave
> >> and cprload.
> > 
> > Does QEMU's existing -object memory-backend-file on tmpfs or hugetlbfs
> > achieve the same thing?
> 
> Not quite.  Various secondary anonymous memory objects are allocated via ram_block_add
> and must be preserved, such as these on x86_64.  
>   vga.vram
>   pc.ram
>   pc.bios
>   pc.rom
>   vga.rom
>   rom@etc/acpi/tables
>   rom@etc/table-loader
>   rom@etc/acpi/rsdp
> 
> Even the read-only areas must be preserved rather than recreated from files in the updated
> qemu, as their contents may have changed.

Migration knows how to save/load these RAM blocks. Only pc.ram is
significant in size so I'm not sure it's worth special-casing the
others?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update
  2021-05-13 20:21   ` Steven Sistare
@ 2021-05-14 11:53     ` Stefan Hajnoczi
  2021-05-14 15:15       ` Steven Sistare
  0 siblings, 1 reply; 81+ messages in thread
From: Stefan Hajnoczi @ 2021-05-14 11:53 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Alex Williamson, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 3924 bytes --]

On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:
> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:
> > On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
> >> Provide the cprsave and cprload commands for live update.  These save and
> >> restore VM state, with minimal guest pause time, so that qemu may be updated
> >> to a new version in between.
> >>
> >> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
> >> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
> >> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
> >> paused state and waits for the cprload command.
> > 
> > I think cprsave/cprload could be generalized by using QMP to stash the
> > file descriptors. The 'getfd' QMP command already exists and QEMU code
> > already opens fds passed using this mechanism.
> > 
> > I haven't checked but it may be possible to drop some patches by reusing
> > QEMU's monitor file descriptor passing since the code already knows how
> > to open from 'getfd' fds.
> > 
> > The reason why using QMP is interesting is because it eliminates the
> > need for execve(2). QEMU may be unable to execute a program due to
> > chroot, seccomp, etc.
> > 
> > QMP would enable cprsave/cprload to work both with and without
> > execve(2).
> > 
> > One tricky thing with this approach might be startup ordering: how to
> > get fds via the QMP monitor in the new process before processing the
> > entire command-line.
> 
> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
> I suspect my recent vfio extensions would smooth the rough edges.

I wonder about the reason for VFIO's pid limitation, maybe because it
pins pages from the original process?

Is this VFIO pid limitation the main reason why you chose to make QEMU
execve(2) the new binary?

> However, the main issue is that guest ram must be backed by named shared memory, and
> we would need to add code to support shared memory for all the secondary memory objects.
> That makes it less interesting for us at this time; we care about updating legacy qemu 
> instances with anonymous guest memory.

Thanks for explaining this more in the other sub-thread. The secondary
memory objects you mentioned are relatively small so I don't think
saving them in the traditional way is a problem.

Two approaches for zero-copy memory migration fit into QEMU's existing
migration infrastructure:

- Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
  etc) so they are not saved into the savevm file. The existing --object
  memory-backend-file syntax can be used.

- Extending the live migration protocol to detect when file descriptor
  passing is available (i.e. UNIX domain socket migration) and using
  that for memory-backend-* objects that have fds.

Either of these approaches would handle RAM with existing savevm/migrate
commands.

The remaining issue is how to migrate VFIO and other file descriptors
that cannot be reopened by the new process. As mentioned, QEMU already
has file descriptor passing support in the QMP monitor and support for
opening passed file descriptors (see qemu_open_internal(),
monitor_fd_param(), and socket_get_fd()).

The advantage of integrating live update functionality into the existing
savevm/migrate commands is that it will work in more use cases with
less code duplication/maintenance/bitrot prevention than the
special-case cprsave command in this patch series.

Maybe there is a fundamental technical reason why live update needs to
be different from QEMU's existing migration commands but I haven't
figured it out yet.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 07/22] cpr
  2021-05-14 11:28       ` Stefan Hajnoczi
@ 2021-05-14 15:14         ` Steven Sistare
  2021-05-18 13:42           ` Stefan Hajnoczi
  0 siblings, 1 reply; 81+ messages in thread
From: Steven Sistare @ 2021-05-14 15:14 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Alex Williamson, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On 5/14/2021 7:28 AM, Stefan Hajnoczi wrote:
> On Thu, May 13, 2021 at 04:21:02PM -0400, Steven Sistare wrote:
>> On 5/12/2021 12:19 PM, Stefan Hajnoczi wrote:
>>> On Fri, May 07, 2021 at 05:25:05AM -0700, Steve Sistare wrote:
>>>> To use the restart mode, qemu must be started with the memfd-alloc machine
>>>> option.  The memfd's are saved to the environment and kept open across exec,
>>>> after which they are found from the environment and re-mmap'd.  Hence guest
>>>> ram is preserved in place, albeit with new virtual addresses in the qemu
>>>> process.  The caller resumes the guest by calling cprload, which loads
>>>> state from the file.  If the VM was running at cprsave time, then VM
>>>> execution resumes.  cprsave supports any type of guest image and block
>>>> device, but the caller must not modify guest block devices between cprsave
>>>> and cprload.
>>>
>>> Does QEMU's existing -object memory-backend-file on tmpfs or hugetlbfs
>>> achieve the same thing?
>>
>> Not quite.  Various secondary anonymous memory objects are allocated via ram_block_add
>> and must be preserved, such as these on x86_64.  
>>   vga.vram
>>   pc.ram
>>   pc.bios
>>   pc.rom
>>   vga.rom
>>   rom@etc/acpi/tables
>>   rom@etc/table-loader
>>   rom@etc/acpi/rsdp
>>
>> Even the read-only areas must be preserved rather than recreated from files in the updated
>> qemu, as their contents may have changed.
> 
> Migration knows how to save/load these RAM blocks. Only pc.ram is
> significant in size so I'm not sure it's worth special-casing the
> others?

Some of these are mapped for vfio dma as a consequence of the normal memory region callback to
consumers code.  We get conflict errors vs those existing vfio mappings if they are recreated 
and remapped in the new process.  The memfd option is a simple and robust solution to that issue.

- Steve


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update
  2021-05-14 11:53     ` Stefan Hajnoczi
@ 2021-05-14 15:15       ` Steven Sistare
  2021-05-17 11:40         ` Stefan Hajnoczi
  2021-05-18  9:57         ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 81+ messages in thread
From: Steven Sistare @ 2021-05-14 15:15 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Alex Williamson, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:
> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:
>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:
>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
>>>> Provide the cprsave and cprload commands for live update.  These save and
>>>> restore VM state, with minimal guest pause time, so that qemu may be updated
>>>> to a new version in between.
>>>>
>>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
>>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
>>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
>>>> paused state and waits for the cprload command.
>>>
>>> I think cprsave/cprload could be generalized by using QMP to stash the
>>> file descriptors. The 'getfd' QMP command already exists and QEMU code
>>> already opens fds passed using this mechanism.
>>>
>>> I haven't checked but it may be possible to drop some patches by reusing
>>> QEMU's monitor file descriptor passing since the code already knows how
>>> to open from 'getfd' fds.
>>>
>>> The reason why using QMP is interesting is because it eliminates the
>>> need for execve(2). QEMU may be unable to execute a program due to
>>> chroot, seccomp, etc.
>>>
>>> QMP would enable cprsave/cprload to work both with and without
>>> execve(2).
>>>
>>> One tricky thing with this approach might be startup ordering: how to
>>> get fds via the QMP monitor in the new process before processing the
>>> entire command-line.
>>
>> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
>> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
>> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
>> I suspect my recent vfio extensions would smooth the rough edges.
> 
> I wonder about the reason for VFIO's pid limitation, maybe because it
> pins pages from the original process?

The dma unmap code verifies that the requesting task is the same as the task that mapped
the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
to fix locked memory accounting, which is associated with the mm of the original task.

> Is this VFIO pid limitation the main reason why you chose to make QEMU
> execve(2) the new binary?

That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
I was working against vfio rather than with it.

Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
without exec, I still need the exec option.

>> However, the main issue is that guest ram must be backed by named shared memory, and
>> we would need to add code to support shared memory for all the secondary memory objects.
>> That makes it less interesting for us at this time; we care about updating legacy qemu 
>> instances with anonymous guest memory.
> 
> Thanks for explaining this more in the other sub-thread. The secondary
> memory objects you mentioned are relatively small so I don't think
> saving them in the traditional way is a problem.
> 
> Two approaches for zero-copy memory migration fit into QEMU's existing
> migration infrastructure:
> 
> - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
>   etc) so they are not saved into the savevm file. The existing --object
>   memory-backend-file syntax can be used.
> 
> - Extending the live migration protocol to detect when file descriptor
>   passing is available (i.e. UNIX domain socket migration) and using
>   that for memory-backend-* objects that have fds.
> 
> Either of these approaches would handle RAM with existing savevm/migrate
> commands.

Yes, but the vfio issues would still need to be solved, and we would need new
command line options to back existing and future secondary memory objects with 
named shared memory.

> The remaining issue is how to migrate VFIO and other file descriptors
> that cannot be reopened by the new process. As mentioned, QEMU already
> has file descriptor passing support in the QMP monitor and support for
> opening passed file descriptors (see qemu_open_internal(),
> monitor_fd_param(), and socket_get_fd()).
> 
> The advantage of integrating live update functionality into the existing
> savevm/migrate commands is that it will work in more use cases with
> less code duplication/maintenance/bitrot prevention than the
> special-case cprsave command in this patch series.
> 
> Maybe there is a fundamental technical reason why live update needs to
> be different from QEMU's existing migration commands but I haven't
> figured it out yet.

vfio and anonymous memory.

Regarding code duplication, I did consider whether to extend the migration
syntax and implementation versus creating something new.  Those functions
handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr
use case, and the cpr functions handle state that is n/a for the migration case.
I judged that handling both in the same functions would be less readable and
maintainable.  After feedback during the V1 review, I simplified the cprsave
code by by calling qemu_save_device_state, as Xen does, thus eliminating any
interaction with the migration code.

Regarding bit rot, I still need to add a cpr test to the test suite, when the 
review is more complete and folks agree on the final form of the functionality.

I do like the idea of supporting update without exec, but as a future project, 
and not at the expense of dropping update with exec.

- Steve


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update
  2021-05-14 15:15       ` Steven Sistare
@ 2021-05-17 11:40         ` Stefan Hajnoczi
  2021-05-17 19:10           ` Alex Williamson
  2021-05-18  9:57         ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 81+ messages in thread
From: Stefan Hajnoczi @ 2021-05-17 11:40 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Alex Williamson, Steven Sistare,
	vfio-users, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 7190 bytes --]

On Fri, May 14, 2021 at 11:15:18AM -0400, Steven Sistare wrote:
> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:
> > On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:
> >> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:
> >>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
> >>>> Provide the cprsave and cprload commands for live update.  These save and
> >>>> restore VM state, with minimal guest pause time, so that qemu may be updated
> >>>> to a new version in between.
> >>>>
> >>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
> >>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
> >>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
> >>>> paused state and waits for the cprload command.
> >>>
> >>> I think cprsave/cprload could be generalized by using QMP to stash the
> >>> file descriptors. The 'getfd' QMP command already exists and QEMU code
> >>> already opens fds passed using this mechanism.
> >>>
> >>> I haven't checked but it may be possible to drop some patches by reusing
> >>> QEMU's monitor file descriptor passing since the code already knows how
> >>> to open from 'getfd' fds.
> >>>
> >>> The reason why using QMP is interesting is because it eliminates the
> >>> need for execve(2). QEMU may be unable to execute a program due to
> >>> chroot, seccomp, etc.
> >>>
> >>> QMP would enable cprsave/cprload to work both with and without
> >>> execve(2).
> >>>
> >>> One tricky thing with this approach might be startup ordering: how to
> >>> get fds via the QMP monitor in the new process before processing the
> >>> entire command-line.
> >>
> >> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
> >> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
> >> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
> >> I suspect my recent vfio extensions would smooth the rough edges.
> > 
> > I wonder about the reason for VFIO's pid limitation, maybe because it
> > pins pages from the original process?
> 
> The dma unmap code verifies that the requesting task is the same as the task that mapped
> the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
> to fix locked memory accounting, which is associated with the mm of the original task.
> 
> > Is this VFIO pid limitation the main reason why you chose to make QEMU
> > execve(2) the new binary?
> 
> That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
> errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
> but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
> diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
> I was working against vfio rather than with it.
> 
> Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
> code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
> without exec, I still need the exec option.
> 
> >> However, the main issue is that guest ram must be backed by named shared memory, and
> >> we would need to add code to support shared memory for all the secondary memory objects.
> >> That makes it less interesting for us at this time; we care about updating legacy qemu 
> >> instances with anonymous guest memory.
> > 
> > Thanks for explaining this more in the other sub-thread. The secondary
> > memory objects you mentioned are relatively small so I don't think
> > saving them in the traditional way is a problem.
> > 
> > Two approaches for zero-copy memory migration fit into QEMU's existing
> > migration infrastructure:
> > 
> > - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
> >   etc) so they are not saved into the savevm file. The existing --object
> >   memory-backend-file syntax can be used.
> > 
> > - Extending the live migration protocol to detect when file descriptor
> >   passing is available (i.e. UNIX domain socket migration) and using
> >   that for memory-backend-* objects that have fds.
> > 
> > Either of these approaches would handle RAM with existing savevm/migrate
> > commands.
> 
> Yes, but the vfio issues would still need to be solved, and we would need new
> command line options to back existing and future secondary memory objects with 
> named shared memory.
> 
> > The remaining issue is how to migrate VFIO and other file descriptors
> > that cannot be reopened by the new process. As mentioned, QEMU already
> > has file descriptor passing support in the QMP monitor and support for
> > opening passed file descriptors (see qemu_open_internal(),
> > monitor_fd_param(), and socket_get_fd()).
> > 
> > The advantage of integrating live update functionality into the existing
> > savevm/migrate commands is that it will work in more use cases with
> > less code duplication/maintenance/bitrot prevention than the
> > special-case cprsave command in this patch series.
> > 
> > Maybe there is a fundamental technical reason why live update needs to
> > be different from QEMU's existing migration commands but I haven't
> > figured it out yet.
> 
> vfio and anonymous memory.
> 
> Regarding code duplication, I did consider whether to extend the migration
> syntax and implementation versus creating something new.  Those functions
> handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr
> use case, and the cpr functions handle state that is n/a for the migration case.
> I judged that handling both in the same functions would be less readable and
> maintainable.  After feedback during the V1 review, I simplified the cprsave
> code by by calling qemu_save_device_state, as Xen does, thus eliminating any
> interaction with the migration code.
> 
> Regarding bit rot, I still need to add a cpr test to the test suite, when the 
> review is more complete and folks agree on the final form of the functionality.
> 
> I do like the idea of supporting update without exec, but as a future project, 
> and not at the expense of dropping update with exec.

Alex: We're discussing how to live update QEMU while VFIO devices are
running. This patch series introduces monitor commands that call
execve(2) to run the new QEMU binary and inherit the memory/vfio/etc
file descriptors. This way live update is transparent to VFIO but it
won't work if a sandboxed QEMU process is forbidden to call execve(2).
What are your thoughts on 1) the execve(2) approach and 2) extending
VFIO to allow running devices to be attached to a different process so
that execve(2) is not necessary?

Steven: Do you know if cpr will work with Intel's upcoming Shared
Virtual Addressing? I'm worried that execve(2) may be a short-term
solution that works around VFIO's current limitations but even execve(2)
may stop working in the future as IOMMUs and DMA approaches change.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update
  2021-05-17 11:40         ` Stefan Hajnoczi
@ 2021-05-17 19:10           ` Alex Williamson
  2021-05-18 13:39             ` Stefan Hajnoczi
  0 siblings, 1 reply; 81+ messages in thread
From: Alex Williamson @ 2021-05-17 19:10 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Steven Sistare, vfio-users,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Markus Armbruster

On Mon, 17 May 2021 12:40:43 +0100
Stefan Hajnoczi <stefanha@redhat.com> wrote:

> On Fri, May 14, 2021 at 11:15:18AM -0400, Steven Sistare wrote:
> > On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:  
> > > On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:  
> > >> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:  
> > >>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:  
> > >>>> Provide the cprsave and cprload commands for live update.  These save and
> > >>>> restore VM state, with minimal guest pause time, so that qemu may be updated
> > >>>> to a new version in between.
> > >>>>
> > >>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
> > >>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
> > >>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
> > >>>> paused state and waits for the cprload command.  
> > >>>
> > >>> I think cprsave/cprload could be generalized by using QMP to stash the
> > >>> file descriptors. The 'getfd' QMP command already exists and QEMU code
> > >>> already opens fds passed using this mechanism.
> > >>>
> > >>> I haven't checked but it may be possible to drop some patches by reusing
> > >>> QEMU's monitor file descriptor passing since the code already knows how
> > >>> to open from 'getfd' fds.
> > >>>
> > >>> The reason why using QMP is interesting is because it eliminates the
> > >>> need for execve(2). QEMU may be unable to execute a program due to
> > >>> chroot, seccomp, etc.
> > >>>
> > >>> QMP would enable cprsave/cprload to work both with and without
> > >>> execve(2).
> > >>>
> > >>> One tricky thing with this approach might be startup ordering: how to
> > >>> get fds via the QMP monitor in the new process before processing the
> > >>> entire command-line.  
> > >>
> > >> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
> > >> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
> > >> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
> > >> I suspect my recent vfio extensions would smooth the rough edges.  
> > > 
> > > I wonder about the reason for VFIO's pid limitation, maybe because it
> > > pins pages from the original process?  
> > 
> > The dma unmap code verifies that the requesting task is the same as the task that mapped
> > the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
> > to fix locked memory accounting, which is associated with the mm of the original task.
> >   
> > > Is this VFIO pid limitation the main reason why you chose to make QEMU
> > > execve(2) the new binary?  
> > 
> > That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
> > errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
> > but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
> > diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
> > I was working against vfio rather than with it.
> > 
> > Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
> > code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
> > without exec, I still need the exec option.
> >   
> > >> However, the main issue is that guest ram must be backed by named shared memory, and
> > >> we would need to add code to support shared memory for all the secondary memory objects.
> > >> That makes it less interesting for us at this time; we care about updating legacy qemu 
> > >> instances with anonymous guest memory.  
> > > 
> > > Thanks for explaining this more in the other sub-thread. The secondary
> > > memory objects you mentioned are relatively small so I don't think
> > > saving them in the traditional way is a problem.
> > > 
> > > Two approaches for zero-copy memory migration fit into QEMU's existing
> > > migration infrastructure:
> > > 
> > > - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
> > >   etc) so they are not saved into the savevm file. The existing --object
> > >   memory-backend-file syntax can be used.
> > > 
> > > - Extending the live migration protocol to detect when file descriptor
> > >   passing is available (i.e. UNIX domain socket migration) and using
> > >   that for memory-backend-* objects that have fds.
> > > 
> > > Either of these approaches would handle RAM with existing savevm/migrate
> > > commands.  
> > 
> > Yes, but the vfio issues would still need to be solved, and we would need new
> > command line options to back existing and future secondary memory objects with 
> > named shared memory.
> >   
> > > The remaining issue is how to migrate VFIO and other file descriptors
> > > that cannot be reopened by the new process. As mentioned, QEMU already
> > > has file descriptor passing support in the QMP monitor and support for
> > > opening passed file descriptors (see qemu_open_internal(),
> > > monitor_fd_param(), and socket_get_fd()).
> > > 
> > > The advantage of integrating live update functionality into the existing
> > > savevm/migrate commands is that it will work in more use cases with
> > > less code duplication/maintenance/bitrot prevention than the
> > > special-case cprsave command in this patch series.
> > > 
> > > Maybe there is a fundamental technical reason why live update needs to
> > > be different from QEMU's existing migration commands but I haven't
> > > figured it out yet.  
> > 
> > vfio and anonymous memory.
> > 
> > Regarding code duplication, I did consider whether to extend the migration
> > syntax and implementation versus creating something new.  Those functions
> > handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr
> > use case, and the cpr functions handle state that is n/a for the migration case.
> > I judged that handling both in the same functions would be less readable and
> > maintainable.  After feedback during the V1 review, I simplified the cprsave
> > code by by calling qemu_save_device_state, as Xen does, thus eliminating any
> > interaction with the migration code.
> > 
> > Regarding bit rot, I still need to add a cpr test to the test suite, when the 
> > review is more complete and folks agree on the final form of the functionality.
> > 
> > I do like the idea of supporting update without exec, but as a future project, 
> > and not at the expense of dropping update with exec.  
> 
> Alex: We're discussing how to live update QEMU while VFIO devices are
> running. This patch series introduces monitor commands that call
> execve(2) to run the new QEMU binary and inherit the memory/vfio/etc
> file descriptors. This way live update is transparent to VFIO but it
> won't work if a sandboxed QEMU process is forbidden to call execve(2).
> What are your thoughts on 1) the execve(2) approach and 2) extending
> VFIO to allow running devices to be attached to a different process so
> that execve(2) is not necessary?

Tracking processes is largely to support page pinning; we need to be
able to support both asynchronous page pinning to handle requests from
mdev drivers and we need to make sure pinned page accounting is
tracked to the same process.  If userspace can "pay" for locked pages
from one process on mappping, then "credit" them to another process on
unmap, that seems fairly exploitable.  We'd need some way to transfer
the locked memory accounting or handle it outside of vfio.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update
  2021-05-14 15:15       ` Steven Sistare
  2021-05-17 11:40         ` Stefan Hajnoczi
@ 2021-05-18  9:57         ` Dr. David Alan Gilbert
  2021-05-18 16:00           ` Steven Sistare
  1 sibling, 1 reply; 81+ messages in thread
From: Dr. David Alan Gilbert @ 2021-05-18  9:57 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel, Markus Armbruster,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

* Steven Sistare (steven.sistare@oracle.com) wrote:
> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:
> > On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:
> >> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:
> >>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
> >>>> Provide the cprsave and cprload commands for live update.  These save and
> >>>> restore VM state, with minimal guest pause time, so that qemu may be updated
> >>>> to a new version in between.
> >>>>
> >>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
> >>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
> >>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
> >>>> paused state and waits for the cprload command.
> >>>
> >>> I think cprsave/cprload could be generalized by using QMP to stash the
> >>> file descriptors. The 'getfd' QMP command already exists and QEMU code
> >>> already opens fds passed using this mechanism.
> >>>
> >>> I haven't checked but it may be possible to drop some patches by reusing
> >>> QEMU's monitor file descriptor passing since the code already knows how
> >>> to open from 'getfd' fds.
> >>>
> >>> The reason why using QMP is interesting is because it eliminates the
> >>> need for execve(2). QEMU may be unable to execute a program due to
> >>> chroot, seccomp, etc.
> >>>
> >>> QMP would enable cprsave/cprload to work both with and without
> >>> execve(2).
> >>>
> >>> One tricky thing with this approach might be startup ordering: how to
> >>> get fds via the QMP monitor in the new process before processing the
> >>> entire command-line.
> >>
> >> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
> >> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
> >> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
> >> I suspect my recent vfio extensions would smooth the rough edges.
> > 
> > I wonder about the reason for VFIO's pid limitation, maybe because it
> > pins pages from the original process?
> 
> The dma unmap code verifies that the requesting task is the same as the task that mapped
> the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
> to fix locked memory accounting, which is associated with the mm of the original task.

> > Is this VFIO pid limitation the main reason why you chose to make QEMU
> > execve(2) the new binary?
> 
> That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
> errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
> but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
> diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
> I was working against vfio rather than with it.

OK the weirdness of vfio helps explain a bit about why you're doing it
this way; can you help separate some difference between restart and
reboot for me though:

In 'reboot' mode; where the guest must do suspend in it's drivers, how
much of these vfio requirements are needed?  I guess the memfd use
for the anonymous areas isn't any use for reboot mode.

You mention cprsave calls VFIO_DMA_UNMAP_FLAG_VADDR - after that does
vfio still care about the currently-anonymous areas?

> Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
> code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
> without exec, I still need the exec option.

Can you explain what that code injection mechanism is for those of us
who didn't see that?

Dave

> >> However, the main issue is that guest ram must be backed by named shared memory, and
> >> we would need to add code to support shared memory for all the secondary memory objects.
> >> That makes it less interesting for us at this time; we care about updating legacy qemu 
> >> instances with anonymous guest memory.
> > 
> > Thanks for explaining this more in the other sub-thread. The secondary
> > memory objects you mentioned are relatively small so I don't think
> > saving them in the traditional way is a problem.
> > 
> > Two approaches for zero-copy memory migration fit into QEMU's existing
> > migration infrastructure:
> > 
> > - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
> >   etc) so they are not saved into the savevm file. The existing --object
> >   memory-backend-file syntax can be used.
> > 
> > - Extending the live migration protocol to detect when file descriptor
> >   passing is available (i.e. UNIX domain socket migration) and using
> >   that for memory-backend-* objects that have fds.
> > 
> > Either of these approaches would handle RAM with existing savevm/migrate
> > commands.
> 
> Yes, but the vfio issues would still need to be solved, and we would need new
> command line options to back existing and future secondary memory objects with 
> named shared memory.
> 
> > The remaining issue is how to migrate VFIO and other file descriptors
> > that cannot be reopened by the new process. As mentioned, QEMU already
> > has file descriptor passing support in the QMP monitor and support for
> > opening passed file descriptors (see qemu_open_internal(),
> > monitor_fd_param(), and socket_get_fd()).
> > 
> > The advantage of integrating live update functionality into the existing
> > savevm/migrate commands is that it will work in more use cases with
> > less code duplication/maintenance/bitrot prevention than the
> > special-case cprsave command in this patch series.
> > 
> > Maybe there is a fundamental technical reason why live update needs to
> > be different from QEMU's existing migration commands but I haven't
> > figured it out yet.
> 
> vfio and anonymous memory.
> 
> Regarding code duplication, I did consider whether to extend the migration
> syntax and implementation versus creating something new.  Those functions
> handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr
> use case, and the cpr functions handle state that is n/a for the migration case.
> I judged that handling both in the same functions would be less readable and
> maintainable.  After feedback during the V1 review, I simplified the cprsave
> code by by calling qemu_save_device_state, as Xen does, thus eliminating any
> interaction with the migration code.
> 
> Regarding bit rot, I still need to add a cpr test to the test suite, when the 
> review is more complete and folks agree on the final form of the functionality.
> 
> I do like the idea of supporting update without exec, but as a future project, 
> and not at the expense of dropping update with exec.
> 
> - Steve
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update
  2021-05-17 19:10           ` Alex Williamson
@ 2021-05-18 13:39             ` Stefan Hajnoczi
  2021-05-18 15:48               ` Steven Sistare
  0 siblings, 1 reply; 81+ messages in thread
From: Stefan Hajnoczi @ 2021-05-18 13:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Steven Sistare, vfio-users,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 8730 bytes --]

On Mon, May 17, 2021 at 01:10:01PM -0600, Alex Williamson wrote:
> On Mon, 17 May 2021 12:40:43 +0100
> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> > On Fri, May 14, 2021 at 11:15:18AM -0400, Steven Sistare wrote:
> > > On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:  
> > > > On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:  
> > > >> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:  
> > > >>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:  
> > > >>>> Provide the cprsave and cprload commands for live update.  These save and
> > > >>>> restore VM state, with minimal guest pause time, so that qemu may be updated
> > > >>>> to a new version in between.
> > > >>>>
> > > >>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
> > > >>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
> > > >>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
> > > >>>> paused state and waits for the cprload command.  
> > > >>>
> > > >>> I think cprsave/cprload could be generalized by using QMP to stash the
> > > >>> file descriptors. The 'getfd' QMP command already exists and QEMU code
> > > >>> already opens fds passed using this mechanism.
> > > >>>
> > > >>> I haven't checked but it may be possible to drop some patches by reusing
> > > >>> QEMU's monitor file descriptor passing since the code already knows how
> > > >>> to open from 'getfd' fds.
> > > >>>
> > > >>> The reason why using QMP is interesting is because it eliminates the
> > > >>> need for execve(2). QEMU may be unable to execute a program due to
> > > >>> chroot, seccomp, etc.
> > > >>>
> > > >>> QMP would enable cprsave/cprload to work both with and without
> > > >>> execve(2).
> > > >>>
> > > >>> One tricky thing with this approach might be startup ordering: how to
> > > >>> get fds via the QMP monitor in the new process before processing the
> > > >>> entire command-line.  
> > > >>
> > > >> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
> > > >> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
> > > >> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
> > > >> I suspect my recent vfio extensions would smooth the rough edges.  
> > > > 
> > > > I wonder about the reason for VFIO's pid limitation, maybe because it
> > > > pins pages from the original process?  
> > > 
> > > The dma unmap code verifies that the requesting task is the same as the task that mapped
> > > the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
> > > to fix locked memory accounting, which is associated with the mm of the original task.
> > >   
> > > > Is this VFIO pid limitation the main reason why you chose to make QEMU
> > > > execve(2) the new binary?  
> > > 
> > > That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
> > > errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
> > > but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
> > > diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
> > > I was working against vfio rather than with it.
> > > 
> > > Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
> > > code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
> > > without exec, I still need the exec option.
> > >   
> > > >> However, the main issue is that guest ram must be backed by named shared memory, and
> > > >> we would need to add code to support shared memory for all the secondary memory objects.
> > > >> That makes it less interesting for us at this time; we care about updating legacy qemu 
> > > >> instances with anonymous guest memory.  
> > > > 
> > > > Thanks for explaining this more in the other sub-thread. The secondary
> > > > memory objects you mentioned are relatively small so I don't think
> > > > saving them in the traditional way is a problem.
> > > > 
> > > > Two approaches for zero-copy memory migration fit into QEMU's existing
> > > > migration infrastructure:
> > > > 
> > > > - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
> > > >   etc) so they are not saved into the savevm file. The existing --object
> > > >   memory-backend-file syntax can be used.
> > > > 
> > > > - Extending the live migration protocol to detect when file descriptor
> > > >   passing is available (i.e. UNIX domain socket migration) and using
> > > >   that for memory-backend-* objects that have fds.
> > > > 
> > > > Either of these approaches would handle RAM with existing savevm/migrate
> > > > commands.  
> > > 
> > > Yes, but the vfio issues would still need to be solved, and we would need new
> > > command line options to back existing and future secondary memory objects with 
> > > named shared memory.
> > >   
> > > > The remaining issue is how to migrate VFIO and other file descriptors
> > > > that cannot be reopened by the new process. As mentioned, QEMU already
> > > > has file descriptor passing support in the QMP monitor and support for
> > > > opening passed file descriptors (see qemu_open_internal(),
> > > > monitor_fd_param(), and socket_get_fd()).
> > > > 
> > > > The advantage of integrating live update functionality into the existing
> > > > savevm/migrate commands is that it will work in more use cases with
> > > > less code duplication/maintenance/bitrot prevention than the
> > > > special-case cprsave command in this patch series.
> > > > 
> > > > Maybe there is a fundamental technical reason why live update needs to
> > > > be different from QEMU's existing migration commands but I haven't
> > > > figured it out yet.  
> > > 
> > > vfio and anonymous memory.
> > > 
> > > Regarding code duplication, I did consider whether to extend the migration
> > > syntax and implementation versus creating something new.  Those functions
> > > handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr
> > > use case, and the cpr functions handle state that is n/a for the migration case.
> > > I judged that handling both in the same functions would be less readable and
> > > maintainable.  After feedback during the V1 review, I simplified the cprsave
> > > code by by calling qemu_save_device_state, as Xen does, thus eliminating any
> > > interaction with the migration code.
> > > 
> > > Regarding bit rot, I still need to add a cpr test to the test suite, when the 
> > > review is more complete and folks agree on the final form of the functionality.
> > > 
> > > I do like the idea of supporting update without exec, but as a future project, 
> > > and not at the expense of dropping update with exec.  
> > 
> > Alex: We're discussing how to live update QEMU while VFIO devices are
> > running. This patch series introduces monitor commands that call
> > execve(2) to run the new QEMU binary and inherit the memory/vfio/etc
> > file descriptors. This way live update is transparent to VFIO but it
> > won't work if a sandboxed QEMU process is forbidden to call execve(2).
> > What are your thoughts on 1) the execve(2) approach and 2) extending
> > VFIO to allow running devices to be attached to a different process so
> > that execve(2) is not necessary?
> 
> Tracking processes is largely to support page pinning; we need to be
> able to support both asynchronous page pinning to handle requests from
> mdev drivers and we need to make sure pinned page accounting is
> tracked to the same process.  If userspace can "pay" for locked pages
> from one process on mappping, then "credit" them to another process on
> unmap, that seems fairly exploitable.  We'd need some way to transfer
> the locked memory accounting or handle it outside of vfio.  Thanks,

Vhost's VHOST_SET_OWNER ioctl is somewhat similar. It's used to
associate the in-kernel vhost device with a userspace process and it's
mm.

Would it be possible to add a VFIO_SET_OWNER ioctl that associates the
current process with the vfio_device? Only one process would be the
owner at any given time.

I'm not sure how existing DMA mappings would behave, but this patch
series seems to rely on DMA continuing to work even though there is a
window of time when the execve(2) process doesn't have guest RAM
mmapped. So I guess existing DMA mappings continue to work because the
pages were previously pinned?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 07/22] cpr
  2021-05-14 15:14         ` Steven Sistare
@ 2021-05-18 13:42           ` Stefan Hajnoczi
  0 siblings, 0 replies; 81+ messages in thread
From: Stefan Hajnoczi @ 2021-05-18 13:42 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Alex Williamson, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 2214 bytes --]

On Fri, May 14, 2021 at 11:14:44AM -0400, Steven Sistare wrote:
> On 5/14/2021 7:28 AM, Stefan Hajnoczi wrote:
> > On Thu, May 13, 2021 at 04:21:02PM -0400, Steven Sistare wrote:
> >> On 5/12/2021 12:19 PM, Stefan Hajnoczi wrote:
> >>> On Fri, May 07, 2021 at 05:25:05AM -0700, Steve Sistare wrote:
> >>>> To use the restart mode, qemu must be started with the memfd-alloc machine
> >>>> option.  The memfd's are saved to the environment and kept open across exec,
> >>>> after which they are found from the environment and re-mmap'd.  Hence guest
> >>>> ram is preserved in place, albeit with new virtual addresses in the qemu
> >>>> process.  The caller resumes the guest by calling cprload, which loads
> >>>> state from the file.  If the VM was running at cprsave time, then VM
> >>>> execution resumes.  cprsave supports any type of guest image and block
> >>>> device, but the caller must not modify guest block devices between cprsave
> >>>> and cprload.
> >>>
> >>> Does QEMU's existing -object memory-backend-file on tmpfs or hugetlbfs
> >>> achieve the same thing?
> >>
> >> Not quite.  Various secondary anonymous memory objects are allocated via ram_block_add
> >> and must be preserved, such as these on x86_64.  
> >>   vga.vram
> >>   pc.ram
> >>   pc.bios
> >>   pc.rom
> >>   vga.rom
> >>   rom@etc/acpi/tables
> >>   rom@etc/table-loader
> >>   rom@etc/acpi/rsdp
> >>
> >> Even the read-only areas must be preserved rather than recreated from files in the updated
> >> qemu, as their contents may have changed.
> > 
> > Migration knows how to save/load these RAM blocks. Only pc.ram is
> > significant in size so I'm not sure it's worth special-casing the
> > others?
> 
> Some of these are mapped for vfio dma as a consequence of the normal memory region callback to
> consumers code.  We get conflict errors vs those existing vfio mappings if they are recreated 
> and remapped in the new process.  The memfd option is a simple and robust solution to that issue.

Okay, if the VFIO device DMAs to them then they need to stay alive. Live
migration cannot copy their contents since they could be DMAed to at any
time and we'd copy stale data.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update
  2021-05-18 13:39             ` Stefan Hajnoczi
@ 2021-05-18 15:48               ` Steven Sistare
  0 siblings, 0 replies; 81+ messages in thread
From: Steven Sistare @ 2021-05-18 15:48 UTC (permalink / raw)
  To: Stefan Hajnoczi, Alex Williamson
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, vfio-users, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On 5/18/2021 9:39 AM, Stefan Hajnoczi wrote:
> On Mon, May 17, 2021 at 01:10:01PM -0600, Alex Williamson wrote:
>> On Mon, 17 May 2021 12:40:43 +0100
>> Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>
>>> On Fri, May 14, 2021 at 11:15:18AM -0400, Steven Sistare wrote:
>>>> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:  
>>>>> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:  
>>>>>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:  
>>>>>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:  
>>>>>>>> Provide the cprsave and cprload commands for live update.  These save and
>>>>>>>> restore VM state, with minimal guest pause time, so that qemu may be updated
>>>>>>>> to a new version in between.
>>>>>>>>
>>>>>>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
>>>>>>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
>>>>>>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
>>>>>>>> paused state and waits for the cprload command.  
>>>>>>>
>>>>>>> I think cprsave/cprload could be generalized by using QMP to stash the
>>>>>>> file descriptors. The 'getfd' QMP command already exists and QEMU code
>>>>>>> already opens fds passed using this mechanism.
>>>>>>>
>>>>>>> I haven't checked but it may be possible to drop some patches by reusing
>>>>>>> QEMU's monitor file descriptor passing since the code already knows how
>>>>>>> to open from 'getfd' fds.
>>>>>>>
>>>>>>> The reason why using QMP is interesting is because it eliminates the
>>>>>>> need for execve(2). QEMU may be unable to execute a program due to
>>>>>>> chroot, seccomp, etc.
>>>>>>>
>>>>>>> QMP would enable cprsave/cprload to work both with and without
>>>>>>> execve(2).
>>>>>>>
>>>>>>> One tricky thing with this approach might be startup ordering: how to
>>>>>>> get fds via the QMP monitor in the new process before processing the
>>>>>>> entire command-line.  
>>>>>>
>>>>>> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
>>>>>> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
>>>>>> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
>>>>>> I suspect my recent vfio extensions would smooth the rough edges.  
>>>>>
>>>>> I wonder about the reason for VFIO's pid limitation, maybe because it
>>>>> pins pages from the original process?  
>>>>
>>>> The dma unmap code verifies that the requesting task is the same as the task that mapped
>>>> the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
>>>> to fix locked memory accounting, which is associated with the mm of the original task.
>>>>   
>>>>> Is this VFIO pid limitation the main reason why you chose to make QEMU
>>>>> execve(2) the new binary?  
>>>>
>>>> That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
>>>> errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
>>>> but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
>>>> diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
>>>> I was working against vfio rather than with it.
>>>>
>>>> Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
>>>> code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
>>>> without exec, I still need the exec option.
>>>>   
>>>>>> However, the main issue is that guest ram must be backed by named shared memory, and
>>>>>> we would need to add code to support shared memory for all the secondary memory objects.
>>>>>> That makes it less interesting for us at this time; we care about updating legacy qemu 
>>>>>> instances with anonymous guest memory.  
>>>>>
>>>>> Thanks for explaining this more in the other sub-thread. The secondary
>>>>> memory objects you mentioned are relatively small so I don't think
>>>>> saving them in the traditional way is a problem.
>>>>>
>>>>> Two approaches for zero-copy memory migration fit into QEMU's existing
>>>>> migration infrastructure:
>>>>>
>>>>> - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
>>>>>   etc) so they are not saved into the savevm file. The existing --object
>>>>>   memory-backend-file syntax can be used.
>>>>>
>>>>> - Extending the live migration protocol to detect when file descriptor
>>>>>   passing is available (i.e. UNIX domain socket migration) and using
>>>>>   that for memory-backend-* objects that have fds.
>>>>>
>>>>> Either of these approaches would handle RAM with existing savevm/migrate
>>>>> commands.  
>>>>
>>>> Yes, but the vfio issues would still need to be solved, and we would need new
>>>> command line options to back existing and future secondary memory objects with 
>>>> named shared memory.
>>>>   
>>>>> The remaining issue is how to migrate VFIO and other file descriptors
>>>>> that cannot be reopened by the new process. As mentioned, QEMU already
>>>>> has file descriptor passing support in the QMP monitor and support for
>>>>> opening passed file descriptors (see qemu_open_internal(),
>>>>> monitor_fd_param(), and socket_get_fd()).
>>>>>
>>>>> The advantage of integrating live update functionality into the existing
>>>>> savevm/migrate commands is that it will work in more use cases with
>>>>> less code duplication/maintenance/bitrot prevention than the
>>>>> special-case cprsave command in this patch series.
>>>>>
>>>>> Maybe there is a fundamental technical reason why live update needs to
>>>>> be different from QEMU's existing migration commands but I haven't
>>>>> figured it out yet.  
>>>>
>>>> vfio and anonymous memory.
>>>>
>>>> Regarding code duplication, I did consider whether to extend the migration
>>>> syntax and implementation versus creating something new.  Those functions
>>>> handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr
>>>> use case, and the cpr functions handle state that is n/a for the migration case.
>>>> I judged that handling both in the same functions would be less readable and
>>>> maintainable.  After feedback during the V1 review, I simplified the cprsave
>>>> code by by calling qemu_save_device_state, as Xen does, thus eliminating any
>>>> interaction with the migration code.
>>>>
>>>> Regarding bit rot, I still need to add a cpr test to the test suite, when the 
>>>> review is more complete and folks agree on the final form of the functionality.
>>>>
>>>> I do like the idea of supporting update without exec, but as a future project, 
>>>> and not at the expense of dropping update with exec.  
>>>
>>> Alex: We're discussing how to live update QEMU while VFIO devices are
>>> running. This patch series introduces monitor commands that call
>>> execve(2) to run the new QEMU binary and inherit the memory/vfio/etc
>>> file descriptors. This way live update is transparent to VFIO but it
>>> won't work if a sandboxed QEMU process is forbidden to call execve(2).
>>> What are your thoughts on 1) the execve(2) approach and 2) extending
>>> VFIO to allow running devices to be attached to a different process so
>>> that execve(2) is not necessary?
>>
>> Tracking processes is largely to support page pinning; we need to be
>> able to support both asynchronous page pinning to handle requests from
>> mdev drivers and we need to make sure pinned page accounting is
>> tracked to the same process.  If userspace can "pay" for locked pages
>> from one process on mappping, then "credit" them to another process on
>> unmap, that seems fairly exploitable.  We'd need some way to transfer
>> the locked memory accounting or handle it outside of vfio.  Thanks,
> 
> Vhost's VHOST_SET_OWNER ioctl is somewhat similar. It's used to
> associate the in-kernel vhost device with a userspace process and it's
> mm.
> 
> Would it be possible to add a VFIO_SET_OWNER ioctl that associates the
> current process with the vfio_device? Only one process would be the
> owner at any given time.

It is possible, but the implementation would need to invent new hooks in mm
to transfer the locked memory accounting.

> I'm not sure how existing DMA mappings would behave, but this patch
> series seems to rely on DMA continuing to work even though there is a
> window of time when the execve(2) process doesn't have guest RAM
> mmapped. So I guess existing DMA mappings continue to work because the
> pages were previously pinned?

Correct.  And changes to mappings are blocked between the pre-exec process issuing
VFIO_DMA_UNMAP_FLAG_VADDR and the post-exec process issuing VFIO_DMA_MAP_FLAG_VADDR.

- Steve


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update
  2021-05-18  9:57         ` Dr. David Alan Gilbert
@ 2021-05-18 16:00           ` Steven Sistare
  2021-05-18 19:23             ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 81+ messages in thread
From: Steven Sistare @ 2021-05-18 16:00 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel, Markus Armbruster,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

On 5/18/2021 5:57 AM, Dr. David Alan Gilbert wrote:
> * Steven Sistare (steven.sistare@oracle.com) wrote:
>> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:
>>> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:
>>>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:
>>>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
>>>>>> Provide the cprsave and cprload commands for live update.  These save and
>>>>>> restore VM state, with minimal guest pause time, so that qemu may be updated
>>>>>> to a new version in between.
>>>>>>
>>>>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
>>>>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
>>>>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
>>>>>> paused state and waits for the cprload command.
>>>>>
>>>>> I think cprsave/cprload could be generalized by using QMP to stash the
>>>>> file descriptors. The 'getfd' QMP command already exists and QEMU code
>>>>> already opens fds passed using this mechanism.
>>>>>
>>>>> I haven't checked but it may be possible to drop some patches by reusing
>>>>> QEMU's monitor file descriptor passing since the code already knows how
>>>>> to open from 'getfd' fds.
>>>>>
>>>>> The reason why using QMP is interesting is because it eliminates the
>>>>> need for execve(2). QEMU may be unable to execute a program due to
>>>>> chroot, seccomp, etc.
>>>>>
>>>>> QMP would enable cprsave/cprload to work both with and without
>>>>> execve(2).
>>>>>
>>>>> One tricky thing with this approach might be startup ordering: how to
>>>>> get fds via the QMP monitor in the new process before processing the
>>>>> entire command-line.
>>>>
>>>> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
>>>> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
>>>> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
>>>> I suspect my recent vfio extensions would smooth the rough edges.
>>>
>>> I wonder about the reason for VFIO's pid limitation, maybe because it
>>> pins pages from the original process?
>>
>> The dma unmap code verifies that the requesting task is the same as the task that mapped
>> the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
>> to fix locked memory accounting, which is associated with the mm of the original task.
> 
>>> Is this VFIO pid limitation the main reason why you chose to make QEMU
>>> execve(2) the new binary?
>>
>> That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
>> errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
>> but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
>> diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
>> I was working against vfio rather than with it.
> 
> OK the weirdness of vfio helps explain a bit about why you're doing it
> this way; can you help separate some difference between restart and
> reboot for me though:
> 
> In 'reboot' mode; where the guest must do suspend in it's drivers, how
> much of these vfio requirements are needed?  I guess the memfd use
> for the anonymous areas isn't any use for reboot mode.

Correct.  For reboot no special vfio support or fiddling is needed.

> You mention cprsave calls VFIO_DMA_UNMAP_FLAG_VADDR - after that does
> vfio still care about the currently-anonymous areas?

Yes, for restart mode.  The physical pages behind the anonymous memory remain pinned and 
are targets for ongoing DMA.  Post-exec qemu needs a way to find those same pages.

>> Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
>> code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
>> without exec, I still need the exec option.
> 
> Can you explain what that code injection mechanism is for those of us
> who didn't see that?

Sure.  Here is slide 12 from the talk.  It relies on mmap(MADV_DOEXEC) which was not
accepted upstream.

-----------------------------------------------------------------------------
Legacy Live Update

 * Update legacy qemu process to latest version
   - Inject code into legacy qemu process to perform cprsave: vmsave.so
     . Access qemu data structures and globals
       - eg ram_list, savevm_state, chardevs, vhost_devices
       - dlopen does not resolve them, must get addresses via symbol lookup.
     . Delete some vmstate handlers, register new ones (eg vfio)
     . Call MADV_DOEXEC on guest memory. Find devices, preserve fd
 * Hot patch a monitor function to dlopen vmsave.so, call entry point
   - write patch to /proc/pid/mem
   - Call the monitor function via monitor socket
 * Send cprload to update qemu
 * vmsave.so has binary dependency on qemu data structures and variables
   - Build vmsave-ver.so per legacy version
   - Indexed by qemu's gcc build-id

-----------------------------------------------------------------------------

- Steve
 
>>>> However, the main issue is that guest ram must be backed by named shared memory, and
>>>> we would need to add code to support shared memory for all the secondary memory objects.
>>>> That makes it less interesting for us at this time; we care about updating legacy qemu 
>>>> instances with anonymous guest memory.
>>>
>>> Thanks for explaining this more in the other sub-thread. The secondary
>>> memory objects you mentioned are relatively small so I don't think
>>> saving them in the traditional way is a problem.
>>>
>>> Two approaches for zero-copy memory migration fit into QEMU's existing
>>> migration infrastructure:
>>>
>>> - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
>>>   etc) so they are not saved into the savevm file. The existing --object
>>>   memory-backend-file syntax can be used.
>>>
>>> - Extending the live migration protocol to detect when file descriptor
>>>   passing is available (i.e. UNIX domain socket migration) and using
>>>   that for memory-backend-* objects that have fds.
>>>
>>> Either of these approaches would handle RAM with existing savevm/migrate
>>> commands.
>>
>> Yes, but the vfio issues would still need to be solved, and we would need new
>> command line options to back existing and future secondary memory objects with 
>> named shared memory.
>>
>>> The remaining issue is how to migrate VFIO and other file descriptors
>>> that cannot be reopened by the new process. As mentioned, QEMU already
>>> has file descriptor passing support in the QMP monitor and support for
>>> opening passed file descriptors (see qemu_open_internal(),
>>> monitor_fd_param(), and socket_get_fd()).
>>>
>>> The advantage of integrating live update functionality into the existing
>>> savevm/migrate commands is that it will work in more use cases with
>>> less code duplication/maintenance/bitrot prevention than the
>>> special-case cprsave command in this patch series.
>>>
>>> Maybe there is a fundamental technical reason why live update needs to
>>> be different from QEMU's existing migration commands but I haven't
>>> figured it out yet.
>>
>> vfio and anonymous memory.
>>
>> Regarding code duplication, I did consider whether to extend the migration
>> syntax and implementation versus creating something new.  Those functions
>> handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr
>> use case, and the cpr functions handle state that is n/a for the migration case.
>> I judged that handling both in the same functions would be less readable and
>> maintainable.  After feedback during the V1 review, I simplified the cprsave
>> code by by calling qemu_save_device_state, as Xen does, thus eliminating any
>> interaction with the migration code.
>>
>> Regarding bit rot, I still need to add a cpr test to the test suite, when the 
>> review is more complete and folks agree on the final form of the functionality.
>>
>> I do like the idea of supporting update without exec, but as a future project, 
>> and not at the expense of dropping update with exec.
>>
>> - Steve
>>


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update
  2021-05-18 16:00           ` Steven Sistare
@ 2021-05-18 19:23             ` Dr. David Alan Gilbert
  2021-05-18 20:01               ` Alex Williamson
  2021-05-18 20:14               ` Steven Sistare
  0 siblings, 2 replies; 81+ messages in thread
From: Dr. David Alan Gilbert @ 2021-05-18 19:23 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel, Markus Armbruster,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

* Steven Sistare (steven.sistare@oracle.com) wrote:
> On 5/18/2021 5:57 AM, Dr. David Alan Gilbert wrote:
> > * Steven Sistare (steven.sistare@oracle.com) wrote:
> >> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:
> >>> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:
> >>>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:
> >>>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
> >>>>>> Provide the cprsave and cprload commands for live update.  These save and
> >>>>>> restore VM state, with minimal guest pause time, so that qemu may be updated
> >>>>>> to a new version in between.
> >>>>>>
> >>>>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
> >>>>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
> >>>>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
> >>>>>> paused state and waits for the cprload command.
> >>>>>
> >>>>> I think cprsave/cprload could be generalized by using QMP to stash the
> >>>>> file descriptors. The 'getfd' QMP command already exists and QEMU code
> >>>>> already opens fds passed using this mechanism.
> >>>>>
> >>>>> I haven't checked but it may be possible to drop some patches by reusing
> >>>>> QEMU's monitor file descriptor passing since the code already knows how
> >>>>> to open from 'getfd' fds.
> >>>>>
> >>>>> The reason why using QMP is interesting is because it eliminates the
> >>>>> need for execve(2). QEMU may be unable to execute a program due to
> >>>>> chroot, seccomp, etc.
> >>>>>
> >>>>> QMP would enable cprsave/cprload to work both with and without
> >>>>> execve(2).
> >>>>>
> >>>>> One tricky thing with this approach might be startup ordering: how to
> >>>>> get fds via the QMP monitor in the new process before processing the
> >>>>> entire command-line.
> >>>>
> >>>> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
> >>>> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
> >>>> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
> >>>> I suspect my recent vfio extensions would smooth the rough edges.
> >>>
> >>> I wonder about the reason for VFIO's pid limitation, maybe because it
> >>> pins pages from the original process?
> >>
> >> The dma unmap code verifies that the requesting task is the same as the task that mapped
> >> the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
> >> to fix locked memory accounting, which is associated with the mm of the original task.
> > 
> >>> Is this VFIO pid limitation the main reason why you chose to make QEMU
> >>> execve(2) the new binary?
> >>
> >> That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
> >> errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
> >> but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
> >> diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
> >> I was working against vfio rather than with it.
> > 
> > OK the weirdness of vfio helps explain a bit about why you're doing it
> > this way; can you help separate some difference between restart and
> > reboot for me though:
> > 
> > In 'reboot' mode; where the guest must do suspend in it's drivers, how
> > much of these vfio requirements are needed?  I guess the memfd use
> > for the anonymous areas isn't any use for reboot mode.
> 
> Correct.  For reboot no special vfio support or fiddling is needed.
> 
> > You mention cprsave calls VFIO_DMA_UNMAP_FLAG_VADDR - after that does
> > vfio still care about the currently-anonymous areas?
> 
> Yes, for restart mode.  The physical pages behind the anonymous memory remain pinned and 
> are targets for ongoing DMA.  Post-exec qemu needs a way to find those same pages.

Is it possible with vfio to map it into multiple processes
simultaneously or does it have to only be one at a time?
Are you saying that you have no way to shut off DMA, and thus you can
never know it's safe to terminate the source process?

> >> Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
> >> code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
> >> without exec, I still need the exec option.
> > 
> > Can you explain what that code injection mechanism is for those of us
> > who didn't see that?
> 
> Sure.  Here is slide 12 from the talk.  It relies on mmap(MADV_DOEXEC) which was not
> accepted upstream.

In this series, without MADV_DOEXEC, how do you guarantee the same HVA
in source and destination - or is that not necessary?

> -----------------------------------------------------------------------------
> Legacy Live Update
> 
>  * Update legacy qemu process to latest version
>    - Inject code into legacy qemu process to perform cprsave: vmsave.so
>      . Access qemu data structures and globals
>        - eg ram_list, savevm_state, chardevs, vhost_devices
>        - dlopen does not resolve them, must get addresses via symbol lookup.
>      . Delete some vmstate handlers, register new ones (eg vfio)
>      . Call MADV_DOEXEC on guest memory. Find devices, preserve fd
>  * Hot patch a monitor function to dlopen vmsave.so, call entry point
>    - write patch to /proc/pid/mem
>    - Call the monitor function via monitor socket
>  * Send cprload to update qemu
>  * vmsave.so has binary dependency on qemu data structures and variables
>    - Build vmsave-ver.so per legacy version
>    - Indexed by qemu's gcc build-id
> 
> -----------------------------------------------------------------------------

That's hairy!
At that point isn't it easier to recompile a patched qemu against the
original sources and ptrace something in to mmap the new qemu?

Dave

> - Steve
>  
> >>>> However, the main issue is that guest ram must be backed by named shared memory, and
> >>>> we would need to add code to support shared memory for all the secondary memory objects.
> >>>> That makes it less interesting for us at this time; we care about updating legacy qemu 
> >>>> instances with anonymous guest memory.
> >>>
> >>> Thanks for explaining this more in the other sub-thread. The secondary
> >>> memory objects you mentioned are relatively small so I don't think
> >>> saving them in the traditional way is a problem.
> >>>
> >>> Two approaches for zero-copy memory migration fit into QEMU's existing
> >>> migration infrastructure:
> >>>
> >>> - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
> >>>   etc) so they are not saved into the savevm file. The existing --object
> >>>   memory-backend-file syntax can be used.
> >>>
> >>> - Extending the live migration protocol to detect when file descriptor
> >>>   passing is available (i.e. UNIX domain socket migration) and using
> >>>   that for memory-backend-* objects that have fds.
> >>>
> >>> Either of these approaches would handle RAM with existing savevm/migrate
> >>> commands.
> >>
> >> Yes, but the vfio issues would still need to be solved, and we would need new
> >> command line options to back existing and future secondary memory objects with 
> >> named shared memory.
> >>
> >>> The remaining issue is how to migrate VFIO and other file descriptors
> >>> that cannot be reopened by the new process. As mentioned, QEMU already
> >>> has file descriptor passing support in the QMP monitor and support for
> >>> opening passed file descriptors (see qemu_open_internal(),
> >>> monitor_fd_param(), and socket_get_fd()).
> >>>
> >>> The advantage of integrating live update functionality into the existing
> >>> savevm/migrate commands is that it will work in more use cases with
> >>> less code duplication/maintenance/bitrot prevention than the
> >>> special-case cprsave command in this patch series.
> >>>
> >>> Maybe there is a fundamental technical reason why live update needs to
> >>> be different from QEMU's existing migration commands but I haven't
> >>> figured it out yet.
> >>
> >> vfio and anonymous memory.
> >>
> >> Regarding code duplication, I did consider whether to extend the migration
> >> syntax and implementation versus creating something new.  Those functions
> >> handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr
> >> use case, and the cpr functions handle state that is n/a for the migration case.
> >> I judged that handling both in the same functions would be less readable and
> >> maintainable.  After feedback during the V1 review, I simplified the cprsave
> >> code by by calling qemu_save_device_state, as Xen does, thus eliminating any
> >> interaction with the migration code.
> >>
> >> Regarding bit rot, I still need to add a cpr test to the test suite, when the 
> >> review is more complete and folks agree on the final form of the functionality.
> >>
> >> I do like the idea of supporting update without exec, but as a future project, 
> >> and not at the expense of dropping update with exec.
> >>
> >> - Steve
> >>
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update
  2021-05-18 19:23             ` Dr. David Alan Gilbert
@ 2021-05-18 20:01               ` Alex Williamson
  2021-05-18 20:14               ` Steven Sistare
  1 sibling, 0 replies; 81+ messages in thread
From: Alex Williamson @ 2021-05-18 20:01 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Philippe Mathieu-Daudé,
	Juan Quintela, qemu-devel, Markus Armbruster, Steven Sistare,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Alex Bennée

On Tue, 18 May 2021 20:23:25 +0100
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Steven Sistare (steven.sistare@oracle.com) wrote:
> > On 5/18/2021 5:57 AM, Dr. David Alan Gilbert wrote:  
> > > * Steven Sistare (steven.sistare@oracle.com) wrote:  
> > >> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:  
> > >>> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:  
> > >>>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:  
> > >>>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:  
> > >>>>>> Provide the cprsave and cprload commands for live update.  These save and
> > >>>>>> restore VM state, with minimal guest pause time, so that qemu may be updated
> > >>>>>> to a new version in between.
> > >>>>>>
> > >>>>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
> > >>>>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
> > >>>>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
> > >>>>>> paused state and waits for the cprload command.  
> > >>>>>
> > >>>>> I think cprsave/cprload could be generalized by using QMP to stash the
> > >>>>> file descriptors. The 'getfd' QMP command already exists and QEMU code
> > >>>>> already opens fds passed using this mechanism.
> > >>>>>
> > >>>>> I haven't checked but it may be possible to drop some patches by reusing
> > >>>>> QEMU's monitor file descriptor passing since the code already knows how
> > >>>>> to open from 'getfd' fds.
> > >>>>>
> > >>>>> The reason why using QMP is interesting is because it eliminates the
> > >>>>> need for execve(2). QEMU may be unable to execute a program due to
> > >>>>> chroot, seccomp, etc.
> > >>>>>
> > >>>>> QMP would enable cprsave/cprload to work both with and without
> > >>>>> execve(2).
> > >>>>>
> > >>>>> One tricky thing with this approach might be startup ordering: how to
> > >>>>> get fds via the QMP monitor in the new process before processing the
> > >>>>> entire command-line.  
> > >>>>
> > >>>> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
> > >>>> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
> > >>>> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
> > >>>> I suspect my recent vfio extensions would smooth the rough edges.  
> > >>>
> > >>> I wonder about the reason for VFIO's pid limitation, maybe because it
> > >>> pins pages from the original process?  
> > >>
> > >> The dma unmap code verifies that the requesting task is the same as the task that mapped
> > >> the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
> > >> to fix locked memory accounting, which is associated with the mm of the original task.  
> > >   
> > >>> Is this VFIO pid limitation the main reason why you chose to make QEMU
> > >>> execve(2) the new binary?  
> > >>
> > >> That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
> > >> errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
> > >> but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
> > >> diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
> > >> I was working against vfio rather than with it.  
> > > 
> > > OK the weirdness of vfio helps explain a bit about why you're doing it
> > > this way; can you help separate some difference between restart and
> > > reboot for me though:
> > > 
> > > In 'reboot' mode; where the guest must do suspend in it's drivers, how
> > > much of these vfio requirements are needed?  I guess the memfd use
> > > for the anonymous areas isn't any use for reboot mode.  
> > 
> > Correct.  For reboot no special vfio support or fiddling is needed.
> >   
> > > You mention cprsave calls VFIO_DMA_UNMAP_FLAG_VADDR - after that does
> > > vfio still care about the currently-anonymous areas?  
> > 
> > Yes, for restart mode.  The physical pages behind the anonymous memory remain pinned and 
> > are targets for ongoing DMA.  Post-exec qemu needs a way to find those same pages.  
> 
> Is it possible with vfio to map it into multiple processes
> simultaneously or does it have to only be one at a time?

The IOMMU maps an IOVA to a physical address, what Steve is saying is
that mapping persists across the restart.  A given IOVA can only map to
a specific physical address, so mapping into multiple processes doesn't
make any sense.  The two processes need to map the same IOVA to the
same HPA, only the HVA is allowed to change.

> Are you saying that you have no way to shut off DMA, and thus you can
> never know it's safe to terminate the source process?

Stopping DMA, ex. disabling PCI bus master, would be not only visible
to the behavior of the device, but likely detrimental.  You'd need
driver or device participation to some extent to make this seamless.

> > >> Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
> > >> code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
> > >> without exec, I still need the exec option.  
> > > 
> > > Can you explain what that code injection mechanism is for those of us
> > > who didn't see that?  
> > 
> > Sure.  Here is slide 12 from the talk.  It relies on mmap(MADV_DOEXEC) which was not
> > accepted upstream.  
> 
> In this series, without MADV_DOEXEC, how do you guarantee the same HVA
> in source and destination - or is that not necessary?

It's not necessary, the HVA is used to establish the IOVA to HPA
mapping for the IOMMU.  We have patches upstream that suspend (block)
that translation for the window when the HVA is invalid and resume when
it becomes valid.  It's expected that the new HVA is equivalent to the
old HVA and that the user can only hurt themselves should they violate
this, ie. they can still only map+pin memory they own, so at worst they
create a bad translation for their own device.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update
  2021-05-18 19:23             ` Dr. David Alan Gilbert
  2021-05-18 20:01               ` Alex Williamson
@ 2021-05-18 20:14               ` Steven Sistare
  2021-05-20 13:00                 ` [PATCH V3 00/22] Live Update [reboot] Dr. David Alan Gilbert
  2021-05-20 13:13                 ` [PATCH V3 00/22] Live Update [restart] Dr. David Alan Gilbert
  1 sibling, 2 replies; 81+ messages in thread
From: Steven Sistare @ 2021-05-18 20:14 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel, Markus Armbruster,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

On 5/18/2021 3:23 PM, Dr. David Alan Gilbert wrote:
> * Steven Sistare (steven.sistare@oracle.com) wrote:
>> On 5/18/2021 5:57 AM, Dr. David Alan Gilbert wrote:
>>> * Steven Sistare (steven.sistare@oracle.com) wrote:
>>>> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:
>>>>> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:
>>>>>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:
>>>>>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
>>>>>>>> Provide the cprsave and cprload commands for live update.  These save and
>>>>>>>> restore VM state, with minimal guest pause time, so that qemu may be updated
>>>>>>>> to a new version in between.
>>>>>>>>
>>>>>>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
>>>>>>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
>>>>>>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
>>>>>>>> paused state and waits for the cprload command.
>>>>>>>
>>>>>>> I think cprsave/cprload could be generalized by using QMP to stash the
>>>>>>> file descriptors. The 'getfd' QMP command already exists and QEMU code
>>>>>>> already opens fds passed using this mechanism.
>>>>>>>
>>>>>>> I haven't checked but it may be possible to drop some patches by reusing
>>>>>>> QEMU's monitor file descriptor passing since the code already knows how
>>>>>>> to open from 'getfd' fds.
>>>>>>>
>>>>>>> The reason why using QMP is interesting is because it eliminates the
>>>>>>> need for execve(2). QEMU may be unable to execute a program due to
>>>>>>> chroot, seccomp, etc.
>>>>>>>
>>>>>>> QMP would enable cprsave/cprload to work both with and without
>>>>>>> execve(2).
>>>>>>>
>>>>>>> One tricky thing with this approach might be startup ordering: how to
>>>>>>> get fds via the QMP monitor in the new process before processing the
>>>>>>> entire command-line.
>>>>>>
>>>>>> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
>>>>>> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
>>>>>> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
>>>>>> I suspect my recent vfio extensions would smooth the rough edges.
>>>>>
>>>>> I wonder about the reason for VFIO's pid limitation, maybe because it
>>>>> pins pages from the original process?
>>>>
>>>> The dma unmap code verifies that the requesting task is the same as the task that mapped
>>>> the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
>>>> to fix locked memory accounting, which is associated with the mm of the original task.
>>>
>>>>> Is this VFIO pid limitation the main reason why you chose to make QEMU
>>>>> execve(2) the new binary?
>>>>
>>>> That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
>>>> errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
>>>> but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
>>>> diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
>>>> I was working against vfio rather than with it.
>>>
>>> OK the weirdness of vfio helps explain a bit about why you're doing it
>>> this way; can you help separate some difference between restart and
>>> reboot for me though:
>>>
>>> In 'reboot' mode; where the guest must do suspend in it's drivers, how
>>> much of these vfio requirements are needed?  I guess the memfd use
>>> for the anonymous areas isn't any use for reboot mode.
>>
>> Correct.  For reboot no special vfio support or fiddling is needed.
>>
>>> You mention cprsave calls VFIO_DMA_UNMAP_FLAG_VADDR - after that does
>>> vfio still care about the currently-anonymous areas?
>>
>> Yes, for restart mode.  The physical pages behind the anonymous memory remain pinned and 
>> are targets for ongoing DMA.  Post-exec qemu needs a way to find those same pages.
> 
> Is it possible with vfio to map it into multiple processes
> simultaneously or does it have to only be one at a time?
> Are you saying that you have no way to shut off DMA, and thus you can
> never know it's safe to terminate the source process?
> 
>>>> Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
>>>> code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
>>>> without exec, I still need the exec option.
>>>
>>> Can you explain what that code injection mechanism is for those of us
>>> who didn't see that?
>>
>> Sure.  Here is slide 12 from the talk.  It relies on mmap(MADV_DOEXEC) which was not
>> accepted upstream.
> 
> In this series, without MADV_DOEXEC, how do you guarantee the same HVA
> in source and destination - or is that not necessary?

Not necessary.  We can safely change the HVS using the new vfio ioctls.

>> -----------------------------------------------------------------------------
>> Legacy Live Update
>>
>>  * Update legacy qemu process to latest version
>>    - Inject code into legacy qemu process to perform cprsave: vmsave.so
>>      . Access qemu data structures and globals
>>        - eg ram_list, savevm_state, chardevs, vhost_devices
>>        - dlopen does not resolve them, must get addresses via symbol lookup.
>>      . Delete some vmstate handlers, register new ones (eg vfio)
>>      . Call MADV_DOEXEC on guest memory. Find devices, preserve fd
>>  * Hot patch a monitor function to dlopen vmsave.so, call entry point
>>    - write patch to /proc/pid/mem
>>    - Call the monitor function via monitor socket
>>  * Send cprload to update qemu
>>  * vmsave.so has binary dependency on qemu data structures and variables
>>    - Build vmsave-ver.so per legacy version
>>    - Indexed by qemu's gcc build-id
>>
>> -----------------------------------------------------------------------------
> 
> That's hairy!
> At that point isn't it easier to recompile a patched qemu against the
> original sources and ptrace something in to mmap the new qemu
That could work, but safely capturing all the threads and forcing them to jump to the
mmap'd qemu is hard.

- Steve

>>>>>> However, the main issue is that guest ram must be backed by named shared memory, and
>>>>>> we would need to add code to support shared memory for all the secondary memory objects.
>>>>>> That makes it less interesting for us at this time; we care about updating legacy qemu 
>>>>>> instances with anonymous guest memory.
>>>>>
>>>>> Thanks for explaining this more in the other sub-thread. The secondary
>>>>> memory objects you mentioned are relatively small so I don't think
>>>>> saving them in the traditional way is a problem.
>>>>>
>>>>> Two approaches for zero-copy memory migration fit into QEMU's existing
>>>>> migration infrastructure:
>>>>>
>>>>> - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
>>>>>   etc) so they are not saved into the savevm file. The existing --object
>>>>>   memory-backend-file syntax can be used.
>>>>>
>>>>> - Extending the live migration protocol to detect when file descriptor
>>>>>   passing is available (i.e. UNIX domain socket migration) and using
>>>>>   that for memory-backend-* objects that have fds.
>>>>>
>>>>> Either of these approaches would handle RAM with existing savevm/migrate
>>>>> commands.
>>>>
>>>> Yes, but the vfio issues would still need to be solved, and we would need new
>>>> command line options to back existing and future secondary memory objects with 
>>>> named shared memory.
>>>>
>>>>> The remaining issue is how to migrate VFIO and other file descriptors
>>>>> that cannot be reopened by the new process. As mentioned, QEMU already
>>>>> has file descriptor passing support in the QMP monitor and support for
>>>>> opening passed file descriptors (see qemu_open_internal(),
>>>>> monitor_fd_param(), and socket_get_fd()).
>>>>>
>>>>> The advantage of integrating live update functionality into the existing
>>>>> savevm/migrate commands is that it will work in more use cases with
>>>>> less code duplication/maintenance/bitrot prevention than the
>>>>> special-case cprsave command in this patch series.
>>>>>
>>>>> Maybe there is a fundamental technical reason why live update needs to
>>>>> be different from QEMU's existing migration commands but I haven't
>>>>> figured it out yet.
>>>>
>>>> vfio and anonymous memory.
>>>>
>>>> Regarding code duplication, I did consider whether to extend the migration
>>>> syntax and implementation versus creating something new.  Those functions
>>>> handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr
>>>> use case, and the cpr functions handle state that is n/a for the migration case.
>>>> I judged that handling both in the same functions would be less readable and
>>>> maintainable.  After feedback during the V1 review, I simplified the cprsave
>>>> code by by calling qemu_save_device_state, as Xen does, thus eliminating any
>>>> interaction with the migration code.
>>>>
>>>> Regarding bit rot, I still need to add a cpr test to the test suite, when the 
>>>> review is more complete and folks agree on the final form of the functionality.
>>>>
>>>> I do like the idea of supporting update without exec, but as a future project, 
>>>> and not at the expense of dropping update with exec.
>>>>
>>>> - Steve
>>>>
>>


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update
  2021-05-07 12:24 [PATCH V3 00/22] Live Update Steve Sistare
                   ` (23 preceding siblings ...)
  2021-05-12 16:42 ` Stefan Hajnoczi
@ 2021-05-19 16:43 ` Steven Sistare
  2021-06-02 15:19   ` Steven Sistare
  24 siblings, 1 reply; 81+ messages in thread
From: Steven Sistare @ 2021-05-19 16:43 UTC (permalink / raw)
  To: Michael S. Tsirkin, Marcel Apfelbaum
  Cc: Daniel P. Berrange, Juan Quintela, Jason Zeng, Alex Bennée,
	qemu-devel, Dr. David Alan Gilbert, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Markus Armbruster

Hi Michael, Marcel,
  I hope you have time to review the pci and vfio-pci related patches in this
series.  They are an essential part of the live update functionality.  The
first 2 patches are straightforward, just exposing functions for use in vfio.
The last 2 patches are more substantial.

  - pci: export functions for cpr
  - vfio-pci: refactor for cpr
  - vfio-pci: cpr part 1
  - vfio-pci: cpr part 2

- Steve

On 5/7/2021 8:24 AM, Steve Sistare wrote:
> Provide the cprsave and cprload commands for live update.  These save and
> restore VM state, with minimal guest pause time, so that qemu may be updated
> to a new version in between.
> 
> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
> paused state and waits for the cprload command.
> 
> To use the restart mode, qemu must be started with the memfd-alloc option,
> which allocates guest ram using memfd_create.  The memfd's are saved to
> the environment and kept open across exec, after which they are found from
> the environment and re-mmap'd.  Hence guest ram is preserved in place,
> albeit with new virtual addresses in the qemu process.  The caller resumes
> the guest by calling cprload, which loads state from the file.  If the VM
> was running at cprsave time, then VM execution resumes.  cprsave supports
> any type of guest image and block device, but the caller must not modify
> guest block devices between cprsave and cprload.
> 
> The restart mode supports vfio devices by preserving the vfio container,
> group, device, and event descriptors across the qemu re-exec, and by
> updating DMA mapping virtual addresses using VFIO_DMA_UNMAP_FLAG_VADDR and
> VFIO_DMA_MAP_FLAG_VADDR as defined in https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sistare@oracle.com/
> and integrated in Linux kernel 5.12.
> 
> For the reboot mode, cprsave saves state and exits qemu, and the caller is
> allowed to update the host kernel and system software and reboot.  The
> caller resumes the guest by running qemu with the same arguments as the
> original process and calling cprload.  To use this mode, guest ram must be
> mapped to a persistent shared memory file such as /dev/dax0.0, or /dev/shm
> PKRAM as proposed in https://lore.kernel.org/lkml/1617140178-8773-1-git-send-email-anthony.yznaga@oracle.com.
> 
> The reboot mode supports vfio devices if the caller suspends the guest
> instead of stopping the VM, such as by issuing guest-suspend-ram to the
> qemu guest agent.  The guest drivers' suspend methods flush outstanding
> requests and re-initialize the devices, and thus there is no device state
> to save and restore.
> 
> The first patches add helper functions:
> 
>   - as_flat_walk
>   - qemu_ram_volatile
>   - oslib: qemu_clr_cloexec
>   - util: env var helpers
>   - machine: memfd-alloc option
>   - vl: add helper to request re-exec
> 
> The next patches implement cprsave and cprload:
> 
>   - cpr
>   - cpr: QMP interfaces
>   - cpr: HMP interfaces
> 
> The next patches add vfio support for the restart mode:
> 
>   - pci: export functions for cpr
>   - vfio-pci: refactor for cpr
>   - vfio-pci: cpr part 1
>   - vfio-pci: cpr part 2
> 
> The next patches preserve various descriptor-based backend devices across
> a cprsave restart:
> 
>   - vhost: reset vhost devices upon cprsave
>   - hostmem-memfd: cpr support
>   - chardev: cpr framework
>   - chardev: cpr for simple devices
>   - chardev: cpr for pty
>   - chardev: cpr for sockets
>   - cpr: only-cpr-capable option
>   - cpr: maintainers
>   - simplify savevm
> 
> Here is an example of updating qemu from v4.2.0 to v4.2.1 using 
> "cprload restart".  The software update is performed while the guest is
> running to minimize downtime.
> 
> window 1				| window 2
> 					|
> # qemu-system-x86_64 ... 		|
> QEMU 4.2.0 monitor - type 'help' ...	|
> (qemu) info status			|
> VM status: running			|
> 					| # yum update qemu
> (qemu) cprsave /tmp/qemu.sav restart	|
> QEMU 4.2.1 monitor - type 'help' ...	|
> (qemu) info status			|
> VM status: paused (prelaunch)		|
> (qemu) cprload /tmp/qemu.sav		|
> (qemu) info status			|
> VM status: running			|
> 
> 
> Here is an example of updating the host kernel using "cprload reboot"
> 
> window 1					| window 2
> 						|
> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
> QEMU 4.2.1 monitor - type 'help' ...		|
> (qemu) info status				|
> VM status: running				|
> 						| # yum update kernel-uek
> (qemu) cprsave /tmp/qemu.sav restart		|
> 						|
> # systemctl kexec				|
> kexec_core: Starting new kernel			|
> ...						|
> 						|
> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
> QEMU 4.2.1 monitor - type 'help' ...		|
> (qemu) info status				|
> VM status: paused (prelaunch)			|
> (qemu) cprload /tmp/qemu.sav			|
> (qemu) info status				|
> VM status: running				|
> 
> Changes from V1 to V2:
>   - revert vmstate infrastructure changes
>   - refactor cpr functions into new files
>   - delete MADV_DOEXEC and use memfd + VFIO_DMA_UNMAP_FLAG_SUSPEND to 
>     preserve memory.
>   - add framework to filter chardev's that support cpr
>   - save and restore vfio eventfd's
>   - modify cprinfo QMP interface
>   - incorporate misc review feedback
>   - remove unrelated and unneeded patches
>   - refactor all patches into a shorter and easier to review series
> 
> Changes from V2 to V3:
>   - rebase to qemu 6.0.0
>   - use final definition of vfio ioctls (VFIO_DMA_UNMAP_FLAG_VADDR etc)
>   - change memfd-alloc to a machine option
>   - use existing channel socket function instead of defining new ones
>   - close monitor socket during cpr
>   - support memory-backend-memfd
>   - fix a few unreported bugs
> 
> Steve Sistare (18):
>   as_flat_walk
>   qemu_ram_volatile
>   oslib: qemu_clr_cloexec
>   util: env var helpers
>   machine: memfd-alloc option
>   vl: add helper to request re-exec
>   cpr
>   pci: export functions for cpr
>   vfio-pci: refactor for cpr
>   vfio-pci: cpr part 1
>   vfio-pci: cpr part 2
>   hostmem-memfd: cpr support
>   chardev: cpr framework
>   chardev: cpr for simple devices
>   chardev: cpr for pty
>   cpr: only-cpr-capable option
>   cpr: maintainers
>   simplify savevm
> 
> Mark Kanda, Steve Sistare (4):
>   cpr: QMP interfaces
>   cpr: HMP interfaces
>   vhost: reset vhost devices upon cprsave
>   chardev: cpr for sockets
> 
>  MAINTAINERS                   |  11 +++
>  backends/hostmem-memfd.c      |  21 +++--
>  chardev/char-mux.c            |   1 +
>  chardev/char-null.c           |   1 +
>  chardev/char-pty.c            |  15 ++-
>  chardev/char-serial.c         |   1 +
>  chardev/char-socket.c         |  35 +++++++
>  chardev/char-stdio.c          |   8 ++
>  chardev/char.c                |  41 +++++++-
>  gdbstub.c                     |   1 +
>  hmp-commands.hx               |  44 +++++++++
>  hw/core/machine.c             |  19 ++++
>  hw/pci/msi.c                  |   4 +
>  hw/pci/msix.c                 |  20 ++--
>  hw/pci/pci.c                  |   7 +-
>  hw/vfio/common.c              |  68 +++++++++++++-
>  hw/vfio/cpr.c                 | 131 ++++++++++++++++++++++++++
>  hw/vfio/meson.build           |   1 +
>  hw/vfio/pci.c                 | 214 ++++++++++++++++++++++++++++++++++++++----
>  hw/vfio/trace-events          |   1 +
>  hw/virtio/vhost.c             |  11 +++
>  include/chardev/char.h        |   6 ++
>  include/exec/memory.h         |  25 +++++
>  include/hw/boards.h           |   1 +
>  include/hw/pci/msix.h         |   5 +
>  include/hw/pci/pci.h          |   2 +
>  include/hw/vfio/vfio-common.h |   8 ++
>  include/hw/virtio/vhost.h     |   1 +
>  include/migration/cpr.h       |  17 ++++
>  include/monitor/hmp.h         |   3 +
>  include/qemu/env.h            |  23 +++++
>  include/qemu/osdep.h          |   1 +
>  include/sysemu/runstate.h     |   2 +
>  include/sysemu/sysemu.h       |   2 +
>  linux-headers/linux/vfio.h    |  27 ++++++
>  migration/cpr.c               | 200 +++++++++++++++++++++++++++++++++++++++
>  migration/meson.build         |   1 +
>  migration/migration.c         |   5 +
>  migration/savevm.c            |  21 ++---
>  migration/savevm.h            |   2 +
>  monitor/hmp-cmds.c            |  48 ++++++++++
>  monitor/hmp.c                 |   3 +
>  monitor/qmp-cmds.c            |  31 ++++++
>  monitor/qmp.c                 |   3 +
>  qapi/char.json                |   5 +-
>  qapi/cpr.json                 |  76 +++++++++++++++
>  qapi/meson.build              |   1 +
>  qapi/qapi-schema.json         |   1 +
>  qemu-options.hx               |  39 +++++++-
>  softmmu/globals.c             |   2 +
>  softmmu/memory.c              |  48 ++++++++++
>  softmmu/physmem.c             |  49 ++++++++--
>  softmmu/runstate.c            |  49 +++++++++-
>  softmmu/vl.c                  |  21 ++++-
>  stubs/cpr.c                   |   3 +
>  stubs/meson.build             |   1 +
>  trace-events                  |   1 +
>  util/env.c                    |  99 +++++++++++++++++++
>  util/meson.build              |   1 +
>  util/oslib-posix.c            |   9 ++
>  util/oslib-win32.c            |   4 +
>  util/qemu-config.c            |   4 +
>  62 files changed, 1431 insertions(+), 74 deletions(-)
>  create mode 100644 hw/vfio/cpr.c
>  create mode 100644 include/migration/cpr.h
>  create mode 100644 include/qemu/env.h
>  create mode 100644 migration/cpr.c
>  create mode 100644 qapi/cpr.json
>  create mode 100644 stubs/cpr.c
>  create mode 100644 util/env.c
> 


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 11/22] vfio-pci: refactor for cpr
  2021-05-07 12:25 ` [PATCH V3 11/22] vfio-pci: refactor " Steve Sistare
@ 2021-05-19 22:38   ` Alex Williamson
  2021-05-21 13:33     ` Steven Sistare
  0 siblings, 1 reply; 81+ messages in thread
From: Alex Williamson @ 2021-05-19 22:38 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On Fri,  7 May 2021 05:25:09 -0700
Steve Sistare <steven.sistare@oracle.com> wrote:

> Export vfio_address_spaces and vfio_listener_skipped_section.
> Add optional eventfd arg to vfio_add_kvm_msi_virq.
> Refactor vector use into a helper vfio_vector_init.
> All for use by cpr in a subsequent patch.  No functional change.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  hw/vfio/common.c              |  4 ++--
>  hw/vfio/pci.c                 | 36 +++++++++++++++++++++++++-----------
>  include/hw/vfio/vfio-common.h |  3 +++
>  3 files changed, 30 insertions(+), 13 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index ae5654f..9220e64 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -42,7 +42,7 @@
>  
>  VFIOGroupList vfio_group_list =
>      QLIST_HEAD_INITIALIZER(vfio_group_list);
> -static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces =
> +VFIOAddressSpaceList vfio_address_spaces =
>      QLIST_HEAD_INITIALIZER(vfio_address_spaces);
>  
>  #ifdef CONFIG_KVM
> @@ -534,7 +534,7 @@ static int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova,
>      return -1;
>  }
>  
> -static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> +bool vfio_listener_skipped_section(MemoryRegionSection *section)
>  {
>      return (!memory_region_is_ram(section->mr) &&
>              !memory_region_is_iommu(section->mr)) ||
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 5c65aa0..7a4fb6c 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -411,7 +411,7 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
>  }
>  
>  static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
> -                                  int vector_n, bool msix)
> +                                  int vector_n, bool msix, int eventfd)
>  {
>      int virq;
>  
> @@ -419,7 +419,9 @@ static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
>          return;
>      }
>  
> -    if (event_notifier_init(&vector->kvm_interrupt, 0)) {
> +    if (eventfd >= 0) {
> +        event_notifier_init_fd(&vector->kvm_interrupt, eventfd);
> +    } else if (event_notifier_init(&vector->kvm_interrupt, 0)) {
>          return;
>      }

This seems very obfuscated.  The "active" arg of event_notifier_init()
just seems to preload the eventfd with a signal.  What does that have
to do with an eventfd arg to this function?  What if the first branch
returns failure?

>  
> @@ -455,6 +457,22 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
>      kvm_irqchip_commit_routes(kvm_state);
>  }
>  
> +static void vfio_vector_init(VFIOPCIDevice *vdev, int nr, int eventfd)
> +{
> +    VFIOMSIVector *vector = &vdev->msi_vectors[nr];
> +    PCIDevice *pdev = &vdev->pdev;
> +
> +    vector->vdev = vdev;
> +    vector->virq = -1;
> +    if (eventfd >= 0) {
> +        event_notifier_init_fd(&vector->interrupt, eventfd);
> +    } else if (event_notifier_init(&vector->interrupt, 0)) {
> +        error_report("vfio: Error: event_notifier_init failed");
> +    }

Gak, here's that same pattern.

> +    vector->use = true;
> +    msix_vector_use(pdev, nr);
> +}
> +
>  static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>                                     MSIMessage *msg, IOHandler *handler)
>  {
> @@ -466,14 +484,10 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>  
>      vector = &vdev->msi_vectors[nr];
>  
> +    vfio_vector_init(vdev, nr, -1);
> +
>      if (!vector->use) {
> -        vector->vdev = vdev;
> -        vector->virq = -1;
> -        if (event_notifier_init(&vector->interrupt, 0)) {
> -            error_report("vfio: Error: event_notifier_init failed");
> -        }
> -        vector->use = true;
> -        msix_vector_use(pdev, nr);
> +        vfio_vector_init(vdev, nr, -1);
>      }

Huh?  That's not at all "no functional change".  Also the branch is
entirely dead code now.

>  
>      qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
> @@ -491,7 +505,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>          }
>      } else {
>          if (msg) {
> -            vfio_add_kvm_msi_virq(vdev, vector, nr, true);
> +            vfio_add_kvm_msi_virq(vdev, vector, nr, true, -1);
>          }
>      }
>  
> @@ -641,7 +655,7 @@ retry:
>           * Attempt to enable route through KVM irqchip,
>           * default to userspace handling if unavailable.
>           */
> -        vfio_add_kvm_msi_virq(vdev, vector, i, false);
> +        vfio_add_kvm_msi_virq(vdev, vector, i, false, -1);
>      }

And then we're not really passing an eventfd anyway :-\  I'm so
confused...

Thanks,
Alex

>  
>      /* Set interrupt type prior to possible interrupts */
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 6141162..00acb85 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -204,6 +204,8 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>  extern const MemoryRegionOps vfio_region_ops;
>  typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
>  extern VFIOGroupList vfio_group_list;
> +typedef QLIST_HEAD(, VFIOAddressSpace) VFIOAddressSpaceList;
> +extern VFIOAddressSpaceList vfio_address_spaces;
>  
>  bool vfio_mig_active(void);
>  int64_t vfio_mig_bytes_transferred(void);
> @@ -222,6 +224,7 @@ struct vfio_info_cap_header *
>  vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id);
>  #endif
>  extern const MemoryListener vfio_prereg_listener;
> +bool vfio_listener_skipped_section(MemoryRegionSection *section);
>  
>  int vfio_spapr_create_window(VFIOContainer *container,
>                               MemoryRegionSection *section,



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update [reboot]
  2021-05-18 20:14               ` Steven Sistare
@ 2021-05-20 13:00                 ` Dr. David Alan Gilbert
  2021-05-21 14:55                   ` Steven Sistare
  2021-05-20 13:13                 ` [PATCH V3 00/22] Live Update [restart] Dr. David Alan Gilbert
  1 sibling, 1 reply; 81+ messages in thread
From: Dr. David Alan Gilbert @ 2021-05-20 13:00 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel, Markus Armbruster,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Hi Steven,
  I'd like to split the discussion into reboot and restart,
so I can make sure I understand them individually.

So reboot mode;
Can you explain which parts of this series are needed for reboot mode;
I've managed to do a kexec based reboot on qemu with the current qemu -
albeit with no vfio devices, my current understanding is that for doing
reboot with vfio we just need some way of getting migrate to send the
metadata associated with vfio devices if the guest is in S3.

Is there something I'm missing and which you have in this series?

Dave

-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update [restart]
  2021-05-18 20:14               ` Steven Sistare
  2021-05-20 13:00                 ` [PATCH V3 00/22] Live Update [reboot] Dr. David Alan Gilbert
@ 2021-05-20 13:13                 ` Dr. David Alan Gilbert
  2021-05-21 14:56                   ` Steven Sistare
  1 sibling, 1 reply; 81+ messages in thread
From: Dr. David Alan Gilbert @ 2021-05-20 13:13 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel, Markus Armbruster,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

On the 'restart' branch of questions; can you explain,
other than the passing of the fd's, why the outgoing side of
qemu's 'migrate exec:' doesn't work for you?

Dave

-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 11/22] vfio-pci: refactor for cpr
  2021-05-19 22:38   ` Alex Williamson
@ 2021-05-21 13:33     ` Steven Sistare
  2021-05-21 21:07       ` Alex Williamson
  0 siblings, 1 reply; 81+ messages in thread
From: Steven Sistare @ 2021-05-21 13:33 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On 5/19/2021 6:38 PM, Alex Williamson wrote:
> On Fri,  7 May 2021 05:25:09 -0700
> Steve Sistare <steven.sistare@oracle.com> wrote:
> 
>> Export vfio_address_spaces and vfio_listener_skipped_section.
>> Add optional eventfd arg to vfio_add_kvm_msi_virq.
>> Refactor vector use into a helper vfio_vector_init.
>> All for use by cpr in a subsequent patch.  No functional change.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  hw/vfio/common.c              |  4 ++--
>>  hw/vfio/pci.c                 | 36 +++++++++++++++++++++++++-----------
>>  include/hw/vfio/vfio-common.h |  3 +++
>>  3 files changed, 30 insertions(+), 13 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index ae5654f..9220e64 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -42,7 +42,7 @@
>>  
>>  VFIOGroupList vfio_group_list =
>>      QLIST_HEAD_INITIALIZER(vfio_group_list);
>> -static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces =
>> +VFIOAddressSpaceList vfio_address_spaces =
>>      QLIST_HEAD_INITIALIZER(vfio_address_spaces);
>>  
>>  #ifdef CONFIG_KVM
>> @@ -534,7 +534,7 @@ static int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova,
>>      return -1;
>>  }
>>  
>> -static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>> +bool vfio_listener_skipped_section(MemoryRegionSection *section)
>>  {
>>      return (!memory_region_is_ram(section->mr) &&
>>              !memory_region_is_iommu(section->mr)) ||
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 5c65aa0..7a4fb6c 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -411,7 +411,7 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
>>  }
>>  
>>  static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
>> -                                  int vector_n, bool msix)
>> +                                  int vector_n, bool msix, int eventfd)
>>  {
>>      int virq;
>>  
>> @@ -419,7 +419,9 @@ static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
>>          return;
>>      }
>>  
>> -    if (event_notifier_init(&vector->kvm_interrupt, 0)) {
>> +    if (eventfd >= 0) {
>> +        event_notifier_init_fd(&vector->kvm_interrupt, eventfd);
>> +    } else if (event_notifier_init(&vector->kvm_interrupt, 0)) {
>>          return;
>>      }
> 
> This seems very obfuscated.  The "active" arg of event_notifier_init()
> just seems to preload the eventfd with a signal.  What does that have
> to do with an eventfd arg to this function?  What if the first branch
> returns failure?

Perhaps you mis-read the code?  The function called in the first branch is different than
the function called in the second branch.  And event_notifier_init_fd is void and never fails.

Eschew obfuscation.

Gesundheit.

>> @@ -455,6 +457,22 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
>>      kvm_irqchip_commit_routes(kvm_state);
>>  }
>>  
>> +static void vfio_vector_init(VFIOPCIDevice *vdev, int nr, int eventfd)
>> +{
>> +    VFIOMSIVector *vector = &vdev->msi_vectors[nr];
>> +    PCIDevice *pdev = &vdev->pdev;
>> +
>> +    vector->vdev = vdev;
>> +    vector->virq = -1;
>> +    if (eventfd >= 0) {
>> +        event_notifier_init_fd(&vector->interrupt, eventfd);
>> +    } else if (event_notifier_init(&vector->interrupt, 0)) {
>> +        error_report("vfio: Error: event_notifier_init failed");
>> +    }
> 
> Gak, here's that same pattern.
> 
>> +    vector->use = true;
>> +    msix_vector_use(pdev, nr);
>> +}
>> +
>>  static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>>                                     MSIMessage *msg, IOHandler *handler)
>>  {
>> @@ -466,14 +484,10 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>>  
>>      vector = &vdev->msi_vectors[nr];
>>  
>> +    vfio_vector_init(vdev, nr, -1);
>> +
>>      if (!vector->use) {
>> -        vector->vdev = vdev;
>> -        vector->virq = -1;
>> -        if (event_notifier_init(&vector->interrupt, 0)) {
>> -            error_report("vfio: Error: event_notifier_init failed");
>> -        }
>> -        vector->use = true;
>> -        msix_vector_use(pdev, nr);
>> +        vfio_vector_init(vdev, nr, -1);
>>      }
> 
> Huh?  That's not at all "no functional change".  Also the branch is
> entirely dead code now.

Good catch, thank you.  This is a rebase error.  The unconditional call to vfio_vector_init
should not be there.  With that fix, we have:

    if (!vector->use) {
        vfio_vector_init(vdev, nr, -1);
    }

and there is no functional change; the actions performed in vfio_vector_init are identical to 
those deleted here.

>>      qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
>> @@ -491,7 +505,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>>          }
>>      } else {
>>          if (msg) {
>> -            vfio_add_kvm_msi_virq(vdev, vector, nr, true);
>> +            vfio_add_kvm_msi_virq(vdev, vector, nr, true, -1);
>>          }
>>      }
>>  
>> @@ -641,7 +655,7 @@ retry:
>>           * Attempt to enable route through KVM irqchip,
>>           * default to userspace handling if unavailable.
>>           */
>> -        vfio_add_kvm_msi_virq(vdev, vector, i, false);
>> +        vfio_add_kvm_msi_virq(vdev, vector, i, false, -1);
>>      }
> 
> And then we're not really passing an eventfd anyway :-\  I'm so
> confused...

This patch just adds the eventfd arg.  The next few patches pass valid eventfd's from the
cpr code paths.

- Steve

>>      /* Set interrupt type prior to possible interrupts */
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 6141162..00acb85 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -204,6 +204,8 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>>  extern const MemoryRegionOps vfio_region_ops;
>>  typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
>>  extern VFIOGroupList vfio_group_list;
>> +typedef QLIST_HEAD(, VFIOAddressSpace) VFIOAddressSpaceList;
>> +extern VFIOAddressSpaceList vfio_address_spaces;
>>  
>>  bool vfio_mig_active(void);
>>  int64_t vfio_mig_bytes_transferred(void);
>> @@ -222,6 +224,7 @@ struct vfio_info_cap_header *
>>  vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id);
>>  #endif
>>  extern const MemoryListener vfio_prereg_listener;
>> +bool vfio_listener_skipped_section(MemoryRegionSection *section);
>>  
>>  int vfio_spapr_create_window(VFIOContainer *container,
>>                               MemoryRegionSection *section,
> 


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update [reboot]
  2021-05-20 13:00                 ` [PATCH V3 00/22] Live Update [reboot] Dr. David Alan Gilbert
@ 2021-05-21 14:55                   ` Steven Sistare
  2021-06-15 19:14                     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 81+ messages in thread
From: Steven Sistare @ 2021-05-21 14:55 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel, Markus Armbruster,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

On 5/20/2021 9:00 AM, Dr. David Alan Gilbert wrote:
> Hi Steven,
>   I'd like to split the discussion into reboot and restart,
> so I can make sure I understand them individually.
> 
> So reboot mode;
> Can you explain which parts of this series are needed for reboot mode;
> I've managed to do a kexec based reboot on qemu with the current qemu -
> albeit with no vfio devices, my current understanding is that for doing
> reboot with vfio we just need some way of getting migrate to send the
> metadata associated with vfio devices if the guest is in S3.
> 
> Is there something I'm missing and which you have in this series?

You are correct, this series has little special code for reboot mode, but it does allow
reboot and restart to be handled similarly, which simplifies the management layer because 
the same calls are performed for each mode. 

For vfio in reboot mode, prior to sending cprload, the manager sends the guest-suspend-ram
command to the qemu guest agent. This flushes requests and brings the guest device to a 
reset state, so there is no vfio metadata to save.  Reboot mode does not call vfio_cprsave.

There are a few unique patches to support reboot mode.  One is qemu_ram_volatile, which
is a sanity check that the writable ram blocks are backed by some form of shared memory.
Plus there are a few fragments in the "cpr" patch that handle the suspended state that
is induced by guest-suspend-ram.  See qemu_system_start_on_wake_request() and instances
of RUN_STATE_SUSPENDED in migration/cpr.c

- Steve


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update [restart]
  2021-05-20 13:13                 ` [PATCH V3 00/22] Live Update [restart] Dr. David Alan Gilbert
@ 2021-05-21 14:56                   ` Steven Sistare
  2021-05-24 10:39                     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 81+ messages in thread
From: Steven Sistare @ 2021-05-21 14:56 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel, Markus Armbruster,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote:
> On the 'restart' branch of questions; can you explain,
> other than the passing of the fd's, why the outgoing side of
> qemu's 'migrate exec:' doesn't work for you?

I'm not sure what I should describe.  Can you be more specific?
Do you mean: can we add the cpr specific bits to the migrate exec code?

- Steve


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 11/22] vfio-pci: refactor for cpr
  2021-05-21 13:33     ` Steven Sistare
@ 2021-05-21 21:07       ` Alex Williamson
  2021-05-21 21:18         ` Steven Sistare
  0 siblings, 1 reply; 81+ messages in thread
From: Alex Williamson @ 2021-05-21 21:07 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On Fri, 21 May 2021 09:33:13 -0400
Steven Sistare <steven.sistare@oracle.com> wrote:

> On 5/19/2021 6:38 PM, Alex Williamson wrote:
> > On Fri,  7 May 2021 05:25:09 -0700
> > Steve Sistare <steven.sistare@oracle.com> wrote:
> >   
> >> Export vfio_address_spaces and vfio_listener_skipped_section.
> >> Add optional eventfd arg to vfio_add_kvm_msi_virq.
> >> Refactor vector use into a helper vfio_vector_init.
> >> All for use by cpr in a subsequent patch.  No functional change.
> >>
> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >> ---
> >>  hw/vfio/common.c              |  4 ++--
> >>  hw/vfio/pci.c                 | 36 +++++++++++++++++++++++++-----------
> >>  include/hw/vfio/vfio-common.h |  3 +++
> >>  3 files changed, 30 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >> index ae5654f..9220e64 100644
> >> --- a/hw/vfio/common.c
> >> +++ b/hw/vfio/common.c
> >> @@ -42,7 +42,7 @@
> >>  
> >>  VFIOGroupList vfio_group_list =
> >>      QLIST_HEAD_INITIALIZER(vfio_group_list);
> >> -static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces =
> >> +VFIOAddressSpaceList vfio_address_spaces =
> >>      QLIST_HEAD_INITIALIZER(vfio_address_spaces);
> >>  
> >>  #ifdef CONFIG_KVM
> >> @@ -534,7 +534,7 @@ static int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova,
> >>      return -1;
> >>  }
> >>  
> >> -static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> >> +bool vfio_listener_skipped_section(MemoryRegionSection *section)
> >>  {
> >>      return (!memory_region_is_ram(section->mr) &&
> >>              !memory_region_is_iommu(section->mr)) ||
> >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >> index 5c65aa0..7a4fb6c 100644
> >> --- a/hw/vfio/pci.c
> >> +++ b/hw/vfio/pci.c
> >> @@ -411,7 +411,7 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
> >>  }
> >>  
> >>  static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
> >> -                                  int vector_n, bool msix)
> >> +                                  int vector_n, bool msix, int eventfd)
> >>  {
> >>      int virq;
> >>  
> >> @@ -419,7 +419,9 @@ static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
> >>          return;
> >>      }
> >>  
> >> -    if (event_notifier_init(&vector->kvm_interrupt, 0)) {
> >> +    if (eventfd >= 0) {
> >> +        event_notifier_init_fd(&vector->kvm_interrupt, eventfd);
> >> +    } else if (event_notifier_init(&vector->kvm_interrupt, 0)) {
> >>          return;
> >>      }  
> > 
> > This seems very obfuscated.  The "active" arg of event_notifier_init()
> > just seems to preload the eventfd with a signal.  What does that have
> > to do with an eventfd arg to this function?  What if the first branch
> > returns failure?  
> 
> Perhaps you mis-read the code?  The function called in the first branch is different than
> the function called in the second branch.  And event_notifier_init_fd is void and never fails.
> 
> Eschew obfuscation.
> 
> Gesundheit.

D'oh!  I looked at that so many times trying to figure out what I was
missing and still didn't spot the "_fd" on the first function.  The
fact that @active is an int used as a bool in the non-fd version didn't
help.  Maybe we need our own wrapper just to spread the code out a
bit...

/* Create new or reuse existing eventfd */
static int vfio_event_notifier_init(EventNotifier *e, int fd)
{
    if (fd < 0) {
        return event_notifier_init(e, 0);
    }

    event_notifier_init_fd(e, fd);
    return 0;
}

Or I should just user bigger fonts, but that's somehow more apparent to
me and can be reused below.

> >> @@ -455,6 +457,22 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
> >>      kvm_irqchip_commit_routes(kvm_state);
> >>  }
> >>  
> >> +static void vfio_vector_init(VFIOPCIDevice *vdev, int nr, int eventfd)
> >> +{
> >> +    VFIOMSIVector *vector = &vdev->msi_vectors[nr];
> >> +    PCIDevice *pdev = &vdev->pdev;
> >> +
> >> +    vector->vdev = vdev;
> >> +    vector->virq = -1;
> >> +    if (eventfd >= 0) {
> >> +        event_notifier_init_fd(&vector->interrupt, eventfd);
> >> +    } else if (event_notifier_init(&vector->interrupt, 0)) {
> >> +        error_report("vfio: Error: event_notifier_init failed");
> >> +    }  
> > 
> > Gak, here's that same pattern.
> >   
> >> +    vector->use = true;
> >> +    msix_vector_use(pdev, nr);
> >> +}
> >> +
> >>  static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
> >>                                     MSIMessage *msg, IOHandler *handler)
> >>  {
> >> @@ -466,14 +484,10 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
> >>  
> >>      vector = &vdev->msi_vectors[nr];
> >>  
> >> +    vfio_vector_init(vdev, nr, -1);
> >> +
> >>      if (!vector->use) {
> >> -        vector->vdev = vdev;
> >> -        vector->virq = -1;
> >> -        if (event_notifier_init(&vector->interrupt, 0)) {
> >> -            error_report("vfio: Error: event_notifier_init failed");
> >> -        }
> >> -        vector->use = true;
> >> -        msix_vector_use(pdev, nr);
> >> +        vfio_vector_init(vdev, nr, -1);
> >>      }  
> > 
> > Huh?  That's not at all "no functional change".  Also the branch is
> > entirely dead code now.  
> 
> Good catch, thank you.  This is a rebase error.  The unconditional call to vfio_vector_init
> should not be there.  With that fix, we have:
> 
>     if (!vector->use) {
>         vfio_vector_init(vdev, nr, -1);
>     }
> 
> and there is no functional change; the actions performed in vfio_vector_init are identical to 
> those deleted here.

Yup.

> >>      qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
> >> @@ -491,7 +505,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
> >>          }
> >>      } else {
> >>          if (msg) {
> >> -            vfio_add_kvm_msi_virq(vdev, vector, nr, true);
> >> +            vfio_add_kvm_msi_virq(vdev, vector, nr, true, -1);
> >>          }
> >>      }
> >>  
> >> @@ -641,7 +655,7 @@ retry:
> >>           * Attempt to enable route through KVM irqchip,
> >>           * default to userspace handling if unavailable.
> >>           */
> >> -        vfio_add_kvm_msi_virq(vdev, vector, i, false);
> >> +        vfio_add_kvm_msi_virq(vdev, vector, i, false, -1);
> >>      }  
> > 
> > And then we're not really passing an eventfd anyway :-\  I'm so
> > confused...  
> 
> This patch just adds the eventfd arg.  The next few patches pass valid eventfd's from the
> cpr code paths.

Yeah, I couldn't put the pieces together though after repeatedly
misreading eventfd being used as a bool in event_notifier_init(), even
though -1 here should have clued me in too.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 11/22] vfio-pci: refactor for cpr
  2021-05-21 21:07       ` Alex Williamson
@ 2021-05-21 21:18         ` Steven Sistare
  0 siblings, 0 replies; 81+ messages in thread
From: Steven Sistare @ 2021-05-21 21:18 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On 5/21/2021 5:07 PM, Alex Williamson wrote:
> On Fri, 21 May 2021 09:33:13 -0400
> Steven Sistare <steven.sistare@oracle.com> wrote:
> 
>> On 5/19/2021 6:38 PM, Alex Williamson wrote:
>>> On Fri,  7 May 2021 05:25:09 -0700
>>> Steve Sistare <steven.sistare@oracle.com> wrote:
>>>   
>>>> Export vfio_address_spaces and vfio_listener_skipped_section.
>>>> Add optional eventfd arg to vfio_add_kvm_msi_virq.
>>>> Refactor vector use into a helper vfio_vector_init.
>>>> All for use by cpr in a subsequent patch.  No functional change.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>>  hw/vfio/common.c              |  4 ++--
>>>>  hw/vfio/pci.c                 | 36 +++++++++++++++++++++++++-----------
>>>>  include/hw/vfio/vfio-common.h |  3 +++
>>>>  3 files changed, 30 insertions(+), 13 deletions(-)
>>>>
>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>> index ae5654f..9220e64 100644
>>>> --- a/hw/vfio/common.c
>>>> +++ b/hw/vfio/common.c
>>>> @@ -42,7 +42,7 @@
>>>>  
>>>>  VFIOGroupList vfio_group_list =
>>>>      QLIST_HEAD_INITIALIZER(vfio_group_list);
>>>> -static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces =
>>>> +VFIOAddressSpaceList vfio_address_spaces =
>>>>      QLIST_HEAD_INITIALIZER(vfio_address_spaces);
>>>>  
>>>>  #ifdef CONFIG_KVM
>>>> @@ -534,7 +534,7 @@ static int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova,
>>>>      return -1;
>>>>  }
>>>>  
>>>> -static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>>>> +bool vfio_listener_skipped_section(MemoryRegionSection *section)
>>>>  {
>>>>      return (!memory_region_is_ram(section->mr) &&
>>>>              !memory_region_is_iommu(section->mr)) ||
>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>> index 5c65aa0..7a4fb6c 100644
>>>> --- a/hw/vfio/pci.c
>>>> +++ b/hw/vfio/pci.c
>>>> @@ -411,7 +411,7 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
>>>>  }
>>>>  
>>>>  static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
>>>> -                                  int vector_n, bool msix)
>>>> +                                  int vector_n, bool msix, int eventfd)
>>>>  {
>>>>      int virq;
>>>>  
>>>> @@ -419,7 +419,9 @@ static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
>>>>          return;
>>>>      }
>>>>  
>>>> -    if (event_notifier_init(&vector->kvm_interrupt, 0)) {
>>>> +    if (eventfd >= 0) {
>>>> +        event_notifier_init_fd(&vector->kvm_interrupt, eventfd);
>>>> +    } else if (event_notifier_init(&vector->kvm_interrupt, 0)) {
>>>>          return;
>>>>      }  
>>>
>>> This seems very obfuscated.  The "active" arg of event_notifier_init()
>>> just seems to preload the eventfd with a signal.  What does that have
>>> to do with an eventfd arg to this function?  What if the first branch
>>> returns failure?  
>>
>> Perhaps you mis-read the code?  The function called in the first branch is different than
>> the function called in the second branch.  And event_notifier_init_fd is void and never fails.
>>
>> Eschew obfuscation.
>>
>> Gesundheit.
> 
> D'oh!  I looked at that so many times trying to figure out what I was
> missing and still didn't spot the "_fd" on the first function.  The
> fact that @active is an int used as a bool in the non-fd version didn't
> help.  Maybe we need our own wrapper just to spread the code out a
> bit...
> 
> /* Create new or reuse existing eventfd */
> static int vfio_event_notifier_init(EventNotifier *e, int fd)
> {
>     if (fd < 0) {
>         return event_notifier_init(e, 0);
>     }
> 
>     event_notifier_init_fd(e, fd);
>     return 0;
> }

Will do, for both here and below - Steve

> Or I should just user bigger fonts, but that's somehow more apparent to
> me and can be reused below.
> 
>>>> @@ -455,6 +457,22 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
>>>>      kvm_irqchip_commit_routes(kvm_state);
>>>>  }
>>>>  
>>>> +static void vfio_vector_init(VFIOPCIDevice *vdev, int nr, int eventfd)
>>>> +{
>>>> +    VFIOMSIVector *vector = &vdev->msi_vectors[nr];
>>>> +    PCIDevice *pdev = &vdev->pdev;
>>>> +
>>>> +    vector->vdev = vdev;
>>>> +    vector->virq = -1;
>>>> +    if (eventfd >= 0) {
>>>> +        event_notifier_init_fd(&vector->interrupt, eventfd);
>>>> +    } else if (event_notifier_init(&vector->interrupt, 0)) {
>>>> +        error_report("vfio: Error: event_notifier_init failed");
>>>> +    }  
>>>
>>> Gak, here's that same pattern.
>>>   
>>>> +    vector->use = true;
>>>> +    msix_vector_use(pdev, nr);
>>>> +}
>>>> +
>>>>  static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>>>>                                     MSIMessage *msg, IOHandler *handler)
>>>>  {
>>>> @@ -466,14 +484,10 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>>>>  
>>>>      vector = &vdev->msi_vectors[nr];
>>>>  
>>>> +    vfio_vector_init(vdev, nr, -1);
>>>> +
>>>>      if (!vector->use) {
>>>> -        vector->vdev = vdev;
>>>> -        vector->virq = -1;
>>>> -        if (event_notifier_init(&vector->interrupt, 0)) {
>>>> -            error_report("vfio: Error: event_notifier_init failed");
>>>> -        }
>>>> -        vector->use = true;
>>>> -        msix_vector_use(pdev, nr);
>>>> +        vfio_vector_init(vdev, nr, -1);
>>>>      }  
>>>
>>> Huh?  That's not at all "no functional change".  Also the branch is
>>> entirely dead code now.  
>>
>> Good catch, thank you.  This is a rebase error.  The unconditional call to vfio_vector_init
>> should not be there.  With that fix, we have:
>>
>>     if (!vector->use) {
>>         vfio_vector_init(vdev, nr, -1);
>>     }
>>
>> and there is no functional change; the actions performed in vfio_vector_init are identical to 
>> those deleted here.
> 
> Yup.
> 
>>>>      qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
>>>> @@ -491,7 +505,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>>>>          }
>>>>      } else {
>>>>          if (msg) {
>>>> -            vfio_add_kvm_msi_virq(vdev, vector, nr, true);
>>>> +            vfio_add_kvm_msi_virq(vdev, vector, nr, true, -1);
>>>>          }
>>>>      }
>>>>  
>>>> @@ -641,7 +655,7 @@ retry:
>>>>           * Attempt to enable route through KVM irqchip,
>>>>           * default to userspace handling if unavailable.
>>>>           */
>>>> -        vfio_add_kvm_msi_virq(vdev, vector, i, false);
>>>> +        vfio_add_kvm_msi_virq(vdev, vector, i, false, -1);
>>>>      }  
>>>
>>> And then we're not really passing an eventfd anyway :-\  I'm so
>>> confused...  
>>
>> This patch just adds the eventfd arg.  The next few patches pass valid eventfd's from the
>> cpr code paths.
> 
> Yeah, I couldn't put the pieces together though after repeatedly
> misreading eventfd being used as a bool in event_notifier_init(), even
> though -1 here should have clued me in too.  Thanks,
> 
> Alex
> 


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 12/22] vfio-pci: cpr part 1
  2021-05-07 12:25 ` [PATCH V3 12/22] vfio-pci: cpr part 1 Steve Sistare
@ 2021-05-21 22:24   ` Alex Williamson
  2021-05-24 18:29     ` Steven Sistare
  0 siblings, 1 reply; 81+ messages in thread
From: Alex Williamson @ 2021-05-21 22:24 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On Fri,  7 May 2021 05:25:10 -0700
Steve Sistare <steven.sistare@oracle.com> wrote:

> Enable vfio-pci devices to be saved and restored across an exec restart
> of qemu.
> 
> At vfio creation time, save the value of vfio container, group, and device
> descriptors in the environment.
> 
> In cprsave, suspend the use of virtual addresses in DMA mappings with
> VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped at a
> different VA after exec.  DMA to already-mapped pages continues.  Save
> the msi message area as part of vfio-pci vmstate, save the interrupt and
> notifier eventfd's in the environment, and clear the close-on-exec flag
> for the vfio descriptors.  The flag is not cleared earlier because the
> descriptors should not persist across miscellaneous fork and exec calls
> that may be performed during normal operation.
> 
> On qemu restart, vfio_realize() finds the descriptor env vars, uses
> the descriptors, and notes that the device is being reused.  Device and
> iommu state is already configured, so operations in vfio_realize that
> would modify the configuration are skipped for a reused device, including
> vfio ioctl's and writes to PCI configuration space.  The result is that
> vfio_realize constructs qemu data structures that reflect the current
> state of the device.  However, the reconstruction is not complete until
> cprload is called. cprload loads the msi data and finds eventfds in the
> environment.  It rebuilds vector data structures and attaches the
> interrupts to the new KVM instance.  cprload then walks the flattened
> ranges of the vfio_address_spaces and calls VFIO_DMA_MAP_FLAG_VADDR to
> inform the kernel of the new VA's.  Lastly, it starts the VM and suppresses
> vfio device reset.
> 
> This functionality is delivered by 2 patches for clarity.  Part 2 adds
> eventfd and vector support.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  hw/pci/msi.c                  |   4 ++
>  hw/pci/pci.c                  |   4 ++
>  hw/vfio/common.c              |  59 ++++++++++++++++++-
>  hw/vfio/cpr.c                 | 131 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/meson.build           |   1 +
>  hw/vfio/pci.c                 |  65 +++++++++++++++++++--
>  hw/vfio/trace-events          |   1 +
>  include/hw/pci/pci.h          |   1 +
>  include/hw/vfio/vfio-common.h |   5 ++
>  linux-headers/linux/vfio.h    |  27 +++++++++
>  migration/cpr.c               |   7 +++
>  11 files changed, 298 insertions(+), 7 deletions(-)
>  create mode 100644 hw/vfio/cpr.c
> 
> diff --git a/hw/pci/msi.c b/hw/pci/msi.c
> index 47d2b0f..39de6a7 100644
> --- a/hw/pci/msi.c
> +++ b/hw/pci/msi.c
> @@ -225,6 +225,10 @@ int msi_init(struct PCIDevice *dev, uint8_t offset,
>      dev->msi_cap = config_offset;
>      dev->cap_present |= QEMU_PCI_CAP_MSI;
>  
> +    if (dev->reused) {
> +        return 0;
> +    }
> +
>      pci_set_word(dev->config + msi_flags_off(dev), flags);
>      pci_set_word(dev->wmask + msi_flags_off(dev),
>                   PCI_MSI_FLAGS_QSIZE | PCI_MSI_FLAGS_ENABLE);
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index e08d981..27019ca 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -308,6 +308,10 @@ static void pci_do_device_reset(PCIDevice *dev)
>  {
>      int r;
>  
> +    if (dev->reused) {
> +        return;
> +    }
> +
>      pci_device_deassert_intx(dev);
>      assert(dev->irq_state == 0);
>  
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 9220e64..00d07b2 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -31,6 +31,7 @@
>  #include "exec/memory.h"
>  #include "exec/ram_addr.h"
>  #include "hw/hw.h"
> +#include "qemu/env.h"
>  #include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
>  #include "qemu/range.h"
> @@ -440,6 +441,10 @@ static int vfio_dma_unmap(VFIOContainer *container,
>          return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
>      }
>  
> +    if (container->reused) {
> +        return 0;
> +    }
> +
>      while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
>          /*
>           * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
> @@ -463,6 +468,11 @@ static int vfio_dma_unmap(VFIOContainer *container,
>          return -errno;
>      }
>  
> +    if (unmap.size != size) {
> +        warn_report("VFIO_UNMAP_DMA(0x%lx, 0x%lx) only unmaps 0x%llx",
> +                     iova, size, unmap.size);
> +    }
> +
>      return 0;
>  }
>  
> @@ -477,6 +487,10 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>          .size = size,
>      };
>  
> +    if (container->reused) {
> +        return 0;
> +    }
> +
>      if (!readonly) {
>          map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
>      }
> @@ -1603,6 +1617,10 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
>      if (iommu_type < 0) {
>          return iommu_type;
>      }
> +    if (container->reused) {
> +        container->iommu_type = iommu_type;
> +        return 0;
> +    }
>  
>      ret = ioctl(group_fd, VFIO_GROUP_SET_CONTAINER, &container->fd);
>      if (ret) {
> @@ -1703,6 +1721,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>  {
>      VFIOContainer *container;
>      int ret, fd;
> +    bool reused;
> +    char name[40];
>      VFIOAddressSpace *space;
>  
>      space = vfio_get_address_space(as);
> @@ -1739,16 +1759,29 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>          return ret;
>      }
>  
> +    snprintf(name, sizeof(name), "vfio_container_%d", group->groupid);

For more clarity, maybe "vfio_container_for_group_%d"?

> +    fd = getenv_fd(name);
> +    reused = (fd >= 0);
> +
>      QLIST_FOREACH(container, &space->containers, next) {
> +        if (fd >= 0 && container->fd == fd) {

Test @reused rather than @fd?  I'm not sure the first half of this test
is even needed though, <0 should never match container->fd, right?

> +            group->container = container;
> +            QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> +            return 0;
> +        }

This looks unnecessarily sensitive to the order of containers in the
list, if the fd doesn't match above we try to set a new container below?
It seems like you only want to create a new container object if none of
the existing ones match.

There's also a lot of duplication that seems like it could be combined

if (container->fd == fd || (!reused && !ioctl(...)) {

>          if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>              group->container = container;
>              QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>              vfio_kvm_device_add_group(group);

Why is this kvm device setup missing in the reuse case?


if (!reused) {
> +            setenv_fd(name, container->fd);
}

>              return 0;
>          }
>      }
>  
> -    fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
> +    if (fd < 0) {

if (!reused)?

> +        fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
> +    }
> +
>      if (fd < 0) {
>          error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
>          ret = -errno;
> @@ -1766,6 +1799,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      container = g_malloc0(sizeof(*container));
>      container->space = space;
>      container->fd = fd;
> +    container->reused = reused;
>      container->error = NULL;
>      container->dirty_pages_supported = false;
>      QLIST_INIT(&container->giommu_list);
> @@ -1893,6 +1927,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      }
>  
>      container->initialized = true;
> +    setenv_fd(name, fd);

Maybe we don't need the test around the previous setenv_fd if we can
overwrite existing env values, which would seem to be the case for a
restart here.

>  
>      return 0;
>  listener_release_exit:
> @@ -1920,6 +1955,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>  
>      QLIST_REMOVE(group, container_next);
>      group->container = NULL;
> +    unsetenv_fdv("vfio_container_%d", group->groupid);
>  
>      /*
>       * Explicitly release the listener first before unset container,
> @@ -1978,7 +2014,12 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>      group = g_malloc0(sizeof(*group));
>  
>      snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
> -    group->fd = qemu_open_old(path, O_RDWR);
> +
> +    group->fd = getenv_fd(path);
> +    if (group->fd < 0) {
> +        group->fd = qemu_open_old(path, O_RDWR);
> +    }
> +
>      if (group->fd < 0) {
>          error_setg_errno(errp, errno, "failed to open %s", path);
>          goto free_group_exit;
> @@ -2012,6 +2053,8 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>  
>      QLIST_INSERT_HEAD(&vfio_group_list, group, next);
>  
> +    setenv_fd(path, group->fd);
> +
>      return group;
>  
>  close_fd_exit:
> @@ -2036,6 +2079,7 @@ void vfio_put_group(VFIOGroup *group)
>      vfio_disconnect_container(group);
>      QLIST_REMOVE(group, next);
>      trace_vfio_put_group(group->fd);
> +    unsetenv_fdv("/dev/vfio/%d", group->groupid);
>      close(group->fd);
>      g_free(group);
>  
> @@ -2049,8 +2093,14 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>  {
>      struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
>      int ret, fd;
> +    bool reused;
> +
> +    fd = getenv_fd(name);
> +    reused = (fd >= 0);
> +    if (fd < 0) {

if (!reused) ?

> +        fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
> +    }
>  
> -    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>      if (fd < 0) {
>          error_setg_errno(errp, errno, "error getting device from group %d",
>                           group->groupid);
> @@ -2095,6 +2145,8 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>      vbasedev->num_irqs = dev_info.num_irqs;
>      vbasedev->num_regions = dev_info.num_regions;
>      vbasedev->flags = dev_info.flags;
> +    vbasedev->reused = reused;
> +    setenv_fd(name, fd);
>  
>      trace_vfio_get_device(name, dev_info.flags, dev_info.num_regions,
>                            dev_info.num_irqs);
> @@ -2111,6 +2163,7 @@ void vfio_put_base_device(VFIODevice *vbasedev)
>      QLIST_REMOVE(vbasedev, next);
>      vbasedev->group = NULL;
>      trace_vfio_put_base_device(vbasedev->fd);
> +    unsetenv_fd(vbasedev->name);
>      close(vbasedev->fd);
>  }
>  
> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
> new file mode 100644
> index 0000000..c5ad9f2
> --- /dev/null
> +++ b/hw/vfio/cpr.c
> @@ -0,0 +1,131 @@
> +/*
> + * Copyright (c) 2021 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include <sys/ioctl.h>
> +#include <linux/vfio.h>
> +#include "hw/vfio/vfio-common.h"
> +#include "sysemu/kvm.h"
> +#include "qapi/error.h"
> +#include "trace.h"
> +
> +static int
> +vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
> +{
> +    struct vfio_iommu_type1_dma_unmap unmap = {
> +        .argsz = sizeof(unmap),
> +        .flags = VFIO_DMA_UNMAP_FLAG_VADDR | VFIO_DMA_UNMAP_FLAG_ALL,
> +        .iova = 0,
> +        .size = 0,
> +    };
> +    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
> +        error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
> +        return -errno;
> +    }
> +    return 0;
> +}
> +
> +static int vfio_dma_map_vaddr(VFIOContainer *container, hwaddr iova,
> +                              ram_addr_t size, void *vaddr,
> +                              Error **errp)
> +{
> +    struct vfio_iommu_type1_dma_map map = {
> +        .argsz = sizeof(map),
> +        .flags = VFIO_DMA_MAP_FLAG_VADDR,
> +        .vaddr = (__u64)(uintptr_t)vaddr,
> +        .iova = iova,
> +        .size = size,
> +    };
> +    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
> +        error_setg_errno(errp, errno,
> +                         "vfio_dma_map_vaddr(iova %lu, size %ld, va %p)",
> +                         iova, size, vaddr);
> +        return -errno;
> +    }
> +    return 0;
> +}
> +
> +static int
> +vfio_region_remap(MemoryRegionSection *section, void *handle, Error **errp)
> +{
> +    MemoryRegion *mr = section->mr;
> +    VFIOContainer *container = handle;
> +    const char *name = memory_region_name(mr);
> +    ram_addr_t size = int128_get64(section->size);
> +    hwaddr offset, iova, roundup;
> +    void *vaddr;
> +
> +    if (vfio_listener_skipped_section(section) || memory_region_is_iommu(mr)) {
> +        return 0;
> +    }
> +
> +    offset = section->offset_within_address_space;
> +    iova = TARGET_PAGE_ALIGN(offset);
> +    roundup = iova - offset;
> +    size = (size - roundup) & TARGET_PAGE_MASK;
> +    vaddr = memory_region_get_ram_ptr(mr) +
> +            section->offset_within_region + roundup;
> +
> +    trace_vfio_region_remap(name, container->fd, iova, iova + size - 1, vaddr);
> +    return vfio_dma_map_vaddr(container, iova, size, vaddr, errp);
> +}
> +
> +bool vfio_cpr_capable(VFIOContainer *container, Error **errp)
> +{
> +    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
> +        !ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
> +        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR "
> +                         "or VFIO_UNMAP_ALL");
> +        return false;
> +    } else {
> +        return true;
> +    }
> +}
> +
> +int vfio_cprsave(Error **errp)
> +{
> +    VFIOAddressSpace *space;
> +    VFIOContainer *container;
> +
> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> +        QLIST_FOREACH(container, &space->containers, next) {
> +            if (!vfio_cpr_capable(container, errp)) {
> +                return 1;
> +            }
> +            if (vfio_dma_unmap_vaddr_all(container, errp)) {
> +                return 1;
> +            }
> +        }
> +    }


Seems like you'd want to test that all containers are capable before
unmapping any vaddrs.  I also hope we'll find an unwind somewhere that
remaps vaddrs should any fail.

> +    return 0;
> +}
> +
> +int vfio_cprload(Error **errp)
> +{
> +    VFIOAddressSpace *space;
> +    VFIOContainer *container;
> +    VFIOGroup *group;
> +    VFIODevice *vbasedev;
> +
> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> +        QLIST_FOREACH(container, &space->containers, next) {
> +            if (!vfio_cpr_capable(container, errp)) {
> +                return 1;
> +            }
> +            container->reused = false;
> +            if (as_flat_walk(space->as, vfio_region_remap, container, errp)) {
> +                return 1;
> +            }
> +        }
> +    }

What state are we in if any of these fail?

> +    QLIST_FOREACH(group, &vfio_group_list, next) {
> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +            vbasedev->reused = false;
> +        }
> +    }
> +    return 0;
> +}
> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
> index da9af29..e247b2b 100644
> --- a/hw/vfio/meson.build
> +++ b/hw/vfio/meson.build
> @@ -5,6 +5,7 @@ vfio_ss.add(files(
>    'migration.c',
>  ))
>  vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
> +  'cpr.c',
>    'display.c',
>    'pci-quirks.c',
>    'pci.c',
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 7a4fb6c..f7ac9f03 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -29,6 +29,8 @@
>  #include "hw/qdev-properties.h"
>  #include "hw/qdev-properties-system.h"
>  #include "migration/vmstate.h"
> +#include "migration/cpr.h"
> +#include "qemu/env.h"
>  #include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
>  #include "qemu/module.h"
> @@ -1612,6 +1614,14 @@ static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled)
>      }
>  }
>  
> +static void vfio_config_sync(VFIOPCIDevice *vdev, uint32_t offset, size_t len)
> +{
> +    if (pread(vdev->vbasedev.fd, vdev->pdev.config + offset, len,
> +          vdev->config_offset + offset) != len) {
> +        error_report("vfio_config_sync pread failed");
> +    }
> +}
> +
>  static void vfio_bar_prepare(VFIOPCIDevice *vdev, int nr)
>  {
>      VFIOBAR *bar = &vdev->bars[nr];
> @@ -1652,6 +1662,7 @@ static void vfio_bars_prepare(VFIOPCIDevice *vdev)
>  static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
>  {
>      VFIOBAR *bar = &vdev->bars[nr];
> +    PCIDevice *pdev = &vdev->pdev;
>      char *name;
>  
>      if (!bar->size) {
> @@ -1672,7 +1683,10 @@ static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
>          }
>      }
>  
> -    pci_register_bar(&vdev->pdev, nr, bar->type, bar->mr);
> +    pci_register_bar(pdev, nr, bar->type, bar->mr);
> +    if (pdev->reused) {
> +        vfio_config_sync(vdev, pci_bar(pdev, nr), 8);

Assuming 64-bit BARs?  This might be the first case where we actually
rely on the kernel BAR values, IIRC we usually use QEMU's emulation.

> +    }
>  }
>  
>  static void vfio_bars_register(VFIOPCIDevice *vdev)
> @@ -2884,6 +2898,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>          vfio_put_group(group);
>          goto error;
>      }
> +    pdev->reused = vdev->vbasedev.reused;
>  
>      vfio_populate_device(vdev, &err);
>      if (err) {
> @@ -3046,9 +3061,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>                                               vfio_intx_routing_notifier);
>          vdev->irqchip_change_notifier.notify = vfio_irqchip_change;
>          kvm_irqchip_add_change_notifier(&vdev->irqchip_change_notifier);
> -        ret = vfio_intx_enable(vdev, errp);
> -        if (ret) {
> -            goto out_deregister;
> +        if (!pdev->reused) {
> +            ret = vfio_intx_enable(vdev, errp);
> +            if (ret) {
> +                goto out_deregister;
> +            }
>          }
>      }
>  
> @@ -3098,6 +3115,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>      vfio_register_req_notifier(vdev);
>      vfio_setup_resetfn_quirk(vdev);
>  
> +    vfio_config_sync(vdev, pdev->msix_cap + PCI_MSIX_FLAGS, 2);
> +    if (pdev->reused) {
> +        pci_update_mappings(pdev);
> +    }
> +

Are the msix flag sync and mapping update related?  They seem
independent to me.  A blank line and comment would be helpful.  I
expect we'd need to call msix_enabled() somewhere for the msix flag
sync to be effective.

Is there an assumption here of msi-x only support or is it not needed
for msi or intx?

>      return;
>  
>  out_deregister:
> @@ -3153,6 +3175,10 @@ static void vfio_pci_reset(DeviceState *dev)
>  {
>      VFIOPCIDevice *vdev = VFIO_PCI(dev);
>  
> +    if (vdev->pdev.reused) {
> +        return;
> +    }
> +
>      trace_vfio_pci_reset(vdev->vbasedev.name);
>  
>      vfio_pci_pre_reset(vdev);
> @@ -3260,6 +3286,36 @@ static Property vfio_pci_dev_properties[] = {
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> +static int vfio_pci_post_load(void *opaque, int version_id)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    PCIDevice *pdev = &vdev->pdev;
> +    bool enabled;
> +
> +    pdev->reused = false;
> +    enabled = pci_get_word(pdev->config + PCI_COMMAND) & PCI_COMMAND_MASTER;
> +    memory_region_set_enabled(&pdev->bus_master_enable_region, enabled);
> +
> +    return 0;
> +}
> +
> +static bool vfio_pci_needed(void *opaque)
> +{
> +    return cpr_active();
> +}
> +
> +static const VMStateDescription vfio_pci_vmstate = {
> +    .name = "vfio-pci",
> +    .unmigratable = 1,
> +    .version_id = 0,
> +    .minimum_version_id = 0,
> +    .post_load = vfio_pci_post_load,
> +    .needed = vfio_pci_needed,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
>  static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>  {
>      DeviceClass *dc = DEVICE_CLASS(klass);
> @@ -3267,6 +3323,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>  
>      dc->reset = vfio_pci_reset;
>      device_class_set_props(dc, vfio_pci_dev_properties);
> +    dc->vmsd = &vfio_pci_vmstate;
>      dc->desc = "VFIO-based PCI device assignment";
>      set_bit(DEVICE_CATEGORY_MISC, dc->categories);
>      pdc->realize = vfio_realize;
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 079f53a..0f8b166 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -118,6 +118,7 @@ vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Devic
>  vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
>  vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
>  vfio_dma_unmap_overflow_workaround(void) ""
> +vfio_region_remap(const char *name, int fd, uint64_t iova_start, uint64_t iova_end, void *vaddr) "%s fd %d 0x%"PRIx64" - 0x%"PRIx64" [%p]"
>  
>  # platform.c
>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index bef3e49..add7f46 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -360,6 +360,7 @@ struct PCIDevice {
>      /* ID of standby device in net_failover pair */
>      char *failover_pair_id;
>      uint32_t acpi_index;
> +    bool reused;
>  };
>  
>  void pci_register_bar(PCIDevice *pci_dev, int region_num,
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 00acb85..b46d850 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -85,6 +85,7 @@ typedef struct VFIOContainer {
>      Error *error;
>      bool initialized;
>      bool dirty_pages_supported;
> +    bool reused;
>      uint64_t dirty_pgsizes;
>      uint64_t max_dirty_bitmap_size;
>      unsigned long pgsizes;
> @@ -124,6 +125,7 @@ typedef struct VFIODevice {
>      bool no_mmap;
>      bool ram_block_discard_allowed;
>      bool enable_migration;
> +    bool reused;
>      VFIODeviceOps *ops;
>      unsigned int num_irqs;
>      unsigned int num_regions;
> @@ -200,6 +202,9 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
>  void vfio_put_group(VFIOGroup *group);
>  int vfio_get_device(VFIOGroup *group, const char *name,
>                      VFIODevice *vbasedev, Error **errp);
> +int vfio_cprsave(Error **errp);
> +int vfio_cprload(Error **errp);
> +bool vfio_cpr_capable(VFIOContainer *container, Error **errp);
>  
>  extern const MemoryRegionOps vfio_region_ops;
>  typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index 609099e..bc3a66e 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -46,6 +46,12 @@
>   */
>  #define VFIO_NOIOMMU_IOMMU		8
>  
> +/* Supports VFIO_DMA_UNMAP_FLAG_ALL */
> +#define VFIO_UNMAP_ALL                        9
> +
> +/* Supports VFIO DMA map and unmap with the VADDR flag */
> +#define VFIO_UPDATE_VADDR              10
> +
>  /*
>   * The IOCTL interface is designed for extensibility by embedding the
>   * structure length (argsz) and flags into structures passed between
> @@ -1074,12 +1080,22 @@ struct vfio_iommu_type1_info_dma_avail {
>   *
>   * Map process virtual addresses to IO virtual addresses using the
>   * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
> + *
> + * If flags & VFIO_DMA_MAP_FLAG_VADDR, record the new base vaddr for iova, and
> + * unblock translation of host virtual addresses in the iova range.  The vaddr
> + * must have previously been invalidated with VFIO_DMA_UNMAP_FLAG_VADDR.  To
> + * maintain memory consistency within the user application, the updated vaddr
> + * must address the same memory object as originally mapped.  Failure to do so
> + * will result in user memory corruption and/or device misbehavior.  iova and
> + * size must match those in the original MAP_DMA call.  Protection is not
> + * changed, and the READ & WRITE flags must be 0.
>   */
>  struct vfio_iommu_type1_dma_map {
>  	__u32	argsz;
>  	__u32	flags;
>  #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
>  #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
> +#define VFIO_DMA_MAP_FLAG_VADDR (1 << 2)
>  	__u64	vaddr;				/* Process virtual address */
>  	__u64	iova;				/* IO virtual address */
>  	__u64	size;				/* Size of mapping (bytes) */
> @@ -1102,6 +1118,7 @@ struct vfio_bitmap {
>   * field.  No guarantee is made to the user that arbitrary unmaps of iova
>   * or size different from those used in the original mapping call will
>   * succeed.
> + *
>   * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get the dirty bitmap
>   * before unmapping IO virtual addresses. When this flag is set, the user must
>   * provide a struct vfio_bitmap in data[]. User must provide zero-allocated
> @@ -1111,11 +1128,21 @@ struct vfio_bitmap {
>   * indicates that the page at that offset from iova is dirty. A Bitmap of the
>   * pages in the range of unmapped size is returned in the user-provided
>   * vfio_bitmap.data.
> + *
> + * If flags & VFIO_DMA_UNMAP_FLAG_ALL, unmap all addresses.  iova and size
> + * must be 0.  This cannot be combined with the get-dirty-bitmap flag.
> + *
> + * If flags & VFIO_DMA_UNMAP_FLAG_VADDR, do not unmap, but invalidate host
> + * virtual addresses in the iova range.  Tasks that attempt to translate an
> + * iova's vaddr will block.  DMA to already-mapped pages continues.  This
> + * cannot be combined with the get-dirty-bitmap flag.
>   */
>  struct vfio_iommu_type1_dma_unmap {
>  	__u32	argsz;
>  	__u32	flags;
>  #define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
> +#define VFIO_DMA_UNMAP_FLAG_ALL              (1 << 1)
> +#define VFIO_DMA_UNMAP_FLAG_VADDR            (1 << 2)
>  	__u64	iova;				/* IO virtual address */
>  	__u64	size;				/* Size of mapping (bytes) */
>  	__u8    data[];
> diff --git a/migration/cpr.c b/migration/cpr.c
> index e0da1cf..e9a189b 100644
> --- a/migration/cpr.c
> +++ b/migration/cpr.c
> @@ -132,6 +132,9 @@ void cprsave(const char *file, CprMode mode, Error **errp)
>          shutdown_action = SHUTDOWN_ACTION_POWEROFF;
>          qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
>      } else if (restart) {
> +        if (vfio_cprsave(errp)) {
> +            goto err;
> +        }
>          walkenv(FD_PREFIX, preserve_fd, 0);
>          setenv("QEMU_START_FREEZE", "", 1);
>          qemu_system_exec_request();
> @@ -176,6 +179,10 @@ void cprload(const char *file, Error **errp)
>          return;
>      }
>  
> +    if (vfio_cprload(errp)) {
> +        return;
> +    }
> +
>      state = global_state_get_runstate();
>      if (state == RUN_STATE_RUNNING) {
>          vm_start();

I didn't find that unwind I was hoping for or anywhere that the msix
flags come into play.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 13/22] vfio-pci: cpr part 2
  2021-05-07 12:25 ` [PATCH V3 13/22] vfio-pci: cpr part 2 Steve Sistare
@ 2021-05-21 22:24   ` Alex Williamson
  2021-05-24 18:31     ` Steven Sistare
  0 siblings, 1 reply; 81+ messages in thread
From: Alex Williamson @ 2021-05-21 22:24 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On Fri,  7 May 2021 05:25:11 -0700
Steve Sistare <steven.sistare@oracle.com> wrote:

> Finish cpr for vfio-pci by preserving eventfd's and vector state.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  hw/vfio/pci.c | 110 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 108 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index f7ac9f03..e983db4 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2661,6 +2661,27 @@ static void vfio_put_device(VFIOPCIDevice *vdev)
>      vfio_put_base_device(&vdev->vbasedev);
>  }
>  
> +static void setenv_event_fd(VFIOPCIDevice *vdev, int nr, const char *name,
> +                            EventNotifier *ev)
> +{
> +    char envname[256];
> +    int fd = event_notifier_get_fd(ev);
> +    const char *vfname = vdev->vbasedev.name;
> +
> +    if (fd >= 0) {
> +        snprintf(envname, sizeof(envname), "%s_%s_%d", vfname, name, nr);
> +        setenv_fd(envname, fd);
> +    }
> +}
> +
> +static int getenv_event_fd(VFIOPCIDevice *vdev, int nr, const char *name)
> +{
> +    char envname[256];
> +    const char *vfname = vdev->vbasedev.name;
> +    snprintf(envname, sizeof(envname), "%s_%s_%d", vfname, name, nr);
> +    return getenv_fd(envname);
> +}
> +
>  static void vfio_err_notifier_handler(void *opaque)
>  {
>      VFIOPCIDevice *vdev = opaque;
> @@ -2692,7 +2713,13 @@ static void vfio_err_notifier_handler(void *opaque)
>  static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
>  {
>      Error *err = NULL;
> -    int32_t fd;
> +    int32_t fd = getenv_event_fd(vdev, 0, "err");

Arg order should match the actual env names, device name, interrupt
name, interrupt number.

> +
> +    if (fd >= 0) {
> +        event_notifier_init_fd(&vdev->err_notifier, fd);
> +        qemu_set_fd_handler(fd, vfio_err_notifier_handler, NULL, vdev);
> +        return;
> +    }
>  
>      if (!vdev->pci_aer) {
>          return;
> @@ -2753,7 +2780,14 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
>      struct vfio_irq_info irq_info = { .argsz = sizeof(irq_info),
>                                        .index = VFIO_PCI_REQ_IRQ_INDEX };
>      Error *err = NULL;
> -    int32_t fd;
> +    int32_t fd = getenv_event_fd(vdev, 0, "req");
> +
> +    if (fd >= 0) {
> +        event_notifier_init_fd(&vdev->req_notifier, fd);
> +        qemu_set_fd_handler(fd, vfio_req_notifier_handler, NULL, vdev);
> +        vdev->req_enabled = true;
> +        return;
> +    }
>  
>      if (!(vdev->features & VFIO_FEATURE_ENABLE_REQ)) {
>          return;
> @@ -3286,12 +3320,82 @@ static Property vfio_pci_dev_properties[] = {
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> +static int vfio_pci_pre_save(void *opaque)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    int i;
> +
> +    for (i = 0; i < vdev->nr_vectors; i++) {
> +        VFIOMSIVector *vector = &vdev->msi_vectors[i];
> +        if (vector->use) {
> +            setenv_event_fd(vdev, i, "interrupt", &vector->interrupt);
> +            if (vector->virq >= 0) {
> +                setenv_event_fd(vdev, i, "kvm_interrupt",
> +                                &vector->kvm_interrupt);
> +            }
> +        }
> +    }
> +    setenv_event_fd(vdev, 0, "err", &vdev->err_notifier);
> +    setenv_event_fd(vdev, 0, "req", &vdev->req_notifier);
> +    return 0;
> +}
> +
> +static void vfio_claim_vectors(VFIOPCIDevice *vdev, int nr_vectors, bool msix)
> +{
> +    int i, fd;
> +    bool pending = false;
> +    PCIDevice *pdev = &vdev->pdev;
> +
> +    vdev->nr_vectors = nr_vectors;
> +    vdev->msi_vectors = g_new0(VFIOMSIVector, nr_vectors);
> +    vdev->interrupt = msix ? VFIO_INT_MSIX : VFIO_INT_MSI;
> +
> +    for (i = 0; i < nr_vectors; i++) {
> +        VFIOMSIVector *vector = &vdev->msi_vectors[i];
> +
> +        fd = getenv_event_fd(vdev, i, "interrupt");
> +        if (fd >= 0) {
> +            vfio_vector_init(vdev, i, fd);
> +            qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
> +        }
> +
> +        fd = getenv_event_fd(vdev, i, "kvm_interrupt");
> +        if (fd >= 0) {
> +            vfio_add_kvm_msi_virq(vdev, vector, i, msix, fd);
> +        }
> +
> +        if (msix && msix_is_pending(pdev, i) && msix_is_masked(pdev, i)) {
> +            set_bit(i, vdev->msix->pending);
> +            pending = true;
> +        }
> +    }
> +
> +    if (msix) {
> +        memory_region_set_enabled(&pdev->msix_pba_mmio, pending);
> +    }
> +}
> +
>  static int vfio_pci_post_load(void *opaque, int version_id)
>  {
>      VFIOPCIDevice *vdev = opaque;
>      PCIDevice *pdev = &vdev->pdev;
> +    int nr_vectors;
>      bool enabled;
>  
> +    if (msix_enabled(pdev)) {
> +        nr_vectors = vdev->msix->entries;
> +        vfio_claim_vectors(vdev, nr_vectors, true);
> +        msix_init_vector_notifiers(pdev, vfio_msix_vector_use,
> +                                   vfio_msix_vector_release, NULL);
> +
> +    } else if (msi_enabled(pdev)) {
> +        nr_vectors = msi_nr_vectors_allocated(pdev);
> +        vfio_claim_vectors(vdev, nr_vectors, false);
> +
> +    } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
> +        error_report("vfio_pci_post_load does not support INTX");
> +    }

Why?  Is post-load where we really want to find this out?  Thanks,

Alex

> +
>      pdev->reused = false;
>      enabled = pci_get_word(pdev->config + PCI_COMMAND) & PCI_COMMAND_MASTER;
>      memory_region_set_enabled(&pdev->bus_master_enable_region, enabled);
> @@ -3310,8 +3414,10 @@ static const VMStateDescription vfio_pci_vmstate = {
>      .version_id = 0,
>      .minimum_version_id = 0,
>      .post_load = vfio_pci_post_load,
> +    .pre_save = vfio_pci_pre_save,
>      .needed = vfio_pci_needed,
>      .fields = (VMStateField[]) {
> +        VMSTATE_MSIX(pdev, VFIOPCIDevice),
>          VMSTATE_END_OF_LIST()
>      }
>  };



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update [restart]
  2021-05-21 14:56                   ` Steven Sistare
@ 2021-05-24 10:39                     ` Dr. David Alan Gilbert
  2021-06-02 13:51                       ` Steven Sistare
  0 siblings, 1 reply; 81+ messages in thread
From: Dr. David Alan Gilbert @ 2021-05-24 10:39 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel, Markus Armbruster,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

* Steven Sistare (steven.sistare@oracle.com) wrote:
> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote:
> > On the 'restart' branch of questions; can you explain,
> > other than the passing of the fd's, why the outgoing side of
> > qemu's 'migrate exec:' doesn't work for you?
> 
> I'm not sure what I should describe.  Can you be more specific?
> Do you mean: can we add the cpr specific bits to the migrate exec code?

Yes; if possible I'd prefer to just keep the one exec mechanism.
It has an advantage of letting you specify the new command line; that
avoids the problems I'd pointed out with needing to change the command
line if a hotplug had happened.  It also means we only need one chunk of
exec code.

Dave

> - Steve
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 12/22] vfio-pci: cpr part 1
  2021-05-21 22:24   ` Alex Williamson
@ 2021-05-24 18:29     ` Steven Sistare
  2021-06-11 18:15       ` Steven Sistare
  0 siblings, 1 reply; 81+ messages in thread
From: Steven Sistare @ 2021-05-24 18:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On 5/21/2021 6:24 PM, Alex Williamson wrote:> On Fri,  7 May 2021 05:25:10 -0700
> Steve Sistare <steven.sistare@oracle.com> wrote:
> 
>> Enable vfio-pci devices to be saved and restored across an exec restart
>> of qemu.
>>
>> At vfio creation time, save the value of vfio container, group, and device
>> descriptors in the environment.
>>
>> In cprsave, suspend the use of virtual addresses in DMA mappings with
>> VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped at a
>> different VA after exec.  DMA to already-mapped pages continues.  Save
>> the msi message area as part of vfio-pci vmstate, save the interrupt and
>> notifier eventfd's in the environment, and clear the close-on-exec flag
>> for the vfio descriptors.  The flag is not cleared earlier because the
>> descriptors should not persist across miscellaneous fork and exec calls
>> that may be performed during normal operation.
>>
>> On qemu restart, vfio_realize() finds the descriptor env vars, uses
>> the descriptors, and notes that the device is being reused.  Device and
>> iommu state is already configured, so operations in vfio_realize that
>> would modify the configuration are skipped for a reused device, including
>> vfio ioctl's and writes to PCI configuration space.  The result is that
>> vfio_realize constructs qemu data structures that reflect the current
>> state of the device.  However, the reconstruction is not complete until
>> cprload is called. cprload loads the msi data and finds eventfds in the
>> environment.  It rebuilds vector data structures and attaches the
>> interrupts to the new KVM instance.  cprload then walks the flattened
>> ranges of the vfio_address_spaces and calls VFIO_DMA_MAP_FLAG_VADDR to
>> inform the kernel of the new VA's.  Lastly, it starts the VM and suppresses
>> vfio device reset.
>>
>> This functionality is delivered by 2 patches for clarity.  Part 2 adds
>> eventfd and vector support.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  hw/pci/msi.c                  |   4 ++
>>  hw/pci/pci.c                  |   4 ++
>>  hw/vfio/common.c              |  59 ++++++++++++++++++-
>>  hw/vfio/cpr.c                 | 131 ++++++++++++++++++++++++++++++++++++++++++
>>  hw/vfio/meson.build           |   1 +
>>  hw/vfio/pci.c                 |  65 +++++++++++++++++++--
>>  hw/vfio/trace-events          |   1 +
>>  include/hw/pci/pci.h          |   1 +
>>  include/hw/vfio/vfio-common.h |   5 ++
>>  linux-headers/linux/vfio.h    |  27 +++++++++
>>  migration/cpr.c               |   7 +++
>>  11 files changed, 298 insertions(+), 7 deletions(-)
>>  create mode 100644 hw/vfio/cpr.c
>>
>> diff --git a/hw/pci/msi.c b/hw/pci/msi.c
>> index 47d2b0f..39de6a7 100644
>> --- a/hw/pci/msi.c
>> +++ b/hw/pci/msi.c
>> @@ -225,6 +225,10 @@ int msi_init(struct PCIDevice *dev, uint8_t offset,
>>      dev->msi_cap = config_offset;
>>      dev->cap_present |= QEMU_PCI_CAP_MSI;
>>  
>> +    if (dev->reused) {
>> +        return 0;
>> +    }
>> +
>>      pci_set_word(dev->config + msi_flags_off(dev), flags);
>>      pci_set_word(dev->wmask + msi_flags_off(dev),
>>                   PCI_MSI_FLAGS_QSIZE | PCI_MSI_FLAGS_ENABLE);
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index e08d981..27019ca 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -308,6 +308,10 @@ static void pci_do_device_reset(PCIDevice *dev)
>>  {
>>      int r;
>>  
>> +    if (dev->reused) {
>> +        return;
>> +    }
>> +
>>      pci_device_deassert_intx(dev);
>>      assert(dev->irq_state == 0);
>>  
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 9220e64..00d07b2 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -31,6 +31,7 @@
>>  #include "exec/memory.h"
>>  #include "exec/ram_addr.h"
>>  #include "hw/hw.h"
>> +#include "qemu/env.h"
>>  #include "qemu/error-report.h"
>>  #include "qemu/main-loop.h"
>>  #include "qemu/range.h"
>> @@ -440,6 +441,10 @@ static int vfio_dma_unmap(VFIOContainer *container,
>>          return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
>>      }
>>  
>> +    if (container->reused) {
>> +        return 0;
>> +    }
>> +
>>      while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
>>          /*
>>           * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
>> @@ -463,6 +468,11 @@ static int vfio_dma_unmap(VFIOContainer *container,
>>          return -errno;
>>      }
>>  
>> +    if (unmap.size != size) {
>> +        warn_report("VFIO_UNMAP_DMA(0x%lx, 0x%lx) only unmaps 0x%llx",
>> +                     iova, size, unmap.size);
>> +    }
>> +
>>      return 0;
>>  }
>>  
>> @@ -477,6 +487,10 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>>          .size = size,
>>      };
>>  
>> +    if (container->reused) {
>> +        return 0;
>> +    }
>> +
>>      if (!readonly) {
>>          map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
>>      }
>> @@ -1603,6 +1617,10 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
>>      if (iommu_type < 0) {
>>          return iommu_type;
>>      }
>> +    if (container->reused) {
>> +        container->iommu_type = iommu_type;
>> +        return 0;
>> +    }
>>  
>>      ret = ioctl(group_fd, VFIO_GROUP_SET_CONTAINER, &container->fd);
>>      if (ret) {
>> @@ -1703,6 +1721,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>  {
>>      VFIOContainer *container;
>>      int ret, fd;
>> +    bool reused;
>> +    char name[40];
>>      VFIOAddressSpace *space;
>>  
>>      space = vfio_get_address_space(as);
>> @@ -1739,16 +1759,29 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>          return ret;
>>      }
>>  
>> +    snprintf(name, sizeof(name), "vfio_container_%d", group->groupid);
> 
> For more clarity, maybe "vfio_container_for_group_%d"?

OK, that is clearer.

>> +    fd = getenv_fd(name);
>> +    reused = (fd >= 0);
>> +
>>      QLIST_FOREACH(container, &space->containers, next) {
>> +        if (fd >= 0 && container->fd == fd) {
> 
> Test @reused rather than @fd?  > I'm not sure the first half of this test
> is even needed though, <0 should never match container->fd, right?

OK, I will drop the first test.

>> +            group->container = container;
>> +            QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>> +            return 0;
>> +        }
> 
> This looks unnecessarily sensitive to the order of containers in the
> list, if the fd doesn't match above we try to set a new container below?
> It seems like you only want to create a new container object if none of
> the existing ones match.
> 
> There's also a lot of duplication that seems like it could be combined
> 
> if (container->fd == fd || (!reused && !ioctl(...)) {

OK, I will rewrite to avoid depending on creation order, and de-dup:

    QLIST_FOREACH(container, &space->containers, next) {
        if (container->fd == fd ||
            !ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
            break;
        }
    }

    if (container) {
        group->container = container;
        QLIST_INSERT_HEAD(&container->group_list, group, container_next);
        if (!reused) {
            vfio_kvm_device_add_group(group);
            setenv_fd(name, container->fd);
        }
        return 0;
    }


>>          if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>>              group->container = container;
>>              QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>>              vfio_kvm_device_add_group(group);
> 
> Why is this kvm device setup missing in the reuse case?

vfio_kvm_device_add_group only calls ioctls, and they were already called when the device 
was initially created.

> if (!reused) {
>> +            setenv_fd(name, container->fd);
> }
> 
>>              return 0;
>>          }
>>      }
>>  
>> -    fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
>> +    if (fd < 0) {
> 
> if (!reused)?

OK.

>> +        fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
>> +    }
>> +
>>      if (fd < 0) {
>>          error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
>>          ret = -errno;
>> @@ -1766,6 +1799,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>      container = g_malloc0(sizeof(*container));
>>      container->space = space;
>>      container->fd = fd;
>> +    container->reused = reused;
>>      container->error = NULL;
>>      container->dirty_pages_supported = false;
>>      QLIST_INIT(&container->giommu_list);
>> @@ -1893,6 +1927,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>      }
>>  
>>      container->initialized = true;
>> +    setenv_fd(name, fd);
> 
> Maybe we don't need the test around the previous setenv_fd if we can
> overwrite existing env values, which would seem to be the case for a
> restart here.

Yes, setenv_fd overwrites, so I omitted the test for clarity, at a cost of a few cycles.

>>      return 0;
>>  listener_release_exit:
>> @@ -1920,6 +1955,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>>  
>>      QLIST_REMOVE(group, container_next);
>>      group->container = NULL;
>> +    unsetenv_fdv("vfio_container_%d", group->groupid);
>>  
>>      /*
>>       * Explicitly release the listener first before unset container,
>> @@ -1978,7 +2014,12 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>>      group = g_malloc0(sizeof(*group));
>>  
>>      snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
>> -    group->fd = qemu_open_old(path, O_RDWR);
>> +
>> +    group->fd = getenv_fd(path);
>> +    if (group->fd < 0) {
>> +        group->fd = qemu_open_old(path, O_RDWR);
>> +    }
>> +
>>      if (group->fd < 0) {
>>          error_setg_errno(errp, errno, "failed to open %s", path);
>>          goto free_group_exit;
>> @@ -2012,6 +2053,8 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>>  
>>      QLIST_INSERT_HEAD(&vfio_group_list, group, next);
>>  
>> +    setenv_fd(path, group->fd);
>> +
>>      return group;
>>  
>>  close_fd_exit:
>> @@ -2036,6 +2079,7 @@ void vfio_put_group(VFIOGroup *group)
>>      vfio_disconnect_container(group);
>>      QLIST_REMOVE(group, next);
>>      trace_vfio_put_group(group->fd);
>> +    unsetenv_fdv("/dev/vfio/%d", group->groupid);
>>      close(group->fd);
>>      g_free(group);
>>  
>> @@ -2049,8 +2093,14 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>>  {
>>      struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
>>      int ret, fd;
>> +    bool reused;
>> +
>> +    fd = getenv_fd(name);
>> +    reused = (fd >= 0);
>> +    if (fd < 0) {
> 
> if (!reused) ?

OK.

>> +        fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>> +    }
>>  
>> -    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>>      if (fd < 0) {
>>          error_setg_errno(errp, errno, "error getting device from group %d",
>>                           group->groupid);
>> @@ -2095,6 +2145,8 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>>      vbasedev->num_irqs = dev_info.num_irqs;
>>      vbasedev->num_regions = dev_info.num_regions;
>>      vbasedev->flags = dev_info.flags;
>> +    vbasedev->reused = reused;
>> +    setenv_fd(name, fd);
>>  
>>      trace_vfio_get_device(name, dev_info.flags, dev_info.num_regions,
>>                            dev_info.num_irqs);
>> @@ -2111,6 +2163,7 @@ void vfio_put_base_device(VFIODevice *vbasedev)
>>      QLIST_REMOVE(vbasedev, next);
>>      vbasedev->group = NULL;
>>      trace_vfio_put_base_device(vbasedev->fd);
>> +    unsetenv_fd(vbasedev->name);
>>      close(vbasedev->fd);
>>  }
>>  
>> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
>> new file mode 100644
>> index 0000000..c5ad9f2
>> --- /dev/null
>> +++ b/hw/vfio/cpr.c
>> @@ -0,0 +1,131 @@
>> +/*
>> + * Copyright (c) 2021 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include <sys/ioctl.h>
>> +#include <linux/vfio.h>
>> +#include "hw/vfio/vfio-common.h"
>> +#include "sysemu/kvm.h"
>> +#include "qapi/error.h"
>> +#include "trace.h"
>> +
>> +static int
>> +vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
>> +{
>> +    struct vfio_iommu_type1_dma_unmap unmap = {
>> +        .argsz = sizeof(unmap),
>> +        .flags = VFIO_DMA_UNMAP_FLAG_VADDR | VFIO_DMA_UNMAP_FLAG_ALL,
>> +        .iova = 0,
>> +        .size = 0,
>> +    };
>> +    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
>> +        error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
>> +        return -errno;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static int vfio_dma_map_vaddr(VFIOContainer *container, hwaddr iova,
>> +                              ram_addr_t size, void *vaddr,
>> +                              Error **errp)
>> +{
>> +    struct vfio_iommu_type1_dma_map map = {
>> +        .argsz = sizeof(map),
>> +        .flags = VFIO_DMA_MAP_FLAG_VADDR,
>> +        .vaddr = (__u64)(uintptr_t)vaddr,
>> +        .iova = iova,
>> +        .size = size,
>> +    };
>> +    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
>> +        error_setg_errno(errp, errno,
>> +                         "vfio_dma_map_vaddr(iova %lu, size %ld, va %p)",
>> +                         iova, size, vaddr);
>> +        return -errno;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static int
>> +vfio_region_remap(MemoryRegionSection *section, void *handle, Error **errp)
>> +{
>> +    MemoryRegion *mr = section->mr;
>> +    VFIOContainer *container = handle;
>> +    const char *name = memory_region_name(mr);
>> +    ram_addr_t size = int128_get64(section->size);
>> +    hwaddr offset, iova, roundup;
>> +    void *vaddr;
>> +
>> +    if (vfio_listener_skipped_section(section) || memory_region_is_iommu(mr)) {
>> +        return 0;
>> +    }
>> +
>> +    offset = section->offset_within_address_space;
>> +    iova = TARGET_PAGE_ALIGN(offset);
>> +    roundup = iova - offset;
>> +    size = (size - roundup) & TARGET_PAGE_MASK;
>> +    vaddr = memory_region_get_ram_ptr(mr) +
>> +            section->offset_within_region + roundup;
>> +
>> +    trace_vfio_region_remap(name, container->fd, iova, iova + size - 1, vaddr);
>> +    return vfio_dma_map_vaddr(container, iova, size, vaddr, errp);
>> +}
>> +
>> +bool vfio_cpr_capable(VFIOContainer *container, Error **errp)
>> +{
>> +    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
>> +        !ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
>> +        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR "
>> +                         "or VFIO_UNMAP_ALL");
>> +        return false;
>> +    } else {
>> +        return true;
>> +    }
>> +}
>> +
>> +int vfio_cprsave(Error **errp)
>> +{
>> +    VFIOAddressSpace *space;
>> +    VFIOContainer *container;
>> +
>> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
>> +        QLIST_FOREACH(container, &space->containers, next) {
>> +            if (!vfio_cpr_capable(container, errp)) {
>> +                return 1;
>> +            }
>> +            if (vfio_dma_unmap_vaddr_all(container, errp)) {
>> +                return 1;
>> +            }
>> +        }
>> +    }
> 
> Seems like you'd want to test that all containers are capable before
> unmapping any vaddrs.  I also hope we'll find an unwind somewhere that
> remaps vaddrs should any fail.

That is verified earlier if one runs qemu with the -only-cpr-capable option:

  vfio_get_iommu_type()
      for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
          if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {
              if (only_cpr_capable && !vfio_cpr_capable(container, errp)) {
                  error_prepend(errp, "only-cpr-capable is specified: ");
                  return -EINVAL;

But I will check here as well in case only-cpr-capable was not specified.

I will add code to unwind and remap.

>> +    return 0;
>> +}
>> +
>> +int vfio_cprload(Error **errp)
>> +{
>> +    VFIOAddressSpace *space;
>> +    VFIOContainer *container;
>> +    VFIOGroup *group;
>> +    VFIODevice *vbasedev;
>> +
>> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
>> +        QLIST_FOREACH(container, &space->containers, next) {
>> +            if (!vfio_cpr_capable(container, errp)) {
>> +                return 1;
>> +            }
>> +            container->reused = false;
>> +            if (as_flat_walk(space->as, vfio_region_remap, container, errp)) {
>> +                return 1;
>> +            }
>> +        }
>> +    }
> 
> What state are we in if any of these fail?

The guest has not resumed, but we cannot recover.  Since we verified vfio_cpr_capable in 
cprsave, vfio_cprload should never fail, sans bugs.

>> +    QLIST_FOREACH(group, &vfio_group_list, next) {
>> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
>> +            vbasedev->reused = false;
>> +        }
>> +    }
>> +    return 0;
>> +}
>> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
>> index da9af29..e247b2b 100644
>> --- a/hw/vfio/meson.build
>> +++ b/hw/vfio/meson.build
>> @@ -5,6 +5,7 @@ vfio_ss.add(files(
>>    'migration.c',
>>  ))
>>  vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
>> +  'cpr.c',
>>    'display.c',
>>    'pci-quirks.c',
>>    'pci.c',
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 7a4fb6c..f7ac9f03 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -29,6 +29,8 @@
>>  #include "hw/qdev-properties.h"
>>  #include "hw/qdev-properties-system.h"
>>  #include "migration/vmstate.h"
>> +#include "migration/cpr.h"
>> +#include "qemu/env.h"
>>  #include "qemu/error-report.h"
>>  #include "qemu/main-loop.h"
>>  #include "qemu/module.h"
>> @@ -1612,6 +1614,14 @@ static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled)
>>      }
>>  }
>>  
>> +static void vfio_config_sync(VFIOPCIDevice *vdev, uint32_t offset, size_t len)
>> +{
>> +    if (pread(vdev->vbasedev.fd, vdev->pdev.config + offset, len,
>> +          vdev->config_offset + offset) != len) {
>> +        error_report("vfio_config_sync pread failed");
>> +    }
>> +}
>> +
>>  static void vfio_bar_prepare(VFIOPCIDevice *vdev, int nr)
>>  {
>>      VFIOBAR *bar = &vdev->bars[nr];
>> @@ -1652,6 +1662,7 @@ static void vfio_bars_prepare(VFIOPCIDevice *vdev)
>>  static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
>>  {
>>      VFIOBAR *bar = &vdev->bars[nr];
>> +    PCIDevice *pdev = &vdev->pdev;
>>      char *name;
>>  
>>      if (!bar->size) {
>> @@ -1672,7 +1683,10 @@ static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
>>          }
>>      }
>>  
>> -    pci_register_bar(&vdev->pdev, nr, bar->type, bar->mr);
>> +    pci_register_bar(pdev, nr, bar->type, bar->mr);
>> +    if (pdev->reused) {
>> +        vfio_config_sync(vdev, pci_bar(pdev, nr), 8);
> 
> Assuming 64-bit BARs?  This might be the first case where we actually
> rely on the kernel BAR values, IIRC we usually use QEMU's emulation.

No asssumptions.  vfio_config_sync() preads a piece of config space using a single 
system call, copying directly to the qemu buffer, not looking at words or calling any
action functions.

>> +    }
>>  }
>>  
>>  static void vfio_bars_register(VFIOPCIDevice *vdev)
>> @@ -2884,6 +2898,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>>          vfio_put_group(group);
>>          goto error;
>>      }
>> +    pdev->reused = vdev->vbasedev.reused;
>>  
>>      vfio_populate_device(vdev, &err);
>>      if (err) {
>> @@ -3046,9 +3061,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>>                                               vfio_intx_routing_notifier);
>>          vdev->irqchip_change_notifier.notify = vfio_irqchip_change;
>>          kvm_irqchip_add_change_notifier(&vdev->irqchip_change_notifier);
>> -        ret = vfio_intx_enable(vdev, errp);
>> -        if (ret) {
>> -            goto out_deregister;
>> +        if (!pdev->reused) {
>> +            ret = vfio_intx_enable(vdev, errp);
>> +            if (ret) {
>> +                goto out_deregister;
>> +            }
>>          }
>>      }
>>  
>> @@ -3098,6 +3115,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>>      vfio_register_req_notifier(vdev);
>>      vfio_setup_resetfn_quirk(vdev);
>>  
>> +    vfio_config_sync(vdev, pdev->msix_cap + PCI_MSIX_FLAGS, 2);
>> +    if (pdev->reused) {
>> +        pci_update_mappings(pdev);
>> +    }
>> +
> 
> Are the msix flag sync and mapping update related?  They seem
> independent to me.  A blank line and comment would be helpful.

OK.

> I expect we'd need to call msix_enabled() somewhere for the msix flag
> sync to be effective.

Yes, vfio_pci_post_load in cpr part 2 calls msix_enabled.

> Is there an assumption here of msi-x only support or is it not needed
> for msi or intx?

The code supports msi-x and msi.  However, I should only be sync'ing PCI_MSIX_FLAGS
if pdev->cap_present & QEMU_PCI_CAP_MSIX.  And, I am missing a sync for PCI_MSI_FLAGS.
I'll fix that.

> [...]
> 
> I didn't find that unwind I was hoping for or anywhere that the msix
> flags come into play.  Thanks,

Let me know f I have not adequately answered those questions.

- Steve


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 13/22] vfio-pci: cpr part 2
  2021-05-21 22:24   ` Alex Williamson
@ 2021-05-24 18:31     ` Steven Sistare
  0 siblings, 0 replies; 81+ messages in thread
From: Steven Sistare @ 2021-05-24 18:31 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On 5/21/2021 6:24 PM, Alex Williamson wrote:
> On Fri,  7 May 2021 05:25:11 -0700
> Steve Sistare <steven.sistare@oracle.com> wrote:
> 
>> Finish cpr for vfio-pci by preserving eventfd's and vector state.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  hw/vfio/pci.c | 110 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 108 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index f7ac9f03..e983db4 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -2661,6 +2661,27 @@ static void vfio_put_device(VFIOPCIDevice *vdev)
>>      vfio_put_base_device(&vdev->vbasedev);
>>  }
>>  
>> +static void setenv_event_fd(VFIOPCIDevice *vdev, int nr, const char *name,
>> +                            EventNotifier *ev)
>> +{
>> +    char envname[256];
>> +    int fd = event_notifier_get_fd(ev);
>> +    const char *vfname = vdev->vbasedev.name;
>> +
>> +    if (fd >= 0) {
>> +        snprintf(envname, sizeof(envname), "%s_%s_%d", vfname, name, nr);
>> +        setenv_fd(envname, fd);
>> +    }
>> +}
>> +
>> +static int getenv_event_fd(VFIOPCIDevice *vdev, int nr, const char *name)
>> +{
>> +    char envname[256];
>> +    const char *vfname = vdev->vbasedev.name;
>> +    snprintf(envname, sizeof(envname), "%s_%s_%d", vfname, name, nr);
>> +    return getenv_fd(envname);
>> +}
>> +
>>  static void vfio_err_notifier_handler(void *opaque)
>>  {
>>      VFIOPCIDevice *vdev = opaque;
>> @@ -2692,7 +2713,13 @@ static void vfio_err_notifier_handler(void *opaque)
>>  static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
>>  {
>>      Error *err = NULL;
>> -    int32_t fd;
>> +    int32_t fd = getenv_event_fd(vdev, 0, "err");
> 
> Arg order should match the actual env names, device name, interrupt
> name, interrupt number.

I am happy to swap interrupt name and interrupt order, here and in setenv_event_fd().
However, I pass vdev so the getenv_event_fd() caller does not need to generate the env 
var name, and the details of the name are confined to {get,set}_event_fd.  I could pass 
the vdev name instead, but IMO it would be uglier at every call site, eg:

    fd = getenv_event_fd(vdev->vbasedev.name, "err", 0);
vs
    fd = getenv_event_fd(vdev, "err", 0);

I could rename the functions so they do not imply argument similarity with getenv_fd:
  getenv_event_fd --> save_event_fd
  setenv_event_fd --> load_event_fd

>> +    if (fd >= 0) {
>> +        event_notifier_init_fd(&vdev->err_notifier, fd);
>> +        qemu_set_fd_handler(fd, vfio_err_notifier_handler, NULL, vdev);
>> +        return;
>> +    }
>>  
>>      if (!vdev->pci_aer) {
>>          return;
>> @@ -2753,7 +2780,14 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
>>      struct vfio_irq_info irq_info = { .argsz = sizeof(irq_info),
>>                                        .index = VFIO_PCI_REQ_IRQ_INDEX };
>>      Error *err = NULL;
>> -    int32_t fd;
>> +    int32_t fd = getenv_event_fd(vdev, 0, "req");
>> +
>> +    if (fd >= 0) {
>> +        event_notifier_init_fd(&vdev->req_notifier, fd);
>> +        qemu_set_fd_handler(fd, vfio_req_notifier_handler, NULL, vdev);
>> +        vdev->req_enabled = true;
>> +        return;
>> +    }
>>  
>>      if (!(vdev->features & VFIO_FEATURE_ENABLE_REQ)) {
>>          return;
>> @@ -3286,12 +3320,82 @@ static Property vfio_pci_dev_properties[] = {
>>      DEFINE_PROP_END_OF_LIST(),
>>  };
>>  
>> +static int vfio_pci_pre_save(void *opaque)
>> +{
>> +    VFIOPCIDevice *vdev = opaque;
>> +    int i;
>> +
>> +    for (i = 0; i < vdev->nr_vectors; i++) {
>> +        VFIOMSIVector *vector = &vdev->msi_vectors[i];
>> +        if (vector->use) {
>> +            setenv_event_fd(vdev, i, "interrupt", &vector->interrupt);
>> +            if (vector->virq >= 0) {
>> +                setenv_event_fd(vdev, i, "kvm_interrupt",
>> +                                &vector->kvm_interrupt);
>> +            }
>> +        }
>> +    }
>> +    setenv_event_fd(vdev, 0, "err", &vdev->err_notifier);
>> +    setenv_event_fd(vdev, 0, "req", &vdev->req_notifier);
>> +    return 0;
>> +}
>> +
>> +static void vfio_claim_vectors(VFIOPCIDevice *vdev, int nr_vectors, bool msix)
>> +{
>> +    int i, fd;
>> +    bool pending = false;
>> +    PCIDevice *pdev = &vdev->pdev;
>> +
>> +    vdev->nr_vectors = nr_vectors;
>> +    vdev->msi_vectors = g_new0(VFIOMSIVector, nr_vectors);
>> +    vdev->interrupt = msix ? VFIO_INT_MSIX : VFIO_INT_MSI;
>> +
>> +    for (i = 0; i < nr_vectors; i++) {
>> +        VFIOMSIVector *vector = &vdev->msi_vectors[i];
>> +
>> +        fd = getenv_event_fd(vdev, i, "interrupt");
>> +        if (fd >= 0) {
>> +            vfio_vector_init(vdev, i, fd);
>> +            qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
>> +        }
>> +
>> +        fd = getenv_event_fd(vdev, i, "kvm_interrupt");
>> +        if (fd >= 0) {
>> +            vfio_add_kvm_msi_virq(vdev, vector, i, msix, fd);
>> +        }
>> +
>> +        if (msix && msix_is_pending(pdev, i) && msix_is_masked(pdev, i)) {
>> +            set_bit(i, vdev->msix->pending);
>> +            pending = true;
>> +        }
>> +    }
>> +
>> +    if (msix) {
>> +        memory_region_set_enabled(&pdev->msix_pba_mmio, pending);
>> +    }
>> +}
>> +
>>  static int vfio_pci_post_load(void *opaque, int version_id)
>>  {
>>      VFIOPCIDevice *vdev = opaque;
>>      PCIDevice *pdev = &vdev->pdev;
>> +    int nr_vectors;
>>      bool enabled;
>>  
>> +    if (msix_enabled(pdev)) {
>> +        nr_vectors = vdev->msix->entries;
>> +        vfio_claim_vectors(vdev, nr_vectors, true);
>> +        msix_init_vector_notifiers(pdev, vfio_msix_vector_use,
>> +                                   vfio_msix_vector_release, NULL);
>> +
>> +    } else if (msi_enabled(pdev)) {
>> +        nr_vectors = msi_nr_vectors_allocated(pdev);
>> +        vfio_claim_vectors(vdev, nr_vectors, false);
>> +
>> +    } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
>> +        error_report("vfio_pci_post_load does not support INTX");
>> +    }
> 
> Why?  Is post-load where we really want to find this out?  Thanks,

This is also checked at VM creation time if only-cpr-capable is specified.
I could also check it in vfio_pci_pre_save.

- Steve

>> +
>>      pdev->reused = false;
>>      enabled = pci_get_word(pdev->config + PCI_COMMAND) & PCI_COMMAND_MASTER;
>>      memory_region_set_enabled(&pdev->bus_master_enable_region, enabled);
>> @@ -3310,8 +3414,10 @@ static const VMStateDescription vfio_pci_vmstate = {
>>      .version_id = 0,
>>      .minimum_version_id = 0,
>>      .post_load = vfio_pci_post_load,
>> +    .pre_save = vfio_pci_pre_save,
>>      .needed = vfio_pci_needed,
>>      .fields = (VMStateField[]) {
>> +        VMSTATE_MSIX(pdev, VFIOPCIDevice),
>>          VMSTATE_END_OF_LIST()
>>      }
>>  };
> 


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update [restart]
  2021-05-24 10:39                     ` Dr. David Alan Gilbert
@ 2021-06-02 13:51                       ` Steven Sistare
  2021-06-03 19:36                         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 81+ messages in thread
From: Steven Sistare @ 2021-06-02 13:51 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel, Markus Armbruster,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote:
> * Steven Sistare (steven.sistare@oracle.com) wrote:
>> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote:
>>> On the 'restart' branch of questions; can you explain,
>>> other than the passing of the fd's, why the outgoing side of
>>> qemu's 'migrate exec:' doesn't work for you?
>>
>> I'm not sure what I should describe.  Can you be more specific?
>> Do you mean: can we add the cpr specific bits to the migrate exec code?
> 
> Yes; if possible I'd prefer to just keep the one exec mechanism.
> It has an advantage of letting you specify the new command line; that
> avoids the problems I'd pointed out with needing to change the command
> line if a hotplug had happened.  It also means we only need one chunk of
> exec code.

How/where would you specify a new command line?  Are you picturing the usual migration
setup where you start a second qemu with its own arguments, plus a migrate_incoming
option or command?  That does not work for live update restart; the old qemu must exec
the new qemu.  Or something else?

We could shoehorn cpr restart into the migrate exec path by defining a new migration 
capability that the client would set before issuing the migrate command.  However, the
implementation code would be sprinkled with conditionals to suppress migrate-only bits
and call cpr-only bits.  IMO that would be less maintainable than having a separate
cprsave function.  Currently cprsave does not duplicate any migration functionality.
cprsave calls qemu_save_device_state() which is used by xen.

- Steve



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update
  2021-05-19 16:43 ` [PATCH V3 00/22] Live Update Steven Sistare
@ 2021-06-02 15:19   ` Steven Sistare
  0 siblings, 0 replies; 81+ messages in thread
From: Steven Sistare @ 2021-06-02 15:19 UTC (permalink / raw)
  To: Michael S. Tsirkin, Marcel Apfelbaum
  Cc: Daniel P. Berrange, Juan Quintela, Jason Zeng, Alex Bennée,
	qemu-devel, Dr. David Alan Gilbert, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Markus Armbruster

Hi Michael,
  Alex has reviewed the vfio-pci patches.  If you could give me a thumbs-up or a 
needs-work on "pci: export functions for cpr", I would appreciate it.  Thanks!

[PATCH V3 10/22] pci: export functions for cpr
https://lore.kernel.org/qemu-devel/1620390320-301716-11-git-send-email-steven.sistare@oracle.com

- Steve

On 5/19/2021 12:43 PM, Steven Sistare wrote:
> Hi Michael, Marcel,
>   I hope you have time to review the pci and vfio-pci related patches in this
> series.  They are an essential part of the live update functionality.  The
> first 2 patches are straightforward, just exposing functions for use in vfio.
> The last 2 patches are more substantial.
> 
>   - pci: export functions for cpr
>   - vfio-pci: refactor for cpr
>   - vfio-pci: cpr part 1
>   - vfio-pci: cpr part 2
> 
> - Steve
> 
> On 5/7/2021 8:24 AM, Steve Sistare wrote:
>> Provide the cprsave and cprload commands for live update.  These save and
>> restore VM state, with minimal guest pause time, so that qemu may be updated
>> to a new version in between.
>>
>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
>> paused state and waits for the cprload command.
>>
>> To use the restart mode, qemu must be started with the memfd-alloc option,
>> which allocates guest ram using memfd_create.  The memfd's are saved to
>> the environment and kept open across exec, after which they are found from
>> the environment and re-mmap'd.  Hence guest ram is preserved in place,
>> albeit with new virtual addresses in the qemu process.  The caller resumes
>> the guest by calling cprload, which loads state from the file.  If the VM
>> was running at cprsave time, then VM execution resumes.  cprsave supports
>> any type of guest image and block device, but the caller must not modify
>> guest block devices between cprsave and cprload.
>>
>> The restart mode supports vfio devices by preserving the vfio container,
>> group, device, and event descriptors across the qemu re-exec, and by
>> updating DMA mapping virtual addresses using VFIO_DMA_UNMAP_FLAG_VADDR and
>> VFIO_DMA_MAP_FLAG_VADDR as defined in https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sistare@oracle.com/
>> and integrated in Linux kernel 5.12.
>>
>> For the reboot mode, cprsave saves state and exits qemu, and the caller is
>> allowed to update the host kernel and system software and reboot.  The
>> caller resumes the guest by running qemu with the same arguments as the
>> original process and calling cprload.  To use this mode, guest ram must be
>> mapped to a persistent shared memory file such as /dev/dax0.0, or /dev/shm
>> PKRAM as proposed in https://lore.kernel.org/lkml/1617140178-8773-1-git-send-email-anthony.yznaga@oracle.com.
>>
>> The reboot mode supports vfio devices if the caller suspends the guest
>> instead of stopping the VM, such as by issuing guest-suspend-ram to the
>> qemu guest agent.  The guest drivers' suspend methods flush outstanding
>> requests and re-initialize the devices, and thus there is no device state
>> to save and restore.
>>
>> The first patches add helper functions:
>>
>>   - as_flat_walk
>>   - qemu_ram_volatile
>>   - oslib: qemu_clr_cloexec
>>   - util: env var helpers
>>   - machine: memfd-alloc option
>>   - vl: add helper to request re-exec
>>
>> The next patches implement cprsave and cprload:
>>
>>   - cpr
>>   - cpr: QMP interfaces
>>   - cpr: HMP interfaces
>>
>> The next patches add vfio support for the restart mode:
>>
>>   - pci: export functions for cpr
>>   - vfio-pci: refactor for cpr
>>   - vfio-pci: cpr part 1
>>   - vfio-pci: cpr part 2
>>
>> The next patches preserve various descriptor-based backend devices across
>> a cprsave restart:
>>
>>   - vhost: reset vhost devices upon cprsave
>>   - hostmem-memfd: cpr support
>>   - chardev: cpr framework
>>   - chardev: cpr for simple devices
>>   - chardev: cpr for pty
>>   - chardev: cpr for sockets
>>   - cpr: only-cpr-capable option
>>   - cpr: maintainers
>>   - simplify savevm
>>
>> Here is an example of updating qemu from v4.2.0 to v4.2.1 using 
>> "cprload restart".  The software update is performed while the guest is
>> running to minimize downtime.
>>
>> window 1				| window 2
>> 					|
>> # qemu-system-x86_64 ... 		|
>> QEMU 4.2.0 monitor - type 'help' ...	|
>> (qemu) info status			|
>> VM status: running			|
>> 					| # yum update qemu
>> (qemu) cprsave /tmp/qemu.sav restart	|
>> QEMU 4.2.1 monitor - type 'help' ...	|
>> (qemu) info status			|
>> VM status: paused (prelaunch)		|
>> (qemu) cprload /tmp/qemu.sav		|
>> (qemu) info status			|
>> VM status: running			|
>>
>>
>> Here is an example of updating the host kernel using "cprload reboot"
>>
>> window 1					| window 2
>> 						|
>> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
>> QEMU 4.2.1 monitor - type 'help' ...		|
>> (qemu) info status				|
>> VM status: running				|
>> 						| # yum update kernel-uek
>> (qemu) cprsave /tmp/qemu.sav restart		|
>> 						|
>> # systemctl kexec				|
>> kexec_core: Starting new kernel			|
>> ...						|
>> 						|
>> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
>> QEMU 4.2.1 monitor - type 'help' ...		|
>> (qemu) info status				|
>> VM status: paused (prelaunch)			|
>> (qemu) cprload /tmp/qemu.sav			|
>> (qemu) info status				|
>> VM status: running				|
>>
>> Changes from V1 to V2:
>>   - revert vmstate infrastructure changes
>>   - refactor cpr functions into new files
>>   - delete MADV_DOEXEC and use memfd + VFIO_DMA_UNMAP_FLAG_SUSPEND to 
>>     preserve memory.
>>   - add framework to filter chardev's that support cpr
>>   - save and restore vfio eventfd's
>>   - modify cprinfo QMP interface
>>   - incorporate misc review feedback
>>   - remove unrelated and unneeded patches
>>   - refactor all patches into a shorter and easier to review series
>>
>> Changes from V2 to V3:
>>   - rebase to qemu 6.0.0
>>   - use final definition of vfio ioctls (VFIO_DMA_UNMAP_FLAG_VADDR etc)
>>   - change memfd-alloc to a machine option
>>   - use existing channel socket function instead of defining new ones
>>   - close monitor socket during cpr
>>   - support memory-backend-memfd
>>   - fix a few unreported bugs
>>
>> Steve Sistare (18):
>>   as_flat_walk
>>   qemu_ram_volatile
>>   oslib: qemu_clr_cloexec
>>   util: env var helpers
>>   machine: memfd-alloc option
>>   vl: add helper to request re-exec
>>   cpr
>>   pci: export functions for cpr
>>   vfio-pci: refactor for cpr
>>   vfio-pci: cpr part 1
>>   vfio-pci: cpr part 2
>>   hostmem-memfd: cpr support
>>   chardev: cpr framework
>>   chardev: cpr for simple devices
>>   chardev: cpr for pty
>>   cpr: only-cpr-capable option
>>   cpr: maintainers
>>   simplify savevm
>>
>> Mark Kanda, Steve Sistare (4):
>>   cpr: QMP interfaces
>>   cpr: HMP interfaces
>>   vhost: reset vhost devices upon cprsave
>>   chardev: cpr for sockets
>>
>>  MAINTAINERS                   |  11 +++
>>  backends/hostmem-memfd.c      |  21 +++--
>>  chardev/char-mux.c            |   1 +
>>  chardev/char-null.c           |   1 +
>>  chardev/char-pty.c            |  15 ++-
>>  chardev/char-serial.c         |   1 +
>>  chardev/char-socket.c         |  35 +++++++
>>  chardev/char-stdio.c          |   8 ++
>>  chardev/char.c                |  41 +++++++-
>>  gdbstub.c                     |   1 +
>>  hmp-commands.hx               |  44 +++++++++
>>  hw/core/machine.c             |  19 ++++
>>  hw/pci/msi.c                  |   4 +
>>  hw/pci/msix.c                 |  20 ++--
>>  hw/pci/pci.c                  |   7 +-
>>  hw/vfio/common.c              |  68 +++++++++++++-
>>  hw/vfio/cpr.c                 | 131 ++++++++++++++++++++++++++
>>  hw/vfio/meson.build           |   1 +
>>  hw/vfio/pci.c                 | 214 ++++++++++++++++++++++++++++++++++++++----
>>  hw/vfio/trace-events          |   1 +
>>  hw/virtio/vhost.c             |  11 +++
>>  include/chardev/char.h        |   6 ++
>>  include/exec/memory.h         |  25 +++++
>>  include/hw/boards.h           |   1 +
>>  include/hw/pci/msix.h         |   5 +
>>  include/hw/pci/pci.h          |   2 +
>>  include/hw/vfio/vfio-common.h |   8 ++
>>  include/hw/virtio/vhost.h     |   1 +
>>  include/migration/cpr.h       |  17 ++++
>>  include/monitor/hmp.h         |   3 +
>>  include/qemu/env.h            |  23 +++++
>>  include/qemu/osdep.h          |   1 +
>>  include/sysemu/runstate.h     |   2 +
>>  include/sysemu/sysemu.h       |   2 +
>>  linux-headers/linux/vfio.h    |  27 ++++++
>>  migration/cpr.c               | 200 +++++++++++++++++++++++++++++++++++++++
>>  migration/meson.build         |   1 +
>>  migration/migration.c         |   5 +
>>  migration/savevm.c            |  21 ++---
>>  migration/savevm.h            |   2 +
>>  monitor/hmp-cmds.c            |  48 ++++++++++
>>  monitor/hmp.c                 |   3 +
>>  monitor/qmp-cmds.c            |  31 ++++++
>>  monitor/qmp.c                 |   3 +
>>  qapi/char.json                |   5 +-
>>  qapi/cpr.json                 |  76 +++++++++++++++
>>  qapi/meson.build              |   1 +
>>  qapi/qapi-schema.json         |   1 +
>>  qemu-options.hx               |  39 +++++++-
>>  softmmu/globals.c             |   2 +
>>  softmmu/memory.c              |  48 ++++++++++
>>  softmmu/physmem.c             |  49 ++++++++--
>>  softmmu/runstate.c            |  49 +++++++++-
>>  softmmu/vl.c                  |  21 ++++-
>>  stubs/cpr.c                   |   3 +
>>  stubs/meson.build             |   1 +
>>  trace-events                  |   1 +
>>  util/env.c                    |  99 +++++++++++++++++++
>>  util/meson.build              |   1 +
>>  util/oslib-posix.c            |   9 ++
>>  util/oslib-win32.c            |   4 +
>>  util/qemu-config.c            |   4 +
>>  62 files changed, 1431 insertions(+), 74 deletions(-)
>>  create mode 100644 hw/vfio/cpr.c
>>  create mode 100644 include/migration/cpr.h
>>  create mode 100644 include/qemu/env.h
>>  create mode 100644 migration/cpr.c
>>  create mode 100644 qapi/cpr.json
>>  create mode 100644 stubs/cpr.c
>>  create mode 100644 util/env.c
>>


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update [restart]
  2021-06-02 13:51                       ` Steven Sistare
@ 2021-06-03 19:36                         ` Dr. David Alan Gilbert
  2021-06-03 20:44                           ` Daniel P. Berrangé
  2021-06-07 18:08                           ` [PATCH V3 00/22] Live Update [restart] : code replication Steven Sistare
  0 siblings, 2 replies; 81+ messages in thread
From: Dr. David Alan Gilbert @ 2021-06-03 19:36 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel, Markus Armbruster,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

* Steven Sistare (steven.sistare@oracle.com) wrote:
> On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote:
> > * Steven Sistare (steven.sistare@oracle.com) wrote:
> >> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote:
> >>> On the 'restart' branch of questions; can you explain,
> >>> other than the passing of the fd's, why the outgoing side of
> >>> qemu's 'migrate exec:' doesn't work for you?
> >>
> >> I'm not sure what I should describe.  Can you be more specific?
> >> Do you mean: can we add the cpr specific bits to the migrate exec code?
> > 
> > Yes; if possible I'd prefer to just keep the one exec mechanism.
> > It has an advantage of letting you specify the new command line; that
> > avoids the problems I'd pointed out with needing to change the command
> > line if a hotplug had happened.  It also means we only need one chunk of
> > exec code.
> 
> How/where would you specify a new command line?  Are you picturing the usual migration
> setup where you start a second qemu with its own arguments, plus a migrate_incoming
> option or command?  That does not work for live update restart; the old qemu must exec
> the new qemu.  Or something else?

The existing migration path allows an exec - originally intended to exec
something like a compressor or a store to file rather than a real
migration; i.e. you can do:

  migrate "exec:gzip > mig"

and that will save the migration stream to a compressed file called mig.
Now, I *think* we can already do:

  migrate "exec:path-to-qemu command line parameters -incoming 'hmmmmm'"
(That's probably cleaner via the QMP interface).

I'm not quite sure what I want in the incoming there, but that is
already the source execing the destination qemu - although I think we'd
probably need to check if that's actually via an intermediary.

> We could shoehorn cpr restart into the migrate exec path by defining a new migration 
> capability that the client would set before issuing the migrate command.  However, the
> implementation code would be sprinkled with conditionals to suppress migrate-only bits
> and call cpr-only bits.  IMO that would be less maintainable than having a separate
> cprsave function.  Currently cprsave does not duplicate any migration functionality.
> cprsave calls qemu_save_device_state() which is used by xen.

To me it feels like cprsave in particular is replicating more code.

It's also jumping through hoops in places to avoid changing the
commandline;  that's going to cause more pain for a lot of people - not
just because it's hacks all over for that, but because a lot of people
are actually going to need to change the commandline even in a cpr like
case (e.g. due to hotplug or changing something else about the
environment, like auth data or route to storage or networking that
changed).

There are hooks for early parameter parsing, so if we need to add extra
commandline args we can; but for example the case of QEMU_START_FREEZE
to add -S just isn't needed as soon as you let go of the idea of needing
an identical commandline.

Dave

> - Steve
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update [restart]
  2021-06-03 19:36                         ` Dr. David Alan Gilbert
@ 2021-06-03 20:44                           ` Daniel P. Berrangé
  2021-06-07 16:40                             ` [PATCH V3 00/22] Live Update [restart] : exec Steven Sistare
  2021-06-07 18:08                           ` [PATCH V3 00/22] Live Update [restart] : code replication Steven Sistare
  1 sibling, 1 reply; 81+ messages in thread
From: Daniel P. Berrangé @ 2021-06-03 20:44 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Jason Zeng, Michael S. Tsirkin, Philippe Mathieu-Daudé,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Steven Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Alex Bennée

On Thu, Jun 03, 2021 at 08:36:42PM +0100, Dr. David Alan Gilbert wrote:
> * Steven Sistare (steven.sistare@oracle.com) wrote:
> > On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote:
> > > * Steven Sistare (steven.sistare@oracle.com) wrote:
> > >> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote:
> > >>> On the 'restart' branch of questions; can you explain,
> > >>> other than the passing of the fd's, why the outgoing side of
> > >>> qemu's 'migrate exec:' doesn't work for you?
> > >>
> > >> I'm not sure what I should describe.  Can you be more specific?
> > >> Do you mean: can we add the cpr specific bits to the migrate exec code?
> > > 
> > > Yes; if possible I'd prefer to just keep the one exec mechanism.
> > > It has an advantage of letting you specify the new command line; that
> > > avoids the problems I'd pointed out with needing to change the command
> > > line if a hotplug had happened.  It also means we only need one chunk of
> > > exec code.
> > 
> > How/where would you specify a new command line?  Are you picturing the usual migration
> > setup where you start a second qemu with its own arguments, plus a migrate_incoming
> > option or command?  That does not work for live update restart; the old qemu must exec
> > the new qemu.  Or something else?
> 
> The existing migration path allows an exec - originally intended to exec
> something like a compressor or a store to file rather than a real
> migration; i.e. you can do:
> 
>   migrate "exec:gzip > mig"
> 
> and that will save the migration stream to a compressed file called mig.
> Now, I *think* we can already do:
> 
>   migrate "exec:path-to-qemu command line parameters -incoming 'hmmmmm'"
> (That's probably cleaner via the QMP interface).
> 
> I'm not quite sure what I want in the incoming there, but that is
> already the source execing the destination qemu - although I think we'd
> probably need to check if that's actually via an intermediary.

I don't think you can dirctly exec  qemu in that way, because the
source QEMU migration code is going to wait for completion of the
QEMU you exec'd and that'll never come on success. So you'll end
up with both QEMU's running forever. If you pass the daemonize
option to the new QEMU then it will immediately detach itself,
and the source QEMU will think the migration command has finished
or failed.

I think you can probably do it if you use a wrapper script though.
The wrapper would have to fork QEMU in the backend, and then the
wrapper would have to monitor the new QEMU to see when the incoming
migration has finished/aborted, at which point the wrapper can
exit, so the source QEMU sees a successful cleanup of the exec'd
command. </hand waving>

> > We could shoehorn cpr restart into the migrate exec path by defining a new migration 
> > capability that the client would set before issuing the migrate command.  However, the
> > implementation code would be sprinkled with conditionals to suppress migrate-only bits
> > and call cpr-only bits.  IMO that would be less maintainable than having a separate
> > cprsave function.  Currently cprsave does not duplicate any migration functionality.
> > cprsave calls qemu_save_device_state() which is used by xen.
> 
> To me it feels like cprsave in particular is replicating more code.
> 
> It's also jumping through hoops in places to avoid changing the
> commandline;  that's going to cause more pain for a lot of people - not
> just because it's hacks all over for that, but because a lot of people
> are actually going to need to change the commandline even in a cpr like
> case (e.g. due to hotplug or changing something else about the
> environment, like auth data or route to storage or networking that
> changed).

Management apps that already support migration, will almost certainly
know how to start up a new QEMU with a different command line that
takes account of hotplugged/unplugged devices. IOW avoiding changing
the command line only really addresses the simple case, and the hard
case is likely already solved for purposes of handling regular live
migration.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 08/22] cpr: QMP interfaces
  2021-05-07 12:25 ` [PATCH V3 08/22] cpr: QMP interfaces Steve Sistare
@ 2021-06-04 13:59   ` Eric Blake
  2021-06-07 17:19     ` Steven Sistare
  0 siblings, 1 reply; 81+ messages in thread
From: Eric Blake @ 2021-06-04 13:59 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Alex Williamson, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Markus Armbruster

On Fri, May 07, 2021 at 05:25:06AM -0700, Steve Sistare wrote:
> cprsave calls cprsave().  Syntax:
>   { 'enum': 'CprMode', 'data': [ 'reboot', 'restart' ] }
>   { 'command': 'cprsave', 'data': { 'file': 'str', 'mode': 'CprMode' } }
> 
> cprload calls cprload().  Syntax:
>   { 'command': 'cprload', 'data': { 'file': 'str' } }
> 
> cprinfo returns a list of supported modes.  Syntax:
>   { 'struct': 'CprInfo', 'data': { 'modes': [ 'CprMode' ] } }
>   { 'command': 'cprinfo', 'returns': 'CprInfo' }
> 
> Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---

> +++ b/qapi/cpr.json

> +##
> +# @CprMode:
> +#
> +# @reboot: checkpoint can be cprload'ed after a host kexec reboot.
> +#
> +# @restart: checkpoint can be cprload'ed after restarting qemu.
> +#
> +# Since: 6.0

We've missed 6.0; this and all other since tags should mention 6.1.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update [restart] : exec
  2021-06-03 20:44                           ` Daniel P. Berrangé
@ 2021-06-07 16:40                             ` Steven Sistare
  2021-06-14 14:31                               ` Steven Sistare
  2021-06-15 19:05                               ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 81+ messages in thread
From: Steven Sistare @ 2021-06-07 16:40 UTC (permalink / raw)
  To: Daniel P. Berrangé, Dr. David Alan Gilbert
  Cc: Jason Zeng, Michael S. Tsirkin, Alex Bennée, Juan Quintela,
	qemu-devel, Eric Blake, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On 6/3/2021 4:44 PM, Daniel P. Berrangé wrote:
> On Thu, Jun 03, 2021 at 08:36:42PM +0100, Dr. David Alan Gilbert wrote:
>> * Steven Sistare (steven.sistare@oracle.com) wrote:
>>> On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote:
>>>> * Steven Sistare (steven.sistare@oracle.com) wrote:
>>>>> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote:
>>>>>> On the 'restart' branch of questions; can you explain,
>>>>>> other than the passing of the fd's, why the outgoing side of
>>>>>> qemu's 'migrate exec:' doesn't work for you?
>>>>>
>>>>> I'm not sure what I should describe.  Can you be more specific?
>>>>> Do you mean: can we add the cpr specific bits to the migrate exec code?
>>>>
>>>> Yes; if possible I'd prefer to just keep the one exec mechanism.
>>>> It has an advantage of letting you specify the new command line; that
>>>> avoids the problems I'd pointed out with needing to change the command
>>>> line if a hotplug had happened.  It also means we only need one chunk of
>>>> exec code.
>>>
>>> How/where would you specify a new command line?  Are you picturing the usual migration
>>> setup where you start a second qemu with its own arguments, plus a migrate_incoming
>>> option or command?  That does not work for live update restart; the old qemu must exec
>>> the new qemu.  Or something else?
>>
>> The existing migration path allows an exec - originally intended to exec
>> something like a compressor or a store to file rather than a real
>> migration; i.e. you can do:
>>
>>   migrate "exec:gzip > mig"
>>
>> and that will save the migration stream to a compressed file called mig.
>> Now, I *think* we can already do:
>>
>>   migrate "exec:path-to-qemu command line parameters -incoming 'hmmmmm'"
>> (That's probably cleaner via the QMP interface).
>>
>> I'm not quite sure what I want in the incoming there, but that is
>> already the source execing the destination qemu - although I think we'd
>> probably need to check if that's actually via an intermediary.
> 
> I don't think you can dirctly exec  qemu in that way, because the
> source QEMU migration code is going to wait for completion of the
> QEMU you exec'd and that'll never come on success. So you'll end
> up with both QEMU's running forever. If you pass the daemonize
> option to the new QEMU then it will immediately detach itself,
> and the source QEMU will think the migration command has finished
> or failed.
> 
> I think you can probably do it if you use a wrapper script though.
> The wrapper would have to fork QEMU in the backend, and then the
> wrapper would have to monitor the new QEMU to see when the incoming
> migration has finished/aborted, at which point the wrapper can
> exit, so the source QEMU sees a successful cleanup of the exec'd
> command. </hand waving>

cpr restart does not work for any scheme that involves the old qemu process co-existing with
the new qemu process.  To preserve descriptors and anonymous memory, cpr restart requires 
that old qemu directly execs new qemu.  Not fork-exec.  Same pid.

So responding to Dave's comment, "keep the one exec mechanism", that is not possible.
We still need the qemu_exec_requested mechanism to cause a direct exec after state is
saved.

>>> We could shoehorn cpr restart into the migrate exec path by defining a new migration 
>>> capability that the client would set before issuing the migrate command.  However, the
>>> implementation code would be sprinkled with conditionals to suppress migrate-only bits
>>> and call cpr-only bits.  IMO that would be less maintainable than having a separate
>>> cprsave function.  Currently cprsave does not duplicate any migration functionality.
>>> cprsave calls qemu_save_device_state() which is used by xen.
>>
>> To me it feels like cprsave in particular is replicating more code.
>>
>> It's also jumping through hoops in places to avoid changing the
>> commandline;  that's going to cause more pain for a lot of people - not
>> just because it's hacks all over for that, but because a lot of people
>> are actually going to need to change the commandline even in a cpr like
>> case (e.g. due to hotplug or changing something else about the
>> environment, like auth data or route to storage or networking that
>> changed).
> 
> Management apps that already support migration, will almost certainly
> know how to start up a new QEMU with a different command line that
> takes account of hotplugged/unplugged devices. IOW avoiding changing
> the command line only really addresses the simple case, and the hard
> case is likely already solved for purposes of handling regular live
> migration. 

Agreed, with the caveat that for cpr, the management app must communicate the new arguments
to the qemu-exec trampoline, rather than passing the args on the command line to a new 
qemu process.

>> There are hooks for early parameter parsing, so if we need to add extra
>> commandline args we can; but for example the case of QEMU_START_FREEZE
>> to add -S just isn't needed as soon as you let go of the idea of needing
>> an identical commandline.

I'll delete QEMU_START_FREEZE.  

I still need to preserve argv_main and pass it to the qemu-exec trampoline, though, as 
the args contain identifying information that the management app needs to modify the 
arguments based the the instances's hot plug history.

Or, here is another possibility.  We could redefine cprsave to leave the VM in a
stopped state, and add a cprstart command to be called subsequently that performs 
the exec.  It takes a single string argument: a command plus arguments to exec.  
The command may be qemu or a trampoline like qemu-exec.  I like that the trampoline
name is no longer hardcoded.  The management app can derive new qemu args for the
instances as it would with migration, and pass them to the command, instead of passing
them to qemu-exec via some side channel.  cprload finishes the job and does not change.
I already like this scheme better.

- Steve


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 08/22] cpr: QMP interfaces
  2021-06-04 13:59   ` Eric Blake
@ 2021-06-07 17:19     ` Steven Sistare
  0 siblings, 0 replies; 81+ messages in thread
From: Steven Sistare @ 2021-06-07 17:19 UTC (permalink / raw)
  To: Eric Blake
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Alex Williamson, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Markus Armbruster

On 6/4/2021 9:59 AM, Eric Blake wrote:
> On Fri, May 07, 2021 at 05:25:06AM -0700, Steve Sistare wrote:
>> cprsave calls cprsave().  Syntax:
>>   { 'enum': 'CprMode', 'data': [ 'reboot', 'restart' ] }
>>   { 'command': 'cprsave', 'data': { 'file': 'str', 'mode': 'CprMode' } }
>>
>> cprload calls cprload().  Syntax:
>>   { 'command': 'cprload', 'data': { 'file': 'str' } }
>>
>> cprinfo returns a list of supported modes.  Syntax:
>>   { 'struct': 'CprInfo', 'data': { 'modes': [ 'CprMode' ] } }
>>   { 'command': 'cprinfo', 'returns': 'CprInfo' }
>>
>> Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
> 
>> +++ b/qapi/cpr.json
> 
>> +##
>> +# @CprMode:
>> +#
>> +# @reboot: checkpoint can be cprload'ed after a host kexec reboot.
>> +#
>> +# @restart: checkpoint can be cprload'ed after restarting qemu.
>> +#
>> +# Since: 6.0
> 
> We've missed 6.0; this and all other since tags should mention 6.1.

Yes, thanks.  You caught a different instance in a previous email and I did a global search 
and replace in my workspace.

- Steve

 


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update [restart] : code replication
  2021-06-03 19:36                         ` Dr. David Alan Gilbert
  2021-06-03 20:44                           ` Daniel P. Berrangé
@ 2021-06-07 18:08                           ` Steven Sistare
  2021-06-14 14:33                             ` Steven Sistare
  1 sibling, 1 reply; 81+ messages in thread
From: Steven Sistare @ 2021-06-07 18:08 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel, Eric Blake,
	Markus Armbruster, Alex Williamson, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

[-- Attachment #1: Type: text/plain, Size: 2402 bytes --]

On 6/3/2021 3:36 PM, Dr. David Alan Gilbert wrote:
> * Steven Sistare (steven.sistare@oracle.com) wrote:
>> On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote:
>>> * Steven Sistare (steven.sistare@oracle.com) wrote:
>>>> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote:
>>>>> On the 'restart' branch of questions; can you explain,
>>>>> other than the passing of the fd's, why the outgoing side of
>>>>> qemu's 'migrate exec:' doesn't work for you?
>>>>
>>>> I'm not sure what I should describe.  Can you be more specific?
>>>> Do you mean: can we add the cpr specific bits to the migrate exec code?
>>>
>>> Yes; if possible I'd prefer to just keep the one exec mechanism.
>>> It has an advantage of letting you specify the new command line; that
>>> avoids the problems I'd pointed out with needing to change the command
>>> line if a hotplug had happened.  It also means we only need one chunk of
>>> exec code.
>>
>> [...]
> 
> I'm not quite sure what I want in the incoming there, but that is
> already the source execing the destination qemu - although I think we'd
> probably need to check if that's actually via an intermediary.
> 
>> We could shoehorn cpr restart into the migrate exec path by defining a new migration 
>> capability that the client would set before issuing the migrate command.  However, the
>> implementation code would be sprinkled with conditionals to suppress migrate-only bits
>> and call cpr-only bits.  IMO that would be less maintainable than having a separate
>> cprsave function.  Currently cprsave does not duplicate any migration functionality.
>> cprsave calls qemu_save_device_state() which is used by xen.
> 
> To me it feels like cprsave in particular is replicating more code. 

In the attached file I annotated lines of code that have some overlap
with migration code actions.  They include vm control, global_state_store,
and vmstate save, and cover 18 lines of 78 total.  I did not include the
body of qf_file_open because it is also called by xen.

The migration code adds capabilities, parameters, state, status, info,
precopy, postcopy, dirty bitmap, events, notifiers, 6 channel types,
blockers, pause+resume, return path, request-reply commands, throttling, colo,
blocks, phases, iteration, and threading, implemented by 20000+ lines of code.
To me it seems wrong to throw cpr into that mix to avoid adding tens of lines 
of similar code.

- Steve

[-- Attachment #2: cprsave.txt --]
[-- Type: text/plain, Size: 2107 bytes --]

  void cprsave(const char *file, CprMode mode, Error **errp)
  {
      int ret = 0;
**    QEMUFile *f;
**    int saved_vm_running = runstate_is_running();
      bool restart = (mode == CPR_MODE_RESTART);
      bool reboot = (mode == CPR_MODE_REBOOT);
  
      if (reboot && qemu_ram_volatile(errp)) {
          return;
      }
  
      if (restart && xen_enabled()) {
          error_setg(errp, "xen does not support cprsave restart");
          return;
      }
  
      if (migrate_colo_enabled()) {
          error_setg(errp, "error: cprsave does not support x-colo");
          return;
      }
  
      if (replay_mode != REPLAY_MODE_NONE) {
          error_setg(errp, "error: cprsave does not support replay");
          return;
      }
  
      f = qf_file_open(file, O_CREAT | O_WRONLY | O_TRUNC, 0600, "cprsave", errp);
      if (!f) {
          return;
      }
  
**    ret = global_state_store();
**    if (ret) {
**        error_setg(errp, "Error saving global state");
**        qemu_fclose(f);
**        return;
**    }
      if (runstate_check(RUN_STATE_SUSPENDED)) {
          /* Update timers_state before saving.  Suspend did not so do. */
          cpu_disable_ticks();
      }
**    vm_stop(RUN_STATE_SAVE_VM);
  
      cpr_is_active = true;
**    ret = qemu_save_device_state(f);
**    qemu_fclose(f);
**    if (ret < 0) {
**        error_setg(errp, "Error %d while saving VM state", ret);
**        goto err;
**    }
  
      if (reboot) {
          shutdown_action = SHUTDOWN_ACTION_POWEROFF;
          qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
      } else if (restart) {
          if (!qemu_chr_cpr_capable(errp)) {
              goto err;
          }
          if (vfio_cprsave(errp)) {
              goto err;
          }
          walkenv(FD_PREFIX, preserve_fd, 0);
          vhost_dev_reset_all();
          qemu_term_exit();
          setenv("QEMU_START_FREEZE", "", 1);
          qemu_system_exec_request();
      }
      goto done;
  
  err:
**    if (saved_vm_running) {
**        vm_start();
**    }
  done:
      cpr_is_active = false;
      return;
  }

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 12/22] vfio-pci: cpr part 1
  2021-05-24 18:29     ` Steven Sistare
@ 2021-06-11 18:15       ` Steven Sistare
  2021-06-11 19:43         ` Steven Sistare
  0 siblings, 1 reply; 81+ messages in thread
From: Steven Sistare @ 2021-06-11 18:15 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel, Eric Blake,
	Dr. David Alan Gilbert, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On 5/24/2021 2:29 PM, Steven Sistare wrote:
> On 5/21/2021 6:24 PM, Alex Williamson wrote:> On Fri,  7 May 2021 05:25:10 -0700
>> Steve Sistare <steven.sistare@oracle.com> wrote:
>>
>>>[...]
>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>> index 7a4fb6c..f7ac9f03 100644
>>> --- a/hw/vfio/pci.c
>>> +++ b/hw/vfio/pci.c
>>> @@ -29,6 +29,8 @@
>>>  #include "hw/qdev-properties.h"
>>>  #include "hw/qdev-properties-system.h"
>>>  #include "migration/vmstate.h"
>>> +#include "migration/cpr.h"
>>> +#include "qemu/env.h"
>>>  #include "qemu/error-report.h"
>>>  #include "qemu/main-loop.h"
>>>  #include "qemu/module.h"
>>> @@ -1612,6 +1614,14 @@ static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled)
>>>      }
>>>  }
>>>  
>>> +static void vfio_config_sync(VFIOPCIDevice *vdev, uint32_t offset, size_t len)
>>> +{
>>> +    if (pread(vdev->vbasedev.fd, vdev->pdev.config + offset, len,
>>> +          vdev->config_offset + offset) != len) {
>>> +        error_report("vfio_config_sync pread failed");
>>> +    }
>>> +}
>>> +
>>>  static void vfio_bar_prepare(VFIOPCIDevice *vdev, int nr)
>>>  {
>>>      VFIOBAR *bar = &vdev->bars[nr];
>>> @@ -1652,6 +1662,7 @@ static void vfio_bars_prepare(VFIOPCIDevice *vdev)
>>>  static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
>>>  {
>>>      VFIOBAR *bar = &vdev->bars[nr];
>>> +    PCIDevice *pdev = &vdev->pdev;
>>>      char *name;
>>>  
>>>      if (!bar->size) {
>>> @@ -1672,7 +1683,10 @@ static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
>>>          }
>>>      }
>>>  
>>> -    pci_register_bar(&vdev->pdev, nr, bar->type, bar->mr);
>>> +    pci_register_bar(pdev, nr, bar->type, bar->mr);
>>> +    if (pdev->reused) {
>>> +        vfio_config_sync(vdev, pci_bar(pdev, nr), 8);
>>
>> Assuming 64-bit BARs?  This might be the first case where we actually
>> rely on the kernel BAR values, IIRC we usually use QEMU's emulation.
> 
> No asssumptions.  vfio_config_sync() preads a piece of config space using a single 
> system call, copying directly to the qemu buffer, not looking at words or calling any
> action functions.
> 
>[...] 
>>> @@ -3098,6 +3115,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>>>      vfio_register_req_notifier(vdev);
>>>      vfio_setup_resetfn_quirk(vdev);
>>>  
>>> +    vfio_config_sync(vdev, pdev->msix_cap + PCI_MSIX_FLAGS, 2);
>>> +    if (pdev->reused) {
>>> +        pci_update_mappings(pdev);
>>> +    }
>>> +
>>
>> Are the msix flag sync and mapping update related?  They seem
>> independent to me.  A blank line and comment would be helpful.
> 
> OK.
> 
>> I expect we'd need to call msix_enabled() somewhere for the msix flag
>> sync to be effective.
> 
> Yes, vfio_pci_post_load in cpr part 2 calls msix_enabled.
> 
>> Is there an assumption here of msi-x only support or is it not needed
>> for msi or intx?
> 
> The code supports msi-x and msi.  However, I should only be sync'ing PCI_MSIX_FLAGS
> if pdev->cap_present & QEMU_PCI_CAP_MSIX.  And, I am missing a sync for PCI_MSI_FLAGS.
> I'll fix that.

Hi Alex, FYI, I am making more changes here.  The calls to vfio_config_sync fix pdev->config[]
words that are initialized during vfio_realize(), by pread'ing from the live kernel config.
However, it makes more sense to suppress the undesired re-initialization, rather than undo
the damage later.  Thus I will add a few more 'if (!pdev->reused)' guards in msix and pci bar
init functions, and delete vfio_config_sync.

Most of the config is preserved in the kernel across restart.  However, the bits that are
purely emulated (indicated by the emulated_config_bits mask) may be rejected when they 
are written through to the kernel, and thus are currently lost on restart.  I need to save 
pdev->config[] in the vmstate file, and in vfio_pci_post_load, merge it with the kernel 
config using emulated_config_bits.

Sound sane?

- Steve


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 12/22] vfio-pci: cpr part 1
  2021-06-11 18:15       ` Steven Sistare
@ 2021-06-11 19:43         ` Steven Sistare
  0 siblings, 0 replies; 81+ messages in thread
From: Steven Sistare @ 2021-06-11 19:43 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel, Eric Blake,
	Dr. David Alan Gilbert, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On 6/11/2021 2:15 PM, Steven Sistare wrote:
> On 5/24/2021 2:29 PM, Steven Sistare wrote:
>> On 5/21/2021 6:24 PM, Alex Williamson wrote:> On Fri,  7 May 2021 05:25:10 -0700
>>> Steve Sistare <steven.sistare@oracle.com> wrote:
>>>
>>>> [...]
>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>> index 7a4fb6c..f7ac9f03 100644
>>>> --- a/hw/vfio/pci.c
>>>> +++ b/hw/vfio/pci.c
>>>> @@ -29,6 +29,8 @@
>>>>  #include "hw/qdev-properties.h"
>>>>  #include "hw/qdev-properties-system.h"
>>>>  #include "migration/vmstate.h"
>>>> +#include "migration/cpr.h"
>>>> +#include "qemu/env.h"
>>>>  #include "qemu/error-report.h"
>>>>  #include "qemu/main-loop.h"
>>>>  #include "qemu/module.h"
>>>> @@ -1612,6 +1614,14 @@ static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled)
>>>>      }
>>>>  }
>>>>  
>>>> +static void vfio_config_sync(VFIOPCIDevice *vdev, uint32_t offset, size_t len)
>>>> +{
>>>> +    if (pread(vdev->vbasedev.fd, vdev->pdev.config + offset, len,
>>>> +          vdev->config_offset + offset) != len) {
>>>> +        error_report("vfio_config_sync pread failed");
>>>> +    }
>>>> +}
>>>> +
>>>>  static void vfio_bar_prepare(VFIOPCIDevice *vdev, int nr)
>>>>  {
>>>>      VFIOBAR *bar = &vdev->bars[nr];
>>>> @@ -1652,6 +1662,7 @@ static void vfio_bars_prepare(VFIOPCIDevice *vdev)
>>>>  static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
>>>>  {
>>>>      VFIOBAR *bar = &vdev->bars[nr];
>>>> +    PCIDevice *pdev = &vdev->pdev;
>>>>      char *name;
>>>>  
>>>>      if (!bar->size) {
>>>> @@ -1672,7 +1683,10 @@ static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
>>>>          }
>>>>      }
>>>>  
>>>> -    pci_register_bar(&vdev->pdev, nr, bar->type, bar->mr);
>>>> +    pci_register_bar(pdev, nr, bar->type, bar->mr);
>>>> +    if (pdev->reused) {
>>>> +        vfio_config_sync(vdev, pci_bar(pdev, nr), 8);
>>>
>>> Assuming 64-bit BARs?  This might be the first case where we actually
>>> rely on the kernel BAR values, IIRC we usually use QEMU's emulation.
>>
>> No asssumptions.  vfio_config_sync() preads a piece of config space using a single 
>> system call, copying directly to the qemu buffer, not looking at words or calling any
>> action functions.
>>
>> [...] 
>>>> @@ -3098,6 +3115,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>>>>      vfio_register_req_notifier(vdev);
>>>>      vfio_setup_resetfn_quirk(vdev);
>>>>  
>>>> +    vfio_config_sync(vdev, pdev->msix_cap + PCI_MSIX_FLAGS, 2);
>>>> +    if (pdev->reused) {
>>>> +        pci_update_mappings(pdev);
>>>> +    }
>>>> +
>>>
>>> Are the msix flag sync and mapping update related?  They seem
>>> independent to me.  A blank line and comment would be helpful.
>>
>> OK.
>>
>>> I expect we'd need to call msix_enabled() somewhere for the msix flag
>>> sync to be effective.
>>
>> Yes, vfio_pci_post_load in cpr part 2 calls msix_enabled.
>>
>>> Is there an assumption here of msi-x only support or is it not needed
>>> for msi or intx?
>>
>> The code supports msi-x and msi.  However, I should only be sync'ing PCI_MSIX_FLAGS
>> if pdev->cap_present & QEMU_PCI_CAP_MSIX.  And, I am missing a sync for PCI_MSI_FLAGS.
>> I'll fix that.
> 
> Hi Alex, FYI, I am making more changes here.  The calls to vfio_config_sync fix pdev->config[]
> words that are initialized during vfio_realize(), by pread'ing from the live kernel config.
> However, it makes more sense to suppress the undesired re-initialization, rather than undo
> the damage later.  Thus I will add a few more 'if (!pdev->reused)' guards in msix and pci bar
> init functions, and delete vfio_config_sync.
> 
> Most of the config is preserved in the kernel across restart.  However, the bits that are
> purely emulated (indicated by the emulated_config_bits mask) may be rejected when they 
> are written through to the kernel, and thus are currently lost on restart.  I need to save 
> pdev->config[] in the vmstate file, and in vfio_pci_post_load, merge it with the kernel 
> config using emulated_config_bits.
> 
> Sound sane?

Furthermore, there is no need to check reused and suppress initialization of msix and pci bar, 
as the vmstate loader fixes them up.

- Steve


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update [restart] : exec
  2021-06-07 16:40                             ` [PATCH V3 00/22] Live Update [restart] : exec Steven Sistare
@ 2021-06-14 14:31                               ` Steven Sistare
  2021-06-14 14:36                                 ` Daniel P. Berrangé
  2021-06-15 19:05                               ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 81+ messages in thread
From: Steven Sistare @ 2021-06-14 14:31 UTC (permalink / raw)
  To: Daniel P. Berrangé, Dr. David Alan Gilbert
  Cc: Jason Zeng, Michael S. Tsirkin, Alex Bennée, Juan Quintela,
	qemu-devel, Eric Blake, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On 6/7/2021 12:40 PM, Steven Sistare wrote:
> On 6/3/2021 4:44 PM, Daniel P. Berrangé wrote:
>> On Thu, Jun 03, 2021 at 08:36:42PM +0100, Dr. David Alan Gilbert wrote:
>>> * Steven Sistare (steven.sistare@oracle.com) wrote:
>>>> On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote:
>>>>> * Steven Sistare (steven.sistare@oracle.com) wrote:
>>>>>> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote:
>>>>>>> On the 'restart' branch of questions; can you explain,
>>>>>>> other than the passing of the fd's, why the outgoing side of
>>>>>>> qemu's 'migrate exec:' doesn't work for you?
>>>>>>
>>>>>> I'm not sure what I should describe.  Can you be more specific?
>>>>>> Do you mean: can we add the cpr specific bits to the migrate exec code?
>>>>>
>>>>> Yes; if possible I'd prefer to just keep the one exec mechanism.
>>>>> It has an advantage of letting you specify the new command line; that
>>>>> avoids the problems I'd pointed out with needing to change the command
>>>>> line if a hotplug had happened.  It also means we only need one chunk of
>>>>> exec code.
>>>>
>>>> How/where would you specify a new command line?  Are you picturing the usual migration
>>>> setup where you start a second qemu with its own arguments, plus a migrate_incoming
>>>> option or command?  That does not work for live update restart; the old qemu must exec
>>>> the new qemu.  Or something else?
>>>
>>> The existing migration path allows an exec - originally intended to exec
>>> something like a compressor or a store to file rather than a real
>>> migration; i.e. you can do:
>>>
>>>   migrate "exec:gzip > mig"
>>>
>>> and that will save the migration stream to a compressed file called mig.
>>> Now, I *think* we can already do:
>>>
>>>   migrate "exec:path-to-qemu command line parameters -incoming 'hmmmmm'"
>>> (That's probably cleaner via the QMP interface).
>>>
>>> I'm not quite sure what I want in the incoming there, but that is
>>> already the source execing the destination qemu - although I think we'd
>>> probably need to check if that's actually via an intermediary.
>>
>> I don't think you can dirctly exec  qemu in that way, because the
>> source QEMU migration code is going to wait for completion of the
>> QEMU you exec'd and that'll never come on success. So you'll end
>> up with both QEMU's running forever. If you pass the daemonize
>> option to the new QEMU then it will immediately detach itself,
>> and the source QEMU will think the migration command has finished
>> or failed.
>>
>> I think you can probably do it if you use a wrapper script though.
>> The wrapper would have to fork QEMU in the backend, and then the
>> wrapper would have to monitor the new QEMU to see when the incoming
>> migration has finished/aborted, at which point the wrapper can
>> exit, so the source QEMU sees a successful cleanup of the exec'd
>> command. </hand waving>
> 
> cpr restart does not work for any scheme that involves the old qemu process co-existing with
> the new qemu process.  To preserve descriptors and anonymous memory, cpr restart requires 
> that old qemu directly execs new qemu.  Not fork-exec.  Same pid.
> 
> So responding to Dave's comment, "keep the one exec mechanism", that is not possible.
> We still need the qemu_exec_requested mechanism to cause a direct exec after state is
> saved.
> 
>>>> We could shoehorn cpr restart into the migrate exec path by defining a new migration 
>>>> capability that the client would set before issuing the migrate command.  However, the
>>>> implementation code would be sprinkled with conditionals to suppress migrate-only bits
>>>> and call cpr-only bits.  IMO that would be less maintainable than having a separate
>>>> cprsave function.  Currently cprsave does not duplicate any migration functionality.
>>>> cprsave calls qemu_save_device_state() which is used by xen.
>>>
>>> To me it feels like cprsave in particular is replicating more code.
>>>
>>> It's also jumping through hoops in places to avoid changing the
>>> commandline;  that's going to cause more pain for a lot of people - not
>>> just because it's hacks all over for that, but because a lot of people
>>> are actually going to need to change the commandline even in a cpr like
>>> case (e.g. due to hotplug or changing something else about the
>>> environment, like auth data or route to storage or networking that
>>> changed).
>>
>> Management apps that already support migration, will almost certainly
>> know how to start up a new QEMU with a different command line that
>> takes account of hotplugged/unplugged devices. IOW avoiding changing
>> the command line only really addresses the simple case, and the hard
>> case is likely already solved for purposes of handling regular live
>> migration. 
> 
> Agreed, with the caveat that for cpr, the management app must communicate the new arguments
> to the qemu-exec trampoline, rather than passing the args on the command line to a new 
> qemu process.
> 
>>> There are hooks for early parameter parsing, so if we need to add extra
>>> commandline args we can; but for example the case of QEMU_START_FREEZE
>>> to add -S just isn't needed as soon as you let go of the idea of needing
>>> an identical commandline.
> 
> I'll delete QEMU_START_FREEZE.  
> 
> I still need to preserve argv_main and pass it to the qemu-exec trampoline, though, as 
> the args contain identifying information that the management app needs to modify the 
> arguments based the the instances's hot plug history.
> 
> Or, here is another possibility.  We could redefine cprsave to leave the VM in a
> stopped state, and add a cprstart command to be called subsequently that performs 
> the exec.  It takes a single string argument: a command plus arguments to exec.  
> The command may be qemu or a trampoline like qemu-exec.  I like that the trampoline
> name is no longer hardcoded.  The management app can derive new qemu args for the
> instances as it would with migration, and pass them to the command, instead of passing
> them to qemu-exec via some side channel.  cprload finishes the job and does not change.
> I already like this scheme better.

Or, pass argv as an additional parameter to cprsave.

Daniel, David, do you like passing argv to cprsave or a new cprstart command better than the 
current scheme?  I am ready to sent V4 of the series after we resolve this and the question of
whether or not to fold cpr into the migration command.

- Steve


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update [restart] : code replication
  2021-06-07 18:08                           ` [PATCH V3 00/22] Live Update [restart] : code replication Steven Sistare
@ 2021-06-14 14:33                             ` Steven Sistare
  0 siblings, 0 replies; 81+ messages in thread
From: Steven Sistare @ 2021-06-14 14:33 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel, Eric Blake,
	Markus Armbruster, Alex Williamson, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On 6/7/2021 2:08 PM, Steven Sistare wrote:
> On 6/3/2021 3:36 PM, Dr. David Alan Gilbert wrote:
>> * Steven Sistare (steven.sistare@oracle.com) wrote:
>>> On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote:
>>>> * Steven Sistare (steven.sistare@oracle.com) wrote:
>>>>> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote:
>>>>>> On the 'restart' branch of questions; can you explain,
>>>>>> other than the passing of the fd's, why the outgoing side of
>>>>>> qemu's 'migrate exec:' doesn't work for you?
>>>>>
>>>>> I'm not sure what I should describe.  Can you be more specific?
>>>>> Do you mean: can we add the cpr specific bits to the migrate exec code?
>>>>
>>>> Yes; if possible I'd prefer to just keep the one exec mechanism.
>>>> It has an advantage of letting you specify the new command line; that
>>>> avoids the problems I'd pointed out with needing to change the command
>>>> line if a hotplug had happened.  It also means we only need one chunk of
>>>> exec code.
>>>
>>> [...]
>>
>> I'm not quite sure what I want in the incoming there, but that is
>> already the source execing the destination qemu - although I think we'd
>> probably need to check if that's actually via an intermediary.
>>
>>> We could shoehorn cpr restart into the migrate exec path by defining a new migration 
>>> capability that the client would set before issuing the migrate command.  However, the
>>> implementation code would be sprinkled with conditionals to suppress migrate-only bits
>>> and call cpr-only bits.  IMO that would be less maintainable than having a separate
>>> cprsave function.  Currently cprsave does not duplicate any migration functionality.
>>> cprsave calls qemu_save_device_state() which is used by xen.
>>
>> To me it feels like cprsave in particular is replicating more code. 
> 
> In the attached file I annotated lines of code that have some overlap
> with migration code actions.  They include vm control, global_state_store,
> and vmstate save, and cover 18 lines of 78 total.  I did not include the
> body of qf_file_open because it is also called by xen.
> 
> The migration code adds capabilities, parameters, state, status, info,
> precopy, postcopy, dirty bitmap, events, notifiers, 6 channel types,
> blockers, pause+resume, return path, request-reply commands, throttling, colo,
> blocks, phases, iteration, and threading, implemented by 20000+ lines of code.
> To me it seems wrong to throw cpr into that mix to avoid adding tens of lines 
> of similar code.

Hi David, what is your decision, will you accept separate cpr commands?
One last point is that Xen made a similar choice, adding the xen-save-devices-state
command which calls qemu_save_device_state instead of migration_thread.

- Steve


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update [restart] : exec
  2021-06-14 14:31                               ` Steven Sistare
@ 2021-06-14 14:36                                 ` Daniel P. Berrangé
  0 siblings, 0 replies; 81+ messages in thread
From: Daniel P. Berrangé @ 2021-06-14 14:36 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Jason Zeng, Michael S. Tsirkin, Philippe Mathieu-Daudé,
	Juan Quintela, Dr. David Alan Gilbert, Eric Blake, qemu-devel,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Alex Bennée, Markus Armbruster

On Mon, Jun 14, 2021 at 10:31:32AM -0400, Steven Sistare wrote:
> On 6/7/2021 12:40 PM, Steven Sistare wrote:
> > On 6/3/2021 4:44 PM, Daniel P. Berrangé wrote:
> >> On Thu, Jun 03, 2021 at 08:36:42PM +0100, Dr. David Alan Gilbert wrote:
> >>> * Steven Sistare (steven.sistare@oracle.com) wrote:
> >>>> On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote:
> >>>>> * Steven Sistare (steven.sistare@oracle.com) wrote:
> >>>>>> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote:
> >>>>>>> On the 'restart' branch of questions; can you explain,
> >>>>>>> other than the passing of the fd's, why the outgoing side of
> >>>>>>> qemu's 'migrate exec:' doesn't work for you?
> >>>>>>
> >>>>>> I'm not sure what I should describe.  Can you be more specific?
> >>>>>> Do you mean: can we add the cpr specific bits to the migrate exec code?
> >>>>>
> >>>>> Yes; if possible I'd prefer to just keep the one exec mechanism.
> >>>>> It has an advantage of letting you specify the new command line; that
> >>>>> avoids the problems I'd pointed out with needing to change the command
> >>>>> line if a hotplug had happened.  It also means we only need one chunk of
> >>>>> exec code.
> >>>>
> >>>> How/where would you specify a new command line?  Are you picturing the usual migration
> >>>> setup where you start a second qemu with its own arguments, plus a migrate_incoming
> >>>> option or command?  That does not work for live update restart; the old qemu must exec
> >>>> the new qemu.  Or something else?
> >>>
> >>> The existing migration path allows an exec - originally intended to exec
> >>> something like a compressor or a store to file rather than a real
> >>> migration; i.e. you can do:
> >>>
> >>>   migrate "exec:gzip > mig"
> >>>
> >>> and that will save the migration stream to a compressed file called mig.
> >>> Now, I *think* we can already do:
> >>>
> >>>   migrate "exec:path-to-qemu command line parameters -incoming 'hmmmmm'"
> >>> (That's probably cleaner via the QMP interface).
> >>>
> >>> I'm not quite sure what I want in the incoming there, but that is
> >>> already the source execing the destination qemu - although I think we'd
> >>> probably need to check if that's actually via an intermediary.
> >>
> >> I don't think you can dirctly exec  qemu in that way, because the
> >> source QEMU migration code is going to wait for completion of the
> >> QEMU you exec'd and that'll never come on success. So you'll end
> >> up with both QEMU's running forever. If you pass the daemonize
> >> option to the new QEMU then it will immediately detach itself,
> >> and the source QEMU will think the migration command has finished
> >> or failed.
> >>
> >> I think you can probably do it if you use a wrapper script though.
> >> The wrapper would have to fork QEMU in the backend, and then the
> >> wrapper would have to monitor the new QEMU to see when the incoming
> >> migration has finished/aborted, at which point the wrapper can
> >> exit, so the source QEMU sees a successful cleanup of the exec'd
> >> command. </hand waving>
> > 
> > cpr restart does not work for any scheme that involves the old qemu process co-existing with
> > the new qemu process.  To preserve descriptors and anonymous memory, cpr restart requires 
> > that old qemu directly execs new qemu.  Not fork-exec.  Same pid.
> > 
> > So responding to Dave's comment, "keep the one exec mechanism", that is not possible.
> > We still need the qemu_exec_requested mechanism to cause a direct exec after state is
> > saved.
> > 
> >>>> We could shoehorn cpr restart into the migrate exec path by defining a new migration 
> >>>> capability that the client would set before issuing the migrate command.  However, the
> >>>> implementation code would be sprinkled with conditionals to suppress migrate-only bits
> >>>> and call cpr-only bits.  IMO that would be less maintainable than having a separate
> >>>> cprsave function.  Currently cprsave does not duplicate any migration functionality.
> >>>> cprsave calls qemu_save_device_state() which is used by xen.
> >>>
> >>> To me it feels like cprsave in particular is replicating more code.
> >>>
> >>> It's also jumping through hoops in places to avoid changing the
> >>> commandline;  that's going to cause more pain for a lot of people - not
> >>> just because it's hacks all over for that, but because a lot of people
> >>> are actually going to need to change the commandline even in a cpr like
> >>> case (e.g. due to hotplug or changing something else about the
> >>> environment, like auth data or route to storage or networking that
> >>> changed).
> >>
> >> Management apps that already support migration, will almost certainly
> >> know how to start up a new QEMU with a different command line that
> >> takes account of hotplugged/unplugged devices. IOW avoiding changing
> >> the command line only really addresses the simple case, and the hard
> >> case is likely already solved for purposes of handling regular live
> >> migration. 
> > 
> > Agreed, with the caveat that for cpr, the management app must communicate the new arguments
> > to the qemu-exec trampoline, rather than passing the args on the command line to a new 
> > qemu process.
> > 
> >>> There are hooks for early parameter parsing, so if we need to add extra
> >>> commandline args we can; but for example the case of QEMU_START_FREEZE
> >>> to add -S just isn't needed as soon as you let go of the idea of needing
> >>> an identical commandline.
> > 
> > I'll delete QEMU_START_FREEZE.  
> > 
> > I still need to preserve argv_main and pass it to the qemu-exec trampoline, though, as 
> > the args contain identifying information that the management app needs to modify the 
> > arguments based the the instances's hot plug history.
> > 
> > Or, here is another possibility.  We could redefine cprsave to leave the VM in a
> > stopped state, and add a cprstart command to be called subsequently that performs 
> > the exec.  It takes a single string argument: a command plus arguments to exec.  
> > The command may be qemu or a trampoline like qemu-exec.  I like that the trampoline
> > name is no longer hardcoded.  The management app can derive new qemu args for the
> > instances as it would with migration, and pass them to the command, instead of passing
> > them to qemu-exec via some side channel.  cprload finishes the job and does not change.
> > I already like this scheme better.
> 
> Or, pass argv as an additional parameter to cprsave.
> 
> Daniel, David, do you like passing argv to cprsave or a new cprstart command better than the 
> current scheme?  I am ready to sent V4 of the series after we resolve this and the question of
> whether or not to fold cpr into the migration command.

I don't really have a strong opinion on this either way.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update [restart] : exec
  2021-06-07 16:40                             ` [PATCH V3 00/22] Live Update [restart] : exec Steven Sistare
  2021-06-14 14:31                               ` Steven Sistare
@ 2021-06-15 19:05                               ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 81+ messages in thread
From: Dr. David Alan Gilbert @ 2021-06-15 19:05 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Daniel P. Berrangé,
	Michael S. Tsirkin, Jason Zeng, Philippe Mathieu-Daudé,
	Juan Quintela, qemu-devel, Eric Blake, Markus Armbruster,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Alex Bennée

* Steven Sistare (steven.sistare@oracle.com) wrote:
> On 6/3/2021 4:44 PM, Daniel P. Berrangé wrote:
> > On Thu, Jun 03, 2021 at 08:36:42PM +0100, Dr. David Alan Gilbert wrote:
> >> * Steven Sistare (steven.sistare@oracle.com) wrote:
> >>> On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote:
> >>>> * Steven Sistare (steven.sistare@oracle.com) wrote:
> >>>>> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote:
> >>>>>> On the 'restart' branch of questions; can you explain,
> >>>>>> other than the passing of the fd's, why the outgoing side of
> >>>>>> qemu's 'migrate exec:' doesn't work for you?
> >>>>>
> >>>>> I'm not sure what I should describe.  Can you be more specific?
> >>>>> Do you mean: can we add the cpr specific bits to the migrate exec code?
> >>>>
> >>>> Yes; if possible I'd prefer to just keep the one exec mechanism.
> >>>> It has an advantage of letting you specify the new command line; that
> >>>> avoids the problems I'd pointed out with needing to change the command
> >>>> line if a hotplug had happened.  It also means we only need one chunk of
> >>>> exec code.
> >>>
> >>> How/where would you specify a new command line?  Are you picturing the usual migration
> >>> setup where you start a second qemu with its own arguments, plus a migrate_incoming
> >>> option or command?  That does not work for live update restart; the old qemu must exec
> >>> the new qemu.  Or something else?
> >>
> >> The existing migration path allows an exec - originally intended to exec
> >> something like a compressor or a store to file rather than a real
> >> migration; i.e. you can do:
> >>
> >>   migrate "exec:gzip > mig"
> >>
> >> and that will save the migration stream to a compressed file called mig.
> >> Now, I *think* we can already do:
> >>
> >>   migrate "exec:path-to-qemu command line parameters -incoming 'hmmmmm'"
> >> (That's probably cleaner via the QMP interface).
> >>
> >> I'm not quite sure what I want in the incoming there, but that is
> >> already the source execing the destination qemu - although I think we'd
> >> probably need to check if that's actually via an intermediary.
> > 
> > I don't think you can dirctly exec  qemu in that way, because the
> > source QEMU migration code is going to wait for completion of the
> > QEMU you exec'd and that'll never come on success. So you'll end
> > up with both QEMU's running forever. If you pass the daemonize
> > option to the new QEMU then it will immediately detach itself,
> > and the source QEMU will think the migration command has finished
> > or failed.
> > 
> > I think you can probably do it if you use a wrapper script though.
> > The wrapper would have to fork QEMU in the backend, and then the
> > wrapper would have to monitor the new QEMU to see when the incoming
> > migration has finished/aborted, at which point the wrapper can
> > exit, so the source QEMU sees a successful cleanup of the exec'd
> > command. </hand waving>
> 
> cpr restart does not work for any scheme that involves the old qemu process co-existing with
> the new qemu process.  To preserve descriptors and anonymous memory, cpr restart requires 
> that old qemu directly execs new qemu.  Not fork-exec.  Same pid.
> 
> So responding to Dave's comment, "keep the one exec mechanism", that is not possible.
> We still need the qemu_exec_requested mechanism to cause a direct exec after state is
> saved.

OK, note if you can find anyway to make kernel changes to avoid this
kexec, life is going to get *much* better; starting a separate qemu at
the management layer would be so much easier.

> >>> We could shoehorn cpr restart into the migrate exec path by defining a new migration 
> >>> capability that the client would set before issuing the migrate command.  However, the
> >>> implementation code would be sprinkled with conditionals to suppress migrate-only bits
> >>> and call cpr-only bits.  IMO that would be less maintainable than having a separate
> >>> cprsave function.  Currently cprsave does not duplicate any migration functionality.
> >>> cprsave calls qemu_save_device_state() which is used by xen.
> >>
> >> To me it feels like cprsave in particular is replicating more code.
> >>
> >> It's also jumping through hoops in places to avoid changing the
> >> commandline;  that's going to cause more pain for a lot of people - not
> >> just because it's hacks all over for that, but because a lot of people
> >> are actually going to need to change the commandline even in a cpr like
> >> case (e.g. due to hotplug or changing something else about the
> >> environment, like auth data or route to storage or networking that
> >> changed).
> > 
> > Management apps that already support migration, will almost certainly
> > know how to start up a new QEMU with a different command line that
> > takes account of hotplugged/unplugged devices. IOW avoiding changing
> > the command line only really addresses the simple case, and the hard
> > case is likely already solved for purposes of handling regular live
> > migration. 
> 
> Agreed, with the caveat that for cpr, the management app must communicate the new arguments
> to the qemu-exec trampoline, rather than passing the args on the command line to a new 
> qemu process.
> 
> >> There are hooks for early parameter parsing, so if we need to add extra
> >> commandline args we can; but for example the case of QEMU_START_FREEZE
> >> to add -S just isn't needed as soon as you let go of the idea of needing
> >> an identical commandline.
> 
> I'll delete QEMU_START_FREEZE.  
> 
> I still need to preserve argv_main and pass it to the qemu-exec trampoline, though, as 
> the args contain identifying information that the management app needs to modify the 
> arguments based the the instances's hot plug history.
> 
> Or, here is another possibility.  We could redefine cprsave to leave the VM in a
> stopped state, and add a cprstart command to be called subsequently that performs 
> the exec.  It takes a single string argument: a command plus arguments to exec.  
> The command may be qemu or a trampoline like qemu-exec.  I like that the trampoline
> name is no longer hardcoded.  The management app can derive new qemu args for the
> instances as it would with migration, and pass them to the command, instead of passing
> them to qemu-exec via some side channel.  cprload finishes the job and does not change.
> I already like this scheme better.

Right, that's sounding better; now the other benefit you get is you
don't need to play with environment variables; you can define a command
line option that takes all the extra data it needs.

Dave

> - Steve
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update [reboot]
  2021-05-21 14:55                   ` Steven Sistare
@ 2021-06-15 19:14                     ` Dr. David Alan Gilbert
  2021-06-24 15:05                       ` Steven Sistare
  0 siblings, 1 reply; 81+ messages in thread
From: Dr. David Alan Gilbert @ 2021-06-15 19:14 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel, Eric Blake,
	Markus Armbruster, Alex Williamson, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

* Steven Sistare (steven.sistare@oracle.com) wrote:
> On 5/20/2021 9:00 AM, Dr. David Alan Gilbert wrote:
> > Hi Steven,
> >   I'd like to split the discussion into reboot and restart,
> > so I can make sure I understand them individually.
> > 
> > So reboot mode;
> > Can you explain which parts of this series are needed for reboot mode;
> > I've managed to do a kexec based reboot on qemu with the current qemu -
> > albeit with no vfio devices, my current understanding is that for doing
> > reboot with vfio we just need some way of getting migrate to send the
> > metadata associated with vfio devices if the guest is in S3.
> > 
> > Is there something I'm missing and which you have in this series?
> 
> You are correct, this series has little special code for reboot mode, but it does allow
> reboot and restart to be handled similarly, which simplifies the management layer because 
> the same calls are performed for each mode. 
> 
> For vfio in reboot mode, prior to sending cprload, the manager sends the guest-suspend-ram
> command to the qemu guest agent. This flushes requests and brings the guest device to a 
> reset state, so there is no vfio metadata to save.  Reboot mode does not call vfio_cprsave.
> 
> There are a few unique patches to support reboot mode.  One is qemu_ram_volatile, which
> is a sanity check that the writable ram blocks are backed by some form of shared memory.
> Plus there are a few fragments in the "cpr" patch that handle the suspended state that
> is induced by guest-suspend-ram.  See qemu_system_start_on_wake_request() and instances
> of RUN_STATE_SUSPENDED in migration/cpr.c

Could you split the 'reboot' part of separately, then we can review
that and perhaps get it in first? It should be a relatively small patch
set - it'll get things moving in the right direction.

The guest-suspend-ram stuff seems reasonable as an idea; lets just try
and avoid doing it all via environment variables though; make it proper
command line options.

Dave

> - Steve
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update [reboot]
  2021-06-15 19:14                     ` Dr. David Alan Gilbert
@ 2021-06-24 15:05                       ` Steven Sistare
  2021-07-06 17:31                         ` Steven Sistare
  0 siblings, 1 reply; 81+ messages in thread
From: Steven Sistare @ 2021-06-24 15:05 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel, Eric Blake,
	Markus Armbruster, Alex Williamson, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On 6/15/2021 3:14 PM, Dr. David Alan Gilbert wrote:
> * Steven Sistare (steven.sistare@oracle.com) wrote:
>> On 5/20/2021 9:00 AM, Dr. David Alan Gilbert wrote:
>>> Hi Steven,
>>>   I'd like to split the discussion into reboot and restart,
>>> so I can make sure I understand them individually.
>>>
>>> So reboot mode;
>>> Can you explain which parts of this series are needed for reboot mode;
>>> I've managed to do a kexec based reboot on qemu with the current qemu -
>>> albeit with no vfio devices, my current understanding is that for doing
>>> reboot with vfio we just need some way of getting migrate to send the
>>> metadata associated with vfio devices if the guest is in S3.
>>>
>>> Is there something I'm missing and which you have in this series?
>>
>> You are correct, this series has little special code for reboot mode, but it does allow
>> reboot and restart to be handled similarly, which simplifies the management layer because 
>> the same calls are performed for each mode. 
>>
>> For vfio in reboot mode, prior to sending cprload, the manager sends the guest-suspend-ram
>> command to the qemu guest agent. This flushes requests and brings the guest device to a 
>> reset state, so there is no vfio metadata to save.  Reboot mode does not call vfio_cprsave.
>>
>> There are a few unique patches to support reboot mode.  One is qemu_ram_volatile, which
>> is a sanity check that the writable ram blocks are backed by some form of shared memory.
>> Plus there are a few fragments in the "cpr" patch that handle the suspended state that
>> is induced by guest-suspend-ram.  See qemu_system_start_on_wake_request() and instances
>> of RUN_STATE_SUSPENDED in migration/cpr.c
> 
> Could you split the 'reboot' part of separately, then we can review
> that and perhaps get it in first? It should be a relatively small patch
> set - it'll get things moving in the right direction.
> 
> The guest-suspend-ram stuff seems reasonable as an idea; lets just try
> and avoid doing it all via environment variables though; make it proper
> command line options.

How about I delete reboot mode and the mode argument instead.  Having two modes is causing no 
end of confusion, and my primary business need is for restart mode.

- Steve


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH V3 00/22] Live Update [reboot]
  2021-06-24 15:05                       ` Steven Sistare
@ 2021-07-06 17:31                         ` Steven Sistare
  0 siblings, 0 replies; 81+ messages in thread
From: Steven Sistare @ 2021-07-06 17:31 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel, Eric Blake,
	Markus Armbruster, Alex Williamson, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On 6/24/2021 11:05 AM, Steven Sistare wrote:
> On 6/15/2021 3:14 PM, Dr. David Alan Gilbert wrote:
>> * Steven Sistare (steven.sistare@oracle.com) wrote:
>>> On 5/20/2021 9:00 AM, Dr. David Alan Gilbert wrote:
>>>> Hi Steven,
>>>>   I'd like to split the discussion into reboot and restart,
>>>> so I can make sure I understand them individually.
>>>>
>>>> So reboot mode;
>>>> Can you explain which parts of this series are needed for reboot mode;
>>>> I've managed to do a kexec based reboot on qemu with the current qemu -
>>>> albeit with no vfio devices, my current understanding is that for doing
>>>> reboot with vfio we just need some way of getting migrate to send the
>>>> metadata associated with vfio devices if the guest is in S3.
>>>>
>>>> Is there something I'm missing and which you have in this series?
>>>
>>> You are correct, this series has little special code for reboot mode, but it does allow
>>> reboot and restart to be handled similarly, which simplifies the management layer because 
>>> the same calls are performed for each mode. 
>>>
>>> For vfio in reboot mode, prior to sending cprload, the manager sends the guest-suspend-ram
>>> command to the qemu guest agent. This flushes requests and brings the guest device to a 
>>> reset state, so there is no vfio metadata to save.  Reboot mode does not call vfio_cprsave.
>>>
>>> There are a few unique patches to support reboot mode.  One is qemu_ram_volatile, which
>>> is a sanity check that the writable ram blocks are backed by some form of shared memory.
>>> Plus there are a few fragments in the "cpr" patch that handle the suspended state that
>>> is induced by guest-suspend-ram.  See qemu_system_start_on_wake_request() and instances
>>> of RUN_STATE_SUSPENDED in migration/cpr.c
>>
>> Could you split the 'reboot' part of separately, then we can review
>> that and perhaps get it in first? It should be a relatively small patch
>> set - it'll get things moving in the right direction.
>>
>> The guest-suspend-ram stuff seems reasonable as an idea; lets just try
>> and avoid doing it all via environment variables though; make it proper
>> command line options.
> 
> How about I delete reboot mode and the mode argument instead.  Having two modes is causing no 
> end of confusion, and my primary business need is for restart mode.

I just posted V4 of the patch series, refactoring reboot mode into the first 4 patches.

- Steve


^ permalink raw reply	[flat|nested] 81+ messages in thread

end of thread, other threads:[~2021-07-06 17:33 UTC | newest]

Thread overview: 81+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-07 12:24 [PATCH V3 00/22] Live Update Steve Sistare
2021-05-07 12:24 ` [PATCH V3 01/22] as_flat_walk Steve Sistare
2021-05-07 12:25 ` [PATCH V3 02/22] qemu_ram_volatile Steve Sistare
2021-05-07 12:25 ` [PATCH V3 03/22] oslib: qemu_clr_cloexec Steve Sistare
2021-05-07 12:25 ` [PATCH V3 04/22] util: env var helpers Steve Sistare
2021-05-07 12:25 ` [PATCH V3 05/22] machine: memfd-alloc option Steve Sistare
2021-05-07 12:25 ` [PATCH V3 06/22] vl: add helper to request re-exec Steve Sistare
2021-05-07 14:31   ` Eric Blake
2021-05-13 20:19     ` Steven Sistare
2021-05-14  8:18       ` Daniel P. Berrangé
2021-05-12 16:27   ` Stefan Hajnoczi
2021-05-13 20:20     ` Steven Sistare
2021-05-07 12:25 ` [PATCH V3 07/22] cpr Steve Sistare
2021-05-12 16:19   ` Stefan Hajnoczi
2021-05-13 20:21     ` Steven Sistare
2021-05-14 11:28       ` Stefan Hajnoczi
2021-05-14 15:14         ` Steven Sistare
2021-05-18 13:42           ` Stefan Hajnoczi
2021-05-07 12:25 ` [PATCH V3 08/22] cpr: QMP interfaces Steve Sistare
2021-06-04 13:59   ` Eric Blake
2021-06-07 17:19     ` Steven Sistare
2021-05-07 12:25 ` [PATCH V3 09/22] cpr: HMP interfaces Steve Sistare
2021-05-07 12:25 ` [PATCH V3 10/22] pci: export functions for cpr Steve Sistare
2021-05-07 12:25 ` [PATCH V3 11/22] vfio-pci: refactor " Steve Sistare
2021-05-19 22:38   ` Alex Williamson
2021-05-21 13:33     ` Steven Sistare
2021-05-21 21:07       ` Alex Williamson
2021-05-21 21:18         ` Steven Sistare
2021-05-07 12:25 ` [PATCH V3 12/22] vfio-pci: cpr part 1 Steve Sistare
2021-05-21 22:24   ` Alex Williamson
2021-05-24 18:29     ` Steven Sistare
2021-06-11 18:15       ` Steven Sistare
2021-06-11 19:43         ` Steven Sistare
2021-05-07 12:25 ` [PATCH V3 13/22] vfio-pci: cpr part 2 Steve Sistare
2021-05-21 22:24   ` Alex Williamson
2021-05-24 18:31     ` Steven Sistare
2021-05-07 12:25 ` [PATCH V3 14/22] vhost: reset vhost devices upon cprsave Steve Sistare
2021-05-07 12:25 ` [PATCH V3 15/22] hostmem-memfd: cpr support Steve Sistare
2021-05-07 12:25 ` [PATCH V3 16/22] chardev: cpr framework Steve Sistare
2021-05-07 14:33   ` Eric Blake
2021-05-13 20:19     ` Steven Sistare
2021-05-07 12:25 ` [PATCH V3 17/22] chardev: cpr for simple devices Steve Sistare
2021-05-07 12:25 ` [PATCH V3 18/22] chardev: cpr for pty Steve Sistare
2021-05-07 12:25 ` [PATCH V3 19/22] chardev: cpr for sockets Steve Sistare
2021-05-07 12:25 ` [PATCH V3 20/22] cpr: only-cpr-capable option Steve Sistare
2021-05-07 12:25 ` [PATCH V3 21/22] cpr: maintainers Steve Sistare
2021-05-07 12:25 ` [PATCH V3 22/22] simplify savevm Steve Sistare
2021-05-07 13:00 ` [PATCH V3 00/22] Live Update no-reply
2021-05-13 20:42   ` Steven Sistare
2021-05-12 16:42 ` Stefan Hajnoczi
2021-05-13 20:21   ` Steven Sistare
2021-05-14 11:53     ` Stefan Hajnoczi
2021-05-14 15:15       ` Steven Sistare
2021-05-17 11:40         ` Stefan Hajnoczi
2021-05-17 19:10           ` Alex Williamson
2021-05-18 13:39             ` Stefan Hajnoczi
2021-05-18 15:48               ` Steven Sistare
2021-05-18  9:57         ` Dr. David Alan Gilbert
2021-05-18 16:00           ` Steven Sistare
2021-05-18 19:23             ` Dr. David Alan Gilbert
2021-05-18 20:01               ` Alex Williamson
2021-05-18 20:14               ` Steven Sistare
2021-05-20 13:00                 ` [PATCH V3 00/22] Live Update [reboot] Dr. David Alan Gilbert
2021-05-21 14:55                   ` Steven Sistare
2021-06-15 19:14                     ` Dr. David Alan Gilbert
2021-06-24 15:05                       ` Steven Sistare
2021-07-06 17:31                         ` Steven Sistare
2021-05-20 13:13                 ` [PATCH V3 00/22] Live Update [restart] Dr. David Alan Gilbert
2021-05-21 14:56                   ` Steven Sistare
2021-05-24 10:39                     ` Dr. David Alan Gilbert
2021-06-02 13:51                       ` Steven Sistare
2021-06-03 19:36                         ` Dr. David Alan Gilbert
2021-06-03 20:44                           ` Daniel P. Berrangé
2021-06-07 16:40                             ` [PATCH V3 00/22] Live Update [restart] : exec Steven Sistare
2021-06-14 14:31                               ` Steven Sistare
2021-06-14 14:36                                 ` Daniel P. Berrangé
2021-06-15 19:05                               ` Dr. David Alan Gilbert
2021-06-07 18:08                           ` [PATCH V3 00/22] Live Update [restart] : code replication Steven Sistare
2021-06-14 14:33                             ` Steven Sistare
2021-05-19 16:43 ` [PATCH V3 00/22] Live Update Steven Sistare
2021-06-02 15:19   ` Steven Sistare

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.