All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V6 00/27] Live Update
@ 2021-08-06 21:43 Steve Sistare
  2021-08-06 21:43 ` [PATCH V6 01/27] memory: qemu_check_ram_volatile Steve Sistare
                   ` (28 more replies)
  0 siblings, 29 replies; 44+ messages in thread
From: Steve Sistare @ 2021-08-06 21:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Provide the cpr-save, cpr-exec, and cpr-load commands for live update.
These save and restore VM state, with minimal guest pause time, so that
qemu may be updated to a new version in between.

cpr-save stops the VM and saves vmstate to an ordinary file.  It supports
any type of guest image and block device, but the caller must not modify
guest block devices between cpr-save and cpr-load.  It supports two modes:
reboot and restart.

In reboot mode, the caller invokes cpr-save and then terminates qemu.
The caller may then update the host kernel and system software and reboot.
The caller resumes the guest by running qemu with the same arguments as the
original process and invoking cpr-load.  To use this mode, guest ram must be
mapped to a persistent shared memory file such as /dev/dax0.0, or /dev/shm
PKRAM as proposed in https://lore.kernel.org/lkml/1617140178-8773-1-git-send-email-anthony.yznaga@oracle.com.

The reboot mode supports vfio devices if the caller first suspends the
guest, such as by issuing guest-suspend-ram to the qemu guest agent.  The
guest drivers' suspend methods flush outstanding requests and re-initialize
the devices, and thus there is no device state to save and restore.

Restart mode preserves the guest VM across a restart of the qemu process.
After cpr-save, the caller passes qemu command-line arguments to cpr-exec,
which directly exec's the new qemu binary.  The arguments must include -S
so new qemu starts in a paused state and waits for the cpr-load command.
The restart mode supports vfio devices by preserving the vfio container,
group, device, and event descriptors across the qemu re-exec, and by
updating DMA mapping virtual addresses using VFIO_DMA_UNMAP_FLAG_VADDR and
VFIO_DMA_MAP_FLAG_VADDR as defined in https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sistare@oracle.com/
and integrated in Linux kernel 5.12.

To use the restart mode, qemu must be started with the memfd-alloc option,
which allocates guest ram using memfd_create.  The memfd's are saved to
the environment and kept open across exec, after which they are found from
the environment and re-mmap'd.  Hence guest ram is preserved in place,
albeit with new virtual addresses in the qemu process.

The caller resumes the guest by invoking cpr-load, which loads state from
the file. If the VM was running at cpr-save time, then VM execution resumes.
If the VM was suspended at cpr-save time (reboot mode), then the caller must
issue a system_wakeup command to resume.

The first patches add reboot mode:
  - memory: qemu_check_ram_volatile
  - migration: fix populate_vfio_info
  - migration: qemu file wrappers
  - migration: simplify savevm
  - vl: start on wakeup request
  - cpr: reboot mode
  - cpr: reboot HMP interfaces

The next patches add restart mode:
  - memory: flat section iterator
  - oslib: qemu_clear_cloexec
  - machine: memfd-alloc option
  - qapi: list utility functions
  - vl: helper to request re-exec
  - cpr: preserve extra state
  - cpr: restart mode
  - cpr: restart HMP interfaces
  - hostmem-memfd: cpr for memory-backend-memfd

The next patches add vfio support for restart mode:
  - pci: export functions for cpr
  - vfio-pci: refactor for cpr
  - vfio-pci: cpr part 1 (fd and dma)
  - vfio-pci: cpr part 2 (msi)
  - vfio-pci: cpr part 3 (intx)

The next patches preserve various descriptor-based backend devices across
cprexec:
  - vhost: reset vhost devices for cpr
  - chardev: cpr framework
  - chardev: cpr for simple devices
  - chardev: cpr for pty
  - chardev: cpr for sockets
  - cpr: only-cpr-capable option

Here is an example of updating qemu from v4.2.0 to v4.2.1 using
restart mode.  The software update is performed while the guest is
running to minimize downtime.

window 1                                        | window 2
                                                |
# qemu-system-x86_64 ...                        |
QEMU 4.2.0 monitor - type 'help' ...            |
(qemu) info status                              |
VM status: running                              |
                                                | # yum update qemu
(qemu) cpr-save /tmp/qemu.sav restart           |
(qemu) cpr-exec qemu-system-x86_64 -S ...       |
QEMU 4.2.1 monitor - type 'help' ...            |
(qemu) info status                              |
VM status: paused (prelaunch)                   |
(qemu) cpr-load /tmp/qemu.sav                   |
(qemu) info status                              |
VM status: running                              |


Here is an example of updating the host kernel using reboot mode.

window 1                                        | window 2
                                                |
# qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
QEMU 4.2.1 monitor - type 'help' ...            |
(qemu) info status                              |
VM status: running                              |
                                                | # yum update kernel-uek
(qemu) cpr-save /tmp/qemu.sav restart           |
(qemu) quit                                     |
                                                |
# systemctl kexec                               |
kexec_core: Starting new kernel                 |
...                                             |
                                                |
# qemu-system-x86_64 -S mem-path=/dev/dax0.0 ...|
QEMU 4.2.1 monitor - type 'help' ...            |
(qemu) info status                              |
VM status: paused (prelaunch)                   |
(qemu) cpr-load /tmp/qemu.sav                   |
(qemu) info status                              |
VM status: running                              |

Changes from V1 to V2:
  - revert vmstate infrastructure changes
  - refactor cpr functions into new files
  - delete MADV_DOEXEC and use memfd + VFIO_DMA_UNMAP_FLAG_SUSPEND to
    preserve memory.
  - add framework to filter chardev's that support cpr
  - save and restore vfio eventfd's
  - modify cprinfo QMP interface
  - incorporate misc review feedback
  - remove unrelated and unneeded patches
  - refactor all patches into a shorter and easier to review series

Changes from V2 to V3:
  - rebase to qemu 6.0.0
  - use final definition of vfio ioctls (VFIO_DMA_UNMAP_FLAG_VADDR etc)
  - change memfd-alloc to a machine option
  - Use qio_channel_socket_new_fd instead of adding qio_channel_socket_new_fd
  - close monitor socket during cpr
  - fix a few unreported bugs
  - support memory-backend-memfd

Changes from V3 to V4:
  - split reboot mode into separate patches
  - add cprexec command
  - delete QEMU_START_FREEZE, argv_main, and /usr/bin/qemu-exec
  - add more checks for vfio and cpr compatibility, and recover after errors
  - save vfio pci config in vmstate
  - rename {setenv,getenv}_event_fd to {save,load}_event_fd
  - use qemu_strtol
  - change 6.0 references to 6.1
  - use strerror(), use EXIT_FAILURE, remove period from error messages
  - distribute MAINTAINERS additions to each patch

Changes from V4 to V5:
  - rebase to master

Changes from V5 to V6:
  vfio:
  - delete redundant bus_master_enable_region in vfio_pci_post_load
  - delete unmap.size warning
  - fix phys_config memory leak
  - add INTX support
  - add vfio_named_notifier_init() helper
  Other:
  - 6.1 -> 6.2
  - rename file -> filename in qapi
  - delete cprinfo.  qapi introspection serves the same purpose.
  - rename cprsave, cprexec, cprload -> cpr-save, cpr-exec, cpr-load
  - improve documentation in qapi/cpr.json
  - rename qemu_ram_volatile -> qemu_ram_check_volatile, and use
    qemu_ram_foreach_block
  - rename handle -> opaque
  - use ERRP_GUARD
  - use g_autoptr and g_autofree, and glib allocation functions
  - conform to error conventions for bool and int function return values
    and function names.
  - remove word "error" in error messages
  - rename as_flat_walk and its callback, and add comments.
  - rename qemu_clr_cloexec -> qemu_clear_cloexec
  - rename close-on-cpr -> reopen-on-cpr
  - add strList utility functions
  - factor out start on wakeup request to a separate patch
  - deleted unnecessary layer (cprsave etc) and squashed QMP patches
  - conditionally compile for CONFIG_VFIO

Steve Sistare (24):
  memory: qemu_check_ram_volatile
  migration: fix populate_vfio_info
  migration: qemu file wrappers
  migration: simplify savevm
  vl: start on wakeup request
  cpr: reboot mode
  memory: flat section iterator
  oslib: qemu_clear_cloexec
  machine: memfd-alloc option
  qapi: list utility functions
  vl: helper to request re-exec
  cpr: preserve extra state
  cpr: restart mode
  cpr: restart HMP interfaces
  hostmem-memfd: cpr for memory-backend-memfd
  pci: export functions for cpr
  vfio-pci: refactor for cpr
  vfio-pci: cpr part 1 (fd and dma)
  vfio-pci: cpr part 2 (msi)
  vfio-pci: cpr part 3 (intx)
  chardev: cpr framework
  chardev: cpr for simple devices
  chardev: cpr for pty
  cpr: only-cpr-capable option

Mark Kanda, Steve Sistare (3):
  cpr: reboot HMP interfaces
  vhost: reset vhost devices for cpr
  chardev: cpr for sockets

 MAINTAINERS                   |  12 ++
 backends/hostmem-memfd.c      |  21 +--
 chardev/char-mux.c            |   1 +
 chardev/char-null.c           |   1 +
 chardev/char-pty.c            |  14 +-
 chardev/char-serial.c         |   1 +
 chardev/char-socket.c         |  36 +++++
 chardev/char-stdio.c          |   8 ++
 chardev/char.c                |  43 +++++-
 gdbstub.c                     |   1 +
 hmp-commands.hx               |  50 +++++++
 hw/core/machine.c             |  19 +++
 hw/pci/msix.c                 |  20 ++-
 hw/pci/pci.c                  |   7 +-
 hw/vfio/common.c              |  79 +++++++++--
 hw/vfio/cpr.c                 | 160 ++++++++++++++++++++++
 hw/vfio/meson.build           |   1 +
 hw/vfio/pci.c                 | 301 +++++++++++++++++++++++++++++++++++++++---
 hw/vfio/trace-events          |   1 +
 hw/virtio/vhost.c             |  11 ++
 include/chardev/char.h        |   6 +
 include/exec/memory.h         |  39 ++++++
 include/hw/boards.h           |   1 +
 include/hw/pci/msix.h         |   5 +
 include/hw/pci/pci.h          |   2 +
 include/hw/vfio/vfio-common.h |   8 ++
 include/hw/virtio/vhost.h     |   1 +
 include/migration/cpr.h       |  31 +++++
 include/monitor/hmp.h         |   3 +
 include/qapi/util.h           |  28 ++++
 include/qemu/osdep.h          |   1 +
 include/sysemu/runstate.h     |   2 +
 include/sysemu/sysemu.h       |   1 +
 linux-headers/linux/vfio.h    |   6 +
 migration/cpr-state.c         | 215 ++++++++++++++++++++++++++++++
 migration/cpr.c               | 176 ++++++++++++++++++++++++
 migration/meson.build         |   2 +
 migration/migration.c         |   5 +
 migration/qemu-file-channel.c |  36 +++++
 migration/qemu-file-channel.h |   6 +
 migration/savevm.c            |  21 +--
 migration/target.c            |  24 +++-
 migration/trace-events        |   5 +
 monitor/hmp-cmds.c            |  68 ++++++----
 monitor/hmp.c                 |   3 +
 monitor/qmp.c                 |   3 +
 qapi/char.json                |   7 +-
 qapi/cpr.json                 |  76 +++++++++++
 qapi/meson.build              |   1 +
 qapi/qapi-schema.json         |   1 +
 qapi/qapi-util.c              |  37 ++++++
 qemu-options.hx               |  40 +++++-
 softmmu/globals.c             |   1 +
 softmmu/memory.c              |  46 +++++++
 softmmu/physmem.c             |  55 ++++++--
 softmmu/runstate.c            |  38 +++++-
 softmmu/vl.c                  |  18 ++-
 stubs/cpr-state.c             |  15 +++
 stubs/cpr.c                   |   3 +
 stubs/meson.build             |   2 +
 trace-events                  |   1 +
 util/oslib-posix.c            |   9 ++
 util/oslib-win32.c            |   4 +
 util/qemu-config.c            |   4 +
 64 files changed, 1732 insertions(+), 111 deletions(-)
 create mode 100644 hw/vfio/cpr.c
 create mode 100644 include/migration/cpr.h
 create mode 100644 migration/cpr-state.c
 create mode 100644 migration/cpr.c
 create mode 100644 qapi/cpr.json
 create mode 100644 stubs/cpr-state.c
 create mode 100644 stubs/cpr.c

-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH V6 01/27] memory: qemu_check_ram_volatile
  2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
@ 2021-08-06 21:43 ` Steve Sistare
  2021-08-06 21:43 ` [PATCH V6 02/27] migration: fix populate_vfio_info Steve Sistare
                   ` (27 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Steve Sistare @ 2021-08-06 21:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Add a function that returns an error if any ram_list block represents
volatile memory.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/exec/memory.h |  8 ++++++++
 softmmu/memory.c      | 26 ++++++++++++++++++++++++++
 2 files changed, 34 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index c3d417d..0e6d364 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -2925,6 +2925,14 @@ bool ram_block_discard_is_disabled(void);
  */
 bool ram_block_discard_is_required(void);
 
+/**
+ * qemu_ram_check_volatile: return 1 if any memory regions are writable and not
+ * backed by shared memory, else return 0.
+ *
+ * @errp: returned error message identifying the first volatile region found.
+ */
+int qemu_check_ram_volatile(Error **errp);
+
 #endif
 
 #endif
diff --git a/softmmu/memory.c b/softmmu/memory.c
index bfedaf9..e143692 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -2809,6 +2809,32 @@ void memory_global_dirty_log_stop(void)
     memory_global_dirty_log_do_stop();
 }
 
+static int check_volatile(RAMBlock *rb, void *opaque)
+{
+    MemoryRegion *mr = rb->mr;
+
+    if (mr &&
+        memory_region_is_ram(mr) &&
+        !memory_region_is_ram_device(mr) &&
+        !memory_region_is_rom(mr) &&
+        (rb->fd == -1 || !qemu_ram_is_shared(rb))) {
+        *(const char **)opaque = memory_region_name(mr);
+        return -1;
+    }
+    return 0;
+}
+
+int qemu_check_ram_volatile(Error **errp)
+{
+    char *name;
+
+    if (qemu_ram_foreach_block(check_volatile, &name)) {
+        error_setg(errp, "Memory region %s is volatile", name);
+        return -1;
+    }
+    return 0;
+}
+
 static void listener_add_address_space(MemoryListener *listener,
                                        AddressSpace *as)
 {
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V6 02/27] migration: fix populate_vfio_info
  2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
  2021-08-06 21:43 ` [PATCH V6 01/27] memory: qemu_check_ram_volatile Steve Sistare
@ 2021-08-06 21:43 ` Steve Sistare
  2021-08-06 21:43 ` [PATCH V6 03/27] migration: qemu file wrappers Steve Sistare
                   ` (26 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Steve Sistare @ 2021-08-06 21:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Include CONFIG_DEVICES so that populate_vfio_info is instantiated for
CONFIG_VFIO.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 migration/target.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/migration/target.c b/migration/target.c
index 907ebf0..4390bf0 100644
--- a/migration/target.c
+++ b/migration/target.c
@@ -8,18 +8,22 @@
 #include "qemu/osdep.h"
 #include "qapi/qapi-types-migration.h"
 #include "migration.h"
+#include CONFIG_DEVICES
 
 #ifdef CONFIG_VFIO
+
 #include "hw/vfio/vfio-common.h"
-#endif
 
 void populate_vfio_info(MigrationInfo *info)
 {
-#ifdef CONFIG_VFIO
     if (vfio_mig_active()) {
         info->has_vfio = true;
         info->vfio = g_malloc0(sizeof(*info->vfio));
         info->vfio->transferred = vfio_mig_bytes_transferred();
     }
-#endif
 }
+#else
+
+void populate_vfio_info(MigrationInfo *info) {}
+
+#endif /* CONFIG_VFIO */
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V6 03/27] migration: qemu file wrappers
  2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
  2021-08-06 21:43 ` [PATCH V6 01/27] memory: qemu_check_ram_volatile Steve Sistare
  2021-08-06 21:43 ` [PATCH V6 02/27] migration: fix populate_vfio_info Steve Sistare
@ 2021-08-06 21:43 ` Steve Sistare
  2021-08-06 21:43 ` [PATCH V6 04/27] migration: simplify savevm Steve Sistare
                   ` (25 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Steve Sistare @ 2021-08-06 21:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Add qemu_file_open and qemu_fd_open to create QEMUFile objects for unix
files and file descriptors.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 migration/qemu-file-channel.c | 36 ++++++++++++++++++++++++++++++++++++
 migration/qemu-file-channel.h |  6 ++++++
 2 files changed, 42 insertions(+)

diff --git a/migration/qemu-file-channel.c b/migration/qemu-file-channel.c
index bb5a575..afb16d7 100644
--- a/migration/qemu-file-channel.c
+++ b/migration/qemu-file-channel.c
@@ -27,8 +27,10 @@
 #include "qemu-file.h"
 #include "io/channel-socket.h"
 #include "io/channel-tls.h"
+#include "io/channel-file.h"
 #include "qemu/iov.h"
 #include "qemu/yank.h"
+#include "qapi/error.h"
 #include "yank_functions.h"
 
 
@@ -192,3 +194,37 @@ QEMUFile *qemu_fopen_channel_output(QIOChannel *ioc)
     object_ref(OBJECT(ioc));
     return qemu_fopen_ops(ioc, &channel_output_ops, true);
 }
+
+QEMUFile *qemu_file_open(const char *path, int flags, int mode,
+                         const char *name, Error **errp)
+{
+    g_autoptr(QIOChannelFile) fioc = NULL;
+    QIOChannel *ioc;
+    QEMUFile *f;
+
+    if (flags & O_RDWR) {
+        error_setg(errp, "qemu_file_open %s: O_RDWR not supported", path);
+        return NULL;
+    }
+
+    fioc = qio_channel_file_new_path(path, flags, mode, errp);
+    if (!fioc) {
+        return NULL;
+    }
+
+    ioc = QIO_CHANNEL(fioc);
+    qio_channel_set_name(ioc, name);
+    f = (flags & O_WRONLY) ? qemu_fopen_channel_output(ioc) :
+                             qemu_fopen_channel_input(ioc);
+    return f;
+}
+
+QEMUFile *qemu_fd_open(int fd, bool writable, const char *name)
+{
+    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
+    QIOChannel *ioc = QIO_CHANNEL(fioc);
+    QEMUFile *f = writable ? qemu_fopen_channel_output(ioc) :
+                             qemu_fopen_channel_input(ioc);
+    qio_channel_set_name(ioc, name);
+    return f;
+}
diff --git a/migration/qemu-file-channel.h b/migration/qemu-file-channel.h
index 0028a09..324ae2d 100644
--- a/migration/qemu-file-channel.h
+++ b/migration/qemu-file-channel.h
@@ -29,4 +29,10 @@
 
 QEMUFile *qemu_fopen_channel_input(QIOChannel *ioc);
 QEMUFile *qemu_fopen_channel_output(QIOChannel *ioc);
+
+QEMUFile *qemu_file_open(const char *path, int flags, int mode,
+                         const char *name, Error **errp);
+
+QEMUFile *qemu_fd_open(int fd, bool writable, const char *name);
+
 #endif
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V6 04/27] migration: simplify savevm
  2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
                   ` (2 preceding siblings ...)
  2021-08-06 21:43 ` [PATCH V6 03/27] migration: qemu file wrappers Steve Sistare
@ 2021-08-06 21:43 ` Steve Sistare
  2021-08-06 21:43 ` [PATCH V6 05/27] vl: start on wakeup request Steve Sistare
                   ` (24 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Steve Sistare @ 2021-08-06 21:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Use qemu_file_open to simplify a few functions in savevm.c.
No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 migration/savevm.c | 21 +++++++--------------
 1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index 7b7b64b..bdd6ef8 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2908,8 +2908,9 @@ bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
 void qmp_xen_save_devices_state(const char *filename, bool has_live, bool live,
                                 Error **errp)
 {
+    const char *ioc_name = "migration-xen-save-state";
+    int flags = O_WRONLY | O_CREAT | O_TRUNC;
     QEMUFile *f;
-    QIOChannelFile *ioc;
     int saved_vm_running;
     int ret;
 
@@ -2923,14 +2924,10 @@ void qmp_xen_save_devices_state(const char *filename, bool has_live, bool live,
     vm_stop(RUN_STATE_SAVE_VM);
     global_state_store_running();
 
-    ioc = qio_channel_file_new_path(filename, O_WRONLY | O_CREAT | O_TRUNC,
-                                    0660, errp);
-    if (!ioc) {
+    f = qemu_file_open(filename, flags, 0660, ioc_name, errp);
+    if (!f) {
         goto the_end;
     }
-    qio_channel_set_name(QIO_CHANNEL(ioc), "migration-xen-save-state");
-    f = qemu_fopen_channel_output(QIO_CHANNEL(ioc));
-    object_unref(OBJECT(ioc));
     ret = qemu_save_device_state(f);
     if (ret < 0 || qemu_fclose(f) < 0) {
         error_setg(errp, QERR_IO_ERROR);
@@ -2958,8 +2955,8 @@ void qmp_xen_save_devices_state(const char *filename, bool has_live, bool live,
 
 void qmp_xen_load_devices_state(const char *filename, Error **errp)
 {
+    const char *ioc_name = "migration-xen-load-state";
     QEMUFile *f;
-    QIOChannelFile *ioc;
     int ret;
 
     /* Guest must be paused before loading the device state; the RAM state
@@ -2971,14 +2968,10 @@ void qmp_xen_load_devices_state(const char *filename, Error **errp)
     }
     vm_stop(RUN_STATE_RESTORE_VM);
 
-    ioc = qio_channel_file_new_path(filename, O_RDONLY | O_BINARY, 0, errp);
-    if (!ioc) {
+    f = qemu_file_open(filename, O_RDONLY | O_BINARY, 0, ioc_name, errp);
+    if (!f) {
         return;
     }
-    qio_channel_set_name(QIO_CHANNEL(ioc), "migration-xen-load-state");
-    f = qemu_fopen_channel_input(QIO_CHANNEL(ioc));
-    object_unref(OBJECT(ioc));
-
     ret = qemu_loadvm_state(f);
     qemu_fclose(f);
     if (ret < 0) {
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V6 05/27] vl: start on wakeup request
  2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
                   ` (3 preceding siblings ...)
  2021-08-06 21:43 ` [PATCH V6 04/27] migration: simplify savevm Steve Sistare
@ 2021-08-06 21:43 ` Steve Sistare
  2021-08-06 21:43 ` [PATCH V6 06/27] cpr: reboot mode Steve Sistare
                   ` (23 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Steve Sistare @ 2021-08-06 21:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

If qemu starts and loads a VM in the suspended state, then a later wakeup
request will set the state to running, which is not sufficient to initialize
the vm, as vm_start was never called during this invocation of qemu.  See
qemu_system_wakeup_request().

Define the start_on_wakeup_requested() hook to cause vm_start() to be called
when processing the wakeup request.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/sysemu/runstate.h |  1 +
 softmmu/runstate.c        | 17 ++++++++++++++++-
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/include/sysemu/runstate.h b/include/sysemu/runstate.h
index a535691..b655c7b 100644
--- a/include/sysemu/runstate.h
+++ b/include/sysemu/runstate.h
@@ -51,6 +51,7 @@ void qemu_system_reset_request(ShutdownCause reason);
 void qemu_system_suspend_request(void);
 void qemu_register_suspend_notifier(Notifier *notifier);
 bool qemu_wakeup_suspend_enabled(void);
+void qemu_system_start_on_wakeup_request(void);
 void qemu_system_wakeup_request(WakeupReason reason, Error **errp);
 void qemu_system_wakeup_enable(WakeupReason reason, bool enabled);
 void qemu_register_wakeup_notifier(Notifier *notifier);
diff --git a/softmmu/runstate.c b/softmmu/runstate.c
index 10d9b73..3d344c9 100644
--- a/softmmu/runstate.c
+++ b/softmmu/runstate.c
@@ -115,6 +115,8 @@ static const RunStateTransition runstate_transitions_def[] = {
     { RUN_STATE_PRELAUNCH, RUN_STATE_RUNNING },
     { RUN_STATE_PRELAUNCH, RUN_STATE_FINISH_MIGRATE },
     { RUN_STATE_PRELAUNCH, RUN_STATE_INMIGRATE },
+    { RUN_STATE_PRELAUNCH, RUN_STATE_SUSPENDED },
+    { RUN_STATE_PRELAUNCH, RUN_STATE_PAUSED },
 
     { RUN_STATE_FINISH_MIGRATE, RUN_STATE_RUNNING },
     { RUN_STATE_FINISH_MIGRATE, RUN_STATE_PAUSED },
@@ -335,6 +337,7 @@ void vm_state_notify(bool running, RunState state)
     }
 }
 
+static bool start_on_wakeup_requested;
 static ShutdownCause reset_requested;
 static ShutdownCause shutdown_requested;
 static int shutdown_signal;
@@ -562,6 +565,11 @@ void qemu_register_suspend_notifier(Notifier *notifier)
     notifier_list_add(&suspend_notifiers, notifier);
 }
 
+void qemu_system_start_on_wakeup_request(void)
+{
+    start_on_wakeup_requested = true;
+}
+
 void qemu_system_wakeup_request(WakeupReason reason, Error **errp)
 {
     trace_system_wakeup_request(reason);
@@ -574,7 +582,14 @@ void qemu_system_wakeup_request(WakeupReason reason, Error **errp)
     if (!(wakeup_reason_mask & (1 << reason))) {
         return;
     }
-    runstate_set(RUN_STATE_RUNNING);
+
+    if (start_on_wakeup_requested) {
+        start_on_wakeup_requested = false;
+        vm_start();
+    } else {
+        runstate_set(RUN_STATE_RUNNING);
+    }
+
     wakeup_reason = reason;
     qemu_notify_event();
 }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V6 06/27] cpr: reboot mode
  2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
                   ` (4 preceding siblings ...)
  2021-08-06 21:43 ` [PATCH V6 05/27] vl: start on wakeup request Steve Sistare
@ 2021-08-06 21:43 ` Steve Sistare
  2021-08-06 21:43 ` [PATCH V6 07/27] cpr: reboot HMP interfaces Steve Sistare
                   ` (22 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Steve Sistare @ 2021-08-06 21:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Provide the cpr-save and cpr-load functions for live update.  These save and
restore VM state, with minimal guest pause time, so that qemu may be updated
to a new version in between.

cpr-save stops the VM and saves vmstate to an ordinary file.  It supports
any type of guest image and block device, but the caller must not modify
guest block devices between cpr-save and cpr-load.

cpr-save supports several modes, the first of which is reboot. In this mode,
the caller invokes cpr-save and then terminates qemu.  The caller may then
update the host kernel and system software and reboot.  The caller resumes
the guest by running qemu with the same arguments as the original process
and invoking cpr-load.  To use this mode, guest ram must be mapped to a
persistent shared memory file such as /dev/dax0.0 or /dev/shm PKRAM.

The reboot mode supports vfio devices if the caller first suspends the
guest, such as by issuing guest-suspend-ram to the qemu guest agent.  The
guest drivers' suspend methods flush outstanding requests and re-initialize
the devices, and thus there is no device state to save and restore.

cpr-load loads state from the file.  If the VM was running at cpr-save time,
then VM execution resumes.  If the VM was suspended at cpr-save time, then
the caller must issue a system_wakeup command to resume.

cpr-save syntax:
  { 'enum': 'CprMode', 'data': [ 'reboot' ] }
  { 'command': 'cpr-save', 'data': { 'filename': 'str', 'mode': 'CprMode' }}

cpr-load syntax:
  { 'command': 'cpr-load', 'data': { 'filename': 'str' } }

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 MAINTAINERS             |   8 +++
 include/migration/cpr.h |  17 ++++++
 migration/cpr.c         | 136 ++++++++++++++++++++++++++++++++++++++++++++++++
 migration/meson.build   |   1 +
 qapi/cpr.json           |  56 ++++++++++++++++++++
 qapi/meson.build        |   1 +
 qapi/qapi-schema.json   |   1 +
 7 files changed, 220 insertions(+)
 create mode 100644 include/migration/cpr.h
 create mode 100644 migration/cpr.c
 create mode 100644 qapi/cpr.json

diff --git a/MAINTAINERS b/MAINTAINERS
index 37b1a8e..2611ca6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2900,6 +2900,14 @@ F: net/colo*
 F: net/filter-rewriter.c
 F: net/filter-mirror.c
 
+CPR
+M: Steve Sistare <steven.sistare@oracle.com>
+M: Mark Kanda <mark.kanda@oracle.com>
+S: Maintained
+F: include/migration/cpr.h
+F: migration/cpr.c
+F: qapi/cpr.json
+
 Record/replay
 M: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
 R: Paolo Bonzini <pbonzini@redhat.com>
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
new file mode 100644
index 0000000..a76429a
--- /dev/null
+++ b/include/migration/cpr.h
@@ -0,0 +1,17 @@
+/*
+ * Copyright (c) 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef MIGRATION_CPR_H
+#define MIGRATION_CPR_H
+
+#include "qapi/qapi-types-cpr.h"
+
+#define CPR_MODE_NONE ((CprMode)(-1))
+
+CprMode cpr_mode(void);
+
+#endif
diff --git a/migration/cpr.c b/migration/cpr.c
new file mode 100644
index 0000000..1ec903f
--- /dev/null
+++ b/migration/cpr.c
@@ -0,0 +1,136 @@
+/*
+ * Copyright (c) 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "exec/memory.h"
+#include "io/channel-buffer.h"
+#include "io/channel-file.h"
+#include "migration.h"
+#include "migration/cpr.h"
+#include "migration/global_state.h"
+#include "migration/misc.h"
+#include "migration/snapshot.h"
+#include "qapi/error.h"
+#include "qapi/qapi-commands-cpr.h"
+#include "qapi/qmp/qerror.h"
+#include "qemu-file-channel.h"
+#include "qemu-file.h"
+#include "savevm.h"
+#include "sysemu/cpu-timers.h"
+#include "sysemu/replay.h"
+#include "sysemu/runstate.h"
+#include "sysemu/runstate-action.h"
+#include "sysemu/sysemu.h"
+#include "sysemu/xen.h"
+
+static CprMode cpr_active_mode = CPR_MODE_NONE;
+
+CprMode cpr_mode(void)
+{
+    return cpr_active_mode;
+}
+
+void qmp_cpr_save(const char *filename, CprMode mode, Error **errp)
+{
+    int ret;
+    QEMUFile *f;
+    int flags = O_CREAT | O_WRONLY | O_TRUNC;
+    int saved_vm_running = runstate_is_running();
+
+    if (qemu_check_ram_volatile(errp)) {
+        return;
+    }
+
+    if (migrate_colo_enabled()) {
+        error_setg(errp, "cpr-save does not support x-colo");
+        return;
+    }
+
+    if (replay_mode != REPLAY_MODE_NONE) {
+        error_setg(errp, "cpr-save does not support replay");
+        return;
+    }
+
+    if (global_state_store()) {
+        error_setg(errp, "Error saving global state");
+        return;
+    }
+
+    f = qemu_file_open(filename, flags, 0600, "cpr-save", errp);
+    if (!f) {
+        return;
+    }
+
+    if (runstate_check(RUN_STATE_SUSPENDED)) {
+        /* Update timers_state before saving.  Suspend did not so do. */
+        cpu_disable_ticks();
+    }
+    vm_stop(RUN_STATE_SAVE_VM);
+
+    cpr_active_mode = mode;
+    ret = qemu_save_device_state(f);
+    qemu_fclose(f);
+    if (ret < 0) {
+        error_setg(errp, "Error %d while saving VM state", ret);
+        goto err;
+    }
+
+    return;
+
+err:
+    if (saved_vm_running) {
+        vm_start();
+    }
+    cpr_active_mode = CPR_MODE_NONE;
+}
+
+void qmp_cpr_load(const char *filename, Error **errp)
+{
+    QEMUFile *f;
+    int ret;
+    RunState state;
+
+    if (runstate_is_running()) {
+        error_setg(errp, "cpr-load called for a running VM");
+        return;
+    }
+
+    f = qemu_file_open(filename, O_RDONLY, 0, "cpr-load", errp);
+    if (!f) {
+        return;
+    }
+
+    if (qemu_get_be32(f) != QEMU_VM_FILE_MAGIC ||
+        qemu_get_be32(f) != QEMU_VM_FILE_VERSION) {
+        error_setg(errp, "%s is not a vmstate file", filename);
+        qemu_fclose(f);
+        return;
+    }
+
+    cpr_active_mode = CPR_MODE_REBOOT;  /* generalized in a later patch */
+
+    ret = qemu_load_device_state(f);
+    qemu_fclose(f);
+    if (ret < 0) {
+        error_setg(errp, "Error %d while loading VM state", ret);
+        goto out;
+    }
+
+    state = global_state_get_runstate();
+    if (state == RUN_STATE_RUNNING) {
+        vm_start();
+    } else {
+        runstate_set(state);
+        if (runstate_check(RUN_STATE_SUSPENDED)) {
+            /* Force vm_start to be called later. */
+            qemu_system_start_on_wakeup_request();
+        }
+    }
+
+out:
+    cpr_active_mode = CPR_MODE_NONE;
+}
diff --git a/migration/meson.build b/migration/meson.build
index f8714dc..fd59281 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -15,6 +15,7 @@ softmmu_ss.add(files(
   'channel.c',
   'colo-failover.c',
   'colo.c',
+  'cpr.c',
   'exec.c',
   'fd.c',
   'global_state.c',
diff --git a/qapi/cpr.json b/qapi/cpr.json
new file mode 100644
index 0000000..2edd08e
--- /dev/null
+++ b/qapi/cpr.json
@@ -0,0 +1,56 @@
+# -*- Mode: Python -*-
+#
+# Copyright (c) 2021 Oracle and/or its affiliates.
+#
+# This work is licensed under the terms of the GNU GPL, version 2.
+# See the COPYING file in the top-level directory.
+
+##
+# = CPR - CheckPoint and Restart
+##
+
+{ 'include': 'common.json' }
+
+##
+# @CprMode:
+#
+# @reboot: checkpoint can be cpr-load'ed after a host kexec reboot.
+#
+# Since: 6.2
+##
+{ 'enum': 'CprMode',
+  'data': [ 'reboot' ] }
+
+##
+# @cpr-save:
+#
+# Create a checkpoint of the virtual machine device state in @filename.
+# Unlike snapshot-save, this command completes synchronously, saves state
+# to an ordinary file, and does not save guest RAM or guest block device
+# blocks.  The caller must not modify guest block devices between cpr-save
+# and cpr-load.
+#
+# For reboot mode, all guest RAM objects must be non-volatile across reboot,
+# and created with the share=on parameter.
+#
+# @filename: name of checkpoint file
+# @mode: @CprMode mode
+#
+# Since: 6.2
+##
+{ 'command': 'cpr-save',
+  'data': { 'filename': 'str',
+            'mode': 'CprMode' } }
+
+##
+# @cpr-load:
+#
+# Start virtual machine from checkpoint file that was created earlier using
+# the cpr-save command.
+#
+# @filename: name of checkpoint file
+#
+# Since: 6.2
+##
+{ 'command': 'cpr-load',
+  'data': { 'filename': 'str' } }
diff --git a/qapi/meson.build b/qapi/meson.build
index c356a38..73ece6a 100644
--- a/qapi/meson.build
+++ b/qapi/meson.build
@@ -27,6 +27,7 @@ qapi_all_modules = [
   'common',
   'compat',
   'control',
+  'cpr',
   'crypto',
   'dump',
   'error',
diff --git a/qapi/qapi-schema.json b/qapi/qapi-schema.json
index 4912b97..001d790 100644
--- a/qapi/qapi-schema.json
+++ b/qapi/qapi-schema.json
@@ -77,6 +77,7 @@
 { 'include': 'ui.json' }
 { 'include': 'authz.json' }
 { 'include': 'migration.json' }
+{ 'include': 'cpr.json' }
 { 'include': 'transaction.json' }
 { 'include': 'trace.json' }
 { 'include': 'compat.json' }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V6 07/27] cpr: reboot HMP interfaces
  2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
                   ` (5 preceding siblings ...)
  2021-08-06 21:43 ` [PATCH V6 06/27] cpr: reboot mode Steve Sistare
@ 2021-08-06 21:43 ` Steve Sistare
  2021-08-06 21:43 ` [PATCH V6 08/27] memory: flat section iterator Steve Sistare
                   ` (21 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Steve Sistare @ 2021-08-06 21:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

cpr-save <filename> <mode>
  Call qmp_cpr_save().
  Arguments:
    filename : save vmstate to filename
    mode: must be "reboot"

cpr-load <filename>
  Call qmp_cpr_load().
  Arguments:
    filename : load vmstate from filename

Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hmp-commands.hx       | 31 +++++++++++++++++++++++++++++++
 include/monitor/hmp.h |  2 ++
 monitor/hmp-cmds.c    | 28 ++++++++++++++++++++++++++++
 3 files changed, 61 insertions(+)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index 8e45bce..0a45c59 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -351,6 +351,37 @@ SRST
 ERST
 
     {
+        .name       = "cpr-save",
+        .args_type  = "filename:s,mode:s",
+        .params     = "filename 'reboot'",
+        .help       = "create a checkpoint of the VM in file",
+        .cmd        = hmp_cpr_save,
+    },
+
+SRST
+``cpr-save`` *filename* *mode*
+Pause the VCPUs,
+create a checkpoint of the whole virtual machine, and save it in *filename*.
+If *mode* is 'reboot', the checkpoint remains valid after a host kexec
+reboot, and guest ram must be backed by persistent shared memory.  To
+resume from the checkpoint, issue the quit command, reboot the system,
+and issue the cpr-load command.
+ERST
+
+    {
+        .name       = "cpr-load",
+        .args_type  = "filename:s",
+        .params     = "filename",
+        .help       = "load VM checkpoint from file",
+        .cmd        = hmp_cpr_load,
+    },
+
+SRST
+``cpr-load`` *filename*
+Load a virtual machine from checkpoint file *filename* and continue VCPUs.
+ERST
+
+    {
         .name       = "delvm",
         .args_type  = "name:s",
         .params     = "tag",
diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
index 3baa105..01b5df8 100644
--- a/include/monitor/hmp.h
+++ b/include/monitor/hmp.h
@@ -58,6 +58,8 @@ void hmp_balloon(Monitor *mon, const QDict *qdict);
 void hmp_loadvm(Monitor *mon, const QDict *qdict);
 void hmp_savevm(Monitor *mon, const QDict *qdict);
 void hmp_delvm(Monitor *mon, const QDict *qdict);
+void hmp_cpr_save(Monitor *mon, const QDict *qdict);
+void hmp_cpr_load(Monitor *mon, const QDict *qdict);
 void hmp_migrate_cancel(Monitor *mon, const QDict *qdict);
 void hmp_migrate_continue(Monitor *mon, const QDict *qdict);
 void hmp_migrate_incoming(Monitor *mon, const QDict *qdict);
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index e00255f..6aed6ac 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -33,6 +33,7 @@
 #include "qapi/qapi-commands-block.h"
 #include "qapi/qapi-commands-char.h"
 #include "qapi/qapi-commands-control.h"
+#include "qapi/qapi-commands-cpr.h"
 #include "qapi/qapi-commands-machine.h"
 #include "qapi/qapi-commands-migration.h"
 #include "qapi/qapi-commands-misc.h"
@@ -1177,6 +1178,33 @@ void hmp_announce_self(Monitor *mon, const QDict *qdict)
     qapi_free_AnnounceParameters(params);
 }
 
+void hmp_cpr_save(Monitor *mon, const QDict *qdict)
+{
+    Error *err = NULL;
+    const char *mode;
+    int val;
+
+    mode = qdict_get_try_str(qdict, "mode");
+    val = qapi_enum_parse(&CprMode_lookup, mode, -1, &err);
+
+    if (val == -1) {
+        goto out;
+    }
+
+    qmp_cpr_save(qdict_get_try_str(qdict, "filename"), val, &err);
+
+out:
+    hmp_handle_error(mon, err);
+}
+
+void hmp_cpr_load(Monitor *mon, const QDict *qdict)
+{
+    Error *err = NULL;
+
+    qmp_cpr_load(qdict_get_try_str(qdict, "filename"), &err);
+    hmp_handle_error(mon, err);
+}
+
 void hmp_migrate_cancel(Monitor *mon, const QDict *qdict)
 {
     qmp_migrate_cancel(NULL);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V6 08/27] memory: flat section iterator
  2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
                   ` (6 preceding siblings ...)
  2021-08-06 21:43 ` [PATCH V6 07/27] cpr: reboot HMP interfaces Steve Sistare
@ 2021-08-06 21:43 ` Steve Sistare
  2021-08-06 21:43 ` [PATCH V6 09/27] oslib: qemu_clear_cloexec Steve Sistare
                   ` (20 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Steve Sistare @ 2021-08-06 21:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Add an iterator over the sections of a flattened address space.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/exec/memory.h | 31 +++++++++++++++++++++++++++++++
 softmmu/memory.c      | 20 ++++++++++++++++++++
 2 files changed, 51 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 0e6d364..2bb6772 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -2286,6 +2286,37 @@ void memory_region_set_ram_discard_manager(MemoryRegion *mr,
                                            RamDiscardManager *rdm);
 
 /**
+ * memory_region_section_cb: callback for address_space_flat_for_each_section()
+ *
+ * @s: MemoryRegionSection of the range
+ * @opaque: data pointer passed to address_space_flat_for_each_section()
+ * @errp: error message, returned to the address_space_flat_for_each_section
+ *        caller.
+ *
+ * Returns: non-zero to stop the iteration, and 0 to continue.  The same
+ * non-zero value is returned to the address_space_flat_for_each_section caller.
+ */
+
+typedef int (*memory_region_section_cb)(MemoryRegionSection *s,
+                                        void *opaque,
+                                        Error **errp);
+
+/**
+ * address_space_flat_for_each_section: walk the ranges in the address space
+ * flat view and call @func for each.  Return 0 on success, else return non-zero
+ * with a message in @errp.
+ *
+ * @as: target address space
+ * @func: callback function
+ * @opaque: passed to @func
+ * @errp: passed to @func
+ */
+int address_space_flat_for_each_section(AddressSpace *as,
+                                        memory_region_section_cb func,
+                                        void *opaque,
+                                        Error **errp);
+
+/**
  * memory_region_find: translate an address/size relative to a
  * MemoryRegion into a #MemoryRegionSection.
  *
diff --git a/softmmu/memory.c b/softmmu/memory.c
index e143692..45952fc 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -2645,6 +2645,26 @@ bool memory_region_is_mapped(MemoryRegion *mr)
     return mr->container ? true : false;
 }
 
+int address_space_flat_for_each_section(AddressSpace *as,
+                                        memory_region_section_cb func,
+                                        void *opaque,
+                                        Error **errp)
+{
+    FlatView *view = address_space_get_flatview(as);
+    FlatRange *fr;
+    int ret;
+
+    FOR_EACH_FLAT_RANGE(fr, view) {
+        MemoryRegionSection section = section_from_flat_range(fr, view);
+        ret = func(&section, opaque, errp);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
 /* Same as memory_region_find, but it does not add a reference to the
  * returned region.  It must be called from an RCU critical section.
  */
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V6 09/27] oslib: qemu_clear_cloexec
  2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
                   ` (7 preceding siblings ...)
  2021-08-06 21:43 ` [PATCH V6 08/27] memory: flat section iterator Steve Sistare
@ 2021-08-06 21:43 ` Steve Sistare
  2021-08-06 21:43 ` [PATCH V6 10/27] machine: memfd-alloc option Steve Sistare
                   ` (19 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Steve Sistare @ 2021-08-06 21:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Define qemu_clear_cloexec, analogous to qemu_set_cloexec.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/qemu/osdep.h | 1 +
 util/oslib-posix.c   | 9 +++++++++
 util/oslib-win32.c   | 4 ++++
 3 files changed, 14 insertions(+)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 60718fc..1ad7714 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -637,6 +637,7 @@ static inline void qemu_timersub(const struct timeval *val1,
 #endif
 
 void qemu_set_cloexec(int fd);
+void qemu_clear_cloexec(int fd);
 
 /* Starting on QEMU 2.5, qemu_hw_version() returns "2.5+" by default
  * instead of QEMU_VERSION, so setting hw_version on MachineClass
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index e8bdb02..7913334 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -309,6 +309,15 @@ void qemu_set_cloexec(int fd)
     assert(f != -1);
 }
 
+void qemu_clear_cloexec(int fd)
+{
+    int f;
+    f = fcntl(fd, F_GETFD);
+    assert(f != -1);
+    f = fcntl(fd, F_SETFD, f & ~FD_CLOEXEC);
+    assert(f != -1);
+}
+
 /*
  * Creates a pipe with FD_CLOEXEC set on both file descriptors
  */
diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index af559ef..acc3e06 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -265,6 +265,10 @@ void qemu_set_cloexec(int fd)
 {
 }
 
+void qemu_clear_cloexec(int fd)
+{
+}
+
 /* Offset between 1/1/1601 and 1/1/1970 in 100 nanosec units */
 #define _W32_FT_OFFSET (116444736000000000ULL)
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V6 10/27] machine: memfd-alloc option
  2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
                   ` (8 preceding siblings ...)
  2021-08-06 21:43 ` [PATCH V6 09/27] oslib: qemu_clear_cloexec Steve Sistare
@ 2021-08-06 21:43 ` Steve Sistare
  2021-08-06 21:43 ` [PATCH V6 11/27] qapi: list utility functions Steve Sistare
                   ` (18 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Steve Sistare @ 2021-08-06 21:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Allocate anonymous memory using memfd_create if the memfd-alloc machine
option is set.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/core/machine.c   | 19 +++++++++++++++++++
 include/hw/boards.h |  1 +
 qemu-options.hx     |  6 ++++++
 softmmu/physmem.c   | 47 ++++++++++++++++++++++++++++++++++++++---------
 softmmu/vl.c        |  1 +
 trace-events        |  1 +
 util/qemu-config.c  |  4 ++++
 7 files changed, 70 insertions(+), 9 deletions(-)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 943974d..5d76265 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -385,6 +385,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
     ms->mem_merge = value;
 }
 
+static bool machine_get_memfd_alloc(Object *obj, Error **errp)
+{
+    MachineState *ms = MACHINE(obj);
+
+    return ms->memfd_alloc;
+}
+
+static void machine_set_memfd_alloc(Object *obj, bool value, Error **errp)
+{
+    MachineState *ms = MACHINE(obj);
+
+    ms->memfd_alloc = value;
+}
+
 static bool machine_get_usb(Object *obj, Error **errp)
 {
     MachineState *ms = MACHINE(obj);
@@ -919,6 +933,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
     object_class_property_set_description(oc, "mem-merge",
         "Enable/disable memory merge support");
 
+    object_class_property_add_bool(oc, "memfd-alloc",
+        machine_get_memfd_alloc, machine_set_memfd_alloc);
+    object_class_property_set_description(oc, "memfd-alloc",
+        "Enable/disable allocating anonymous memory using memfd_create");
+
     object_class_property_add_bool(oc, "usb",
         machine_get_usb, machine_set_usb);
     object_class_property_set_description(oc, "usb",
diff --git a/include/hw/boards.h b/include/hw/boards.h
index accd6ef..299e1ca 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -305,6 +305,7 @@ struct MachineState {
     char *dt_compatible;
     bool dump_guest_core;
     bool mem_merge;
+    bool memfd_alloc;
     bool usb;
     bool usb_disabled;
     char *firmware;
diff --git a/qemu-options.hx b/qemu-options.hx
index 83aa59a..05e206c 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -30,6 +30,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
     "                vmport=on|off|auto controls emulation of vmport (default: auto)\n"
     "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
     "                mem-merge=on|off controls memory merge support (default: on)\n"
+    "                memfd-alloc=on|off controls allocating anonymous guest RAM using memfd_create (default: off)\n"
     "                aes-key-wrap=on|off controls support for AES key wrapping (default=on)\n"
     "                dea-key-wrap=on|off controls support for DEA key wrapping (default=on)\n"
     "                suppress-vmdesc=on|off disables self-describing migration (default=off)\n"
@@ -76,6 +77,11 @@ SRST
         supported by the host, de-duplicates identical memory pages
         among VMs instances (enabled by default).
 
+    ``memfd-alloc=on|off``
+        Enables or disables allocation of anonymous guest RAM using
+        memfd_create.  Any associated memory-backend objects are created with
+        share=on.  The memfd-alloc default is off.
+
     ``aes-key-wrap=on|off``
         Enables or disables AES key wrapping support on s390-ccw hosts.
         This feature controls whether AES wrapping keys will be created
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 3c1912a..d11455f 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -41,6 +41,7 @@
 #include "qemu/config-file.h"
 #include "qemu/error-report.h"
 #include "qemu/qemu-print.h"
+#include "qemu/memfd.h"
 #include "exec/memory.h"
 #include "exec/ioport.h"
 #include "sysemu/dma.h"
@@ -1960,35 +1961,63 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
     const bool shared = qemu_ram_is_shared(new_block);
     RAMBlock *block;
     RAMBlock *last_block = NULL;
+    struct MemoryRegion *mr = new_block->mr;
     ram_addr_t old_ram_size, new_ram_size;
     Error *err = NULL;
+    const char *name;
+    void *addr = 0;
+    size_t maxlen;
+    MachineState *ms = MACHINE(qdev_get_machine());
 
     old_ram_size = last_ram_page();
 
     qemu_mutex_lock_ramlist();
-    new_block->offset = find_ram_offset(new_block->max_length);
+    maxlen = new_block->max_length;
+    new_block->offset = find_ram_offset(maxlen);
 
     if (!new_block->host) {
         if (xen_enabled()) {
-            xen_ram_alloc(new_block->offset, new_block->max_length,
-                          new_block->mr, &err);
+            xen_ram_alloc(new_block->offset, maxlen, new_block->mr, &err);
             if (err) {
                 error_propagate(errp, err);
                 qemu_mutex_unlock_ramlist();
                 return;
             }
         } else {
-            new_block->host = qemu_anon_ram_alloc(new_block->max_length,
-                                                  &new_block->mr->align,
-                                                  shared, noreserve);
-            if (!new_block->host) {
+            name = memory_region_name(mr);
+            if (ms->memfd_alloc) {
+                Object *parent = &mr->parent_obj;
+                int mfd = -1;          /* placeholder until next patch */
+                mr->align = QEMU_VMALLOC_ALIGN;
+                if (mfd < 0) {
+                    mfd = qemu_memfd_create(name, maxlen + mr->align,
+                                            0, 0, 0, &err);
+                    if (mfd < 0) {
+                        return;
+                    }
+                }
+                qemu_set_cloexec(mfd);
+                /* The memory backend already set its desired flags. */
+                if (!object_dynamic_cast(parent, TYPE_MEMORY_BACKEND)) {
+                    new_block->flags |= RAM_SHARED;
+                }
+                addr = file_ram_alloc(new_block, maxlen, mfd,
+                                      false, false, 0, errp);
+                trace_anon_memfd_alloc(name, maxlen, addr, mfd);
+            } else {
+                addr = qemu_anon_ram_alloc(maxlen, &mr->align,
+                                           shared, noreserve);
+            }
+
+            if (!addr) {
                 error_setg_errno(errp, errno,
                                  "cannot set up guest memory '%s'",
-                                 memory_region_name(new_block->mr));
+                                 name);
                 qemu_mutex_unlock_ramlist();
                 return;
             }
-            memory_try_enable_merging(new_block->host, new_block->max_length);
+            memory_try_enable_merging(addr, maxlen);
+            new_block->host = addr;
         }
     }
 
diff --git a/softmmu/vl.c b/softmmu/vl.c
index 5ca11e7..cb72ca2 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -2406,6 +2406,7 @@ static void create_default_memdev(MachineState *ms, const char *path)
         object_property_set_str(obj, "mem-path", path, &error_fatal);
     }
     object_property_set_int(obj, "size", ms->ram_size, &error_fatal);
+    object_property_set_bool(obj, "share", ms->memfd_alloc, &error_fatal);
     object_property_add_child(object_get_objects_root(), mc->default_ram_id,
                               obj);
     /* Ensure backend's memory region name is equal to mc->default_ram_id */
diff --git a/trace-events b/trace-events
index c4cca29..a42c7c5 100644
--- a/trace-events
+++ b/trace-events
@@ -45,6 +45,7 @@ ram_block_discard_range(const char *rbname, void *hva, size_t length, bool need_
 # accel/tcg/cputlb.c
 memory_notdirty_write_access(uint64_t vaddr, uint64_t ram_addr, unsigned size) "0x%" PRIx64 " ram_addr 0x%" PRIx64 " size %u"
 memory_notdirty_set_dirty(uint64_t vaddr) "0x%" PRIx64
+anon_memfd_alloc(const char *name, size_t size, void *ptr, int fd) "%s size %zu ptr %p fd %d"
 
 # gdbstub.c
 gdbstub_op_start(const char *device) "Starting gdbstub using device %s"
diff --git a/util/qemu-config.c b/util/qemu-config.c
index 436ab63..3606e5c 100644
--- a/util/qemu-config.c
+++ b/util/qemu-config.c
@@ -207,6 +207,10 @@ static QemuOptsList machine_opts = {
             .type = QEMU_OPT_BOOL,
             .help = "enable/disable memory merge support",
         },{
+            .name = "memfd-alloc",
+            .type = QEMU_OPT_BOOL,
+            .help = "enable/disable memfd_create for anonymous memory",
+        },{
             .name = "usb",
             .type = QEMU_OPT_BOOL,
             .help = "Set on/off to enable/disable usb",
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V6 11/27] qapi: list utility functions
  2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
                   ` (9 preceding siblings ...)
  2021-08-06 21:43 ` [PATCH V6 10/27] machine: memfd-alloc option Steve Sistare
@ 2021-08-06 21:43 ` Steve Sistare
  2021-08-06 21:43 ` [PATCH V6 12/27] vl: helper to request re-exec Steve Sistare
                   ` (17 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Steve Sistare @ 2021-08-06 21:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Generalize strList_from_comma_list() to take any delimiter character, rename
as strList_from_string(), and move it to qapi/util.c.  Also add
strList_from_string() and QAPI_LIST_LENGTH().

No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/qapi/util.h | 28 ++++++++++++++++++++++++++++
 monitor/hmp-cmds.c  | 29 ++---------------------------
 qapi/qapi-util.c    | 37 +++++++++++++++++++++++++++++++++++++
 3 files changed, 67 insertions(+), 27 deletions(-)

diff --git a/include/qapi/util.h b/include/qapi/util.h
index d7bfb30..83cc4d7 100644
--- a/include/qapi/util.h
+++ b/include/qapi/util.h
@@ -16,6 +16,8 @@ typedef struct QEnumLookup {
     int size;
 } QEnumLookup;
 
+struct strList;
+
 const char *qapi_enum_lookup(const QEnumLookup *lookup, int val);
 int qapi_enum_parse(const QEnumLookup *lookup, const char *buf,
                     int def, Error **errp);
@@ -25,6 +27,19 @@ bool qapi_bool_parse(const char *name, const char *value, bool *obj,
 int parse_qapi_name(const char *name, bool complete);
 
 /*
+ * Produce and return a NULL-terminated array of strings from @args.
+ * All strings are g_strdup'd.
+ */
+char **strv_from_strList(const struct strList *args);
+
+/*
+ * Produce a strList from the character delimited string @in.
+ * All strings are g_strdup'd.
+ * A NULL or empty input string returns NULL.
+ */
+struct strList *strList_from_string(const char *in, char delim);
+
+/*
  * For any GenericList @list, insert @element at the front.
  *
  * Note that this macro evaluates @element exactly once, so it is safe
@@ -50,4 +65,17 @@ int parse_qapi_name(const char *name, bool complete);
     (tail) = &(*(tail))->next; \
 } while (0)
 
+/*
+ * For any GenericList @list, return its length.
+ */
+#define QAPI_LIST_LENGTH(list) \
+    ({ \
+        int len = 0; \
+        typeof(list) elem; \
+        for (elem = list; elem != NULL; elem = elem->next) { \
+            len++; \
+        } \
+        len; \
+    })
+
 #endif
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index 6aed6ac..da91a0a 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -43,6 +43,7 @@
 #include "qapi/qapi-commands-run-state.h"
 #include "qapi/qapi-commands-tpm.h"
 #include "qapi/qapi-commands-ui.h"
+#include "qapi/util.h"
 #include "qapi/qapi-visit-net.h"
 #include "qapi/qapi-visit-migration.h"
 #include "qapi/qmp/qdict.h"
@@ -70,32 +71,6 @@ void hmp_handle_error(Monitor *mon, Error *err)
     }
 }
 
-/*
- * Produce a strList from a comma separated list.
- * A NULL or empty input string return NULL.
- */
-static strList *strList_from_comma_list(const char *in)
-{
-    strList *res = NULL;
-    strList **tail = &res;
-
-    while (in && in[0]) {
-        char *comma = strchr(in, ',');
-        char *value;
-
-        if (comma) {
-            value = g_strndup(in, comma - in);
-            in = comma + 1; /* skip the , */
-        } else {
-            value = g_strdup(in);
-            in = NULL;
-        }
-        QAPI_LIST_APPEND(tail, value);
-    }
-
-    return res;
-}
-
 void hmp_info_name(Monitor *mon, const QDict *qdict)
 {
     NameInfo *info;
@@ -1170,7 +1145,7 @@ void hmp_announce_self(Monitor *mon, const QDict *qdict)
                                             migrate_announce_params());
 
     qapi_free_strList(params->interfaces);
-    params->interfaces = strList_from_comma_list(interfaces_str);
+    params->interfaces = strList_from_string(interfaces_str, ',');
     params->has_interfaces = params->interfaces != NULL;
     params->id = g_strdup(id);
     params->has_id = !!params->id;
diff --git a/qapi/qapi-util.c b/qapi/qapi-util.c
index 3c24bb3..1edbd04 100644
--- a/qapi/qapi-util.c
+++ b/qapi/qapi-util.c
@@ -14,6 +14,7 @@
 #include "qapi/error.h"
 #include "qemu/ctype.h"
 #include "qapi/qmp/qerror.h"
+#include "qapi/qapi-builtin-types.h"
 
 const char *qapi_enum_lookup(const QEnumLookup *lookup, int val)
 {
@@ -109,3 +110,39 @@ int parse_qapi_name(const char *str, bool complete)
     }
     return p - str;
 }
+
+char **strv_from_strList(const strList *args)
+{
+    const strList *arg;
+    int i = 0;
+    char **argv = g_malloc((QAPI_LIST_LENGTH(args) + 1) * sizeof(char *));
+
+    for (arg = args; arg != NULL; arg = arg->next) {
+        argv[i++] = g_strdup(arg->value);
+    }
+    argv[i] = NULL;
+
+    return argv;
+}
+
+strList *strList_from_string(const char *in, char delim)
+{
+    strList *res = NULL;
+    strList **tail = &res;
+
+    while (in && in[0]) {
+        char *next = strchr(in, delim);
+        char *value;
+
+        if (next) {
+            value = g_strndup(in, next - in);
+            in = next + 1; /* skip the delim */
+        } else {
+            value = g_strdup(in);
+            in = NULL;
+        }
+        QAPI_LIST_APPEND(tail, value);
+    }
+
+    return res;
+}
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V6 12/27] vl: helper to request re-exec
  2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
                   ` (10 preceding siblings ...)
  2021-08-06 21:43 ` [PATCH V6 11/27] qapi: list utility functions Steve Sistare
@ 2021-08-06 21:43 ` Steve Sistare
  2021-08-06 21:43 ` [PATCH V6 13/27] cpr: preserve extra state Steve Sistare
                   ` (16 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Steve Sistare @ 2021-08-06 21:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Add a qemu_system_exec_request() hook that causes the main loop to exit and
re-exec qemu using the specified arguments.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/sysemu/runstate.h |  1 +
 softmmu/runstate.c        | 21 +++++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/include/sysemu/runstate.h b/include/sysemu/runstate.h
index b655c7b..198211b 100644
--- a/include/sysemu/runstate.h
+++ b/include/sysemu/runstate.h
@@ -57,6 +57,7 @@ void qemu_system_wakeup_enable(WakeupReason reason, bool enabled);
 void qemu_register_wakeup_notifier(Notifier *notifier);
 void qemu_register_wakeup_support(void);
 void qemu_system_shutdown_request(ShutdownCause reason);
+void qemu_system_exec_request(const strList *args);
 void qemu_system_powerdown_request(void);
 void qemu_register_powerdown_notifier(Notifier *notifier);
 void qemu_register_shutdown_notifier(Notifier *notifier);
diff --git a/softmmu/runstate.c b/softmmu/runstate.c
index 3d344c9..309a4bf 100644
--- a/softmmu/runstate.c
+++ b/softmmu/runstate.c
@@ -38,6 +38,7 @@
 #include "monitor/monitor.h"
 #include "net/net.h"
 #include "net/vhost_net.h"
+#include "qapi/util.h"
 #include "qapi/error.h"
 #include "qapi/qapi-commands-run-state.h"
 #include "qapi/qapi-events-run-state.h"
@@ -355,6 +356,7 @@ static NotifierList wakeup_notifiers =
 static NotifierList shutdown_notifiers =
     NOTIFIER_LIST_INITIALIZER(shutdown_notifiers);
 static uint32_t wakeup_reason_mask = ~(1 << QEMU_WAKEUP_REASON_NONE);
+static char **exec_argv;
 
 ShutdownCause qemu_shutdown_requested_get(void)
 {
@@ -371,6 +373,11 @@ static int qemu_shutdown_requested(void)
     return qatomic_xchg(&shutdown_requested, SHUTDOWN_CAUSE_NONE);
 }
 
+static int qemu_exec_requested(void)
+{
+    return exec_argv != NULL;
+}
+
 static void qemu_kill_report(void)
 {
     if (!qtest_driver() && shutdown_signal) {
@@ -641,6 +648,13 @@ void qemu_system_shutdown_request(ShutdownCause reason)
     qemu_notify_event();
 }
 
+void qemu_system_exec_request(const strList *args)
+{
+    exec_argv = strv_from_strList(args);
+    shutdown_requested = 1;
+    qemu_notify_event();
+}
+
 static void qemu_system_powerdown(void)
 {
     qapi_event_send_powerdown();
@@ -689,6 +703,13 @@ static bool main_loop_should_exit(void)
     }
     request = qemu_shutdown_requested();
     if (request) {
+
+        if (qemu_exec_requested()) {
+            execvp(exec_argv[0], exec_argv);
+            error_report("execvp %s failed: %s", exec_argv[0], strerror(errno));
+            g_strfreev(exec_argv);
+            exec_argv = NULL;
+        }
         qemu_kill_report();
         qemu_system_shutdown(request);
         if (shutdown_action == SHUTDOWN_ACTION_PAUSE) {
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V6 13/27] cpr: preserve extra state
  2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
                   ` (11 preceding siblings ...)
  2021-08-06 21:43 ` [PATCH V6 12/27] vl: helper to request re-exec Steve Sistare
@ 2021-08-06 21:43 ` Steve Sistare
  2021-08-06 21:43 ` [PATCH V6 14/27] cpr: restart mode Steve Sistare
                   ` (15 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Steve Sistare @ 2021-08-06 21:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

cpr must save state that is needed after qemu is restarted, when devices are
realized.  Thus the extra state cannot be saved in the cpr-load vmstate file,
as objects must already exist before that file can be loaded.  Instead,
define auxilliary state structures and vmstate descriptions, not associated
with any registered object, and serialize the aux state to a memfd file.
Deserialize after qemu restarts, before devices are realized.

Currently file descriptors comprise the only such state, but more could
be added in the future.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 MAINTAINERS             |   2 +
 include/migration/cpr.h |  11 +++
 migration/cpr-state.c   | 215 ++++++++++++++++++++++++++++++++++++++++++++++++
 migration/meson.build   |   1 +
 migration/trace-events  |   5 ++
 stubs/cpr-state.c       |  15 ++++
 stubs/meson.build       |   1 +
 7 files changed, 250 insertions(+)
 create mode 100644 migration/cpr-state.c
 create mode 100644 stubs/cpr-state.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 2611ca6..a9d2ed8 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2907,6 +2907,8 @@ S: Maintained
 F: include/migration/cpr.h
 F: migration/cpr.c
 F: qapi/cpr.json
+F: migration/cpr-state.c
+F: stubs/cpr-state.c
 
 Record/replay
 M: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index a76429a..83f69c9 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -14,4 +14,15 @@
 
 CprMode cpr_mode(void);
 
+typedef int (*cpr_walk_fd_cb)(const char *name, int id, int fd, void *opaque);
+
+void cpr_save_fd(const char *name, int id, int fd);
+void cpr_delete_fd(const char *name, int id);
+int cpr_find_fd(const char *name, int id);
+int cpr_walk_fd(cpr_walk_fd_cb cb, void *handle);
+int cpr_state_save(Error **errp);
+int cpr_state_load(Error **errp);
+CprMode cpr_state_mode(void);
+void cpr_state_print(void);
+
 #endif
diff --git a/migration/cpr-state.c b/migration/cpr-state.c
new file mode 100644
index 0000000..003b449
--- /dev/null
+++ b/migration/cpr-state.c
@@ -0,0 +1,215 @@
+#include "qemu/osdep.h"
+#include "qemu/cutils.h"
+#include "qemu/queue.h"
+#include "qemu/memfd.h"
+#include "qapi/error.h"
+#include "migration/vmstate.h"
+#include "migration/cpr.h"
+#include "migration/qemu-file.h"
+#include "migration/qemu-file-channel.h"
+#include "trace.h"
+
+/*************************************************************************/
+/* cpr state container for all information to be saved. */
+
+typedef QLIST_HEAD(CprNameList, CprName) CprNameList;
+
+typedef struct CprState {
+    CprMode mode;
+    CprNameList fds;            /* list of CprFd */
+} CprState;
+
+static CprState cpr_state;
+
+/*************************************************************************/
+/* Generic list of names. */
+
+typedef struct CprName {
+    char *name;
+    unsigned int namelen;
+    int id;
+    QLIST_ENTRY(CprName) next;
+} CprName;
+
+static const VMStateDescription vmstate_cpr_name = {
+    .name = "cpr name",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT32(namelen, CprName),
+        VMSTATE_VBUFFER_ALLOC_UINT32(name, CprName, 0, NULL, namelen),
+        VMSTATE_INT32(id, CprName),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+static void
+add_name(CprNameList *head, const char *name, int id, CprName *elem)
+{
+    elem->name = g_strdup(name);
+    elem->namelen = strlen(name) + 1;
+    elem->id = id;
+    QLIST_INSERT_HEAD(head, elem, next);
+}
+
+static CprName *find_name(CprNameList *head, const char *name, int id)
+{
+    CprName *elem;
+
+    QLIST_FOREACH(elem, head, next) {
+        if (!strcmp(elem->name, name) && elem->id == id) {
+            return elem;
+        }
+    }
+    return NULL;
+}
+
+static void delete_name(CprNameList *head, const char *name, int id)
+{
+    CprName *elem = find_name(head, name, id);
+
+    if (elem) {
+        QLIST_REMOVE(elem, next);
+        g_free(elem->name);
+        g_free(elem);
+    }
+}
+
+/****************************************************************************/
+/* Lists of named things.  The first field of each entry must be a CprName. */
+
+typedef struct CprFd {
+    CprName name;               /* must be first */
+    int fd;
+} CprFd;
+
+static const VMStateDescription vmstate_cpr_fd = {
+    .name = "cpr fd",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_STRUCT(name, CprFd, 1, vmstate_cpr_name, CprName),
+        VMSTATE_INT32(fd, CprFd),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+#define CPR_FD(elem)        ((CprFd *)(elem))
+#define CPR_FD_FD(elem)     (CPR_FD(elem)->fd)
+
+void cpr_save_fd(const char *name, int id, int fd)
+{
+    CprFd *elem = g_new0(CprFd, 1);
+
+    trace_cpr_save_fd(name, id, fd);
+    elem->fd = fd;
+    add_name(&cpr_state.fds, name, id, &elem->name);
+}
+
+void cpr_delete_fd(const char *name, int id)
+{
+    trace_cpr_delete_fd(name, id);
+    delete_name(&cpr_state.fds, name, id);
+}
+
+int cpr_find_fd(const char *name, int id)
+{
+    CprName *elem = find_name(&cpr_state.fds, name, id);
+    int fd = elem ? CPR_FD_FD(elem) : -1;
+
+    trace_cpr_find_fd(name, id, fd);
+    return fd;
+}
+
+int cpr_walk_fd(cpr_walk_fd_cb cb, void *opaque)
+{
+    CprName *elem;
+
+    QLIST_FOREACH(elem, &cpr_state.fds, next) {
+        if (cb(elem->name, elem->id, CPR_FD_FD(elem), opaque)) {
+            return 1;
+        }
+    }
+    return 0;
+}
+
+/*************************************************************************/
+/* cpr state container interface and implementation. */
+
+#define CPR_STATE_NAME "QEMU_CPR_STATE"
+
+static const VMStateDescription vmstate_cpr_state = {
+    .name = CPR_STATE_NAME,
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT32(mode, CprState),
+        VMSTATE_QLIST_V(fds, CprState, 1, vmstate_cpr_fd, CprFd, name.next),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+int cpr_state_save(Error **errp)
+{
+    int ret, mfd;
+    QEMUFile *f;
+    char val[16];
+
+    mfd = memfd_create(CPR_STATE_NAME, 0);
+    if (mfd < 0) {
+        error_setg_errno(errp, errno, "memfd_create failed");
+        return -1;
+    }
+    qemu_clear_cloexec(mfd);
+    f = qemu_fd_open(mfd, true, CPR_STATE_NAME);
+
+    ret = vmstate_save_state(f, &vmstate_cpr_state, &cpr_state, 0);
+    if (ret) {
+        error_setg(errp, "vmstate_save_state error %d", ret);
+        return ret;
+    }
+
+    /* Do not close f, as mfd must remain open. */
+    qemu_fflush(f);
+    lseek(mfd, 0, SEEK_SET);
+
+    /* Remember mfd for post-exec cpr_state_load */
+    snprintf(val, sizeof(val), "%d", mfd);
+    g_setenv(CPR_STATE_NAME, val, 1);
+
+    return 0;
+}
+
+int cpr_state_load(Error **errp)
+{
+    int ret, mfd;
+    QEMUFile *f;
+    const char *val = g_getenv(CPR_STATE_NAME);
+
+    if (!val) {
+        return 0;
+    }
+    g_unsetenv(CPR_STATE_NAME);
+    if (qemu_strtoi(val, NULL, 10, &mfd)) {
+        error_setg(errp, "Bad %s env value %s", CPR_STATE_NAME, val);
+        return 1;
+    }
+    f = qemu_fd_open(mfd, false, CPR_STATE_NAME);
+    ret = vmstate_load_state(f, &vmstate_cpr_state, &cpr_state, 1);
+    qemu_fclose(f);
+    return ret;
+}
+
+CprMode cpr_state_mode(void)
+{
+    return cpr_state.mode;
+}
+
+void cpr_state_print(void)
+{
+    CprName *elem;
+
+    QLIST_FOREACH(elem, &cpr_state.fds, next) {
+        printf("%s %d : %d\n", elem->name, elem->id, CPR_FD_FD(elem));
+    }
+}
diff --git a/migration/meson.build b/migration/meson.build
index fd59281..b79d02c 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -16,6 +16,7 @@ softmmu_ss.add(files(
   'colo-failover.c',
   'colo.c',
   'cpr.c',
+  'cpr-state.c',
   'exec.c',
   'fd.c',
   'global_state.c',
diff --git a/migration/trace-events b/migration/trace-events
index a1c0f03..e3149b6 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -310,6 +310,11 @@ colo_receive_message(const char *msg) "Receive '%s' message"
 # colo-failover.c
 colo_failover_set_state(const char *new_state) "new state %s"
 
+# cpr-state.c
+cpr_save_fd(const char *name, int id, int fd) "%s, id %d, fd %d"
+cpr_delete_fd(const char *name, int id) "%s, id %d"
+cpr_find_fd(const char *name, int id, int fd) "%s, id %d returns %d"
+
 # block-dirty-bitmap.c
 send_bitmap_header_enter(void) ""
 send_bitmap_bits(uint32_t flags, uint64_t start_sector, uint32_t nr_sectors, uint64_t data_size) "flags: 0x%x, start_sector: %" PRIu64 ", nr_sectors: %" PRIu32 ", data_size: %" PRIu64
diff --git a/stubs/cpr-state.c b/stubs/cpr-state.c
new file mode 100644
index 0000000..24a9057
--- /dev/null
+++ b/stubs/cpr-state.c
@@ -0,0 +1,15 @@
+#include "qemu/osdep.h"
+#include "migration/cpr.h"
+
+void cpr_save_fd(const char *name, int id, int fd)
+{
+}
+
+void cpr_delete_fd(const char *name, int id)
+{
+}
+
+int cpr_find_fd(const char *name, int id)
+{
+    return -1;
+}
diff --git a/stubs/meson.build b/stubs/meson.build
index d3fa864..2748508 100644
--- a/stubs/meson.build
+++ b/stubs/meson.build
@@ -5,6 +5,7 @@ stub_ss.add(files('blk-exp-close-all.c'))
 stub_ss.add(files('blockdev-close-all-bdrv-states.c'))
 stub_ss.add(files('change-state-handler.c'))
 stub_ss.add(files('cmos.c'))
+stub_ss.add(files('cpr-state.c'))
 stub_ss.add(files('cpu-get-clock.c'))
 stub_ss.add(files('cpus-get-virtual-clock.c'))
 stub_ss.add(files('qemu-timer-notify-cb.c'))
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V6 14/27] cpr: restart mode
  2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
                   ` (12 preceding siblings ...)
  2021-08-06 21:43 ` [PATCH V6 13/27] cpr: preserve extra state Steve Sistare
@ 2021-08-06 21:43 ` Steve Sistare
  2021-08-06 21:43 ` [PATCH V6 15/27] cpr: restart HMP interfaces Steve Sistare
                   ` (14 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Steve Sistare @ 2021-08-06 21:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Provide the cpr-save restart mode, which preserves the guest VM across a
restart of the qemu process.  After cpr-save, the caller passes qemu
command-line arguments to cpr-exec, which directly exec's the new qemu
binary.  The arguments must include -S so new qemu starts in a paused state.
The caller resumes the guest by calling cpr-load.

To use the restart mode, all guest RAM objects must be shared.  The
share=on property is required for memory created with an explicit -object
option.  The memfd-alloc machine property is required for memory that is
implicitly created.  The memfd values are saved in special cpr state which
is retrieved after exec, and are kept open across exec, after which they
are retrieved and re-mmap'd.  Hence guest RAM is preserved in place,
albeit with new virtual addresses in the qemu process.

The restart mode supports vfio devices and explicit memory-backend-memfd
objects in subsequent patches.

cpr-exec syntax:
  { 'command': 'cpr-exec', 'data': { 'argv': [ 'str' ] } }

Add the restart mode:
  { 'enum': 'CprMode', 'data': [ 'reboot', 'restart' ] }

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 migration/cpr.c   | 30 +++++++++++++++++++++++++++++-
 qapi/cpr.json     | 22 +++++++++++++++++++++-
 softmmu/physmem.c |  5 ++++-
 softmmu/vl.c      |  3 +++
 4 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/migration/cpr.c b/migration/cpr.c
index 1ec903f..72a5f4b 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -88,6 +88,34 @@ err:
     cpr_active_mode = CPR_MODE_NONE;
 }
 
+static int preserve_fd(const char *name, int id, int fd, void *opaque)
+{
+    qemu_clear_cloexec(fd);
+    return 0;
+}
+
+void qmp_cpr_exec(strList *args, Error **errp)
+{
+    if (xen_enabled()) {
+        error_setg(errp, "xen does not support cpr-exec");
+        return;
+    }
+    if (!runstate_check(RUN_STATE_SAVE_VM)) {
+        error_setg(errp, "runstate is not save-vm");
+        return;
+    }
+    if (cpr_active_mode != CPR_MODE_RESTART) {
+        error_setg(errp, "cpr-exec requires cpr-save with restart mode");
+        return;
+    }
+
+    cpr_walk_fd(preserve_fd, 0);
+    if (cpr_state_save(errp)) {
+        return;
+    }
+    qemu_system_exec_request(args);
+}
+
 void qmp_cpr_load(const char *filename, Error **errp)
 {
     QEMUFile *f;
@@ -111,7 +139,7 @@ void qmp_cpr_load(const char *filename, Error **errp)
         return;
     }
 
-    cpr_active_mode = CPR_MODE_REBOOT;  /* generalized in a later patch */
+    cpr_active_mode = cpr_state_mode();
 
     ret = qemu_load_device_state(f);
     qemu_fclose(f);
diff --git a/qapi/cpr.json b/qapi/cpr.json
index 2edd08e..56be0e5 100644
--- a/qapi/cpr.json
+++ b/qapi/cpr.json
@@ -15,11 +15,12 @@
 # @CprMode:
 #
 # @reboot: checkpoint can be cpr-load'ed after a host kexec reboot.
+# @restart: checkpoint can be cpr-load'ed after restarting qemu.
 #
 # Since: 6.2
 ##
 { 'enum': 'CprMode',
-  'data': [ 'reboot' ] }
+  'data': [ 'reboot', 'restart' ] }
 
 ##
 # @cpr-save:
@@ -33,6 +34,11 @@
 # For reboot mode, all guest RAM objects must be non-volatile across reboot,
 # and created with the share=on parameter.
 #
+# For restart mode, all guest RAM objects must be shared.  The share=on
+# property is required for memory created with an explicit -object option,
+# and the memfd-alloc machine property is required for memory that is
+# implicitly created.
+#
 # @filename: name of checkpoint file
 # @mode: @CprMode mode
 #
@@ -43,6 +49,20 @@
             'mode': 'CprMode' } }
 
 ##
+# @cpr-exec:
+#
+# exec() a command and replace the qemu process.  The PID remains the same.
+# @argv[0] should be the path of a new qemu binary, or a prefix command that
+# in turn exec's the new qemu binary.  Must be called after cpr-save restart.
+#
+# @argv: arguments to be passed to exec().
+#
+# Since: 6.2
+##
+{ 'command': 'cpr-exec',
+  'data': { 'argv': [ 'str' ] } }
+
+##
 # @cpr-load:
 #
 # Start virtual machine from checkpoint file that was created earlier using
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index d11455f..2e14314 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -65,6 +65,7 @@
 
 #include "qemu/pmem.h"
 
+#include "migration/cpr.h"
 #include "migration/vmstate.h"
 
 #include "qemu/range.h"
@@ -1987,7 +1988,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
             name = memory_region_name(mr);
             if (ms->memfd_alloc) {
                 Object *parent = &mr->parent_obj;
-                int mfd = -1;          /* placeholder until next patch */
+                int mfd = cpr_find_fd(name, 0);
                 mr->align = QEMU_VMALLOC_ALIGN;
                 if (mfd < 0) {
                     mfd = qemu_memfd_create(name, maxlen + mr->align,
@@ -1995,6 +1996,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
                     if (mfd < 0) {
                         return;
                     }
+                    cpr_save_fd(name, 0, mfd);
                 }
                 qemu_set_cloexec(mfd);
                 /* The memory backend already set its desired flags. */
@@ -2251,6 +2253,7 @@ void qemu_ram_free(RAMBlock *block)
     }
 
     qemu_mutex_lock_ramlist();
+    cpr_delete_fd(memory_region_name(block->mr), 0);
     QLIST_REMOVE_RCU(block, next);
     ram_list.mru_block = NULL;
     /* Write list before version */
diff --git a/softmmu/vl.c b/softmmu/vl.c
index cb72ca2..924e8f9 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -76,6 +76,7 @@
 #include "hw/i386/pc.h"
 #include "migration/misc.h"
 #include "migration/snapshot.h"
+#include "migration/cpr.h"
 #include "sysemu/tpm.h"
 #include "sysemu/dma.h"
 #include "hw/audio/soundhw.h"
@@ -3614,6 +3615,8 @@ void qemu_init(int argc, char **argv, char **envp)
     qemu_validate_options(machine_opts_dict);
     qemu_process_sugar_options();
 
+    cpr_state_load(&error_fatal);
+
     /*
      * These options affect everything else and should be processed
      * before daemonizing.
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V6 15/27] cpr: restart HMP interfaces
  2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
                   ` (13 preceding siblings ...)
  2021-08-06 21:43 ` [PATCH V6 14/27] cpr: restart mode Steve Sistare
@ 2021-08-06 21:43 ` Steve Sistare
  2021-08-06 21:43 ` [PATCH V6 16/27] hostmem-memfd: cpr for memory-backend-memfd Steve Sistare
                   ` (13 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Steve Sistare @ 2021-08-06 21:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

cpr-save <filename> <mode>
  mode may be "restart"

cpr-exec <command>
  Call qmp_cpr_exec().
  Arguments:
    command : command line to execute, with space-separated arguments

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hmp-commands.hx       | 21 ++++++++++++++++++++-
 include/monitor/hmp.h |  1 +
 monitor/hmp-cmds.c    | 11 +++++++++++
 3 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index 0a45c59..9541871 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -353,7 +353,7 @@ ERST
     {
         .name       = "cpr-save",
         .args_type  = "filename:s,mode:s",
-        .params     = "filename 'reboot'",
+        .params     = "filename 'reboot'|'restart'",
         .help       = "create a checkpoint of the VM in file",
         .cmd        = hmp_cpr_save,
     },
@@ -366,6 +366,25 @@ If *mode* is 'reboot', the checkpoint remains valid after a host kexec
 reboot, and guest ram must be backed by persistent shared memory.  To
 resume from the checkpoint, issue the quit command, reboot the system,
 and issue the cpr-load command.
+
+If *mode* is 'restart', the checkpoint remains valid after restarting qemu
+using a subsequent cpr-exec.  All guest RAM objects must be shared.  The
+share=on property is required for memory created with an explicit -object
+option, and the memfd-alloc machine property is required for memory that is
+implicitly created.  To resume from the checkpoint, issue the cpr-load command.
+ERST
+
+    {
+        .name       = "cpr-exec",
+        .args_type  = "command:S",
+        .params     = "command",
+        .help       = "Restart qemu by directly exec'ing command",
+        .cmd        = hmp_cpr_exec,
+    },
+
+SRST
+``cpr-exec`` *command*
+Restart qemu by directly exec'ing *command*, replacing the qemu process.
 ERST
 
     {
diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
index 01b5df8..90f18fd 100644
--- a/include/monitor/hmp.h
+++ b/include/monitor/hmp.h
@@ -59,6 +59,7 @@ void hmp_loadvm(Monitor *mon, const QDict *qdict);
 void hmp_savevm(Monitor *mon, const QDict *qdict);
 void hmp_delvm(Monitor *mon, const QDict *qdict);
 void hmp_cpr_save(Monitor *mon, const QDict *qdict);
+void hmp_cpr_exec(Monitor *mon, const QDict *qdict);
 void hmp_cpr_load(Monitor *mon, const QDict *qdict);
 void hmp_migrate_cancel(Monitor *mon, const QDict *qdict);
 void hmp_migrate_continue(Monitor *mon, const QDict *qdict);
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index da91a0a..99f75a1 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -1172,6 +1172,17 @@ out:
     hmp_handle_error(mon, err);
 }
 
+void hmp_cpr_exec(Monitor *mon, const QDict *qdict)
+{
+    Error *err = NULL;
+    const char *command = qdict_get_try_str(qdict, "command");
+    strList *args = strList_from_string(command, ' ');
+
+    qmp_cpr_exec(args, &err);
+    qapi_free_strList(args);
+    hmp_handle_error(mon, err);
+}
+
 void hmp_cpr_load(Monitor *mon, const QDict *qdict)
 {
     Error *err = NULL;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V6 16/27] hostmem-memfd: cpr for memory-backend-memfd
  2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
                   ` (14 preceding siblings ...)
  2021-08-06 21:43 ` [PATCH V6 15/27] cpr: restart HMP interfaces Steve Sistare
@ 2021-08-06 21:43 ` Steve Sistare
  2021-08-06 21:43 ` [PATCH V6 17/27] pci: export functions for cpr Steve Sistare
                   ` (12 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Steve Sistare @ 2021-08-06 21:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Preserve memory-backend-memfd memory objects during cpr.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 backends/hostmem-memfd.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c
index 3fc85c3..5097a05 100644
--- a/backends/hostmem-memfd.c
+++ b/backends/hostmem-memfd.c
@@ -14,6 +14,7 @@
 #include "sysemu/hostmem.h"
 #include "qom/object_interfaces.h"
 #include "qemu/memfd.h"
+#include "migration/cpr.h"
 #include "qemu/module.h"
 #include "qapi/error.h"
 #include "qom/object.h"
@@ -36,23 +37,25 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
 {
     HostMemoryBackendMemfd *m = MEMORY_BACKEND_MEMFD(backend);
     uint32_t ram_flags;
-    char *name;
-    int fd;
+    char *name = host_memory_backend_get_name(backend);
+    int fd = cpr_find_fd(name, 0);
 
     if (!backend->size) {
         error_setg(errp, "can't create backend with size 0");
         return;
     }
 
-    fd = qemu_memfd_create(TYPE_MEMORY_BACKEND_MEMFD, backend->size,
-                           m->hugetlb, m->hugetlbsize, m->seal ?
-                           F_SEAL_GROW | F_SEAL_SHRINK | F_SEAL_SEAL : 0,
-                           errp);
-    if (fd == -1) {
-        return;
+    if (fd < 0) {
+        fd = qemu_memfd_create(TYPE_MEMORY_BACKEND_MEMFD, backend->size,
+                               m->hugetlb, m->hugetlbsize, m->seal ?
+                               F_SEAL_GROW | F_SEAL_SHRINK | F_SEAL_SEAL : 0,
+                               errp);
+        if (fd == -1) {
+            return;
+        }
+        cpr_save_fd(name, 0, fd);
     }
 
-    name = host_memory_backend_get_name(backend);
     ram_flags = backend->share ? RAM_SHARED : 0;
     ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
     memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), name,
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V6 17/27] pci: export functions for cpr
  2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
                   ` (15 preceding siblings ...)
  2021-08-06 21:43 ` [PATCH V6 16/27] hostmem-memfd: cpr for memory-backend-memfd Steve Sistare
@ 2021-08-06 21:43 ` Steve Sistare
  2021-08-06 21:43 ` [PATCH V6 18/27] vfio-pci: refactor " Steve Sistare
                   ` (11 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Steve Sistare @ 2021-08-06 21:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Export msix_is_pending, msix_init_vector_notifiers, and pci_update_mappings
for use by cpr.  No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/pci/msix.c         | 20 ++++++++++++++------
 hw/pci/pci.c          |  3 +--
 include/hw/pci/msix.h |  5 +++++
 include/hw/pci/pci.h  |  1 +
 4 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index ae9331c..73f4259 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -64,7 +64,7 @@ static uint8_t *msix_pending_byte(PCIDevice *dev, int vector)
     return dev->msix_pba + vector / 8;
 }
 
-static int msix_is_pending(PCIDevice *dev, int vector)
+int msix_is_pending(PCIDevice *dev, unsigned int vector)
 {
     return *msix_pending_byte(dev, vector) & msix_pending_mask(vector);
 }
@@ -579,6 +579,17 @@ static void msix_unset_notifier_for_vector(PCIDevice *dev, unsigned int vector)
     dev->msix_vector_release_notifier(dev, vector);
 }
 
+void msix_init_vector_notifiers(PCIDevice *dev,
+                                MSIVectorUseNotifier use_notifier,
+                                MSIVectorReleaseNotifier release_notifier,
+                                MSIVectorPollNotifier poll_notifier)
+{
+    assert(use_notifier && release_notifier);
+    dev->msix_vector_use_notifier = use_notifier;
+    dev->msix_vector_release_notifier = release_notifier;
+    dev->msix_vector_poll_notifier = poll_notifier;
+}
+
 int msix_set_vector_notifiers(PCIDevice *dev,
                               MSIVectorUseNotifier use_notifier,
                               MSIVectorReleaseNotifier release_notifier,
@@ -586,11 +597,8 @@ int msix_set_vector_notifiers(PCIDevice *dev,
 {
     int vector, ret;
 
-    assert(use_notifier && release_notifier);
-
-    dev->msix_vector_use_notifier = use_notifier;
-    dev->msix_vector_release_notifier = release_notifier;
-    dev->msix_vector_poll_notifier = poll_notifier;
+    msix_init_vector_notifiers(dev, use_notifier, release_notifier,
+                               poll_notifier);
 
     if ((dev->config[dev->msix_cap + MSIX_CONTROL_OFFSET] &
         (MSIX_ENABLE_MASK | MSIX_MASKALL_MASK)) == MSIX_ENABLE_MASK) {
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 23d2ae2..59408a3 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -225,7 +225,6 @@ static const TypeInfo pcie_bus_info = {
 };
 
 static PCIBus *pci_find_bus_nr(PCIBus *bus, int bus_num);
-static void pci_update_mappings(PCIDevice *d);
 static void pci_irq_handler(void *opaque, int irq_num, int level);
 static void pci_add_option_rom(PCIDevice *pdev, bool is_default_rom, Error **);
 static void pci_del_option_rom(PCIDevice *pdev);
@@ -1366,7 +1365,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
     return new_addr;
 }
 
-static void pci_update_mappings(PCIDevice *d)
+void pci_update_mappings(PCIDevice *d)
 {
     PCIIORegion *r;
     int i;
diff --git a/include/hw/pci/msix.h b/include/hw/pci/msix.h
index 4c4a60c..46606cf 100644
--- a/include/hw/pci/msix.h
+++ b/include/hw/pci/msix.h
@@ -32,6 +32,7 @@ int msix_present(PCIDevice *dev);
 bool msix_is_masked(PCIDevice *dev, unsigned vector);
 void msix_set_pending(PCIDevice *dev, unsigned vector);
 void msix_clr_pending(PCIDevice *dev, int vector);
+int msix_is_pending(PCIDevice *dev, unsigned vector);
 
 int msix_vector_use(PCIDevice *dev, unsigned vector);
 void msix_vector_unuse(PCIDevice *dev, unsigned vector);
@@ -41,6 +42,10 @@ void msix_notify(PCIDevice *dev, unsigned vector);
 
 void msix_reset(PCIDevice *dev);
 
+void msix_init_vector_notifiers(PCIDevice *dev,
+                                MSIVectorUseNotifier use_notifier,
+                                MSIVectorReleaseNotifier release_notifier,
+                                MSIVectorPollNotifier poll_notifier);
 int msix_set_vector_notifiers(PCIDevice *dev,
                               MSIVectorUseNotifier use_notifier,
                               MSIVectorReleaseNotifier release_notifier,
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index d0f4266..bf5be06 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -904,5 +904,6 @@ extern const VMStateDescription vmstate_pci_device;
 }
 
 MSIMessage pci_get_msi_message(PCIDevice *dev, int vector);
+void pci_update_mappings(PCIDevice *d);
 
 #endif
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V6 18/27] vfio-pci: refactor for cpr
  2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
                   ` (16 preceding siblings ...)
  2021-08-06 21:43 ` [PATCH V6 17/27] pci: export functions for cpr Steve Sistare
@ 2021-08-06 21:43 ` Steve Sistare
  2021-08-10 16:53   ` Alex Williamson
  2021-08-06 21:43 ` [PATCH V6 19/27] vfio-pci: cpr part 1 (fd and dma) Steve Sistare
                   ` (10 subsequent siblings)
  28 siblings, 1 reply; 44+ messages in thread
From: Steve Sistare @ 2021-08-06 21:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Export vfio_address_spaces and vfio_listener_skipped_section.
Add optional name arg to vfio_add_kvm_msi_virq.
Refactor vector use into a helper vfio_vector_init.
All for use by cpr in a subsequent patch.  No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/common.c              |  4 ++--
 hw/vfio/pci.c                 | 50 +++++++++++++++++++++++++++++++------------
 include/hw/vfio/vfio-common.h |  3 +++
 3 files changed, 41 insertions(+), 16 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 8728d4d..7918c0d 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -43,7 +43,7 @@
 
 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
-static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces =
+VFIOAddressSpaceList vfio_address_spaces =
     QLIST_HEAD_INITIALIZER(vfio_address_spaces);
 
 #ifdef CONFIG_KVM
@@ -558,7 +558,7 @@ static int vfio_host_win_del(VFIOContainer *container, hwaddr min_iova,
     return -1;
 }
 
-static bool vfio_listener_skipped_section(MemoryRegionSection *section)
+bool vfio_listener_skipped_section(MemoryRegionSection *section)
 {
     return (!memory_region_is_ram(section->mr) &&
             !memory_region_is_iommu(section->mr)) ||
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e1ea1d8..e8e371e 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -48,6 +48,20 @@
 static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
 static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
 
+/* Create new or reuse existing eventfd */
+static int vfio_named_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
+                                    const char *name, int nr)
+{
+    int fd = -1;   /* placeholder until a subsequent patch */
+
+    if (fd >= 0) {
+        event_notifier_init_fd(e, fd);
+        return 0;
+    } else {
+        return event_notifier_init(e, 0);
+    }
+}
+
 /*
  * Disabling BAR mmaping can be slow, but toggling it around INTx can
  * also be a huge overhead.  We try to get the best of both worlds by
@@ -410,7 +424,7 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
 }
 
 static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
-                                  int vector_n, bool msix)
+                                  const char *name, int nr, bool msix)
 {
     int virq;
 
@@ -418,11 +432,11 @@ static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
         return;
     }
 
-    if (event_notifier_init(&vector->kvm_interrupt, 0)) {
+    if (vfio_named_notifier_init(vdev, &vector->kvm_interrupt, name, nr)) {
         return;
     }
 
-    virq = kvm_irqchip_add_msi_route(kvm_state, vector_n, &vdev->pdev);
+    virq = kvm_irqchip_add_msi_route(kvm_state, nr, &vdev->pdev);
     if (virq < 0) {
         event_notifier_cleanup(&vector->kvm_interrupt);
         return;
@@ -454,6 +468,20 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
     kvm_irqchip_commit_routes(kvm_state);
 }
 
+static void vfio_vector_init(VFIOPCIDevice *vdev, const char *name, int nr)
+{
+    VFIOMSIVector *vector = &vdev->msi_vectors[nr];
+    PCIDevice *pdev = &vdev->pdev;
+
+    vector->vdev = vdev;
+    vector->virq = -1;
+    if (vfio_named_notifier_init(vdev, &vector->interrupt, name, nr)) {
+        error_report("vfio: Error: event_notifier_init failed");
+    }
+    vector->use = true;
+    msix_vector_use(pdev, nr);
+}
+
 static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
                                    MSIMessage *msg, IOHandler *handler)
 {
@@ -466,13 +494,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
     vector = &vdev->msi_vectors[nr];
 
     if (!vector->use) {
-        vector->vdev = vdev;
-        vector->virq = -1;
-        if (event_notifier_init(&vector->interrupt, 0)) {
-            error_report("vfio: Error: event_notifier_init failed");
-        }
-        vector->use = true;
-        msix_vector_use(pdev, nr);
+        vfio_vector_init(vdev, NULL, nr);
     }
 
     qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
@@ -490,7 +512,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
         }
     } else {
         if (msg) {
-            vfio_add_kvm_msi_virq(vdev, vector, nr, true);
+            vfio_add_kvm_msi_virq(vdev, vector, NULL, nr, true);
         }
     }
 
@@ -640,7 +662,7 @@ retry:
          * Attempt to enable route through KVM irqchip,
          * default to userspace handling if unavailable.
          */
-        vfio_add_kvm_msi_virq(vdev, vector, i, false);
+        vfio_add_kvm_msi_virq(vdev, vector, NULL, i, false);
     }
 
     /* Set interrupt type prior to possible interrupts */
@@ -2677,7 +2699,7 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
         return;
     }
 
-    if (event_notifier_init(&vdev->err_notifier, 0)) {
+    if (vfio_named_notifier_init(vdev, &vdev->err_notifier, "err", 0)) {
         error_report("vfio: Unable to init event notifier for error detection");
         vdev->pci_aer = false;
         return;
@@ -2743,7 +2765,7 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
         return;
     }
 
-    if (event_notifier_init(&vdev->req_notifier, 0)) {
+    if (vfio_named_notifier_init(vdev, &vdev->req_notifier, "req", 0)) {
         error_report("vfio: Unable to init event notifier for device request");
         return;
     }
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 8af11b0..cb04cc6 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -216,6 +216,8 @@ int vfio_get_device(VFIOGroup *group, const char *name,
 extern const MemoryRegionOps vfio_region_ops;
 typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
 extern VFIOGroupList vfio_group_list;
+typedef QLIST_HEAD(, VFIOAddressSpace) VFIOAddressSpaceList;
+extern VFIOAddressSpaceList vfio_address_spaces;
 
 bool vfio_mig_active(void);
 int64_t vfio_mig_bytes_transferred(void);
@@ -234,6 +236,7 @@ struct vfio_info_cap_header *
 vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id);
 #endif
 extern const MemoryListener vfio_prereg_listener;
+bool vfio_listener_skipped_section(MemoryRegionSection *section);
 
 int vfio_spapr_create_window(VFIOContainer *container,
                              MemoryRegionSection *section,
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V6 19/27] vfio-pci: cpr part 1 (fd and dma)
  2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
                   ` (17 preceding siblings ...)
  2021-08-06 21:43 ` [PATCH V6 18/27] vfio-pci: refactor " Steve Sistare
@ 2021-08-06 21:43 ` Steve Sistare
  2021-08-10 17:06   ` Alex Williamson
  2021-08-06 21:43 ` [PATCH V6 20/27] vfio-pci: cpr part 2 (msi) Steve Sistare
                   ` (9 subsequent siblings)
  28 siblings, 1 reply; 44+ messages in thread
From: Steve Sistare @ 2021-08-06 21:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Enable vfio-pci devices to be saved and restored across an exec restart
of qemu.

At vfio creation time, save the value of vfio container, group, and device
descriptors in cpr state.

In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
at a different VA after exec.  DMA to already-mapped pages continues.  Save
the msi message area as part of vfio-pci vmstate, save the interrupt and
notifier eventfd's in cpr state, and clear the close-on-exec flag for the
vfio descriptors.  The flag is not cleared earlier because the descriptors
should not persist across miscellaneous fork and exec calls that may be
performed during normal operation.

On qemu restart, vfio_realize() finds the descriptor env vars, uses
the descriptors, and notes that the device is being reused.  Device and
iommu state is already configured, so operations in vfio_realize that
would modify the configuration are skipped for a reused device, including
vfio ioctl's and writes to PCI configuration space.  The result is that
vfio_realize constructs qemu data structures that reflect the current
state of the device.  However, the reconstruction is not complete until
cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
state.  It rebuilds vector data structures and attaches the interrupts to
the new KVM instance.  cpr-load then walks the flattened ranges of the
vfio_address_spaces and calls VFIO_DMA_MAP_FLAG_VADDR to inform the kernel
of the new VA's.  Lastly, it starts the VM and suppresses vfio device reset.

This functionality is delivered by 3 patches for clarity.  Part 1 handles
device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
support.  Part 3 adds INTX support.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 MAINTAINERS                   |   1 +
 hw/pci/pci.c                  |   4 ++
 hw/vfio/common.c              |  69 ++++++++++++++++--
 hw/vfio/cpr.c                 | 160 ++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/meson.build           |   1 +
 hw/vfio/pci.c                 |  57 +++++++++++++++
 hw/vfio/trace-events          |   1 +
 include/hw/pci/pci.h          |   1 +
 include/hw/vfio/vfio-common.h |   5 ++
 include/migration/cpr.h       |   3 +
 linux-headers/linux/vfio.h    |   6 ++
 migration/cpr.c               |  10 ++-
 migration/target.c            |  14 ++++
 13 files changed, 325 insertions(+), 7 deletions(-)
 create mode 100644 hw/vfio/cpr.c

diff --git a/MAINTAINERS b/MAINTAINERS
index a9d2ed8..3132965 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2904,6 +2904,7 @@ CPR
 M: Steve Sistare <steven.sistare@oracle.com>
 M: Mark Kanda <mark.kanda@oracle.com>
 S: Maintained
+F: hw/vfio/cpr.c
 F: include/migration/cpr.h
 F: migration/cpr.c
 F: qapi/cpr.json
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 59408a3..b9c6ca1 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -307,6 +307,10 @@ static void pci_do_device_reset(PCIDevice *dev)
 {
     int r;
 
+    if (dev->reused) {
+        return;
+    }
+
     pci_device_deassert_intx(dev);
     assert(dev->irq_state == 0);
 
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 7918c0d..872a1ac 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -31,6 +31,7 @@
 #include "exec/memory.h"
 #include "exec/ram_addr.h"
 #include "hw/hw.h"
+#include "migration/cpr.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "qemu/range.h"
@@ -464,6 +465,10 @@ static int vfio_dma_unmap(VFIOContainer *container,
         return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
     }
 
+    if (container->reused) {
+        return 0;
+    }
+
     while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
         /*
          * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
@@ -501,6 +506,10 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
         .size = size,
     };
 
+    if (container->reused) {
+        return 0;
+    }
+
     if (!readonly) {
         map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
     }
@@ -1872,6 +1881,10 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
     if (iommu_type < 0) {
         return iommu_type;
     }
+    if (container->reused) {
+        container->iommu_type = iommu_type;
+        return 0;
+    }
 
     ret = ioctl(group_fd, VFIO_GROUP_SET_CONTAINER, &container->fd);
     if (ret) {
@@ -1972,6 +1985,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
 {
     VFIOContainer *container;
     int ret, fd;
+    bool reused;
     VFIOAddressSpace *space;
 
     space = vfio_get_address_space(as);
@@ -2007,7 +2021,13 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
      * details once we know which type of IOMMU we are using.
      */
 
+    fd = cpr_find_fd("vfio_container_for_group", group->groupid);
+    reused = (fd >= 0);
+
     QLIST_FOREACH(container, &space->containers, next) {
+        if (container->fd == fd) {
+            break;
+        }
         if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
             ret = vfio_ram_block_discard_disable(container, true);
             if (ret) {
@@ -2020,14 +2040,25 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
                 }
                 return ret;
             }
-            group->container = container;
-            QLIST_INSERT_HEAD(&container->group_list, group, container_next);
+            break;
+        }
+    }
+
+    if (container) {
+        group->container = container;
+        QLIST_INSERT_HEAD(&container->group_list, group, container_next);
+        if (!reused) {
             vfio_kvm_device_add_group(group);
-            return 0;
+            cpr_save_fd("vfio_container_for_group", group->groupid,
+                        container->fd);
         }
+        return 0;
+    }
+
+    if (!reused) {
+        fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
     }
 
-    fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
     if (fd < 0) {
         error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
         ret = -errno;
@@ -2045,6 +2076,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     container = g_malloc0(sizeof(*container));
     container->space = space;
     container->fd = fd;
+    container->reused = reused;
     container->error = NULL;
     container->dirty_pages_supported = false;
     container->dma_max_mappings = 0;
@@ -2183,6 +2215,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     }
 
     container->initialized = true;
+    cpr_save_fd("vfio_container_for_group", group->groupid, fd);
 
     return 0;
 listener_release_exit:
@@ -2212,6 +2245,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
 
     QLIST_REMOVE(group, container_next);
     group->container = NULL;
+    cpr_delete_fd("vfio_container_for_group", group->groupid);
 
     /*
      * Explicitly release the listener first before unset container,
@@ -2253,6 +2287,7 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
     VFIOGroup *group;
     char path[32];
     struct vfio_group_status status = { .argsz = sizeof(status) };
+    bool reused;
 
     QLIST_FOREACH(group, &vfio_group_list, next) {
         if (group->groupid == groupid) {
@@ -2270,7 +2305,13 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
     group = g_malloc0(sizeof(*group));
 
     snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
-    group->fd = qemu_open_old(path, O_RDWR);
+
+    group->fd = cpr_find_fd("vfio_group", groupid);
+    reused = (group->fd >= 0);
+    if (!reused) {
+        group->fd = qemu_open_old(path, O_RDWR);
+    }
+
     if (group->fd < 0) {
         error_setg_errno(errp, errno, "failed to open %s", path);
         goto free_group_exit;
@@ -2304,6 +2345,10 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
 
     QLIST_INSERT_HEAD(&vfio_group_list, group, next);
 
+    if (!reused) {
+        cpr_save_fd("vfio_group", groupid, group->fd);
+    }
+
     return group;
 
 close_fd_exit:
@@ -2328,6 +2373,7 @@ void vfio_put_group(VFIOGroup *group)
     vfio_disconnect_container(group);
     QLIST_REMOVE(group, next);
     trace_vfio_put_group(group->fd);
+    cpr_delete_fd("vfio_group", group->groupid);
     close(group->fd);
     g_free(group);
 
@@ -2341,8 +2387,14 @@ int vfio_get_device(VFIOGroup *group, const char *name,
 {
     struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
     int ret, fd;
+    bool reused;
+
+    fd = cpr_find_fd(name, 0);
+    reused = (fd >= 0);
+    if (!reused) {
+        fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
+    }
 
-    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
     if (fd < 0) {
         error_setg_errno(errp, errno, "error getting device from group %d",
                          group->groupid);
@@ -2387,6 +2439,10 @@ int vfio_get_device(VFIOGroup *group, const char *name,
     vbasedev->num_irqs = dev_info.num_irqs;
     vbasedev->num_regions = dev_info.num_regions;
     vbasedev->flags = dev_info.flags;
+    vbasedev->reused = reused;
+    if (!reused) {
+        cpr_save_fd(name, 0, fd);
+    }
 
     trace_vfio_get_device(name, dev_info.flags, dev_info.num_regions,
                           dev_info.num_irqs);
@@ -2403,6 +2459,7 @@ void vfio_put_base_device(VFIODevice *vbasedev)
     QLIST_REMOVE(vbasedev, next);
     vbasedev->group = NULL;
     trace_vfio_put_base_device(vbasedev->fd);
+    cpr_delete_fd(vbasedev->name, 0);
     close(vbasedev->fd);
 }
 
diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
new file mode 100644
index 0000000..0981d31
--- /dev/null
+++ b/hw/vfio/cpr.c
@@ -0,0 +1,160 @@
+/*
+ * Copyright (c) 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include <sys/ioctl.h>
+#include <linux/vfio.h>
+#include "hw/vfio/vfio-common.h"
+#include "sysemu/kvm.h"
+#include "qapi/error.h"
+#include "trace.h"
+
+static int
+vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
+{
+    struct vfio_iommu_type1_dma_unmap unmap = {
+        .argsz = sizeof(unmap),
+        .flags = VFIO_DMA_UNMAP_FLAG_VADDR | VFIO_DMA_UNMAP_FLAG_ALL,
+        .iova = 0,
+        .size = 0,
+    };
+    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
+        error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
+        return -errno;
+    }
+    return 0;
+}
+
+static int vfio_dma_map_vaddr(VFIOContainer *container, hwaddr iova,
+                              ram_addr_t size, void *vaddr,
+                              Error **errp)
+{
+    struct vfio_iommu_type1_dma_map map = {
+        .argsz = sizeof(map),
+        .flags = VFIO_DMA_MAP_FLAG_VADDR,
+        .vaddr = (__u64)(uintptr_t)vaddr,
+        .iova = iova,
+        .size = size,
+    };
+    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
+        error_setg_errno(errp, errno,
+                         "vfio_dma_map_vaddr(iova %lu, size %ld, va %p)",
+                         iova, size, vaddr);
+        return -errno;
+    }
+    return 0;
+}
+
+static int
+vfio_region_remap(MemoryRegionSection *section, void *handle, Error **errp)
+{
+    MemoryRegion *mr = section->mr;
+    VFIOContainer *container = handle;
+    const char *name = memory_region_name(mr);
+    ram_addr_t size = int128_get64(section->size);
+    hwaddr offset, iova, roundup;
+    void *vaddr;
+
+    if (vfio_listener_skipped_section(section) || memory_region_is_iommu(mr)) {
+        return 0;
+    }
+
+    offset = section->offset_within_address_space;
+    iova = REAL_HOST_PAGE_ALIGN(offset);
+    roundup = iova - offset;
+    size -= roundup;
+    size = REAL_HOST_PAGE_ALIGN(size);
+    vaddr = memory_region_get_ram_ptr(mr) +
+            section->offset_within_region + roundup;
+
+    trace_vfio_region_remap(name, container->fd, iova, iova + size - 1, vaddr);
+    return vfio_dma_map_vaddr(container, iova, size, vaddr, errp);
+}
+
+bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp)
+{
+    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
+        !ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
+        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR "
+                         "or VFIO_UNMAP_ALL");
+        return false;
+    } else {
+        return true;
+    }
+}
+
+int vfio_cpr_save(Error **errp)
+{
+    ERRP_GUARD();
+    VFIOAddressSpace *space, *last_space;
+    VFIOContainer *container, *last_container;
+
+    QLIST_FOREACH(space, &vfio_address_spaces, list) {
+        QLIST_FOREACH(container, &space->containers, next) {
+            if (!vfio_is_cpr_capable(container, errp)) {
+                return -1;
+            }
+        }
+    }
+
+    QLIST_FOREACH(space, &vfio_address_spaces, list) {
+        QLIST_FOREACH(container, &space->containers, next) {
+            if (vfio_dma_unmap_vaddr_all(container, errp)) {
+                goto unwind;
+            }
+        }
+    }
+    return 0;
+
+unwind:
+    last_space = space;
+    last_container = container;
+    QLIST_FOREACH(space, &vfio_address_spaces, list) {
+        QLIST_FOREACH(container, &space->containers, next) {
+            Error *err;
+
+            if (space == last_space && container == last_container) {
+                break;
+            }
+            if (address_space_flat_for_each_section(space->as,
+                                                    vfio_region_remap,
+                                                    container, &err)) {
+                error_prepend(errp, "%s", error_get_pretty(err));
+                error_free(err);
+            }
+        }
+    }
+    return -1;
+}
+
+int vfio_cpr_load(Error **errp)
+{
+    VFIOAddressSpace *space;
+    VFIOContainer *container;
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+
+    QLIST_FOREACH(space, &vfio_address_spaces, list) {
+        QLIST_FOREACH(container, &space->containers, next) {
+            if (!vfio_is_cpr_capable(container, errp)) {
+                return -1;
+            }
+            container->reused = false;
+            if (address_space_flat_for_each_section(space->as,
+                                                    vfio_region_remap,
+                                                    container, errp)) {
+                return -1;
+            }
+        }
+    }
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            vbasedev->reused = false;
+        }
+    }
+    return 0;
+}
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index da9af29..e247b2b 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -5,6 +5,7 @@ vfio_ss.add(files(
   'migration.c',
 ))
 vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
+  'cpr.c',
   'display.c',
   'pci-quirks.c',
   'pci.c',
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e8e371e..64e2557 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -29,6 +29,7 @@
 #include "hw/qdev-properties.h"
 #include "hw/qdev-properties-system.h"
 #include "migration/vmstate.h"
+#include "migration/cpr.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "qemu/module.h"
@@ -2899,6 +2900,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         vfio_put_group(group);
         goto error;
     }
+    pdev->reused = vdev->vbasedev.reused;
 
     vfio_populate_device(vdev, &err);
     if (err) {
@@ -3168,6 +3170,10 @@ static void vfio_pci_reset(DeviceState *dev)
 {
     VFIOPCIDevice *vdev = VFIO_PCI(dev);
 
+    if (vdev->pdev.reused) {
+        return;
+    }
+
     trace_vfio_pci_reset(vdev->vbasedev.name);
 
     vfio_pci_pre_reset(vdev);
@@ -3275,6 +3281,56 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
+static void vfio_merge_config(VFIOPCIDevice *vdev)
+{
+    PCIDevice *pdev = &vdev->pdev;
+    int size = MIN(pci_config_size(pdev), vdev->config_size);
+    g_autofree uint8_t *phys_config = g_malloc(size);
+    uint32_t mask;
+    int ret, i;
+
+    ret = pread(vdev->vbasedev.fd, phys_config, size, vdev->config_offset);
+    if (ret < size) {
+        ret = ret < 0 ? errno : EFAULT;
+        error_report("failed to read device config space: %s", strerror(ret));
+        return;
+    }
+
+    for (i = 0; i < size; i++) {
+        mask = vdev->emulated_config_bits[i];
+        pdev->config[i] = (pdev->config[i] & mask) | (phys_config[i] & ~mask);
+    }
+}
+
+static int vfio_pci_post_load(void *opaque, int version_id)
+{
+    VFIOPCIDevice *vdev = opaque;
+    PCIDevice *pdev = &vdev->pdev;
+
+    vfio_merge_config(vdev);
+
+    pdev->reused = false;
+
+    return 0;
+}
+
+static bool vfio_pci_needed(void *opaque)
+{
+    return cpr_mode() == CPR_MODE_RESTART;
+}
+
+static const VMStateDescription vfio_pci_vmstate = {
+    .name = "vfio-pci",
+    .unmigratable = 1,
+    .version_id = 0,
+    .minimum_version_id = 0,
+    .post_load = vfio_pci_post_load,
+    .needed = vfio_pci_needed,
+    .fields = (VMStateField[]) {
+        VMSTATE_END_OF_LIST()
+    }
+};
+
 static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
@@ -3282,6 +3338,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 
     dc->reset = vfio_pci_reset;
     device_class_set_props(dc, vfio_pci_dev_properties);
+    dc->vmsd = &vfio_pci_vmstate;
     dc->desc = "VFIO-based PCI device assignment";
     set_bit(DEVICE_CATEGORY_MISC, dc->categories);
     pdc->realize = vfio_realize;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 0ef1b5f..63dd0fe 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -118,6 +118,7 @@ vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Devic
 vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
 vfio_dma_unmap_overflow_workaround(void) ""
+vfio_region_remap(const char *name, int fd, uint64_t iova_start, uint64_t iova_end, void *vaddr) "%s fd %d 0x%"PRIx64" - 0x%"PRIx64" [%p]"
 
 # platform.c
 vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index bf5be06..f079423 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -360,6 +360,7 @@ struct PCIDevice {
     /* ID of standby device in net_failover pair */
     char *failover_pair_id;
     uint32_t acpi_index;
+    bool reused;
 };
 
 void pci_register_bar(PCIDevice *pci_dev, int region_num,
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index cb04cc6..0766cc4 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -85,6 +85,7 @@ typedef struct VFIOContainer {
     Error *error;
     bool initialized;
     bool dirty_pages_supported;
+    bool reused;
     uint64_t dirty_pgsizes;
     uint64_t max_dirty_bitmap_size;
     unsigned long pgsizes;
@@ -136,6 +137,7 @@ typedef struct VFIODevice {
     bool no_mmap;
     bool ram_block_discard_allowed;
     bool enable_migration;
+    bool reused;
     VFIODeviceOps *ops;
     unsigned int num_irqs;
     unsigned int num_regions;
@@ -212,6 +214,9 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
 void vfio_put_group(VFIOGroup *group);
 int vfio_get_device(VFIOGroup *group, const char *name,
                     VFIODevice *vbasedev, Error **errp);
+int vfio_cpr_save(Error **errp);
+int vfio_cpr_load(Error **errp);
+bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp);
 
 extern const MemoryRegionOps vfio_region_ops;
 typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index 83f69c9..e9b987f 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -25,4 +25,7 @@ int cpr_state_load(Error **errp);
 CprMode cpr_state_mode(void);
 void cpr_state_print(void);
 
+int cpr_vfio_save(Error **errp);
+int cpr_vfio_load(Error **errp);
+
 #endif
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index e680594..48a02c0 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -52,6 +52,12 @@
 /* Supports the vaddr flag for DMA map and unmap */
 #define VFIO_UPDATE_VADDR		10
 
+/* Supports VFIO_DMA_UNMAP_FLAG_ALL */
+#define VFIO_UNMAP_ALL                        9
+
+/* Supports VFIO DMA map and unmap with the VADDR flag */
+#define VFIO_UPDATE_VADDR              10
+
 /*
  * The IOCTL interface is designed for extensibility by embedding the
  * structure length (argsz) and flags into structures passed between
diff --git a/migration/cpr.c b/migration/cpr.c
index 72a5f4b..16f11bd 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -7,6 +7,7 @@
 
 #include "qemu/osdep.h"
 #include "exec/memory.h"
+#include "hw/vfio/vfio-common.h"
 #include "io/channel-buffer.h"
 #include "io/channel-file.h"
 #include "migration.h"
@@ -108,7 +109,9 @@ void qmp_cpr_exec(strList *args, Error **errp)
         error_setg(errp, "cpr-exec requires cpr-save with restart mode");
         return;
     }
-
+    if (cpr_vfio_save(errp)) {
+        return;
+    }
     cpr_walk_fd(preserve_fd, 0);
     if (cpr_state_save(errp)) {
         return;
@@ -148,6 +151,11 @@ void qmp_cpr_load(const char *filename, Error **errp)
         goto out;
     }
 
+    if (cpr_active_mode == CPR_MODE_RESTART &&
+        cpr_vfio_load(errp)) {
+        goto out;
+    }
+
     state = global_state_get_runstate();
     if (state == RUN_STATE_RUNNING) {
         vm_start();
diff --git a/migration/target.c b/migration/target.c
index 4390bf0..984bc9e 100644
--- a/migration/target.c
+++ b/migration/target.c
@@ -8,6 +8,7 @@
 #include "qemu/osdep.h"
 #include "qapi/qapi-types-migration.h"
 #include "migration.h"
+#include "migration/cpr.h"
 #include CONFIG_DEVICES
 
 #ifdef CONFIG_VFIO
@@ -22,8 +23,21 @@ void populate_vfio_info(MigrationInfo *info)
         info->vfio->transferred = vfio_mig_bytes_transferred();
     }
 }
+
+int cpr_vfio_save(Error **errp)
+{
+    return vfio_cpr_save(errp);
+}
+
+int cpr_vfio_load(Error **errp)
+{
+    return vfio_cpr_load(errp);
+}
+
 #else
 
 void populate_vfio_info(MigrationInfo *info) {}
+int cpr_vfio_save(Error **errp) { return 0; }
+int cpr_vfio_load(Error **errp) { return 0; }
 
 #endif /* CONFIG_VFIO */
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V6 20/27] vfio-pci: cpr part 2 (msi)
  2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
                   ` (18 preceding siblings ...)
  2021-08-06 21:43 ` [PATCH V6 19/27] vfio-pci: cpr part 1 (fd and dma) Steve Sistare
@ 2021-08-06 21:43 ` Steve Sistare
  2021-08-06 21:43 ` [PATCH V6 21/27] vfio-pci: cpr part 3 (intx) Steve Sistare
                   ` (8 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Steve Sistare @ 2021-08-06 21:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Finish cpr for vfio-pci MSI/MSI-X devices by preserving eventfd's and
vector state.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/pci.c | 108 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 107 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 64e2557..1cee52a 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -49,11 +49,31 @@
 static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
 static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
 
+static void save_event_fd(VFIOPCIDevice *vdev, const char *name, int nr,
+                          EventNotifier *ev)
+{
+    int fd = event_notifier_get_fd(ev);
+
+    if (fd >= 0) {
+        g_autofree char *fdname =
+            g_strdup_printf("%s_%s", vdev->vbasedev.name, name);
+        cpr_save_fd(fdname, nr, fd);
+    }
+}
+
+static int load_event_fd(VFIOPCIDevice *vdev, const char *name, int nr)
+{
+    g_autofree char *fdname =
+        g_strdup_printf("%s_%s", vdev->vbasedev.name, name);
+    int fd = cpr_find_fd(fdname, nr);
+    return fd;
+}
+
 /* Create new or reuse existing eventfd */
 static int vfio_named_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
                                     const char *name, int nr)
 {
-    int fd = -1;   /* placeholder until a subsequent patch */
+    int fd = name ? load_event_fd(vdev, name, nr) : -1;
 
     if (fd >= 0) {
         event_notifier_init_fd(e, fd);
@@ -2709,6 +2729,10 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
     fd = event_notifier_get_fd(&vdev->err_notifier);
     qemu_set_fd_handler(fd, vfio_err_notifier_handler, NULL, vdev);
 
+    if (vdev->pdev.reused) {
+        return;
+    }
+
     if (vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_ERR_IRQ_INDEX, 0,
                                VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
         error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
@@ -2774,6 +2798,11 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
     fd = event_notifier_get_fd(&vdev->req_notifier);
     qemu_set_fd_handler(fd, vfio_req_notifier_handler, NULL, vdev);
 
+    if (vdev->pdev.reused) {
+        vdev->req_enabled = true;
+        return;
+    }
+
     if (vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_REQ_IRQ_INDEX, 0,
                            VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
         error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
@@ -3302,13 +3331,87 @@ static void vfio_merge_config(VFIOPCIDevice *vdev)
     }
 }
 
+static int vfio_pci_pre_save(void *opaque)
+{
+    VFIOPCIDevice *vdev = opaque;
+    PCIDevice *pdev = &vdev->pdev;
+    int i;
+
+    if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
+        assert(0);      /* completed in a subsequent patch */
+    }
+
+    for (i = 0; i < vdev->nr_vectors; i++) {
+        VFIOMSIVector *vector = &vdev->msi_vectors[i];
+        if (vector->use) {
+            save_event_fd(vdev, "interrupt", i, &vector->interrupt);
+            if (vector->virq >= 0) {
+                save_event_fd(vdev, "kvm_interrupt", i,
+                                &vector->kvm_interrupt);
+            }
+        }
+    }
+    save_event_fd(vdev, "err", 0, &vdev->err_notifier);
+    save_event_fd(vdev, "req", 0, &vdev->req_notifier);
+    return 0;
+}
+
+static void vfio_claim_vectors(VFIOPCIDevice *vdev, int nr_vectors, bool msix)
+{
+    int i, fd;
+    bool pending = false;
+    PCIDevice *pdev = &vdev->pdev;
+
+    vdev->nr_vectors = nr_vectors;
+    vdev->msi_vectors = g_new0(VFIOMSIVector, nr_vectors);
+    vdev->interrupt = msix ? VFIO_INT_MSIX : VFIO_INT_MSI;
+
+    for (i = 0; i < nr_vectors; i++) {
+        VFIOMSIVector *vector = &vdev->msi_vectors[i];
+
+        fd = load_event_fd(vdev, "interrupt", i);
+        if (fd >= 0) {
+            vfio_vector_init(vdev, "interrupt", i);
+            qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
+        }
+
+        if (load_event_fd(vdev, "kvm_interrupt", i) >= 0) {
+            vfio_add_kvm_msi_virq(vdev, vector, "kvm_interrupt", i, msix);
+        }
+
+        if (msix && msix_is_pending(pdev, i) && msix_is_masked(pdev, i)) {
+            set_bit(i, vdev->msix->pending);
+            pending = true;
+        }
+    }
+
+    if (msix) {
+        memory_region_set_enabled(&pdev->msix_pba_mmio, pending);
+    }
+}
+
 static int vfio_pci_post_load(void *opaque, int version_id)
 {
     VFIOPCIDevice *vdev = opaque;
     PCIDevice *pdev = &vdev->pdev;
+    int nr_vectors;
 
     vfio_merge_config(vdev);
 
+    if (msix_enabled(pdev)) {
+        nr_vectors = vdev->msix->entries;
+        vfio_claim_vectors(vdev, nr_vectors, true);
+        msix_init_vector_notifiers(pdev, vfio_msix_vector_use,
+                                   vfio_msix_vector_release, NULL);
+
+    } else if (msi_enabled(pdev)) {
+        nr_vectors = msi_nr_vectors_allocated(pdev);
+        vfio_claim_vectors(vdev, nr_vectors, false);
+
+    } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
+        assert(0);      /* completed in a subsequent patch */
+    }
+
     pdev->reused = false;
 
     return 0;
@@ -3325,8 +3428,11 @@ static const VMStateDescription vfio_pci_vmstate = {
     .version_id = 0,
     .minimum_version_id = 0,
     .post_load = vfio_pci_post_load,
+    .pre_save = vfio_pci_pre_save,
     .needed = vfio_pci_needed,
     .fields = (VMStateField[]) {
+        VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
+        VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
         VMSTATE_END_OF_LIST()
     }
 };
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V6 21/27] vfio-pci: cpr part 3 (intx)
  2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
                   ` (19 preceding siblings ...)
  2021-08-06 21:43 ` [PATCH V6 20/27] vfio-pci: cpr part 2 (msi) Steve Sistare
@ 2021-08-06 21:43 ` Steve Sistare
  2022-03-29 11:03   ` Fam Zheng
  2021-08-06 21:43 ` [PATCH V6 22/27] vhost: reset vhost devices for cpr Steve Sistare
                   ` (7 subsequent siblings)
  28 siblings, 1 reply; 44+ messages in thread
From: Steve Sistare @ 2021-08-06 21:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Preserve vfio INTX state across cpr restart.  Preserve VFIOINTx fields as
follows:
  pin : Recover this from the vfio config in kernel space
  interrupt : Preserve its eventfd descriptor across exec.
  unmask : Ditto
  route.irq : This could perhaps be recovered in vfio_pci_post_load by
    calling pci_device_route_intx_to_irq(pin), whose implementation reads
    config space for a bridge device such as ich9.  However, there is no
    guarantee that the bridge vmstate is read before vfio vmstate.  Rather
    than fiddling with MigrationPriority for vmstate handlers, explicitly
    save route.irq in vfio vmstate.
  pending : save in vfio vmstate.
  mmap_timeout, mmap_timer : Re-initialize
  bool kvm_accel : Re-initialize

In vfio_realize, defer calling vfio_intx_enable until the vmstate
is available, in vfio_pci_post_load.  Modify vfio_intx_enable and
vfio_intx_kvm_enable to skip vfio initialization, but still perform
kvm initialization.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/pci.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 83 insertions(+), 11 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 1cee52a..7e59f4f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -145,14 +145,45 @@ static void vfio_intx_eoi(VFIODevice *vbasedev)
     vfio_unmask_single_irqindex(vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
 }
 
+#ifdef CONFIG_KVM
+static bool vfio_no_kvm_intx(VFIOPCIDevice *vdev)
+{
+    return vdev->no_kvm_intx || !kvm_irqfds_enabled() ||
+           vdev->intx.route.mode != PCI_INTX_ENABLED ||
+           !kvm_resamplefds_enabled();
+}
+#endif
+
+static void vfio_intx_reenable_kvm(VFIOPCIDevice *vdev, Error **errp)
+{
+#ifdef CONFIG_KVM
+    if (vfio_no_kvm_intx(vdev)) {
+        return;
+    }
+
+    if (vfio_named_notifier_init(vdev, &vdev->intx.unmask, "intx-unmask", 0)) {
+        error_setg(errp, "vfio_named_notifier_init failed");
+        return;
+    }
+
+    if (kvm_irqchip_add_irqfd_notifier_gsi(kvm_state,
+                                           &vdev->intx.interrupt,
+                                           &vdev->intx.unmask,
+                                           vdev->intx.route.irq)) {
+        error_setg_errno(errp, errno, "failed to setup resample irqfd");
+        return;
+    }
+
+    vdev->intx.kvm_accel = true;
+#endif
+}
+
 static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
 {
 #ifdef CONFIG_KVM
     int irq_fd = event_notifier_get_fd(&vdev->intx.interrupt);
 
-    if (vdev->no_kvm_intx || !kvm_irqfds_enabled() ||
-        vdev->intx.route.mode != PCI_INTX_ENABLED ||
-        !kvm_resamplefds_enabled()) {
+    if (vfio_no_kvm_intx(vdev)) {
         return;
     }
 
@@ -300,7 +331,9 @@ static int vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
         return 0;
     }
 
-    vfio_disable_interrupts(vdev);
+    if (!vdev->pdev.reused) {
+        vfio_disable_interrupts(vdev);
+    }
 
     vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
     pci_config_set_interrupt_pin(vdev->pdev.config, pin);
@@ -316,7 +349,8 @@ static int vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
     }
 #endif
 
-    ret = event_notifier_init(&vdev->intx.interrupt, 0);
+    ret = vfio_named_notifier_init(vdev, &vdev->intx.interrupt,
+                                   "intx-interrupt", 0);
     if (ret) {
         error_setg_errno(errp, -ret, "event_notifier_init failed");
         return ret;
@@ -324,6 +358,11 @@ static int vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
     fd = event_notifier_get_fd(&vdev->intx.interrupt);
     qemu_set_fd_handler(fd, vfio_intx_interrupt, NULL, vdev);
 
+    if (vdev->pdev.reused) {
+        vfio_intx_reenable_kvm(vdev, &err);
+        goto finish;
+    }
+
     if (vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
                                VFIO_IRQ_SET_ACTION_TRIGGER, fd, errp)) {
         qemu_set_fd_handler(fd, NULL, NULL, vdev);
@@ -336,6 +375,7 @@ static int vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
         warn_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
     }
 
+finish:
     vdev->interrupt = VFIO_INT_INTx;
 
     trace_vfio_intx_enable(vdev->vbasedev.name);
@@ -3092,9 +3132,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
                                              vfio_intx_routing_notifier);
         vdev->irqchip_change_notifier.notify = vfio_irqchip_change;
         kvm_irqchip_add_change_notifier(&vdev->irqchip_change_notifier);
-        ret = vfio_intx_enable(vdev, errp);
-        if (ret) {
-            goto out_deregister;
+
+        /* Wait until cpr-load reads intx routing data to enable */
+        if (!pdev->reused) {
+            ret = vfio_intx_enable(vdev, errp);
+            if (ret) {
+                goto out_deregister;
+            }
         }
     }
 
@@ -3338,7 +3382,8 @@ static int vfio_pci_pre_save(void *opaque)
     int i;
 
     if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
-        assert(0);      /* completed in a subsequent patch */
+        save_event_fd(vdev, "intx-interrupt", 0, &vdev->intx.interrupt);
+        save_event_fd(vdev, "intx-unmask", 0, &vdev->intx.unmask);
     }
 
     for (i = 0; i < vdev->nr_vectors; i++) {
@@ -3395,6 +3440,7 @@ static int vfio_pci_post_load(void *opaque, int version_id)
     VFIOPCIDevice *vdev = opaque;
     PCIDevice *pdev = &vdev->pdev;
     int nr_vectors;
+    int ret = 0;
 
     vfio_merge_config(vdev);
 
@@ -3409,12 +3455,37 @@ static int vfio_pci_post_load(void *opaque, int version_id)
         vfio_claim_vectors(vdev, nr_vectors, false);
 
     } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
-        assert(0);      /* completed in a subsequent patch */
+        Error *err = 0;
+        ret = vfio_intx_enable(vdev, &err);
+        if (ret) {
+            error_report_err(err);
+        }
     }
 
     pdev->reused = false;
 
-    return 0;
+    return ret;
+}
+
+static const VMStateDescription vfio_intx_vmstate = {
+    .name = "vfio-intx",
+    .unmigratable = 1,
+    .version_id = 0,
+    .minimum_version_id = 0,
+    .fields = (VMStateField[]) {
+        VMSTATE_BOOL(pending, VFIOINTx),
+        VMSTATE_UINT32(route.mode, VFIOINTx),
+        VMSTATE_INT32(route.irq, VFIOINTx),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+#define VMSTATE_VFIO_INTX(_field, _state) {                         \
+    .name       = (stringify(_field)),                              \
+    .size       = sizeof(VFIOINTx),                                 \
+    .vmsd       = &vfio_intx_vmstate,                               \
+    .flags      = VMS_STRUCT,                                       \
+    .offset     = vmstate_offset_value(_state, _field, VFIOINTx),   \
 }
 
 static bool vfio_pci_needed(void *opaque)
@@ -3433,6 +3504,7 @@ static const VMStateDescription vfio_pci_vmstate = {
     .fields = (VMStateField[]) {
         VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
         VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
+        VMSTATE_VFIO_INTX(intx, VFIOPCIDevice),
         VMSTATE_END_OF_LIST()
     }
 };
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V6 22/27] vhost: reset vhost devices for cpr
  2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
                   ` (20 preceding siblings ...)
  2021-08-06 21:43 ` [PATCH V6 21/27] vfio-pci: cpr part 3 (intx) Steve Sistare
@ 2021-08-06 21:43 ` Steve Sistare
  2021-08-06 21:43 ` [PATCH V6 23/27] chardev: cpr framework Steve Sistare
                   ` (6 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Steve Sistare @ 2021-08-06 21:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

A vhost device is implicitly preserved across re-exec because its fd is not
closed, and the value of the fd is specified on the command line for the
new qemu to find.  However, new qemu issues an VHOST_RESET_OWNER ioctl,
which fails because the device already has an owner.  To fix, reset the
owner prior to exec.

Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/virtio/vhost.c         | 11 +++++++++++
 include/hw/virtio/vhost.h |  1 +
 migration/cpr.c           |  2 ++
 3 files changed, 14 insertions(+)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index e8f85a5..3934178 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1832,6 +1832,17 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
     hdev->vdev = NULL;
 }
 
+void vhost_dev_reset_all(void)
+{
+    struct vhost_dev *dev;
+
+    QLIST_FOREACH(dev, &vhost_devices, entry) {
+        if (dev->vhost_ops->vhost_reset_device(dev) < 0) {
+            VHOST_OPS_DEBUG("vhost_reset_device failed");
+        }
+    }
+}
+
 int vhost_net_set_backend(struct vhost_dev *hdev,
                           struct vhost_vring_file *file)
 {
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index 045d0fd..facdfc2 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -108,6 +108,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
 void vhost_dev_cleanup(struct vhost_dev *hdev);
 int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev);
 void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev);
+void vhost_dev_reset_all(void);
 int vhost_dev_enable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev);
 void vhost_dev_disable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev);
 
diff --git a/migration/cpr.c b/migration/cpr.c
index 16f11bd..fd37d98 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -8,6 +8,7 @@
 #include "qemu/osdep.h"
 #include "exec/memory.h"
 #include "hw/vfio/vfio-common.h"
+#include "hw/virtio/vhost.h"
 #include "io/channel-buffer.h"
 #include "io/channel-file.h"
 #include "migration.h"
@@ -116,6 +117,7 @@ void qmp_cpr_exec(strList *args, Error **errp)
     if (cpr_state_save(errp)) {
         return;
     }
+    vhost_dev_reset_all();
     qemu_system_exec_request(args);
 }
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V6 23/27] chardev: cpr framework
  2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
                   ` (21 preceding siblings ...)
  2021-08-06 21:43 ` [PATCH V6 22/27] vhost: reset vhost devices for cpr Steve Sistare
@ 2021-08-06 21:43 ` Steve Sistare
  2021-08-06 21:43 ` [PATCH V6 24/27] chardev: cpr for simple devices Steve Sistare
                   ` (5 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Steve Sistare @ 2021-08-06 21:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Add QEMU_CHAR_FEATURE_CPR for devices that support cpr.
Add the chardev reopen-on-cpr option for devices that can be closed on cpr
and reopened after exec.
cpr is allowed only if either QEMU_CHAR_FEATURE_CPR or reopen-on-cpr is set
for all chardevs in the configuration.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 chardev/char.c         | 43 ++++++++++++++++++++++++++++++++++++++++---
 include/chardev/char.h |  5 +++++
 migration/cpr.c        |  1 +
 qapi/char.json         |  7 ++++++-
 qemu-options.hx        | 26 ++++++++++++++++++++++----
 5 files changed, 74 insertions(+), 8 deletions(-)

diff --git a/chardev/char.c b/chardev/char.c
index 4595a8d..013afdd 100644
--- a/chardev/char.c
+++ b/chardev/char.c
@@ -36,6 +36,7 @@
 #include "qemu/help_option.h"
 #include "qemu/module.h"
 #include "qemu/option.h"
+#include "migration/cpr.h"
 #include "qemu/id.h"
 #include "qemu/coroutine.h"
 #include "qemu/yank.h"
@@ -240,7 +241,10 @@ static void qemu_char_open(Chardev *chr, ChardevBackend *backend,
     /* Any ChardevCommon member would work */
     ChardevCommon *common = backend ? backend->u.null.data : NULL;
 
+    chr->reopen_on_cpr = (common && common->reopen_on_cpr);
+
     if (common && common->has_logfile) {
+        g_autofree char *fdname = g_strdup_printf("%s_log", chr->label);
         int flags = O_WRONLY | O_CREAT;
         if (common->has_logappend &&
             common->logappend) {
@@ -248,7 +252,13 @@ static void qemu_char_open(Chardev *chr, ChardevBackend *backend,
         } else {
             flags |= O_TRUNC;
         }
-        chr->logfd = qemu_open_old(common->logfile, flags, 0666);
+        chr->logfd = cpr_find_fd(fdname, 0);
+        if (chr->logfd < 0) {
+            chr->logfd = qemu_open_old(common->logfile, flags, 0666);
+            if (!chr->reopen_on_cpr) {
+                cpr_save_fd(fdname, 0, chr->logfd);
+            }
+        }
         if (chr->logfd < 0) {
             error_setg_errno(errp, errno,
                              "Unable to open logfile %s",
@@ -300,11 +310,13 @@ static void char_finalize(Object *obj)
     if (chr->be) {
         chr->be->chr = NULL;
     }
-    g_free(chr->filename);
-    g_free(chr->label);
     if (chr->logfd != -1) {
+        g_autofree char *fdname = g_strdup_printf("%s_log", chr->label);
+        cpr_delete_fd(fdname, 0);
         close(chr->logfd);
     }
+    g_free(chr->filename);
+    g_free(chr->label);
     qemu_mutex_destroy(&chr->chr_write_lock);
 }
 
@@ -504,6 +516,8 @@ void qemu_chr_parse_common(QemuOpts *opts, ChardevCommon *backend)
 
     backend->has_logappend = true;
     backend->logappend = qemu_opt_get_bool(opts, "logappend", false);
+
+    backend->reopen_on_cpr = qemu_opt_get_bool(opts, "reopen-on-cpr", false);
 }
 
 static const ChardevClass *char_get_class(const char *driver, Error **errp)
@@ -945,6 +959,9 @@ QemuOptsList qemu_chardev_opts = {
         },{
             .name = "abstract",
             .type = QEMU_OPT_BOOL,
+        },{
+            .name = "reopen-on-cpr",
+            .type = QEMU_OPT_BOOL,
 #endif
         },
         { /* end of list */ }
@@ -1220,6 +1237,26 @@ GSource *qemu_chr_timeout_add_ms(Chardev *chr, guint ms,
     return source;
 }
 
+static int chr_cpr_capable(Object *obj, void *opaque)
+{
+    Chardev *chr = (Chardev *)obj;
+    Error **errp = opaque;
+
+    if (qemu_chr_has_feature(chr, QEMU_CHAR_FEATURE_CPR) ||
+        chr->reopen_on_cpr) {
+        return 0;
+    }
+    error_setg(errp,
+               "chardev %s -> %s is not capable of cpr. See reopen-on-cpr",
+               chr->label, chr->filename);
+    return -1;
+}
+
+bool qemu_chr_is_cpr_capable(Error **errp)
+{
+    return !object_child_foreach(get_chardevs_root(), chr_cpr_capable, errp);
+}
+
 void qemu_chr_cleanup(void)
 {
     object_unparent(get_chardevs_root());
diff --git a/include/chardev/char.h b/include/chardev/char.h
index 7c0444f..3fa3528 100644
--- a/include/chardev/char.h
+++ b/include/chardev/char.h
@@ -50,6 +50,8 @@ typedef enum {
     /* Whether the gcontext can be changed after calling
      * qemu_chr_be_update_read_handlers() */
     QEMU_CHAR_FEATURE_GCONTEXT,
+    /* Whether the device supports cpr */
+    QEMU_CHAR_FEATURE_CPR,
 
     QEMU_CHAR_FEATURE_LAST,
 } ChardevFeature;
@@ -67,6 +69,7 @@ struct Chardev {
     int be_open;
     /* used to coordinate the chardev-change special-case: */
     bool handover_yank_instance;
+    bool reopen_on_cpr;
     GSource *gsource;
     GMainContext *gcontext;
     DECLARE_BITMAP(features, QEMU_CHAR_FEATURE_LAST);
@@ -291,4 +294,6 @@ void resume_mux_open(void);
 /* console.c */
 void qemu_chr_parse_vc(QemuOpts *opts, ChardevBackend *backend, Error **errp);
 
+bool qemu_chr_is_cpr_capable(Error **errp);
+
 #endif
diff --git a/migration/cpr.c b/migration/cpr.c
index fd37d98..62b2d51 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -6,6 +6,7 @@
  */
 
 #include "qemu/osdep.h"
+#include "chardev/char.h"
 #include "exec/memory.h"
 #include "hw/vfio/vfio-common.h"
 #include "hw/virtio/vhost.h"
diff --git a/qapi/char.json b/qapi/char.json
index adf2685..41475dc 100644
--- a/qapi/char.json
+++ b/qapi/char.json
@@ -204,12 +204,17 @@
 # @logfile: The name of a logfile to save output
 # @logappend: true to append instead of truncate
 #             (default to false to truncate)
+# @reopen-on-cpr: if true, close device's fd on cpr-save and reopen it after
+#                 cpr-exec. Set this to allow CPR on a device that does not
+#                 support QEMU_CHAR_FEATURE_CPR. defaults to false.
+#                 since 6.2.
 #
 # Since: 2.6
 ##
 { 'struct': 'ChardevCommon',
   'data': { '*logfile': 'str',
-            '*logappend': 'bool' } }
+            '*logappend': 'bool',
+            '*reopen-on-cpr': 'bool' } }
 
 ##
 # @ChardevFile:
diff --git a/qemu-options.hx b/qemu-options.hx
index 05e206c..3f0c974 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -3185,43 +3185,57 @@ DEFHEADING(Character device options:)
 
 DEF("chardev", HAS_ARG, QEMU_OPTION_chardev,
     "-chardev help\n"
-    "-chardev null,id=id[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "-chardev null,id=id[,mux=on|off][,logfile=PATH][,logappend=on|off][,reopen-on-cpr=on|off]\n"
     "-chardev socket,id=id[,host=host],port=port[,to=to][,ipv4=on|off][,ipv6=on|off][,nodelay=on|off][,reconnect=seconds]\n"
     "         [,server=on|off][,wait=on|off][,telnet=on|off][,websocket=on|off][,reconnect=seconds][,mux=on|off]\n"
-    "         [,logfile=PATH][,logappend=on|off][,tls-creds=ID][,tls-authz=ID] (tcp)\n"
+    "         [,logfile=PATH][,logappend=on|off][,tls-creds=ID][,tls-authz=ID][,reopen-on-cpr=on|off] (tcp)\n"
     "-chardev socket,id=id,path=path[,server=on|off][,wait=on|off][,telnet=on|off][,websocket=on|off][,reconnect=seconds]\n"
-    "         [,mux=on|off][,logfile=PATH][,logappend=on|off][,abstract=on|off][,tight=on|off] (unix)\n"
+    "         [,mux=on|off][,logfile=PATH][,logappend=on|off][,abstract=on|off][,tight=on|off][,reopen-on-cpr=on|off] (unix)\n"
     "-chardev udp,id=id[,host=host],port=port[,localaddr=localaddr]\n"
     "         [,localport=localport][,ipv4=on|off][,ipv6=on|off][,mux=on|off]\n"
-    "         [,logfile=PATH][,logappend=on|off]\n"
+    "         [,logfile=PATH][,logappend=on|off][,reopen-on-cpr=on|off]\n"
     "-chardev msmouse,id=id[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
     "-chardev vc,id=id[[,width=width][,height=height]][[,cols=cols][,rows=rows]]\n"
     "         [,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
     "-chardev ringbuf,id=id[,size=size][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
     "-chardev file,id=id,path=path[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
     "-chardev pipe,id=id,path=path[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
 #ifdef _WIN32
     "-chardev console,id=id[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
     "-chardev serial,id=id,path=path[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
 #else
     "-chardev pty,id=id[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
     "-chardev stdio,id=id[,mux=on|off][,signal=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
 #endif
 #ifdef CONFIG_BRLAPI
     "-chardev braille,id=id[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
 #endif
 #if defined(__linux__) || defined(__sun__) || defined(__FreeBSD__) \
         || defined(__NetBSD__) || defined(__OpenBSD__) || defined(__DragonFly__)
     "-chardev serial,id=id,path=path[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
     "-chardev tty,id=id,path=path[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
 #endif
 #if defined(__linux__) || defined(__FreeBSD__) || defined(__DragonFly__)
     "-chardev parallel,id=id,path=path[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
     "-chardev parport,id=id,path=path[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
 #endif
 #if defined(CONFIG_SPICE)
     "-chardev spicevmc,id=id,name=name[,debug=debug][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
     "-chardev spiceport,id=id,name=name[,debug=debug][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
 #endif
     , QEMU_ARCH_ALL
 )
@@ -3296,6 +3310,10 @@ The general form of a character device option is:
     ``logappend`` option controls whether the log file will be truncated
     or appended to when opened.
 
+    Every backend supports the ``reopen-on-cpr`` option.  If on, the
+    devices's descriptor is closed during cpr-save, and reopened after exec.
+    This is useful for devices that do not support cpr.
+
 The available backends are:
 
 ``-chardev null,id=id``
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V6 24/27] chardev: cpr for simple devices
  2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
                   ` (22 preceding siblings ...)
  2021-08-06 21:43 ` [PATCH V6 23/27] chardev: cpr framework Steve Sistare
@ 2021-08-06 21:43 ` Steve Sistare
  2021-08-06 21:43 ` [PATCH V6 25/27] chardev: cpr for pty Steve Sistare
                   ` (4 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Steve Sistare @ 2021-08-06 21:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Set QEMU_CHAR_FEATURE_CPR for devices that trivially support cpr.
char-stdio is slightly less trivial.  Allow the gdb server by
closing it on exec.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 chardev/char-mux.c     | 1 +
 chardev/char-null.c    | 1 +
 chardev/char-serial.c  | 1 +
 chardev/char-stdio.c   | 8 ++++++++
 gdbstub.c              | 1 +
 include/chardev/char.h | 1 +
 migration/cpr.c        | 1 +
 7 files changed, 14 insertions(+)

diff --git a/chardev/char-mux.c b/chardev/char-mux.c
index 5baf419..bf7bad9 100644
--- a/chardev/char-mux.c
+++ b/chardev/char-mux.c
@@ -336,6 +336,7 @@ static void qemu_chr_open_mux(Chardev *chr,
      */
     *be_opened = muxes_opened;
     qemu_chr_fe_init(&d->chr, drv, errp);
+    qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
 }
 
 static void qemu_chr_parse_mux(QemuOpts *opts, ChardevBackend *backend,
diff --git a/chardev/char-null.c b/chardev/char-null.c
index 1c6a290..02acaff 100644
--- a/chardev/char-null.c
+++ b/chardev/char-null.c
@@ -32,6 +32,7 @@ static void null_chr_open(Chardev *chr,
                           Error **errp)
 {
     *be_opened = false;
+    qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
 }
 
 static void char_null_class_init(ObjectClass *oc, void *data)
diff --git a/chardev/char-serial.c b/chardev/char-serial.c
index 7c3d84a..b585085 100644
--- a/chardev/char-serial.c
+++ b/chardev/char-serial.c
@@ -274,6 +274,7 @@ static void qmp_chardev_open_serial(Chardev *chr,
     qemu_set_nonblock(fd);
     tty_serial_init(fd, 115200, 'N', 8, 1);
 
+    qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
     qemu_chr_open_fd(chr, fd, fd);
 }
 #endif /* __linux__ || __sun__ */
diff --git a/chardev/char-stdio.c b/chardev/char-stdio.c
index 403da30..9410c16 100644
--- a/chardev/char-stdio.c
+++ b/chardev/char-stdio.c
@@ -114,9 +114,17 @@ static void qemu_chr_open_stdio(Chardev *chr,
 
     stdio_allow_signal = !opts->has_signal || opts->signal;
     qemu_chr_set_echo_stdio(chr, false);
+    qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
 }
 #endif
 
+void qemu_term_exit(void)
+{
+#ifndef _WIN32
+    term_exit();
+#endif
+}
+
 static void qemu_chr_parse_stdio(QemuOpts *opts, ChardevBackend *backend,
                                  Error **errp)
 {
diff --git a/gdbstub.c b/gdbstub.c
index 52bde5b..5210a3f 100644
--- a/gdbstub.c
+++ b/gdbstub.c
@@ -3534,6 +3534,7 @@ int gdbserver_start(const char *device)
         mon_chr = gdbserver_state.mon_chr;
         reset_gdbserver_state();
     }
+    mon_chr->reopen_on_cpr = true;
 
     create_processes(&gdbserver_state);
 
diff --git a/include/chardev/char.h b/include/chardev/char.h
index 3fa3528..187c665 100644
--- a/include/chardev/char.h
+++ b/include/chardev/char.h
@@ -295,5 +295,6 @@ void resume_mux_open(void);
 void qemu_chr_parse_vc(QemuOpts *opts, ChardevBackend *backend, Error **errp);
 
 bool qemu_chr_is_cpr_capable(Error **errp);
+void qemu_term_exit(void);
 
 #endif
diff --git a/migration/cpr.c b/migration/cpr.c
index 62b2d51..d14bc5a 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -119,6 +119,7 @@ void qmp_cpr_exec(strList *args, Error **errp)
         return;
     }
     vhost_dev_reset_all();
+    qemu_term_exit();
     qemu_system_exec_request(args);
 }
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V6 25/27] chardev: cpr for pty
  2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
                   ` (23 preceding siblings ...)
  2021-08-06 21:43 ` [PATCH V6 24/27] chardev: cpr for simple devices Steve Sistare
@ 2021-08-06 21:43 ` Steve Sistare
  2021-08-06 21:44 ` [PATCH V6 26/27] chardev: cpr for sockets Steve Sistare
                   ` (3 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Steve Sistare @ 2021-08-06 21:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Save and restore pty descriptors across cpr-save and cpr-load.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 chardev/char-pty.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/chardev/char-pty.c b/chardev/char-pty.c
index a2d1e7c..721cee9 100644
--- a/chardev/char-pty.c
+++ b/chardev/char-pty.c
@@ -30,6 +30,7 @@
 #include "qemu/sockets.h"
 #include "qemu/error-report.h"
 #include "qemu/module.h"
+#include "migration/cpr.h"
 #include "qemu/qemu-print.h"
 
 #include "chardev/char-io.h"
@@ -191,6 +192,7 @@ static void char_pty_finalize(Object *obj)
     Chardev *chr = CHARDEV(obj);
     PtyChardev *s = PTY_CHARDEV(obj);
 
+    cpr_delete_fd(chr->label, 0);
     pty_chr_state(chr, 0);
     object_unref(OBJECT(s->ioc));
     pty_chr_timer_cancel(s);
@@ -207,12 +209,20 @@ static void char_pty_open(Chardev *chr,
     char pty_name[PATH_MAX];
     char *name;
 
+    master_fd = cpr_find_fd(chr->label, 0);
+    if (master_fd >= 0) {
+        chr->filename = g_strdup_printf("pty:unknown");
+        goto have_fd;
+    }
+
     master_fd = qemu_openpty_raw(&slave_fd, pty_name);
     if (master_fd < 0) {
         error_setg_errno(errp, errno, "Failed to create PTY");
         return;
     }
-
+    if (!chr->reopen_on_cpr) {
+        cpr_save_fd(chr->label, 0, master_fd);
+    }
     close(slave_fd);
     qemu_set_nonblock(master_fd);
 
@@ -220,6 +230,8 @@ static void char_pty_open(Chardev *chr,
     qemu_printf("char device redirected to %s (label %s)\n",
                 pty_name, chr->label);
 
+have_fd:
+    qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
     s = PTY_CHARDEV(chr);
     s->ioc = QIO_CHANNEL(qio_channel_file_new_fd(master_fd));
     name = g_strdup_printf("chardev-pty-%s", chr->label);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V6 26/27] chardev: cpr for sockets
  2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
                   ` (24 preceding siblings ...)
  2021-08-06 21:43 ` [PATCH V6 25/27] chardev: cpr for pty Steve Sistare
@ 2021-08-06 21:44 ` Steve Sistare
  2021-08-06 21:44 ` [PATCH V6 27/27] cpr: only-cpr-capable option Steve Sistare
                   ` (2 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Steve Sistare @ 2021-08-06 21:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Save accepted socket fds before cpr-save, and look for them after cpr-load.
in the environment after cpr-load.  Reject cpr-exec if a socket enables
the TLS or websocket option.  Allow a monitor socket by closing it on exec.

Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 chardev/char-socket.c | 32 ++++++++++++++++++++++++++++++++
 monitor/hmp.c         |  3 +++
 monitor/qmp.c         |  3 +++
 3 files changed, 38 insertions(+)

diff --git a/chardev/char-socket.c b/chardev/char-socket.c
index c43668c..f6d00d8 100644
--- a/chardev/char-socket.c
+++ b/chardev/char-socket.c
@@ -27,6 +27,7 @@
 #include "io/channel-socket.h"
 #include "io/channel-tls.h"
 #include "io/channel-websock.h"
+#include "migration/cpr.h"
 #include "io/net-listener.h"
 #include "qemu/error-report.h"
 #include "qemu/module.h"
@@ -414,6 +415,7 @@ static void tcp_chr_free_connection(Chardev *chr)
     SocketChardev *s = SOCKET_CHARDEV(chr);
     int i;
 
+    cpr_delete_fd(chr->label, 0);
     if (s->read_msgfds_num) {
         for (i = 0; i < s->read_msgfds_num; i++) {
             close(s->read_msgfds[i]);
@@ -976,6 +978,10 @@ static void tcp_chr_accept(QIONetListener *listener,
                                QIO_CHANNEL(cioc));
     }
     tcp_chr_new_client(chr, cioc);
+
+    if (s->sioc && !chr->reopen_on_cpr) {
+        cpr_save_fd(chr->label, 0, s->sioc->fd);
+    }
 }
 
 
@@ -1231,6 +1237,26 @@ static gboolean socket_reconnect_timeout(gpointer opaque)
     return false;
 }
 
+static int load_char_socket_fd(Chardev *chr, Error **errp)
+{
+    SocketChardev *sockchar = SOCKET_CHARDEV(chr);
+    QIOChannelSocket *sioc;
+    const char *label = chr->label;
+    int fd = cpr_find_fd(label, 0);
+
+    if (fd != -1) {
+        sockchar = SOCKET_CHARDEV(chr);
+        sioc = qio_channel_socket_new_fd(fd, errp);
+        if (sioc) {
+            tcp_chr_accept(sockchar->listener, sioc, chr);
+            object_unref(OBJECT(sioc));
+        } else {
+            error_setg(errp, "could not restore socket for %s", label);
+            return -1;
+        }
+    }
+    return 0;
+}
 
 static int qmp_chardev_open_socket_server(Chardev *chr,
                                           bool is_telnet,
@@ -1435,6 +1461,10 @@ static void qmp_chardev_open_socket(Chardev *chr,
     }
     s->registered_yank = true;
 
+    if (!s->tls_creds && !s->is_websock) {
+        qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
+    }
+
     /* be isn't opened until we get a connection */
     *be_opened = false;
 
@@ -1450,6 +1480,8 @@ static void qmp_chardev_open_socket(Chardev *chr,
             return;
         }
     }
+
+    load_char_socket_fd(chr, errp);
 }
 
 static void qemu_chr_parse_socket(QemuOpts *opts, ChardevBackend *backend,
diff --git a/monitor/hmp.c b/monitor/hmp.c
index d50c312..993df18 100644
--- a/monitor/hmp.c
+++ b/monitor/hmp.c
@@ -1458,4 +1458,7 @@ void monitor_init_hmp(Chardev *chr, bool use_readline, Error **errp)
     qemu_chr_fe_set_handlers(&mon->common.chr, monitor_can_read, monitor_read,
                              monitor_event, NULL, &mon->common, NULL, true);
     monitor_list_append(&mon->common);
+
+    /* monitor cannot yet be preserved across cpr */
+    chr->reopen_on_cpr = true;
 }
diff --git a/monitor/qmp.c b/monitor/qmp.c
index 092c527..0043459 100644
--- a/monitor/qmp.c
+++ b/monitor/qmp.c
@@ -535,4 +535,7 @@ void monitor_init_qmp(Chardev *chr, bool pretty, Error **errp)
                                  NULL, &mon->common, NULL, true);
         monitor_list_append(&mon->common);
     }
+
+    /* Monitor cannot yet be preserved across cpr */
+    chr->reopen_on_cpr = true;
 }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V6 27/27] cpr: only-cpr-capable option
  2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
                   ` (25 preceding siblings ...)
  2021-08-06 21:44 ` [PATCH V6 26/27] chardev: cpr for sockets Steve Sistare
@ 2021-08-06 21:44 ` Steve Sistare
  2021-08-09 16:02 ` [PATCH V6 00/27] Live Update Steven Sistare
  2021-08-21  8:54 ` Zheng Chuan
  28 siblings, 0 replies; 44+ messages in thread
From: Steve Sistare @ 2021-08-06 21:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Add the only-cpr-capable option, which causes qemu to exit with an error
if any devices that are not capable of cpr are added.  This guarantees that
a cpr-exec operation will not fail with an unsupported device error.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 MAINTAINERS             |  1 +
 chardev/char-socket.c   |  4 ++++
 hw/vfio/common.c        |  6 ++++++
 include/sysemu/sysemu.h |  1 +
 migration/migration.c   |  5 +++++
 qemu-options.hx         |  8 ++++++++
 softmmu/globals.c       |  1 +
 softmmu/physmem.c       |  5 +++++
 softmmu/vl.c            | 14 +++++++++++++-
 stubs/cpr.c             |  3 +++
 stubs/meson.build       |  1 +
 11 files changed, 48 insertions(+), 1 deletion(-)
 create mode 100644 stubs/cpr.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 3132965..1cc0f73 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2910,6 +2910,7 @@ F: migration/cpr.c
 F: qapi/cpr.json
 F: migration/cpr-state.c
 F: stubs/cpr-state.c
+F: stubs/cpr.c
 
 Record/replay
 M: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
diff --git a/chardev/char-socket.c b/chardev/char-socket.c
index f6d00d8..a6ffb93 100644
--- a/chardev/char-socket.c
+++ b/chardev/char-socket.c
@@ -39,6 +39,7 @@
 
 #include "chardev/char-io.h"
 #include "qom/object.h"
+#include "sysemu/sysemu.h"
 
 /***********************************************************/
 /* TCP Net console */
@@ -1463,6 +1464,9 @@ static void qmp_chardev_open_socket(Chardev *chr,
 
     if (!s->tls_creds && !s->is_websock) {
         qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
+    } else if (only_cpr_capable) {
+        error_setg(errp, "error: socket %s is not cpr capable due to %s option",
+                   chr->label, (s->tls_creds ? "TLS" : "websocket"));
     }
 
     /* be isn't opened until we get a connection */
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 872a1ac..2f8f982 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -38,6 +38,7 @@
 #include "sysemu/kvm.h"
 #include "sysemu/reset.h"
 #include "sysemu/runstate.h"
+#include "sysemu/sysemu.h"
 #include "trace.h"
 #include "qapi/error.h"
 #include "migration/migration.h"
@@ -1859,12 +1860,17 @@ static void vfio_put_address_space(VFIOAddressSpace *space)
 static int vfio_get_iommu_type(VFIOContainer *container,
                                Error **errp)
 {
+    ERRP_GUARD();
     int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
                           VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };
     int i;
 
     for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
         if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {
+            if (only_cpr_capable && !vfio_is_cpr_capable(container, errp)) {
+                error_prepend(errp, "only-cpr-capable is specified: ");
+                return -EINVAL;
+            }
             return iommu_types[i];
         }
     }
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index 8fae667..6241c20 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -9,6 +9,7 @@
 /* vl.c */
 
 extern int only_migratable;
+extern bool only_cpr_capable;
 extern const char *qemu_name;
 extern QemuUUID qemu_uuid;
 extern bool qemu_uuid_set;
diff --git a/migration/migration.c b/migration/migration.c
index 041b845..3556f01 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1249,6 +1249,11 @@ static bool migrate_caps_check(bool *cap_list,
         }
     }
 
+    if (cap_list[MIGRATION_CAPABILITY_X_COLO] && only_cpr_capable) {
+        error_setg(errp, "x-colo is not compatible with -only-cpr-capable");
+        return false;
+    }
+
     return true;
 }
 
diff --git a/qemu-options.hx b/qemu-options.hx
index 3f0c974..c47af4c 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -4374,6 +4374,14 @@ SRST
     an unmigratable state.
 ERST
 
+DEF("only-cpr-capable", 0, QEMU_OPTION_only_cpr_capable, \
+    "-only-cpr-capable    allow only cpr capable devices\n", QEMU_ARCH_ALL)
+SRST
+``-only-cpr-capable``
+    Only allow cpr capable devices, which guarantees that cpr-save and
+    cpr-exec will not fail with an unsupported device error.
+ERST
+
 DEF("nodefaults", 0, QEMU_OPTION_nodefaults, \
     "-nodefaults     don't create default devices\n", QEMU_ARCH_ALL)
 SRST
diff --git a/softmmu/globals.c b/softmmu/globals.c
index 7d0fc81..a18fd8d 100644
--- a/softmmu/globals.c
+++ b/softmmu/globals.c
@@ -59,6 +59,7 @@ int boot_menu;
 bool boot_strict;
 uint8_t *boot_splash_filedata;
 int only_migratable; /* turn it off unless user states otherwise */
+bool only_cpr_capable;
 int icount_align_option;
 
 /* The bytes in qemu_uuid are in the order specified by RFC4122, _not_ in the
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 2e14314..8db8a6d 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -47,6 +47,7 @@
 #include "sysemu/dma.h"
 #include "sysemu/hostmem.h"
 #include "sysemu/hw_accel.h"
+#include "sysemu/sysemu.h"
 #include "sysemu/xen-mapcache.h"
 #include "trace/trace-root.h"
 
@@ -2006,6 +2007,10 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
                 addr = file_ram_alloc(new_block, maxlen, mfd,
                                       false, false, 0, errp);
                 trace_anon_memfd_alloc(name, maxlen, addr, mfd);
+            } else if (only_cpr_capable) {
+                error_setg(errp,
+                    "only-cpr-capable requires -machine memfd-alloc=on");
+                return;
             } else {
                 addr = qemu_anon_ram_alloc(maxlen, &mr->align,
                                            shared, noreserve);
diff --git a/softmmu/vl.c b/softmmu/vl.c
index 924e8f9..7c638d8 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -2695,6 +2695,10 @@ void qmp_x_exit_preconfig(Error **errp)
     qemu_create_cli_devices();
     qemu_machine_creation_done();
 
+    if (only_cpr_capable && !qemu_chr_is_cpr_capable(errp)) {
+        ;    /* not reached due to error_fatal */
+    }
+
     if (loadvm) {
         Error *local_err = NULL;
         if (!load_snapshot(loadvm, NULL, false, NULL, &local_err)) {
@@ -2704,7 +2708,12 @@ void qmp_x_exit_preconfig(Error **errp)
         }
     }
     if (replay_mode != REPLAY_MODE_NONE) {
-        replay_vmstate_init();
+        if (only_cpr_capable) {
+            error_setg(errp, "replay is not compatible with -only-cpr-capable");
+            /* not reached due to error_fatal */
+        } else {
+            replay_vmstate_init();
+        }
     }
 
     if (incoming) {
@@ -3446,6 +3455,9 @@ void qemu_init(int argc, char **argv, char **envp)
             case QEMU_OPTION_only_migratable:
                 only_migratable = 1;
                 break;
+            case QEMU_OPTION_only_cpr_capable:
+                only_cpr_capable = true;
+                break;
             case QEMU_OPTION_nodefaults:
                 has_defaults = 0;
                 break;
diff --git a/stubs/cpr.c b/stubs/cpr.c
new file mode 100644
index 0000000..aaa189e
--- /dev/null
+++ b/stubs/cpr.c
@@ -0,0 +1,3 @@
+#include "qemu/osdep.h"
+
+bool only_cpr_capable;
diff --git a/stubs/meson.build b/stubs/meson.build
index 2748508..dd9e51f 100644
--- a/stubs/meson.build
+++ b/stubs/meson.build
@@ -5,6 +5,7 @@ stub_ss.add(files('blk-exp-close-all.c'))
 stub_ss.add(files('blockdev-close-all-bdrv-states.c'))
 stub_ss.add(files('change-state-handler.c'))
 stub_ss.add(files('cmos.c'))
+stub_ss.add(files('cpr.c'))
 stub_ss.add(files('cpr-state.c'))
 stub_ss.add(files('cpu-get-clock.c'))
 stub_ss.add(files('cpus-get-virtual-clock.c'))
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH V6 00/27] Live Update
  2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
                   ` (26 preceding siblings ...)
  2021-08-06 21:44 ` [PATCH V6 27/27] cpr: only-cpr-capable option Steve Sistare
@ 2021-08-09 16:02 ` Steven Sistare
  2021-08-21  8:54 ` Zheng Chuan
  28 siblings, 0 replies; 44+ messages in thread
From: Steven Sistare @ 2021-08-09 16:02 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

I forgot to mention in the changes list: I added a new mechanism to save fd values,
in lieu of the environment.  See [PATCH V6 13/27] cpr: preserve extra state

- Steve

On 8/6/2021 5:43 PM, Steve Sistare wrote:
> Provide the cpr-save, cpr-exec, and cpr-load commands for live update.
> These save and restore VM state, with minimal guest pause time, so that
> qemu may be updated to a new version in between.
> 
> cpr-save stops the VM and saves vmstate to an ordinary file.  It supports
> any type of guest image and block device, but the caller must not modify
> guest block devices between cpr-save and cpr-load.  It supports two modes:
> reboot and restart.
> 
> In reboot mode, the caller invokes cpr-save and then terminates qemu.
> The caller may then update the host kernel and system software and reboot.
> The caller resumes the guest by running qemu with the same arguments as the
> original process and invoking cpr-load.  To use this mode, guest ram must be
> mapped to a persistent shared memory file such as /dev/dax0.0, or /dev/shm
> PKRAM as proposed in https://lore.kernel.org/lkml/1617140178-8773-1-git-send-email-anthony.yznaga@oracle.com.
> 
> The reboot mode supports vfio devices if the caller first suspends the
> guest, such as by issuing guest-suspend-ram to the qemu guest agent.  The
> guest drivers' suspend methods flush outstanding requests and re-initialize
> the devices, and thus there is no device state to save and restore.
> 
> Restart mode preserves the guest VM across a restart of the qemu process.
> After cpr-save, the caller passes qemu command-line arguments to cpr-exec,
> which directly exec's the new qemu binary.  The arguments must include -S
> so new qemu starts in a paused state and waits for the cpr-load command.
> The restart mode supports vfio devices by preserving the vfio container,
> group, device, and event descriptors across the qemu re-exec, and by
> updating DMA mapping virtual addresses using VFIO_DMA_UNMAP_FLAG_VADDR and
> VFIO_DMA_MAP_FLAG_VADDR as defined in https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sistare@oracle.com/
> and integrated in Linux kernel 5.12.
> 
> To use the restart mode, qemu must be started with the memfd-alloc option,
> which allocates guest ram using memfd_create.  The memfd's are saved to
> the environment and kept open across exec, after which they are found from
> the environment and re-mmap'd.  Hence guest ram is preserved in place,
> albeit with new virtual addresses in the qemu process.
> 
> The caller resumes the guest by invoking cpr-load, which loads state from
> the file. If the VM was running at cpr-save time, then VM execution resumes.
> If the VM was suspended at cpr-save time (reboot mode), then the caller must
> issue a system_wakeup command to resume.
> 
> The first patches add reboot mode:
>   - memory: qemu_check_ram_volatile
>   - migration: fix populate_vfio_info
>   - migration: qemu file wrappers
>   - migration: simplify savevm
>   - vl: start on wakeup request
>   - cpr: reboot mode
>   - cpr: reboot HMP interfaces
> 
> The next patches add restart mode:
>   - memory: flat section iterator
>   - oslib: qemu_clear_cloexec
>   - machine: memfd-alloc option
>   - qapi: list utility functions
>   - vl: helper to request re-exec
>   - cpr: preserve extra state
>   - cpr: restart mode
>   - cpr: restart HMP interfaces
>   - hostmem-memfd: cpr for memory-backend-memfd
> 
> The next patches add vfio support for restart mode:
>   - pci: export functions for cpr
>   - vfio-pci: refactor for cpr
>   - vfio-pci: cpr part 1 (fd and dma)
>   - vfio-pci: cpr part 2 (msi)
>   - vfio-pci: cpr part 3 (intx)
> 
> The next patches preserve various descriptor-based backend devices across
> cprexec:
>   - vhost: reset vhost devices for cpr
>   - chardev: cpr framework
>   - chardev: cpr for simple devices
>   - chardev: cpr for pty
>   - chardev: cpr for sockets
>   - cpr: only-cpr-capable option
> 
> Here is an example of updating qemu from v4.2.0 to v4.2.1 using
> restart mode.  The software update is performed while the guest is
> running to minimize downtime.
> 
> window 1                                        | window 2
>                                                 |
> # qemu-system-x86_64 ...                        |
> QEMU 4.2.0 monitor - type 'help' ...            |
> (qemu) info status                              |
> VM status: running                              |
>                                                 | # yum update qemu
> (qemu) cpr-save /tmp/qemu.sav restart           |
> (qemu) cpr-exec qemu-system-x86_64 -S ...       |
> QEMU 4.2.1 monitor - type 'help' ...            |
> (qemu) info status                              |
> VM status: paused (prelaunch)                   |
> (qemu) cpr-load /tmp/qemu.sav                   |
> (qemu) info status                              |
> VM status: running                              |
> 
> 
> Here is an example of updating the host kernel using reboot mode.
> 
> window 1                                        | window 2
>                                                 |
> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
> QEMU 4.2.1 monitor - type 'help' ...            |
> (qemu) info status                              |
> VM status: running                              |
>                                                 | # yum update kernel-uek
> (qemu) cpr-save /tmp/qemu.sav restart           |
> (qemu) quit                                     |
>                                                 |
> # systemctl kexec                               |
> kexec_core: Starting new kernel                 |
> ...                                             |
>                                                 |
> # qemu-system-x86_64 -S mem-path=/dev/dax0.0 ...|
> QEMU 4.2.1 monitor - type 'help' ...            |
> (qemu) info status                              |
> VM status: paused (prelaunch)                   |
> (qemu) cpr-load /tmp/qemu.sav                   |
> (qemu) info status                              |
> VM status: running                              |
> 
> Changes from V1 to V2:
>   - revert vmstate infrastructure changes
>   - refactor cpr functions into new files
>   - delete MADV_DOEXEC and use memfd + VFIO_DMA_UNMAP_FLAG_SUSPEND to
>     preserve memory.
>   - add framework to filter chardev's that support cpr
>   - save and restore vfio eventfd's
>   - modify cprinfo QMP interface
>   - incorporate misc review feedback
>   - remove unrelated and unneeded patches
>   - refactor all patches into a shorter and easier to review series
> 
> Changes from V2 to V3:
>   - rebase to qemu 6.0.0
>   - use final definition of vfio ioctls (VFIO_DMA_UNMAP_FLAG_VADDR etc)
>   - change memfd-alloc to a machine option
>   - Use qio_channel_socket_new_fd instead of adding qio_channel_socket_new_fd
>   - close monitor socket during cpr
>   - fix a few unreported bugs
>   - support memory-backend-memfd
> 
> Changes from V3 to V4:
>   - split reboot mode into separate patches
>   - add cprexec command
>   - delete QEMU_START_FREEZE, argv_main, and /usr/bin/qemu-exec
>   - add more checks for vfio and cpr compatibility, and recover after errors
>   - save vfio pci config in vmstate
>   - rename {setenv,getenv}_event_fd to {save,load}_event_fd
>   - use qemu_strtol
>   - change 6.0 references to 6.1
>   - use strerror(), use EXIT_FAILURE, remove period from error messages
>   - distribute MAINTAINERS additions to each patch
> 
> Changes from V4 to V5:
>   - rebase to master
> 
> Changes from V5 to V6:
>   vfio:
>   - delete redundant bus_master_enable_region in vfio_pci_post_load
>   - delete unmap.size warning
>   - fix phys_config memory leak
>   - add INTX support
>   - add vfio_named_notifier_init() helper
>   Other:
>   - 6.1 -> 6.2
>   - rename file -> filename in qapi
>   - delete cprinfo.  qapi introspection serves the same purpose.
>   - rename cprsave, cprexec, cprload -> cpr-save, cpr-exec, cpr-load
>   - improve documentation in qapi/cpr.json
>   - rename qemu_ram_volatile -> qemu_ram_check_volatile, and use
>     qemu_ram_foreach_block
>   - rename handle -> opaque
>   - use ERRP_GUARD
>   - use g_autoptr and g_autofree, and glib allocation functions
>   - conform to error conventions for bool and int function return values
>     and function names.
>   - remove word "error" in error messages
>   - rename as_flat_walk and its callback, and add comments.
>   - rename qemu_clr_cloexec -> qemu_clear_cloexec
>   - rename close-on-cpr -> reopen-on-cpr
>   - add strList utility functions
>   - factor out start on wakeup request to a separate patch
>   - deleted unnecessary layer (cprsave etc) and squashed QMP patches
>   - conditionally compile for CONFIG_VFIO
> 
> Steve Sistare (24):
>   memory: qemu_check_ram_volatile
>   migration: fix populate_vfio_info
>   migration: qemu file wrappers
>   migration: simplify savevm
>   vl: start on wakeup request
>   cpr: reboot mode
>   memory: flat section iterator
>   oslib: qemu_clear_cloexec
>   machine: memfd-alloc option
>   qapi: list utility functions
>   vl: helper to request re-exec
>   cpr: preserve extra state
>   cpr: restart mode
>   cpr: restart HMP interfaces
>   hostmem-memfd: cpr for memory-backend-memfd
>   pci: export functions for cpr
>   vfio-pci: refactor for cpr
>   vfio-pci: cpr part 1 (fd and dma)
>   vfio-pci: cpr part 2 (msi)
>   vfio-pci: cpr part 3 (intx)
>   chardev: cpr framework
>   chardev: cpr for simple devices
>   chardev: cpr for pty
>   cpr: only-cpr-capable option
> 
> Mark Kanda, Steve Sistare (3):
>   cpr: reboot HMP interfaces
>   vhost: reset vhost devices for cpr
>   chardev: cpr for sockets
> 
>  MAINTAINERS                   |  12 ++
>  backends/hostmem-memfd.c      |  21 +--
>  chardev/char-mux.c            |   1 +
>  chardev/char-null.c           |   1 +
>  chardev/char-pty.c            |  14 +-
>  chardev/char-serial.c         |   1 +
>  chardev/char-socket.c         |  36 +++++
>  chardev/char-stdio.c          |   8 ++
>  chardev/char.c                |  43 +++++-
>  gdbstub.c                     |   1 +
>  hmp-commands.hx               |  50 +++++++
>  hw/core/machine.c             |  19 +++
>  hw/pci/msix.c                 |  20 ++-
>  hw/pci/pci.c                  |   7 +-
>  hw/vfio/common.c              |  79 +++++++++--
>  hw/vfio/cpr.c                 | 160 ++++++++++++++++++++++
>  hw/vfio/meson.build           |   1 +
>  hw/vfio/pci.c                 | 301 +++++++++++++++++++++++++++++++++++++++---
>  hw/vfio/trace-events          |   1 +
>  hw/virtio/vhost.c             |  11 ++
>  include/chardev/char.h        |   6 +
>  include/exec/memory.h         |  39 ++++++
>  include/hw/boards.h           |   1 +
>  include/hw/pci/msix.h         |   5 +
>  include/hw/pci/pci.h          |   2 +
>  include/hw/vfio/vfio-common.h |   8 ++
>  include/hw/virtio/vhost.h     |   1 +
>  include/migration/cpr.h       |  31 +++++
>  include/monitor/hmp.h         |   3 +
>  include/qapi/util.h           |  28 ++++
>  include/qemu/osdep.h          |   1 +
>  include/sysemu/runstate.h     |   2 +
>  include/sysemu/sysemu.h       |   1 +
>  linux-headers/linux/vfio.h    |   6 +
>  migration/cpr-state.c         | 215 ++++++++++++++++++++++++++++++
>  migration/cpr.c               | 176 ++++++++++++++++++++++++
>  migration/meson.build         |   2 +
>  migration/migration.c         |   5 +
>  migration/qemu-file-channel.c |  36 +++++
>  migration/qemu-file-channel.h |   6 +
>  migration/savevm.c            |  21 +--
>  migration/target.c            |  24 +++-
>  migration/trace-events        |   5 +
>  monitor/hmp-cmds.c            |  68 ++++++----
>  monitor/hmp.c                 |   3 +
>  monitor/qmp.c                 |   3 +
>  qapi/char.json                |   7 +-
>  qapi/cpr.json                 |  76 +++++++++++
>  qapi/meson.build              |   1 +
>  qapi/qapi-schema.json         |   1 +
>  qapi/qapi-util.c              |  37 ++++++
>  qemu-options.hx               |  40 +++++-
>  softmmu/globals.c             |   1 +
>  softmmu/memory.c              |  46 +++++++
>  softmmu/physmem.c             |  55 ++++++--
>  softmmu/runstate.c            |  38 +++++-
>  softmmu/vl.c                  |  18 ++-
>  stubs/cpr-state.c             |  15 +++
>  stubs/cpr.c                   |   3 +
>  stubs/meson.build             |   2 +
>  trace-events                  |   1 +
>  util/oslib-posix.c            |   9 ++
>  util/oslib-win32.c            |   4 +
>  util/qemu-config.c            |   4 +
>  64 files changed, 1732 insertions(+), 111 deletions(-)
>  create mode 100644 hw/vfio/cpr.c
>  create mode 100644 include/migration/cpr.h
>  create mode 100644 migration/cpr-state.c
>  create mode 100644 migration/cpr.c
>  create mode 100644 qapi/cpr.json
>  create mode 100644 stubs/cpr-state.c
>  create mode 100644 stubs/cpr.c
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V6 18/27] vfio-pci: refactor for cpr
  2021-08-06 21:43 ` [PATCH V6 18/27] vfio-pci: refactor " Steve Sistare
@ 2021-08-10 16:53   ` Alex Williamson
  2021-08-23 16:52     ` Steven Sistare
  0 siblings, 1 reply; 44+ messages in thread
From: Alex Williamson @ 2021-08-10 16:53 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel, Eric Blake,
	Dr. David Alan Gilbert, Zheng Chuan, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Markus Armbruster

On Fri,  6 Aug 2021 14:43:52 -0700
Steve Sistare <steven.sistare@oracle.com> wrote:

> Export vfio_address_spaces and vfio_listener_skipped_section.
> Add optional name arg to vfio_add_kvm_msi_virq.
> Refactor vector use into a helper vfio_vector_init.
> All for use by cpr in a subsequent patch.  No functional change.

Why is the name arg optional?  It seems really inconsistent to me that
everything other than MSI/X uses this with a name, but MSI/X use NULL
and in an entirely separate pre-save step we then iterate through all
the {event,irq}fds to save them.  If we asked for a named notifier,
shouldn't we go ahead and save it under that name at that time?  ie.

static int vfio_named_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
                                    const char *name, int nr)
{
    int ret, fd = load_event_fd(vdev, name, nr);

    if (fd >= 0) {
        event_notifier_init_fd(e, fd);
    } else {
        ret = event_notifier_init(e, 0);
        if (ret) {
            return ret;
        }
        save_event_fd(vdev, name, nr, e);
    }
    return 0;
}

Are we not doing this to avoid runtime overhead?

In the process, maybe we can use more descriptive names than
"interrupt", ex. "msi" or "msix".

It also feels a bit forced to me that the entire fd saving uses {name,
id} but vfio is the only caller that makes use of a non-zero id.
Should we instead just wrap all the calls from vfio to append the id to
the name so the common code can just use strcmp()?  Thanks,

Alex



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V6 19/27] vfio-pci: cpr part 1 (fd and dma)
  2021-08-06 21:43 ` [PATCH V6 19/27] vfio-pci: cpr part 1 (fd and dma) Steve Sistare
@ 2021-08-10 17:06   ` Alex Williamson
  2021-08-23 19:43     ` Steven Sistare
  2021-11-10  7:48     ` Zheng Chuan
  0 siblings, 2 replies; 44+ messages in thread
From: Alex Williamson @ 2021-08-10 17:06 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel, Eric Blake,
	Dr. David Alan Gilbert, Zheng Chuan, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Markus Armbruster

On Fri,  6 Aug 2021 14:43:53 -0700
Steve Sistare <steven.sistare@oracle.com> wrote:

> Enable vfio-pci devices to be saved and restored across an exec restart
> of qemu.
> 
> At vfio creation time, save the value of vfio container, group, and device
> descriptors in cpr state.
> 
> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
> at a different VA after exec.  DMA to already-mapped pages continues.  Save
> the msi message area as part of vfio-pci vmstate, save the interrupt and
> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
> vfio descriptors.  The flag is not cleared earlier because the descriptors
> should not persist across miscellaneous fork and exec calls that may be
> performed during normal operation.
> 
> On qemu restart, vfio_realize() finds the descriptor env vars, uses
> the descriptors, and notes that the device is being reused.  Device and
> iommu state is already configured, so operations in vfio_realize that
> would modify the configuration are skipped for a reused device, including
> vfio ioctl's and writes to PCI configuration space.  The result is that
> vfio_realize constructs qemu data structures that reflect the current
> state of the device.  However, the reconstruction is not complete until
> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
> state.  It rebuilds vector data structures and attaches the interrupts to
> the new KVM instance.  cpr-load then walks the flattened ranges of the
> vfio_address_spaces and calls VFIO_DMA_MAP_FLAG_VADDR to inform the kernel
> of the new VA's.  Lastly, it starts the VM and suppresses vfio device reset.
> 
> This functionality is delivered by 3 patches for clarity.  Part 1 handles
> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
> support.  Part 3 adds INTX support.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  MAINTAINERS                   |   1 +
>  hw/pci/pci.c                  |   4 ++
>  hw/vfio/common.c              |  69 ++++++++++++++++--
>  hw/vfio/cpr.c                 | 160 ++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/meson.build           |   1 +
>  hw/vfio/pci.c                 |  57 +++++++++++++++
>  hw/vfio/trace-events          |   1 +
>  include/hw/pci/pci.h          |   1 +
>  include/hw/vfio/vfio-common.h |   5 ++
>  include/migration/cpr.h       |   3 +
>  linux-headers/linux/vfio.h    |   6 ++
>  migration/cpr.c               |  10 ++-
>  migration/target.c            |  14 ++++
>  13 files changed, 325 insertions(+), 7 deletions(-)
>  create mode 100644 hw/vfio/cpr.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index a9d2ed8..3132965 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2904,6 +2904,7 @@ CPR
>  M: Steve Sistare <steven.sistare@oracle.com>
>  M: Mark Kanda <mark.kanda@oracle.com>
>  S: Maintained
> +F: hw/vfio/cpr.c
>  F: include/migration/cpr.h
>  F: migration/cpr.c
>  F: qapi/cpr.json
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 59408a3..b9c6ca1 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -307,6 +307,10 @@ static void pci_do_device_reset(PCIDevice *dev)
>  {
>      int r;
>  
> +    if (dev->reused) {
> +        return;
> +    }
> +
>      pci_device_deassert_intx(dev);
>      assert(dev->irq_state == 0);
>  
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 7918c0d..872a1ac 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -31,6 +31,7 @@
>  #include "exec/memory.h"
>  #include "exec/ram_addr.h"
>  #include "hw/hw.h"
> +#include "migration/cpr.h"
>  #include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
>  #include "qemu/range.h"
> @@ -464,6 +465,10 @@ static int vfio_dma_unmap(VFIOContainer *container,
>          return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
>      }
>  
> +    if (container->reused) {
> +        return 0;
> +    }
> +
>      while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
>          /*
>           * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
> @@ -501,6 +506,10 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>          .size = size,
>      };
>  
> +    if (container->reused) {
> +        return 0;
> +    }
> +
>      if (!readonly) {
>          map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
>      }
> @@ -1872,6 +1881,10 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
>      if (iommu_type < 0) {
>          return iommu_type;
>      }
> +    if (container->reused) {
> +        container->iommu_type = iommu_type;
> +        return 0;
> +    }
>  

I'd like to see more comments throughout, but particularly where we're
dumping out of functions for reused containers, groups, and devices.
For instance map/unmap we're assuming we'll reach the same IOMMU
mapping state we had previously, how do we validate that, why can't we
only set vaddr in the mapping path rather than skipping it for a later
pass at the flatmap, do we actually see unmaps, is deferring listener
registration an alternate option, which specific reset path are we
trying to defer, why are VFIOPCIDevices the only PCIDevices that set
reused, there are some assumptions about the iommu_type that could use
further description, etc.

>      ret = ioctl(group_fd, VFIO_GROUP_SET_CONTAINER, &container->fd);
>      if (ret) {
> @@ -1972,6 +1985,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>  {
>      VFIOContainer *container;
>      int ret, fd;
> +    bool reused;
>      VFIOAddressSpace *space;
>  
>      space = vfio_get_address_space(as);
> @@ -2007,7 +2021,13 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>       * details once we know which type of IOMMU we are using.
>       */
>  
> +    fd = cpr_find_fd("vfio_container_for_group", group->groupid);
> +    reused = (fd >= 0);
> +
>      QLIST_FOREACH(container, &space->containers, next) {
> +        if (container->fd == fd) {
> +            break;
> +        }
>          if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {


Letting the reused case call this ioctl feels a little sloppy.  I'm
assuming we've tested this in a vIOMMU config or other setups where
we'd actually have multiple containers and we're relying on the ioctl
failing, but why call it at all if we already know the group is
attached to a container.


>              ret = vfio_ram_block_discard_disable(container, true);
>              if (ret) {
> @@ -2020,14 +2040,25 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>                  }
>                  return ret;
>              }
> -            group->container = container;
> -            QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> +            break;
> +        }
> +    }
> +
> +    if (container) {
> +        group->container = container;
> +        QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> +        if (!reused) {
>              vfio_kvm_device_add_group(group);
> -            return 0;
> +            cpr_save_fd("vfio_container_for_group", group->groupid,
> +                        container->fd);
>          }
> +        return 0;
> +    }
> +
> +    if (!reused) {
> +        fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
>      }
>  
> -    fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
>      if (fd < 0) {
>          error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
>          ret = -errno;
> @@ -2045,6 +2076,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      container = g_malloc0(sizeof(*container));
>      container->space = space;
>      container->fd = fd;
> +    container->reused = reused;
>      container->error = NULL;
>      container->dirty_pages_supported = false;
>      container->dma_max_mappings = 0;
> @@ -2183,6 +2215,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      }
>  
>      container->initialized = true;
> +    cpr_save_fd("vfio_container_for_group", group->groupid, fd);
>  
>      return 0;
>  listener_release_exit:
> @@ -2212,6 +2245,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>  
>      QLIST_REMOVE(group, container_next);
>      group->container = NULL;
> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
>  
>      /*
>       * Explicitly release the listener first before unset container,
> @@ -2253,6 +2287,7 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>      VFIOGroup *group;
>      char path[32];
>      struct vfio_group_status status = { .argsz = sizeof(status) };
> +    bool reused;
>  
>      QLIST_FOREACH(group, &vfio_group_list, next) {
>          if (group->groupid == groupid) {
> @@ -2270,7 +2305,13 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>      group = g_malloc0(sizeof(*group));
>  
>      snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
> -    group->fd = qemu_open_old(path, O_RDWR);
> +
> +    group->fd = cpr_find_fd("vfio_group", groupid);
> +    reused = (group->fd >= 0);
> +    if (!reused) {
> +        group->fd = qemu_open_old(path, O_RDWR);
> +    }
> +
>      if (group->fd < 0) {
>          error_setg_errno(errp, errno, "failed to open %s", path);
>          goto free_group_exit;
> @@ -2304,6 +2345,10 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>  
>      QLIST_INSERT_HEAD(&vfio_group_list, group, next);
>  
> +    if (!reused) {
> +        cpr_save_fd("vfio_group", groupid, group->fd);
> +    }
> +
>      return group;
>  
>  close_fd_exit:
> @@ -2328,6 +2373,7 @@ void vfio_put_group(VFIOGroup *group)
>      vfio_disconnect_container(group);
>      QLIST_REMOVE(group, next);
>      trace_vfio_put_group(group->fd);
> +    cpr_delete_fd("vfio_group", group->groupid);
>      close(group->fd);
>      g_free(group);
>  
> @@ -2341,8 +2387,14 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>  {
>      struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
>      int ret, fd;
> +    bool reused;
> +
> +    fd = cpr_find_fd(name, 0);
> +    reused = (fd >= 0);
> +    if (!reused) {
> +        fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
> +    }
>  
> -    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>      if (fd < 0) {
>          error_setg_errno(errp, errno, "error getting device from group %d",
>                           group->groupid);
> @@ -2387,6 +2439,10 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>      vbasedev->num_irqs = dev_info.num_irqs;
>      vbasedev->num_regions = dev_info.num_regions;
>      vbasedev->flags = dev_info.flags;
> +    vbasedev->reused = reused;
> +    if (!reused) {
> +        cpr_save_fd(name, 0, fd);
> +    }
>  
>      trace_vfio_get_device(name, dev_info.flags, dev_info.num_regions,
>                            dev_info.num_irqs);
> @@ -2403,6 +2459,7 @@ void vfio_put_base_device(VFIODevice *vbasedev)
>      QLIST_REMOVE(vbasedev, next);
>      vbasedev->group = NULL;
>      trace_vfio_put_base_device(vbasedev->fd);
> +    cpr_delete_fd(vbasedev->name, 0);
>      close(vbasedev->fd);
>  }
>  
> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
> new file mode 100644
> index 0000000..0981d31
> --- /dev/null
> +++ b/hw/vfio/cpr.c
> @@ -0,0 +1,160 @@
> +/*
> + * Copyright (c) 2021 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include <sys/ioctl.h>
> +#include <linux/vfio.h>
> +#include "hw/vfio/vfio-common.h"
> +#include "sysemu/kvm.h"
> +#include "qapi/error.h"
> +#include "trace.h"
> +
> +static int
> +vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
> +{
> +    struct vfio_iommu_type1_dma_unmap unmap = {
> +        .argsz = sizeof(unmap),
> +        .flags = VFIO_DMA_UNMAP_FLAG_VADDR | VFIO_DMA_UNMAP_FLAG_ALL,
> +        .iova = 0,
> +        .size = 0,
> +    };
> +    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
> +        error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
> +        return -errno;
> +    }
> +    return 0;
> +}
> +
> +static int vfio_dma_map_vaddr(VFIOContainer *container, hwaddr iova,
> +                              ram_addr_t size, void *vaddr,
> +                              Error **errp)
> +{
> +    struct vfio_iommu_type1_dma_map map = {
> +        .argsz = sizeof(map),
> +        .flags = VFIO_DMA_MAP_FLAG_VADDR,
> +        .vaddr = (__u64)(uintptr_t)vaddr,
> +        .iova = iova,
> +        .size = size,
> +    };
> +    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
> +        error_setg_errno(errp, errno,
> +                         "vfio_dma_map_vaddr(iova %lu, size %ld, va %p)",
> +                         iova, size, vaddr);
> +        return -errno;
> +    }
> +    return 0;
> +}
> +
> +static int
> +vfio_region_remap(MemoryRegionSection *section, void *handle, Error **errp)
> +{
> +    MemoryRegion *mr = section->mr;
> +    VFIOContainer *container = handle;
> +    const char *name = memory_region_name(mr);
> +    ram_addr_t size = int128_get64(section->size);
> +    hwaddr offset, iova, roundup;
> +    void *vaddr;
> +
> +    if (vfio_listener_skipped_section(section) || memory_region_is_iommu(mr)) {

A comment reminding us why we're also skipping iommu regions would be
useful.  It's not clear to me why this needs to happen separately from
the listener.  There's a sufficient degree of magic here that I'm
afraid it's going to get broken too easily if it's left to me trying to
remember how it's supposed to work.

> +        return 0;
> +    }
> +
> +    offset = section->offset_within_address_space;
> +    iova = REAL_HOST_PAGE_ALIGN(offset);
> +    roundup = iova - offset;
> +    size -= roundup;
> +    size = REAL_HOST_PAGE_ALIGN(size);
> +    vaddr = memory_region_get_ram_ptr(mr) +
> +            section->offset_within_region + roundup;
> +
> +    trace_vfio_region_remap(name, container->fd, iova, iova + size - 1, vaddr);
> +    return vfio_dma_map_vaddr(container, iova, size, vaddr, errp);
> +}
> +
> +bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp)
> +{
> +    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
> +        !ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
> +        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR "
> +                         "or VFIO_UNMAP_ALL");
> +        return false;
> +    } else {
> +        return true;
> +    }
> +}
> +
> +int vfio_cpr_save(Error **errp)
> +{
> +    ERRP_GUARD();
> +    VFIOAddressSpace *space, *last_space;
> +    VFIOContainer *container, *last_container;
> +
> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> +        QLIST_FOREACH(container, &space->containers, next) {
> +            if (!vfio_is_cpr_capable(container, errp)) {
> +                return -1;
> +            }
> +        }
> +    }
> +
> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> +        QLIST_FOREACH(container, &space->containers, next) {
> +            if (vfio_dma_unmap_vaddr_all(container, errp)) {
> +                goto unwind;
> +            }
> +        }
> +    }
> +    return 0;
> +
> +unwind:
> +    last_space = space;
> +    last_container = container;
> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> +        QLIST_FOREACH(container, &space->containers, next) {
> +            Error *err;
> +
> +            if (space == last_space && container == last_container) {
> +                break;
> +            }

Isn't it sufficient to only test the container?  I think we'd be in
trouble if we found a container on multiple address space lists.  Too
bad we don't have a continue_reverse foreach or it might be trivial to
convert to a qtailq. 

> +            if (address_space_flat_for_each_section(space->as,
> +                                                    vfio_region_remap,
> +                                                    container, &err)) {
> +                error_prepend(errp, "%s", error_get_pretty(err));
> +                error_free(err);
> +            }
> +        }
> +    }
> +    return -1;
> +}
> +
> +int vfio_cpr_load(Error **errp)
> +{
> +    VFIOAddressSpace *space;
> +    VFIOContainer *container;
> +    VFIOGroup *group;
> +    VFIODevice *vbasedev;
> +
> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> +        QLIST_FOREACH(container, &space->containers, next) {
> +            if (!vfio_is_cpr_capable(container, errp)) {
> +                return -1;
> +            }
> +            container->reused = false;
> +            if (address_space_flat_for_each_section(space->as,
> +                                                    vfio_region_remap,
> +                                                    container, errp)) {
> +                return -1;
> +            }
> +        }
> +    }
> +    QLIST_FOREACH(group, &vfio_group_list, next) {
> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +            vbasedev->reused = false;
> +        }
> +    }

The above is a bit disjoint between group/device and space/container,
how about walking container->group_list rather than the global group
list?

> +    return 0;
> +}
> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
> index da9af29..e247b2b 100644
> --- a/hw/vfio/meson.build
> +++ b/hw/vfio/meson.build
> @@ -5,6 +5,7 @@ vfio_ss.add(files(
>    'migration.c',
>  ))
>  vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
> +  'cpr.c',
>    'display.c',
>    'pci-quirks.c',
>    'pci.c',
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index e8e371e..64e2557 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -29,6 +29,7 @@
>  #include "hw/qdev-properties.h"
>  #include "hw/qdev-properties-system.h"
>  #include "migration/vmstate.h"
> +#include "migration/cpr.h"
>  #include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
>  #include "qemu/module.h"
> @@ -2899,6 +2900,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>          vfio_put_group(group);
>          goto error;
>      }
> +    pdev->reused = vdev->vbasedev.reused;
>  
>      vfio_populate_device(vdev, &err);
>      if (err) {
> @@ -3168,6 +3170,10 @@ static void vfio_pci_reset(DeviceState *dev)
>  {
>      VFIOPCIDevice *vdev = VFIO_PCI(dev);
>  
> +    if (vdev->pdev.reused) {
> +        return;
> +    }

Why are we the only ones using PCIDevice.reused and why are we testing
that rather than VFIOPCIDevice.reused above?  These have different
lifecycles and the difference is too subtle, esp. w/o comments.

> +
>      trace_vfio_pci_reset(vdev->vbasedev.name);
>  
>      vfio_pci_pre_reset(vdev);
> @@ -3275,6 +3281,56 @@ static Property vfio_pci_dev_properties[] = {
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> +static void vfio_merge_config(VFIOPCIDevice *vdev)
> +{
> +    PCIDevice *pdev = &vdev->pdev;
> +    int size = MIN(pci_config_size(pdev), vdev->config_size);
> +    g_autofree uint8_t *phys_config = g_malloc(size);
> +    uint32_t mask;
> +    int ret, i;
> +
> +    ret = pread(vdev->vbasedev.fd, phys_config, size, vdev->config_offset);
> +    if (ret < size) {
> +        ret = ret < 0 ? errno : EFAULT;
> +        error_report("failed to read device config space: %s", strerror(ret));
> +        return;
> +    }
> +
> +    for (i = 0; i < size; i++) {
> +        mask = vdev->emulated_config_bits[i];
> +        pdev->config[i] = (pdev->config[i] & mask) | (phys_config[i] & ~mask);
> +    }
> +}
> +
> +static int vfio_pci_post_load(void *opaque, int version_id)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    PCIDevice *pdev = &vdev->pdev;
> +
> +    vfio_merge_config(vdev);
> +
> +    pdev->reused = false;
> +
> +    return 0;
> +}
> +
> +static bool vfio_pci_needed(void *opaque)
> +{
> +    return cpr_mode() == CPR_MODE_RESTART;
> +}
> +
> +static const VMStateDescription vfio_pci_vmstate = {
> +    .name = "vfio-pci",
> +    .unmigratable = 1,


Doesn't this break the experimental (for now) migration support?


> +    .version_id = 0,
> +    .minimum_version_id = 0,
> +    .post_load = vfio_pci_post_load,
> +    .needed = vfio_pci_needed,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
>  static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>  {
>      DeviceClass *dc = DEVICE_CLASS(klass);
> @@ -3282,6 +3338,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>  
>      dc->reset = vfio_pci_reset;
>      device_class_set_props(dc, vfio_pci_dev_properties);
> +    dc->vmsd = &vfio_pci_vmstate;
>      dc->desc = "VFIO-based PCI device assignment";
>      set_bit(DEVICE_CATEGORY_MISC, dc->categories);
>      pdc->realize = vfio_realize;
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 0ef1b5f..63dd0fe 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -118,6 +118,7 @@ vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Devic
>  vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
>  vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
>  vfio_dma_unmap_overflow_workaround(void) ""
> +vfio_region_remap(const char *name, int fd, uint64_t iova_start, uint64_t iova_end, void *vaddr) "%s fd %d 0x%"PRIx64" - 0x%"PRIx64" [%p]"
>  
>  # platform.c
>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index bf5be06..f079423 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -360,6 +360,7 @@ struct PCIDevice {
>      /* ID of standby device in net_failover pair */
>      char *failover_pair_id;
>      uint32_t acpi_index;
> +    bool reused;
>  };
>  
>  void pci_register_bar(PCIDevice *pci_dev, int region_num,
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index cb04cc6..0766cc4 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -85,6 +85,7 @@ typedef struct VFIOContainer {
>      Error *error;
>      bool initialized;
>      bool dirty_pages_supported;
> +    bool reused;
>      uint64_t dirty_pgsizes;
>      uint64_t max_dirty_bitmap_size;
>      unsigned long pgsizes;
> @@ -136,6 +137,7 @@ typedef struct VFIODevice {
>      bool no_mmap;
>      bool ram_block_discard_allowed;
>      bool enable_migration;
> +    bool reused;
>      VFIODeviceOps *ops;
>      unsigned int num_irqs;
>      unsigned int num_regions;
> @@ -212,6 +214,9 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
>  void vfio_put_group(VFIOGroup *group);
>  int vfio_get_device(VFIOGroup *group, const char *name,
>                      VFIODevice *vbasedev, Error **errp);
> +int vfio_cpr_save(Error **errp);
> +int vfio_cpr_load(Error **errp);
> +bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp);
>  
>  extern const MemoryRegionOps vfio_region_ops;
>  typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> index 83f69c9..e9b987f 100644
> --- a/include/migration/cpr.h
> +++ b/include/migration/cpr.h
> @@ -25,4 +25,7 @@ int cpr_state_load(Error **errp);
>  CprMode cpr_state_mode(void);
>  void cpr_state_print(void);
>  
> +int cpr_vfio_save(Error **errp);
> +int cpr_vfio_load(Error **errp);
> +
>  #endif
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index e680594..48a02c0 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -52,6 +52,12 @@
>  /* Supports the vaddr flag for DMA map and unmap */
>  #define VFIO_UPDATE_VADDR		10
           ^^^^^^^^^^^^^^^^^

It's already there.  Thanks,

Alex

>  
> +/* Supports VFIO_DMA_UNMAP_FLAG_ALL */
> +#define VFIO_UNMAP_ALL                        9
> +
> +/* Supports VFIO DMA map and unmap with the VADDR flag */
> +#define VFIO_UPDATE_VADDR              10
> +
>  /*
>   * The IOCTL interface is designed for extensibility by embedding the
>   * structure length (argsz) and flags into structures passed between
> diff --git a/migration/cpr.c b/migration/cpr.c
> index 72a5f4b..16f11bd 100644
> --- a/migration/cpr.c
> +++ b/migration/cpr.c
> @@ -7,6 +7,7 @@
>  
>  #include "qemu/osdep.h"
>  #include "exec/memory.h"
> +#include "hw/vfio/vfio-common.h"
>  #include "io/channel-buffer.h"
>  #include "io/channel-file.h"
>  #include "migration.h"
> @@ -108,7 +109,9 @@ void qmp_cpr_exec(strList *args, Error **errp)
>          error_setg(errp, "cpr-exec requires cpr-save with restart mode");
>          return;
>      }
> -
> +    if (cpr_vfio_save(errp)) {
> +        return;
> +    }
>      cpr_walk_fd(preserve_fd, 0);
>      if (cpr_state_save(errp)) {
>          return;
> @@ -148,6 +151,11 @@ void qmp_cpr_load(const char *filename, Error **errp)
>          goto out;
>      }
>  
> +    if (cpr_active_mode == CPR_MODE_RESTART &&
> +        cpr_vfio_load(errp)) {
> +        goto out;
> +    }
> +
>      state = global_state_get_runstate();
>      if (state == RUN_STATE_RUNNING) {
>          vm_start();
> diff --git a/migration/target.c b/migration/target.c
> index 4390bf0..984bc9e 100644
> --- a/migration/target.c
> +++ b/migration/target.c
> @@ -8,6 +8,7 @@
>  #include "qemu/osdep.h"
>  #include "qapi/qapi-types-migration.h"
>  #include "migration.h"
> +#include "migration/cpr.h"
>  #include CONFIG_DEVICES
>  
>  #ifdef CONFIG_VFIO
> @@ -22,8 +23,21 @@ void populate_vfio_info(MigrationInfo *info)
>          info->vfio->transferred = vfio_mig_bytes_transferred();
>      }
>  }
> +
> +int cpr_vfio_save(Error **errp)
> +{
> +    return vfio_cpr_save(errp);
> +}
> +
> +int cpr_vfio_load(Error **errp)
> +{
> +    return vfio_cpr_load(errp);
> +}
> +
>  #else
>  
>  void populate_vfio_info(MigrationInfo *info) {}
> +int cpr_vfio_save(Error **errp) { return 0; }
> +int cpr_vfio_load(Error **errp) { return 0; }
>  
>  #endif /* CONFIG_VFIO */



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V6 00/27] Live Update
  2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
                   ` (27 preceding siblings ...)
  2021-08-09 16:02 ` [PATCH V6 00/27] Live Update Steven Sistare
@ 2021-08-21  8:54 ` Zheng Chuan
  2021-08-23 21:36   ` Steven Sistare
  28 siblings, 1 reply; 44+ messages in thread
From: Zheng Chuan @ 2021-08-21  8:54 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Alex Williamson, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Hi, steve

It seems the VM will stuck after cpr-load on AArch64 environment?

My AArch64 environment and test steps:
1. linux kernel: 5.14-rc6
2. QEMU version: v6.1.0-rc2 (patch your patchset), and configure with `../configure --target-list=aarch64-softmmu --disable-werror --enable-kvm` 4. Steps to live update:
# ./build/aarch64-softmmu/qemu-system-aarch64 -machine virt,accel=kvm,gic-version=3,memfd-alloc=on -nodefaults -cpu host -m 2G -smp 1 -drive file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,readonly=on
-drive file=<path/to/vm.qcow2>,format=qcow2,if=none,id=drive_image1
-device virtio-blk-pci,id=image1,drive=drive_image1 -vnc :10 -device
virtio-gpu,id=video0 -device piix3-usb-uhci,id=usb -device
usb-tablet,id=input0,bus=usb.0,port=1 -device
usb-kbd,id=input1,bus=usb.0,port=2 -monitor stdio
(qemu) cpr-save /tmp/qemu.save restart
(qemu) cpr-exec ./build/aarch64-softmmu/qemu-system-aarch64 -machine virt,accel=kvm,gic-version=3,memfd-alloc=on -nodefaults -cpu host -m 2G -smp 1 -drive file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,readonly=on
-drive file=<path/to/vm.qcow2>,format=qcow2,if=none,id=drive_image1
-device virtio-blk-pci,id=image1,drive=drive_image1 -vnc :10 -device
virtio-gpu,id=video0 -device piix3-usb-uhci,id=usb -device
usb-tablet,id=input0,bus=usb.0,port=1 -device
usb-kbd,id=input1,bus=usb.0,port=2 -monitor stdio -S
(qemu) QEMU 6.0.92 monitor - type 'help' for more information
(qemu) cpr-load /tmp/qemu.save

Does I miss something?

On 2021/8/7 5:43, Steve Sistare wrote:
> Provide the cpr-save, cpr-exec, and cpr-load commands for live update.
> These save and restore VM state, with minimal guest pause time, so that
> qemu may be updated to a new version in between.
> 
> cpr-save stops the VM and saves vmstate to an ordinary file.  It supports
> any type of guest image and block device, but the caller must not modify
> guest block devices between cpr-save and cpr-load.  It supports two modes:
> reboot and restart.
> 
> In reboot mode, the caller invokes cpr-save and then terminates qemu.
> The caller may then update the host kernel and system software and reboot.
> The caller resumes the guest by running qemu with the same arguments as the
> original process and invoking cpr-load.  To use this mode, guest ram must be
> mapped to a persistent shared memory file such as /dev/dax0.0, or /dev/shm
> PKRAM as proposed in https://lore.kernel.org/lkml/1617140178-8773-1-git-send-email-anthony.yznaga@oracle.com.
> 
> The reboot mode supports vfio devices if the caller first suspends the
> guest, such as by issuing guest-suspend-ram to the qemu guest agent.  The
> guest drivers' suspend methods flush outstanding requests and re-initialize
> the devices, and thus there is no device state to save and restore.
> 
> Restart mode preserves the guest VM across a restart of the qemu process.
> After cpr-save, the caller passes qemu command-line arguments to cpr-exec,
> which directly exec's the new qemu binary.  The arguments must include -S
> so new qemu starts in a paused state and waits for the cpr-load command.
> The restart mode supports vfio devices by preserving the vfio container,
> group, device, and event descriptors across the qemu re-exec, and by
> updating DMA mapping virtual addresses using VFIO_DMA_UNMAP_FLAG_VADDR and
> VFIO_DMA_MAP_FLAG_VADDR as defined in https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sistare@oracle.com/
> and integrated in Linux kernel 5.12.
> 
> To use the restart mode, qemu must be started with the memfd-alloc option,
> which allocates guest ram using memfd_create.  The memfd's are saved to
> the environment and kept open across exec, after which they are found from
> the environment and re-mmap'd.  Hence guest ram is preserved in place,
> albeit with new virtual addresses in the qemu process.
> 
> The caller resumes the guest by invoking cpr-load, which loads state from
> the file. If the VM was running at cpr-save time, then VM execution resumes.
> If the VM was suspended at cpr-save time (reboot mode), then the caller must
> issue a system_wakeup command to resume.
> 
> The first patches add reboot mode:
>   - memory: qemu_check_ram_volatile
>   - migration: fix populate_vfio_info
>   - migration: qemu file wrappers
>   - migration: simplify savevm
>   - vl: start on wakeup request
>   - cpr: reboot mode
>   - cpr: reboot HMP interfaces
> 
> The next patches add restart mode:
>   - memory: flat section iterator
>   - oslib: qemu_clear_cloexec
>   - machine: memfd-alloc option
>   - qapi: list utility functions
>   - vl: helper to request re-exec
>   - cpr: preserve extra state
>   - cpr: restart mode
>   - cpr: restart HMP interfaces
>   - hostmem-memfd: cpr for memory-backend-memfd
> 
> The next patches add vfio support for restart mode:
>   - pci: export functions for cpr
>   - vfio-pci: refactor for cpr
>   - vfio-pci: cpr part 1 (fd and dma)
>   - vfio-pci: cpr part 2 (msi)
>   - vfio-pci: cpr part 3 (intx)
> 
> The next patches preserve various descriptor-based backend devices across
> cprexec:
>   - vhost: reset vhost devices for cpr
>   - chardev: cpr framework
>   - chardev: cpr for simple devices
>   - chardev: cpr for pty
>   - chardev: cpr for sockets
>   - cpr: only-cpr-capable option
> 
> Here is an example of updating qemu from v4.2.0 to v4.2.1 using
> restart mode.  The software update is performed while the guest is
> running to minimize downtime.
> 
> window 1                                        | window 2
>                                                 |
> # qemu-system-x86_64 ...                        |
> QEMU 4.2.0 monitor - type 'help' ...            |
> (qemu) info status                              |
> VM status: running                              |
>                                                 | # yum update qemu
> (qemu) cpr-save /tmp/qemu.sav restart           |
> (qemu) cpr-exec qemu-system-x86_64 -S ...       |
> QEMU 4.2.1 monitor - type 'help' ...            |
> (qemu) info status                              |
> VM status: paused (prelaunch)                   |
> (qemu) cpr-load /tmp/qemu.sav                   |
> (qemu) info status                              |
> VM status: running                              |
> 
> 
> Here is an example of updating the host kernel using reboot mode.
> 
> window 1                                        | window 2
>                                                 |
> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
> QEMU 4.2.1 monitor - type 'help' ...            |
> (qemu) info status                              |
> VM status: running                              |
>                                                 | # yum update kernel-uek
> (qemu) cpr-save /tmp/qemu.sav restart           |
> (qemu) quit                                     |
>                                                 |
> # systemctl kexec                               |
> kexec_core: Starting new kernel                 |
> ...                                             |
>                                                 |
> # qemu-system-x86_64 -S mem-path=/dev/dax0.0 ...|
> QEMU 4.2.1 monitor - type 'help' ...            |
> (qemu) info status                              |
> VM status: paused (prelaunch)                   |
> (qemu) cpr-load /tmp/qemu.sav                   |
> (qemu) info status                              |
> VM status: running                              |
> 
> Changes from V1 to V2:
>   - revert vmstate infrastructure changes
>   - refactor cpr functions into new files
>   - delete MADV_DOEXEC and use memfd + VFIO_DMA_UNMAP_FLAG_SUSPEND to
>     preserve memory.
>   - add framework to filter chardev's that support cpr
>   - save and restore vfio eventfd's
>   - modify cprinfo QMP interface
>   - incorporate misc review feedback
>   - remove unrelated and unneeded patches
>   - refactor all patches into a shorter and easier to review series
> 
> Changes from V2 to V3:
>   - rebase to qemu 6.0.0
>   - use final definition of vfio ioctls (VFIO_DMA_UNMAP_FLAG_VADDR etc)
>   - change memfd-alloc to a machine option
>   - Use qio_channel_socket_new_fd instead of adding qio_channel_socket_new_fd
>   - close monitor socket during cpr
>   - fix a few unreported bugs
>   - support memory-backend-memfd
> 
> Changes from V3 to V4:
>   - split reboot mode into separate patches
>   - add cprexec command
>   - delete QEMU_START_FREEZE, argv_main, and /usr/bin/qemu-exec
>   - add more checks for vfio and cpr compatibility, and recover after errors
>   - save vfio pci config in vmstate
>   - rename {setenv,getenv}_event_fd to {save,load}_event_fd
>   - use qemu_strtol
>   - change 6.0 references to 6.1
>   - use strerror(), use EXIT_FAILURE, remove period from error messages
>   - distribute MAINTAINERS additions to each patch
> 
> Changes from V4 to V5:
>   - rebase to master
> 
> Changes from V5 to V6:
>   vfio:
>   - delete redundant bus_master_enable_region in vfio_pci_post_load
>   - delete unmap.size warning
>   - fix phys_config memory leak
>   - add INTX support
>   - add vfio_named_notifier_init() helper
>   Other:
>   - 6.1 -> 6.2
>   - rename file -> filename in qapi
>   - delete cprinfo.  qapi introspection serves the same purpose.
>   - rename cprsave, cprexec, cprload -> cpr-save, cpr-exec, cpr-load
>   - improve documentation in qapi/cpr.json
>   - rename qemu_ram_volatile -> qemu_ram_check_volatile, and use
>     qemu_ram_foreach_block
>   - rename handle -> opaque
>   - use ERRP_GUARD
>   - use g_autoptr and g_autofree, and glib allocation functions
>   - conform to error conventions for bool and int function return values
>     and function names.
>   - remove word "error" in error messages
>   - rename as_flat_walk and its callback, and add comments.
>   - rename qemu_clr_cloexec -> qemu_clear_cloexec
>   - rename close-on-cpr -> reopen-on-cpr
>   - add strList utility functions
>   - factor out start on wakeup request to a separate patch
>   - deleted unnecessary layer (cprsave etc) and squashed QMP patches
>   - conditionally compile for CONFIG_VFIO
> 
> Steve Sistare (24):
>   memory: qemu_check_ram_volatile
>   migration: fix populate_vfio_info
>   migration: qemu file wrappers
>   migration: simplify savevm
>   vl: start on wakeup request
>   cpr: reboot mode
>   memory: flat section iterator
>   oslib: qemu_clear_cloexec
>   machine: memfd-alloc option
>   qapi: list utility functions
>   vl: helper to request re-exec
>   cpr: preserve extra state
>   cpr: restart mode
>   cpr: restart HMP interfaces
>   hostmem-memfd: cpr for memory-backend-memfd
>   pci: export functions for cpr
>   vfio-pci: refactor for cpr
>   vfio-pci: cpr part 1 (fd and dma)
>   vfio-pci: cpr part 2 (msi)
>   vfio-pci: cpr part 3 (intx)
>   chardev: cpr framework
>   chardev: cpr for simple devices
>   chardev: cpr for pty
>   cpr: only-cpr-capable option
> 
> Mark Kanda, Steve Sistare (3):
>   cpr: reboot HMP interfaces
>   vhost: reset vhost devices for cpr
>   chardev: cpr for sockets
> 
>  MAINTAINERS                   |  12 ++
>  backends/hostmem-memfd.c      |  21 +--
>  chardev/char-mux.c            |   1 +
>  chardev/char-null.c           |   1 +
>  chardev/char-pty.c            |  14 +-
>  chardev/char-serial.c         |   1 +
>  chardev/char-socket.c         |  36 +++++
>  chardev/char-stdio.c          |   8 ++
>  chardev/char.c                |  43 +++++-
>  gdbstub.c                     |   1 +
>  hmp-commands.hx               |  50 +++++++
>  hw/core/machine.c             |  19 +++
>  hw/pci/msix.c                 |  20 ++-
>  hw/pci/pci.c                  |   7 +-
>  hw/vfio/common.c              |  79 +++++++++--
>  hw/vfio/cpr.c                 | 160 ++++++++++++++++++++++
>  hw/vfio/meson.build           |   1 +
>  hw/vfio/pci.c                 | 301 +++++++++++++++++++++++++++++++++++++++---
>  hw/vfio/trace-events          |   1 +
>  hw/virtio/vhost.c             |  11 ++
>  include/chardev/char.h        |   6 +
>  include/exec/memory.h         |  39 ++++++
>  include/hw/boards.h           |   1 +
>  include/hw/pci/msix.h         |   5 +
>  include/hw/pci/pci.h          |   2 +
>  include/hw/vfio/vfio-common.h |   8 ++
>  include/hw/virtio/vhost.h     |   1 +
>  include/migration/cpr.h       |  31 +++++
>  include/monitor/hmp.h         |   3 +
>  include/qapi/util.h           |  28 ++++
>  include/qemu/osdep.h          |   1 +
>  include/sysemu/runstate.h     |   2 +
>  include/sysemu/sysemu.h       |   1 +
>  linux-headers/linux/vfio.h    |   6 +
>  migration/cpr-state.c         | 215 ++++++++++++++++++++++++++++++
>  migration/cpr.c               | 176 ++++++++++++++++++++++++
>  migration/meson.build         |   2 +
>  migration/migration.c         |   5 +
>  migration/qemu-file-channel.c |  36 +++++
>  migration/qemu-file-channel.h |   6 +
>  migration/savevm.c            |  21 +--
>  migration/target.c            |  24 +++-
>  migration/trace-events        |   5 +
>  monitor/hmp-cmds.c            |  68 ++++++----
>  monitor/hmp.c                 |   3 +
>  monitor/qmp.c                 |   3 +
>  qapi/char.json                |   7 +-
>  qapi/cpr.json                 |  76 +++++++++++
>  qapi/meson.build              |   1 +
>  qapi/qapi-schema.json         |   1 +
>  qapi/qapi-util.c              |  37 ++++++
>  qemu-options.hx               |  40 +++++-
>  softmmu/globals.c             |   1 +
>  softmmu/memory.c              |  46 +++++++
>  softmmu/physmem.c             |  55 ++++++--
>  softmmu/runstate.c            |  38 +++++-
>  softmmu/vl.c                  |  18 ++-
>  stubs/cpr-state.c             |  15 +++
>  stubs/cpr.c                   |   3 +
>  stubs/meson.build             |   2 +
>  trace-events                  |   1 +
>  util/oslib-posix.c            |   9 ++
>  util/oslib-win32.c            |   4 +
>  util/qemu-config.c            |   4 +
>  64 files changed, 1732 insertions(+), 111 deletions(-)
>  create mode 100644 hw/vfio/cpr.c
>  create mode 100644 include/migration/cpr.h
>  create mode 100644 migration/cpr-state.c
>  create mode 100644 migration/cpr.c
>  create mode 100644 qapi/cpr.json
>  create mode 100644 stubs/cpr-state.c
>  create mode 100644 stubs/cpr.c
> 

-- 
Regards.
Chuan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V6 18/27] vfio-pci: refactor for cpr
  2021-08-10 16:53   ` Alex Williamson
@ 2021-08-23 16:52     ` Steven Sistare
  0 siblings, 0 replies; 44+ messages in thread
From: Steven Sistare @ 2021-08-23 16:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel, Eric Blake,
	Dr. David Alan Gilbert, Zheng Chuan, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Markus Armbruster

Thanks for reviewing, and sorry for the delayed response, I just returned from vacation.

On 8/10/2021 12:53 PM, Alex Williamson wrote:
> On Fri,  6 Aug 2021 14:43:52 -0700
> Steve Sistare <steven.sistare@oracle.com> wrote:
> 
>> Export vfio_address_spaces and vfio_listener_skipped_section.
>> Add optional name arg to vfio_add_kvm_msi_virq.
>> Refactor vector use into a helper vfio_vector_init.
>> All for use by cpr in a subsequent patch.  No functional change.
> 
> Why is the name arg optional?  It seems really inconsistent to me that
> everything other than MSI/X uses this with a name, but MSI/X use NULL
> and in an entirely separate pre-save step we then iterate through all
> the {event,irq}fds to save them.  If we asked for a named notifier,
> shouldn't we go ahead and save it under that name at that time?  ie.
> 
> static int vfio_named_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
>                                     const char *name, int nr)
> {
>     int ret, fd = load_event_fd(vdev, name, nr);
> 
>     if (fd >= 0) {
>         event_notifier_init_fd(e, fd);
>     } else {
>         ret = event_notifier_init(e, 0);
>         if (ret) {
>             return ret;
>         }
>         save_event_fd(vdev, name, nr, e);
>     }
>     return 0;
> }
> 
> Are we not doing this to avoid runtime overhead?

OK, I will delete the pre-save function and align the life-cycle of the fd and the event
notifier. (Currently the vfio-pci code does not call cpr_delete_fd.)

> In the process, maybe we can use more descriptive names than
> "interrupt", ex. "msi" or "msix".

I chose "interrupt" and "kvm_interrupt" to match the names of the corresponding 
VFIOMSIVector fields.  Ditto for intx-interrupt, intx-unmask, err, and req, with
minor differences.

> It also feels a bit forced to me that the entire fd saving uses {name,
> id} but vfio is the only caller that makes use of a non-zero id.
> Should we instead just wrap all the calls from vfio to append the id to
> the name so the common code can just use strcmp()?  Thanks,

I liked the simplification in the vfio code, but I will remove the id if you prefer,
and add g_autoptr and g_strdup_printf to each call site.

- Steve



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V6 19/27] vfio-pci: cpr part 1 (fd and dma)
  2021-08-10 17:06   ` Alex Williamson
@ 2021-08-23 19:43     ` Steven Sistare
  2021-11-10  7:48     ` Zheng Chuan
  1 sibling, 0 replies; 44+ messages in thread
From: Steven Sistare @ 2021-08-23 19:43 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel, Eric Blake,
	Dr. David Alan Gilbert, Zheng Chuan, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Markus Armbruster

On 8/10/2021 1:06 PM, Alex Williamson wrote:
> On Fri,  6 Aug 2021 14:43:53 -0700
> Steve Sistare <steven.sistare@oracle.com> wrote:
> 
>> Enable vfio-pci devices to be saved and restored across an exec restart
>> of qemu.
>>
>> At vfio creation time, save the value of vfio container, group, and device
>> descriptors in cpr state.
>>
>> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
>> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
>> at a different VA after exec.  DMA to already-mapped pages continues.  Save
>> the msi message area as part of vfio-pci vmstate, save the interrupt and
>> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
>> vfio descriptors.  The flag is not cleared earlier because the descriptors
>> should not persist across miscellaneous fork and exec calls that may be
>> performed during normal operation.
>>
>> On qemu restart, vfio_realize() finds the descriptor env vars, uses
>> the descriptors, and notes that the device is being reused.  Device and
>> iommu state is already configured, so operations in vfio_realize that
>> would modify the configuration are skipped for a reused device, including
>> vfio ioctl's and writes to PCI configuration space.  The result is that
>> vfio_realize constructs qemu data structures that reflect the current
>> state of the device.  However, the reconstruction is not complete until
>> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
>> state.  It rebuilds vector data structures and attaches the interrupts to
>> the new KVM instance.  cpr-load then walks the flattened ranges of the
>> vfio_address_spaces and calls VFIO_DMA_MAP_FLAG_VADDR to inform the kernel
>> of the new VA's.  Lastly, it starts the VM and suppresses vfio device reset.
>>
>> This functionality is delivered by 3 patches for clarity.  Part 1 handles
>> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
>> support.  Part 3 adds INTX support.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  MAINTAINERS                   |   1 +
>>  hw/pci/pci.c                  |   4 ++
>>  hw/vfio/common.c              |  69 ++++++++++++++++--
>>  hw/vfio/cpr.c                 | 160 ++++++++++++++++++++++++++++++++++++++++++
>>  hw/vfio/meson.build           |   1 +
>>  hw/vfio/pci.c                 |  57 +++++++++++++++
>>  hw/vfio/trace-events          |   1 +
>>  include/hw/pci/pci.h          |   1 +
>>  include/hw/vfio/vfio-common.h |   5 ++
>>  include/migration/cpr.h       |   3 +
>>  linux-headers/linux/vfio.h    |   6 ++
>>  migration/cpr.c               |  10 ++-
>>  migration/target.c            |  14 ++++
>>  13 files changed, 325 insertions(+), 7 deletions(-)
>>  create mode 100644 hw/vfio/cpr.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index a9d2ed8..3132965 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -2904,6 +2904,7 @@ CPR
>>  M: Steve Sistare <steven.sistare@oracle.com>
>>  M: Mark Kanda <mark.kanda@oracle.com>
>>  S: Maintained
>> +F: hw/vfio/cpr.c
>>  F: include/migration/cpr.h
>>  F: migration/cpr.c
>>  F: qapi/cpr.json
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index 59408a3..b9c6ca1 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -307,6 +307,10 @@ static void pci_do_device_reset(PCIDevice *dev)
>>  {
>>      int r;
>>  
>> +    if (dev->reused) {
>> +        return;
>> +    }
>> +
>>      pci_device_deassert_intx(dev);
>>      assert(dev->irq_state == 0);
>>  
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 7918c0d..872a1ac 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -31,6 +31,7 @@
>>  #include "exec/memory.h"
>>  #include "exec/ram_addr.h"
>>  #include "hw/hw.h"
>> +#include "migration/cpr.h"
>>  #include "qemu/error-report.h"
>>  #include "qemu/main-loop.h"
>>  #include "qemu/range.h"
>> @@ -464,6 +465,10 @@ static int vfio_dma_unmap(VFIOContainer *container,
>>          return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
>>      }
>>  
>> +    if (container->reused) {
>> +        return 0;
>> +    }
>> +
>>      while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
>>          /*
>>           * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
>> @@ -501,6 +506,10 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>>          .size = size,
>>      };
>>  
>> +    if (container->reused) {
>> +        return 0;
>> +    }
>> +
>>      if (!readonly) {
>>          map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
>>      }
>> @@ -1872,6 +1881,10 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
>>      if (iommu_type < 0) {
>>          return iommu_type;
>>      }
>> +    if (container->reused) {
>> +        container->iommu_type = iommu_type;
>> +        return 0;
>> +    }
>>  
> 
> I'd like to see more comments throughout, but particularly where we're
> dumping out of functions for reused containers, groups, and devices.
> For instance map/unmap we're assuming we'll reach the same IOMMU
> mapping state we had previously, how do we validate that, why can't we
> only set vaddr in the mapping path rather than skipping it for a later
> pass at the flatmap, do we actually see unmaps, is deferring listener
> registration an alternate option, which specific reset path are we
> trying to defer, why are VFIOPCIDevices the only PCIDevices that set
> reused, there are some assumptions about the iommu_type that could use
> further description, etc.

Will do.

>>      ret = ioctl(group_fd, VFIO_GROUP_SET_CONTAINER, &container->fd);
>>      if (ret) {
>> @@ -1972,6 +1985,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>  {
>>      VFIOContainer *container;
>>      int ret, fd;
>> +    bool reused;
>>      VFIOAddressSpace *space;
>>  
>>      space = vfio_get_address_space(as);
>> @@ -2007,7 +2021,13 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>       * details once we know which type of IOMMU we are using.
>>       */
>>  
>> +    fd = cpr_find_fd("vfio_container_for_group", group->groupid);
>> +    reused = (fd >= 0);
>> +
>>      QLIST_FOREACH(container, &space->containers, next) {
>> +        if (container->fd == fd) {
>> +            break;
>> +        }
>>          if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
> 
> Letting the reused case call this ioctl feels a little sloppy.  I'm
> assuming we've tested this in a vIOMMU config or other setups where
> we'd actually have multiple containers and we're relying on the ioctl
> failing, but why call it at all if we already know the group is
> attached to a container.

Good find, this was unintentional.  Will fix.

>>[...]
>> +++ b/hw/vfio/cpr.c
>> @@ -0,0 +1,160 @@
>> +/*
>> + * Copyright (c) 2021 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include <sys/ioctl.h>
>> +#include <linux/vfio.h>
>> +#include "hw/vfio/vfio-common.h"
>> +#include "sysemu/kvm.h"
>> +#include "qapi/error.h"
>> +#include "trace.h"
>> +
>> +static int
>> +vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
>> +{
>> +    struct vfio_iommu_type1_dma_unmap unmap = {
>> +        .argsz = sizeof(unmap),
>> +        .flags = VFIO_DMA_UNMAP_FLAG_VADDR | VFIO_DMA_UNMAP_FLAG_ALL,
>> +        .iova = 0,
>> +        .size = 0,
>> +    };
>> +    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
>> +        error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
>> +        return -errno;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static int vfio_dma_map_vaddr(VFIOContainer *container, hwaddr iova,
>> +                              ram_addr_t size, void *vaddr,
>> +                              Error **errp)
>> +{
>> +    struct vfio_iommu_type1_dma_map map = {
>> +        .argsz = sizeof(map),
>> +        .flags = VFIO_DMA_MAP_FLAG_VADDR,
>> +        .vaddr = (__u64)(uintptr_t)vaddr,
>> +        .iova = iova,
>> +        .size = size,
>> +    };
>> +    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
>> +        error_setg_errno(errp, errno,
>> +                         "vfio_dma_map_vaddr(iova %lu, size %ld, va %p)",
>> +                         iova, size, vaddr);
>> +        return -errno;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static int
>> +vfio_region_remap(MemoryRegionSection *section, void *handle, Error **errp)
>> +{
>> +    MemoryRegion *mr = section->mr;
>> +    VFIOContainer *container = handle;
>> +    const char *name = memory_region_name(mr);
>> +    ram_addr_t size = int128_get64(section->size);
>> +    hwaddr offset, iova, roundup;
>> +    void *vaddr;
>> +
>> +    if (vfio_listener_skipped_section(section) || memory_region_is_iommu(mr)) {
> 
> A comment reminding us why we're also skipping iommu regions would be
> useful.  

OK.

> It's not clear to me why this needs to happen separately from
> the listener.  

After vfio_realize registers the address space listener, subsequent device realizations
can modify the address space.  After all realizations, qemu's picture of the address space
matches the kernel's picture of mapped dma ranges, and it is safe to call vfio_region_remap.
However, the intermediate qemu picture after each realization may not match the kernel's
picture, so we cannot allow the listener to issue incremental ioctl's.

> There's a sufficient degree of magic here that I'm
> afraid it's going to get broken too easily if it's left to me trying to
> remember how it's supposed to work.
>> +        return 0;
>> +    }
>> +
>> +    offset = section->offset_within_address_space;
>> +    iova = REAL_HOST_PAGE_ALIGN(offset);
>> +    roundup = iova - offset;
>> +    size -= roundup;
>> +    size = REAL_HOST_PAGE_ALIGN(size);
>> +    vaddr = memory_region_get_ram_ptr(mr) +
>> +            section->offset_within_region + roundup;
>> +
>> +    trace_vfio_region_remap(name, container->fd, iova, iova + size - 1, vaddr);
>> +    return vfio_dma_map_vaddr(container, iova, size, vaddr, errp);
>> +}
>> +
>> +bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp)
>> +{
>> +    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
>> +        !ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
>> +        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR "
>> +                         "or VFIO_UNMAP_ALL");
>> +        return false;
>> +    } else {
>> +        return true;
>> +    }
>> +}
>> +
>> +int vfio_cpr_save(Error **errp)
>> +{
>> +    ERRP_GUARD();
>> +    VFIOAddressSpace *space, *last_space;
>> +    VFIOContainer *container, *last_container;
>> +
>> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
>> +        QLIST_FOREACH(container, &space->containers, next) {
>> +            if (!vfio_is_cpr_capable(container, errp)) {
>> +                return -1;
>> +            }
>> +        }
>> +    }
>> +
>> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
>> +        QLIST_FOREACH(container, &space->containers, next) {
>> +            if (vfio_dma_unmap_vaddr_all(container, errp)) {
>> +                goto unwind;
>> +            }
>> +        }
>> +    }
>> +    return 0;
>> +
>> +unwind:
>> +    last_space = space;
>> +    last_container = container;
>> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
>> +        QLIST_FOREACH(container, &space->containers, next) {
>> +            Error *err;
>> +
>> +            if (space == last_space && container == last_container) {
>> +                break;
>> +            }
> 
> Isn't it sufficient to only test the container?  I think we'd be in
> trouble if we found a container on multiple address space lists.  Too

OK.

> bad we don't have a continue_reverse foreach or it might be trivial to
> convert to a qtailq. 
> 
>> +            if (address_space_flat_for_each_section(space->as,
>> +                                                    vfio_region_remap,
>> +                                                    container, &err)) {
>> +                error_prepend(errp, "%s", error_get_pretty(err));
>> +                error_free(err);
>> +            }
>> +        }
>> +    }
>> +    return -1;
>> +}
>> +
>> +int vfio_cpr_load(Error **errp)
>> +{
>> +    VFIOAddressSpace *space;
>> +    VFIOContainer *container;
>> +    VFIOGroup *group;
>> +    VFIODevice *vbasedev;
>> +
>> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
>> +        QLIST_FOREACH(container, &space->containers, next) {
>> +            if (!vfio_is_cpr_capable(container, errp)) {
>> +                return -1;
>> +            }
>> +            container->reused = false;
>> +            if (address_space_flat_for_each_section(space->as,
>> +                                                    vfio_region_remap,
>> +                                                    container, errp)) {
>> +                return -1;
>> +            }
>> +        }
>> +    }
>> +    QLIST_FOREACH(group, &vfio_group_list, next) {
>> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
>> +            vbasedev->reused = false;
>> +        }
>> +    }
> 
> The above is a bit disjoint between group/device and space/container,
> how about walking container->group_list rather than the global group
> list?

OK.

>> +    return 0;
>> +}
>> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
>> index da9af29..e247b2b 100644
>> --- a/hw/vfio/meson.build
>> +++ b/hw/vfio/meson.build
>> @@ -5,6 +5,7 @@ vfio_ss.add(files(
>>    'migration.c',
>>  ))
>>  vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
>> +  'cpr.c',
>>    'display.c',
>>    'pci-quirks.c',
>>    'pci.c',
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index e8e371e..64e2557 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -29,6 +29,7 @@
>>  #include "hw/qdev-properties.h"
>>  #include "hw/qdev-properties-system.h"
>>  #include "migration/vmstate.h"
>> +#include "migration/cpr.h"
>>  #include "qemu/error-report.h"
>>  #include "qemu/main-loop.h"
>>  #include "qemu/module.h"
>> @@ -2899,6 +2900,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>>          vfio_put_group(group);
>>          goto error;
>>      }
>> +    pdev->reused = vdev->vbasedev.reused;
>>  
>>      vfio_populate_device(vdev, &err);
>>      if (err) {
>> @@ -3168,6 +3170,10 @@ static void vfio_pci_reset(DeviceState *dev)
>>  {
>>      VFIOPCIDevice *vdev = VFIO_PCI(dev);
>>  
>> +    if (vdev->pdev.reused) {
>> +        return;
>> +    }
> 
> Why are we the only ones using PCIDevice.reused and why are we testing
> that rather than VFIOPCIDevice.reused above?  These have different
> lifecycles and the difference is too subtle, esp. w/o comments.

PCIDevice.reused is referenced in hw/pci/pci.c:pci_do_device_reset().

I'll add a reused field to VFIOPCIDevice for use in the vfio specific code.

>> +
>>      trace_vfio_pci_reset(vdev->vbasedev.name);
>>  
>>      vfio_pci_pre_reset(vdev);
>> @@ -3275,6 +3281,56 @@ static Property vfio_pci_dev_properties[] = {
>>      DEFINE_PROP_END_OF_LIST(),
>>  };
>>  
>> +static void vfio_merge_config(VFIOPCIDevice *vdev)
>> +{
>> +    PCIDevice *pdev = &vdev->pdev;
>> +    int size = MIN(pci_config_size(pdev), vdev->config_size);
>> +    g_autofree uint8_t *phys_config = g_malloc(size);
>> +    uint32_t mask;
>> +    int ret, i;
>> +
>> +    ret = pread(vdev->vbasedev.fd, phys_config, size, vdev->config_offset);
>> +    if (ret < size) {
>> +        ret = ret < 0 ? errno : EFAULT;
>> +        error_report("failed to read device config space: %s", strerror(ret));
>> +        return;
>> +    }
>> +
>> +    for (i = 0; i < size; i++) {
>> +        mask = vdev->emulated_config_bits[i];
>> +        pdev->config[i] = (pdev->config[i] & mask) | (phys_config[i] & ~mask);
>> +    }
>> +}
>> +
>> +static int vfio_pci_post_load(void *opaque, int version_id)
>> +{
>> +    VFIOPCIDevice *vdev = opaque;
>> +    PCIDevice *pdev = &vdev->pdev;
>> +
>> +    vfio_merge_config(vdev);
>> +
>> +    pdev->reused = false;
>> +
>> +    return 0;
>> +}
>> +
>> +static bool vfio_pci_needed(void *opaque)
>> +{
>> +    return cpr_mode() == CPR_MODE_RESTART;
>> +}
>> +
>> +static const VMStateDescription vfio_pci_vmstate = {
>> +    .name = "vfio-pci",
>> +    .unmigratable = 1,
> 
> Doesn't this break the experimental (for now) migration support?

Yes, thanks.  I will delete this line and add a comment.  vfio_pci_needed() guarantees
that this handler is only used for cpr.

> 
>> +    .version_id = 0,
>> +    .minimum_version_id = 0,
>> +    .post_load = vfio_pci_post_load,
>> +    .needed = vfio_pci_needed,
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_END_OF_LIST()
>> +    }
>> +};
>> +
>>  static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>>  {
>>      DeviceClass *dc = DEVICE_CLASS(klass);
>> @@ -3282,6 +3338,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>>  
>>      dc->reset = vfio_pci_reset;
>>      device_class_set_props(dc, vfio_pci_dev_properties);
>> +    dc->vmsd = &vfio_pci_vmstate;
>>      dc->desc = "VFIO-based PCI device assignment";
>>      set_bit(DEVICE_CATEGORY_MISC, dc->categories);
>>      pdc->realize = vfio_realize;
>> [...]
>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>> index e680594..48a02c0 100644
>> --- a/linux-headers/linux/vfio.h
>> +++ b/linux-headers/linux/vfio.h
>> @@ -52,6 +52,12 @@
>>  /* Supports the vaddr flag for DMA map and unmap */
>>  #define VFIO_UPDATE_VADDR		10
>            ^^^^^^^^^^^^^^^^^
> 
> It's already there.  Thanks,

Thanks, I will delete that and the dup'd VFIO_UNMAP_ALL                        .

- Steve

>> +/* Supports VFIO_DMA_UNMAP_FLAG_ALL */
>> +#define VFIO_UNMAP_ALL                        9
>> +
>> +/* Supports VFIO DMA map and unmap with the VADDR flag */
>> +#define VFIO_UPDATE_VADDR              10
>> +


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V6 00/27] Live Update
  2021-08-21  8:54 ` Zheng Chuan
@ 2021-08-23 21:36   ` Steven Sistare
  2021-08-24  9:36     ` Zheng Chuan
  0 siblings, 1 reply; 44+ messages in thread
From: Steven Sistare @ 2021-08-23 21:36 UTC (permalink / raw)
  To: Zheng Chuan, qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Alex Williamson, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Hi Zheng, testing aarch64 is on our todo list. We will run this case and try to 
reproduce the failure.  Thanks for the report.

- Steve

On 8/21/2021 4:54 AM, Zheng Chuan wrote:
> Hi, steve
> 
> It seems the VM will stuck after cpr-load on AArch64 environment?
> 
> My AArch64 environment and test steps:
> 1. linux kernel: 5.14-rc6
> 2. QEMU version: v6.1.0-rc2 (patch your patchset), and configure with `../configure --target-list=aarch64-softmmu --disable-werror --enable-kvm` 4. Steps to live update:
> # ./build/aarch64-softmmu/qemu-system-aarch64 -machine virt,accel=kvm,gic-version=3,memfd-alloc=on -nodefaults -cpu host -m 2G -smp 1 -drive file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,readonly=on
> -drive file=<path/to/vm.qcow2>,format=qcow2,if=none,id=drive_image1
> -device virtio-blk-pci,id=image1,drive=drive_image1 -vnc :10 -device
> virtio-gpu,id=video0 -device piix3-usb-uhci,id=usb -device
> usb-tablet,id=input0,bus=usb.0,port=1 -device
> usb-kbd,id=input1,bus=usb.0,port=2 -monitor stdio
> (qemu) cpr-save /tmp/qemu.save restart
> (qemu) cpr-exec ./build/aarch64-softmmu/qemu-system-aarch64 -machine virt,accel=kvm,gic-version=3,memfd-alloc=on -nodefaults -cpu host -m 2G -smp 1 -drive file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,readonly=on
> -drive file=<path/to/vm.qcow2>,format=qcow2,if=none,id=drive_image1
> -device virtio-blk-pci,id=image1,drive=drive_image1 -vnc :10 -device
> virtio-gpu,id=video0 -device piix3-usb-uhci,id=usb -device
> usb-tablet,id=input0,bus=usb.0,port=1 -device
> usb-kbd,id=input1,bus=usb.0,port=2 -monitor stdio -S
> (qemu) QEMU 6.0.92 monitor - type 'help' for more information
> (qemu) cpr-load /tmp/qemu.save
> 
> Does I miss something?
> 
> On 2021/8/7 5:43, Steve Sistare wrote:
>> Provide the cpr-save, cpr-exec, and cpr-load commands for live update.
>> These save and restore VM state, with minimal guest pause time, so that
>> qemu may be updated to a new version in between.
>>
>> cpr-save stops the VM and saves vmstate to an ordinary file.  It supports
>> any type of guest image and block device, but the caller must not modify
>> guest block devices between cpr-save and cpr-load.  It supports two modes:
>> reboot and restart.
>>
>> In reboot mode, the caller invokes cpr-save and then terminates qemu.
>> The caller may then update the host kernel and system software and reboot.
>> The caller resumes the guest by running qemu with the same arguments as the
>> original process and invoking cpr-load.  To use this mode, guest ram must be
>> mapped to a persistent shared memory file such as /dev/dax0.0, or /dev/shm
>> PKRAM as proposed in https://lore.kernel.org/lkml/1617140178-8773-1-git-send-email-anthony.yznaga@oracle.com.
>>
>> The reboot mode supports vfio devices if the caller first suspends the
>> guest, such as by issuing guest-suspend-ram to the qemu guest agent.  The
>> guest drivers' suspend methods flush outstanding requests and re-initialize
>> the devices, and thus there is no device state to save and restore.
>>
>> Restart mode preserves the guest VM across a restart of the qemu process.
>> After cpr-save, the caller passes qemu command-line arguments to cpr-exec,
>> which directly exec's the new qemu binary.  The arguments must include -S
>> so new qemu starts in a paused state and waits for the cpr-load command.
>> The restart mode supports vfio devices by preserving the vfio container,
>> group, device, and event descriptors across the qemu re-exec, and by
>> updating DMA mapping virtual addresses using VFIO_DMA_UNMAP_FLAG_VADDR and
>> VFIO_DMA_MAP_FLAG_VADDR as defined in https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sistare@oracle.com/
>> and integrated in Linux kernel 5.12.
>>
>> To use the restart mode, qemu must be started with the memfd-alloc option,
>> which allocates guest ram using memfd_create.  The memfd's are saved to
>> the environment and kept open across exec, after which they are found from
>> the environment and re-mmap'd.  Hence guest ram is preserved in place,
>> albeit with new virtual addresses in the qemu process.
>>
>> The caller resumes the guest by invoking cpr-load, which loads state from
>> the file. If the VM was running at cpr-save time, then VM execution resumes.
>> If the VM was suspended at cpr-save time (reboot mode), then the caller must
>> issue a system_wakeup command to resume.
>>
>> The first patches add reboot mode:
>>   - memory: qemu_check_ram_volatile
>>   - migration: fix populate_vfio_info
>>   - migration: qemu file wrappers
>>   - migration: simplify savevm
>>   - vl: start on wakeup request
>>   - cpr: reboot mode
>>   - cpr: reboot HMP interfaces
>>
>> The next patches add restart mode:
>>   - memory: flat section iterator
>>   - oslib: qemu_clear_cloexec
>>   - machine: memfd-alloc option
>>   - qapi: list utility functions
>>   - vl: helper to request re-exec
>>   - cpr: preserve extra state
>>   - cpr: restart mode
>>   - cpr: restart HMP interfaces
>>   - hostmem-memfd: cpr for memory-backend-memfd
>>
>> The next patches add vfio support for restart mode:
>>   - pci: export functions for cpr
>>   - vfio-pci: refactor for cpr
>>   - vfio-pci: cpr part 1 (fd and dma)
>>   - vfio-pci: cpr part 2 (msi)
>>   - vfio-pci: cpr part 3 (intx)
>>
>> The next patches preserve various descriptor-based backend devices across
>> cprexec:
>>   - vhost: reset vhost devices for cpr
>>   - chardev: cpr framework
>>   - chardev: cpr for simple devices
>>   - chardev: cpr for pty
>>   - chardev: cpr for sockets
>>   - cpr: only-cpr-capable option
>>
>> Here is an example of updating qemu from v4.2.0 to v4.2.1 using
>> restart mode.  The software update is performed while the guest is
>> running to minimize downtime.
>>
>> window 1                                        | window 2
>>                                                 |
>> # qemu-system-x86_64 ...                        |
>> QEMU 4.2.0 monitor - type 'help' ...            |
>> (qemu) info status                              |
>> VM status: running                              |
>>                                                 | # yum update qemu
>> (qemu) cpr-save /tmp/qemu.sav restart           |
>> (qemu) cpr-exec qemu-system-x86_64 -S ...       |
>> QEMU 4.2.1 monitor - type 'help' ...            |
>> (qemu) info status                              |
>> VM status: paused (prelaunch)                   |
>> (qemu) cpr-load /tmp/qemu.sav                   |
>> (qemu) info status                              |
>> VM status: running                              |
>>
>>
>> Here is an example of updating the host kernel using reboot mode.
>>
>> window 1                                        | window 2
>>                                                 |
>> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
>> QEMU 4.2.1 monitor - type 'help' ...            |
>> (qemu) info status                              |
>> VM status: running                              |
>>                                                 | # yum update kernel-uek
>> (qemu) cpr-save /tmp/qemu.sav restart           |
>> (qemu) quit                                     |
>>                                                 |
>> # systemctl kexec                               |
>> kexec_core: Starting new kernel                 |
>> ...                                             |
>>                                                 |
>> # qemu-system-x86_64 -S mem-path=/dev/dax0.0 ...|
>> QEMU 4.2.1 monitor - type 'help' ...            |
>> (qemu) info status                              |
>> VM status: paused (prelaunch)                   |
>> (qemu) cpr-load /tmp/qemu.sav                   |
>> (qemu) info status                              |
>> VM status: running                              |
>>
>> Changes from V1 to V2:
>>   - revert vmstate infrastructure changes
>>   - refactor cpr functions into new files
>>   - delete MADV_DOEXEC and use memfd + VFIO_DMA_UNMAP_FLAG_SUSPEND to
>>     preserve memory.
>>   - add framework to filter chardev's that support cpr
>>   - save and restore vfio eventfd's
>>   - modify cprinfo QMP interface
>>   - incorporate misc review feedback
>>   - remove unrelated and unneeded patches
>>   - refactor all patches into a shorter and easier to review series
>>
>> Changes from V2 to V3:
>>   - rebase to qemu 6.0.0
>>   - use final definition of vfio ioctls (VFIO_DMA_UNMAP_FLAG_VADDR etc)
>>   - change memfd-alloc to a machine option
>>   - Use qio_channel_socket_new_fd instead of adding qio_channel_socket_new_fd
>>   - close monitor socket during cpr
>>   - fix a few unreported bugs
>>   - support memory-backend-memfd
>>
>> Changes from V3 to V4:
>>   - split reboot mode into separate patches
>>   - add cprexec command
>>   - delete QEMU_START_FREEZE, argv_main, and /usr/bin/qemu-exec
>>   - add more checks for vfio and cpr compatibility, and recover after errors
>>   - save vfio pci config in vmstate
>>   - rename {setenv,getenv}_event_fd to {save,load}_event_fd
>>   - use qemu_strtol
>>   - change 6.0 references to 6.1
>>   - use strerror(), use EXIT_FAILURE, remove period from error messages
>>   - distribute MAINTAINERS additions to each patch
>>
>> Changes from V4 to V5:
>>   - rebase to master
>>
>> Changes from V5 to V6:
>>   vfio:
>>   - delete redundant bus_master_enable_region in vfio_pci_post_load
>>   - delete unmap.size warning
>>   - fix phys_config memory leak
>>   - add INTX support
>>   - add vfio_named_notifier_init() helper
>>   Other:
>>   - 6.1 -> 6.2
>>   - rename file -> filename in qapi
>>   - delete cprinfo.  qapi introspection serves the same purpose.
>>   - rename cprsave, cprexec, cprload -> cpr-save, cpr-exec, cpr-load
>>   - improve documentation in qapi/cpr.json
>>   - rename qemu_ram_volatile -> qemu_ram_check_volatile, and use
>>     qemu_ram_foreach_block
>>   - rename handle -> opaque
>>   - use ERRP_GUARD
>>   - use g_autoptr and g_autofree, and glib allocation functions
>>   - conform to error conventions for bool and int function return values
>>     and function names.
>>   - remove word "error" in error messages
>>   - rename as_flat_walk and its callback, and add comments.
>>   - rename qemu_clr_cloexec -> qemu_clear_cloexec
>>   - rename close-on-cpr -> reopen-on-cpr
>>   - add strList utility functions
>>   - factor out start on wakeup request to a separate patch
>>   - deleted unnecessary layer (cprsave etc) and squashed QMP patches
>>   - conditionally compile for CONFIG_VFIO
>>
>> Steve Sistare (24):
>>   memory: qemu_check_ram_volatile
>>   migration: fix populate_vfio_info
>>   migration: qemu file wrappers
>>   migration: simplify savevm
>>   vl: start on wakeup request
>>   cpr: reboot mode
>>   memory: flat section iterator
>>   oslib: qemu_clear_cloexec
>>   machine: memfd-alloc option
>>   qapi: list utility functions
>>   vl: helper to request re-exec
>>   cpr: preserve extra state
>>   cpr: restart mode
>>   cpr: restart HMP interfaces
>>   hostmem-memfd: cpr for memory-backend-memfd
>>   pci: export functions for cpr
>>   vfio-pci: refactor for cpr
>>   vfio-pci: cpr part 1 (fd and dma)
>>   vfio-pci: cpr part 2 (msi)
>>   vfio-pci: cpr part 3 (intx)
>>   chardev: cpr framework
>>   chardev: cpr for simple devices
>>   chardev: cpr for pty
>>   cpr: only-cpr-capable option
>>
>> Mark Kanda, Steve Sistare (3):
>>   cpr: reboot HMP interfaces
>>   vhost: reset vhost devices for cpr
>>   chardev: cpr for sockets
>>
>>  MAINTAINERS                   |  12 ++
>>  backends/hostmem-memfd.c      |  21 +--
>>  chardev/char-mux.c            |   1 +
>>  chardev/char-null.c           |   1 +
>>  chardev/char-pty.c            |  14 +-
>>  chardev/char-serial.c         |   1 +
>>  chardev/char-socket.c         |  36 +++++
>>  chardev/char-stdio.c          |   8 ++
>>  chardev/char.c                |  43 +++++-
>>  gdbstub.c                     |   1 +
>>  hmp-commands.hx               |  50 +++++++
>>  hw/core/machine.c             |  19 +++
>>  hw/pci/msix.c                 |  20 ++-
>>  hw/pci/pci.c                  |   7 +-
>>  hw/vfio/common.c              |  79 +++++++++--
>>  hw/vfio/cpr.c                 | 160 ++++++++++++++++++++++
>>  hw/vfio/meson.build           |   1 +
>>  hw/vfio/pci.c                 | 301 +++++++++++++++++++++++++++++++++++++++---
>>  hw/vfio/trace-events          |   1 +
>>  hw/virtio/vhost.c             |  11 ++
>>  include/chardev/char.h        |   6 +
>>  include/exec/memory.h         |  39 ++++++
>>  include/hw/boards.h           |   1 +
>>  include/hw/pci/msix.h         |   5 +
>>  include/hw/pci/pci.h          |   2 +
>>  include/hw/vfio/vfio-common.h |   8 ++
>>  include/hw/virtio/vhost.h     |   1 +
>>  include/migration/cpr.h       |  31 +++++
>>  include/monitor/hmp.h         |   3 +
>>  include/qapi/util.h           |  28 ++++
>>  include/qemu/osdep.h          |   1 +
>>  include/sysemu/runstate.h     |   2 +
>>  include/sysemu/sysemu.h       |   1 +
>>  linux-headers/linux/vfio.h    |   6 +
>>  migration/cpr-state.c         | 215 ++++++++++++++++++++++++++++++
>>  migration/cpr.c               | 176 ++++++++++++++++++++++++
>>  migration/meson.build         |   2 +
>>  migration/migration.c         |   5 +
>>  migration/qemu-file-channel.c |  36 +++++
>>  migration/qemu-file-channel.h |   6 +
>>  migration/savevm.c            |  21 +--
>>  migration/target.c            |  24 +++-
>>  migration/trace-events        |   5 +
>>  monitor/hmp-cmds.c            |  68 ++++++----
>>  monitor/hmp.c                 |   3 +
>>  monitor/qmp.c                 |   3 +
>>  qapi/char.json                |   7 +-
>>  qapi/cpr.json                 |  76 +++++++++++
>>  qapi/meson.build              |   1 +
>>  qapi/qapi-schema.json         |   1 +
>>  qapi/qapi-util.c              |  37 ++++++
>>  qemu-options.hx               |  40 +++++-
>>  softmmu/globals.c             |   1 +
>>  softmmu/memory.c              |  46 +++++++
>>  softmmu/physmem.c             |  55 ++++++--
>>  softmmu/runstate.c            |  38 +++++-
>>  softmmu/vl.c                  |  18 ++-
>>  stubs/cpr-state.c             |  15 +++
>>  stubs/cpr.c                   |   3 +
>>  stubs/meson.build             |   2 +
>>  trace-events                  |   1 +
>>  util/oslib-posix.c            |   9 ++
>>  util/oslib-win32.c            |   4 +
>>  util/qemu-config.c            |   4 +
>>  64 files changed, 1732 insertions(+), 111 deletions(-)
>>  create mode 100644 hw/vfio/cpr.c
>>  create mode 100644 include/migration/cpr.h
>>  create mode 100644 migration/cpr-state.c
>>  create mode 100644 migration/cpr.c
>>  create mode 100644 qapi/cpr.json
>>  create mode 100644 stubs/cpr-state.c
>>  create mode 100644 stubs/cpr.c
>>
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V6 00/27] Live Update
  2021-08-23 21:36   ` Steven Sistare
@ 2021-08-24  9:36     ` Zheng Chuan
  2021-08-31 21:15       ` Steven Sistare
  0 siblings, 1 reply; 44+ messages in thread
From: Zheng Chuan @ 2021-08-24  9:36 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Alex Williamson, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Hi, Steve.

I think I have found the problem, it is because the rom_reset() during the cpr_exec will write dtb into the mach-virt.ram which cause the memory corruption.
Also I found in x86 the memoryregion of acpi also changed during rom_rest. Maybe we should keep it consistent and skip the rom_reset() like migration does.
Here is the patch drafted(Also fix the cpr state missing saving problem):

diff --git a/hw/core/loader.c b/hw/core/loader.c
index 5b34869a5417..1dcf0be1492f 100644
--- a/hw/core/loader.c
+++ b/hw/core/loader.c
@@ -50,6 +50,7 @@
 #include "hw/hw.h"
 #include "disas/disas.h"
 #include "migration/vmstate.h"
+#include "migration/cpr.h"
 #include "monitor/monitor.h"
 #include "sysemu/reset.h"
 #include "sysemu/sysemu.h"
@@ -1128,7 +1129,7 @@ static void rom_reset(void *unused)
          * the data in during the next incoming migration in all cases.  Note
          * that some of those RAMs can actually be modified by the guest.
          */
-        if (runstate_check(RUN_STATE_INMIGRATE)) {
+        if (runstate_check(RUN_STATE_INMIGRATE) || cpr_is_active()) {
             if (rom->data && rom->isrom) {
                 /*
                  * Free it so that a rom_reset after migration doesn't
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index e9b987f54319..0b7d7e9f6bf0 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -20,9 +20,11 @@ void cpr_save_fd(const char *name, int id, int fd);
 void cpr_delete_fd(const char *name, int id);
 int cpr_find_fd(const char *name, int id);
 int cpr_walk_fd(cpr_walk_fd_cb cb, void *handle);
-int cpr_state_save(Error **errp);
+int cpr_state_save(CprMode mode, Error **errp);
 int cpr_state_load(Error **errp);
 CprMode cpr_state_mode(void);
+void cpr_state_clear(void);
+bool cpr_is_active(void);
 void cpr_state_print(void);

 int cpr_vfio_save(Error **errp);
diff --git a/migration/cpr-state.c b/migration/cpr-state.c
index 003b449bbcf8..4ac08539d932 100644
--- a/migration/cpr-state.c
+++ b/migration/cpr-state.c
@@ -19,7 +19,7 @@ typedef struct CprState {
     CprNameList fds;            /* list of CprFd */
 } CprState;

-static CprState cpr_state;
+static CprState cpr_state = { .mode = CPR_MODE_NONE };

 /*************************************************************************/
 /* Generic list of names. */
@@ -149,7 +149,7 @@ static const VMStateDescription vmstate_cpr_state = {
     }
 };

-int cpr_state_save(Error **errp)
+int cpr_state_save(CprMode mode, Error **errp)
 {
     int ret, mfd;
     QEMUFile *f;
@@ -163,9 +163,11 @@ int cpr_state_save(Error **errp)
     qemu_clear_cloexec(mfd);
     f = qemu_fd_open(mfd, true, CPR_STATE_NAME);

+    cpr_state.mode = mode;
     ret = vmstate_save_state(f, &vmstate_cpr_state, &cpr_state, 0);
     if (ret) {
         error_setg(errp, "vmstate_save_state error %d", ret);
+        cpr_state.mode = CPR_MODE_NONE;
         return ret;
     }

@@ -205,6 +207,16 @@ CprMode cpr_state_mode(void)
     return cpr_state.mode;
 }

+void cpr_state_clear(void)
+{
+    cpr_state.mode = CPR_MODE_NONE;
+}
+
+bool cpr_is_active(void)
+{
+    return cpr_state.mode != CPR_MODE_NONE;
+}
+
 void cpr_state_print(void)
 {
     CprName *elem;
diff --git a/migration/cpr.c b/migration/cpr.c
index d14bc5ad2678..97b2293c01e8 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -115,7 +115,7 @@ void qmp_cpr_exec(strList *args, Error **errp)
         return;
     }
     cpr_walk_fd(preserve_fd, 0);
-    if (cpr_state_save(errp)) {
+    if (cpr_state_save(cpr_active_mode, errp)) {
         return;
     }
     vhost_dev_reset_all();
@@ -173,4 +173,5 @@ void qmp_cpr_load(const char *filename, Error **errp)

 out:
     cpr_active_mode = CPR_MODE_NONE;
+    cpr_state_clear();
 }


On 2021/8/24 5:36, Steven Sistare wrote:
> Hi Zheng, testing aarch64 is on our todo list. We will run this case and try to 
> reproduce the failure.  Thanks for the report.
> 
> - Steve
> 
> On 8/21/2021 4:54 AM, Zheng Chuan wrote:
>> Hi, steve
>>
>> It seems the VM will stuck after cpr-load on AArch64 environment?
>>
>> My AArch64 environment and test steps:
>> 1. linux kernel: 5.14-rc6
>> 2. QEMU version: v6.1.0-rc2 (patch your patchset), and configure with `../configure --target-list=aarch64-softmmu --disable-werror --enable-kvm` 4. Steps to live update:
>> # ./build/aarch64-softmmu/qemu-system-aarch64 -machine virt,accel=kvm,gic-version=3,memfd-alloc=on -nodefaults -cpu host -m 2G -smp 1 -drive file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,readonly=on
>> -drive file=<path/to/vm.qcow2>,format=qcow2,if=none,id=drive_image1
>> -device virtio-blk-pci,id=image1,drive=drive_image1 -vnc :10 -device
>> virtio-gpu,id=video0 -device piix3-usb-uhci,id=usb -device
>> usb-tablet,id=input0,bus=usb.0,port=1 -device
>> usb-kbd,id=input1,bus=usb.0,port=2 -monitor stdio
>> (qemu) cpr-save /tmp/qemu.save restart
>> (qemu) cpr-exec ./build/aarch64-softmmu/qemu-system-aarch64 -machine virt,accel=kvm,gic-version=3,memfd-alloc=on -nodefaults -cpu host -m 2G -smp 1 -drive file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,readonly=on
>> -drive file=<path/to/vm.qcow2>,format=qcow2,if=none,id=drive_image1
>> -device virtio-blk-pci,id=image1,drive=drive_image1 -vnc :10 -device
>> virtio-gpu,id=video0 -device piix3-usb-uhci,id=usb -device
>> usb-tablet,id=input0,bus=usb.0,port=1 -device
>> usb-kbd,id=input1,bus=usb.0,port=2 -monitor stdio -S
>> (qemu) QEMU 6.0.92 monitor - type 'help' for more information
>> (qemu) cpr-load /tmp/qemu.save
>>
>> Does I miss something?
>>
>> On 2021/8/7 5:43, Steve Sistare wrote:
>>> Provide the cpr-save, cpr-exec, and cpr-load commands for live update.
>>> These save and restore VM state, with minimal guest pause time, so that
>>> qemu may be updated to a new version in between.
>>>
>>> cpr-save stops the VM and saves vmstate to an ordinary file.  It supports
>>> any type of guest image and block device, but the caller must not modify
>>> guest block devices between cpr-save and cpr-load.  It supports two modes:
>>> reboot and restart.
>>>
>>> In reboot mode, the caller invokes cpr-save and then terminates qemu.
>>> The caller may then update the host kernel and system software and reboot.
>>> The caller resumes the guest by running qemu with the same arguments as the
>>> original process and invoking cpr-load.  To use this mode, guest ram must be
>>> mapped to a persistent shared memory file such as /dev/dax0.0, or /dev/shm
>>> PKRAM as proposed in https://lore.kernel.org/lkml/1617140178-8773-1-git-send-email-anthony.yznaga@oracle.com.
>>>
>>> The reboot mode supports vfio devices if the caller first suspends the
>>> guest, such as by issuing guest-suspend-ram to the qemu guest agent.  The
>>> guest drivers' suspend methods flush outstanding requests and re-initialize
>>> the devices, and thus there is no device state to save and restore.
>>>
>>> Restart mode preserves the guest VM across a restart of the qemu process.
>>> After cpr-save, the caller passes qemu command-line arguments to cpr-exec,
>>> which directly exec's the new qemu binary.  The arguments must include -S
>>> so new qemu starts in a paused state and waits for the cpr-load command.
>>> The restart mode supports vfio devices by preserving the vfio container,
>>> group, device, and event descriptors across the qemu re-exec, and by
>>> updating DMA mapping virtual addresses using VFIO_DMA_UNMAP_FLAG_VADDR and
>>> VFIO_DMA_MAP_FLAG_VADDR as defined in https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sistare@oracle.com/
>>> and integrated in Linux kernel 5.12.
>>>
>>> To use the restart mode, qemu must be started with the memfd-alloc option,
>>> which allocates guest ram using memfd_create.  The memfd's are saved to
>>> the environment and kept open across exec, after which they are found from
>>> the environment and re-mmap'd.  Hence guest ram is preserved in place,
>>> albeit with new virtual addresses in the qemu process.
>>>
>>> The caller resumes the guest by invoking cpr-load, which loads state from
>>> the file. If the VM was running at cpr-save time, then VM execution resumes.
>>> If the VM was suspended at cpr-save time (reboot mode), then the caller must
>>> issue a system_wakeup command to resume.
>>>
>>> The first patches add reboot mode:
>>>   - memory: qemu_check_ram_volatile
>>>   - migration: fix populate_vfio_info
>>>   - migration: qemu file wrappers
>>>   - migration: simplify savevm
>>>   - vl: start on wakeup request
>>>   - cpr: reboot mode
>>>   - cpr: reboot HMP interfaces
>>>
>>> The next patches add restart mode:
>>>   - memory: flat section iterator
>>>   - oslib: qemu_clear_cloexec
>>>   - machine: memfd-alloc option
>>>   - qapi: list utility functions
>>>   - vl: helper to request re-exec
>>>   - cpr: preserve extra state
>>>   - cpr: restart mode
>>>   - cpr: restart HMP interfaces
>>>   - hostmem-memfd: cpr for memory-backend-memfd
>>>
>>> The next patches add vfio support for restart mode:
>>>   - pci: export functions for cpr
>>>   - vfio-pci: refactor for cpr
>>>   - vfio-pci: cpr part 1 (fd and dma)
>>>   - vfio-pci: cpr part 2 (msi)
>>>   - vfio-pci: cpr part 3 (intx)
>>>
>>> The next patches preserve various descriptor-based backend devices across
>>> cprexec:
>>>   - vhost: reset vhost devices for cpr
>>>   - chardev: cpr framework
>>>   - chardev: cpr for simple devices
>>>   - chardev: cpr for pty
>>>   - chardev: cpr for sockets
>>>   - cpr: only-cpr-capable option
>>>
>>> Here is an example of updating qemu from v4.2.0 to v4.2.1 using
>>> restart mode.  The software update is performed while the guest is
>>> running to minimize downtime.
>>>
>>> window 1                                        | window 2
>>>                                                 |
>>> # qemu-system-x86_64 ...                        |
>>> QEMU 4.2.0 monitor - type 'help' ...            |
>>> (qemu) info status                              |
>>> VM status: running                              |
>>>                                                 | # yum update qemu
>>> (qemu) cpr-save /tmp/qemu.sav restart           |
>>> (qemu) cpr-exec qemu-system-x86_64 -S ...       |
>>> QEMU 4.2.1 monitor - type 'help' ...            |
>>> (qemu) info status                              |
>>> VM status: paused (prelaunch)                   |
>>> (qemu) cpr-load /tmp/qemu.sav                   |
>>> (qemu) info status                              |
>>> VM status: running                              |
>>>
>>>
>>> Here is an example of updating the host kernel using reboot mode.
>>>
>>> window 1                                        | window 2
>>>                                                 |
>>> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
>>> QEMU 4.2.1 monitor - type 'help' ...            |
>>> (qemu) info status                              |
>>> VM status: running                              |
>>>                                                 | # yum update kernel-uek
>>> (qemu) cpr-save /tmp/qemu.sav restart           |
>>> (qemu) quit                                     |
>>>                                                 |
>>> # systemctl kexec                               |
>>> kexec_core: Starting new kernel                 |
>>> ...                                             |
>>>                                                 |
>>> # qemu-system-x86_64 -S mem-path=/dev/dax0.0 ...|
>>> QEMU 4.2.1 monitor - type 'help' ...            |
>>> (qemu) info status                              |
>>> VM status: paused (prelaunch)                   |
>>> (qemu) cpr-load /tmp/qemu.sav                   |
>>> (qemu) info status                              |
>>> VM status: running                              |
>>>
>>> Changes from V1 to V2:
>>>   - revert vmstate infrastructure changes
>>>   - refactor cpr functions into new files
>>>   - delete MADV_DOEXEC and use memfd + VFIO_DMA_UNMAP_FLAG_SUSPEND to
>>>     preserve memory.
>>>   - add framework to filter chardev's that support cpr
>>>   - save and restore vfio eventfd's
>>>   - modify cprinfo QMP interface
>>>   - incorporate misc review feedback
>>>   - remove unrelated and unneeded patches
>>>   - refactor all patches into a shorter and easier to review series
>>>
>>> Changes from V2 to V3:
>>>   - rebase to qemu 6.0.0
>>>   - use final definition of vfio ioctls (VFIO_DMA_UNMAP_FLAG_VADDR etc)
>>>   - change memfd-alloc to a machine option
>>>   - Use qio_channel_socket_new_fd instead of adding qio_channel_socket_new_fd
>>>   - close monitor socket during cpr
>>>   - fix a few unreported bugs
>>>   - support memory-backend-memfd
>>>
>>> Changes from V3 to V4:
>>>   - split reboot mode into separate patches
>>>   - add cprexec command
>>>   - delete QEMU_START_FREEZE, argv_main, and /usr/bin/qemu-exec
>>>   - add more checks for vfio and cpr compatibility, and recover after errors
>>>   - save vfio pci config in vmstate
>>>   - rename {setenv,getenv}_event_fd to {save,load}_event_fd
>>>   - use qemu_strtol
>>>   - change 6.0 references to 6.1
>>>   - use strerror(), use EXIT_FAILURE, remove period from error messages
>>>   - distribute MAINTAINERS additions to each patch
>>>
>>> Changes from V4 to V5:
>>>   - rebase to master
>>>
>>> Changes from V5 to V6:
>>>   vfio:
>>>   - delete redundant bus_master_enable_region in vfio_pci_post_load
>>>   - delete unmap.size warning
>>>   - fix phys_config memory leak
>>>   - add INTX support
>>>   - add vfio_named_notifier_init() helper
>>>   Other:
>>>   - 6.1 -> 6.2
>>>   - rename file -> filename in qapi
>>>   - delete cprinfo.  qapi introspection serves the same purpose.
>>>   - rename cprsave, cprexec, cprload -> cpr-save, cpr-exec, cpr-load
>>>   - improve documentation in qapi/cpr.json
>>>   - rename qemu_ram_volatile -> qemu_ram_check_volatile, and use
>>>     qemu_ram_foreach_block
>>>   - rename handle -> opaque
>>>   - use ERRP_GUARD
>>>   - use g_autoptr and g_autofree, and glib allocation functions
>>>   - conform to error conventions for bool and int function return values
>>>     and function names.
>>>   - remove word "error" in error messages
>>>   - rename as_flat_walk and its callback, and add comments.
>>>   - rename qemu_clr_cloexec -> qemu_clear_cloexec
>>>   - rename close-on-cpr -> reopen-on-cpr
>>>   - add strList utility functions
>>>   - factor out start on wakeup request to a separate patch
>>>   - deleted unnecessary layer (cprsave etc) and squashed QMP patches
>>>   - conditionally compile for CONFIG_VFIO
>>>
>>> Steve Sistare (24):
>>>   memory: qemu_check_ram_volatile
>>>   migration: fix populate_vfio_info
>>>   migration: qemu file wrappers
>>>   migration: simplify savevm
>>>   vl: start on wakeup request
>>>   cpr: reboot mode
>>>   memory: flat section iterator
>>>   oslib: qemu_clear_cloexec
>>>   machine: memfd-alloc option
>>>   qapi: list utility functions
>>>   vl: helper to request re-exec
>>>   cpr: preserve extra state
>>>   cpr: restart mode
>>>   cpr: restart HMP interfaces
>>>   hostmem-memfd: cpr for memory-backend-memfd
>>>   pci: export functions for cpr
>>>   vfio-pci: refactor for cpr
>>>   vfio-pci: cpr part 1 (fd and dma)
>>>   vfio-pci: cpr part 2 (msi)
>>>   vfio-pci: cpr part 3 (intx)
>>>   chardev: cpr framework
>>>   chardev: cpr for simple devices
>>>   chardev: cpr for pty
>>>   cpr: only-cpr-capable option
>>>
>>> Mark Kanda, Steve Sistare (3):
>>>   cpr: reboot HMP interfaces
>>>   vhost: reset vhost devices for cpr
>>>   chardev: cpr for sockets
>>>
>>>  MAINTAINERS                   |  12 ++
>>>  backends/hostmem-memfd.c      |  21 +--
>>>  chardev/char-mux.c            |   1 +
>>>  chardev/char-null.c           |   1 +
>>>  chardev/char-pty.c            |  14 +-
>>>  chardev/char-serial.c         |   1 +
>>>  chardev/char-socket.c         |  36 +++++
>>>  chardev/char-stdio.c          |   8 ++
>>>  chardev/char.c                |  43 +++++-
>>>  gdbstub.c                     |   1 +
>>>  hmp-commands.hx               |  50 +++++++
>>>  hw/core/machine.c             |  19 +++
>>>  hw/pci/msix.c                 |  20 ++-
>>>  hw/pci/pci.c                  |   7 +-
>>>  hw/vfio/common.c              |  79 +++++++++--
>>>  hw/vfio/cpr.c                 | 160 ++++++++++++++++++++++
>>>  hw/vfio/meson.build           |   1 +
>>>  hw/vfio/pci.c                 | 301 +++++++++++++++++++++++++++++++++++++++---
>>>  hw/vfio/trace-events          |   1 +
>>>  hw/virtio/vhost.c             |  11 ++
>>>  include/chardev/char.h        |   6 +
>>>  include/exec/memory.h         |  39 ++++++
>>>  include/hw/boards.h           |   1 +
>>>  include/hw/pci/msix.h         |   5 +
>>>  include/hw/pci/pci.h          |   2 +
>>>  include/hw/vfio/vfio-common.h |   8 ++
>>>  include/hw/virtio/vhost.h     |   1 +
>>>  include/migration/cpr.h       |  31 +++++
>>>  include/monitor/hmp.h         |   3 +
>>>  include/qapi/util.h           |  28 ++++
>>>  include/qemu/osdep.h          |   1 +
>>>  include/sysemu/runstate.h     |   2 +
>>>  include/sysemu/sysemu.h       |   1 +
>>>  linux-headers/linux/vfio.h    |   6 +
>>>  migration/cpr-state.c         | 215 ++++++++++++++++++++++++++++++
>>>  migration/cpr.c               | 176 ++++++++++++++++++++++++
>>>  migration/meson.build         |   2 +
>>>  migration/migration.c         |   5 +
>>>  migration/qemu-file-channel.c |  36 +++++
>>>  migration/qemu-file-channel.h |   6 +
>>>  migration/savevm.c            |  21 +--
>>>  migration/target.c            |  24 +++-
>>>  migration/trace-events        |   5 +
>>>  monitor/hmp-cmds.c            |  68 ++++++----
>>>  monitor/hmp.c                 |   3 +
>>>  monitor/qmp.c                 |   3 +
>>>  qapi/char.json                |   7 +-
>>>  qapi/cpr.json                 |  76 +++++++++++
>>>  qapi/meson.build              |   1 +
>>>  qapi/qapi-schema.json         |   1 +
>>>  qapi/qapi-util.c              |  37 ++++++
>>>  qemu-options.hx               |  40 +++++-
>>>  softmmu/globals.c             |   1 +
>>>  softmmu/memory.c              |  46 +++++++
>>>  softmmu/physmem.c             |  55 ++++++--
>>>  softmmu/runstate.c            |  38 +++++-
>>>  softmmu/vl.c                  |  18 ++-
>>>  stubs/cpr-state.c             |  15 +++
>>>  stubs/cpr.c                   |   3 +
>>>  stubs/meson.build             |   2 +
>>>  trace-events                  |   1 +
>>>  util/oslib-posix.c            |   9 ++
>>>  util/oslib-win32.c            |   4 +
>>>  util/qemu-config.c            |   4 +
>>>  64 files changed, 1732 insertions(+), 111 deletions(-)
>>>  create mode 100644 hw/vfio/cpr.c
>>>  create mode 100644 include/migration/cpr.h
>>>  create mode 100644 migration/cpr-state.c
>>>  create mode 100644 migration/cpr.c
>>>  create mode 100644 qapi/cpr.json
>>>  create mode 100644 stubs/cpr-state.c
>>>  create mode 100644 stubs/cpr.c
>>>
>>
> .
> 

-- 
Regards.
Chuan


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH V6 00/27] Live Update
  2021-08-24  9:36     ` Zheng Chuan
@ 2021-08-31 21:15       ` Steven Sistare
  2021-10-27  6:16         ` Zheng Chuan
  0 siblings, 1 reply; 44+ messages in thread
From: Steven Sistare @ 2021-08-31 21:15 UTC (permalink / raw)
  To: Zheng Chuan, qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Alex Williamson, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On 8/24/2021 5:36 AM, Zheng Chuan wrote:
> Hi, Steve.
> 
> I think I have found the problem, it is because the rom_reset() during the cpr_exec will write dtb into the mach-virt.ram which cause the memory corruption.
> Also I found in x86 the memoryregion of acpi also changed during rom_rest. Maybe we should keep it consistent and skip the rom_reset() like migration does.
> Here is the patch drafted(Also fix the cpr state missing saving problem):

Hi Chuan, thank-you very much for debugging the problem.  rom_reset() is a great find.
I also noticed and have a fix ready for the mode bug. I will add similar fixes to patch V7.

- Steve

> diff --git a/hw/core/loader.c b/hw/core/loader.c
> index 5b34869a5417..1dcf0be1492f 100644
> --- a/hw/core/loader.c
> +++ b/hw/core/loader.c
> @@ -50,6 +50,7 @@
>  #include "hw/hw.h"
>  #include "disas/disas.h"
>  #include "migration/vmstate.h"
> +#include "migration/cpr.h"
>  #include "monitor/monitor.h"
>  #include "sysemu/reset.h"
>  #include "sysemu/sysemu.h"
> @@ -1128,7 +1129,7 @@ static void rom_reset(void *unused)
>           * the data in during the next incoming migration in all cases.  Note
>           * that some of those RAMs can actually be modified by the guest.
>           */
> -        if (runstate_check(RUN_STATE_INMIGRATE)) {
> +        if (runstate_check(RUN_STATE_INMIGRATE) || cpr_is_active()) {
>              if (rom->data && rom->isrom) {
>                  /*
>                   * Free it so that a rom_reset after migration doesn't
> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> index e9b987f54319..0b7d7e9f6bf0 100644
> --- a/include/migration/cpr.h
> +++ b/include/migration/cpr.h
> @@ -20,9 +20,11 @@ void cpr_save_fd(const char *name, int id, int fd);
>  void cpr_delete_fd(const char *name, int id);
>  int cpr_find_fd(const char *name, int id);
>  int cpr_walk_fd(cpr_walk_fd_cb cb, void *handle);
> -int cpr_state_save(Error **errp);
> +int cpr_state_save(CprMode mode, Error **errp);
>  int cpr_state_load(Error **errp);
>  CprMode cpr_state_mode(void);
> +void cpr_state_clear(void);
> +bool cpr_is_active(void);
>  void cpr_state_print(void);
> 
>  int cpr_vfio_save(Error **errp);
> diff --git a/migration/cpr-state.c b/migration/cpr-state.c
> index 003b449bbcf8..4ac08539d932 100644
> --- a/migration/cpr-state.c
> +++ b/migration/cpr-state.c
> @@ -19,7 +19,7 @@ typedef struct CprState {
>      CprNameList fds;            /* list of CprFd */
>  } CprState;
> 
> -static CprState cpr_state;
> +static CprState cpr_state = { .mode = CPR_MODE_NONE };
> 
>  /*************************************************************************/
>  /* Generic list of names. */
> @@ -149,7 +149,7 @@ static const VMStateDescription vmstate_cpr_state = {
>      }
>  };
> 
> -int cpr_state_save(Error **errp)
> +int cpr_state_save(CprMode mode, Error **errp)
>  {
>      int ret, mfd;
>      QEMUFile *f;
> @@ -163,9 +163,11 @@ int cpr_state_save(Error **errp)
>      qemu_clear_cloexec(mfd);
>      f = qemu_fd_open(mfd, true, CPR_STATE_NAME);
> 
> +    cpr_state.mode = mode;
>      ret = vmstate_save_state(f, &vmstate_cpr_state, &cpr_state, 0);
>      if (ret) {
>          error_setg(errp, "vmstate_save_state error %d", ret);
> +        cpr_state.mode = CPR_MODE_NONE;
>          return ret;
>      }
> 
> @@ -205,6 +207,16 @@ CprMode cpr_state_mode(void)
>      return cpr_state.mode;
>  }
> 
> +void cpr_state_clear(void)
> +{
> +    cpr_state.mode = CPR_MODE_NONE;
> +}
> +
> +bool cpr_is_active(void)
> +{
> +    return cpr_state.mode != CPR_MODE_NONE;
> +}
> +
>  void cpr_state_print(void)
>  {
>      CprName *elem;
> diff --git a/migration/cpr.c b/migration/cpr.c
> index d14bc5ad2678..97b2293c01e8 100644
> --- a/migration/cpr.c
> +++ b/migration/cpr.c
> @@ -115,7 +115,7 @@ void qmp_cpr_exec(strList *args, Error **errp)
>          return;
>      }
>      cpr_walk_fd(preserve_fd, 0);
> -    if (cpr_state_save(errp)) {
> +    if (cpr_state_save(cpr_active_mode, errp)) {
>          return;
>      }
>      vhost_dev_reset_all();
> @@ -173,4 +173,5 @@ void qmp_cpr_load(const char *filename, Error **errp)
> 
>  out:
>      cpr_active_mode = CPR_MODE_NONE;
> +    cpr_state_clear();
>  }
> 
> 
> On 2021/8/24 5:36, Steven Sistare wrote:
>> Hi Zheng, testing aarch64 is on our todo list. We will run this case and try to 
>> reproduce the failure.  Thanks for the report.
>>
>> - Steve
>>
>> On 8/21/2021 4:54 AM, Zheng Chuan wrote:
>>> Hi, steve
>>>
>>> It seems the VM will stuck after cpr-load on AArch64 environment?
>>>
>>> My AArch64 environment and test steps:
>>> 1. linux kernel: 5.14-rc6
>>> 2. QEMU version: v6.1.0-rc2 (patch your patchset), and configure with `../configure --target-list=aarch64-softmmu --disable-werror --enable-kvm` 4. Steps to live update:
>>> # ./build/aarch64-softmmu/qemu-system-aarch64 -machine virt,accel=kvm,gic-version=3,memfd-alloc=on -nodefaults -cpu host -m 2G -smp 1 -drive file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,readonly=on
>>> -drive file=<path/to/vm.qcow2>,format=qcow2,if=none,id=drive_image1
>>> -device virtio-blk-pci,id=image1,drive=drive_image1 -vnc :10 -device
>>> virtio-gpu,id=video0 -device piix3-usb-uhci,id=usb -device
>>> usb-tablet,id=input0,bus=usb.0,port=1 -device
>>> usb-kbd,id=input1,bus=usb.0,port=2 -monitor stdio
>>> (qemu) cpr-save /tmp/qemu.save restart
>>> (qemu) cpr-exec ./build/aarch64-softmmu/qemu-system-aarch64 -machine virt,accel=kvm,gic-version=3,memfd-alloc=on -nodefaults -cpu host -m 2G -smp 1 -drive file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,readonly=on
>>> -drive file=<path/to/vm.qcow2>,format=qcow2,if=none,id=drive_image1
>>> -device virtio-blk-pci,id=image1,drive=drive_image1 -vnc :10 -device
>>> virtio-gpu,id=video0 -device piix3-usb-uhci,id=usb -device
>>> usb-tablet,id=input0,bus=usb.0,port=1 -device
>>> usb-kbd,id=input1,bus=usb.0,port=2 -monitor stdio -S
>>> (qemu) QEMU 6.0.92 monitor - type 'help' for more information
>>> (qemu) cpr-load /tmp/qemu.save
>>>
>>> Does I miss something?
>>>
>>> On 2021/8/7 5:43, Steve Sistare wrote:
>>>> Provide the cpr-save, cpr-exec, and cpr-load commands for live update.
>>>> These save and restore VM state, with minimal guest pause time, so that
>>>> qemu may be updated to a new version in between.
>>>>
>>>> cpr-save stops the VM and saves vmstate to an ordinary file.  It supports
>>>> any type of guest image and block device, but the caller must not modify
>>>> guest block devices between cpr-save and cpr-load.  It supports two modes:
>>>> reboot and restart.
>>>>
>>>> In reboot mode, the caller invokes cpr-save and then terminates qemu.
>>>> The caller may then update the host kernel and system software and reboot.
>>>> The caller resumes the guest by running qemu with the same arguments as the
>>>> original process and invoking cpr-load.  To use this mode, guest ram must be
>>>> mapped to a persistent shared memory file such as /dev/dax0.0, or /dev/shm
>>>> PKRAM as proposed in https://lore.kernel.org/lkml/1617140178-8773-1-git-send-email-anthony.yznaga@oracle.com.
>>>>
>>>> The reboot mode supports vfio devices if the caller first suspends the
>>>> guest, such as by issuing guest-suspend-ram to the qemu guest agent.  The
>>>> guest drivers' suspend methods flush outstanding requests and re-initialize
>>>> the devices, and thus there is no device state to save and restore.
>>>>
>>>> Restart mode preserves the guest VM across a restart of the qemu process.
>>>> After cpr-save, the caller passes qemu command-line arguments to cpr-exec,
>>>> which directly exec's the new qemu binary.  The arguments must include -S
>>>> so new qemu starts in a paused state and waits for the cpr-load command.
>>>> The restart mode supports vfio devices by preserving the vfio container,
>>>> group, device, and event descriptors across the qemu re-exec, and by
>>>> updating DMA mapping virtual addresses using VFIO_DMA_UNMAP_FLAG_VADDR and
>>>> VFIO_DMA_MAP_FLAG_VADDR as defined in https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sistare@oracle.com/
>>>> and integrated in Linux kernel 5.12.
>>>>
>>>> To use the restart mode, qemu must be started with the memfd-alloc option,
>>>> which allocates guest ram using memfd_create.  The memfd's are saved to
>>>> the environment and kept open across exec, after which they are found from
>>>> the environment and re-mmap'd.  Hence guest ram is preserved in place,
>>>> albeit with new virtual addresses in the qemu process.
>>>>
>>>> The caller resumes the guest by invoking cpr-load, which loads state from
>>>> the file. If the VM was running at cpr-save time, then VM execution resumes.
>>>> If the VM was suspended at cpr-save time (reboot mode), then the caller must
>>>> issue a system_wakeup command to resume.
>>>>
>>>> The first patches add reboot mode:
>>>>   - memory: qemu_check_ram_volatile
>>>>   - migration: fix populate_vfio_info
>>>>   - migration: qemu file wrappers
>>>>   - migration: simplify savevm
>>>>   - vl: start on wakeup request
>>>>   - cpr: reboot mode
>>>>   - cpr: reboot HMP interfaces
>>>>
>>>> The next patches add restart mode:
>>>>   - memory: flat section iterator
>>>>   - oslib: qemu_clear_cloexec
>>>>   - machine: memfd-alloc option
>>>>   - qapi: list utility functions
>>>>   - vl: helper to request re-exec
>>>>   - cpr: preserve extra state
>>>>   - cpr: restart mode
>>>>   - cpr: restart HMP interfaces
>>>>   - hostmem-memfd: cpr for memory-backend-memfd
>>>>
>>>> The next patches add vfio support for restart mode:
>>>>   - pci: export functions for cpr
>>>>   - vfio-pci: refactor for cpr
>>>>   - vfio-pci: cpr part 1 (fd and dma)
>>>>   - vfio-pci: cpr part 2 (msi)
>>>>   - vfio-pci: cpr part 3 (intx)
>>>>
>>>> The next patches preserve various descriptor-based backend devices across
>>>> cprexec:
>>>>   - vhost: reset vhost devices for cpr
>>>>   - chardev: cpr framework
>>>>   - chardev: cpr for simple devices
>>>>   - chardev: cpr for pty
>>>>   - chardev: cpr for sockets
>>>>   - cpr: only-cpr-capable option
>>>>
>>>> Here is an example of updating qemu from v4.2.0 to v4.2.1 using
>>>> restart mode.  The software update is performed while the guest is
>>>> running to minimize downtime.
>>>>
>>>> window 1                                        | window 2
>>>>                                                 |
>>>> # qemu-system-x86_64 ...                        |
>>>> QEMU 4.2.0 monitor - type 'help' ...            |
>>>> (qemu) info status                              |
>>>> VM status: running                              |
>>>>                                                 | # yum update qemu
>>>> (qemu) cpr-save /tmp/qemu.sav restart           |
>>>> (qemu) cpr-exec qemu-system-x86_64 -S ...       |
>>>> QEMU 4.2.1 monitor - type 'help' ...            |
>>>> (qemu) info status                              |
>>>> VM status: paused (prelaunch)                   |
>>>> (qemu) cpr-load /tmp/qemu.sav                   |
>>>> (qemu) info status                              |
>>>> VM status: running                              |
>>>>
>>>>
>>>> Here is an example of updating the host kernel using reboot mode.
>>>>
>>>> window 1                                        | window 2
>>>>                                                 |
>>>> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
>>>> QEMU 4.2.1 monitor - type 'help' ...            |
>>>> (qemu) info status                              |
>>>> VM status: running                              |
>>>>                                                 | # yum update kernel-uek
>>>> (qemu) cpr-save /tmp/qemu.sav restart           |
>>>> (qemu) quit                                     |
>>>>                                                 |
>>>> # systemctl kexec                               |
>>>> kexec_core: Starting new kernel                 |
>>>> ...                                             |
>>>>                                                 |
>>>> # qemu-system-x86_64 -S mem-path=/dev/dax0.0 ...|
>>>> QEMU 4.2.1 monitor - type 'help' ...            |
>>>> (qemu) info status                              |
>>>> VM status: paused (prelaunch)                   |
>>>> (qemu) cpr-load /tmp/qemu.sav                   |
>>>> (qemu) info status                              |
>>>> VM status: running                              |
>>>>
>>>> Changes from V1 to V2:
>>>>   - revert vmstate infrastructure changes
>>>>   - refactor cpr functions into new files
>>>>   - delete MADV_DOEXEC and use memfd + VFIO_DMA_UNMAP_FLAG_SUSPEND to
>>>>     preserve memory.
>>>>   - add framework to filter chardev's that support cpr
>>>>   - save and restore vfio eventfd's
>>>>   - modify cprinfo QMP interface
>>>>   - incorporate misc review feedback
>>>>   - remove unrelated and unneeded patches
>>>>   - refactor all patches into a shorter and easier to review series
>>>>
>>>> Changes from V2 to V3:
>>>>   - rebase to qemu 6.0.0
>>>>   - use final definition of vfio ioctls (VFIO_DMA_UNMAP_FLAG_VADDR etc)
>>>>   - change memfd-alloc to a machine option
>>>>   - Use qio_channel_socket_new_fd instead of adding qio_channel_socket_new_fd
>>>>   - close monitor socket during cpr
>>>>   - fix a few unreported bugs
>>>>   - support memory-backend-memfd
>>>>
>>>> Changes from V3 to V4:
>>>>   - split reboot mode into separate patches
>>>>   - add cprexec command
>>>>   - delete QEMU_START_FREEZE, argv_main, and /usr/bin/qemu-exec
>>>>   - add more checks for vfio and cpr compatibility, and recover after errors
>>>>   - save vfio pci config in vmstate
>>>>   - rename {setenv,getenv}_event_fd to {save,load}_event_fd
>>>>   - use qemu_strtol
>>>>   - change 6.0 references to 6.1
>>>>   - use strerror(), use EXIT_FAILURE, remove period from error messages
>>>>   - distribute MAINTAINERS additions to each patch
>>>>
>>>> Changes from V4 to V5:
>>>>   - rebase to master
>>>>
>>>> Changes from V5 to V6:
>>>>   vfio:
>>>>   - delete redundant bus_master_enable_region in vfio_pci_post_load
>>>>   - delete unmap.size warning
>>>>   - fix phys_config memory leak
>>>>   - add INTX support
>>>>   - add vfio_named_notifier_init() helper
>>>>   Other:
>>>>   - 6.1 -> 6.2
>>>>   - rename file -> filename in qapi
>>>>   - delete cprinfo.  qapi introspection serves the same purpose.
>>>>   - rename cprsave, cprexec, cprload -> cpr-save, cpr-exec, cpr-load
>>>>   - improve documentation in qapi/cpr.json
>>>>   - rename qemu_ram_volatile -> qemu_ram_check_volatile, and use
>>>>     qemu_ram_foreach_block
>>>>   - rename handle -> opaque
>>>>   - use ERRP_GUARD
>>>>   - use g_autoptr and g_autofree, and glib allocation functions
>>>>   - conform to error conventions for bool and int function return values
>>>>     and function names.
>>>>   - remove word "error" in error messages
>>>>   - rename as_flat_walk and its callback, and add comments.
>>>>   - rename qemu_clr_cloexec -> qemu_clear_cloexec
>>>>   - rename close-on-cpr -> reopen-on-cpr
>>>>   - add strList utility functions
>>>>   - factor out start on wakeup request to a separate patch
>>>>   - deleted unnecessary layer (cprsave etc) and squashed QMP patches
>>>>   - conditionally compile for CONFIG_VFIO
>>>>
>>>> Steve Sistare (24):
>>>>   memory: qemu_check_ram_volatile
>>>>   migration: fix populate_vfio_info
>>>>   migration: qemu file wrappers
>>>>   migration: simplify savevm
>>>>   vl: start on wakeup request
>>>>   cpr: reboot mode
>>>>   memory: flat section iterator
>>>>   oslib: qemu_clear_cloexec
>>>>   machine: memfd-alloc option
>>>>   qapi: list utility functions
>>>>   vl: helper to request re-exec
>>>>   cpr: preserve extra state
>>>>   cpr: restart mode
>>>>   cpr: restart HMP interfaces
>>>>   hostmem-memfd: cpr for memory-backend-memfd
>>>>   pci: export functions for cpr
>>>>   vfio-pci: refactor for cpr
>>>>   vfio-pci: cpr part 1 (fd and dma)
>>>>   vfio-pci: cpr part 2 (msi)
>>>>   vfio-pci: cpr part 3 (intx)
>>>>   chardev: cpr framework
>>>>   chardev: cpr for simple devices
>>>>   chardev: cpr for pty
>>>>   cpr: only-cpr-capable option
>>>>
>>>> Mark Kanda, Steve Sistare (3):
>>>>   cpr: reboot HMP interfaces
>>>>   vhost: reset vhost devices for cpr
>>>>   chardev: cpr for sockets
>>>>
>>>>  MAINTAINERS                   |  12 ++
>>>>  backends/hostmem-memfd.c      |  21 +--
>>>>  chardev/char-mux.c            |   1 +
>>>>  chardev/char-null.c           |   1 +
>>>>  chardev/char-pty.c            |  14 +-
>>>>  chardev/char-serial.c         |   1 +
>>>>  chardev/char-socket.c         |  36 +++++
>>>>  chardev/char-stdio.c          |   8 ++
>>>>  chardev/char.c                |  43 +++++-
>>>>  gdbstub.c                     |   1 +
>>>>  hmp-commands.hx               |  50 +++++++
>>>>  hw/core/machine.c             |  19 +++
>>>>  hw/pci/msix.c                 |  20 ++-
>>>>  hw/pci/pci.c                  |   7 +-
>>>>  hw/vfio/common.c              |  79 +++++++++--
>>>>  hw/vfio/cpr.c                 | 160 ++++++++++++++++++++++
>>>>  hw/vfio/meson.build           |   1 +
>>>>  hw/vfio/pci.c                 | 301 +++++++++++++++++++++++++++++++++++++++---
>>>>  hw/vfio/trace-events          |   1 +
>>>>  hw/virtio/vhost.c             |  11 ++
>>>>  include/chardev/char.h        |   6 +
>>>>  include/exec/memory.h         |  39 ++++++
>>>>  include/hw/boards.h           |   1 +
>>>>  include/hw/pci/msix.h         |   5 +
>>>>  include/hw/pci/pci.h          |   2 +
>>>>  include/hw/vfio/vfio-common.h |   8 ++
>>>>  include/hw/virtio/vhost.h     |   1 +
>>>>  include/migration/cpr.h       |  31 +++++
>>>>  include/monitor/hmp.h         |   3 +
>>>>  include/qapi/util.h           |  28 ++++
>>>>  include/qemu/osdep.h          |   1 +
>>>>  include/sysemu/runstate.h     |   2 +
>>>>  include/sysemu/sysemu.h       |   1 +
>>>>  linux-headers/linux/vfio.h    |   6 +
>>>>  migration/cpr-state.c         | 215 ++++++++++++++++++++++++++++++
>>>>  migration/cpr.c               | 176 ++++++++++++++++++++++++
>>>>  migration/meson.build         |   2 +
>>>>  migration/migration.c         |   5 +
>>>>  migration/qemu-file-channel.c |  36 +++++
>>>>  migration/qemu-file-channel.h |   6 +
>>>>  migration/savevm.c            |  21 +--
>>>>  migration/target.c            |  24 +++-
>>>>  migration/trace-events        |   5 +
>>>>  monitor/hmp-cmds.c            |  68 ++++++----
>>>>  monitor/hmp.c                 |   3 +
>>>>  monitor/qmp.c                 |   3 +
>>>>  qapi/char.json                |   7 +-
>>>>  qapi/cpr.json                 |  76 +++++++++++
>>>>  qapi/meson.build              |   1 +
>>>>  qapi/qapi-schema.json         |   1 +
>>>>  qapi/qapi-util.c              |  37 ++++++
>>>>  qemu-options.hx               |  40 +++++-
>>>>  softmmu/globals.c             |   1 +
>>>>  softmmu/memory.c              |  46 +++++++
>>>>  softmmu/physmem.c             |  55 ++++++--
>>>>  softmmu/runstate.c            |  38 +++++-
>>>>  softmmu/vl.c                  |  18 ++-
>>>>  stubs/cpr-state.c             |  15 +++
>>>>  stubs/cpr.c                   |   3 +
>>>>  stubs/meson.build             |   2 +
>>>>  trace-events                  |   1 +
>>>>  util/oslib-posix.c            |   9 ++
>>>>  util/oslib-win32.c            |   4 +
>>>>  util/qemu-config.c            |   4 +
>>>>  64 files changed, 1732 insertions(+), 111 deletions(-)
>>>>  create mode 100644 hw/vfio/cpr.c
>>>>  create mode 100644 include/migration/cpr.h
>>>>  create mode 100644 migration/cpr-state.c
>>>>  create mode 100644 migration/cpr.c
>>>>  create mode 100644 qapi/cpr.json
>>>>  create mode 100644 stubs/cpr-state.c
>>>>  create mode 100644 stubs/cpr.c
>>>>
>>>
>> .
>>
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V6 00/27] Live Update
  2021-08-31 21:15       ` Steven Sistare
@ 2021-10-27  6:16         ` Zheng Chuan
  2021-10-27 12:25           ` Steven Sistare
  0 siblings, 1 reply; 44+ messages in thread
From: Zheng Chuan @ 2021-10-27  6:16 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Alex Williamson, Xiexiangyou,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Hi, Steve.
Any updates for this series?

On 2021/9/1 5:15, Steven Sistare wrote:
> On 8/24/2021 5:36 AM, Zheng Chuan wrote:
>> Hi, Steve.
>>
>> I think I have found the problem, it is because the rom_reset() during the cpr_exec will write dtb into the mach-virt.ram which cause the memory corruption.
>> Also I found in x86 the memoryregion of acpi also changed during rom_rest. Maybe we should keep it consistent and skip the rom_reset() like migration does.
>> Here is the patch drafted(Also fix the cpr state missing saving problem):
> 
> Hi Chuan, thank-you very much for debugging the problem.  rom_reset() is a great find.
> I also noticed and have a fix ready for the mode bug. I will add similar fixes to patch V7.
> 
> - Steve
> 
>> diff --git a/hw/core/loader.c b/hw/core/loader.c
>> index 5b34869a5417..1dcf0be1492f 100644
>> --- a/hw/core/loader.c
>> +++ b/hw/core/loader.c
>> @@ -50,6 +50,7 @@
>>  #include "hw/hw.h"
>>  #include "disas/disas.h"
>>  #include "migration/vmstate.h"
>> +#include "migration/cpr.h"
>>  #include "monitor/monitor.h"
>>  #include "sysemu/reset.h"
>>  #include "sysemu/sysemu.h"
>> @@ -1128,7 +1129,7 @@ static void rom_reset(void *unused)
>>           * the data in during the next incoming migration in all cases.  Note
>>           * that some of those RAMs can actually be modified by the guest.
>>           */
>> -        if (runstate_check(RUN_STATE_INMIGRATE)) {
>> +        if (runstate_check(RUN_STATE_INMIGRATE) || cpr_is_active()) {
>>              if (rom->data && rom->isrom) {
>>                  /*
>>                   * Free it so that a rom_reset after migration doesn't
>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>> index e9b987f54319..0b7d7e9f6bf0 100644
>> --- a/include/migration/cpr.h
>> +++ b/include/migration/cpr.h
>> @@ -20,9 +20,11 @@ void cpr_save_fd(const char *name, int id, int fd);
>>  void cpr_delete_fd(const char *name, int id);
>>  int cpr_find_fd(const char *name, int id);
>>  int cpr_walk_fd(cpr_walk_fd_cb cb, void *handle);
>> -int cpr_state_save(Error **errp);
>> +int cpr_state_save(CprMode mode, Error **errp);
>>  int cpr_state_load(Error **errp);
>>  CprMode cpr_state_mode(void);
>> +void cpr_state_clear(void);
>> +bool cpr_is_active(void);
>>  void cpr_state_print(void);
>>
>>  int cpr_vfio_save(Error **errp);
>> diff --git a/migration/cpr-state.c b/migration/cpr-state.c
>> index 003b449bbcf8..4ac08539d932 100644
>> --- a/migration/cpr-state.c
>> +++ b/migration/cpr-state.c
>> @@ -19,7 +19,7 @@ typedef struct CprState {
>>      CprNameList fds;            /* list of CprFd */
>>  } CprState;
>>
>> -static CprState cpr_state;
>> +static CprState cpr_state = { .mode = CPR_MODE_NONE };
>>
>>  /*************************************************************************/
>>  /* Generic list of names. */
>> @@ -149,7 +149,7 @@ static const VMStateDescription vmstate_cpr_state = {
>>      }
>>  };
>>
>> -int cpr_state_save(Error **errp)
>> +int cpr_state_save(CprMode mode, Error **errp)
>>  {
>>      int ret, mfd;
>>      QEMUFile *f;
>> @@ -163,9 +163,11 @@ int cpr_state_save(Error **errp)
>>      qemu_clear_cloexec(mfd);
>>      f = qemu_fd_open(mfd, true, CPR_STATE_NAME);
>>
>> +    cpr_state.mode = mode;
>>      ret = vmstate_save_state(f, &vmstate_cpr_state, &cpr_state, 0);
>>      if (ret) {
>>          error_setg(errp, "vmstate_save_state error %d", ret);
>> +        cpr_state.mode = CPR_MODE_NONE;
>>          return ret;
>>      }
>>
>> @@ -205,6 +207,16 @@ CprMode cpr_state_mode(void)
>>      return cpr_state.mode;
>>  }
>>
>> +void cpr_state_clear(void)
>> +{
>> +    cpr_state.mode = CPR_MODE_NONE;
>> +}
>> +
>> +bool cpr_is_active(void)
>> +{
>> +    return cpr_state.mode != CPR_MODE_NONE;
>> +}
>> +
>>  void cpr_state_print(void)
>>  {
>>      CprName *elem;
>> diff --git a/migration/cpr.c b/migration/cpr.c
>> index d14bc5ad2678..97b2293c01e8 100644
>> --- a/migration/cpr.c
>> +++ b/migration/cpr.c
>> @@ -115,7 +115,7 @@ void qmp_cpr_exec(strList *args, Error **errp)
>>          return;
>>      }
>>      cpr_walk_fd(preserve_fd, 0);
>> -    if (cpr_state_save(errp)) {
>> +    if (cpr_state_save(cpr_active_mode, errp)) {
>>          return;
>>      }
>>      vhost_dev_reset_all();
>> @@ -173,4 +173,5 @@ void qmp_cpr_load(const char *filename, Error **errp)
>>
>>  out:
>>      cpr_active_mode = CPR_MODE_NONE;
>> +    cpr_state_clear();
>>  }
>>
>>
>> On 2021/8/24 5:36, Steven Sistare wrote:
>>> Hi Zheng, testing aarch64 is on our todo list. We will run this case and try to 
>>> reproduce the failure.  Thanks for the report.
>>>
>>> - Steve
>>>
>>> On 8/21/2021 4:54 AM, Zheng Chuan wrote:
>>>> Hi, steve
>>>>
>>>> It seems the VM will stuck after cpr-load on AArch64 environment?
>>>>
>>>> My AArch64 environment and test steps:
>>>> 1. linux kernel: 5.14-rc6
>>>> 2. QEMU version: v6.1.0-rc2 (patch your patchset), and configure with `../configure --target-list=aarch64-softmmu --disable-werror --enable-kvm` 4. Steps to live update:
>>>> # ./build/aarch64-softmmu/qemu-system-aarch64 -machine virt,accel=kvm,gic-version=3,memfd-alloc=on -nodefaults -cpu host -m 2G -smp 1 -drive file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,readonly=on
>>>> -drive file=<path/to/vm.qcow2>,format=qcow2,if=none,id=drive_image1
>>>> -device virtio-blk-pci,id=image1,drive=drive_image1 -vnc :10 -device
>>>> virtio-gpu,id=video0 -device piix3-usb-uhci,id=usb -device
>>>> usb-tablet,id=input0,bus=usb.0,port=1 -device
>>>> usb-kbd,id=input1,bus=usb.0,port=2 -monitor stdio
>>>> (qemu) cpr-save /tmp/qemu.save restart
>>>> (qemu) cpr-exec ./build/aarch64-softmmu/qemu-system-aarch64 -machine virt,accel=kvm,gic-version=3,memfd-alloc=on -nodefaults -cpu host -m 2G -smp 1 -drive file=/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw,if=pflash,format=raw,readonly=on
>>>> -drive file=<path/to/vm.qcow2>,format=qcow2,if=none,id=drive_image1
>>>> -device virtio-blk-pci,id=image1,drive=drive_image1 -vnc :10 -device
>>>> virtio-gpu,id=video0 -device piix3-usb-uhci,id=usb -device
>>>> usb-tablet,id=input0,bus=usb.0,port=1 -device
>>>> usb-kbd,id=input1,bus=usb.0,port=2 -monitor stdio -S
>>>> (qemu) QEMU 6.0.92 monitor - type 'help' for more information
>>>> (qemu) cpr-load /tmp/qemu.save
>>>>
>>>> Does I miss something?
>>>>
>>>> On 2021/8/7 5:43, Steve Sistare wrote:
>>>>> Provide the cpr-save, cpr-exec, and cpr-load commands for live update.
>>>>> These save and restore VM state, with minimal guest pause time, so that
>>>>> qemu may be updated to a new version in between.
>>>>>
>>>>> cpr-save stops the VM and saves vmstate to an ordinary file.  It supports
>>>>> any type of guest image and block device, but the caller must not modify
>>>>> guest block devices between cpr-save and cpr-load.  It supports two modes:
>>>>> reboot and restart.
>>>>>
>>>>> In reboot mode, the caller invokes cpr-save and then terminates qemu.
>>>>> The caller may then update the host kernel and system software and reboot.
>>>>> The caller resumes the guest by running qemu with the same arguments as the
>>>>> original process and invoking cpr-load.  To use this mode, guest ram must be
>>>>> mapped to a persistent shared memory file such as /dev/dax0.0, or /dev/shm
>>>>> PKRAM as proposed in https://lore.kernel.org/lkml/1617140178-8773-1-git-send-email-anthony.yznaga@oracle.com.
>>>>>
>>>>> The reboot mode supports vfio devices if the caller first suspends the
>>>>> guest, such as by issuing guest-suspend-ram to the qemu guest agent.  The
>>>>> guest drivers' suspend methods flush outstanding requests and re-initialize
>>>>> the devices, and thus there is no device state to save and restore.
>>>>>
>>>>> Restart mode preserves the guest VM across a restart of the qemu process.
>>>>> After cpr-save, the caller passes qemu command-line arguments to cpr-exec,
>>>>> which directly exec's the new qemu binary.  The arguments must include -S
>>>>> so new qemu starts in a paused state and waits for the cpr-load command.
>>>>> The restart mode supports vfio devices by preserving the vfio container,
>>>>> group, device, and event descriptors across the qemu re-exec, and by
>>>>> updating DMA mapping virtual addresses using VFIO_DMA_UNMAP_FLAG_VADDR and
>>>>> VFIO_DMA_MAP_FLAG_VADDR as defined in https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sistare@oracle.com/
>>>>> and integrated in Linux kernel 5.12.
>>>>>
>>>>> To use the restart mode, qemu must be started with the memfd-alloc option,
>>>>> which allocates guest ram using memfd_create.  The memfd's are saved to
>>>>> the environment and kept open across exec, after which they are found from
>>>>> the environment and re-mmap'd.  Hence guest ram is preserved in place,
>>>>> albeit with new virtual addresses in the qemu process.
>>>>>
>>>>> The caller resumes the guest by invoking cpr-load, which loads state from
>>>>> the file. If the VM was running at cpr-save time, then VM execution resumes.
>>>>> If the VM was suspended at cpr-save time (reboot mode), then the caller must
>>>>> issue a system_wakeup command to resume.
>>>>>
>>>>> The first patches add reboot mode:
>>>>>   - memory: qemu_check_ram_volatile
>>>>>   - migration: fix populate_vfio_info
>>>>>   - migration: qemu file wrappers
>>>>>   - migration: simplify savevm
>>>>>   - vl: start on wakeup request
>>>>>   - cpr: reboot mode
>>>>>   - cpr: reboot HMP interfaces
>>>>>
>>>>> The next patches add restart mode:
>>>>>   - memory: flat section iterator
>>>>>   - oslib: qemu_clear_cloexec
>>>>>   - machine: memfd-alloc option
>>>>>   - qapi: list utility functions
>>>>>   - vl: helper to request re-exec
>>>>>   - cpr: preserve extra state
>>>>>   - cpr: restart mode
>>>>>   - cpr: restart HMP interfaces
>>>>>   - hostmem-memfd: cpr for memory-backend-memfd
>>>>>
>>>>> The next patches add vfio support for restart mode:
>>>>>   - pci: export functions for cpr
>>>>>   - vfio-pci: refactor for cpr
>>>>>   - vfio-pci: cpr part 1 (fd and dma)
>>>>>   - vfio-pci: cpr part 2 (msi)
>>>>>   - vfio-pci: cpr part 3 (intx)
>>>>>
>>>>> The next patches preserve various descriptor-based backend devices across
>>>>> cprexec:
>>>>>   - vhost: reset vhost devices for cpr
>>>>>   - chardev: cpr framework
>>>>>   - chardev: cpr for simple devices
>>>>>   - chardev: cpr for pty
>>>>>   - chardev: cpr for sockets
>>>>>   - cpr: only-cpr-capable option
>>>>>
>>>>> Here is an example of updating qemu from v4.2.0 to v4.2.1 using
>>>>> restart mode.  The software update is performed while the guest is
>>>>> running to minimize downtime.
>>>>>
>>>>> window 1                                        | window 2
>>>>>                                                 |
>>>>> # qemu-system-x86_64 ...                        |
>>>>> QEMU 4.2.0 monitor - type 'help' ...            |
>>>>> (qemu) info status                              |
>>>>> VM status: running                              |
>>>>>                                                 | # yum update qemu
>>>>> (qemu) cpr-save /tmp/qemu.sav restart           |
>>>>> (qemu) cpr-exec qemu-system-x86_64 -S ...       |
>>>>> QEMU 4.2.1 monitor - type 'help' ...            |
>>>>> (qemu) info status                              |
>>>>> VM status: paused (prelaunch)                   |
>>>>> (qemu) cpr-load /tmp/qemu.sav                   |
>>>>> (qemu) info status                              |
>>>>> VM status: running                              |
>>>>>
>>>>>
>>>>> Here is an example of updating the host kernel using reboot mode.
>>>>>
>>>>> window 1                                        | window 2
>>>>>                                                 |
>>>>> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
>>>>> QEMU 4.2.1 monitor - type 'help' ...            |
>>>>> (qemu) info status                              |
>>>>> VM status: running                              |
>>>>>                                                 | # yum update kernel-uek
>>>>> (qemu) cpr-save /tmp/qemu.sav restart           |
>>>>> (qemu) quit                                     |
>>>>>                                                 |
>>>>> # systemctl kexec                               |
>>>>> kexec_core: Starting new kernel                 |
>>>>> ...                                             |
>>>>>                                                 |
>>>>> # qemu-system-x86_64 -S mem-path=/dev/dax0.0 ...|
>>>>> QEMU 4.2.1 monitor - type 'help' ...            |
>>>>> (qemu) info status                              |
>>>>> VM status: paused (prelaunch)                   |
>>>>> (qemu) cpr-load /tmp/qemu.sav                   |
>>>>> (qemu) info status                              |
>>>>> VM status: running                              |
>>>>>
>>>>> Changes from V1 to V2:
>>>>>   - revert vmstate infrastructure changes
>>>>>   - refactor cpr functions into new files
>>>>>   - delete MADV_DOEXEC and use memfd + VFIO_DMA_UNMAP_FLAG_SUSPEND to
>>>>>     preserve memory.
>>>>>   - add framework to filter chardev's that support cpr
>>>>>   - save and restore vfio eventfd's
>>>>>   - modify cprinfo QMP interface
>>>>>   - incorporate misc review feedback
>>>>>   - remove unrelated and unneeded patches
>>>>>   - refactor all patches into a shorter and easier to review series
>>>>>
>>>>> Changes from V2 to V3:
>>>>>   - rebase to qemu 6.0.0
>>>>>   - use final definition of vfio ioctls (VFIO_DMA_UNMAP_FLAG_VADDR etc)
>>>>>   - change memfd-alloc to a machine option
>>>>>   - Use qio_channel_socket_new_fd instead of adding qio_channel_socket_new_fd
>>>>>   - close monitor socket during cpr
>>>>>   - fix a few unreported bugs
>>>>>   - support memory-backend-memfd
>>>>>
>>>>> Changes from V3 to V4:
>>>>>   - split reboot mode into separate patches
>>>>>   - add cprexec command
>>>>>   - delete QEMU_START_FREEZE, argv_main, and /usr/bin/qemu-exec
>>>>>   - add more checks for vfio and cpr compatibility, and recover after errors
>>>>>   - save vfio pci config in vmstate
>>>>>   - rename {setenv,getenv}_event_fd to {save,load}_event_fd
>>>>>   - use qemu_strtol
>>>>>   - change 6.0 references to 6.1
>>>>>   - use strerror(), use EXIT_FAILURE, remove period from error messages
>>>>>   - distribute MAINTAINERS additions to each patch
>>>>>
>>>>> Changes from V4 to V5:
>>>>>   - rebase to master
>>>>>
>>>>> Changes from V5 to V6:
>>>>>   vfio:
>>>>>   - delete redundant bus_master_enable_region in vfio_pci_post_load
>>>>>   - delete unmap.size warning
>>>>>   - fix phys_config memory leak
>>>>>   - add INTX support
>>>>>   - add vfio_named_notifier_init() helper
>>>>>   Other:
>>>>>   - 6.1 -> 6.2
>>>>>   - rename file -> filename in qapi
>>>>>   - delete cprinfo.  qapi introspection serves the same purpose.
>>>>>   - rename cprsave, cprexec, cprload -> cpr-save, cpr-exec, cpr-load
>>>>>   - improve documentation in qapi/cpr.json
>>>>>   - rename qemu_ram_volatile -> qemu_ram_check_volatile, and use
>>>>>     qemu_ram_foreach_block
>>>>>   - rename handle -> opaque
>>>>>   - use ERRP_GUARD
>>>>>   - use g_autoptr and g_autofree, and glib allocation functions
>>>>>   - conform to error conventions for bool and int function return values
>>>>>     and function names.
>>>>>   - remove word "error" in error messages
>>>>>   - rename as_flat_walk and its callback, and add comments.
>>>>>   - rename qemu_clr_cloexec -> qemu_clear_cloexec
>>>>>   - rename close-on-cpr -> reopen-on-cpr
>>>>>   - add strList utility functions
>>>>>   - factor out start on wakeup request to a separate patch
>>>>>   - deleted unnecessary layer (cprsave etc) and squashed QMP patches
>>>>>   - conditionally compile for CONFIG_VFIO
>>>>>
>>>>> Steve Sistare (24):
>>>>>   memory: qemu_check_ram_volatile
>>>>>   migration: fix populate_vfio_info
>>>>>   migration: qemu file wrappers
>>>>>   migration: simplify savevm
>>>>>   vl: start on wakeup request
>>>>>   cpr: reboot mode
>>>>>   memory: flat section iterator
>>>>>   oslib: qemu_clear_cloexec
>>>>>   machine: memfd-alloc option
>>>>>   qapi: list utility functions
>>>>>   vl: helper to request re-exec
>>>>>   cpr: preserve extra state
>>>>>   cpr: restart mode
>>>>>   cpr: restart HMP interfaces
>>>>>   hostmem-memfd: cpr for memory-backend-memfd
>>>>>   pci: export functions for cpr
>>>>>   vfio-pci: refactor for cpr
>>>>>   vfio-pci: cpr part 1 (fd and dma)
>>>>>   vfio-pci: cpr part 2 (msi)
>>>>>   vfio-pci: cpr part 3 (intx)
>>>>>   chardev: cpr framework
>>>>>   chardev: cpr for simple devices
>>>>>   chardev: cpr for pty
>>>>>   cpr: only-cpr-capable option
>>>>>
>>>>> Mark Kanda, Steve Sistare (3):
>>>>>   cpr: reboot HMP interfaces
>>>>>   vhost: reset vhost devices for cpr
>>>>>   chardev: cpr for sockets
>>>>>
>>>>>  MAINTAINERS                   |  12 ++
>>>>>  backends/hostmem-memfd.c      |  21 +--
>>>>>  chardev/char-mux.c            |   1 +
>>>>>  chardev/char-null.c           |   1 +
>>>>>  chardev/char-pty.c            |  14 +-
>>>>>  chardev/char-serial.c         |   1 +
>>>>>  chardev/char-socket.c         |  36 +++++
>>>>>  chardev/char-stdio.c          |   8 ++
>>>>>  chardev/char.c                |  43 +++++-
>>>>>  gdbstub.c                     |   1 +
>>>>>  hmp-commands.hx               |  50 +++++++
>>>>>  hw/core/machine.c             |  19 +++
>>>>>  hw/pci/msix.c                 |  20 ++-
>>>>>  hw/pci/pci.c                  |   7 +-
>>>>>  hw/vfio/common.c              |  79 +++++++++--
>>>>>  hw/vfio/cpr.c                 | 160 ++++++++++++++++++++++
>>>>>  hw/vfio/meson.build           |   1 +
>>>>>  hw/vfio/pci.c                 | 301 +++++++++++++++++++++++++++++++++++++++---
>>>>>  hw/vfio/trace-events          |   1 +
>>>>>  hw/virtio/vhost.c             |  11 ++
>>>>>  include/chardev/char.h        |   6 +
>>>>>  include/exec/memory.h         |  39 ++++++
>>>>>  include/hw/boards.h           |   1 +
>>>>>  include/hw/pci/msix.h         |   5 +
>>>>>  include/hw/pci/pci.h          |   2 +
>>>>>  include/hw/vfio/vfio-common.h |   8 ++
>>>>>  include/hw/virtio/vhost.h     |   1 +
>>>>>  include/migration/cpr.h       |  31 +++++
>>>>>  include/monitor/hmp.h         |   3 +
>>>>>  include/qapi/util.h           |  28 ++++
>>>>>  include/qemu/osdep.h          |   1 +
>>>>>  include/sysemu/runstate.h     |   2 +
>>>>>  include/sysemu/sysemu.h       |   1 +
>>>>>  linux-headers/linux/vfio.h    |   6 +
>>>>>  migration/cpr-state.c         | 215 ++++++++++++++++++++++++++++++
>>>>>  migration/cpr.c               | 176 ++++++++++++++++++++++++
>>>>>  migration/meson.build         |   2 +
>>>>>  migration/migration.c         |   5 +
>>>>>  migration/qemu-file-channel.c |  36 +++++
>>>>>  migration/qemu-file-channel.h |   6 +
>>>>>  migration/savevm.c            |  21 +--
>>>>>  migration/target.c            |  24 +++-
>>>>>  migration/trace-events        |   5 +
>>>>>  monitor/hmp-cmds.c            |  68 ++++++----
>>>>>  monitor/hmp.c                 |   3 +
>>>>>  monitor/qmp.c                 |   3 +
>>>>>  qapi/char.json                |   7 +-
>>>>>  qapi/cpr.json                 |  76 +++++++++++
>>>>>  qapi/meson.build              |   1 +
>>>>>  qapi/qapi-schema.json         |   1 +
>>>>>  qapi/qapi-util.c              |  37 ++++++
>>>>>  qemu-options.hx               |  40 +++++-
>>>>>  softmmu/globals.c             |   1 +
>>>>>  softmmu/memory.c              |  46 +++++++
>>>>>  softmmu/physmem.c             |  55 ++++++--
>>>>>  softmmu/runstate.c            |  38 +++++-
>>>>>  softmmu/vl.c                  |  18 ++-
>>>>>  stubs/cpr-state.c             |  15 +++
>>>>>  stubs/cpr.c                   |   3 +
>>>>>  stubs/meson.build             |   2 +
>>>>>  trace-events                  |   1 +
>>>>>  util/oslib-posix.c            |   9 ++
>>>>>  util/oslib-win32.c            |   4 +
>>>>>  util/qemu-config.c            |   4 +
>>>>>  64 files changed, 1732 insertions(+), 111 deletions(-)
>>>>>  create mode 100644 hw/vfio/cpr.c
>>>>>  create mode 100644 include/migration/cpr.h
>>>>>  create mode 100644 migration/cpr-state.c
>>>>>  create mode 100644 migration/cpr.c
>>>>>  create mode 100644 qapi/cpr.json
>>>>>  create mode 100644 stubs/cpr-state.c
>>>>>  create mode 100644 stubs/cpr.c
>>>>>
>>>>
>>> .
>>>
>>
> .
> 

-- 
Regards.
Chuan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V6 00/27] Live Update
  2021-10-27  6:16         ` Zheng Chuan
@ 2021-10-27 12:25           ` Steven Sistare
  0 siblings, 0 replies; 44+ messages in thread
From: Steven Sistare @ 2021-10-27 12:25 UTC (permalink / raw)
  To: Zheng Chuan, qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Alex Williamson, Xiexiangyou,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Soon.  I'll aim for next week.  Thanks for your continued interest!

- Steve

On 10/27/2021 2:16 AM, Zheng Chuan wrote:
> Hi, Steve.
> Any updates for this series?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V6 19/27] vfio-pci: cpr part 1 (fd and dma)
  2021-08-10 17:06   ` Alex Williamson
  2021-08-23 19:43     ` Steven Sistare
@ 2021-11-10  7:48     ` Zheng Chuan
  2021-11-30 16:11       ` Steven Sistare
  1 sibling, 1 reply; 44+ messages in thread
From: Zheng Chuan @ 2021-11-10  7:48 UTC (permalink / raw)
  To: Alex Williamson, Steve Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel, Eric Blake,
	Dr. David Alan Gilbert, Xiexiangyou, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Markus Armbruster


Hi, steve

On 2021/8/11 1:06, Alex Williamson wrote:
> On Fri,  6 Aug 2021 14:43:53 -0700
> Steve Sistare <steven.sistare@oracle.com> wrote:
> 
>> Enable vfio-pci devices to be saved and restored across an exec restart
>> of qemu.
>>
>> At vfio creation time, save the value of vfio container, group, and device
>> descriptors in cpr state.
>>
>> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
>> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
>> at a different VA after exec.  DMA to already-mapped pages continues.  Save
>> the msi message area as part of vfio-pci vmstate, save the interrupt and
>> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
>> vfio descriptors.  The flag is not cleared earlier because the descriptors
>> should not persist across miscellaneous fork and exec calls that may be
>> performed during normal operation.
>>
>> On qemu restart, vfio_realize() finds the descriptor env vars, uses
>> the descriptors, and notes that the device is being reused.  Device and
>> iommu state is already configured, so operations in vfio_realize that
>> would modify the configuration are skipped for a reused device, including
>> vfio ioctl's and writes to PCI configuration space.  The result is that
>> vfio_realize constructs qemu data structures that reflect the current
>> state of the device.  However, the reconstruction is not complete until
>> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
>> state.  It rebuilds vector data structures and attaches the interrupts to
>> the new KVM instance.  cpr-load then walks the flattened ranges of the
>> vfio_address_spaces and calls VFIO_DMA_MAP_FLAG_VADDR to inform the kernel
>> of the new VA's.  Lastly, it starts the VM and suppresses vfio device reset.
>>
>> This functionality is delivered by 3 patches for clarity.  Part 1 handles
>> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
>> support.  Part 3 adds INTX support.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  MAINTAINERS                   |   1 +
>>  hw/pci/pci.c                  |   4 ++
>>  hw/vfio/common.c              |  69 ++++++++++++++++--
>>  hw/vfio/cpr.c                 | 160 ++++++++++++++++++++++++++++++++++++++++++
>>  hw/vfio/meson.build           |   1 +
>>  hw/vfio/pci.c                 |  57 +++++++++++++++
>>  hw/vfio/trace-events          |   1 +
>>  include/hw/pci/pci.h          |   1 +
>>  include/hw/vfio/vfio-common.h |   5 ++
>>  include/migration/cpr.h       |   3 +
>>  linux-headers/linux/vfio.h    |   6 ++
>>  migration/cpr.c               |  10 ++-
>>  migration/target.c            |  14 ++++
>>  13 files changed, 325 insertions(+), 7 deletions(-)
>>  create mode 100644 hw/vfio/cpr.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index a9d2ed8..3132965 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -2904,6 +2904,7 @@ CPR
>>  M: Steve Sistare <steven.sistare@oracle.com>
>>  M: Mark Kanda <mark.kanda@oracle.com>
>>  S: Maintained
>> +F: hw/vfio/cpr.c
>>  F: include/migration/cpr.h
>>  F: migration/cpr.c
>>  F: qapi/cpr.json
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index 59408a3..b9c6ca1 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -307,6 +307,10 @@ static void pci_do_device_reset(PCIDevice *dev)
>>  {
>>      int r;
>>  
>> +    if (dev->reused) {
>> +        return;
>> +    }
>> +
>>      pci_device_deassert_intx(dev);
>>      assert(dev->irq_state == 0);
>>  
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 7918c0d..872a1ac 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -31,6 +31,7 @@
>>  #include "exec/memory.h"
>>  #include "exec/ram_addr.h"
>>  #include "hw/hw.h"
>> +#include "migration/cpr.h"
>>  #include "qemu/error-report.h"
>>  #include "qemu/main-loop.h"
>>  #include "qemu/range.h"
>> @@ -464,6 +465,10 @@ static int vfio_dma_unmap(VFIOContainer *container,
>>          return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
>>      }
>>  
>> +    if (container->reused) {
>> +        return 0;
>> +    }
>> +
>>      while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
>>          /*
>>           * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
>> @@ -501,6 +506,10 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>>          .size = size,
>>      };
>>  
>> +    if (container->reused) {
>> +        return 0;
>> +    }
>> +
>>      if (!readonly) {
>>          map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
>>      }
>> @@ -1872,6 +1881,10 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
>>      if (iommu_type < 0) {
>>          return iommu_type;
>>      }
>> +    if (container->reused) {
>> +        container->iommu_type = iommu_type;
>> +        return 0;
>> +    }
>>  
> 
> I'd like to see more comments throughout, but particularly where we're
> dumping out of functions for reused containers, groups, and devices.
> For instance map/unmap we're assuming we'll reach the same IOMMU
> mapping state we had previously, how do we validate that, why can't we
> only set vaddr in the mapping path rather than skipping it for a later
> pass at the flatmap, do we actually see unmaps, is deferring listener
> registration an alternate option, which specific reset path are we
> trying to defer, why are VFIOPCIDevices the only PCIDevices that set
> reused, there are some assumptions about the iommu_type that could use
> further description, etc.
> 
>>      ret = ioctl(group_fd, VFIO_GROUP_SET_CONTAINER, &container->fd);
>>      if (ret) {
>> @@ -1972,6 +1985,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>  {
>>      VFIOContainer *container;
>>      int ret, fd;
>> +    bool reused;
>>      VFIOAddressSpace *space;
>>  
>>      space = vfio_get_address_space(as);
>> @@ -2007,7 +2021,13 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>       * details once we know which type of IOMMU we are using.
>>       */
>>  
>> +    fd = cpr_find_fd("vfio_container_for_group", group->groupid);
>> +    reused = (fd >= 0);
>> +
>>      QLIST_FOREACH(container, &space->containers, next) {
>> +        if (container->fd == fd) {
>> +            break;
>> +        }
>>          if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
> 
> 
> Letting the reused case call this ioctl feels a little sloppy.  I'm
> assuming we've tested this in a vIOMMU config or other setups where
> we'd actually have multiple containers and we're relying on the ioctl
> failing, but why call it at all if we already know the group is
> attached to a container.
> 
> 
>>              ret = vfio_ram_block_discard_disable(container, true);
>>              if (ret) {
>> @@ -2020,14 +2040,25 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>                  }
>>                  return ret;
>>              }
>> -            group->container = container;
>> -            QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>> +            break;
>> +        }
>> +    }
>> +
>> +    if (container) {
>> +        group->container = container;
>> +        QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>> +        if (!reused) {
>>              vfio_kvm_device_add_group(group);
>> -            return 0;
>> +            cpr_save_fd("vfio_container_for_group", group->groupid,
>> +                        container->fd);
>>          }
>> +        return 0;
>> +    }
>> +
>> +    if (!reused) {
>> +        fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
>>      }
>>  
>> -    fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
>>      if (fd < 0) {
>>          error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
>>          ret = -errno;
>> @@ -2045,6 +2076,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>      container = g_malloc0(sizeof(*container));
>>      container->space = space;
>>      container->fd = fd;
>> +    container->reused = reused;
>>      container->error = NULL;
>>      container->dirty_pages_supported = false;
>>      container->dma_max_mappings = 0;
>> @@ -2183,6 +2215,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>      }
>>  
>>      container->initialized = true;
>> +    cpr_save_fd("vfio_container_for_group", group->groupid, fd);
>>  
>>      return 0;
>>  listener_release_exit:
>> @@ -2212,6 +2245,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>>  
>>      QLIST_REMOVE(group, container_next);
>>      group->container = NULL;
>> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
>>  
>>      /*
>>       * Explicitly release the listener first before unset container,
>> @@ -2253,6 +2287,7 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>>      VFIOGroup *group;
>>      char path[32];
>>      struct vfio_group_status status = { .argsz = sizeof(status) };
>> +    bool reused;
>>  
>>      QLIST_FOREACH(group, &vfio_group_list, next) {
>>          if (group->groupid == groupid) {
>> @@ -2270,7 +2305,13 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>>      group = g_malloc0(sizeof(*group));
>>  
>>      snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
>> -    group->fd = qemu_open_old(path, O_RDWR);
>> +
>> +    group->fd = cpr_find_fd("vfio_group", groupid);
>> +    reused = (group->fd >= 0);
>> +    if (!reused) {
>> +        group->fd = qemu_open_old(path, O_RDWR);
>> +    }
>> +
>>      if (group->fd < 0) {
>>          error_setg_errno(errp, errno, "failed to open %s", path);
>>          goto free_group_exit;
>> @@ -2304,6 +2345,10 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>>  
>>      QLIST_INSERT_HEAD(&vfio_group_list, group, next);
>>  
>> +    if (!reused) {
>> +        cpr_save_fd("vfio_group", groupid, group->fd);
>> +    }
>> +
>>      return group;
>>  
>>  close_fd_exit:
>> @@ -2328,6 +2373,7 @@ void vfio_put_group(VFIOGroup *group)
>>      vfio_disconnect_container(group);
>>      QLIST_REMOVE(group, next);
>>      trace_vfio_put_group(group->fd);
>> +    cpr_delete_fd("vfio_group", group->groupid);
>>      close(group->fd);
>>      g_free(group);
>>  
>> @@ -2341,8 +2387,14 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>>  {
>>      struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
>>      int ret, fd;
>> +    bool reused;
>> +
>> +    fd = cpr_find_fd(name, 0);
>> +    reused = (fd >= 0);
>> +    if (!reused) {
>> +        fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>> +    }
>>  
>> -    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>>      if (fd < 0) {
>>          error_setg_errno(errp, errno, "error getting device from group %d",
>>                           group->groupid);
>> @@ -2387,6 +2439,10 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>>      vbasedev->num_irqs = dev_info.num_irqs;
>>      vbasedev->num_regions = dev_info.num_regions;
>>      vbasedev->flags = dev_info.flags;
>> +    vbasedev->reused = reused;
>> +    if (!reused) {
>> +        cpr_save_fd(name, 0, fd);
>> +    }
>>  
>>      trace_vfio_get_device(name, dev_info.flags, dev_info.num_regions,
>>                            dev_info.num_irqs);
>> @@ -2403,6 +2459,7 @@ void vfio_put_base_device(VFIODevice *vbasedev)
>>      QLIST_REMOVE(vbasedev, next);
>>      vbasedev->group = NULL;
>>      trace_vfio_put_base_device(vbasedev->fd);
>> +    cpr_delete_fd(vbasedev->name, 0);
>>      close(vbasedev->fd);
>>  }
>>  
>> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
>> new file mode 100644
>> index 0000000..0981d31
>> --- /dev/null
>> +++ b/hw/vfio/cpr.c
>> @@ -0,0 +1,160 @@
>> +/*
>> + * Copyright (c) 2021 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include <sys/ioctl.h>
>> +#include <linux/vfio.h>
>> +#include "hw/vfio/vfio-common.h"
>> +#include "sysemu/kvm.h"
>> +#include "qapi/error.h"
>> +#include "trace.h"
>> +
>> +static int
>> +vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
>> +{
>> +    struct vfio_iommu_type1_dma_unmap unmap = {
>> +        .argsz = sizeof(unmap),
>> +        .flags = VFIO_DMA_UNMAP_FLAG_VADDR | VFIO_DMA_UNMAP_FLAG_ALL,
>> +        .iova = 0,
>> +        .size = 0,
>> +    };
>> +    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
>> +        error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
>> +        return -errno;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static int vfio_dma_map_vaddr(VFIOContainer *container, hwaddr iova,
>> +                              ram_addr_t size, void *vaddr,
>> +                              Error **errp)
>> +{
>> +    struct vfio_iommu_type1_dma_map map = {
>> +        .argsz = sizeof(map),
>> +        .flags = VFIO_DMA_MAP_FLAG_VADDR,
>> +        .vaddr = (__u64)(uintptr_t)vaddr,
>> +        .iova = iova,
>> +        .size = size,
>> +    };
>> +    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
>> +        error_setg_errno(errp, errno,
>> +                         "vfio_dma_map_vaddr(iova %lu, size %ld, va %p)",
>> +                         iova, size, vaddr);
>> +        return -errno;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static int
>> +vfio_region_remap(MemoryRegionSection *section, void *handle, Error **errp)
>> +{
>> +    MemoryRegion *mr = section->mr;
>> +    VFIOContainer *container = handle;
>> +    const char *name = memory_region_name(mr);
>> +    ram_addr_t size = int128_get64(section->size);
>> +    hwaddr offset, iova, roundup;
>> +    void *vaddr;
>> +
>> +    if (vfio_listener_skipped_section(section) || memory_region_is_iommu(mr)) {
> 
> A comment reminding us why we're also skipping iommu regions would be
> useful.  It's not clear to me why this needs to happen separately from
> the listener.  There's a sufficient degree of magic here that I'm
> afraid it's going to get broken too easily if it's left to me trying to
> remember how it's supposed to work.
> 
>> +        return 0;
>> +    }
>> +
>> +    offset = section->offset_within_address_space;
>> +    iova = REAL_HOST_PAGE_ALIGN(offset);
We should not do remap if it shares on host page with other structures.
I think a judgement like int128_ge((int128_make64(iova), llend)) in vfio_listener_region_add() should be also added here to check it,
otherwise it will remap no-exit dma which causes the live update failure.
diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index 0981d31..d231841 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -58,13 +58,21 @@ vfio_region_remap(MemoryRegionSection *section, void *handle, Error **errp)
     ram_addr_t size = int128_get64(section->size);
     hwaddr offset, iova, roundup;
     void *vaddr;
-
+    Int128 llend;
+
     if (vfio_listener_skipped_section(section) || memory_region_is_iommu(mr)) {
         return 0;
     }

     offset = section->offset_within_address_space;
     iova = REAL_HOST_PAGE_ALIGN(offset);
+    llend = int128_make64(section->offset_within_address_space);
+    llend = int128_add(llend, section->size);
+    llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask));
+    if (int128_ge(int128_make64(iova), llend)) {
+        return 0;
+    }
+
     roundup = iova - offset;
     size -= roundup;
     size = REAL_HOST_PAGE_ALIGN(size);

>> +    roundup = iova - offset;
>> +    size -= roundup;
>> +    size = REAL_HOST_PAGE_ALIGN(size);
>> +    vaddr = memory_region_get_ram_ptr(mr) +
>> +            section->offset_within_region + roundup;
>> +
>> +    trace_vfio_region_remap(name, container->fd, iova, iova + size - 1, vaddr);
>> +    return vfio_dma_map_vaddr(container, iova, size, vaddr, errp);
>> +}
>> +
>> +bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp)
>> +{
>> +    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
>> +        !ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
>> +        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR "
>> +                         "or VFIO_UNMAP_ALL");
>> +        return false;
>> +    } else {
>> +        return true;
>> +    }
>> +}
>> +
>> +int vfio_cpr_save(Error **errp)
>> +{
>> +    ERRP_GUARD();
>> +    VFIOAddressSpace *space, *last_space;
>> +    VFIOContainer *container, *last_container;
>> +
>> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
>> +        QLIST_FOREACH(container, &space->containers, next) {
>> +            if (!vfio_is_cpr_capable(container, errp)) {
>> +                return -1;
>> +            }
>> +        }
>> +    }
>> +
>> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
>> +        QLIST_FOREACH(container, &space->containers, next) {
>> +            if (vfio_dma_unmap_vaddr_all(container, errp)) {
>> +                goto unwind;
>> +            }
>> +        }
>> +    }
>> +    return 0;
>> +
>> +unwind:
>> +    last_space = space;
>> +    last_container = container;
>> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
>> +        QLIST_FOREACH(container, &space->containers, next) {
>> +            Error *err;
>> +
>> +            if (space == last_space && container == last_container) {
>> +                break;
>> +            }
> 
> Isn't it sufficient to only test the container?  I think we'd be in
> trouble if we found a container on multiple address space lists.  Too
> bad we don't have a continue_reverse foreach or it might be trivial to
> convert to a qtailq. 
> 
>> +            if (address_space_flat_for_each_section(space->as,
>> +                                                    vfio_region_remap,
>> +                                                    container, &err)) {
>> +                error_prepend(errp, "%s", error_get_pretty(err));
>> +                error_free(err);
>> +            }
>> +        }
>> +    }
>> +    return -1;
>> +}
>> +
>> +int vfio_cpr_load(Error **errp)
>> +{
>> +    VFIOAddressSpace *space;
>> +    VFIOContainer *container;
>> +    VFIOGroup *group;
>> +    VFIODevice *vbasedev;
>> +
>> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
>> +        QLIST_FOREACH(container, &space->containers, next) {
>> +            if (!vfio_is_cpr_capable(container, errp)) {
>> +                return -1;
>> +            }
>> +            container->reused = false;
>> +            if (address_space_flat_for_each_section(space->as,
>> +                                                    vfio_region_remap,
>> +                                                    container, errp)) {
>> +                return -1;
>> +            }
>> +        }
>> +    }
>> +    QLIST_FOREACH(group, &vfio_group_list, next) {
>> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
>> +            vbasedev->reused = false;
>> +        }
>> +    }
> 
> The above is a bit disjoint between group/device and space/container,
> how about walking container->group_list rather than the global group
> list?
> 
>> +    return 0;
>> +}
>> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
>> index da9af29..e247b2b 100644
>> --- a/hw/vfio/meson.build
>> +++ b/hw/vfio/meson.build
>> @@ -5,6 +5,7 @@ vfio_ss.add(files(
>>    'migration.c',
>>  ))
>>  vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
>> +  'cpr.c',
>>    'display.c',
>>    'pci-quirks.c',
>>    'pci.c',
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index e8e371e..64e2557 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -29,6 +29,7 @@
>>  #include "hw/qdev-properties.h"
>>  #include "hw/qdev-properties-system.h"
>>  #include "migration/vmstate.h"
>> +#include "migration/cpr.h"
>>  #include "qemu/error-report.h"
>>  #include "qemu/main-loop.h"
>>  #include "qemu/module.h"
>> @@ -2899,6 +2900,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>>          vfio_put_group(group);
>>          goto error;
>>      }
>> +    pdev->reused = vdev->vbasedev.reused;
>>  
>>      vfio_populate_device(vdev, &err);
>>      if (err) {
>> @@ -3168,6 +3170,10 @@ static void vfio_pci_reset(DeviceState *dev)
>>  {
>>      VFIOPCIDevice *vdev = VFIO_PCI(dev);
>>  
>> +    if (vdev->pdev.reused) {
>> +        return;
>> +    }
> 
> Why are we the only ones using PCIDevice.reused and why are we testing
> that rather than VFIOPCIDevice.reused above?  These have different
> lifecycles and the difference is too subtle, esp. w/o comments.
> 
>> +
>>      trace_vfio_pci_reset(vdev->vbasedev.name);
>>  
>>      vfio_pci_pre_reset(vdev);
>> @@ -3275,6 +3281,56 @@ static Property vfio_pci_dev_properties[] = {
>>      DEFINE_PROP_END_OF_LIST(),
>>  };
>>  
>> +static void vfio_merge_config(VFIOPCIDevice *vdev)
>> +{
>> +    PCIDevice *pdev = &vdev->pdev;
>> +    int size = MIN(pci_config_size(pdev), vdev->config_size);
>> +    g_autofree uint8_t *phys_config = g_malloc(size);
>> +    uint32_t mask;
>> +    int ret, i;
>> +
>> +    ret = pread(vdev->vbasedev.fd, phys_config, size, vdev->config_offset);
>> +    if (ret < size) {
>> +        ret = ret < 0 ? errno : EFAULT;
>> +        error_report("failed to read device config space: %s", strerror(ret));
>> +        return;
>> +    }
>> +
>> +    for (i = 0; i < size; i++) {
>> +        mask = vdev->emulated_config_bits[i];
>> +        pdev->config[i] = (pdev->config[i] & mask) | (phys_config[i] & ~mask);
>> +    }
>> +}
>> +
>> +static int vfio_pci_post_load(void *opaque, int version_id)
>> +{
>> +    VFIOPCIDevice *vdev = opaque;
>> +    PCIDevice *pdev = &vdev->pdev;
>> +
>> +    vfio_merge_config(vdev);
>> +
>> +    pdev->reused = false;
>> +
>> +    return 0;
>> +}
>> +
>> +static bool vfio_pci_needed(void *opaque)
>> +{
>> +    return cpr_mode() == CPR_MODE_RESTART;
>> +}
>> +
>> +static const VMStateDescription vfio_pci_vmstate = {
>> +    .name = "vfio-pci",
>> +    .unmigratable = 1,
> 
> 
> Doesn't this break the experimental (for now) migration support?
> 
> 
>> +    .version_id = 0,
>> +    .minimum_version_id = 0,
>> +    .post_load = vfio_pci_post_load,
>> +    .needed = vfio_pci_needed,
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_END_OF_LIST()
>> +    }
>> +};
>> +
>>  static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>>  {
>>      DeviceClass *dc = DEVICE_CLASS(klass);
>> @@ -3282,6 +3338,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>>  
>>      dc->reset = vfio_pci_reset;
>>      device_class_set_props(dc, vfio_pci_dev_properties);
>> +    dc->vmsd = &vfio_pci_vmstate;
>>      dc->desc = "VFIO-based PCI device assignment";
>>      set_bit(DEVICE_CATEGORY_MISC, dc->categories);
>>      pdc->realize = vfio_realize;
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 0ef1b5f..63dd0fe 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -118,6 +118,7 @@ vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Devic
>>  vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
>>  vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
>>  vfio_dma_unmap_overflow_workaround(void) ""
>> +vfio_region_remap(const char *name, int fd, uint64_t iova_start, uint64_t iova_end, void *vaddr) "%s fd %d 0x%"PRIx64" - 0x%"PRIx64" [%p]"
>>  
>>  # platform.c
>>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>> index bf5be06..f079423 100644
>> --- a/include/hw/pci/pci.h
>> +++ b/include/hw/pci/pci.h
>> @@ -360,6 +360,7 @@ struct PCIDevice {
>>      /* ID of standby device in net_failover pair */
>>      char *failover_pair_id;
>>      uint32_t acpi_index;
>> +    bool reused;
>>  };
>>  
>>  void pci_register_bar(PCIDevice *pci_dev, int region_num,
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index cb04cc6..0766cc4 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -85,6 +85,7 @@ typedef struct VFIOContainer {
>>      Error *error;
>>      bool initialized;
>>      bool dirty_pages_supported;
>> +    bool reused;
>>      uint64_t dirty_pgsizes;
>>      uint64_t max_dirty_bitmap_size;
>>      unsigned long pgsizes;
>> @@ -136,6 +137,7 @@ typedef struct VFIODevice {
>>      bool no_mmap;
>>      bool ram_block_discard_allowed;
>>      bool enable_migration;
>> +    bool reused;
>>      VFIODeviceOps *ops;
>>      unsigned int num_irqs;
>>      unsigned int num_regions;
>> @@ -212,6 +214,9 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
>>  void vfio_put_group(VFIOGroup *group);
>>  int vfio_get_device(VFIOGroup *group, const char *name,
>>                      VFIODevice *vbasedev, Error **errp);
>> +int vfio_cpr_save(Error **errp);
>> +int vfio_cpr_load(Error **errp);
>> +bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp);
>>  
>>  extern const MemoryRegionOps vfio_region_ops;
>>  typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>> index 83f69c9..e9b987f 100644
>> --- a/include/migration/cpr.h
>> +++ b/include/migration/cpr.h
>> @@ -25,4 +25,7 @@ int cpr_state_load(Error **errp);
>>  CprMode cpr_state_mode(void);
>>  void cpr_state_print(void);
>>  
>> +int cpr_vfio_save(Error **errp);
>> +int cpr_vfio_load(Error **errp);
>> +
>>  #endif
>> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
>> index e680594..48a02c0 100644
>> --- a/linux-headers/linux/vfio.h
>> +++ b/linux-headers/linux/vfio.h
>> @@ -52,6 +52,12 @@
>>  /* Supports the vaddr flag for DMA map and unmap */
>>  #define VFIO_UPDATE_VADDR		10
>            ^^^^^^^^^^^^^^^^^
> 
> It's already there.  Thanks,
> 
> Alex
> 
>>  
>> +/* Supports VFIO_DMA_UNMAP_FLAG_ALL */
>> +#define VFIO_UNMAP_ALL                        9
>> +
>> +/* Supports VFIO DMA map and unmap with the VADDR flag */
>> +#define VFIO_UPDATE_VADDR              10
>> +
>>  /*
>>   * The IOCTL interface is designed for extensibility by embedding the
>>   * structure length (argsz) and flags into structures passed between
>> diff --git a/migration/cpr.c b/migration/cpr.c
>> index 72a5f4b..16f11bd 100644
>> --- a/migration/cpr.c
>> +++ b/migration/cpr.c
>> @@ -7,6 +7,7 @@
>>  
>>  #include "qemu/osdep.h"
>>  #include "exec/memory.h"
>> +#include "hw/vfio/vfio-common.h"
>>  #include "io/channel-buffer.h"
>>  #include "io/channel-file.h"
>>  #include "migration.h"
>> @@ -108,7 +109,9 @@ void qmp_cpr_exec(strList *args, Error **errp)
>>          error_setg(errp, "cpr-exec requires cpr-save with restart mode");
>>          return;
>>      }
>> -
>> +    if (cpr_vfio_save(errp)) {
>> +        return;
>> +    }
>>      cpr_walk_fd(preserve_fd, 0);
>>      if (cpr_state_save(errp)) {
>>          return;
>> @@ -148,6 +151,11 @@ void qmp_cpr_load(const char *filename, Error **errp)
>>          goto out;
>>      }
>>  
>> +    if (cpr_active_mode == CPR_MODE_RESTART &&
>> +        cpr_vfio_load(errp)) {
>> +        goto out;
>> +    }
>> +
>>      state = global_state_get_runstate();
>>      if (state == RUN_STATE_RUNNING) {
>>          vm_start();
>> diff --git a/migration/target.c b/migration/target.c
>> index 4390bf0..984bc9e 100644
>> --- a/migration/target.c
>> +++ b/migration/target.c
>> @@ -8,6 +8,7 @@
>>  #include "qemu/osdep.h"
>>  #include "qapi/qapi-types-migration.h"
>>  #include "migration.h"
>> +#include "migration/cpr.h"
>>  #include CONFIG_DEVICES
>>  
>>  #ifdef CONFIG_VFIO
>> @@ -22,8 +23,21 @@ void populate_vfio_info(MigrationInfo *info)
>>          info->vfio->transferred = vfio_mig_bytes_transferred();
>>      }
>>  }
>> +
>> +int cpr_vfio_save(Error **errp)
>> +{
>> +    return vfio_cpr_save(errp);
>> +}
>> +
>> +int cpr_vfio_load(Error **errp)
>> +{
>> +    return vfio_cpr_load(errp);
>> +}
>> +
>>  #else
>>  
>>  void populate_vfio_info(MigrationInfo *info) {}
>> +int cpr_vfio_save(Error **errp) { return 0; }
>> +int cpr_vfio_load(Error **errp) { return 0; }
>>  
>>  #endif /* CONFIG_VFIO */
> 
> .
> 

-- 
Regards.
Chuan


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH V6 19/27] vfio-pci: cpr part 1 (fd and dma)
  2021-11-10  7:48     ` Zheng Chuan
@ 2021-11-30 16:11       ` Steven Sistare
  0 siblings, 0 replies; 44+ messages in thread
From: Steven Sistare @ 2021-11-30 16:11 UTC (permalink / raw)
  To: Zheng Chuan, Alex Williamson
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel, Eric Blake,
	Dr. David Alan Gilbert, Xiexiangyou, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Markus Armbruster

On 11/10/2021 2:48 AM, Zheng Chuan wrote:
> 
> Hi, steve
> 
> On 2021/8/11 1:06, Alex Williamson wrote:
>> On Fri,  6 Aug 2021 14:43:53 -0700
>> Steve Sistare <steven.sistare@oracle.com> wrote:
>> [...]
>>> +static int
>>> +vfio_region_remap(MemoryRegionSection *section, void *handle, Error **errp)
>>> +{
>>> +    MemoryRegion *mr = section->mr;
>>> +    VFIOContainer *container = handle;
>>> +    const char *name = memory_region_name(mr);
>>> +    ram_addr_t size = int128_get64(section->size);
>>> +    hwaddr offset, iova, roundup;
>>> +    void *vaddr;
>>> +
>>> +    if (vfio_listener_skipped_section(section) || memory_region_is_iommu(mr)) {
>>> +        return 0;
>>> +    }
>>> +
>>> +    offset = section->offset_within_address_space;
>>> +    iova = REAL_HOST_PAGE_ALIGN(offset);
> We should not do remap if it shares on host page with other structures.
> I think a judgement like int128_ge((int128_make64(iova), llend)) in vfio_listener_region_add() should be also added here to check it,
> otherwise it will remap no-exit dma which causes the live update failure.
> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
> index 0981d31..d231841 100644
> --- a/hw/vfio/cpr.c
> +++ b/hw/vfio/cpr.c
> @@ -58,13 +58,21 @@ vfio_region_remap(MemoryRegionSection *section, void *handle, Error **errp)
>      ram_addr_t size = int128_get64(section->size);
>      hwaddr offset, iova, roundup;
>      void *vaddr;
> -
> +    Int128 llend;
> +
>      if (vfio_listener_skipped_section(section) || memory_region_is_iommu(mr)) {
>          return 0;
>      }
> 
>      offset = section->offset_within_address_space;
>      iova = REAL_HOST_PAGE_ALIGN(offset);
> +    llend = int128_make64(section->offset_within_address_space);
> +    llend = int128_add(llend, section->size);
> +    llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask));
> +    if (int128_ge(int128_make64(iova), llend)) {
> +        return 0;
> +    }
> +
>      roundup = iova - offset;
>      size -= roundup;
>      size = REAL_HOST_PAGE_ALIGN(size);
> 
>>> +    roundup = iova - offset;
>>> +    size -= roundup;
>>> +    size = REAL_HOST_PAGE_ALIGN(size);
>>> +    vaddr = memory_region_get_ram_ptr(mr) +
>>> +            section->offset_within_region + roundup;
>>> +
>>> +    trace_vfio_region_remap(name, container->fd, iova, iova + size - 1, vaddr);
>>> +    return vfio_dma_map_vaddr(container, iova, size, vaddr, errp);
>>> +}

Thank you Zheng.  I intended to implement the logic you suggest, using 64-bit arithmetic,
but I botched it.  This should do the trick:

diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index df334d9..bbdeaea 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -66,8 +66,8 @@ vfio_region_remap(MemoryRegionSection *section, void *handle,
     offset = section->offset_within_address_space;
     iova = REAL_HOST_PAGE_ALIGN(offset);
     roundup = iova - offset;
-    size -= roundup;
-    size = REAL_HOST_PAGE_ALIGN(size);
+    size -= roundup;                    /* adjust for starting alignment */
+    size &= qemu_real_host_page_mask;   /* adjust for ending alignment */
     end = iova + size;
     if (iova >= end) {
         return 0;

- Steve


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH V6 21/27] vfio-pci: cpr part 3 (intx)
  2021-08-06 21:43 ` [PATCH V6 21/27] vfio-pci: cpr part 3 (intx) Steve Sistare
@ 2022-03-29 11:03   ` Fam Zheng
  2022-04-11 16:23     ` Steven Sistare
  0 siblings, 1 reply; 44+ messages in thread
From: Fam Zheng @ 2022-03-29 11:03 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	qemu-devel, Dr. David Alan Gilbert, Zheng Chuan, Alex Williamson,
	Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Daniel P. Berrange, Philippe Mathieu-Daudé,
	Alex Bennée, Markus Armbruster

On 2021-08-06 14:43, Steve Sistare wrote:
> Preserve vfio INTX state across cpr restart.  Preserve VFIOINTx fields as
> follows:
>   pin : Recover this from the vfio config in kernel space
>   interrupt : Preserve its eventfd descriptor across exec.
>   unmask : Ditto
>   route.irq : This could perhaps be recovered in vfio_pci_post_load by
>     calling pci_device_route_intx_to_irq(pin), whose implementation reads
>     config space for a bridge device such as ich9.  However, there is no
>     guarantee that the bridge vmstate is read before vfio vmstate.  Rather
>     than fiddling with MigrationPriority for vmstate handlers, explicitly
>     save route.irq in vfio vmstate.
>   pending : save in vfio vmstate.
>   mmap_timeout, mmap_timer : Re-initialize
>   bool kvm_accel : Re-initialize
> 
> In vfio_realize, defer calling vfio_intx_enable until the vmstate
> is available, in vfio_pci_post_load.  Modify vfio_intx_enable and
> vfio_intx_kvm_enable to skip vfio initialization, but still perform
> kvm initialization.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Hi Steve,

Not directly related to this patch, but since the context is close: it looks
like this series only takes care of exec restart mode of vfio-pci, have you had
any thoughts on kexec reboot mode with vfio-pci?

The general idea is if DMAR context is not lost during kexec, we should be able
to set up irqfds again and things will just work?

Fam

--

PS some more info below:

I have some local kernel patches to kexec reboot most part of the host kernel
while keeping IOMMU DMAR tables in a valid state; with that, not many extra
things are needed in addition to restore it. A PoC is like below (I can share
more details of the kernel changes if this patch makes any sense):


commit f8951e58be86bd6e37f816394a9a73f28d8059fc
Author: Fam Zheng <fam.zheng@bytedance.com>
Date:   Mon Mar 21 13:19:49 2022 +0000

    cpr: Add live-update support to vfio-pci devices
    
    In cpr-save, always serialize the vfio-pci states.
    
    In cpr-load, add a '-restore' mode that will do
    VFIO_GROUP_GET_DEVICE_FD_INTACT and skip DMAR setup, somewhat similar to
    the current cpr exec mode.
    
    Signed-off-by: Fam Zheng <fam.zheng@bytedance.com>

diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index 73f4259556..e36f0ef97d 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -584,10 +584,15 @@ void msix_init_vector_notifiers(PCIDevice *dev,
                                 MSIVectorReleaseNotifier release_notifier,
                                 MSIVectorPollNotifier poll_notifier)
 {
+    int vector;
+
     assert(use_notifier && release_notifier);
     dev->msix_vector_use_notifier = use_notifier;
     dev->msix_vector_release_notifier = release_notifier;
     dev->msix_vector_poll_notifier = poll_notifier;
+    for (vector = 0; vector < dev->msix_entries_nr; ++vector) {
+        msix_handle_mask_update(dev, vector, true);
+    }
 }
 
 int msix_set_vector_notifiers(PCIDevice *dev,
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 605ffbb5d0..f1240410a8 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -2066,6 +2066,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     bool reused;
     VFIOAddressSpace *space;
 
+    if (restore) {
+        return 0;
+    }
     space = vfio_get_address_space(as);
     fd = cpr_find_fd("vfio_container_for_group", group->groupid);
     reused = (fd > 0);
@@ -2486,7 +2489,8 @@ int vfio_get_device(VFIOGroup *group, const char *name,
     fd = cpr_find_fd(name, 0);
     reused = (fd >= 0);
     if (!reused) {
-        fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
+        int op = restore ? VFIO_GROUP_GET_DEVICE_FD_INTACT : VFIO_GROUP_GET_DEVICE_FD;
+        fd = ioctl(group->fd, op, name);
     }
 
     if (fd < 0) {
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e32513c668..9da5f93228 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -361,7 +361,7 @@ static int vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
      * Do not alter interrupt state during vfio_realize and cpr-load.  The
      * reused flag is cleared thereafter.
      */
-    if (!vdev->pdev.reused) {
+    if (!vdev->pdev.reused && !restore) {
         vfio_disable_interrupts(vdev);
     }
 
@@ -388,7 +388,7 @@ static int vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
     fd = event_notifier_get_fd(&vdev->intx.interrupt);
     qemu_set_fd_handler(fd, vfio_intx_interrupt, NULL, vdev);
 
-    if (vdev->pdev.reused) {
+    if (vdev->pdev.reused && !restore) {
         vfio_intx_reenable_kvm(vdev, &err);
         goto finish;
     }
@@ -2326,6 +2326,9 @@ static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool single)
     int ret, i, count;
     bool multi = false;
 
+    if (restore) {
+        return 0;
+    }
     trace_vfio_pci_hot_reset(vdev->vbasedev.name, single ? "one" : "multi");
 
     if (!single) {
@@ -3185,7 +3188,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         kvm_irqchip_add_change_notifier(&vdev->irqchip_change_notifier);
 
         /* Wait until cpr-load reads intx routing data to enable */
-        if (!pdev->reused) {
+        if (!pdev->reused && !restore) {
             ret = vfio_intx_enable(vdev, errp);
             if (ret) {
                 goto out_deregister;
@@ -3295,7 +3298,7 @@ static void vfio_pci_reset(DeviceState *dev)
     VFIOPCIDevice *vdev = VFIO_PCI(dev);
 
     /* Do not reset the device during qemu_system_reset prior to cpr-load */
-    if (vdev->pdev.reused) {
+    if (vdev->pdev.reused || restore) {
         return;
     }
 
@@ -3429,33 +3432,40 @@ static void vfio_merge_config(VFIOPCIDevice *vdev)
 
 static void vfio_claim_vectors(VFIOPCIDevice *vdev, int nr_vectors, bool msix)
 {
-    int i, fd;
+    int i, fd, ret;
     bool pending = false;
     PCIDevice *pdev = &vdev->pdev;
 
+    pdev->msix_entries_nr = nr_vectors;
     vdev->nr_vectors = nr_vectors;
     vdev->msi_vectors = g_new0(VFIOMSIVector, nr_vectors);
     vdev->interrupt = msix ? VFIO_INT_MSIX : VFIO_INT_MSI;
 
-    for (i = 0; i < nr_vectors; i++) {
-        VFIOMSIVector *vector = &vdev->msi_vectors[i];
-
-        fd = load_event_fd(vdev, "interrupt", i);
-        if (fd >= 0) {
-            vfio_vector_init(vdev, i);
-            qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
+    if (restore) {
+        ret = vfio_enable_vectors(vdev, true);
+        if (ret) {
+            error_report("vfio: failed to enable vectors, %d", ret);
         }
+    } else {
+        for (i = 0; i < nr_vectors; i++) {
+            VFIOMSIVector *vector = &vdev->msi_vectors[i];
 
-        if (load_event_fd(vdev, "kvm_interrupt", i) >= 0) {
-            vfio_add_kvm_msi_virq(vdev, vector, i, msix);
-        }
+            fd = load_event_fd(vdev, "interrupt", i);
+            if (fd >= 0) {
+                vfio_vector_init(vdev, i);
+                qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
+            }
 
-        if (msix && msix_is_pending(pdev, i) && msix_is_masked(pdev, i)) {
-            set_bit(i, vdev->msix->pending);
-            pending = true;
+            if (load_event_fd(vdev, "kvm_interrupt", i) >= 0) {
+                vfio_add_kvm_msi_virq(vdev, vector, i, msix);
+            }
+
+            if (msix && msix_is_pending(pdev, i) && msix_is_masked(pdev, i)) {
+                set_bit(i, vdev->msix->pending);
+                pending = true;
+            }
         }
     }
-
     if (msix) {
         memory_region_set_enabled(&pdev->msix_pba_mmio, pending);
     }
@@ -3534,7 +3544,7 @@ static const VMStateDescription vfio_intx_vmstate = {
 
 static bool vfio_pci_needed(void *opaque)
 {
-    return cpr_get_mode() == CPR_MODE_RESTART;
+    return 1;
 }
 
 static const VMStateDescription vfio_pci_vmstate = {
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index 6241c20fb1..0179b0aa90 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -26,6 +26,7 @@ void configure_rtc(QemuOpts *opts);
 void qemu_init_subsystems(void);
 
 extern int autostart;
+extern int restore;
 
 typedef enum {
     VGA_NONE, VGA_STD, VGA_CIRRUS, VGA_VMWARE, VGA_XENFB, VGA_QXL,
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index e680594f27..65c3bab074 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -188,6 +188,8 @@ struct vfio_group_status {
  */
 #define VFIO_GROUP_GET_DEVICE_FD	_IO(VFIO_TYPE, VFIO_BASE + 6)
 
+#define VFIO_GROUP_GET_DEVICE_FD_INTACT	_IO(VFIO_TYPE, VFIO_BASE + 21)
+
 /* --------------- IOCTLs for DEVICE file descriptors --------------- */
 
 /**
diff --git a/qemu-options.hx b/qemu-options.hx
index 8b90d04cb9..03666a59b3 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -3984,6 +3984,10 @@ SRST
     option is experimental.
 ERST
 
+DEF("restore", 0, QEMU_OPTION_restore, \
+    "-restore              restore mode",
+    QEMU_ARCH_ALL)
+
 DEF("S", 0, QEMU_OPTION_S, \
     "-S              freeze CPU at startup (use 'c' to start execution)\n",
     QEMU_ARCH_ALL)
diff --git a/softmmu/globals.c b/softmmu/globals.c
index a18fd8dcf3..6fcb5846b4 100644
--- a/softmmu/globals.c
+++ b/softmmu/globals.c
@@ -41,6 +41,7 @@ bool enable_cpu_pm;
 int nb_nics;
 NICInfo nd_table[MAX_NICS];
 int autostart = 1;
+int restore;
 int vga_interface_type = VGA_NONE;
 Chardev *parallel_hds[MAX_PARALLEL_PORTS];
 int win2k_install_hack;
diff --git a/softmmu/vl.c b/softmmu/vl.c
index f14e29e622..fba6b577cb 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -3088,6 +3088,9 @@ void qemu_init(int argc, char **argv, char **envp)
             case QEMU_OPTION_S:
                 autostart = 0;
                 break;
+            case QEMU_OPTION_restore:
+                restore = 1;
+                break;
             case QEMU_OPTION_k:
                 keyboard_layout = optarg;
                 break;


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH V6 21/27] vfio-pci: cpr part 3 (intx)
  2022-03-29 11:03   ` Fam Zheng
@ 2022-04-11 16:23     ` Steven Sistare
  2022-04-12 11:01       ` Fam Zheng
  0 siblings, 1 reply; 44+ messages in thread
From: Steven Sistare @ 2022-04-11 16:23 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	qemu-devel, Dr. David Alan Gilbert, Zheng Chuan, Alex Williamson,
	Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Daniel P. Berrange, Philippe Mathieu-Daudé,
	Alex Bennée, Markus Armbruster

On 3/29/2022 7:03 AM, Fam Zheng wrote:
> On 2021-08-06 14:43, Steve Sistare wrote:
>> Preserve vfio INTX state across cpr restart.  Preserve VFIOINTx fields as
>> follows:
>>   pin : Recover this from the vfio config in kernel space
>>   interrupt : Preserve its eventfd descriptor across exec.
>>   unmask : Ditto
>>   route.irq : This could perhaps be recovered in vfio_pci_post_load by
>>     calling pci_device_route_intx_to_irq(pin), whose implementation reads
>>     config space for a bridge device such as ich9.  However, there is no
>>     guarantee that the bridge vmstate is read before vfio vmstate.  Rather
>>     than fiddling with MigrationPriority for vmstate handlers, explicitly
>>     save route.irq in vfio vmstate.
>>   pending : save in vfio vmstate.
>>   mmap_timeout, mmap_timer : Re-initialize
>>   bool kvm_accel : Re-initialize
>>
>> In vfio_realize, defer calling vfio_intx_enable until the vmstate
>> is available, in vfio_pci_post_load.  Modify vfio_intx_enable and
>> vfio_intx_kvm_enable to skip vfio initialization, but still perform
>> kvm initialization.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> 
> Hi Steve,
> 
> Not directly related to this patch, but since the context is close: it looks
> like this series only takes care of exec restart mode of vfio-pci, have you had
> any thoughts on kexec reboot mode with vfio-pci?
> 
> The general idea is if DMAR context is not lost during kexec, we should be able
> to set up irqfds again and things will just work?
> 
> Fam

Hi Fam,
  I have thought about that use case, but only in general terms.
IMO it best fits in the cpr framework as a new mode (rather than as 
a new -restore command line argument).  

In your code below, you would have fewer code changes if you set 
'reused = true' for the new mode, rather than testing both 'reused and restored' 
at multiple sites. Lastly, I cleaned up the vector handling somewhat from V6 
to V7, so you may want to try your code using V7 as a base.

- Steve

> PS some more info below:
> 
> I have some local kernel patches to kexec reboot most part of the host kernel
> while keeping IOMMU DMAR tables in a valid state; with that, not many extra
> things are needed in addition to restore it. A PoC is like below (I can share
> more details of the kernel changes if this patch makes any sense):
> 
> 
> commit f8951e58be86bd6e37f816394a9a73f28d8059fc
> Author: Fam Zheng <fam.zheng@bytedance.com>
> Date:   Mon Mar 21 13:19:49 2022 +0000
> 
>     cpr: Add live-update support to vfio-pci devices
>     
>     In cpr-save, always serialize the vfio-pci states.
>     
>     In cpr-load, add a '-restore' mode that will do
>     VFIO_GROUP_GET_DEVICE_FD_INTACT and skip DMAR setup, somewhat similar to
>     the current cpr exec mode.
>     
>     Signed-off-by: Fam Zheng <fam.zheng@bytedance.com>
> 
> diff --git a/hw/pci/msix.c b/hw/pci/msix.c
> index 73f4259556..e36f0ef97d 100644
> --- a/hw/pci/msix.c
> +++ b/hw/pci/msix.c
> @@ -584,10 +584,15 @@ void msix_init_vector_notifiers(PCIDevice *dev,
>                                  MSIVectorReleaseNotifier release_notifier,
>                                  MSIVectorPollNotifier poll_notifier)
>  {
> +    int vector;
> +
>      assert(use_notifier && release_notifier);
>      dev->msix_vector_use_notifier = use_notifier;
>      dev->msix_vector_release_notifier = release_notifier;
>      dev->msix_vector_poll_notifier = poll_notifier;
> +    for (vector = 0; vector < dev->msix_entries_nr; ++vector) {
> +        msix_handle_mask_update(dev, vector, true);
> +    }
>  }
>  
>  int msix_set_vector_notifiers(PCIDevice *dev,
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 605ffbb5d0..f1240410a8 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -2066,6 +2066,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      bool reused;
>      VFIOAddressSpace *space;
>  
> +    if (restore) {
> +        return 0;
> +    }
>      space = vfio_get_address_space(as);
>      fd = cpr_find_fd("vfio_container_for_group", group->groupid);
>      reused = (fd > 0);
> @@ -2486,7 +2489,8 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>      fd = cpr_find_fd(name, 0);
>      reused = (fd >= 0);
>      if (!reused) {
> -        fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
> +        int op = restore ? VFIO_GROUP_GET_DEVICE_FD_INTACT : VFIO_GROUP_GET_DEVICE_FD;
> +        fd = ioctl(group->fd, op, name);
>      }
>  
>      if (fd < 0) {
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index e32513c668..9da5f93228 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -361,7 +361,7 @@ static int vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
>       * Do not alter interrupt state during vfio_realize and cpr-load.  The
>       * reused flag is cleared thereafter.
>       */
> -    if (!vdev->pdev.reused) {
> +    if (!vdev->pdev.reused && !restore) {
>          vfio_disable_interrupts(vdev);
>      }
>  
> @@ -388,7 +388,7 @@ static int vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
>      fd = event_notifier_get_fd(&vdev->intx.interrupt);
>      qemu_set_fd_handler(fd, vfio_intx_interrupt, NULL, vdev);
>  
> -    if (vdev->pdev.reused) {
> +    if (vdev->pdev.reused && !restore) {
>          vfio_intx_reenable_kvm(vdev, &err);
>          goto finish;
>      }
> @@ -2326,6 +2326,9 @@ static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool single)
>      int ret, i, count;
>      bool multi = false;
>  
> +    if (restore) {
> +        return 0;
> +    }
>      trace_vfio_pci_hot_reset(vdev->vbasedev.name, single ? "one" : "multi");
>  
>      if (!single) {
> @@ -3185,7 +3188,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>          kvm_irqchip_add_change_notifier(&vdev->irqchip_change_notifier);
>  
>          /* Wait until cpr-load reads intx routing data to enable */
> -        if (!pdev->reused) {
> +        if (!pdev->reused && !restore) {
>              ret = vfio_intx_enable(vdev, errp);
>              if (ret) {
>                  goto out_deregister;
> @@ -3295,7 +3298,7 @@ static void vfio_pci_reset(DeviceState *dev)
>      VFIOPCIDevice *vdev = VFIO_PCI(dev);
>  
>      /* Do not reset the device during qemu_system_reset prior to cpr-load */
> -    if (vdev->pdev.reused) {
> +    if (vdev->pdev.reused || restore) {
>          return;
>      }
>  
> @@ -3429,33 +3432,40 @@ static void vfio_merge_config(VFIOPCIDevice *vdev)
>  
>  static void vfio_claim_vectors(VFIOPCIDevice *vdev, int nr_vectors, bool msix)
>  {
> -    int i, fd;
> +    int i, fd, ret;
>      bool pending = false;
>      PCIDevice *pdev = &vdev->pdev;
>  
> +    pdev->msix_entries_nr = nr_vectors;
>      vdev->nr_vectors = nr_vectors;
>      vdev->msi_vectors = g_new0(VFIOMSIVector, nr_vectors);
>      vdev->interrupt = msix ? VFIO_INT_MSIX : VFIO_INT_MSI;
>  
> -    for (i = 0; i < nr_vectors; i++) {
> -        VFIOMSIVector *vector = &vdev->msi_vectors[i];
> -
> -        fd = load_event_fd(vdev, "interrupt", i);
> -        if (fd >= 0) {
> -            vfio_vector_init(vdev, i);
> -            qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
> +    if (restore) {
> +        ret = vfio_enable_vectors(vdev, true);
> +        if (ret) {
> +            error_report("vfio: failed to enable vectors, %d", ret);
>          }
> +    } else {
> +        for (i = 0; i < nr_vectors; i++) {
> +            VFIOMSIVector *vector = &vdev->msi_vectors[i];
>  
> -        if (load_event_fd(vdev, "kvm_interrupt", i) >= 0) {
> -            vfio_add_kvm_msi_virq(vdev, vector, i, msix);
> -        }
> +            fd = load_event_fd(vdev, "interrupt", i);
> +            if (fd >= 0) {
> +                vfio_vector_init(vdev, i);
> +                qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
> +            }
>  
> -        if (msix && msix_is_pending(pdev, i) && msix_is_masked(pdev, i)) {
> -            set_bit(i, vdev->msix->pending);
> -            pending = true;
> +            if (load_event_fd(vdev, "kvm_interrupt", i) >= 0) {
> +                vfio_add_kvm_msi_virq(vdev, vector, i, msix);
> +            }
> +
> +            if (msix && msix_is_pending(pdev, i) && msix_is_masked(pdev, i)) {
> +                set_bit(i, vdev->msix->pending);
> +                pending = true;
> +            }
>          }
>      }
> -
>      if (msix) {
>          memory_region_set_enabled(&pdev->msix_pba_mmio, pending);
>      }
> @@ -3534,7 +3544,7 @@ static const VMStateDescription vfio_intx_vmstate = {
>  
>  static bool vfio_pci_needed(void *opaque)
>  {
> -    return cpr_get_mode() == CPR_MODE_RESTART;
> +    return 1;
>  }
>  
>  static const VMStateDescription vfio_pci_vmstate = {
> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> index 6241c20fb1..0179b0aa90 100644
> --- a/include/sysemu/sysemu.h
> +++ b/include/sysemu/sysemu.h
> @@ -26,6 +26,7 @@ void configure_rtc(QemuOpts *opts);
>  void qemu_init_subsystems(void);
>  
>  extern int autostart;
> +extern int restore;
>  
>  typedef enum {
>      VGA_NONE, VGA_STD, VGA_CIRRUS, VGA_VMWARE, VGA_XENFB, VGA_QXL,
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index e680594f27..65c3bab074 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -188,6 +188,8 @@ struct vfio_group_status {
>   */
>  #define VFIO_GROUP_GET_DEVICE_FD	_IO(VFIO_TYPE, VFIO_BASE + 6)
>  
> +#define VFIO_GROUP_GET_DEVICE_FD_INTACT	_IO(VFIO_TYPE, VFIO_BASE + 21)
> +
>  /* --------------- IOCTLs for DEVICE file descriptors --------------- */
>  
>  /**
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 8b90d04cb9..03666a59b3 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -3984,6 +3984,10 @@ SRST
>      option is experimental.
>  ERST
>  
> +DEF("restore", 0, QEMU_OPTION_restore, \
> +    "-restore              restore mode",
> +    QEMU_ARCH_ALL)
> +
>  DEF("S", 0, QEMU_OPTION_S, \
>      "-S              freeze CPU at startup (use 'c' to start execution)\n",
>      QEMU_ARCH_ALL)
> diff --git a/softmmu/globals.c b/softmmu/globals.c
> index a18fd8dcf3..6fcb5846b4 100644
> --- a/softmmu/globals.c
> +++ b/softmmu/globals.c
> @@ -41,6 +41,7 @@ bool enable_cpu_pm;
>  int nb_nics;
>  NICInfo nd_table[MAX_NICS];
>  int autostart = 1;
> +int restore;
>  int vga_interface_type = VGA_NONE;
>  Chardev *parallel_hds[MAX_PARALLEL_PORTS];
>  int win2k_install_hack;
> diff --git a/softmmu/vl.c b/softmmu/vl.c
> index f14e29e622..fba6b577cb 100644
> --- a/softmmu/vl.c
> +++ b/softmmu/vl.c
> @@ -3088,6 +3088,9 @@ void qemu_init(int argc, char **argv, char **envp)
>              case QEMU_OPTION_S:
>                  autostart = 0;
>                  break;
> +            case QEMU_OPTION_restore:
> +                restore = 1;
> +                break;
>              case QEMU_OPTION_k:
>                  keyboard_layout = optarg;
>                  break;


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V6 21/27] vfio-pci: cpr part 3 (intx)
  2022-04-11 16:23     ` Steven Sistare
@ 2022-04-12 11:01       ` Fam Zheng
  0 siblings, 0 replies; 44+ messages in thread
From: Fam Zheng @ 2022-04-12 11:01 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Jason Zeng, Michael S. Tsirkin, Philippe Mathieu-Daudé,
	Juan Quintela, qemu-devel, Dr. David Alan Gilbert, Zheng Chuan,
	Marc-André Lureau, Alex Williamson, Daniel P. Berrange,
	Stefan Hajnoczi, Paolo Bonzini, Fam Zheng, Alex Bennée,
	Eric Blake, Markus Armbruster

On 2022-04-11 12:23, Steven Sistare wrote:
> On 3/29/2022 7:03 AM, Fam Zheng wrote:
> > On 2021-08-06 14:43, Steve Sistare wrote:
> >> Preserve vfio INTX state across cpr restart.  Preserve VFIOINTx fields as
> >> follows:
> >>   pin : Recover this from the vfio config in kernel space
> >>   interrupt : Preserve its eventfd descriptor across exec.
> >>   unmask : Ditto
> >>   route.irq : This could perhaps be recovered in vfio_pci_post_load by
> >>     calling pci_device_route_intx_to_irq(pin), whose implementation reads
> >>     config space for a bridge device such as ich9.  However, there is no
> >>     guarantee that the bridge vmstate is read before vfio vmstate.  Rather
> >>     than fiddling with MigrationPriority for vmstate handlers, explicitly
> >>     save route.irq in vfio vmstate.
> >>   pending : save in vfio vmstate.
> >>   mmap_timeout, mmap_timer : Re-initialize
> >>   bool kvm_accel : Re-initialize
> >>
> >> In vfio_realize, defer calling vfio_intx_enable until the vmstate
> >> is available, in vfio_pci_post_load.  Modify vfio_intx_enable and
> >> vfio_intx_kvm_enable to skip vfio initialization, but still perform
> >> kvm initialization.
> >>
> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > 
> > Hi Steve,
> > 
> > Not directly related to this patch, but since the context is close: it looks
> > like this series only takes care of exec restart mode of vfio-pci, have you had
> > any thoughts on kexec reboot mode with vfio-pci?
> > 
> > The general idea is if DMAR context is not lost during kexec, we should be able
> > to set up irqfds again and things will just work?
> > 
> > Fam
> 
> Hi Fam,
>   I have thought about that use case, but only in general terms.
> IMO it best fits in the cpr framework as a new mode (rather than as 
> a new -restore command line argument).  

Yes I think that is better, I will try that.

> 
> In your code below, you would have fewer code changes if you set 
> 'reused = true' for the new mode, rather than testing both 'reused and restored' 
> at multiple sites. Lastly, I cleaned up the vector handling somewhat from V6 
> to V7, so you may want to try your code using V7 as a base.

I am cleaning up the kernel patches and will post both parts once ready.

Fam


^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2022-04-12 11:09 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-06 21:43 [PATCH V6 00/27] Live Update Steve Sistare
2021-08-06 21:43 ` [PATCH V6 01/27] memory: qemu_check_ram_volatile Steve Sistare
2021-08-06 21:43 ` [PATCH V6 02/27] migration: fix populate_vfio_info Steve Sistare
2021-08-06 21:43 ` [PATCH V6 03/27] migration: qemu file wrappers Steve Sistare
2021-08-06 21:43 ` [PATCH V6 04/27] migration: simplify savevm Steve Sistare
2021-08-06 21:43 ` [PATCH V6 05/27] vl: start on wakeup request Steve Sistare
2021-08-06 21:43 ` [PATCH V6 06/27] cpr: reboot mode Steve Sistare
2021-08-06 21:43 ` [PATCH V6 07/27] cpr: reboot HMP interfaces Steve Sistare
2021-08-06 21:43 ` [PATCH V6 08/27] memory: flat section iterator Steve Sistare
2021-08-06 21:43 ` [PATCH V6 09/27] oslib: qemu_clear_cloexec Steve Sistare
2021-08-06 21:43 ` [PATCH V6 10/27] machine: memfd-alloc option Steve Sistare
2021-08-06 21:43 ` [PATCH V6 11/27] qapi: list utility functions Steve Sistare
2021-08-06 21:43 ` [PATCH V6 12/27] vl: helper to request re-exec Steve Sistare
2021-08-06 21:43 ` [PATCH V6 13/27] cpr: preserve extra state Steve Sistare
2021-08-06 21:43 ` [PATCH V6 14/27] cpr: restart mode Steve Sistare
2021-08-06 21:43 ` [PATCH V6 15/27] cpr: restart HMP interfaces Steve Sistare
2021-08-06 21:43 ` [PATCH V6 16/27] hostmem-memfd: cpr for memory-backend-memfd Steve Sistare
2021-08-06 21:43 ` [PATCH V6 17/27] pci: export functions for cpr Steve Sistare
2021-08-06 21:43 ` [PATCH V6 18/27] vfio-pci: refactor " Steve Sistare
2021-08-10 16:53   ` Alex Williamson
2021-08-23 16:52     ` Steven Sistare
2021-08-06 21:43 ` [PATCH V6 19/27] vfio-pci: cpr part 1 (fd and dma) Steve Sistare
2021-08-10 17:06   ` Alex Williamson
2021-08-23 19:43     ` Steven Sistare
2021-11-10  7:48     ` Zheng Chuan
2021-11-30 16:11       ` Steven Sistare
2021-08-06 21:43 ` [PATCH V6 20/27] vfio-pci: cpr part 2 (msi) Steve Sistare
2021-08-06 21:43 ` [PATCH V6 21/27] vfio-pci: cpr part 3 (intx) Steve Sistare
2022-03-29 11:03   ` Fam Zheng
2022-04-11 16:23     ` Steven Sistare
2022-04-12 11:01       ` Fam Zheng
2021-08-06 21:43 ` [PATCH V6 22/27] vhost: reset vhost devices for cpr Steve Sistare
2021-08-06 21:43 ` [PATCH V6 23/27] chardev: cpr framework Steve Sistare
2021-08-06 21:43 ` [PATCH V6 24/27] chardev: cpr for simple devices Steve Sistare
2021-08-06 21:43 ` [PATCH V6 25/27] chardev: cpr for pty Steve Sistare
2021-08-06 21:44 ` [PATCH V6 26/27] chardev: cpr for sockets Steve Sistare
2021-08-06 21:44 ` [PATCH V6 27/27] cpr: only-cpr-capable option Steve Sistare
2021-08-09 16:02 ` [PATCH V6 00/27] Live Update Steven Sistare
2021-08-21  8:54 ` Zheng Chuan
2021-08-23 21:36   ` Steven Sistare
2021-08-24  9:36     ` Zheng Chuan
2021-08-31 21:15       ` Steven Sistare
2021-10-27  6:16         ` Zheng Chuan
2021-10-27 12:25           ` Steven Sistare

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.