All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V8 00/39] Live Update
@ 2022-06-15 14:51 Steve Sistare
  2022-06-15 14:51 ` [PATCH V8 01/39] migration: fix populate_vfio_info Steve Sistare
                   ` (38 more replies)
  0 siblings, 39 replies; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Provide the cpr-save, cpr-exec, and cpr-load commands for live update.
These save and restore VM state, with minimal guest pause time, so that
qemu may be updated to a new version in between.

cpr-save stops the VM and saves vmstate to an ordinary file.  It supports
any type of guest image and block device, but the caller must not modify
guest block devices between cpr-save and cpr-load.  It supports two modes:
reboot and restart.

In reboot mode, the caller invokes cpr-save and then terminates qemu.
The caller may then update the host kernel and system software and reboot.
The caller resumes the guest by running qemu with the same arguments as the
original process, plus -S so new qemu starts in a paused state, and invoking
cpr-load.  For maximum efficiency in this mode, guest ram should be mapped to
a persistent shared memory file such as /dev/dax0.0, or /dev/shm PKRAM as
proposed in https://lore.kernel.org/lkml/1617140178-8773-1-git-send-email-anthony.yznaga@oracle.com.

The reboot mode supports vfio devices if the caller first suspends the
guest, such as by issuing guest-suspend-ram to the qemu guest agent.  The
guest drivers' suspend methods flush outstanding requests and re-initialize
the devices, and thus there is no device state to save and restore.

Restart mode preserves the guest VM across a restart of the qemu process.
After cpr-save, the caller passes the original qemu command-line arguments
plus -S to cpr-exec. The restart mode supports vfio devices by preserving the
vfio container, group, device, and event descriptors across the qemu re-exec,
and by updating DMA mapping virtual addresses using VFIO_DMA_UNMAP_FLAG_VADDR
and VFIO_DMA_MAP_FLAG_VADDR as defined in https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sistare@oracle.com/
and integrated in Linux kernel 5.12.

For restart mode, the user must create guest ram using a memory-backend-memfd
or a shared memory-backend-file.  These are re-mmap'd in the updated process,
so guest ram is efficiently preserved in place, albeit with new virtual
addresses.  In addition, qemu allocates secondary guest ram blocks -- those
that cannot be specified as objects on the command line -- using memfd_create.
The memfd's are remembered and kept open across exec, after which they are
re-mmap'd.

The caller resumes the guest by invoking cpr-load, which loads state from
the file. If the VM was running at cpr-save time, then VM execution resumes.
If the VM was suspended at cpr-save time (reboot mode), then the caller must
issue a system_wakeup command to resume.

The first patches add reboot mode:
  - migration: fix populate_vfio_info
  - migration: qemu file wrappers
  - migration: simplify savevm
  - memory: RAM_ANON flag
  - vl: start on wakeup request
  - cpr: reboot mode
  - cpr: reboot HMP interfaces
  - cpr: blockers
  - cpr: register blockers
  - cpr: cpr-enable option
  - cpr: save ram blocks

The next patches add restart mode:
  - memory: flat section iterator
  - oslib: qemu_clear_cloexec
  - qapi: strList_from_string
  - qapi: QAPI_LIST_LENGTH
  - qapi: strv_from_strList
  - qapi: strList unit tests
  - vl: helper to request re-exec
  - cpr: preserve extra state
  - cpr: restart mode
  - cpr: restart HMP interfaces
  - cpr: ram block blockers
  - hostmem-memfd: cpr for memory-backend-memfd

The next patches add vfio support for restart mode:
  - pci: export export msix_is_pending
  - cpr: notifiers
  - vfio-pci: refactor for cpr
  - vfio-pci: cpr part 1 (fd and dma)
  - vfio-pci: cpr part 2 (msi)
  - vfio-pci: cpr part 3 (intx)
  - vfio-pci: recover from unmap-all-vaddr failure

The next patches preserve various descriptor-based backend devices across
cpr-exec:
  - vhost: reset vhost devices for cpr
  - loader: suppress rom_reset during cpr
  - chardev: cpr framework
  - chardev: cpr for simple devices
  - chardev: cpr for pty
  - chardev: cpr for sockets
  - cpr: only-cpr-capable option

The next patches add a test:
  - python/machine: add QEMUMachine accessors
  - tests/avocado: add cpr regression test

Here is an example of updating qemu from v7.0.0 to v7.1.0 using
restart mode.  The software update is performed while the guest is
running to minimize downtime.

window 1                                        | window 2
                                                |
# qemu-system-x86_64 ...                        |
QEMU 7.0.0 monitor - type 'help' ...            |
(qemu) info status                              |
VM status: running                              |
                                                | # yum update qemu
(qemu) cpr-save /tmp/qemu.sav restart           |
(qemu) cpr-exec qemu-system-x86_64 -S ...       |
QEMU 7.1.0 monitor - type 'help' ...            |
(qemu) info status                              |
VM status: paused (prelaunch)                   |
(qemu) cpr-load /tmp/qemu.sav                   |
(qemu) info status                              |
VM status: running                              |


Here is an example of updating the host kernel using reboot mode.

window 1                                        | window 2
                                                |
# qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
QEMU 7.1.0 monitor - type 'help' ...            |
(qemu) info status                              |
VM status: running                              |
                                                | # yum update kernel-uek
(qemu) cpr-save /tmp/qemu.sav reboot            |
(qemu) quit                                     |
                                                |
# systemctl kexec                               |
kexec_core: Starting new kernel                 |
...                                             |
                                                |
# qemu-system-x86_64 -S mem-path=/dev/dax0.0 ...|
QEMU 7.1.0 monitor - type 'help' ...            |
(qemu) info status                              |
VM status: paused (prelaunch)                   |
(qemu) cpr-load /tmp/qemu.sav                   |
(qemu) info status                              |
VM status: running                              |

Changes from V1 to V2:
  - revert vmstate infrastructure changes
  - refactor cpr functions into new files
  - delete MADV_DOEXEC and use memfd + VFIO_DMA_UNMAP_FLAG_SUSPEND to
    preserve memory.
  - add framework to filter chardev's that support cpr
  - save and restore vfio eventfd's
  - modify cprinfo QMP interface
  - incorporate misc review feedback
  - remove unrelated and unneeded patches
  - refactor all patches into a shorter and easier to review series

Changes from V2 to V3:
  - rebase to qemu 6.0.0
  - use final definition of vfio ioctls (VFIO_DMA_UNMAP_FLAG_VADDR etc)
  - change memfd-alloc to a machine option
  - Use qio_channel_socket_new_fd instead of adding qio_channel_socket_new_fd
  - close monitor socket during cpr
  - fix a few unreported bugs
  - support memory-backend-memfd

Changes from V3 to V4:
  - split reboot mode into separate patches
  - add cprexec command
  - delete QEMU_START_FREEZE, argv_main, and /usr/bin/qemu-exec
  - add more checks for vfio and cpr compatibility, and recover after errors
  - save vfio pci config in vmstate
  - rename {setenv,getenv}_event_fd to {save,load}_event_fd
  - use qemu_strtol
  - change 6.0 references to 6.1
  - use strerror(), use EXIT_FAILURE, remove period from error messages
  - distribute MAINTAINERS additions to each patch

Changes from V4 to V5:
  - rebase to master

Changes from V5 to V6:
  vfio:
  - delete redundant bus_master_enable_region in vfio_pci_post_load
  - delete unmap.size warning
  - fix phys_config memory leak
  - add INTX support
  - add vfio_named_notifier_init() helper
  Other:
  - 6.1 -> 6.2
  - rename file -> filename in qapi
  - delete cprinfo.  qapi introspection serves the same purpose.
  - rename cprsave, cprexec, cprload -> cpr-save, cpr-exec, cpr-load
  - improve documentation in qapi/cpr.json
  - rename qemu_ram_volatile -> qemu_ram_check_volatile, and use
    qemu_ram_foreach_block
  - rename handle -> opaque
  - use ERRP_GUARD
  - use g_autoptr and g_autofree, and glib allocation functions
  - conform to error conventions for bool and int function return values
    and function names.
  - remove word "error" in error messages
  - rename as_flat_walk and its callback, and add comments.
  - rename qemu_clr_cloexec -> qemu_clear_cloexec
  - rename close-on-cpr -> reopen-on-cpr
  - add strList utility functions
  - factor out start on wakeup request to a separate patch
  - deleted unnecessary layer (cprsave etc) and squashed QMP patches
  - conditionally compile for CONFIG_VFIO

Changes from V6 to V7:
  vfio:
  - convert all event fd's to named event fd's with the same lifecycle and
    delete vfio_pci_pre_save
  - use vfio listener callback for updating vaddr and
    defer listener registration
  - update vaddr in vfio_dma_map
  - simplify iommu_type derivation
  - refactor recovery from unmap-all-vaddr failure to a separate patch
  - add vfio_pci_pre_load to handle non-emulated config bits
  - do not call VFIO_GROUP_SET_CONTAINER if reused
  - add comments for vfio cpr
  Other:
  - suppress rom_reset during cpr
  - more robust management of cpr mode
  - delete chardev fd's iff !reopen_on_cpr

Changes from V7 to V8:
  vfio:
  - delete hardcoded vfio callouts from migration/cpr.c, and add a vmstate
    handler for the vfio container
  - register notifier to recover from unmap-all-vaddr failure
  - register blocker for unsupported container
  - fix err_notifier and req_notifier names
  - use VFIO_CHECK_EXTENSION to set iommu_type after restart
  - delete vfio_merge_config, not needed.
  - improve vfio_connect_container
  - simplify by using cpr_resave_fd
  - simplify populate_vfio_info CONFIG_VFIO fix

  Other:
  - add the -cpr-enable command-line option
  - add mode argument to cpr-load
  - register cpr blockers and notifiers, so cpr.c becomes generic.
  - save small ram blocks to file for reboot mode, using a new vmstate handler
  - add cpr_save_memfd to save used_length to support resizeable ram block
  - fix the classification of volatile ram blocks
  - add RAM_ANON flag
  - add cpr regression test
  - split strList patches, use GStrv, add unit test
  - simplify pci changes
  - rename: qemu_file_open -> qemu_fopen_file, qemu_fd_open -> qemu_fopen_fd,
    s -> mrs
  - add chardev cpr_enabled flag
  - check reopen_on_cpr for chardev socket

Steve Sistare (36):
  migration: fix populate_vfio_info
  migration: qemu file wrappers
  migration: simplify savevm
  memory: RAM_ANON flag
  vl: start on wakeup request
  cpr: reboot mode
  cpr: blockers
  cpr: register blockers
  cpr: cpr-enable option
  cpr: save ram blocks
  memory: flat section iterator
  oslib: qemu_clear_cloexec
  qapi: strList_from_string
  qapi: QAPI_LIST_LENGTH
  qapi: strv_from_strList
  qapi: strList unit tests
  vl: helper to request re-exec
  cpr: preserve extra state
  cpr: restart mode
  cpr: restart HMP interfaces
  cpr: ram block blockers
  hostmem-memfd: cpr for memory-backend-memfd
  pci: export export msix_is_pending
  cpr: notifiers
  vfio-pci: refactor for cpr
  vfio-pci: cpr part 1 (fd and dma)
  vfio-pci: cpr part 2 (msi)
  vfio-pci: cpr part 3 (intx)
  vfio-pci: recover from unmap-all-vaddr failure
  loader: suppress rom_reset during cpr
  chardev: cpr framework
  chardev: cpr for simple devices
  chardev: cpr for pty
  cpr: only-cpr-capable option
  python/machine: add QEMUMachine accessors
  tests/avocado: add cpr regression test

Mark Kanda, Steve Sistare (3):
  cpr: reboot HMP interfaces
  vhost: reset vhost devices for cpr
  chardev: cpr for sockets

 MAINTAINERS                    |  14 ++
 accel/xen/xen-all.c            |   3 +
 backends/hostmem-epc.c         |   8 +-
 backends/hostmem-memfd.c       |  22 +--
 backends/hostmem-ram.c         |   1 +
 chardev/char-mux.c             |   1 +
 chardev/char-null.c            |   1 +
 chardev/char-pty.c             |  16 +-
 chardev/char-serial.c          |   1 +
 chardev/char-socket.c          |  45 ++++++
 chardev/char-stdio.c           |  10 ++
 chardev/char.c                 |  49 +++++-
 gdbstub.c                      |   1 +
 hmp-commands.hx                |  66 ++++++++
 hw/core/loader.c               |   4 +-
 hw/pci/msix.c                  |   2 +-
 hw/pci/pci.c                   |  12 ++
 hw/vfio/common.c               | 213 ++++++++++++++++++++-----
 hw/vfio/cpr.c                  | 148 +++++++++++++++++
 hw/vfio/meson.build            |   1 +
 hw/vfio/pci.c                  | 351 ++++++++++++++++++++++++++++++++++++-----
 hw/vfio/trace-events           |   1 +
 hw/virtio/vhost.c              |  17 ++
 include/chardev/char-socket.h  |   1 +
 include/chardev/char.h         |   5 +
 include/exec/memory.h          |  42 +++++
 include/exec/ram_addr.h        |   1 +
 include/exec/ramblock.h        |   1 +
 include/hw/pci/msix.h          |   1 +
 include/hw/vfio/vfio-common.h  |  11 ++
 include/hw/virtio/vhost.h      |   1 +
 include/migration/cpr.h        |  53 +++++++
 include/migration/vmstate.h    |   1 +
 include/monitor/hmp.h          |   3 +
 include/qapi/util.h            |  28 ++++
 include/qemu/osdep.h           |   1 +
 include/sysemu/runstate.h      |   2 +
 migration/cpr-state.c          | 330 ++++++++++++++++++++++++++++++++++++++
 migration/cpr.c                | 274 ++++++++++++++++++++++++++++++++
 migration/meson.build          |   2 +
 migration/migration.c          |   6 +
 migration/qemu-file-channel.c  |  36 +++++
 migration/qemu-file-channel.h  |   6 +
 migration/ram.c                |   3 +-
 migration/savevm.c             |  22 +--
 migration/target.c             |   1 +
 migration/trace-events         |   8 +
 monitor/hmp-cmds.c             |  67 ++++----
 monitor/hmp.c                  |   3 +
 monitor/qmp.c                  |   3 +
 python/qemu/machine/machine.py |  14 ++
 qapi/char.json                 |   7 +-
 qapi/cpr.json                  |  90 +++++++++++
 qapi/meson.build               |   1 +
 qapi/qapi-schema.json          |   1 +
 qapi/qapi-util.c               |  37 +++++
 qemu-options.hx                |  44 +++++-
 replay/replay.c                |   4 +
 softmmu/memory.c               |  38 +++++
 softmmu/physmem.c              | 173 +++++++++++++++++++-
 softmmu/runstate.c             |  43 ++++-
 softmmu/vl.c                   |  14 ++
 stubs/cpr-state.c              |  27 ++++
 stubs/cpr.c                    |  33 ++++
 stubs/meson.build              |   2 +
 tests/avocado/cpr.py           | 152 ++++++++++++++++++
 tests/unit/meson.build         |   1 +
 tests/unit/test-strlist.c      |  81 ++++++++++
 trace-events                   |   1 +
 util/oslib-posix.c             |   9 ++
 util/oslib-win32.c             |   4 +
 71 files changed, 2528 insertions(+), 147 deletions(-)
 create mode 100644 hw/vfio/cpr.c
 create mode 100644 include/migration/cpr.h
 create mode 100644 migration/cpr-state.c
 create mode 100644 migration/cpr.c
 create mode 100644 qapi/cpr.json
 create mode 100644 stubs/cpr-state.c
 create mode 100644 stubs/cpr.c
 create mode 100644 tests/avocado/cpr.py
 create mode 100644 tests/unit/test-strlist.c

--
1.8.3.1



^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH V8 01/39] migration: fix populate_vfio_info
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
@ 2022-06-15 14:51 ` Steve Sistare
  2022-06-16 14:41   ` Marc-André Lureau
  2022-06-15 14:51 ` [PATCH V8 02/39] migration: qemu file wrappers Steve Sistare
                   ` (37 subsequent siblings)
  38 siblings, 1 reply; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Include CONFIG_DEVICES so that populate_vfio_info is instantiated for
CONFIG_VFIO.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 migration/target.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/migration/target.c b/migration/target.c
index 907ebf0..a0991bc 100644
--- a/migration/target.c
+++ b/migration/target.c
@@ -8,6 +8,7 @@
 #include "qemu/osdep.h"
 #include "qapi/qapi-types-migration.h"
 #include "migration.h"
+#include CONFIG_DEVICES
 
 #ifdef CONFIG_VFIO
 #include "hw/vfio/vfio-common.h"
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 02/39] migration: qemu file wrappers
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
  2022-06-15 14:51 ` [PATCH V8 01/39] migration: fix populate_vfio_info Steve Sistare
@ 2022-06-15 14:51 ` Steve Sistare
  2022-06-16  2:18   ` Guoyi Tu
                     ` (2 more replies)
  2022-06-15 14:51 ` [PATCH V8 03/39] migration: simplify savevm Steve Sistare
                   ` (36 subsequent siblings)
  38 siblings, 3 replies; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Add qemu_file_open and qemu_fd_open to create QEMUFile objects for unix
files and file descriptors.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 migration/qemu-file-channel.c | 36 ++++++++++++++++++++++++++++++++++++
 migration/qemu-file-channel.h |  6 ++++++
 2 files changed, 42 insertions(+)

diff --git a/migration/qemu-file-channel.c b/migration/qemu-file-channel.c
index bb5a575..cc5aebc 100644
--- a/migration/qemu-file-channel.c
+++ b/migration/qemu-file-channel.c
@@ -27,8 +27,10 @@
 #include "qemu-file.h"
 #include "io/channel-socket.h"
 #include "io/channel-tls.h"
+#include "io/channel-file.h"
 #include "qemu/iov.h"
 #include "qemu/yank.h"
+#include "qapi/error.h"
 #include "yank_functions.h"
 
 
@@ -192,3 +194,37 @@ QEMUFile *qemu_fopen_channel_output(QIOChannel *ioc)
     object_ref(OBJECT(ioc));
     return qemu_fopen_ops(ioc, &channel_output_ops, true);
 }
+
+QEMUFile *qemu_fopen_file(const char *path, int flags, int mode,
+                          const char *name, Error **errp)
+{
+    g_autoptr(QIOChannelFile) fioc = NULL;
+    QIOChannel *ioc;
+    QEMUFile *f;
+
+    if (flags & O_RDWR) {
+        error_setg(errp, "qemu_fopen_file %s: O_RDWR not supported", path);
+        return NULL;
+    }
+
+    fioc = qio_channel_file_new_path(path, flags, mode, errp);
+    if (!fioc) {
+        return NULL;
+    }
+
+    ioc = QIO_CHANNEL(fioc);
+    qio_channel_set_name(ioc, name);
+    f = (flags & O_WRONLY) ? qemu_fopen_channel_output(ioc) :
+                             qemu_fopen_channel_input(ioc);
+    return f;
+}
+
+QEMUFile *qemu_fopen_fd(int fd, bool writable, const char *name)
+{
+    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
+    QIOChannel *ioc = QIO_CHANNEL(fioc);
+    QEMUFile *f = writable ? qemu_fopen_channel_output(ioc) :
+                             qemu_fopen_channel_input(ioc);
+    qio_channel_set_name(ioc, name);
+    return f;
+}
diff --git a/migration/qemu-file-channel.h b/migration/qemu-file-channel.h
index 0028a09..75fd0ad 100644
--- a/migration/qemu-file-channel.h
+++ b/migration/qemu-file-channel.h
@@ -29,4 +29,10 @@
 
 QEMUFile *qemu_fopen_channel_input(QIOChannel *ioc);
 QEMUFile *qemu_fopen_channel_output(QIOChannel *ioc);
+
+QEMUFile *qemu_fopen_file(const char *path, int flags, int mode,
+                         const char *name, Error **errp);
+
+QEMUFile *qemu_fopen_fd(int fd, bool writable, const char *name);
+
 #endif
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 03/39] migration: simplify savevm
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
  2022-06-15 14:51 ` [PATCH V8 01/39] migration: fix populate_vfio_info Steve Sistare
  2022-06-15 14:51 ` [PATCH V8 02/39] migration: qemu file wrappers Steve Sistare
@ 2022-06-15 14:51 ` Steve Sistare
  2022-06-16 14:59   ` Marc-André Lureau
  2022-06-15 14:51 ` [PATCH V8 04/39] memory: RAM_ANON flag Steve Sistare
                   ` (35 subsequent siblings)
  38 siblings, 1 reply; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Use qemu_file_open to simplify a few functions in savevm.c.
No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/savevm.c | 20 ++++++--------------
 1 file changed, 6 insertions(+), 14 deletions(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index d907689..0b2c5cd 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2931,7 +2931,6 @@ void qmp_xen_save_devices_state(const char *filename, bool has_live, bool live,
                                 Error **errp)
 {
     QEMUFile *f;
-    QIOChannelFile *ioc;
     int saved_vm_running;
     int ret;
 
@@ -2945,14 +2944,11 @@ void qmp_xen_save_devices_state(const char *filename, bool has_live, bool live,
     vm_stop(RUN_STATE_SAVE_VM);
     global_state_store_running();
 
-    ioc = qio_channel_file_new_path(filename, O_WRONLY | O_CREAT | O_TRUNC,
-                                    0660, errp);
-    if (!ioc) {
+    f = qemu_fopen_file(filename, O_WRONLY | O_CREAT | O_TRUNC, 0660,
+                        "migration-xen-save-state", errp);
+    if (!f) {
         goto the_end;
     }
-    qio_channel_set_name(QIO_CHANNEL(ioc), "migration-xen-save-state");
-    f = qemu_fopen_channel_output(QIO_CHANNEL(ioc));
-    object_unref(OBJECT(ioc));
     ret = qemu_save_device_state(f);
     if (ret < 0 || qemu_fclose(f) < 0) {
         error_setg(errp, QERR_IO_ERROR);
@@ -2981,7 +2977,6 @@ void qmp_xen_save_devices_state(const char *filename, bool has_live, bool live,
 void qmp_xen_load_devices_state(const char *filename, Error **errp)
 {
     QEMUFile *f;
-    QIOChannelFile *ioc;
     int ret;
 
     /* Guest must be paused before loading the device state; the RAM state
@@ -2993,14 +2988,11 @@ void qmp_xen_load_devices_state(const char *filename, Error **errp)
     }
     vm_stop(RUN_STATE_RESTORE_VM);
 
-    ioc = qio_channel_file_new_path(filename, O_RDONLY | O_BINARY, 0, errp);
-    if (!ioc) {
+    f = qemu_fopen_file(filename, O_RDONLY | O_BINARY, 0,
+                        "migration-xen-load-state", errp);
+    if (!f) {
         return;
     }
-    qio_channel_set_name(QIO_CHANNEL(ioc), "migration-xen-load-state");
-    f = qemu_fopen_channel_input(QIO_CHANNEL(ioc));
-    object_unref(OBJECT(ioc));
-
     ret = qemu_loadvm_state(f);
     qemu_fclose(f);
     if (ret < 0) {
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 04/39] memory: RAM_ANON flag
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (2 preceding siblings ...)
  2022-06-15 14:51 ` [PATCH V8 03/39] migration: simplify savevm Steve Sistare
@ 2022-06-15 14:51 ` Steve Sistare
  2022-06-15 20:25   ` David Hildenbrand
  2022-06-15 14:51 ` [PATCH V8 05/39] vl: start on wakeup request Steve Sistare
                   ` (34 subsequent siblings)
  38 siblings, 1 reply; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

A memory-backend-ram or a memory-backend-memfd block with the RAM_SHARED
flag set is not migrated when migrate_ignore_shared() is true, but this
is wrong, because it has no named backing store, and its contents will be
lost.  Define a new flag RAM_ANON to distinguish this case.  Cpr will also
test this flag, for similar reasons.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 backends/hostmem-epc.c   |  2 +-
 backends/hostmem-memfd.c |  1 +
 backends/hostmem-ram.c   |  1 +
 include/exec/memory.h    |  3 +++
 include/exec/ram_addr.h  |  1 +
 migration/ram.c          |  3 ++-
 softmmu/physmem.c        | 12 +++++++++---
 7 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/backends/hostmem-epc.c b/backends/hostmem-epc.c
index 037292d..cb06255 100644
--- a/backends/hostmem-epc.c
+++ b/backends/hostmem-epc.c
@@ -37,7 +37,7 @@ sgx_epc_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
     }
 
     name = object_get_canonical_path(OBJECT(backend));
-    ram_flags = (backend->share ? RAM_SHARED : 0) | RAM_PROTECTED;
+    ram_flags = (backend->share ? RAM_SHARED : 0) | RAM_PROTECTED | MAP_ANON;
     memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend),
                                    name, backend->size, ram_flags,
                                    fd, 0, errp);
diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c
index 3fc85c3..c9d8001 100644
--- a/backends/hostmem-memfd.c
+++ b/backends/hostmem-memfd.c
@@ -55,6 +55,7 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
     name = host_memory_backend_get_name(backend);
     ram_flags = backend->share ? RAM_SHARED : 0;
     ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
+    ram_flags |= RAM_ANON;
     memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), name,
                                    backend->size, ram_flags, fd, 0, errp);
     g_free(name);
diff --git a/backends/hostmem-ram.c b/backends/hostmem-ram.c
index b8e55cd..5e80149 100644
--- a/backends/hostmem-ram.c
+++ b/backends/hostmem-ram.c
@@ -30,6 +30,7 @@ ram_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
     name = host_memory_backend_get_name(backend);
     ram_flags = backend->share ? RAM_SHARED : 0;
     ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
+    ram_flags |= RAM_ANON;
     memory_region_init_ram_flags_nomigrate(&backend->mr, OBJECT(backend), name,
                                            backend->size, ram_flags, errp);
     g_free(name);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index f1c1945..0daddd7 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -203,6 +203,9 @@ typedef struct IOMMUTLBEvent {
 /* RAM that isn't accessible through normal means. */
 #define RAM_PROTECTED (1 << 8)
 
+/* RAM has no name outside the qemu process. */
+#define RAM_ANON (1 << 9)
+
 static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
                                        IOMMUNotifierFlag flags,
                                        hwaddr start, hwaddr end,
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index f3e0c78..56188b8 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -94,6 +94,7 @@ static inline unsigned long int ramblock_recv_bitmap_offset(void *host_addr,
 }
 
 bool ramblock_is_pmem(RAMBlock *rb);
+bool ramblock_is_anon(RAMBlock *rb);
 
 long qemu_minrampagesize(void);
 long qemu_maxrampagesize(void);
diff --git a/migration/ram.c b/migration/ram.c
index 5f5e37f..5cdb93d 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -164,7 +164,8 @@ out:
 bool ramblock_is_ignored(RAMBlock *block)
 {
     return !qemu_ram_is_migratable(block) ||
-           (migrate_ignore_shared() && qemu_ram_is_shared(block));
+           (migrate_ignore_shared() && qemu_ram_is_shared(block) &&
+            !ramblock_is_anon(block));
 }
 
 #undef RAMBLOCK_FOREACH
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 657841e..0f1ce28 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -1975,6 +1975,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
     new_block->offset = find_ram_offset(new_block->max_length);
 
     if (!new_block->host) {
+        new_block->flags |= RAM_ANON;
         if (xen_enabled()) {
             xen_ram_alloc(new_block->offset, new_block->max_length,
                           new_block->mr, &err);
@@ -2059,7 +2060,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
 
     /* Just support these ram flags by now. */
     assert((ram_flags & ~(RAM_SHARED | RAM_PMEM | RAM_NORESERVE |
-                          RAM_PROTECTED)) == 0);
+                          RAM_PROTECTED | RAM_ANON)) == 0);
 
     if (xen_enabled()) {
         error_setg(errp, "-mem-path not supported with Xen");
@@ -2151,7 +2152,7 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
     Error *local_err = NULL;
 
     assert((ram_flags & ~(RAM_SHARED | RAM_RESIZEABLE | RAM_PREALLOC |
-                          RAM_NORESERVE)) == 0);
+                          RAM_NORESERVE | RAM_ANON)) == 0);
     assert(!host ^ (ram_flags & RAM_PREALLOC));
 
     size = HOST_PAGE_ALIGN(size);
@@ -2185,7 +2186,7 @@ RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
 RAMBlock *qemu_ram_alloc(ram_addr_t size, uint32_t ram_flags,
                          MemoryRegion *mr, Error **errp)
 {
-    assert((ram_flags & ~(RAM_SHARED | RAM_NORESERVE)) == 0);
+    assert((ram_flags & ~(RAM_SHARED | RAM_NORESERVE | RAM_ANON)) == 0);
     return qemu_ram_alloc_internal(size, size, NULL, NULL, ram_flags, mr, errp);
 }
 
@@ -3664,6 +3665,11 @@ bool ramblock_is_pmem(RAMBlock *rb)
     return rb->flags & RAM_PMEM;
 }
 
+bool ramblock_is_anon(RAMBlock *rb)
+{
+    return rb->flags & RAM_ANON;
+}
+
 static void mtree_print_phys_entries(int start, int end, int skip, int ptr)
 {
     if (start == end - 1) {
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 05/39] vl: start on wakeup request
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (3 preceding siblings ...)
  2022-06-15 14:51 ` [PATCH V8 04/39] memory: RAM_ANON flag Steve Sistare
@ 2022-06-15 14:51 ` Steve Sistare
  2022-06-16 15:55   ` Marc-André Lureau
  2022-06-15 14:51 ` [PATCH V8 06/39] cpr: reboot mode Steve Sistare
                   ` (33 subsequent siblings)
  38 siblings, 1 reply; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

If qemu starts and loads a VM in the suspended state, then a later wakeup
request will set the state to running, which is not sufficient to initialize
the vm, as vm_start was never called during this invocation of qemu.  See
qemu_system_wakeup_request().

Define the start_on_wakeup_requested() hook to cause vm_start() to be called
when processing the wakeup request.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/sysemu/runstate.h |  1 +
 softmmu/runstate.c        | 16 +++++++++++++++-
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/include/sysemu/runstate.h b/include/sysemu/runstate.h
index f3ed525..16c1c41 100644
--- a/include/sysemu/runstate.h
+++ b/include/sysemu/runstate.h
@@ -57,6 +57,7 @@ void qemu_system_reset_request(ShutdownCause reason);
 void qemu_system_suspend_request(void);
 void qemu_register_suspend_notifier(Notifier *notifier);
 bool qemu_wakeup_suspend_enabled(void);
+void qemu_system_start_on_wakeup_request(void);
 void qemu_system_wakeup_request(WakeupReason reason, Error **errp);
 void qemu_system_wakeup_enable(WakeupReason reason, bool enabled);
 void qemu_register_wakeup_notifier(Notifier *notifier);
diff --git a/softmmu/runstate.c b/softmmu/runstate.c
index fac7b63..9b27d74 100644
--- a/softmmu/runstate.c
+++ b/softmmu/runstate.c
@@ -115,6 +115,7 @@ static const RunStateTransition runstate_transitions_def[] = {
     { RUN_STATE_PRELAUNCH, RUN_STATE_RUNNING },
     { RUN_STATE_PRELAUNCH, RUN_STATE_FINISH_MIGRATE },
     { RUN_STATE_PRELAUNCH, RUN_STATE_INMIGRATE },
+    { RUN_STATE_PRELAUNCH, RUN_STATE_SUSPENDED },
 
     { RUN_STATE_FINISH_MIGRATE, RUN_STATE_RUNNING },
     { RUN_STATE_FINISH_MIGRATE, RUN_STATE_PAUSED },
@@ -335,6 +336,7 @@ void vm_state_notify(bool running, RunState state)
     }
 }
 
+static bool start_on_wakeup_requested;
 static ShutdownCause reset_requested;
 static ShutdownCause shutdown_requested;
 static int shutdown_signal;
@@ -562,6 +564,11 @@ void qemu_register_suspend_notifier(Notifier *notifier)
     notifier_list_add(&suspend_notifiers, notifier);
 }
 
+void qemu_system_start_on_wakeup_request(void)
+{
+    start_on_wakeup_requested = true;
+}
+
 void qemu_system_wakeup_request(WakeupReason reason, Error **errp)
 {
     trace_system_wakeup_request(reason);
@@ -574,7 +581,14 @@ void qemu_system_wakeup_request(WakeupReason reason, Error **errp)
     if (!(wakeup_reason_mask & (1 << reason))) {
         return;
     }
-    runstate_set(RUN_STATE_RUNNING);
+
+    if (start_on_wakeup_requested) {
+        start_on_wakeup_requested = false;
+        vm_start();
+    } else {
+        runstate_set(RUN_STATE_RUNNING);
+    }
+
     wakeup_reason = reason;
     qemu_notify_event();
 }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 06/39] cpr: reboot mode
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (4 preceding siblings ...)
  2022-06-15 14:51 ` [PATCH V8 05/39] vl: start on wakeup request Steve Sistare
@ 2022-06-15 14:51 ` Steve Sistare
  2022-06-16 11:10   ` Daniel P. Berrangé
  2022-06-15 14:51 ` [PATCH V8 07/39] cpr: reboot HMP interfaces Steve Sistare
                   ` (32 subsequent siblings)
  38 siblings, 1 reply; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Provide the cpr-save and cpr-load functions for live update.  These save and
restore VM state, with minimal guest pause time, so that qemu may be updated
to a new version in between.

cpr-save stops the VM and saves vmstate to an ordinary file.  It supports
any type of guest image and block device, but the caller must not modify
guest block devices between cpr-save and cpr-load.

cpr-save supports several modes, the first of which is reboot. In this mode
the caller invokes cpr-save and then terminates qemu.  The caller may then
update the host kernel and system software and reboot.  The caller resumes
the guest by running qemu with the same arguments as the original process
and invoking cpr-load.  To use this mode, guest ram must be mapped to a
persistent shared memory file such as /dev/dax0.0 or /dev/shm PKRAM.

The reboot mode supports vfio devices if the caller first suspends the
guest, such as by issuing guest-suspend-ram to the qemu guest agent.  The
guest drivers' suspend methods flush outstanding requests and re-initialize
the devices, and thus there is no device state to save and restore.

cpr-load loads state from the file.  If the VM was running at cpr-save time
then VM execution resumes.  If the VM was suspended at cpr-save time, then
the caller must issue a system_wakeup command to resume.

cpr-save syntax:
  { 'enum': 'CprMode', 'data': [ 'reboot' ] }
  { 'command': 'cpr-save', 'data': { 'filename': 'str', 'mode': 'CprMode' }}

cpr-load syntax:
  { 'command': 'cpr-load', 'data': { 'filename': 'str', 'mode': 'CprMode' }}

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 MAINTAINERS             |   8 ++++
 include/migration/cpr.h |  16 +++++++
 migration/cpr.c         | 116 ++++++++++++++++++++++++++++++++++++++++++++++++
 migration/meson.build   |   1 +
 qapi/cpr.json           |  62 ++++++++++++++++++++++++++
 qapi/meson.build        |   1 +
 qapi/qapi-schema.json   |   1 +
 softmmu/runstate.c      |   1 +
 8 files changed, 206 insertions(+)
 create mode 100644 include/migration/cpr.h
 create mode 100644 migration/cpr.c
 create mode 100644 qapi/cpr.json

diff --git a/MAINTAINERS b/MAINTAINERS
index 4cf6174..9273891 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3152,6 +3152,14 @@ F: net/filter-rewriter.c
 F: net/filter-mirror.c
 F: tests/qtest/test-filter*
 
+CPR
+M: Steve Sistare <steven.sistare@oracle.com>
+M: Mark Kanda <mark.kanda@oracle.com>
+S: Maintained
+F: include/migration/cpr.h
+F: migration/cpr.c
+F: qapi/cpr.json
+
 Record/replay
 M: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
 R: Paolo Bonzini <pbonzini@redhat.com>
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
new file mode 100644
index 0000000..1b6c82f
--- /dev/null
+++ b/include/migration/cpr.h
@@ -0,0 +1,16 @@
+/*
+ * Copyright (c) 2021, 2022 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef MIGRATION_CPR_H
+#define MIGRATION_CPR_H
+
+#include "qapi/qapi-types-cpr.h"
+
+void cpr_set_mode(CprMode mode);
+CprMode cpr_get_mode(void);
+
+#endif
diff --git a/migration/cpr.c b/migration/cpr.c
new file mode 100644
index 0000000..24b0bcc
--- /dev/null
+++ b/migration/cpr.c
@@ -0,0 +1,116 @@
+/*
+ * Copyright (c) 2021, 2022 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "migration/cpr.h"
+#include "migration/global_state.h"
+#include "qapi/error.h"
+#include "qapi/qapi-commands-cpr.h"
+#include "qemu-file-channel.h"
+#include "qemu-file.h"
+#include "savevm.h"
+#include "sysemu/cpu-timers.h"
+#include "sysemu/runstate.h"
+#include "sysemu/sysemu.h"
+
+static CprMode cpr_mode = CPR_MODE_NONE;
+
+CprMode cpr_get_mode(void)
+{
+    return cpr_mode;
+}
+
+void cpr_set_mode(CprMode mode)
+{
+    cpr_mode = mode;
+}
+
+void qmp_cpr_save(const char *filename, CprMode mode, Error **errp)
+{
+    int ret;
+    QEMUFile *f;
+    int saved_vm_running = runstate_is_running();
+
+    if (global_state_store()) {
+        error_setg(errp, "Error saving global state");
+        return;
+    }
+
+    f = qemu_fopen_file(filename, O_CREAT | O_WRONLY | O_TRUNC, 0600,
+                        "cpr-save", errp);
+    if (!f) {
+        return;
+    }
+
+    if (runstate_check(RUN_STATE_SUSPENDED)) {
+        /* Update timers_state before saving.  Suspend did not so do. */
+        cpu_disable_ticks();
+    }
+    vm_stop(RUN_STATE_SAVE_VM);
+
+    cpr_set_mode(mode);
+    ret = qemu_save_device_state(f);
+    qemu_fclose(f);
+    if (ret < 0) {
+        error_setg(errp, "Error %d while saving VM state", ret);
+        goto err;
+    }
+
+    return;
+
+err:
+    if (saved_vm_running) {
+        vm_start();
+    }
+    cpr_set_mode(CPR_MODE_NONE);
+}
+
+void qmp_cpr_load(const char *filename, CprMode mode, Error **errp)
+{
+    QEMUFile *f;
+    int ret;
+    RunState state;
+
+    if (runstate_is_running()) {
+        error_setg(errp, "cpr-load called for a running VM");
+        return;
+    }
+
+    f = qemu_fopen_file(filename, O_RDONLY, 0, "cpr-load", errp);
+    if (!f) {
+        return;
+    }
+
+    if (qemu_get_be32(f) != QEMU_VM_FILE_MAGIC ||
+        qemu_get_be32(f) != QEMU_VM_FILE_VERSION) {
+        error_setg(errp, "%s is not a vmstate file", filename);
+        qemu_fclose(f);
+        return;
+    }
+
+    cpr_set_mode(mode);
+    ret = qemu_load_device_state(f);
+    qemu_fclose(f);
+    if (ret < 0) {
+        error_setg(errp, "Error %d while loading VM state", ret);
+        goto out;
+    }
+
+    state = global_state_get_runstate();
+    if (state == RUN_STATE_RUNNING) {
+        vm_start();
+    } else {
+        runstate_set(state);
+        if (runstate_check(RUN_STATE_SUSPENDED)) {
+            /* Force vm_start to be called later. */
+            qemu_system_start_on_wakeup_request();
+        }
+    }
+
+out:
+    cpr_set_mode(CPR_MODE_NONE);
+}
diff --git a/migration/meson.build b/migration/meson.build
index 6880b61..76fcfdb 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -15,6 +15,7 @@ softmmu_ss.add(files(
   'channel.c',
   'colo-failover.c',
   'colo.c',
+  'cpr.c',
   'exec.c',
   'fd.c',
   'global_state.c',
diff --git a/qapi/cpr.json b/qapi/cpr.json
new file mode 100644
index 0000000..bdaabcb
--- /dev/null
+++ b/qapi/cpr.json
@@ -0,0 +1,62 @@
+# -*- Mode: Python -*-
+#
+# Copyright (c) 2021, 2022 Oracle and/or its affiliates.
+#
+# This work is licensed under the terms of the GNU GPL, version 2.
+# See the COPYING file in the top-level directory.
+
+##
+# = CPR - CheckPoint and Restart
+##
+
+{ 'include': 'common.json' }
+
+##
+# @CprMode:
+#
+# @reboot: checkpoint can be cpr-load'ed after a host reboot.
+#
+# Since: 7.1
+##
+{ 'enum': 'CprMode',
+  'data': [ 'none', 'reboot' ] }
+
+##
+# @cpr-save:
+#
+# Pause the VCPUs, and create a checkpoint of the virtual machine device state
+# in @filename.  Unlike snapshot-save, this command completes synchronously,
+# saves state to an ordinary file, does not save guest block device blocks,
+# and does not require that guest RAM be saved in the file.  The caller must
+# not modify guest block devices between cpr-save and cpr-load.
+#
+# If @mode is 'reboot', the checkpoint remains valid after a host reboot.
+# The guest RAM memory-backend should be shared and non-volatile across
+# reboot, else it will be saved to the file.  To resume from the checkpoint,
+# issue the quit command, reboot the system, start qemu using the same
+# arguments plus -S, and issue the cpr-load command.
+#
+# @filename: name of checkpoint file
+# @mode: @CprMode mode
+#
+# Since: 7.1
+##
+{ 'command': 'cpr-save',
+  'data': { 'filename': 'str',
+            'mode': 'CprMode' } }
+
+##
+# @cpr-load:
+#
+# Load a virtual machine from the checkpoint file @filename that was created
+# earlier by the cpr-save command, and continue the VCPUs.  @mode must match
+# the mode specified for cpr-save.
+#
+# @filename: name of checkpoint file
+# @mode: @CprMode mode
+#
+# Since: 7.1
+##
+{ 'command': 'cpr-load',
+  'data': { 'filename': 'str',
+            'mode': 'CprMode' } }
diff --git a/qapi/meson.build b/qapi/meson.build
index 656ef0e..d9ab29d 100644
--- a/qapi/meson.build
+++ b/qapi/meson.build
@@ -30,6 +30,7 @@ qapi_all_modules = [
   'common',
   'compat',
   'control',
+  'cpr',
   'crypto',
   'dump',
   'error',
diff --git a/qapi/qapi-schema.json b/qapi/qapi-schema.json
index 4912b97..001d790 100644
--- a/qapi/qapi-schema.json
+++ b/qapi/qapi-schema.json
@@ -77,6 +77,7 @@
 { 'include': 'ui.json' }
 { 'include': 'authz.json' }
 { 'include': 'migration.json' }
+{ 'include': 'cpr.json' }
 { 'include': 'transaction.json' }
 { 'include': 'trace.json' }
 { 'include': 'compat.json' }
diff --git a/softmmu/runstate.c b/softmmu/runstate.c
index 9b27d74..cfd6aa9 100644
--- a/softmmu/runstate.c
+++ b/softmmu/runstate.c
@@ -116,6 +116,7 @@ static const RunStateTransition runstate_transitions_def[] = {
     { RUN_STATE_PRELAUNCH, RUN_STATE_FINISH_MIGRATE },
     { RUN_STATE_PRELAUNCH, RUN_STATE_INMIGRATE },
     { RUN_STATE_PRELAUNCH, RUN_STATE_SUSPENDED },
+    { RUN_STATE_PRELAUNCH, RUN_STATE_PAUSED },
 
     { RUN_STATE_FINISH_MIGRATE, RUN_STATE_RUNNING },
     { RUN_STATE_FINISH_MIGRATE, RUN_STATE_PAUSED },
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 07/39] cpr: reboot HMP interfaces
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (5 preceding siblings ...)
  2022-06-15 14:51 ` [PATCH V8 06/39] cpr: reboot mode Steve Sistare
@ 2022-06-15 14:51 ` Steve Sistare
  2022-06-15 14:51 ` [PATCH V8 08/39] cpr: blockers Steve Sistare
                   ` (31 subsequent siblings)
  38 siblings, 0 replies; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

cpr-save <filename> <mode>
  Call qmp_cpr_save().
  Arguments:
    filename : save vmstate to filename
    mode: must be "reboot"

cpr-load <filename> <mode>
  Call qmp_cpr_load().
  Arguments:
    filename : load vmstate from filename
    mode: must be "reboot"

Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hmp-commands.hx       | 39 +++++++++++++++++++++++++++++++++++++++
 include/monitor/hmp.h |  2 ++
 monitor/hmp-cmds.c    | 27 +++++++++++++++++++++++++++
 3 files changed, 68 insertions(+)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index 564f1de..9d9f984 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -355,6 +355,45 @@ SRST
 ERST
 
     {
+        .name       = "cpr-save",
+        .args_type  = "filename:s,mode:s",
+        .params     = "filename 'reboot'",
+        .help       = "create a checkpoint of the VM in file",
+        .cmd        = hmp_cpr_save,
+    },
+
+SRST
+``cpr-save`` *filename* *mode*
+  Pause the VCPUs, and create a checkpoint of the virtual machine device state
+  in *filename*.  Unlike snapshot-save, this command completes synchronously,
+  saves state to an ordinary file, does not save guest block device blocks,
+  and does not require that guest RAM be saved in the file.  The caller must
+  not modify guest block devices between cpr-save and cpr-load.
+
+  If *mode* is 'reboot', the checkpoint remains valid after a host reboot.
+  The guest RAM memory-backend should be shared and non-volatile across
+  reboot, else it will be saved to the file.  To resume from the checkpoint,
+  issue the quit command, reboot the system, start qemu using the same
+  arguments plus -S, and issue the cpr-load command.
+ERST
+
+    {
+        .name       = "cpr-load",
+        .args_type  = "filename:s,mode:s",
+        .params     = "filename 'reboot'",
+
+        .help       = "load VM checkpoint from file",
+        .cmd        = hmp_cpr_load,
+    },
+
+SRST
+``cpr-load`` *filename* *mode*
+  Load a virtual machine from the checkpoint file *filename* that was created
+  earlier by the cpr-save command, and continue the VCPUs.  *mode* must match
+  the mode specified for cpr-save.
+ERST
+
+    {
         .name       = "delvm",
         .args_type  = "name:s",
         .params     = "tag",
diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
index 96d0148..b44588e 100644
--- a/include/monitor/hmp.h
+++ b/include/monitor/hmp.h
@@ -59,6 +59,8 @@ void hmp_balloon(Monitor *mon, const QDict *qdict);
 void hmp_loadvm(Monitor *mon, const QDict *qdict);
 void hmp_savevm(Monitor *mon, const QDict *qdict);
 void hmp_delvm(Monitor *mon, const QDict *qdict);
+void hmp_cpr_save(Monitor *mon, const QDict *qdict);
+void hmp_cpr_load(Monitor *mon, const QDict *qdict);
 void hmp_migrate_cancel(Monitor *mon, const QDict *qdict);
 void hmp_migrate_continue(Monitor *mon, const QDict *qdict);
 void hmp_migrate_incoming(Monitor *mon, const QDict *qdict);
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index 622c783..bb12589 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -33,6 +33,7 @@
 #include "qapi/qapi-commands-block.h"
 #include "qapi/qapi-commands-char.h"
 #include "qapi/qapi-commands-control.h"
+#include "qapi/qapi-commands-cpr.h"
 #include "qapi/qapi-commands-machine.h"
 #include "qapi/qapi-commands-migration.h"
 #include "qapi/qapi-commands-misc.h"
@@ -1122,6 +1123,32 @@ void hmp_announce_self(Monitor *mon, const QDict *qdict)
     qapi_free_AnnounceParameters(params);
 }
 
+void hmp_cpr_save(Monitor *mon, const QDict *qdict)
+{
+    Error *err = NULL;
+    const char *filename = qdict_get_try_str(qdict, "filename");
+    const char *str = qdict_get_try_str(qdict, "mode");
+    CprMode mode = qapi_enum_parse(&CprMode_lookup, str, -1, &err);
+
+    if (mode != -1) {
+        qmp_cpr_save(filename, mode, &err);
+    }
+    hmp_handle_error(mon, err);
+}
+
+void hmp_cpr_load(Monitor *mon, const QDict *qdict)
+{
+    Error *err = NULL;
+    const char *filename = qdict_get_try_str(qdict, "filename");
+    const char *str = qdict_get_try_str(qdict, "mode");
+    CprMode mode = qapi_enum_parse(&CprMode_lookup, str, -1, &err);
+
+    if (mode != -1) {
+        qmp_cpr_load(filename, mode, &err);
+    }
+    hmp_handle_error(mon, err);
+}
+
 void hmp_migrate_cancel(Monitor *mon, const QDict *qdict)
 {
     qmp_migrate_cancel(NULL);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 08/39] cpr: blockers
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (6 preceding siblings ...)
  2022-06-15 14:51 ` [PATCH V8 07/39] cpr: reboot HMP interfaces Steve Sistare
@ 2022-06-15 14:51 ` Steve Sistare
  2022-06-15 14:51 ` [PATCH V8 09/39] cpr: register blockers Steve Sistare
                   ` (30 subsequent siblings)
  38 siblings, 0 replies; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Add an interface to register a blocker for cpr-save for one or more modes.
Devices and options that do not support a cpr mode can register a blocker,
and cpr-save will fail with a descriptive error message.  Conversely, if
such a device is deleted and un-registers its blocker, cpr will be allowed
again.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 MAINTAINERS             |  1 +
 include/migration/cpr.h |  6 ++++
 migration/cpr.c         | 79 +++++++++++++++++++++++++++++++++++++++++++++++++
 stubs/cpr.c             | 23 ++++++++++++++
 stubs/meson.build       |  1 +
 5 files changed, 110 insertions(+)
 create mode 100644 stubs/cpr.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 9273891..1e4e72f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3159,6 +3159,7 @@ S: Maintained
 F: include/migration/cpr.h
 F: migration/cpr.c
 F: qapi/cpr.json
+F: stubs/cpr.c
 
 Record/replay
 M: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index 1b6c82f..dfe5a1d 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -13,4 +13,10 @@
 void cpr_set_mode(CprMode mode);
 CprMode cpr_get_mode(void);
 
+#define CPR_MODE_ALL CPR_MODE__MAX
+
+int cpr_add_blocker(Error **reasonp, Error **errp, CprMode mode, ...);
+int cpr_add_blocker_str(const char *reason, Error **errp, CprMode mode, ...);
+void cpr_del_blocker(Error **reasonp);
+
 #endif
diff --git a/migration/cpr.c b/migration/cpr.c
index 24b0bcc..c1da784 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -29,12 +29,91 @@ void cpr_set_mode(CprMode mode)
     cpr_mode = mode;
 }
 
+static GSList *cpr_blockers[CPR_MODE__MAX];
+
+/*
+ * Add blocker for each mode in varargs list, or for all modes if CPR_MODE_ALL
+ * is specified.  Caller terminates the list with 0 or CPR_MODE_ALL.  This
+ * function takes ownership of *reasonp, and frees it on error, or in
+ * cpr_del_blocker.  errp is set in a later patch.
+ */
+int cpr_add_blocker(Error **reasonp, Error **errp, CprMode mode, ...)
+{
+    int modes = 0;
+    va_list ap;
+    ERRP_GUARD();
+
+    va_start(ap, mode);
+    while (mode != CPR_MODE_NONE && mode != CPR_MODE_ALL) {
+        assert(mode > CPR_MODE_NONE && mode < CPR_MODE__MAX);
+        modes |= BIT(mode);
+        mode = va_arg(ap, CprMode);
+    }
+    va_end(ap);
+    if (mode == CPR_MODE_ALL) {
+        modes = BIT(CPR_MODE__MAX) - 1;
+    }
+
+    for (mode = 0; mode < CPR_MODE__MAX; mode++) {
+        if (modes & BIT(mode)) {
+            cpr_blockers[mode] = g_slist_prepend(cpr_blockers[mode], *reasonp);
+        }
+    }
+    return 0;
+}
+
+/*
+ * Delete the blocker from all modes it is associated with.
+ */
+void cpr_del_blocker(Error **reasonp)
+{
+    CprMode mode;
+
+    if (*reasonp) {
+        for (mode = 0; mode < CPR_MODE__MAX; mode++) {
+            cpr_blockers[mode] = g_slist_remove(cpr_blockers[mode], *reasonp);
+        }
+        error_free(*reasonp);
+        *reasonp = NULL;
+    }
+}
+
+/*
+ * Add a blocker which will not be deleted.  Simpler for some callers.
+ */
+int cpr_add_blocker_str(const char *msg, Error **errp, CprMode mode, ...)
+{
+    int ret;
+    va_list ap;
+    Error *reason = NULL;
+
+    error_setg(&reason, "%s", msg);
+    va_start(ap, mode);
+    ret = cpr_add_blocker(&reason, errp, mode, ap);
+    va_end(ap);
+    return ret;
+}
+
+static bool cpr_is_blocked(Error **errp, CprMode mode)
+{
+    if (cpr_blockers[mode]) {
+        error_propagate(errp, error_copy(cpr_blockers[mode]->data));
+        return true;
+    }
+
+    return false;
+}
+
 void qmp_cpr_save(const char *filename, CprMode mode, Error **errp)
 {
     int ret;
     QEMUFile *f;
     int saved_vm_running = runstate_is_running();
 
+    if (cpr_is_blocked(errp, mode)) {
+        return;
+    }
+
     if (global_state_store()) {
         error_setg(errp, "Error saving global state");
         return;
diff --git a/stubs/cpr.c b/stubs/cpr.c
new file mode 100644
index 0000000..06a9a1c
--- /dev/null
+++ b/stubs/cpr.c
@@ -0,0 +1,23 @@
+/*
+ * Copyright (c) 2022 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "migration/cpr.h"
+
+int cpr_add_blocker(Error **reasonp, Error **errp, CprMode mode, ...)
+{
+    return 0;
+}
+
+int cpr_add_blocker_str(const char *reason, Error **errp, CprMode mode, ...)
+{
+    return 0;
+}
+
+void cpr_del_blocker(Error **reasonp)
+{
+}
diff --git a/stubs/meson.build b/stubs/meson.build
index 6f80fec..0d7565b 100644
--- a/stubs/meson.build
+++ b/stubs/meson.build
@@ -4,6 +4,7 @@ stub_ss.add(files('blk-exp-close-all.c'))
 stub_ss.add(files('blockdev-close-all-bdrv-states.c'))
 stub_ss.add(files('change-state-handler.c'))
 stub_ss.add(files('cmos.c'))
+stub_ss.add(files('cpr.c'))
 stub_ss.add(files('cpu-get-clock.c'))
 stub_ss.add(files('cpus-get-virtual-clock.c'))
 stub_ss.add(files('qemu-timer-notify-cb.c'))
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 09/39] cpr: register blockers
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (7 preceding siblings ...)
  2022-06-15 14:51 ` [PATCH V8 08/39] cpr: blockers Steve Sistare
@ 2022-06-15 14:51 ` Steve Sistare
  2022-06-15 14:51 ` [PATCH V8 10/39] cpr: cpr-enable option Steve Sistare
                   ` (29 subsequent siblings)
  38 siblings, 0 replies; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Register the known cpr blockers.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 accel/xen/xen-all.c    | 3 +++
 backends/hostmem-epc.c | 6 ++++++
 migration/migration.c  | 6 ++++++
 replay/replay.c        | 4 ++++
 4 files changed, 19 insertions(+)

diff --git a/accel/xen/xen-all.c b/accel/xen/xen-all.c
index 69aa7d0..9dd0dc6 100644
--- a/accel/xen/xen-all.c
+++ b/accel/xen/xen-all.c
@@ -21,6 +21,7 @@
 #include "sysemu/runstate.h"
 #include "migration/misc.h"
 #include "migration/global_state.h"
+#include "migration/cpr.h"
 #include "hw/boards.h"
 
 //#define DEBUG_XEN
@@ -181,6 +182,8 @@ static int xen_init(MachineState *ms)
      * opt out of system RAM being allocated by generic code
      */
     mc->default_ram_id = NULL;
+
+    cpr_add_blocker_str("xen does not support cpr", &error_fatal, CPR_MODE_ALL);
     return 0;
 }
 
diff --git a/backends/hostmem-epc.c b/backends/hostmem-epc.c
index cb06255..094fed9 100644
--- a/backends/hostmem-epc.c
+++ b/backends/hostmem-epc.c
@@ -16,6 +16,7 @@
 #include "qapi/error.h"
 #include "sysemu/hostmem.h"
 #include "hw/i386/hostmem-epc.h"
+#include "migration/cpr.h"
 
 static void
 sgx_epc_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
@@ -23,6 +24,7 @@ sgx_epc_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
     uint32_t ram_flags;
     char *name;
     int fd;
+    Error *blocker = NULL;
 
     if (!backend->size) {
         error_setg(errp, "can't create backend with size 0");
@@ -41,6 +43,10 @@ sgx_epc_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
     memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend),
                                    name, backend->size, ram_flags,
                                    fd, 0, errp);
+
+    error_setg(&blocker, "RAM_PROTECTED block %s does not support cpr", name);
+    cpr_add_blocker(&blocker, errp, CPR_MODE_ALL);
+
     g_free(name);
 }
 
diff --git a/migration/migration.c b/migration/migration.c
index 31739b2..1451bae 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -32,6 +32,7 @@
 #include "savevm.h"
 #include "qemu-file-channel.h"
 #include "qemu-file.h"
+#include "migration/cpr.h"
 #include "migration/vmstate.h"
 #include "block/block.h"
 #include "qapi/error.h"
@@ -1283,6 +1284,11 @@ static bool migrate_caps_check(bool *cap_list,
         return false;
     }
 
+    if (cap_list[MIGRATION_CAPABILITY_X_COLO]) {
+        return cpr_add_blocker_str("x-colo is not compatible with cpr",
+                                   errp, CPR_MODE_ALL);
+    }
+
     return true;
 }
 
diff --git a/replay/replay.c b/replay/replay.c
index 4c396bb..eb5456f 100644
--- a/replay/replay.c
+++ b/replay/replay.c
@@ -19,6 +19,7 @@
 #include "qemu/option.h"
 #include "sysemu/cpus.h"
 #include "qemu/error-report.h"
+#include "migration/cpr.h"
 
 /* Current version of the replay mechanism.
    Increase it when file format changes. */
@@ -232,6 +233,9 @@ static void replay_enable(const char *fname, int mode)
     const char *fmode = NULL;
     assert(!replay_file);
 
+    cpr_add_blocker_str("replay is not compatible with cpr",
+                        &error_fatal, CPR_MODE_ALL);
+
     switch (mode) {
     case REPLAY_MODE_RECORD:
         fmode = "wb";
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 10/39] cpr: cpr-enable option
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (8 preceding siblings ...)
  2022-06-15 14:51 ` [PATCH V8 09/39] cpr: register blockers Steve Sistare
@ 2022-06-15 14:51 ` Steve Sistare
  2022-06-15 14:51 ` [PATCH V8 11/39] cpr: save ram blocks Steve Sistare
                   ` (28 subsequent siblings)
  38 siblings, 0 replies; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Add the '-cpr-enable <mode>' command-line option as a pre-requisite for
using cpr-save and cpr-load for the mode.  Multiple -cpr-enable options
may be specified, one per mode.

Requiring -cpr-enable allows qemu to initialize objects differently, if
necessary, so that cpr-save is not blocked.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hmp-commands.hx         |  4 ++++
 include/migration/cpr.h |  2 ++
 migration/cpr.c         | 22 ++++++++++++++++++++++
 qapi/cpr.json           |  4 ++++
 qemu-options.hx         | 10 ++++++++++
 softmmu/vl.c            |  8 ++++++++
 6 files changed, 50 insertions(+)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index 9d9f984..d621968 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -370,6 +370,8 @@ SRST
   and does not require that guest RAM be saved in the file.  The caller must
   not modify guest block devices between cpr-save and cpr-load.
 
+  cpr-save requires that qemu was started with -cpr-enable for *mode*.
+
   If *mode* is 'reboot', the checkpoint remains valid after a host reboot.
   The guest RAM memory-backend should be shared and non-volatile across
   reboot, else it will be saved to the file.  To resume from the checkpoint,
@@ -391,6 +393,8 @@ SRST
   Load a virtual machine from the checkpoint file *filename* that was created
   earlier by the cpr-save command, and continue the VCPUs.  *mode* must match
   the mode specified for cpr-save.
+
+  cpr-load requires that qemu was started with -cpr-enable for *mode*.
 ERST
 
     {
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index dfe5a1d..f236cbf 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -10,8 +10,10 @@
 
 #include "qapi/qapi-types-cpr.h"
 
+void cpr_init(int modes);
 void cpr_set_mode(CprMode mode);
 CprMode cpr_get_mode(void);
+bool cpr_enabled(CprMode mode);
 
 #define CPR_MODE_ALL CPR_MODE__MAX
 
diff --git a/migration/cpr.c b/migration/cpr.c
index c1da784..76b9225 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -29,6 +29,18 @@ void cpr_set_mode(CprMode mode)
     cpr_mode = mode;
 }
 
+static int cpr_enabled_modes;
+
+void cpr_init(int modes)
+{
+    cpr_enabled_modes = modes;
+}
+
+bool cpr_enabled(CprMode mode)
+{
+    return !!(cpr_enabled_modes & BIT(mode));
+}
+
 static GSList *cpr_blockers[CPR_MODE__MAX];
 
 /*
@@ -110,6 +122,11 @@ void qmp_cpr_save(const char *filename, CprMode mode, Error **errp)
     QEMUFile *f;
     int saved_vm_running = runstate_is_running();
 
+    if (!(cpr_enabled_modes & BIT(mode))) {
+        error_setg(errp, "cpr mode is not enabled.  Use -cpr-enable.");
+        return;
+    }
+
     if (cpr_is_blocked(errp, mode)) {
         return;
     }
@@ -154,6 +171,11 @@ void qmp_cpr_load(const char *filename, CprMode mode, Error **errp)
     int ret;
     RunState state;
 
+    if (!(cpr_enabled_modes & BIT(mode))) {
+        error_setg(errp, "cpr mode is not enabled.  Use -cpr-enable.");
+        return;
+    }
+
     if (runstate_is_running()) {
         error_setg(errp, "cpr-load called for a running VM");
         return;
diff --git a/qapi/cpr.json b/qapi/cpr.json
index bdaabcb..11c6f88 100644
--- a/qapi/cpr.json
+++ b/qapi/cpr.json
@@ -30,6 +30,8 @@
 # and does not require that guest RAM be saved in the file.  The caller must
 # not modify guest block devices between cpr-save and cpr-load.
 #
+# cpr-save requires that qemu was started with -cpr-enable for @mode.
+#
 # If @mode is 'reboot', the checkpoint remains valid after a host reboot.
 # The guest RAM memory-backend should be shared and non-volatile across
 # reboot, else it will be saved to the file.  To resume from the checkpoint,
@@ -52,6 +54,8 @@
 # earlier by the cpr-save command, and continue the VCPUs.  @mode must match
 # the mode specified for cpr-save.
 #
+# cpr-load requires that qemu was started with -cpr-enable for @mode.
+#
 # @filename: name of checkpoint file
 # @mode: @CprMode mode
 #
diff --git a/qemu-options.hx b/qemu-options.hx
index 377d22f..6e51c33 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -4483,6 +4483,16 @@ SRST
     an unmigratable state.
 ERST
 
+DEF("cpr-enable", HAS_ARG, QEMU_OPTION_cpr_enable, \
+    "-cpr-enable reboot    enable the cpr mode\n",
+    QEMU_ARCH_ALL)
+SRST
+``-cpr-enable reboot``
+    Enable the specified cpr mode.  May be supplied multiple times, once
+    per mode.  This is a pre-requisite for calling the cpr-save and cpr-load
+    commands.
+ERST
+
 DEF("nodefaults", 0, QEMU_OPTION_nodefaults, \
     "-nodefaults     don't create default devices\n", QEMU_ARCH_ALL)
 SRST
diff --git a/softmmu/vl.c b/softmmu/vl.c
index 54e920a..ce779cf 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -78,6 +78,7 @@
 #include "hw/i386/pc.h"
 #include "migration/misc.h"
 #include "migration/snapshot.h"
+#include "migration/cpr.h"
 #include "sysemu/tpm.h"
 #include "sysemu/dma.h"
 #include "hw/audio/soundhw.h"
@@ -2600,6 +2601,7 @@ void qemu_init(int argc, char **argv, char **envp)
     MachineClass *machine_class;
     bool userconfig = true;
     FILE *vmstate_dump_file = NULL;
+    int cpr_modes = 0;
 
     qemu_add_opts(&qemu_drive_opts);
     qemu_add_drive_opts(&qemu_legacy_drive_opts);
@@ -3313,6 +3315,10 @@ void qemu_init(int argc, char **argv, char **envp)
             case QEMU_OPTION_only_migratable:
                 only_migratable = 1;
                 break;
+            case QEMU_OPTION_cpr_enable:
+                cpr_modes |= BIT(qapi_enum_parse(&CprMode_lookup, optarg, -1,
+                                                 &error_fatal));
+                break;
             case QEMU_OPTION_nodefaults:
                 has_defaults = 0;
                 break;
@@ -3464,6 +3470,8 @@ void qemu_init(int argc, char **argv, char **envp)
     qemu_validate_options(machine_opts_dict);
     qemu_process_sugar_options();
 
+    cpr_init(cpr_modes);
+
     /*
      * These options affect everything else and should be processed
      * before daemonizing.
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 11/39] cpr: save ram blocks
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (9 preceding siblings ...)
  2022-06-15 14:51 ` [PATCH V8 10/39] cpr: cpr-enable option Steve Sistare
@ 2022-06-15 14:51 ` Steve Sistare
  2022-06-15 14:51 ` [PATCH V8 12/39] memory: flat section iterator Steve Sistare
                   ` (27 subsequent siblings)
  38 siblings, 0 replies; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Add a vmstate handler to save volatile ram blocks in the state file.  This
is used to preserve secondary guest ram blocks (those that cannot be
specified on the command line) such as video ram and roms for cpr reboot,
as there is no option to allocate them in shared memory.  For efficiency,
the user should create a shared memory-backend-file for the VM's main ram,
so it is not copied to the state file, but this is not enforced.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/exec/memory.h |  6 +++++
 migration/savevm.c    |  2 ++
 softmmu/memory.c      | 18 ++++++++++++++
 softmmu/physmem.c     | 67 +++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 93 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 0daddd7..a03301d 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -3002,6 +3002,12 @@ bool ram_block_discard_is_disabled(void);
  */
 bool ram_block_discard_is_required(void);
 
+/*
+ * Register/unregister a ram block for cpr.
+ */
+void ram_block_register(RAMBlock *rb);
+void ram_block_unregister(RAMBlock *rb);
+
 #endif
 
 #endif
diff --git a/migration/savevm.c b/migration/savevm.c
index 0b2c5cd..9d528ed 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -3108,10 +3108,12 @@ void vmstate_register_ram(MemoryRegion *mr, DeviceState *dev)
     qemu_ram_set_idstr(mr->ram_block,
                        memory_region_name(mr), dev);
     qemu_ram_set_migratable(mr->ram_block);
+    ram_block_register(mr->ram_block);
 }
 
 void vmstate_unregister_ram(MemoryRegion *mr, DeviceState *dev)
 {
+    ram_block_unregister(mr->ram_block);
     qemu_ram_unset_idstr(mr->ram_block);
     qemu_ram_unset_migratable(mr->ram_block);
 }
diff --git a/softmmu/memory.c b/softmmu/memory.c
index 7ba2048..0fe6fac 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -3541,13 +3541,31 @@ void __attribute__((weak)) fuzz_dma_read_cb(size_t addr,
 }
 #endif
 
+static char *
+memory_region_vmstate_if_get_id(VMStateIf *obj)
+{
+    MemoryRegion *mr = MEMORY_REGION(obj);
+    return strdup(mr->ram_block->idstr);
+}
+
+static void memory_region_class_init(ObjectClass *class, void *data)
+{
+    VMStateIfClass *vc = VMSTATE_IF_CLASS(class);
+    vc->get_id = memory_region_vmstate_if_get_id;
+}
+
 static const TypeInfo memory_region_info = {
     .parent             = TYPE_OBJECT,
     .name               = TYPE_MEMORY_REGION,
     .class_size         = sizeof(MemoryRegionClass),
+    .class_init         = memory_region_class_init,
     .instance_size      = sizeof(MemoryRegion),
     .instance_init      = memory_region_initfn,
     .instance_finalize  = memory_region_finalize,
+    .interfaces = (InterfaceInfo[]) {
+        { TYPE_VMSTATE_IF },
+        { }
+    }
 };
 
 static const TypeInfo iommu_memory_region_info = {
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 0f1ce28..822c424 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -66,7 +66,9 @@
 
 #include "qemu/pmem.h"
 
+#include "migration/cpr.h"
 #include "migration/vmstate.h"
+#include "migration/qemu-file.h"
 
 #include "qemu/range.h"
 #ifndef _WIN32
@@ -2450,6 +2452,71 @@ ram_addr_t qemu_ram_addr_from_host(void *ptr)
     return block->offset + offset;
 }
 
+static int put_ram_block(QEMUFile *f, void *pv, size_t size,
+                         const VMStateField *field, JSONWriter *vmdesc)
+{
+    RAMBlock *rb = pv;
+
+    if (rb->used_length > 1024 * 1024) {
+        warn_report("Large RAM block %s size %ld saved to state file. "
+                    "Use a shared file memory backend to avoid the copy.",
+                    rb->idstr, rb->used_length);
+    }
+    qemu_put_buffer(f, rb->host, rb->used_length);
+    return 0;
+}
+
+static int get_ram_block(QEMUFile *f, void *pv, size_t size,
+                         const VMStateField *field)
+{
+    RAMBlock *rb = pv;
+    qemu_get_buffer(f, rb->host, rb->used_length);
+    return 0;
+}
+
+static const VMStateInfo vmstate_info_ram_block = {
+    .name = "ram block host",
+    .get  = get_ram_block,
+    .put  = put_ram_block,
+};
+
+#define VMSTATE_RAM_BLOCK() {           \
+    .name  = "ram_block_host",          \
+    .info  = &vmstate_info_ram_block,   \
+    .flags = VMS_SINGLE,                \
+}
+
+static bool ram_block_needed(void *opaque)
+{
+    RAMBlock *rb = opaque;
+
+    return cpr_get_mode() == CPR_MODE_REBOOT &&
+        qemu_ram_is_migratable(rb) &&
+        (!qemu_ram_is_shared(rb) || ramblock_is_anon(rb));
+}
+
+const VMStateDescription vmstate_ram_block = {
+    .name = "RAMBlock",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .needed = ram_block_needed,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT64(used_length, RAMBlock),
+        VMSTATE_RAM_BLOCK(),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
+void ram_block_register(RAMBlock *rb)
+{
+    vmstate_register(VMSTATE_IF(rb->mr), 0, &vmstate_ram_block, rb);
+}
+
+void ram_block_unregister(RAMBlock *rb)
+{
+    vmstate_unregister(VMSTATE_IF(rb->mr), &vmstate_ram_block, rb);
+}
+
 static MemTxResult flatview_read(FlatView *fv, hwaddr addr,
                                  MemTxAttrs attrs, void *buf, hwaddr len);
 static MemTxResult flatview_write(FlatView *fv, hwaddr addr, MemTxAttrs attrs,
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 12/39] memory: flat section iterator
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (10 preceding siblings ...)
  2022-06-15 14:51 ` [PATCH V8 11/39] cpr: save ram blocks Steve Sistare
@ 2022-06-15 14:51 ` Steve Sistare
  2022-07-03  7:52   ` Peng Liang
  2022-06-15 14:52 ` [PATCH V8 13/39] oslib: qemu_clear_cloexec Steve Sistare
                   ` (26 subsequent siblings)
  38 siblings, 1 reply; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Add an iterator over the sections of a flattened address space.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
---
 include/exec/memory.h | 31 +++++++++++++++++++++++++++++++
 softmmu/memory.c      | 20 ++++++++++++++++++++
 2 files changed, 51 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index a03301d..6a257a4 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -2343,6 +2343,37 @@ void memory_region_set_ram_discard_manager(MemoryRegion *mr,
                                            RamDiscardManager *rdm);
 
 /**
+ * memory_region_section_cb: callback for address_space_flat_for_each_section()
+ *
+ * @mrs: MemoryRegionSection of the range
+ * @opaque: data pointer passed to address_space_flat_for_each_section()
+ * @errp: error message, returned to the address_space_flat_for_each_section
+ *        caller.
+ *
+ * Returns: non-zero to stop the iteration, and 0 to continue.  The same
+ * non-zero value is returned to the address_space_flat_for_each_section caller.
+ */
+
+typedef int (*memory_region_section_cb)(MemoryRegionSection *mrs,
+                                        void *opaque,
+                                        Error **errp);
+
+/**
+ * address_space_flat_for_each_section: walk the ranges in the address space
+ * flat view and call @func for each.  Return 0 on success, else return non-zero
+ * with a message in @errp.
+ *
+ * @as: target address space
+ * @func: callback function
+ * @opaque: passed to @func
+ * @errp: passed to @func
+ */
+int address_space_flat_for_each_section(AddressSpace *as,
+                                        memory_region_section_cb func,
+                                        void *opaque,
+                                        Error **errp);
+
+/**
  * memory_region_find: translate an address/size relative to a
  * MemoryRegion into a #MemoryRegionSection.
  *
diff --git a/softmmu/memory.c b/softmmu/memory.c
index 0fe6fac..e5aefdd 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -2683,6 +2683,26 @@ bool memory_region_is_mapped(MemoryRegion *mr)
     return !!mr->container || mr->mapped_via_alias;
 }
 
+int address_space_flat_for_each_section(AddressSpace *as,
+                                        memory_region_section_cb func,
+                                        void *opaque,
+                                        Error **errp)
+{
+    FlatView *view = address_space_get_flatview(as);
+    FlatRange *fr;
+    int ret;
+
+    FOR_EACH_FLAT_RANGE(fr, view) {
+        MemoryRegionSection mrs = section_from_flat_range(fr, view);
+        ret = func(&mrs, opaque, errp);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
 /* Same as memory_region_find, but it does not add a reference to the
  * returned region.  It must be called from an RCU critical section.
  */
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 13/39] oslib: qemu_clear_cloexec
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (11 preceding siblings ...)
  2022-06-15 14:51 ` [PATCH V8 12/39] memory: flat section iterator Steve Sistare
@ 2022-06-15 14:52 ` Steve Sistare
  2022-06-16 16:01   ` Marc-André Lureau
  2022-06-16 16:07   ` Daniel P. Berrangé
  2022-06-15 14:52 ` [PATCH V8 14/39] qapi: strList_from_string Steve Sistare
                   ` (25 subsequent siblings)
  38 siblings, 2 replies; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Define qemu_clear_cloexec, analogous to qemu_set_cloexec.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/qemu/osdep.h | 1 +
 util/oslib-posix.c   | 9 +++++++++
 util/oslib-win32.c   | 4 ++++
 3 files changed, 14 insertions(+)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index b1c161c..e916f3b 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -548,6 +548,7 @@ ssize_t qemu_write_full(int fd, const void *buf, size_t count)
     G_GNUC_WARN_UNUSED_RESULT;
 
 void qemu_set_cloexec(int fd);
+void qemu_clear_cloexec(int fd);
 
 /* Return a dynamically allocated directory path that is appropriate for storing
  * local state.
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 7a34c16..421e987 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -261,6 +261,15 @@ void qemu_set_cloexec(int fd)
     assert(f != -1);
 }
 
+void qemu_clear_cloexec(int fd)
+{
+    int f;
+    f = fcntl(fd, F_GETFD);
+    assert(f != -1);
+    f = fcntl(fd, F_SETFD, f & ~FD_CLOEXEC);
+    assert(f != -1);
+}
+
 char *
 qemu_get_local_state_dir(void)
 {
diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index 5723d3e..5bed148 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -226,6 +226,10 @@ void qemu_set_cloexec(int fd)
 {
 }
 
+void qemu_clear_cloexec(int fd)
+{
+}
+
 int qemu_get_thread_id(void)
 {
     return GetCurrentThreadId();
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 14/39] qapi: strList_from_string
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (12 preceding siblings ...)
  2022-06-15 14:52 ` [PATCH V8 13/39] oslib: qemu_clear_cloexec Steve Sistare
@ 2022-06-15 14:52 ` Steve Sistare
  2022-06-16 16:04   ` Marc-André Lureau
  2022-06-15 14:52 ` [PATCH V8 15/39] qapi: QAPI_LIST_LENGTH Steve Sistare
                   ` (24 subsequent siblings)
  38 siblings, 1 reply; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Generalize strList_from_comma_list() to take any delimiter character, rename
as strList_from_string(), and move it to qapi/util.c.

No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/qapi/util.h |  9 +++++++++
 monitor/hmp-cmds.c  | 29 ++---------------------------
 qapi/qapi-util.c    | 23 +++++++++++++++++++++++
 3 files changed, 34 insertions(+), 27 deletions(-)

diff --git a/include/qapi/util.h b/include/qapi/util.h
index 81a2b13..7d88b09 100644
--- a/include/qapi/util.h
+++ b/include/qapi/util.h
@@ -22,6 +22,8 @@ typedef struct QEnumLookup {
     const int size;
 } QEnumLookup;
 
+struct strList;
+
 const char *qapi_enum_lookup(const QEnumLookup *lookup, int val);
 int qapi_enum_parse(const QEnumLookup *lookup, const char *buf,
                     int def, Error **errp);
@@ -31,6 +33,13 @@ bool qapi_bool_parse(const char *name, const char *value, bool *obj,
 int parse_qapi_name(const char *name, bool complete);
 
 /*
+ * Produce a strList from the character delimited string @in.
+ * All strings are g_strdup'd.
+ * A NULL or empty input string returns NULL.
+ */
+struct strList *strList_from_string(const char *in, char delim);
+
+/*
  * For any GenericList @list, insert @element at the front.
  *
  * Note that this macro evaluates @element exactly once, so it is safe
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index bb12589..9f58b1f 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -43,6 +43,7 @@
 #include "qapi/qapi-commands-run-state.h"
 #include "qapi/qapi-commands-tpm.h"
 #include "qapi/qapi-commands-ui.h"
+#include "qapi/util.h"
 #include "qapi/qapi-visit-net.h"
 #include "qapi/qapi-visit-migration.h"
 #include "qapi/qmp/qdict.h"
@@ -70,32 +71,6 @@ bool hmp_handle_error(Monitor *mon, Error *err)
     return false;
 }
 
-/*
- * Produce a strList from a comma separated list.
- * A NULL or empty input string return NULL.
- */
-static strList *strList_from_comma_list(const char *in)
-{
-    strList *res = NULL;
-    strList **tail = &res;
-
-    while (in && in[0]) {
-        char *comma = strchr(in, ',');
-        char *value;
-
-        if (comma) {
-            value = g_strndup(in, comma - in);
-            in = comma + 1; /* skip the , */
-        } else {
-            value = g_strdup(in);
-            in = NULL;
-        }
-        QAPI_LIST_APPEND(tail, value);
-    }
-
-    return res;
-}
-
 void hmp_info_name(Monitor *mon, const QDict *qdict)
 {
     NameInfo *info;
@@ -1115,7 +1090,7 @@ void hmp_announce_self(Monitor *mon, const QDict *qdict)
                                             migrate_announce_params());
 
     qapi_free_strList(params->interfaces);
-    params->interfaces = strList_from_comma_list(interfaces_str);
+    params->interfaces = strList_from_string(interfaces_str, ',');
     params->has_interfaces = params->interfaces != NULL;
     params->id = g_strdup(id);
     params->has_id = !!params->id;
diff --git a/qapi/qapi-util.c b/qapi/qapi-util.c
index 63596e1..b61c73c 100644
--- a/qapi/qapi-util.c
+++ b/qapi/qapi-util.c
@@ -15,6 +15,7 @@
 #include "qapi/error.h"
 #include "qemu/ctype.h"
 #include "qapi/qmp/qerror.h"
+#include "qapi/qapi-builtin-types.h"
 
 CompatPolicy compat_policy;
 
@@ -152,3 +153,25 @@ int parse_qapi_name(const char *str, bool complete)
     }
     return p - str;
 }
+
+strList *strList_from_string(const char *in, char delim)
+{
+    strList *res = NULL;
+    strList **tail = &res;
+
+    while (in && in[0]) {
+        char *next = strchr(in, delim);
+        char *value;
+
+        if (next) {
+            value = g_strndup(in, next - in);
+            in = next + 1; /* skip the delim */
+        } else {
+            value = g_strdup(in);
+            in = NULL;
+        }
+        QAPI_LIST_APPEND(tail, value);
+    }
+
+    return res;
+}
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 15/39] qapi: QAPI_LIST_LENGTH
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (13 preceding siblings ...)
  2022-06-15 14:52 ` [PATCH V8 14/39] qapi: strList_from_string Steve Sistare
@ 2022-06-15 14:52 ` Steve Sistare
  2022-06-16 16:06   ` Marc-André Lureau
  2022-06-15 14:52 ` [PATCH V8 16/39] qapi: strv_from_strList Steve Sistare
                   ` (23 subsequent siblings)
  38 siblings, 1 reply; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/qapi/util.h | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/include/qapi/util.h b/include/qapi/util.h
index 7d88b09..75dddca 100644
--- a/include/qapi/util.h
+++ b/include/qapi/util.h
@@ -65,4 +65,17 @@ struct strList *strList_from_string(const char *in, char delim);
     (tail) = &(*(tail))->next; \
 } while (0)
 
+/*
+ * For any GenericList @list, return its length.
+ */
+#define QAPI_LIST_LENGTH(list) \
+    ({ \
+        int len = 0; \
+        typeof(list) elem; \
+        for (elem = list; elem != NULL; elem = elem->next) { \
+            len++; \
+        } \
+        len; \
+    })
+
 #endif
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 16/39] qapi: strv_from_strList
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (14 preceding siblings ...)
  2022-06-15 14:52 ` [PATCH V8 15/39] qapi: QAPI_LIST_LENGTH Steve Sistare
@ 2022-06-15 14:52 ` Steve Sistare
  2022-06-16 16:08   ` Marc-André Lureau
  2022-06-15 14:52 ` [PATCH V8 17/39] qapi: strList unit tests Steve Sistare
                   ` (22 subsequent siblings)
  38 siblings, 1 reply; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/qapi/util.h |  6 ++++++
 qapi/qapi-util.c    | 14 ++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/include/qapi/util.h b/include/qapi/util.h
index 75dddca..51ff64e 100644
--- a/include/qapi/util.h
+++ b/include/qapi/util.h
@@ -33,6 +33,12 @@ bool qapi_bool_parse(const char *name, const char *value, bool *obj,
 int parse_qapi_name(const char *name, bool complete);
 
 /*
+ * Produce and return a NULL-terminated array of strings from @args.
+ * All strings are g_strdup'd.
+ */
+GStrv strv_from_strList(const struct strList *args);
+
+/*
  * Produce a strList from the character delimited string @in.
  * All strings are g_strdup'd.
  * A NULL or empty input string returns NULL.
diff --git a/qapi/qapi-util.c b/qapi/qapi-util.c
index b61c73c..8c96cab 100644
--- a/qapi/qapi-util.c
+++ b/qapi/qapi-util.c
@@ -154,6 +154,20 @@ int parse_qapi_name(const char *str, bool complete)
     return p - str;
 }
 
+GStrv strv_from_strList(const strList *args)
+{
+    const strList *arg;
+    int i = 0;
+    GStrv argv = g_malloc((QAPI_LIST_LENGTH(args) + 1) * sizeof(char *));
+
+    for (arg = args; arg != NULL; arg = arg->next) {
+        argv[i++] = g_strdup(arg->value);
+    }
+    argv[i] = NULL;
+
+    return argv;
+}
+
 strList *strList_from_string(const char *in, char delim)
 {
     strList *res = NULL;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 17/39] qapi: strList unit tests
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (15 preceding siblings ...)
  2022-06-15 14:52 ` [PATCH V8 16/39] qapi: strv_from_strList Steve Sistare
@ 2022-06-15 14:52 ` Steve Sistare
  2022-06-16 16:10   ` Marc-André Lureau
  2022-06-15 14:52 ` [PATCH V8 18/39] vl: helper to request re-exec Steve Sistare
                   ` (21 subsequent siblings)
  38 siblings, 1 reply; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 MAINTAINERS               |  1 +
 tests/unit/meson.build    |  1 +
 tests/unit/test-strlist.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 83 insertions(+)
 create mode 100644 tests/unit/test-strlist.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 1e4e72f..f9a6362 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3160,6 +3160,7 @@ F: include/migration/cpr.h
 F: migration/cpr.c
 F: qapi/cpr.json
 F: stubs/cpr.c
+F: tests/unit/test-strlist.c
 
 Record/replay
 M: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
diff --git a/tests/unit/meson.build b/tests/unit/meson.build
index 287b367..57d48d5 100644
--- a/tests/unit/meson.build
+++ b/tests/unit/meson.build
@@ -17,6 +17,7 @@ tests = {
   'test-forward-visitor': [testqapi],
   'test-string-input-visitor': [testqapi],
   'test-string-output-visitor': [testqapi],
+  'test-strlist': [testqapi],
   'test-opts-visitor': [testqapi],
   'test-visitor-serialization': [testqapi],
   'test-bitmap': [],
diff --git a/tests/unit/test-strlist.c b/tests/unit/test-strlist.c
new file mode 100644
index 0000000..ef740dc
--- /dev/null
+++ b/tests/unit/test-strlist.c
@@ -0,0 +1,81 @@
+/*
+ * Copyright (c) 2022 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/util.h"
+#include "qapi/qapi-builtin-types.h"
+
+static strList *make_list(int length)
+{
+    strList *head = 0, *list, **prev = &head;
+
+    while (length--) {
+        list = *prev = g_new0(strList, 1);
+        list->value = g_strdup("aaa");
+        prev = &list->next;
+    }
+    return head;
+}
+
+static void test_length(void)
+{
+    strList *list;
+    int i;
+
+    for (i = 0; i < 5; i++) {
+        list = make_list(i);
+        g_assert_cmpint(i, ==, QAPI_LIST_LENGTH(list));
+        qapi_free_strList(list);
+    }
+}
+
+struct {
+    const char *string;
+    char delim;
+    const char *args[5];
+} list_data[] = {
+    { 0, ',', { 0 } },
+    { "", ',', { 0 } },
+    { "a", ',', { "a", 0 } },
+    { "a,b", ',', { "a", "b", 0 } },
+    { "a,b,c", ',', { "a", "b", "c", 0 } },
+    { "first last", ' ', { "first", "last", 0 } },
+    { "a:", ':', { "a", 0 } },
+    { "a::b", ':', { "a", "", "b", 0 } },
+    { ":", ':', { "", 0 } },
+    { ":a", ':', { "", "a", 0 } },
+    { "::a", ':', { "", "", "a", 0 } },
+};
+
+static void test_strv(void)
+{
+    int i, j;
+    const char **expect;
+    strList *list;
+    GStrv args;
+
+    for (i = 0; i < ARRAY_SIZE(list_data); i++) {
+        expect = list_data[i].args;
+        list = strList_from_string(list_data[i].string, list_data[i].delim);
+        args = strv_from_strList(list);
+        qapi_free_strList(list);
+        for (j = 0; expect[j] && args[j]; j++) {
+            g_assert_cmpstr(expect[j], ==, args[j]);
+        }
+        g_assert_null(expect[j]);
+        g_assert_null(args[j]);
+        g_strfreev(args);
+    }
+}
+
+int main(int argc, char **argv)
+{
+    g_test_init(&argc, &argv, NULL);
+    g_test_add_func("/test-string/length", test_length);
+    g_test_add_func("/test-string/strv", test_strv);
+    return g_test_run();
+}
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 18/39] vl: helper to request re-exec
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (16 preceding siblings ...)
  2022-06-15 14:52 ` [PATCH V8 17/39] qapi: strList unit tests Steve Sistare
@ 2022-06-15 14:52 ` Steve Sistare
  2022-06-15 14:52 ` [PATCH V8 19/39] cpr: preserve extra state Steve Sistare
                   ` (20 subsequent siblings)
  38 siblings, 0 replies; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Add a qemu_system_exec_request() hook that causes the main loop to exit and
re-exec qemu using the specified arguments.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/sysemu/runstate.h |  1 +
 softmmu/runstate.c        | 26 ++++++++++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/include/sysemu/runstate.h b/include/sysemu/runstate.h
index 16c1c41..6b0b4f1 100644
--- a/include/sysemu/runstate.h
+++ b/include/sysemu/runstate.h
@@ -63,6 +63,7 @@ void qemu_system_wakeup_enable(WakeupReason reason, bool enabled);
 void qemu_register_wakeup_notifier(Notifier *notifier);
 void qemu_register_wakeup_support(void);
 void qemu_system_shutdown_request(ShutdownCause reason);
+int qemu_system_exec_request(const strList *args, Error **errp);
 void qemu_system_powerdown_request(void);
 void qemu_register_powerdown_notifier(Notifier *notifier);
 void qemu_register_shutdown_notifier(Notifier *notifier);
diff --git a/softmmu/runstate.c b/softmmu/runstate.c
index cfd6aa9..c35ab09 100644
--- a/softmmu/runstate.c
+++ b/softmmu/runstate.c
@@ -37,6 +37,7 @@
 #include "monitor/monitor.h"
 #include "net/net.h"
 #include "net/vhost_net.h"
+#include "qapi/util.h"
 #include "qapi/error.h"
 #include "qapi/qapi-commands-run-state.h"
 #include "qapi/qapi-events-run-state.h"
@@ -355,6 +356,7 @@ static NotifierList wakeup_notifiers =
 static NotifierList shutdown_notifiers =
     NOTIFIER_LIST_INITIALIZER(shutdown_notifiers);
 static uint32_t wakeup_reason_mask = ~(1 << QEMU_WAKEUP_REASON_NONE);
+static GStrv exec_argv;
 
 ShutdownCause qemu_shutdown_requested_get(void)
 {
@@ -371,6 +373,11 @@ static int qemu_shutdown_requested(void)
     return qatomic_xchg(&shutdown_requested, SHUTDOWN_CAUSE_NONE);
 }
 
+static int qemu_exec_requested(void)
+{
+    return exec_argv != NULL;
+}
+
 static void qemu_kill_report(void)
 {
     if (!qtest_driver() && shutdown_signal) {
@@ -641,6 +648,18 @@ void qemu_system_shutdown_request(ShutdownCause reason)
     qemu_notify_event();
 }
 
+int qemu_system_exec_request(const strList *args, Error **errp)
+{
+    exec_argv = strv_from_strList(args);
+    if (!exec_argv[0]) {
+        error_setg(errp, "qemu_system_exec_request: argv[0] is NULL");
+        return 1;
+    }
+    shutdown_requested = 1;
+    qemu_notify_event();
+    return 0;
+}
+
 static void qemu_system_powerdown(void)
 {
     qapi_event_send_powerdown();
@@ -689,6 +708,13 @@ static bool main_loop_should_exit(void)
     }
     request = qemu_shutdown_requested();
     if (request) {
+
+        if (qemu_exec_requested()) {
+            execvp(exec_argv[0], exec_argv);
+            error_report("execvp %s failed: %s", exec_argv[0], strerror(errno));
+            g_strfreev(exec_argv);
+            exec_argv = NULL;
+        }
         qemu_kill_report();
         qemu_system_shutdown(request);
         if (shutdown_action == SHUTDOWN_ACTION_PAUSE) {
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 19/39] cpr: preserve extra state
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (17 preceding siblings ...)
  2022-06-15 14:52 ` [PATCH V8 18/39] vl: helper to request re-exec Steve Sistare
@ 2022-06-15 14:52 ` Steve Sistare
  2022-06-15 14:52 ` [PATCH V8 20/39] cpr: restart mode Steve Sistare
                   ` (19 subsequent siblings)
  38 siblings, 0 replies; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

cpr must save state that is needed after qemu is restarted, when devices are
realized.  Thus the extra state cannot be saved in the cpr-load vmstate file,
as objects must already exist before that file can be loaded.  Instead,
define auxilliary state structures and vmstate descriptions, not associated
with any registered object, and serialize the aux state to a memfd file.
Deserialize after qemu restarts, before devices are realized.

The following state is saved:
  * cpr mode
  * file descriptor names and values
  * memfd values and properties for anonymous ram blocks

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 MAINTAINERS             |   2 +
 include/migration/cpr.h |  16 +++
 migration/cpr-state.c   | 330 ++++++++++++++++++++++++++++++++++++++++++++++++
 migration/cpr.c         |  12 --
 migration/meson.build   |   1 +
 migration/trace-events  |   8 ++
 stubs/cpr-state.c       |  27 ++++
 stubs/meson.build       |   1 +
 8 files changed, 385 insertions(+), 12 deletions(-)
 create mode 100644 migration/cpr-state.c
 create mode 100644 stubs/cpr-state.c

diff --git a/MAINTAINERS b/MAINTAINERS
index f9a6362..74a43e6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3161,6 +3161,8 @@ F: migration/cpr.c
 F: qapi/cpr.json
 F: stubs/cpr.c
 F: tests/unit/test-strlist.c
+F: migration/cpr-state.c
+F: stubs/cpr-state.c
 
 Record/replay
 M: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index f236cbf..b75dec4 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -15,6 +15,22 @@ void cpr_set_mode(CprMode mode);
 CprMode cpr_get_mode(void);
 bool cpr_enabled(CprMode mode);
 
+typedef int (*cpr_walk_fd_cb)(const char *name, int id, int fd, void *opaque);
+
+void cpr_save_fd(const char *name, int id, int fd);
+void cpr_delete_fd(const char *name, int id);
+int cpr_find_fd(const char *name, int id);
+int cpr_walk_fd(cpr_walk_fd_cb cb, void *handle);
+void cpr_save_memfd(const char *name, int fd, size_t len, size_t maxlen,
+                    uint64_t align);
+int cpr_find_memfd(const char *name, size_t *lenp, size_t *maxlenp,
+                   uint64_t *alignp);
+void cpr_delete_memfd(const char *name);
+int cpr_resave_fd(const char *name, int id, int fd, Error **errp);
+int cpr_state_save(Error **errp);
+int cpr_state_load(Error **errp);
+void cpr_state_print(void);
+
 #define CPR_MODE_ALL CPR_MODE__MAX
 
 int cpr_add_blocker(Error **reasonp, Error **errp, CprMode mode, ...);
diff --git a/migration/cpr-state.c b/migration/cpr-state.c
new file mode 100644
index 0000000..ff1e122
--- /dev/null
+++ b/migration/cpr-state.c
@@ -0,0 +1,330 @@
+/*
+ * Copyright (c) 2022 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/cutils.h"
+#include "qemu/queue.h"
+#include "qemu/memfd.h"
+#include "qapi/error.h"
+#include "migration/vmstate.h"
+#include "migration/cpr.h"
+#include "migration/qemu-file.h"
+#include "migration/qemu-file-channel.h"
+#include "trace.h"
+
+/*************************************************************************/
+/* cpr state container for all information to be saved. */
+
+typedef QLIST_HEAD(CprNameList, CprName) CprNameList;
+
+typedef struct CprState {
+    CprMode mode;
+    CprNameList fds;            /* list of CprFd */
+    CprNameList memfd;          /* list of CprMemfd */
+} CprState;
+
+static CprState cpr_state = {
+    .mode = CPR_MODE_NONE,
+};
+
+/*************************************************************************/
+/* Misc accessors. */
+
+CprMode cpr_get_mode(void)
+{
+    return cpr_state.mode;
+}
+
+void cpr_set_mode(CprMode mode)
+{
+    cpr_state.mode = mode;
+}
+
+/*************************************************************************/
+/* Generic list of names. */
+
+typedef struct CprName {
+    char *name;
+    unsigned int namelen;
+    int id;
+    QLIST_ENTRY(CprName) next;
+} CprName;
+
+static const VMStateDescription vmstate_cpr_name = {
+    .name = "cpr name",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT32(namelen, CprName),
+        VMSTATE_VBUFFER_ALLOC_UINT32(name, CprName, 0, NULL, namelen),
+        VMSTATE_INT32(id, CprName),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+static void
+add_name(CprNameList *head, const char *name, int id, CprName *elem)
+{
+    elem->name = g_strdup(name);
+    elem->namelen = strlen(name) + 1;
+    elem->id = id;
+    QLIST_INSERT_HEAD(head, elem, next);
+}
+
+static CprName *find_name(CprNameList *head, const char *name, int id)
+{
+    CprName *elem;
+
+    QLIST_FOREACH(elem, head, next) {
+        if (!strcmp(elem->name, name) && elem->id == id) {
+            return elem;
+        }
+    }
+    return NULL;
+}
+
+static void delete_name(CprNameList *head, const char *name, int id)
+{
+    CprName *elem = find_name(head, name, id);
+
+    if (elem) {
+        QLIST_REMOVE(elem, next);
+        g_free(elem->name);
+        g_free(elem);
+    }
+}
+
+/****************************************************************************/
+/* Lists of named things.  The first field of each entry must be a CprName. */
+
+typedef struct CprFd {
+    CprName name;               /* must be first */
+    int fd;
+} CprFd;
+
+static const VMStateDescription vmstate_cpr_fd = {
+    .name = "cpr fd",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_STRUCT(name, CprFd, 1, vmstate_cpr_name, CprName),
+        VMSTATE_INT32(fd, CprFd),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+#define CPR_FD(elem)        ((CprFd *)(elem))
+#define CPR_FD_FD(elem)     (CPR_FD(elem)->fd)
+
+void cpr_save_fd(const char *name, int id, int fd)
+{
+    CprFd *elem = g_new0(CprFd, 1);
+
+    trace_cpr_save_fd(name, id, fd);
+    elem->fd = fd;
+    add_name(&cpr_state.fds, name, id, &elem->name);
+}
+
+void cpr_delete_fd(const char *name, int id)
+{
+    trace_cpr_delete_fd(name, id);
+    delete_name(&cpr_state.fds, name, id);
+}
+
+int cpr_find_fd(const char *name, int id)
+{
+    CprName *elem = find_name(&cpr_state.fds, name, id);
+    int fd = elem ? CPR_FD_FD(elem) : -1;
+
+    trace_cpr_find_fd(name, id, fd);
+    return fd;
+}
+
+int cpr_walk_fd(cpr_walk_fd_cb cb, void *opaque)
+{
+    CprName *elem;
+
+    QLIST_FOREACH(elem, &cpr_state.fds, next) {
+        if (cb(elem->name, elem->id, CPR_FD_FD(elem), opaque)) {
+            return 1;
+        }
+    }
+    return 0;
+}
+
+int cpr_resave_fd(const char *name, int id, int fd, Error **errp)
+{
+    int old_fd = cpr_find_fd(name, id);
+
+    if (old_fd < 0) {
+        cpr_save_fd(name, id, fd);
+        return 0;
+    } else if (old_fd == fd) {
+        return 0;
+    } else {
+        error_setg(errp, "fd %s %d already saved with a different value %d",
+                   name, fd, old_fd);
+        return 1;
+    }
+}
+
+/*************************************************************************/
+/* A memfd ram block. */
+
+typedef struct CprMemfd {
+    CprName name;               /* must be first */
+    size_t len;
+    size_t maxlen;
+    uint64_t align;
+} CprMemfd;
+
+static const VMStateDescription vmstate_cpr_memfd = {
+    .name = "cpr memfd",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_STRUCT(name, CprMemfd, 1, vmstate_cpr_name, CprName),
+        VMSTATE_UINT64(len, CprMemfd),
+        VMSTATE_UINT64(maxlen, CprMemfd),
+        VMSTATE_UINT64(align, CprMemfd),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+#define CPR_MEMFD(elem)        ((CprMemfd *)(elem))
+#define CPR_MEMFD_LEN(elem)    (CPR_MEMFD(elem)->len)
+#define CPR_MEMFD_MAXLEN(elem) (CPR_MEMFD(elem)->maxlen)
+#define CPR_MEMFD_ALIGN(elem)  (CPR_MEMFD(elem)->align)
+
+void cpr_save_memfd(const char *name, int fd, size_t len, size_t maxlen,
+                    uint64_t align)
+{
+    CprMemfd *elem = g_new0(CprMemfd, 1);
+
+    trace_cpr_save_memfd(name, len, maxlen, align);
+    elem->len = len;
+    elem->maxlen = maxlen;
+    elem->align = align;
+    add_name(&cpr_state.memfd, name, 0, &elem->name);
+    cpr_save_fd(name, 0, fd);
+}
+
+void cpr_delete_memfd(const char *name)
+{
+    trace_cpr_delete_memfd(name);
+    delete_name(&cpr_state.memfd, name, 0);
+    cpr_delete_fd(name, 0);
+}
+
+int cpr_find_memfd(const char *name, size_t *lenp, size_t *maxlenp,
+                   uint64_t *alignp)
+{
+    int fd = cpr_find_fd(name, 0);
+    CprName *elem = find_name(&cpr_state.memfd, name, 0);
+
+    if (elem) {
+        *lenp = CPR_MEMFD_LEN(elem);
+        *maxlenp = CPR_MEMFD_MAXLEN(elem);
+        *alignp = CPR_MEMFD_ALIGN(elem);
+    } else {
+        *lenp = 0;
+        *maxlenp = 0;
+        *alignp = 0;
+    }
+
+    trace_cpr_find_memfd(name, *lenp, *maxlenp, *alignp);
+    return fd;
+}
+
+/*************************************************************************/
+/* cpr state container interface and implementation. */
+
+#define CPR_STATE_NAME "QEMU_CPR_STATE"
+
+static const VMStateDescription vmstate_cpr_state = {
+    .name = CPR_STATE_NAME,
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT32(mode, CprState),
+        VMSTATE_QLIST_V(fds, CprState, 1, vmstate_cpr_fd, CprFd, name.next),
+        VMSTATE_QLIST_V(memfd, CprState, 1, vmstate_cpr_memfd, CprMemfd,
+                        name.next),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+int cpr_state_save(Error **errp)
+{
+    int ret, mfd;
+    QEMUFile *f;
+    char val[16];
+
+    mfd = memfd_create(CPR_STATE_NAME, 0);
+    if (mfd < 0) {
+        error_setg_errno(errp, errno, "memfd_create failed");
+        return -1;
+    }
+    qemu_clear_cloexec(mfd);
+    f = qemu_fopen_fd(mfd, true, CPR_STATE_NAME);
+    if (!f) {
+        error_setg(errp, "qemu_fopen_fd %d failed", mfd);
+        return -1;
+    }
+
+    ret = vmstate_save_state(f, &vmstate_cpr_state, &cpr_state, 0);
+    if (ret) {
+        error_setg(errp, "vmstate_save_state error %d", ret);
+        return ret;
+    }
+
+    /* Do not close f, as mfd must remain open. */
+    qemu_fflush(f);
+    lseek(mfd, 0, SEEK_SET);
+
+    /* Remember mfd for post-exec cpr_state_load */
+    snprintf(val, sizeof(val), "%d", mfd);
+    g_setenv(CPR_STATE_NAME, val, 1);
+
+    return 0;
+}
+
+int cpr_state_load(Error **errp)
+{
+    int ret, mfd;
+    QEMUFile *f;
+    const char *val = g_getenv(CPR_STATE_NAME);
+
+    if (!val) {
+        return 0;
+    }
+    g_unsetenv(CPR_STATE_NAME);
+    if (qemu_strtoi(val, NULL, 10, &mfd)) {
+        error_setg(errp, "Bad %s env value %s", CPR_STATE_NAME, val);
+        return 1;
+    }
+    f = qemu_fopen_fd(mfd, false, CPR_STATE_NAME);
+    ret = vmstate_load_state(f, &vmstate_cpr_state, &cpr_state, 1);
+    qemu_fclose(f);
+    return ret;
+}
+
+void cpr_state_print(void)
+{
+    CprName *elem;
+
+    printf("cpr_state:\n");
+    printf("- mode = %d\n", cpr_state.mode);
+    QLIST_FOREACH(elem, &cpr_state.fds, next) {
+        printf("- %s %d : fd=%d\n", elem->name, elem->id, CPR_FD_FD(elem));
+    }
+    QLIST_FOREACH(elem, &cpr_state.memfd, next) {
+        printf("- %s : len=%lu, maxlen=%lu, align=%lu\n", elem->name,
+               CPR_MEMFD_LEN(elem), CPR_MEMFD_MAXLEN(elem),
+               CPR_MEMFD_ALIGN(elem));
+    }
+}
diff --git a/migration/cpr.c b/migration/cpr.c
index 76b9225..1cc8738 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -17,18 +17,6 @@
 #include "sysemu/runstate.h"
 #include "sysemu/sysemu.h"
 
-static CprMode cpr_mode = CPR_MODE_NONE;
-
-CprMode cpr_get_mode(void)
-{
-    return cpr_mode;
-}
-
-void cpr_set_mode(CprMode mode)
-{
-    cpr_mode = mode;
-}
-
 static int cpr_enabled_modes;
 
 void cpr_init(int modes)
diff --git a/migration/meson.build b/migration/meson.build
index 76fcfdb..6bb502d 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -16,6 +16,7 @@ softmmu_ss.add(files(
   'colo-failover.c',
   'colo.c',
   'cpr.c',
+  'cpr-state.c',
   'exec.c',
   'fd.c',
   'global_state.c',
diff --git a/migration/trace-events b/migration/trace-events
index 1aec580..bfde1ac 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -310,6 +310,14 @@ colo_receive_message(const char *msg) "Receive '%s' message"
 # colo-failover.c
 colo_failover_set_state(const char *new_state) "new state %s"
 
+# cpr-state.c
+cpr_save_fd(const char *name, int id, int fd) "%s, id %d, fd %d"
+cpr_delete_fd(const char *name, int id) "%s, id %d"
+cpr_find_fd(const char *name, int id, int fd) "%s, id %d returns %d"
+cpr_save_memfd(const char *name, size_t len, size_t maxlen, uint64_t align) "%s, len %lu, maxlen %lu, align %lu"
+cpr_delete_memfd(const char *name) "%s"
+cpr_find_memfd(const char *name, size_t len, size_t maxlen, uint64_t align) "%s, len %lu, maxlen %lu, align %lu"
+
 # block-dirty-bitmap.c
 send_bitmap_header_enter(void) ""
 send_bitmap_bits(uint32_t flags, uint64_t start_sector, uint32_t nr_sectors, uint64_t data_size) "flags: 0x%x, start_sector: %" PRIu64 ", nr_sectors: %" PRIu32 ", data_size: %" PRIu64
diff --git a/stubs/cpr-state.c b/stubs/cpr-state.c
new file mode 100644
index 0000000..cdd32aa
--- /dev/null
+++ b/stubs/cpr-state.c
@@ -0,0 +1,27 @@
+/*
+ * Copyright (c) 2022 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "migration/cpr.h"
+
+void cpr_save_fd(const char *name, int id, int fd)
+{
+}
+
+void cpr_delete_fd(const char *name, int id)
+{
+}
+
+int cpr_find_fd(const char *name, int id)
+{
+    return -1;
+}
+
+int cpr_resave_fd(const char *name, int id, int fd, Error **errp)
+{
+    return 0;
+}
diff --git a/stubs/meson.build b/stubs/meson.build
index 0d7565b..8186834 100644
--- a/stubs/meson.build
+++ b/stubs/meson.build
@@ -5,6 +5,7 @@ stub_ss.add(files('blockdev-close-all-bdrv-states.c'))
 stub_ss.add(files('change-state-handler.c'))
 stub_ss.add(files('cmos.c'))
 stub_ss.add(files('cpr.c'))
+stub_ss.add(files('cpr-state.c'))
 stub_ss.add(files('cpu-get-clock.c'))
 stub_ss.add(files('cpus-get-virtual-clock.c'))
 stub_ss.add(files('qemu-timer-notify-cb.c'))
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 20/39] cpr: restart mode
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (18 preceding siblings ...)
  2022-06-15 14:52 ` [PATCH V8 19/39] cpr: preserve extra state Steve Sistare
@ 2022-06-15 14:52 ` Steve Sistare
  2022-07-03  8:15   ` Peng Liang
  2022-06-15 14:52 ` [PATCH V8 21/39] cpr: restart HMP interfaces Steve Sistare
                   ` (18 subsequent siblings)
  38 siblings, 1 reply; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Provide the cpr-save restart mode, which preserves the guest VM across a
restart of the qemu process.  After cpr-save, the caller passes qemu
command-line arguments to cpr-exec, which directly exec's the new qemu
binary.  The arguments must include -S so new qemu starts in a paused state.
The caller resumes the guest by calling cpr-load.

To use the restart mode, guest RAM must be backed by a memory-backend-file
with share=on.  The '-cpr-enable restart' option causes secondary guest
ram blocks (those not specified on the command line) to be allocated by
mmap'ing a memfd.  The memfd values are saved in special cpr state which
is retrieved after exec, and are kept open across exec, after which they
are retrieved and re-mmap'd.  Hence guest RAM is preserved in place, albeit
with new virtual addresses in the qemu process.

The restart mode supports vfio devices and memory-backend-memfd in
subsequent patches.

cpr-exec syntax:
  { 'command': 'cpr-exec', 'data': { 'argv': [ 'str' ] } }

Add the restart mode:
  { 'enum': 'CprMode', 'data': [ 'reboot', 'restart' ] }

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 migration/cpr.c   | 35 +++++++++++++++++++++++++++++++++++
 qapi/cpr.json     | 26 +++++++++++++++++++++++++-
 qemu-options.hx   |  2 +-
 softmmu/physmem.c | 46 +++++++++++++++++++++++++++++++++++++++++++++-
 trace-events      |  1 +
 5 files changed, 107 insertions(+), 3 deletions(-)

diff --git a/migration/cpr.c b/migration/cpr.c
index 1cc8738..8b3fffd 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -22,6 +22,7 @@ static int cpr_enabled_modes;
 void cpr_init(int modes)
 {
     cpr_enabled_modes = modes;
+    cpr_state_load(&error_fatal);
 }
 
 bool cpr_enabled(CprMode mode)
@@ -153,6 +154,37 @@ err:
     cpr_set_mode(CPR_MODE_NONE);
 }
 
+static int preserve_fd(const char *name, int id, int fd, void *opaque)
+{
+    qemu_clear_cloexec(fd);
+    return 0;
+}
+
+static int unpreserve_fd(const char *name, int id, int fd, void *opaque)
+{
+    qemu_set_cloexec(fd);
+    return 0;
+}
+
+void qmp_cpr_exec(strList *args, Error **errp)
+{
+    if (!runstate_check(RUN_STATE_SAVE_VM)) {
+        error_setg(errp, "runstate is not save-vm");
+        return;
+    }
+    if (cpr_get_mode() != CPR_MODE_RESTART) {
+        error_setg(errp, "cpr-exec requires cpr-save with restart mode");
+        return;
+    }
+
+    cpr_walk_fd(preserve_fd, 0);
+    if (cpr_state_save(errp)) {
+        return;
+    }
+
+    assert(qemu_system_exec_request(args, errp) == 0);
+}
+
 void qmp_cpr_load(const char *filename, CprMode mode, Error **errp)
 {
     QEMUFile *f;
@@ -189,6 +221,9 @@ void qmp_cpr_load(const char *filename, CprMode mode, Error **errp)
         goto out;
     }
 
+    /* Clear cloexec to prevent fd leaks until the next cpr-save */
+    cpr_walk_fd(unpreserve_fd, 0);
+
     state = global_state_get_runstate();
     if (state == RUN_STATE_RUNNING) {
         vm_start();
diff --git a/qapi/cpr.json b/qapi/cpr.json
index 11c6f88..47ee4ff 100644
--- a/qapi/cpr.json
+++ b/qapi/cpr.json
@@ -15,11 +15,12 @@
 # @CprMode:
 #
 # @reboot: checkpoint can be cpr-load'ed after a host reboot.
+# @restart: checkpoint can be cpr-load'ed after restarting qemu.
 #
 # Since: 7.1
 ##
 { 'enum': 'CprMode',
-  'data': [ 'none', 'reboot' ] }
+  'data': [ 'none', 'reboot', 'restart' ] }
 
 ##
 # @cpr-save:
@@ -38,6 +39,11 @@
 # issue the quit command, reboot the system, start qemu using the same
 # arguments plus -S, and issue the cpr-load command.
 #
+# If @mode is 'restart', the checkpoint remains valid after restarting
+# qemu using a subsequent cpr-exec.  Guest RAM must be backed by a
+# memory-backend-file with share=on.
+# To resume from the checkpoint, issue the cpr-load command.
+#
 # @filename: name of checkpoint file
 # @mode: @CprMode mode
 #
@@ -48,6 +54,24 @@
             'mode': 'CprMode' } }
 
 ##
+# @cpr-exec:
+#
+# Restart qemu by directly exec'ing @argv[0], replacing the qemu process.
+# The PID remains the same.  Must be called after cpr-save restart.
+#
+# @argv[0] should be the path of a new qemu binary, or a prefix command that
+# in turn exec's the new qemu binary.  The arguments must match those used
+# to initially start qemu, plus the -S option so new qemu starts in a paused
+# state.
+#
+# @argv: arguments to be passed to exec().
+#
+# Since: 7.1
+##
+{ 'command': 'cpr-exec',
+  'data': { 'argv': [ 'str' ] } }
+
+##
 # @cpr-load:
 #
 # Load a virtual machine from the checkpoint file @filename that was created
diff --git a/qemu-options.hx b/qemu-options.hx
index 6e51c33..1b49360 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -4484,7 +4484,7 @@ SRST
 ERST
 
 DEF("cpr-enable", HAS_ARG, QEMU_OPTION_cpr_enable, \
-    "-cpr-enable reboot    enable the cpr mode\n",
+    "-cpr-enable reboot|restart    enable the cpr mode\n",
     QEMU_ARCH_ALL)
 SRST
 ``-cpr-enable reboot``
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 822c424..412cc80 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -44,6 +44,7 @@
 #include "qemu/qemu-print.h"
 #include "qemu/log.h"
 #include "qemu/memalign.h"
+#include "qemu/memfd.h"
 #include "exec/memory.h"
 #include "exec/ioport.h"
 #include "sysemu/dma.h"
@@ -1962,6 +1963,40 @@ static void dirty_memory_extend(ram_addr_t old_ram_size,
     }
 }
 
+static bool memory_region_is_backend(MemoryRegion *mr)
+{
+    return !!object_dynamic_cast(mr->parent_obj.parent, TYPE_MEMORY_BACKEND);
+}
+
+static void *qemu_anon_memfd_alloc(RAMBlock *rb, size_t maxlen, Error **errp)
+{
+    size_t len, align;
+    void *addr;
+    struct MemoryRegion *mr = rb->mr;
+    const char *name = memory_region_name(mr);
+    int mfd = cpr_find_memfd(name, &len, &maxlen, &align);
+
+    if (mfd >= 0) {
+        rb->used_length = len;
+        rb->max_length = maxlen;
+        mr->align = align;
+    } else {
+        len = rb->used_length;
+        maxlen = rb->max_length;
+        mr->align = QEMU_VMALLOC_ALIGN;
+        mfd = qemu_memfd_create(name, maxlen + mr->align, 0, 0, 0, errp);
+        if (mfd < 0) {
+            return NULL;
+        }
+        cpr_save_memfd(name, mfd, len, maxlen, mr->align);
+    }
+    rb->flags |= RAM_SHARED;
+    qemu_set_cloexec(mfd);
+    addr = file_ram_alloc(rb, maxlen, mfd, false, false, 0, errp);
+    trace_anon_memfd_alloc(name, maxlen, addr, mfd);
+    return addr;
+}
+
 static void ram_block_add(RAMBlock *new_block, Error **errp)
 {
     const bool noreserve = qemu_ram_is_noreserve(new_block);
@@ -1986,6 +2021,14 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
                 qemu_mutex_unlock_ramlist();
                 return;
             }
+        } else if (cpr_enabled(CPR_MODE_RESTART) &&
+                   !memory_region_is_backend(new_block->mr)) {
+            new_block->host = qemu_anon_memfd_alloc(new_block,
+                                                    new_block->max_length,
+                                                    errp);
+            if (!new_block->host) {
+                return;
+            }
         } else {
             new_block->host = qemu_anon_ram_alloc(new_block->max_length,
                                                   &new_block->mr->align,
@@ -1997,8 +2040,8 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
                 qemu_mutex_unlock_ramlist();
                 return;
             }
-            memory_try_enable_merging(new_block->host, new_block->max_length);
         }
+        memory_try_enable_merging(new_block->host, new_block->max_length);
     }
 
     new_ram_size = MAX(old_ram_size,
@@ -2231,6 +2274,7 @@ void qemu_ram_free(RAMBlock *block)
     }
 
     qemu_mutex_lock_ramlist();
+    cpr_delete_memfd(memory_region_name(block->mr));
     QLIST_REMOVE_RCU(block, next);
     ram_list.mru_block = NULL;
     /* Write list before version */
diff --git a/trace-events b/trace-events
index bc71006..07369bb 100644
--- a/trace-events
+++ b/trace-events
@@ -45,6 +45,7 @@ ram_block_discard_range(const char *rbname, void *hva, size_t length, bool need_
 # accel/tcg/cputlb.c
 memory_notdirty_write_access(uint64_t vaddr, uint64_t ram_addr, unsigned size) "0x%" PRIx64 " ram_addr 0x%" PRIx64 " size %u"
 memory_notdirty_set_dirty(uint64_t vaddr) "0x%" PRIx64
+anon_memfd_alloc(const char *name, size_t size, void *ptr, int fd) "%s size %zu ptr %p fd %d"
 
 # gdbstub.c
 gdbstub_op_start(const char *device) "Starting gdbstub using device %s"
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 21/39] cpr: restart HMP interfaces
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (19 preceding siblings ...)
  2022-06-15 14:52 ` [PATCH V8 20/39] cpr: restart mode Steve Sistare
@ 2022-06-15 14:52 ` Steve Sistare
  2022-06-15 14:52 ` [PATCH V8 22/39] cpr: ram block blockers Steve Sistare
                   ` (17 subsequent siblings)
  38 siblings, 0 replies; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

cpr-save <filename> <mode>
  mode may be "restart"

cpr-exec <command>
  Call qmp_cpr_exec().
  Arguments:
    command : command line to execute, with space-separated arguments

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hmp-commands.hx       | 29 ++++++++++++++++++++++++++---
 include/monitor/hmp.h |  1 +
 monitor/hmp-cmds.c    | 11 +++++++++++
 3 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index d621968..da5dd60 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -357,7 +357,7 @@ ERST
     {
         .name       = "cpr-save",
         .args_type  = "filename:s,mode:s",
-        .params     = "filename 'reboot'",
+        .params     = "filename 'reboot'|'restart'",
         .help       = "create a checkpoint of the VM in file",
         .cmd        = hmp_cpr_save,
     },
@@ -377,13 +377,36 @@ SRST
   reboot, else it will be saved to the file.  To resume from the checkpoint,
   issue the quit command, reboot the system, start qemu using the same
   arguments plus -S, and issue the cpr-load command.
+
+  If *mode* is 'restart', the checkpoint remains valid after restarting
+  qemu using a subsequent cpr-exec.  Guest RAM must be backed by a
+  memory-backend-file with share=on.
+  To resume from the checkpoint, issue the cpr-load command.
+ERST
+
+    {
+        .name       = "cpr-exec",
+        .args_type  = "command:S",
+        .params     = "command",
+        .help       = "Restart qemu by directly exec'ing command",
+        .cmd        = hmp_cpr_exec,
+    },
+
+SRST
+``cpr-exec`` *command*
+  Restart qemu by directly exec'ing *command*, replacing the qemu process.
+  The PID remains the same.  Must be called after cpr-save restart.
+
+  *command*[0] should be the path of a new qemu binary, or a prefix command that
+  in turn exec's the new qemu binary.  The arguments must match those used
+  to initially start qemu, plus the -S option so new qemu starts in a paused
+  state.
 ERST
 
     {
         .name       = "cpr-load",
         .args_type  = "filename:s,mode:s",
-        .params     = "filename 'reboot'",
-
+        .params     = "filename 'reboot'|'restart'",
         .help       = "load VM checkpoint from file",
         .cmd        = hmp_cpr_load,
     },
diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
index b44588e..ec4fa44 100644
--- a/include/monitor/hmp.h
+++ b/include/monitor/hmp.h
@@ -60,6 +60,7 @@ void hmp_loadvm(Monitor *mon, const QDict *qdict);
 void hmp_savevm(Monitor *mon, const QDict *qdict);
 void hmp_delvm(Monitor *mon, const QDict *qdict);
 void hmp_cpr_save(Monitor *mon, const QDict *qdict);
+void hmp_cpr_exec(Monitor *mon, const QDict *qdict);
 void hmp_cpr_load(Monitor *mon, const QDict *qdict);
 void hmp_migrate_cancel(Monitor *mon, const QDict *qdict);
 void hmp_migrate_continue(Monitor *mon, const QDict *qdict);
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index 9f58b1f..b866c7f 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -1111,6 +1111,17 @@ void hmp_cpr_save(Monitor *mon, const QDict *qdict)
     hmp_handle_error(mon, err);
 }
 
+void hmp_cpr_exec(Monitor *mon, const QDict *qdict)
+{
+    Error *err = NULL;
+    const char *command = qdict_get_try_str(qdict, "command");
+    strList *args = strList_from_string(command, ' ');
+
+    qmp_cpr_exec(args, &err);
+    qapi_free_strList(args);
+    hmp_handle_error(mon, err);
+}
+
 void hmp_cpr_load(Monitor *mon, const QDict *qdict)
 {
     Error *err = NULL;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 22/39] cpr: ram block blockers
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (20 preceding siblings ...)
  2022-06-15 14:52 ` [PATCH V8 21/39] cpr: restart HMP interfaces Steve Sistare
@ 2022-06-15 14:52 ` Steve Sistare
  2022-06-15 14:52 ` [PATCH V8 23/39] hostmem-memfd: cpr for memory-backend-memfd Steve Sistare
                   ` (16 subsequent siblings)
  38 siblings, 0 replies; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Unlike reboot mode, restart mode cannot save volatile ram blocks in the
vmstate file and recreate them later, because the physical memory for the
blocks is pinned and registered for vfio.  Add a restart-mode blocker for
volatile ram blocks.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/exec/memory.h   |  2 ++
 include/exec/ramblock.h |  1 +
 softmmu/physmem.c       | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 softmmu/vl.c            |  2 ++
 4 files changed, 53 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 6a257a4..812226f 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -3039,6 +3039,8 @@ bool ram_block_discard_is_required(void);
 void ram_block_register(RAMBlock *rb);
 void ram_block_unregister(RAMBlock *rb);
 
+void ram_block_add_cpr_blockers(Error **errp);
+
 #endif
 
 #endif
diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
index 6cbedf9..a5cbd9e 100644
--- a/include/exec/ramblock.h
+++ b/include/exec/ramblock.h
@@ -39,6 +39,7 @@ struct RAMBlock {
     /* RCU-enabled, writes protected by the ramlist lock */
     QLIST_ENTRY(RAMBlock) next;
     QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;
+    Error *cpr_blocker;
     int fd;
     size_t page_size;
     /* dirty bitmap used during migration */
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 412cc80..b90ab4e 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -1968,6 +1968,53 @@ static bool memory_region_is_backend(MemoryRegion *mr)
     return !!object_dynamic_cast(mr->parent_obj.parent, TYPE_MEMORY_BACKEND);
 }
 
+/*
+ * Return true if ram contents would be lost during cpr for CPR_MODE_RESTART.
+ * Return false for ram_device because it is remapped after restart.  Do not
+ * exclude rom, even though it is readonly, because the rom file could change
+ * in the new qemu.  Return false for non-migratable blocks.  They are either
+ * re-created after restart, or are handled specially, or are covered by a
+ * device-level cpr blocker.  Return false for an fd, because it is visible and
+ * can be remapped in the new process.
+ */
+static bool ram_is_volatile(RAMBlock *rb)
+{
+    MemoryRegion *mr = rb->mr;
+
+    return mr &&
+        memory_region_is_ram(mr) &&
+        !memory_region_is_ram_device(mr) &&
+        (!qemu_ram_is_shared(rb) || ramblock_is_anon(rb)) &&
+        qemu_ram_is_migratable(rb) &&
+        rb->fd < 0;
+}
+
+/*
+ * Add a CPR_MODE_RESTART blocker for each volatile ram block.  This cannot be
+ * performed in ram_block_add because the migratable flag has not been set yet.
+ */
+void ram_block_add_cpr_blockers(Error **errp)
+{
+    RAMBlock *rb;
+
+    RAMBLOCK_FOREACH(rb) {
+        if (ram_is_volatile(rb)) {
+            const char *name = memory_region_name(rb->mr);
+            rb->cpr_blocker = NULL;
+            if (memory_region_is_backend(rb->mr)) {
+                error_setg(&rb->cpr_blocker,
+                    "Memory region %s is volatile. A memory-backend-memfd or"
+                    " memory-backend-file with share=on is required.", name);
+            } else {
+                error_setg(&rb->cpr_blocker,
+                    "Memory region %s is volatile. "
+                    "-cpr-enable restart is required.", name);
+            }
+            cpr_add_blocker(&rb->cpr_blocker, errp, CPR_MODE_RESTART, 0);
+        }
+    }
+}
+
 static void *qemu_anon_memfd_alloc(RAMBlock *rb, size_t maxlen, Error **errp)
 {
     size_t len, align;
@@ -2275,6 +2322,7 @@ void qemu_ram_free(RAMBlock *block)
 
     qemu_mutex_lock_ramlist();
     cpr_delete_memfd(memory_region_name(block->mr));
+    cpr_del_blocker(&block->cpr_blocker);
     QLIST_REMOVE_RCU(block, next);
     ram_list.mru_block = NULL;
     /* Write list before version */
diff --git a/softmmu/vl.c b/softmmu/vl.c
index ce779cf..3e19c74 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -28,6 +28,7 @@
 #include "qemu/units.h"
 #include "exec/cpu-common.h"
 #include "exec/page-vary.h"
+#include "exec/memory.h"
 #include "hw/qdev-properties.h"
 #include "qapi/compat-policy.h"
 #include "qapi/error.h"
@@ -2569,6 +2570,7 @@ void qmp_x_exit_preconfig(Error **errp)
     qemu_init_board();
     qemu_create_cli_devices();
     qemu_machine_creation_done();
+    ram_block_add_cpr_blockers(&error_fatal);
 
     if (loadvm) {
         load_snapshot(loadvm, NULL, false, NULL, &error_fatal);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 23/39] hostmem-memfd: cpr for memory-backend-memfd
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (21 preceding siblings ...)
  2022-06-15 14:52 ` [PATCH V8 22/39] cpr: ram block blockers Steve Sistare
@ 2022-06-15 14:52 ` Steve Sistare
  2022-06-15 14:52 ` [PATCH V8 24/39] pci: export export msix_is_pending Steve Sistare
                   ` (15 subsequent siblings)
  38 siblings, 0 replies; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Preserve memory-backend-memfd memory objects during cpr.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 backends/hostmem-memfd.c | 21 ++++++++++++---------
 hmp-commands.hx          |  2 +-
 qapi/cpr.json            |  2 +-
 3 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c
index c9d8001..2aeb5d1 100644
--- a/backends/hostmem-memfd.c
+++ b/backends/hostmem-memfd.c
@@ -14,6 +14,7 @@
 #include "sysemu/hostmem.h"
 #include "qom/object_interfaces.h"
 #include "qemu/memfd.h"
+#include "migration/cpr.h"
 #include "qemu/module.h"
 #include "qapi/error.h"
 #include "qom/object.h"
@@ -36,23 +37,25 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
 {
     HostMemoryBackendMemfd *m = MEMORY_BACKEND_MEMFD(backend);
     uint32_t ram_flags;
-    char *name;
-    int fd;
+    char *name = host_memory_backend_get_name(backend);
+    int fd = cpr_find_fd(name, 0);
 
     if (!backend->size) {
         error_setg(errp, "can't create backend with size 0");
         return;
     }
 
-    fd = qemu_memfd_create(TYPE_MEMORY_BACKEND_MEMFD, backend->size,
-                           m->hugetlb, m->hugetlbsize, m->seal ?
-                           F_SEAL_GROW | F_SEAL_SHRINK | F_SEAL_SEAL : 0,
-                           errp);
-    if (fd == -1) {
-        return;
+    if (fd < 0) {
+        fd = qemu_memfd_create(TYPE_MEMORY_BACKEND_MEMFD, backend->size,
+                               m->hugetlb, m->hugetlbsize, m->seal ?
+                               F_SEAL_GROW | F_SEAL_SHRINK | F_SEAL_SEAL : 0,
+                               errp);
+        if (fd == -1) {
+            return;
+        }
+        cpr_save_fd(name, 0, fd);
     }
 
-    name = host_memory_backend_get_name(backend);
     ram_flags = backend->share ? RAM_SHARED : 0;
     ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
     ram_flags |= RAM_ANON;
diff --git a/hmp-commands.hx b/hmp-commands.hx
index da5dd60..540f9be 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -380,7 +380,7 @@ SRST
 
   If *mode* is 'restart', the checkpoint remains valid after restarting
   qemu using a subsequent cpr-exec.  Guest RAM must be backed by a
-  memory-backend-file with share=on.
+  memory-backend-memfd or memory-backend-file object with share=on.
   To resume from the checkpoint, issue the cpr-load command.
 ERST
 
diff --git a/qapi/cpr.json b/qapi/cpr.json
index 47ee4ff..1ec5aae 100644
--- a/qapi/cpr.json
+++ b/qapi/cpr.json
@@ -41,7 +41,7 @@
 #
 # If @mode is 'restart', the checkpoint remains valid after restarting
 # qemu using a subsequent cpr-exec.  Guest RAM must be backed by a
-# memory-backend-file with share=on.
+# memory-backend-memfd or memory-backend-file object with share=on.
 # To resume from the checkpoint, issue the cpr-load command.
 #
 # @filename: name of checkpoint file
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 24/39] pci: export export msix_is_pending
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (22 preceding siblings ...)
  2022-06-15 14:52 ` [PATCH V8 23/39] hostmem-memfd: cpr for memory-backend-memfd Steve Sistare
@ 2022-06-15 14:52 ` Steve Sistare
  2022-06-27 22:44   ` Michael S. Tsirkin
  2022-06-15 14:52 ` [PATCH V8 25/39] cpr: notifiers Steve Sistare
                   ` (14 subsequent siblings)
  38 siblings, 1 reply; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Export msix_is_pending for use by cpr.  No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/pci/msix.c         | 2 +-
 include/hw/pci/msix.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index ae9331c..e492ce0 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -64,7 +64,7 @@ static uint8_t *msix_pending_byte(PCIDevice *dev, int vector)
     return dev->msix_pba + vector / 8;
 }
 
-static int msix_is_pending(PCIDevice *dev, int vector)
+int msix_is_pending(PCIDevice *dev, unsigned int vector)
 {
     return *msix_pending_byte(dev, vector) & msix_pending_mask(vector);
 }
diff --git a/include/hw/pci/msix.h b/include/hw/pci/msix.h
index 4c4a60c..0065354 100644
--- a/include/hw/pci/msix.h
+++ b/include/hw/pci/msix.h
@@ -32,6 +32,7 @@ int msix_present(PCIDevice *dev);
 bool msix_is_masked(PCIDevice *dev, unsigned vector);
 void msix_set_pending(PCIDevice *dev, unsigned vector);
 void msix_clr_pending(PCIDevice *dev, int vector);
+int msix_is_pending(PCIDevice *dev, unsigned vector);
 
 int msix_vector_use(PCIDevice *dev, unsigned vector);
 void msix_vector_unuse(PCIDevice *dev, unsigned vector);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 25/39] cpr: notifiers
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (23 preceding siblings ...)
  2022-06-15 14:52 ` [PATCH V8 24/39] pci: export export msix_is_pending Steve Sistare
@ 2022-06-15 14:52 ` Steve Sistare
  2022-06-15 14:52 ` [PATCH V8 26/39] vfio-pci: refactor for cpr Steve Sistare
                   ` (13 subsequent siblings)
  38 siblings, 0 replies; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Add an interface to register notifiers for cpr transitions.  It is used to
support vfio cpr in a subsequent patch.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/migration/cpr.h | 13 +++++++++++++
 migration/cpr.c         | 25 +++++++++++++++++++++++++
 stubs/cpr.c             | 10 ++++++++++
 3 files changed, 48 insertions(+)

diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index b75dec4..ab5f53e 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -9,6 +9,7 @@
 #define MIGRATION_CPR_H
 
 #include "qapi/qapi-types-cpr.h"
+#include "qemu/notify.h"
 
 void cpr_init(int modes);
 void cpr_set_mode(CprMode mode);
@@ -37,4 +38,16 @@ int cpr_add_blocker(Error **reasonp, Error **errp, CprMode mode, ...);
 int cpr_add_blocker_str(const char *reason, Error **errp, CprMode mode, ...);
 void cpr_del_blocker(Error **reasonp);
 
+typedef enum CprNotifyState {
+    CPR_NOTIFY_EXEC,
+    CPR_NOTIFY_SAVE_FAILED,
+    CPR_NOTIFY_LOAD_FAILED,
+    CPR_NOTIFY_NUM
+} CprNotifyState;
+
+void cpr_add_notifier(Notifier *notify,
+                      void (*cb)(Notifier *notifier, void *data),
+                      CprNotifyState state);
+void cpr_remove_notifier(Notifier *notify);
+
 #endif
diff --git a/migration/cpr.c b/migration/cpr.c
index 8b3fffd..9d6bca4 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -105,6 +105,28 @@ static bool cpr_is_blocked(Error **errp, CprMode mode)
     return false;
 }
 
+static NotifierList cpr_notifiers[CPR_NOTIFY_NUM];
+
+void cpr_add_notifier(Notifier *notify,
+                      void (*cb)(Notifier *notifier, void *data),
+                      CprNotifyState state)
+{
+    assert(state >= 0 && state < CPR_NOTIFY_NUM);
+    notify->notify = cb;
+    notifier_list_add(&cpr_notifiers[state], notify);
+}
+
+void cpr_remove_notifier(Notifier *notify)
+{
+    notifier_remove(notify);
+    notify->notify = NULL;
+}
+
+static void cpr_call_notifiers(CprNotifyState state)
+{
+    notifier_list_notify(&cpr_notifiers[state], 0);
+}
+
 void qmp_cpr_save(const char *filename, CprMode mode, Error **errp)
 {
     int ret;
@@ -142,6 +164,7 @@ void qmp_cpr_save(const char *filename, CprMode mode, Error **errp)
     qemu_fclose(f);
     if (ret < 0) {
         error_setg(errp, "Error %d while saving VM state", ret);
+        cpr_call_notifiers(CPR_NOTIFY_SAVE_FAILED);
         goto err;
     }
 
@@ -182,6 +205,7 @@ void qmp_cpr_exec(strList *args, Error **errp)
         return;
     }
 
+    cpr_call_notifiers(CPR_NOTIFY_EXEC);
     assert(qemu_system_exec_request(args, errp) == 0);
 }
 
@@ -218,6 +242,7 @@ void qmp_cpr_load(const char *filename, CprMode mode, Error **errp)
     qemu_fclose(f);
     if (ret < 0) {
         error_setg(errp, "Error %d while loading VM state", ret);
+        cpr_call_notifiers(CPR_NOTIFY_LOAD_FAILED);
         goto out;
     }
 
diff --git a/stubs/cpr.c b/stubs/cpr.c
index 06a9a1c..9262e78 100644
--- a/stubs/cpr.c
+++ b/stubs/cpr.c
@@ -21,3 +21,13 @@ int cpr_add_blocker_str(const char *reason, Error **errp, CprMode mode, ...)
 void cpr_del_blocker(Error **reasonp)
 {
 }
+
+void cpr_add_notifier(Notifier *notify,
+                      void (*cb)(Notifier *notifier, void *data),
+                      CprNotifyState state)
+{
+}
+
+void cpr_remove_notifier(Notifier *notify)
+{
+}
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 26/39] vfio-pci: refactor for cpr
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (24 preceding siblings ...)
  2022-06-15 14:52 ` [PATCH V8 25/39] cpr: notifiers Steve Sistare
@ 2022-06-15 14:52 ` Steve Sistare
  2022-06-15 14:52 ` [PATCH V8 27/39] vfio-pci: cpr part 1 (fd and dma) Steve Sistare
                   ` (12 subsequent siblings)
  38 siblings, 0 replies; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Refactor vector use into a helper vfio_vector_init.
Add vfio_notifier_init and vfio_notifier_cleanup for named notifiers,
and pass additional arguments to vfio_remove_kvm_msi_virq.

All for use by cpr in a subsequent patch.  No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/pci.c | 99 +++++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 66 insertions(+), 33 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 939dcc3..0143c9a 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -52,6 +52,27 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
 static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
 static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
 
+/* Create new or reuse existing eventfd */
+static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
+                              const char *name, int nr)
+{
+    int fd = -1;   /* placeholder until a subsequent patch */
+    int ret = 0;
+
+    if (fd >= 0) {
+        event_notifier_init_fd(e, fd);
+    } else {
+        ret = event_notifier_init(e, 0);
+    }
+    return ret;
+}
+
+static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
+                                  const char *name, int nr)
+{
+    event_notifier_cleanup(e);
+}
+
 /*
  * Disabling BAR mmaping can be slow, but toggling it around INTx can
  * also be a huge overhead.  We try to get the best of both worlds by
@@ -132,8 +153,8 @@ static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
     pci_irq_deassert(&vdev->pdev);
 
     /* Get an eventfd for resample/unmask */
-    if (event_notifier_init(&vdev->intx.unmask, 0)) {
-        error_setg(errp, "event_notifier_init failed eoi");
+    if (vfio_notifier_init(vdev, &vdev->intx.unmask, "intx-unmask", 0)) {
+        error_setg(errp, "vfio_notifier_init intx-unmask failed");
         goto fail;
     }
 
@@ -165,7 +186,7 @@ fail_vfio:
     kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vdev->intx.interrupt,
                                           vdev->intx.route.irq);
 fail_irqfd:
-    event_notifier_cleanup(&vdev->intx.unmask);
+    vfio_notifier_cleanup(vdev, &vdev->intx.unmask, "intx-unmask", 0);
 fail:
     qemu_set_fd_handler(irq_fd, vfio_intx_interrupt, NULL, vdev);
     vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
@@ -194,7 +215,7 @@ static void vfio_intx_disable_kvm(VFIOPCIDevice *vdev)
     }
 
     /* We only need to close the eventfd for VFIO to cleanup the kernel side */
-    event_notifier_cleanup(&vdev->intx.unmask);
+    vfio_notifier_cleanup(vdev, &vdev->intx.unmask, "intx-unmask", 0);
 
     /* QEMU starts listening for interrupt events. */
     qemu_set_fd_handler(event_notifier_get_fd(&vdev->intx.interrupt),
@@ -285,9 +306,10 @@ static int vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
     }
 #endif
 
-    ret = event_notifier_init(&vdev->intx.interrupt, 0);
+    ret = vfio_notifier_init(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
     if (ret) {
-        error_setg_errno(errp, -ret, "event_notifier_init failed");
+        error_setg_errno(errp, -ret,
+                         "vfio_notifier_init intx-interrupt failed");
         return ret;
     }
     fd = event_notifier_get_fd(&vdev->intx.interrupt);
@@ -296,7 +318,7 @@ static int vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
     if (vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
                                VFIO_IRQ_SET_ACTION_TRIGGER, fd, errp)) {
         qemu_set_fd_handler(fd, NULL, NULL, vdev);
-        event_notifier_cleanup(&vdev->intx.interrupt);
+        vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
         return -errno;
     }
 
@@ -324,7 +346,7 @@ static void vfio_intx_disable(VFIOPCIDevice *vdev)
 
     fd = event_notifier_get_fd(&vdev->intx.interrupt);
     qemu_set_fd_handler(fd, NULL, NULL, vdev);
-    event_notifier_cleanup(&vdev->intx.interrupt);
+    vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
 
     vdev->interrupt = VFIO_INT_NONE;
 
@@ -424,13 +446,15 @@ static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
                                              vector_n, &vdev->pdev);
 }
 
-static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector)
+static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector, int nr)
 {
+    const char *name = "kvm_interrupt";
+
     if (vector->virq < 0) {
         return;
     }
 
-    if (event_notifier_init(&vector->kvm_interrupt, 0)) {
+    if (vfio_notifier_init(vector->vdev, &vector->kvm_interrupt, name, nr)) {
         goto fail_notifier;
     }
 
@@ -442,19 +466,20 @@ static void vfio_connect_kvm_msi_virq(VFIOMSIVector *vector)
     return;
 
 fail_kvm:
-    event_notifier_cleanup(&vector->kvm_interrupt);
+    vfio_notifier_cleanup(vector->vdev, &vector->kvm_interrupt, name, nr);
 fail_notifier:
     kvm_irqchip_release_virq(kvm_state, vector->virq);
     vector->virq = -1;
 }
 
-static void vfio_remove_kvm_msi_virq(VFIOMSIVector *vector)
+static void vfio_remove_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
+                                     int nr)
 {
     kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vector->kvm_interrupt,
                                           vector->virq);
     kvm_irqchip_release_virq(kvm_state, vector->virq);
     vector->virq = -1;
-    event_notifier_cleanup(&vector->kvm_interrupt);
+    vfio_notifier_cleanup(vdev, &vector->kvm_interrupt, "kvm_interrupt", nr);
 }
 
 static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
@@ -464,6 +489,20 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
     kvm_irqchip_commit_routes(kvm_state);
 }
 
+static void vfio_vector_init(VFIOPCIDevice *vdev, int nr)
+{
+    VFIOMSIVector *vector = &vdev->msi_vectors[nr];
+    PCIDevice *pdev = &vdev->pdev;
+
+    vector->vdev = vdev;
+    vector->virq = -1;
+    if (vfio_notifier_init(vdev, &vector->interrupt, "interrupt", nr)) {
+        error_report("vfio: vfio_notifier_init interrupt failed");
+    }
+    vector->use = true;
+    msix_vector_use(pdev, nr);
+}
+
 static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
                                    MSIMessage *msg, IOHandler *handler)
 {
@@ -476,13 +515,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
     vector = &vdev->msi_vectors[nr];
 
     if (!vector->use) {
-        vector->vdev = vdev;
-        vector->virq = -1;
-        if (event_notifier_init(&vector->interrupt, 0)) {
-            error_report("vfio: Error: event_notifier_init failed");
-        }
-        vector->use = true;
-        msix_vector_use(pdev, nr);
+        vfio_vector_init(vdev, nr);
     }
 
     qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
@@ -494,7 +527,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
      */
     if (vector->virq >= 0) {
         if (!msg) {
-            vfio_remove_kvm_msi_virq(vector);
+            vfio_remove_kvm_msi_virq(vdev, vector, nr);
         } else {
             vfio_update_kvm_msi_virq(vector, *msg, pdev);
         }
@@ -506,7 +539,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
                 vfio_route_change = kvm_irqchip_begin_route_changes(kvm_state);
                 vfio_add_kvm_msi_virq(vdev, vector, nr, true);
                 kvm_irqchip_commit_route_changes(&vfio_route_change);
-                vfio_connect_kvm_msi_virq(vector);
+                vfio_connect_kvm_msi_virq(vector, nr);
             }
         }
     }
@@ -602,7 +635,7 @@ static void vfio_commit_kvm_msi_virq_batch(VFIOPCIDevice *vdev)
     kvm_irqchip_commit_route_changes(&vfio_route_change);
 
     for (i = 0; i < vdev->nr_vectors; i++) {
-        vfio_connect_kvm_msi_virq(&vdev->msi_vectors[i]);
+        vfio_connect_kvm_msi_virq(&vdev->msi_vectors[i], i);
     }
 }
 
@@ -681,8 +714,8 @@ retry:
         vector->virq = -1;
         vector->use = true;
 
-        if (event_notifier_init(&vector->interrupt, 0)) {
-            error_report("vfio: Error: event_notifier_init failed");
+        if (vfio_notifier_init(vdev, &vector->interrupt, "interrupt", i)) {
+            error_report("vfio: Error: vfio_notifier_init failed");
         }
 
         qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
@@ -737,11 +770,11 @@ static void vfio_msi_disable_common(VFIOPCIDevice *vdev)
         VFIOMSIVector *vector = &vdev->msi_vectors[i];
         if (vdev->msi_vectors[i].use) {
             if (vector->virq >= 0) {
-                vfio_remove_kvm_msi_virq(vector);
+                vfio_remove_kvm_msi_virq(vdev, vector, i);
             }
             qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
                                 NULL, NULL, NULL);
-            event_notifier_cleanup(&vector->interrupt);
+            vfio_notifier_cleanup(vdev, &vector->interrupt, "interrupt", i);
         }
     }
 
@@ -2740,7 +2773,7 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
         return;
     }
 
-    if (event_notifier_init(&vdev->err_notifier, 0)) {
+    if (vfio_notifier_init(vdev, &vdev->err_notifier, "err_notifier", 0)) {
         error_report("vfio: Unable to init event notifier for error detection");
         vdev->pci_aer = false;
         return;
@@ -2753,7 +2786,7 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
                                VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
         error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
         qemu_set_fd_handler(fd, NULL, NULL, vdev);
-        event_notifier_cleanup(&vdev->err_notifier);
+        vfio_notifier_cleanup(vdev, &vdev->err_notifier, "err_notifier", 0);
         vdev->pci_aer = false;
     }
 }
@@ -2772,7 +2805,7 @@ static void vfio_unregister_err_notifier(VFIOPCIDevice *vdev)
     }
     qemu_set_fd_handler(event_notifier_get_fd(&vdev->err_notifier),
                         NULL, NULL, vdev);
-    event_notifier_cleanup(&vdev->err_notifier);
+    vfio_notifier_cleanup(vdev, &vdev->err_notifier, "err_notifier", 0);
 }
 
 static void vfio_req_notifier_handler(void *opaque)
@@ -2806,7 +2839,7 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
         return;
     }
 
-    if (event_notifier_init(&vdev->req_notifier, 0)) {
+    if (vfio_notifier_init(vdev, &vdev->req_notifier, "req_notifier", 0)) {
         error_report("vfio: Unable to init event notifier for device request");
         return;
     }
@@ -2818,7 +2851,7 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
                            VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
         error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
         qemu_set_fd_handler(fd, NULL, NULL, vdev);
-        event_notifier_cleanup(&vdev->req_notifier);
+        vfio_notifier_cleanup(vdev, &vdev->req_notifier, "req_notifier", 0);
     } else {
         vdev->req_enabled = true;
     }
@@ -2838,7 +2871,7 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
     }
     qemu_set_fd_handler(event_notifier_get_fd(&vdev->req_notifier),
                         NULL, NULL, vdev);
-    event_notifier_cleanup(&vdev->req_notifier);
+    vfio_notifier_cleanup(vdev, &vdev->req_notifier, "req_notifier", 0);
 
     vdev->req_enabled = false;
 }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 27/39] vfio-pci: cpr part 1 (fd and dma)
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (25 preceding siblings ...)
  2022-06-15 14:52 ` [PATCH V8 26/39] vfio-pci: refactor for cpr Steve Sistare
@ 2022-06-15 14:52 ` Steve Sistare
  2022-06-29 19:14   ` Alex Williamson
  2022-07-03  8:32   ` Peng Liang
  2022-06-15 14:52 ` [PATCH V8 28/39] vfio-pci: cpr part 2 (msi) Steve Sistare
                   ` (11 subsequent siblings)
  38 siblings, 2 replies; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Enable vfio-pci devices to be saved and restored across an exec restart
of qemu.

At vfio creation time, save the value of vfio container, group, and device
descriptors in cpr state.

In the container pre_save handler, suspend the use of virtual addresses in
DMA mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be
remapped at a different VA after exec.  DMA to already-mapped pages
continues.  Save the msi message area as part of vfio-pci vmstate, save the
interrupt and notifier eventfd's in cpr state, and clear the close-on-exec
flag for the vfio descriptors.  The flag is not cleared earlier because the
descriptors should not persist across miscellaneous fork and exec calls
that may be performed during normal operation.

On qemu restart, vfio_realize() finds the saved descriptors, uses
the descriptors, and notes that the device is being reused.  Device and
iommu state is already configured, so operations in vfio_realize that
would modify the configuration are skipped for a reused device, including
vfio ioctl's and writes to PCI configuration space.  Vfio PCI device reset
is also suppressed. The result is that vfio_realize constructs qemu data
structures that reflect the current state of the device.  However, the
reconstruction is not complete until cpr-load is called. cpr-load loads the
msi data.  The vfio post_load handler finds eventfds in cpr state, rebuilds
vector data structures, and attaches the interrupts to the new KVM instance.
The container post_load handler then invokes the main vfio listener
callback, which walks the flattened ranges of the vfio address space and
calls VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly,
cpr-load starts the VM.

This functionality is delivered by 3 patches for clarity.  Part 1 handles
device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
support.  Part 3 adds INTX support.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 MAINTAINERS                   |   1 +
 hw/pci/pci.c                  |  12 ++++
 hw/vfio/common.c              | 151 +++++++++++++++++++++++++++++++++++-------
 hw/vfio/cpr.c                 | 119 +++++++++++++++++++++++++++++++++
 hw/vfio/meson.build           |   1 +
 hw/vfio/pci.c                 |  44 ++++++++++++
 hw/vfio/trace-events          |   1 +
 include/hw/vfio/vfio-common.h |  11 +++
 include/migration/vmstate.h   |   1 +
 9 files changed, 317 insertions(+), 24 deletions(-)
 create mode 100644 hw/vfio/cpr.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 74a43e6..864aec6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3156,6 +3156,7 @@ CPR
 M: Steve Sistare <steven.sistare@oracle.com>
 M: Mark Kanda <mark.kanda@oracle.com>
 S: Maintained
+F: hw/vfio/cpr.c
 F: include/migration/cpr.h
 F: migration/cpr.c
 F: qapi/cpr.json
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 6e70153..a3b19eb 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -32,6 +32,7 @@
 #include "hw/pci/pci_host.h"
 #include "hw/qdev-properties.h"
 #include "hw/qdev-properties-system.h"
+#include "migration/cpr.h"
 #include "migration/qemu-file-types.h"
 #include "migration/vmstate.h"
 #include "monitor/monitor.h"
@@ -341,6 +342,17 @@ static void pci_reset_regions(PCIDevice *dev)
 
 static void pci_do_device_reset(PCIDevice *dev)
 {
+    /*
+     * A PCI device that is resuming for cpr is already configured, so do
+     * not reset it here when we are called from qemu_system_reset prior to
+     * cpr-load, else interrupts may be lost for vfio-pci devices.  It is
+     * safe to skip this reset for all PCI devices, because cpr-load will set
+     * all fields that would have been set here.
+     */
+    if (cpr_get_mode() == CPR_MODE_RESTART) {
+        return;
+    }
+
     pci_device_deassert_intx(dev);
     assert(dev->irq_state == 0);
 
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index ace9562..c7d73b6 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -31,6 +31,7 @@
 #include "exec/memory.h"
 #include "exec/ram_addr.h"
 #include "hw/hw.h"
+#include "migration/cpr.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "qemu/range.h"
@@ -460,6 +461,8 @@ static int vfio_dma_unmap(VFIOContainer *container,
         .size = size,
     };
 
+    assert(!container->reused);
+
     if (iotlb && container->dirty_pages_supported &&
         vfio_devices_all_running_and_saving(container)) {
         return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
@@ -496,12 +499,24 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
 {
     struct vfio_iommu_type1_dma_map map = {
         .argsz = sizeof(map),
-        .flags = VFIO_DMA_MAP_FLAG_READ,
         .vaddr = (__u64)(uintptr_t)vaddr,
         .iova = iova,
         .size = size,
     };
 
+    /*
+     * Set the new vaddr for any mappings registered during cpr-load.
+     * Reused is cleared thereafter.
+     */
+    if (container->reused) {
+        map.flags = VFIO_DMA_MAP_FLAG_VADDR;
+        if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
+            goto fail;
+        }
+        return 0;
+    }
+
+    map.flags = VFIO_DMA_MAP_FLAG_READ;
     if (!readonly) {
         map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
     }
@@ -517,7 +532,9 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
         return 0;
     }
 
-    error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
+fail:
+    error_report("vfio_dma_map %s (iova %lu, size %ld, va %p): %s",
+        (container->reused ? "VADDR" : ""), iova, size, vaddr, strerror(errno));
     return -errno;
 }
 
@@ -882,6 +899,12 @@ static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+    vfio_container_region_add(container, section);
+}
+
+void vfio_container_region_add(VFIOContainer *container,
+                               MemoryRegionSection *section)
+{
     hwaddr iova, end;
     Int128 llend, llsize;
     void *vaddr;
@@ -1492,6 +1515,12 @@ static void vfio_listener_release(VFIOContainer *container)
     }
 }
 
+void vfio_listener_register(VFIOContainer *container)
+{
+    container->listener = vfio_memory_listener;
+    memory_listener_register(&container->listener, container->space->as);
+}
+
 static struct vfio_info_cap_header *
 vfio_get_cap(void *ptr, uint32_t cap_offset, uint16_t id)
 {
@@ -1910,6 +1939,22 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
 {
     int iommu_type, ret;
 
+    /*
+     * If container is reused, just set its type and skip the ioctls, as the
+     * container and group are already configured in the kernel.
+     * VFIO_TYPE1v2_IOMMU is the only type that supports reuse/cpr.
+     */
+    if (container->reused) {
+        if (ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1v2_IOMMU)) {
+            container->iommu_type = VFIO_TYPE1v2_IOMMU;
+            return 0;
+        } else {
+            error_setg(errp, "container was reused but VFIO_TYPE1v2_IOMMU "
+                             "is not supported");
+            return -errno;
+        }
+    }
+
     iommu_type = vfio_get_iommu_type(container, errp);
     if (iommu_type < 0) {
         return iommu_type;
@@ -2014,9 +2059,12 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
 {
     VFIOContainer *container;
     int ret, fd;
+    bool reused;
     VFIOAddressSpace *space;
 
     space = vfio_get_address_space(as);
+    fd = cpr_find_fd("vfio_container_for_group", group->groupid);
+    reused = (fd > 0);
 
     /*
      * VFIO is currently incompatible with discarding of RAM insofar as the
@@ -2049,27 +2097,47 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
      * details once we know which type of IOMMU we are using.
      */
 
+    /*
+     * If the container is reused, then the group is already attached in the
+     * kernel.  If a container with matching fd is found, then update the
+     * userland group list and return.  If not, then after the loop, create
+     * the container struct and group list.
+     */
+
     QLIST_FOREACH(container, &space->containers, next) {
-        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
-            ret = vfio_ram_block_discard_disable(container, true);
-            if (ret) {
-                error_setg_errno(errp, -ret,
-                                 "Cannot set discarding of RAM broken");
-                if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
-                          &container->fd)) {
-                    error_report("vfio: error disconnecting group %d from"
-                                 " container", group->groupid);
-                }
-                return ret;
+        if (reused) {
+            if (container->fd != fd) {
+                continue;
             }
-            group->container = container;
-            QLIST_INSERT_HEAD(&container->group_list, group, container_next);
+        } else if (ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
+            continue;
+        }
+
+        ret = vfio_ram_block_discard_disable(container, true);
+        if (ret) {
+            error_setg_errno(errp, -ret,
+                             "Cannot set discarding of RAM broken");
+            if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
+                      &container->fd)) {
+                error_report("vfio: error disconnecting group %d from"
+                             " container", group->groupid);
+            }
+            return ret;
+        }
+        group->container = container;
+        QLIST_INSERT_HEAD(&container->group_list, group, container_next);
+        if (!reused) {
             vfio_kvm_device_add_group(group);
-            return 0;
+            cpr_save_fd("vfio_container_for_group", group->groupid,
+                        container->fd);
         }
+        return 0;
+    }
+
+    if (!reused) {
+        fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
     }
 
-    fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
     if (fd < 0) {
         error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
         ret = -errno;
@@ -2087,6 +2155,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     container = g_malloc0(sizeof(*container));
     container->space = space;
     container->fd = fd;
+    container->reused = reused;
     container->error = NULL;
     container->dirty_pages_supported = false;
     container->dma_max_mappings = 0;
@@ -2099,6 +2168,11 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
         goto free_container_exit;
     }
 
+    ret = vfio_cpr_register_container(container, errp);
+    if (ret) {
+        goto free_container_exit;
+    }
+
     ret = vfio_ram_block_discard_disable(container, true);
     if (ret) {
         error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken");
@@ -2213,9 +2287,16 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     group->container = container;
     QLIST_INSERT_HEAD(&container->group_list, group, container_next);
 
-    container->listener = vfio_memory_listener;
-
-    memory_listener_register(&container->listener, container->space->as);
+    /*
+     * If reused, register the listener later, after all state that may
+     * affect regions and mapping boundaries has been cpr-load'ed.  Later,
+     * the listener will invoke its callback on each flat section and call
+     * vfio_dma_map to supply the new vaddr, and the calls will match the
+     * mappings remembered by the kernel.
+     */
+    if (!reused) {
+        vfio_listener_register(container);
+    }
 
     if (container->error) {
         ret = -1;
@@ -2225,8 +2306,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     }
 
     container->initialized = true;
+    ret = cpr_resave_fd("vfio_container_for_group", group->groupid, fd, errp);
 
-    return 0;
+    return ret;
 listener_release_exit:
     QLIST_REMOVE(group, container_next);
     QLIST_REMOVE(container, next);
@@ -2254,6 +2336,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
 
     QLIST_REMOVE(group, container_next);
     group->container = NULL;
+    cpr_delete_fd("vfio_container_for_group", group->groupid);
 
     /*
      * Explicitly release the listener first before unset container,
@@ -2290,6 +2373,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
         }
 
         trace_vfio_disconnect_container(container->fd);
+        vfio_cpr_unregister_container(container);
         close(container->fd);
         g_free(container);
 
@@ -2319,7 +2403,12 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
     group = g_malloc0(sizeof(*group));
 
     snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
-    group->fd = qemu_open_old(path, O_RDWR);
+
+    group->fd = cpr_find_fd("vfio_group", groupid);
+    if (group->fd < 0) {
+        group->fd = qemu_open_old(path, O_RDWR);
+    }
+
     if (group->fd < 0) {
         error_setg_errno(errp, errno, "failed to open %s", path);
         goto free_group_exit;
@@ -2353,6 +2442,10 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
 
     QLIST_INSERT_HEAD(&vfio_group_list, group, next);
 
+    if (cpr_resave_fd("vfio_group", groupid, group->fd, errp)) {
+        goto close_fd_exit;
+    }
+
     return group;
 
 close_fd_exit:
@@ -2377,6 +2470,7 @@ void vfio_put_group(VFIOGroup *group)
     vfio_disconnect_container(group);
     QLIST_REMOVE(group, next);
     trace_vfio_put_group(group->fd);
+    cpr_delete_fd("vfio_group", group->groupid);
     close(group->fd);
     g_free(group);
 
@@ -2390,8 +2484,14 @@ int vfio_get_device(VFIOGroup *group, const char *name,
 {
     struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
     int ret, fd;
+    bool reused;
+
+    fd = cpr_find_fd(name, 0);
+    reused = (fd >= 0);
+    if (!reused) {
+        fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
+    }
 
-    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
     if (fd < 0) {
         error_setg_errno(errp, errno, "error getting device from group %d",
                          group->groupid);
@@ -2436,12 +2536,14 @@ int vfio_get_device(VFIOGroup *group, const char *name,
     vbasedev->num_irqs = dev_info.num_irqs;
     vbasedev->num_regions = dev_info.num_regions;
     vbasedev->flags = dev_info.flags;
+    vbasedev->reused = reused;
 
     trace_vfio_get_device(name, dev_info.flags, dev_info.num_regions,
                           dev_info.num_irqs);
 
     vbasedev->reset_works = !!(dev_info.flags & VFIO_DEVICE_FLAGS_RESET);
-    return 0;
+    ret = cpr_resave_fd(name, 0, fd, errp);
+    return ret;
 }
 
 void vfio_put_base_device(VFIODevice *vbasedev)
@@ -2452,6 +2554,7 @@ void vfio_put_base_device(VFIODevice *vbasedev)
     QLIST_REMOVE(vbasedev, next);
     vbasedev->group = NULL;
     trace_vfio_put_base_device(vbasedev->fd);
+    cpr_delete_fd(vbasedev->name, 0);
     close(vbasedev->fd);
 }
 
diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
new file mode 100644
index 0000000..a227d5e
--- /dev/null
+++ b/hw/vfio/cpr.c
@@ -0,0 +1,119 @@
+/*
+ * Copyright (c) 2021, 2022 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include <sys/ioctl.h>
+#include <linux/vfio.h>
+#include "hw/vfio/vfio-common.h"
+#include "sysemu/kvm.h"
+#include "qapi/error.h"
+#include "migration/cpr.h"
+#include "migration/vmstate.h"
+#include "trace.h"
+
+static int
+vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
+{
+    struct vfio_iommu_type1_dma_unmap unmap = {
+        .argsz = sizeof(unmap),
+        .flags = VFIO_DMA_UNMAP_FLAG_VADDR | VFIO_DMA_UNMAP_FLAG_ALL,
+        .iova = 0,
+        .size = 0,
+    };
+    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
+        error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
+        return -errno;
+    }
+    container->vaddr_unmapped = true;
+    return 0;
+}
+
+static bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp)
+{
+    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
+        !ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
+        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR "
+                         "or VFIO_UNMAP_ALL");
+        return false;
+    } else {
+        return true;
+    }
+}
+
+static bool vfio_vmstate_needed(void *opaque)
+{
+    return cpr_get_mode() == CPR_MODE_RESTART;
+}
+
+static int vfio_container_pre_save(void *opaque)
+{
+    VFIOContainer *container = (VFIOContainer *)opaque;
+    Error *err;
+
+    if (!vfio_is_cpr_capable(container, &err) ||
+        vfio_dma_unmap_vaddr_all(container, &err)) {
+        error_report_err(err);
+        return -1;
+    }
+    return 0;
+}
+
+static int vfio_container_post_load(void *opaque, int version_id)
+{
+    VFIOContainer *container = (VFIOContainer *)opaque;
+    VFIOGroup *group;
+    Error *err;
+    VFIODevice *vbasedev;
+
+    if (!vfio_is_cpr_capable(container, &err)) {
+        error_report_err(err);
+        return -1;
+    }
+
+    vfio_listener_register(container);
+    container->reused = false;
+
+    QLIST_FOREACH(group, &container->group_list, container_next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            vbasedev->reused = false;
+        }
+    }
+    return 0;
+}
+
+static const VMStateDescription vfio_container_vmstate = {
+    .name = "vfio-container",
+    .unmigratable = 1,
+    .version_id = 0,
+    .minimum_version_id = 0,
+    .pre_save = vfio_container_pre_save,
+    .post_load = vfio_container_post_load,
+    .needed = vfio_vmstate_needed,
+    .fields = (VMStateField[]) {
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+int vfio_cpr_register_container(VFIOContainer *container, Error **errp)
+{
+    container->cpr_blocker = NULL;
+    if (!vfio_is_cpr_capable(container, &container->cpr_blocker)) {
+        return cpr_add_blocker(&container->cpr_blocker, errp,
+                               CPR_MODE_RESTART, 0);
+    }
+
+    vmstate_register(NULL, -1, &vfio_container_vmstate, container);
+
+    return 0;
+}
+
+void vfio_cpr_unregister_container(VFIOContainer *container)
+{
+    cpr_del_blocker(&container->cpr_blocker);
+
+    vmstate_unregister(NULL, &vfio_container_vmstate, container);
+}
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index da9af29..e247b2b 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -5,6 +5,7 @@ vfio_ss.add(files(
   'migration.c',
 ))
 vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
+  'cpr.c',
   'display.c',
   'pci-quirks.c',
   'pci.c',
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 0143c9a..237231b 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -30,6 +30,7 @@
 #include "hw/qdev-properties-system.h"
 #include "migration/vmstate.h"
 #include "qapi/qmp/qdict.h"
+#include "migration/cpr.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "qemu/module.h"
@@ -2514,6 +2515,7 @@ const VMStateDescription vmstate_vfio_pci_config = {
     .name = "VFIOPCIDevice",
     .version_id = 1,
     .minimum_version_id = 1,
+    .priority = MIG_PRI_VFIO_PCI,   /* * must load before container */
     .fields = (VMStateField[]) {
         VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
         VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
@@ -3243,6 +3245,11 @@ static void vfio_pci_reset(DeviceState *dev)
 {
     VFIOPCIDevice *vdev = VFIO_PCI(dev);
 
+    /* Do not reset the device during qemu_system_reset prior to cpr-load */
+    if (vdev->vbasedev.reused) {
+        return;
+    }
+
     trace_vfio_pci_reset(vdev->vbasedev.name);
 
     vfio_pci_pre_reset(vdev);
@@ -3350,6 +3357,42 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
+/*
+ * The kernel may change non-emulated config bits.  Exclude them from the
+ * changed-bits check in get_pci_config_device.
+ */
+static int vfio_pci_pre_load(void *opaque)
+{
+    VFIOPCIDevice *vdev = opaque;
+    PCIDevice *pdev = &vdev->pdev;
+    int size = MIN(pci_config_size(pdev), vdev->config_size);
+    int i;
+
+    for (i = 0; i < size; i++) {
+        pdev->cmask[i] &= vdev->emulated_config_bits[i];
+    }
+
+    return 0;
+}
+
+static bool vfio_pci_needed(void *opaque)
+{
+    return cpr_get_mode() == CPR_MODE_RESTART;
+}
+
+static const VMStateDescription vfio_pci_vmstate = {
+    .name = "vfio-pci",
+    .unmigratable = 1,
+    .version_id = 0,
+    .minimum_version_id = 0,
+    .priority = MIG_PRI_VFIO_PCI,       /* must load before container */
+    .pre_load = vfio_pci_pre_load,
+    .needed = vfio_pci_needed,
+    .fields = (VMStateField[]) {
+        VMSTATE_END_OF_LIST()
+    }
+};
+
 static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
@@ -3357,6 +3400,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 
     dc->reset = vfio_pci_reset;
     device_class_set_props(dc, vfio_pci_dev_properties);
+    dc->vmsd = &vfio_pci_vmstate;
     dc->desc = "VFIO-based PCI device assignment";
     set_bit(DEVICE_CATEGORY_MISC, dc->categories);
     pdc->realize = vfio_realize;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 73dffe9..a6d0034 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -119,6 +119,7 @@ vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Devic
 vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
 vfio_dma_unmap_overflow_workaround(void) ""
+vfio_region_remap(const char *name, int fd, uint64_t iova_start, uint64_t iova_end, void *vaddr) "%s fd %d 0x%"PRIx64" - 0x%"PRIx64" [%p]"
 
 # platform.c
 vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index e573f5a..17ad9ba 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -81,10 +81,14 @@ typedef struct VFIOContainer {
     int fd; /* /dev/vfio/vfio, empowered by the attached groups */
     MemoryListener listener;
     MemoryListener prereg_listener;
+    Notifier cpr_notifier;
+    Error *cpr_blocker;
     unsigned iommu_type;
     Error *error;
     bool initialized;
     bool dirty_pages_supported;
+    bool reused;
+    bool vaddr_unmapped;
     uint64_t dirty_pgsizes;
     uint64_t max_dirty_bitmap_size;
     unsigned long pgsizes;
@@ -136,6 +140,7 @@ typedef struct VFIODevice {
     bool no_mmap;
     bool ram_block_discard_allowed;
     bool enable_migration;
+    bool reused;
     VFIODeviceOps *ops;
     unsigned int num_irqs;
     unsigned int num_regions;
@@ -213,6 +218,9 @@ void vfio_put_group(VFIOGroup *group);
 int vfio_get_device(VFIOGroup *group, const char *name,
                     VFIODevice *vbasedev, Error **errp);
 
+int vfio_cpr_register_container(VFIOContainer *container, Error **errp);
+void vfio_cpr_unregister_container(VFIOContainer *container);
+
 extern const MemoryRegionOps vfio_region_ops;
 typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
 extern VFIOGroupList vfio_group_list;
@@ -234,6 +242,9 @@ struct vfio_info_cap_header *
 vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id);
 #endif
 extern const MemoryListener vfio_prereg_listener;
+void vfio_listener_register(VFIOContainer *container);
+void vfio_container_region_add(VFIOContainer *container,
+                               MemoryRegionSection *section);
 
 int vfio_spapr_create_window(VFIOContainer *container,
                              MemoryRegionSection *section,
diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index ad24aa1..19f1538 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -157,6 +157,7 @@ typedef enum {
     MIG_PRI_GICV3_ITS,          /* Must happen before PCI devices */
     MIG_PRI_GICV3,              /* Must happen before the ITS */
     MIG_PRI_MAX,
+    MIG_PRI_VFIO_PCI = MIG_PRI_IOMMU,
 } MigrationPriority;
 
 struct VMStateField {
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 28/39] vfio-pci: cpr part 2 (msi)
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (26 preceding siblings ...)
  2022-06-15 14:52 ` [PATCH V8 27/39] vfio-pci: cpr part 1 (fd and dma) Steve Sistare
@ 2022-06-15 14:52 ` Steve Sistare
  2022-06-29 20:19   ` Alex Williamson
  2022-06-15 14:52 ` [PATCH V8 29/39] vfio-pci: cpr part 3 (intx) Steve Sistare
                   ` (10 subsequent siblings)
  38 siblings, 1 reply; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Finish cpr for vfio-pci MSI/MSI-X devices by preserving eventfd's and
vector state.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/pci.c | 122 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 121 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 237231b..2fd7121 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -53,17 +53,53 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
 static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
 static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
 
+#define EVENT_FD_NAME(vdev, name)   \
+    g_strdup_printf("%s_%s", (vdev)->vbasedev.name, (name))
+
+static int save_event_fd(VFIOPCIDevice *vdev, const char *name, int nr,
+                         EventNotifier *ev)
+{
+    int fd = event_notifier_get_fd(ev);
+
+    if (fd >= 0) {
+        Error *err;
+        g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
+
+        if (cpr_resave_fd(fdname, nr, fd, &err)) {
+            error_report_err(err);
+            return 1;
+        }
+    }
+    return 0;
+}
+
+static int load_event_fd(VFIOPCIDevice *vdev, const char *name, int nr)
+{
+    g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
+    int fd = cpr_find_fd(fdname, nr);
+    return fd;
+}
+
+static void delete_event_fd(VFIOPCIDevice *vdev, const char *name, int nr)
+{
+    g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
+    cpr_delete_fd(fdname, nr);
+}
+
 /* Create new or reuse existing eventfd */
 static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
                               const char *name, int nr)
 {
-    int fd = -1;   /* placeholder until a subsequent patch */
     int ret = 0;
+    int fd = load_event_fd(vdev, name, nr);
 
     if (fd >= 0) {
         event_notifier_init_fd(e, fd);
     } else {
         ret = event_notifier_init(e, 0);
+        if (!ret) {
+            save_event_fd(vdev, name, nr, e);
+        }
     }
     return ret;
 }
@@ -71,6 +107,7 @@ static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
 static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
                                   const char *name, int nr)
 {
+    delete_event_fd(vdev, name, nr);
     event_notifier_cleanup(e);
 }
 
@@ -511,6 +548,15 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
     VFIOMSIVector *vector;
     int ret;
 
+    /*
+     * Ignore the callback from msix_set_vector_notifiers during resume.
+     * The necessary subset of these actions is called from vfio_claim_vectors
+     * during post load.
+     */
+    if (vdev->vbasedev.reused) {
+        return 0;
+    }
+
     trace_vfio_msix_vector_do_use(vdev->vbasedev.name, nr);
 
     vector = &vdev->msi_vectors[nr];
@@ -2784,6 +2830,11 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
     fd = event_notifier_get_fd(&vdev->err_notifier);
     qemu_set_fd_handler(fd, vfio_err_notifier_handler, NULL, vdev);
 
+    /* Do not alter irq_signaling during vfio_realize for cpr */
+    if (vdev->vbasedev.reused) {
+        return;
+    }
+
     if (vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_ERR_IRQ_INDEX, 0,
                                VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
         error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
@@ -2849,6 +2900,12 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
     fd = event_notifier_get_fd(&vdev->req_notifier);
     qemu_set_fd_handler(fd, vfio_req_notifier_handler, NULL, vdev);
 
+    /* Do not alter irq_signaling during vfio_realize for cpr */
+    if (vdev->vbasedev.reused) {
+        vdev->req_enabled = true;
+        return;
+    }
+
     if (vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_REQ_IRQ_INDEX, 0,
                            VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
         error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
@@ -3357,6 +3414,43 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
+static void vfio_claim_vectors(VFIOPCIDevice *vdev, int nr_vectors, bool msix)
+{
+    int i, fd;
+    bool pending = false;
+    PCIDevice *pdev = &vdev->pdev;
+
+    vdev->nr_vectors = nr_vectors;
+    vdev->msi_vectors = g_new0(VFIOMSIVector, nr_vectors);
+    vdev->interrupt = msix ? VFIO_INT_MSIX : VFIO_INT_MSI;
+
+    for (i = 0; i < nr_vectors; i++) {
+        VFIOMSIVector *vector = &vdev->msi_vectors[i];
+
+        fd = load_event_fd(vdev, "interrupt", i);
+        if (fd >= 0) {
+            vfio_vector_init(vdev, i);
+            qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
+        }
+
+        if (load_event_fd(vdev, "kvm_interrupt", i) >= 0) {
+            vfio_route_change = kvm_irqchip_begin_route_changes(kvm_state);
+            vfio_add_kvm_msi_virq(vdev, vector, i, msix);
+            kvm_irqchip_commit_route_changes(&vfio_route_change);
+            vfio_connect_kvm_msi_virq(vector, i);
+        }
+
+        if (msix && msix_is_pending(pdev, i) && msix_is_masked(pdev, i)) {
+            set_bit(i, vdev->msix->pending);
+            pending = true;
+        }
+    }
+
+    if (msix) {
+        memory_region_set_enabled(&pdev->msix_pba_mmio, pending);
+    }
+}
+
 /*
  * The kernel may change non-emulated config bits.  Exclude them from the
  * changed-bits check in get_pci_config_device.
@@ -3375,6 +3469,29 @@ static int vfio_pci_pre_load(void *opaque)
     return 0;
 }
 
+static int vfio_pci_post_load(void *opaque, int version_id)
+{
+    VFIOPCIDevice *vdev = opaque;
+    PCIDevice *pdev = &vdev->pdev;
+    int nr_vectors;
+
+    if (msix_enabled(pdev)) {
+        msix_set_vector_notifiers(pdev, vfio_msix_vector_use,
+                                   vfio_msix_vector_release, NULL);
+        nr_vectors = vdev->msix->entries;
+        vfio_claim_vectors(vdev, nr_vectors, true);
+
+    } else if (msi_enabled(pdev)) {
+        nr_vectors = msi_nr_vectors_allocated(pdev);
+        vfio_claim_vectors(vdev, nr_vectors, false);
+
+    } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
+        assert(0);      /* completed in a subsequent patch */
+    }
+
+    return 0;
+}
+
 static bool vfio_pci_needed(void *opaque)
 {
     return cpr_get_mode() == CPR_MODE_RESTART;
@@ -3387,8 +3504,11 @@ static const VMStateDescription vfio_pci_vmstate = {
     .minimum_version_id = 0,
     .priority = MIG_PRI_VFIO_PCI,       /* must load before container */
     .pre_load = vfio_pci_pre_load,
+    .post_load = vfio_pci_post_load,
     .needed = vfio_pci_needed,
     .fields = (VMStateField[]) {
+        VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
+        VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
         VMSTATE_END_OF_LIST()
     }
 };
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 29/39] vfio-pci: cpr part 3 (intx)
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (27 preceding siblings ...)
  2022-06-15 14:52 ` [PATCH V8 28/39] vfio-pci: cpr part 2 (msi) Steve Sistare
@ 2022-06-15 14:52 ` Steve Sistare
  2022-06-29 20:43   ` Alex Williamson
  2022-06-15 14:52 ` [PATCH V8 30/39] vfio-pci: recover from unmap-all-vaddr failure Steve Sistare
                   ` (9 subsequent siblings)
  38 siblings, 1 reply; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Preserve vfio INTX state across cpr restart.  Preserve VFIOINTx fields as
follows:
  pin : Recover this from the vfio config in kernel space
  interrupt : Preserve its eventfd descriptor across exec.
  unmask : Ditto
  route.irq : This could perhaps be recovered in vfio_pci_post_load by
    calling pci_device_route_intx_to_irq(pin), whose implementation reads
    config space for a bridge device such as ich9.  However, there is no
    guarantee that the bridge vmstate is read before vfio vmstate.  Rather
    than fiddling with MigrationPriority for vmstate handlers, explicitly
    save route.irq in vfio vmstate.
  pending : save in vfio vmstate.
  mmap_timeout, mmap_timer : Re-initialize
  bool kvm_accel : Re-initialize

In vfio_realize, defer calling vfio_intx_enable until the vmstate
is available, in vfio_pci_post_load.  Modify vfio_intx_enable and
vfio_intx_kvm_enable to skip vfio initialization, but still perform
kvm initialization.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/pci.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 83 insertions(+), 9 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 2fd7121..b8aee91 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -173,14 +173,45 @@ static void vfio_intx_eoi(VFIODevice *vbasedev)
     vfio_unmask_single_irqindex(vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
 }
 
+#ifdef CONFIG_KVM
+static bool vfio_no_kvm_intx(VFIOPCIDevice *vdev)
+{
+    return vdev->no_kvm_intx || !kvm_irqfds_enabled() ||
+           vdev->intx.route.mode != PCI_INTX_ENABLED ||
+           !kvm_resamplefds_enabled();
+}
+#endif
+
+static void vfio_intx_reenable_kvm(VFIOPCIDevice *vdev, Error **errp)
+{
+#ifdef CONFIG_KVM
+    if (vfio_no_kvm_intx(vdev)) {
+        return;
+    }
+
+    if (vfio_notifier_init(vdev, &vdev->intx.unmask, "intx-unmask", 0)) {
+        error_setg(errp, "vfio_notifier_init intx-unmask failed");
+        return;
+    }
+
+    if (kvm_irqchip_add_irqfd_notifier_gsi(kvm_state,
+                                           &vdev->intx.interrupt,
+                                           &vdev->intx.unmask,
+                                           vdev->intx.route.irq)) {
+        error_setg_errno(errp, errno, "failed to setup resample irqfd");
+        return;
+    }
+
+    vdev->intx.kvm_accel = true;
+#endif
+}
+
 static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
 {
 #ifdef CONFIG_KVM
     int irq_fd = event_notifier_get_fd(&vdev->intx.interrupt);
 
-    if (vdev->no_kvm_intx || !kvm_irqfds_enabled() ||
-        vdev->intx.route.mode != PCI_INTX_ENABLED ||
-        !kvm_resamplefds_enabled()) {
+    if (vfio_no_kvm_intx(vdev)) {
         return;
     }
 
@@ -328,7 +359,13 @@ static int vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
         return 0;
     }
 
-    vfio_disable_interrupts(vdev);
+    /*
+     * Do not alter interrupt state during vfio_realize and cpr-load.  The
+     * reused flag is cleared thereafter.
+     */
+    if (!vdev->vbasedev.reused) {
+        vfio_disable_interrupts(vdev);
+    }
 
     vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
     pci_config_set_interrupt_pin(vdev->pdev.config, pin);
@@ -353,6 +390,11 @@ static int vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
     fd = event_notifier_get_fd(&vdev->intx.interrupt);
     qemu_set_fd_handler(fd, vfio_intx_interrupt, NULL, vdev);
 
+    if (vdev->vbasedev.reused) {
+        vfio_intx_reenable_kvm(vdev, &err);
+        goto finish;
+    }
+
     if (vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
                                VFIO_IRQ_SET_ACTION_TRIGGER, fd, errp)) {
         qemu_set_fd_handler(fd, NULL, NULL, vdev);
@@ -365,6 +407,7 @@ static int vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
         warn_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
     }
 
+finish:
     vdev->interrupt = VFIO_INT_INTx;
 
     trace_vfio_intx_enable(vdev->vbasedev.name);
@@ -3195,9 +3238,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
                                              vfio_intx_routing_notifier);
         vdev->irqchip_change_notifier.notify = vfio_irqchip_change;
         kvm_irqchip_add_change_notifier(&vdev->irqchip_change_notifier);
-        ret = vfio_intx_enable(vdev, errp);
-        if (ret) {
-            goto out_deregister;
+
+        /* Wait until cpr-load reads intx routing data to enable */
+        if (!vdev->vbasedev.reused) {
+            ret = vfio_intx_enable(vdev, errp);
+            if (ret) {
+                goto out_deregister;
+            }
         }
     }
 
@@ -3474,6 +3521,7 @@ static int vfio_pci_post_load(void *opaque, int version_id)
     VFIOPCIDevice *vdev = opaque;
     PCIDevice *pdev = &vdev->pdev;
     int nr_vectors;
+    int ret = 0;
 
     if (msix_enabled(pdev)) {
         msix_set_vector_notifiers(pdev, vfio_msix_vector_use,
@@ -3486,10 +3534,35 @@ static int vfio_pci_post_load(void *opaque, int version_id)
         vfio_claim_vectors(vdev, nr_vectors, false);
 
     } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
-        assert(0);      /* completed in a subsequent patch */
+        Error *err = 0;
+        ret = vfio_intx_enable(vdev, &err);
+        if (ret) {
+            error_report_err(err);
+        }
     }
 
-    return 0;
+    return ret;
+}
+
+static const VMStateDescription vfio_intx_vmstate = {
+    .name = "vfio-intx",
+    .unmigratable = 1,
+    .version_id = 0,
+    .minimum_version_id = 0,
+    .fields = (VMStateField[]) {
+        VMSTATE_BOOL(pending, VFIOINTx),
+        VMSTATE_UINT32(route.mode, VFIOINTx),
+        VMSTATE_INT32(route.irq, VFIOINTx),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+#define VMSTATE_VFIO_INTX(_field, _state) {                         \
+    .name       = (stringify(_field)),                              \
+    .size       = sizeof(VFIOINTx),                                 \
+    .vmsd       = &vfio_intx_vmstate,                               \
+    .flags      = VMS_STRUCT,                                       \
+    .offset     = vmstate_offset_value(_state, _field, VFIOINTx),   \
 }
 
 static bool vfio_pci_needed(void *opaque)
@@ -3509,6 +3582,7 @@ static const VMStateDescription vfio_pci_vmstate = {
     .fields = (VMStateField[]) {
         VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
         VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
+        VMSTATE_VFIO_INTX(intx, VFIOPCIDevice),
         VMSTATE_END_OF_LIST()
     }
 };
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 30/39] vfio-pci: recover from unmap-all-vaddr failure
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (28 preceding siblings ...)
  2022-06-15 14:52 ` [PATCH V8 29/39] vfio-pci: cpr part 3 (intx) Steve Sistare
@ 2022-06-15 14:52 ` Steve Sistare
  2022-06-29 22:58   ` Alex Williamson
  2022-06-15 14:52 ` [PATCH V8 31/39] vhost: reset vhost devices for cpr Steve Sistare
                   ` (8 subsequent siblings)
  38 siblings, 1 reply; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

If vfio_cpr_save fails to unmap all vaddr's, then recover by walking all
flat sections to restore the vaddr for each.  Do so by invoking the
vfio listener callback, and passing a new "replay" flag that tells it
to replay a mapping without re-allocating new userland data structures.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/common.c              | 66 ++++++++++++++++++++++++++++++++-----------
 hw/vfio/cpr.c                 | 29 +++++++++++++++++++
 include/hw/vfio/vfio-common.h |  2 +-
 3 files changed, 80 insertions(+), 17 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index c7d73b6..5f2bd50 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -895,15 +895,35 @@ static bool vfio_known_safe_misalignment(MemoryRegionSection *section)
     return true;
 }
 
+static VFIORamDiscardListener *vfio_find_ram_discard_listener(
+    VFIOContainer *container, MemoryRegionSection *section)
+{
+    VFIORamDiscardListener *vrdl = NULL;
+
+    QLIST_FOREACH(vrdl, &container->vrdl_list, next) {
+        if (vrdl->mr == section->mr &&
+            vrdl->offset_within_address_space ==
+            section->offset_within_address_space) {
+            break;
+        }
+    }
+
+    if (!vrdl) {
+        hw_error("vfio: Trying to sync missing RAM discard listener");
+        /* does not return */
+    }
+    return vrdl;
+}
+
 static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
-    vfio_container_region_add(container, section);
+    vfio_container_region_add(container, section, false);
 }
 
 void vfio_container_region_add(VFIOContainer *container,
-                               MemoryRegionSection *section)
+                               MemoryRegionSection *section, bool replay)
 {
     hwaddr iova, end;
     Int128 llend, llsize;
@@ -1033,6 +1053,23 @@ void vfio_container_region_add(VFIOContainer *container,
         int iommu_idx;
 
         trace_vfio_listener_region_add_iommu(iova, end);
+
+        if (replay) {
+            hwaddr as_offset = section->offset_within_address_space;
+            hwaddr iommu_offset = as_offset - section->offset_within_region;
+
+            QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
+                if (giommu->iommu_mr == iommu_mr &&
+                    giommu->iommu_offset == iommu_offset) {
+                    memory_region_iommu_replay(giommu->iommu_mr, &giommu->n);
+                    return;
+                }
+            }
+            error_report("Container cannot find iommu region %s offset %lx",
+                memory_region_name(section->mr), iommu_offset);
+            goto fail;
+        }
+
         /*
          * FIXME: For VFIO iommu types which have KVM acceleration to
          * avoid bouncing all map/unmaps through qemu this way, this
@@ -1083,7 +1120,15 @@ void vfio_container_region_add(VFIOContainer *container,
      * about changes.
      */
     if (memory_region_has_ram_discard_manager(section->mr)) {
-        vfio_register_ram_discard_listener(container, section);
+        if (replay)  {
+            VFIORamDiscardListener *vrdl =
+                vfio_find_ram_discard_listener(container, section);
+            if (vfio_ram_discard_notify_populate(&vrdl->listener, section)) {
+                error_report("ram_discard_manager_replay_populated failed");
+            }
+        } else {
+            vfio_register_ram_discard_listener(container, section);
+        }
         return;
     }
 
@@ -1417,19 +1462,8 @@ static int vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainer *container,
                                                    MemoryRegionSection *section)
 {
     RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr);
-    VFIORamDiscardListener *vrdl = NULL;
-
-    QLIST_FOREACH(vrdl, &container->vrdl_list, next) {
-        if (vrdl->mr == section->mr &&
-            vrdl->offset_within_address_space ==
-            section->offset_within_address_space) {
-            break;
-        }
-    }
-
-    if (!vrdl) {
-        hw_error("vfio: Trying to sync missing RAM discard listener");
-    }
+    VFIORamDiscardListener *vrdl =
+        vfio_find_ram_discard_listener(container, section);
 
     /*
      * We only want/can synchronize the bitmap for actually mapped parts -
diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index a227d5e..2b5e77c 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -32,6 +32,15 @@ vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
     return 0;
 }
 
+static int
+vfio_region_remap(MemoryRegionSection *section, void *handle, Error **errp)
+{
+    VFIOContainer *container = handle;
+    vfio_container_region_add(container, section, true);
+    container->vaddr_unmapped = false;
+    return 0;
+}
+
 static bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp)
 {
     if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
@@ -98,6 +107,22 @@ static const VMStateDescription vfio_container_vmstate = {
     }
 };
 
+static void vfio_cpr_save_failed_notifier(Notifier *notifier, void *data)
+{
+    Error *err;
+    VFIOContainer *container =
+        container_of(notifier, VFIOContainer, cpr_notifier);
+
+    /* Set reused so vfio_dma_map restores vaddr */
+    container->reused = true;
+    if (address_space_flat_for_each_section(container->space->as,
+                                            vfio_region_remap,
+                                            container, &err)) {
+        error_report_err(err);
+    }
+    container->reused = false;
+}
+
 int vfio_cpr_register_container(VFIOContainer *container, Error **errp)
 {
     container->cpr_blocker = NULL;
@@ -108,6 +133,8 @@ int vfio_cpr_register_container(VFIOContainer *container, Error **errp)
 
     vmstate_register(NULL, -1, &vfio_container_vmstate, container);
 
+    cpr_add_notifier(&container->cpr_notifier, vfio_cpr_save_failed_notifier,
+                     CPR_NOTIFY_SAVE_FAILED);
     return 0;
 }
 
@@ -116,4 +143,6 @@ void vfio_cpr_unregister_container(VFIOContainer *container)
     cpr_del_blocker(&container->cpr_blocker);
 
     vmstate_unregister(NULL, &vfio_container_vmstate, container);
+
+    cpr_remove_notifier(&container->cpr_notifier);
 }
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 17ad9ba..dd6bbcf 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -244,7 +244,7 @@ vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id);
 extern const MemoryListener vfio_prereg_listener;
 void vfio_listener_register(VFIOContainer *container);
 void vfio_container_region_add(VFIOContainer *container,
-                               MemoryRegionSection *section);
+                               MemoryRegionSection *section, bool replay);
 
 int vfio_spapr_create_window(VFIOContainer *container,
                              MemoryRegionSection *section,
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 31/39] vhost: reset vhost devices for cpr
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (29 preceding siblings ...)
  2022-06-15 14:52 ` [PATCH V8 30/39] vfio-pci: recover from unmap-all-vaddr failure Steve Sistare
@ 2022-06-15 14:52 ` Steve Sistare
  2022-06-15 14:52 ` [PATCH V8 32/39] loader: suppress rom_reset during cpr Steve Sistare
                   ` (7 subsequent siblings)
  38 siblings, 0 replies; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

A vhost device is implicitly preserved across re-exec because its fd is not
closed, and the value of the fd is specified on the command line for the
new qemu to find.  However, new qemu issues an VHOST_RESET_OWNER ioctl,
which fails because the device already has an owner.  To fix, reset the
owner prior to exec.

Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/virtio/vhost.c         | 17 +++++++++++++++++
 include/hw/virtio/vhost.h |  1 +
 2 files changed, 18 insertions(+)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index dd3263d..efaa28c 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -23,6 +23,7 @@
 #include "standard-headers/linux/vhost_types.h"
 #include "hw/virtio/virtio-bus.h"
 #include "hw/virtio/virtio-access.h"
+#include "migration/cpr.h"
 #include "migration/blocker.h"
 #include "migration/qemu-file-types.h"
 #include "sysemu/dma.h"
@@ -1306,6 +1307,17 @@ static void vhost_virtqueue_cleanup(struct vhost_virtqueue *vq)
     event_notifier_cleanup(&vq->masked_notifier);
 }
 
+static void vhost_cpr_exec_notifier(Notifier *notifier, void *data)
+{
+    struct vhost_dev *dev = container_of(notifier, struct vhost_dev,
+                                         cpr_notifier);
+    int r = dev->vhost_ops->vhost_reset_device(dev);
+
+    if (r < 0) {
+        VHOST_OPS_DEBUG(r, "vhost_reset_device failed");
+    }
+}
+
 int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
                    VhostBackendType backend_type, uint32_t busyloop_timeout,
                    Error **errp)
@@ -1405,6 +1417,8 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
     hdev->log_enabled = false;
     hdev->started = false;
     memory_listener_register(&hdev->memory_listener, &address_space_memory);
+    cpr_add_notifier(&hdev->cpr_notifier, vhost_cpr_exec_notifier,
+                     CPR_NOTIFY_EXEC);
     QLIST_INSERT_HEAD(&vhost_devices, hdev, entry);
 
     if (used_memslots > hdev->vhost_ops->vhost_backend_memslots_limit(hdev)) {
@@ -1444,6 +1458,9 @@ void vhost_dev_cleanup(struct vhost_dev *hdev)
         migrate_del_blocker(hdev->migration_blocker);
         error_free(hdev->migration_blocker);
     }
+    if (hdev->cpr_notifier.notify) {
+        cpr_remove_notifier(&hdev->cpr_notifier);
+    }
     g_free(hdev->mem);
     g_free(hdev->mem_sections);
     if (hdev->vhost_ops) {
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index b291fe4..1316b14 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -100,6 +100,7 @@ struct vhost_dev {
     QLIST_ENTRY(vhost_dev) entry;
     QLIST_HEAD(, vhost_iommu) iommu_list;
     IOMMUNotifier n;
+    Notifier cpr_notifier;
     const VhostDevConfigOps *config_ops;
 };
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 32/39] loader: suppress rom_reset during cpr
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (30 preceding siblings ...)
  2022-06-15 14:52 ` [PATCH V8 31/39] vhost: reset vhost devices for cpr Steve Sistare
@ 2022-06-15 14:52 ` Steve Sistare
  2022-06-15 14:52 ` [PATCH V8 33/39] chardev: cpr framework Steve Sistare
                   ` (6 subsequent siblings)
  38 siblings, 0 replies; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Reported-by: Zheng Chuan <zhengchuan@huawei.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/core/loader.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/core/loader.c b/hw/core/loader.c
index 0548830..7b39c07 100644
--- a/hw/core/loader.c
+++ b/hw/core/loader.c
@@ -51,6 +51,7 @@
 #include "hw/hw.h"
 #include "disas/disas.h"
 #include "migration/vmstate.h"
+#include "migration/cpr.h"
 #include "monitor/monitor.h"
 #include "sysemu/reset.h"
 #include "sysemu/sysemu.h"
@@ -1153,6 +1154,7 @@ ssize_t rom_add_option(const char *file, int32_t bootindex)
 static void rom_reset(void *unused)
 {
     Rom *rom;
+    bool cpr_is_active = (cpr_get_mode() != CPR_MODE_NONE);
 
     QTAILQ_FOREACH(rom, &roms, next) {
         if (rom->fw_file) {
@@ -1163,7 +1165,7 @@ static void rom_reset(void *unused)
          * the data in during the next incoming migration in all cases.  Note
          * that some of those RAMs can actually be modified by the guest.
          */
-        if (runstate_check(RUN_STATE_INMIGRATE)) {
+        if (runstate_check(RUN_STATE_INMIGRATE) || cpr_is_active) {
             if (rom->data && rom->isrom) {
                 /*
                  * Free it so that a rom_reset after migration doesn't
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 33/39] chardev: cpr framework
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (31 preceding siblings ...)
  2022-06-15 14:52 ` [PATCH V8 32/39] loader: suppress rom_reset during cpr Steve Sistare
@ 2022-06-15 14:52 ` Steve Sistare
  2022-06-15 14:52 ` [PATCH V8 34/39] chardev: cpr for simple devices Steve Sistare
                   ` (5 subsequent siblings)
  38 siblings, 0 replies; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Add QEMU_CHAR_FEATURE_CPR for devices that support cpr by preserving an
open descriptor across exec.  Add the chardev reopen-on-cpr option for
devices that should be closed on cpr and reopened after exec.

Enable cpr for a chardev if it has QEMU_CHAR_FEATURE_CPR and reopen-on-cpr
is false.  Allow cpr-save if either QEMU_CHAR_FEATURE_CPR or reopen-on-cpr
is true for all chardevs in the configuration.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 chardev/char.c         | 49 +++++++++++++++++++++++++++++++++++++++++++++----
 include/chardev/char.h |  5 +++++
 qapi/char.json         |  7 ++++++-
 qemu-options.hx        | 26 ++++++++++++++++++++++----
 4 files changed, 78 insertions(+), 9 deletions(-)

diff --git a/chardev/char.c b/chardev/char.c
index 0169d8d..ef3f196 100644
--- a/chardev/char.c
+++ b/chardev/char.c
@@ -36,9 +36,11 @@
 #include "qemu/help_option.h"
 #include "qemu/module.h"
 #include "qemu/option.h"
+#include "migration/cpr.h"
 #include "qemu/id.h"
 #include "qemu/coroutine.h"
 #include "qemu/yank.h"
+#include "sysemu/sysemu.h"
 
 #include "chardev-internal.h"
 
@@ -236,26 +238,55 @@ int qemu_chr_add_client(Chardev *s, int fd)
 static void qemu_char_open(Chardev *chr, ChardevBackend *backend,
                            bool *be_opened, Error **errp)
 {
+    ERRP_GUARD();
+    g_autofree char *fdname = NULL;
+
     ChardevClass *cc = CHARDEV_GET_CLASS(chr);
     /* Any ChardevCommon member would work */
     ChardevCommon *common = backend ? backend->u.null.data : NULL;
+    bool has_logfile = (common && common->has_logfile);
+    bool has_feature_cpr;
 
-    if (common && common->has_logfile) {
+    if (has_logfile) {
         int flags = O_WRONLY;
+        fdname = g_strdup_printf("%s_log", chr->label);
         if (common->has_logappend &&
             common->logappend) {
             flags |= O_APPEND;
         } else {
             flags |= O_TRUNC;
         }
-        chr->logfd = qemu_create(common->logfile, flags, 0666, errp);
+        chr->logfd = cpr_find_fd(fdname, 0);
+        if (chr->logfd < 0) {
+            chr->logfd = qemu_create(common->logfile, flags, 0666, errp);
+        }
         if (chr->logfd < 0) {
             return;
         }
     }
 
+    chr->reopen_on_cpr = (common && common->reopen_on_cpr);
+
     if (cc->open) {
         cc->open(chr, backend, be_opened, errp);
+        if (*errp) {
+            return;
+        }
+    }
+
+    /* Evaluate this after the open method sets the feature */
+    has_feature_cpr = qemu_chr_has_feature(chr, QEMU_CHAR_FEATURE_CPR);
+    chr->cpr_enabled = !chr->reopen_on_cpr && has_feature_cpr;
+
+    if (!chr->reopen_on_cpr && !has_feature_cpr) {
+        chr->cpr_blocker = NULL;
+        error_setg(&chr->cpr_blocker,
+                "chardev %s -> %s does not allow cpr. See reopen-on-cpr.",
+                chr->label, chr->filename);
+        cpr_add_blocker(&chr->cpr_blocker, errp, CPR_MODE_RESTART, 0);
+
+    } else if (chr->cpr_enabled && has_logfile) {
+        cpr_resave_fd(fdname, 0, chr->logfd, errp);
     }
 }
 
@@ -297,11 +328,16 @@ static void char_finalize(Object *obj)
     if (chr->be) {
         chr->be->chr = NULL;
     }
-    g_free(chr->filename);
-    g_free(chr->label);
     if (chr->logfd != -1) {
+        g_autofree char *fdname = g_strdup_printf("%s_log", chr->label);
+        if (chr->cpr_enabled) {
+            cpr_delete_fd(fdname, 0);
+        }
         close(chr->logfd);
     }
+    cpr_del_blocker(&chr->cpr_blocker);
+    g_free(chr->filename);
+    g_free(chr->label);
     qemu_mutex_destroy(&chr->chr_write_lock);
 }
 
@@ -501,6 +537,8 @@ void qemu_chr_parse_common(QemuOpts *opts, ChardevCommon *backend)
 
     backend->has_logappend = true;
     backend->logappend = qemu_opt_get_bool(opts, "logappend", false);
+
+    backend->reopen_on_cpr = qemu_opt_get_bool(opts, "reopen-on-cpr", false);
 }
 
 static const ChardevClass *char_get_class(const char *driver, Error **errp)
@@ -942,6 +980,9 @@ QemuOptsList qemu_chardev_opts = {
         },{
             .name = "abstract",
             .type = QEMU_OPT_BOOL,
+        },{
+            .name = "reopen-on-cpr",
+            .type = QEMU_OPT_BOOL,
 #endif
         },
         { /* end of list */ }
diff --git a/include/chardev/char.h b/include/chardev/char.h
index a319b5f..bbf2560 100644
--- a/include/chardev/char.h
+++ b/include/chardev/char.h
@@ -50,6 +50,8 @@ typedef enum {
     /* Whether the gcontext can be changed after calling
      * qemu_chr_be_update_read_handlers() */
     QEMU_CHAR_FEATURE_GCONTEXT,
+    /* Whether the device supports cpr */
+    QEMU_CHAR_FEATURE_CPR,
 
     QEMU_CHAR_FEATURE_LAST,
 } ChardevFeature;
@@ -67,6 +69,9 @@ struct Chardev {
     int be_open;
     /* used to coordinate the chardev-change special-case: */
     bool handover_yank_instance;
+    bool reopen_on_cpr;
+    bool cpr_enabled;
+    Error *cpr_blocker;
     GSource *gsource;
     GMainContext *gcontext;
     DECLARE_BITMAP(features, QEMU_CHAR_FEATURE_LAST);
diff --git a/qapi/char.json b/qapi/char.json
index 923dc50..0c3558e 100644
--- a/qapi/char.json
+++ b/qapi/char.json
@@ -204,12 +204,17 @@
 # @logfile: The name of a logfile to save output
 # @logappend: true to append instead of truncate
 #             (default to false to truncate)
+# @reopen-on-cpr: if true, close device's fd on cpr-save and reopen it after
+#                 cpr-exec. Set this to allow CPR on a device that does not
+#                 support QEMU_CHAR_FEATURE_CPR. defaults to false.
+#                 since 7.1.
 #
 # Since: 2.6
 ##
 { 'struct': 'ChardevCommon',
   'data': { '*logfile': 'str',
-            '*logappend': 'bool' } }
+            '*logappend': 'bool',
+            '*reopen-on-cpr': 'bool' } }
 
 ##
 # @ChardevFile:
diff --git a/qemu-options.hx b/qemu-options.hx
index 1b49360..2f4bb2b 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -3291,43 +3291,57 @@ DEFHEADING(Character device options:)
 
 DEF("chardev", HAS_ARG, QEMU_OPTION_chardev,
     "-chardev help\n"
-    "-chardev null,id=id[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "-chardev null,id=id[,mux=on|off][,logfile=PATH][,logappend=on|off][,reopen-on-cpr=on|off]\n"
     "-chardev socket,id=id[,host=host],port=port[,to=to][,ipv4=on|off][,ipv6=on|off][,nodelay=on|off]\n"
     "         [,server=on|off][,wait=on|off][,telnet=on|off][,websocket=on|off][,reconnect=seconds][,mux=on|off]\n"
-    "         [,logfile=PATH][,logappend=on|off][,tls-creds=ID][,tls-authz=ID] (tcp)\n"
+    "         [,logfile=PATH][,logappend=on|off][,tls-creds=ID][,tls-authz=ID][,reopen-on-cpr=on|off] (tcp)\n"
     "-chardev socket,id=id,path=path[,server=on|off][,wait=on|off][,telnet=on|off][,websocket=on|off][,reconnect=seconds]\n"
-    "         [,mux=on|off][,logfile=PATH][,logappend=on|off][,abstract=on|off][,tight=on|off] (unix)\n"
+    "         [,mux=on|off][,logfile=PATH][,logappend=on|off][,abstract=on|off][,tight=on|off][,reopen-on-cpr=on|off] (unix)\n"
     "-chardev udp,id=id[,host=host],port=port[,localaddr=localaddr]\n"
     "         [,localport=localport][,ipv4=on|off][,ipv6=on|off][,mux=on|off]\n"
-    "         [,logfile=PATH][,logappend=on|off]\n"
+    "         [,logfile=PATH][,logappend=on|off][,reopen-on-cpr=on|off]\n"
     "-chardev msmouse,id=id[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
     "-chardev vc,id=id[[,width=width][,height=height]][[,cols=cols][,rows=rows]]\n"
     "         [,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
     "-chardev ringbuf,id=id[,size=size][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
     "-chardev file,id=id,path=path[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
     "-chardev pipe,id=id,path=path[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
 #ifdef _WIN32
     "-chardev console,id=id[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
     "-chardev serial,id=id,path=path[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
 #else
     "-chardev pty,id=id[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
     "-chardev stdio,id=id[,mux=on|off][,signal=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
 #endif
 #ifdef CONFIG_BRLAPI
     "-chardev braille,id=id[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
 #endif
 #if defined(__linux__) || defined(__sun__) || defined(__FreeBSD__) \
         || defined(__NetBSD__) || defined(__OpenBSD__) || defined(__DragonFly__)
     "-chardev serial,id=id,path=path[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
     "-chardev tty,id=id,path=path[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
 #endif
 #if defined(__linux__) || defined(__FreeBSD__) || defined(__DragonFly__)
     "-chardev parallel,id=id,path=path[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
     "-chardev parport,id=id,path=path[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
 #endif
 #if defined(CONFIG_SPICE)
     "-chardev spicevmc,id=id,name=name[,debug=debug][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
     "-chardev spiceport,id=id,name=name[,debug=debug][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
 #endif
     , QEMU_ARCH_ALL
 )
@@ -3402,6 +3416,10 @@ The general form of a character device option is:
     ``logappend`` option controls whether the log file will be truncated
     or appended to when opened.
 
+    Every backend supports the ``reopen-on-cpr`` option.  If on, the
+    devices's descriptor is closed during cpr-save, and reopened after exec.
+    This is useful for devices that do not support cpr.
+
 The available backends are:
 
 ``-chardev null,id=id``
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 34/39] chardev: cpr for simple devices
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (32 preceding siblings ...)
  2022-06-15 14:52 ` [PATCH V8 33/39] chardev: cpr framework Steve Sistare
@ 2022-06-15 14:52 ` Steve Sistare
  2022-06-15 14:52 ` [PATCH V8 35/39] chardev: cpr for pty Steve Sistare
                   ` (4 subsequent siblings)
  38 siblings, 0 replies; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Set QEMU_CHAR_FEATURE_CPR for devices that trivially support cpr.
char-stdio is slightly less trivial.  Allow the gdb server by
closing it on exec.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 chardev/char-mux.c    |  1 +
 chardev/char-null.c   |  1 +
 chardev/char-serial.c |  1 +
 chardev/char-stdio.c  | 10 ++++++++++
 gdbstub.c             |  1 +
 5 files changed, 14 insertions(+)

diff --git a/chardev/char-mux.c b/chardev/char-mux.c
index ee2d47b..d47fa31 100644
--- a/chardev/char-mux.c
+++ b/chardev/char-mux.c
@@ -337,6 +337,7 @@ static void qemu_chr_open_mux(Chardev *chr,
      */
     *be_opened = muxes_opened;
     qemu_chr_fe_init(&d->chr, drv, errp);
+    qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
 }
 
 static void qemu_chr_parse_mux(QemuOpts *opts, ChardevBackend *backend,
diff --git a/chardev/char-null.c b/chardev/char-null.c
index 1c6a290..02acaff 100644
--- a/chardev/char-null.c
+++ b/chardev/char-null.c
@@ -32,6 +32,7 @@ static void null_chr_open(Chardev *chr,
                           Error **errp)
 {
     *be_opened = false;
+    qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
 }
 
 static void char_null_class_init(ObjectClass *oc, void *data)
diff --git a/chardev/char-serial.c b/chardev/char-serial.c
index 4b0b83d..7aa2042 100644
--- a/chardev/char-serial.c
+++ b/chardev/char-serial.c
@@ -277,6 +277,7 @@ static void qmp_chardev_open_serial(Chardev *chr,
     }
     tty_serial_init(fd, 115200, 'N', 8, 1);
 
+    qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
     qemu_chr_open_fd(chr, fd, fd);
 }
 #endif /* __linux__ || __sun__ */
diff --git a/chardev/char-stdio.c b/chardev/char-stdio.c
index 3c64867..520f1db 100644
--- a/chardev/char-stdio.c
+++ b/chardev/char-stdio.c
@@ -27,6 +27,7 @@
 #include "qemu/option.h"
 #include "qemu/sockets.h"
 #include "qapi/error.h"
+#include "migration/cpr.h"
 #include "chardev/char.h"
 
 #ifdef _WIN32
@@ -44,6 +45,7 @@ static int old_fd0_flags;
 static bool stdio_in_use;
 static bool stdio_allow_signal;
 static bool stdio_echo_state;
+static Notifier cpr_notifier;
 
 static void term_exit(void)
 {
@@ -53,6 +55,11 @@ static void term_exit(void)
     }
 }
 
+static void term_cpr_exec_notifier(Notifier *notifier, void *data)
+{
+    term_exit();
+}
+
 static void qemu_chr_set_echo_stdio(Chardev *chr, bool echo)
 {
     struct termios tty;
@@ -117,6 +124,8 @@ static void qemu_chr_open_stdio(Chardev *chr,
 
     stdio_allow_signal = !opts->has_signal || opts->signal;
     qemu_chr_set_echo_stdio(chr, false);
+    qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
+    cpr_add_notifier(&cpr_notifier, term_cpr_exec_notifier, CPR_NOTIFY_EXEC);
 }
 #endif
 
@@ -147,6 +156,7 @@ static void char_stdio_finalize(Object *obj)
 {
 #ifndef _WIN32
     term_exit();
+    cpr_remove_notifier(&cpr_notifier);
 #endif
 }
 
diff --git a/gdbstub.c b/gdbstub.c
index 88a34c8..7865c3d 100644
--- a/gdbstub.c
+++ b/gdbstub.c
@@ -3584,6 +3584,7 @@ int gdbserver_start(const char *device)
         mon_chr = gdbserver_state.mon_chr;
         reset_gdbserver_state();
     }
+    mon_chr->reopen_on_cpr = true;
 
     create_processes(&gdbserver_state);
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 35/39] chardev: cpr for pty
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (33 preceding siblings ...)
  2022-06-15 14:52 ` [PATCH V8 34/39] chardev: cpr for simple devices Steve Sistare
@ 2022-06-15 14:52 ` Steve Sistare
  2022-06-15 14:52 ` [PATCH V8 36/39] chardev: cpr for sockets Steve Sistare
                   ` (3 subsequent siblings)
  38 siblings, 0 replies; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Save and restore pty descriptors across cpr-save and cpr-load.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 chardev/char-pty.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/chardev/char-pty.c b/chardev/char-pty.c
index 53f25c6..ff5b00a 100644
--- a/chardev/char-pty.c
+++ b/chardev/char-pty.c
@@ -29,6 +29,7 @@
 #include "qemu/sockets.h"
 #include "qemu/error-report.h"
 #include "qemu/module.h"
+#include "migration/cpr.h"
 #include "qemu/qemu-print.h"
 
 #include "chardev/char-io.h"
@@ -190,6 +191,9 @@ static void char_pty_finalize(Object *obj)
     Chardev *chr = CHARDEV(obj);
     PtyChardev *s = PTY_CHARDEV(obj);
 
+    if (chr->cpr_enabled) {
+        cpr_delete_fd(chr->label, 0);
+    }
     pty_chr_state(chr, 0);
     object_unref(OBJECT(s->ioc));
     pty_chr_timer_cancel(s);
@@ -317,12 +321,20 @@ static void char_pty_open(Chardev *chr,
     char pty_name[PATH_MAX];
     char *name;
 
+    master_fd = cpr_find_fd(chr->label, 0);
+    if (master_fd >= 0) {
+        chr->filename = g_strdup_printf("pty:unknown");
+        goto have_fd;
+    }
+
     master_fd = qemu_openpty_raw(&slave_fd, pty_name);
     if (master_fd < 0) {
         error_setg_errno(errp, errno, "Failed to create PTY");
         return;
     }
-
+    if (chr->cpr_enabled) {
+        cpr_save_fd(chr->label, 0, master_fd);
+    }
     close(slave_fd);
     if (!g_unix_set_fd_nonblocking(master_fd, true, NULL)) {
         error_setg_errno(errp, errno, "Failed to set FD nonblocking");
@@ -333,6 +345,8 @@ static void char_pty_open(Chardev *chr,
     qemu_printf("char device redirected to %s (label %s)\n",
                 pty_name, chr->label);
 
+have_fd:
+    qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
     s = PTY_CHARDEV(chr);
     s->ioc = QIO_CHANNEL(qio_channel_file_new_fd(master_fd));
     name = g_strdup_printf("chardev-pty-%s", chr->label);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 36/39] chardev: cpr for sockets
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (34 preceding siblings ...)
  2022-06-15 14:52 ` [PATCH V8 35/39] chardev: cpr for pty Steve Sistare
@ 2022-06-15 14:52 ` Steve Sistare
  2022-07-03  8:19   ` Peng Liang
  2022-06-15 14:52 ` [PATCH V8 37/39] cpr: only-cpr-capable option Steve Sistare
                   ` (2 subsequent siblings)
  38 siblings, 1 reply; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Save accepted socket fds before cpr-save, and look for them after cpr-load.
Block cpr-exec if a socket enables the TLS or websocket option.  Allow a
monitor socket by closing it on exec.

Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 chardev/char-socket.c         | 45 +++++++++++++++++++++++++++++++++++++++++++
 include/chardev/char-socket.h |  1 +
 monitor/hmp.c                 |  3 +++
 monitor/qmp.c                 |  3 +++
 4 files changed, 52 insertions(+)

diff --git a/chardev/char-socket.c b/chardev/char-socket.c
index dc4e218..3a1e36b 100644
--- a/chardev/char-socket.c
+++ b/chardev/char-socket.c
@@ -26,6 +26,7 @@
 #include "chardev/char.h"
 #include "io/channel-socket.h"
 #include "io/channel-websock.h"
+#include "migration/cpr.h"
 #include "qemu/error-report.h"
 #include "qemu/module.h"
 #include "qemu/option.h"
@@ -33,6 +34,7 @@
 #include "qapi/clone-visitor.h"
 #include "qapi/qapi-visit-sockets.h"
 #include "qemu/yank.h"
+#include "sysemu/sysemu.h"
 
 #include "chardev/char-io.h"
 #include "chardev/char-socket.h"
@@ -358,6 +360,11 @@ static void tcp_chr_free_connection(Chardev *chr)
     SocketChardev *s = SOCKET_CHARDEV(chr);
     int i;
 
+    if (chr->cpr_enabled) {
+        cpr_delete_fd(chr->label, 0);
+    }
+    cpr_del_blocker(&s->cpr_blocker);
+
     if (s->read_msgfds_num) {
         for (i = 0; i < s->read_msgfds_num; i++) {
             close(s->read_msgfds[i]);
@@ -923,6 +930,10 @@ static void tcp_chr_accept(QIONetListener *listener,
                                QIO_CHANNEL(cioc));
     }
     tcp_chr_new_client(chr, cioc);
+
+    if (s->sioc && chr->cpr_enabled) {
+        cpr_resave_fd(chr->label, 0, s->sioc->fd, NULL);
+    }
 }
 
 
@@ -1178,6 +1189,26 @@ static gboolean socket_reconnect_timeout(gpointer opaque)
     return false;
 }
 
+static int load_char_socket_fd(Chardev *chr, Error **errp)
+{
+    SocketChardev *sockchar = SOCKET_CHARDEV(chr);
+    QIOChannelSocket *sioc;
+    const char *label = chr->label;
+    int fd = cpr_find_fd(label, 0);
+
+    if (fd != -1) {
+        sockchar = SOCKET_CHARDEV(chr);
+        sioc = qio_channel_socket_new_fd(fd, errp);
+        if (sioc) {
+            tcp_chr_accept(sockchar->listener, sioc, chr);
+            object_unref(OBJECT(sioc));
+        } else {
+            error_setg(errp, "could not restore socket for %s", label);
+            return -1;
+        }
+    }
+    return 0;
+}
 
 static int qmp_chardev_open_socket_server(Chardev *chr,
                                           bool is_telnet,
@@ -1388,6 +1419,18 @@ static void qmp_chardev_open_socket(Chardev *chr,
     }
     s->registered_yank = true;
 
+    if (!s->tls_creds && !s->is_websock) {
+        qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
+    } else if (!chr->reopen_on_cpr) {
+        s->cpr_blocker = NULL;
+        error_setg(&s->cpr_blocker,
+                   "error: socket %s is not cpr capable due to %s option",
+                   chr->label, (s->tls_creds ? "TLS" : "websocket"));
+        if (cpr_add_blocker(&s->cpr_blocker, errp, CPR_MODE_RESTART, 0)) {
+            return;
+        }
+    }
+
     /* be isn't opened until we get a connection */
     *be_opened = false;
 
@@ -1403,6 +1446,8 @@ static void qmp_chardev_open_socket(Chardev *chr,
             return;
         }
     }
+
+    load_char_socket_fd(chr, errp);
 }
 
 static void qemu_chr_parse_socket(QemuOpts *opts, ChardevBackend *backend,
diff --git a/include/chardev/char-socket.h b/include/chardev/char-socket.h
index 0708ca6..1c3abf7 100644
--- a/include/chardev/char-socket.h
+++ b/include/chardev/char-socket.h
@@ -78,6 +78,7 @@ struct SocketChardev {
     bool connect_err_reported;
 
     QIOTask *connect_task;
+    Error *cpr_blocker;
 };
 typedef struct SocketChardev SocketChardev;
 
diff --git a/monitor/hmp.c b/monitor/hmp.c
index 15ca047..75e6739 100644
--- a/monitor/hmp.c
+++ b/monitor/hmp.c
@@ -1501,4 +1501,7 @@ void monitor_init_hmp(Chardev *chr, bool use_readline, Error **errp)
     qemu_chr_fe_set_handlers(&mon->common.chr, monitor_can_read, monitor_read,
                              monitor_event, NULL, &mon->common, NULL, true);
     monitor_list_append(&mon->common);
+
+    /* monitor cannot yet be preserved across cpr */
+    chr->reopen_on_cpr = true;
 }
diff --git a/monitor/qmp.c b/monitor/qmp.c
index 092c527..0043459 100644
--- a/monitor/qmp.c
+++ b/monitor/qmp.c
@@ -535,4 +535,7 @@ void monitor_init_qmp(Chardev *chr, bool pretty, Error **errp)
                                  NULL, &mon->common, NULL, true);
         monitor_list_append(&mon->common);
     }
+
+    /* Monitor cannot yet be preserved across cpr */
+    chr->reopen_on_cpr = true;
 }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 37/39] cpr: only-cpr-capable option
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (35 preceding siblings ...)
  2022-06-15 14:52 ` [PATCH V8 36/39] chardev: cpr for sockets Steve Sistare
@ 2022-06-15 14:52 ` Steve Sistare
  2022-06-15 14:52 ` [PATCH V8 38/39] python/machine: add QEMUMachine accessors Steve Sistare
  2022-06-15 14:52 ` [PATCH V8 39/39] tests/avocado: add cpr regression test Steve Sistare
  38 siblings, 0 replies; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Add the only-cpr-capable option, which causes qemu to exit with an error
if any devices that are not capable of cpr are added.  This guarantees that
cpr-save will not fail due to a blocker.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/migration/cpr.h |  2 +-
 migration/cpr.c         | 13 +++++++++++--
 qemu-options.hx         |  8 ++++++++
 softmmu/vl.c            |  6 +++++-
 4 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index ab5f53e..c7eb914 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -11,7 +11,7 @@
 #include "qapi/qapi-types-cpr.h"
 #include "qemu/notify.h"
 
-void cpr_init(int modes);
+void cpr_init(int modes, bool only_cpr_capable);
 void cpr_set_mode(CprMode mode);
 CprMode cpr_get_mode(void);
 bool cpr_enabled(CprMode mode);
diff --git a/migration/cpr.c b/migration/cpr.c
index 9d6bca4..7f507f1 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -18,9 +18,11 @@
 #include "sysemu/sysemu.h"
 
 static int cpr_enabled_modes;
+static bool only_cpr_capable;
 
-void cpr_init(int modes)
+void cpr_init(int modes, bool only_cpr)
 {
+    only_cpr_capable = only_cpr;
     cpr_enabled_modes = modes;
     cpr_state_load(&error_fatal);
 }
@@ -36,7 +38,7 @@ static GSList *cpr_blockers[CPR_MODE__MAX];
  * Add blocker for each mode in varargs list, or for all modes if CPR_MODE_ALL
  * is specified.  Caller terminates the list with 0 or CPR_MODE_ALL.  This
  * function takes ownership of *reasonp, and frees it on error, or in
- * cpr_del_blocker.  errp is set in a later patch.
+ * cpr_del_blocker.
  */
 int cpr_add_blocker(Error **reasonp, Error **errp, CprMode mode, ...)
 {
@@ -55,6 +57,13 @@ int cpr_add_blocker(Error **reasonp, Error **errp, CprMode mode, ...)
         modes = BIT(CPR_MODE__MAX) - 1;
     }
 
+    if (only_cpr_capable && (cpr_enabled_modes & modes)) {
+        error_propagate_prepend(errp, *reasonp,
+                                "-only-cpr-capable specified, but: ");
+        *reasonp = NULL;
+        return -EACCES;
+    }
+
     for (mode = 0; mode < CPR_MODE__MAX; mode++) {
         if (modes & BIT(mode)) {
             cpr_blockers[mode] = g_slist_prepend(cpr_blockers[mode], *reasonp);
diff --git a/qemu-options.hx b/qemu-options.hx
index 2f4bb2b..25e392f 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -4511,6 +4511,14 @@ SRST
     commands.
 ERST
 
+DEF("only-cpr-capable", 0, QEMU_OPTION_only_cpr_capable, \
+    "-only-cpr-capable    allow only cpr capable devices\n", QEMU_ARCH_ALL)
+SRST
+``-only-cpr-capable``
+    Only allow cpr capable devices, which guarantees that cpr-save will not
+    fail due to a cpr blocker.
+ERST
+
 DEF("nodefaults", 0, QEMU_OPTION_nodefaults, \
     "-nodefaults     don't create default devices\n", QEMU_ARCH_ALL)
 SRST
diff --git a/softmmu/vl.c b/softmmu/vl.c
index 3e19c74..1bee692 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -2604,6 +2604,7 @@ void qemu_init(int argc, char **argv, char **envp)
     bool userconfig = true;
     FILE *vmstate_dump_file = NULL;
     int cpr_modes = 0;
+    bool only_cpr_capable = false;
 
     qemu_add_opts(&qemu_drive_opts);
     qemu_add_drive_opts(&qemu_legacy_drive_opts);
@@ -3321,6 +3322,9 @@ void qemu_init(int argc, char **argv, char **envp)
                 cpr_modes |= BIT(qapi_enum_parse(&CprMode_lookup, optarg, -1,
                                                  &error_fatal));
                 break;
+            case QEMU_OPTION_only_cpr_capable:
+                only_cpr_capable = true;
+                break;
             case QEMU_OPTION_nodefaults:
                 has_defaults = 0;
                 break;
@@ -3472,7 +3476,7 @@ void qemu_init(int argc, char **argv, char **envp)
     qemu_validate_options(machine_opts_dict);
     qemu_process_sugar_options();
 
-    cpr_init(cpr_modes);
+    cpr_init(cpr_modes, only_cpr_capable);
 
     /*
      * These options affect everything else and should be processed
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 38/39] python/machine: add QEMUMachine accessors
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (36 preceding siblings ...)
  2022-06-15 14:52 ` [PATCH V8 37/39] cpr: only-cpr-capable option Steve Sistare
@ 2022-06-15 14:52 ` Steve Sistare
  2022-06-17 14:16   ` John Snow
  2022-06-15 14:52 ` [PATCH V8 39/39] tests/avocado: add cpr regression test Steve Sistare
  38 siblings, 1 reply; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Provide full_args() to return all command-line arguments used to start a
vm, some of which are not otherwise visible to QEMUMachine clients.  This
is needed by the cpr test, which must start a vm, then pass all qemu
command-line arguments to the cpr-exec monitor call.

Provide reopen_qmp_connection() to reopen a closed monitor connection.
This is needed by cpr, because qemu-exec closes the monitor socket.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 python/qemu/machine/machine.py | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/python/qemu/machine/machine.py b/python/qemu/machine/machine.py
index 37191f4..60b934d 100644
--- a/python/qemu/machine/machine.py
+++ b/python/qemu/machine/machine.py
@@ -332,6 +332,11 @@ def args(self) -> List[str]:
         """Returns the list of arguments given to the QEMU binary."""
         return self._args
 
+    @property
+    def full_args(self) -> List[str]:
+        """Returns the full list of arguments used to launch QEMU."""
+        return list(self._qemu_full_args)
+
     def _pre_launch(self) -> None:
         if self._console_set:
             self._remove_files.append(self._console_address)
@@ -486,6 +491,15 @@ def _close_qmp_connection(self) -> None:
         finally:
             self._qmp_connection = None
 
+    def reopen_qmp_connection(self):
+        self._close_qmp_connection()
+        self._qmp_connection = QEMUMonitorProtocol(
+            self._monitor_address,
+            server=True,
+            nickname=self._name
+        )
+        self._qmp.accept(self._qmp_timer)
+
     def _early_cleanup(self) -> None:
         """
         Perform any cleanup that needs to happen before the VM exits.
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH V8 39/39] tests/avocado: add cpr regression test
  2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
                   ` (37 preceding siblings ...)
  2022-06-15 14:52 ` [PATCH V8 38/39] python/machine: add QEMUMachine accessors Steve Sistare
@ 2022-06-15 14:52 ` Steve Sistare
  38 siblings, 0 replies; 84+ messages in thread
From: Steve Sistare @ 2022-06-15 14:52 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Steve Sistare, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 MAINTAINERS          |   1 +
 tests/avocado/cpr.py | 152 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 153 insertions(+)
 create mode 100644 tests/avocado/cpr.py

diff --git a/MAINTAINERS b/MAINTAINERS
index 864aec6..4e6e7ab 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3164,6 +3164,7 @@ F: stubs/cpr.c
 F: tests/unit/test-strlist.c
 F: migration/cpr-state.c
 F: stubs/cpr-state.c
+F: tests/avocado/cpr.py
 
 Record/replay
 M: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
diff --git a/tests/avocado/cpr.py b/tests/avocado/cpr.py
new file mode 100644
index 0000000..feb43d1
--- /dev/null
+++ b/tests/avocado/cpr.py
@@ -0,0 +1,152 @@
+# cpr test
+
+# Copyright (c) 2021, 2022 Oracle and/or its affiliates.
+#
+# This work is licensed under the terms of the GNU GPL, version 2.
+# See the COPYING file in the top-level directory.
+
+import tempfile
+from avocado_qemu import QemuSystemTest
+from avocado.utils import wait
+
+class Cpr(QemuSystemTest):
+    """
+    :avocado: tags=cpr
+    """
+
+    timeout = 5
+    fast_timeout = 1
+
+    @staticmethod
+    def has_status(vm, status):
+        return vm.command('query-status')['status'] == status
+
+    def wait_for_status(self, vm, status):
+        wait.wait_for(self.has_status,
+                      timeout=self.timeout,
+                      step=0.1,
+                      args=(vm,status,))
+
+    def run_and_fail(self, vm, msg):
+        # Qemu will fail fast, so disable monitor to avoid timeout in accept
+        vm.set_qmp_monitor(False)
+        vm.launch()
+        vm.wait(self.timeout)
+        self.assertRegex(vm.get_log(), msg)
+
+    def do_cpr_restart(self, vmstate_name):
+        vm = self.get_vm('-nodefaults',
+                         '-cpr-enable', 'restart',
+                         '-object', 'memory-backend-memfd,id=pc.ram,size=8M',
+                         '-machine', 'memory-backend=pc.ram')
+
+        vm.launch()
+
+        vm.qmp('cpr-save', filename=vmstate_name, mode='restart')
+        vm.event_wait(name='STOP', timeout=self.fast_timeout)
+
+        args = vm.full_args + ['-S']
+        vm.qmp('cpr-exec', argv=args)
+
+        # exec closes the monitor socket, so reopen it.
+        vm.reopen_qmp_connection()
+
+        self.wait_for_status(vm, 'prelaunch')
+        vm.qmp('cpr-load', filename=vmstate_name, mode='restart')
+        vm.event_wait(name='RESUME', timeout=self.fast_timeout)
+
+        self.assertEqual(vm.command('query-status')['status'], 'running')
+
+    def do_cpr_reboot(self, vmstate_name):
+        old_vm = self.get_vm('-nodefaults',
+                             '-cpr-enable', 'reboot')
+        old_vm.launch()
+
+        old_vm.qmp('cpr-save', filename=vmstate_name, mode='reboot')
+        old_vm.event_wait(name='STOP', timeout=self.fast_timeout)
+
+        new_vm = self.get_vm('-nodefaults',
+                             '-cpr-enable', 'reboot',
+                             '-S')
+        new_vm.launch()
+        self.wait_for_status(new_vm, 'prelaunch')
+
+        new_vm.qmp('cpr-load', filename=vmstate_name, mode='reboot')
+        new_vm.event_wait(name='RESUME', timeout=self.fast_timeout)
+
+        self.assertEqual(new_vm.command('query-status')['status'], 'running')
+
+    def test_cpr_restart(self):
+        """
+        Verify that cpr restart mode works
+        """
+        with tempfile.NamedTemporaryFile() as vmstate_file:
+            self.do_cpr_restart(vmstate_file.name)
+
+    def test_cpr_reboot(self):
+        """
+        Verify that cpr reboot mode works
+        """
+        with tempfile.NamedTemporaryFile() as vmstate_file:
+            self.do_cpr_reboot(vmstate_file.name)
+
+    def test_cpr_block_cpr_save(self):
+
+        """
+        Verify that qemu rejects cpr-save for volatile memory
+        """
+        vm = self.get_vm('-nodefaults',
+                         '-cpr-enable', 'restart')
+        vm.launch()
+        rsp = vm.qmp('cpr-save', filename='/dev/null', mode='restart')
+        vm.qmp('quit')
+
+        expect = r'Memory region .* is volatile'
+        self.assertRegex(rsp['error']['desc'], expect)
+
+    def test_cpr_block_memfd(self):
+
+        """
+        Verify that qemu complains for only-cpr-capable and volatile memory
+        """
+        vm = self.get_vm('-nodefaults',
+                         '-cpr-enable', 'restart',
+                         '-only-cpr-capable')
+        self.run_and_fail(vm, r'only-cpr-capable specified.* Memory ')
+
+    def test_cpr_block_replay(self):
+        """
+        Verify that qemu complains for only-cpr-capable and replay
+        """
+        vm = self.get_vm('-nodefaults',
+                         '-cpr-enable', 'restart',
+                         '-object', 'memory-backend-memfd,id=pc.ram,size=8M',
+                         '-machine', 'memory-backend=pc.ram',
+                         '-only-cpr-capable',
+                         '-icount', 'shift=10,rr=record,rrfile=/dev/null')
+        self.run_and_fail(vm, r'only-cpr-capable specified.* replay ')
+
+    def test_cpr_block_chardev(self):
+        """
+        Verify that qemu complains for only-cpr-capable and unsupported chardev
+        """
+        vm = self.get_vm('-nodefaults',
+                         '-cpr-enable', 'restart',
+                         '-object', 'memory-backend-memfd,id=pc.ram,size=8M',
+                         '-machine', 'memory-backend=pc.ram',
+                         '-only-cpr-capable',
+                         '-chardev', 'vc,id=vc1')
+        self.run_and_fail(vm, r'only-cpr-capable specified.* vc1 ')
+
+    def test_cpr_allow_chardev(self):
+        """
+        Verify that qemu allows unsupported chardev with reopen-on-cpr
+        """
+        vm = self.get_vm('-nodefaults',
+                         '-cpr-enable', 'restart',
+                         '-object', 'memory-backend-memfd,id=pc.ram,size=8M',
+                         '-machine', 'memory-backend=pc.ram',
+                         '-only-cpr-capable',
+                         '-chardev', 'vc,id=vc1,reopen-on-cpr=on')
+        vm.launch()
+        self.wait_for_status(vm, 'running')
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 04/39] memory: RAM_ANON flag
  2022-06-15 14:51 ` [PATCH V8 04/39] memory: RAM_ANON flag Steve Sistare
@ 2022-06-15 20:25   ` David Hildenbrand
  2022-07-05 18:23     ` Steven Sistare
  0 siblings, 1 reply; 84+ messages in thread
From: David Hildenbrand @ 2022-06-15 20:25 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, John Snow

On 15.06.22 16:51, Steve Sistare wrote:
> A memory-backend-ram or a memory-backend-memfd block with the RAM_SHARED
> flag set is not migrated when migrate_ignore_shared() is true, but this
> is wrong, because it has no named backing store, and its contents will be
> lost.  Define a new flag RAM_ANON to distinguish this case.  Cpr will also
> test this flag, for similar reasons.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  backends/hostmem-epc.c   |  2 +-
>  backends/hostmem-memfd.c |  1 +
>  backends/hostmem-ram.c   |  1 +
>  include/exec/memory.h    |  3 +++
>  include/exec/ram_addr.h  |  1 +
>  migration/ram.c          |  3 ++-
>  softmmu/physmem.c        | 12 +++++++++---
>  7 files changed, 18 insertions(+), 5 deletions(-)
> 
> diff --git a/backends/hostmem-epc.c b/backends/hostmem-epc.c
> index 037292d..cb06255 100644
> --- a/backends/hostmem-epc.c
> +++ b/backends/hostmem-epc.c
> @@ -37,7 +37,7 @@ sgx_epc_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>      }
>  
>      name = object_get_canonical_path(OBJECT(backend));
> -    ram_flags = (backend->share ? RAM_SHARED : 0) | RAM_PROTECTED;
> +    ram_flags = (backend->share ? RAM_SHARED : 0) | RAM_PROTECTED | MAP_ANON;

I'm pretty sure that doesn't compile. -> RAM_ANON

>      memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend),
>                                     name, backend->size, ram_flags,
>                                     fd, 0, errp);
> diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c
> index 3fc85c3..c9d8001 100644
> --- a/backends/hostmem-memfd.c
> +++ b/backends/hostmem-memfd.c
> @@ -55,6 +55,7 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>      name = host_memory_backend_get_name(backend);
>      ram_flags = backend->share ? RAM_SHARED : 0;
>      ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
> +    ram_flags |= RAM_ANON;
>      memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), name,
>                                     backend->size, ram_flags, fd, 0, errp);
>      g_free(name);
> diff --git a/backends/hostmem-ram.c b/backends/hostmem-ram.c
> index b8e55cd..5e80149 100644
> --- a/backends/hostmem-ram.c
> +++ b/backends/hostmem-ram.c
> @@ -30,6 +30,7 @@ ram_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>      name = host_memory_backend_get_name(backend);
>      ram_flags = backend->share ? RAM_SHARED : 0;
>      ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
> +    ram_flags |= RAM_ANON;
>      memory_region_init_ram_flags_nomigrate(&backend->mr, OBJECT(backend), name,
>                                             backend->size, ram_flags, errp);
>      g_free(name);
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index f1c1945..0daddd7 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -203,6 +203,9 @@ typedef struct IOMMUTLBEvent {
>  /* RAM that isn't accessible through normal means. */
>  #define RAM_PROTECTED (1 << 8)
>  
> +/* RAM has no name outside the qemu process. */
> +#define RAM_ANON (1 << 9)

That name is a bit misleading because it mangles anonymous memory with
an anonymous file, which doesn't provide anonymous memory in "kernel
speak". Please find a better name, some idea below ...

I think what you actual want to know is: is this from a real file,
instead of from an anonymous file or anonymous memory. A real file can
be re-opened and remapped after closing QEMU. Further, you need
MAP_SHARED semantics.


/* RAM maps a real file instead of an anonymous file or no file/fd. */
#define RAM_REAL_FILE (1 << 9)

bool ramblock_maps_real_file(RAMBlock *rb)
{
    return rb->flags & RAM_REAL_FILE;
}


Maybe we can come up with a better name for "real file".


Set the flag from applicable callsites. When setting the flag
internally, assert that we don't have a fd -- that cannot possibly make
sense.

At applicable callsites check for ramblock_maps_real_file() and that
it's actually a shared mapping. If not, it cannot be preserved by
restarting QEMU (easily, there might be ways for memfd involving other
processes).


Make sense?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 02/39] migration: qemu file wrappers
  2022-06-15 14:51 ` [PATCH V8 02/39] migration: qemu file wrappers Steve Sistare
@ 2022-06-16  2:18   ` Guoyi Tu
  2022-07-05 18:24     ` Steven Sistare
  2022-06-16 14:55   ` Marc-André Lureau
  2022-06-16 15:29   ` Daniel P. Berrangé
  2 siblings, 1 reply; 84+ messages in thread
From: Guoyi Tu @ 2022-06-16  2:18 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: tugy, Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Mark Kanda, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

On 2022/6/15 22:51, Steve Sistare wrote:
> Add qemu_file_open and qemu_fd_open to create QEMUFile objects for unix
> files and file descriptors.
> 
the function names should be updated.

--
Guoyi
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   migration/qemu-file-channel.c | 36 ++++++++++++++++++++++++++++++++++++
>   migration/qemu-file-channel.h |  6 ++++++
>   2 files changed, 42 insertions(+)
> 
> diff --git a/migration/qemu-file-channel.c b/migration/qemu-file-channel.c
> index bb5a575..cc5aebc 100644
> --- a/migration/qemu-file-channel.c
> +++ b/migration/qemu-file-channel.c
> @@ -27,8 +27,10 @@
>   #include "qemu-file.h"
>   #include "io/channel-socket.h"
>   #include "io/channel-tls.h"
> +#include "io/channel-file.h"
>   #include "qemu/iov.h"
>   #include "qemu/yank.h"
> +#include "qapi/error.h"
>   #include "yank_functions.h"
>   
>   
> @@ -192,3 +194,37 @@ QEMUFile *qemu_fopen_channel_output(QIOChannel *ioc)
>       object_ref(OBJECT(ioc));
>       return qemu_fopen_ops(ioc, &channel_output_ops, true);
>   }
> +
> +QEMUFile *qemu_fopen_file(const char *path, int flags, int mode,
> +                          const char *name, Error **errp)
> +{
> +    g_autoptr(QIOChannelFile) fioc = NULL;
> +    QIOChannel *ioc;
> +    QEMUFile *f;
> +
> +    if (flags & O_RDWR) {
> +        error_setg(errp, "qemu_fopen_file %s: O_RDWR not supported", path);
> +        return NULL;
> +    }
> +
> +    fioc = qio_channel_file_new_path(path, flags, mode, errp);
> +    if (!fioc) {
> +        return NULL;
> +    }
> +
> +    ioc = QIO_CHANNEL(fioc);
> +    qio_channel_set_name(ioc, name);
> +    f = (flags & O_WRONLY) ? qemu_fopen_channel_output(ioc) :
> +                             qemu_fopen_channel_input(ioc);
> +    return f;
> +}
> +
> +QEMUFile *qemu_fopen_fd(int fd, bool writable, const char *name)
> +{
> +    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
> +    QIOChannel *ioc = QIO_CHANNEL(fioc);
> +    QEMUFile *f = writable ? qemu_fopen_channel_output(ioc) :
> +                             qemu_fopen_channel_input(ioc);
> +    qio_channel_set_name(ioc, name);
> +    return f;
> +}
> diff --git a/migration/qemu-file-channel.h b/migration/qemu-file-channel.h
> index 0028a09..75fd0ad 100644
> --- a/migration/qemu-file-channel.h
> +++ b/migration/qemu-file-channel.h
> @@ -29,4 +29,10 @@
>   
>   QEMUFile *qemu_fopen_channel_input(QIOChannel *ioc);
>   QEMUFile *qemu_fopen_channel_output(QIOChannel *ioc);
> +
> +QEMUFile *qemu_fopen_file(const char *path, int flags, int mode,
> +                         const char *name, Error **errp);
> +
> +QEMUFile *qemu_fopen_fd(int fd, bool writable, const char *name);
> +
>   #endif


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 06/39] cpr: reboot mode
  2022-06-15 14:51 ` [PATCH V8 06/39] cpr: reboot mode Steve Sistare
@ 2022-06-16 11:10   ` Daniel P. Berrangé
  2022-07-05 18:26     ` Steven Sistare
  0 siblings, 1 reply; 84+ messages in thread
From: Daniel P. Berrangé @ 2022-06-16 11:10 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Alex Bennée, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Marcel Apfelbaum, Alex Williamson,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

On Wed, Jun 15, 2022 at 07:51:53AM -0700, Steve Sistare wrote:
> Provide the cpr-save and cpr-load functions for live update.  These save and
> restore VM state, with minimal guest pause time, so that qemu may be updated
> to a new version in between.
> 
> cpr-save stops the VM and saves vmstate to an ordinary file.  It supports
> any type of guest image and block device, but the caller must not modify
> guest block devices between cpr-save and cpr-load.
> 
> cpr-save supports several modes, the first of which is reboot. In this mode
> the caller invokes cpr-save and then terminates qemu.  The caller may then
> update the host kernel and system software and reboot.  The caller resumes
> the guest by running qemu with the same arguments as the original process
> and invoking cpr-load.  To use this mode, guest ram must be mapped to a
> persistent shared memory file such as /dev/dax0.0 or /dev/shm PKRAM.
> 
> The reboot mode supports vfio devices if the caller first suspends the
> guest, such as by issuing guest-suspend-ram to the qemu guest agent.  The
> guest drivers' suspend methods flush outstanding requests and re-initialize
> the devices, and thus there is no device state to save and restore.
> 
> cpr-load loads state from the file.  If the VM was running at cpr-save time
> then VM execution resumes.  If the VM was suspended at cpr-save time, then
> the caller must issue a system_wakeup command to resume.
> 
> cpr-save syntax:
>   { 'enum': 'CprMode', 'data': [ 'reboot' ] }
>   { 'command': 'cpr-save', 'data': { 'filename': 'str', 'mode': 'CprMode' }}
> 
> cpr-load syntax:
>   { 'command': 'cpr-load', 'data': { 'filename': 'str', 'mode': 'CprMode' }}

I'm still a little unsure if this direction for QAPI exposure is the
best, or whether we should instead leverage the migration commands.

I particularly concerned that we might regret having an API that
is designed only around storage in local files/blockdevs. The
migration layer has flexibility to use many protocols which has
been useful in the past to be able to offload work to an external
process. For example, libvirt uses migrate-to-fd so it can use
a helper that adds O_DIRECT support such that we avoid trashing
the host I/O cache for save/restore.

At the same time though, the migrate APIs don't currently support
a plain "file" protocol. This was because historically we needed
the QEMUFile to support O_NONBLOCK and this fails with plain
files or block devices, so QEMU threads could get blocked. For
the save side this doesn't matter so much, as QEMU now has the
outgoing migrate channels in blocking mode, only the incoming
side use non-blocking.  We could add a plain "file" protocol
to migration if we clearly document its limitations, and indeed
I've suggested we do that for another unrelated bit of work
for libvirts VM save/restore functionality.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 01/39] migration: fix populate_vfio_info
  2022-06-15 14:51 ` [PATCH V8 01/39] migration: fix populate_vfio_info Steve Sistare
@ 2022-06-16 14:41   ` Marc-André Lureau
  0 siblings, 0 replies; 84+ messages in thread
From: Marc-André Lureau @ 2022-06-16 14:41 UTC (permalink / raw)
  To: Steve Sistare
  Cc: QEMU, Paolo Bonzini, Stefan Hajnoczi, Alex Bennée,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Marcel Apfelbaum,
	Alex Williamson, Daniel P. Berrange, Juan Quintela,
	Markus Armbruster, Eric Blake, Jason Zeng, Zheng Chuan,
	Mark Kanda, Guoyi Tu, Peter Maydell, Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

[-- Attachment #1: Type: text/plain, Size: 871 bytes --]

On Wed, Jun 15, 2022 at 7:20 PM Steve Sistare <steven.sistare@oracle.com>
wrote:

> Include CONFIG_DEVICES so that populate_vfio_info is instantiated for
> CONFIG_VFIO.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>

Fixes: 43bd0bf30fce ("migration: Move populate_vfio_info() into a separate
file")

Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>

---
>  migration/target.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/migration/target.c b/migration/target.c
> index 907ebf0..a0991bc 100644
> --- a/migration/target.c
> +++ b/migration/target.c
> @@ -8,6 +8,7 @@
>  #include "qemu/osdep.h"
>  #include "qapi/qapi-types-migration.h"
>  #include "migration.h"
> +#include CONFIG_DEVICES
>
>  #ifdef CONFIG_VFIO
>  #include "hw/vfio/vfio-common.h"
> --
> 1.8.3.1
>
>
>

-- 
Marc-André Lureau

[-- Attachment #2: Type: text/html, Size: 1668 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 02/39] migration: qemu file wrappers
  2022-06-15 14:51 ` [PATCH V8 02/39] migration: qemu file wrappers Steve Sistare
  2022-06-16  2:18   ` Guoyi Tu
@ 2022-06-16 14:55   ` Marc-André Lureau
  2022-07-05 18:25     ` Steven Sistare
  2022-06-16 15:29   ` Daniel P. Berrangé
  2 siblings, 1 reply; 84+ messages in thread
From: Marc-André Lureau @ 2022-06-16 14:55 UTC (permalink / raw)
  To: Steve Sistare
  Cc: QEMU, Paolo Bonzini, Stefan Hajnoczi, Alex Bennée,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Marcel Apfelbaum,
	Alex Williamson, Daniel P. Berrange, Juan Quintela,
	Markus Armbruster, Eric Blake, Jason Zeng, Zheng Chuan,
	Mark Kanda, Guoyi Tu, Peter Maydell, Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

[-- Attachment #1: Type: text/plain, Size: 3466 bytes --]

Hi

On Wed, Jun 15, 2022 at 6:54 PM Steve Sistare <steven.sistare@oracle.com>
wrote:

> Add qemu_file_open and qemu_fd_open to create QEMUFile objects for unix
> files and file descriptors.
>

File descriptors are not really unix specific, but that's a detail.

The names of the functions in the summary do not match the code, also
details :)

Eventually, I would suggest to follow the libc fopen/fdopen naming, if that
makes sense. (or the QIOChannel naming)


> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  migration/qemu-file-channel.c | 36 ++++++++++++++++++++++++++++++++++++
>  migration/qemu-file-channel.h |  6 ++++++
>  2 files changed, 42 insertions(+)
>
> diff --git a/migration/qemu-file-channel.c b/migration/qemu-file-channel.c
> index bb5a575..cc5aebc 100644
> --- a/migration/qemu-file-channel.c
> +++ b/migration/qemu-file-channel.c
> @@ -27,8 +27,10 @@
>  #include "qemu-file.h"
>  #include "io/channel-socket.h"
>  #include "io/channel-tls.h"
> +#include "io/channel-file.h"
>  #include "qemu/iov.h"
>  #include "qemu/yank.h"
> +#include "qapi/error.h"
>  #include "yank_functions.h"
>
>
> @@ -192,3 +194,37 @@ QEMUFile *qemu_fopen_channel_output(QIOChannel *ioc)
>      object_ref(OBJECT(ioc));
>      return qemu_fopen_ops(ioc, &channel_output_ops, true);
>  }
> +
> +QEMUFile *qemu_fopen_file(const char *path, int flags, int mode,
> +                          const char *name, Error **errp)
> +{
>

I would add ERRP_GUARD();


> +    g_autoptr(QIOChannelFile) fioc = NULL;
> +    QIOChannel *ioc;
> +    QEMUFile *f;
> +
> +    if (flags & O_RDWR) {
> +        error_setg(errp, "qemu_fopen_file %s: O_RDWR not supported",
> path);
> +        return NULL;
> +    }
>

Why not take a "bool writable" instead, like the fdopen below?


> +
> +    fioc = qio_channel_file_new_path(path, flags, mode, errp);
> +    if (!fioc) {
> +        return NULL;
> +    }
> +
> +    ioc = QIO_CHANNEL(fioc);
> +    qio_channel_set_name(ioc, name);
> +    f = (flags & O_WRONLY) ? qemu_fopen_channel_output(ioc) :
> +                             qemu_fopen_channel_input(ioc);
> +    return f;
>

"f" and parentheses are kinda superfluous


> +}
> +
> +QEMUFile *qemu_fopen_fd(int fd, bool writable, const char *name)
> +{
> +    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
> +    QIOChannel *ioc = QIO_CHANNEL(fioc);
> +    QEMUFile *f = writable ? qemu_fopen_channel_output(ioc) :
> +                             qemu_fopen_channel_input(ioc);
> +    qio_channel_set_name(ioc, name);
> +    return f;
>

or:

g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
qio_channel_set_name(QIO_CHANNEL(fioc), name);
return writable ? qemu_fopen_channel_output(ioc) :
qemu_fopen_channel_input(ioc);


> +}
> diff --git a/migration/qemu-file-channel.h b/migration/qemu-file-channel.h
> index 0028a09..75fd0ad 100644
> --- a/migration/qemu-file-channel.h
> +++ b/migration/qemu-file-channel.h
> @@ -29,4 +29,10 @@
>
>  QEMUFile *qemu_fopen_channel_input(QIOChannel *ioc);
>  QEMUFile *qemu_fopen_channel_output(QIOChannel *ioc);
> +
> +QEMUFile *qemu_fopen_file(const char *path, int flags, int mode,
> +                         const char *name, Error **errp);
> +
> +QEMUFile *qemu_fopen_fd(int fd, bool writable, const char *name);
> +
>  #endif
> --
> 1.8.3.1
>
>
>

-- 
Marc-André Lureau

[-- Attachment #2: Type: text/html, Size: 5232 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 03/39] migration: simplify savevm
  2022-06-15 14:51 ` [PATCH V8 03/39] migration: simplify savevm Steve Sistare
@ 2022-06-16 14:59   ` Marc-André Lureau
  0 siblings, 0 replies; 84+ messages in thread
From: Marc-André Lureau @ 2022-06-16 14:59 UTC (permalink / raw)
  To: Steve Sistare
  Cc: QEMU, Paolo Bonzini, Stefan Hajnoczi, Alex Bennée,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Marcel Apfelbaum,
	Alex Williamson, Daniel P. Berrange, Juan Quintela,
	Markus Armbruster, Eric Blake, Jason Zeng, Zheng Chuan,
	Mark Kanda, Guoyi Tu, Peter Maydell, Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

[-- Attachment #1: Type: text/plain, Size: 2843 bytes --]

Hi

On Wed, Jun 15, 2022 at 6:57 PM Steve Sistare <steven.sistare@oracle.com>
wrote:

> Use qemu_file_open to simplify a few functions in savevm.c.
> No functional change.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>

(ok, I get why you keep the mode_t in fopen)

Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>

---
>  migration/savevm.c | 20 ++++++--------------
>  1 file changed, 6 insertions(+), 14 deletions(-)
>
> diff --git a/migration/savevm.c b/migration/savevm.c
> index d907689..0b2c5cd 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2931,7 +2931,6 @@ void qmp_xen_save_devices_state(const char
> *filename, bool has_live, bool live,
>                                  Error **errp)
>  {
>      QEMUFile *f;
> -    QIOChannelFile *ioc;
>      int saved_vm_running;
>      int ret;
>
> @@ -2945,14 +2944,11 @@ void qmp_xen_save_devices_state(const char
> *filename, bool has_live, bool live,
>      vm_stop(RUN_STATE_SAVE_VM);
>      global_state_store_running();
>
> -    ioc = qio_channel_file_new_path(filename, O_WRONLY | O_CREAT |
> O_TRUNC,
> -                                    0660, errp);
> -    if (!ioc) {
> +    f = qemu_fopen_file(filename, O_WRONLY | O_CREAT | O_TRUNC, 0660,
> +                        "migration-xen-save-state", errp);
> +    if (!f) {
>          goto the_end;
>      }
> -    qio_channel_set_name(QIO_CHANNEL(ioc), "migration-xen-save-state");
> -    f = qemu_fopen_channel_output(QIO_CHANNEL(ioc));
> -    object_unref(OBJECT(ioc));
>      ret = qemu_save_device_state(f);
>      if (ret < 0 || qemu_fclose(f) < 0) {
>          error_setg(errp, QERR_IO_ERROR);
> @@ -2981,7 +2977,6 @@ void qmp_xen_save_devices_state(const char
> *filename, bool has_live, bool live,
>  void qmp_xen_load_devices_state(const char *filename, Error **errp)
>  {
>      QEMUFile *f;
> -    QIOChannelFile *ioc;
>      int ret;
>
>      /* Guest must be paused before loading the device state; the RAM state
> @@ -2993,14 +2988,11 @@ void qmp_xen_load_devices_state(const char
> *filename, Error **errp)
>      }
>      vm_stop(RUN_STATE_RESTORE_VM);
>
> -    ioc = qio_channel_file_new_path(filename, O_RDONLY | O_BINARY, 0,
> errp);
> -    if (!ioc) {
> +    f = qemu_fopen_file(filename, O_RDONLY | O_BINARY, 0,
> +                        "migration-xen-load-state", errp);
> +    if (!f) {
>          return;
>      }
> -    qio_channel_set_name(QIO_CHANNEL(ioc), "migration-xen-load-state");
> -    f = qemu_fopen_channel_input(QIO_CHANNEL(ioc));
> -    object_unref(OBJECT(ioc));
> -
>      ret = qemu_loadvm_state(f);
>      qemu_fclose(f);
>      if (ret < 0) {
> --
> 1.8.3.1
>
>
>

-- 
Marc-André Lureau

[-- Attachment #2: Type: text/html, Size: 3938 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 02/39] migration: qemu file wrappers
  2022-06-15 14:51 ` [PATCH V8 02/39] migration: qemu file wrappers Steve Sistare
  2022-06-16  2:18   ` Guoyi Tu
  2022-06-16 14:55   ` Marc-André Lureau
@ 2022-06-16 15:29   ` Daniel P. Berrangé
  2022-07-05 18:25     ` Steven Sistare
  2 siblings, 1 reply; 84+ messages in thread
From: Daniel P. Berrangé @ 2022-06-16 15:29 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Alex Bennée, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Marcel Apfelbaum, Alex Williamson,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

On Wed, Jun 15, 2022 at 07:51:49AM -0700, Steve Sistare wrote:
> Add qemu_file_open and qemu_fd_open to create QEMUFile objects for unix
> files and file descriptors.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  migration/qemu-file-channel.c | 36 ++++++++++++++++++++++++++++++++++++
>  migration/qemu-file-channel.h |  6 ++++++
>  2 files changed, 42 insertions(+)
> 
> diff --git a/migration/qemu-file-channel.c b/migration/qemu-file-channel.c
> index bb5a575..cc5aebc 100644
> --- a/migration/qemu-file-channel.c
> +++ b/migration/qemu-file-channel.c
> @@ -27,8 +27,10 @@
>  #include "qemu-file.h"
>  #include "io/channel-socket.h"
>  #include "io/channel-tls.h"
> +#include "io/channel-file.h"
>  #include "qemu/iov.h"
>  #include "qemu/yank.h"
> +#include "qapi/error.h"
>  #include "yank_functions.h"
>  
>  
> @@ -192,3 +194,37 @@ QEMUFile *qemu_fopen_channel_output(QIOChannel *ioc)
>      object_ref(OBJECT(ioc));
>      return qemu_fopen_ops(ioc, &channel_output_ops, true);
>  }
> +
> +QEMUFile *qemu_fopen_file(const char *path, int flags, int mode,
> +                          const char *name, Error **errp)
> +{
> +    g_autoptr(QIOChannelFile) fioc = NULL;
> +    QIOChannel *ioc;
> +    QEMUFile *f;
> +
> +    if (flags & O_RDWR) {

IIRC, O_RDWR may expand to more than 1 bit, so needs a strict
equality test

   if ((flags & O_RDWR) == O_RDWR)

> +        error_setg(errp, "qemu_fopen_file %s: O_RDWR not supported", path);
> +        return NULL;
> +    }
> +
> +    fioc = qio_channel_file_new_path(path, flags, mode, errp);
> +    if (!fioc) {
> +        return NULL;
> +    }
> +
> +    ioc = QIO_CHANNEL(fioc);
> +    qio_channel_set_name(ioc, name);
> +    f = (flags & O_WRONLY) ? qemu_fopen_channel_output(ioc) :
> +                             qemu_fopen_channel_input(ioc);
> +    return f;
> +}
> +
> +QEMUFile *qemu_fopen_fd(int fd, bool writable, const char *name)
> +{
> +    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
> +    QIOChannel *ioc = QIO_CHANNEL(fioc);
> +    QEMUFile *f = writable ? qemu_fopen_channel_output(ioc) :
> +                             qemu_fopen_channel_input(ioc);
> +    qio_channel_set_name(ioc, name);
> +    return f;
> +}
> diff --git a/migration/qemu-file-channel.h b/migration/qemu-file-channel.h
> index 0028a09..75fd0ad 100644
> --- a/migration/qemu-file-channel.h
> +++ b/migration/qemu-file-channel.h
> @@ -29,4 +29,10 @@
>  
>  QEMUFile *qemu_fopen_channel_input(QIOChannel *ioc);
>  QEMUFile *qemu_fopen_channel_output(QIOChannel *ioc);
> +
> +QEMUFile *qemu_fopen_file(const char *path, int flags, int mode,
> +                         const char *name, Error **errp);
> +
> +QEMUFile *qemu_fopen_fd(int fd, bool writable, const char *name);

Note we used the explicit names "_input" and "_output" in
the existing methods as they're more readable in the calling
sides than "true" / "false".

Similarly we had qemu_open vs qemu_create, so that we don't
have the ambiguity of whuether 'mode' is needed or not. IOW,
I'd suggest we have 

 QEMUFile *qemu_fopen_file_output(const char *path, int mode,
                                  const char *name, Error **errp);
 QEMUFile *qemu_fopen_file_input(const char *path,
                                  const char *name, Error **errp);

 QEMUFile *qemu_fopen_fd_input(int fd, const char *name);
 QEMUFile *qemu_fopen_fd_output(int fd, const char *name);


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 05/39] vl: start on wakeup request
  2022-06-15 14:51 ` [PATCH V8 05/39] vl: start on wakeup request Steve Sistare
@ 2022-06-16 15:55   ` Marc-André Lureau
  2022-07-05 18:26     ` Steven Sistare
  0 siblings, 1 reply; 84+ messages in thread
From: Marc-André Lureau @ 2022-06-16 15:55 UTC (permalink / raw)
  To: Steve Sistare
  Cc: QEMU, Paolo Bonzini, Stefan Hajnoczi, Alex Bennée,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Marcel Apfelbaum,
	Alex Williamson, Daniel P. Berrange, Juan Quintela,
	Markus Armbruster, Eric Blake, Jason Zeng, Zheng Chuan,
	Mark Kanda, Guoyi Tu, Peter Maydell, Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

[-- Attachment #1: Type: text/plain, Size: 3255 bytes --]

Hi

On Wed, Jun 15, 2022 at 7:27 PM Steve Sistare <steven.sistare@oracle.com>
wrote:

> If qemu starts and loads a VM in the suspended state, then a later wakeup
> request will set the state to running, which is not sufficient to
> initialize
> the vm, as vm_start was never called during this invocation of qemu.  See
> qemu_system_wakeup_request().
>
> Define the start_on_wakeup_requested() hook to cause vm_start() to be
> called
> when processing the wakeup request.
>

Nothing calls qemu_system_start_on_wakeup_request() yet, so it would be
useful to say where this is going to be used next.

(otherwise, it seems ok to me)


> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  include/sysemu/runstate.h |  1 +
>  softmmu/runstate.c        | 16 +++++++++++++++-
>  2 files changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/include/sysemu/runstate.h b/include/sysemu/runstate.h
> index f3ed525..16c1c41 100644
> --- a/include/sysemu/runstate.h
> +++ b/include/sysemu/runstate.h
> @@ -57,6 +57,7 @@ void qemu_system_reset_request(ShutdownCause reason);
>  void qemu_system_suspend_request(void);
>  void qemu_register_suspend_notifier(Notifier *notifier);
>  bool qemu_wakeup_suspend_enabled(void);
> +void qemu_system_start_on_wakeup_request(void);
>  void qemu_system_wakeup_request(WakeupReason reason, Error **errp);
>  void qemu_system_wakeup_enable(WakeupReason reason, bool enabled);
>  void qemu_register_wakeup_notifier(Notifier *notifier);
> diff --git a/softmmu/runstate.c b/softmmu/runstate.c
> index fac7b63..9b27d74 100644
> --- a/softmmu/runstate.c
> +++ b/softmmu/runstate.c
> @@ -115,6 +115,7 @@ static const RunStateTransition
> runstate_transitions_def[] = {
>      { RUN_STATE_PRELAUNCH, RUN_STATE_RUNNING },
>      { RUN_STATE_PRELAUNCH, RUN_STATE_FINISH_MIGRATE },
>      { RUN_STATE_PRELAUNCH, RUN_STATE_INMIGRATE },
> +    { RUN_STATE_PRELAUNCH, RUN_STATE_SUSPENDED },
>
>      { RUN_STATE_FINISH_MIGRATE, RUN_STATE_RUNNING },
>      { RUN_STATE_FINISH_MIGRATE, RUN_STATE_PAUSED },
> @@ -335,6 +336,7 @@ void vm_state_notify(bool running, RunState state)
>      }
>  }
>
> +static bool start_on_wakeup_requested;
>  static ShutdownCause reset_requested;
>  static ShutdownCause shutdown_requested;
>  static int shutdown_signal;
> @@ -562,6 +564,11 @@ void qemu_register_suspend_notifier(Notifier
> *notifier)
>      notifier_list_add(&suspend_notifiers, notifier);
>  }
>
> +void qemu_system_start_on_wakeup_request(void)
> +{
> +    start_on_wakeup_requested = true;
> +}
> +
>  void qemu_system_wakeup_request(WakeupReason reason, Error **errp)
>  {
>      trace_system_wakeup_request(reason);
> @@ -574,7 +581,14 @@ void qemu_system_wakeup_request(WakeupReason reason,
> Error **errp)
>      if (!(wakeup_reason_mask & (1 << reason))) {
>          return;
>      }
> -    runstate_set(RUN_STATE_RUNNING);
> +
> +    if (start_on_wakeup_requested) {
> +        start_on_wakeup_requested = false;
> +        vm_start();
> +    } else {
> +        runstate_set(RUN_STATE_RUNNING);
> +    }
> +
>      wakeup_reason = reason;
>      qemu_notify_event();
>  }
> --
> 1.8.3.1
>
>
>

-- 
Marc-André Lureau

[-- Attachment #2: Type: text/html, Size: 4133 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 13/39] oslib: qemu_clear_cloexec
  2022-06-15 14:52 ` [PATCH V8 13/39] oslib: qemu_clear_cloexec Steve Sistare
@ 2022-06-16 16:01   ` Marc-André Lureau
  2022-06-16 16:07   ` Daniel P. Berrangé
  1 sibling, 0 replies; 84+ messages in thread
From: Marc-André Lureau @ 2022-06-16 16:01 UTC (permalink / raw)
  To: Steve Sistare
  Cc: QEMU, Paolo Bonzini, Stefan Hajnoczi, Alex Bennée,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Marcel Apfelbaum,
	Alex Williamson, Daniel P. Berrange, Juan Quintela,
	Markus Armbruster, Eric Blake, Jason Zeng, Zheng Chuan,
	Mark Kanda, Guoyi Tu, Peter Maydell, Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

[-- Attachment #1: Type: text/plain, Size: 1837 bytes --]

On Wed, Jun 15, 2022 at 7:01 PM Steve Sistare <steven.sistare@oracle.com>
wrote:

> Define qemu_clear_cloexec, analogous to qemu_set_cloexec.
>
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>

Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>


> ---
>  include/qemu/osdep.h | 1 +
>  util/oslib-posix.c   | 9 +++++++++
>  util/oslib-win32.c   | 4 ++++
>  3 files changed, 14 insertions(+)
>
> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index b1c161c..e916f3b 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -548,6 +548,7 @@ ssize_t qemu_write_full(int fd, const void *buf,
> size_t count)
>      G_GNUC_WARN_UNUSED_RESULT;
>
>  void qemu_set_cloexec(int fd);
> +void qemu_clear_cloexec(int fd);
>
>  /* Return a dynamically allocated directory path that is appropriate for
> storing
>   * local state.
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index 7a34c16..421e987 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -261,6 +261,15 @@ void qemu_set_cloexec(int fd)
>      assert(f != -1);
>  }
>
> +void qemu_clear_cloexec(int fd)
> +{
> +    int f;
> +    f = fcntl(fd, F_GETFD);
> +    assert(f != -1);
> +    f = fcntl(fd, F_SETFD, f & ~FD_CLOEXEC);
> +    assert(f != -1);
> +}
> +
>  char *
>  qemu_get_local_state_dir(void)
>  {
> diff --git a/util/oslib-win32.c b/util/oslib-win32.c
> index 5723d3e..5bed148 100644
> --- a/util/oslib-win32.c
> +++ b/util/oslib-win32.c
> @@ -226,6 +226,10 @@ void qemu_set_cloexec(int fd)
>  {
>  }
>
> +void qemu_clear_cloexec(int fd)
> +{
> +}
> +
>  int qemu_get_thread_id(void)
>  {
>      return GetCurrentThreadId();
> --
> 1.8.3.1
>
>
>

-- 
Marc-André Lureau

[-- Attachment #2: Type: text/html, Size: 2726 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 14/39] qapi: strList_from_string
  2022-06-15 14:52 ` [PATCH V8 14/39] qapi: strList_from_string Steve Sistare
@ 2022-06-16 16:04   ` Marc-André Lureau
  2022-07-05 18:28     ` Steven Sistare
  0 siblings, 1 reply; 84+ messages in thread
From: Marc-André Lureau @ 2022-06-16 16:04 UTC (permalink / raw)
  To: Steve Sistare
  Cc: QEMU, Paolo Bonzini, Stefan Hajnoczi, Alex Bennée,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Marcel Apfelbaum,
	Alex Williamson, Daniel P. Berrange, Juan Quintela,
	Markus Armbruster, Eric Blake, Jason Zeng, Zheng Chuan,
	Mark Kanda, Guoyi Tu, Peter Maydell, Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

[-- Attachment #1: Type: text/plain, Size: 4427 bytes --]

Hi

On Wed, Jun 15, 2022 at 7:04 PM Steve Sistare <steven.sistare@oracle.com>
wrote:

> Generalize strList_from_comma_list() to take any delimiter character,
> rename
> as strList_from_string(), and move it to qapi/util.c.
>
> No functional change.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  include/qapi/util.h |  9 +++++++++
>  monitor/hmp-cmds.c  | 29 ++---------------------------
>  qapi/qapi-util.c    | 23 +++++++++++++++++++++++
>  3 files changed, 34 insertions(+), 27 deletions(-)
>
> diff --git a/include/qapi/util.h b/include/qapi/util.h
> index 81a2b13..7d88b09 100644
> --- a/include/qapi/util.h
> +++ b/include/qapi/util.h
> @@ -22,6 +22,8 @@ typedef struct QEnumLookup {
>      const int size;
>  } QEnumLookup;
>
> +struct strList;
> +
>

suspicious, you can't include qapi/qapi-builtin-types.h here?

 const char *qapi_enum_lookup(const QEnumLookup *lookup, int val);
>  int qapi_enum_parse(const QEnumLookup *lookup, const char *buf,
>                      int def, Error **errp);
> @@ -31,6 +33,13 @@ bool qapi_bool_parse(const char *name, const char
> *value, bool *obj,
>  int parse_qapi_name(const char *name, bool complete);
>
>  /*
> + * Produce a strList from the character delimited string @in.
> + * All strings are g_strdup'd.
> + * A NULL or empty input string returns NULL.
> + */
> +struct strList *strList_from_string(const char *in, char delim);
> +
> +/*
>   * For any GenericList @list, insert @element at the front.
>   *
>   * Note that this macro evaluates @element exactly once, so it is safe
> diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
> index bb12589..9f58b1f 100644
> --- a/monitor/hmp-cmds.c
> +++ b/monitor/hmp-cmds.c
> @@ -43,6 +43,7 @@
>  #include "qapi/qapi-commands-run-state.h"
>  #include "qapi/qapi-commands-tpm.h"
>  #include "qapi/qapi-commands-ui.h"
> +#include "qapi/util.h"
>  #include "qapi/qapi-visit-net.h"
>  #include "qapi/qapi-visit-migration.h"
>  #include "qapi/qmp/qdict.h"
> @@ -70,32 +71,6 @@ bool hmp_handle_error(Monitor *mon, Error *err)
>      return false;
>  }
>
> -/*
> - * Produce a strList from a comma separated list.
> - * A NULL or empty input string return NULL.
> - */
> -static strList *strList_from_comma_list(const char *in)
> -{
> -    strList *res = NULL;
> -    strList **tail = &res;
> -
> -    while (in && in[0]) {
> -        char *comma = strchr(in, ',');
> -        char *value;
> -
> -        if (comma) {
> -            value = g_strndup(in, comma - in);
> -            in = comma + 1; /* skip the , */
> -        } else {
> -            value = g_strdup(in);
> -            in = NULL;
> -        }
> -        QAPI_LIST_APPEND(tail, value);
> -    }
> -
> -    return res;
> -}
> -
>  void hmp_info_name(Monitor *mon, const QDict *qdict)
>  {
>      NameInfo *info;
> @@ -1115,7 +1090,7 @@ void hmp_announce_self(Monitor *mon, const QDict
> *qdict)
>                                              migrate_announce_params());
>
>      qapi_free_strList(params->interfaces);
> -    params->interfaces = strList_from_comma_list(interfaces_str);
> +    params->interfaces = strList_from_string(interfaces_str, ',');
>      params->has_interfaces = params->interfaces != NULL;
>      params->id = g_strdup(id);
>      params->has_id = !!params->id;
> diff --git a/qapi/qapi-util.c b/qapi/qapi-util.c
> index 63596e1..b61c73c 100644
> --- a/qapi/qapi-util.c
> +++ b/qapi/qapi-util.c
> @@ -15,6 +15,7 @@
>  #include "qapi/error.h"
>  #include "qemu/ctype.h"
>  #include "qapi/qmp/qerror.h"
> +#include "qapi/qapi-builtin-types.h"
>
>  CompatPolicy compat_policy;
>
> @@ -152,3 +153,25 @@ int parse_qapi_name(const char *str, bool complete)
>      }
>      return p - str;
>  }
> +
> +strList *strList_from_string(const char *in, char delim)
> +{
> +    strList *res = NULL;
> +    strList **tail = &res;
> +
> +    while (in && in[0]) {
> +        char *next = strchr(in, delim);
> +        char *value;
> +
> +        if (next) {
> +            value = g_strndup(in, next - in);
> +            in = next + 1; /* skip the delim */
> +        } else {
> +            value = g_strdup(in);
> +            in = NULL;
> +        }
> +        QAPI_LIST_APPEND(tail, value);
> +    }
> +
> +    return res;
> +}
> --
> 1.8.3.1
>
>
>

-- 
Marc-André Lureau

[-- Attachment #2: Type: text/html, Size: 5699 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 15/39] qapi: QAPI_LIST_LENGTH
  2022-06-15 14:52 ` [PATCH V8 15/39] qapi: QAPI_LIST_LENGTH Steve Sistare
@ 2022-06-16 16:06   ` Marc-André Lureau
  0 siblings, 0 replies; 84+ messages in thread
From: Marc-André Lureau @ 2022-06-16 16:06 UTC (permalink / raw)
  To: Steve Sistare
  Cc: QEMU, Paolo Bonzini, Stefan Hajnoczi, Alex Bennée,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Marcel Apfelbaum,
	Alex Williamson, Daniel P. Berrange, Juan Quintela,
	Markus Armbruster, Eric Blake, Jason Zeng, Zheng Chuan,
	Mark Kanda, Guoyi Tu, Peter Maydell, Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

[-- Attachment #1: Type: text/plain, Size: 993 bytes --]

Hi

On Wed, Jun 15, 2022 at 7:38 PM Steve Sistare <steven.sistare@oracle.com>
wrote:

> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>

Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>


> ---
>  include/qapi/util.h | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> diff --git a/include/qapi/util.h b/include/qapi/util.h
> index 7d88b09..75dddca 100644
> --- a/include/qapi/util.h
> +++ b/include/qapi/util.h
> @@ -65,4 +65,17 @@ struct strList *strList_from_string(const char *in,
> char delim);
>      (tail) = &(*(tail))->next; \
>  } while (0)
>
> +/*
> + * For any GenericList @list, return its length.
> + */
> +#define QAPI_LIST_LENGTH(list) \
> +    ({ \
> +        int len = 0; \
> +        typeof(list) elem; \
> +        for (elem = list; elem != NULL; elem = elem->next) { \
> +            len++; \
> +        } \
> +        len; \
> +    })
> +
>  #endif
> --
> 1.8.3.1
>
>
>

-- 
Marc-André Lureau

[-- Attachment #2: Type: text/html, Size: 1752 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 13/39] oslib: qemu_clear_cloexec
  2022-06-15 14:52 ` [PATCH V8 13/39] oslib: qemu_clear_cloexec Steve Sistare
  2022-06-16 16:01   ` Marc-André Lureau
@ 2022-06-16 16:07   ` Daniel P. Berrangé
  2022-07-05 18:27     ` Steven Sistare
  1 sibling, 1 reply; 84+ messages in thread
From: Daniel P. Berrangé @ 2022-06-16 16:07 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Alex Bennée, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Marcel Apfelbaum, Alex Williamson,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

On Wed, Jun 15, 2022 at 07:52:00AM -0700, Steve Sistare wrote:
> Define qemu_clear_cloexec, analogous to qemu_set_cloexec.
> 
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  include/qemu/osdep.h | 1 +
>  util/oslib-posix.c   | 9 +++++++++
>  util/oslib-win32.c   | 4 ++++
>  3 files changed, 14 insertions(+)
> 
> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index b1c161c..e916f3b 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -548,6 +548,7 @@ ssize_t qemu_write_full(int fd, const void *buf, size_t count)
>      G_GNUC_WARN_UNUSED_RESULT;
>  
>  void qemu_set_cloexec(int fd);
> +void qemu_clear_cloexec(int fd);

I'm a little wary of adding this helper without any accompanying
comment.

It is almost never correct to use this new method in a threaded
program like QEMU, unless you have strong confidence that all
the other threads are idle and not liable to perform a fork+exec
for any other reason.

IIUC, this can be satisfied by the CPR code because it will be
used only immediately before exec'ing the updated QEMU binary,
and it has suspended any other CPUs and not other monitor
commands are concurrently running.

IOW, I just ask that you put a comment with a big warning that
essentially no one should use this method, except CPR code.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 16/39] qapi: strv_from_strList
  2022-06-15 14:52 ` [PATCH V8 16/39] qapi: strv_from_strList Steve Sistare
@ 2022-06-16 16:08   ` Marc-André Lureau
  2022-07-05 18:28     ` Steven Sistare
  0 siblings, 1 reply; 84+ messages in thread
From: Marc-André Lureau @ 2022-06-16 16:08 UTC (permalink / raw)
  To: Steve Sistare
  Cc: QEMU, Paolo Bonzini, Stefan Hajnoczi, Alex Bennée,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Marcel Apfelbaum,
	Alex Williamson, Daniel P. Berrange, Juan Quintela,
	Markus Armbruster, Eric Blake, Jason Zeng, Zheng Chuan,
	Mark Kanda, Guoyi Tu, Peter Maydell, Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

[-- Attachment #1: Type: text/plain, Size: 1789 bytes --]

Hi

On Wed, Jun 15, 2022 at 7:30 PM Steve Sistare <steven.sistare@oracle.com>
wrote:

> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  include/qapi/util.h |  6 ++++++
>  qapi/qapi-util.c    | 14 ++++++++++++++
>  2 files changed, 20 insertions(+)
>
> diff --git a/include/qapi/util.h b/include/qapi/util.h
> index 75dddca..51ff64e 100644
> --- a/include/qapi/util.h
> +++ b/include/qapi/util.h
> @@ -33,6 +33,12 @@ bool qapi_bool_parse(const char *name, const char
> *value, bool *obj,
>  int parse_qapi_name(const char *name, bool complete);
>
>  /*
> + * Produce and return a NULL-terminated array of strings from @args.
> + * All strings are g_strdup'd.
> + */
> +GStrv strv_from_strList(const struct strList *args);
> +
> +/*
>   * Produce a strList from the character delimited string @in.
>   * All strings are g_strdup'd.
>   * A NULL or empty input string returns NULL.
> diff --git a/qapi/qapi-util.c b/qapi/qapi-util.c
> index b61c73c..8c96cab 100644
> --- a/qapi/qapi-util.c
> +++ b/qapi/qapi-util.c
> @@ -154,6 +154,20 @@ int parse_qapi_name(const char *str, bool complete)
>      return p - str;
>  }
>
> +GStrv strv_from_strList(const strList *args)
> +{
> +    const strList *arg;
> +    int i = 0;
> +    GStrv argv = g_malloc((QAPI_LIST_LENGTH(args) + 1) * sizeof(char *));
> +
>

Better use g_new() here. Otherwise:
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>


> +    for (arg = args; arg != NULL; arg = arg->next) {
> +        argv[i++] = g_strdup(arg->value);
> +    }
> +    argv[i] = NULL;
> +
> +    return argv;
> +}
> +
>  strList *strList_from_string(const char *in, char delim)
>  {
>      strList *res = NULL;
> --
> 1.8.3.1
>
>
>

-- 
Marc-André Lureau

[-- Attachment #2: Type: text/html, Size: 2610 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 17/39] qapi: strList unit tests
  2022-06-15 14:52 ` [PATCH V8 17/39] qapi: strList unit tests Steve Sistare
@ 2022-06-16 16:10   ` Marc-André Lureau
  0 siblings, 0 replies; 84+ messages in thread
From: Marc-André Lureau @ 2022-06-16 16:10 UTC (permalink / raw)
  To: Steve Sistare
  Cc: QEMU, Paolo Bonzini, Stefan Hajnoczi, Alex Bennée,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Marcel Apfelbaum,
	Alex Williamson, Daniel P. Berrange, Juan Quintela,
	Markus Armbruster, Eric Blake, Jason Zeng, Zheng Chuan,
	Mark Kanda, Guoyi Tu, Peter Maydell, Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

[-- Attachment #1: Type: text/plain, Size: 3835 bytes --]

On Wed, Jun 15, 2022 at 6:58 PM Steve Sistare <steven.sistare@oracle.com>
wrote:

> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>

Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>


> ---
>  MAINTAINERS               |  1 +
>  tests/unit/meson.build    |  1 +
>  tests/unit/test-strlist.c | 81
> +++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 83 insertions(+)
>  create mode 100644 tests/unit/test-strlist.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 1e4e72f..f9a6362 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3160,6 +3160,7 @@ F: include/migration/cpr.h
>  F: migration/cpr.c
>  F: qapi/cpr.json
>  F: stubs/cpr.c
> +F: tests/unit/test-strlist.c
>
>  Record/replay
>  M: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
> diff --git a/tests/unit/meson.build b/tests/unit/meson.build
> index 287b367..57d48d5 100644
> --- a/tests/unit/meson.build
> +++ b/tests/unit/meson.build
> @@ -17,6 +17,7 @@ tests = {
>    'test-forward-visitor': [testqapi],
>    'test-string-input-visitor': [testqapi],
>    'test-string-output-visitor': [testqapi],
> +  'test-strlist': [testqapi],
>    'test-opts-visitor': [testqapi],
>    'test-visitor-serialization': [testqapi],
>    'test-bitmap': [],
> diff --git a/tests/unit/test-strlist.c b/tests/unit/test-strlist.c
> new file mode 100644
> index 0000000..ef740dc
> --- /dev/null
> +++ b/tests/unit/test-strlist.c
> @@ -0,0 +1,81 @@
> +/*
> + * Copyright (c) 2022 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qapi/util.h"
> +#include "qapi/qapi-builtin-types.h"
> +
> +static strList *make_list(int length)
> +{
> +    strList *head = 0, *list, **prev = &head;
> +
> +    while (length--) {
> +        list = *prev = g_new0(strList, 1);
> +        list->value = g_strdup("aaa");
> +        prev = &list->next;
> +    }
> +    return head;
> +}
> +
> +static void test_length(void)
> +{
> +    strList *list;
> +    int i;
> +
> +    for (i = 0; i < 5; i++) {
> +        list = make_list(i);
> +        g_assert_cmpint(i, ==, QAPI_LIST_LENGTH(list));
> +        qapi_free_strList(list);
> +    }
> +}
> +
> +struct {
> +    const char *string;
> +    char delim;
> +    const char *args[5];
> +} list_data[] = {
> +    { 0, ',', { 0 } },
> +    { "", ',', { 0 } },
> +    { "a", ',', { "a", 0 } },
> +    { "a,b", ',', { "a", "b", 0 } },
> +    { "a,b,c", ',', { "a", "b", "c", 0 } },
> +    { "first last", ' ', { "first", "last", 0 } },
> +    { "a:", ':', { "a", 0 } },
> +    { "a::b", ':', { "a", "", "b", 0 } },
> +    { ":", ':', { "", 0 } },
> +    { ":a", ':', { "", "a", 0 } },
> +    { "::a", ':', { "", "", "a", 0 } },
> +};
> +
> +static void test_strv(void)
> +{
> +    int i, j;
> +    const char **expect;
> +    strList *list;
> +    GStrv args;
> +
> +    for (i = 0; i < ARRAY_SIZE(list_data); i++) {
> +        expect = list_data[i].args;
> +        list = strList_from_string(list_data[i].string,
> list_data[i].delim);
> +        args = strv_from_strList(list);
> +        qapi_free_strList(list);
> +        for (j = 0; expect[j] && args[j]; j++) {
> +            g_assert_cmpstr(expect[j], ==, args[j]);
> +        }
> +        g_assert_null(expect[j]);
> +        g_assert_null(args[j]);
> +        g_strfreev(args);
> +    }
> +}
> +
> +int main(int argc, char **argv)
> +{
> +    g_test_init(&argc, &argv, NULL);
> +    g_test_add_func("/test-string/length", test_length);
> +    g_test_add_func("/test-string/strv", test_strv);
> +    return g_test_run();
> +}
> --
> 1.8.3.1
>
>
>

-- 
Marc-André Lureau

[-- Attachment #2: Type: text/html, Size: 5485 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 38/39] python/machine: add QEMUMachine accessors
  2022-06-15 14:52 ` [PATCH V8 38/39] python/machine: add QEMUMachine accessors Steve Sistare
@ 2022-06-17 14:16   ` John Snow
  2022-07-05 18:30     ` Steven Sistare
  0 siblings, 1 reply; 84+ messages in thread
From: John Snow @ 2022-06-17 14:16 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Alex Bennée, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Marcel Apfelbaum, Alex Williamson,
	Daniel P. Berrange, Juan Quintela, Markus Armbruster, Eric Blake,
	Jason Zeng, Zheng Chuan, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand

[-- Attachment #1: Type: text/plain, Size: 1914 bytes --]

On Wed, Jun 15, 2022, 11:27 AM Steve Sistare <steven.sistare@oracle.com>
wrote:

> Provide full_args() to return all command-line arguments used to start a
> vm, some of which are not otherwise visible to QEMUMachine clients.  This
> is needed by the cpr test, which must start a vm, then pass all qemu
> command-line arguments to the cpr-exec monitor call.
>
> Provide reopen_qmp_connection() to reopen a closed monitor connection.
> This is needed by cpr, because qemu-exec closes the monitor socket.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  python/qemu/machine/machine.py | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
>
> diff --git a/python/qemu/machine/machine.py
> b/python/qemu/machine/machine.py
> index 37191f4..60b934d 100644
> --- a/python/qemu/machine/machine.py
> +++ b/python/qemu/machine/machine.py
> @@ -332,6 +332,11 @@ def args(self) -> List[str]:
>          """Returns the list of arguments given to the QEMU binary."""
>          return self._args
>
> +    @property
> +    def full_args(self) -> List[str]:
> +        """Returns the full list of arguments used to launch QEMU."""
> +        return list(self._qemu_full_args)
> +
>

OK

     def _pre_launch(self) -> None:
>          if self._console_set:
>              self._remove_files.append(self._console_address)
> @@ -486,6 +491,15 @@ def _close_qmp_connection(self) -> None:
>          finally:
>              self._qmp_connection = None
>
> +    def reopen_qmp_connection(self):
> +        self._close_qmp_connection()
> +        self._qmp_connection = QEMUMonitorProtocol(
> +            self._monitor_address,
> +            server=True,
> +            nickname=self._name
> +        )
> +        self._qmp.accept(self._qmp_timer)
> +
>

Unrelated change, please split into a new commit. (Sorry.)

Seems harmless enough, though. Happy to give RB and AB to both if you split
the commits.

--js

[-- Attachment #2: Type: text/html, Size: 2948 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 24/39] pci: export export msix_is_pending
  2022-06-15 14:52 ` [PATCH V8 24/39] pci: export export msix_is_pending Steve Sistare
@ 2022-06-27 22:44   ` Michael S. Tsirkin
  2022-07-05 18:29     ` Steven Sistare
  0 siblings, 1 reply; 84+ messages in thread
From: Michael S. Tsirkin @ 2022-06-27 22:44 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Alex Bennée, Dr. David Alan Gilbert,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

On Wed, Jun 15, 2022 at 07:52:11AM -0700, Steve Sistare wrote:
> Export msix_is_pending for use by cpr.  No functional change.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

the subject repeats export twice.
With that fixed:

Acked-by: Michael S. Tsirkin <mst@redhat.com>


> ---
>  hw/pci/msix.c         | 2 +-
>  include/hw/pci/msix.h | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/pci/msix.c b/hw/pci/msix.c
> index ae9331c..e492ce0 100644
> --- a/hw/pci/msix.c
> +++ b/hw/pci/msix.c
> @@ -64,7 +64,7 @@ static uint8_t *msix_pending_byte(PCIDevice *dev, int vector)
>      return dev->msix_pba + vector / 8;
>  }
>  
> -static int msix_is_pending(PCIDevice *dev, int vector)
> +int msix_is_pending(PCIDevice *dev, unsigned int vector)
>  {
>      return *msix_pending_byte(dev, vector) & msix_pending_mask(vector);
>  }
> diff --git a/include/hw/pci/msix.h b/include/hw/pci/msix.h
> index 4c4a60c..0065354 100644
> --- a/include/hw/pci/msix.h
> +++ b/include/hw/pci/msix.h
> @@ -32,6 +32,7 @@ int msix_present(PCIDevice *dev);
>  bool msix_is_masked(PCIDevice *dev, unsigned vector);
>  void msix_set_pending(PCIDevice *dev, unsigned vector);
>  void msix_clr_pending(PCIDevice *dev, int vector);
> +int msix_is_pending(PCIDevice *dev, unsigned vector);
>  
>  int msix_vector_use(PCIDevice *dev, unsigned vector);
>  void msix_vector_unuse(PCIDevice *dev, unsigned vector);
> -- 
> 1.8.3.1



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 27/39] vfio-pci: cpr part 1 (fd and dma)
  2022-06-15 14:52 ` [PATCH V8 27/39] vfio-pci: cpr part 1 (fd and dma) Steve Sistare
@ 2022-06-29 19:14   ` Alex Williamson
  2022-07-06 17:45     ` Steven Sistare
  2022-07-03  8:32   ` Peng Liang
  1 sibling, 1 reply; 84+ messages in thread
From: Alex Williamson @ 2022-06-29 19:14 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Alex Bennée, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Marcel Apfelbaum, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

On Wed, 15 Jun 2022 07:52:14 -0700
Steve Sistare <steven.sistare@oracle.com> wrote:

> Enable vfio-pci devices to be saved and restored across an exec restart
> of qemu.
> 
> At vfio creation time, save the value of vfio container, group, and device
> descriptors in cpr state.
> 
> In the container pre_save handler, suspend the use of virtual addresses in
> DMA mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be
> remapped at a different VA after exec.  DMA to already-mapped pages
> continues.  Save the msi message area as part of vfio-pci vmstate, save the
> interrupt and notifier eventfd's in cpr state, and clear the close-on-exec
> flag for the vfio descriptors.  The flag is not cleared earlier because the
> descriptors should not persist across miscellaneous fork and exec calls
> that may be performed during normal operation.
> 
> On qemu restart, vfio_realize() finds the saved descriptors, uses
> the descriptors, and notes that the device is being reused.  Device and
> iommu state is already configured, so operations in vfio_realize that
> would modify the configuration are skipped for a reused device, including
> vfio ioctl's and writes to PCI configuration space.  Vfio PCI device reset
> is also suppressed. The result is that vfio_realize constructs qemu data
> structures that reflect the current state of the device.  However, the
> reconstruction is not complete until cpr-load is called. cpr-load loads the
> msi data.  The vfio post_load handler finds eventfds in cpr state, rebuilds
> vector data structures, and attaches the interrupts to the new KVM instance.
> The container post_load handler then invokes the main vfio listener
> callback, which walks the flattened ranges of the vfio address space and
> calls VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly,
> cpr-load starts the VM.
> 
> This functionality is delivered by 3 patches for clarity.  Part 1 handles
> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
> support.  Part 3 adds INTX support.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  MAINTAINERS                   |   1 +
>  hw/pci/pci.c                  |  12 ++++
>  hw/vfio/common.c              | 151 +++++++++++++++++++++++++++++++++++-------
>  hw/vfio/cpr.c                 | 119 +++++++++++++++++++++++++++++++++
>  hw/vfio/meson.build           |   1 +
>  hw/vfio/pci.c                 |  44 ++++++++++++
>  hw/vfio/trace-events          |   1 +
>  include/hw/vfio/vfio-common.h |  11 +++
>  include/migration/vmstate.h   |   1 +
>  9 files changed, 317 insertions(+), 24 deletions(-)
>  create mode 100644 hw/vfio/cpr.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 74a43e6..864aec6 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3156,6 +3156,7 @@ CPR
>  M: Steve Sistare <steven.sistare@oracle.com>
>  M: Mark Kanda <mark.kanda@oracle.com>
>  S: Maintained
> +F: hw/vfio/cpr.c
>  F: include/migration/cpr.h
>  F: migration/cpr.c
>  F: qapi/cpr.json
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 6e70153..a3b19eb 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -32,6 +32,7 @@
>  #include "hw/pci/pci_host.h"
>  #include "hw/qdev-properties.h"
>  #include "hw/qdev-properties-system.h"
> +#include "migration/cpr.h"
>  #include "migration/qemu-file-types.h"
>  #include "migration/vmstate.h"
>  #include "monitor/monitor.h"
> @@ -341,6 +342,17 @@ static void pci_reset_regions(PCIDevice *dev)
>  
>  static void pci_do_device_reset(PCIDevice *dev)
>  {
> +    /*
> +     * A PCI device that is resuming for cpr is already configured, so do
> +     * not reset it here when we are called from qemu_system_reset prior to
> +     * cpr-load, else interrupts may be lost for vfio-pci devices.  It is
> +     * safe to skip this reset for all PCI devices, because cpr-load will set
> +     * all fields that would have been set here.
> +     */
> +    if (cpr_get_mode() == CPR_MODE_RESTART) {
> +        return;
> +    }
> +
>      pci_device_deassert_intx(dev);
>      assert(dev->irq_state == 0);
>  
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index ace9562..c7d73b6 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -31,6 +31,7 @@
>  #include "exec/memory.h"
>  #include "exec/ram_addr.h"
>  #include "hw/hw.h"
> +#include "migration/cpr.h"
>  #include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
>  #include "qemu/range.h"
> @@ -460,6 +461,8 @@ static int vfio_dma_unmap(VFIOContainer *container,
>          .size = size,
>      };
>  
> +    assert(!container->reused);
> +
>      if (iotlb && container->dirty_pages_supported &&
>          vfio_devices_all_running_and_saving(container)) {
>          return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
> @@ -496,12 +499,24 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>  {
>      struct vfio_iommu_type1_dma_map map = {
>          .argsz = sizeof(map),
> -        .flags = VFIO_DMA_MAP_FLAG_READ,
>          .vaddr = (__u64)(uintptr_t)vaddr,
>          .iova = iova,
>          .size = size,
>      };
>  
> +    /*
> +     * Set the new vaddr for any mappings registered during cpr-load.
> +     * Reused is cleared thereafter.
> +     */
> +    if (container->reused) {
> +        map.flags = VFIO_DMA_MAP_FLAG_VADDR;
> +        if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
> +            goto fail;
> +        }
> +        return 0;
> +    }
> +
> +    map.flags = VFIO_DMA_MAP_FLAG_READ;
>      if (!readonly) {
>          map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
>      }
> @@ -517,7 +532,9 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>          return 0;
>      }
>  
> -    error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
> +fail:
> +    error_report("vfio_dma_map %s (iova %lu, size %ld, va %p): %s",
> +        (container->reused ? "VADDR" : ""), iova, size, vaddr, strerror(errno));
>      return -errno;
>  }
>  
> @@ -882,6 +899,12 @@ static void vfio_listener_region_add(MemoryListener *listener,
>                                       MemoryRegionSection *section)
>  {
>      VFIOContainer *container = container_of(listener, VFIOContainer, listener);
> +    vfio_container_region_add(container, section);
> +}
> +
> +void vfio_container_region_add(VFIOContainer *container,
> +                               MemoryRegionSection *section)
> +{
>      hwaddr iova, end;
>      Int128 llend, llsize;
>      void *vaddr;
> @@ -1492,6 +1515,12 @@ static void vfio_listener_release(VFIOContainer *container)
>      }
>  }
>  
> +void vfio_listener_register(VFIOContainer *container)
> +{
> +    container->listener = vfio_memory_listener;
> +    memory_listener_register(&container->listener, container->space->as);
> +}
> +
>  static struct vfio_info_cap_header *
>  vfio_get_cap(void *ptr, uint32_t cap_offset, uint16_t id)
>  {
> @@ -1910,6 +1939,22 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
>  {
>      int iommu_type, ret;
>  
> +    /*
> +     * If container is reused, just set its type and skip the ioctls, as the
> +     * container and group are already configured in the kernel.
> +     * VFIO_TYPE1v2_IOMMU is the only type that supports reuse/cpr.
> +     */
> +    if (container->reused) {
> +        if (ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1v2_IOMMU)) {
> +            container->iommu_type = VFIO_TYPE1v2_IOMMU;
> +            return 0;
> +        } else {
> +            error_setg(errp, "container was reused but VFIO_TYPE1v2_IOMMU "
> +                             "is not supported");
> +            return -errno;
> +        }
> +    }
> +
>      iommu_type = vfio_get_iommu_type(container, errp);
>      if (iommu_type < 0) {
>          return iommu_type;
> @@ -2014,9 +2059,12 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>  {
>      VFIOContainer *container;
>      int ret, fd;
> +    bool reused;
>      VFIOAddressSpace *space;
>  
>      space = vfio_get_address_space(as);
> +    fd = cpr_find_fd("vfio_container_for_group", group->groupid);
> +    reused = (fd > 0);
>  
>      /*
>       * VFIO is currently incompatible with discarding of RAM insofar as the
> @@ -2049,27 +2097,47 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>       * details once we know which type of IOMMU we are using.
>       */
>  
> +    /*
> +     * If the container is reused, then the group is already attached in the
> +     * kernel.  If a container with matching fd is found, then update the
> +     * userland group list and return.  If not, then after the loop, create
> +     * the container struct and group list.
> +     */
> +
>      QLIST_FOREACH(container, &space->containers, next) {
> -        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
> -            ret = vfio_ram_block_discard_disable(container, true);
> -            if (ret) {
> -                error_setg_errno(errp, -ret,
> -                                 "Cannot set discarding of RAM broken");
> -                if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
> -                          &container->fd)) {
> -                    error_report("vfio: error disconnecting group %d from"
> -                                 " container", group->groupid);
> -                }
> -                return ret;
> +        if (reused) {
> +            if (container->fd != fd) {
> +                continue;
>              }
> -            group->container = container;
> -            QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> +        } else if (ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
> +            continue;
> +        }
> +
> +        ret = vfio_ram_block_discard_disable(container, true);
> +        if (ret) {
> +            error_setg_errno(errp, -ret,
> +                             "Cannot set discarding of RAM broken");
> +            if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
> +                      &container->fd)) {
> +                error_report("vfio: error disconnecting group %d from"
> +                             " container", group->groupid);
> +            }
> +            return ret;
> +        }
> +        group->container = container;
> +        QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> +        if (!reused) {
>              vfio_kvm_device_add_group(group);
> -            return 0;
> +            cpr_save_fd("vfio_container_for_group", group->groupid,
> +                        container->fd);
>          }
> +        return 0;
> +    }
> +
> +    if (!reused) {
> +        fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
>      }
>  
> -    fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
>      if (fd < 0) {
>          error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
>          ret = -errno;
> @@ -2087,6 +2155,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      container = g_malloc0(sizeof(*container));
>      container->space = space;
>      container->fd = fd;
> +    container->reused = reused;
>      container->error = NULL;
>      container->dirty_pages_supported = false;
>      container->dma_max_mappings = 0;
> @@ -2099,6 +2168,11 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>          goto free_container_exit;
>      }
>  
> +    ret = vfio_cpr_register_container(container, errp);
> +    if (ret) {
> +        goto free_container_exit;
> +    }
> +
>      ret = vfio_ram_block_discard_disable(container, true);
>      if (ret) {
>          error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken");
> @@ -2213,9 +2287,16 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      group->container = container;
>      QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>  
> -    container->listener = vfio_memory_listener;
> -
> -    memory_listener_register(&container->listener, container->space->as);
> +    /*
> +     * If reused, register the listener later, after all state that may
> +     * affect regions and mapping boundaries has been cpr-load'ed.  Later,
> +     * the listener will invoke its callback on each flat section and call
> +     * vfio_dma_map to supply the new vaddr, and the calls will match the
> +     * mappings remembered by the kernel.
> +     */
> +    if (!reused) {
> +        vfio_listener_register(container);
> +    }
>  
>      if (container->error) {
>          ret = -1;
> @@ -2225,8 +2306,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      }
>  
>      container->initialized = true;
> +    ret = cpr_resave_fd("vfio_container_for_group", group->groupid, fd, errp);
>  
> -    return 0;
> +    return ret;


This needs to fall through to unwind if that resave fails.

There also needs to be vfio_cpr_unregister_container() and
cpr_delete_fd() calls in the unwind below, right?


>  listener_release_exit:
>      QLIST_REMOVE(group, container_next);
>      QLIST_REMOVE(container, next);
> @@ -2254,6 +2336,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>  
>      QLIST_REMOVE(group, container_next);
>      group->container = NULL;
> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
>  
>      /*
>       * Explicitly release the listener first before unset container,
> @@ -2290,6 +2373,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>          }
>  
>          trace_vfio_disconnect_container(container->fd);
> +        vfio_cpr_unregister_container(container);
>          close(container->fd);
>          g_free(container);
>  
> @@ -2319,7 +2403,12 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>      group = g_malloc0(sizeof(*group));
>  
>      snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
> -    group->fd = qemu_open_old(path, O_RDWR);
> +
> +    group->fd = cpr_find_fd("vfio_group", groupid);
> +    if (group->fd < 0) {
> +        group->fd = qemu_open_old(path, O_RDWR);
> +    }
> +
>      if (group->fd < 0) {
>          error_setg_errno(errp, errno, "failed to open %s", path);
>          goto free_group_exit;
> @@ -2353,6 +2442,10 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>  
>      QLIST_INSERT_HEAD(&vfio_group_list, group, next);
>  
> +    if (cpr_resave_fd("vfio_group", groupid, group->fd, errp)) {
> +        goto close_fd_exit;
> +    }
> +
>      return group;
>  
>  close_fd_exit:
> @@ -2377,6 +2470,7 @@ void vfio_put_group(VFIOGroup *group)
>      vfio_disconnect_container(group);
>      QLIST_REMOVE(group, next);
>      trace_vfio_put_group(group->fd);
> +    cpr_delete_fd("vfio_group", group->groupid);
>      close(group->fd);
>      g_free(group);
>  
> @@ -2390,8 +2484,14 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>  {
>      struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
>      int ret, fd;
> +    bool reused;
> +
> +    fd = cpr_find_fd(name, 0);
> +    reused = (fd >= 0);
> +    if (!reused) {
> +        fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
> +    }
>  
> -    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>      if (fd < 0) {
>          error_setg_errno(errp, errno, "error getting device from group %d",
>                           group->groupid);
> @@ -2436,12 +2536,14 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>      vbasedev->num_irqs = dev_info.num_irqs;
>      vbasedev->num_regions = dev_info.num_regions;
>      vbasedev->flags = dev_info.flags;
> +    vbasedev->reused = reused;
>  
>      trace_vfio_get_device(name, dev_info.flags, dev_info.num_regions,
>                            dev_info.num_irqs);
>  
>      vbasedev->reset_works = !!(dev_info.flags & VFIO_DEVICE_FLAGS_RESET);
> -    return 0;
> +    ret = cpr_resave_fd(name, 0, fd, errp);
> +    return ret;


This requires new unwind code.


>  }
>  
>  void vfio_put_base_device(VFIODevice *vbasedev)
> @@ -2452,6 +2554,7 @@ void vfio_put_base_device(VFIODevice *vbasedev)
>      QLIST_REMOVE(vbasedev, next);
>      vbasedev->group = NULL;
>      trace_vfio_put_base_device(vbasedev->fd);
> +    cpr_delete_fd(vbasedev->name, 0);
>      close(vbasedev->fd);
>  }
>  
> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
> new file mode 100644
> index 0000000..a227d5e
> --- /dev/null
> +++ b/hw/vfio/cpr.c
> @@ -0,0 +1,119 @@
> +/*
> + * Copyright (c) 2021, 2022 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include <sys/ioctl.h>
> +#include <linux/vfio.h>
> +#include "hw/vfio/vfio-common.h"
> +#include "sysemu/kvm.h"
> +#include "qapi/error.h"
> +#include "migration/cpr.h"
> +#include "migration/vmstate.h"
> +#include "trace.h"
> +
> +static int
> +vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
> +{
> +    struct vfio_iommu_type1_dma_unmap unmap = {
> +        .argsz = sizeof(unmap),
> +        .flags = VFIO_DMA_UNMAP_FLAG_VADDR | VFIO_DMA_UNMAP_FLAG_ALL,
> +        .iova = 0,
> +        .size = 0,
> +    };
> +    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
> +        error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
> +        return -errno;
> +    }
> +    container->vaddr_unmapped = true;
> +    return 0;
> +}
> +
> +static bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp)
> +{
> +    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
> +        !ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
> +        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR "
> +                         "or VFIO_UNMAP_ALL");
> +        return false;
> +    } else {
> +        return true;
> +    }
> +}
> +
> +static bool vfio_vmstate_needed(void *opaque)
> +{
> +    return cpr_get_mode() == CPR_MODE_RESTART;
> +}
> +
> +static int vfio_container_pre_save(void *opaque)
> +{
> +    VFIOContainer *container = (VFIOContainer *)opaque;
> +    Error *err;
> +
> +    if (!vfio_is_cpr_capable(container, &err) ||
> +        vfio_dma_unmap_vaddr_all(container, &err)) {
> +        error_report_err(err);
> +        return -1;
> +    }
> +    return 0;
> +}
> +
> +static int vfio_container_post_load(void *opaque, int version_id)
> +{
> +    VFIOContainer *container = (VFIOContainer *)opaque;
> +    VFIOGroup *group;
> +    Error *err;
> +    VFIODevice *vbasedev;
> +
> +    if (!vfio_is_cpr_capable(container, &err)) {
> +        error_report_err(err);
> +        return -1;
> +    }
> +
> +    vfio_listener_register(container);
> +    container->reused = false;
> +
> +    QLIST_FOREACH(group, &container->group_list, container_next) {
> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +            vbasedev->reused = false;
> +        }
> +    }
> +    return 0;
> +}
> +
> +static const VMStateDescription vfio_container_vmstate = {
> +    .name = "vfio-container",
> +    .unmigratable = 1,


How does this work with vfio devices supporting migration?  This needs
to be coordinated with efforts to enable migration of vfio devices.


> +    .version_id = 0,
> +    .minimum_version_id = 0,
> +    .pre_save = vfio_container_pre_save,
> +    .post_load = vfio_container_post_load,
> +    .needed = vfio_vmstate_needed,


I don't see that .needed is evaluated relative to .unmigratable above
in determining if migration is blocked.


> +    .fields = (VMStateField[]) {
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
> +int vfio_cpr_register_container(VFIOContainer *container, Error **errp)
> +{
> +    container->cpr_blocker = NULL;
> +    if (!vfio_is_cpr_capable(container, &container->cpr_blocker)) {
> +        return cpr_add_blocker(&container->cpr_blocker, errp,
> +                               CPR_MODE_RESTART, 0);
> +    }
> +
> +    vmstate_register(NULL, -1, &vfio_container_vmstate, container);
> +
> +    return 0;
> +}
> +
> +void vfio_cpr_unregister_container(VFIOContainer *container)
> +{
> +    cpr_del_blocker(&container->cpr_blocker);
> +
> +    vmstate_unregister(NULL, &vfio_container_vmstate, container);
> +}
> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
> index da9af29..e247b2b 100644
> --- a/hw/vfio/meson.build
> +++ b/hw/vfio/meson.build
> @@ -5,6 +5,7 @@ vfio_ss.add(files(
>    'migration.c',
>  ))
>  vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
> +  'cpr.c',
>    'display.c',
>    'pci-quirks.c',
>    'pci.c',
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 0143c9a..237231b 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -30,6 +30,7 @@
>  #include "hw/qdev-properties-system.h"
>  #include "migration/vmstate.h"
>  #include "qapi/qmp/qdict.h"
> +#include "migration/cpr.h"
>  #include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
>  #include "qemu/module.h"
> @@ -2514,6 +2515,7 @@ const VMStateDescription vmstate_vfio_pci_config = {
>      .name = "VFIOPCIDevice",
>      .version_id = 1,
>      .minimum_version_id = 1,
> +    .priority = MIG_PRI_VFIO_PCI,   /* * must load before container */
>      .fields = (VMStateField[]) {
>          VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
>          VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
> @@ -3243,6 +3245,11 @@ static void vfio_pci_reset(DeviceState *dev)
>  {
>      VFIOPCIDevice *vdev = VFIO_PCI(dev);
>  
> +    /* Do not reset the device during qemu_system_reset prior to cpr-load */
> +    if (vdev->vbasedev.reused) {
> +        return;
> +    }
> +
>      trace_vfio_pci_reset(vdev->vbasedev.name);
>  
>      vfio_pci_pre_reset(vdev);
> @@ -3350,6 +3357,42 @@ static Property vfio_pci_dev_properties[] = {
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> +/*
> + * The kernel may change non-emulated config bits.  Exclude them from the
> + * changed-bits check in get_pci_config_device.
> + */
> +static int vfio_pci_pre_load(void *opaque)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    PCIDevice *pdev = &vdev->pdev;
> +    int size = MIN(pci_config_size(pdev), vdev->config_size);
> +    int i;
> +
> +    for (i = 0; i < size; i++) {
> +        pdev->cmask[i] &= vdev->emulated_config_bits[i];
> +    }
> +
> +    return 0;
> +}
> +
> +static bool vfio_pci_needed(void *opaque)
> +{
> +    return cpr_get_mode() == CPR_MODE_RESTART;
> +}
> +
> +static const VMStateDescription vfio_pci_vmstate = {
> +    .name = "vfio-pci",
> +    .unmigratable = 1,


Same question here.


> +    .version_id = 0,
> +    .minimum_version_id = 0,
> +    .priority = MIG_PRI_VFIO_PCI,       /* must load before container */
> +    .pre_load = vfio_pci_pre_load,
> +    .needed = vfio_pci_needed,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
>  static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>  {
>      DeviceClass *dc = DEVICE_CLASS(klass);
> @@ -3357,6 +3400,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>  
>      dc->reset = vfio_pci_reset;
>      device_class_set_props(dc, vfio_pci_dev_properties);
> +    dc->vmsd = &vfio_pci_vmstate;
>      dc->desc = "VFIO-based PCI device assignment";
>      set_bit(DEVICE_CATEGORY_MISC, dc->categories);
>      pdc->realize = vfio_realize;
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 73dffe9..a6d0034 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -119,6 +119,7 @@ vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Devic
>  vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
>  vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
>  vfio_dma_unmap_overflow_workaround(void) ""
> +vfio_region_remap(const char *name, int fd, uint64_t iova_start, uint64_t iova_end, void *vaddr) "%s fd %d 0x%"PRIx64" - 0x%"PRIx64" [%p]"
>  
>  # platform.c
>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index e573f5a..17ad9ba 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -81,10 +81,14 @@ typedef struct VFIOContainer {
>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>      MemoryListener listener;
>      MemoryListener prereg_listener;
> +    Notifier cpr_notifier;
> +    Error *cpr_blocker;
>      unsigned iommu_type;
>      Error *error;
>      bool initialized;
>      bool dirty_pages_supported;
> +    bool reused;
> +    bool vaddr_unmapped;
>      uint64_t dirty_pgsizes;
>      uint64_t max_dirty_bitmap_size;
>      unsigned long pgsizes;
> @@ -136,6 +140,7 @@ typedef struct VFIODevice {
>      bool no_mmap;
>      bool ram_block_discard_allowed;
>      bool enable_migration;
> +    bool reused;
>      VFIODeviceOps *ops;
>      unsigned int num_irqs;
>      unsigned int num_regions;
> @@ -213,6 +218,9 @@ void vfio_put_group(VFIOGroup *group);
>  int vfio_get_device(VFIOGroup *group, const char *name,
>                      VFIODevice *vbasedev, Error **errp);
>  
> +int vfio_cpr_register_container(VFIOContainer *container, Error **errp);
> +void vfio_cpr_unregister_container(VFIOContainer *container);
> +
>  extern const MemoryRegionOps vfio_region_ops;
>  typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
>  extern VFIOGroupList vfio_group_list;
> @@ -234,6 +242,9 @@ struct vfio_info_cap_header *
>  vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id);
>  #endif
>  extern const MemoryListener vfio_prereg_listener;
> +void vfio_listener_register(VFIOContainer *container);
> +void vfio_container_region_add(VFIOContainer *container,
> +                               MemoryRegionSection *section);
>  
>  int vfio_spapr_create_window(VFIOContainer *container,
>                               MemoryRegionSection *section,
> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> index ad24aa1..19f1538 100644
> --- a/include/migration/vmstate.h
> +++ b/include/migration/vmstate.h
> @@ -157,6 +157,7 @@ typedef enum {
>      MIG_PRI_GICV3_ITS,          /* Must happen before PCI devices */
>      MIG_PRI_GICV3,              /* Must happen before the ITS */
>      MIG_PRI_MAX,
> +    MIG_PRI_VFIO_PCI = MIG_PRI_IOMMU,


Based on the current contents of this enum, why are we aliasing a
existing priority vs defining a new one?  Thanks,

Alex


>  } MigrationPriority;
>  
>  struct VMStateField {



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 28/39] vfio-pci: cpr part 2 (msi)
  2022-06-15 14:52 ` [PATCH V8 28/39] vfio-pci: cpr part 2 (msi) Steve Sistare
@ 2022-06-29 20:19   ` Alex Williamson
  2022-07-06 17:46     ` Steven Sistare
  0 siblings, 1 reply; 84+ messages in thread
From: Alex Williamson @ 2022-06-29 20:19 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Alex Bennée, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Marcel Apfelbaum, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

On Wed, 15 Jun 2022 07:52:15 -0700
Steve Sistare <steven.sistare@oracle.com> wrote:

> Finish cpr for vfio-pci MSI/MSI-X devices by preserving eventfd's and
> vector state.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  hw/vfio/pci.c | 122 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 121 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 237231b..2fd7121 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -53,17 +53,53 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
>  static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
>  static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
>  
> +#define EVENT_FD_NAME(vdev, name)   \
> +    g_strdup_printf("%s_%s", (vdev)->vbasedev.name, (name))
> +
> +static int save_event_fd(VFIOPCIDevice *vdev, const char *name, int nr,
> +                         EventNotifier *ev)
> +{
> +    int fd = event_notifier_get_fd(ev);
> +
> +    if (fd >= 0) {
> +        Error *err;
> +        g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
> +
> +        if (cpr_resave_fd(fdname, nr, fd, &err)) {
> +            error_report_err(err);
> +            return 1;


Preferably -1, but the caller doesn't actually test the return value
anyway :-\


> +        }
> +    }
> +    return 0;
> +}
> +
> +static int load_event_fd(VFIOPCIDevice *vdev, const char *name, int nr)
> +{
> +    g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
> +    int fd = cpr_find_fd(fdname, nr);
> +    return fd;


    return cpr_find_fd(EVENT_FD_NAME(vdev, name), nr);


> +}
> +
> +static void delete_event_fd(VFIOPCIDevice *vdev, const char *name, int nr)
> +{
> +    g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
> +    cpr_delete_fd(fdname, nr);


    cpr_delete_fd(EVENT_FD_NAME(vdev, name), nr);


> +}
> +
>  /* Create new or reuse existing eventfd */
>  static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
>                                const char *name, int nr)
>  {
> -    int fd = -1;   /* placeholder until a subsequent patch */
>      int ret = 0;
> +    int fd = load_event_fd(vdev, name, nr);
>  
>      if (fd >= 0) {
>          event_notifier_init_fd(e, fd);
>      } else {
>          ret = event_notifier_init(e, 0);
> +        if (!ret) {
> +            save_event_fd(vdev, name, nr, e);


Return value not tested.  The function generates an error report if it
fails, but it doesn't seem that actually blocks a cpr attempt.  Do we
just wind up with that error report as a breadcrumb to why cpr breaks
with a missing fd down the road?


> +        }
>      }
>      return ret;
>  }
> @@ -71,6 +107,7 @@ static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
>  static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
>                                    const char *name, int nr)
>  {
> +    delete_event_fd(vdev, name, nr);
>      event_notifier_cleanup(e);
>  }
>  
> @@ -511,6 +548,15 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>      VFIOMSIVector *vector;
>      int ret;
>  
> +    /*
> +     * Ignore the callback from msix_set_vector_notifiers during resume.
> +     * The necessary subset of these actions is called from vfio_claim_vectors
> +     * during post load.
> +     */
> +    if (vdev->vbasedev.reused) {
> +        return 0;
> +    }
> +
>      trace_vfio_msix_vector_do_use(vdev->vbasedev.name, nr);
>  
>      vector = &vdev->msi_vectors[nr];
> @@ -2784,6 +2830,11 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
>      fd = event_notifier_get_fd(&vdev->err_notifier);
>      qemu_set_fd_handler(fd, vfio_err_notifier_handler, NULL, vdev);
>  
> +    /* Do not alter irq_signaling during vfio_realize for cpr */
> +    if (vdev->vbasedev.reused) {
> +        return;
> +    }
> +
>      if (vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_ERR_IRQ_INDEX, 0,
>                                 VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
>          error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
> @@ -2849,6 +2900,12 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
>      fd = event_notifier_get_fd(&vdev->req_notifier);
>      qemu_set_fd_handler(fd, vfio_req_notifier_handler, NULL, vdev);
>  
> +    /* Do not alter irq_signaling during vfio_realize for cpr */
> +    if (vdev->vbasedev.reused) {
> +        vdev->req_enabled = true;
> +        return;
> +    }


vfio_notifier_init() transparently gets the old fd or creates a new
one, how do we know which has occurred to know that this eventfd is
already configured?

Don't we also have the same issue relative to vdev->pci_aer for the
error handler?

> +
>      if (vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_REQ_IRQ_INDEX, 0,
>                             VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
>          error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
> @@ -3357,6 +3414,43 @@ static Property vfio_pci_dev_properties[] = {
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> +static void vfio_claim_vectors(VFIOPCIDevice *vdev, int nr_vectors, bool msix)
> +{
> +    int i, fd;
> +    bool pending = false;
> +    PCIDevice *pdev = &vdev->pdev;
> +
> +    vdev->nr_vectors = nr_vectors;
> +    vdev->msi_vectors = g_new0(VFIOMSIVector, nr_vectors);
> +    vdev->interrupt = msix ? VFIO_INT_MSIX : VFIO_INT_MSI;
> +
> +    for (i = 0; i < nr_vectors; i++) {
> +        VFIOMSIVector *vector = &vdev->msi_vectors[i];
> +
> +        fd = load_event_fd(vdev, "interrupt", i);
> +        if (fd >= 0) {
> +            vfio_vector_init(vdev, i);
> +            qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
> +        }
> +
> +        if (load_event_fd(vdev, "kvm_interrupt", i) >= 0) {
> +            vfio_route_change = kvm_irqchip_begin_route_changes(kvm_state);
> +            vfio_add_kvm_msi_virq(vdev, vector, i, msix);
> +            kvm_irqchip_commit_route_changes(&vfio_route_change);
> +            vfio_connect_kvm_msi_virq(vector, i);


Shouldn't we take advantage of the batching support here?


> +        }


How do we debug if one of the above fails that shouldn't have failed?
Should we have an assert or change this to a non-void return if we
cannot setup an interrupt that we think is configured?


> +
> +        if (msix && msix_is_pending(pdev, i) && msix_is_masked(pdev, i)) {
> +            set_bit(i, vdev->msix->pending);
> +            pending = true;
> +        }
> +    }
> +
> +    if (msix) {
> +        memory_region_set_enabled(&pdev->msix_pba_mmio, pending);
> +    }
> +}
> +
>  /*
>   * The kernel may change non-emulated config bits.  Exclude them from the
>   * changed-bits check in get_pci_config_device.
> @@ -3375,6 +3469,29 @@ static int vfio_pci_pre_load(void *opaque)
>      return 0;
>  }
>  
> +static int vfio_pci_post_load(void *opaque, int version_id)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    PCIDevice *pdev = &vdev->pdev;
> +    int nr_vectors;
> +
> +    if (msix_enabled(pdev)) {
> +        msix_set_vector_notifiers(pdev, vfio_msix_vector_use,
> +                                   vfio_msix_vector_release, NULL);
> +        nr_vectors = vdev->msix->entries;



Maybe this is why we're not generating an error above, we don't know
which vectors are configured other than if they have a saved eventfd,
where we don't test whether we were able to actually save the fd.
Thanks,

Alex


> +        vfio_claim_vectors(vdev, nr_vectors, true);
> +
> +    } else if (msi_enabled(pdev)) {
> +        nr_vectors = msi_nr_vectors_allocated(pdev);
> +        vfio_claim_vectors(vdev, nr_vectors, false);
> +
> +    } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
> +        assert(0);      /* completed in a subsequent patch */
> +    }
> +
> +    return 0;
> +}
> +
>  static bool vfio_pci_needed(void *opaque)
>  {
>      return cpr_get_mode() == CPR_MODE_RESTART;
> @@ -3387,8 +3504,11 @@ static const VMStateDescription vfio_pci_vmstate = {
>      .minimum_version_id = 0,
>      .priority = MIG_PRI_VFIO_PCI,       /* must load before container */
>      .pre_load = vfio_pci_pre_load,
> +    .post_load = vfio_pci_post_load,
>      .needed = vfio_pci_needed,
>      .fields = (VMStateField[]) {
> +        VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
> +        VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
>          VMSTATE_END_OF_LIST()
>      }
>  };



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 29/39] vfio-pci: cpr part 3 (intx)
  2022-06-15 14:52 ` [PATCH V8 29/39] vfio-pci: cpr part 3 (intx) Steve Sistare
@ 2022-06-29 20:43   ` Alex Williamson
  2022-07-06 17:46     ` Steven Sistare
  0 siblings, 1 reply; 84+ messages in thread
From: Alex Williamson @ 2022-06-29 20:43 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Alex Bennée, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Marcel Apfelbaum, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

On Wed, 15 Jun 2022 07:52:16 -0700
Steve Sistare <steven.sistare@oracle.com> wrote:

> Preserve vfio INTX state across cpr restart.  Preserve VFIOINTx fields as
> follows:
>   pin : Recover this from the vfio config in kernel space
>   interrupt : Preserve its eventfd descriptor across exec.
>   unmask : Ditto
>   route.irq : This could perhaps be recovered in vfio_pci_post_load by
>     calling pci_device_route_intx_to_irq(pin), whose implementation reads
>     config space for a bridge device such as ich9.  However, there is no
>     guarantee that the bridge vmstate is read before vfio vmstate.  Rather
>     than fiddling with MigrationPriority for vmstate handlers, explicitly
>     save route.irq in vfio vmstate.
>   pending : save in vfio vmstate.
>   mmap_timeout, mmap_timer : Re-initialize
>   bool kvm_accel : Re-initialize
> 
> In vfio_realize, defer calling vfio_intx_enable until the vmstate
> is available, in vfio_pci_post_load.  Modify vfio_intx_enable and
> vfio_intx_kvm_enable to skip vfio initialization, but still perform
> kvm initialization.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  hw/vfio/pci.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 83 insertions(+), 9 deletions(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 2fd7121..b8aee91 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -173,14 +173,45 @@ static void vfio_intx_eoi(VFIODevice *vbasedev)
>      vfio_unmask_single_irqindex(vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
>  }
>  
> +#ifdef CONFIG_KVM
> +static bool vfio_no_kvm_intx(VFIOPCIDevice *vdev)
> +{
> +    return vdev->no_kvm_intx || !kvm_irqfds_enabled() ||
> +           vdev->intx.route.mode != PCI_INTX_ENABLED ||
> +           !kvm_resamplefds_enabled();
> +}
> +#endif
> +
> +static void vfio_intx_reenable_kvm(VFIOPCIDevice *vdev, Error **errp)
> +{
> +#ifdef CONFIG_KVM
> +    if (vfio_no_kvm_intx(vdev)) {
> +        return;
> +    }
> +
> +    if (vfio_notifier_init(vdev, &vdev->intx.unmask, "intx-unmask", 0)) {
> +        error_setg(errp, "vfio_notifier_init intx-unmask failed");
> +        return;
> +    }
> +
> +    if (kvm_irqchip_add_irqfd_notifier_gsi(kvm_state,
> +                                           &vdev->intx.interrupt,
> +                                           &vdev->intx.unmask,
> +                                           vdev->intx.route.irq)) {
> +        error_setg_errno(errp, errno, "failed to setup resample irqfd");


Does not unwind with vfio_notifier_cleanup().  This also exactly
duplicates code in vfio_intx_enable_kvm(), which suggests it needs
further refactoring to a common helper.



> +        return;
> +    }
> +
> +    vdev->intx.kvm_accel = true;
> +#endif
> +}
> +
>  static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
>  {
>  #ifdef CONFIG_KVM
>      int irq_fd = event_notifier_get_fd(&vdev->intx.interrupt);
>  
> -    if (vdev->no_kvm_intx || !kvm_irqfds_enabled() ||
> -        vdev->intx.route.mode != PCI_INTX_ENABLED ||
> -        !kvm_resamplefds_enabled()) {
> +    if (vfio_no_kvm_intx(vdev)) {
>          return;
>      }
>  
> @@ -328,7 +359,13 @@ static int vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
>          return 0;
>      }
>  
> -    vfio_disable_interrupts(vdev);
> +    /*
> +     * Do not alter interrupt state during vfio_realize and cpr-load.  The
> +     * reused flag is cleared thereafter.
> +     */
> +    if (!vdev->vbasedev.reused) {
> +        vfio_disable_interrupts(vdev);
> +    }
>  
>      vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
>      pci_config_set_interrupt_pin(vdev->pdev.config, pin);
> @@ -353,6 +390,11 @@ static int vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
>      fd = event_notifier_get_fd(&vdev->intx.interrupt);
>      qemu_set_fd_handler(fd, vfio_intx_interrupt, NULL, vdev);
>  
> +    if (vdev->vbasedev.reused) {
> +        vfio_intx_reenable_kvm(vdev, &err);
> +        goto finish;
> +    }
> +

This only jumps over the vfio_set_irq_signaling() and
vfio_intx_enable_kvm(), largely replacing the latter with chunks of
code taken from it.  Doesn't seem like the right factoring.

>      if (vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
>                                 VFIO_IRQ_SET_ACTION_TRIGGER, fd, errp)) {
>          qemu_set_fd_handler(fd, NULL, NULL, vdev);
> @@ -365,6 +407,7 @@ static int vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
>          warn_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
>      }
>  
> +finish:
>      vdev->interrupt = VFIO_INT_INTx;
>  
>      trace_vfio_intx_enable(vdev->vbasedev.name);
> @@ -3195,9 +3238,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>                                               vfio_intx_routing_notifier);
>          vdev->irqchip_change_notifier.notify = vfio_irqchip_change;
>          kvm_irqchip_add_change_notifier(&vdev->irqchip_change_notifier);
> -        ret = vfio_intx_enable(vdev, errp);
> -        if (ret) {
> -            goto out_deregister;
> +
> +        /* Wait until cpr-load reads intx routing data to enable */
> +        if (!vdev->vbasedev.reused) {
> +            ret = vfio_intx_enable(vdev, errp);
> +            if (ret) {
> +                goto out_deregister;
> +            }
>          }
>      }
>  
> @@ -3474,6 +3521,7 @@ static int vfio_pci_post_load(void *opaque, int version_id)
>      VFIOPCIDevice *vdev = opaque;
>      PCIDevice *pdev = &vdev->pdev;
>      int nr_vectors;
> +    int ret = 0;
>  
>      if (msix_enabled(pdev)) {
>          msix_set_vector_notifiers(pdev, vfio_msix_vector_use,
> @@ -3486,10 +3534,35 @@ static int vfio_pci_post_load(void *opaque, int version_id)
>          vfio_claim_vectors(vdev, nr_vectors, false);
>  
>      } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
> -        assert(0);      /* completed in a subsequent patch */
> +        Error *err = 0;
> +        ret = vfio_intx_enable(vdev, &err);
> +        if (ret) {
> +            error_report_err(err);
> +        }
>      }
>  
> -    return 0;
> +    return ret;
> +}
> +
> +static const VMStateDescription vfio_intx_vmstate = {
> +    .name = "vfio-intx",
> +    .unmigratable = 1,


unmigratable-vmstates-to-interfere-with-migration++

Thanks,
Alex


> +    .version_id = 0,
> +    .minimum_version_id = 0,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_BOOL(pending, VFIOINTx),
> +        VMSTATE_UINT32(route.mode, VFIOINTx),
> +        VMSTATE_INT32(route.irq, VFIOINTx),
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
> +#define VMSTATE_VFIO_INTX(_field, _state) {                         \
> +    .name       = (stringify(_field)),                              \
> +    .size       = sizeof(VFIOINTx),                                 \
> +    .vmsd       = &vfio_intx_vmstate,                               \
> +    .flags      = VMS_STRUCT,                                       \
> +    .offset     = vmstate_offset_value(_state, _field, VFIOINTx),   \
>  }
>  
>  static bool vfio_pci_needed(void *opaque)
> @@ -3509,6 +3582,7 @@ static const VMStateDescription vfio_pci_vmstate = {
>      .fields = (VMStateField[]) {
>          VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
>          VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
> +        VMSTATE_VFIO_INTX(intx, VFIOPCIDevice),
>          VMSTATE_END_OF_LIST()
>      }
>  };



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 30/39] vfio-pci: recover from unmap-all-vaddr failure
  2022-06-15 14:52 ` [PATCH V8 30/39] vfio-pci: recover from unmap-all-vaddr failure Steve Sistare
@ 2022-06-29 22:58   ` Alex Williamson
  2022-07-06 17:46     ` Steven Sistare
  0 siblings, 1 reply; 84+ messages in thread
From: Alex Williamson @ 2022-06-29 22:58 UTC (permalink / raw)
  To: Steve Sistare
  Cc: qemu-devel, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Alex Bennée, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Marcel Apfelbaum, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

On Wed, 15 Jun 2022 07:52:17 -0700
Steve Sistare <steven.sistare@oracle.com> wrote:

> If vfio_cpr_save fails to unmap all vaddr's, then recover by walking all
> flat sections to restore the vaddr for each.  Do so by invoking the
> vfio listener callback, and passing a new "replay" flag that tells it
> to replay a mapping without re-allocating new userland data structures.

Is this comment accurate?  I thought we had unwind in the kernel for
vaddr invalidation, and the notifier here is hooked up to any fault, so
it's at least misleading regarding vaddr.  The replay option really
needs some documentation in comments.

> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  hw/vfio/common.c              | 66 ++++++++++++++++++++++++++++++++-----------
>  hw/vfio/cpr.c                 | 29 +++++++++++++++++++
>  include/hw/vfio/vfio-common.h |  2 +-
>  3 files changed, 80 insertions(+), 17 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index c7d73b6..5f2bd50 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -895,15 +895,35 @@ static bool vfio_known_safe_misalignment(MemoryRegionSection *section)
>      return true;
>  }
>  
> +static VFIORamDiscardListener *vfio_find_ram_discard_listener(
> +    VFIOContainer *container, MemoryRegionSection *section)
> +{
> +    VFIORamDiscardListener *vrdl = NULL;

This initialization was copied from current code, but...

#define QLIST_FOREACH(var, head, field)                                 \
        for ((var) = ((head)->lh_first);                                \
               ...

it doesn't look necessary.  Thanks,

Alex

> +
> +    QLIST_FOREACH(vrdl, &container->vrdl_list, next) {
> +        if (vrdl->mr == section->mr &&
> +            vrdl->offset_within_address_space ==
> +            section->offset_within_address_space) {
> +            break;
> +        }
> +    }
> +
> +    if (!vrdl) {
> +        hw_error("vfio: Trying to sync missing RAM discard listener");
> +        /* does not return */
> +    }
> +    return vrdl;
> +}
> +
>  static void vfio_listener_region_add(MemoryListener *listener,
>                                       MemoryRegionSection *section)
>  {
>      VFIOContainer *container = container_of(listener, VFIOContainer, listener);
> -    vfio_container_region_add(container, section);
> +    vfio_container_region_add(container, section, false);
>  }
>  
>  void vfio_container_region_add(VFIOContainer *container,
> -                               MemoryRegionSection *section)
> +                               MemoryRegionSection *section, bool replay)
>  {
>      hwaddr iova, end;
>      Int128 llend, llsize;
> @@ -1033,6 +1053,23 @@ void vfio_container_region_add(VFIOContainer *container,
>          int iommu_idx;
>  
>          trace_vfio_listener_region_add_iommu(iova, end);
> +
> +        if (replay) {
> +            hwaddr as_offset = section->offset_within_address_space;
> +            hwaddr iommu_offset = as_offset - section->offset_within_region;
> +
> +            QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> +                if (giommu->iommu_mr == iommu_mr &&
> +                    giommu->iommu_offset == iommu_offset) {
> +                    memory_region_iommu_replay(giommu->iommu_mr, &giommu->n);
> +                    return;
> +                }
> +            }
> +            error_report("Container cannot find iommu region %s offset %lx",
> +                memory_region_name(section->mr), iommu_offset);
> +            goto fail;
> +        }
> +
>          /*
>           * FIXME: For VFIO iommu types which have KVM acceleration to
>           * avoid bouncing all map/unmaps through qemu this way, this
> @@ -1083,7 +1120,15 @@ void vfio_container_region_add(VFIOContainer *container,
>       * about changes.
>       */
>      if (memory_region_has_ram_discard_manager(section->mr)) {
> -        vfio_register_ram_discard_listener(container, section);
> +        if (replay)  {
> +            VFIORamDiscardListener *vrdl =
> +                vfio_find_ram_discard_listener(container, section);
> +            if (vfio_ram_discard_notify_populate(&vrdl->listener, section)) {
> +                error_report("ram_discard_manager_replay_populated failed");
> +            }
> +        } else {
> +            vfio_register_ram_discard_listener(container, section);
> +        }
>          return;
>      }
>  
> @@ -1417,19 +1462,8 @@ static int vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainer *container,
>                                                     MemoryRegionSection *section)
>  {
>      RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr);
> -    VFIORamDiscardListener *vrdl = NULL;
> -
> -    QLIST_FOREACH(vrdl, &container->vrdl_list, next) {
> -        if (vrdl->mr == section->mr &&
> -            vrdl->offset_within_address_space ==
> -            section->offset_within_address_space) {
> -            break;
> -        }
> -    }
> -
> -    if (!vrdl) {
> -        hw_error("vfio: Trying to sync missing RAM discard listener");
> -    }
> +    VFIORamDiscardListener *vrdl =
> +        vfio_find_ram_discard_listener(container, section);
>  
>      /*
>       * We only want/can synchronize the bitmap for actually mapped parts -
> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
> index a227d5e..2b5e77c 100644
> --- a/hw/vfio/cpr.c
> +++ b/hw/vfio/cpr.c
> @@ -32,6 +32,15 @@ vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
>      return 0;
>  }
>  
> +static int
> +vfio_region_remap(MemoryRegionSection *section, void *handle, Error **errp)
> +{
> +    VFIOContainer *container = handle;
> +    vfio_container_region_add(container, section, true);
> +    container->vaddr_unmapped = false;
> +    return 0;
> +}
> +
>  static bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp)
>  {
>      if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
> @@ -98,6 +107,22 @@ static const VMStateDescription vfio_container_vmstate = {
>      }
>  };
>  
> +static void vfio_cpr_save_failed_notifier(Notifier *notifier, void *data)
> +{
> +    Error *err;
> +    VFIOContainer *container =
> +        container_of(notifier, VFIOContainer, cpr_notifier);
> +
> +    /* Set reused so vfio_dma_map restores vaddr */
> +    container->reused = true;
> +    if (address_space_flat_for_each_section(container->space->as,
> +                                            vfio_region_remap,
> +                                            container, &err)) {
> +        error_report_err(err);
> +    }
> +    container->reused = false;
> +}
> +
>  int vfio_cpr_register_container(VFIOContainer *container, Error **errp)
>  {
>      container->cpr_blocker = NULL;
> @@ -108,6 +133,8 @@ int vfio_cpr_register_container(VFIOContainer *container, Error **errp)
>  
>      vmstate_register(NULL, -1, &vfio_container_vmstate, container);
>  
> +    cpr_add_notifier(&container->cpr_notifier, vfio_cpr_save_failed_notifier,
> +                     CPR_NOTIFY_SAVE_FAILED);
>      return 0;
>  }
>  
> @@ -116,4 +143,6 @@ void vfio_cpr_unregister_container(VFIOContainer *container)
>      cpr_del_blocker(&container->cpr_blocker);
>  
>      vmstate_unregister(NULL, &vfio_container_vmstate, container);
> +
> +    cpr_remove_notifier(&container->cpr_notifier);
>  }
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 17ad9ba..dd6bbcf 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -244,7 +244,7 @@ vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id);
>  extern const MemoryListener vfio_prereg_listener;
>  void vfio_listener_register(VFIOContainer *container);
>  void vfio_container_region_add(VFIOContainer *container,
> -                               MemoryRegionSection *section);
> +                               MemoryRegionSection *section, bool replay);
>  
>  int vfio_spapr_create_window(VFIOContainer *container,
>                               MemoryRegionSection *section,



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 12/39] memory: flat section iterator
  2022-06-15 14:51 ` [PATCH V8 12/39] memory: flat section iterator Steve Sistare
@ 2022-07-03  7:52   ` Peng Liang
  2022-07-05 18:26     ` Steven Sistare
  0 siblings, 1 reply; 84+ messages in thread
From: Peng Liang @ 2022-07-03  7:52 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow



On 6/15/2022 10:51 PM, Steve Sistare wrote:
> Add an iterator over the sections of a flattened address space.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
> ---
>  include/exec/memory.h | 31 +++++++++++++++++++++++++++++++
>  softmmu/memory.c      | 20 ++++++++++++++++++++
>  2 files changed, 51 insertions(+)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index a03301d..6a257a4 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -2343,6 +2343,37 @@ void memory_region_set_ram_discard_manager(MemoryRegion *mr,
>                                             RamDiscardManager *rdm);
>  
>  /**
> + * memory_region_section_cb: callback for address_space_flat_for_each_section()
> + *
> + * @mrs: MemoryRegionSection of the range
> + * @opaque: data pointer passed to address_space_flat_for_each_section()
> + * @errp: error message, returned to the address_space_flat_for_each_section
> + *        caller.
> + *
> + * Returns: non-zero to stop the iteration, and 0 to continue.  The same
> + * non-zero value is returned to the address_space_flat_for_each_section caller.
> + */
> +
> +typedef int (*memory_region_section_cb)(MemoryRegionSection *mrs,
> +                                        void *opaque,
> +                                        Error **errp);
> +
> +/**
> + * address_space_flat_for_each_section: walk the ranges in the address space
> + * flat view and call @func for each.  Return 0 on success, else return non-zero
> + * with a message in @errp.
> + *
> + * @as: target address space
> + * @func: callback function
> + * @opaque: passed to @func
> + * @errp: passed to @func
> + */
> +int address_space_flat_for_each_section(AddressSpace *as,
> +                                        memory_region_section_cb func,
> +                                        void *opaque,
> +                                        Error **errp);
> +
> +/**
>   * memory_region_find: translate an address/size relative to a
>   * MemoryRegion into a #MemoryRegionSection.
>   *
> diff --git a/softmmu/memory.c b/softmmu/memory.c
> index 0fe6fac..e5aefdd 100644
> --- a/softmmu/memory.c
> +++ b/softmmu/memory.c
> @@ -2683,6 +2683,26 @@ bool memory_region_is_mapped(MemoryRegion *mr)
>      return !!mr->container || mr->mapped_via_alias;
>  }
>  
> +int address_space_flat_for_each_section(AddressSpace *as,
> +                                        memory_region_section_cb func,
> +                                        void *opaque,
> +                                        Error **errp)
> +{
> +    FlatView *view = address_space_get_flatview(as);
> +    FlatRange *fr;
> +    int ret;
> +
> +    FOR_EACH_FLAT_RANGE(fr, view) {
> +        MemoryRegionSection mrs = section_from_flat_range(fr, view);
> +        ret = func(&mrs, opaque, errp);
> +        if (ret) {
> +            return ret;
> +        }
> +    }
> +

Hi Steve,
I guess a flatview_unref(view) is missing here? Because the return value
of address_space_get_flatview has been flatview_ref.

> +    return 0;
> +}
> +
>  /* Same as memory_region_find, but it does not add a reference to the
>   * returned region.  It must be called from an RCU critical section.
>   */


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 20/39] cpr: restart mode
  2022-06-15 14:52 ` [PATCH V8 20/39] cpr: restart mode Steve Sistare
@ 2022-07-03  8:15   ` Peng Liang
  2022-07-05 18:29     ` Steven Sistare
  0 siblings, 1 reply; 84+ messages in thread
From: Peng Liang @ 2022-07-03  8:15 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow



On 6/15/2022 10:52 PM, Steve Sistare wrote:
> Provide the cpr-save restart mode, which preserves the guest VM across a
> restart of the qemu process.  After cpr-save, the caller passes qemu
> command-line arguments to cpr-exec, which directly exec's the new qemu
> binary.  The arguments must include -S so new qemu starts in a paused state.
> The caller resumes the guest by calling cpr-load.
> 
> To use the restart mode, guest RAM must be backed by a memory-backend-file
> with share=on.  The '-cpr-enable restart' option causes secondary guest
> ram blocks (those not specified on the command line) to be allocated by
> mmap'ing a memfd.  The memfd values are saved in special cpr state which
> is retrieved after exec, and are kept open across exec, after which they
> are retrieved and re-mmap'd.  Hence guest RAM is preserved in place, albeit
> with new virtual addresses in the qemu process.
> 
> The restart mode supports vfio devices and memory-backend-memfd in
> subsequent patches.
> 
> cpr-exec syntax:
>   { 'command': 'cpr-exec', 'data': { 'argv': [ 'str' ] } }
> 
> Add the restart mode:
>   { 'enum': 'CprMode', 'data': [ 'reboot', 'restart' ] }
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  migration/cpr.c   | 35 +++++++++++++++++++++++++++++++++++
>  qapi/cpr.json     | 26 +++++++++++++++++++++++++-
>  qemu-options.hx   |  2 +-
>  softmmu/physmem.c | 46 +++++++++++++++++++++++++++++++++++++++++++++-
>  trace-events      |  1 +
>  5 files changed, 107 insertions(+), 3 deletions(-)
> 
> diff --git a/migration/cpr.c b/migration/cpr.c
> index 1cc8738..8b3fffd 100644
> --- a/migration/cpr.c
> +++ b/migration/cpr.c
> @@ -22,6 +22,7 @@ static int cpr_enabled_modes;
>  void cpr_init(int modes)
>  {
>      cpr_enabled_modes = modes;
> +    cpr_state_load(&error_fatal);
>  }
>  
>  bool cpr_enabled(CprMode mode)
> @@ -153,6 +154,37 @@ err:
>      cpr_set_mode(CPR_MODE_NONE);
>  }
>  
> +static int preserve_fd(const char *name, int id, int fd, void *opaque)
> +{
> +    qemu_clear_cloexec(fd);
> +    return 0;
> +}
> +
> +static int unpreserve_fd(const char *name, int id, int fd, void *opaque)
> +{
> +    qemu_set_cloexec(fd);
> +    return 0;
> +}
> +
> +void qmp_cpr_exec(strList *args, Error **errp)
> +{
> +    if (!runstate_check(RUN_STATE_SAVE_VM)) {
> +        error_setg(errp, "runstate is not save-vm");
> +        return;
> +    }
> +    if (cpr_get_mode() != CPR_MODE_RESTART) {
> +        error_setg(errp, "cpr-exec requires cpr-save with restart mode");
> +        return;
> +    }
> +
> +    cpr_walk_fd(preserve_fd, 0);
> +    if (cpr_state_save(errp)) {
> +        return;
> +    }
> +
> +    assert(qemu_system_exec_request(args, errp) == 0);
> +}
> +
>  void qmp_cpr_load(const char *filename, CprMode mode, Error **errp)
>  {
>      QEMUFile *f;
> @@ -189,6 +221,9 @@ void qmp_cpr_load(const char *filename, CprMode mode, Error **errp)
>          goto out;
>      }
>  
> +    /* Clear cloexec to prevent fd leaks until the next cpr-save */
> +    cpr_walk_fd(unpreserve_fd, 0);
> +
>      state = global_state_get_runstate();
>      if (state == RUN_STATE_RUNNING) {
>          vm_start();
> diff --git a/qapi/cpr.json b/qapi/cpr.json
> index 11c6f88..47ee4ff 100644
> --- a/qapi/cpr.json
> +++ b/qapi/cpr.json
> @@ -15,11 +15,12 @@
>  # @CprMode:
>  #
>  # @reboot: checkpoint can be cpr-load'ed after a host reboot.
> +# @restart: checkpoint can be cpr-load'ed after restarting qemu.
>  #
>  # Since: 7.1
>  ##
>  { 'enum': 'CprMode',
> -  'data': [ 'none', 'reboot' ] }
> +  'data': [ 'none', 'reboot', 'restart' ] }
>  
>  ##
>  # @cpr-save:
> @@ -38,6 +39,11 @@
>  # issue the quit command, reboot the system, start qemu using the same
>  # arguments plus -S, and issue the cpr-load command.
>  #
> +# If @mode is 'restart', the checkpoint remains valid after restarting
> +# qemu using a subsequent cpr-exec.  Guest RAM must be backed by a
> +# memory-backend-file with share=on.
> +# To resume from the checkpoint, issue the cpr-load command.
> +#
>  # @filename: name of checkpoint file
>  # @mode: @CprMode mode
>  #
> @@ -48,6 +54,24 @@
>              'mode': 'CprMode' } }
>  
>  ##
> +# @cpr-exec:
> +#
> +# Restart qemu by directly exec'ing @argv[0], replacing the qemu process.
> +# The PID remains the same.  Must be called after cpr-save restart.
> +#
> +# @argv[0] should be the path of a new qemu binary, or a prefix command that
> +# in turn exec's the new qemu binary.  The arguments must match those used
> +# to initially start qemu, plus the -S option so new qemu starts in a paused
> +# state.
> +#
> +# @argv: arguments to be passed to exec().
> +#
> +# Since: 7.1
> +##
> +{ 'command': 'cpr-exec',
> +  'data': { 'argv': [ 'str' ] } }
> +
> +##
>  # @cpr-load:
>  #
>  # Load a virtual machine from the checkpoint file @filename that was created
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 6e51c33..1b49360 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -4484,7 +4484,7 @@ SRST
>  ERST
>  
>  DEF("cpr-enable", HAS_ARG, QEMU_OPTION_cpr_enable, \
> -    "-cpr-enable reboot    enable the cpr mode\n",
> +    "-cpr-enable reboot|restart    enable the cpr mode\n",
>      QEMU_ARCH_ALL)
>  SRST
>  ``-cpr-enable reboot``
> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> index 822c424..412cc80 100644
> --- a/softmmu/physmem.c
> +++ b/softmmu/physmem.c
> @@ -44,6 +44,7 @@
>  #include "qemu/qemu-print.h"
>  #include "qemu/log.h"
>  #include "qemu/memalign.h"
> +#include "qemu/memfd.h"
>  #include "exec/memory.h"
>  #include "exec/ioport.h"
>  #include "sysemu/dma.h"
> @@ -1962,6 +1963,40 @@ static void dirty_memory_extend(ram_addr_t old_ram_size,
>      }
>  }
>  
> +static bool memory_region_is_backend(MemoryRegion *mr)
> +{
> +    return !!object_dynamic_cast(mr->parent_obj.parent, TYPE_MEMORY_BACKEND);
> +}

Maybe OBJECT(mr)->parent or mr->owner is more readable?

> +
> +static void *qemu_anon_memfd_alloc(RAMBlock *rb, size_t maxlen, Error **errp)
> +{
> +    size_t len, align;
> +    void *addr;
> +    struct MemoryRegion *mr = rb->mr;
> +    const char *name = memory_region_name(mr);
> +    int mfd = cpr_find_memfd(name, &len, &maxlen, &align);
> +
> +    if (mfd >= 0) {
> +        rb->used_length = len;
> +        rb->max_length = maxlen;
> +        mr->align = align;
> +    } else {
> +        len = rb->used_length;
> +        maxlen = rb->max_length;
> +        mr->align = QEMU_VMALLOC_ALIGN;
> +        mfd = qemu_memfd_create(name, maxlen + mr->align, 0, 0, 0, errp);
> +        if (mfd < 0) {
> +            return NULL;
> +        }
> +        cpr_save_memfd(name, mfd, len, maxlen, mr->align);
> +    }
> +    rb->flags |= RAM_SHARED;
> +    qemu_set_cloexec(mfd);
> +    addr = file_ram_alloc(rb, maxlen, mfd, false, false, 0, errp);
> +    trace_anon_memfd_alloc(name, maxlen, addr, mfd);
> +    return addr;
> +}
> +
>  static void ram_block_add(RAMBlock *new_block, Error **errp)
>  {
>      const bool noreserve = qemu_ram_is_noreserve(new_block);
> @@ -1986,6 +2021,14 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>                  qemu_mutex_unlock_ramlist();
>                  return;
>              }
> +        } else if (cpr_enabled(CPR_MODE_RESTART) &&
> +                   !memory_region_is_backend(new_block->mr)) {
> +            new_block->host = qemu_anon_memfd_alloc(new_block,
> +                                                    new_block->max_length,
> +                                                    errp);
> +            if (!new_block->host) {
> +                return;
> +            }
>          } else {
>              new_block->host = qemu_anon_ram_alloc(new_block->max_length,
>                                                    &new_block->mr->align,
> @@ -1997,8 +2040,8 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>                  qemu_mutex_unlock_ramlist();
>                  return;
>              }
> -            memory_try_enable_merging(new_block->host, new_block->max_length);
>          }
> +        memory_try_enable_merging(new_block->host, new_block->max_length);
>      }
>  
>      new_ram_size = MAX(old_ram_size,
> @@ -2231,6 +2274,7 @@ void qemu_ram_free(RAMBlock *block)
>      }
>  
>      qemu_mutex_lock_ramlist();
> +    cpr_delete_memfd(memory_region_name(block->mr));
>      QLIST_REMOVE_RCU(block, next);
>      ram_list.mru_block = NULL;
>      /* Write list before version */
> diff --git a/trace-events b/trace-events
> index bc71006..07369bb 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -45,6 +45,7 @@ ram_block_discard_range(const char *rbname, void *hva, size_t length, bool need_
>  # accel/tcg/cputlb.c
>  memory_notdirty_write_access(uint64_t vaddr, uint64_t ram_addr, unsigned size) "0x%" PRIx64 " ram_addr 0x%" PRIx64 " size %u"
>  memory_notdirty_set_dirty(uint64_t vaddr) "0x%" PRIx64
> +anon_memfd_alloc(const char *name, size_t size, void *ptr, int fd) "%s size %zu ptr %p fd %d"
>  
>  # gdbstub.c
>  gdbstub_op_start(const char *device) "Starting gdbstub using device %s"


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 36/39] chardev: cpr for sockets
  2022-06-15 14:52 ` [PATCH V8 36/39] chardev: cpr for sockets Steve Sistare
@ 2022-07-03  8:19   ` Peng Liang
  2022-07-05 18:29     ` Steven Sistare
  0 siblings, 1 reply; 84+ messages in thread
From: Peng Liang @ 2022-07-03  8:19 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow



On 6/15/2022 10:52 PM, Steve Sistare wrote:
> Save accepted socket fds before cpr-save, and look for them after cpr-load.
> Block cpr-exec if a socket enables the TLS or websocket option.  Allow a
> monitor socket by closing it on exec.
> 
> Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  chardev/char-socket.c         | 45 +++++++++++++++++++++++++++++++++++++++++++
>  include/chardev/char-socket.h |  1 +
>  monitor/hmp.c                 |  3 +++
>  monitor/qmp.c                 |  3 +++
>  4 files changed, 52 insertions(+)
> 
> diff --git a/chardev/char-socket.c b/chardev/char-socket.c
> index dc4e218..3a1e36b 100644
> --- a/chardev/char-socket.c
> +++ b/chardev/char-socket.c
> @@ -26,6 +26,7 @@
>  #include "chardev/char.h"
>  #include "io/channel-socket.h"
>  #include "io/channel-websock.h"
> +#include "migration/cpr.h"
>  #include "qemu/error-report.h"
>  #include "qemu/module.h"
>  #include "qemu/option.h"
> @@ -33,6 +34,7 @@
>  #include "qapi/clone-visitor.h"
>  #include "qapi/qapi-visit-sockets.h"
>  #include "qemu/yank.h"
> +#include "sysemu/sysemu.h"
>  
>  #include "chardev/char-io.h"
>  #include "chardev/char-socket.h"
> @@ -358,6 +360,11 @@ static void tcp_chr_free_connection(Chardev *chr)
>      SocketChardev *s = SOCKET_CHARDEV(chr);
>      int i;
>  
> +    if (chr->cpr_enabled) {
> +        cpr_delete_fd(chr->label, 0);
> +    }
> +    cpr_del_blocker(&s->cpr_blocker);
> +
>      if (s->read_msgfds_num) {
>          for (i = 0; i < s->read_msgfds_num; i++) {
>              close(s->read_msgfds[i]);
> @@ -923,6 +930,10 @@ static void tcp_chr_accept(QIONetListener *listener,
>                                 QIO_CHANNEL(cioc));
>      }
>      tcp_chr_new_client(chr, cioc);
> +
> +    if (s->sioc && chr->cpr_enabled) {
> +        cpr_resave_fd(chr->label, 0, s->sioc->fd, NULL);
> +    }
>  }
>  
>  
> @@ -1178,6 +1189,26 @@ static gboolean socket_reconnect_timeout(gpointer opaque)
>      return false;
>  }
>  
> +static int load_char_socket_fd(Chardev *chr, Error **errp)
> +{
> +    SocketChardev *sockchar = SOCKET_CHARDEV(chr);
> +    QIOChannelSocket *sioc;
> +    const char *label = chr->label;
> +    int fd = cpr_find_fd(label, 0);
> +
> +    if (fd != -1) {
> +        sockchar = SOCKET_CHARDEV(chr);
> +        sioc = qio_channel_socket_new_fd(fd, errp);
> +        if (sioc) {
> +            tcp_chr_accept(sockchar->listener, sioc, chr);
> +            object_unref(OBJECT(sioc));
> +        } else {
> +            error_setg(errp, "could not restore socket for %s", label);

If we go here, then qio_channel_socket_new_fd fails and errp should be set. So I think
error_prepend is more appropriate here.

> +            return -1;
> +        }
> +    }
> +    return 0;
> +}
>  
>  static int qmp_chardev_open_socket_server(Chardev *chr,
>                                            bool is_telnet,
> @@ -1388,6 +1419,18 @@ static void qmp_chardev_open_socket(Chardev *chr,
>      }
>      s->registered_yank = true;
>  
> +    if (!s->tls_creds && !s->is_websock) {
> +        qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
> +    } else if (!chr->reopen_on_cpr) {
> +        s->cpr_blocker = NULL;
> +        error_setg(&s->cpr_blocker,
> +                   "error: socket %s is not cpr capable due to %s option",
> +                   chr->label, (s->tls_creds ? "TLS" : "websocket"));
> +        if (cpr_add_blocker(&s->cpr_blocker, errp, CPR_MODE_RESTART, 0)) {
> +            return;
> +        }
> +    }
> +
>      /* be isn't opened until we get a connection */
>      *be_opened = false;
>  
> @@ -1403,6 +1446,8 @@ static void qmp_chardev_open_socket(Chardev *chr,
>              return;
>          }
>      }
> +
> +    load_char_socket_fd(chr, errp);
>  }
>  
>  static void qemu_chr_parse_socket(QemuOpts *opts, ChardevBackend *backend,
> diff --git a/include/chardev/char-socket.h b/include/chardev/char-socket.h
> index 0708ca6..1c3abf7 100644
> --- a/include/chardev/char-socket.h
> +++ b/include/chardev/char-socket.h
> @@ -78,6 +78,7 @@ struct SocketChardev {
>      bool connect_err_reported;
>  
>      QIOTask *connect_task;
> +    Error *cpr_blocker;
>  };
>  typedef struct SocketChardev SocketChardev;
>  
> diff --git a/monitor/hmp.c b/monitor/hmp.c
> index 15ca047..75e6739 100644
> --- a/monitor/hmp.c
> +++ b/monitor/hmp.c
> @@ -1501,4 +1501,7 @@ void monitor_init_hmp(Chardev *chr, bool use_readline, Error **errp)
>      qemu_chr_fe_set_handlers(&mon->common.chr, monitor_can_read, monitor_read,
>                               monitor_event, NULL, &mon->common, NULL, true);
>      monitor_list_append(&mon->common);
> +
> +    /* monitor cannot yet be preserved across cpr */
> +    chr->reopen_on_cpr = true;
>  }
> diff --git a/monitor/qmp.c b/monitor/qmp.c
> index 092c527..0043459 100644
> --- a/monitor/qmp.c
> +++ b/monitor/qmp.c
> @@ -535,4 +535,7 @@ void monitor_init_qmp(Chardev *chr, bool pretty, Error **errp)
>                                   NULL, &mon->common, NULL, true);
>          monitor_list_append(&mon->common);
>      }
> +
> +    /* Monitor cannot yet be preserved across cpr */
> +    chr->reopen_on_cpr = true;
>  }


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 27/39] vfio-pci: cpr part 1 (fd and dma)
  2022-06-15 14:52 ` [PATCH V8 27/39] vfio-pci: cpr part 1 (fd and dma) Steve Sistare
  2022-06-29 19:14   ` Alex Williamson
@ 2022-07-03  8:32   ` Peng Liang
  2022-07-05 18:29     ` Steven Sistare
  1 sibling, 1 reply; 84+ messages in thread
From: Peng Liang @ 2022-07-03  8:32 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow



On 6/15/2022 10:52 PM, Steve Sistare wrote:
> Enable vfio-pci devices to be saved and restored across an exec restart
> of qemu.
> 
> At vfio creation time, save the value of vfio container, group, and device
> descriptors in cpr state.
> 
> In the container pre_save handler, suspend the use of virtual addresses in
> DMA mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be
> remapped at a different VA after exec.  DMA to already-mapped pages
> continues.  Save the msi message area as part of vfio-pci vmstate, save the
> interrupt and notifier eventfd's in cpr state, and clear the close-on-exec
> flag for the vfio descriptors.  The flag is not cleared earlier because the
> descriptors should not persist across miscellaneous fork and exec calls
> that may be performed during normal operation.
> 
> On qemu restart, vfio_realize() finds the saved descriptors, uses
> the descriptors, and notes that the device is being reused.  Device and
> iommu state is already configured, so operations in vfio_realize that
> would modify the configuration are skipped for a reused device, including
> vfio ioctl's and writes to PCI configuration space.  Vfio PCI device reset
> is also suppressed. The result is that vfio_realize constructs qemu data
> structures that reflect the current state of the device.  However, the
> reconstruction is not complete until cpr-load is called. cpr-load loads the
> msi data.  The vfio post_load handler finds eventfds in cpr state, rebuilds
> vector data structures, and attaches the interrupts to the new KVM instance.
> The container post_load handler then invokes the main vfio listener
> callback, which walks the flattened ranges of the vfio address space and
> calls VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly,
> cpr-load starts the VM.
> 
> This functionality is delivered by 3 patches for clarity.  Part 1 handles
> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
> support.  Part 3 adds INTX support.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  MAINTAINERS                   |   1 +
>  hw/pci/pci.c                  |  12 ++++
>  hw/vfio/common.c              | 151 +++++++++++++++++++++++++++++++++++-------
>  hw/vfio/cpr.c                 | 119 +++++++++++++++++++++++++++++++++
>  hw/vfio/meson.build           |   1 +
>  hw/vfio/pci.c                 |  44 ++++++++++++
>  hw/vfio/trace-events          |   1 +
>  include/hw/vfio/vfio-common.h |  11 +++
>  include/migration/vmstate.h   |   1 +
>  9 files changed, 317 insertions(+), 24 deletions(-)
>  create mode 100644 hw/vfio/cpr.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 74a43e6..864aec6 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3156,6 +3156,7 @@ CPR
>  M: Steve Sistare <steven.sistare@oracle.com>
>  M: Mark Kanda <mark.kanda@oracle.com>
>  S: Maintained
> +F: hw/vfio/cpr.c
>  F: include/migration/cpr.h
>  F: migration/cpr.c
>  F: qapi/cpr.json
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 6e70153..a3b19eb 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -32,6 +32,7 @@
>  #include "hw/pci/pci_host.h"
>  #include "hw/qdev-properties.h"
>  #include "hw/qdev-properties-system.h"
> +#include "migration/cpr.h"
>  #include "migration/qemu-file-types.h"
>  #include "migration/vmstate.h"
>  #include "monitor/monitor.h"
> @@ -341,6 +342,17 @@ static void pci_reset_regions(PCIDevice *dev)
>  
>  static void pci_do_device_reset(PCIDevice *dev)
>  {
> +    /*
> +     * A PCI device that is resuming for cpr is already configured, so do
> +     * not reset it here when we are called from qemu_system_reset prior to
> +     * cpr-load, else interrupts may be lost for vfio-pci devices.  It is
> +     * safe to skip this reset for all PCI devices, because cpr-load will set
> +     * all fields that would have been set here.
> +     */
> +    if (cpr_get_mode() == CPR_MODE_RESTART) {
> +        return;
> +    }
> +
>      pci_device_deassert_intx(dev);
>      assert(dev->irq_state == 0);
>  
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index ace9562..c7d73b6 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -31,6 +31,7 @@
>  #include "exec/memory.h"
>  #include "exec/ram_addr.h"
>  #include "hw/hw.h"
> +#include "migration/cpr.h"
>  #include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
>  #include "qemu/range.h"
> @@ -460,6 +461,8 @@ static int vfio_dma_unmap(VFIOContainer *container,
>          .size = size,
>      };
>  
> +    assert(!container->reused);
> +
>      if (iotlb && container->dirty_pages_supported &&
>          vfio_devices_all_running_and_saving(container)) {
>          return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
> @@ -496,12 +499,24 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>  {
>      struct vfio_iommu_type1_dma_map map = {
>          .argsz = sizeof(map),
> -        .flags = VFIO_DMA_MAP_FLAG_READ,
>          .vaddr = (__u64)(uintptr_t)vaddr,
>          .iova = iova,
>          .size = size,
>      };
>  
> +    /*
> +     * Set the new vaddr for any mappings registered during cpr-load.
> +     * Reused is cleared thereafter.
> +     */
> +    if (container->reused) {
> +        map.flags = VFIO_DMA_MAP_FLAG_VADDR;
> +        if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
> +            goto fail;
> +        }
> +        return 0;
> +    }
> +
> +    map.flags = VFIO_DMA_MAP_FLAG_READ;
>      if (!readonly) {
>          map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
>      }
> @@ -517,7 +532,9 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>          return 0;
>      }
>  
> -    error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
> +fail:
> +    error_report("vfio_dma_map %s (iova %lu, size %ld, va %p): %s",
> +        (container->reused ? "VADDR" : ""), iova, size, vaddr, strerror(errno));
>      return -errno;
>  }
>  
> @@ -882,6 +899,12 @@ static void vfio_listener_region_add(MemoryListener *listener,
>                                       MemoryRegionSection *section)
>  {
>      VFIOContainer *container = container_of(listener, VFIOContainer, listener);
> +    vfio_container_region_add(container, section);
> +}
> +
> +void vfio_container_region_add(VFIOContainer *container,
> +                               MemoryRegionSection *section)
> +{
>      hwaddr iova, end;
>      Int128 llend, llsize;
>      void *vaddr;
> @@ -1492,6 +1515,12 @@ static void vfio_listener_release(VFIOContainer *container)
>      }
>  }
>  
> +void vfio_listener_register(VFIOContainer *container)
> +{
> +    container->listener = vfio_memory_listener;
> +    memory_listener_register(&container->listener, container->space->as);
> +}
> +
>  static struct vfio_info_cap_header *
>  vfio_get_cap(void *ptr, uint32_t cap_offset, uint16_t id)
>  {
> @@ -1910,6 +1939,22 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
>  {
>      int iommu_type, ret;
>  
> +    /*
> +     * If container is reused, just set its type and skip the ioctls, as the
> +     * container and group are already configured in the kernel.
> +     * VFIO_TYPE1v2_IOMMU is the only type that supports reuse/cpr.
> +     */
> +    if (container->reused) {
> +        if (ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1v2_IOMMU)) {
> +            container->iommu_type = VFIO_TYPE1v2_IOMMU;
> +            return 0;
> +        } else {
> +            error_setg(errp, "container was reused but VFIO_TYPE1v2_IOMMU "
> +                             "is not supported");
> +            return -errno;
> +        }
> +    }
> +
>      iommu_type = vfio_get_iommu_type(container, errp);
>      if (iommu_type < 0) {
>          return iommu_type;
> @@ -2014,9 +2059,12 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>  {
>      VFIOContainer *container;
>      int ret, fd;
> +    bool reused;
>      VFIOAddressSpace *space;
>  
>      space = vfio_get_address_space(as);
> +    fd = cpr_find_fd("vfio_container_for_group", group->groupid);
> +    reused = (fd > 0);
>  
>      /*
>       * VFIO is currently incompatible with discarding of RAM insofar as the
> @@ -2049,27 +2097,47 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>       * details once we know which type of IOMMU we are using.
>       */
>  
> +    /*
> +     * If the container is reused, then the group is already attached in the
> +     * kernel.  If a container with matching fd is found, then update the
> +     * userland group list and return.  If not, then after the loop, create
> +     * the container struct and group list.
> +     */
> +
>      QLIST_FOREACH(container, &space->containers, next) {
> -        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
> -            ret = vfio_ram_block_discard_disable(container, true);
> -            if (ret) {
> -                error_setg_errno(errp, -ret,
> -                                 "Cannot set discarding of RAM broken");
> -                if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
> -                          &container->fd)) {
> -                    error_report("vfio: error disconnecting group %d from"
> -                                 " container", group->groupid);
> -                }
> -                return ret;
> +        if (reused) {
> +            if (container->fd != fd) {
> +                continue;
>              }
> -            group->container = container;
> -            QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> +        } else if (ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
> +            continue;
> +        }
> +
> +        ret = vfio_ram_block_discard_disable(container, true);
> +        if (ret) {
> +            error_setg_errno(errp, -ret,
> +                             "Cannot set discarding of RAM broken");
> +            if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
> +                      &container->fd)) {
> +                error_report("vfio: error disconnecting group %d from"
> +                             " container", group->groupid);
> +            }
> +            return ret;
> +        }
> +        group->container = container;
> +        QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> +        if (!reused) {
>              vfio_kvm_device_add_group(group);
> -            return 0;
> +            cpr_save_fd("vfio_container_for_group", group->groupid,
> +                        container->fd);
>          }
> +        return 0;
> +    }
> +
> +    if (!reused) {
> +        fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
>      }
>  
> -    fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
>      if (fd < 0) {
>          error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
>          ret = -errno;
> @@ -2087,6 +2155,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      container = g_malloc0(sizeof(*container));
>      container->space = space;
>      container->fd = fd;
> +    container->reused = reused;
>      container->error = NULL;
>      container->dirty_pages_supported = false;
>      container->dma_max_mappings = 0;
> @@ -2099,6 +2168,11 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>          goto free_container_exit;
>      }
>  
> +    ret = vfio_cpr_register_container(container, errp);
> +    if (ret) {
> +        goto free_container_exit;
> +    }
> +
>      ret = vfio_ram_block_discard_disable(container, true);
>      if (ret) {
>          error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken");
> @@ -2213,9 +2287,16 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      group->container = container;
>      QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>  
> -    container->listener = vfio_memory_listener;
> -
> -    memory_listener_register(&container->listener, container->space->as);
> +    /*
> +     * If reused, register the listener later, after all state that may
> +     * affect regions and mapping boundaries has been cpr-load'ed.  Later,
> +     * the listener will invoke its callback on each flat section and call
> +     * vfio_dma_map to supply the new vaddr, and the calls will match the
> +     * mappings remembered by the kernel.
> +     */
> +    if (!reused) {
> +        vfio_listener_register(container);
> +    }
>  
>      if (container->error) {
>          ret = -1;
> @@ -2225,8 +2306,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      }
>  
>      container->initialized = true;
> +    ret = cpr_resave_fd("vfio_container_for_group", group->groupid, fd, errp);
>  
> -    return 0;
> +    return ret;
>  listener_release_exit:
>      QLIST_REMOVE(group, container_next);
>      QLIST_REMOVE(container, next);
> @@ -2254,6 +2336,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>  
>      QLIST_REMOVE(group, container_next);
>      group->container = NULL;
> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
>  
>      /*
>       * Explicitly release the listener first before unset container,
> @@ -2290,6 +2373,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>          }
>  
>          trace_vfio_disconnect_container(container->fd);
> +        vfio_cpr_unregister_container(container);
>          close(container->fd);
>          g_free(container);
>  
> @@ -2319,7 +2403,12 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>      group = g_malloc0(sizeof(*group));
>  
>      snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
> -    group->fd = qemu_open_old(path, O_RDWR);
> +
> +    group->fd = cpr_find_fd("vfio_group", groupid);
> +    if (group->fd < 0) {
> +        group->fd = qemu_open_old(path, O_RDWR);
> +    }
> +
>      if (group->fd < 0) {
>          error_setg_errno(errp, errno, "failed to open %s", path);
>          goto free_group_exit;
> @@ -2353,6 +2442,10 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>  
>      QLIST_INSERT_HEAD(&vfio_group_list, group, next);
>  
> +    if (cpr_resave_fd("vfio_group", groupid, group->fd, errp)) {
> +        goto close_fd_exit;
> +    }
> +
>      return group;
>  
>  close_fd_exit:
> @@ -2377,6 +2470,7 @@ void vfio_put_group(VFIOGroup *group)
>      vfio_disconnect_container(group);
>      QLIST_REMOVE(group, next);
>      trace_vfio_put_group(group->fd);
> +    cpr_delete_fd("vfio_group", group->groupid);
>      close(group->fd);
>      g_free(group);
>  
> @@ -2390,8 +2484,14 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>  {
>      struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
>      int ret, fd;
> +    bool reused;
> +
> +    fd = cpr_find_fd(name, 0);
> +    reused = (fd >= 0);
> +    if (!reused) {
> +        fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
> +    }
>  
> -    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>      if (fd < 0) {
>          error_setg_errno(errp, errno, "error getting device from group %d",
>                           group->groupid);
> @@ -2436,12 +2536,14 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>      vbasedev->num_irqs = dev_info.num_irqs;
>      vbasedev->num_regions = dev_info.num_regions;
>      vbasedev->flags = dev_info.flags;
> +    vbasedev->reused = reused;
>  
>      trace_vfio_get_device(name, dev_info.flags, dev_info.num_regions,
>                            dev_info.num_irqs);
>  
>      vbasedev->reset_works = !!(dev_info.flags & VFIO_DEVICE_FLAGS_RESET);
> -    return 0;
> +    ret = cpr_resave_fd(name, 0, fd, errp);
> +    return ret;
>  }
>  
>  void vfio_put_base_device(VFIODevice *vbasedev)
> @@ -2452,6 +2554,7 @@ void vfio_put_base_device(VFIODevice *vbasedev)
>      QLIST_REMOVE(vbasedev, next);
>      vbasedev->group = NULL;
>      trace_vfio_put_base_device(vbasedev->fd);
> +    cpr_delete_fd(vbasedev->name, 0);
>      close(vbasedev->fd);
>  }
>  
> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
> new file mode 100644
> index 0000000..a227d5e
> --- /dev/null
> +++ b/hw/vfio/cpr.c
> @@ -0,0 +1,119 @@
> +/*
> + * Copyright (c) 2021, 2022 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include <sys/ioctl.h>
> +#include <linux/vfio.h>
> +#include "hw/vfio/vfio-common.h"
> +#include "sysemu/kvm.h"
> +#include "qapi/error.h"
> +#include "migration/cpr.h"
> +#include "migration/vmstate.h"
> +#include "trace.h"
> +
> +static int
> +vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
> +{
> +    struct vfio_iommu_type1_dma_unmap unmap = {
> +        .argsz = sizeof(unmap),
> +        .flags = VFIO_DMA_UNMAP_FLAG_VADDR | VFIO_DMA_UNMAP_FLAG_ALL,
> +        .iova = 0,
> +        .size = 0,
> +    };
> +    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
> +        error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
> +        return -errno;
> +    }
> +    container->vaddr_unmapped = true;
> +    return 0;
> +}
> +
> +static bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp)
> +{
> +    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
> +        !ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
> +        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR "
> +                         "or VFIO_UNMAP_ALL");
> +        return false;
> +    } else {
> +        return true;
> +    }
> +}
> +
> +static bool vfio_vmstate_needed(void *opaque)
> +{
> +    return cpr_get_mode() == CPR_MODE_RESTART;
> +}
> +
> +static int vfio_container_pre_save(void *opaque)
> +{
> +    VFIOContainer *container = (VFIOContainer *)opaque;
> +    Error *err;

According to the description of error_setg, local Error variables should be
initialized to NULL. The following coccinelle script from Markus should be helpful
to auto fix the problem :) :

@ r @
identifier id;
@@
(
  static Error *id;
|
  Error *id
+ = NULL
  ;
)

> +
> +    if (!vfio_is_cpr_capable(container, &err) ||
> +        vfio_dma_unmap_vaddr_all(container, &err)) {
> +        error_report_err(err);
> +        return -1;
> +    }
> +    return 0;
> +}
> +
> +static int vfio_container_post_load(void *opaque, int version_id)
> +{
> +    VFIOContainer *container = (VFIOContainer *)opaque;
> +    VFIOGroup *group;
> +    Error *err;
> +    VFIODevice *vbasedev;
> +
> +    if (!vfio_is_cpr_capable(container, &err)) {
> +        error_report_err(err);
> +        return -1;
> +    }
> +
> +    vfio_listener_register(container);
> +    container->reused = false;
> +
> +    QLIST_FOREACH(group, &container->group_list, container_next) {
> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +            vbasedev->reused = false;
> +        }
> +    }
> +    return 0;
> +}
> +
> +static const VMStateDescription vfio_container_vmstate = {
> +    .name = "vfio-container",
> +    .unmigratable = 1,
> +    .version_id = 0,
> +    .minimum_version_id = 0,
> +    .pre_save = vfio_container_pre_save,
> +    .post_load = vfio_container_post_load,
> +    .needed = vfio_vmstate_needed,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
> +int vfio_cpr_register_container(VFIOContainer *container, Error **errp)
> +{
> +    container->cpr_blocker = NULL;
> +    if (!vfio_is_cpr_capable(container, &container->cpr_blocker)) {
> +        return cpr_add_blocker(&container->cpr_blocker, errp,
> +                               CPR_MODE_RESTART, 0);
> +    }
> +
> +    vmstate_register(NULL, -1, &vfio_container_vmstate, container);
> +
> +    return 0;
> +}
> +
> +void vfio_cpr_unregister_container(VFIOContainer *container)
> +{
> +    cpr_del_blocker(&container->cpr_blocker);
> +
> +    vmstate_unregister(NULL, &vfio_container_vmstate, container);
> +}
> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
> index da9af29..e247b2b 100644
> --- a/hw/vfio/meson.build
> +++ b/hw/vfio/meson.build
> @@ -5,6 +5,7 @@ vfio_ss.add(files(
>    'migration.c',
>  ))
>  vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
> +  'cpr.c',
>    'display.c',
>    'pci-quirks.c',
>    'pci.c',
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 0143c9a..237231b 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -30,6 +30,7 @@
>  #include "hw/qdev-properties-system.h"
>  #include "migration/vmstate.h"
>  #include "qapi/qmp/qdict.h"
> +#include "migration/cpr.h"
>  #include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
>  #include "qemu/module.h"
> @@ -2514,6 +2515,7 @@ const VMStateDescription vmstate_vfio_pci_config = {
>      .name = "VFIOPCIDevice",
>      .version_id = 1,
>      .minimum_version_id = 1,
> +    .priority = MIG_PRI_VFIO_PCI,   /* * must load before container */
>      .fields = (VMStateField[]) {
>          VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
>          VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
> @@ -3243,6 +3245,11 @@ static void vfio_pci_reset(DeviceState *dev)
>  {
>      VFIOPCIDevice *vdev = VFIO_PCI(dev);
>  
> +    /* Do not reset the device during qemu_system_reset prior to cpr-load */
> +    if (vdev->vbasedev.reused) {
> +        return;
> +    }
> +
>      trace_vfio_pci_reset(vdev->vbasedev.name);
>  
>      vfio_pci_pre_reset(vdev);
> @@ -3350,6 +3357,42 @@ static Property vfio_pci_dev_properties[] = {
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> +/*
> + * The kernel may change non-emulated config bits.  Exclude them from the
> + * changed-bits check in get_pci_config_device.
> + */
> +static int vfio_pci_pre_load(void *opaque)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    PCIDevice *pdev = &vdev->pdev;
> +    int size = MIN(pci_config_size(pdev), vdev->config_size);
> +    int i;
> +
> +    for (i = 0; i < size; i++) {
> +        pdev->cmask[i] &= vdev->emulated_config_bits[i];
> +    }
> +
> +    return 0;
> +}
> +
> +static bool vfio_pci_needed(void *opaque)
> +{
> +    return cpr_get_mode() == CPR_MODE_RESTART;
> +}
> +
> +static const VMStateDescription vfio_pci_vmstate = {
> +    .name = "vfio-pci",
> +    .unmigratable = 1,
> +    .version_id = 0,
> +    .minimum_version_id = 0,
> +    .priority = MIG_PRI_VFIO_PCI,       /* must load before container */
> +    .pre_load = vfio_pci_pre_load,
> +    .needed = vfio_pci_needed,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
>  static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>  {
>      DeviceClass *dc = DEVICE_CLASS(klass);
> @@ -3357,6 +3400,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>  
>      dc->reset = vfio_pci_reset;
>      device_class_set_props(dc, vfio_pci_dev_properties);
> +    dc->vmsd = &vfio_pci_vmstate;
>      dc->desc = "VFIO-based PCI device assignment";
>      set_bit(DEVICE_CATEGORY_MISC, dc->categories);
>      pdc->realize = vfio_realize;
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 73dffe9..a6d0034 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -119,6 +119,7 @@ vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Devic
>  vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
>  vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
>  vfio_dma_unmap_overflow_workaround(void) ""
> +vfio_region_remap(const char *name, int fd, uint64_t iova_start, uint64_t iova_end, void *vaddr) "%s fd %d 0x%"PRIx64" - 0x%"PRIx64" [%p]"
>  
>  # platform.c
>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index e573f5a..17ad9ba 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -81,10 +81,14 @@ typedef struct VFIOContainer {
>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>      MemoryListener listener;
>      MemoryListener prereg_listener;
> +    Notifier cpr_notifier;
> +    Error *cpr_blocker;
>      unsigned iommu_type;
>      Error *error;
>      bool initialized;
>      bool dirty_pages_supported;
> +    bool reused;
> +    bool vaddr_unmapped;
>      uint64_t dirty_pgsizes;
>      uint64_t max_dirty_bitmap_size;
>      unsigned long pgsizes;
> @@ -136,6 +140,7 @@ typedef struct VFIODevice {
>      bool no_mmap;
>      bool ram_block_discard_allowed;
>      bool enable_migration;
> +    bool reused;
>      VFIODeviceOps *ops;
>      unsigned int num_irqs;
>      unsigned int num_regions;
> @@ -213,6 +218,9 @@ void vfio_put_group(VFIOGroup *group);
>  int vfio_get_device(VFIOGroup *group, const char *name,
>                      VFIODevice *vbasedev, Error **errp);
>  
> +int vfio_cpr_register_container(VFIOContainer *container, Error **errp);
> +void vfio_cpr_unregister_container(VFIOContainer *container);
> +
>  extern const MemoryRegionOps vfio_region_ops;
>  typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
>  extern VFIOGroupList vfio_group_list;
> @@ -234,6 +242,9 @@ struct vfio_info_cap_header *
>  vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id);
>  #endif
>  extern const MemoryListener vfio_prereg_listener;
> +void vfio_listener_register(VFIOContainer *container);
> +void vfio_container_region_add(VFIOContainer *container,
> +                               MemoryRegionSection *section);
>  
>  int vfio_spapr_create_window(VFIOContainer *container,
>                               MemoryRegionSection *section,
> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> index ad24aa1..19f1538 100644
> --- a/include/migration/vmstate.h
> +++ b/include/migration/vmstate.h
> @@ -157,6 +157,7 @@ typedef enum {
>      MIG_PRI_GICV3_ITS,          /* Must happen before PCI devices */
>      MIG_PRI_GICV3,              /* Must happen before the ITS */
>      MIG_PRI_MAX,
> +    MIG_PRI_VFIO_PCI = MIG_PRI_IOMMU,
>  } MigrationPriority;
>  
>  struct VMStateField {


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 04/39] memory: RAM_ANON flag
  2022-06-15 20:25   ` David Hildenbrand
@ 2022-07-05 18:23     ` Steven Sistare
  0 siblings, 0 replies; 84+ messages in thread
From: Steven Sistare @ 2022-07-05 18:23 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, John Snow

On 6/15/2022 4:25 PM, David Hildenbrand wrote:
> On 15.06.22 16:51, Steve Sistare wrote:
>> A memory-backend-ram or a memory-backend-memfd block with the RAM_SHARED
>> flag set is not migrated when migrate_ignore_shared() is true, but this
>> is wrong, because it has no named backing store, and its contents will be
>> lost.  Define a new flag RAM_ANON to distinguish this case.  Cpr will also
>> test this flag, for similar reasons.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  backends/hostmem-epc.c   |  2 +-
>>  backends/hostmem-memfd.c |  1 +
>>  backends/hostmem-ram.c   |  1 +
>>  include/exec/memory.h    |  3 +++
>>  include/exec/ram_addr.h  |  1 +
>>  migration/ram.c          |  3 ++-
>>  softmmu/physmem.c        | 12 +++++++++---
>>  7 files changed, 18 insertions(+), 5 deletions(-)
>>
>> diff --git a/backends/hostmem-epc.c b/backends/hostmem-epc.c
>> index 037292d..cb06255 100644
>> --- a/backends/hostmem-epc.c
>> +++ b/backends/hostmem-epc.c
>> @@ -37,7 +37,7 @@ sgx_epc_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>>      }
>>  
>>      name = object_get_canonical_path(OBJECT(backend));
>> -    ram_flags = (backend->share ? RAM_SHARED : 0) | RAM_PROTECTED;
>> +    ram_flags = (backend->share ? RAM_SHARED : 0) | RAM_PROTECTED | MAP_ANON;
> 
> I'm pretty sure that doesn't compile. -> RAM_ANON

Oh it does, but not what we want!  Thanks for the catch.

>>      memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend),
>>                                     name, backend->size, ram_flags,
>>                                     fd, 0, errp);
>> diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c
>> index 3fc85c3..c9d8001 100644
>> --- a/backends/hostmem-memfd.c
>> +++ b/backends/hostmem-memfd.c
>> @@ -55,6 +55,7 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>>      name = host_memory_backend_get_name(backend);
>>      ram_flags = backend->share ? RAM_SHARED : 0;
>>      ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
>> +    ram_flags |= RAM_ANON;
>>      memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), name,
>>                                     backend->size, ram_flags, fd, 0, errp);
>>      g_free(name);
>> diff --git a/backends/hostmem-ram.c b/backends/hostmem-ram.c
>> index b8e55cd..5e80149 100644
>> --- a/backends/hostmem-ram.c
>> +++ b/backends/hostmem-ram.c
>> @@ -30,6 +30,7 @@ ram_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
>>      name = host_memory_backend_get_name(backend);
>>      ram_flags = backend->share ? RAM_SHARED : 0;
>>      ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
>> +    ram_flags |= RAM_ANON;
>>      memory_region_init_ram_flags_nomigrate(&backend->mr, OBJECT(backend), name,
>>                                             backend->size, ram_flags, errp);
>>      g_free(name);
>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>> index f1c1945..0daddd7 100644
>> --- a/include/exec/memory.h
>> +++ b/include/exec/memory.h
>> @@ -203,6 +203,9 @@ typedef struct IOMMUTLBEvent {
>>  /* RAM that isn't accessible through normal means. */
>>  #define RAM_PROTECTED (1 << 8)
>>  
>> +/* RAM has no name outside the qemu process. */
>> +#define RAM_ANON (1 << 9)
> 
> That name is a bit misleading because it mangles anonymous memory with
> an anonymous file, which doesn't provide anonymous memory in "kernel
> speak". Please find a better name, some idea below ...
> 
> I think what you actual want to know is: is this from a real file,
> instead of from an anonymous file or anonymous memory. A real file can
> be re-opened and remapped after closing QEMU. Further, you need
> MAP_SHARED semantics.
> 
> 
> /* RAM maps a real file instead of an anonymous file or no file/fd. */
> #define RAM_REAL_FILE (1 << 9)
> 
> bool ramblock_maps_real_file(RAMBlock *rb)
> {
>     return rb->flags & RAM_REAL_FILE;
> }
> 
> 
> Maybe we can come up with a better name for "real file".

Sure.  Ideas:
  RAM_FILE
  RAM_NAMED
  RAM_NAMED_FILE

> Set the flag from applicable callsites. When setting the flag
> internally, assert that we don't have a fd -- that cannot possibly make
> sense.

It will only be set in hostmem-file.c

> At applicable callsites check for ramblock_maps_real_file() and that
> it's actually a shared mapping. If not, it cannot be preserved by
> restarting QEMU (easily, there might be ways for memfd involving other
> processes).

Memfd is allowed for cpr restart by virtue of being shared and having an
fd which can be mapped, which I test for.  See ram_is_volatile in patch 22.
ramblock_is_anon() becomes !ramblock_is_file().

- Steve


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 02/39] migration: qemu file wrappers
  2022-06-16  2:18   ` Guoyi Tu
@ 2022-07-05 18:24     ` Steven Sistare
  0 siblings, 0 replies; 84+ messages in thread
From: Steven Sistare @ 2022-07-05 18:24 UTC (permalink / raw)
  To: Guoyi Tu, qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Mark Kanda, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

On 6/15/2022 10:18 PM, Guoyi Tu wrote:
> On 2022/6/15 22:51, Steve Sistare wrote:
>> Add qemu_file_open and qemu_fd_open to create QEMUFile objects for unix
>> files and file descriptors.
>>
> the function names should be updated.
> 
> -- 
> Guoyi

Yes indeed, thanks - Steve

>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   migration/qemu-file-channel.c | 36 ++++++++++++++++++++++++++++++++++++
>>   migration/qemu-file-channel.h |  6 ++++++
>>   2 files changed, 42 insertions(+)
>>
>> diff --git a/migration/qemu-file-channel.c b/migration/qemu-file-channel.c
>> index bb5a575..cc5aebc 100644
>> --- a/migration/qemu-file-channel.c
>> +++ b/migration/qemu-file-channel.c
>> @@ -27,8 +27,10 @@
>>   #include "qemu-file.h"
>>   #include "io/channel-socket.h"
>>   #include "io/channel-tls.h"
>> +#include "io/channel-file.h"
>>   #include "qemu/iov.h"
>>   #include "qemu/yank.h"
>> +#include "qapi/error.h"
>>   #include "yank_functions.h"
>>     @@ -192,3 +194,37 @@ QEMUFile *qemu_fopen_channel_output(QIOChannel *ioc)
>>       object_ref(OBJECT(ioc));
>>       return qemu_fopen_ops(ioc, &channel_output_ops, true);
>>   }
>> +
>> +QEMUFile *qemu_fopen_file(const char *path, int flags, int mode,
>> +                          const char *name, Error **errp)
>> +{
>> +    g_autoptr(QIOChannelFile) fioc = NULL;
>> +    QIOChannel *ioc;
>> +    QEMUFile *f;
>> +
>> +    if (flags & O_RDWR) {
>> +        error_setg(errp, "qemu_fopen_file %s: O_RDWR not supported", path);
>> +        return NULL;
>> +    }
>> +
>> +    fioc = qio_channel_file_new_path(path, flags, mode, errp);
>> +    if (!fioc) {
>> +        return NULL;
>> +    }
>> +
>> +    ioc = QIO_CHANNEL(fioc);
>> +    qio_channel_set_name(ioc, name);
>> +    f = (flags & O_WRONLY) ? qemu_fopen_channel_output(ioc) :
>> +                             qemu_fopen_channel_input(ioc);
>> +    return f;
>> +}
>> +
>> +QEMUFile *qemu_fopen_fd(int fd, bool writable, const char *name)
>> +{
>> +    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
>> +    QIOChannel *ioc = QIO_CHANNEL(fioc);
>> +    QEMUFile *f = writable ? qemu_fopen_channel_output(ioc) :
>> +                             qemu_fopen_channel_input(ioc);
>> +    qio_channel_set_name(ioc, name);
>> +    return f;
>> +}
>> diff --git a/migration/qemu-file-channel.h b/migration/qemu-file-channel.h
>> index 0028a09..75fd0ad 100644
>> --- a/migration/qemu-file-channel.h
>> +++ b/migration/qemu-file-channel.h
>> @@ -29,4 +29,10 @@
>>     QEMUFile *qemu_fopen_channel_input(QIOChannel *ioc);
>>   QEMUFile *qemu_fopen_channel_output(QIOChannel *ioc);
>> +
>> +QEMUFile *qemu_fopen_file(const char *path, int flags, int mode,
>> +                         const char *name, Error **errp);
>> +
>> +QEMUFile *qemu_fopen_fd(int fd, bool writable, const char *name);
>> +
>>   #endif


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 02/39] migration: qemu file wrappers
  2022-06-16 14:55   ` Marc-André Lureau
@ 2022-07-05 18:25     ` Steven Sistare
  0 siblings, 0 replies; 84+ messages in thread
From: Steven Sistare @ 2022-07-05 18:25 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: QEMU, Paolo Bonzini, Stefan Hajnoczi, Alex Bennée,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Marcel Apfelbaum,
	Alex Williamson, Daniel P. Berrange, Juan Quintela,
	Markus Armbruster, Eric Blake, Jason Zeng, Zheng Chuan,
	Mark Kanda, Guoyi Tu, Peter Maydell, Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

On 6/16/2022 10:55 AM, Marc-André Lureau wrote:
> Hi
> 
> On Wed, Jun 15, 2022 at 6:54 PM Steve Sistare <steven.sistare@oracle.com <mailto:steven.sistare@oracle.com>> wrote:
> 
>     Add qemu_file_open and qemu_fd_open to create QEMUFile objects for unix
>     files and file descriptors.
> 
> File descriptors are not really unix specific, but that's a detail.

OK, I will change the description.

> The names of the functions in the summary do not match the code, also details :)

Yup, will fix.

> Eventually, I would suggest to follow the libc fopen/fdopen naming, if that makes sense. (or the QIOChannel naming)

OK. I'll use the names that Daniel suggests.

>     Signed-off-by: Steve Sistare <steven.sistare@oracle.com <mailto:steven.sistare@oracle.com>>
>     ---
>      migration/qemu-file-channel.c | 36 ++++++++++++++++++++++++++++++++++++
>      migration/qemu-file-channel.h |  6 ++++++
>      2 files changed, 42 insertions(+)
> 
>     diff --git a/migration/qemu-file-channel.c b/migration/qemu-file-channel.c
>     index bb5a575..cc5aebc 100644
>     --- a/migration/qemu-file-channel.c
>     +++ b/migration/qemu-file-channel.c
>     @@ -27,8 +27,10 @@
>      #include "qemu-file.h"
>      #include "io/channel-socket.h"
>      #include "io/channel-tls.h"
>     +#include "io/channel-file.h"
>      #include "qemu/iov.h"
>      #include "qemu/yank.h"
>     +#include "qapi/error.h"
>      #include "yank_functions.h"
> 
> 
>     @@ -192,3 +194,37 @@ QEMUFile *qemu_fopen_channel_output(QIOChannel *ioc)
>          object_ref(OBJECT(ioc));
>          return qemu_fopen_ops(ioc, &channel_output_ops, true);
>      }
>     +
>     +QEMUFile *qemu_fopen_file(const char *path, int flags, int mode,
>     +                          const char *name, Error **errp)
>     +{
> 
> 
> I would add ERR_GUARD();

error.h advises us not to clutter the code with ERRP_GUARD when it is not needed.

>     +    g_autoptr(QIOChannelFile) fioc = NULL;
>     +    QIOChannel *ioc;
>     +    QEMUFile *f;
>     +
>     +    if (flags & O_RDWR) {
>     +        error_setg(errp, "qemu_fopen_file %s: O_RDWR not supported", path);
>     +        return NULL;
>     +    }
> 
> 
> Why not take a "bool writable" instead, like the fdopen below?

I will ditch the bools and expand the function names as Daniel suggests.

>     +
>     +    fioc = qio_channel_file_new_path(path, flags, mode, errp);
>     +    if (!fioc) {
>     +        return NULL;
>     +    }
>     +
>     +    ioc = QIO_CHANNEL(fioc);
>     +    qio_channel_set_name(ioc, name);
>     +    f = (flags & O_WRONLY) ? qemu_fopen_channel_output(ioc) :
>     +                             qemu_fopen_channel_input(ioc);
>     +    return f;
> 
> 
> "f" and parentheses are kinda superfluous

OK, will fix.

>     +}
>     +
>     +QEMUFile *qemu_fopen_fd(int fd, bool writable, const char *name)
>     +{
>     +    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
>     +    QIOChannel *ioc = QIO_CHANNEL(fioc);
>     +    QEMUFile *f = writable ? qemu_fopen_channel_output(ioc) :
>     +                             qemu_fopen_channel_input(ioc);
>     +    qio_channel_set_name(ioc, name);
>     +    return f;
> 
> 
> or:
> 
> g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
> qio_channel_set_name(QIO_CHANNEL(fioc), name);
> return writable ? qemu_fopen_channel_output(ioc) : qemu_fopen_channel_input(ioc);

OK:
    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
    qio_channel_set_name(QIO_CHANNEL(fioc), name);
    return writable ? qemu_fopen_channel_output(QIO_CHANNEL(fioc)) :
                      qemu_fopen_channel_input(QIO_CHANNEL(fioc));

- Steve

>     +}
>     diff --git a/migration/qemu-file-channel.h b/migration/qemu-file-channel.h
>     index 0028a09..75fd0ad 100644
>     --- a/migration/qemu-file-channel.h
>     +++ b/migration/qemu-file-channel.h
>     @@ -29,4 +29,10 @@
> 
>      QEMUFile *qemu_fopen_channel_input(QIOChannel *ioc);
>      QEMUFile *qemu_fopen_channel_output(QIOChannel *ioc);
>     +
>     +QEMUFile *qemu_fopen_file(const char *path, int flags, int mode,
>     +                         const char *name, Error **errp);
>     +
>     +QEMUFile *qemu_fopen_fd(int fd, bool writable, const char *name);
>     +
>      #endif
>     -- 
>     1.8.3.1
> 
> 
> 
> 
> -- 
> Marc-André Lureau


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 02/39] migration: qemu file wrappers
  2022-06-16 15:29   ` Daniel P. Berrangé
@ 2022-07-05 18:25     ` Steven Sistare
  0 siblings, 0 replies; 84+ messages in thread
From: Steven Sistare @ 2022-07-05 18:25 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: qemu-devel, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Alex Bennée, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Marcel Apfelbaum, Alex Williamson,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

On 6/16/2022 11:29 AM, Daniel P. Berrangé wrote:
> On Wed, Jun 15, 2022 at 07:51:49AM -0700, Steve Sistare wrote:
>> Add qemu_file_open and qemu_fd_open to create QEMUFile objects for unix
>> files and file descriptors.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  migration/qemu-file-channel.c | 36 ++++++++++++++++++++++++++++++++++++
>>  migration/qemu-file-channel.h |  6 ++++++
>>  2 files changed, 42 insertions(+)
>>
>> diff --git a/migration/qemu-file-channel.c b/migration/qemu-file-channel.c
>> index bb5a575..cc5aebc 100644
>> --- a/migration/qemu-file-channel.c
>> +++ b/migration/qemu-file-channel.c
>> @@ -27,8 +27,10 @@
>>  #include "qemu-file.h"
>>  #include "io/channel-socket.h"
>>  #include "io/channel-tls.h"
>> +#include "io/channel-file.h"
>>  #include "qemu/iov.h"
>>  #include "qemu/yank.h"
>> +#include "qapi/error.h"
>>  #include "yank_functions.h"
>>  
>>  
>> @@ -192,3 +194,37 @@ QEMUFile *qemu_fopen_channel_output(QIOChannel *ioc)
>>      object_ref(OBJECT(ioc));
>>      return qemu_fopen_ops(ioc, &channel_output_ops, true);
>>  }
>> +
>> +QEMUFile *qemu_fopen_file(const char *path, int flags, int mode,
>> +                          const char *name, Error **errp)
>> +{
>> +    g_autoptr(QIOChannelFile) fioc = NULL;
>> +    QIOChannel *ioc;
>> +    QEMUFile *f;
>> +
>> +    if (flags & O_RDWR) {
> 
> IIRC, O_RDWR may expand to more than 1 bit, so needs a strict
> equality test
> 
>    if ((flags & O_RDWR) == O_RDWR)

Hmm, on what OS?  No harm if I just do it, but the next reviewer will tell 
me to remove the unnecessary equality test :)

>> +        error_setg(errp, "qemu_fopen_file %s: O_RDWR not supported", path);
>> +        return NULL;
>> +    }
>> +
>> +    fioc = qio_channel_file_new_path(path, flags, mode, errp);
>> +    if (!fioc) {
>> +        return NULL;
>> +    }
>> +
>> +    ioc = QIO_CHANNEL(fioc);
>> +    qio_channel_set_name(ioc, name);
>> +    f = (flags & O_WRONLY) ? qemu_fopen_channel_output(ioc) :
>> +                             qemu_fopen_channel_input(ioc);
>> +    return f;
>> +}
>> +
>> +QEMUFile *qemu_fopen_fd(int fd, bool writable, const char *name)
>> +{
>> +    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
>> +    QIOChannel *ioc = QIO_CHANNEL(fioc);
>> +    QEMUFile *f = writable ? qemu_fopen_channel_output(ioc) :
>> +                             qemu_fopen_channel_input(ioc);
>> +    qio_channel_set_name(ioc, name);
>> +    return f;
>> +}
>> diff --git a/migration/qemu-file-channel.h b/migration/qemu-file-channel.h
>> index 0028a09..75fd0ad 100644
>> --- a/migration/qemu-file-channel.h
>> +++ b/migration/qemu-file-channel.h
>> @@ -29,4 +29,10 @@
>>  
>>  QEMUFile *qemu_fopen_channel_input(QIOChannel *ioc);
>>  QEMUFile *qemu_fopen_channel_output(QIOChannel *ioc);
>> +
>> +QEMUFile *qemu_fopen_file(const char *path, int flags, int mode,
>> +                         const char *name, Error **errp);
>> +
>> +QEMUFile *qemu_fopen_fd(int fd, bool writable, const char *name);
> 
> Note we used the explicit names "_input" and "_output" in
> the existing methods as they're more readable in the calling
> sides than "true" / "false".
> 
> Similarly we had qemu_open vs qemu_create, so that we don't
> have the ambiguity of whuether 'mode' is needed or not. IOW,
> I'd suggest we have 
> 
>  QEMUFile *qemu_fopen_file_output(const char *path, int mode,
>                                   const char *name, Error **errp);
>  QEMUFile *qemu_fopen_file_input(const char *path,
>                                   const char *name, Error **errp);
> 
>  QEMUFile *qemu_fopen_fd_input(int fd, const char *name);
>  QEMUFile *qemu_fopen_fd_output(int fd, const char *name);

Will do.  I do need the flags argument in the fopen_file calls, though.

- Steve


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 05/39] vl: start on wakeup request
  2022-06-16 15:55   ` Marc-André Lureau
@ 2022-07-05 18:26     ` Steven Sistare
  0 siblings, 0 replies; 84+ messages in thread
From: Steven Sistare @ 2022-07-05 18:26 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: QEMU, Paolo Bonzini, Stefan Hajnoczi, Alex Bennée,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Marcel Apfelbaum,
	Alex Williamson, Daniel P. Berrange, Juan Quintela,
	Markus Armbruster, Eric Blake, Jason Zeng, Zheng Chuan,
	Mark Kanda, Guoyi Tu, Peter Maydell, Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

On 6/16/2022 11:55 AM, Marc-André Lureau wrote:
> Hi
> 
> On Wed, Jun 15, 2022 at 7:27 PM Steve Sistare <steven.sistare@oracle.com <mailto:steven.sistare@oracle.com>> wrote:
> 
>     If qemu starts and loads a VM in the suspended state, then a later wakeup
>     request will set the state to running, which is not sufficient to initialize
>     the vm, as vm_start was never called during this invocation of qemu.  See
>     qemu_system_wakeup_request().
> 
>     Define the start_on_wakeup_requested() hook to cause vm_start() to be called
>     when processing the wakeup request.
> 
> 
> Nothing calls qemu_system_start_on_wakeup_request() yet, so it would be useful to say where this is going to be used next.
> 
> (otherwise, it seems ok to me)

Will do, thanks - Steve

>     Signed-off-by: Steve Sistare <steven.sistare@oracle.com <mailto:steven.sistare@oracle.com>>
>     ---
>      include/sysemu/runstate.h |  1 +
>      softmmu/runstate.c        | 16 +++++++++++++++-
>      2 files changed, 16 insertions(+), 1 deletion(-)
> 
>     diff --git a/include/sysemu/runstate.h b/include/sysemu/runstate.h
>     index f3ed525..16c1c41 100644
>     --- a/include/sysemu/runstate.h
>     +++ b/include/sysemu/runstate.h
>     @@ -57,6 +57,7 @@ void qemu_system_reset_request(ShutdownCause reason);
>      void qemu_system_suspend_request(void);
>      void qemu_register_suspend_notifier(Notifier *notifier);
>      bool qemu_wakeup_suspend_enabled(void);
>     +void qemu_system_start_on_wakeup_request(void);
>      void qemu_system_wakeup_request(WakeupReason reason, Error **errp);
>      void qemu_system_wakeup_enable(WakeupReason reason, bool enabled);
>      void qemu_register_wakeup_notifier(Notifier *notifier);
>     diff --git a/softmmu/runstate.c b/softmmu/runstate.c
>     index fac7b63..9b27d74 100644
>     --- a/softmmu/runstate.c
>     +++ b/softmmu/runstate.c
>     @@ -115,6 +115,7 @@ static const RunStateTransition runstate_transitions_def[] = {
>          { RUN_STATE_PRELAUNCH, RUN_STATE_RUNNING },
>          { RUN_STATE_PRELAUNCH, RUN_STATE_FINISH_MIGRATE },
>          { RUN_STATE_PRELAUNCH, RUN_STATE_INMIGRATE },
>     +    { RUN_STATE_PRELAUNCH, RUN_STATE_SUSPENDED },
> 
>          { RUN_STATE_FINISH_MIGRATE, RUN_STATE_RUNNING },
>          { RUN_STATE_FINISH_MIGRATE, RUN_STATE_PAUSED },
>     @@ -335,6 +336,7 @@ void vm_state_notify(bool running, RunState state)
>          }
>      }
> 
>     +static bool start_on_wakeup_requested;
>      static ShutdownCause reset_requested;
>      static ShutdownCause shutdown_requested;
>      static int shutdown_signal;
>     @@ -562,6 +564,11 @@ void qemu_register_suspend_notifier(Notifier *notifier)
>          notifier_list_add(&suspend_notifiers, notifier);
>      }
> 
>     +void qemu_system_start_on_wakeup_request(void)
>     +{
>     +    start_on_wakeup_requested = true;
>     +}
>     +
>      void qemu_system_wakeup_request(WakeupReason reason, Error **errp)
>      {
>          trace_system_wakeup_request(reason);
>     @@ -574,7 +581,14 @@ void qemu_system_wakeup_request(WakeupReason reason, Error **errp)
>          if (!(wakeup_reason_mask & (1 << reason))) {
>              return;
>          }
>     -    runstate_set(RUN_STATE_RUNNING);
>     +
>     +    if (start_on_wakeup_requested) {
>     +        start_on_wakeup_requested = false;
>     +        vm_start();
>     +    } else {
>     +        runstate_set(RUN_STATE_RUNNING);
>     +    }
>     +
>          wakeup_reason = reason;
>          qemu_notify_event();
>      }
>     -- 
>     1.8.3.1
> 
> 
> 
> 
> -- 
> Marc-André Lureau


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 06/39] cpr: reboot mode
  2022-06-16 11:10   ` Daniel P. Berrangé
@ 2022-07-05 18:26     ` Steven Sistare
  0 siblings, 0 replies; 84+ messages in thread
From: Steven Sistare @ 2022-07-05 18:26 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: qemu-devel, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Alex Bennée, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Marcel Apfelbaum, Alex Williamson,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

On 6/16/2022 7:10 AM, Daniel P. Berrangé wrote:
> On Wed, Jun 15, 2022 at 07:51:53AM -0700, Steve Sistare wrote:
>> Provide the cpr-save and cpr-load functions for live update.  These save and
>> restore VM state, with minimal guest pause time, so that qemu may be updated
>> to a new version in between.
>>
>> cpr-save stops the VM and saves vmstate to an ordinary file.  It supports
>> any type of guest image and block device, but the caller must not modify
>> guest block devices between cpr-save and cpr-load.
>>
>> cpr-save supports several modes, the first of which is reboot. In this mode
>> the caller invokes cpr-save and then terminates qemu.  The caller may then
>> update the host kernel and system software and reboot.  The caller resumes
>> the guest by running qemu with the same arguments as the original process
>> and invoking cpr-load.  To use this mode, guest ram must be mapped to a
>> persistent shared memory file such as /dev/dax0.0 or /dev/shm PKRAM.
>>
>> The reboot mode supports vfio devices if the caller first suspends the
>> guest, such as by issuing guest-suspend-ram to the qemu guest agent.  The
>> guest drivers' suspend methods flush outstanding requests and re-initialize
>> the devices, and thus there is no device state to save and restore.
>>
>> cpr-load loads state from the file.  If the VM was running at cpr-save time
>> then VM execution resumes.  If the VM was suspended at cpr-save time, then
>> the caller must issue a system_wakeup command to resume.
>>
>> cpr-save syntax:
>>   { 'enum': 'CprMode', 'data': [ 'reboot' ] }
>>   { 'command': 'cpr-save', 'data': { 'filename': 'str', 'mode': 'CprMode' }}
>>
>> cpr-load syntax:
>>   { 'command': 'cpr-load', 'data': { 'filename': 'str', 'mode': 'CprMode' }}
> 
> I'm still a little unsure if this direction for QAPI exposure is the
> best, or whether we should instead leverage the migration commands.
> 
> I particularly concerned that we might regret having an API that
> is designed only around storage in local files/blockdevs. The
> migration layer has flexibility to use many protocols which has
> been useful in the past to be able to offload work to an external
> process. For example, libvirt uses migrate-to-fd so it can use
> a helper that adds O_DIRECT support such that we avoid trashing
> the host I/O cache for save/restore.
> 
> At the same time though, the migrate APIs don't currently support
> a plain "file" protocol. This was because historically we needed
> the QEMUFile to support O_NONBLOCK and this fails with plain
> files or block devices, so QEMU threads could get blocked. For
> the save side this doesn't matter so much, as QEMU now has the
> outgoing migrate channels in blocking mode, only the incoming
> side use non-blocking.  We could add a plain "file" protocol
> to migration if we clearly document its limitations, and indeed
> I've suggested we do that for another unrelated bit of work
> for libvirts VM save/restore functionality.

OK, I'll give it a try:
  - delete cpr-save, cpr-load, and cpr-exec
  - add file uri
  - add argv to MigrationParameters for the execv call.

- Steve


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 12/39] memory: flat section iterator
  2022-07-03  7:52   ` Peng Liang
@ 2022-07-05 18:26     ` Steven Sistare
  0 siblings, 0 replies; 84+ messages in thread
From: Steven Sistare @ 2022-07-05 18:26 UTC (permalink / raw)
  To: Peng Liang, qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

On 7/3/2022 3:52 AM, Peng Liang wrote:
> On 6/15/2022 10:51 PM, Steve Sistare wrote:
>> Add an iterator over the sections of a flattened address space.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
>> ---
>>  include/exec/memory.h | 31 +++++++++++++++++++++++++++++++
>>  softmmu/memory.c      | 20 ++++++++++++++++++++
>>  2 files changed, 51 insertions(+)
>>
>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>> index a03301d..6a257a4 100644
>> --- a/include/exec/memory.h
>> +++ b/include/exec/memory.h
>> @@ -2343,6 +2343,37 @@ void memory_region_set_ram_discard_manager(MemoryRegion *mr,
>>                                             RamDiscardManager *rdm);
>>  
>>  /**
>> + * memory_region_section_cb: callback for address_space_flat_for_each_section()
>> + *
>> + * @mrs: MemoryRegionSection of the range
>> + * @opaque: data pointer passed to address_space_flat_for_each_section()
>> + * @errp: error message, returned to the address_space_flat_for_each_section
>> + *        caller.
>> + *
>> + * Returns: non-zero to stop the iteration, and 0 to continue.  The same
>> + * non-zero value is returned to the address_space_flat_for_each_section caller.
>> + */
>> +
>> +typedef int (*memory_region_section_cb)(MemoryRegionSection *mrs,
>> +                                        void *opaque,
>> +                                        Error **errp);
>> +
>> +/**
>> + * address_space_flat_for_each_section: walk the ranges in the address space
>> + * flat view and call @func for each.  Return 0 on success, else return non-zero
>> + * with a message in @errp.
>> + *
>> + * @as: target address space
>> + * @func: callback function
>> + * @opaque: passed to @func
>> + * @errp: passed to @func
>> + */
>> +int address_space_flat_for_each_section(AddressSpace *as,
>> +                                        memory_region_section_cb func,
>> +                                        void *opaque,
>> +                                        Error **errp);
>> +
>> +/**
>>   * memory_region_find: translate an address/size relative to a
>>   * MemoryRegion into a #MemoryRegionSection.
>>   *
>> diff --git a/softmmu/memory.c b/softmmu/memory.c
>> index 0fe6fac..e5aefdd 100644
>> --- a/softmmu/memory.c
>> +++ b/softmmu/memory.c
>> @@ -2683,6 +2683,26 @@ bool memory_region_is_mapped(MemoryRegion *mr)
>>      return !!mr->container || mr->mapped_via_alias;
>>  }
>>  
>> +int address_space_flat_for_each_section(AddressSpace *as,
>> +                                        memory_region_section_cb func,
>> +                                        void *opaque,
>> +                                        Error **errp)
>> +{
>> +    FlatView *view = address_space_get_flatview(as);
>> +    FlatRange *fr;
>> +    int ret;
>> +
>> +    FOR_EACH_FLAT_RANGE(fr, view) {
>> +        MemoryRegionSection mrs = section_from_flat_range(fr, view);
>> +        ret = func(&mrs, opaque, errp);
>> +        if (ret) {
>> +            return ret;
>> +        }
>> +    }
>> +
> 
> Hi Steve,
> I guess a flatview_unref(view) is missing here? Because the return value
> of address_space_get_flatview has been flatview_ref.

Yes!  Good catch, will fix, thanks - Steve

>> +    return 0;
>> +}
>> +
>>  /* Same as memory_region_find, but it does not add a reference to the
>>   * returned region.  It must be called from an RCU critical section.
>>   */


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 13/39] oslib: qemu_clear_cloexec
  2022-06-16 16:07   ` Daniel P. Berrangé
@ 2022-07-05 18:27     ` Steven Sistare
  0 siblings, 0 replies; 84+ messages in thread
From: Steven Sistare @ 2022-07-05 18:27 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: qemu-devel, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Alex Bennée, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Marcel Apfelbaum, Alex Williamson,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

On 6/16/2022 12:07 PM, Daniel P. Berrangé wrote:
> On Wed, Jun 15, 2022 at 07:52:00AM -0700, Steve Sistare wrote:
>> Define qemu_clear_cloexec, analogous to qemu_set_cloexec.
>>
>> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  include/qemu/osdep.h | 1 +
>>  util/oslib-posix.c   | 9 +++++++++
>>  util/oslib-win32.c   | 4 ++++
>>  3 files changed, 14 insertions(+)
>>
>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
>> index b1c161c..e916f3b 100644
>> --- a/include/qemu/osdep.h
>> +++ b/include/qemu/osdep.h
>> @@ -548,6 +548,7 @@ ssize_t qemu_write_full(int fd, const void *buf, size_t count)
>>      G_GNUC_WARN_UNUSED_RESULT;
>>  
>>  void qemu_set_cloexec(int fd);
>> +void qemu_clear_cloexec(int fd);
> 
> I'm a little wary of adding this helper without any accompanying
> comment.
> 
> It is almost never correct to use this new method in a threaded
> program like QEMU, unless you have strong confidence that all
> the other threads are idle and not liable to perform a fork+exec
> for any other reason.
> 
> IIUC, this can be satisfied by the CPR code because it will be
> used only immediately before exec'ing the updated QEMU binary,
> and it has suspended any other CPUs and not other monitor
> commands are concurrently running.
> 
> IOW, I just ask that you put a comment with a big warning that
> essentially no one should use this method, except CPR code.
> 
> With regards,
> Daniel

Sure thing, will do - Steve


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 14/39] qapi: strList_from_string
  2022-06-16 16:04   ` Marc-André Lureau
@ 2022-07-05 18:28     ` Steven Sistare
  0 siblings, 0 replies; 84+ messages in thread
From: Steven Sistare @ 2022-07-05 18:28 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: QEMU, Paolo Bonzini, Stefan Hajnoczi, Alex Bennée,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Marcel Apfelbaum,
	Alex Williamson, Daniel P. Berrange, Juan Quintela,
	Markus Armbruster, Eric Blake, Jason Zeng, Zheng Chuan,
	Mark Kanda, Guoyi Tu, Peter Maydell, Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

On 6/16/2022 12:04 PM, Marc-André Lureau wrote:
> Hi
> 
> On Wed, Jun 15, 2022 at 7:04 PM Steve Sistare <steven.sistare@oracle.com <mailto:steven.sistare@oracle.com>> wrote:
> 
>     Generalize strList_from_comma_list() to take any delimiter character, rename
>     as strList_from_string(), and move it to qapi/util.c.
> 
>     No functional change.
> 
>     Signed-off-by: Steve Sistare <steven.sistare@oracle.com <mailto:steven.sistare@oracle.com>>
>     ---
>      include/qapi/util.h |  9 +++++++++
>      monitor/hmp-cmds.c  | 29 ++---------------------------
>      qapi/qapi-util.c    | 23 +++++++++++++++++++++++
>      3 files changed, 34 insertions(+), 27 deletions(-)
> 
>     diff --git a/include/qapi/util.h b/include/qapi/util.h
>     index 81a2b13..7d88b09 100644
>     --- a/include/qapi/util.h
>     +++ b/include/qapi/util.h
>     @@ -22,6 +22,8 @@ typedef struct QEnumLookup {
>          const int size;
>      } QEnumLookup;
> 
>     +struct strList;
>     +
> 
> 
> suspicious, you can't include qapi/qapi-builtin-types.h here?

Nope.  qapi-builtin-types.h includes util.h because it needs QEnumLookup.
See the code generation in types.py:
          self._genh.preamble_add(mcgen('''
#include "qapi/util.h"

- Steve

>      const char *qapi_enum_lookup(const QEnumLookup *lookup, int val);
>      int qapi_enum_parse(const QEnumLookup *lookup, const char *buf,
>                          int def, Error **errp);
>     @@ -31,6 +33,13 @@ bool qapi_bool_parse(const char *name, const char *value, bool *obj,
>      int parse_qapi_name(const char *name, bool complete);
> 
>      /*
>     + * Produce a strList from the character delimited string @in.
>     + * All strings are g_strdup'd.
>     + * A NULL or empty input string returns NULL.
>     + */
>     +struct strList *strList_from_string(const char *in, char delim);
>     +
>     +/*
>       * For any GenericList @list, insert @element at the front.
>       *
>       * Note that this macro evaluates @element exactly once, so it is safe
>     diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
>     index bb12589..9f58b1f 100644
>     --- a/monitor/hmp-cmds.c
>     +++ b/monitor/hmp-cmds.c
>     @@ -43,6 +43,7 @@
>      #include "qapi/qapi-commands-run-state.h"
>      #include "qapi/qapi-commands-tpm.h"
>      #include "qapi/qapi-commands-ui.h"
>     +#include "qapi/util.h"
>      #include "qapi/qapi-visit-net.h"
>      #include "qapi/qapi-visit-migration.h"
>      #include "qapi/qmp/qdict.h"
>     @@ -70,32 +71,6 @@ bool hmp_handle_error(Monitor *mon, Error *err)
>          return false;
>      }
> 
>     -/*
>     - * Produce a strList from a comma separated list.
>     - * A NULL or empty input string return NULL.
>     - */
>     -static strList *strList_from_comma_list(const char *in)
>     -{
>     -    strList *res = NULL;
>     -    strList **tail = &res;
>     -
>     -    while (in && in[0]) {
>     -        char *comma = strchr(in, ',');
>     -        char *value;
>     -
>     -        if (comma) {
>     -            value = g_strndup(in, comma - in);
>     -            in = comma + 1; /* skip the , */
>     -        } else {
>     -            value = g_strdup(in);
>     -            in = NULL;
>     -        }
>     -        QAPI_LIST_APPEND(tail, value);
>     -    }
>     -
>     -    return res;
>     -}
>     -
>      void hmp_info_name(Monitor *mon, const QDict *qdict)
>      {
>          NameInfo *info;
>     @@ -1115,7 +1090,7 @@ void hmp_announce_self(Monitor *mon, const QDict *qdict)
>                                                  migrate_announce_params());
> 
>          qapi_free_strList(params->interfaces);
>     -    params->interfaces = strList_from_comma_list(interfaces_str);
>     +    params->interfaces = strList_from_string(interfaces_str, ',');
>          params->has_interfaces = params->interfaces != NULL;
>          params->id = g_strdup(id);
>          params->has_id = !!params->id;
>     diff --git a/qapi/qapi-util.c b/qapi/qapi-util.c
>     index 63596e1..b61c73c 100644
>     --- a/qapi/qapi-util.c
>     +++ b/qapi/qapi-util.c
>     @@ -15,6 +15,7 @@
>      #include "qapi/error.h"
>      #include "qemu/ctype.h"
>      #include "qapi/qmp/qerror.h"
>     +#include "qapi/qapi-builtin-types.h"
> 
>      CompatPolicy compat_policy;
> 
>     @@ -152,3 +153,25 @@ int parse_qapi_name(const char *str, bool complete)
>          }
>          return p - str;
>      }
>     +
>     +strList *strList_from_string(const char *in, char delim)
>     +{
>     +    strList *res = NULL;
>     +    strList **tail = &res;
>     +
>     +    while (in && in[0]) {
>     +        char *next = strchr(in, delim);
>     +        char *value;
>     +
>     +        if (next) {
>     +            value = g_strndup(in, next - in);
>     +            in = next + 1; /* skip the delim */
>     +        } else {
>     +            value = g_strdup(in);
>     +            in = NULL;
>     +        }
>     +        QAPI_LIST_APPEND(tail, value);
>     +    }
>     +
>     +    return res;
>     +}
>     -- 
>     1.8.3.1
> 
> 
> 
> 
> -- 
> Marc-André Lureau


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 16/39] qapi: strv_from_strList
  2022-06-16 16:08   ` Marc-André Lureau
@ 2022-07-05 18:28     ` Steven Sistare
  0 siblings, 0 replies; 84+ messages in thread
From: Steven Sistare @ 2022-07-05 18:28 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: QEMU, Paolo Bonzini, Stefan Hajnoczi, Alex Bennée,
	Dr. David Alan Gilbert, Michael S. Tsirkin, Marcel Apfelbaum,
	Alex Williamson, Daniel P. Berrange, Juan Quintela,
	Markus Armbruster, Eric Blake, Jason Zeng, Zheng Chuan,
	Mark Kanda, Guoyi Tu, Peter Maydell, Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

On 6/16/2022 12:08 PM, Marc-André Lureau wrote:
> Hi
> 
> On Wed, Jun 15, 2022 at 7:30 PM Steve Sistare <steven.sistare@oracle.com <mailto:steven.sistare@oracle.com>> wrote:
> 
>     Signed-off-by: Steve Sistare <steven.sistare@oracle.com <mailto:steven.sistare@oracle.com>>
>     ---
>      include/qapi/util.h |  6 ++++++
>      qapi/qapi-util.c    | 14 ++++++++++++++
>      2 files changed, 20 insertions(+)
> 
>     diff --git a/include/qapi/util.h b/include/qapi/util.h
>     index 75dddca..51ff64e 100644
>     --- a/include/qapi/util.h
>     +++ b/include/qapi/util.h
>     @@ -33,6 +33,12 @@ bool qapi_bool_parse(const char *name, const char *value, bool *obj,
>      int parse_qapi_name(const char *name, bool complete);
> 
>      /*
>     + * Produce and return a NULL-terminated array of strings from @args.
>     + * All strings are g_strdup'd.
>     + */
>     +GStrv strv_from_strList(const struct strList *args);
>     +
>     +/*
>       * Produce a strList from the character delimited string @in.
>       * All strings are g_strdup'd.
>       * A NULL or empty input string returns NULL.
>     diff --git a/qapi/qapi-util.c b/qapi/qapi-util.c
>     index b61c73c..8c96cab 100644
>     --- a/qapi/qapi-util.c
>     +++ b/qapi/qapi-util.c
>     @@ -154,6 +154,20 @@ int parse_qapi_name(const char *str, bool complete)
>          return p - str;
>      }
> 
>     +GStrv strv_from_strList(const strList *args)
>     +{
>     +    const strList *arg;
>     +    int i = 0;
>     +    GStrv argv = g_malloc((QAPI_LIST_LENGTH(args) + 1) * sizeof(char *));
>     +
> 
> 
> Better use g_new() here. Otherwise:
> Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com <mailto:marcandre.lureau@redhat.com>>

Will do, thanks - Steve

>     +    for (arg = args; arg != NULL; arg = arg->next) {
>     +        argv[i++] = g_strdup(arg->value);
>     +    }
>     +    argv[i] = NULL;
>     +
>     +    return argv;
>     +}
>     +
>      strList *strList_from_string(const char *in, char delim)
>      {
>          strList *res = NULL;
>     -- 
>     1.8.3.1
> 
> 
> 
> 
> -- 
> Marc-André Lureau


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 20/39] cpr: restart mode
  2022-07-03  8:15   ` Peng Liang
@ 2022-07-05 18:29     ` Steven Sistare
  2022-07-06  0:15       ` Peng Liang
  0 siblings, 1 reply; 84+ messages in thread
From: Steven Sistare @ 2022-07-05 18:29 UTC (permalink / raw)
  To: Peng Liang, qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

On 7/3/2022 4:15 AM, Peng Liang wrote:
> On 6/15/2022 10:52 PM, Steve Sistare wrote:
>> Provide the cpr-save restart mode, which preserves the guest VM across a
>> restart of the qemu process.  After cpr-save, the caller passes qemu
>> command-line arguments to cpr-exec, which directly exec's the new qemu
>> binary.  The arguments must include -S so new qemu starts in a paused state.
>> The caller resumes the guest by calling cpr-load.
>>
>> To use the restart mode, guest RAM must be backed by a memory-backend-file
>> with share=on.  The '-cpr-enable restart' option causes secondary guest
>> ram blocks (those not specified on the command line) to be allocated by
>> mmap'ing a memfd.  The memfd values are saved in special cpr state which
>> is retrieved after exec, and are kept open across exec, after which they
>> are retrieved and re-mmap'd.  Hence guest RAM is preserved in place, albeit
>> with new virtual addresses in the qemu process.
>>
>> The restart mode supports vfio devices and memory-backend-memfd in
>> subsequent patches.
>>
>> cpr-exec syntax:
>>   { 'command': 'cpr-exec', 'data': { 'argv': [ 'str' ] } }
>>
>> Add the restart mode:
>>   { 'enum': 'CprMode', 'data': [ 'reboot', 'restart' ] }
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  migration/cpr.c   | 35 +++++++++++++++++++++++++++++++++++
>>  qapi/cpr.json     | 26 +++++++++++++++++++++++++-
>>  qemu-options.hx   |  2 +-
>>  softmmu/physmem.c | 46 +++++++++++++++++++++++++++++++++++++++++++++-
>>  trace-events      |  1 +
>>  5 files changed, 107 insertions(+), 3 deletions(-)
>>
>> diff --git a/migration/cpr.c b/migration/cpr.c
>> index 1cc8738..8b3fffd 100644
>> --- a/migration/cpr.c
>> +++ b/migration/cpr.c
>> @@ -22,6 +22,7 @@ static int cpr_enabled_modes;
>>  void cpr_init(int modes)
>>  {
>>      cpr_enabled_modes = modes;
>> +    cpr_state_load(&error_fatal);
>>  }
>>  
>>  bool cpr_enabled(CprMode mode)
>> @@ -153,6 +154,37 @@ err:
>>      cpr_set_mode(CPR_MODE_NONE);
>>  }
>>  
>> +static int preserve_fd(const char *name, int id, int fd, void *opaque)
>> +{
>> +    qemu_clear_cloexec(fd);
>> +    return 0;
>> +}
>> +
>> +static int unpreserve_fd(const char *name, int id, int fd, void *opaque)
>> +{
>> +    qemu_set_cloexec(fd);
>> +    return 0;
>> +}
>> +
>> +void qmp_cpr_exec(strList *args, Error **errp)
>> +{
>> +    if (!runstate_check(RUN_STATE_SAVE_VM)) {
>> +        error_setg(errp, "runstate is not save-vm");
>> +        return;
>> +    }
>> +    if (cpr_get_mode() != CPR_MODE_RESTART) {
>> +        error_setg(errp, "cpr-exec requires cpr-save with restart mode");
>> +        return;
>> +    }
>> +
>> +    cpr_walk_fd(preserve_fd, 0);
>> +    if (cpr_state_save(errp)) {
>> +        return;
>> +    }
>> +
>> +    assert(qemu_system_exec_request(args, errp) == 0);
>> +}
>> +
>>  void qmp_cpr_load(const char *filename, CprMode mode, Error **errp)
>>  {
>>      QEMUFile *f;
>> @@ -189,6 +221,9 @@ void qmp_cpr_load(const char *filename, CprMode mode, Error **errp)
>>          goto out;
>>      }
>>  
>> +    /* Clear cloexec to prevent fd leaks until the next cpr-save */
>> +    cpr_walk_fd(unpreserve_fd, 0);
>> +
>>      state = global_state_get_runstate();
>>      if (state == RUN_STATE_RUNNING) {
>>          vm_start();
>> diff --git a/qapi/cpr.json b/qapi/cpr.json
>> index 11c6f88..47ee4ff 100644
>> --- a/qapi/cpr.json
>> +++ b/qapi/cpr.json
>> @@ -15,11 +15,12 @@
>>  # @CprMode:
>>  #
>>  # @reboot: checkpoint can be cpr-load'ed after a host reboot.
>> +# @restart: checkpoint can be cpr-load'ed after restarting qemu.
>>  #
>>  # Since: 7.1
>>  ##
>>  { 'enum': 'CprMode',
>> -  'data': [ 'none', 'reboot' ] }
>> +  'data': [ 'none', 'reboot', 'restart' ] }
>>  
>>  ##
>>  # @cpr-save:
>> @@ -38,6 +39,11 @@
>>  # issue the quit command, reboot the system, start qemu using the same
>>  # arguments plus -S, and issue the cpr-load command.
>>  #
>> +# If @mode is 'restart', the checkpoint remains valid after restarting
>> +# qemu using a subsequent cpr-exec.  Guest RAM must be backed by a
>> +# memory-backend-file with share=on.
>> +# To resume from the checkpoint, issue the cpr-load command.
>> +#
>>  # @filename: name of checkpoint file
>>  # @mode: @CprMode mode
>>  #
>> @@ -48,6 +54,24 @@
>>              'mode': 'CprMode' } }
>>  
>>  ##
>> +# @cpr-exec:
>> +#
>> +# Restart qemu by directly exec'ing @argv[0], replacing the qemu process.
>> +# The PID remains the same.  Must be called after cpr-save restart.
>> +#
>> +# @argv[0] should be the path of a new qemu binary, or a prefix command that
>> +# in turn exec's the new qemu binary.  The arguments must match those used
>> +# to initially start qemu, plus the -S option so new qemu starts in a paused
>> +# state.
>> +#
>> +# @argv: arguments to be passed to exec().
>> +#
>> +# Since: 7.1
>> +##
>> +{ 'command': 'cpr-exec',
>> +  'data': { 'argv': [ 'str' ] } }
>> +
>> +##
>>  # @cpr-load:
>>  #
>>  # Load a virtual machine from the checkpoint file @filename that was created
>> diff --git a/qemu-options.hx b/qemu-options.hx
>> index 6e51c33..1b49360 100644
>> --- a/qemu-options.hx
>> +++ b/qemu-options.hx
>> @@ -4484,7 +4484,7 @@ SRST
>>  ERST
>>  
>>  DEF("cpr-enable", HAS_ARG, QEMU_OPTION_cpr_enable, \
>> -    "-cpr-enable reboot    enable the cpr mode\n",
>> +    "-cpr-enable reboot|restart    enable the cpr mode\n",
>>      QEMU_ARCH_ALL)
>>  SRST
>>  ``-cpr-enable reboot``
>> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
>> index 822c424..412cc80 100644
>> --- a/softmmu/physmem.c
>> +++ b/softmmu/physmem.c
>> @@ -44,6 +44,7 @@
>>  #include "qemu/qemu-print.h"
>>  #include "qemu/log.h"
>>  #include "qemu/memalign.h"
>> +#include "qemu/memfd.h"
>>  #include "exec/memory.h"
>>  #include "exec/ioport.h"
>>  #include "sysemu/dma.h"
>> @@ -1962,6 +1963,40 @@ static void dirty_memory_extend(ram_addr_t old_ram_size,
>>      }
>>  }
>>  
>> +static bool memory_region_is_backend(MemoryRegion *mr)
>> +{
>> +    return !!object_dynamic_cast(mr->parent_obj.parent, TYPE_MEMORY_BACKEND);
>> +}
> 
> Maybe or mr->owner is more readable?

Maybe OBJECT(mr)->parent.

mr->owner is not always the same as mr->parent_obj.parent.

- Steve

>> +
>> +static void *qemu_anon_memfd_alloc(RAMBlock *rb, size_t maxlen, Error **errp)
>> +{
>> +    size_t len, align;
>> +    void *addr;
>> +    struct MemoryRegion *mr = rb->mr;
>> +    const char *name = memory_region_name(mr);
>> +    int mfd = cpr_find_memfd(name, &len, &maxlen, &align);
>> +
>> +    if (mfd >= 0) {
>> +        rb->used_length = len;
>> +        rb->max_length = maxlen;
>> +        mr->align = align;
>> +    } else {
>> +        len = rb->used_length;
>> +        maxlen = rb->max_length;
>> +        mr->align = QEMU_VMALLOC_ALIGN;
>> +        mfd = qemu_memfd_create(name, maxlen + mr->align, 0, 0, 0, errp);
>> +        if (mfd < 0) {
>> +            return NULL;
>> +        }
>> +        cpr_save_memfd(name, mfd, len, maxlen, mr->align);
>> +    }
>> +    rb->flags |= RAM_SHARED;
>> +    qemu_set_cloexec(mfd);
>> +    addr = file_ram_alloc(rb, maxlen, mfd, false, false, 0, errp);
>> +    trace_anon_memfd_alloc(name, maxlen, addr, mfd);
>> +    return addr;
>> +}
>> +
>>  static void ram_block_add(RAMBlock *new_block, Error **errp)
>>  {
>>      const bool noreserve = qemu_ram_is_noreserve(new_block);
>> @@ -1986,6 +2021,14 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>                  qemu_mutex_unlock_ramlist();
>>                  return;
>>              }
>> +        } else if (cpr_enabled(CPR_MODE_RESTART) &&
>> +                   !memory_region_is_backend(new_block->mr)) {
>> +            new_block->host = qemu_anon_memfd_alloc(new_block,
>> +                                                    new_block->max_length,
>> +                                                    errp);
>> +            if (!new_block->host) {
>> +                return;
>> +            }
>>          } else {
>>              new_block->host = qemu_anon_ram_alloc(new_block->max_length,
>>                                                    &new_block->mr->align,
>> @@ -1997,8 +2040,8 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>                  qemu_mutex_unlock_ramlist();
>>                  return;
>>              }
>> -            memory_try_enable_merging(new_block->host, new_block->max_length);
>>          }
>> +        memory_try_enable_merging(new_block->host, new_block->max_length);
>>      }
>>  
>>      new_ram_size = MAX(old_ram_size,
>> @@ -2231,6 +2274,7 @@ void qemu_ram_free(RAMBlock *block)
>>      }
>>  
>>      qemu_mutex_lock_ramlist();
>> +    cpr_delete_memfd(memory_region_name(block->mr));
>>      QLIST_REMOVE_RCU(block, next);
>>      ram_list.mru_block = NULL;
>>      /* Write list before version */
>> diff --git a/trace-events b/trace-events
>> index bc71006..07369bb 100644
>> --- a/trace-events
>> +++ b/trace-events
>> @@ -45,6 +45,7 @@ ram_block_discard_range(const char *rbname, void *hva, size_t length, bool need_
>>  # accel/tcg/cputlb.c
>>  memory_notdirty_write_access(uint64_t vaddr, uint64_t ram_addr, unsigned size) "0x%" PRIx64 " ram_addr 0x%" PRIx64 " size %u"
>>  memory_notdirty_set_dirty(uint64_t vaddr) "0x%" PRIx64
>> +anon_memfd_alloc(const char *name, size_t size, void *ptr, int fd) "%s size %zu ptr %p fd %d"
>>  
>>  # gdbstub.c
>>  gdbstub_op_start(const char *device) "Starting gdbstub using device %s"


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 24/39] pci: export export msix_is_pending
  2022-06-27 22:44   ` Michael S. Tsirkin
@ 2022-07-05 18:29     ` Steven Sistare
  0 siblings, 0 replies; 84+ messages in thread
From: Steven Sistare @ 2022-07-05 18:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Alex Bennée, Dr. David Alan Gilbert,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

On 6/27/2022 6:44 PM, Michael S. Tsirkin wrote:
> On Wed, Jun 15, 2022 at 07:52:11AM -0700, Steve Sistare wrote:
>> Export msix_is_pending for use by cpr.  No functional change.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> 
> the subject repeats export twice.
> With that fixed:
> 
> Acked-by: Michael S. Tsirkin <mst@redhat.com>

Will will fix, thanks! 

- Steve

>> ---
>>  hw/pci/msix.c         | 2 +-
>>  include/hw/pci/msix.h | 1 +
>>  2 files changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/pci/msix.c b/hw/pci/msix.c
>> index ae9331c..e492ce0 100644
>> --- a/hw/pci/msix.c
>> +++ b/hw/pci/msix.c
>> @@ -64,7 +64,7 @@ static uint8_t *msix_pending_byte(PCIDevice *dev, int vector)
>>      return dev->msix_pba + vector / 8;
>>  }
>>  
>> -static int msix_is_pending(PCIDevice *dev, int vector)
>> +int msix_is_pending(PCIDevice *dev, unsigned int vector)
>>  {
>>      return *msix_pending_byte(dev, vector) & msix_pending_mask(vector);
>>  }
>> diff --git a/include/hw/pci/msix.h b/include/hw/pci/msix.h
>> index 4c4a60c..0065354 100644
>> --- a/include/hw/pci/msix.h
>> +++ b/include/hw/pci/msix.h
>> @@ -32,6 +32,7 @@ int msix_present(PCIDevice *dev);
>>  bool msix_is_masked(PCIDevice *dev, unsigned vector);
>>  void msix_set_pending(PCIDevice *dev, unsigned vector);
>>  void msix_clr_pending(PCIDevice *dev, int vector);
>> +int msix_is_pending(PCIDevice *dev, unsigned vector);
>>  
>>  int msix_vector_use(PCIDevice *dev, unsigned vector);
>>  void msix_vector_unuse(PCIDevice *dev, unsigned vector);
>> -- 
>> 1.8.3.1
> 


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 27/39] vfio-pci: cpr part 1 (fd and dma)
  2022-07-03  8:32   ` Peng Liang
@ 2022-07-05 18:29     ` Steven Sistare
  0 siblings, 0 replies; 84+ messages in thread
From: Steven Sistare @ 2022-07-05 18:29 UTC (permalink / raw)
  To: Peng Liang, qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

On 7/3/2022 4:32 AM, Peng Liang wrote:
> On 6/15/2022 10:52 PM, Steve Sistare wrote:
[...]  
>> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
[...]
>> +static int vfio_container_pre_save(void *opaque)
>> +{
>> +    VFIOContainer *container = (VFIOContainer *)opaque;
>> +    Error *err;
> 
> According to the description of error_setg, local Error variables should be
> initialized to NULL. The following coccinelle script from Markus should be helpful
> to auto fix the problem :) :

Thanks!  I will fix this and the other instances in my code.

- Steve

> 
> @ r @
> identifier id;
> @@
> (
>   static Error *id;
> |
>   Error *id
> + = NULL
>   ;
> )
> 
[...]


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 36/39] chardev: cpr for sockets
  2022-07-03  8:19   ` Peng Liang
@ 2022-07-05 18:29     ` Steven Sistare
  0 siblings, 0 replies; 84+ messages in thread
From: Steven Sistare @ 2022-07-05 18:29 UTC (permalink / raw)
  To: Peng Liang, qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

On 7/3/2022 4:19 AM, Peng Liang wrote:
> On 6/15/2022 10:52 PM, Steve Sistare wrote:
>> Save accepted socket fds before cpr-save, and look for them after cpr-load.
>> Block cpr-exec if a socket enables the TLS or websocket option.  Allow a
>> monitor socket by closing it on exec.
>>
>> Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  chardev/char-socket.c         | 45 +++++++++++++++++++++++++++++++++++++++++++
>>  include/chardev/char-socket.h |  1 +
>>  monitor/hmp.c                 |  3 +++
>>  monitor/qmp.c                 |  3 +++
>>  4 files changed, 52 insertions(+)
>>
>> diff --git a/chardev/char-socket.c b/chardev/char-socket.c
>> index dc4e218..3a1e36b 100644
>> --- a/chardev/char-socket.c
>> +++ b/chardev/char-socket.c
>> @@ -26,6 +26,7 @@
>>  #include "chardev/char.h"
>>  #include "io/channel-socket.h"
>>  #include "io/channel-websock.h"
>> +#include "migration/cpr.h"
>>  #include "qemu/error-report.h"
>>  #include "qemu/module.h"
>>  #include "qemu/option.h"
>> @@ -33,6 +34,7 @@
>>  #include "qapi/clone-visitor.h"
>>  #include "qapi/qapi-visit-sockets.h"
>>  #include "qemu/yank.h"
>> +#include "sysemu/sysemu.h"
>>  
>>  #include "chardev/char-io.h"
>>  #include "chardev/char-socket.h"
>> @@ -358,6 +360,11 @@ static void tcp_chr_free_connection(Chardev *chr)
>>      SocketChardev *s = SOCKET_CHARDEV(chr);
>>      int i;
>>  
>> +    if (chr->cpr_enabled) {
>> +        cpr_delete_fd(chr->label, 0);
>> +    }
>> +    cpr_del_blocker(&s->cpr_blocker);
>> +
>>      if (s->read_msgfds_num) {
>>          for (i = 0; i < s->read_msgfds_num; i++) {
>>              close(s->read_msgfds[i]);
>> @@ -923,6 +930,10 @@ static void tcp_chr_accept(QIONetListener *listener,
>>                                 QIO_CHANNEL(cioc));
>>      }
>>      tcp_chr_new_client(chr, cioc);
>> +
>> +    if (s->sioc && chr->cpr_enabled) {
>> +        cpr_resave_fd(chr->label, 0, s->sioc->fd, NULL);
>> +    }
>>  }
>>  
>>  
>> @@ -1178,6 +1189,26 @@ static gboolean socket_reconnect_timeout(gpointer opaque)
>>      return false;
>>  }
>>  
>> +static int load_char_socket_fd(Chardev *chr, Error **errp)
>> +{
>> +    SocketChardev *sockchar = SOCKET_CHARDEV(chr);
>> +    QIOChannelSocket *sioc;
>> +    const char *label = chr->label;
>> +    int fd = cpr_find_fd(label, 0);
>> +
>> +    if (fd != -1) {
>> +        sockchar = SOCKET_CHARDEV(chr);
>> +        sioc = qio_channel_socket_new_fd(fd, errp);
>> +        if (sioc) {
>> +            tcp_chr_accept(sockchar->listener, sioc, chr);
>> +            object_unref(OBJECT(sioc));
>> +        } else {
>> +            error_setg(errp, "could not restore socket for %s", label);
> 
> If we go here, then qio_channel_socket_new_fd fails and errp should be set. So I think
> error_prepend is more appropriate here.

Good suggestion, will do, thanks - Steve

[...]


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 38/39] python/machine: add QEMUMachine accessors
  2022-06-17 14:16   ` John Snow
@ 2022-07-05 18:30     ` Steven Sistare
  0 siblings, 0 replies; 84+ messages in thread
From: Steven Sistare @ 2022-07-05 18:30 UTC (permalink / raw)
  To: John Snow
  Cc: qemu-devel, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Alex Bennée, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Marcel Apfelbaum, Alex Williamson,
	Daniel P. Berrange, Juan Quintela, Markus Armbruster, Eric Blake,
	Jason Zeng, Zheng Chuan, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand

On 6/17/2022 10:16 AM, John Snow wrote:
> On Wed, Jun 15, 2022, 11:27 AM Steve Sistare <steven.sistare@oracle.com <mailto:steven.sistare@oracle.com>> wrote:
> 
>     Provide full_args() to return all command-line arguments used to start a
>     vm, some of which are not otherwise visible to QEMUMachine clients.  This
>     is needed by the cpr test, which must start a vm, then pass all qemu
>     command-line arguments to the cpr-exec monitor call.
> 
>     Provide reopen_qmp_connection() to reopen a closed monitor connection.
>     This is needed by cpr, because qemu-exec closes the monitor socket.
> 
>     Signed-off-by: Steve Sistare <steven.sistare@oracle.com <mailto:steven.sistare@oracle.com>>
>     ---
>      python/qemu/machine/machine.py | 14 ++++++++++++++
>      1 file changed, 14 insertions(+)
> 
>     diff --git a/python/qemu/machine/machine.py b/python/qemu/machine/machine.py
>     index 37191f4..60b934d 100644
>     --- a/python/qemu/machine/machine.py
>     +++ b/python/qemu/machine/machine.py
>     @@ -332,6 +332,11 @@ def args(self) -> List[str]:
>              """Returns the list of arguments given to the QEMU binary."""
>              return self._args
> 
>     +    @property
>     +    def full_args(self) -> List[str]:
>     +        """Returns the full list of arguments used to launch QEMU."""
>     +        return list(self._qemu_full_args)
>     +
> 
> 
> OK
> 
>          def _pre_launch(self) -> None:
>              if self._console_set:
>                  self._remove_files.append(self._console_address)
>     @@ -486,6 +491,15 @@ def _close_qmp_connection(self) -> None:
>              finally:
>                  self._qmp_connection = None
> 
>     +    def reopen_qmp_connection(self):
>     +        self._close_qmp_connection()
>     +        self._qmp_connection = QEMUMonitorProtocol(
>     +            self._monitor_address,
>     +            server=True,
>     +            nickname=self._name
>     +        )
>     +        self._qmp.accept(self._qmp_timer)
>     +
> 
> 
> Unrelated change, please split into a new commit. (Sorry.)
> 
> Seems harmless enough, though. Happy to give RB and AB to both if you split the commits.

Cool.  Will do, thanks.

- Steve


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 20/39] cpr: restart mode
  2022-07-05 18:29     ` Steven Sistare
@ 2022-07-06  0:15       ` Peng Liang
  0 siblings, 0 replies; 84+ messages in thread
From: Peng Liang @ 2022-07-06  0:15 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel
  Cc: Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée, Dr. David Alan Gilbert, Michael S. Tsirkin,
	Marcel Apfelbaum, Alex Williamson, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow



On 7/6/2022 2:29 AM, Steven Sistare wrote:
> On 7/3/2022 4:15 AM, Peng Liang wrote:
>> On 6/15/2022 10:52 PM, Steve Sistare wrote:
>>> Provide the cpr-save restart mode, which preserves the guest VM across a
>>> restart of the qemu process.  After cpr-save, the caller passes qemu
>>> command-line arguments to cpr-exec, which directly exec's the new qemu
>>> binary.  The arguments must include -S so new qemu starts in a paused state.
>>> The caller resumes the guest by calling cpr-load.
>>>
>>> To use the restart mode, guest RAM must be backed by a memory-backend-file
>>> with share=on.  The '-cpr-enable restart' option causes secondary guest
>>> ram blocks (those not specified on the command line) to be allocated by
>>> mmap'ing a memfd.  The memfd values are saved in special cpr state which
>>> is retrieved after exec, and are kept open across exec, after which they
>>> are retrieved and re-mmap'd.  Hence guest RAM is preserved in place, albeit
>>> with new virtual addresses in the qemu process.
>>>
>>> The restart mode supports vfio devices and memory-backend-memfd in
>>> subsequent patches.
>>>
>>> cpr-exec syntax:
>>>   { 'command': 'cpr-exec', 'data': { 'argv': [ 'str' ] } }
>>>
>>> Add the restart mode:
>>>   { 'enum': 'CprMode', 'data': [ 'reboot', 'restart' ] }
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>>  migration/cpr.c   | 35 +++++++++++++++++++++++++++++++++++
>>>  qapi/cpr.json     | 26 +++++++++++++++++++++++++-
>>>  qemu-options.hx   |  2 +-
>>>  softmmu/physmem.c | 46 +++++++++++++++++++++++++++++++++++++++++++++-
>>>  trace-events      |  1 +
>>>  5 files changed, 107 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/migration/cpr.c b/migration/cpr.c
>>> index 1cc8738..8b3fffd 100644
>>> --- a/migration/cpr.c
>>> +++ b/migration/cpr.c
>>> @@ -22,6 +22,7 @@ static int cpr_enabled_modes;
>>>  void cpr_init(int modes)
>>>  {
>>>      cpr_enabled_modes = modes;
>>> +    cpr_state_load(&error_fatal);
>>>  }
>>>  
>>>  bool cpr_enabled(CprMode mode)
>>> @@ -153,6 +154,37 @@ err:
>>>      cpr_set_mode(CPR_MODE_NONE);
>>>  }
>>>  
>>> +static int preserve_fd(const char *name, int id, int fd, void *opaque)
>>> +{
>>> +    qemu_clear_cloexec(fd);
>>> +    return 0;
>>> +}
>>> +
>>> +static int unpreserve_fd(const char *name, int id, int fd, void *opaque)
>>> +{
>>> +    qemu_set_cloexec(fd);
>>> +    return 0;
>>> +}
>>> +
>>> +void qmp_cpr_exec(strList *args, Error **errp)
>>> +{
>>> +    if (!runstate_check(RUN_STATE_SAVE_VM)) {
>>> +        error_setg(errp, "runstate is not save-vm");
>>> +        return;
>>> +    }
>>> +    if (cpr_get_mode() != CPR_MODE_RESTART) {
>>> +        error_setg(errp, "cpr-exec requires cpr-save with restart mode");
>>> +        return;
>>> +    }
>>> +
>>> +    cpr_walk_fd(preserve_fd, 0);
>>> +    if (cpr_state_save(errp)) {
>>> +        return;
>>> +    }
>>> +
>>> +    assert(qemu_system_exec_request(args, errp) == 0);
>>> +}
>>> +
>>>  void qmp_cpr_load(const char *filename, CprMode mode, Error **errp)
>>>  {
>>>      QEMUFile *f;
>>> @@ -189,6 +221,9 @@ void qmp_cpr_load(const char *filename, CprMode mode, Error **errp)
>>>          goto out;
>>>      }
>>>  
>>> +    /* Clear cloexec to prevent fd leaks until the next cpr-save */
>>> +    cpr_walk_fd(unpreserve_fd, 0);
>>> +
>>>      state = global_state_get_runstate();
>>>      if (state == RUN_STATE_RUNNING) {
>>>          vm_start();
>>> diff --git a/qapi/cpr.json b/qapi/cpr.json
>>> index 11c6f88..47ee4ff 100644
>>> --- a/qapi/cpr.json
>>> +++ b/qapi/cpr.json
>>> @@ -15,11 +15,12 @@
>>>  # @CprMode:
>>>  #
>>>  # @reboot: checkpoint can be cpr-load'ed after a host reboot.
>>> +# @restart: checkpoint can be cpr-load'ed after restarting qemu.
>>>  #
>>>  # Since: 7.1
>>>  ##
>>>  { 'enum': 'CprMode',
>>> -  'data': [ 'none', 'reboot' ] }
>>> +  'data': [ 'none', 'reboot', 'restart' ] }
>>>  
>>>  ##
>>>  # @cpr-save:
>>> @@ -38,6 +39,11 @@
>>>  # issue the quit command, reboot the system, start qemu using the same
>>>  # arguments plus -S, and issue the cpr-load command.
>>>  #
>>> +# If @mode is 'restart', the checkpoint remains valid after restarting
>>> +# qemu using a subsequent cpr-exec.  Guest RAM must be backed by a
>>> +# memory-backend-file with share=on.
>>> +# To resume from the checkpoint, issue the cpr-load command.
>>> +#
>>>  # @filename: name of checkpoint file
>>>  # @mode: @CprMode mode
>>>  #
>>> @@ -48,6 +54,24 @@
>>>              'mode': 'CprMode' } }
>>>  
>>>  ##
>>> +# @cpr-exec:
>>> +#
>>> +# Restart qemu by directly exec'ing @argv[0], replacing the qemu process.
>>> +# The PID remains the same.  Must be called after cpr-save restart.
>>> +#
>>> +# @argv[0] should be the path of a new qemu binary, or a prefix command that
>>> +# in turn exec's the new qemu binary.  The arguments must match those used
>>> +# to initially start qemu, plus the -S option so new qemu starts in a paused
>>> +# state.
>>> +#
>>> +# @argv: arguments to be passed to exec().
>>> +#
>>> +# Since: 7.1
>>> +##
>>> +{ 'command': 'cpr-exec',
>>> +  'data': { 'argv': [ 'str' ] } }
>>> +
>>> +##
>>>  # @cpr-load:
>>>  #
>>>  # Load a virtual machine from the checkpoint file @filename that was created
>>> diff --git a/qemu-options.hx b/qemu-options.hx
>>> index 6e51c33..1b49360 100644
>>> --- a/qemu-options.hx
>>> +++ b/qemu-options.hx
>>> @@ -4484,7 +4484,7 @@ SRST
>>>  ERST
>>>  
>>>  DEF("cpr-enable", HAS_ARG, QEMU_OPTION_cpr_enable, \
>>> -    "-cpr-enable reboot    enable the cpr mode\n",
>>> +    "-cpr-enable reboot|restart    enable the cpr mode\n",
>>>      QEMU_ARCH_ALL)
>>>  SRST
>>>  ``-cpr-enable reboot``
>>> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
>>> index 822c424..412cc80 100644
>>> --- a/softmmu/physmem.c
>>> +++ b/softmmu/physmem.c
>>> @@ -44,6 +44,7 @@
>>>  #include "qemu/qemu-print.h"
>>>  #include "qemu/log.h"
>>>  #include "qemu/memalign.h"
>>> +#include "qemu/memfd.h"
>>>  #include "exec/memory.h"
>>>  #include "exec/ioport.h"
>>>  #include "sysemu/dma.h"
>>> @@ -1962,6 +1963,40 @@ static void dirty_memory_extend(ram_addr_t old_ram_size,
>>>      }
>>>  }
>>>  
>>> +static bool memory_region_is_backend(MemoryRegion *mr)
>>> +{
>>> +    return !!object_dynamic_cast(mr->parent_obj.parent, TYPE_MEMORY_BACKEND);
>>> +}
>>
>> Maybe or mr->owner is more readable?
> 
> Maybe OBJECT(mr)->parent.

Yes. I meaned "OBJECT(mr)->parent or mr->owner" originally, but "OBJECT(mr)" was missing
(maybe I deleted it mistakely).

[...]


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 27/39] vfio-pci: cpr part 1 (fd and dma)
  2022-06-29 19:14   ` Alex Williamson
@ 2022-07-06 17:45     ` Steven Sistare
  0 siblings, 0 replies; 84+ messages in thread
From: Steven Sistare @ 2022-07-06 17:45 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Alex Bennée, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Marcel Apfelbaum, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

On 6/29/2022 3:14 PM, Alex Williamson wrote:
> On Wed, 15 Jun 2022 07:52:14 -0700
> Steve Sistare <steven.sistare@oracle.com> wrote:
> 
>> Enable vfio-pci devices to be saved and restored across an exec restart
>> of qemu.
>>
>> At vfio creation time, save the value of vfio container, group, and device
>> descriptors in cpr state.
>>
>> In the container pre_save handler, suspend the use of virtual addresses in
>> DMA mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be
>> remapped at a different VA after exec.  DMA to already-mapped pages
>> continues.  Save the msi message area as part of vfio-pci vmstate, save the
>> interrupt and notifier eventfd's in cpr state, and clear the close-on-exec
>> flag for the vfio descriptors.  The flag is not cleared earlier because the
>> descriptors should not persist across miscellaneous fork and exec calls
>> that may be performed during normal operation.
>>
>> On qemu restart, vfio_realize() finds the saved descriptors, uses
>> the descriptors, and notes that the device is being reused.  Device and
>> iommu state is already configured, so operations in vfio_realize that
>> would modify the configuration are skipped for a reused device, including
>> vfio ioctl's and writes to PCI configuration space.  Vfio PCI device reset
>> is also suppressed. The result is that vfio_realize constructs qemu data
>> structures that reflect the current state of the device.  However, the
>> reconstruction is not complete until cpr-load is called. cpr-load loads the
>> msi data.  The vfio post_load handler finds eventfds in cpr state, rebuilds
>> vector data structures, and attaches the interrupts to the new KVM instance.
>> The container post_load handler then invokes the main vfio listener
>> callback, which walks the flattened ranges of the vfio address space and
>> calls VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly,
>> cpr-load starts the VM.
>>
>> This functionality is delivered by 3 patches for clarity.  Part 1 handles
>> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
>> support.  Part 3 adds INTX support.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  MAINTAINERS                   |   1 +
>>  hw/pci/pci.c                  |  12 ++++
>>  hw/vfio/common.c              | 151 +++++++++++++++++++++++++++++++++++-------
>>  hw/vfio/cpr.c                 | 119 +++++++++++++++++++++++++++++++++
>>  hw/vfio/meson.build           |   1 +
>>  hw/vfio/pci.c                 |  44 ++++++++++++
>>  hw/vfio/trace-events          |   1 +
>>  include/hw/vfio/vfio-common.h |  11 +++
>>  include/migration/vmstate.h   |   1 +
>>  9 files changed, 317 insertions(+), 24 deletions(-)
>>  create mode 100644 hw/vfio/cpr.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 74a43e6..864aec6 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -3156,6 +3156,7 @@ CPR
>>  M: Steve Sistare <steven.sistare@oracle.com>
>>  M: Mark Kanda <mark.kanda@oracle.com>
>>  S: Maintained
>> +F: hw/vfio/cpr.c
>>  F: include/migration/cpr.h
>>  F: migration/cpr.c
>>  F: qapi/cpr.json
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index 6e70153..a3b19eb 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -32,6 +32,7 @@
>>  #include "hw/pci/pci_host.h"
>>  #include "hw/qdev-properties.h"
>>  #include "hw/qdev-properties-system.h"
>> +#include "migration/cpr.h"
>>  #include "migration/qemu-file-types.h"
>>  #include "migration/vmstate.h"
>>  #include "monitor/monitor.h"
>> @@ -341,6 +342,17 @@ static void pci_reset_regions(PCIDevice *dev)
>>  
>>  static void pci_do_device_reset(PCIDevice *dev)
>>  {
>> +    /*
>> +     * A PCI device that is resuming for cpr is already configured, so do
>> +     * not reset it here when we are called from qemu_system_reset prior to
>> +     * cpr-load, else interrupts may be lost for vfio-pci devices.  It is
>> +     * safe to skip this reset for all PCI devices, because cpr-load will set
>> +     * all fields that would have been set here.
>> +     */
>> +    if (cpr_get_mode() == CPR_MODE_RESTART) {
>> +        return;
>> +    }
>> +
>>      pci_device_deassert_intx(dev);
>>      assert(dev->irq_state == 0);
>>  
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index ace9562..c7d73b6 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -31,6 +31,7 @@
>>  #include "exec/memory.h"
>>  #include "exec/ram_addr.h"
>>  #include "hw/hw.h"
>> +#include "migration/cpr.h"
>>  #include "qemu/error-report.h"
>>  #include "qemu/main-loop.h"
>>  #include "qemu/range.h"
>> @@ -460,6 +461,8 @@ static int vfio_dma_unmap(VFIOContainer *container,
>>          .size = size,
>>      };
>>  
>> +    assert(!container->reused);
>> +
>>      if (iotlb && container->dirty_pages_supported &&
>>          vfio_devices_all_running_and_saving(container)) {
>>          return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
>> @@ -496,12 +499,24 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>>  {
>>      struct vfio_iommu_type1_dma_map map = {
>>          .argsz = sizeof(map),
>> -        .flags = VFIO_DMA_MAP_FLAG_READ,
>>          .vaddr = (__u64)(uintptr_t)vaddr,
>>          .iova = iova,
>>          .size = size,
>>      };
>>  
>> +    /*
>> +     * Set the new vaddr for any mappings registered during cpr-load.
>> +     * Reused is cleared thereafter.
>> +     */
>> +    if (container->reused) {
>> +        map.flags = VFIO_DMA_MAP_FLAG_VADDR;
>> +        if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
>> +            goto fail;
>> +        }
>> +        return 0;
>> +    }
>> +
>> +    map.flags = VFIO_DMA_MAP_FLAG_READ;
>>      if (!readonly) {
>>          map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
>>      }
>> @@ -517,7 +532,9 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>>          return 0;
>>      }
>>  
>> -    error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
>> +fail:
>> +    error_report("vfio_dma_map %s (iova %lu, size %ld, va %p): %s",
>> +        (container->reused ? "VADDR" : ""), iova, size, vaddr, strerror(errno));
>>      return -errno;
>>  }
>>  
>> @@ -882,6 +899,12 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>                                       MemoryRegionSection *section)
>>  {
>>      VFIOContainer *container = container_of(listener, VFIOContainer, listener);
>> +    vfio_container_region_add(container, section);
>> +}
>> +
>> +void vfio_container_region_add(VFIOContainer *container,
>> +                               MemoryRegionSection *section)
>> +{
>>      hwaddr iova, end;
>>      Int128 llend, llsize;
>>      void *vaddr;
>> @@ -1492,6 +1515,12 @@ static void vfio_listener_release(VFIOContainer *container)
>>      }
>>  }
>>  
>> +void vfio_listener_register(VFIOContainer *container)
>> +{
>> +    container->listener = vfio_memory_listener;
>> +    memory_listener_register(&container->listener, container->space->as);
>> +}
>> +
>>  static struct vfio_info_cap_header *
>>  vfio_get_cap(void *ptr, uint32_t cap_offset, uint16_t id)
>>  {
>> @@ -1910,6 +1939,22 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
>>  {
>>      int iommu_type, ret;
>>  
>> +    /*
>> +     * If container is reused, just set its type and skip the ioctls, as the
>> +     * container and group are already configured in the kernel.
>> +     * VFIO_TYPE1v2_IOMMU is the only type that supports reuse/cpr.
>> +     */
>> +    if (container->reused) {
>> +        if (ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1v2_IOMMU)) {
>> +            container->iommu_type = VFIO_TYPE1v2_IOMMU;
>> +            return 0;
>> +        } else {
>> +            error_setg(errp, "container was reused but VFIO_TYPE1v2_IOMMU "
>> +                             "is not supported");
>> +            return -errno;
>> +        }
>> +    }
>> +
>>      iommu_type = vfio_get_iommu_type(container, errp);
>>      if (iommu_type < 0) {
>>          return iommu_type;
>> @@ -2014,9 +2059,12 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>  {
>>      VFIOContainer *container;
>>      int ret, fd;
>> +    bool reused;
>>      VFIOAddressSpace *space;
>>  
>>      space = vfio_get_address_space(as);
>> +    fd = cpr_find_fd("vfio_container_for_group", group->groupid);
>> +    reused = (fd > 0);
>>  
>>      /*
>>       * VFIO is currently incompatible with discarding of RAM insofar as the
>> @@ -2049,27 +2097,47 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>       * details once we know which type of IOMMU we are using.
>>       */
>>  
>> +    /*
>> +     * If the container is reused, then the group is already attached in the
>> +     * kernel.  If a container with matching fd is found, then update the
>> +     * userland group list and return.  If not, then after the loop, create
>> +     * the container struct and group list.
>> +     */
>> +
>>      QLIST_FOREACH(container, &space->containers, next) {
>> -        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>> -            ret = vfio_ram_block_discard_disable(container, true);
>> -            if (ret) {
>> -                error_setg_errno(errp, -ret,
>> -                                 "Cannot set discarding of RAM broken");
>> -                if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
>> -                          &container->fd)) {
>> -                    error_report("vfio: error disconnecting group %d from"
>> -                                 " container", group->groupid);
>> -                }
>> -                return ret;
>> +        if (reused) {
>> +            if (container->fd != fd) {
>> +                continue;
>>              }
>> -            group->container = container;
>> -            QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>> +        } else if (ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>> +            continue;
>> +        }
>> +
>> +        ret = vfio_ram_block_discard_disable(container, true);
>> +        if (ret) {
>> +            error_setg_errno(errp, -ret,
>> +                             "Cannot set discarding of RAM broken");
>> +            if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER,
>> +                      &container->fd)) {
>> +                error_report("vfio: error disconnecting group %d from"
>> +                             " container", group->groupid);
>> +            }
>> +            return ret;
>> +        }
>> +        group->container = container;
>> +        QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>> +        if (!reused) {
>>              vfio_kvm_device_add_group(group);
>> -            return 0;
>> +            cpr_save_fd("vfio_container_for_group", group->groupid,
>> +                        container->fd);
>>          }
>> +        return 0;
>> +    }
>> +
>> +    if (!reused) {
>> +        fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
>>      }
>>  
>> -    fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
>>      if (fd < 0) {
>>          error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
>>          ret = -errno;
>> @@ -2087,6 +2155,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>      container = g_malloc0(sizeof(*container));
>>      container->space = space;
>>      container->fd = fd;
>> +    container->reused = reused;
>>      container->error = NULL;
>>      container->dirty_pages_supported = false;
>>      container->dma_max_mappings = 0;
>> @@ -2099,6 +2168,11 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>          goto free_container_exit;
>>      }
>>  
>> +    ret = vfio_cpr_register_container(container, errp);
>> +    if (ret) {
>> +        goto free_container_exit;
>> +    }
>> +
>>      ret = vfio_ram_block_discard_disable(container, true);
>>      if (ret) {
>>          error_setg_errno(errp, -ret, "Cannot set discarding of RAM broken");
>> @@ -2213,9 +2287,16 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>      group->container = container;
>>      QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>>  
>> -    container->listener = vfio_memory_listener;
>> -
>> -    memory_listener_register(&container->listener, container->space->as);
>> +    /*
>> +     * If reused, register the listener later, after all state that may
>> +     * affect regions and mapping boundaries has been cpr-load'ed.  Later,
>> +     * the listener will invoke its callback on each flat section and call
>> +     * vfio_dma_map to supply the new vaddr, and the calls will match the
>> +     * mappings remembered by the kernel.
>> +     */
>> +    if (!reused) {
>> +        vfio_listener_register(container);
>> +    }
>>  
>>      if (container->error) {
>>          ret = -1;
>> @@ -2225,8 +2306,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>      }
>>  
>>      container->initialized = true;
>> +    ret = cpr_resave_fd("vfio_container_for_group", group->groupid, fd, errp);
>>  
>> -    return 0;
>> +    return ret;
> 
> 
> This needs to fall through to unwind if that resave fails> 
> There also needs to be vfio_cpr_unregister_container() and
> cpr_delete_fd() calls in the unwind below, right?

Will do, thanks.

>>  listener_release_exit:
>>      QLIST_REMOVE(group, container_next);
>>      QLIST_REMOVE(container, next);
>> @@ -2254,6 +2336,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>>  
>>      QLIST_REMOVE(group, container_next);
>>      group->container = NULL;
>> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
>>  
>>      /*
>>       * Explicitly release the listener first before unset container,
>> @@ -2290,6 +2373,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>>          }
>>  
>>          trace_vfio_disconnect_container(container->fd);
>> +        vfio_cpr_unregister_container(container);
>>          close(container->fd);
>>          g_free(container);
>>  
>> @@ -2319,7 +2403,12 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>>      group = g_malloc0(sizeof(*group));
>>  
>>      snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
>> -    group->fd = qemu_open_old(path, O_RDWR);
>> +
>> +    group->fd = cpr_find_fd("vfio_group", groupid);
>> +    if (group->fd < 0) {
>> +        group->fd = qemu_open_old(path, O_RDWR);
>> +    }
>> +
>>      if (group->fd < 0) {
>>          error_setg_errno(errp, errno, "failed to open %s", path);
>>          goto free_group_exit;
>> @@ -2353,6 +2442,10 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>>  
>>      QLIST_INSERT_HEAD(&vfio_group_list, group, next);
>>  
>> +    if (cpr_resave_fd("vfio_group", groupid, group->fd, errp)) {
>> +        goto close_fd_exit;
>> +    }
>> +
>>      return group;
>>  
>>  close_fd_exit:
>> @@ -2377,6 +2470,7 @@ void vfio_put_group(VFIOGroup *group)
>>      vfio_disconnect_container(group);
>>      QLIST_REMOVE(group, next);
>>      trace_vfio_put_group(group->fd);
>> +    cpr_delete_fd("vfio_group", group->groupid);
>>      close(group->fd);
>>      g_free(group);
>>  
>> @@ -2390,8 +2484,14 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>>  {
>>      struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
>>      int ret, fd;
>> +    bool reused;
>> +
>> +    fd = cpr_find_fd(name, 0);
>> +    reused = (fd >= 0);
>> +    if (!reused) {
>> +        fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>> +    }
>>  
>> -    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>>      if (fd < 0) {
>>          error_setg_errno(errp, errno, "error getting device from group %d",
>>                           group->groupid);
>> @@ -2436,12 +2536,14 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>>      vbasedev->num_irqs = dev_info.num_irqs;
>>      vbasedev->num_regions = dev_info.num_regions;
>>      vbasedev->flags = dev_info.flags;
>> +    vbasedev->reused = reused;
>>  
>>      trace_vfio_get_device(name, dev_info.flags, dev_info.num_regions,
>>                            dev_info.num_irqs);
>>  
>>      vbasedev->reset_works = !!(dev_info.flags & VFIO_DEVICE_FLAGS_RESET);
>> -    return 0;
>> +    ret = cpr_resave_fd(name, 0, fd, errp);
>> +    return ret;
> 
> 
> This requires new unwind code.

Yeah, the introduction of cpr_resave_fd with an error return value in V8 makes
this messier.  cpr_resave_fd only fails if the fd was already saved with a different
value.  That is a qemu coding bug, either in the cpr state code, or in life cycle
mgmt for an object.  I propose to change this to an assertion inside cpr_resave_fd
and return void, so callers can always assume it succeeds.  Hence no unwind.

Sound OK?

>>  void vfio_put_base_device(VFIODevice *vbasedev)
>> @@ -2452,6 +2554,7 @@ void vfio_put_base_device(VFIODevice *vbasedev)
>>      QLIST_REMOVE(vbasedev, next);
>>      vbasedev->group = NULL;
>>      trace_vfio_put_base_device(vbasedev->fd);
>> +    cpr_delete_fd(vbasedev->name, 0);
>>      close(vbasedev->fd);
>>  }
>>  
>> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
>> new file mode 100644
>> index 0000000..a227d5e
>> --- /dev/null
>> +++ b/hw/vfio/cpr.c
>> @@ -0,0 +1,119 @@
>> +/*
>> + * Copyright (c) 2021, 2022 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include <sys/ioctl.h>
>> +#include <linux/vfio.h>
>> +#include "hw/vfio/vfio-common.h"
>> +#include "sysemu/kvm.h"
>> +#include "qapi/error.h"
>> +#include "migration/cpr.h"
>> +#include "migration/vmstate.h"
>> +#include "trace.h"
>> +
>> +static int
>> +vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
>> +{
>> +    struct vfio_iommu_type1_dma_unmap unmap = {
>> +        .argsz = sizeof(unmap),
>> +        .flags = VFIO_DMA_UNMAP_FLAG_VADDR | VFIO_DMA_UNMAP_FLAG_ALL,
>> +        .iova = 0,
>> +        .size = 0,
>> +    };
>> +    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
>> +        error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
>> +        return -errno;
>> +    }
>> +    container->vaddr_unmapped = true;
>> +    return 0;
>> +}
>> +
>> +static bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp)
>> +{
>> +    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
>> +        !ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
>> +        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR "
>> +                         "or VFIO_UNMAP_ALL");
>> +        return false;
>> +    } else {
>> +        return true;
>> +    }
>> +}
>> +
>> +static bool vfio_vmstate_needed(void *opaque)
>> +{
>> +    return cpr_get_mode() == CPR_MODE_RESTART;
>> +}
>> +
>> +static int vfio_container_pre_save(void *opaque)
>> +{
>> +    VFIOContainer *container = (VFIOContainer *)opaque;
>> +    Error *err;
>> +
>> +    if (!vfio_is_cpr_capable(container, &err) ||
>> +        vfio_dma_unmap_vaddr_all(container, &err)) {
>> +        error_report_err(err);
>> +        return -1;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static int vfio_container_post_load(void *opaque, int version_id)
>> +{
>> +    VFIOContainer *container = (VFIOContainer *)opaque;
>> +    VFIOGroup *group;
>> +    Error *err;
>> +    VFIODevice *vbasedev;
>> +
>> +    if (!vfio_is_cpr_capable(container, &err)) {
>> +        error_report_err(err);
>> +        return -1;
>> +    }
>> +
>> +    vfio_listener_register(container);
>> +    container->reused = false;
>> +
>> +    QLIST_FOREACH(group, &container->group_list, container_next) {
>> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
>> +            vbasedev->reused = false;
>> +        }
>> +    }
>> +    return 0;
>> +}
>> +
>> +static const VMStateDescription vfio_container_vmstate = {
>> +    .name = "vfio-container",
>> +    .unmigratable = 1,
> 
> 
> How does this work with vfio devices supporting migration?  This needs
> to be coordinated with efforts to enable migration of vfio devices.

This is a mistake which needs to be deleted, which I found with more testing 
after posting. After deletion, it will not block migration, and will not be used 
in migration, because vfio_vmstate_needed only returns true for cpr.

>> +    .version_id = 0,
>> +    .minimum_version_id = 0,
>> +    .pre_save = vfio_container_pre_save,
>> +    .post_load = vfio_container_post_load,
>> +    .needed = vfio_vmstate_needed,
> 
> 
> I don't see that .needed is evaluated relative to .unmigratable above
> in determining if migration is blocked.
> 
> 
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_END_OF_LIST()
>> +    }
>> +};
>> +
>> +int vfio_cpr_register_container(VFIOContainer *container, Error **errp)
>> +{
>> +    container->cpr_blocker = NULL;
>> +    if (!vfio_is_cpr_capable(container, &container->cpr_blocker)) {
>> +        return cpr_add_blocker(&container->cpr_blocker, errp,
>> +                               CPR_MODE_RESTART, 0);
>> +    }
>> +
>> +    vmstate_register(NULL, -1, &vfio_container_vmstate, container);
>> +
>> +    return 0;
>> +}
>> +
>> +void vfio_cpr_unregister_container(VFIOContainer *container)
>> +{
>> +    cpr_del_blocker(&container->cpr_blocker);
>> +
>> +    vmstate_unregister(NULL, &vfio_container_vmstate, container);
>> +}
>> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
>> index da9af29..e247b2b 100644
>> --- a/hw/vfio/meson.build
>> +++ b/hw/vfio/meson.build
>> @@ -5,6 +5,7 @@ vfio_ss.add(files(
>>    'migration.c',
>>  ))
>>  vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
>> +  'cpr.c',
>>    'display.c',
>>    'pci-quirks.c',
>>    'pci.c',
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 0143c9a..237231b 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -30,6 +30,7 @@
>>  #include "hw/qdev-properties-system.h"
>>  #include "migration/vmstate.h"
>>  #include "qapi/qmp/qdict.h"
>> +#include "migration/cpr.h"
>>  #include "qemu/error-report.h"
>>  #include "qemu/main-loop.h"
>>  #include "qemu/module.h"
>> @@ -2514,6 +2515,7 @@ const VMStateDescription vmstate_vfio_pci_config = {
>>      .name = "VFIOPCIDevice",
>>      .version_id = 1,
>>      .minimum_version_id = 1,
>> +    .priority = MIG_PRI_VFIO_PCI,   /* * must load before container */
>>      .fields = (VMStateField[]) {
>>          VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
>>          VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
>> @@ -3243,6 +3245,11 @@ static void vfio_pci_reset(DeviceState *dev)
>>  {
>>      VFIOPCIDevice *vdev = VFIO_PCI(dev);
>>  
>> +    /* Do not reset the device during qemu_system_reset prior to cpr-load */
>> +    if (vdev->vbasedev.reused) {
>> +        return;
>> +    }
>> +
>>      trace_vfio_pci_reset(vdev->vbasedev.name);
>>  
>>      vfio_pci_pre_reset(vdev);
>> @@ -3350,6 +3357,42 @@ static Property vfio_pci_dev_properties[] = {
>>      DEFINE_PROP_END_OF_LIST(),
>>  };
>>  
>> +/*
>> + * The kernel may change non-emulated config bits.  Exclude them from the
>> + * changed-bits check in get_pci_config_device.
>> + */
>> +static int vfio_pci_pre_load(void *opaque)
>> +{
>> +    VFIOPCIDevice *vdev = opaque;
>> +    PCIDevice *pdev = &vdev->pdev;
>> +    int size = MIN(pci_config_size(pdev), vdev->config_size);
>> +    int i;
>> +
>> +    for (i = 0; i < size; i++) {
>> +        pdev->cmask[i] &= vdev->emulated_config_bits[i];
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static bool vfio_pci_needed(void *opaque)
>> +{
>> +    return cpr_get_mode() == CPR_MODE_RESTART;
>> +}
>> +
>> +static const VMStateDescription vfio_pci_vmstate = {
>> +    .name = "vfio-pci",
>> +    .unmigratable = 1,
> 
> 
> Same question here.

Same, I will delete it.

>> +    .version_id = 0,
>> +    .minimum_version_id = 0,
>> +    .priority = MIG_PRI_VFIO_PCI,       /* must load before container */
>> +    .pre_load = vfio_pci_pre_load,
>> +    .needed = vfio_pci_needed,
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_END_OF_LIST()
>> +    }
>> +};
>> +
>>  static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>>  {
>>      DeviceClass *dc = DEVICE_CLASS(klass);
>> @@ -3357,6 +3400,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>>  
>>      dc->reset = vfio_pci_reset;
>>      device_class_set_props(dc, vfio_pci_dev_properties);
>> +    dc->vmsd = &vfio_pci_vmstate;
>>      dc->desc = "VFIO-based PCI device assignment";
>>      set_bit(DEVICE_CATEGORY_MISC, dc->categories);
>>      pdc->realize = vfio_realize;
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 73dffe9..a6d0034 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -119,6 +119,7 @@ vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Devic
>>  vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
>>  vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
>>  vfio_dma_unmap_overflow_workaround(void) ""
>> +vfio_region_remap(const char *name, int fd, uint64_t iova_start, uint64_t iova_end, void *vaddr) "%s fd %d 0x%"PRIx64" - 0x%"PRIx64" [%p]"
>>  
>>  # platform.c
>>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index e573f5a..17ad9ba 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -81,10 +81,14 @@ typedef struct VFIOContainer {
>>      int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>>      MemoryListener listener;
>>      MemoryListener prereg_listener;
>> +    Notifier cpr_notifier;
>> +    Error *cpr_blocker;
>>      unsigned iommu_type;
>>      Error *error;
>>      bool initialized;
>>      bool dirty_pages_supported;
>> +    bool reused;
>> +    bool vaddr_unmapped;
>>      uint64_t dirty_pgsizes;
>>      uint64_t max_dirty_bitmap_size;
>>      unsigned long pgsizes;
>> @@ -136,6 +140,7 @@ typedef struct VFIODevice {
>>      bool no_mmap;
>>      bool ram_block_discard_allowed;
>>      bool enable_migration;
>> +    bool reused;
>>      VFIODeviceOps *ops;
>>      unsigned int num_irqs;
>>      unsigned int num_regions;
>> @@ -213,6 +218,9 @@ void vfio_put_group(VFIOGroup *group);
>>  int vfio_get_device(VFIOGroup *group, const char *name,
>>                      VFIODevice *vbasedev, Error **errp);
>>  
>> +int vfio_cpr_register_container(VFIOContainer *container, Error **errp);
>> +void vfio_cpr_unregister_container(VFIOContainer *container);
>> +
>>  extern const MemoryRegionOps vfio_region_ops;
>>  typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
>>  extern VFIOGroupList vfio_group_list;
>> @@ -234,6 +242,9 @@ struct vfio_info_cap_header *
>>  vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id);
>>  #endif
>>  extern const MemoryListener vfio_prereg_listener;
>> +void vfio_listener_register(VFIOContainer *container);
>> +void vfio_container_region_add(VFIOContainer *container,
>> +                               MemoryRegionSection *section);
>>  
>>  int vfio_spapr_create_window(VFIOContainer *container,
>>                               MemoryRegionSection *section,
>> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
>> index ad24aa1..19f1538 100644
>> --- a/include/migration/vmstate.h
>> +++ b/include/migration/vmstate.h
>> @@ -157,6 +157,7 @@ typedef enum {
>>      MIG_PRI_GICV3_ITS,          /* Must happen before PCI devices */
>>      MIG_PRI_GICV3,              /* Must happen before the ITS */
>>      MIG_PRI_MAX,
>> +    MIG_PRI_VFIO_PCI = MIG_PRI_IOMMU,
> 
> 
> Based on the current contents of this enum, why are we aliasing a
> existing priority vs defining a new one?  Thanks,

Sharing a priority is slightly more efficient because migration iterates over
a list of lists of vmstate handlers indexed by priority, and it is more maintainable
because it expresses the minimal ordering requiremment and reduces the likelihood
of priority ordering conflicts later.

The mininimal ordering requirement for the vfio handlers is that they come 
before the vfio container handler at MIG_PRI_DEFAULT = 0.  Thus the minimal
priority is 1 which matches MIG_PRI_IOMMU.  I should have expressed that more
clearly like so:

  MIG_PRI_VFIO_PCI = MIG_PRI_DEFAULT + 1,    /* Must happen before vfio containers */

- Steve



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 28/39] vfio-pci: cpr part 2 (msi)
  2022-06-29 20:19   ` Alex Williamson
@ 2022-07-06 17:46     ` Steven Sistare
  0 siblings, 0 replies; 84+ messages in thread
From: Steven Sistare @ 2022-07-06 17:46 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Alex Bennée, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Marcel Apfelbaum, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

On 6/29/2022 4:19 PM, Alex Williamson wrote:
> On Wed, 15 Jun 2022 07:52:15 -0700
> Steve Sistare <steven.sistare@oracle.com> wrote:
> 
>> Finish cpr for vfio-pci MSI/MSI-X devices by preserving eventfd's and
>> vector state.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  hw/vfio/pci.c | 122 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 121 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 237231b..2fd7121 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -53,17 +53,53 @@ static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
>>  static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
>>  static void vfio_msi_disable_common(VFIOPCIDevice *vdev);
>>  
>> +#define EVENT_FD_NAME(vdev, name)   \
>> +    g_strdup_printf("%s_%s", (vdev)->vbasedev.name, (name))
>> +
>> +static int save_event_fd(VFIOPCIDevice *vdev, const char *name, int nr,
>> +                         EventNotifier *ev)
>> +{
>> +    int fd = event_notifier_get_fd(ev);
>> +
>> +    if (fd >= 0) {
>> +        Error *err;
>> +        g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
>> +
>> +        if (cpr_resave_fd(fdname, nr, fd, &err)) {
>> +            error_report_err(err);
>> +            return 1;
> 
> 
> Preferably -1, but the caller doesn't actually test the return value
> anyway :-\

Per my previous email, I suggest that cpr_resave_fd return void, and hence
save_event_fd becomes void as well.

>> +        }
>> +    }
>> +    return 0;
>> +}
>> +
>> +static int load_event_fd(VFIOPCIDevice *vdev, const char *name, int nr)
>> +{
>> +    g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
>> +    int fd = cpr_find_fd(fdname, nr);
>> +    return fd;
> 
> 
>     return cpr_find_fd(EVENT_FD_NAME(vdev, name), nr);

That leaks EVENT_FD_NAME, produced by g_strdup_printf, but I can reduce it to:
    g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
    return cpr_find_fd(fdname, nr);

>> +}
>> +
>> +static void delete_event_fd(VFIOPCIDevice *vdev, const char *name, int nr)
>> +{
>> +    g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
>> +    cpr_delete_fd(fdname, nr);
> 
> 
>     cpr_delete_fd(EVENT_FD_NAME(vdev, name), nr);

Ditto.

>> +}
>> +
>>  /* Create new or reuse existing eventfd */
>>  static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
>>                                const char *name, int nr)
>>  {
>> -    int fd = -1;   /* placeholder until a subsequent patch */
>>      int ret = 0;
>> +    int fd = load_event_fd(vdev, name, nr);
>>  
>>      if (fd >= 0) {
>>          event_notifier_init_fd(e, fd);
>>      } else {
>>          ret = event_notifier_init(e, 0);
>> +        if (!ret) {
>> +            save_event_fd(vdev, name, nr, e);
> 
> 
> Return value not tested.  The function generates an error report if it
> fails, but it doesn't seem that actually blocks a cpr attempt.  Do we
> just wind up with that error report as a breadcrumb to why cpr breaks
> with a missing fd down the road?

Thanks, that is a bug, it should have been:
    ret = save_event_fd(vdev, name, nr, e)
... but per the previous comment save_event_fd becomes void.

>> +        }
>>      }
>>      return ret;
>>  }
>> @@ -71,6 +107,7 @@ static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
>>  static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
>>                                    const char *name, int nr)
>>  {
>> +    delete_event_fd(vdev, name, nr);
>>      event_notifier_cleanup(e);
>>  }
>>  
>> @@ -511,6 +548,15 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
>>      VFIOMSIVector *vector;
>>      int ret;
>>  
>> +    /*
>> +     * Ignore the callback from msix_set_vector_notifiers during resume.
>> +     * The necessary subset of these actions is called from vfio_claim_vectors
>> +     * during post load.
>> +     */
>> +    if (vdev->vbasedev.reused) {
>> +        return 0;
>> +    }
>> +
>>      trace_vfio_msix_vector_do_use(vdev->vbasedev.name, nr);
>>  
>>      vector = &vdev->msi_vectors[nr];
>> @@ -2784,6 +2830,11 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
>>      fd = event_notifier_get_fd(&vdev->err_notifier);
>>      qemu_set_fd_handler(fd, vfio_err_notifier_handler, NULL, vdev);
>>  
>> +    /* Do not alter irq_signaling during vfio_realize for cpr */
>> +    if (vdev->vbasedev.reused) {
>> +        return;
>> +    }
>> +
>>      if (vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_ERR_IRQ_INDEX, 0,
>>                                 VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
>>          error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
>> @@ -2849,6 +2900,12 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
>>      fd = event_notifier_get_fd(&vdev->req_notifier);
>>      qemu_set_fd_handler(fd, vfio_req_notifier_handler, NULL, vdev);
>>  
>> +    /* Do not alter irq_signaling during vfio_realize for cpr */
>> +    if (vdev->vbasedev.reused) {
>> +        vdev->req_enabled = true;
>> +        return;
>> +    }
> 
> vfio_notifier_init() transparently gets the old fd or creates a new
> one, how do we know which has occurred to know that this eventfd is
> already configured?

The caller can check the reused flag, which is set iff an old fd exists.
I could pass reused to vfio_notifier_init to assert that, but in some cases
I would need to pass a reused flag down through several functions to reach
vfio_notifier_init, which just seems ugly.

> Don't we also have the same issue relative to vdev->pci_aer for the
> error handler?

Same answer:
    vfio_register_err_notifier()
        vfio_notifier_init();
        if (vdev->vbasedev.reused)
            return;
        vfio_set_irq_signaling() ...

>> +
>>      if (vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_REQ_IRQ_INDEX, 0,
>>                             VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
>>          error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
>> @@ -3357,6 +3414,43 @@ static Property vfio_pci_dev_properties[] = {
>>      DEFINE_PROP_END_OF_LIST(),
>>  };
>>  
>> +static void vfio_claim_vectors(VFIOPCIDevice *vdev, int nr_vectors, bool msix)
>> +{
>> +    int i, fd;
>> +    bool pending = false;
>> +    PCIDevice *pdev = &vdev->pdev;
>> +
>> +    vdev->nr_vectors = nr_vectors;
>> +    vdev->msi_vectors = g_new0(VFIOMSIVector, nr_vectors);
>> +    vdev->interrupt = msix ? VFIO_INT_MSIX : VFIO_INT_MSI;
>> +
>> +    for (i = 0; i < nr_vectors; i++) {
>> +        VFIOMSIVector *vector = &vdev->msi_vectors[i];
>> +
>> +        fd = load_event_fd(vdev, "interrupt", i);
>> +        if (fd >= 0) {
>> +            vfio_vector_init(vdev, i);
>> +            qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
>> +        }
>> +
>> +        if (load_event_fd(vdev, "kvm_interrupt", i) >= 0) {
>> +            vfio_route_change = kvm_irqchip_begin_route_changes(kvm_state);
>> +            vfio_add_kvm_msi_virq(vdev, vector, i, msix);
>> +            kvm_irqchip_commit_route_changes(&vfio_route_change);
>> +            vfio_connect_kvm_msi_virq(vector, i);
> 
> 
> Shouldn't we take advantage of the batching support here?

OK, will do.

>> +        }
> 
> How do we debug if one of the above fails that shouldn't have failed?
> Should we have an assert or change this to a non-void return if we
> cannot setup an interrupt that we think is configured?

The path above ending with qemu_set_fd_handler always succeeds, because:

    fd = load_event_fd(vdev, "interrupt", i);
    if (fd >= 0) {
        vfio_vector_init(vdev, i)
            vfio_notifier_init(..., "interrupt", i)
                int fd = load_event_fd(vdev, name, i);
                if (fd >= 0) {
                    event_notifier_init_fd(e, fd);      <-- void, never fails

In the kvm_interrupt clause, only vfio_connect_kvm_msi_virq() can fail.  But, it
returns void, and other callers also assume it succeeds.  Good enough, or do
you want to do better here?

>> +
>> +        if (msix && msix_is_pending(pdev, i) && msix_is_masked(pdev, i)) {
>> +            set_bit(i, vdev->msix->pending);
>> +            pending = true;
>> +        }
>> +    }
>> +
>> +    if (msix) {
>> +        memory_region_set_enabled(&pdev->msix_pba_mmio, pending);
>> +    }
>> +}
>> +
>>  /*
>>   * The kernel may change non-emulated config bits.  Exclude them from the
>>   * changed-bits check in get_pci_config_device.
>> @@ -3375,6 +3469,29 @@ static int vfio_pci_pre_load(void *opaque)
>>      return 0;
>>  }
>>  
>> +static int vfio_pci_post_load(void *opaque, int version_id)
>> +{
>> +    VFIOPCIDevice *vdev = opaque;
>> +    PCIDevice *pdev = &vdev->pdev;
>> +    int nr_vectors;
>> +
>> +    if (msix_enabled(pdev)) {
>> +        msix_set_vector_notifiers(pdev, vfio_msix_vector_use,
>> +                                   vfio_msix_vector_release, NULL);
>> +        nr_vectors = vdev->msix->entries;
> 
> Maybe this is why we're not generating an error above, we don't know
> which vectors are configured other than if they have a saved eventfd,
> where we don't test whether we were able to actually save the fd.
> Thanks,
> 
> Alex
> 
> 
>> +        vfio_claim_vectors(vdev, nr_vectors, true);
>> +
>> +    } else if (msi_enabled(pdev)) {
>> +        nr_vectors = msi_nr_vectors_allocated(pdev);
>> +        vfio_claim_vectors(vdev, nr_vectors, false);
>> +
>> +    } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
>> +        assert(0);      /* completed in a subsequent patch */
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>>  static bool vfio_pci_needed(void *opaque)
>>  {
>>      return cpr_get_mode() == CPR_MODE_RESTART;
>> @@ -3387,8 +3504,11 @@ static const VMStateDescription vfio_pci_vmstate = {
>>      .minimum_version_id = 0,
>>      .priority = MIG_PRI_VFIO_PCI,       /* must load before container */
>>      .pre_load = vfio_pci_pre_load,
>> +    .post_load = vfio_pci_post_load,
>>      .needed = vfio_pci_needed,
>>      .fields = (VMStateField[]) {
>> +        VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
>> +        VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
>>          VMSTATE_END_OF_LIST()
>>      }
>>  };
> 


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 29/39] vfio-pci: cpr part 3 (intx)
  2022-06-29 20:43   ` Alex Williamson
@ 2022-07-06 17:46     ` Steven Sistare
  0 siblings, 0 replies; 84+ messages in thread
From: Steven Sistare @ 2022-07-06 17:46 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Alex Bennée, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Marcel Apfelbaum, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

On 6/29/2022 4:43 PM, Alex Williamson wrote:
> On Wed, 15 Jun 2022 07:52:16 -0700
> Steve Sistare <steven.sistare@oracle.com> wrote:
> 
>> Preserve vfio INTX state across cpr restart.  Preserve VFIOINTx fields as
>> follows:
>>   pin : Recover this from the vfio config in kernel space
>>   interrupt : Preserve its eventfd descriptor across exec.
>>   unmask : Ditto
>>   route.irq : This could perhaps be recovered in vfio_pci_post_load by
>>     calling pci_device_route_intx_to_irq(pin), whose implementation reads
>>     config space for a bridge device such as ich9.  However, there is no
>>     guarantee that the bridge vmstate is read before vfio vmstate.  Rather
>>     than fiddling with MigrationPriority for vmstate handlers, explicitly
>>     save route.irq in vfio vmstate.
>>   pending : save in vfio vmstate.
>>   mmap_timeout, mmap_timer : Re-initialize
>>   bool kvm_accel : Re-initialize
>>
>> In vfio_realize, defer calling vfio_intx_enable until the vmstate
>> is available, in vfio_pci_post_load.  Modify vfio_intx_enable and
>> vfio_intx_kvm_enable to skip vfio initialization, but still perform
>> kvm initialization.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  hw/vfio/pci.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++++++------
>>  1 file changed, 83 insertions(+), 9 deletions(-)
>>
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 2fd7121..b8aee91 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -173,14 +173,45 @@ static void vfio_intx_eoi(VFIODevice *vbasedev)
>>      vfio_unmask_single_irqindex(vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
>>  }
>>  
>> +#ifdef CONFIG_KVM
>> +static bool vfio_no_kvm_intx(VFIOPCIDevice *vdev)
>> +{
>> +    return vdev->no_kvm_intx || !kvm_irqfds_enabled() ||
>> +           vdev->intx.route.mode != PCI_INTX_ENABLED ||
>> +           !kvm_resamplefds_enabled();
>> +}
>> +#endif
>> +
>> +static void vfio_intx_reenable_kvm(VFIOPCIDevice *vdev, Error **errp)
>> +{
>> +#ifdef CONFIG_KVM
>> +    if (vfio_no_kvm_intx(vdev)) {
>> +        return;
>> +    }
>> +
>> +    if (vfio_notifier_init(vdev, &vdev->intx.unmask, "intx-unmask", 0)) {
>> +        error_setg(errp, "vfio_notifier_init intx-unmask failed");
>> +        return;
>> +    }
>> +
>> +    if (kvm_irqchip_add_irqfd_notifier_gsi(kvm_state,
>> +                                           &vdev->intx.interrupt,
>> +                                           &vdev->intx.unmask,
>> +                                           vdev->intx.route.irq)) {
>> +        error_setg_errno(errp, errno, "failed to setup resample irqfd");
> 
> 
> Does not unwind with vfio_notifier_cleanup().  This also exactly
> duplicates code in vfio_intx_enable_kvm(), which suggests it needs
> further refactoring to a common helper.

I will delete vfio_intx_reenable_kvm and add conditionals to vfio_intx_enable_kvm.
That looks better.

>> +        return;
>> +    }
>> +
>> +    vdev->intx.kvm_accel = true;
>> +#endif
>> +}
>> +
>>  static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
>>  {
>>  #ifdef CONFIG_KVM
>>      int irq_fd = event_notifier_get_fd(&vdev->intx.interrupt);
>>  
>> -    if (vdev->no_kvm_intx || !kvm_irqfds_enabled() ||
>> -        vdev->intx.route.mode != PCI_INTX_ENABLED ||
>> -        !kvm_resamplefds_enabled()) {
>> +    if (vfio_no_kvm_intx(vdev)) {
>>          return;
>>      }
>>  
>> @@ -328,7 +359,13 @@ static int vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
>>          return 0;
>>      }
>>  
>> -    vfio_disable_interrupts(vdev);
>> +    /*
>> +     * Do not alter interrupt state during vfio_realize and cpr-load.  The
>> +     * reused flag is cleared thereafter.
>> +     */
>> +    if (!vdev->vbasedev.reused) {
>> +        vfio_disable_interrupts(vdev);
>> +    }
>>  
>>      vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
>>      pci_config_set_interrupt_pin(vdev->pdev.config, pin);
>> @@ -353,6 +390,11 @@ static int vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
>>      fd = event_notifier_get_fd(&vdev->intx.interrupt);
>>      qemu_set_fd_handler(fd, vfio_intx_interrupt, NULL, vdev);
>>  
>> +    if (vdev->vbasedev.reused) {
>> +        vfio_intx_reenable_kvm(vdev, &err);
>> +        goto finish;
>> +    }
>> +
> 
> This only jumps over the vfio_set_irq_signaling() and
> vfio_intx_enable_kvm(), largely replacing the latter with chunks of
> code taken from it.  Doesn't seem like the right factoring.
Cleaned up in the next version.

>>      if (vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
>>                                 VFIO_IRQ_SET_ACTION_TRIGGER, fd, errp)) {
>>          qemu_set_fd_handler(fd, NULL, NULL, vdev);
>> @@ -365,6 +407,7 @@ static int vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
>>          warn_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
>>      }
>>  
>> +finish:
>>      vdev->interrupt = VFIO_INT_INTx;
>>  
>>      trace_vfio_intx_enable(vdev->vbasedev.name);
>> @@ -3195,9 +3238,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>>                                               vfio_intx_routing_notifier);
>>          vdev->irqchip_change_notifier.notify = vfio_irqchip_change;
>>          kvm_irqchip_add_change_notifier(&vdev->irqchip_change_notifier);
>> -        ret = vfio_intx_enable(vdev, errp);
>> -        if (ret) {
>> -            goto out_deregister;
>> +
>> +        /* Wait until cpr-load reads intx routing data to enable */
>> +        if (!vdev->vbasedev.reused) {
>> +            ret = vfio_intx_enable(vdev, errp);
>> +            if (ret) {
>> +                goto out_deregister;
>> +            }
>>          }
>>      }
>>  
>> @@ -3474,6 +3521,7 @@ static int vfio_pci_post_load(void *opaque, int version_id)
>>      VFIOPCIDevice *vdev = opaque;
>>      PCIDevice *pdev = &vdev->pdev;
>>      int nr_vectors;
>> +    int ret = 0;
>>  
>>      if (msix_enabled(pdev)) {
>>          msix_set_vector_notifiers(pdev, vfio_msix_vector_use,
>> @@ -3486,10 +3534,35 @@ static int vfio_pci_post_load(void *opaque, int version_id)
>>          vfio_claim_vectors(vdev, nr_vectors, false);
>>  
>>      } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
>> -        assert(0);      /* completed in a subsequent patch */
>> +        Error *err = 0;
>> +        ret = vfio_intx_enable(vdev, &err);
>> +        if (ret) {
>> +            error_report_err(err);
>> +        }
>>      }
>>  
>> -    return 0;
>> +    return ret;
>> +}
>> +
>> +static const VMStateDescription vfio_intx_vmstate = {
>> +    .name = "vfio-intx",
>> +    .unmigratable = 1,
> 
> 
> unmigratable-vmstates-to-interfere-with-migration+
A bug, will delete.

- Steve
 
>> +    .version_id = 0,
>> +    .minimum_version_id = 0,
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_BOOL(pending, VFIOINTx),
>> +        VMSTATE_UINT32(route.mode, VFIOINTx),
>> +        VMSTATE_INT32(route.irq, VFIOINTx),
>> +        VMSTATE_END_OF_LIST()
>> +    }
>> +};
>> +
>> +#define VMSTATE_VFIO_INTX(_field, _state) {                         \
>> +    .name       = (stringify(_field)),                              \
>> +    .size       = sizeof(VFIOINTx),                                 \
>> +    .vmsd       = &vfio_intx_vmstate,                               \
>> +    .flags      = VMS_STRUCT,                                       \
>> +    .offset     = vmstate_offset_value(_state, _field, VFIOINTx),   \
>>  }
>>  
>>  static bool vfio_pci_needed(void *opaque)
>> @@ -3509,6 +3582,7 @@ static const VMStateDescription vfio_pci_vmstate = {
>>      .fields = (VMStateField[]) {
>>          VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
>>          VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
>> +        VMSTATE_VFIO_INTX(intx, VFIOPCIDevice),
>>          VMSTATE_END_OF_LIST()
>>      }
>>  };
> 


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH V8 30/39] vfio-pci: recover from unmap-all-vaddr failure
  2022-06-29 22:58   ` Alex Williamson
@ 2022-07-06 17:46     ` Steven Sistare
  0 siblings, 0 replies; 84+ messages in thread
From: Steven Sistare @ 2022-07-06 17:46 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Alex Bennée, Dr. David Alan Gilbert,
	Michael S. Tsirkin, Marcel Apfelbaum, Daniel P. Berrange,
	Juan Quintela, Markus Armbruster, Eric Blake, Jason Zeng,
	Zheng Chuan, Mark Kanda, Guoyi Tu, Peter Maydell,
	Philippe Mathieu-Daudé,
	Igor Mammedov, David Hildenbrand, John Snow

On 6/29/2022 6:58 PM, Alex Williamson wrote:
> On Wed, 15 Jun 2022 07:52:17 -0700
> Steve Sistare <steven.sistare@oracle.com> wrote:
> 
>> If vfio_cpr_save fails to unmap all vaddr's, then recover by walking all
>> flat sections to restore the vaddr for each.  Do so by invoking the
>> vfio listener callback, and passing a new "replay" flag that tells it
>> to replay a mapping without re-allocating new userland data structures.
> 
> Is this comment accurate?  I thought we had unwind in the kernel for
> vaddr invalidation, and the notifier here is hooked up to any fault, so
> it's at least misleading regarding vaddr.  

The comment is misleading, I'll fix it.  If there are multiple containers and 
unmap-all fails for some container, we need to remap vaddr for the other
containers for which unmap-all succeeded.

> The replay option really
> needs some documentation in comments.

Will do.

>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  hw/vfio/common.c              | 66 ++++++++++++++++++++++++++++++++-----------
>>  hw/vfio/cpr.c                 | 29 +++++++++++++++++++
>>  include/hw/vfio/vfio-common.h |  2 +-
>>  3 files changed, 80 insertions(+), 17 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index c7d73b6..5f2bd50 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -895,15 +895,35 @@ static bool vfio_known_safe_misalignment(MemoryRegionSection *section)
>>      return true;
>>  }
>>  
>> +static VFIORamDiscardListener *vfio_find_ram_discard_listener(
>> +    VFIOContainer *container, MemoryRegionSection *section)
>> +{
>> +    VFIORamDiscardListener *vrdl = NULL;
> 
> This initialization was copied from current code, but...
> 
> #define QLIST_FOREACH(var, head, field)                                 \
>         for ((var) = ((head)->lh_first);                                \
>                ...
> 
> it doesn't look necessary.  Thanks,

Sure, will remove it.

- Steve
 
>> +
>> +    QLIST_FOREACH(vrdl, &container->vrdl_list, next) {
>> +        if (vrdl->mr == section->mr &&
>> +            vrdl->offset_within_address_space ==
>> +            section->offset_within_address_space) {
>> +            break;
>> +        }
>> +    }
>> +
>> +    if (!vrdl) {
>> +        hw_error("vfio: Trying to sync missing RAM discard listener");
>> +        /* does not return */
>> +    }
>> +    return vrdl;
>> +}
>> +
[...]


^ permalink raw reply	[flat|nested] 84+ messages in thread

end of thread, other threads:[~2022-07-06 17:54 UTC | newest]

Thread overview: 84+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-15 14:51 [PATCH V8 00/39] Live Update Steve Sistare
2022-06-15 14:51 ` [PATCH V8 01/39] migration: fix populate_vfio_info Steve Sistare
2022-06-16 14:41   ` Marc-André Lureau
2022-06-15 14:51 ` [PATCH V8 02/39] migration: qemu file wrappers Steve Sistare
2022-06-16  2:18   ` Guoyi Tu
2022-07-05 18:24     ` Steven Sistare
2022-06-16 14:55   ` Marc-André Lureau
2022-07-05 18:25     ` Steven Sistare
2022-06-16 15:29   ` Daniel P. Berrangé
2022-07-05 18:25     ` Steven Sistare
2022-06-15 14:51 ` [PATCH V8 03/39] migration: simplify savevm Steve Sistare
2022-06-16 14:59   ` Marc-André Lureau
2022-06-15 14:51 ` [PATCH V8 04/39] memory: RAM_ANON flag Steve Sistare
2022-06-15 20:25   ` David Hildenbrand
2022-07-05 18:23     ` Steven Sistare
2022-06-15 14:51 ` [PATCH V8 05/39] vl: start on wakeup request Steve Sistare
2022-06-16 15:55   ` Marc-André Lureau
2022-07-05 18:26     ` Steven Sistare
2022-06-15 14:51 ` [PATCH V8 06/39] cpr: reboot mode Steve Sistare
2022-06-16 11:10   ` Daniel P. Berrangé
2022-07-05 18:26     ` Steven Sistare
2022-06-15 14:51 ` [PATCH V8 07/39] cpr: reboot HMP interfaces Steve Sistare
2022-06-15 14:51 ` [PATCH V8 08/39] cpr: blockers Steve Sistare
2022-06-15 14:51 ` [PATCH V8 09/39] cpr: register blockers Steve Sistare
2022-06-15 14:51 ` [PATCH V8 10/39] cpr: cpr-enable option Steve Sistare
2022-06-15 14:51 ` [PATCH V8 11/39] cpr: save ram blocks Steve Sistare
2022-06-15 14:51 ` [PATCH V8 12/39] memory: flat section iterator Steve Sistare
2022-07-03  7:52   ` Peng Liang
2022-07-05 18:26     ` Steven Sistare
2022-06-15 14:52 ` [PATCH V8 13/39] oslib: qemu_clear_cloexec Steve Sistare
2022-06-16 16:01   ` Marc-André Lureau
2022-06-16 16:07   ` Daniel P. Berrangé
2022-07-05 18:27     ` Steven Sistare
2022-06-15 14:52 ` [PATCH V8 14/39] qapi: strList_from_string Steve Sistare
2022-06-16 16:04   ` Marc-André Lureau
2022-07-05 18:28     ` Steven Sistare
2022-06-15 14:52 ` [PATCH V8 15/39] qapi: QAPI_LIST_LENGTH Steve Sistare
2022-06-16 16:06   ` Marc-André Lureau
2022-06-15 14:52 ` [PATCH V8 16/39] qapi: strv_from_strList Steve Sistare
2022-06-16 16:08   ` Marc-André Lureau
2022-07-05 18:28     ` Steven Sistare
2022-06-15 14:52 ` [PATCH V8 17/39] qapi: strList unit tests Steve Sistare
2022-06-16 16:10   ` Marc-André Lureau
2022-06-15 14:52 ` [PATCH V8 18/39] vl: helper to request re-exec Steve Sistare
2022-06-15 14:52 ` [PATCH V8 19/39] cpr: preserve extra state Steve Sistare
2022-06-15 14:52 ` [PATCH V8 20/39] cpr: restart mode Steve Sistare
2022-07-03  8:15   ` Peng Liang
2022-07-05 18:29     ` Steven Sistare
2022-07-06  0:15       ` Peng Liang
2022-06-15 14:52 ` [PATCH V8 21/39] cpr: restart HMP interfaces Steve Sistare
2022-06-15 14:52 ` [PATCH V8 22/39] cpr: ram block blockers Steve Sistare
2022-06-15 14:52 ` [PATCH V8 23/39] hostmem-memfd: cpr for memory-backend-memfd Steve Sistare
2022-06-15 14:52 ` [PATCH V8 24/39] pci: export export msix_is_pending Steve Sistare
2022-06-27 22:44   ` Michael S. Tsirkin
2022-07-05 18:29     ` Steven Sistare
2022-06-15 14:52 ` [PATCH V8 25/39] cpr: notifiers Steve Sistare
2022-06-15 14:52 ` [PATCH V8 26/39] vfio-pci: refactor for cpr Steve Sistare
2022-06-15 14:52 ` [PATCH V8 27/39] vfio-pci: cpr part 1 (fd and dma) Steve Sistare
2022-06-29 19:14   ` Alex Williamson
2022-07-06 17:45     ` Steven Sistare
2022-07-03  8:32   ` Peng Liang
2022-07-05 18:29     ` Steven Sistare
2022-06-15 14:52 ` [PATCH V8 28/39] vfio-pci: cpr part 2 (msi) Steve Sistare
2022-06-29 20:19   ` Alex Williamson
2022-07-06 17:46     ` Steven Sistare
2022-06-15 14:52 ` [PATCH V8 29/39] vfio-pci: cpr part 3 (intx) Steve Sistare
2022-06-29 20:43   ` Alex Williamson
2022-07-06 17:46     ` Steven Sistare
2022-06-15 14:52 ` [PATCH V8 30/39] vfio-pci: recover from unmap-all-vaddr failure Steve Sistare
2022-06-29 22:58   ` Alex Williamson
2022-07-06 17:46     ` Steven Sistare
2022-06-15 14:52 ` [PATCH V8 31/39] vhost: reset vhost devices for cpr Steve Sistare
2022-06-15 14:52 ` [PATCH V8 32/39] loader: suppress rom_reset during cpr Steve Sistare
2022-06-15 14:52 ` [PATCH V8 33/39] chardev: cpr framework Steve Sistare
2022-06-15 14:52 ` [PATCH V8 34/39] chardev: cpr for simple devices Steve Sistare
2022-06-15 14:52 ` [PATCH V8 35/39] chardev: cpr for pty Steve Sistare
2022-06-15 14:52 ` [PATCH V8 36/39] chardev: cpr for sockets Steve Sistare
2022-07-03  8:19   ` Peng Liang
2022-07-05 18:29     ` Steven Sistare
2022-06-15 14:52 ` [PATCH V8 37/39] cpr: only-cpr-capable option Steve Sistare
2022-06-15 14:52 ` [PATCH V8 38/39] python/machine: add QEMUMachine accessors Steve Sistare
2022-06-17 14:16   ` John Snow
2022-07-05 18:30     ` Steven Sistare
2022-06-15 14:52 ` [PATCH V8 39/39] tests/avocado: add cpr regression test Steve Sistare

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.