All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V7 00/29] Live Update
@ 2021-12-22 19:05 Steve Sistare
  2021-12-22 19:05 ` [PATCH V7 01/29] memory: qemu_check_ram_volatile Steve Sistare
                   ` (29 more replies)
  0 siblings, 30 replies; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Provide the cpr-save, cpr-exec, and cpr-load commands for live update.
These save and restore VM state, with minimal guest pause time, so that
qemu may be updated to a new version in between.

cpr-save stops the VM and saves vmstate to an ordinary file.  It supports
any type of guest image and block device, but the caller must not modify
guest block devices between cpr-save and cpr-load.  It supports two modes:
reboot and restart.

In reboot mode, the caller invokes cpr-save and then terminates qemu.
The caller may then update the host kernel and system software and reboot.
The caller resumes the guest by running qemu with the same arguments as the
original process and invoking cpr-load.  To use this mode, guest ram must be
mapped to a persistent shared memory file such as /dev/dax0.0, or /dev/shm
PKRAM as proposed in https://lore.kernel.org/lkml/1617140178-8773-1-git-send-email-anthony.yznaga@oracle.com.

The reboot mode supports vfio devices if the caller first suspends the
guest, such as by issuing guest-suspend-ram to the qemu guest agent.  The
guest drivers' suspend methods flush outstanding requests and re-initialize
the devices, and thus there is no device state to save and restore.

Restart mode preserves the guest VM across a restart of the qemu process.
After cpr-save, the caller passes qemu command-line arguments to cpr-exec,
which directly exec's the new qemu binary.  The arguments must include -S
so new qemu starts in a paused state and waits for the cpr-load command.
The restart mode supports vfio devices by preserving the vfio container,
group, device, and event descriptors across the qemu re-exec, and by
updating DMA mapping virtual addresses using VFIO_DMA_UNMAP_FLAG_VADDR and
VFIO_DMA_MAP_FLAG_VADDR as defined in https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sistare@oracle.com/
and integrated in Linux kernel 5.12.

To use the restart mode, qemu must be started with the memfd-alloc option,
which allocates guest ram using memfd_create.  The memfd's are saved to
the environment and kept open across exec, after which they are found from
the environment and re-mmap'd.  Hence guest ram is preserved in place,
albeit with new virtual addresses in the qemu process.

The caller resumes the guest by invoking cpr-load, which loads state from
the file. If the VM was running at cpr-save time, then VM execution resumes.
If the VM was suspended at cpr-save time (reboot mode), then the caller must
issue a system_wakeup command to resume.

The first patches add reboot mode:
  - memory: qemu_check_ram_volatile
  - migration: fix populate_vfio_info
  - migration: qemu file wrappers
  - migration: simplify savevm
  - vl: start on wakeup request
  - cpr: reboot mode
  - cpr: reboot HMP interfaces

The next patches add restart mode:
  - memory: flat section iterator
  - oslib: qemu_clear_cloexec
  - machine: memfd-alloc option
  - qapi: list utility functions
  - vl: helper to request re-exec
  - cpr: preserve extra state
  - cpr: restart mode
  - cpr: restart HMP interfaces
  - hostmem-memfd: cpr for memory-backend-memfd

The next patches add vfio support for restart mode:
  - pci: export functions for cpr
  - vfio-pci: refactor for cpr
  - vfio-pci: cpr part 1 (fd and dma)
  - vfio-pci: cpr part 2 (msi)
  - vfio-pci: cpr part 3 (intx)
  - vfio-pci: recover from unmap-all-vaddr failure

The next patches preserve various descriptor-based backend devices across
cprexec:
  - loader: suppress rom_reset during cpr
  - vhost: reset vhost devices for cpr
  - chardev: cpr framework
  - chardev: cpr for simple devices
  - chardev: cpr for pty
  - chardev: cpr for sockets
  - cpr: only-cpr-capable option

Here is an example of updating qemu from v4.2.0 to v4.2.1 using
restart mode.  The software update is performed while the guest is
running to minimize downtime.

window 1                                        | window 2
                                                |
# qemu-system-x86_64 ...                        |
QEMU 4.2.0 monitor - type 'help' ...            |
(qemu) info status                              |
VM status: running                              |
                                                | # yum update qemu
(qemu) cpr-save /tmp/qemu.sav restart           |
(qemu) cpr-exec qemu-system-x86_64 -S ...       |
QEMU 4.2.1 monitor - type 'help' ...            |
(qemu) info status                              |
VM status: paused (prelaunch)                   |
(qemu) cpr-load /tmp/qemu.sav                   |
(qemu) info status                              |
VM status: running                              |


Here is an example of updating the host kernel using reboot mode.

window 1                                        | window 2
                                                |
# qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
QEMU 4.2.1 monitor - type 'help' ...            |
(qemu) info status                              |
VM status: running                              |
                                                | # yum update kernel-uek
(qemu) cpr-save /tmp/qemu.sav reboot            |
(qemu) quit                                     |
                                                |
# systemctl kexec                               |
kexec_core: Starting new kernel                 |
...                                             |
                                                |
# qemu-system-x86_64 -S mem-path=/dev/dax0.0 ...|
QEMU 4.2.1 monitor - type 'help' ...            |
(qemu) info status                              |
VM status: paused (prelaunch)                   |
(qemu) cpr-load /tmp/qemu.sav                   |
(qemu) info status                              |
VM status: running                              |

Changes from V1 to V2:
  - revert vmstate infrastructure changes
  - refactor cpr functions into new files
  - delete MADV_DOEXEC and use memfd + VFIO_DMA_UNMAP_FLAG_SUSPEND to
    preserve memory.
  - add framework to filter chardev's that support cpr
  - save and restore vfio eventfd's
  - modify cprinfo QMP interface
  - incorporate misc review feedback
  - remove unrelated and unneeded patches
  - refactor all patches into a shorter and easier to review series

Changes from V2 to V3:
  - rebase to qemu 6.0.0
  - use final definition of vfio ioctls (VFIO_DMA_UNMAP_FLAG_VADDR etc)
  - change memfd-alloc to a machine option
  - Use qio_channel_socket_new_fd instead of adding qio_channel_socket_new_fd
  - close monitor socket during cpr
  - fix a few unreported bugs
  - support memory-backend-memfd

Changes from V3 to V4:
  - split reboot mode into separate patches
  - add cprexec command
  - delete QEMU_START_FREEZE, argv_main, and /usr/bin/qemu-exec
  - add more checks for vfio and cpr compatibility, and recover after errors
  - save vfio pci config in vmstate
  - rename {setenv,getenv}_event_fd to {save,load}_event_fd
  - use qemu_strtol
  - change 6.0 references to 6.1
  - use strerror(), use EXIT_FAILURE, remove period from error messages
  - distribute MAINTAINERS additions to each patch

Changes from V4 to V5:
  - rebase to master

Changes from V5 to V6:
  vfio:
  - delete redundant bus_master_enable_region in vfio_pci_post_load
  - delete unmap.size warning
  - fix phys_config memory leak
  - add INTX support
  - add vfio_named_notifier_init() helper
  Other:
  - 6.1 -> 6.2
  - rename file -> filename in qapi
  - delete cprinfo.  qapi introspection serves the same purpose.
  - rename cprsave, cprexec, cprload -> cpr-save, cpr-exec, cpr-load
  - improve documentation in qapi/cpr.json
  - rename qemu_ram_volatile -> qemu_ram_check_volatile, and use
    qemu_ram_foreach_block
  - rename handle -> opaque
  - use ERRP_GUARD
  - use g_autoptr and g_autofree, and glib allocation functions
  - conform to error conventions for bool and int function return values
    and function names.
  - remove word "error" in error messages
  - rename as_flat_walk and its callback, and add comments.
  - rename qemu_clr_cloexec -> qemu_clear_cloexec
  - rename close-on-cpr -> reopen-on-cpr
  - add strList utility functions
  - factor out start on wakeup request to a separate patch
  - deleted unnecessary layer (cprsave etc) and squashed QMP patches
  - conditionally compile for CONFIG_VFIO

Changes from V6 to V7:
  vfio:
  - convert all event fd's to named event fd's with the same lifecycle and
    delete vfio_pci_pre_save
  - use vfio listener callback for updating vaddr and
    defer listener registration
  - update vaddr in vfio_dma_map
  - simplify iommu_type derivation
  - refactor recovery from unmap-all-vaddr failure to a separate patch
  - add vfio_pci_pre_load to handle non-emulated config bits
  - do not call VFIO_GROUP_SET_CONTAINER if reused
  - add comments for vfio cpr
  Other:
  - suppress rom_reset during cpr
  - more robust management of cpr mode
  - delete chardev fd's iff !reopen_on_cpr

Steve Sistare (26):
  memory: qemu_check_ram_volatile
  migration: fix populate_vfio_info
  migration: qemu file wrappers
  migration: simplify savevm
  vl: start on wakeup request
  cpr: reboot mode
  memory: flat section iterator
  oslib: qemu_clear_cloexec
  machine: memfd-alloc option
  qapi: list utility functions
  vl: helper to request re-exec
  cpr: preserve extra state
  cpr: restart mode
  cpr: restart HMP interfaces
  hostmem-memfd: cpr for memory-backend-memfd
  pci: export functions for cpr
  vfio-pci: refactor for cpr
  vfio-pci: cpr part 1 (fd and dma)
  vfio-pci: cpr part 2 (msi)
  vfio-pci: cpr part 3 (intx)
  vfio-pci: recover from unmap-all-vaddr failure
  loader: suppress rom_reset during cpr
  chardev: cpr framework
  chardev: cpr for simple devices
  chardev: cpr for pty
  cpr: only-cpr-capable option

Mark Kanda, Steve Sistare (3):
  cpr: reboot HMP interfaces
  vhost: reset vhost devices for cpr
  chardev: cpr for sockets

 MAINTAINERS                   |  12 ++
 backends/hostmem-memfd.c      |  21 +--
 chardev/char-mux.c            |   1 +
 chardev/char-null.c           |   1 +
 chardev/char-pty.c            |  16 +-
 chardev/char-serial.c         |   1 +
 chardev/char-socket.c         |  39 +++++
 chardev/char-stdio.c          |   8 +
 chardev/char.c                |  45 +++++-
 gdbstub.c                     |   1 +
 hmp-commands.hx               |  50 ++++++
 hw/core/loader.c              |   4 +-
 hw/core/machine.c             |  19 +++
 hw/pci/msix.c                 |  20 ++-
 hw/pci/pci.c                  |  13 +-
 hw/vfio/common.c              | 184 ++++++++++++++++++---
 hw/vfio/cpr.c                 | 129 +++++++++++++++
 hw/vfio/meson.build           |   1 +
 hw/vfio/pci.c                 | 368 +++++++++++++++++++++++++++++++++++++-----
 hw/vfio/trace-events          |   1 +
 hw/virtio/vhost.c             |  11 ++
 include/chardev/char.h        |   6 +
 include/exec/memory.h         |  39 +++++
 include/hw/boards.h           |   1 +
 include/hw/pci/msix.h         |   5 +
 include/hw/pci/pci.h          |   2 +
 include/hw/vfio/vfio-common.h |  10 ++
 include/hw/virtio/vhost.h     |   1 +
 include/migration/cpr.h       |  31 ++++
 include/monitor/hmp.h         |   3 +
 include/qapi/util.h           |  28 ++++
 include/qemu/osdep.h          |   1 +
 include/sysemu/runstate.h     |   2 +
 include/sysemu/sysemu.h       |   1 +
 migration/cpr-state.c         | 228 ++++++++++++++++++++++++++
 migration/cpr.c               | 167 +++++++++++++++++++
 migration/meson.build         |   2 +
 migration/migration.c         |   5 +
 migration/qemu-file-channel.c |  36 +++++
 migration/qemu-file-channel.h |   6 +
 migration/savevm.c            |  21 +--
 migration/target.c            |  24 ++-
 migration/trace-events        |   5 +
 monitor/hmp-cmds.c            |  68 ++++----
 monitor/hmp.c                 |   3 +
 monitor/qmp.c                 |   3 +
 qapi/char.json                |   7 +-
 qapi/cpr.json                 |  76 +++++++++
 qapi/meson.build              |   1 +
 qapi/qapi-schema.json         |   1 +
 qapi/qapi-util.c              |  37 +++++
 qemu-options.hx               |  40 ++++-
 softmmu/globals.c             |   1 +
 softmmu/memory.c              |  46 ++++++
 softmmu/physmem.c             |  55 +++++--
 softmmu/runstate.c            |  38 ++++-
 softmmu/vl.c                  |  18 ++-
 stubs/cpr-state.c             |  15 ++
 stubs/cpr.c                   |   3 +
 stubs/meson.build             |   2 +
 trace-events                  |   1 +
 util/oslib-posix.c            |   9 ++
 util/oslib-win32.c            |   4 +
 util/qemu-config.c            |   4 +
 64 files changed, 1852 insertions(+), 149 deletions(-)
 create mode 100644 hw/vfio/cpr.c
 create mode 100644 include/migration/cpr.h
 create mode 100644 migration/cpr-state.c
 create mode 100644 migration/cpr.c
 create mode 100644 qapi/cpr.json
 create mode 100644 stubs/cpr-state.c
 create mode 100644 stubs/cpr.c

-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH V7 01/29] memory: qemu_check_ram_volatile
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
@ 2021-12-22 19:05 ` Steve Sistare
  2022-02-24 18:28   ` Dr. David Alan Gilbert
  2022-03-04 12:47   ` Philippe Mathieu-Daudé
  2021-12-22 19:05 ` [PATCH V7 02/29] migration: fix populate_vfio_info Steve Sistare
                   ` (28 subsequent siblings)
  29 siblings, 2 replies; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Add a function that returns an error if any ram_list block represents
volatile memory.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/exec/memory.h |  8 ++++++++
 softmmu/memory.c      | 26 ++++++++++++++++++++++++++
 2 files changed, 34 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 20f1b27..137f5f3 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -2981,6 +2981,14 @@ bool ram_block_discard_is_disabled(void);
  */
 bool ram_block_discard_is_required(void);
 
+/**
+ * qemu_ram_check_volatile: return 1 if any memory regions are writable and not
+ * backed by shared memory, else return 0.
+ *
+ * @errp: returned error message identifying the first volatile region found.
+ */
+int qemu_check_ram_volatile(Error **errp);
+
 #endif
 
 #endif
diff --git a/softmmu/memory.c b/softmmu/memory.c
index 7340e19..30b2f68 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -2837,6 +2837,32 @@ void memory_global_dirty_log_stop(unsigned int flags)
     memory_global_dirty_log_do_stop(flags);
 }
 
+static int check_volatile(RAMBlock *rb, void *opaque)
+{
+    MemoryRegion *mr = rb->mr;
+
+    if (mr &&
+        memory_region_is_ram(mr) &&
+        !memory_region_is_ram_device(mr) &&
+        !memory_region_is_rom(mr) &&
+        (rb->fd == -1 || !qemu_ram_is_shared(rb))) {
+        *(const char **)opaque = memory_region_name(mr);
+        return -1;
+    }
+    return 0;
+}
+
+int qemu_check_ram_volatile(Error **errp)
+{
+    char *name;
+
+    if (qemu_ram_foreach_block(check_volatile, &name)) {
+        error_setg(errp, "Memory region %s is volatile", name);
+        return -1;
+    }
+    return 0;
+}
+
 static void listener_add_address_space(MemoryListener *listener,
                                        AddressSpace *as)
 {
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH V7 02/29] migration: fix populate_vfio_info
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
  2021-12-22 19:05 ` [PATCH V7 01/29] memory: qemu_check_ram_volatile Steve Sistare
@ 2021-12-22 19:05 ` Steve Sistare
  2022-02-24 18:42   ` Peter Maydell
  2021-12-22 19:05 ` [PATCH V7 03/29] migration: qemu file wrappers Steve Sistare
                   ` (27 subsequent siblings)
  29 siblings, 1 reply; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Include CONFIG_DEVICES so that populate_vfio_info is instantiated for
CONFIG_VFIO.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 migration/target.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/migration/target.c b/migration/target.c
index 907ebf0..4390bf0 100644
--- a/migration/target.c
+++ b/migration/target.c
@@ -8,18 +8,22 @@
 #include "qemu/osdep.h"
 #include "qapi/qapi-types-migration.h"
 #include "migration.h"
+#include CONFIG_DEVICES
 
 #ifdef CONFIG_VFIO
+
 #include "hw/vfio/vfio-common.h"
-#endif
 
 void populate_vfio_info(MigrationInfo *info)
 {
-#ifdef CONFIG_VFIO
     if (vfio_mig_active()) {
         info->has_vfio = true;
         info->vfio = g_malloc0(sizeof(*info->vfio));
         info->vfio->transferred = vfio_mig_bytes_transferred();
     }
-#endif
 }
+#else
+
+void populate_vfio_info(MigrationInfo *info) {}
+
+#endif /* CONFIG_VFIO */
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH V7 03/29] migration: qemu file wrappers
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
  2021-12-22 19:05 ` [PATCH V7 01/29] memory: qemu_check_ram_volatile Steve Sistare
  2021-12-22 19:05 ` [PATCH V7 02/29] migration: fix populate_vfio_info Steve Sistare
@ 2021-12-22 19:05 ` Steve Sistare
  2022-02-24 18:21   ` Dr. David Alan Gilbert
  2021-12-22 19:05 ` [PATCH V7 04/29] migration: simplify savevm Steve Sistare
                   ` (26 subsequent siblings)
  29 siblings, 1 reply; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Add qemu_file_open and qemu_fd_open to create QEMUFile objects for unix
files and file descriptors.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 migration/qemu-file-channel.c | 36 ++++++++++++++++++++++++++++++++++++
 migration/qemu-file-channel.h |  6 ++++++
 2 files changed, 42 insertions(+)

diff --git a/migration/qemu-file-channel.c b/migration/qemu-file-channel.c
index bb5a575..afb16d7 100644
--- a/migration/qemu-file-channel.c
+++ b/migration/qemu-file-channel.c
@@ -27,8 +27,10 @@
 #include "qemu-file.h"
 #include "io/channel-socket.h"
 #include "io/channel-tls.h"
+#include "io/channel-file.h"
 #include "qemu/iov.h"
 #include "qemu/yank.h"
+#include "qapi/error.h"
 #include "yank_functions.h"
 
 
@@ -192,3 +194,37 @@ QEMUFile *qemu_fopen_channel_output(QIOChannel *ioc)
     object_ref(OBJECT(ioc));
     return qemu_fopen_ops(ioc, &channel_output_ops, true);
 }
+
+QEMUFile *qemu_file_open(const char *path, int flags, int mode,
+                         const char *name, Error **errp)
+{
+    g_autoptr(QIOChannelFile) fioc = NULL;
+    QIOChannel *ioc;
+    QEMUFile *f;
+
+    if (flags & O_RDWR) {
+        error_setg(errp, "qemu_file_open %s: O_RDWR not supported", path);
+        return NULL;
+    }
+
+    fioc = qio_channel_file_new_path(path, flags, mode, errp);
+    if (!fioc) {
+        return NULL;
+    }
+
+    ioc = QIO_CHANNEL(fioc);
+    qio_channel_set_name(ioc, name);
+    f = (flags & O_WRONLY) ? qemu_fopen_channel_output(ioc) :
+                             qemu_fopen_channel_input(ioc);
+    return f;
+}
+
+QEMUFile *qemu_fd_open(int fd, bool writable, const char *name)
+{
+    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
+    QIOChannel *ioc = QIO_CHANNEL(fioc);
+    QEMUFile *f = writable ? qemu_fopen_channel_output(ioc) :
+                             qemu_fopen_channel_input(ioc);
+    qio_channel_set_name(ioc, name);
+    return f;
+}
diff --git a/migration/qemu-file-channel.h b/migration/qemu-file-channel.h
index 0028a09..324ae2d 100644
--- a/migration/qemu-file-channel.h
+++ b/migration/qemu-file-channel.h
@@ -29,4 +29,10 @@
 
 QEMUFile *qemu_fopen_channel_input(QIOChannel *ioc);
 QEMUFile *qemu_fopen_channel_output(QIOChannel *ioc);
+
+QEMUFile *qemu_file_open(const char *path, int flags, int mode,
+                         const char *name, Error **errp);
+
+QEMUFile *qemu_fd_open(int fd, bool writable, const char *name);
+
 #endif
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH V7 04/29] migration: simplify savevm
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
                   ` (2 preceding siblings ...)
  2021-12-22 19:05 ` [PATCH V7 03/29] migration: qemu file wrappers Steve Sistare
@ 2021-12-22 19:05 ` Steve Sistare
  2022-02-24 18:25   ` Dr. David Alan Gilbert
  2021-12-22 19:05 ` [PATCH V7 05/29] vl: start on wakeup request Steve Sistare
                   ` (25 subsequent siblings)
  29 siblings, 1 reply; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Use qemu_file_open to simplify a few functions in savevm.c.
No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 migration/savevm.c | 21 +++++++--------------
 1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index 0bef031..c71d525 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2910,8 +2910,9 @@ bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
 void qmp_xen_save_devices_state(const char *filename, bool has_live, bool live,
                                 Error **errp)
 {
+    const char *ioc_name = "migration-xen-save-state";
+    int flags = O_WRONLY | O_CREAT | O_TRUNC;
     QEMUFile *f;
-    QIOChannelFile *ioc;
     int saved_vm_running;
     int ret;
 
@@ -2925,14 +2926,10 @@ void qmp_xen_save_devices_state(const char *filename, bool has_live, bool live,
     vm_stop(RUN_STATE_SAVE_VM);
     global_state_store_running();
 
-    ioc = qio_channel_file_new_path(filename, O_WRONLY | O_CREAT | O_TRUNC,
-                                    0660, errp);
-    if (!ioc) {
+    f = qemu_file_open(filename, flags, 0660, ioc_name, errp);
+    if (!f) {
         goto the_end;
     }
-    qio_channel_set_name(QIO_CHANNEL(ioc), "migration-xen-save-state");
-    f = qemu_fopen_channel_output(QIO_CHANNEL(ioc));
-    object_unref(OBJECT(ioc));
     ret = qemu_save_device_state(f);
     if (ret < 0 || qemu_fclose(f) < 0) {
         error_setg(errp, QERR_IO_ERROR);
@@ -2960,8 +2957,8 @@ void qmp_xen_save_devices_state(const char *filename, bool has_live, bool live,
 
 void qmp_xen_load_devices_state(const char *filename, Error **errp)
 {
+    const char *ioc_name = "migration-xen-load-state";
     QEMUFile *f;
-    QIOChannelFile *ioc;
     int ret;
 
     /* Guest must be paused before loading the device state; the RAM state
@@ -2973,14 +2970,10 @@ void qmp_xen_load_devices_state(const char *filename, Error **errp)
     }
     vm_stop(RUN_STATE_RESTORE_VM);
 
-    ioc = qio_channel_file_new_path(filename, O_RDONLY | O_BINARY, 0, errp);
-    if (!ioc) {
+    f = qemu_file_open(filename, O_RDONLY | O_BINARY, 0, ioc_name, errp);
+    if (!f) {
         return;
     }
-    qio_channel_set_name(QIO_CHANNEL(ioc), "migration-xen-load-state");
-    f = qemu_fopen_channel_input(QIO_CHANNEL(ioc));
-    object_unref(OBJECT(ioc));
-
     ret = qemu_loadvm_state(f);
     qemu_fclose(f);
     if (ret < 0) {
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH V7 05/29] vl: start on wakeup request
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
                   ` (3 preceding siblings ...)
  2021-12-22 19:05 ` [PATCH V7 04/29] migration: simplify savevm Steve Sistare
@ 2021-12-22 19:05 ` Steve Sistare
  2022-02-24 18:51   ` Dr. David Alan Gilbert
  2021-12-22 19:05 ` [PATCH V7 06/29] cpr: reboot mode Steve Sistare
                   ` (24 subsequent siblings)
  29 siblings, 1 reply; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

If qemu starts and loads a VM in the suspended state, then a later wakeup
request will set the state to running, which is not sufficient to initialize
the vm, as vm_start was never called during this invocation of qemu.  See
qemu_system_wakeup_request().

Define the start_on_wakeup_requested() hook to cause vm_start() to be called
when processing the wakeup request.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/sysemu/runstate.h |  1 +
 softmmu/runstate.c        | 17 ++++++++++++++++-
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/include/sysemu/runstate.h b/include/sysemu/runstate.h
index a535691..b655c7b 100644
--- a/include/sysemu/runstate.h
+++ b/include/sysemu/runstate.h
@@ -51,6 +51,7 @@ void qemu_system_reset_request(ShutdownCause reason);
 void qemu_system_suspend_request(void);
 void qemu_register_suspend_notifier(Notifier *notifier);
 bool qemu_wakeup_suspend_enabled(void);
+void qemu_system_start_on_wakeup_request(void);
 void qemu_system_wakeup_request(WakeupReason reason, Error **errp);
 void qemu_system_wakeup_enable(WakeupReason reason, bool enabled);
 void qemu_register_wakeup_notifier(Notifier *notifier);
diff --git a/softmmu/runstate.c b/softmmu/runstate.c
index 10d9b73..3d344c9 100644
--- a/softmmu/runstate.c
+++ b/softmmu/runstate.c
@@ -115,6 +115,8 @@ static const RunStateTransition runstate_transitions_def[] = {
     { RUN_STATE_PRELAUNCH, RUN_STATE_RUNNING },
     { RUN_STATE_PRELAUNCH, RUN_STATE_FINISH_MIGRATE },
     { RUN_STATE_PRELAUNCH, RUN_STATE_INMIGRATE },
+    { RUN_STATE_PRELAUNCH, RUN_STATE_SUSPENDED },
+    { RUN_STATE_PRELAUNCH, RUN_STATE_PAUSED },
 
     { RUN_STATE_FINISH_MIGRATE, RUN_STATE_RUNNING },
     { RUN_STATE_FINISH_MIGRATE, RUN_STATE_PAUSED },
@@ -335,6 +337,7 @@ void vm_state_notify(bool running, RunState state)
     }
 }
 
+static bool start_on_wakeup_requested;
 static ShutdownCause reset_requested;
 static ShutdownCause shutdown_requested;
 static int shutdown_signal;
@@ -562,6 +565,11 @@ void qemu_register_suspend_notifier(Notifier *notifier)
     notifier_list_add(&suspend_notifiers, notifier);
 }
 
+void qemu_system_start_on_wakeup_request(void)
+{
+    start_on_wakeup_requested = true;
+}
+
 void qemu_system_wakeup_request(WakeupReason reason, Error **errp)
 {
     trace_system_wakeup_request(reason);
@@ -574,7 +582,14 @@ void qemu_system_wakeup_request(WakeupReason reason, Error **errp)
     if (!(wakeup_reason_mask & (1 << reason))) {
         return;
     }
-    runstate_set(RUN_STATE_RUNNING);
+
+    if (start_on_wakeup_requested) {
+        start_on_wakeup_requested = false;
+        vm_start();
+    } else {
+        runstate_set(RUN_STATE_RUNNING);
+    }
+
     wakeup_reason = reason;
     qemu_notify_event();
 }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH V7 06/29] cpr: reboot mode
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
                   ` (4 preceding siblings ...)
  2021-12-22 19:05 ` [PATCH V7 05/29] vl: start on wakeup request Steve Sistare
@ 2021-12-22 19:05 ` Steve Sistare
  2021-12-22 19:05 ` [PATCH V7 07/29] cpr: reboot HMP interfaces Steve Sistare
                   ` (23 subsequent siblings)
  29 siblings, 0 replies; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Provide the cpr-save and cpr-load functions for live update.  These save and
restore VM state, with minimal guest pause time, so that qemu may be updated
to a new version in between.

cpr-save stops the VM and saves vmstate to an ordinary file.  It supports
any type of guest image and block device, but the caller must not modify
guest block devices between cpr-save and cpr-load.

cpr-save supports several modes, the first of which is reboot. In this mode,
the caller invokes cpr-save and then terminates qemu.  The caller may then
update the host kernel and system software and reboot.  The caller resumes
the guest by running qemu with the same arguments as the original process
and invoking cpr-load.  To use this mode, guest ram must be mapped to a
persistent shared memory file such as /dev/dax0.0 or /dev/shm PKRAM.

The reboot mode supports vfio devices if the caller first suspends the
guest, such as by issuing guest-suspend-ram to the qemu guest agent.  The
guest drivers' suspend methods flush outstanding requests and re-initialize
the devices, and thus there is no device state to save and restore.

cpr-load loads state from the file.  If the VM was running at cpr-save time,
then VM execution resumes.  If the VM was suspended at cpr-save time, then
the caller must issue a system_wakeup command to resume.

cpr-save syntax:
  { 'enum': 'CprMode', 'data': [ 'reboot' ] }
  { 'command': 'cpr-save', 'data': { 'filename': 'str', 'mode': 'CprMode' }}

cpr-load syntax:
  { 'command': 'cpr-load', 'data': { 'filename': 'str' } }

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 MAINTAINERS             |   8 +++
 include/migration/cpr.h |  17 +++++++
 migration/cpr.c         | 128 ++++++++++++++++++++++++++++++++++++++++++++++++
 migration/meson.build   |   1 +
 qapi/cpr.json           |  56 +++++++++++++++++++++
 qapi/meson.build        |   1 +
 qapi/qapi-schema.json   |   1 +
 7 files changed, 212 insertions(+)
 create mode 100644 include/migration/cpr.h
 create mode 100644 migration/cpr.c
 create mode 100644 qapi/cpr.json

diff --git a/MAINTAINERS b/MAINTAINERS
index dc4b6f7..3c53b0d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2988,6 +2988,14 @@ F: net/colo*
 F: net/filter-rewriter.c
 F: net/filter-mirror.c
 
+CPR
+M: Steve Sistare <steven.sistare@oracle.com>
+M: Mark Kanda <mark.kanda@oracle.com>
+S: Maintained
+F: include/migration/cpr.h
+F: migration/cpr.c
+F: qapi/cpr.json
+
 Record/replay
 M: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
 R: Paolo Bonzini <pbonzini@redhat.com>
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
new file mode 100644
index 0000000..0f27b61
--- /dev/null
+++ b/include/migration/cpr.h
@@ -0,0 +1,17 @@
+/*
+ * Copyright (c) 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef MIGRATION_CPR_H
+#define MIGRATION_CPR_H
+
+#include "qapi/qapi-types-cpr.h"
+
+#define CPR_MODE_NONE ((CprMode)(-1))
+
+static void cpr_set_mode(CprMode mode) {}   /* no-op until a later patch */
+
+#endif
diff --git a/migration/cpr.c b/migration/cpr.c
new file mode 100644
index 0000000..ca76124
--- /dev/null
+++ b/migration/cpr.c
@@ -0,0 +1,128 @@
+/*
+ * Copyright (c) 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "exec/memory.h"
+#include "io/channel-buffer.h"
+#include "io/channel-file.h"
+#include "migration.h"
+#include "migration/cpr.h"
+#include "migration/global_state.h"
+#include "migration/misc.h"
+#include "migration/snapshot.h"
+#include "qapi/error.h"
+#include "qapi/qapi-commands-cpr.h"
+#include "qapi/qmp/qerror.h"
+#include "qemu-file-channel.h"
+#include "qemu-file.h"
+#include "savevm.h"
+#include "sysemu/cpu-timers.h"
+#include "sysemu/replay.h"
+#include "sysemu/runstate.h"
+#include "sysemu/runstate-action.h"
+#include "sysemu/sysemu.h"
+#include "sysemu/xen.h"
+
+void qmp_cpr_save(const char *filename, CprMode mode, Error **errp)
+{
+    int ret;
+    QEMUFile *f;
+    int flags = O_CREAT | O_WRONLY | O_TRUNC;
+    int saved_vm_running = runstate_is_running();
+
+    if (qemu_check_ram_volatile(errp)) {
+        return;
+    }
+
+    if (migrate_colo_enabled()) {
+        error_setg(errp, "cpr-save does not support x-colo");
+        return;
+    }
+
+    if (replay_mode != REPLAY_MODE_NONE) {
+        error_setg(errp, "cpr-save does not support replay");
+        return;
+    }
+
+    if (global_state_store()) {
+        error_setg(errp, "Error saving global state");
+        return;
+    }
+
+    f = qemu_file_open(filename, flags, 0600, "cpr-save", errp);
+    if (!f) {
+        return;
+    }
+
+    if (runstate_check(RUN_STATE_SUSPENDED)) {
+        /* Update timers_state before saving.  Suspend did not so do. */
+        cpu_disable_ticks();
+    }
+    vm_stop(RUN_STATE_SAVE_VM);
+
+    cpr_set_mode(mode);
+    ret = qemu_save_device_state(f);
+    qemu_fclose(f);
+    if (ret < 0) {
+        error_setg(errp, "Error %d while saving VM state", ret);
+        goto err;
+    }
+
+    return;
+
+err:
+    if (saved_vm_running) {
+        vm_start();
+    }
+    cpr_set_mode(CPR_MODE_NONE);
+}
+
+void qmp_cpr_load(const char *filename, Error **errp)
+{
+    QEMUFile *f;
+    int ret;
+    RunState state;
+
+    if (runstate_is_running()) {
+        error_setg(errp, "cpr-load called for a running VM");
+        return;
+    }
+
+    f = qemu_file_open(filename, O_RDONLY, 0, "cpr-load", errp);
+    if (!f) {
+        return;
+    }
+
+    if (qemu_get_be32(f) != QEMU_VM_FILE_MAGIC ||
+        qemu_get_be32(f) != QEMU_VM_FILE_VERSION) {
+        error_setg(errp, "%s is not a vmstate file", filename);
+        qemu_fclose(f);
+        return;
+    }
+
+    cpr_set_mode(CPR_MODE_REBOOT);
+    ret = qemu_load_device_state(f);
+    qemu_fclose(f);
+    if (ret < 0) {
+        error_setg(errp, "Error %d while loading VM state", ret);
+        goto out;
+    }
+
+    state = global_state_get_runstate();
+    if (state == RUN_STATE_RUNNING) {
+        vm_start();
+    } else {
+        runstate_set(state);
+        if (runstate_check(RUN_STATE_SUSPENDED)) {
+            /* Force vm_start to be called later. */
+            qemu_system_start_on_wakeup_request();
+        }
+    }
+
+out:
+    cpr_set_mode(CPR_MODE_NONE);
+}
diff --git a/migration/meson.build b/migration/meson.build
index f8714dc..fd59281 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -15,6 +15,7 @@ softmmu_ss.add(files(
   'channel.c',
   'colo-failover.c',
   'colo.c',
+  'cpr.c',
   'exec.c',
   'fd.c',
   'global_state.c',
diff --git a/qapi/cpr.json b/qapi/cpr.json
new file mode 100644
index 0000000..2edd08e
--- /dev/null
+++ b/qapi/cpr.json
@@ -0,0 +1,56 @@
+# -*- Mode: Python -*-
+#
+# Copyright (c) 2021 Oracle and/or its affiliates.
+#
+# This work is licensed under the terms of the GNU GPL, version 2.
+# See the COPYING file in the top-level directory.
+
+##
+# = CPR - CheckPoint and Restart
+##
+
+{ 'include': 'common.json' }
+
+##
+# @CprMode:
+#
+# @reboot: checkpoint can be cpr-load'ed after a host kexec reboot.
+#
+# Since: 6.2
+##
+{ 'enum': 'CprMode',
+  'data': [ 'reboot' ] }
+
+##
+# @cpr-save:
+#
+# Create a checkpoint of the virtual machine device state in @filename.
+# Unlike snapshot-save, this command completes synchronously, saves state
+# to an ordinary file, and does not save guest RAM or guest block device
+# blocks.  The caller must not modify guest block devices between cpr-save
+# and cpr-load.
+#
+# For reboot mode, all guest RAM objects must be non-volatile across reboot,
+# and created with the share=on parameter.
+#
+# @filename: name of checkpoint file
+# @mode: @CprMode mode
+#
+# Since: 6.2
+##
+{ 'command': 'cpr-save',
+  'data': { 'filename': 'str',
+            'mode': 'CprMode' } }
+
+##
+# @cpr-load:
+#
+# Start virtual machine from checkpoint file that was created earlier using
+# the cpr-save command.
+#
+# @filename: name of checkpoint file
+#
+# Since: 6.2
+##
+{ 'command': 'cpr-load',
+  'data': { 'filename': 'str' } }
diff --git a/qapi/meson.build b/qapi/meson.build
index c0c49c1..8d5515d 100644
--- a/qapi/meson.build
+++ b/qapi/meson.build
@@ -30,6 +30,7 @@ qapi_all_modules = [
   'common',
   'compat',
   'control',
+  'cpr',
   'crypto',
   'dump',
   'error',
diff --git a/qapi/qapi-schema.json b/qapi/qapi-schema.json
index 4912b97..001d790 100644
--- a/qapi/qapi-schema.json
+++ b/qapi/qapi-schema.json
@@ -77,6 +77,7 @@
 { 'include': 'ui.json' }
 { 'include': 'authz.json' }
 { 'include': 'migration.json' }
+{ 'include': 'cpr.json' }
 { 'include': 'transaction.json' }
 { 'include': 'trace.json' }
 { 'include': 'compat.json' }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH V7 07/29] cpr: reboot HMP interfaces
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
                   ` (5 preceding siblings ...)
  2021-12-22 19:05 ` [PATCH V7 06/29] cpr: reboot mode Steve Sistare
@ 2021-12-22 19:05 ` Steve Sistare
  2021-12-22 19:05 ` [PATCH V7 08/29] memory: flat section iterator Steve Sistare
                   ` (22 subsequent siblings)
  29 siblings, 0 replies; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

cpr-save <filename> <mode>
  Call qmp_cpr_save().
  Arguments:
    filename : save vmstate to filename
    mode: must be "reboot"

cpr-load <filename>
  Call qmp_cpr_load().
  Arguments:
    filename : load vmstate from filename

Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hmp-commands.hx       | 31 +++++++++++++++++++++++++++++++
 include/monitor/hmp.h |  2 ++
 monitor/hmp-cmds.c    | 28 ++++++++++++++++++++++++++++
 3 files changed, 61 insertions(+)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index 70a9136..350c886 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -351,6 +351,37 @@ SRST
 ERST
 
     {
+        .name       = "cpr-save",
+        .args_type  = "filename:s,mode:s",
+        .params     = "filename 'reboot'",
+        .help       = "create a checkpoint of the VM in file",
+        .cmd        = hmp_cpr_save,
+    },
+
+SRST
+``cpr-save`` *filename* *mode*
+Pause the VCPUs,
+create a checkpoint of the whole virtual machine, and save it in *filename*.
+If *mode* is 'reboot', the checkpoint remains valid after a host kexec
+reboot, and guest ram must be backed by persistent shared memory.  To
+resume from the checkpoint, issue the quit command, reboot the system,
+and issue the cpr-load command.
+ERST
+
+    {
+        .name       = "cpr-load",
+        .args_type  = "filename:s",
+        .params     = "filename",
+        .help       = "load VM checkpoint from file",
+        .cmd        = hmp_cpr_load,
+    },
+
+SRST
+``cpr-load`` *filename*
+Load a virtual machine from checkpoint file *filename* and continue VCPUs.
+ERST
+
+    {
         .name       = "delvm",
         .args_type  = "name:s",
         .params     = "tag",
diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
index 96d0148..b44588e 100644
--- a/include/monitor/hmp.h
+++ b/include/monitor/hmp.h
@@ -59,6 +59,8 @@ void hmp_balloon(Monitor *mon, const QDict *qdict);
 void hmp_loadvm(Monitor *mon, const QDict *qdict);
 void hmp_savevm(Monitor *mon, const QDict *qdict);
 void hmp_delvm(Monitor *mon, const QDict *qdict);
+void hmp_cpr_save(Monitor *mon, const QDict *qdict);
+void hmp_cpr_load(Monitor *mon, const QDict *qdict);
 void hmp_migrate_cancel(Monitor *mon, const QDict *qdict);
 void hmp_migrate_continue(Monitor *mon, const QDict *qdict);
 void hmp_migrate_incoming(Monitor *mon, const QDict *qdict);
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index 2669156..b8c22da 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -33,6 +33,7 @@
 #include "qapi/qapi-commands-block.h"
 #include "qapi/qapi-commands-char.h"
 #include "qapi/qapi-commands-control.h"
+#include "qapi/qapi-commands-cpr.h"
 #include "qapi/qapi-commands-machine.h"
 #include "qapi/qapi-commands-migration.h"
 #include "qapi/qapi-commands-misc.h"
@@ -1110,6 +1111,33 @@ void hmp_announce_self(Monitor *mon, const QDict *qdict)
     qapi_free_AnnounceParameters(params);
 }
 
+void hmp_cpr_save(Monitor *mon, const QDict *qdict)
+{
+    Error *err = NULL;
+    const char *mode;
+    int val;
+
+    mode = qdict_get_try_str(qdict, "mode");
+    val = qapi_enum_parse(&CprMode_lookup, mode, -1, &err);
+
+    if (val == -1) {
+        goto out;
+    }
+
+    qmp_cpr_save(qdict_get_try_str(qdict, "filename"), val, &err);
+
+out:
+    hmp_handle_error(mon, err);
+}
+
+void hmp_cpr_load(Monitor *mon, const QDict *qdict)
+{
+    Error *err = NULL;
+
+    qmp_cpr_load(qdict_get_try_str(qdict, "filename"), &err);
+    hmp_handle_error(mon, err);
+}
+
 void hmp_migrate_cancel(Monitor *mon, const QDict *qdict)
 {
     qmp_migrate_cancel(NULL);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH V7 08/29] memory: flat section iterator
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
                   ` (6 preceding siblings ...)
  2021-12-22 19:05 ` [PATCH V7 07/29] cpr: reboot HMP interfaces Steve Sistare
@ 2021-12-22 19:05 ` Steve Sistare
  2022-03-04 12:48   ` Philippe Mathieu-Daudé
  2022-03-09 14:18   ` Marc-André Lureau
  2021-12-22 19:05 ` [PATCH V7 09/29] oslib: qemu_clear_cloexec Steve Sistare
                   ` (21 subsequent siblings)
  29 siblings, 2 replies; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Add an iterator over the sections of a flattened address space.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/exec/memory.h | 31 +++++++++++++++++++++++++++++++
 softmmu/memory.c      | 20 ++++++++++++++++++++
 2 files changed, 51 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 137f5f3..9660475 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -2338,6 +2338,37 @@ void memory_region_set_ram_discard_manager(MemoryRegion *mr,
                                            RamDiscardManager *rdm);
 
 /**
+ * memory_region_section_cb: callback for address_space_flat_for_each_section()
+ *
+ * @s: MemoryRegionSection of the range
+ * @opaque: data pointer passed to address_space_flat_for_each_section()
+ * @errp: error message, returned to the address_space_flat_for_each_section
+ *        caller.
+ *
+ * Returns: non-zero to stop the iteration, and 0 to continue.  The same
+ * non-zero value is returned to the address_space_flat_for_each_section caller.
+ */
+
+typedef int (*memory_region_section_cb)(MemoryRegionSection *s,
+                                        void *opaque,
+                                        Error **errp);
+
+/**
+ * address_space_flat_for_each_section: walk the ranges in the address space
+ * flat view and call @func for each.  Return 0 on success, else return non-zero
+ * with a message in @errp.
+ *
+ * @as: target address space
+ * @func: callback function
+ * @opaque: passed to @func
+ * @errp: passed to @func
+ */
+int address_space_flat_for_each_section(AddressSpace *as,
+                                        memory_region_section_cb func,
+                                        void *opaque,
+                                        Error **errp);
+
+/**
  * memory_region_find: translate an address/size relative to a
  * MemoryRegion into a #MemoryRegionSection.
  *
diff --git a/softmmu/memory.c b/softmmu/memory.c
index 30b2f68..40f3522 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -2663,6 +2663,26 @@ bool memory_region_is_mapped(MemoryRegion *mr)
     return mr->container ? true : false;
 }
 
+int address_space_flat_for_each_section(AddressSpace *as,
+                                        memory_region_section_cb func,
+                                        void *opaque,
+                                        Error **errp)
+{
+    FlatView *view = address_space_get_flatview(as);
+    FlatRange *fr;
+    int ret;
+
+    FOR_EACH_FLAT_RANGE(fr, view) {
+        MemoryRegionSection section = section_from_flat_range(fr, view);
+        ret = func(&section, opaque, errp);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
 /* Same as memory_region_find, but it does not add a reference to the
  * returned region.  It must be called from an RCU critical section.
  */
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH V7 09/29] oslib: qemu_clear_cloexec
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
                   ` (7 preceding siblings ...)
  2021-12-22 19:05 ` [PATCH V7 08/29] memory: flat section iterator Steve Sistare
@ 2021-12-22 19:05 ` Steve Sistare
  2021-12-22 19:05 ` [PATCH V7 10/29] machine: memfd-alloc option Steve Sistare
                   ` (20 subsequent siblings)
  29 siblings, 0 replies; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Define qemu_clear_cloexec, analogous to qemu_set_cloexec.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/qemu/osdep.h | 1 +
 util/oslib-posix.c   | 9 +++++++++
 util/oslib-win32.c   | 4 ++++
 3 files changed, 14 insertions(+)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 60718fc..1ad7714 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -637,6 +637,7 @@ static inline void qemu_timersub(const struct timeval *val1,
 #endif
 
 void qemu_set_cloexec(int fd);
+void qemu_clear_cloexec(int fd);
 
 /* Starting on QEMU 2.5, qemu_hw_version() returns "2.5+" by default
  * instead of QEMU_VERSION, so setting hw_version on MachineClass
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index e8bdb02..7913334 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -309,6 +309,15 @@ void qemu_set_cloexec(int fd)
     assert(f != -1);
 }
 
+void qemu_clear_cloexec(int fd)
+{
+    int f;
+    f = fcntl(fd, F_GETFD);
+    assert(f != -1);
+    f = fcntl(fd, F_SETFD, f & ~FD_CLOEXEC);
+    assert(f != -1);
+}
+
 /*
  * Creates a pipe with FD_CLOEXEC set on both file descriptors
  */
diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index af559ef..acc3e06 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -265,6 +265,10 @@ void qemu_set_cloexec(int fd)
 {
 }
 
+void qemu_clear_cloexec(int fd)
+{
+}
+
 /* Offset between 1/1/1601 and 1/1/1970 in 100 nanosec units */
 #define _W32_FT_OFFSET (116444736000000000ULL)
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH V7 10/29] machine: memfd-alloc option
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
                   ` (8 preceding siblings ...)
  2021-12-22 19:05 ` [PATCH V7 09/29] oslib: qemu_clear_cloexec Steve Sistare
@ 2021-12-22 19:05 ` Steve Sistare
  2022-02-18  8:05   ` Guoyi Tu
                     ` (3 more replies)
  2021-12-22 19:05 ` [PATCH V7 11/29] qapi: list utility functions Steve Sistare
                   ` (19 subsequent siblings)
  29 siblings, 4 replies; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Allocate anonymous memory using memfd_create if the memfd-alloc machine
option is set.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/core/machine.c   | 19 +++++++++++++++++++
 include/hw/boards.h |  1 +
 qemu-options.hx     |  6 ++++++
 softmmu/physmem.c   | 47 ++++++++++++++++++++++++++++++++++++++---------
 softmmu/vl.c        |  1 +
 trace-events        |  1 +
 util/qemu-config.c  |  4 ++++
 7 files changed, 70 insertions(+), 9 deletions(-)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 53a99ab..7739d88 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -392,6 +392,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
     ms->mem_merge = value;
 }
 
+static bool machine_get_memfd_alloc(Object *obj, Error **errp)
+{
+    MachineState *ms = MACHINE(obj);
+
+    return ms->memfd_alloc;
+}
+
+static void machine_set_memfd_alloc(Object *obj, bool value, Error **errp)
+{
+    MachineState *ms = MACHINE(obj);
+
+    ms->memfd_alloc = value;
+}
+
 static bool machine_get_usb(Object *obj, Error **errp)
 {
     MachineState *ms = MACHINE(obj);
@@ -829,6 +843,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
     object_class_property_set_description(oc, "mem-merge",
         "Enable/disable memory merge support");
 
+    object_class_property_add_bool(oc, "memfd-alloc",
+        machine_get_memfd_alloc, machine_set_memfd_alloc);
+    object_class_property_set_description(oc, "memfd-alloc",
+        "Enable/disable allocating anonymous memory using memfd_create");
+
     object_class_property_add_bool(oc, "usb",
         machine_get_usb, machine_set_usb);
     object_class_property_set_description(oc, "usb",
diff --git a/include/hw/boards.h b/include/hw/boards.h
index 9c1c190..a57d7a0 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -327,6 +327,7 @@ struct MachineState {
     char *dt_compatible;
     bool dump_guest_core;
     bool mem_merge;
+    bool memfd_alloc;
     bool usb;
     bool usb_disabled;
     char *firmware;
diff --git a/qemu-options.hx b/qemu-options.hx
index 7d47510..33c8173 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -30,6 +30,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
     "                vmport=on|off|auto controls emulation of vmport (default: auto)\n"
     "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
     "                mem-merge=on|off controls memory merge support (default: on)\n"
+    "                memfd-alloc=on|off controls allocating anonymous guest RAM using memfd_create (default: off)\n"
     "                aes-key-wrap=on|off controls support for AES key wrapping (default=on)\n"
     "                dea-key-wrap=on|off controls support for DEA key wrapping (default=on)\n"
     "                suppress-vmdesc=on|off disables self-describing migration (default=off)\n"
@@ -76,6 +77,11 @@ SRST
         supported by the host, de-duplicates identical memory pages
         among VMs instances (enabled by default).
 
+    ``memfd-alloc=on|off``
+        Enables or disables allocation of anonymous guest RAM using
+        memfd_create.  Any associated memory-backend objects are created with
+        share=on.  The memfd-alloc default is off.
+
     ``aes-key-wrap=on|off``
         Enables or disables AES key wrapping support on s390-ccw hosts.
         This feature controls whether AES wrapping keys will be created
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 3524c04..95e2b49 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -41,6 +41,7 @@
 #include "qemu/config-file.h"
 #include "qemu/error-report.h"
 #include "qemu/qemu-print.h"
+#include "qemu/memfd.h"
 #include "exec/memory.h"
 #include "exec/ioport.h"
 #include "sysemu/dma.h"
@@ -1964,35 +1965,63 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
     const bool shared = qemu_ram_is_shared(new_block);
     RAMBlock *block;
     RAMBlock *last_block = NULL;
+    struct MemoryRegion *mr = new_block->mr;
     ram_addr_t old_ram_size, new_ram_size;
     Error *err = NULL;
+    const char *name;
+    void *addr = 0;
+    size_t maxlen;
+    MachineState *ms = MACHINE(qdev_get_machine());
 
     old_ram_size = last_ram_page();
 
     qemu_mutex_lock_ramlist();
-    new_block->offset = find_ram_offset(new_block->max_length);
+    maxlen = new_block->max_length;
+    new_block->offset = find_ram_offset(maxlen);
 
     if (!new_block->host) {
         if (xen_enabled()) {
-            xen_ram_alloc(new_block->offset, new_block->max_length,
-                          new_block->mr, &err);
+            xen_ram_alloc(new_block->offset, maxlen, new_block->mr, &err);
             if (err) {
                 error_propagate(errp, err);
                 qemu_mutex_unlock_ramlist();
                 return;
             }
         } else {
-            new_block->host = qemu_anon_ram_alloc(new_block->max_length,
-                                                  &new_block->mr->align,
-                                                  shared, noreserve);
-            if (!new_block->host) {
+            name = memory_region_name(mr);
+            if (ms->memfd_alloc) {
+                Object *parent = &mr->parent_obj;
+                int mfd = -1;          /* placeholder until next patch */
+                mr->align = QEMU_VMALLOC_ALIGN;
+                if (mfd < 0) {
+                    mfd = qemu_memfd_create(name, maxlen + mr->align,
+                                            0, 0, 0, &err);
+                    if (mfd < 0) {
+                        return;
+                    }
+                }
+                qemu_set_cloexec(mfd);
+                /* The memory backend already set its desired flags. */
+                if (!object_dynamic_cast(parent, TYPE_MEMORY_BACKEND)) {
+                    new_block->flags |= RAM_SHARED;
+                }
+                addr = file_ram_alloc(new_block, maxlen, mfd,
+                                      false, false, 0, errp);
+                trace_anon_memfd_alloc(name, maxlen, addr, mfd);
+            } else {
+                addr = qemu_anon_ram_alloc(maxlen, &mr->align,
+                                           shared, noreserve);
+            }
+
+            if (!addr) {
                 error_setg_errno(errp, errno,
                                  "cannot set up guest memory '%s'",
-                                 memory_region_name(new_block->mr));
+                                 name);
                 qemu_mutex_unlock_ramlist();
                 return;
             }
-            memory_try_enable_merging(new_block->host, new_block->max_length);
+            memory_try_enable_merging(addr, maxlen);
+            new_block->host = addr;
         }
     }
 
diff --git a/softmmu/vl.c b/softmmu/vl.c
index 620a1f1..ab3648a 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -2440,6 +2440,7 @@ static void create_default_memdev(MachineState *ms, const char *path)
         object_property_set_str(obj, "mem-path", path, &error_fatal);
     }
     object_property_set_int(obj, "size", ms->ram_size, &error_fatal);
+    object_property_set_bool(obj, "share", ms->memfd_alloc, &error_fatal);
     object_property_add_child(object_get_objects_root(), mc->default_ram_id,
                               obj);
     /* Ensure backend's memory region name is equal to mc->default_ram_id */
diff --git a/trace-events b/trace-events
index a637a61..770a9ac 100644
--- a/trace-events
+++ b/trace-events
@@ -45,6 +45,7 @@ ram_block_discard_range(const char *rbname, void *hva, size_t length, bool need_
 # accel/tcg/cputlb.c
 memory_notdirty_write_access(uint64_t vaddr, uint64_t ram_addr, unsigned size) "0x%" PRIx64 " ram_addr 0x%" PRIx64 " size %u"
 memory_notdirty_set_dirty(uint64_t vaddr) "0x%" PRIx64
+anon_memfd_alloc(const char *name, size_t size, void *ptr, int fd) "%s size %zu ptr %p fd %d"
 
 # gdbstub.c
 gdbstub_op_start(const char *device) "Starting gdbstub using device %s"
diff --git a/util/qemu-config.c b/util/qemu-config.c
index 436ab63..3606e5c 100644
--- a/util/qemu-config.c
+++ b/util/qemu-config.c
@@ -207,6 +207,10 @@ static QemuOptsList machine_opts = {
             .type = QEMU_OPT_BOOL,
             .help = "enable/disable memory merge support",
         },{
+            .name = "memfd-alloc",
+            .type = QEMU_OPT_BOOL,
+            .help = "enable/disable memfd_create for anonymous memory",
+        },{
             .name = "usb",
             .type = QEMU_OPT_BOOL,
             .help = "Set on/off to enable/disable usb",
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH V7 11/29] qapi: list utility functions
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
                   ` (9 preceding siblings ...)
  2021-12-22 19:05 ` [PATCH V7 10/29] machine: memfd-alloc option Steve Sistare
@ 2021-12-22 19:05 ` Steve Sistare
  2022-03-09 14:11   ` Marc-André Lureau
  2021-12-22 19:05 ` [PATCH V7 12/29] vl: helper to request re-exec Steve Sistare
                   ` (18 subsequent siblings)
  29 siblings, 1 reply; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Generalize strList_from_comma_list() to take any delimiter character, rename
as strList_from_string(), and move it to qapi/util.c.  Also add
strv_from_strList() and QAPI_LIST_LENGTH().

No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/qapi/util.h | 28 ++++++++++++++++++++++++++++
 monitor/hmp-cmds.c  | 29 ++---------------------------
 qapi/qapi-util.c    | 37 +++++++++++++++++++++++++++++++++++++
 3 files changed, 67 insertions(+), 27 deletions(-)

diff --git a/include/qapi/util.h b/include/qapi/util.h
index 81a2b13..c249108 100644
--- a/include/qapi/util.h
+++ b/include/qapi/util.h
@@ -22,6 +22,8 @@ typedef struct QEnumLookup {
     const int size;
 } QEnumLookup;
 
+struct strList;
+
 const char *qapi_enum_lookup(const QEnumLookup *lookup, int val);
 int qapi_enum_parse(const QEnumLookup *lookup, const char *buf,
                     int def, Error **errp);
@@ -31,6 +33,19 @@ bool qapi_bool_parse(const char *name, const char *value, bool *obj,
 int parse_qapi_name(const char *name, bool complete);
 
 /*
+ * Produce and return a NULL-terminated array of strings from @args.
+ * All strings are g_strdup'd.
+ */
+char **strv_from_strList(const struct strList *args);
+
+/*
+ * Produce a strList from the character delimited string @in.
+ * All strings are g_strdup'd.
+ * A NULL or empty input string returns NULL.
+ */
+struct strList *strList_from_string(const char *in, char delim);
+
+/*
  * For any GenericList @list, insert @element at the front.
  *
  * Note that this macro evaluates @element exactly once, so it is safe
@@ -56,4 +71,17 @@ int parse_qapi_name(const char *name, bool complete);
     (tail) = &(*(tail))->next; \
 } while (0)
 
+/*
+ * For any GenericList @list, return its length.
+ */
+#define QAPI_LIST_LENGTH(list) \
+    ({ \
+        int len = 0; \
+        typeof(list) elem; \
+        for (elem = list; elem != NULL; elem = elem->next) { \
+            len++; \
+        } \
+        len; \
+    })
+
 #endif
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index b8c22da..5ca8b4b 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -43,6 +43,7 @@
 #include "qapi/qapi-commands-run-state.h"
 #include "qapi/qapi-commands-tpm.h"
 #include "qapi/qapi-commands-ui.h"
+#include "qapi/util.h"
 #include "qapi/qapi-visit-net.h"
 #include "qapi/qapi-visit-migration.h"
 #include "qapi/qmp/qdict.h"
@@ -70,32 +71,6 @@ bool hmp_handle_error(Monitor *mon, Error *err)
     return false;
 }
 
-/*
- * Produce a strList from a comma separated list.
- * A NULL or empty input string return NULL.
- */
-static strList *strList_from_comma_list(const char *in)
-{
-    strList *res = NULL;
-    strList **tail = &res;
-
-    while (in && in[0]) {
-        char *comma = strchr(in, ',');
-        char *value;
-
-        if (comma) {
-            value = g_strndup(in, comma - in);
-            in = comma + 1; /* skip the , */
-        } else {
-            value = g_strdup(in);
-            in = NULL;
-        }
-        QAPI_LIST_APPEND(tail, value);
-    }
-
-    return res;
-}
-
 void hmp_info_name(Monitor *mon, const QDict *qdict)
 {
     NameInfo *info;
@@ -1103,7 +1078,7 @@ void hmp_announce_self(Monitor *mon, const QDict *qdict)
                                             migrate_announce_params());
 
     qapi_free_strList(params->interfaces);
-    params->interfaces = strList_from_comma_list(interfaces_str);
+    params->interfaces = strList_from_string(interfaces_str, ',');
     params->has_interfaces = params->interfaces != NULL;
     params->id = g_strdup(id);
     params->has_id = !!params->id;
diff --git a/qapi/qapi-util.c b/qapi/qapi-util.c
index fda7044..edd51b3 100644
--- a/qapi/qapi-util.c
+++ b/qapi/qapi-util.c
@@ -15,6 +15,7 @@
 #include "qapi/error.h"
 #include "qemu/ctype.h"
 #include "qapi/qmp/qerror.h"
+#include "qapi/qapi-builtin-types.h"
 
 CompatPolicy compat_policy;
 
@@ -152,3 +153,39 @@ int parse_qapi_name(const char *str, bool complete)
     }
     return p - str;
 }
+
+char **strv_from_strList(const strList *args)
+{
+    const strList *arg;
+    int i = 0;
+    char **argv = g_malloc((QAPI_LIST_LENGTH(args) + 1) * sizeof(char *));
+
+    for (arg = args; arg != NULL; arg = arg->next) {
+        argv[i++] = g_strdup(arg->value);
+    }
+    argv[i] = NULL;
+
+    return argv;
+}
+
+strList *strList_from_string(const char *in, char delim)
+{
+    strList *res = NULL;
+    strList **tail = &res;
+
+    while (in && in[0]) {
+        char *next = strchr(in, delim);
+        char *value;
+
+        if (next) {
+            value = g_strndup(in, next - in);
+            in = next + 1; /* skip the delim */
+        } else {
+            value = g_strdup(in);
+            in = NULL;
+        }
+        QAPI_LIST_APPEND(tail, value);
+    }
+
+    return res;
+}
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH V7 12/29] vl: helper to request re-exec
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
                   ` (10 preceding siblings ...)
  2021-12-22 19:05 ` [PATCH V7 11/29] qapi: list utility functions Steve Sistare
@ 2021-12-22 19:05 ` Steve Sistare
  2022-03-09 14:16   ` Marc-André Lureau
  2021-12-22 19:05 ` [PATCH V7 13/29] cpr: preserve extra state Steve Sistare
                   ` (17 subsequent siblings)
  29 siblings, 1 reply; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Add a qemu_system_exec_request() hook that causes the main loop to exit and
re-exec qemu using the specified arguments.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/sysemu/runstate.h |  1 +
 softmmu/runstate.c        | 21 +++++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/include/sysemu/runstate.h b/include/sysemu/runstate.h
index b655c7b..198211b 100644
--- a/include/sysemu/runstate.h
+++ b/include/sysemu/runstate.h
@@ -57,6 +57,7 @@ void qemu_system_wakeup_enable(WakeupReason reason, bool enabled);
 void qemu_register_wakeup_notifier(Notifier *notifier);
 void qemu_register_wakeup_support(void);
 void qemu_system_shutdown_request(ShutdownCause reason);
+void qemu_system_exec_request(const strList *args);
 void qemu_system_powerdown_request(void);
 void qemu_register_powerdown_notifier(Notifier *notifier);
 void qemu_register_shutdown_notifier(Notifier *notifier);
diff --git a/softmmu/runstate.c b/softmmu/runstate.c
index 3d344c9..309a4bf 100644
--- a/softmmu/runstate.c
+++ b/softmmu/runstate.c
@@ -38,6 +38,7 @@
 #include "monitor/monitor.h"
 #include "net/net.h"
 #include "net/vhost_net.h"
+#include "qapi/util.h"
 #include "qapi/error.h"
 #include "qapi/qapi-commands-run-state.h"
 #include "qapi/qapi-events-run-state.h"
@@ -355,6 +356,7 @@ static NotifierList wakeup_notifiers =
 static NotifierList shutdown_notifiers =
     NOTIFIER_LIST_INITIALIZER(shutdown_notifiers);
 static uint32_t wakeup_reason_mask = ~(1 << QEMU_WAKEUP_REASON_NONE);
+static char **exec_argv;
 
 ShutdownCause qemu_shutdown_requested_get(void)
 {
@@ -371,6 +373,11 @@ static int qemu_shutdown_requested(void)
     return qatomic_xchg(&shutdown_requested, SHUTDOWN_CAUSE_NONE);
 }
 
+static int qemu_exec_requested(void)
+{
+    return exec_argv != NULL;
+}
+
 static void qemu_kill_report(void)
 {
     if (!qtest_driver() && shutdown_signal) {
@@ -641,6 +648,13 @@ void qemu_system_shutdown_request(ShutdownCause reason)
     qemu_notify_event();
 }
 
+void qemu_system_exec_request(const strList *args)
+{
+    exec_argv = strv_from_strList(args);
+    shutdown_requested = 1;
+    qemu_notify_event();
+}
+
 static void qemu_system_powerdown(void)
 {
     qapi_event_send_powerdown();
@@ -689,6 +703,13 @@ static bool main_loop_should_exit(void)
     }
     request = qemu_shutdown_requested();
     if (request) {
+
+        if (qemu_exec_requested()) {
+            execvp(exec_argv[0], exec_argv);
+            error_report("execvp %s failed: %s", exec_argv[0], strerror(errno));
+            g_strfreev(exec_argv);
+            exec_argv = NULL;
+        }
         qemu_kill_report();
         qemu_system_shutdown(request);
         if (shutdown_action == SHUTDOWN_ACTION_PAUSE) {
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH V7 13/29] cpr: preserve extra state
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
                   ` (11 preceding siblings ...)
  2021-12-22 19:05 ` [PATCH V7 12/29] vl: helper to request re-exec Steve Sistare
@ 2021-12-22 19:05 ` Steve Sistare
  2021-12-22 19:05 ` [PATCH V7 14/29] cpr: restart mode Steve Sistare
                   ` (16 subsequent siblings)
  29 siblings, 0 replies; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

cpr must save state that is needed after qemu is restarted, when devices are
realized.  Thus the extra state cannot be saved in the cpr-load vmstate file,
as objects must already exist before that file can be loaded.  Instead,
define auxilliary state structures and vmstate descriptions, not associated
with any registered object, and serialize the aux state to a memfd file.
Deserialize after qemu restarts, before devices are realized.

Currently file descriptors comprise the only such state, but more could
be added in the future.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 MAINTAINERS             |   2 +
 include/migration/cpr.h |  13 ++-
 migration/cpr-state.c   | 228 ++++++++++++++++++++++++++++++++++++++++++++++++
 migration/meson.build   |   1 +
 migration/trace-events  |   5 ++
 stubs/cpr-state.c       |  15 ++++
 stubs/meson.build       |   1 +
 7 files changed, 264 insertions(+), 1 deletion(-)
 create mode 100644 migration/cpr-state.c
 create mode 100644 stubs/cpr-state.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 3c53b0d..cfe7480 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2995,6 +2995,8 @@ S: Maintained
 F: include/migration/cpr.h
 F: migration/cpr.c
 F: qapi/cpr.json
+F: migration/cpr-state.c
+F: stubs/cpr-state.c
 
 Record/replay
 M: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index 0f27b61..a4da24e 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -12,6 +12,17 @@
 
 #define CPR_MODE_NONE ((CprMode)(-1))
 
-static void cpr_set_mode(CprMode mode) {}   /* no-op until a later patch */
+void cpr_set_mode(CprMode mode);
+CprMode cpr_get_mode(void);
+
+typedef int (*cpr_walk_fd_cb)(const char *name, int id, int fd, void *opaque);
+
+void cpr_save_fd(const char *name, int id, int fd);
+void cpr_delete_fd(const char *name, int id);
+int cpr_find_fd(const char *name, int id);
+int cpr_walk_fd(cpr_walk_fd_cb cb, void *handle);
+int cpr_state_save(Error **errp);
+int cpr_state_load(Error **errp);
+void cpr_state_print(void);
 
 #endif
diff --git a/migration/cpr-state.c b/migration/cpr-state.c
new file mode 100644
index 0000000..42465f8
--- /dev/null
+++ b/migration/cpr-state.c
@@ -0,0 +1,228 @@
+#include "qemu/osdep.h"
+#include "qemu/cutils.h"
+#include "qemu/queue.h"
+#include "qemu/memfd.h"
+#include "qapi/error.h"
+#include "migration/vmstate.h"
+#include "migration/cpr.h"
+#include "migration/qemu-file.h"
+#include "migration/qemu-file-channel.h"
+#include "trace.h"
+
+/*************************************************************************/
+/* cpr state container for all information to be saved. */
+
+typedef QLIST_HEAD(CprNameList, CprName) CprNameList;
+
+typedef struct CprState {
+    CprMode mode;
+    CprNameList fds;            /* list of CprFd */
+} CprState;
+
+static CprState cpr_state = {
+    .mode = CPR_MODE_NONE,
+};
+
+/*************************************************************************/
+/* Generic list of names. */
+
+typedef struct CprName {
+    char *name;
+    unsigned int namelen;
+    int id;
+    QLIST_ENTRY(CprName) next;
+} CprName;
+
+static const VMStateDescription vmstate_cpr_name = {
+    .name = "cpr name",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT32(namelen, CprName),
+        VMSTATE_VBUFFER_ALLOC_UINT32(name, CprName, 0, NULL, namelen),
+        VMSTATE_INT32(id, CprName),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+static void
+add_name(CprNameList *head, const char *name, int id, CprName *elem)
+{
+    elem->name = g_strdup(name);
+    elem->namelen = strlen(name) + 1;
+    elem->id = id;
+    QLIST_INSERT_HEAD(head, elem, next);
+}
+
+static CprName *find_name(CprNameList *head, const char *name, int id)
+{
+    CprName *elem;
+
+    QLIST_FOREACH(elem, head, next) {
+        if (!strcmp(elem->name, name) && elem->id == id) {
+            return elem;
+        }
+    }
+    return NULL;
+}
+
+static void delete_name(CprNameList *head, const char *name, int id)
+{
+    CprName *elem = find_name(head, name, id);
+
+    if (elem) {
+        QLIST_REMOVE(elem, next);
+        g_free(elem->name);
+        g_free(elem);
+    }
+}
+
+/****************************************************************************/
+/* Lists of named things.  The first field of each entry must be a CprName. */
+
+typedef struct CprFd {
+    CprName name;               /* must be first */
+    int fd;
+} CprFd;
+
+static const VMStateDescription vmstate_cpr_fd = {
+    .name = "cpr fd",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_STRUCT(name, CprFd, 1, vmstate_cpr_name, CprName),
+        VMSTATE_INT32(fd, CprFd),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+#define CPR_FD(elem)        ((CprFd *)(elem))
+#define CPR_FD_FD(elem)     (CPR_FD(elem)->fd)
+
+void cpr_save_fd(const char *name, int id, int fd)
+{
+    CprFd *elem = g_new0(CprFd, 1);
+
+    trace_cpr_save_fd(name, id, fd);
+    elem->fd = fd;
+    add_name(&cpr_state.fds, name, id, &elem->name);
+}
+
+void cpr_delete_fd(const char *name, int id)
+{
+    trace_cpr_delete_fd(name, id);
+    delete_name(&cpr_state.fds, name, id);
+}
+
+int cpr_find_fd(const char *name, int id)
+{
+    CprName *elem = find_name(&cpr_state.fds, name, id);
+    int fd = elem ? CPR_FD_FD(elem) : -1;
+
+    trace_cpr_find_fd(name, id, fd);
+    return fd;
+}
+
+int cpr_walk_fd(cpr_walk_fd_cb cb, void *opaque)
+{
+    CprName *elem;
+
+    QLIST_FOREACH(elem, &cpr_state.fds, next) {
+        if (cb(elem->name, elem->id, CPR_FD_FD(elem), opaque)) {
+            return 1;
+        }
+    }
+    return 0;
+}
+
+/*************************************************************************/
+/* cpr state container interface and implementation. */
+
+#define CPR_STATE_NAME "QEMU_CPR_STATE"
+
+static const VMStateDescription vmstate_cpr_state = {
+    .name = CPR_STATE_NAME,
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT32(mode, CprState),
+        VMSTATE_QLIST_V(fds, CprState, 1, vmstate_cpr_fd, CprFd, name.next),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+int cpr_state_save(Error **errp)
+{
+    int ret, mfd;
+    QEMUFile *f;
+    char val[16];
+
+    mfd = memfd_create(CPR_STATE_NAME, 0);
+    if (mfd < 0) {
+        error_setg_errno(errp, errno, "memfd_create failed");
+        return -1;
+    }
+    qemu_clear_cloexec(mfd);
+    f = qemu_fd_open(mfd, true, CPR_STATE_NAME);
+    if (!f) {
+        error_setg(errp, "qemu_fd_open %d failed", mfd);
+        return -1;
+    }
+
+    ret = vmstate_save_state(f, &vmstate_cpr_state, &cpr_state, 0);
+    if (ret) {
+        error_setg(errp, "vmstate_save_state error %d", ret);
+        return ret;
+    }
+
+    /* Do not close f, as mfd must remain open. */
+    qemu_fflush(f);
+    lseek(mfd, 0, SEEK_SET);
+
+    /* Remember mfd for post-exec cpr_state_load */
+    snprintf(val, sizeof(val), "%d", mfd);
+    g_setenv(CPR_STATE_NAME, val, 1);
+
+    return 0;
+}
+
+int cpr_state_load(Error **errp)
+{
+    int ret, mfd;
+    QEMUFile *f;
+    const char *val = g_getenv(CPR_STATE_NAME);
+
+    if (!val) {
+        return 0;
+    }
+    g_unsetenv(CPR_STATE_NAME);
+    if (qemu_strtoi(val, NULL, 10, &mfd)) {
+        error_setg(errp, "Bad %s env value %s", CPR_STATE_NAME, val);
+        return 1;
+    }
+    f = qemu_fd_open(mfd, false, CPR_STATE_NAME);
+    ret = vmstate_load_state(f, &vmstate_cpr_state, &cpr_state, 1);
+    qemu_fclose(f);
+    return ret;
+}
+
+CprMode cpr_get_mode(void)
+{
+    return cpr_state.mode;
+}
+
+void cpr_set_mode(CprMode mode)
+{
+    cpr_state.mode = mode;
+}
+
+void cpr_state_print(void)
+{
+    CprName *elem;
+
+    printf("cpr_state:\n");
+    printf("- mode = %d\n", cpr_state.mode);
+    QLIST_FOREACH(elem, &cpr_state.fds, next) {
+        printf("- %s %d : fd=%d\n", elem->name, elem->id, CPR_FD_FD(elem));
+    }
+}
diff --git a/migration/meson.build b/migration/meson.build
index fd59281..b79d02c 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -16,6 +16,7 @@ softmmu_ss.add(files(
   'colo-failover.c',
   'colo.c',
   'cpr.c',
+  'cpr-state.c',
   'exec.c',
   'fd.c',
   'global_state.c',
diff --git a/migration/trace-events b/migration/trace-events
index b48d873..35b4627 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -312,6 +312,11 @@ colo_receive_message(const char *msg) "Receive '%s' message"
 # colo-failover.c
 colo_failover_set_state(const char *new_state) "new state %s"
 
+# cpr-state.c
+cpr_save_fd(const char *name, int id, int fd) "%s, id %d, fd %d"
+cpr_delete_fd(const char *name, int id) "%s, id %d"
+cpr_find_fd(const char *name, int id, int fd) "%s, id %d returns %d"
+
 # block-dirty-bitmap.c
 send_bitmap_header_enter(void) ""
 send_bitmap_bits(uint32_t flags, uint64_t start_sector, uint32_t nr_sectors, uint64_t data_size) "flags: 0x%x, start_sector: %" PRIu64 ", nr_sectors: %" PRIu32 ", data_size: %" PRIu64
diff --git a/stubs/cpr-state.c b/stubs/cpr-state.c
new file mode 100644
index 0000000..24a9057
--- /dev/null
+++ b/stubs/cpr-state.c
@@ -0,0 +1,15 @@
+#include "qemu/osdep.h"
+#include "migration/cpr.h"
+
+void cpr_save_fd(const char *name, int id, int fd)
+{
+}
+
+void cpr_delete_fd(const char *name, int id)
+{
+}
+
+int cpr_find_fd(const char *name, int id)
+{
+    return -1;
+}
diff --git a/stubs/meson.build b/stubs/meson.build
index 71469c1..9565c7d 100644
--- a/stubs/meson.build
+++ b/stubs/meson.build
@@ -4,6 +4,7 @@ stub_ss.add(files('blk-exp-close-all.c'))
 stub_ss.add(files('blockdev-close-all-bdrv-states.c'))
 stub_ss.add(files('change-state-handler.c'))
 stub_ss.add(files('cmos.c'))
+stub_ss.add(files('cpr-state.c'))
 stub_ss.add(files('cpu-get-clock.c'))
 stub_ss.add(files('cpus-get-virtual-clock.c'))
 stub_ss.add(files('qemu-timer-notify-cb.c'))
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH V7 14/29] cpr: restart mode
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
                   ` (12 preceding siblings ...)
  2021-12-22 19:05 ` [PATCH V7 13/29] cpr: preserve extra state Steve Sistare
@ 2021-12-22 19:05 ` Steve Sistare
  2021-12-22 19:05 ` [PATCH V7 15/29] cpr: restart HMP interfaces Steve Sistare
                   ` (15 subsequent siblings)
  29 siblings, 0 replies; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Provide the cpr-save restart mode, which preserves the guest VM across a
restart of the qemu process.  After cpr-save, the caller passes qemu
command-line arguments to cpr-exec, which directly exec's the new qemu
binary.  The arguments must include -S so new qemu starts in a paused state.
The caller resumes the guest by calling cpr-load.

To use the restart mode, all guest RAM objects must be shared.  The
share=on property is required for memory created with an explicit -object
option.  The memfd-alloc machine property is required for memory that is
implicitly created.  The memfd values are saved in special cpr state which
is retrieved after exec, and are kept open across exec, after which they
are retrieved and re-mmap'd.  Hence guest RAM is preserved in place,
albeit with new virtual addresses in the qemu process.

The restart mode supports vfio devices and explicit memory-backend-memfd
objects in subsequent patches.

cpr-exec syntax:
  { 'command': 'cpr-exec', 'data': { 'argv': [ 'str' ] } }

Add the restart mode:
  { 'enum': 'CprMode', 'data': [ 'reboot', 'restart' ] }

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 migration/cpr.c   | 29 ++++++++++++++++++++++++++++-
 qapi/cpr.json     | 22 +++++++++++++++++++++-
 softmmu/physmem.c |  5 ++++-
 softmmu/vl.c      |  3 +++
 4 files changed, 56 insertions(+), 3 deletions(-)

diff --git a/migration/cpr.c b/migration/cpr.c
index ca76124..37eca66 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -81,6 +81,34 @@ err:
     cpr_set_mode(CPR_MODE_NONE);
 }
 
+static int preserve_fd(const char *name, int id, int fd, void *opaque)
+{
+    qemu_clear_cloexec(fd);
+    return 0;
+}
+
+void qmp_cpr_exec(strList *args, Error **errp)
+{
+    if (xen_enabled()) {
+        error_setg(errp, "xen does not support cpr-exec");
+        return;
+    }
+    if (!runstate_check(RUN_STATE_SAVE_VM)) {
+        error_setg(errp, "runstate is not save-vm");
+        return;
+    }
+    if (cpr_get_mode() != CPR_MODE_RESTART) {
+        error_setg(errp, "cpr-exec requires cpr-save with restart mode");
+        return;
+    }
+
+    cpr_walk_fd(preserve_fd, 0);
+    if (cpr_state_save(errp)) {
+        return;
+    }
+    qemu_system_exec_request(args);
+}
+
 void qmp_cpr_load(const char *filename, Error **errp)
 {
     QEMUFile *f;
@@ -104,7 +132,6 @@ void qmp_cpr_load(const char *filename, Error **errp)
         return;
     }
 
-    cpr_set_mode(CPR_MODE_REBOOT);
     ret = qemu_load_device_state(f);
     qemu_fclose(f);
     if (ret < 0) {
diff --git a/qapi/cpr.json b/qapi/cpr.json
index 2edd08e..56be0e5 100644
--- a/qapi/cpr.json
+++ b/qapi/cpr.json
@@ -15,11 +15,12 @@
 # @CprMode:
 #
 # @reboot: checkpoint can be cpr-load'ed after a host kexec reboot.
+# @restart: checkpoint can be cpr-load'ed after restarting qemu.
 #
 # Since: 6.2
 ##
 { 'enum': 'CprMode',
-  'data': [ 'reboot' ] }
+  'data': [ 'reboot', 'restart' ] }
 
 ##
 # @cpr-save:
@@ -33,6 +34,11 @@
 # For reboot mode, all guest RAM objects must be non-volatile across reboot,
 # and created with the share=on parameter.
 #
+# For restart mode, all guest RAM objects must be shared.  The share=on
+# property is required for memory created with an explicit -object option,
+# and the memfd-alloc machine property is required for memory that is
+# implicitly created.
+#
 # @filename: name of checkpoint file
 # @mode: @CprMode mode
 #
@@ -43,6 +49,20 @@
             'mode': 'CprMode' } }
 
 ##
+# @cpr-exec:
+#
+# exec() a command and replace the qemu process.  The PID remains the same.
+# @argv[0] should be the path of a new qemu binary, or a prefix command that
+# in turn exec's the new qemu binary.  Must be called after cpr-save restart.
+#
+# @argv: arguments to be passed to exec().
+#
+# Since: 6.2
+##
+{ 'command': 'cpr-exec',
+  'data': { 'argv': [ 'str' ] } }
+
+##
 # @cpr-load:
 #
 # Start virtual machine from checkpoint file that was created earlier using
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 95e2b49..e227195 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -65,6 +65,7 @@
 
 #include "qemu/pmem.h"
 
+#include "migration/cpr.h"
 #include "migration/vmstate.h"
 
 #include "qemu/range.h"
@@ -1991,7 +1992,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
             name = memory_region_name(mr);
             if (ms->memfd_alloc) {
                 Object *parent = &mr->parent_obj;
-                int mfd = -1;          /* placeholder until next patch */
+                int mfd = cpr_find_fd(name, 0);
                 mr->align = QEMU_VMALLOC_ALIGN;
                 if (mfd < 0) {
                     mfd = qemu_memfd_create(name, maxlen + mr->align,
@@ -1999,6 +2000,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
                     if (mfd < 0) {
                         return;
                     }
+                    cpr_save_fd(name, 0, mfd);
                 }
                 qemu_set_cloexec(mfd);
                 /* The memory backend already set its desired flags. */
@@ -2255,6 +2257,7 @@ void qemu_ram_free(RAMBlock *block)
     }
 
     qemu_mutex_lock_ramlist();
+    cpr_delete_fd(memory_region_name(block->mr), 0);
     QLIST_REMOVE_RCU(block, next);
     ram_list.mru_block = NULL;
     /* Write list before version */
diff --git a/softmmu/vl.c b/softmmu/vl.c
index ab3648a..4319e1a 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -76,6 +76,7 @@
 #include "hw/i386/pc.h"
 #include "migration/misc.h"
 #include "migration/snapshot.h"
+#include "migration/cpr.h"
 #include "sysemu/tpm.h"
 #include "sysemu/dma.h"
 #include "hw/audio/soundhw.h"
@@ -3675,6 +3676,8 @@ void qemu_init(int argc, char **argv, char **envp)
     qemu_validate_options(machine_opts_dict);
     qemu_process_sugar_options();
 
+    cpr_state_load(&error_fatal);
+
     /*
      * These options affect everything else and should be processed
      * before daemonizing.
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH V7 15/29] cpr: restart HMP interfaces
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
                   ` (13 preceding siblings ...)
  2021-12-22 19:05 ` [PATCH V7 14/29] cpr: restart mode Steve Sistare
@ 2021-12-22 19:05 ` Steve Sistare
  2021-12-22 19:05 ` [PATCH V7 16/29] hostmem-memfd: cpr for memory-backend-memfd Steve Sistare
                   ` (14 subsequent siblings)
  29 siblings, 0 replies; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

cpr-save <filename> <mode>
  mode may be "restart"

cpr-exec <command>
  Call qmp_cpr_exec().
  Arguments:
    command : command line to execute, with space-separated arguments

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hmp-commands.hx       | 21 ++++++++++++++++++++-
 include/monitor/hmp.h |  1 +
 monitor/hmp-cmds.c    | 11 +++++++++++
 3 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index 350c886..0fd5b1b 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -353,7 +353,7 @@ ERST
     {
         .name       = "cpr-save",
         .args_type  = "filename:s,mode:s",
-        .params     = "filename 'reboot'",
+        .params     = "filename 'reboot'|'restart'",
         .help       = "create a checkpoint of the VM in file",
         .cmd        = hmp_cpr_save,
     },
@@ -366,6 +366,25 @@ If *mode* is 'reboot', the checkpoint remains valid after a host kexec
 reboot, and guest ram must be backed by persistent shared memory.  To
 resume from the checkpoint, issue the quit command, reboot the system,
 and issue the cpr-load command.
+
+If *mode* is 'restart', the checkpoint remains valid after restarting qemu
+using a subsequent cpr-exec.  All guest RAM objects must be shared.  The
+share=on property is required for memory created with an explicit -object
+option, and the memfd-alloc machine property is required for memory that is
+implicitly created.  To resume from the checkpoint, issue the cpr-load command.
+ERST
+
+    {
+        .name       = "cpr-exec",
+        .args_type  = "command:S",
+        .params     = "command",
+        .help       = "Restart qemu by directly exec'ing command",
+        .cmd        = hmp_cpr_exec,
+    },
+
+SRST
+``cpr-exec`` *command*
+Restart qemu by directly exec'ing *command*, replacing the qemu process.
 ERST
 
     {
diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
index b44588e..ec4fa44 100644
--- a/include/monitor/hmp.h
+++ b/include/monitor/hmp.h
@@ -60,6 +60,7 @@ void hmp_loadvm(Monitor *mon, const QDict *qdict);
 void hmp_savevm(Monitor *mon, const QDict *qdict);
 void hmp_delvm(Monitor *mon, const QDict *qdict);
 void hmp_cpr_save(Monitor *mon, const QDict *qdict);
+void hmp_cpr_exec(Monitor *mon, const QDict *qdict);
 void hmp_cpr_load(Monitor *mon, const QDict *qdict);
 void hmp_migrate_cancel(Monitor *mon, const QDict *qdict);
 void hmp_migrate_continue(Monitor *mon, const QDict *qdict);
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index 5ca8b4b..39894d8 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -1105,6 +1105,17 @@ out:
     hmp_handle_error(mon, err);
 }
 
+void hmp_cpr_exec(Monitor *mon, const QDict *qdict)
+{
+    Error *err = NULL;
+    const char *command = qdict_get_try_str(qdict, "command");
+    strList *args = strList_from_string(command, ' ');
+
+    qmp_cpr_exec(args, &err);
+    qapi_free_strList(args);
+    hmp_handle_error(mon, err);
+}
+
 void hmp_cpr_load(Monitor *mon, const QDict *qdict)
 {
     Error *err = NULL;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH V7 16/29] hostmem-memfd: cpr for memory-backend-memfd
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
                   ` (14 preceding siblings ...)
  2021-12-22 19:05 ` [PATCH V7 15/29] cpr: restart HMP interfaces Steve Sistare
@ 2021-12-22 19:05 ` Steve Sistare
  2021-12-22 19:05 ` [PATCH V7 17/29] pci: export functions for cpr Steve Sistare
                   ` (13 subsequent siblings)
  29 siblings, 0 replies; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Preserve memory-backend-memfd memory objects during cpr.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 backends/hostmem-memfd.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c
index 3fc85c3..5097a05 100644
--- a/backends/hostmem-memfd.c
+++ b/backends/hostmem-memfd.c
@@ -14,6 +14,7 @@
 #include "sysemu/hostmem.h"
 #include "qom/object_interfaces.h"
 #include "qemu/memfd.h"
+#include "migration/cpr.h"
 #include "qemu/module.h"
 #include "qapi/error.h"
 #include "qom/object.h"
@@ -36,23 +37,25 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
 {
     HostMemoryBackendMemfd *m = MEMORY_BACKEND_MEMFD(backend);
     uint32_t ram_flags;
-    char *name;
-    int fd;
+    char *name = host_memory_backend_get_name(backend);
+    int fd = cpr_find_fd(name, 0);
 
     if (!backend->size) {
         error_setg(errp, "can't create backend with size 0");
         return;
     }
 
-    fd = qemu_memfd_create(TYPE_MEMORY_BACKEND_MEMFD, backend->size,
-                           m->hugetlb, m->hugetlbsize, m->seal ?
-                           F_SEAL_GROW | F_SEAL_SHRINK | F_SEAL_SEAL : 0,
-                           errp);
-    if (fd == -1) {
-        return;
+    if (fd < 0) {
+        fd = qemu_memfd_create(TYPE_MEMORY_BACKEND_MEMFD, backend->size,
+                               m->hugetlb, m->hugetlbsize, m->seal ?
+                               F_SEAL_GROW | F_SEAL_SHRINK | F_SEAL_SEAL : 0,
+                               errp);
+        if (fd == -1) {
+            return;
+        }
+        cpr_save_fd(name, 0, fd);
     }
 
-    name = host_memory_backend_get_name(backend);
     ram_flags = backend->share ? RAM_SHARED : 0;
     ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
     memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), name,
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH V7 17/29] pci: export functions for cpr
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
                   ` (15 preceding siblings ...)
  2021-12-22 19:05 ` [PATCH V7 16/29] hostmem-memfd: cpr for memory-backend-memfd Steve Sistare
@ 2021-12-22 19:05 ` Steve Sistare
  2021-12-22 23:07   ` Michael S. Tsirkin
  2021-12-22 19:05 ` [PATCH V7 18/29] vfio-pci: refactor " Steve Sistare
                   ` (12 subsequent siblings)
  29 siblings, 1 reply; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Export msix_is_pending, msix_init_vector_notifiers, and pci_update_mappings
for use by cpr.  No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/pci/msix.c         | 20 ++++++++++++++------
 hw/pci/pci.c          |  3 +--
 include/hw/pci/msix.h |  5 +++++
 include/hw/pci/pci.h  |  1 +
 4 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index ae9331c..73f4259 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -64,7 +64,7 @@ static uint8_t *msix_pending_byte(PCIDevice *dev, int vector)
     return dev->msix_pba + vector / 8;
 }
 
-static int msix_is_pending(PCIDevice *dev, int vector)
+int msix_is_pending(PCIDevice *dev, unsigned int vector)
 {
     return *msix_pending_byte(dev, vector) & msix_pending_mask(vector);
 }
@@ -579,6 +579,17 @@ static void msix_unset_notifier_for_vector(PCIDevice *dev, unsigned int vector)
     dev->msix_vector_release_notifier(dev, vector);
 }
 
+void msix_init_vector_notifiers(PCIDevice *dev,
+                                MSIVectorUseNotifier use_notifier,
+                                MSIVectorReleaseNotifier release_notifier,
+                                MSIVectorPollNotifier poll_notifier)
+{
+    assert(use_notifier && release_notifier);
+    dev->msix_vector_use_notifier = use_notifier;
+    dev->msix_vector_release_notifier = release_notifier;
+    dev->msix_vector_poll_notifier = poll_notifier;
+}
+
 int msix_set_vector_notifiers(PCIDevice *dev,
                               MSIVectorUseNotifier use_notifier,
                               MSIVectorReleaseNotifier release_notifier,
@@ -586,11 +597,8 @@ int msix_set_vector_notifiers(PCIDevice *dev,
 {
     int vector, ret;
 
-    assert(use_notifier && release_notifier);
-
-    dev->msix_vector_use_notifier = use_notifier;
-    dev->msix_vector_release_notifier = release_notifier;
-    dev->msix_vector_poll_notifier = poll_notifier;
+    msix_init_vector_notifiers(dev, use_notifier, release_notifier,
+                               poll_notifier);
 
     if ((dev->config[dev->msix_cap + MSIX_CONTROL_OFFSET] &
         (MSIX_ENABLE_MASK | MSIX_MASKALL_MASK)) == MSIX_ENABLE_MASK) {
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index e5993c1..0fd21e1 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -225,7 +225,6 @@ static const TypeInfo pcie_bus_info = {
 };
 
 static PCIBus *pci_find_bus_nr(PCIBus *bus, int bus_num);
-static void pci_update_mappings(PCIDevice *d);
 static void pci_irq_handler(void *opaque, int irq_num, int level);
 static void pci_add_option_rom(PCIDevice *pdev, bool is_default_rom, Error **);
 static void pci_del_option_rom(PCIDevice *pdev);
@@ -1366,7 +1365,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
     return new_addr;
 }
 
-static void pci_update_mappings(PCIDevice *d)
+void pci_update_mappings(PCIDevice *d)
 {
     PCIIORegion *r;
     int i;
diff --git a/include/hw/pci/msix.h b/include/hw/pci/msix.h
index 4c4a60c..46606cf 100644
--- a/include/hw/pci/msix.h
+++ b/include/hw/pci/msix.h
@@ -32,6 +32,7 @@ int msix_present(PCIDevice *dev);
 bool msix_is_masked(PCIDevice *dev, unsigned vector);
 void msix_set_pending(PCIDevice *dev, unsigned vector);
 void msix_clr_pending(PCIDevice *dev, int vector);
+int msix_is_pending(PCIDevice *dev, unsigned vector);
 
 int msix_vector_use(PCIDevice *dev, unsigned vector);
 void msix_vector_unuse(PCIDevice *dev, unsigned vector);
@@ -41,6 +42,10 @@ void msix_notify(PCIDevice *dev, unsigned vector);
 
 void msix_reset(PCIDevice *dev);
 
+void msix_init_vector_notifiers(PCIDevice *dev,
+                                MSIVectorUseNotifier use_notifier,
+                                MSIVectorReleaseNotifier release_notifier,
+                                MSIVectorPollNotifier poll_notifier);
 int msix_set_vector_notifiers(PCIDevice *dev,
                               MSIVectorUseNotifier use_notifier,
                               MSIVectorReleaseNotifier release_notifier,
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index e7cdf2d..cc63dd4 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -910,5 +910,6 @@ extern const VMStateDescription vmstate_pci_device;
 
 MSIMessage pci_get_msi_message(PCIDevice *dev, int vector);
 void pci_set_power(PCIDevice *pci_dev, bool state);
+void pci_update_mappings(PCIDevice *d);
 
 #endif
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH V7 18/29] vfio-pci: refactor for cpr
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
                   ` (16 preceding siblings ...)
  2021-12-22 19:05 ` [PATCH V7 17/29] pci: export functions for cpr Steve Sistare
@ 2021-12-22 19:05 ` Steve Sistare
  2022-03-03 23:21   ` Alex Williamson
  2021-12-22 19:05 ` [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma) Steve Sistare
                   ` (11 subsequent siblings)
  29 siblings, 1 reply; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Export vfio_address_spaces.
Refactor vector use into a helper vfio_vector_init.
Add vfio_notifier_init and vfio_notifier_cleanup for named notifiers,
and pass additional arguments to vfio_remove_kvm_msi_virq.

All for use by cpr in a subsequent patch.  No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/common.c              |   2 +-
 hw/vfio/pci.c                 | 102 +++++++++++++++++++++++++++---------------
 include/hw/vfio/vfio-common.h |   2 +
 3 files changed, 70 insertions(+), 36 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 080046e..5b87f95 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -43,7 +43,7 @@
 
 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
-static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces =
+VFIOAddressSpaceList vfio_address_spaces =
     QLIST_HEAD_INITIALIZER(vfio_address_spaces);
 
 #ifdef CONFIG_KVM
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 7b45353..a90cce2 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -48,6 +48,27 @@
 static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
 static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
 
+/* Create new or reuse existing eventfd */
+static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
+                              const char *name, int nr)
+{
+    int fd = -1;   /* placeholder until a subsequent patch */
+    int ret = 0;
+
+    if (fd >= 0) {
+        event_notifier_init_fd(e, fd);
+    } else {
+        ret = event_notifier_init(e, 0);
+    }
+    return ret;
+}
+
+static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
+                                  const char *name, int nr)
+{
+    event_notifier_cleanup(e);
+}
+
 /*
  * Disabling BAR mmaping can be slow, but toggling it around INTx can
  * also be a huge overhead.  We try to get the best of both worlds by
@@ -128,8 +149,8 @@ static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
     pci_irq_deassert(&vdev->pdev);
 
     /* Get an eventfd for resample/unmask */
-    if (event_notifier_init(&vdev->intx.unmask, 0)) {
-        error_setg(errp, "event_notifier_init failed eoi");
+    if (vfio_notifier_init(vdev, &vdev->intx.unmask, "intx-unmask", 0)) {
+        error_setg(errp, "vfio_notifier_init intx-unmask failed");
         goto fail;
     }
 
@@ -161,7 +182,7 @@ fail_vfio:
     kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vdev->intx.interrupt,
                                           vdev->intx.route.irq);
 fail_irqfd:
-    event_notifier_cleanup(&vdev->intx.unmask);
+    vfio_notifier_cleanup(vdev, &vdev->intx.unmask, "intx-unmask", 0);
 fail:
     qemu_set_fd_handler(irq_fd, vfio_intx_interrupt, NULL, vdev);
     vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
@@ -190,7 +211,7 @@ static void vfio_intx_disable_kvm(VFIOPCIDevice *vdev)
     }
 
     /* We only need to close the eventfd for VFIO to cleanup the kernel side */
-    event_notifier_cleanup(&vdev->intx.unmask);
+    vfio_notifier_cleanup(vdev, &vdev->intx.unmask, "intx-unmask", 0);
 
     /* QEMU starts listening for interrupt events. */
     qemu_set_fd_handler(event_notifier_get_fd(&vdev->intx.interrupt),
@@ -281,9 +302,10 @@ static int vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
     }
 #endif
 
-    ret = event_notifier_init(&vdev->intx.interrupt, 0);
+    ret = vfio_notifier_init(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
     if (ret) {
-        error_setg_errno(errp, -ret, "event_notifier_init failed");
+        error_setg_errno(errp, -ret,
+                         "vfio_notifier_init intx-interrupt failed");
         return ret;
     }
     fd = event_notifier_get_fd(&vdev->intx.interrupt);
@@ -292,7 +314,7 @@ static int vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
     if (vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
                                VFIO_IRQ_SET_ACTION_TRIGGER, fd, errp)) {
         qemu_set_fd_handler(fd, NULL, NULL, vdev);
-        event_notifier_cleanup(&vdev->intx.interrupt);
+        vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
         return -errno;
     }
 
@@ -320,7 +342,7 @@ static void vfio_intx_disable(VFIOPCIDevice *vdev)
 
     fd = event_notifier_get_fd(&vdev->intx.interrupt);
     qemu_set_fd_handler(fd, NULL, NULL, vdev);
-    event_notifier_cleanup(&vdev->intx.interrupt);
+    vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
 
     vdev->interrupt = VFIO_INT_NONE;
 
@@ -410,41 +432,43 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
 }
 
 static void vfio_add_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
-                                  int vector_n, bool msix)
+                                  int nr, bool msix)
 {
     int virq;
+    const char *name = "kvm_interrupt";
 
     if ((msix && vdev->no_kvm_msix) || (!msix && vdev->no_kvm_msi)) {
         return;
     }
 
-    if (event_notifier_init(&vector->kvm_interrupt, 0)) {
+    if (vfio_notifier_init(vdev, &vector->kvm_interrupt, name, nr)) {
         return;
     }
 
-    virq = kvm_irqchip_add_msi_route(kvm_state, vector_n, &vdev->pdev);
+    virq = kvm_irqchip_add_msi_route(kvm_state, nr, &vdev->pdev);
     if (virq < 0) {
-        event_notifier_cleanup(&vector->kvm_interrupt);
+        vfio_notifier_cleanup(vdev, &vector->kvm_interrupt, name, nr);
         return;
     }
 
     if (kvm_irqchip_add_irqfd_notifier_gsi(kvm_state, &vector->kvm_interrupt,
                                        NULL, virq) < 0) {
         kvm_irqchip_release_virq(kvm_state, virq);
-        event_notifier_cleanup(&vector->kvm_interrupt);
+        vfio_notifier_cleanup(vdev, &vector->kvm_interrupt, name, nr);
         return;
     }
 
     vector->virq = virq;
 }
 
-static void vfio_remove_kvm_msi_virq(VFIOMSIVector *vector)
+static void vfio_remove_kvm_msi_virq(VFIOPCIDevice *vdev, VFIOMSIVector *vector,
+                                     int nr)
 {
     kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vector->kvm_interrupt,
                                           vector->virq);
     kvm_irqchip_release_virq(kvm_state, vector->virq);
     vector->virq = -1;
-    event_notifier_cleanup(&vector->kvm_interrupt);
+    vfio_notifier_cleanup(vdev, &vector->kvm_interrupt, "kvm_interrupt", nr);
 }
 
 static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
@@ -454,6 +478,20 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
     kvm_irqchip_commit_routes(kvm_state);
 }
 
+static void vfio_vector_init(VFIOPCIDevice *vdev, int nr)
+{
+    VFIOMSIVector *vector = &vdev->msi_vectors[nr];
+    PCIDevice *pdev = &vdev->pdev;
+
+    vector->vdev = vdev;
+    vector->virq = -1;
+    if (vfio_notifier_init(vdev, &vector->interrupt, "interrupt", nr)) {
+        error_report("vfio: vfio_notifier_init interrupt failed");
+    }
+    vector->use = true;
+    msix_vector_use(pdev, nr);
+}
+
 static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
                                    MSIMessage *msg, IOHandler *handler)
 {
@@ -466,13 +504,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
     vector = &vdev->msi_vectors[nr];
 
     if (!vector->use) {
-        vector->vdev = vdev;
-        vector->virq = -1;
-        if (event_notifier_init(&vector->interrupt, 0)) {
-            error_report("vfio: Error: event_notifier_init failed");
-        }
-        vector->use = true;
-        msix_vector_use(pdev, nr);
+        vfio_vector_init(vdev, nr);
     }
 
     qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
@@ -484,7 +516,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
      */
     if (vector->virq >= 0) {
         if (!msg) {
-            vfio_remove_kvm_msi_virq(vector);
+            vfio_remove_kvm_msi_virq(vdev, vector, nr);
         } else {
             vfio_update_kvm_msi_virq(vector, *msg, pdev);
         }
@@ -629,8 +661,8 @@ retry:
         vector->virq = -1;
         vector->use = true;
 
-        if (event_notifier_init(&vector->interrupt, 0)) {
-            error_report("vfio: Error: event_notifier_init failed");
+        if (vfio_notifier_init(vdev, &vector->interrupt, "interrupt", i)) {
+            error_report("vfio: Error: vfio_notifier_init failed");
         }
 
         qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
@@ -658,11 +690,11 @@ retry:
         for (i = 0; i < vdev->nr_vectors; i++) {
             VFIOMSIVector *vector = &vdev->msi_vectors[i];
             if (vector->virq >= 0) {
-                vfio_remove_kvm_msi_virq(vector);
+                vfio_remove_kvm_msi_virq(vdev, vector, i);
             }
             qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
                                 NULL, NULL, NULL);
-            event_notifier_cleanup(&vector->interrupt);
+            vfio_notifier_cleanup(vdev, &vector->interrupt, "interrupt", i);
         }
 
         g_free(vdev->msi_vectors);
@@ -697,11 +729,11 @@ static void vfio_msi_disable_common(VFIOPCIDevice *vdev)
         VFIOMSIVector *vector = &vdev->msi_vectors[i];
         if (vdev->msi_vectors[i].use) {
             if (vector->virq >= 0) {
-                vfio_remove_kvm_msi_virq(vector);
+                vfio_remove_kvm_msi_virq(vdev, vector, i);
             }
             qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
                                 NULL, NULL, NULL);
-            event_notifier_cleanup(&vector->interrupt);
+            vfio_notifier_cleanup(vdev, &vector->interrupt, "interrupt", i);
         }
     }
 
@@ -2694,7 +2726,7 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
         return;
     }
 
-    if (event_notifier_init(&vdev->err_notifier, 0)) {
+    if (vfio_notifier_init(vdev, &vdev->err_notifier, "err", 0)) {
         error_report("vfio: Unable to init event notifier for error detection");
         vdev->pci_aer = false;
         return;
@@ -2707,7 +2739,7 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
                                VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
         error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
         qemu_set_fd_handler(fd, NULL, NULL, vdev);
-        event_notifier_cleanup(&vdev->err_notifier);
+        vfio_notifier_cleanup(vdev, &vdev->err_notifier, "err_notifier", 0);
         vdev->pci_aer = false;
     }
 }
@@ -2726,7 +2758,7 @@ static void vfio_unregister_err_notifier(VFIOPCIDevice *vdev)
     }
     qemu_set_fd_handler(event_notifier_get_fd(&vdev->err_notifier),
                         NULL, NULL, vdev);
-    event_notifier_cleanup(&vdev->err_notifier);
+    vfio_notifier_cleanup(vdev, &vdev->err_notifier, "err_notifier", 0);
 }
 
 static void vfio_req_notifier_handler(void *opaque)
@@ -2760,7 +2792,7 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
         return;
     }
 
-    if (event_notifier_init(&vdev->req_notifier, 0)) {
+    if (vfio_notifier_init(vdev, &vdev->req_notifier, "req", 0)) {
         error_report("vfio: Unable to init event notifier for device request");
         return;
     }
@@ -2772,7 +2804,7 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
                            VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
         error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
         qemu_set_fd_handler(fd, NULL, NULL, vdev);
-        event_notifier_cleanup(&vdev->req_notifier);
+        vfio_notifier_cleanup(vdev, &vdev->req_notifier, "req_notifier", 0);
     } else {
         vdev->req_enabled = true;
     }
@@ -2792,7 +2824,7 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
     }
     qemu_set_fd_handler(event_notifier_get_fd(&vdev->req_notifier),
                         NULL, NULL, vdev);
-    event_notifier_cleanup(&vdev->req_notifier);
+    vfio_notifier_cleanup(vdev, &vdev->req_notifier, "req_notifier", 0);
 
     vdev->req_enabled = false;
 }
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 8af11b0..1641753 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -216,6 +216,8 @@ int vfio_get_device(VFIOGroup *group, const char *name,
 extern const MemoryRegionOps vfio_region_ops;
 typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
 extern VFIOGroupList vfio_group_list;
+typedef QLIST_HEAD(, VFIOAddressSpace) VFIOAddressSpaceList;
+extern VFIOAddressSpaceList vfio_address_spaces;
 
 bool vfio_mig_active(void);
 int64_t vfio_mig_bytes_transferred(void);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
                   ` (17 preceding siblings ...)
  2021-12-22 19:05 ` [PATCH V7 18/29] vfio-pci: refactor " Steve Sistare
@ 2021-12-22 19:05 ` Steve Sistare
  2021-12-22 23:15   ` Michael S. Tsirkin
  2022-03-07 22:16   ` Alex Williamson
  2021-12-22 19:05 ` [PATCH V7 20/29] vfio-pci: cpr part 2 (msi) Steve Sistare
                   ` (10 subsequent siblings)
  29 siblings, 2 replies; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Enable vfio-pci devices to be saved and restored across an exec restart
of qemu.

At vfio creation time, save the value of vfio container, group, and device
descriptors in cpr state.

In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
at a different VA after exec.  DMA to already-mapped pages continues.  Save
the msi message area as part of vfio-pci vmstate, save the interrupt and
notifier eventfd's in cpr state, and clear the close-on-exec flag for the
vfio descriptors.  The flag is not cleared earlier because the descriptors
should not persist across miscellaneous fork and exec calls that may be
performed during normal operation.

On qemu restart, vfio_realize() finds the saved descriptors, uses
the descriptors, and notes that the device is being reused.  Device and
iommu state is already configured, so operations in vfio_realize that
would modify the configuration are skipped for a reused device, including
vfio ioctl's and writes to PCI configuration space.  The result is that
vfio_realize constructs qemu data structures that reflect the current
state of the device.  However, the reconstruction is not complete until
cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
state.  It rebuilds vector data structures and attaches the interrupts to
the new KVM instance.  cpr-load then invokes the main vfio listener callback,
which walks the flattened ranges of the vfio_address_spaces and calls
VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
starts the VM and suppresses vfio pci device reset.

This functionality is delivered by 3 patches for clarity.  Part 1 handles
device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
support.  Part 3 adds INTX support.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 MAINTAINERS                   |   1 +
 hw/pci/pci.c                  |  10 ++++
 hw/vfio/common.c              | 115 ++++++++++++++++++++++++++++++++++++++----
 hw/vfio/cpr.c                 |  94 ++++++++++++++++++++++++++++++++++
 hw/vfio/meson.build           |   1 +
 hw/vfio/pci.c                 |  77 ++++++++++++++++++++++++++++
 hw/vfio/trace-events          |   1 +
 include/hw/pci/pci.h          |   1 +
 include/hw/vfio/vfio-common.h |   8 +++
 include/migration/cpr.h       |   3 ++
 migration/cpr.c               |  10 +++-
 migration/target.c            |  14 +++++
 12 files changed, 324 insertions(+), 11 deletions(-)
 create mode 100644 hw/vfio/cpr.c

diff --git a/MAINTAINERS b/MAINTAINERS
index cfe7480..feed239 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2992,6 +2992,7 @@ CPR
 M: Steve Sistare <steven.sistare@oracle.com>
 M: Mark Kanda <mark.kanda@oracle.com>
 S: Maintained
+F: hw/vfio/cpr.c
 F: include/migration/cpr.h
 F: migration/cpr.c
 F: qapi/cpr.json
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 0fd21e1..e35df4f 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
 {
     int r;
 
+    /*
+     * A reused vfio-pci device is already configured, so do not reset it
+     * during qemu_system_reset prior to cpr-load, else interrupts may be
+     * lost.  By contrast, pure-virtual pci devices may be reset here and
+     * updated with new state in cpr-load with no ill effects.
+     */
+    if (dev->reused) {
+        return;
+    }
+
     pci_device_deassert_intx(dev);
     assert(dev->irq_state == 0);
 
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 5b87f95..90f66ad 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -31,6 +31,7 @@
 #include "exec/memory.h"
 #include "exec/ram_addr.h"
 #include "hw/hw.h"
+#include "migration/cpr.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "qemu/range.h"
@@ -459,6 +460,8 @@ static int vfio_dma_unmap(VFIOContainer *container,
         .size = size,
     };
 
+    assert(!container->reused);
+
     if (iotlb && container->dirty_pages_supported &&
         vfio_devices_all_running_and_saving(container)) {
         return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
@@ -495,12 +498,24 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
 {
     struct vfio_iommu_type1_dma_map map = {
         .argsz = sizeof(map),
-        .flags = VFIO_DMA_MAP_FLAG_READ,
         .vaddr = (__u64)(uintptr_t)vaddr,
         .iova = iova,
         .size = size,
     };
 
+    /*
+     * Set the new vaddr for any mappings registered during cpr-load.
+     * Reused is cleared thereafter.
+     */
+    if (container->reused) {
+        map.flags = VFIO_DMA_MAP_FLAG_VADDR;
+        if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
+            goto fail;
+        }
+        return 0;
+    }
+
+    map.flags = VFIO_DMA_MAP_FLAG_READ;
     if (!readonly) {
         map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
     }
@@ -516,7 +531,9 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
         return 0;
     }
 
-    error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
+fail:
+    error_report("vfio_dma_map %s (iova %lu, size %ld, va %p): %s",
+        (container->reused ? "VADDR" : ""), iova, size, vaddr, strerror(errno));
     return -errno;
 }
 
@@ -865,6 +882,12 @@ static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+    vfio_container_region_add(container, section);
+}
+
+void vfio_container_region_add(VFIOContainer *container,
+                               MemoryRegionSection *section)
+{
     hwaddr iova, end;
     Int128 llend, llsize;
     void *vaddr;
@@ -985,6 +1008,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
         int iommu_idx;
 
         trace_vfio_listener_region_add_iommu(iova, end);
+
         /*
          * FIXME: For VFIO iommu types which have KVM acceleration to
          * avoid bouncing all map/unmaps through qemu this way, this
@@ -1459,6 +1483,12 @@ static void vfio_listener_release(VFIOContainer *container)
     }
 }
 
+void vfio_listener_register(VFIOContainer *container)
+{
+    container->listener = vfio_memory_listener;
+    memory_listener_register(&container->listener, container->space->as);
+}
+
 static struct vfio_info_cap_header *
 vfio_get_cap(void *ptr, uint32_t cap_offset, uint16_t id)
 {
@@ -1878,6 +1908,18 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
 {
     int iommu_type, ret;
 
+    /*
+     * If container is reused, just set its type and skip the ioctls, as the
+     * container and group are already configured in the kernel.
+     * VFIO_TYPE1v2_IOMMU is the only type that supports reuse/cpr.
+     * If you ever add new types or spapr cpr support, kind reader, please
+     * also implement VFIO_GET_IOMMU.
+     */
+    if (container->reused) {
+        container->iommu_type = VFIO_TYPE1v2_IOMMU;
+        return 0;
+    }
+
     iommu_type = vfio_get_iommu_type(container, errp);
     if (iommu_type < 0) {
         return iommu_type;
@@ -1982,9 +2024,12 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
 {
     VFIOContainer *container;
     int ret, fd;
+    bool reused;
     VFIOAddressSpace *space;
 
     space = vfio_get_address_space(as);
+    fd = cpr_find_fd("vfio_container_for_group", group->groupid);
+    reused = (fd > 0);
 
     /*
      * VFIO is currently incompatible with discarding of RAM insofar as the
@@ -2017,8 +2062,16 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
      * details once we know which type of IOMMU we are using.
      */
 
+    /*
+     * If the container is reused, then the group is already attached in the
+     * kernel.  If a container with matching fd is found, then update the
+     * userland group list and return.  It not, then after the loop, create
+     * the container struct and group list.
+     */
+
     QLIST_FOREACH(container, &space->containers, next) {
-        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
+        if ((reused && container->fd == fd) ||
+            !ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
             ret = vfio_ram_block_discard_disable(container, true);
             if (ret) {
                 error_setg_errno(errp, -ret,
@@ -2032,12 +2085,19 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
             }
             group->container = container;
             QLIST_INSERT_HEAD(&container->group_list, group, container_next);
-            vfio_kvm_device_add_group(group);
+            if (!reused) {
+                vfio_kvm_device_add_group(group);
+                cpr_save_fd("vfio_container_for_group", group->groupid,
+                            container->fd);
+            }
             return 0;
         }
     }
 
-    fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
+    if (!reused) {
+        fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
+    }
+
     if (fd < 0) {
         error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
         ret = -errno;
@@ -2055,6 +2115,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     container = g_malloc0(sizeof(*container));
     container->space = space;
     container->fd = fd;
+    container->reused = reused;
     container->error = NULL;
     container->dirty_pages_supported = false;
     container->dma_max_mappings = 0;
@@ -2181,9 +2242,16 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     group->container = container;
     QLIST_INSERT_HEAD(&container->group_list, group, container_next);
 
-    container->listener = vfio_memory_listener;
-
-    memory_listener_register(&container->listener, container->space->as);
+    /*
+     * If reused, register the listener later, after all state that may
+     * affect regions and mapping boundaries has been cpr-load'ed.  Later,
+     * the listener will invoke its callback on each flat section and call
+     * vfio_dma_map to supply the new vaddr, and the calls will match the
+     * mappings remembered by the kernel.
+     */
+    if (!reused) {
+        vfio_listener_register(container);
+    }
 
     if (container->error) {
         ret = -1;
@@ -2193,6 +2261,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     }
 
     container->initialized = true;
+    if (!reused) {
+        cpr_save_fd("vfio_container_for_group", group->groupid, fd);
+    }
 
     return 0;
 listener_release_exit:
@@ -2222,6 +2293,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
 
     QLIST_REMOVE(group, container_next);
     group->container = NULL;
+    cpr_delete_fd("vfio_container_for_group", group->groupid);
 
     /*
      * Explicitly release the listener first before unset container,
@@ -2270,6 +2342,7 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
     VFIOGroup *group;
     char path[32];
     struct vfio_group_status status = { .argsz = sizeof(status) };
+    bool reused;
 
     QLIST_FOREACH(group, &vfio_group_list, next) {
         if (group->groupid == groupid) {
@@ -2287,7 +2360,13 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
     group = g_malloc0(sizeof(*group));
 
     snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
-    group->fd = qemu_open_old(path, O_RDWR);
+
+    group->fd = cpr_find_fd("vfio_group", groupid);
+    reused = (group->fd >= 0);
+    if (!reused) {
+        group->fd = qemu_open_old(path, O_RDWR);
+    }
+
     if (group->fd < 0) {
         error_setg_errno(errp, errno, "failed to open %s", path);
         goto free_group_exit;
@@ -2321,6 +2400,10 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
 
     QLIST_INSERT_HEAD(&vfio_group_list, group, next);
 
+    if (!reused) {
+        cpr_save_fd("vfio_group", groupid, group->fd);
+    }
+
     return group;
 
 close_fd_exit:
@@ -2345,6 +2428,7 @@ void vfio_put_group(VFIOGroup *group)
     vfio_disconnect_container(group);
     QLIST_REMOVE(group, next);
     trace_vfio_put_group(group->fd);
+    cpr_delete_fd("vfio_group", group->groupid);
     close(group->fd);
     g_free(group);
 
@@ -2358,8 +2442,14 @@ int vfio_get_device(VFIOGroup *group, const char *name,
 {
     struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
     int ret, fd;
+    bool reused;
+
+    fd = cpr_find_fd(name, 0);
+    reused = (fd >= 0);
+    if (!reused) {
+        fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
+    }
 
-    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
     if (fd < 0) {
         error_setg_errno(errp, errno, "error getting device from group %d",
                          group->groupid);
@@ -2404,6 +2494,10 @@ int vfio_get_device(VFIOGroup *group, const char *name,
     vbasedev->num_irqs = dev_info.num_irqs;
     vbasedev->num_regions = dev_info.num_regions;
     vbasedev->flags = dev_info.flags;
+    vbasedev->reused = reused;
+    if (!reused) {
+        cpr_save_fd(name, 0, fd);
+    }
 
     trace_vfio_get_device(name, dev_info.flags, dev_info.num_regions,
                           dev_info.num_irqs);
@@ -2420,6 +2514,7 @@ void vfio_put_base_device(VFIODevice *vbasedev)
     QLIST_REMOVE(vbasedev, next);
     vbasedev->group = NULL;
     trace_vfio_put_base_device(vbasedev->fd);
+    cpr_delete_fd(vbasedev->name, 0);
     close(vbasedev->fd);
 }
 
diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
new file mode 100644
index 0000000..2c39cd5
--- /dev/null
+++ b/hw/vfio/cpr.c
@@ -0,0 +1,94 @@
+/*
+ * Copyright (c) 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include <sys/ioctl.h>
+#include <linux/vfio.h>
+#include "hw/vfio/vfio-common.h"
+#include "sysemu/kvm.h"
+#include "qapi/error.h"
+#include "trace.h"
+
+static int
+vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
+{
+    struct vfio_iommu_type1_dma_unmap unmap = {
+        .argsz = sizeof(unmap),
+        .flags = VFIO_DMA_UNMAP_FLAG_VADDR | VFIO_DMA_UNMAP_FLAG_ALL,
+        .iova = 0,
+        .size = 0,
+    };
+    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
+        error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
+        return -errno;
+    }
+    return 0;
+}
+
+bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp)
+{
+    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
+        !ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
+        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR "
+                         "or VFIO_UNMAP_ALL");
+        return false;
+    } else {
+        return true;
+    }
+}
+
+/*
+ * Verify that all containers support CPR, and unmap all dma vaddr's.
+ */
+int vfio_cpr_save(Error **errp)
+{
+    ERRP_GUARD();
+    VFIOAddressSpace *space;
+    VFIOContainer *container;
+
+    QLIST_FOREACH(space, &vfio_address_spaces, list) {
+        QLIST_FOREACH(container, &space->containers, next) {
+            if (!vfio_is_cpr_capable(container, errp)) {
+                return -1;
+            }
+            if (vfio_dma_unmap_vaddr_all(container, errp)) {
+                return -1;
+            }
+        }
+    }
+
+    return 0;
+}
+
+/*
+ * Register the listener for each container, which causes its callback to be
+ * invoked for every flat section.  The callback will see that the container
+ * is reused, and call vfo_dma_map with the new vaddr.
+ */
+int vfio_cpr_load(Error **errp)
+{
+    VFIOAddressSpace *space;
+    VFIOContainer *container;
+    VFIOGroup *group;
+    VFIODevice *vbasedev;
+
+    QLIST_FOREACH(space, &vfio_address_spaces, list) {
+        QLIST_FOREACH(container, &space->containers, next) {
+            if (!vfio_is_cpr_capable(container, errp)) {
+                return -1;
+            }
+            vfio_listener_register(container);
+            container->reused = false;
+        }
+    }
+    QLIST_FOREACH(group, &vfio_group_list, next) {
+        QLIST_FOREACH(vbasedev, &group->device_list, next) {
+            vbasedev->reused = false;
+        }
+    }
+    return 0;
+}
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index da9af29..e247b2b 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -5,6 +5,7 @@ vfio_ss.add(files(
   'migration.c',
 ))
 vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
+  'cpr.c',
   'display.c',
   'pci-quirks.c',
   'pci.c',
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index a90cce2..acac8a7 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -30,6 +30,7 @@
 #include "hw/qdev-properties-system.h"
 #include "migration/vmstate.h"
 #include "qapi/qmp/qdict.h"
+#include "migration/cpr.h"
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "qemu/module.h"
@@ -2926,6 +2927,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         vfio_put_group(group);
         goto error;
     }
+    pdev->reused = vdev->vbasedev.reused;
 
     vfio_populate_device(vdev, &err);
     if (err) {
@@ -3195,6 +3197,11 @@ static void vfio_pci_reset(DeviceState *dev)
 {
     VFIOPCIDevice *vdev = VFIO_PCI(dev);
 
+    /* Do not reset the device during qemu_system_reset prior to cpr-load */
+    if (vdev->pdev.reused) {
+        return;
+    }
+
     trace_vfio_pci_reset(vdev->vbasedev.name);
 
     vfio_pci_pre_reset(vdev);
@@ -3302,6 +3309,75 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
+static void vfio_merge_config(VFIOPCIDevice *vdev)
+{
+    PCIDevice *pdev = &vdev->pdev;
+    int size = MIN(pci_config_size(pdev), vdev->config_size);
+    g_autofree uint8_t *phys_config = g_malloc(size);
+    uint32_t mask;
+    int ret, i;
+
+    ret = pread(vdev->vbasedev.fd, phys_config, size, vdev->config_offset);
+    if (ret < size) {
+        ret = ret < 0 ? errno : EFAULT;
+        error_report("failed to read device config space: %s", strerror(ret));
+        return;
+    }
+
+    for (i = 0; i < size; i++) {
+        mask = vdev->emulated_config_bits[i];
+        pdev->config[i] = (pdev->config[i] & mask) | (phys_config[i] & ~mask);
+    }
+}
+
+/*
+ * The kernel may change non-emulated config bits.  Exclude them from the
+ * changed-bits check in get_pci_config_device.
+ */
+static int vfio_pci_pre_load(void *opaque)
+{
+    VFIOPCIDevice *vdev = opaque;
+    PCIDevice *pdev = &vdev->pdev;
+    int size = MIN(pci_config_size(pdev), vdev->config_size);
+    int i;
+
+    for (i = 0; i < size; i++) {
+        pdev->cmask[i] &= vdev->emulated_config_bits[i];
+    }
+
+    return 0;
+}
+
+static int vfio_pci_post_load(void *opaque, int version_id)
+{
+    VFIOPCIDevice *vdev = opaque;
+    PCIDevice *pdev = &vdev->pdev;
+
+    vfio_merge_config(vdev);
+
+    pdev->reused = false;
+
+    return 0;
+}
+
+static bool vfio_pci_needed(void *opaque)
+{
+    return cpr_get_mode() == CPR_MODE_RESTART;
+}
+
+static const VMStateDescription vfio_pci_vmstate = {
+    .name = "vfio-pci",
+    .unmigratable = 1,
+    .version_id = 0,
+    .minimum_version_id = 0,
+    .pre_load = vfio_pci_pre_load,
+    .post_load = vfio_pci_post_load,
+    .needed = vfio_pci_needed,
+    .fields = (VMStateField[]) {
+        VMSTATE_END_OF_LIST()
+    }
+};
+
 static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
@@ -3309,6 +3385,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 
     dc->reset = vfio_pci_reset;
     device_class_set_props(dc, vfio_pci_dev_properties);
+    dc->vmsd = &vfio_pci_vmstate;
     dc->desc = "VFIO-based PCI device assignment";
     set_bit(DEVICE_CATEGORY_MISC, dc->categories);
     pdc->realize = vfio_realize;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 0ef1b5f..63dd0fe 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -118,6 +118,7 @@ vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Devic
 vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
 vfio_dma_unmap_overflow_workaround(void) ""
+vfio_region_remap(const char *name, int fd, uint64_t iova_start, uint64_t iova_end, void *vaddr) "%s fd %d 0x%"PRIx64" - 0x%"PRIx64" [%p]"
 
 # platform.c
 vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index cc63dd4..8557e82 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -361,6 +361,7 @@ struct PCIDevice {
     /* ID of standby device in net_failover pair */
     char *failover_pair_id;
     uint32_t acpi_index;
+    bool reused;
 };
 
 void pci_register_bar(PCIDevice *pci_dev, int region_num,
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 1641753..bc23c29 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -85,6 +85,7 @@ typedef struct VFIOContainer {
     Error *error;
     bool initialized;
     bool dirty_pages_supported;
+    bool reused;
     uint64_t dirty_pgsizes;
     uint64_t max_dirty_bitmap_size;
     unsigned long pgsizes;
@@ -136,6 +137,7 @@ typedef struct VFIODevice {
     bool no_mmap;
     bool ram_block_discard_allowed;
     bool enable_migration;
+    bool reused;
     VFIODeviceOps *ops;
     unsigned int num_irqs;
     unsigned int num_regions;
@@ -212,6 +214,9 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
 void vfio_put_group(VFIOGroup *group);
 int vfio_get_device(VFIOGroup *group, const char *name,
                     VFIODevice *vbasedev, Error **errp);
+int vfio_cpr_save(Error **errp);
+int vfio_cpr_load(Error **errp);
+bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp);
 
 extern const MemoryRegionOps vfio_region_ops;
 typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
@@ -236,6 +241,9 @@ struct vfio_info_cap_header *
 vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id);
 #endif
 extern const MemoryListener vfio_prereg_listener;
+void vfio_listener_register(VFIOContainer *container);
+void vfio_container_region_add(VFIOContainer *container,
+                               MemoryRegionSection *section);
 
 int vfio_spapr_create_window(VFIOContainer *container,
                              MemoryRegionSection *section,
diff --git a/include/migration/cpr.h b/include/migration/cpr.h
index a4da24e..a4007cf 100644
--- a/include/migration/cpr.h
+++ b/include/migration/cpr.h
@@ -25,4 +25,7 @@ int cpr_state_save(Error **errp);
 int cpr_state_load(Error **errp);
 void cpr_state_print(void);
 
+int cpr_vfio_save(Error **errp);
+int cpr_vfio_load(Error **errp);
+
 #endif
diff --git a/migration/cpr.c b/migration/cpr.c
index 37eca66..cee82cf 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -7,6 +7,7 @@
 
 #include "qemu/osdep.h"
 #include "exec/memory.h"
+#include "hw/vfio/vfio-common.h"
 #include "io/channel-buffer.h"
 #include "io/channel-file.h"
 #include "migration.h"
@@ -101,7 +102,9 @@ void qmp_cpr_exec(strList *args, Error **errp)
         error_setg(errp, "cpr-exec requires cpr-save with restart mode");
         return;
     }
-
+    if (cpr_vfio_save(errp)) {
+        return;
+    }
     cpr_walk_fd(preserve_fd, 0);
     if (cpr_state_save(errp)) {
         return;
@@ -139,6 +142,11 @@ void qmp_cpr_load(const char *filename, Error **errp)
         goto out;
     }
 
+    if (cpr_get_mode() == CPR_MODE_RESTART &&
+        cpr_vfio_load(errp)) {
+        goto out;
+    }
+
     state = global_state_get_runstate();
     if (state == RUN_STATE_RUNNING) {
         vm_start();
diff --git a/migration/target.c b/migration/target.c
index 4390bf0..984bc9e 100644
--- a/migration/target.c
+++ b/migration/target.c
@@ -8,6 +8,7 @@
 #include "qemu/osdep.h"
 #include "qapi/qapi-types-migration.h"
 #include "migration.h"
+#include "migration/cpr.h"
 #include CONFIG_DEVICES
 
 #ifdef CONFIG_VFIO
@@ -22,8 +23,21 @@ void populate_vfio_info(MigrationInfo *info)
         info->vfio->transferred = vfio_mig_bytes_transferred();
     }
 }
+
+int cpr_vfio_save(Error **errp)
+{
+    return vfio_cpr_save(errp);
+}
+
+int cpr_vfio_load(Error **errp)
+{
+    return vfio_cpr_load(errp);
+}
+
 #else
 
 void populate_vfio_info(MigrationInfo *info) {}
+int cpr_vfio_save(Error **errp) { return 0; }
+int cpr_vfio_load(Error **errp) { return 0; }
 
 #endif /* CONFIG_VFIO */
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH V7 20/29] vfio-pci: cpr part 2 (msi)
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
                   ` (18 preceding siblings ...)
  2021-12-22 19:05 ` [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma) Steve Sistare
@ 2021-12-22 19:05 ` Steve Sistare
  2021-12-22 19:05 ` [PATCH V7 21/29] vfio-pci: cpr part 3 (intx) Steve Sistare
                   ` (9 subsequent siblings)
  29 siblings, 0 replies; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Finish cpr for vfio-pci MSI/MSI-X devices by preserving eventfd's and
vector state.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/pci.c | 103 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 102 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index acac8a7..abef9b2 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -49,17 +49,55 @@
 static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
 static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
 
+#define EVENT_FD_NAME(vdev, name)   \
+    g_strdup_printf("%s_%s", (vdev)->vbasedev.name, (name))
+
+static int save_event_fd(VFIOPCIDevice *vdev, const char *name, int nr,
+                         EventNotifier *ev)
+{
+    int fd = event_notifier_get_fd(ev);
+
+    if (fd >= 0) {
+        g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
+        int old_fd = cpr_find_fd(fdname, nr);
+        if (old_fd < 0) {
+            cpr_save_fd(fdname, nr, fd);
+        } else if (old_fd != fd) {
+            error_report("fd %s %d already saved with a different value %d",
+                         name, fd, old_fd);
+            return 1;
+        }
+    }
+    return 0;
+}
+
+static int load_event_fd(VFIOPCIDevice *vdev, const char *name, int nr)
+{
+    g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
+    int fd = cpr_find_fd(fdname, nr);
+    return fd;
+}
+
+static void delete_event_fd(VFIOPCIDevice *vdev, const char *name, int nr)
+{
+    g_autofree char *fdname = EVENT_FD_NAME(vdev, name);
+    cpr_delete_fd(fdname, nr);
+}
+
 /* Create new or reuse existing eventfd */
 static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
                               const char *name, int nr)
 {
-    int fd = -1;   /* placeholder until a subsequent patch */
     int ret = 0;
+    int fd = load_event_fd(vdev, name, nr);
 
     if (fd >= 0) {
         event_notifier_init_fd(e, fd);
     } else {
         ret = event_notifier_init(e, 0);
+        if (!ret) {
+            save_event_fd(vdev, name, nr, e);
+        }
     }
     return ret;
 }
@@ -67,6 +105,7 @@ static int vfio_notifier_init(VFIOPCIDevice *vdev, EventNotifier *e,
 static void vfio_notifier_cleanup(VFIOPCIDevice *vdev, EventNotifier *e,
                                   const char *name, int nr)
 {
+    delete_event_fd(vdev, name, nr);
     event_notifier_cleanup(e);
 }
 
@@ -2736,6 +2775,11 @@ static void vfio_register_err_notifier(VFIOPCIDevice *vdev)
     fd = event_notifier_get_fd(&vdev->err_notifier);
     qemu_set_fd_handler(fd, vfio_err_notifier_handler, NULL, vdev);
 
+    /* Do not alter irq_signaling during vfio_realize for cpr */
+    if (vdev->pdev.reused) {
+        return;
+    }
+
     if (vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_ERR_IRQ_INDEX, 0,
                                VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
         error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
@@ -2801,6 +2845,12 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
     fd = event_notifier_get_fd(&vdev->req_notifier);
     qemu_set_fd_handler(fd, vfio_req_notifier_handler, NULL, vdev);
 
+    /* Do not alter irq_signaling during vfio_realize for cpr */
+    if (vdev->pdev.reused) {
+        vdev->req_enabled = true;
+        return;
+    }
+
     if (vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_REQ_IRQ_INDEX, 0,
                            VFIO_IRQ_SET_ACTION_TRIGGER, fd, &err)) {
         error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
@@ -3330,6 +3380,40 @@ static void vfio_merge_config(VFIOPCIDevice *vdev)
     }
 }
 
+static void vfio_claim_vectors(VFIOPCIDevice *vdev, int nr_vectors, bool msix)
+{
+    int i, fd;
+    bool pending = false;
+    PCIDevice *pdev = &vdev->pdev;
+
+    vdev->nr_vectors = nr_vectors;
+    vdev->msi_vectors = g_new0(VFIOMSIVector, nr_vectors);
+    vdev->interrupt = msix ? VFIO_INT_MSIX : VFIO_INT_MSI;
+
+    for (i = 0; i < nr_vectors; i++) {
+        VFIOMSIVector *vector = &vdev->msi_vectors[i];
+
+        fd = load_event_fd(vdev, "interrupt", i);
+        if (fd >= 0) {
+            vfio_vector_init(vdev, i);
+            qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL, vector);
+        }
+
+        if (load_event_fd(vdev, "kvm_interrupt", i) >= 0) {
+            vfio_add_kvm_msi_virq(vdev, vector, i, msix);
+        }
+
+        if (msix && msix_is_pending(pdev, i) && msix_is_masked(pdev, i)) {
+            set_bit(i, vdev->msix->pending);
+            pending = true;
+        }
+    }
+
+    if (msix) {
+        memory_region_set_enabled(&pdev->msix_pba_mmio, pending);
+    }
+}
+
 /*
  * The kernel may change non-emulated config bits.  Exclude them from the
  * changed-bits check in get_pci_config_device.
@@ -3352,9 +3436,24 @@ static int vfio_pci_post_load(void *opaque, int version_id)
 {
     VFIOPCIDevice *vdev = opaque;
     PCIDevice *pdev = &vdev->pdev;
+    int nr_vectors;
 
     vfio_merge_config(vdev);
 
+    if (msix_enabled(pdev)) {
+        nr_vectors = vdev->msix->entries;
+        vfio_claim_vectors(vdev, nr_vectors, true);
+        msix_init_vector_notifiers(pdev, vfio_msix_vector_use,
+                                   vfio_msix_vector_release, NULL);
+
+    } else if (msi_enabled(pdev)) {
+        nr_vectors = msi_nr_vectors_allocated(pdev);
+        vfio_claim_vectors(vdev, nr_vectors, false);
+
+    } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
+        assert(0);      /* completed in a subsequent patch */
+    }
+
     pdev->reused = false;
 
     return 0;
@@ -3374,6 +3473,8 @@ static const VMStateDescription vfio_pci_vmstate = {
     .post_load = vfio_pci_post_load,
     .needed = vfio_pci_needed,
     .fields = (VMStateField[]) {
+        VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
+        VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
         VMSTATE_END_OF_LIST()
     }
 };
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH V7 21/29] vfio-pci: cpr part 3 (intx)
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
                   ` (19 preceding siblings ...)
  2021-12-22 19:05 ` [PATCH V7 20/29] vfio-pci: cpr part 2 (msi) Steve Sistare
@ 2021-12-22 19:05 ` Steve Sistare
  2021-12-22 19:05 ` [PATCH V7 22/29] vfio-pci: recover from unmap-all-vaddr failure Steve Sistare
                   ` (8 subsequent siblings)
  29 siblings, 0 replies; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Preserve vfio INTX state across cpr restart.  Preserve VFIOINTx fields as
follows:
  pin : Recover this from the vfio config in kernel space
  interrupt : Preserve its eventfd descriptor across exec.
  unmask : Ditto
  route.irq : This could perhaps be recovered in vfio_pci_post_load by
    calling pci_device_route_intx_to_irq(pin), whose implementation reads
    config space for a bridge device such as ich9.  However, there is no
    guarantee that the bridge vmstate is read before vfio vmstate.  Rather
    than fiddling with MigrationPriority for vmstate handlers, explicitly
    save route.irq in vfio vmstate.
  pending : save in vfio vmstate.
  mmap_timeout, mmap_timer : Re-initialize
  bool kvm_accel : Re-initialize

In vfio_realize, defer calling vfio_intx_enable until the vmstate
is available, in vfio_pci_post_load.  Modify vfio_intx_enable and
vfio_intx_kvm_enable to skip vfio initialization, but still perform
kvm initialization.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/pci.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 83 insertions(+), 9 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index abef9b2..e32513c 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -171,14 +171,45 @@ static void vfio_intx_eoi(VFIODevice *vbasedev)
     vfio_unmask_single_irqindex(vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
 }
 
+#ifdef CONFIG_KVM
+static bool vfio_no_kvm_intx(VFIOPCIDevice *vdev)
+{
+    return vdev->no_kvm_intx || !kvm_irqfds_enabled() ||
+           vdev->intx.route.mode != PCI_INTX_ENABLED ||
+           !kvm_resamplefds_enabled();
+}
+#endif
+
+static void vfio_intx_reenable_kvm(VFIOPCIDevice *vdev, Error **errp)
+{
+#ifdef CONFIG_KVM
+    if (vfio_no_kvm_intx(vdev)) {
+        return;
+    }
+
+    if (vfio_notifier_init(vdev, &vdev->intx.unmask, "intx-unmask", 0)) {
+        error_setg(errp, "vfio_notifier_init intx-unmask failed");
+        return;
+    }
+
+    if (kvm_irqchip_add_irqfd_notifier_gsi(kvm_state,
+                                           &vdev->intx.interrupt,
+                                           &vdev->intx.unmask,
+                                           vdev->intx.route.irq)) {
+        error_setg_errno(errp, errno, "failed to setup resample irqfd");
+        return;
+    }
+
+    vdev->intx.kvm_accel = true;
+#endif
+}
+
 static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
 {
 #ifdef CONFIG_KVM
     int irq_fd = event_notifier_get_fd(&vdev->intx.interrupt);
 
-    if (vdev->no_kvm_intx || !kvm_irqfds_enabled() ||
-        vdev->intx.route.mode != PCI_INTX_ENABLED ||
-        !kvm_resamplefds_enabled()) {
+    if (vfio_no_kvm_intx(vdev)) {
         return;
     }
 
@@ -326,7 +357,13 @@ static int vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
         return 0;
     }
 
-    vfio_disable_interrupts(vdev);
+    /*
+     * Do not alter interrupt state during vfio_realize and cpr-load.  The
+     * reused flag is cleared thereafter.
+     */
+    if (!vdev->pdev.reused) {
+        vfio_disable_interrupts(vdev);
+    }
 
     vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
     pci_config_set_interrupt_pin(vdev->pdev.config, pin);
@@ -351,6 +388,11 @@ static int vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
     fd = event_notifier_get_fd(&vdev->intx.interrupt);
     qemu_set_fd_handler(fd, vfio_intx_interrupt, NULL, vdev);
 
+    if (vdev->pdev.reused) {
+        vfio_intx_reenable_kvm(vdev, &err);
+        goto finish;
+    }
+
     if (vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
                                VFIO_IRQ_SET_ACTION_TRIGGER, fd, errp)) {
         qemu_set_fd_handler(fd, NULL, NULL, vdev);
@@ -363,6 +405,7 @@ static int vfio_intx_enable(VFIOPCIDevice *vdev, Error **errp)
         warn_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
     }
 
+finish:
     vdev->interrupt = VFIO_INT_INTx;
 
     trace_vfio_intx_enable(vdev->vbasedev.name);
@@ -3140,9 +3183,13 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
                                              vfio_intx_routing_notifier);
         vdev->irqchip_change_notifier.notify = vfio_irqchip_change;
         kvm_irqchip_add_change_notifier(&vdev->irqchip_change_notifier);
-        ret = vfio_intx_enable(vdev, errp);
-        if (ret) {
-            goto out_deregister;
+
+        /* Wait until cpr-load reads intx routing data to enable */
+        if (!pdev->reused) {
+            ret = vfio_intx_enable(vdev, errp);
+            if (ret) {
+                goto out_deregister;
+            }
         }
     }
 
@@ -3437,6 +3484,7 @@ static int vfio_pci_post_load(void *opaque, int version_id)
     VFIOPCIDevice *vdev = opaque;
     PCIDevice *pdev = &vdev->pdev;
     int nr_vectors;
+    int ret = 0;
 
     vfio_merge_config(vdev);
 
@@ -3451,12 +3499,37 @@ static int vfio_pci_post_load(void *opaque, int version_id)
         vfio_claim_vectors(vdev, nr_vectors, false);
 
     } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
-        assert(0);      /* completed in a subsequent patch */
+        Error *err = 0;
+        ret = vfio_intx_enable(vdev, &err);
+        if (ret) {
+            error_report_err(err);
+        }
     }
 
     pdev->reused = false;
 
-    return 0;
+    return ret;
+}
+
+static const VMStateDescription vfio_intx_vmstate = {
+    .name = "vfio-intx",
+    .unmigratable = 1,
+    .version_id = 0,
+    .minimum_version_id = 0,
+    .fields = (VMStateField[]) {
+        VMSTATE_BOOL(pending, VFIOINTx),
+        VMSTATE_UINT32(route.mode, VFIOINTx),
+        VMSTATE_INT32(route.irq, VFIOINTx),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+#define VMSTATE_VFIO_INTX(_field, _state) {                         \
+    .name       = (stringify(_field)),                              \
+    .size       = sizeof(VFIOINTx),                                 \
+    .vmsd       = &vfio_intx_vmstate,                               \
+    .flags      = VMS_STRUCT,                                       \
+    .offset     = vmstate_offset_value(_state, _field, VFIOINTx),   \
 }
 
 static bool vfio_pci_needed(void *opaque)
@@ -3475,6 +3548,7 @@ static const VMStateDescription vfio_pci_vmstate = {
     .fields = (VMStateField[]) {
         VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
         VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
+        VMSTATE_VFIO_INTX(intx, VFIOPCIDevice),
         VMSTATE_END_OF_LIST()
     }
 };
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH V7 22/29] vfio-pci: recover from unmap-all-vaddr failure
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
                   ` (20 preceding siblings ...)
  2021-12-22 19:05 ` [PATCH V7 21/29] vfio-pci: cpr part 3 (intx) Steve Sistare
@ 2021-12-22 19:05 ` Steve Sistare
  2021-12-22 19:05 ` [PATCH V7 23/29] vhost: reset vhost devices for cpr Steve Sistare
                   ` (7 subsequent siblings)
  29 siblings, 0 replies; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

If vfio_cpr_save fails to unmap all vaddr's, then recover by walking all
flat sections to restore the vaddr for each.  Do so by invoking the
vfio listener callback, and passing a new "replay" flag that tells it
to replay a mapping without re-allocating new userland data structures.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/common.c              | 65 ++++++++++++++++++++++++++++++++-----------
 hw/vfio/cpr.c                 | 41 +++++++++++++++++++++++++--
 include/hw/vfio/vfio-common.h |  2 +-
 3 files changed, 88 insertions(+), 20 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 90f66ad..f2b4a81 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -878,15 +878,35 @@ static void vfio_unregister_ram_discard_listener(VFIOContainer *container,
     g_free(vrdl);
 }
 
+static VFIORamDiscardListener *vfio_find_ram_discard_listener(
+    VFIOContainer *container, MemoryRegionSection *section)
+{
+    VFIORamDiscardListener *vrdl = NULL;
+
+    QLIST_FOREACH(vrdl, &container->vrdl_list, next) {
+        if (vrdl->mr == section->mr &&
+            vrdl->offset_within_address_space ==
+            section->offset_within_address_space) {
+            break;
+        }
+    }
+
+    if (!vrdl) {
+        hw_error("vfio: Trying to sync missing RAM discard listener");
+        /* does not return */
+    }
+    return vrdl;
+}
+
 static void vfio_listener_region_add(MemoryListener *listener,
                                      MemoryRegionSection *section)
 {
     VFIOContainer *container = container_of(listener, VFIOContainer, listener);
-    vfio_container_region_add(container, section);
+    vfio_container_region_add(container, section, false);
 }
 
 void vfio_container_region_add(VFIOContainer *container,
-                               MemoryRegionSection *section)
+                               MemoryRegionSection *section, bool replay)
 {
     hwaddr iova, end;
     Int128 llend, llsize;
@@ -1009,6 +1029,22 @@ void vfio_container_region_add(VFIOContainer *container,
 
         trace_vfio_listener_region_add_iommu(iova, end);
 
+        if (replay) {
+            hwaddr as_offset = section->offset_within_address_space;
+            hwaddr iommu_offset = as_offset - section->offset_within_region;
+
+            QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
+                if (giommu->iommu == iommu_mr &&
+                    giommu->iommu_offset == iommu_offset) {
+                    memory_region_iommu_replay(giommu->iommu, &giommu->n);
+                    return;
+                }
+            }
+            error_report("Container cannot find iommu region %s offset %lx",
+                memory_region_name(section->mr), iommu_offset);
+            goto fail;
+        }
+
         /*
          * FIXME: For VFIO iommu types which have KVM acceleration to
          * avoid bouncing all map/unmaps through qemu this way, this
@@ -1059,7 +1095,15 @@ void vfio_container_region_add(VFIOContainer *container,
      * about changes.
      */
     if (memory_region_has_ram_discard_manager(section->mr)) {
-        vfio_register_ram_discard_listener(container, section);
+        if (replay)  {
+            VFIORamDiscardListener *vrdl =
+                vfio_find_ram_discard_listener(container, section);
+            if (vfio_ram_discard_notify_populate(&vrdl->listener, section)) {
+                error_report("ram_discard_manager_replay_populated failed");
+            }
+        } else {
+            vfio_register_ram_discard_listener(container, section);
+        }
         return;
     }
 
@@ -1385,19 +1429,8 @@ static int vfio_sync_ram_discard_listener_dirty_bitmap(VFIOContainer *container,
                                                    MemoryRegionSection *section)
 {
     RamDiscardManager *rdm = memory_region_get_ram_discard_manager(section->mr);
-    VFIORamDiscardListener *vrdl = NULL;
-
-    QLIST_FOREACH(vrdl, &container->vrdl_list, next) {
-        if (vrdl->mr == section->mr &&
-            vrdl->offset_within_address_space ==
-            section->offset_within_address_space) {
-            break;
-        }
-    }
-
-    if (!vrdl) {
-        hw_error("vfio: Trying to sync missing RAM discard listener");
-    }
+    VFIORamDiscardListener *vrdl =
+        vfio_find_ram_discard_listener(container, section);
 
     /*
      * We only want/can synchronize the bitmap for actually mapped parts -
diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index 2c39cd5..ea673ea 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -29,6 +29,14 @@ vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
     return 0;
 }
 
+static int
+vfio_region_remap(MemoryRegionSection *section, void *handle, Error **errp)
+{
+    VFIOContainer *container = handle;
+    vfio_container_region_add(container, section, true);
+    return 0;
+}
+
 bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp)
 {
     if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
@@ -48,20 +56,47 @@ int vfio_cpr_save(Error **errp)
 {
     ERRP_GUARD();
     VFIOAddressSpace *space;
-    VFIOContainer *container;
+    VFIOContainer *container, *last_container;
 
     QLIST_FOREACH(space, &vfio_address_spaces, list) {
         QLIST_FOREACH(container, &space->containers, next) {
             if (!vfio_is_cpr_capable(container, errp)) {
                 return -1;
             }
+        }
+    }
+
+    QLIST_FOREACH(space, &vfio_address_spaces, list) {
+        QLIST_FOREACH(container, &space->containers, next) {
             if (vfio_dma_unmap_vaddr_all(container, errp)) {
-                return -1;
+                goto unwind;
             }
         }
     }
-
     return 0;
+
+unwind:
+    last_container = container;
+    QLIST_FOREACH(space, &vfio_address_spaces, list) {
+        QLIST_FOREACH(container, &space->containers, next) {
+            Error *err;
+
+            if (container == last_container) {
+                break;
+            }
+
+            /* Set reused so vfio_dma_map restores vaddr */
+            container->reused = true;
+            if (address_space_flat_for_each_section(space->as,
+                                                    vfio_region_remap,
+                                                    container, &err)) {
+                error_prepend(errp, "%s", error_get_pretty(err));
+                error_free(err);
+            }
+            container->reused = false;
+        }
+    }
+    return -1;
 }
 
 /*
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index bc23c29..af960dc 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -243,7 +243,7 @@ vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id);
 extern const MemoryListener vfio_prereg_listener;
 void vfio_listener_register(VFIOContainer *container);
 void vfio_container_region_add(VFIOContainer *container,
-                               MemoryRegionSection *section);
+                               MemoryRegionSection *section, bool replay);
 
 int vfio_spapr_create_window(VFIOContainer *container,
                              MemoryRegionSection *section,
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH V7 23/29] vhost: reset vhost devices for cpr
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
                   ` (21 preceding siblings ...)
  2021-12-22 19:05 ` [PATCH V7 22/29] vfio-pci: recover from unmap-all-vaddr failure Steve Sistare
@ 2021-12-22 19:05 ` Steve Sistare
  2021-12-22 19:05 ` [PATCH V7 24/29] loader: suppress rom_reset during cpr Steve Sistare
                   ` (6 subsequent siblings)
  29 siblings, 0 replies; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

A vhost device is implicitly preserved across re-exec because its fd is not
closed, and the value of the fd is specified on the command line for the
new qemu to find.  However, new qemu issues an VHOST_RESET_OWNER ioctl,
which fails because the device already has an owner.  To fix, reset the
owner prior to exec.

Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/virtio/vhost.c         | 11 +++++++++++
 include/hw/virtio/vhost.h |  1 +
 migration/cpr.c           |  2 ++
 3 files changed, 14 insertions(+)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 20913cf..35d0836 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1853,6 +1853,17 @@ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev)
     hdev->vdev = NULL;
 }
 
+void vhost_dev_reset_all(void)
+{
+    struct vhost_dev *dev;
+
+    QLIST_FOREACH(dev, &vhost_devices, entry) {
+        if (dev->vhost_ops->vhost_reset_device(dev) < 0) {
+            VHOST_OPS_DEBUG("vhost_reset_device failed");
+        }
+    }
+}
+
 int vhost_net_set_backend(struct vhost_dev *hdev,
                           struct vhost_vring_file *file)
 {
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index 58a73e7..d436eba 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -114,6 +114,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
 void vhost_dev_cleanup(struct vhost_dev *hdev);
 int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev);
 void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev);
+void vhost_dev_reset_all(void);
 int vhost_dev_enable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev);
 void vhost_dev_disable_notifiers(struct vhost_dev *hdev, VirtIODevice *vdev);
 
diff --git a/migration/cpr.c b/migration/cpr.c
index cee82cf..4229c17 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -8,6 +8,7 @@
 #include "qemu/osdep.h"
 #include "exec/memory.h"
 #include "hw/vfio/vfio-common.h"
+#include "hw/virtio/vhost.h"
 #include "io/channel-buffer.h"
 #include "io/channel-file.h"
 #include "migration.h"
@@ -109,6 +110,7 @@ void qmp_cpr_exec(strList *args, Error **errp)
     if (cpr_state_save(errp)) {
         return;
     }
+    vhost_dev_reset_all();
     qemu_system_exec_request(args);
 }
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH V7 24/29] loader: suppress rom_reset during cpr
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
                   ` (22 preceding siblings ...)
  2021-12-22 19:05 ` [PATCH V7 23/29] vhost: reset vhost devices for cpr Steve Sistare
@ 2021-12-22 19:05 ` Steve Sistare
  2021-12-22 19:05 ` [PATCH V7 25/29] chardev: cpr framework Steve Sistare
                   ` (5 subsequent siblings)
  29 siblings, 0 replies; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Reported-by: Zheng Chuan <zhengchuan@huawei.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/core/loader.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/core/loader.c b/hw/core/loader.c
index 052a0fd..e88fab2 100644
--- a/hw/core/loader.c
+++ b/hw/core/loader.c
@@ -52,6 +52,7 @@
 #include "hw/hw.h"
 #include "disas/disas.h"
 #include "migration/vmstate.h"
+#include "migration/cpr.h"
 #include "monitor/monitor.h"
 #include "sysemu/reset.h"
 #include "sysemu/sysemu.h"
@@ -1137,6 +1138,7 @@ int rom_add_option(const char *file, int32_t bootindex)
 static void rom_reset(void *unused)
 {
     Rom *rom;
+    bool cpr_is_active = (cpr_get_mode() != CPR_MODE_NONE);
 
     QTAILQ_FOREACH(rom, &roms, next) {
         if (rom->fw_file) {
@@ -1147,7 +1149,7 @@ static void rom_reset(void *unused)
          * the data in during the next incoming migration in all cases.  Note
          * that some of those RAMs can actually be modified by the guest.
          */
-        if (runstate_check(RUN_STATE_INMIGRATE)) {
+        if (runstate_check(RUN_STATE_INMIGRATE) || cpr_is_active) {
             if (rom->data && rom->isrom) {
                 /*
                  * Free it so that a rom_reset after migration doesn't
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH V7 25/29] chardev: cpr framework
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
                   ` (23 preceding siblings ...)
  2021-12-22 19:05 ` [PATCH V7 24/29] loader: suppress rom_reset during cpr Steve Sistare
@ 2021-12-22 19:05 ` Steve Sistare
  2021-12-22 19:05 ` [PATCH V7 26/29] chardev: cpr for simple devices Steve Sistare
                   ` (4 subsequent siblings)
  29 siblings, 0 replies; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Add QEMU_CHAR_FEATURE_CPR for devices that support cpr.
Add the chardev reopen-on-cpr option for devices that can be closed on cpr
and reopened after exec.
cpr is allowed only if either QEMU_CHAR_FEATURE_CPR or reopen-on-cpr is set
for all chardevs in the configuration.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 chardev/char.c         | 45 ++++++++++++++++++++++++++++++++++++++++++---
 include/chardev/char.h |  5 +++++
 migration/cpr.c        |  1 +
 qapi/char.json         |  7 ++++++-
 qemu-options.hx        | 26 ++++++++++++++++++++++----
 5 files changed, 76 insertions(+), 8 deletions(-)

diff --git a/chardev/char.c b/chardev/char.c
index 0169d8d..230bf16 100644
--- a/chardev/char.c
+++ b/chardev/char.c
@@ -36,6 +36,7 @@
 #include "qemu/help_option.h"
 #include "qemu/module.h"
 #include "qemu/option.h"
+#include "migration/cpr.h"
 #include "qemu/id.h"
 #include "qemu/coroutine.h"
 #include "qemu/yank.h"
@@ -240,15 +241,24 @@ static void qemu_char_open(Chardev *chr, ChardevBackend *backend,
     /* Any ChardevCommon member would work */
     ChardevCommon *common = backend ? backend->u.null.data : NULL;
 
+    chr->reopen_on_cpr = (common && common->reopen_on_cpr);
+
     if (common && common->has_logfile) {
         int flags = O_WRONLY;
+        g_autofree char *fdname = g_strdup_printf("%s_log", chr->label);
         if (common->has_logappend &&
             common->logappend) {
             flags |= O_APPEND;
         } else {
             flags |= O_TRUNC;
         }
-        chr->logfd = qemu_create(common->logfile, flags, 0666, errp);
+        chr->logfd = cpr_find_fd(fdname, 0);
+        if (chr->logfd < 0) {
+            chr->logfd = qemu_create(common->logfile, flags, 0666, errp);
+            if (!chr->reopen_on_cpr) {
+                cpr_save_fd(fdname, 0, chr->logfd);
+            }
+        }
         if (chr->logfd < 0) {
             return;
         }
@@ -297,11 +307,15 @@ static void char_finalize(Object *obj)
     if (chr->be) {
         chr->be->chr = NULL;
     }
-    g_free(chr->filename);
-    g_free(chr->label);
     if (chr->logfd != -1) {
+        g_autofree char *fdname = g_strdup_printf("%s_log", chr->label);
+        if (!chr->reopen_on_cpr) {
+            cpr_delete_fd(fdname, 0);
+        }
         close(chr->logfd);
     }
+    g_free(chr->filename);
+    g_free(chr->label);
     qemu_mutex_destroy(&chr->chr_write_lock);
 }
 
@@ -501,6 +515,8 @@ void qemu_chr_parse_common(QemuOpts *opts, ChardevCommon *backend)
 
     backend->has_logappend = true;
     backend->logappend = qemu_opt_get_bool(opts, "logappend", false);
+
+    backend->reopen_on_cpr = qemu_opt_get_bool(opts, "reopen-on-cpr", false);
 }
 
 static const ChardevClass *char_get_class(const char *driver, Error **errp)
@@ -942,6 +958,9 @@ QemuOptsList qemu_chardev_opts = {
         },{
             .name = "abstract",
             .type = QEMU_OPT_BOOL,
+        },{
+            .name = "reopen-on-cpr",
+            .type = QEMU_OPT_BOOL,
 #endif
         },
         { /* end of list */ }
@@ -1217,6 +1236,26 @@ GSource *qemu_chr_timeout_add_ms(Chardev *chr, guint ms,
     return source;
 }
 
+static int chr_cpr_capable(Object *obj, void *opaque)
+{
+    Chardev *chr = (Chardev *)obj;
+    Error **errp = opaque;
+
+    if (qemu_chr_has_feature(chr, QEMU_CHAR_FEATURE_CPR) ||
+        chr->reopen_on_cpr) {
+        return 0;
+    }
+    error_setg(errp,
+               "chardev %s -> %s is not capable of cpr. See reopen-on-cpr",
+               chr->label, chr->filename);
+    return -1;
+}
+
+bool qemu_chr_is_cpr_capable(Error **errp)
+{
+    return !object_child_foreach(get_chardevs_root(), chr_cpr_capable, errp);
+}
+
 void qemu_chr_cleanup(void)
 {
     object_unparent(get_chardevs_root());
diff --git a/include/chardev/char.h b/include/chardev/char.h
index a319b5f..299e129 100644
--- a/include/chardev/char.h
+++ b/include/chardev/char.h
@@ -50,6 +50,8 @@ typedef enum {
     /* Whether the gcontext can be changed after calling
      * qemu_chr_be_update_read_handlers() */
     QEMU_CHAR_FEATURE_GCONTEXT,
+    /* Whether the device supports cpr */
+    QEMU_CHAR_FEATURE_CPR,
 
     QEMU_CHAR_FEATURE_LAST,
 } ChardevFeature;
@@ -67,6 +69,7 @@ struct Chardev {
     int be_open;
     /* used to coordinate the chardev-change special-case: */
     bool handover_yank_instance;
+    bool reopen_on_cpr;
     GSource *gsource;
     GMainContext *gcontext;
     DECLARE_BITMAP(features, QEMU_CHAR_FEATURE_LAST);
@@ -323,4 +326,6 @@ void resume_mux_open(void);
 /* console.c */
 void qemu_chr_parse_vc(QemuOpts *opts, ChardevBackend *backend, Error **errp);
 
+bool qemu_chr_is_cpr_capable(Error **errp);
+
 #endif
diff --git a/migration/cpr.c b/migration/cpr.c
index 4229c17..3bda83e 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -6,6 +6,7 @@
  */
 
 #include "qemu/osdep.h"
+#include "chardev/char.h"
 #include "exec/memory.h"
 #include "hw/vfio/vfio-common.h"
 #include "hw/virtio/vhost.h"
diff --git a/qapi/char.json b/qapi/char.json
index 7b42151..dfa6baf 100644
--- a/qapi/char.json
+++ b/qapi/char.json
@@ -204,12 +204,17 @@
 # @logfile: The name of a logfile to save output
 # @logappend: true to append instead of truncate
 #             (default to false to truncate)
+# @reopen-on-cpr: if true, close device's fd on cpr-save and reopen it after
+#                 cpr-exec. Set this to allow CPR on a device that does not
+#                 support QEMU_CHAR_FEATURE_CPR. defaults to false.
+#                 since 6.2.
 #
 # Since: 2.6
 ##
 { 'struct': 'ChardevCommon',
   'data': { '*logfile': 'str',
-            '*logappend': 'bool' } }
+            '*logappend': 'bool',
+            '*reopen-on-cpr': 'bool' } }
 
 ##
 # @ChardevFile:
diff --git a/qemu-options.hx b/qemu-options.hx
index 33c8173..1859b55 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -3227,43 +3227,57 @@ DEFHEADING(Character device options:)
 
 DEF("chardev", HAS_ARG, QEMU_OPTION_chardev,
     "-chardev help\n"
-    "-chardev null,id=id[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "-chardev null,id=id[,mux=on|off][,logfile=PATH][,logappend=on|off][,reopen-on-cpr=on|off]\n"
     "-chardev socket,id=id[,host=host],port=port[,to=to][,ipv4=on|off][,ipv6=on|off][,nodelay=on|off]\n"
     "         [,server=on|off][,wait=on|off][,telnet=on|off][,websocket=on|off][,reconnect=seconds][,mux=on|off]\n"
-    "         [,logfile=PATH][,logappend=on|off][,tls-creds=ID][,tls-authz=ID] (tcp)\n"
+    "         [,logfile=PATH][,logappend=on|off][,tls-creds=ID][,tls-authz=ID][,reopen-on-cpr=on|off] (tcp)\n"
     "-chardev socket,id=id,path=path[,server=on|off][,wait=on|off][,telnet=on|off][,websocket=on|off][,reconnect=seconds]\n"
-    "         [,mux=on|off][,logfile=PATH][,logappend=on|off][,abstract=on|off][,tight=on|off] (unix)\n"
+    "         [,mux=on|off][,logfile=PATH][,logappend=on|off][,abstract=on|off][,tight=on|off][,reopen-on-cpr=on|off] (unix)\n"
     "-chardev udp,id=id[,host=host],port=port[,localaddr=localaddr]\n"
     "         [,localport=localport][,ipv4=on|off][,ipv6=on|off][,mux=on|off]\n"
-    "         [,logfile=PATH][,logappend=on|off]\n"
+    "         [,logfile=PATH][,logappend=on|off][,reopen-on-cpr=on|off]\n"
     "-chardev msmouse,id=id[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
     "-chardev vc,id=id[[,width=width][,height=height]][[,cols=cols][,rows=rows]]\n"
     "         [,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
     "-chardev ringbuf,id=id[,size=size][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
     "-chardev file,id=id,path=path[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
     "-chardev pipe,id=id,path=path[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
 #ifdef _WIN32
     "-chardev console,id=id[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
     "-chardev serial,id=id,path=path[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
 #else
     "-chardev pty,id=id[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
     "-chardev stdio,id=id[,mux=on|off][,signal=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
 #endif
 #ifdef CONFIG_BRLAPI
     "-chardev braille,id=id[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
 #endif
 #if defined(__linux__) || defined(__sun__) || defined(__FreeBSD__) \
         || defined(__NetBSD__) || defined(__OpenBSD__) || defined(__DragonFly__)
     "-chardev serial,id=id,path=path[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
     "-chardev tty,id=id,path=path[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
 #endif
 #if defined(__linux__) || defined(__FreeBSD__) || defined(__DragonFly__)
     "-chardev parallel,id=id,path=path[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
     "-chardev parport,id=id,path=path[,mux=on|off][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
 #endif
 #if defined(CONFIG_SPICE)
     "-chardev spicevmc,id=id,name=name[,debug=debug][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
     "-chardev spiceport,id=id,name=name[,debug=debug][,logfile=PATH][,logappend=on|off]\n"
+    "         [,reopen-on-cpr=on|off]\n"
 #endif
     , QEMU_ARCH_ALL
 )
@@ -3338,6 +3352,10 @@ The general form of a character device option is:
     ``logappend`` option controls whether the log file will be truncated
     or appended to when opened.
 
+    Every backend supports the ``reopen-on-cpr`` option.  If on, the
+    devices's descriptor is closed during cpr-save, and reopened after exec.
+    This is useful for devices that do not support cpr.
+
 The available backends are:
 
 ``-chardev null,id=id``
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH V7 26/29] chardev: cpr for simple devices
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
                   ` (24 preceding siblings ...)
  2021-12-22 19:05 ` [PATCH V7 25/29] chardev: cpr framework Steve Sistare
@ 2021-12-22 19:05 ` Steve Sistare
  2021-12-22 19:05 ` [PATCH V7 27/29] chardev: cpr for pty Steve Sistare
                   ` (3 subsequent siblings)
  29 siblings, 0 replies; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Set QEMU_CHAR_FEATURE_CPR for devices that trivially support cpr.
char-stdio is slightly less trivial.  Allow the gdb server by
closing it on exec.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 chardev/char-mux.c     | 1 +
 chardev/char-null.c    | 1 +
 chardev/char-serial.c  | 1 +
 chardev/char-stdio.c   | 8 ++++++++
 gdbstub.c              | 1 +
 include/chardev/char.h | 1 +
 migration/cpr.c        | 1 +
 7 files changed, 14 insertions(+)

diff --git a/chardev/char-mux.c b/chardev/char-mux.c
index ee2d47b..d47fa31 100644
--- a/chardev/char-mux.c
+++ b/chardev/char-mux.c
@@ -337,6 +337,7 @@ static void qemu_chr_open_mux(Chardev *chr,
      */
     *be_opened = muxes_opened;
     qemu_chr_fe_init(&d->chr, drv, errp);
+    qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
 }
 
 static void qemu_chr_parse_mux(QemuOpts *opts, ChardevBackend *backend,
diff --git a/chardev/char-null.c b/chardev/char-null.c
index 1c6a290..02acaff 100644
--- a/chardev/char-null.c
+++ b/chardev/char-null.c
@@ -32,6 +32,7 @@ static void null_chr_open(Chardev *chr,
                           Error **errp)
 {
     *be_opened = false;
+    qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
 }
 
 static void char_null_class_init(ObjectClass *oc, void *data)
diff --git a/chardev/char-serial.c b/chardev/char-serial.c
index 7c3d84a..b585085 100644
--- a/chardev/char-serial.c
+++ b/chardev/char-serial.c
@@ -274,6 +274,7 @@ static void qmp_chardev_open_serial(Chardev *chr,
     qemu_set_nonblock(fd);
     tty_serial_init(fd, 115200, 'N', 8, 1);
 
+    qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
     qemu_chr_open_fd(chr, fd, fd);
 }
 #endif /* __linux__ || __sun__ */
diff --git a/chardev/char-stdio.c b/chardev/char-stdio.c
index 403da30..9410c16 100644
--- a/chardev/char-stdio.c
+++ b/chardev/char-stdio.c
@@ -114,9 +114,17 @@ static void qemu_chr_open_stdio(Chardev *chr,
 
     stdio_allow_signal = !opts->has_signal || opts->signal;
     qemu_chr_set_echo_stdio(chr, false);
+    qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
 }
 #endif
 
+void qemu_term_exit(void)
+{
+#ifndef _WIN32
+    term_exit();
+#endif
+}
+
 static void qemu_chr_parse_stdio(QemuOpts *opts, ChardevBackend *backend,
                                  Error **errp)
 {
diff --git a/gdbstub.c b/gdbstub.c
index 3c14c6a..137deeb 100644
--- a/gdbstub.c
+++ b/gdbstub.c
@@ -3569,6 +3569,7 @@ int gdbserver_start(const char *device)
         mon_chr = gdbserver_state.mon_chr;
         reset_gdbserver_state();
     }
+    mon_chr->reopen_on_cpr = true;
 
     create_processes(&gdbserver_state);
 
diff --git a/include/chardev/char.h b/include/chardev/char.h
index 299e129..fc24d28 100644
--- a/include/chardev/char.h
+++ b/include/chardev/char.h
@@ -327,5 +327,6 @@ void resume_mux_open(void);
 void qemu_chr_parse_vc(QemuOpts *opts, ChardevBackend *backend, Error **errp);
 
 bool qemu_chr_is_cpr_capable(Error **errp);
+void qemu_term_exit(void);
 
 #endif
diff --git a/migration/cpr.c b/migration/cpr.c
index 3bda83e..eb8ce2a 100644
--- a/migration/cpr.c
+++ b/migration/cpr.c
@@ -112,6 +112,7 @@ void qmp_cpr_exec(strList *args, Error **errp)
         return;
     }
     vhost_dev_reset_all();
+    qemu_term_exit();
     qemu_system_exec_request(args);
 }
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH V7 27/29] chardev: cpr for pty
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
                   ` (25 preceding siblings ...)
  2021-12-22 19:05 ` [PATCH V7 26/29] chardev: cpr for simple devices Steve Sistare
@ 2021-12-22 19:05 ` Steve Sistare
  2021-12-22 19:05 ` [PATCH V7 28/29] chardev: cpr for sockets Steve Sistare
                   ` (2 subsequent siblings)
  29 siblings, 0 replies; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Save and restore pty descriptors across cpr-save and cpr-load.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 chardev/char-pty.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/chardev/char-pty.c b/chardev/char-pty.c
index a2d1e7c..9801a4f 100644
--- a/chardev/char-pty.c
+++ b/chardev/char-pty.c
@@ -30,6 +30,7 @@
 #include "qemu/sockets.h"
 #include "qemu/error-report.h"
 #include "qemu/module.h"
+#include "migration/cpr.h"
 #include "qemu/qemu-print.h"
 
 #include "chardev/char-io.h"
@@ -191,6 +192,9 @@ static void char_pty_finalize(Object *obj)
     Chardev *chr = CHARDEV(obj);
     PtyChardev *s = PTY_CHARDEV(obj);
 
+    if (!chr->reopen_on_cpr) {
+        cpr_delete_fd(chr->label, 0);
+    }
     pty_chr_state(chr, 0);
     object_unref(OBJECT(s->ioc));
     pty_chr_timer_cancel(s);
@@ -207,12 +211,20 @@ static void char_pty_open(Chardev *chr,
     char pty_name[PATH_MAX];
     char *name;
 
+    master_fd = cpr_find_fd(chr->label, 0);
+    if (master_fd >= 0) {
+        chr->filename = g_strdup_printf("pty:unknown");
+        goto have_fd;
+    }
+
     master_fd = qemu_openpty_raw(&slave_fd, pty_name);
     if (master_fd < 0) {
         error_setg_errno(errp, errno, "Failed to create PTY");
         return;
     }
-
+    if (!chr->reopen_on_cpr) {
+        cpr_save_fd(chr->label, 0, master_fd);
+    }
     close(slave_fd);
     qemu_set_nonblock(master_fd);
 
@@ -220,6 +232,8 @@ static void char_pty_open(Chardev *chr,
     qemu_printf("char device redirected to %s (label %s)\n",
                 pty_name, chr->label);
 
+have_fd:
+    qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
     s = PTY_CHARDEV(chr);
     s->ioc = QIO_CHANNEL(qio_channel_file_new_fd(master_fd));
     name = g_strdup_printf("chardev-pty-%s", chr->label);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH V7 28/29] chardev: cpr for sockets
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
                   ` (26 preceding siblings ...)
  2021-12-22 19:05 ` [PATCH V7 27/29] chardev: cpr for pty Steve Sistare
@ 2021-12-22 19:05 ` Steve Sistare
  2022-02-18  9:03   ` Guoyi Tu
  2021-12-22 19:05 ` [PATCH V7 29/29] cpr: only-cpr-capable option Steve Sistare
  2022-01-07 18:45 ` [PATCH V7 00/29] Live Update Steven Sistare
  29 siblings, 1 reply; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Save accepted socket fds before cpr-save, and look for them after cpr-load.
in the environment after cpr-load.  Reject cpr-exec if a socket enables
the TLS or websocket option.  Allow a monitor socket by closing it on exec.

Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 chardev/char-socket.c | 35 +++++++++++++++++++++++++++++++++++
 monitor/hmp.c         |  3 +++
 monitor/qmp.c         |  3 +++
 3 files changed, 41 insertions(+)

diff --git a/chardev/char-socket.c b/chardev/char-socket.c
index d619088..c111e17 100644
--- a/chardev/char-socket.c
+++ b/chardev/char-socket.c
@@ -26,6 +26,7 @@
 #include "chardev/char.h"
 #include "io/channel-socket.h"
 #include "io/channel-websock.h"
+#include "migration/cpr.h"
 #include "qemu/error-report.h"
 #include "qemu/module.h"
 #include "qemu/option.h"
@@ -358,6 +359,10 @@ static void tcp_chr_free_connection(Chardev *chr)
     SocketChardev *s = SOCKET_CHARDEV(chr);
     int i;
 
+    if (!chr->reopen_on_cpr) {
+        cpr_delete_fd(chr->label, 0);
+    }
+
     if (s->read_msgfds_num) {
         for (i = 0; i < s->read_msgfds_num; i++) {
             close(s->read_msgfds[i]);
@@ -920,6 +925,10 @@ static void tcp_chr_accept(QIONetListener *listener,
                                QIO_CHANNEL(cioc));
     }
     tcp_chr_new_client(chr, cioc);
+
+    if (s->sioc && !chr->reopen_on_cpr) {
+        cpr_save_fd(chr->label, 0, s->sioc->fd);
+    }
 }
 
 
@@ -1175,6 +1184,26 @@ static gboolean socket_reconnect_timeout(gpointer opaque)
     return false;
 }
 
+static int load_char_socket_fd(Chardev *chr, Error **errp)
+{
+    SocketChardev *sockchar = SOCKET_CHARDEV(chr);
+    QIOChannelSocket *sioc;
+    const char *label = chr->label;
+    int fd = cpr_find_fd(label, 0);
+
+    if (fd != -1) {
+        sockchar = SOCKET_CHARDEV(chr);
+        sioc = qio_channel_socket_new_fd(fd, errp);
+        if (sioc) {
+            tcp_chr_accept(sockchar->listener, sioc, chr);
+            object_unref(OBJECT(sioc));
+        } else {
+            error_setg(errp, "could not restore socket for %s", label);
+            return -1;
+        }
+    }
+    return 0;
+}
 
 static int qmp_chardev_open_socket_server(Chardev *chr,
                                           bool is_telnet,
@@ -1385,6 +1414,10 @@ static void qmp_chardev_open_socket(Chardev *chr,
     }
     s->registered_yank = true;
 
+    if (!s->tls_creds && !s->is_websock) {
+        qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
+    }
+
     /* be isn't opened until we get a connection */
     *be_opened = false;
 
@@ -1400,6 +1433,8 @@ static void qmp_chardev_open_socket(Chardev *chr,
             return;
         }
     }
+
+    load_char_socket_fd(chr, errp);
 }
 
 static void qemu_chr_parse_socket(QemuOpts *opts, ChardevBackend *backend,
diff --git a/monitor/hmp.c b/monitor/hmp.c
index b20737e..a425894 100644
--- a/monitor/hmp.c
+++ b/monitor/hmp.c
@@ -1484,4 +1484,7 @@ void monitor_init_hmp(Chardev *chr, bool use_readline, Error **errp)
     qemu_chr_fe_set_handlers(&mon->common.chr, monitor_can_read, monitor_read,
                              monitor_event, NULL, &mon->common, NULL, true);
     monitor_list_append(&mon->common);
+
+    /* monitor cannot yet be preserved across cpr */
+    chr->reopen_on_cpr = true;
 }
diff --git a/monitor/qmp.c b/monitor/qmp.c
index 092c527..0043459 100644
--- a/monitor/qmp.c
+++ b/monitor/qmp.c
@@ -535,4 +535,7 @@ void monitor_init_qmp(Chardev *chr, bool pretty, Error **errp)
                                  NULL, &mon->common, NULL, true);
         monitor_list_append(&mon->common);
     }
+
+    /* Monitor cannot yet be preserved across cpr */
+    chr->reopen_on_cpr = true;
 }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH V7 29/29] cpr: only-cpr-capable option
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
                   ` (27 preceding siblings ...)
  2021-12-22 19:05 ` [PATCH V7 28/29] chardev: cpr for sockets Steve Sistare
@ 2021-12-22 19:05 ` Steve Sistare
  2022-02-18  9:43   ` Guoyi Tu
  2022-01-07 18:45 ` [PATCH V7 00/29] Live Update Steven Sistare
  29 siblings, 1 reply; 96+ messages in thread
From: Steve Sistare @ 2021-12-22 19:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Add the only-cpr-capable option, which causes qemu to exit with an error
if any devices that are not capable of cpr are added.  This guarantees that
a cpr-exec operation will not fail with an unsupported device error.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 MAINTAINERS             |  1 +
 chardev/char-socket.c   |  4 ++++
 hw/vfio/common.c        |  6 ++++++
 include/sysemu/sysemu.h |  1 +
 migration/migration.c   |  5 +++++
 qemu-options.hx         |  8 ++++++++
 softmmu/globals.c       |  1 +
 softmmu/physmem.c       |  5 +++++
 softmmu/vl.c            | 14 +++++++++++++-
 stubs/cpr.c             |  3 +++
 stubs/meson.build       |  1 +
 11 files changed, 48 insertions(+), 1 deletion(-)
 create mode 100644 stubs/cpr.c

diff --git a/MAINTAINERS b/MAINTAINERS
index feed239..af5abc3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2998,6 +2998,7 @@ F: migration/cpr.c
 F: qapi/cpr.json
 F: migration/cpr-state.c
 F: stubs/cpr-state.c
+F: stubs/cpr.c
 
 Record/replay
 M: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
diff --git a/chardev/char-socket.c b/chardev/char-socket.c
index c111e17..a4513a7 100644
--- a/chardev/char-socket.c
+++ b/chardev/char-socket.c
@@ -34,6 +34,7 @@
 #include "qapi/clone-visitor.h"
 #include "qapi/qapi-visit-sockets.h"
 #include "qemu/yank.h"
+#include "sysemu/sysemu.h"
 
 #include "chardev/char-io.h"
 #include "chardev/char-socket.h"
@@ -1416,6 +1417,9 @@ static void qmp_chardev_open_socket(Chardev *chr,
 
     if (!s->tls_creds && !s->is_websock) {
         qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
+    } else if (only_cpr_capable) {
+        error_setg(errp, "error: socket %s is not cpr capable due to %s option",
+                   chr->label, (s->tls_creds ? "TLS" : "websocket"));
     }
 
     /* be isn't opened until we get a connection */
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index f2b4a81..605ffbb 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -38,6 +38,7 @@
 #include "sysemu/kvm.h"
 #include "sysemu/reset.h"
 #include "sysemu/runstate.h"
+#include "sysemu/sysemu.h"
 #include "trace.h"
 #include "qapi/error.h"
 #include "migration/migration.h"
@@ -1923,12 +1924,17 @@ static void vfio_put_address_space(VFIOAddressSpace *space)
 static int vfio_get_iommu_type(VFIOContainer *container,
                                Error **errp)
 {
+    ERRP_GUARD();
     int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
                           VFIO_SPAPR_TCE_v2_IOMMU, VFIO_SPAPR_TCE_IOMMU };
     int i;
 
     for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
         if (ioctl(container->fd, VFIO_CHECK_EXTENSION, iommu_types[i])) {
+            if (only_cpr_capable && !vfio_is_cpr_capable(container, errp)) {
+                error_prepend(errp, "only-cpr-capable is specified: ");
+                return -EINVAL;
+            }
             return iommu_types[i];
         }
     }
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index 8fae667..6241c20 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -9,6 +9,7 @@
 /* vl.c */
 
 extern int only_migratable;
+extern bool only_cpr_capable;
 extern const char *qemu_name;
 extern QemuUUID qemu_uuid;
 extern bool qemu_uuid_set;
diff --git a/migration/migration.c b/migration/migration.c
index 3de11ae..f08db0d 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1257,6 +1257,11 @@ static bool migrate_caps_check(bool *cap_list,
         return false;
     }
 
+    if (cap_list[MIGRATION_CAPABILITY_X_COLO] && only_cpr_capable) {
+        error_setg(errp, "x-colo is not compatible with -only-cpr-capable");
+        return false;
+    }
+
     return true;
 }
 
diff --git a/qemu-options.hx b/qemu-options.hx
index 1859b55..0cbf2e3 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -4434,6 +4434,14 @@ SRST
     an unmigratable state.
 ERST
 
+DEF("only-cpr-capable", 0, QEMU_OPTION_only_cpr_capable, \
+    "-only-cpr-capable    allow only cpr capable devices\n", QEMU_ARCH_ALL)
+SRST
+``-only-cpr-capable``
+    Only allow cpr capable devices, which guarantees that cpr-save and
+    cpr-exec will not fail with an unsupported device error.
+ERST
+
 DEF("nodefaults", 0, QEMU_OPTION_nodefaults, \
     "-nodefaults     don't create default devices\n", QEMU_ARCH_ALL)
 SRST
diff --git a/softmmu/globals.c b/softmmu/globals.c
index 7d0fc81..a18fd8d 100644
--- a/softmmu/globals.c
+++ b/softmmu/globals.c
@@ -59,6 +59,7 @@ int boot_menu;
 bool boot_strict;
 uint8_t *boot_splash_filedata;
 int only_migratable; /* turn it off unless user states otherwise */
+bool only_cpr_capable;
 int icount_align_option;
 
 /* The bytes in qemu_uuid are in the order specified by RFC4122, _not_ in the
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index e227195..e7869f8 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -47,6 +47,7 @@
 #include "sysemu/dma.h"
 #include "sysemu/hostmem.h"
 #include "sysemu/hw_accel.h"
+#include "sysemu/sysemu.h"
 #include "sysemu/xen-mapcache.h"
 #include "trace/trace-root.h"
 
@@ -2010,6 +2011,10 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
                 addr = file_ram_alloc(new_block, maxlen, mfd,
                                       false, false, 0, errp);
                 trace_anon_memfd_alloc(name, maxlen, addr, mfd);
+            } else if (only_cpr_capable) {
+                error_setg(errp,
+                    "only-cpr-capable requires -machine memfd-alloc=on");
+                return;
             } else {
                 addr = qemu_anon_ram_alloc(maxlen, &mr->align,
                                            shared, noreserve);
diff --git a/softmmu/vl.c b/softmmu/vl.c
index 4319e1a..f14e29e 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -2743,11 +2743,20 @@ void qmp_x_exit_preconfig(Error **errp)
     qemu_create_cli_devices();
     qemu_machine_creation_done();
 
+    if (only_cpr_capable && !qemu_chr_is_cpr_capable(errp)) {
+        ;    /* not reached due to error_fatal */
+    }
+
     if (loadvm) {
         load_snapshot(loadvm, NULL, false, NULL, &error_fatal);
     }
     if (replay_mode != REPLAY_MODE_NONE) {
-        replay_vmstate_init();
+        if (only_cpr_capable) {
+            error_setg(errp, "replay is not compatible with -only-cpr-capable");
+            /* not reached due to error_fatal */
+        } else {
+            replay_vmstate_init();
+        }
     }
 
     if (incoming) {
@@ -3507,6 +3516,9 @@ void qemu_init(int argc, char **argv, char **envp)
             case QEMU_OPTION_only_migratable:
                 only_migratable = 1;
                 break;
+            case QEMU_OPTION_only_cpr_capable:
+                only_cpr_capable = true;
+                break;
             case QEMU_OPTION_nodefaults:
                 has_defaults = 0;
                 break;
diff --git a/stubs/cpr.c b/stubs/cpr.c
new file mode 100644
index 0000000..aaa189e
--- /dev/null
+++ b/stubs/cpr.c
@@ -0,0 +1,3 @@
+#include "qemu/osdep.h"
+
+bool only_cpr_capable;
diff --git a/stubs/meson.build b/stubs/meson.build
index 9565c7d..4c9c4ea 100644
--- a/stubs/meson.build
+++ b/stubs/meson.build
@@ -4,6 +4,7 @@ stub_ss.add(files('blk-exp-close-all.c'))
 stub_ss.add(files('blockdev-close-all-bdrv-states.c'))
 stub_ss.add(files('change-state-handler.c'))
 stub_ss.add(files('cmos.c'))
+stub_ss.add(files('cpr.c'))
 stub_ss.add(files('cpr-state.c'))
 stub_ss.add(files('cpu-get-clock.c'))
 stub_ss.add(files('cpus-get-virtual-clock.c'))
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 17/29] pci: export functions for cpr
  2021-12-22 19:05 ` [PATCH V7 17/29] pci: export functions for cpr Steve Sistare
@ 2021-12-22 23:07   ` Michael S. Tsirkin
  2022-01-05 17:22     ` Steven Sistare
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2021-12-22 23:07 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Daniel P. Berrange, Juan Quintela, Jason Zeng, Alex Bennée,
	qemu-devel, Eric Blake, Dr. David Alan Gilbert, Zheng Chuan,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On Wed, Dec 22, 2021 at 11:05:22AM -0800, Steve Sistare wrote:
> Export msix_is_pending, msix_init_vector_notifiers, and pci_update_mappings
> for use by cpr.  No functional change.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

With things like that, I prefer when the API is exported
together with the patch that uses it.
This was I can see why we are exporting these APIs.
Esp wrt pci_update_mappings, it's designed as an
internal API.

> ---
>  hw/pci/msix.c         | 20 ++++++++++++++------
>  hw/pci/pci.c          |  3 +--
>  include/hw/pci/msix.h |  5 +++++
>  include/hw/pci/pci.h  |  1 +
>  4 files changed, 21 insertions(+), 8 deletions(-)
> 
> diff --git a/hw/pci/msix.c b/hw/pci/msix.c
> index ae9331c..73f4259 100644
> --- a/hw/pci/msix.c
> +++ b/hw/pci/msix.c
> @@ -64,7 +64,7 @@ static uint8_t *msix_pending_byte(PCIDevice *dev, int vector)
>      return dev->msix_pba + vector / 8;
>  }
>  
> -static int msix_is_pending(PCIDevice *dev, int vector)
> +int msix_is_pending(PCIDevice *dev, unsigned int vector)
>  {
>      return *msix_pending_byte(dev, vector) & msix_pending_mask(vector);
>  }
> @@ -579,6 +579,17 @@ static void msix_unset_notifier_for_vector(PCIDevice *dev, unsigned int vector)
>      dev->msix_vector_release_notifier(dev, vector);
>  }
>  
> +void msix_init_vector_notifiers(PCIDevice *dev,
> +                                MSIVectorUseNotifier use_notifier,
> +                                MSIVectorReleaseNotifier release_notifier,
> +                                MSIVectorPollNotifier poll_notifier)
> +{
> +    assert(use_notifier && release_notifier);
> +    dev->msix_vector_use_notifier = use_notifier;
> +    dev->msix_vector_release_notifier = release_notifier;
> +    dev->msix_vector_poll_notifier = poll_notifier;
> +}
> +
>  int msix_set_vector_notifiers(PCIDevice *dev,
>                                MSIVectorUseNotifier use_notifier,
>                                MSIVectorReleaseNotifier release_notifier,
> @@ -586,11 +597,8 @@ int msix_set_vector_notifiers(PCIDevice *dev,
>  {
>      int vector, ret;
>  
> -    assert(use_notifier && release_notifier);
> -
> -    dev->msix_vector_use_notifier = use_notifier;
> -    dev->msix_vector_release_notifier = release_notifier;
> -    dev->msix_vector_poll_notifier = poll_notifier;
> +    msix_init_vector_notifiers(dev, use_notifier, release_notifier,
> +                               poll_notifier);
>  
>      if ((dev->config[dev->msix_cap + MSIX_CONTROL_OFFSET] &
>          (MSIX_ENABLE_MASK | MSIX_MASKALL_MASK)) == MSIX_ENABLE_MASK) {
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index e5993c1..0fd21e1 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -225,7 +225,6 @@ static const TypeInfo pcie_bus_info = {
>  };
>  
>  static PCIBus *pci_find_bus_nr(PCIBus *bus, int bus_num);
> -static void pci_update_mappings(PCIDevice *d);
>  static void pci_irq_handler(void *opaque, int irq_num, int level);
>  static void pci_add_option_rom(PCIDevice *pdev, bool is_default_rom, Error **);
>  static void pci_del_option_rom(PCIDevice *pdev);
> @@ -1366,7 +1365,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
>      return new_addr;
>  }
>  
> -static void pci_update_mappings(PCIDevice *d)
> +void pci_update_mappings(PCIDevice *d)
>  {
>      PCIIORegion *r;
>      int i;
> diff --git a/include/hw/pci/msix.h b/include/hw/pci/msix.h
> index 4c4a60c..46606cf 100644
> --- a/include/hw/pci/msix.h
> +++ b/include/hw/pci/msix.h
> @@ -32,6 +32,7 @@ int msix_present(PCIDevice *dev);
>  bool msix_is_masked(PCIDevice *dev, unsigned vector);
>  void msix_set_pending(PCIDevice *dev, unsigned vector);
>  void msix_clr_pending(PCIDevice *dev, int vector);
> +int msix_is_pending(PCIDevice *dev, unsigned vector);
>  
>  int msix_vector_use(PCIDevice *dev, unsigned vector);
>  void msix_vector_unuse(PCIDevice *dev, unsigned vector);
> @@ -41,6 +42,10 @@ void msix_notify(PCIDevice *dev, unsigned vector);
>  
>  void msix_reset(PCIDevice *dev);
>  
> +void msix_init_vector_notifiers(PCIDevice *dev,
> +                                MSIVectorUseNotifier use_notifier,
> +                                MSIVectorReleaseNotifier release_notifier,
> +                                MSIVectorPollNotifier poll_notifier);
>  int msix_set_vector_notifiers(PCIDevice *dev,
>                                MSIVectorUseNotifier use_notifier,
>                                MSIVectorReleaseNotifier release_notifier,
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index e7cdf2d..cc63dd4 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -910,5 +910,6 @@ extern const VMStateDescription vmstate_pci_device;
>  
>  MSIMessage pci_get_msi_message(PCIDevice *dev, int vector);
>  void pci_set_power(PCIDevice *pci_dev, bool state);
> +void pci_update_mappings(PCIDevice *d);
>  
>  #endif
> -- 
> 1.8.3.1



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)
  2021-12-22 19:05 ` [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma) Steve Sistare
@ 2021-12-22 23:15   ` Michael S. Tsirkin
  2022-01-05 17:24     ` Steven Sistare
  2022-03-07 22:16   ` Alex Williamson
  1 sibling, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2021-12-22 23:15 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Daniel P. Berrange, Juan Quintela, Jason Zeng, Alex Bennée,
	qemu-devel, Eric Blake, Dr. David Alan Gilbert, Zheng Chuan,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On Wed, Dec 22, 2021 at 11:05:24AM -0800, Steve Sistare wrote:
> Enable vfio-pci devices to be saved and restored across an exec restart
> of qemu.
> 
> At vfio creation time, save the value of vfio container, group, and device
> descriptors in cpr state.
> 
> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
> at a different VA after exec.  DMA to already-mapped pages continues.  Save
> the msi message area as part of vfio-pci vmstate, save the interrupt and
> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
> vfio descriptors.  The flag is not cleared earlier because the descriptors
> should not persist across miscellaneous fork and exec calls that may be
> performed during normal operation.
> 
> On qemu restart, vfio_realize() finds the saved descriptors, uses
> the descriptors, and notes that the device is being reused.  Device and
> iommu state is already configured, so operations in vfio_realize that
> would modify the configuration are skipped for a reused device, including
> vfio ioctl's and writes to PCI configuration space.  The result is that
> vfio_realize constructs qemu data structures that reflect the current
> state of the device.  However, the reconstruction is not complete until
> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
> state.  It rebuilds vector data structures and attaches the interrupts to
> the new KVM instance.  cpr-load then invokes the main vfio listener callback,
> which walks the flattened ranges of the vfio_address_spaces and calls
> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
> starts the VM and suppresses vfio pci device reset.
> 
> This functionality is delivered by 3 patches for clarity.  Part 1 handles
> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
> support.  Part 3 adds INTX support.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  MAINTAINERS                   |   1 +
>  hw/pci/pci.c                  |  10 ++++
>  hw/vfio/common.c              | 115 ++++++++++++++++++++++++++++++++++++++----
>  hw/vfio/cpr.c                 |  94 ++++++++++++++++++++++++++++++++++
>  hw/vfio/meson.build           |   1 +
>  hw/vfio/pci.c                 |  77 ++++++++++++++++++++++++++++
>  hw/vfio/trace-events          |   1 +
>  include/hw/pci/pci.h          |   1 +
>  include/hw/vfio/vfio-common.h |   8 +++
>  include/migration/cpr.h       |   3 ++
>  migration/cpr.c               |  10 +++-
>  migration/target.c            |  14 +++++
>  12 files changed, 324 insertions(+), 11 deletions(-)
>  create mode 100644 hw/vfio/cpr.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index cfe7480..feed239 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2992,6 +2992,7 @@ CPR
>  M: Steve Sistare <steven.sistare@oracle.com>
>  M: Mark Kanda <mark.kanda@oracle.com>
>  S: Maintained
> +F: hw/vfio/cpr.c
>  F: include/migration/cpr.h
>  F: migration/cpr.c
>  F: qapi/cpr.json
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 0fd21e1..e35df4f 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
>  {
>      int r;
>  
> +    /*
> +     * A reused vfio-pci device is already configured, so do not reset it
> +     * during qemu_system_reset prior to cpr-load, else interrupts may be
> +     * lost.  By contrast, pure-virtual pci devices may be reset here and
> +     * updated with new state in cpr-load with no ill effects.
> +     */
> +    if (dev->reused) {
> +        return;
> +    }
> +
>      pci_device_deassert_intx(dev);
>      assert(dev->irq_state == 0);
>  


Hmm that's a weird thing to do. I suspect this works because
"reused" means something like "in the process of being restored"?
Because clearly, we do not want to skip this part e.g. when
guest resets the device.
So a better name could be called for, but really I don't
love how vfio gets to poke at internal PCI state.
I'd rather we found a way just not to call this function.
If we can't, maybe an explicit API, and make it
actually say what it's doing?


> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 5b87f95..90f66ad 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -31,6 +31,7 @@
>  #include "exec/memory.h"
>  #include "exec/ram_addr.h"
>  #include "hw/hw.h"
> +#include "migration/cpr.h"
>  #include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
>  #include "qemu/range.h"
> @@ -459,6 +460,8 @@ static int vfio_dma_unmap(VFIOContainer *container,
>          .size = size,
>      };
>  
> +    assert(!container->reused);
> +
>      if (iotlb && container->dirty_pages_supported &&
>          vfio_devices_all_running_and_saving(container)) {
>          return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
> @@ -495,12 +498,24 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>  {
>      struct vfio_iommu_type1_dma_map map = {
>          .argsz = sizeof(map),
> -        .flags = VFIO_DMA_MAP_FLAG_READ,
>          .vaddr = (__u64)(uintptr_t)vaddr,
>          .iova = iova,
>          .size = size,
>      };
>  
> +    /*
> +     * Set the new vaddr for any mappings registered during cpr-load.
> +     * Reused is cleared thereafter.
> +     */
> +    if (container->reused) {
> +        map.flags = VFIO_DMA_MAP_FLAG_VADDR;
> +        if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
> +            goto fail;
> +        }
> +        return 0;
> +    }
> +
> +    map.flags = VFIO_DMA_MAP_FLAG_READ;
>      if (!readonly) {
>          map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
>      }
> @@ -516,7 +531,9 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>          return 0;
>      }
>  
> -    error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
> +fail:
> +    error_report("vfio_dma_map %s (iova %lu, size %ld, va %p): %s",
> +        (container->reused ? "VADDR" : ""), iova, size, vaddr, strerror(errno));
>      return -errno;
>  }
>  
> @@ -865,6 +882,12 @@ static void vfio_listener_region_add(MemoryListener *listener,
>                                       MemoryRegionSection *section)
>  {
>      VFIOContainer *container = container_of(listener, VFIOContainer, listener);
> +    vfio_container_region_add(container, section);
> +}
> +
> +void vfio_container_region_add(VFIOContainer *container,
> +                               MemoryRegionSection *section)
> +{
>      hwaddr iova, end;
>      Int128 llend, llsize;
>      void *vaddr;
> @@ -985,6 +1008,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
>          int iommu_idx;
>  
>          trace_vfio_listener_region_add_iommu(iova, end);
> +
>          /*
>           * FIXME: For VFIO iommu types which have KVM acceleration to
>           * avoid bouncing all map/unmaps through qemu this way, this
> @@ -1459,6 +1483,12 @@ static void vfio_listener_release(VFIOContainer *container)
>      }
>  }
>  
> +void vfio_listener_register(VFIOContainer *container)
> +{
> +    container->listener = vfio_memory_listener;
> +    memory_listener_register(&container->listener, container->space->as);
> +}
> +
>  static struct vfio_info_cap_header *
>  vfio_get_cap(void *ptr, uint32_t cap_offset, uint16_t id)
>  {
> @@ -1878,6 +1908,18 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
>  {
>      int iommu_type, ret;
>  
> +    /*
> +     * If container is reused, just set its type and skip the ioctls, as the
> +     * container and group are already configured in the kernel.
> +     * VFIO_TYPE1v2_IOMMU is the only type that supports reuse/cpr.
> +     * If you ever add new types or spapr cpr support, kind reader, please
> +     * also implement VFIO_GET_IOMMU.
> +     */
> +    if (container->reused) {
> +        container->iommu_type = VFIO_TYPE1v2_IOMMU;
> +        return 0;
> +    }
> +
>      iommu_type = vfio_get_iommu_type(container, errp);
>      if (iommu_type < 0) {
>          return iommu_type;
> @@ -1982,9 +2024,12 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>  {
>      VFIOContainer *container;
>      int ret, fd;
> +    bool reused;
>      VFIOAddressSpace *space;
>  
>      space = vfio_get_address_space(as);
> +    fd = cpr_find_fd("vfio_container_for_group", group->groupid);
> +    reused = (fd > 0);
>  
>      /*
>       * VFIO is currently incompatible with discarding of RAM insofar as the
> @@ -2017,8 +2062,16 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>       * details once we know which type of IOMMU we are using.
>       */
>  
> +    /*
> +     * If the container is reused, then the group is already attached in the
> +     * kernel.  If a container with matching fd is found, then update the
> +     * userland group list and return.  It not, then after the loop, create
> +     * the container struct and group list.
> +     */
> +
>      QLIST_FOREACH(container, &space->containers, next) {
> -        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
> +        if ((reused && container->fd == fd) ||
> +            !ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>              ret = vfio_ram_block_discard_disable(container, true);
>              if (ret) {
>                  error_setg_errno(errp, -ret,
> @@ -2032,12 +2085,19 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>              }
>              group->container = container;
>              QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> -            vfio_kvm_device_add_group(group);
> +            if (!reused) {
> +                vfio_kvm_device_add_group(group);
> +                cpr_save_fd("vfio_container_for_group", group->groupid,
> +                            container->fd);
> +            }
>              return 0;
>          }
>      }
>  
> -    fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
> +    if (!reused) {
> +        fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
> +    }
> +
>      if (fd < 0) {
>          error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
>          ret = -errno;
> @@ -2055,6 +2115,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      container = g_malloc0(sizeof(*container));
>      container->space = space;
>      container->fd = fd;
> +    container->reused = reused;
>      container->error = NULL;
>      container->dirty_pages_supported = false;
>      container->dma_max_mappings = 0;
> @@ -2181,9 +2242,16 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      group->container = container;
>      QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>  
> -    container->listener = vfio_memory_listener;
> -
> -    memory_listener_register(&container->listener, container->space->as);
> +    /*
> +     * If reused, register the listener later, after all state that may
> +     * affect regions and mapping boundaries has been cpr-load'ed.  Later,
> +     * the listener will invoke its callback on each flat section and call
> +     * vfio_dma_map to supply the new vaddr, and the calls will match the
> +     * mappings remembered by the kernel.
> +     */
> +    if (!reused) {
> +        vfio_listener_register(container);
> +    }
>  
>      if (container->error) {
>          ret = -1;
> @@ -2193,6 +2261,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      }
>  
>      container->initialized = true;
> +    if (!reused) {
> +        cpr_save_fd("vfio_container_for_group", group->groupid, fd);
> +    }
>  
>      return 0;
>  listener_release_exit:
> @@ -2222,6 +2293,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>  
>      QLIST_REMOVE(group, container_next);
>      group->container = NULL;
> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
>  
>      /*
>       * Explicitly release the listener first before unset container,
> @@ -2270,6 +2342,7 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>      VFIOGroup *group;
>      char path[32];
>      struct vfio_group_status status = { .argsz = sizeof(status) };
> +    bool reused;
>  
>      QLIST_FOREACH(group, &vfio_group_list, next) {
>          if (group->groupid == groupid) {
> @@ -2287,7 +2360,13 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>      group = g_malloc0(sizeof(*group));
>  
>      snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
> -    group->fd = qemu_open_old(path, O_RDWR);
> +
> +    group->fd = cpr_find_fd("vfio_group", groupid);
> +    reused = (group->fd >= 0);
> +    if (!reused) {
> +        group->fd = qemu_open_old(path, O_RDWR);
> +    }
> +
>      if (group->fd < 0) {
>          error_setg_errno(errp, errno, "failed to open %s", path);
>          goto free_group_exit;
> @@ -2321,6 +2400,10 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>  
>      QLIST_INSERT_HEAD(&vfio_group_list, group, next);
>  
> +    if (!reused) {
> +        cpr_save_fd("vfio_group", groupid, group->fd);
> +    }
> +
>      return group;
>  
>  close_fd_exit:
> @@ -2345,6 +2428,7 @@ void vfio_put_group(VFIOGroup *group)
>      vfio_disconnect_container(group);
>      QLIST_REMOVE(group, next);
>      trace_vfio_put_group(group->fd);
> +    cpr_delete_fd("vfio_group", group->groupid);
>      close(group->fd);
>      g_free(group);
>  
> @@ -2358,8 +2442,14 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>  {
>      struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
>      int ret, fd;
> +    bool reused;
> +
> +    fd = cpr_find_fd(name, 0);
> +    reused = (fd >= 0);
> +    if (!reused) {
> +        fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
> +    }
>  
> -    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>      if (fd < 0) {
>          error_setg_errno(errp, errno, "error getting device from group %d",
>                           group->groupid);
> @@ -2404,6 +2494,10 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>      vbasedev->num_irqs = dev_info.num_irqs;
>      vbasedev->num_regions = dev_info.num_regions;
>      vbasedev->flags = dev_info.flags;
> +    vbasedev->reused = reused;
> +    if (!reused) {
> +        cpr_save_fd(name, 0, fd);
> +    }
>  
>      trace_vfio_get_device(name, dev_info.flags, dev_info.num_regions,
>                            dev_info.num_irqs);
> @@ -2420,6 +2514,7 @@ void vfio_put_base_device(VFIODevice *vbasedev)
>      QLIST_REMOVE(vbasedev, next);
>      vbasedev->group = NULL;
>      trace_vfio_put_base_device(vbasedev->fd);
> +    cpr_delete_fd(vbasedev->name, 0);
>      close(vbasedev->fd);
>  }
>  
> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
> new file mode 100644
> index 0000000..2c39cd5
> --- /dev/null
> +++ b/hw/vfio/cpr.c
> @@ -0,0 +1,94 @@
> +/*
> + * Copyright (c) 2021 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include <sys/ioctl.h>
> +#include <linux/vfio.h>
> +#include "hw/vfio/vfio-common.h"
> +#include "sysemu/kvm.h"
> +#include "qapi/error.h"
> +#include "trace.h"
> +
> +static int
> +vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
> +{
> +    struct vfio_iommu_type1_dma_unmap unmap = {
> +        .argsz = sizeof(unmap),
> +        .flags = VFIO_DMA_UNMAP_FLAG_VADDR | VFIO_DMA_UNMAP_FLAG_ALL,
> +        .iova = 0,
> +        .size = 0,
> +    };
> +    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
> +        error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
> +        return -errno;
> +    }
> +    return 0;
> +}
> +
> +bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp)
> +{
> +    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
> +        !ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
> +        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR "
> +                         "or VFIO_UNMAP_ALL");
> +        return false;
> +    } else {
> +        return true;
> +    }
> +}
> +
> +/*
> + * Verify that all containers support CPR, and unmap all dma vaddr's.
> + */
> +int vfio_cpr_save(Error **errp)
> +{
> +    ERRP_GUARD();
> +    VFIOAddressSpace *space;
> +    VFIOContainer *container;
> +
> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> +        QLIST_FOREACH(container, &space->containers, next) {
> +            if (!vfio_is_cpr_capable(container, errp)) {
> +                return -1;
> +            }
> +            if (vfio_dma_unmap_vaddr_all(container, errp)) {
> +                return -1;
> +            }
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +/*
> + * Register the listener for each container, which causes its callback to be
> + * invoked for every flat section.  The callback will see that the container
> + * is reused, and call vfo_dma_map with the new vaddr.
> + */
> +int vfio_cpr_load(Error **errp)
> +{
> +    VFIOAddressSpace *space;
> +    VFIOContainer *container;
> +    VFIOGroup *group;
> +    VFIODevice *vbasedev;
> +
> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> +        QLIST_FOREACH(container, &space->containers, next) {
> +            if (!vfio_is_cpr_capable(container, errp)) {
> +                return -1;
> +            }
> +            vfio_listener_register(container);
> +            container->reused = false;
> +        }
> +    }
> +    QLIST_FOREACH(group, &vfio_group_list, next) {
> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +            vbasedev->reused = false;
> +        }
> +    }
> +    return 0;
> +}
> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
> index da9af29..e247b2b 100644
> --- a/hw/vfio/meson.build
> +++ b/hw/vfio/meson.build
> @@ -5,6 +5,7 @@ vfio_ss.add(files(
>    'migration.c',
>  ))
>  vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
> +  'cpr.c',
>    'display.c',
>    'pci-quirks.c',
>    'pci.c',
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index a90cce2..acac8a7 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -30,6 +30,7 @@
>  #include "hw/qdev-properties-system.h"
>  #include "migration/vmstate.h"
>  #include "qapi/qmp/qdict.h"
> +#include "migration/cpr.h"
>  #include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
>  #include "qemu/module.h"
> @@ -2926,6 +2927,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>          vfio_put_group(group);
>          goto error;
>      }
> +    pdev->reused = vdev->vbasedev.reused;
>  
>      vfio_populate_device(vdev, &err);
>      if (err) {
> @@ -3195,6 +3197,11 @@ static void vfio_pci_reset(DeviceState *dev)
>  {
>      VFIOPCIDevice *vdev = VFIO_PCI(dev);
>  
> +    /* Do not reset the device during qemu_system_reset prior to cpr-load */
> +    if (vdev->pdev.reused) {
> +        return;
> +    }
> +
>      trace_vfio_pci_reset(vdev->vbasedev.name);
>  
>      vfio_pci_pre_reset(vdev);
> @@ -3302,6 +3309,75 @@ static Property vfio_pci_dev_properties[] = {
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> +static void vfio_merge_config(VFIOPCIDevice *vdev)
> +{
> +    PCIDevice *pdev = &vdev->pdev;
> +    int size = MIN(pci_config_size(pdev), vdev->config_size);
> +    g_autofree uint8_t *phys_config = g_malloc(size);
> +    uint32_t mask;
> +    int ret, i;
> +
> +    ret = pread(vdev->vbasedev.fd, phys_config, size, vdev->config_offset);
> +    if (ret < size) {
> +        ret = ret < 0 ? errno : EFAULT;
> +        error_report("failed to read device config space: %s", strerror(ret));
> +        return;
> +    }
> +
> +    for (i = 0; i < size; i++) {
> +        mask = vdev->emulated_config_bits[i];
> +        pdev->config[i] = (pdev->config[i] & mask) | (phys_config[i] & ~mask);
> +    }
> +}
> +
> +/*
> + * The kernel may change non-emulated config bits.  Exclude them from the
> + * changed-bits check in get_pci_config_device.
> + */
> +static int vfio_pci_pre_load(void *opaque)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    PCIDevice *pdev = &vdev->pdev;
> +    int size = MIN(pci_config_size(pdev), vdev->config_size);
> +    int i;
> +
> +    for (i = 0; i < size; i++) {
> +        pdev->cmask[i] &= vdev->emulated_config_bits[i];
> +    }
> +
> +    return 0;
> +}
> +
> +static int vfio_pci_post_load(void *opaque, int version_id)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    PCIDevice *pdev = &vdev->pdev;
> +
> +    vfio_merge_config(vdev);
> +
> +    pdev->reused = false;
> +
> +    return 0;
> +}
> +
> +static bool vfio_pci_needed(void *opaque)
> +{
> +    return cpr_get_mode() == CPR_MODE_RESTART;
> +}
> +
> +static const VMStateDescription vfio_pci_vmstate = {
> +    .name = "vfio-pci",
> +    .unmigratable = 1,
> +    .version_id = 0,
> +    .minimum_version_id = 0,
> +    .pre_load = vfio_pci_pre_load,
> +    .post_load = vfio_pci_post_load,
> +    .needed = vfio_pci_needed,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
>  static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>  {
>      DeviceClass *dc = DEVICE_CLASS(klass);
> @@ -3309,6 +3385,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>  
>      dc->reset = vfio_pci_reset;
>      device_class_set_props(dc, vfio_pci_dev_properties);
> +    dc->vmsd = &vfio_pci_vmstate;
>      dc->desc = "VFIO-based PCI device assignment";
>      set_bit(DEVICE_CATEGORY_MISC, dc->categories);
>      pdc->realize = vfio_realize;
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 0ef1b5f..63dd0fe 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -118,6 +118,7 @@ vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Devic
>  vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
>  vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
>  vfio_dma_unmap_overflow_workaround(void) ""
> +vfio_region_remap(const char *name, int fd, uint64_t iova_start, uint64_t iova_end, void *vaddr) "%s fd %d 0x%"PRIx64" - 0x%"PRIx64" [%p]"
>  
>  # platform.c
>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index cc63dd4..8557e82 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -361,6 +361,7 @@ struct PCIDevice {
>      /* ID of standby device in net_failover pair */
>      char *failover_pair_id;
>      uint32_t acpi_index;
> +    bool reused;
>  };
>  
>  void pci_register_bar(PCIDevice *pci_dev, int region_num,
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 1641753..bc23c29 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -85,6 +85,7 @@ typedef struct VFIOContainer {
>      Error *error;
>      bool initialized;
>      bool dirty_pages_supported;
> +    bool reused;
>      uint64_t dirty_pgsizes;
>      uint64_t max_dirty_bitmap_size;
>      unsigned long pgsizes;
> @@ -136,6 +137,7 @@ typedef struct VFIODevice {
>      bool no_mmap;
>      bool ram_block_discard_allowed;
>      bool enable_migration;
> +    bool reused;
>      VFIODeviceOps *ops;
>      unsigned int num_irqs;
>      unsigned int num_regions;
> @@ -212,6 +214,9 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
>  void vfio_put_group(VFIOGroup *group);
>  int vfio_get_device(VFIOGroup *group, const char *name,
>                      VFIODevice *vbasedev, Error **errp);
> +int vfio_cpr_save(Error **errp);
> +int vfio_cpr_load(Error **errp);
> +bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp);
>  
>  extern const MemoryRegionOps vfio_region_ops;
>  typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
> @@ -236,6 +241,9 @@ struct vfio_info_cap_header *
>  vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id);
>  #endif
>  extern const MemoryListener vfio_prereg_listener;
> +void vfio_listener_register(VFIOContainer *container);
> +void vfio_container_region_add(VFIOContainer *container,
> +                               MemoryRegionSection *section);
>  
>  int vfio_spapr_create_window(VFIOContainer *container,
>                               MemoryRegionSection *section,
> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> index a4da24e..a4007cf 100644
> --- a/include/migration/cpr.h
> +++ b/include/migration/cpr.h
> @@ -25,4 +25,7 @@ int cpr_state_save(Error **errp);
>  int cpr_state_load(Error **errp);
>  void cpr_state_print(void);
>  
> +int cpr_vfio_save(Error **errp);
> +int cpr_vfio_load(Error **errp);
> +
>  #endif
> diff --git a/migration/cpr.c b/migration/cpr.c
> index 37eca66..cee82cf 100644
> --- a/migration/cpr.c
> +++ b/migration/cpr.c
> @@ -7,6 +7,7 @@
>  
>  #include "qemu/osdep.h"
>  #include "exec/memory.h"
> +#include "hw/vfio/vfio-common.h"
>  #include "io/channel-buffer.h"
>  #include "io/channel-file.h"
>  #include "migration.h"
> @@ -101,7 +102,9 @@ void qmp_cpr_exec(strList *args, Error **errp)
>          error_setg(errp, "cpr-exec requires cpr-save with restart mode");
>          return;
>      }
> -
> +    if (cpr_vfio_save(errp)) {
> +        return;
> +    }
>      cpr_walk_fd(preserve_fd, 0);
>      if (cpr_state_save(errp)) {
>          return;
> @@ -139,6 +142,11 @@ void qmp_cpr_load(const char *filename, Error **errp)
>          goto out;
>      }
>  
> +    if (cpr_get_mode() == CPR_MODE_RESTART &&
> +        cpr_vfio_load(errp)) {
> +        goto out;
> +    }
> +
>      state = global_state_get_runstate();
>      if (state == RUN_STATE_RUNNING) {
>          vm_start();
> diff --git a/migration/target.c b/migration/target.c
> index 4390bf0..984bc9e 100644
> --- a/migration/target.c
> +++ b/migration/target.c
> @@ -8,6 +8,7 @@
>  #include "qemu/osdep.h"
>  #include "qapi/qapi-types-migration.h"
>  #include "migration.h"
> +#include "migration/cpr.h"
>  #include CONFIG_DEVICES
>  
>  #ifdef CONFIG_VFIO
> @@ -22,8 +23,21 @@ void populate_vfio_info(MigrationInfo *info)
>          info->vfio->transferred = vfio_mig_bytes_transferred();
>      }
>  }
> +
> +int cpr_vfio_save(Error **errp)
> +{
> +    return vfio_cpr_save(errp);
> +}
> +
> +int cpr_vfio_load(Error **errp)
> +{
> +    return vfio_cpr_load(errp);
> +}
> +
>  #else
>  
>  void populate_vfio_info(MigrationInfo *info) {}
> +int cpr_vfio_save(Error **errp) { return 0; }
> +int cpr_vfio_load(Error **errp) { return 0; }
>  
>  #endif /* CONFIG_VFIO */
> -- 
> 1.8.3.1



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 17/29] pci: export functions for cpr
  2021-12-22 23:07   ` Michael S. Tsirkin
@ 2022-01-05 17:22     ` Steven Sistare
  2022-01-05 20:16       ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Steven Sistare @ 2022-01-05 17:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Daniel P. Berrange, Juan Quintela, Jason Zeng, Alex Bennée,
	qemu-devel, Eric Blake, Dr. David Alan Gilbert, Zheng Chuan,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On 12/22/2021 6:07 PM, Michael S. Tsirkin wrote:
> On Wed, Dec 22, 2021 at 11:05:22AM -0800, Steve Sistare wrote:
>> Export msix_is_pending, msix_init_vector_notifiers, and pci_update_mappings
>> for use by cpr.  No functional change.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> 
> With things like that, I prefer when the API is exported
> together with the patch that uses it.
> This was I can see why we are exporting these APIs.
> Esp wrt pci_update_mappings, it's designed as an
> internal API.

Hi Michael, thanks very much for reviewing these patches.

Serendipitously, I stopped calling pci_update_mappings from vfio code earlier
in the series.  I will revert its scope.

I would prefer to keep this patch separate from the use of these functions in
"vfio-pci cpr part 2 msi", to make the latter smaller and easier to understand.
How about if I say more in this commit message? :

  Export msix_is_pending and msix_init_vector_notifiers for use in vfio cpr.
  Both are needed in the vfio-pci post-load function during cpr-load.
  msix_is_pending is checked to enable the PBA memory region.
  msix_init_vector_notifiers is called to register notifier callbacks, without
  the other side effects of msix_set_vector_notifiers.

- Steve

>> ---
>>  hw/pci/msix.c         | 20 ++++++++++++++------
>>  hw/pci/pci.c          |  3 +--
>>  include/hw/pci/msix.h |  5 +++++
>>  include/hw/pci/pci.h  |  1 +
>>  4 files changed, 21 insertions(+), 8 deletions(-)
>>
>> diff --git a/hw/pci/msix.c b/hw/pci/msix.c
>> index ae9331c..73f4259 100644
>> --- a/hw/pci/msix.c
>> +++ b/hw/pci/msix.c
>> @@ -64,7 +64,7 @@ static uint8_t *msix_pending_byte(PCIDevice *dev, int vector)
>>      return dev->msix_pba + vector / 8;
>>  }
>>  
>> -static int msix_is_pending(PCIDevice *dev, int vector)
>> +int msix_is_pending(PCIDevice *dev, unsigned int vector)
>>  {
>>      return *msix_pending_byte(dev, vector) & msix_pending_mask(vector);
>>  }
>> @@ -579,6 +579,17 @@ static void msix_unset_notifier_for_vector(PCIDevice *dev, unsigned int vector)
>>      dev->msix_vector_release_notifier(dev, vector);
>>  }
>>  
>> +void msix_init_vector_notifiers(PCIDevice *dev,
>> +                                MSIVectorUseNotifier use_notifier,
>> +                                MSIVectorReleaseNotifier release_notifier,
>> +                                MSIVectorPollNotifier poll_notifier)
>> +{
>> +    assert(use_notifier && release_notifier);
>> +    dev->msix_vector_use_notifier = use_notifier;
>> +    dev->msix_vector_release_notifier = release_notifier;
>> +    dev->msix_vector_poll_notifier = poll_notifier;
>> +}
>> +
>>  int msix_set_vector_notifiers(PCIDevice *dev,
>>                                MSIVectorUseNotifier use_notifier,
>>                                MSIVectorReleaseNotifier release_notifier,
>> @@ -586,11 +597,8 @@ int msix_set_vector_notifiers(PCIDevice *dev,
>>  {
>>      int vector, ret;
>>  
>> -    assert(use_notifier && release_notifier);
>> -
>> -    dev->msix_vector_use_notifier = use_notifier;
>> -    dev->msix_vector_release_notifier = release_notifier;
>> -    dev->msix_vector_poll_notifier = poll_notifier;
>> +    msix_init_vector_notifiers(dev, use_notifier, release_notifier,
>> +                               poll_notifier);
>>  
>>      if ((dev->config[dev->msix_cap + MSIX_CONTROL_OFFSET] &
>>          (MSIX_ENABLE_MASK | MSIX_MASKALL_MASK)) == MSIX_ENABLE_MASK) {
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index e5993c1..0fd21e1 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -225,7 +225,6 @@ static const TypeInfo pcie_bus_info = {
>>  };
>>  
>>  static PCIBus *pci_find_bus_nr(PCIBus *bus, int bus_num);
>> -static void pci_update_mappings(PCIDevice *d);
>>  static void pci_irq_handler(void *opaque, int irq_num, int level);
>>  static void pci_add_option_rom(PCIDevice *pdev, bool is_default_rom, Error **);
>>  static void pci_del_option_rom(PCIDevice *pdev);
>> @@ -1366,7 +1365,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
>>      return new_addr;
>>  }
>>  
>> -static void pci_update_mappings(PCIDevice *d)
>> +void pci_update_mappings(PCIDevice *d)
>>  {
>>      PCIIORegion *r;
>>      int i;
>> diff --git a/include/hw/pci/msix.h b/include/hw/pci/msix.h
>> index 4c4a60c..46606cf 100644
>> --- a/include/hw/pci/msix.h
>> +++ b/include/hw/pci/msix.h
>> @@ -32,6 +32,7 @@ int msix_present(PCIDevice *dev);
>>  bool msix_is_masked(PCIDevice *dev, unsigned vector);
>>  void msix_set_pending(PCIDevice *dev, unsigned vector);
>>  void msix_clr_pending(PCIDevice *dev, int vector);
>> +int msix_is_pending(PCIDevice *dev, unsigned vector);
>>  
>>  int msix_vector_use(PCIDevice *dev, unsigned vector);
>>  void msix_vector_unuse(PCIDevice *dev, unsigned vector);
>> @@ -41,6 +42,10 @@ void msix_notify(PCIDevice *dev, unsigned vector);
>>  
>>  void msix_reset(PCIDevice *dev);
>>  
>> +void msix_init_vector_notifiers(PCIDevice *dev,
>> +                                MSIVectorUseNotifier use_notifier,
>> +                                MSIVectorReleaseNotifier release_notifier,
>> +                                MSIVectorPollNotifier poll_notifier);
>>  int msix_set_vector_notifiers(PCIDevice *dev,
>>                                MSIVectorUseNotifier use_notifier,
>>                                MSIVectorReleaseNotifier release_notifier,
>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>> index e7cdf2d..cc63dd4 100644
>> --- a/include/hw/pci/pci.h
>> +++ b/include/hw/pci/pci.h
>> @@ -910,5 +910,6 @@ extern const VMStateDescription vmstate_pci_device;
>>  
>>  MSIMessage pci_get_msi_message(PCIDevice *dev, int vector);
>>  void pci_set_power(PCIDevice *pci_dev, bool state);
>> +void pci_update_mappings(PCIDevice *d);
>>  
>>  #endif
>> -- 
>> 1.8.3.1
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)
  2021-12-22 23:15   ` Michael S. Tsirkin
@ 2022-01-05 17:24     ` Steven Sistare
  2022-01-05 21:14       ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Steven Sistare @ 2022-01-05 17:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Daniel P. Berrange, Juan Quintela, Jason Zeng, Alex Bennée,
	qemu-devel, Eric Blake, Dr. David Alan Gilbert, Zheng Chuan,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On 12/22/2021 6:15 PM, Michael S. Tsirkin wrote:
> On Wed, Dec 22, 2021 at 11:05:24AM -0800, Steve Sistare wrote:
>> Enable vfio-pci devices to be saved and restored across an exec restart
>> of qemu.
>>
>> At vfio creation time, save the value of vfio container, group, and device
>> descriptors in cpr state.
>>
>> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
>> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
>> at a different VA after exec.  DMA to already-mapped pages continues.  Save
>> the msi message area as part of vfio-pci vmstate, save the interrupt and
>> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
>> vfio descriptors.  The flag is not cleared earlier because the descriptors
>> should not persist across miscellaneous fork and exec calls that may be
>> performed during normal operation.
>>
>> On qemu restart, vfio_realize() finds the saved descriptors, uses
>> the descriptors, and notes that the device is being reused.  Device and
>> iommu state is already configured, so operations in vfio_realize that
>> would modify the configuration are skipped for a reused device, including
>> vfio ioctl's and writes to PCI configuration space.  The result is that
>> vfio_realize constructs qemu data structures that reflect the current
>> state of the device.  However, the reconstruction is not complete until
>> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
>> state.  It rebuilds vector data structures and attaches the interrupts to
>> the new KVM instance.  cpr-load then invokes the main vfio listener callback,
>> which walks the flattened ranges of the vfio_address_spaces and calls
>> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
>> starts the VM and suppresses vfio pci device reset.
>>
>> This functionality is delivered by 3 patches for clarity.  Part 1 handles
>> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
>> support.  Part 3 adds INTX support.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  MAINTAINERS                   |   1 +
>>  hw/pci/pci.c                  |  10 ++++
>>  hw/vfio/common.c              | 115 ++++++++++++++++++++++++++++++++++++++----
>>  hw/vfio/cpr.c                 |  94 ++++++++++++++++++++++++++++++++++
>>  hw/vfio/meson.build           |   1 +
>>  hw/vfio/pci.c                 |  77 ++++++++++++++++++++++++++++
>>  hw/vfio/trace-events          |   1 +
>>  include/hw/pci/pci.h          |   1 +
>>  include/hw/vfio/vfio-common.h |   8 +++
>>  include/migration/cpr.h       |   3 ++
>>  migration/cpr.c               |  10 +++-
>>  migration/target.c            |  14 +++++
>>  12 files changed, 324 insertions(+), 11 deletions(-)
>>  create mode 100644 hw/vfio/cpr.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index cfe7480..feed239 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -2992,6 +2992,7 @@ CPR
>>  M: Steve Sistare <steven.sistare@oracle.com>
>>  M: Mark Kanda <mark.kanda@oracle.com>
>>  S: Maintained
>> +F: hw/vfio/cpr.c
>>  F: include/migration/cpr.h
>>  F: migration/cpr.c
>>  F: qapi/cpr.json
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index 0fd21e1..e35df4f 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
>>  {
>>      int r;
>>  
>> +    /*
>> +     * A reused vfio-pci device is already configured, so do not reset it
>> +     * during qemu_system_reset prior to cpr-load, else interrupts may be
>> +     * lost.  By contrast, pure-virtual pci devices may be reset here and
>> +     * updated with new state in cpr-load with no ill effects.
>> +     */
>> +    if (dev->reused) {
>> +        return;
>> +    }
>> +
>>      pci_device_deassert_intx(dev);
>>      assert(dev->irq_state == 0);
>>  
> 
> 
> Hmm that's a weird thing to do. I suspect this works because
> "reused" means something like "in the process of being restored"?
> Because clearly, we do not want to skip this part e.g. when
> guest resets the device.

Exactly.  vfio_realize sets the flag if it detects the device is reused during
a restart, and vfio_pci_post_load clears the reused flag.

> So a better name could be called for, but really I don't
> love how vfio gets to poke at internal PCI state.
> I'd rather we found a way just not to call this function.
> If we can't, maybe an explicit API, and make it
> actually say what it's doing?

How about:

pci_set_restore(PCIDevice *dev) { dev->restore = true; }
pci_clr_restore(PCIDevice *dev) { dev->restore = false; }

vfio_realize()
  pci_set_restore(pdev)

vfio_pci_post_load()
  pci_clr_restore(pdev)

pci_do_device_reset()
    if (dev->restore)
        return;

- Steve
 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 17/29] pci: export functions for cpr
  2022-01-05 17:22     ` Steven Sistare
@ 2022-01-05 20:16       ` Michael S. Tsirkin
  2022-01-06 22:48         ` Steven Sistare
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2022-01-05 20:16 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Daniel P. Berrange, Juan Quintela, Jason Zeng, Alex Bennée,
	qemu-devel, Eric Blake, Dr. David Alan Gilbert, Zheng Chuan,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On Wed, Jan 05, 2022 at 12:22:25PM -0500, Steven Sistare wrote:
> On 12/22/2021 6:07 PM, Michael S. Tsirkin wrote:
> > On Wed, Dec 22, 2021 at 11:05:22AM -0800, Steve Sistare wrote:
> >> Export msix_is_pending, msix_init_vector_notifiers, and pci_update_mappings
> >> for use by cpr.  No functional change.
> >>
> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > 
> > With things like that, I prefer when the API is exported
> > together with the patch that uses it.
> > This was I can see why we are exporting these APIs.
> > Esp wrt pci_update_mappings, it's designed as an
> > internal API.
> 
> Hi Michael, thanks very much for reviewing these patches.
> 
> Serendipitously, I stopped calling pci_update_mappings from vfio code earlier
> in the series.  I will revert its scope.
> 
> I would prefer to keep this patch separate from the use of these functions in
> "vfio-pci cpr part 2 msi", to make the latter smaller and easier to understand.
> How about if I say more in this commit message? :
> 
>   Export msix_is_pending and msix_init_vector_notifiers for use in vfio cpr.
>   Both are needed in the vfio-pci post-load function during cpr-load.
>   msix_is_pending is checked to enable the PBA memory region.
>   msix_init_vector_notifiers is called to register notifier callbacks, without
>   the other side effects of msix_set_vector_notifiers.
> 
> - Steve

Well the reason the side effects are there is to avoid losing events,
no? I'd like to figure out a bit better why we don't need them, and when
should users call msix_init_vector_notifiers versus
msix_set_vector_notifiers.

> >> ---
> >>  hw/pci/msix.c         | 20 ++++++++++++++------
> >>  hw/pci/pci.c          |  3 +--
> >>  include/hw/pci/msix.h |  5 +++++
> >>  include/hw/pci/pci.h  |  1 +
> >>  4 files changed, 21 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/hw/pci/msix.c b/hw/pci/msix.c
> >> index ae9331c..73f4259 100644
> >> --- a/hw/pci/msix.c
> >> +++ b/hw/pci/msix.c
> >> @@ -64,7 +64,7 @@ static uint8_t *msix_pending_byte(PCIDevice *dev, int vector)
> >>      return dev->msix_pba + vector / 8;
> >>  }
> >>  
> >> -static int msix_is_pending(PCIDevice *dev, int vector)
> >> +int msix_is_pending(PCIDevice *dev, unsigned int vector)
> >>  {
> >>      return *msix_pending_byte(dev, vector) & msix_pending_mask(vector);
> >>  }
> >> @@ -579,6 +579,17 @@ static void msix_unset_notifier_for_vector(PCIDevice *dev, unsigned int vector)
> >>      dev->msix_vector_release_notifier(dev, vector);
> >>  }
> >>  
> >> +void msix_init_vector_notifiers(PCIDevice *dev,
> >> +                                MSIVectorUseNotifier use_notifier,
> >> +                                MSIVectorReleaseNotifier release_notifier,
> >> +                                MSIVectorPollNotifier poll_notifier)
> >> +{
> >> +    assert(use_notifier && release_notifier);
> >> +    dev->msix_vector_use_notifier = use_notifier;
> >> +    dev->msix_vector_release_notifier = release_notifier;
> >> +    dev->msix_vector_poll_notifier = poll_notifier;
> >> +}
> >> +
> >>  int msix_set_vector_notifiers(PCIDevice *dev,
> >>                                MSIVectorUseNotifier use_notifier,
> >>                                MSIVectorReleaseNotifier release_notifier,
> >> @@ -586,11 +597,8 @@ int msix_set_vector_notifiers(PCIDevice *dev,
> >>  {
> >>      int vector, ret;
> >>  
> >> -    assert(use_notifier && release_notifier);
> >> -
> >> -    dev->msix_vector_use_notifier = use_notifier;
> >> -    dev->msix_vector_release_notifier = release_notifier;
> >> -    dev->msix_vector_poll_notifier = poll_notifier;
> >> +    msix_init_vector_notifiers(dev, use_notifier, release_notifier,
> >> +                               poll_notifier);
> >>  
> >>      if ((dev->config[dev->msix_cap + MSIX_CONTROL_OFFSET] &
> >>          (MSIX_ENABLE_MASK | MSIX_MASKALL_MASK)) == MSIX_ENABLE_MASK) {
> >> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> >> index e5993c1..0fd21e1 100644
> >> --- a/hw/pci/pci.c
> >> +++ b/hw/pci/pci.c
> >> @@ -225,7 +225,6 @@ static const TypeInfo pcie_bus_info = {
> >>  };
> >>  
> >>  static PCIBus *pci_find_bus_nr(PCIBus *bus, int bus_num);
> >> -static void pci_update_mappings(PCIDevice *d);
> >>  static void pci_irq_handler(void *opaque, int irq_num, int level);
> >>  static void pci_add_option_rom(PCIDevice *pdev, bool is_default_rom, Error **);
> >>  static void pci_del_option_rom(PCIDevice *pdev);
> >> @@ -1366,7 +1365,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
> >>      return new_addr;
> >>  }
> >>  
> >> -static void pci_update_mappings(PCIDevice *d)
> >> +void pci_update_mappings(PCIDevice *d)
> >>  {
> >>      PCIIORegion *r;
> >>      int i;
> >> diff --git a/include/hw/pci/msix.h b/include/hw/pci/msix.h
> >> index 4c4a60c..46606cf 100644
> >> --- a/include/hw/pci/msix.h
> >> +++ b/include/hw/pci/msix.h
> >> @@ -32,6 +32,7 @@ int msix_present(PCIDevice *dev);
> >>  bool msix_is_masked(PCIDevice *dev, unsigned vector);
> >>  void msix_set_pending(PCIDevice *dev, unsigned vector);
> >>  void msix_clr_pending(PCIDevice *dev, int vector);
> >> +int msix_is_pending(PCIDevice *dev, unsigned vector);
> >>  
> >>  int msix_vector_use(PCIDevice *dev, unsigned vector);
> >>  void msix_vector_unuse(PCIDevice *dev, unsigned vector);
> >> @@ -41,6 +42,10 @@ void msix_notify(PCIDevice *dev, unsigned vector);
> >>  
> >>  void msix_reset(PCIDevice *dev);
> >>  
> >> +void msix_init_vector_notifiers(PCIDevice *dev,
> >> +                                MSIVectorUseNotifier use_notifier,
> >> +                                MSIVectorReleaseNotifier release_notifier,
> >> +                                MSIVectorPollNotifier poll_notifier);
> >>  int msix_set_vector_notifiers(PCIDevice *dev,
> >>                                MSIVectorUseNotifier use_notifier,
> >>                                MSIVectorReleaseNotifier release_notifier,
> >> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> >> index e7cdf2d..cc63dd4 100644
> >> --- a/include/hw/pci/pci.h
> >> +++ b/include/hw/pci/pci.h
> >> @@ -910,5 +910,6 @@ extern const VMStateDescription vmstate_pci_device;
> >>  
> >>  MSIMessage pci_get_msi_message(PCIDevice *dev, int vector);
> >>  void pci_set_power(PCIDevice *pci_dev, bool state);
> >> +void pci_update_mappings(PCIDevice *d);
> >>  
> >>  #endif
> >> -- 
> >> 1.8.3.1
> > 



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)
  2022-01-05 17:24     ` Steven Sistare
@ 2022-01-05 21:14       ` Michael S. Tsirkin
  2022-01-05 21:40         ` Steven Sistare
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2022-01-05 21:14 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Daniel P. Berrange, Juan Quintela, Jason Zeng, Alex Bennée,
	qemu-devel, Eric Blake, Dr. David Alan Gilbert, Zheng Chuan,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On Wed, Jan 05, 2022 at 12:24:21PM -0500, Steven Sistare wrote:
> On 12/22/2021 6:15 PM, Michael S. Tsirkin wrote:
> > On Wed, Dec 22, 2021 at 11:05:24AM -0800, Steve Sistare wrote:
> >> Enable vfio-pci devices to be saved and restored across an exec restart
> >> of qemu.
> >>
> >> At vfio creation time, save the value of vfio container, group, and device
> >> descriptors in cpr state.
> >>
> >> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
> >> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
> >> at a different VA after exec.  DMA to already-mapped pages continues.  Save
> >> the msi message area as part of vfio-pci vmstate, save the interrupt and
> >> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
> >> vfio descriptors.  The flag is not cleared earlier because the descriptors
> >> should not persist across miscellaneous fork and exec calls that may be
> >> performed during normal operation.
> >>
> >> On qemu restart, vfio_realize() finds the saved descriptors, uses
> >> the descriptors, and notes that the device is being reused.  Device and
> >> iommu state is already configured, so operations in vfio_realize that
> >> would modify the configuration are skipped for a reused device, including
> >> vfio ioctl's and writes to PCI configuration space.  The result is that
> >> vfio_realize constructs qemu data structures that reflect the current
> >> state of the device.  However, the reconstruction is not complete until
> >> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
> >> state.  It rebuilds vector data structures and attaches the interrupts to
> >> the new KVM instance.  cpr-load then invokes the main vfio listener callback,
> >> which walks the flattened ranges of the vfio_address_spaces and calls
> >> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
> >> starts the VM and suppresses vfio pci device reset.
> >>
> >> This functionality is delivered by 3 patches for clarity.  Part 1 handles
> >> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
> >> support.  Part 3 adds INTX support.
> >>
> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >> ---
> >>  MAINTAINERS                   |   1 +
> >>  hw/pci/pci.c                  |  10 ++++
> >>  hw/vfio/common.c              | 115 ++++++++++++++++++++++++++++++++++++++----
> >>  hw/vfio/cpr.c                 |  94 ++++++++++++++++++++++++++++++++++
> >>  hw/vfio/meson.build           |   1 +
> >>  hw/vfio/pci.c                 |  77 ++++++++++++++++++++++++++++
> >>  hw/vfio/trace-events          |   1 +
> >>  include/hw/pci/pci.h          |   1 +
> >>  include/hw/vfio/vfio-common.h |   8 +++
> >>  include/migration/cpr.h       |   3 ++
> >>  migration/cpr.c               |  10 +++-
> >>  migration/target.c            |  14 +++++
> >>  12 files changed, 324 insertions(+), 11 deletions(-)
> >>  create mode 100644 hw/vfio/cpr.c
> >>
> >> diff --git a/MAINTAINERS b/MAINTAINERS
> >> index cfe7480..feed239 100644
> >> --- a/MAINTAINERS
> >> +++ b/MAINTAINERS
> >> @@ -2992,6 +2992,7 @@ CPR
> >>  M: Steve Sistare <steven.sistare@oracle.com>
> >>  M: Mark Kanda <mark.kanda@oracle.com>
> >>  S: Maintained
> >> +F: hw/vfio/cpr.c
> >>  F: include/migration/cpr.h
> >>  F: migration/cpr.c
> >>  F: qapi/cpr.json
> >> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> >> index 0fd21e1..e35df4f 100644
> >> --- a/hw/pci/pci.c
> >> +++ b/hw/pci/pci.c
> >> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
> >>  {
> >>      int r;
> >>  
> >> +    /*
> >> +     * A reused vfio-pci device is already configured, so do not reset it
> >> +     * during qemu_system_reset prior to cpr-load, else interrupts may be
> >> +     * lost.  By contrast, pure-virtual pci devices may be reset here and
> >> +     * updated with new state in cpr-load with no ill effects.
> >> +     */
> >> +    if (dev->reused) {
> >> +        return;
> >> +    }
> >> +
> >>      pci_device_deassert_intx(dev);
> >>      assert(dev->irq_state == 0);
> >>  
> > 
> > 
> > Hmm that's a weird thing to do. I suspect this works because
> > "reused" means something like "in the process of being restored"?
> > Because clearly, we do not want to skip this part e.g. when
> > guest resets the device.
> 
> Exactly.  vfio_realize sets the flag if it detects the device is reused during
> a restart, and vfio_pci_post_load clears the reused flag.
> 
> > So a better name could be called for, but really I don't
> > love how vfio gets to poke at internal PCI state.
> > I'd rather we found a way just not to call this function.
> > If we can't, maybe an explicit API, and make it
> > actually say what it's doing?
> 
> How about:
> 
> pci_set_restore(PCIDevice *dev) { dev->restore = true; }
> pci_clr_restore(PCIDevice *dev) { dev->restore = false; }
> 
> vfio_realize()
>   pci_set_restore(pdev)
> 
> vfio_pci_post_load()
>   pci_clr_restore(pdev)
> 
> pci_do_device_reset()
>     if (dev->restore)
>         return;
> 
> - Steve


Not too bad. I'd like a better definition of what dev->restore is
exactly and to add them in comments near where it
is defined and used.

E.g. does this mean "device is being restored because of qemu restart"?

Do we need a per device flag for this thing or would a global
"qemu restart in progress" flag be enough?

-- 
MST



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)
  2022-01-05 21:14       ` Michael S. Tsirkin
@ 2022-01-05 21:40         ` Steven Sistare
  2022-01-05 23:09           ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Steven Sistare @ 2022-01-05 21:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Daniel P. Berrange, Juan Quintela, Jason Zeng, Alex Bennée,
	qemu-devel, Eric Blake, Dr. David Alan Gilbert, Zheng Chuan,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On 1/5/2022 4:14 PM, Michael S. Tsirkin wrote:
> On Wed, Jan 05, 2022 at 12:24:21PM -0500, Steven Sistare wrote:
>> On 12/22/2021 6:15 PM, Michael S. Tsirkin wrote:
>>> On Wed, Dec 22, 2021 at 11:05:24AM -0800, Steve Sistare wrote:
>>>> Enable vfio-pci devices to be saved and restored across an exec restart
>>>> of qemu.
>>>>
>>>> At vfio creation time, save the value of vfio container, group, and device
>>>> descriptors in cpr state.
>>>>
>>>> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
>>>> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
>>>> at a different VA after exec.  DMA to already-mapped pages continues.  Save
>>>> the msi message area as part of vfio-pci vmstate, save the interrupt and
>>>> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
>>>> vfio descriptors.  The flag is not cleared earlier because the descriptors
>>>> should not persist across miscellaneous fork and exec calls that may be
>>>> performed during normal operation.
>>>>
>>>> On qemu restart, vfio_realize() finds the saved descriptors, uses
>>>> the descriptors, and notes that the device is being reused.  Device and
>>>> iommu state is already configured, so operations in vfio_realize that
>>>> would modify the configuration are skipped for a reused device, including
>>>> vfio ioctl's and writes to PCI configuration space.  The result is that
>>>> vfio_realize constructs qemu data structures that reflect the current
>>>> state of the device.  However, the reconstruction is not complete until
>>>> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
>>>> state.  It rebuilds vector data structures and attaches the interrupts to
>>>> the new KVM instance.  cpr-load then invokes the main vfio listener callback,
>>>> which walks the flattened ranges of the vfio_address_spaces and calls
>>>> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
>>>> starts the VM and suppresses vfio pci device reset.
>>>>
>>>> This functionality is delivered by 3 patches for clarity.  Part 1 handles
>>>> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
>>>> support.  Part 3 adds INTX support.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>>  MAINTAINERS                   |   1 +
>>>>  hw/pci/pci.c                  |  10 ++++
>>>>  hw/vfio/common.c              | 115 ++++++++++++++++++++++++++++++++++++++----
>>>>  hw/vfio/cpr.c                 |  94 ++++++++++++++++++++++++++++++++++
>>>>  hw/vfio/meson.build           |   1 +
>>>>  hw/vfio/pci.c                 |  77 ++++++++++++++++++++++++++++
>>>>  hw/vfio/trace-events          |   1 +
>>>>  include/hw/pci/pci.h          |   1 +
>>>>  include/hw/vfio/vfio-common.h |   8 +++
>>>>  include/migration/cpr.h       |   3 ++
>>>>  migration/cpr.c               |  10 +++-
>>>>  migration/target.c            |  14 +++++
>>>>  12 files changed, 324 insertions(+), 11 deletions(-)
>>>>  create mode 100644 hw/vfio/cpr.c
>>>>
>>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>>> index cfe7480..feed239 100644
>>>> --- a/MAINTAINERS
>>>> +++ b/MAINTAINERS
>>>> @@ -2992,6 +2992,7 @@ CPR
>>>>  M: Steve Sistare <steven.sistare@oracle.com>
>>>>  M: Mark Kanda <mark.kanda@oracle.com>
>>>>  S: Maintained
>>>> +F: hw/vfio/cpr.c
>>>>  F: include/migration/cpr.h
>>>>  F: migration/cpr.c
>>>>  F: qapi/cpr.json
>>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>>> index 0fd21e1..e35df4f 100644
>>>> --- a/hw/pci/pci.c
>>>> +++ b/hw/pci/pci.c
>>>> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
>>>>  {
>>>>      int r;
>>>>  
>>>> +    /*
>>>> +     * A reused vfio-pci device is already configured, so do not reset it
>>>> +     * during qemu_system_reset prior to cpr-load, else interrupts may be
>>>> +     * lost.  By contrast, pure-virtual pci devices may be reset here and
>>>> +     * updated with new state in cpr-load with no ill effects.
>>>> +     */
>>>> +    if (dev->reused) {
>>>> +        return;
>>>> +    }
>>>> +
>>>>      pci_device_deassert_intx(dev);
>>>>      assert(dev->irq_state == 0);
>>>>  
>>>
>>>
>>> Hmm that's a weird thing to do. I suspect this works because
>>> "reused" means something like "in the process of being restored"?
>>> Because clearly, we do not want to skip this part e.g. when
>>> guest resets the device.
>>
>> Exactly.  vfio_realize sets the flag if it detects the device is reused during
>> a restart, and vfio_pci_post_load clears the reused flag.
>>
>>> So a better name could be called for, but really I don't
>>> love how vfio gets to poke at internal PCI state.
>>> I'd rather we found a way just not to call this function.
>>> If we can't, maybe an explicit API, and make it
>>> actually say what it's doing?
>>
>> How about:
>>
>> pci_set_restore(PCIDevice *dev) { dev->restore = true; }
>> pci_clr_restore(PCIDevice *dev) { dev->restore = false; }
>>
>> vfio_realize()
>>   pci_set_restore(pdev)
>>
>> vfio_pci_post_load()
>>   pci_clr_restore(pdev)
>>
>> pci_do_device_reset()
>>     if (dev->restore)
>>         return;
>>
>> - Steve
> 
> 
> Not too bad. I'd like a better definition of what dev->restore is
> exactly and to add them in comments near where it
> is defined and used.

Will do.

> E.g. does this mean "device is being restored because of qemu restart"?
> 
> Do we need a per device flag for this thing or would a global
> "qemu restart in progress" flag be enough?

A global flag (or function, which already exists) would suppress reset for all
PCI devices, not just vfio-pci.  I am concerned that for some devices, vmstate 
load may implicitly depend on the device having been reset for correctness, by 
virtue of some fields being initialized in the reset function.

- Steve


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)
  2022-01-05 21:40         ` Steven Sistare
@ 2022-01-05 23:09           ` Michael S. Tsirkin
  2022-01-05 23:24             ` Steven Sistare
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2022-01-05 23:09 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Daniel P. Berrange, Juan Quintela, Jason Zeng, Alex Bennée,
	qemu-devel, Eric Blake, Dr. David Alan Gilbert, Zheng Chuan,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On Wed, Jan 05, 2022 at 04:40:43PM -0500, Steven Sistare wrote:
> On 1/5/2022 4:14 PM, Michael S. Tsirkin wrote:
> > On Wed, Jan 05, 2022 at 12:24:21PM -0500, Steven Sistare wrote:
> >> On 12/22/2021 6:15 PM, Michael S. Tsirkin wrote:
> >>> On Wed, Dec 22, 2021 at 11:05:24AM -0800, Steve Sistare wrote:
> >>>> Enable vfio-pci devices to be saved and restored across an exec restart
> >>>> of qemu.
> >>>>
> >>>> At vfio creation time, save the value of vfio container, group, and device
> >>>> descriptors in cpr state.
> >>>>
> >>>> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
> >>>> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
> >>>> at a different VA after exec.  DMA to already-mapped pages continues.  Save
> >>>> the msi message area as part of vfio-pci vmstate, save the interrupt and
> >>>> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
> >>>> vfio descriptors.  The flag is not cleared earlier because the descriptors
> >>>> should not persist across miscellaneous fork and exec calls that may be
> >>>> performed during normal operation.
> >>>>
> >>>> On qemu restart, vfio_realize() finds the saved descriptors, uses
> >>>> the descriptors, and notes that the device is being reused.  Device and
> >>>> iommu state is already configured, so operations in vfio_realize that
> >>>> would modify the configuration are skipped for a reused device, including
> >>>> vfio ioctl's and writes to PCI configuration space.  The result is that
> >>>> vfio_realize constructs qemu data structures that reflect the current
> >>>> state of the device.  However, the reconstruction is not complete until
> >>>> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
> >>>> state.  It rebuilds vector data structures and attaches the interrupts to
> >>>> the new KVM instance.  cpr-load then invokes the main vfio listener callback,
> >>>> which walks the flattened ranges of the vfio_address_spaces and calls
> >>>> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
> >>>> starts the VM and suppresses vfio pci device reset.
> >>>>
> >>>> This functionality is delivered by 3 patches for clarity.  Part 1 handles
> >>>> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
> >>>> support.  Part 3 adds INTX support.
> >>>>
> >>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >>>> ---
> >>>>  MAINTAINERS                   |   1 +
> >>>>  hw/pci/pci.c                  |  10 ++++
> >>>>  hw/vfio/common.c              | 115 ++++++++++++++++++++++++++++++++++++++----
> >>>>  hw/vfio/cpr.c                 |  94 ++++++++++++++++++++++++++++++++++
> >>>>  hw/vfio/meson.build           |   1 +
> >>>>  hw/vfio/pci.c                 |  77 ++++++++++++++++++++++++++++
> >>>>  hw/vfio/trace-events          |   1 +
> >>>>  include/hw/pci/pci.h          |   1 +
> >>>>  include/hw/vfio/vfio-common.h |   8 +++
> >>>>  include/migration/cpr.h       |   3 ++
> >>>>  migration/cpr.c               |  10 +++-
> >>>>  migration/target.c            |  14 +++++
> >>>>  12 files changed, 324 insertions(+), 11 deletions(-)
> >>>>  create mode 100644 hw/vfio/cpr.c
> >>>>
> >>>> diff --git a/MAINTAINERS b/MAINTAINERS
> >>>> index cfe7480..feed239 100644
> >>>> --- a/MAINTAINERS
> >>>> +++ b/MAINTAINERS
> >>>> @@ -2992,6 +2992,7 @@ CPR
> >>>>  M: Steve Sistare <steven.sistare@oracle.com>
> >>>>  M: Mark Kanda <mark.kanda@oracle.com>
> >>>>  S: Maintained
> >>>> +F: hw/vfio/cpr.c
> >>>>  F: include/migration/cpr.h
> >>>>  F: migration/cpr.c
> >>>>  F: qapi/cpr.json
> >>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> >>>> index 0fd21e1..e35df4f 100644
> >>>> --- a/hw/pci/pci.c
> >>>> +++ b/hw/pci/pci.c
> >>>> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
> >>>>  {
> >>>>      int r;
> >>>>  
> >>>> +    /*
> >>>> +     * A reused vfio-pci device is already configured, so do not reset it
> >>>> +     * during qemu_system_reset prior to cpr-load, else interrupts may be
> >>>> +     * lost.  By contrast, pure-virtual pci devices may be reset here and
> >>>> +     * updated with new state in cpr-load with no ill effects.
> >>>> +     */
> >>>> +    if (dev->reused) {
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>>      pci_device_deassert_intx(dev);
> >>>>      assert(dev->irq_state == 0);
> >>>>  
> >>>
> >>>
> >>> Hmm that's a weird thing to do. I suspect this works because
> >>> "reused" means something like "in the process of being restored"?
> >>> Because clearly, we do not want to skip this part e.g. when
> >>> guest resets the device.
> >>
> >> Exactly.  vfio_realize sets the flag if it detects the device is reused during
> >> a restart, and vfio_pci_post_load clears the reused flag.
> >>
> >>> So a better name could be called for, but really I don't
> >>> love how vfio gets to poke at internal PCI state.
> >>> I'd rather we found a way just not to call this function.
> >>> If we can't, maybe an explicit API, and make it
> >>> actually say what it's doing?
> >>
> >> How about:
> >>
> >> pci_set_restore(PCIDevice *dev) { dev->restore = true; }
> >> pci_clr_restore(PCIDevice *dev) { dev->restore = false; }
> >>
> >> vfio_realize()
> >>   pci_set_restore(pdev)
> >>
> >> vfio_pci_post_load()
> >>   pci_clr_restore(pdev)
> >>
> >> pci_do_device_reset()
> >>     if (dev->restore)
> >>         return;
> >>
> >> - Steve
> > 
> > 
> > Not too bad. I'd like a better definition of what dev->restore is
> > exactly and to add them in comments near where it
> > is defined and used.
> 
> Will do.
> 
> > E.g. does this mean "device is being restored because of qemu restart"?
> > 
> > Do we need a per device flag for this thing or would a global
> > "qemu restart in progress" flag be enough?
> 
> A global flag (or function, which already exists) would suppress reset for all
> PCI devices, not just vfio-pci.  I am concerned that for some devices, vmstate 
> load may implicitly depend on the device having been reset for correctness, by 
> virtue of some fields being initialized in the reset function.
> 
> - Steve

So just so I understand, how do these other devices work with restart?
Do they use the save/loadvm machinery? And the reason vfio doesn't
is because it generally does not support savevm/loadvm?

-- 
MST



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)
  2022-01-05 23:09           ` Michael S. Tsirkin
@ 2022-01-05 23:24             ` Steven Sistare
  2022-01-06  9:12               ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Steven Sistare @ 2022-01-05 23:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Daniel P. Berrange, Juan Quintela, Jason Zeng, Alex Bennée,
	qemu-devel, Eric Blake, Dr. David Alan Gilbert, Zheng Chuan,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On 1/5/2022 6:09 PM, Michael S. Tsirkin wrote:
> On Wed, Jan 05, 2022 at 04:40:43PM -0500, Steven Sistare wrote:
>> On 1/5/2022 4:14 PM, Michael S. Tsirkin wrote:
>>> On Wed, Jan 05, 2022 at 12:24:21PM -0500, Steven Sistare wrote:
>>>> On 12/22/2021 6:15 PM, Michael S. Tsirkin wrote:
>>>>> On Wed, Dec 22, 2021 at 11:05:24AM -0800, Steve Sistare wrote:
>>>>>> Enable vfio-pci devices to be saved and restored across an exec restart
>>>>>> of qemu.
>>>>>>
>>>>>> At vfio creation time, save the value of vfio container, group, and device
>>>>>> descriptors in cpr state.
>>>>>>
>>>>>> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
>>>>>> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
>>>>>> at a different VA after exec.  DMA to already-mapped pages continues.  Save
>>>>>> the msi message area as part of vfio-pci vmstate, save the interrupt and
>>>>>> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
>>>>>> vfio descriptors.  The flag is not cleared earlier because the descriptors
>>>>>> should not persist across miscellaneous fork and exec calls that may be
>>>>>> performed during normal operation.
>>>>>>
>>>>>> On qemu restart, vfio_realize() finds the saved descriptors, uses
>>>>>> the descriptors, and notes that the device is being reused.  Device and
>>>>>> iommu state is already configured, so operations in vfio_realize that
>>>>>> would modify the configuration are skipped for a reused device, including
>>>>>> vfio ioctl's and writes to PCI configuration space.  The result is that
>>>>>> vfio_realize constructs qemu data structures that reflect the current
>>>>>> state of the device.  However, the reconstruction is not complete until
>>>>>> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
>>>>>> state.  It rebuilds vector data structures and attaches the interrupts to
>>>>>> the new KVM instance.  cpr-load then invokes the main vfio listener callback,
>>>>>> which walks the flattened ranges of the vfio_address_spaces and calls
>>>>>> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
>>>>>> starts the VM and suppresses vfio pci device reset.
>>>>>>
>>>>>> This functionality is delivered by 3 patches for clarity.  Part 1 handles
>>>>>> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
>>>>>> support.  Part 3 adds INTX support.
>>>>>>
>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>> ---
>>>>>>  MAINTAINERS                   |   1 +
>>>>>>  hw/pci/pci.c                  |  10 ++++
>>>>>>  hw/vfio/common.c              | 115 ++++++++++++++++++++++++++++++++++++++----
>>>>>>  hw/vfio/cpr.c                 |  94 ++++++++++++++++++++++++++++++++++
>>>>>>  hw/vfio/meson.build           |   1 +
>>>>>>  hw/vfio/pci.c                 |  77 ++++++++++++++++++++++++++++
>>>>>>  hw/vfio/trace-events          |   1 +
>>>>>>  include/hw/pci/pci.h          |   1 +
>>>>>>  include/hw/vfio/vfio-common.h |   8 +++
>>>>>>  include/migration/cpr.h       |   3 ++
>>>>>>  migration/cpr.c               |  10 +++-
>>>>>>  migration/target.c            |  14 +++++
>>>>>>  12 files changed, 324 insertions(+), 11 deletions(-)
>>>>>>  create mode 100644 hw/vfio/cpr.c
>>>>>>
>>>>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>>>>> index cfe7480..feed239 100644
>>>>>> --- a/MAINTAINERS
>>>>>> +++ b/MAINTAINERS
>>>>>> @@ -2992,6 +2992,7 @@ CPR
>>>>>>  M: Steve Sistare <steven.sistare@oracle.com>
>>>>>>  M: Mark Kanda <mark.kanda@oracle.com>
>>>>>>  S: Maintained
>>>>>> +F: hw/vfio/cpr.c
>>>>>>  F: include/migration/cpr.h
>>>>>>  F: migration/cpr.c
>>>>>>  F: qapi/cpr.json
>>>>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>>>>> index 0fd21e1..e35df4f 100644
>>>>>> --- a/hw/pci/pci.c
>>>>>> +++ b/hw/pci/pci.c
>>>>>> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
>>>>>>  {
>>>>>>      int r;
>>>>>>  
>>>>>> +    /*
>>>>>> +     * A reused vfio-pci device is already configured, so do not reset it
>>>>>> +     * during qemu_system_reset prior to cpr-load, else interrupts may be
>>>>>> +     * lost.  By contrast, pure-virtual pci devices may be reset here and
>>>>>> +     * updated with new state in cpr-load with no ill effects.
>>>>>> +     */
>>>>>> +    if (dev->reused) {
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>>      pci_device_deassert_intx(dev);
>>>>>>      assert(dev->irq_state == 0);
>>>>>>  
>>>>>
>>>>>
>>>>> Hmm that's a weird thing to do. I suspect this works because
>>>>> "reused" means something like "in the process of being restored"?
>>>>> Because clearly, we do not want to skip this part e.g. when
>>>>> guest resets the device.
>>>>
>>>> Exactly.  vfio_realize sets the flag if it detects the device is reused during
>>>> a restart, and vfio_pci_post_load clears the reused flag.
>>>>
>>>>> So a better name could be called for, but really I don't
>>>>> love how vfio gets to poke at internal PCI state.
>>>>> I'd rather we found a way just not to call this function.
>>>>> If we can't, maybe an explicit API, and make it
>>>>> actually say what it's doing?
>>>>
>>>> How about:
>>>>
>>>> pci_set_restore(PCIDevice *dev) { dev->restore = true; }
>>>> pci_clr_restore(PCIDevice *dev) { dev->restore = false; }
>>>>
>>>> vfio_realize()
>>>>   pci_set_restore(pdev)
>>>>
>>>> vfio_pci_post_load()
>>>>   pci_clr_restore(pdev)
>>>>
>>>> pci_do_device_reset()
>>>>     if (dev->restore)
>>>>         return;
>>>>
>>>> - Steve
>>>
>>>
>>> Not too bad. I'd like a better definition of what dev->restore is
>>> exactly and to add them in comments near where it
>>> is defined and used.
>>
>> Will do.
>>
>>> E.g. does this mean "device is being restored because of qemu restart"?
>>>
>>> Do we need a per device flag for this thing or would a global
>>> "qemu restart in progress" flag be enough?
>>
>> A global flag (or function, which already exists) would suppress reset for all
>> PCI devices, not just vfio-pci.  I am concerned that for some devices, vmstate 
>> load may implicitly depend on the device having been reset for correctness, by 
>> virtue of some fields being initialized in the reset function.
>>
>> - Steve
> 
> So just so I understand, how do these other devices work with restart?
> Do they use the save/loadvm machinery? And the reason vfio doesn't
> is because it generally does not support savevm/loadvm?

They all use save/loadvm.  vfio-pci also uses save/loadvm to preserve its soft state,
plus it preserves its device descriptors.  The only bit we are skipping is the reset
function for vfio-pci, because the hardware device is actively processing dma and 
interrupts, and they would be lost.  Reset is called unconditionally for all devices 
during qemu startup, prior to loadvm, by the path qdev_machine_creation_done() ->
qemu_system_reset().

- Steve


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)
  2022-01-05 23:24             ` Steven Sistare
@ 2022-01-06  9:12               ` Michael S. Tsirkin
  2022-01-06 19:13                 ` Steven Sistare
  0 siblings, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2022-01-06  9:12 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Daniel P. Berrange, Juan Quintela, Jason Zeng, Alex Bennée,
	qemu-devel, Eric Blake, Dr. David Alan Gilbert, Zheng Chuan,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On Wed, Jan 05, 2022 at 06:24:25PM -0500, Steven Sistare wrote:
> On 1/5/2022 6:09 PM, Michael S. Tsirkin wrote:
> > On Wed, Jan 05, 2022 at 04:40:43PM -0500, Steven Sistare wrote:
> >> On 1/5/2022 4:14 PM, Michael S. Tsirkin wrote:
> >>> On Wed, Jan 05, 2022 at 12:24:21PM -0500, Steven Sistare wrote:
> >>>> On 12/22/2021 6:15 PM, Michael S. Tsirkin wrote:
> >>>>> On Wed, Dec 22, 2021 at 11:05:24AM -0800, Steve Sistare wrote:
> >>>>>> Enable vfio-pci devices to be saved and restored across an exec restart
> >>>>>> of qemu.
> >>>>>>
> >>>>>> At vfio creation time, save the value of vfio container, group, and device
> >>>>>> descriptors in cpr state.
> >>>>>>
> >>>>>> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
> >>>>>> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
> >>>>>> at a different VA after exec.  DMA to already-mapped pages continues.  Save
> >>>>>> the msi message area as part of vfio-pci vmstate, save the interrupt and
> >>>>>> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
> >>>>>> vfio descriptors.  The flag is not cleared earlier because the descriptors
> >>>>>> should not persist across miscellaneous fork and exec calls that may be
> >>>>>> performed during normal operation.
> >>>>>>
> >>>>>> On qemu restart, vfio_realize() finds the saved descriptors, uses
> >>>>>> the descriptors, and notes that the device is being reused.  Device and
> >>>>>> iommu state is already configured, so operations in vfio_realize that
> >>>>>> would modify the configuration are skipped for a reused device, including
> >>>>>> vfio ioctl's and writes to PCI configuration space.  The result is that
> >>>>>> vfio_realize constructs qemu data structures that reflect the current
> >>>>>> state of the device.  However, the reconstruction is not complete until
> >>>>>> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
> >>>>>> state.  It rebuilds vector data structures and attaches the interrupts to
> >>>>>> the new KVM instance.  cpr-load then invokes the main vfio listener callback,
> >>>>>> which walks the flattened ranges of the vfio_address_spaces and calls
> >>>>>> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
> >>>>>> starts the VM and suppresses vfio pci device reset.
> >>>>>>
> >>>>>> This functionality is delivered by 3 patches for clarity.  Part 1 handles
> >>>>>> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
> >>>>>> support.  Part 3 adds INTX support.
> >>>>>>
> >>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >>>>>> ---
> >>>>>>  MAINTAINERS                   |   1 +
> >>>>>>  hw/pci/pci.c                  |  10 ++++
> >>>>>>  hw/vfio/common.c              | 115 ++++++++++++++++++++++++++++++++++++++----
> >>>>>>  hw/vfio/cpr.c                 |  94 ++++++++++++++++++++++++++++++++++
> >>>>>>  hw/vfio/meson.build           |   1 +
> >>>>>>  hw/vfio/pci.c                 |  77 ++++++++++++++++++++++++++++
> >>>>>>  hw/vfio/trace-events          |   1 +
> >>>>>>  include/hw/pci/pci.h          |   1 +
> >>>>>>  include/hw/vfio/vfio-common.h |   8 +++
> >>>>>>  include/migration/cpr.h       |   3 ++
> >>>>>>  migration/cpr.c               |  10 +++-
> >>>>>>  migration/target.c            |  14 +++++
> >>>>>>  12 files changed, 324 insertions(+), 11 deletions(-)
> >>>>>>  create mode 100644 hw/vfio/cpr.c
> >>>>>>
> >>>>>> diff --git a/MAINTAINERS b/MAINTAINERS
> >>>>>> index cfe7480..feed239 100644
> >>>>>> --- a/MAINTAINERS
> >>>>>> +++ b/MAINTAINERS
> >>>>>> @@ -2992,6 +2992,7 @@ CPR
> >>>>>>  M: Steve Sistare <steven.sistare@oracle.com>
> >>>>>>  M: Mark Kanda <mark.kanda@oracle.com>
> >>>>>>  S: Maintained
> >>>>>> +F: hw/vfio/cpr.c
> >>>>>>  F: include/migration/cpr.h
> >>>>>>  F: migration/cpr.c
> >>>>>>  F: qapi/cpr.json
> >>>>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> >>>>>> index 0fd21e1..e35df4f 100644
> >>>>>> --- a/hw/pci/pci.c
> >>>>>> +++ b/hw/pci/pci.c
> >>>>>> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
> >>>>>>  {
> >>>>>>      int r;
> >>>>>>  
> >>>>>> +    /*
> >>>>>> +     * A reused vfio-pci device is already configured, so do not reset it
> >>>>>> +     * during qemu_system_reset prior to cpr-load, else interrupts may be
> >>>>>> +     * lost.  By contrast, pure-virtual pci devices may be reset here and
> >>>>>> +     * updated with new state in cpr-load with no ill effects.
> >>>>>> +     */
> >>>>>> +    if (dev->reused) {
> >>>>>> +        return;
> >>>>>> +    }
> >>>>>> +
> >>>>>>      pci_device_deassert_intx(dev);
> >>>>>>      assert(dev->irq_state == 0);
> >>>>>>  
> >>>>>
> >>>>>
> >>>>> Hmm that's a weird thing to do. I suspect this works because
> >>>>> "reused" means something like "in the process of being restored"?
> >>>>> Because clearly, we do not want to skip this part e.g. when
> >>>>> guest resets the device.
> >>>>
> >>>> Exactly.  vfio_realize sets the flag if it detects the device is reused during
> >>>> a restart, and vfio_pci_post_load clears the reused flag.
> >>>>
> >>>>> So a better name could be called for, but really I don't
> >>>>> love how vfio gets to poke at internal PCI state.
> >>>>> I'd rather we found a way just not to call this function.
> >>>>> If we can't, maybe an explicit API, and make it
> >>>>> actually say what it's doing?
> >>>>
> >>>> How about:
> >>>>
> >>>> pci_set_restore(PCIDevice *dev) { dev->restore = true; }
> >>>> pci_clr_restore(PCIDevice *dev) { dev->restore = false; }
> >>>>
> >>>> vfio_realize()
> >>>>   pci_set_restore(pdev)
> >>>>
> >>>> vfio_pci_post_load()
> >>>>   pci_clr_restore(pdev)
> >>>>
> >>>> pci_do_device_reset()
> >>>>     if (dev->restore)
> >>>>         return;
> >>>>
> >>>> - Steve
> >>>
> >>>
> >>> Not too bad. I'd like a better definition of what dev->restore is
> >>> exactly and to add them in comments near where it
> >>> is defined and used.
> >>
> >> Will do.
> >>
> >>> E.g. does this mean "device is being restored because of qemu restart"?
> >>>
> >>> Do we need a per device flag for this thing or would a global
> >>> "qemu restart in progress" flag be enough?
> >>
> >> A global flag (or function, which already exists) would suppress reset for all
> >> PCI devices, not just vfio-pci.  I am concerned that for some devices, vmstate 
> >> load may implicitly depend on the device having been reset for correctness, by 
> >> virtue of some fields being initialized in the reset function.
> >>
> >> - Steve

I took a look and I don't really see any cases like this.
I think pci_qdev_realize will initialize the pci core to a correct state,
pci_do_device_reset isn't necessary right after realize.
It seems safe to just skip it for all devices unconditionally.
A bunch of devices do depend on reset to init them correctly,
e.g. hw/ide/piix.c sets pci status in piix_ide_reset.
But pci core does not seem to.


> > So just so I understand, how do these other devices work with restart?
> > Do they use the save/loadvm machinery? And the reason vfio doesn't
> > is because it generally does not support savevm/loadvm?
> 
> They all use save/loadvm.  vfio-pci also uses save/loadvm to preserve its soft state,
> plus it preserves its device descriptors.  The only bit we are skipping is the reset
> function for vfio-pci, because the hardware device is actively processing dma and 
> interrupts, and they would be lost.  Reset is called unconditionally for all devices 
> during qemu startup, prior to loadvm, by the path qdev_machine_creation_done() ->
> qemu_system_reset().
> 
> - Steve



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)
  2022-01-06  9:12               ` Michael S. Tsirkin
@ 2022-01-06 19:13                 ` Steven Sistare
  0 siblings, 0 replies; 96+ messages in thread
From: Steven Sistare @ 2022-01-06 19:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Daniel P. Berrange, Juan Quintela, Jason Zeng, Alex Bennée,
	qemu-devel, Eric Blake, Dr. David Alan Gilbert, Zheng Chuan,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On 1/6/2022 4:12 AM, Michael S. Tsirkin wrote:
> On Wed, Jan 05, 2022 at 06:24:25PM -0500, Steven Sistare wrote:
>> On 1/5/2022 6:09 PM, Michael S. Tsirkin wrote:
>>> On Wed, Jan 05, 2022 at 04:40:43PM -0500, Steven Sistare wrote:
>>>> On 1/5/2022 4:14 PM, Michael S. Tsirkin wrote:
>>>>> On Wed, Jan 05, 2022 at 12:24:21PM -0500, Steven Sistare wrote:
>>>>>> On 12/22/2021 6:15 PM, Michael S. Tsirkin wrote:
>>>>>>> On Wed, Dec 22, 2021 at 11:05:24AM -0800, Steve Sistare wrote:
>>>>>>>> Enable vfio-pci devices to be saved and restored across an exec restart
>>>>>>>> of qemu.
>>>>>>>>
>>>>>>>> At vfio creation time, save the value of vfio container, group, and device
>>>>>>>> descriptors in cpr state.
>>>>>>>>
>>>>>>>> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
>>>>>>>> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
>>>>>>>> at a different VA after exec.  DMA to already-mapped pages continues.  Save
>>>>>>>> the msi message area as part of vfio-pci vmstate, save the interrupt and
>>>>>>>> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
>>>>>>>> vfio descriptors.  The flag is not cleared earlier because the descriptors
>>>>>>>> should not persist across miscellaneous fork and exec calls that may be
>>>>>>>> performed during normal operation.
>>>>>>>>
>>>>>>>> On qemu restart, vfio_realize() finds the saved descriptors, uses
>>>>>>>> the descriptors, and notes that the device is being reused.  Device and
>>>>>>>> iommu state is already configured, so operations in vfio_realize that
>>>>>>>> would modify the configuration are skipped for a reused device, including
>>>>>>>> vfio ioctl's and writes to PCI configuration space.  The result is that
>>>>>>>> vfio_realize constructs qemu data structures that reflect the current
>>>>>>>> state of the device.  However, the reconstruction is not complete until
>>>>>>>> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
>>>>>>>> state.  It rebuilds vector data structures and attaches the interrupts to
>>>>>>>> the new KVM instance.  cpr-load then invokes the main vfio listener callback,
>>>>>>>> which walks the flattened ranges of the vfio_address_spaces and calls
>>>>>>>> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
>>>>>>>> starts the VM and suppresses vfio pci device reset.
>>>>>>>>
>>>>>>>> This functionality is delivered by 3 patches for clarity.  Part 1 handles
>>>>>>>> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
>>>>>>>> support.  Part 3 adds INTX support.
>>>>>>>>
>>>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>>>> ---
>>>>>>>>  MAINTAINERS                   |   1 +
>>>>>>>>  hw/pci/pci.c                  |  10 ++++
>>>>>>>>  hw/vfio/common.c              | 115 ++++++++++++++++++++++++++++++++++++++----
>>>>>>>>  hw/vfio/cpr.c                 |  94 ++++++++++++++++++++++++++++++++++
>>>>>>>>  hw/vfio/meson.build           |   1 +
>>>>>>>>  hw/vfio/pci.c                 |  77 ++++++++++++++++++++++++++++
>>>>>>>>  hw/vfio/trace-events          |   1 +
>>>>>>>>  include/hw/pci/pci.h          |   1 +
>>>>>>>>  include/hw/vfio/vfio-common.h |   8 +++
>>>>>>>>  include/migration/cpr.h       |   3 ++
>>>>>>>>  migration/cpr.c               |  10 +++-
>>>>>>>>  migration/target.c            |  14 +++++
>>>>>>>>  12 files changed, 324 insertions(+), 11 deletions(-)
>>>>>>>>  create mode 100644 hw/vfio/cpr.c
>>>>>>>>
>>>>>>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>>>>>>> index cfe7480..feed239 100644
>>>>>>>> --- a/MAINTAINERS
>>>>>>>> +++ b/MAINTAINERS
>>>>>>>> @@ -2992,6 +2992,7 @@ CPR
>>>>>>>>  M: Steve Sistare <steven.sistare@oracle.com>
>>>>>>>>  M: Mark Kanda <mark.kanda@oracle.com>
>>>>>>>>  S: Maintained
>>>>>>>> +F: hw/vfio/cpr.c
>>>>>>>>  F: include/migration/cpr.h
>>>>>>>>  F: migration/cpr.c
>>>>>>>>  F: qapi/cpr.json
>>>>>>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>>>>>>> index 0fd21e1..e35df4f 100644
>>>>>>>> --- a/hw/pci/pci.c
>>>>>>>> +++ b/hw/pci/pci.c
>>>>>>>> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
>>>>>>>>  {
>>>>>>>>      int r;
>>>>>>>>  
>>>>>>>> +    /*
>>>>>>>> +     * A reused vfio-pci device is already configured, so do not reset it
>>>>>>>> +     * during qemu_system_reset prior to cpr-load, else interrupts may be
>>>>>>>> +     * lost.  By contrast, pure-virtual pci devices may be reset here and
>>>>>>>> +     * updated with new state in cpr-load with no ill effects.
>>>>>>>> +     */
>>>>>>>> +    if (dev->reused) {
>>>>>>>> +        return;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>>      pci_device_deassert_intx(dev);
>>>>>>>>      assert(dev->irq_state == 0);
>>>>>>>>  
>>>>>>>
>>>>>>>
>>>>>>> Hmm that's a weird thing to do. I suspect this works because
>>>>>>> "reused" means something like "in the process of being restored"?
>>>>>>> Because clearly, we do not want to skip this part e.g. when
>>>>>>> guest resets the device.
>>>>>>
>>>>>> Exactly.  vfio_realize sets the flag if it detects the device is reused during
>>>>>> a restart, and vfio_pci_post_load clears the reused flag.
>>>>>>
>>>>>>> So a better name could be called for, but really I don't
>>>>>>> love how vfio gets to poke at internal PCI state.
>>>>>>> I'd rather we found a way just not to call this function.
>>>>>>> If we can't, maybe an explicit API, and make it
>>>>>>> actually say what it's doing?
>>>>>>
>>>>>> How about:
>>>>>>
>>>>>> pci_set_restore(PCIDevice *dev) { dev->restore = true; }
>>>>>> pci_clr_restore(PCIDevice *dev) { dev->restore = false; }
>>>>>>
>>>>>> vfio_realize()
>>>>>>   pci_set_restore(pdev)
>>>>>>
>>>>>> vfio_pci_post_load()
>>>>>>   pci_clr_restore(pdev)
>>>>>>
>>>>>> pci_do_device_reset()
>>>>>>     if (dev->restore)
>>>>>>         return;
>>>>>>
>>>>>> - Steve
>>>>>
>>>>>
>>>>> Not too bad. I'd like a better definition of what dev->restore is
>>>>> exactly and to add them in comments near where it
>>>>> is defined and used.
>>>>
>>>> Will do.
>>>>
>>>>> E.g. does this mean "device is being restored because of qemu restart"?
>>>>>
>>>>> Do we need a per device flag for this thing or would a global
>>>>> "qemu restart in progress" flag be enough?
>>>>
>>>> A global flag (or function, which already exists) would suppress reset for all
>>>> PCI devices, not just vfio-pci.  I am concerned that for some devices, vmstate 
>>>> load may implicitly depend on the device having been reset for correctness, by 
>>>> virtue of some fields being initialized in the reset function.
>>>>
>>>> - Steve
> 
> I took a look and I don't really see any cases like this.
> I think pci_qdev_realize will initialize the pci core to a correct state,
> pci_do_device_reset isn't necessary right after realize.
> It seems safe to just skip it for all devices unconditionally.
> A bunch of devices do depend on reset to init them correctly,
> e.g. hw/ide/piix.c sets pci status in piix_ide_reset.
> But pci core does not seem to.

Cool.  After comparing pci_do_device_reset to pci_qdev_realize -> do_pci_register_device,
I concur.

This will do the trick.  Mode is set before reset is called, and cleared in cpr-load.

--------------------------
#include "migration/cpr.h"

static void pci_do_device_reset(PCIDevice *dev)
{
    int r;

    /*
     * A PCI device that is resuming for cpr is already configured, so do
     * not reset it here when we are called from qemu_system_reset prior to
     * cpr-load, else interrupts may be lost for vfio-pci devices.  It is
     * safe to skip this reset for all PCI devices, because cpr-load will set
     * all fields that would have been set here.
     */
    if (cpr_get_mode() == CPR_MODE_RESTART) {
        return;
    }
-----------------------------

- Steve

>>> So just so I understand, how do these other devices work with restart?
>>> Do they use the save/loadvm machinery? And the reason vfio doesn't
>>> is because it generally does not support savevm/loadvm?
>>
>> They all use save/loadvm.  vfio-pci also uses save/loadvm to preserve its soft state,
>> plus it preserves its device descriptors.  The only bit we are skipping is the reset
>> function for vfio-pci, because the hardware device is actively processing dma and 
>> interrupts, and they would be lost.  Reset is called unconditionally for all devices 
>> during qemu startup, prior to loadvm, by the path qdev_machine_creation_done() ->
>> qemu_system_reset().
>>
>> - Steve
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 17/29] pci: export functions for cpr
  2022-01-05 20:16       ` Michael S. Tsirkin
@ 2022-01-06 22:48         ` Steven Sistare
  2022-01-07 10:03           ` Michael S. Tsirkin
  0 siblings, 1 reply; 96+ messages in thread
From: Steven Sistare @ 2022-01-06 22:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Daniel P. Berrange, Juan Quintela, Jason Zeng, Alex Bennée,
	qemu-devel, Eric Blake, Dr. David Alan Gilbert, Zheng Chuan,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On 1/5/2022 3:16 PM, Michael S. Tsirkin wrote:
> On Wed, Jan 05, 2022 at 12:22:25PM -0500, Steven Sistare wrote:
>> On 12/22/2021 6:07 PM, Michael S. Tsirkin wrote:
>>> On Wed, Dec 22, 2021 at 11:05:22AM -0800, Steve Sistare wrote:
>>>> Export msix_is_pending, msix_init_vector_notifiers, and pci_update_mappings
>>>> for use by cpr.  No functional change.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>
>>> With things like that, I prefer when the API is exported
>>> together with the patch that uses it.
>>> This was I can see why we are exporting these APIs.
>>> Esp wrt pci_update_mappings, it's designed as an
>>> internal API.
>>
>> Hi Michael, thanks very much for reviewing these patches.
>>
>> Serendipitously, I stopped calling pci_update_mappings from vfio code earlier
>> in the series.  I will revert its scope.
>>
>> I would prefer to keep this patch separate from the use of these functions in
>> "vfio-pci cpr part 2 msi", to make the latter smaller and easier to understand.
>> How about if I say more in this commit message? :
>>
>>   Export msix_is_pending and msix_init_vector_notifiers for use in vfio cpr.
>>   Both are needed in the vfio-pci post-load function during cpr-load.
>>   msix_is_pending is checked to enable the PBA memory region.
>>   msix_init_vector_notifiers is called to register notifier callbacks, without
>>   the other side effects of msix_set_vector_notifiers.
>>
>> - Steve
> 
> Well the reason the side effects are there is to avoid losing events,
> no? I'd like to figure out a bit better why we don't need them,

Currently I do not call vfio_msix_vector_do_use during resume, but
instead execute a subset of its actions in vfio_claim_vectors, which is
defined in vfio-cpr: cpr part 2.

> and when should users call msix_init_vector_notifiers versus
> msix_set_vector_notifiers.

If I call msix_set_vector_notifiers, it calls the use notifier
vfio_msix_vector_use, which calls vfio_msix_vector_do_use.  The latter
gets confused and breaks the vectors because vector-related fields are
only partially initialized.  The details are unimportant, because --

Instead of adding msix_init_vector_notifiers, I will call
msix_set_vector_notifiers, but bail from vfio_msix_vector_do_use if resuming.
Tested and works.

Thus this patch becomes simply "pci: export msix_is_pending".  I can keep it,
or fold it into "vfio-pci: cpr part 2 (msi)".  Your call.

- Steve

>>>> ---
>>>>  hw/pci/msix.c         | 20 ++++++++++++++------
>>>>  hw/pci/pci.c          |  3 +--
>>>>  include/hw/pci/msix.h |  5 +++++
>>>>  include/hw/pci/pci.h  |  1 +
>>>>  4 files changed, 21 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/hw/pci/msix.c b/hw/pci/msix.c
>>>> index ae9331c..73f4259 100644
>>>> --- a/hw/pci/msix.c
>>>> +++ b/hw/pci/msix.c
>>>> @@ -64,7 +64,7 @@ static uint8_t *msix_pending_byte(PCIDevice *dev, int vector)
>>>>      return dev->msix_pba + vector / 8;
>>>>  }
>>>>  
>>>> -static int msix_is_pending(PCIDevice *dev, int vector)
>>>> +int msix_is_pending(PCIDevice *dev, unsigned int vector)
>>>>  {
>>>>      return *msix_pending_byte(dev, vector) & msix_pending_mask(vector);
>>>>  }
>>>> @@ -579,6 +579,17 @@ static void msix_unset_notifier_for_vector(PCIDevice *dev, unsigned int vector)
>>>>      dev->msix_vector_release_notifier(dev, vector);
>>>>  }
>>>>  
>>>> +void msix_init_vector_notifiers(PCIDevice *dev,
>>>> +                                MSIVectorUseNotifier use_notifier,
>>>> +                                MSIVectorReleaseNotifier release_notifier,
>>>> +                                MSIVectorPollNotifier poll_notifier)
>>>> +{
>>>> +    assert(use_notifier && release_notifier);
>>>> +    dev->msix_vector_use_notifier = use_notifier;
>>>> +    dev->msix_vector_release_notifier = release_notifier;
>>>> +    dev->msix_vector_poll_notifier = poll_notifier;
>>>> +}
>>>> +
>>>>  int msix_set_vector_notifiers(PCIDevice *dev,
>>>>                                MSIVectorUseNotifier use_notifier,
>>>>                                MSIVectorReleaseNotifier release_notifier,
>>>> @@ -586,11 +597,8 @@ int msix_set_vector_notifiers(PCIDevice *dev,
>>>>  {
>>>>      int vector, ret;
>>>>  
>>>> -    assert(use_notifier && release_notifier);
>>>> -
>>>> -    dev->msix_vector_use_notifier = use_notifier;
>>>> -    dev->msix_vector_release_notifier = release_notifier;
>>>> -    dev->msix_vector_poll_notifier = poll_notifier;
>>>> +    msix_init_vector_notifiers(dev, use_notifier, release_notifier,
>>>> +                               poll_notifier);
>>>>  
>>>>      if ((dev->config[dev->msix_cap + MSIX_CONTROL_OFFSET] &
>>>>          (MSIX_ENABLE_MASK | MSIX_MASKALL_MASK)) == MSIX_ENABLE_MASK) {
>>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>>> index e5993c1..0fd21e1 100644
>>>> --- a/hw/pci/pci.c
>>>> +++ b/hw/pci/pci.c
>>>> @@ -225,7 +225,6 @@ static const TypeInfo pcie_bus_info = {
>>>>  };
>>>>  
>>>>  static PCIBus *pci_find_bus_nr(PCIBus *bus, int bus_num);
>>>> -static void pci_update_mappings(PCIDevice *d);
>>>>  static void pci_irq_handler(void *opaque, int irq_num, int level);
>>>>  static void pci_add_option_rom(PCIDevice *pdev, bool is_default_rom, Error **);
>>>>  static void pci_del_option_rom(PCIDevice *pdev);
>>>> @@ -1366,7 +1365,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
>>>>      return new_addr;
>>>>  }
>>>>  
>>>> -static void pci_update_mappings(PCIDevice *d)
>>>> +void pci_update_mappings(PCIDevice *d)
>>>>  {
>>>>      PCIIORegion *r;
>>>>      int i;
>>>> diff --git a/include/hw/pci/msix.h b/include/hw/pci/msix.h
>>>> index 4c4a60c..46606cf 100644
>>>> --- a/include/hw/pci/msix.h
>>>> +++ b/include/hw/pci/msix.h
>>>> @@ -32,6 +32,7 @@ int msix_present(PCIDevice *dev);
>>>>  bool msix_is_masked(PCIDevice *dev, unsigned vector);
>>>>  void msix_set_pending(PCIDevice *dev, unsigned vector);
>>>>  void msix_clr_pending(PCIDevice *dev, int vector);
>>>> +int msix_is_pending(PCIDevice *dev, unsigned vector);
>>>>  
>>>>  int msix_vector_use(PCIDevice *dev, unsigned vector);
>>>>  void msix_vector_unuse(PCIDevice *dev, unsigned vector);
>>>> @@ -41,6 +42,10 @@ void msix_notify(PCIDevice *dev, unsigned vector);
>>>>  
>>>>  void msix_reset(PCIDevice *dev);
>>>>  
>>>> +void msix_init_vector_notifiers(PCIDevice *dev,
>>>> +                                MSIVectorUseNotifier use_notifier,
>>>> +                                MSIVectorReleaseNotifier release_notifier,
>>>> +                                MSIVectorPollNotifier poll_notifier);
>>>>  int msix_set_vector_notifiers(PCIDevice *dev,
>>>>                                MSIVectorUseNotifier use_notifier,
>>>>                                MSIVectorReleaseNotifier release_notifier,
>>>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>>>> index e7cdf2d..cc63dd4 100644
>>>> --- a/include/hw/pci/pci.h
>>>> +++ b/include/hw/pci/pci.h
>>>> @@ -910,5 +910,6 @@ extern const VMStateDescription vmstate_pci_device;
>>>>  
>>>>  MSIMessage pci_get_msi_message(PCIDevice *dev, int vector);
>>>>  void pci_set_power(PCIDevice *pci_dev, bool state);
>>>> +void pci_update_mappings(PCIDevice *d);
>>>>  
>>>>  #endif
>>>> -- 
>>>> 1.8.3.1
>>>
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 17/29] pci: export functions for cpr
  2022-01-06 22:48         ` Steven Sistare
@ 2022-01-07 10:03           ` Michael S. Tsirkin
  0 siblings, 0 replies; 96+ messages in thread
From: Michael S. Tsirkin @ 2022-01-07 10:03 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Daniel P. Berrange, Juan Quintela, Jason Zeng, Alex Bennée,
	qemu-devel, Eric Blake, Dr. David Alan Gilbert, Zheng Chuan,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On Thu, Jan 06, 2022 at 05:48:28PM -0500, Steven Sistare wrote:
> On 1/5/2022 3:16 PM, Michael S. Tsirkin wrote:
> > On Wed, Jan 05, 2022 at 12:22:25PM -0500, Steven Sistare wrote:
> >> On 12/22/2021 6:07 PM, Michael S. Tsirkin wrote:
> >>> On Wed, Dec 22, 2021 at 11:05:22AM -0800, Steve Sistare wrote:
> >>>> Export msix_is_pending, msix_init_vector_notifiers, and pci_update_mappings
> >>>> for use by cpr.  No functional change.
> >>>>
> >>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >>>
> >>> With things like that, I prefer when the API is exported
> >>> together with the patch that uses it.
> >>> This was I can see why we are exporting these APIs.
> >>> Esp wrt pci_update_mappings, it's designed as an
> >>> internal API.
> >>
> >> Hi Michael, thanks very much for reviewing these patches.
> >>
> >> Serendipitously, I stopped calling pci_update_mappings from vfio code earlier
> >> in the series.  I will revert its scope.
> >>
> >> I would prefer to keep this patch separate from the use of these functions in
> >> "vfio-pci cpr part 2 msi", to make the latter smaller and easier to understand.
> >> How about if I say more in this commit message? :
> >>
> >>   Export msix_is_pending and msix_init_vector_notifiers for use in vfio cpr.
> >>   Both are needed in the vfio-pci post-load function during cpr-load.
> >>   msix_is_pending is checked to enable the PBA memory region.
> >>   msix_init_vector_notifiers is called to register notifier callbacks, without
> >>   the other side effects of msix_set_vector_notifiers.
> >>
> >> - Steve
> > 
> > Well the reason the side effects are there is to avoid losing events,
> > no? I'd like to figure out a bit better why we don't need them,
> 
> Currently I do not call vfio_msix_vector_do_use during resume, but
> instead execute a subset of its actions in vfio_claim_vectors, which is
> defined in vfio-cpr: cpr part 2.
> 
> > and when should users call msix_init_vector_notifiers versus
> > msix_set_vector_notifiers.
> 
> If I call msix_set_vector_notifiers, it calls the use notifier
> vfio_msix_vector_use, which calls vfio_msix_vector_do_use.  The latter
> gets confused and breaks the vectors because vector-related fields are
> only partially initialized.  The details are unimportant, because --
> 
> Instead of adding msix_init_vector_notifiers, I will call
> msix_set_vector_notifiers, but bail from vfio_msix_vector_do_use if resuming.
> Tested and works.
> 
> Thus this patch becomes simply "pci: export msix_is_pending".  I can keep it,
> or fold it into "vfio-pci: cpr part 2 (msi)".  Your call.
> 
> - Steve

Keep it, it's ok.

> >>>> ---
> >>>>  hw/pci/msix.c         | 20 ++++++++++++++------
> >>>>  hw/pci/pci.c          |  3 +--
> >>>>  include/hw/pci/msix.h |  5 +++++
> >>>>  include/hw/pci/pci.h  |  1 +
> >>>>  4 files changed, 21 insertions(+), 8 deletions(-)
> >>>>
> >>>> diff --git a/hw/pci/msix.c b/hw/pci/msix.c
> >>>> index ae9331c..73f4259 100644
> >>>> --- a/hw/pci/msix.c
> >>>> +++ b/hw/pci/msix.c
> >>>> @@ -64,7 +64,7 @@ static uint8_t *msix_pending_byte(PCIDevice *dev, int vector)
> >>>>      return dev->msix_pba + vector / 8;
> >>>>  }
> >>>>  
> >>>> -static int msix_is_pending(PCIDevice *dev, int vector)
> >>>> +int msix_is_pending(PCIDevice *dev, unsigned int vector)
> >>>>  {
> >>>>      return *msix_pending_byte(dev, vector) & msix_pending_mask(vector);
> >>>>  }
> >>>> @@ -579,6 +579,17 @@ static void msix_unset_notifier_for_vector(PCIDevice *dev, unsigned int vector)
> >>>>      dev->msix_vector_release_notifier(dev, vector);
> >>>>  }
> >>>>  
> >>>> +void msix_init_vector_notifiers(PCIDevice *dev,
> >>>> +                                MSIVectorUseNotifier use_notifier,
> >>>> +                                MSIVectorReleaseNotifier release_notifier,
> >>>> +                                MSIVectorPollNotifier poll_notifier)
> >>>> +{
> >>>> +    assert(use_notifier && release_notifier);
> >>>> +    dev->msix_vector_use_notifier = use_notifier;
> >>>> +    dev->msix_vector_release_notifier = release_notifier;
> >>>> +    dev->msix_vector_poll_notifier = poll_notifier;
> >>>> +}
> >>>> +
> >>>>  int msix_set_vector_notifiers(PCIDevice *dev,
> >>>>                                MSIVectorUseNotifier use_notifier,
> >>>>                                MSIVectorReleaseNotifier release_notifier,
> >>>> @@ -586,11 +597,8 @@ int msix_set_vector_notifiers(PCIDevice *dev,
> >>>>  {
> >>>>      int vector, ret;
> >>>>  
> >>>> -    assert(use_notifier && release_notifier);
> >>>> -
> >>>> -    dev->msix_vector_use_notifier = use_notifier;
> >>>> -    dev->msix_vector_release_notifier = release_notifier;
> >>>> -    dev->msix_vector_poll_notifier = poll_notifier;
> >>>> +    msix_init_vector_notifiers(dev, use_notifier, release_notifier,
> >>>> +                               poll_notifier);
> >>>>  
> >>>>      if ((dev->config[dev->msix_cap + MSIX_CONTROL_OFFSET] &
> >>>>          (MSIX_ENABLE_MASK | MSIX_MASKALL_MASK)) == MSIX_ENABLE_MASK) {
> >>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> >>>> index e5993c1..0fd21e1 100644
> >>>> --- a/hw/pci/pci.c
> >>>> +++ b/hw/pci/pci.c
> >>>> @@ -225,7 +225,6 @@ static const TypeInfo pcie_bus_info = {
> >>>>  };
> >>>>  
> >>>>  static PCIBus *pci_find_bus_nr(PCIBus *bus, int bus_num);
> >>>> -static void pci_update_mappings(PCIDevice *d);
> >>>>  static void pci_irq_handler(void *opaque, int irq_num, int level);
> >>>>  static void pci_add_option_rom(PCIDevice *pdev, bool is_default_rom, Error **);
> >>>>  static void pci_del_option_rom(PCIDevice *pdev);
> >>>> @@ -1366,7 +1365,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
> >>>>      return new_addr;
> >>>>  }
> >>>>  
> >>>> -static void pci_update_mappings(PCIDevice *d)
> >>>> +void pci_update_mappings(PCIDevice *d)
> >>>>  {
> >>>>      PCIIORegion *r;
> >>>>      int i;
> >>>> diff --git a/include/hw/pci/msix.h b/include/hw/pci/msix.h
> >>>> index 4c4a60c..46606cf 100644
> >>>> --- a/include/hw/pci/msix.h
> >>>> +++ b/include/hw/pci/msix.h
> >>>> @@ -32,6 +32,7 @@ int msix_present(PCIDevice *dev);
> >>>>  bool msix_is_masked(PCIDevice *dev, unsigned vector);
> >>>>  void msix_set_pending(PCIDevice *dev, unsigned vector);
> >>>>  void msix_clr_pending(PCIDevice *dev, int vector);
> >>>> +int msix_is_pending(PCIDevice *dev, unsigned vector);
> >>>>  
> >>>>  int msix_vector_use(PCIDevice *dev, unsigned vector);
> >>>>  void msix_vector_unuse(PCIDevice *dev, unsigned vector);
> >>>> @@ -41,6 +42,10 @@ void msix_notify(PCIDevice *dev, unsigned vector);
> >>>>  
> >>>>  void msix_reset(PCIDevice *dev);
> >>>>  
> >>>> +void msix_init_vector_notifiers(PCIDevice *dev,
> >>>> +                                MSIVectorUseNotifier use_notifier,
> >>>> +                                MSIVectorReleaseNotifier release_notifier,
> >>>> +                                MSIVectorPollNotifier poll_notifier);
> >>>>  int msix_set_vector_notifiers(PCIDevice *dev,
> >>>>                                MSIVectorUseNotifier use_notifier,
> >>>>                                MSIVectorReleaseNotifier release_notifier,
> >>>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> >>>> index e7cdf2d..cc63dd4 100644
> >>>> --- a/include/hw/pci/pci.h
> >>>> +++ b/include/hw/pci/pci.h
> >>>> @@ -910,5 +910,6 @@ extern const VMStateDescription vmstate_pci_device;
> >>>>  
> >>>>  MSIMessage pci_get_msi_message(PCIDevice *dev, int vector);
> >>>>  void pci_set_power(PCIDevice *pci_dev, bool state);
> >>>> +void pci_update_mappings(PCIDevice *d);
> >>>>  
> >>>>  #endif
> >>>> -- 
> >>>> 1.8.3.1
> >>>
> > 



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 00/29] Live Update
  2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
                   ` (28 preceding siblings ...)
  2021-12-22 19:05 ` [PATCH V7 29/29] cpr: only-cpr-capable option Steve Sistare
@ 2022-01-07 18:45 ` Steven Sistare
  2022-02-18 13:36   ` Steven Sistare
  29 siblings, 1 reply; 96+ messages in thread
From: Steven Sistare @ 2022-01-07 18:45 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Markus Armbruster, Eric Blake,
	qemu-devel, Zheng Chuan, Alex Williamson, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Hi Dave,
  It has been a long time since we chatted about this series.  The vfio
patches have been updated with feedback from Alex and are close to being 
final (I think).  Could you take another look at the patches that you care 
about?  To refresh your memory, you last reviewed V3 of the series, and I 
made significant changes to address your comments.  The cover letter lists 
the changes in V4, V5, V6, and V7.

Best wishes for the new year,
- Steve

On 12/22/2021 2:05 PM, Steve Sistare wrote:
> Provide the cpr-save, cpr-exec, and cpr-load commands for live update.
> These save and restore VM state, with minimal guest pause time, so that
> qemu may be updated to a new version in between.
> 
> cpr-save stops the VM and saves vmstate to an ordinary file.  It supports
> any type of guest image and block device, but the caller must not modify
> guest block devices between cpr-save and cpr-load.  It supports two modes:
> reboot and restart.
> 
> In reboot mode, the caller invokes cpr-save and then terminates qemu.
> The caller may then update the host kernel and system software and reboot.
> The caller resumes the guest by running qemu with the same arguments as the
> original process and invoking cpr-load.  To use this mode, guest ram must be
> mapped to a persistent shared memory file such as /dev/dax0.0, or /dev/shm
> PKRAM as proposed in https://lore.kernel.org/lkml/1617140178-8773-1-git-send-email-anthony.yznaga@oracle.com.
> 
> The reboot mode supports vfio devices if the caller first suspends the
> guest, such as by issuing guest-suspend-ram to the qemu guest agent.  The
> guest drivers' suspend methods flush outstanding requests and re-initialize
> the devices, and thus there is no device state to save and restore.
> 
> Restart mode preserves the guest VM across a restart of the qemu process.
> After cpr-save, the caller passes qemu command-line arguments to cpr-exec,
> which directly exec's the new qemu binary.  The arguments must include -S
> so new qemu starts in a paused state and waits for the cpr-load command.
> The restart mode supports vfio devices by preserving the vfio container,
> group, device, and event descriptors across the qemu re-exec, and by
> updating DMA mapping virtual addresses using VFIO_DMA_UNMAP_FLAG_VADDR and
> VFIO_DMA_MAP_FLAG_VADDR as defined in https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sistare@oracle.com/
> and integrated in Linux kernel 5.12.
> 
> To use the restart mode, qemu must be started with the memfd-alloc option,
> which allocates guest ram using memfd_create.  The memfd's are saved to
> the environment and kept open across exec, after which they are found from
> the environment and re-mmap'd.  Hence guest ram is preserved in place,
> albeit with new virtual addresses in the qemu process.
> 
> The caller resumes the guest by invoking cpr-load, which loads state from
> the file. If the VM was running at cpr-save time, then VM execution resumes.
> If the VM was suspended at cpr-save time (reboot mode), then the caller must
> issue a system_wakeup command to resume.
> 
> The first patches add reboot mode:
>   - memory: qemu_check_ram_volatile
>   - migration: fix populate_vfio_info
>   - migration: qemu file wrappers
>   - migration: simplify savevm
>   - vl: start on wakeup request
>   - cpr: reboot mode
>   - cpr: reboot HMP interfaces
> 
> The next patches add restart mode:
>   - memory: flat section iterator
>   - oslib: qemu_clear_cloexec
>   - machine: memfd-alloc option
>   - qapi: list utility functions
>   - vl: helper to request re-exec
>   - cpr: preserve extra state
>   - cpr: restart mode
>   - cpr: restart HMP interfaces
>   - hostmem-memfd: cpr for memory-backend-memfd
> 
> The next patches add vfio support for restart mode:
>   - pci: export functions for cpr
>   - vfio-pci: refactor for cpr
>   - vfio-pci: cpr part 1 (fd and dma)
>   - vfio-pci: cpr part 2 (msi)
>   - vfio-pci: cpr part 3 (intx)
>   - vfio-pci: recover from unmap-all-vaddr failure
> 
> The next patches preserve various descriptor-based backend devices across
> cprexec:
>   - loader: suppress rom_reset during cpr
>   - vhost: reset vhost devices for cpr
>   - chardev: cpr framework
>   - chardev: cpr for simple devices
>   - chardev: cpr for pty
>   - chardev: cpr for sockets
>   - cpr: only-cpr-capable option
> 
> Here is an example of updating qemu from v4.2.0 to v4.2.1 using
> restart mode.  The software update is performed while the guest is
> running to minimize downtime.
> 
> window 1                                        | window 2
>                                                 |
> # qemu-system-x86_64 ...                        |
> QEMU 4.2.0 monitor - type 'help' ...            |
> (qemu) info status                              |
> VM status: running                              |
>                                                 | # yum update qemu
> (qemu) cpr-save /tmp/qemu.sav restart           |
> (qemu) cpr-exec qemu-system-x86_64 -S ...       |
> QEMU 4.2.1 monitor - type 'help' ...            |
> (qemu) info status                              |
> VM status: paused (prelaunch)                   |
> (qemu) cpr-load /tmp/qemu.sav                   |
> (qemu) info status                              |
> VM status: running                              |
> 
> 
> Here is an example of updating the host kernel using reboot mode.
> 
> window 1                                        | window 2
>                                                 |
> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
> QEMU 4.2.1 monitor - type 'help' ...            |
> (qemu) info status                              |
> VM status: running                              |
>                                                 | # yum update kernel-uek
> (qemu) cpr-save /tmp/qemu.sav reboot            |
> (qemu) quit                                     |
>                                                 |
> # systemctl kexec                               |
> kexec_core: Starting new kernel                 |
> ...                                             |
>                                                 |
> # qemu-system-x86_64 -S mem-path=/dev/dax0.0 ...|
> QEMU 4.2.1 monitor - type 'help' ...            |
> (qemu) info status                              |
> VM status: paused (prelaunch)                   |
> (qemu) cpr-load /tmp/qemu.sav                   |
> (qemu) info status                              |
> VM status: running                              |
> 
> Changes from V1 to V2:
>   - revert vmstate infrastructure changes
>   - refactor cpr functions into new files
>   - delete MADV_DOEXEC and use memfd + VFIO_DMA_UNMAP_FLAG_SUSPEND to
>     preserve memory.
>   - add framework to filter chardev's that support cpr
>   - save and restore vfio eventfd's
>   - modify cprinfo QMP interface
>   - incorporate misc review feedback
>   - remove unrelated and unneeded patches
>   - refactor all patches into a shorter and easier to review series
> 
> Changes from V2 to V3:
>   - rebase to qemu 6.0.0
>   - use final definition of vfio ioctls (VFIO_DMA_UNMAP_FLAG_VADDR etc)
>   - change memfd-alloc to a machine option
>   - Use qio_channel_socket_new_fd instead of adding qio_channel_socket_new_fd
>   - close monitor socket during cpr
>   - fix a few unreported bugs
>   - support memory-backend-memfd
> 
> Changes from V3 to V4:
>   - split reboot mode into separate patches
>   - add cprexec command
>   - delete QEMU_START_FREEZE, argv_main, and /usr/bin/qemu-exec
>   - add more checks for vfio and cpr compatibility, and recover after errors
>   - save vfio pci config in vmstate
>   - rename {setenv,getenv}_event_fd to {save,load}_event_fd
>   - use qemu_strtol
>   - change 6.0 references to 6.1
>   - use strerror(), use EXIT_FAILURE, remove period from error messages
>   - distribute MAINTAINERS additions to each patch
> 
> Changes from V4 to V5:
>   - rebase to master
> 
> Changes from V5 to V6:
>   vfio:
>   - delete redundant bus_master_enable_region in vfio_pci_post_load
>   - delete unmap.size warning
>   - fix phys_config memory leak
>   - add INTX support
>   - add vfio_named_notifier_init() helper
>   Other:
>   - 6.1 -> 6.2
>   - rename file -> filename in qapi
>   - delete cprinfo.  qapi introspection serves the same purpose.
>   - rename cprsave, cprexec, cprload -> cpr-save, cpr-exec, cpr-load
>   - improve documentation in qapi/cpr.json
>   - rename qemu_ram_volatile -> qemu_ram_check_volatile, and use
>     qemu_ram_foreach_block
>   - rename handle -> opaque
>   - use ERRP_GUARD
>   - use g_autoptr and g_autofree, and glib allocation functions
>   - conform to error conventions for bool and int function return values
>     and function names.
>   - remove word "error" in error messages
>   - rename as_flat_walk and its callback, and add comments.
>   - rename qemu_clr_cloexec -> qemu_clear_cloexec
>   - rename close-on-cpr -> reopen-on-cpr
>   - add strList utility functions
>   - factor out start on wakeup request to a separate patch
>   - deleted unnecessary layer (cprsave etc) and squashed QMP patches
>   - conditionally compile for CONFIG_VFIO
> 
> Changes from V6 to V7:
>   vfio:
>   - convert all event fd's to named event fd's with the same lifecycle and
>     delete vfio_pci_pre_save
>   - use vfio listener callback for updating vaddr and
>     defer listener registration
>   - update vaddr in vfio_dma_map
>   - simplify iommu_type derivation
>   - refactor recovery from unmap-all-vaddr failure to a separate patch
>   - add vfio_pci_pre_load to handle non-emulated config bits
>   - do not call VFIO_GROUP_SET_CONTAINER if reused
>   - add comments for vfio cpr
>   Other:
>   - suppress rom_reset during cpr
>   - more robust management of cpr mode
>   - delete chardev fd's iff !reopen_on_cpr
> 
> Steve Sistare (26):
>   memory: qemu_check_ram_volatile
>   migration: fix populate_vfio_info
>   migration: qemu file wrappers
>   migration: simplify savevm
>   vl: start on wakeup request
>   cpr: reboot mode
>   memory: flat section iterator
>   oslib: qemu_clear_cloexec
>   machine: memfd-alloc option
>   qapi: list utility functions
>   vl: helper to request re-exec
>   cpr: preserve extra state
>   cpr: restart mode
>   cpr: restart HMP interfaces
>   hostmem-memfd: cpr for memory-backend-memfd
>   pci: export functions for cpr
>   vfio-pci: refactor for cpr
>   vfio-pci: cpr part 1 (fd and dma)
>   vfio-pci: cpr part 2 (msi)
>   vfio-pci: cpr part 3 (intx)
>   vfio-pci: recover from unmap-all-vaddr failure
>   loader: suppress rom_reset during cpr
>   chardev: cpr framework
>   chardev: cpr for simple devices
>   chardev: cpr for pty
>   cpr: only-cpr-capable option
> 
> Mark Kanda, Steve Sistare (3):
>   cpr: reboot HMP interfaces
>   vhost: reset vhost devices for cpr
>   chardev: cpr for sockets
> 
>  MAINTAINERS                   |  12 ++
>  backends/hostmem-memfd.c      |  21 +--
>  chardev/char-mux.c            |   1 +
>  chardev/char-null.c           |   1 +
>  chardev/char-pty.c            |  16 +-
>  chardev/char-serial.c         |   1 +
>  chardev/char-socket.c         |  39 +++++
>  chardev/char-stdio.c          |   8 +
>  chardev/char.c                |  45 +++++-
>  gdbstub.c                     |   1 +
>  hmp-commands.hx               |  50 ++++++
>  hw/core/loader.c              |   4 +-
>  hw/core/machine.c             |  19 +++
>  hw/pci/msix.c                 |  20 ++-
>  hw/pci/pci.c                  |  13 +-
>  hw/vfio/common.c              | 184 ++++++++++++++++++---
>  hw/vfio/cpr.c                 | 129 +++++++++++++++
>  hw/vfio/meson.build           |   1 +
>  hw/vfio/pci.c                 | 368 +++++++++++++++++++++++++++++++++++++-----
>  hw/vfio/trace-events          |   1 +
>  hw/virtio/vhost.c             |  11 ++
>  include/chardev/char.h        |   6 +
>  include/exec/memory.h         |  39 +++++
>  include/hw/boards.h           |   1 +
>  include/hw/pci/msix.h         |   5 +
>  include/hw/pci/pci.h          |   2 +
>  include/hw/vfio/vfio-common.h |  10 ++
>  include/hw/virtio/vhost.h     |   1 +
>  include/migration/cpr.h       |  31 ++++
>  include/monitor/hmp.h         |   3 +
>  include/qapi/util.h           |  28 ++++
>  include/qemu/osdep.h          |   1 +
>  include/sysemu/runstate.h     |   2 +
>  include/sysemu/sysemu.h       |   1 +
>  migration/cpr-state.c         | 228 ++++++++++++++++++++++++++
>  migration/cpr.c               | 167 +++++++++++++++++++
>  migration/meson.build         |   2 +
>  migration/migration.c         |   5 +
>  migration/qemu-file-channel.c |  36 +++++
>  migration/qemu-file-channel.h |   6 +
>  migration/savevm.c            |  21 +--
>  migration/target.c            |  24 ++-
>  migration/trace-events        |   5 +
>  monitor/hmp-cmds.c            |  68 ++++----
>  monitor/hmp.c                 |   3 +
>  monitor/qmp.c                 |   3 +
>  qapi/char.json                |   7 +-
>  qapi/cpr.json                 |  76 +++++++++
>  qapi/meson.build              |   1 +
>  qapi/qapi-schema.json         |   1 +
>  qapi/qapi-util.c              |  37 +++++
>  qemu-options.hx               |  40 ++++-
>  softmmu/globals.c             |   1 +
>  softmmu/memory.c              |  46 ++++++
>  softmmu/physmem.c             |  55 +++++--
>  softmmu/runstate.c            |  38 ++++-
>  softmmu/vl.c                  |  18 ++-
>  stubs/cpr-state.c             |  15 ++
>  stubs/cpr.c                   |   3 +
>  stubs/meson.build             |   2 +
>  trace-events                  |   1 +
>  util/oslib-posix.c            |   9 ++
>  util/oslib-win32.c            |   4 +
>  util/qemu-config.c            |   4 +
>  64 files changed, 1852 insertions(+), 149 deletions(-)
>  create mode 100644 hw/vfio/cpr.c
>  create mode 100644 include/migration/cpr.h
>  create mode 100644 migration/cpr-state.c
>  create mode 100644 migration/cpr.c
>  create mode 100644 qapi/cpr.json
>  create mode 100644 stubs/cpr-state.c
>  create mode 100644 stubs/cpr.c
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 10/29] machine: memfd-alloc option
  2021-12-22 19:05 ` [PATCH V7 10/29] machine: memfd-alloc option Steve Sistare
@ 2022-02-18  8:05   ` Guoyi Tu
  2022-03-03 15:55     ` Steven Sistare
  2022-02-24 17:56   ` Dr. David Alan Gilbert
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 96+ messages in thread
From: Guoyi Tu @ 2022-02-18  8:05 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel, tugy
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On Wed, 2021-12-22 at 11:05 -0800, Steve Sistare wrote:
> Allocate anonymous memory using memfd_create if the memfd-alloc
> machine
> option is set.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  hw/core/machine.c   | 19 +++++++++++++++++++
>  include/hw/boards.h |  1 +
>  qemu-options.hx     |  6 ++++++
>  softmmu/physmem.c   | 47 ++++++++++++++++++++++++++++++++++++++-----
> ----
>  softmmu/vl.c        |  1 +
>  trace-events        |  1 +
>  util/qemu-config.c  |  4 ++++
>  7 files changed, 70 insertions(+), 9 deletions(-)
> 
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index 53a99ab..7739d88 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -392,6 +392,20 @@ static void machine_set_mem_merge(Object *obj,
> bool value, Error **errp)
>      ms->mem_merge = value;
>  }
>  
> +static bool machine_get_memfd_alloc(Object *obj, Error **errp)
> +{
> +    MachineState *ms = MACHINE(obj);
> +
> +    return ms->memfd_alloc;
> +}
> +
> +static void machine_set_memfd_alloc(Object *obj, bool value, Error
> **errp)
> +{
> +    MachineState *ms = MACHINE(obj);
> +
> +    ms->memfd_alloc = value;
> +}
> +
>  static bool machine_get_usb(Object *obj, Error **errp)
>  {
>      MachineState *ms = MACHINE(obj);
> @@ -829,6 +843,11 @@ static void machine_class_init(ObjectClass *oc,
> void *data)
>      object_class_property_set_description(oc, "mem-merge",
>          "Enable/disable memory merge support");
>  
> +    object_class_property_add_bool(oc, "memfd-alloc",
> +        machine_get_memfd_alloc, machine_set_memfd_alloc);
> +    object_class_property_set_description(oc, "memfd-alloc",
> +        "Enable/disable allocating anonymous memory using
> memfd_create");
> +
>      object_class_property_add_bool(oc, "usb",
>          machine_get_usb, machine_set_usb);
>      object_class_property_set_description(oc, "usb",
> diff --git a/include/hw/boards.h b/include/hw/boards.h
> index 9c1c190..a57d7a0 100644
> --- a/include/hw/boards.h
> +++ b/include/hw/boards.h
> @@ -327,6 +327,7 @@ struct MachineState {
>      char *dt_compatible;
>      bool dump_guest_core;
>      bool mem_merge;
> +    bool memfd_alloc;
>      bool usb;
>      bool usb_disabled;
>      char *firmware;
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 7d47510..33c8173 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -30,6 +30,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
>      "                vmport=on|off|auto controls emulation of vmport
> (default: auto)\n"
>      "                dump-guest-core=on|off include guest memory in
> a core dump (default=on)\n"
>      "                mem-merge=on|off controls memory merge support
> (default: on)\n"
> +    "                memfd-alloc=on|off controls allocating
> anonymous guest RAM using memfd_create (default: off)\n"
>      "                aes-key-wrap=on|off controls support for AES
> key wrapping (default=on)\n"
>      "                dea-key-wrap=on|off controls support for DEA
> key wrapping (default=on)\n"
>      "                suppress-vmdesc=on|off disables self-describing 
> migration (default=off)\n"
> @@ -76,6 +77,11 @@ SRST
>          supported by the host, de-duplicates identical memory pages
>          among VMs instances (enabled by default).
>  
> +    ``memfd-alloc=on|off``
> +        Enables or disables allocation of anonymous guest RAM using
> +        memfd_create.  Any associated memory-backend objects are
> created with
> +        share=on.  The memfd-alloc default is off.
> +
>      ``aes-key-wrap=on|off``
>          Enables or disables AES key wrapping support on s390-ccw
> hosts.
>          This feature controls whether AES wrapping keys will be
> created
> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> index 3524c04..95e2b49 100644
> --- a/softmmu/physmem.c
> +++ b/softmmu/physmem.c
> @@ -41,6 +41,7 @@
>  #include "qemu/config-file.h"
>  #include "qemu/error-report.h"
>  #include "qemu/qemu-print.h"
> +#include "qemu/memfd.h"
>  #include "exec/memory.h"
>  #include "exec/ioport.h"
>  #include "sysemu/dma.h"
> @@ -1964,35 +1965,63 @@ static void ram_block_add(RAMBlock
> *new_block, Error **errp)
>      const bool shared = qemu_ram_is_shared(new_block);
>      RAMBlock *block;
>      RAMBlock *last_block = NULL;
> +    struct MemoryRegion *mr = new_block->mr;
>      ram_addr_t old_ram_size, new_ram_size;
>      Error *err = NULL;
> +    const char *name;
> +    void *addr = 0;
> +    size_t maxlen;
> +    MachineState *ms = MACHINE(qdev_get_machine());
>  
>      old_ram_size = last_ram_page();
>  
>      qemu_mutex_lock_ramlist();
> -    new_block->offset = find_ram_offset(new_block->max_length);
> +    maxlen = new_block->max_length;
> +    new_block->offset = find_ram_offset(maxlen);
>  
>      if (!new_block->host) {
>          if (xen_enabled()) {
> -            xen_ram_alloc(new_block->offset, new_block->max_length,
> -                          new_block->mr, &err);
> +            xen_ram_alloc(new_block->offset, maxlen, new_block->mr,
> &err);
>              if (err) {
>                  error_propagate(errp, err);
>                  qemu_mutex_unlock_ramlist();
>                  return;
>              }
>          } else {
> -            new_block->host = qemu_anon_ram_alloc(new_block-
> >max_length,
> -                                                  &new_block->mr-
> >align,
> -                                                  shared,
> noreserve);
> -            if (!new_block->host) {
> +            name = memory_region_name(mr);
> +            if (ms->memfd_alloc) {
> +                Object *parent = &mr->parent_obj;
> +                int mfd = -1;          /* placeholder until next
> patch */
> +                mr->align = QEMU_VMALLOC_ALIGN;
> +                if (mfd < 0) {
> +                    mfd = qemu_memfd_create(name, maxlen + mr-
> >align,
> +                                            0, 0, 0, &err);
> +                    if (mfd < 0) {

the error message should be propagated

--
Guoyi Tu

> +                        return;
> +                    }
> +                }
> +                qemu_set_cloexec(mfd);
> +                /* The memory backend already set its desired flags.
> */
> +                if (!object_dynamic_cast(parent,
> TYPE_MEMORY_BACKEND)) {
> +                    new_block->flags |= RAM_SHARED;
> +                }
> +                addr = file_ram_alloc(new_block, maxlen, mfd,
> +                                      false, false, 0, errp);
> +                trace_anon_memfd_alloc(name, maxlen, addr, mfd);
> +            } else {
> +                addr = qemu_anon_ram_alloc(maxlen, &mr->align,
> +                                           shared, noreserve);
> +            }
> +
> +            if (!addr) {
>                  error_setg_errno(errp, errno,
>                                   "cannot set up guest memory '%s'",
> -                                 memory_region_name(new_block->mr));
> +                                 name);
>                  qemu_mutex_unlock_ramlist();
>                  return;
>              }
> -            memory_try_enable_merging(new_block->host, new_block-
> >max_length);
> +            memory_try_enable_merging(addr, maxlen);
> +            new_block->host = addr;
>          }
>      }
>  
> diff --git a/softmmu/vl.c b/softmmu/vl.c
> index 620a1f1..ab3648a 100644
> --- a/softmmu/vl.c
> +++ b/softmmu/vl.c
> @@ -2440,6 +2440,7 @@ static void create_default_memdev(MachineState
> *ms, const char *path)
>          object_property_set_str(obj, "mem-path", path,
> &error_fatal);
>      }
>      object_property_set_int(obj, "size", ms->ram_size,
> &error_fatal);
> +    object_property_set_bool(obj, "share", ms->memfd_alloc,
> &error_fatal);
>      object_property_add_child(object_get_objects_root(), mc-
> >default_ram_id,
>                                obj);
>      /* Ensure backend's memory region name is equal to mc-
> >default_ram_id */
> diff --git a/trace-events b/trace-events
> index a637a61..770a9ac 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -45,6 +45,7 @@ ram_block_discard_range(const char *rbname, void
> *hva, size_t length, bool need_
>  # accel/tcg/cputlb.c
>  memory_notdirty_write_access(uint64_t vaddr, uint64_t ram_addr,
> unsigned size) "0x%" PRIx64 " ram_addr 0x%" PRIx64 " size %u"
>  memory_notdirty_set_dirty(uint64_t vaddr) "0x%" PRIx64
> +anon_memfd_alloc(const char *name, size_t size, void *ptr, int fd)
> "%s size %zu ptr %p fd %d"
>  
>  # gdbstub.c
>  gdbstub_op_start(const char *device) "Starting gdbstub using device
> %s"
> diff --git a/util/qemu-config.c b/util/qemu-config.c
> index 436ab63..3606e5c 100644
> --- a/util/qemu-config.c
> +++ b/util/qemu-config.c
> @@ -207,6 +207,10 @@ static QemuOptsList machine_opts = {
>              .type = QEMU_OPT_BOOL,
>              .help = "enable/disable memory merge support",
>          },{
> +            .name = "memfd-alloc",
> +            .type = QEMU_OPT_BOOL,
> +            .help = "enable/disable memfd_create for anonymous
> memory",
> +        },{
>              .name = "usb",
>              .type = QEMU_OPT_BOOL,
>              .help = "Set on/off to enable/disable usb",
-- 
Guoyi Tu <tugy@chinatelecom.cn>



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 28/29] chardev: cpr for sockets
  2021-12-22 19:05 ` [PATCH V7 28/29] chardev: cpr for sockets Steve Sistare
@ 2022-02-18  9:03   ` Guoyi Tu
  2022-03-03 15:55     ` Steven Sistare
  0 siblings, 1 reply; 96+ messages in thread
From: Guoyi Tu @ 2022-02-18  9:03 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel, tugy
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On Wed, 2021-12-22 at 11:05 -0800, Steve Sistare wrote:
> Save accepted socket fds before cpr-save, and look for them after
> cpr-load.
> in the environment after cpr-load.  Reject cpr-exec if a socket
> enables
> the TLS or websocket option.  Allow a monitor socket by closing it on
> exec.
> 
> Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  chardev/char-socket.c | 35 +++++++++++++++++++++++++++++++++++
>  monitor/hmp.c         |  3 +++
>  monitor/qmp.c         |  3 +++
>  3 files changed, 41 insertions(+)
> 
> diff --git a/chardev/char-socket.c b/chardev/char-socket.c
> index d619088..c111e17 100644
> --- a/chardev/char-socket.c
> +++ b/chardev/char-socket.c
> @@ -26,6 +26,7 @@
>  #include "chardev/char.h"
>  #include "io/channel-socket.h"
>  #include "io/channel-websock.h"
> +#include "migration/cpr.h"
>  #include "qemu/error-report.h"
>  #include "qemu/module.h"
>  #include "qemu/option.h"
> @@ -358,6 +359,10 @@ static void tcp_chr_free_connection(Chardev
> *chr)
>      SocketChardev *s = SOCKET_CHARDEV(chr);
>      int i;
>  
> +    if (!chr->reopen_on_cpr) {
> +        cpr_delete_fd(chr->label, 0);
> +    }
> +
>      if (s->read_msgfds_num) {
>          for (i = 0; i < s->read_msgfds_num; i++) {
>              close(s->read_msgfds[i]);
> @@ -920,6 +925,10 @@ static void tcp_chr_accept(QIONetListener
> *listener,
>                                 QIO_CHANNEL(cioc));
>      }
>      tcp_chr_new_client(chr, cioc);
> +
> +    if (s->sioc && !chr->reopen_on_cpr) {

Is it necessary check if the device has QEMU_CHAR_FEATURE_CPR feature
here? In my opinion, fd should not be saved if device don't support
cpr.

> +        cpr_save_fd(chr->label, 0, s->sioc->fd);
> +    }
>  }
>  
>  
> @@ -1175,6 +1184,26 @@ static gboolean
> socket_reconnect_timeout(gpointer opaque)
>      return false;
>  }
>  
> +static int load_char_socket_fd(Chardev *chr, Error **errp)
> +{
> +    SocketChardev *sockchar = SOCKET_CHARDEV(chr);
> +    QIOChannelSocket *sioc;
> +    const char *label = chr->label;
> +    int fd = cpr_find_fd(label, 0);
> +
> +    if (fd != -1) {
> +        sockchar = SOCKET_CHARDEV(chr);
> +        sioc = qio_channel_socket_new_fd(fd, errp);
> +        if (sioc) {
> +            tcp_chr_accept(sockchar->listener, sioc, chr);
> +            object_unref(OBJECT(sioc));
> +        } else {
> +            error_setg(errp, "could not restore socket for %s",
> label);
> +            return -1;
> +        }
> +    }
> +    return 0;
> +}
>  
>  static int qmp_chardev_open_socket_server(Chardev *chr,
>                                            bool is_telnet,
> @@ -1385,6 +1414,10 @@ static void qmp_chardev_open_socket(Chardev
> *chr,
>      }
>      s->registered_yank = true;
>  
> +    if (!s->tls_creds && !s->is_websock) {
> +        qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
> +    }
> +
>      /* be isn't opened until we get a connection */
>      *be_opened = false;
>  
> @@ -1400,6 +1433,8 @@ static void qmp_chardev_open_socket(Chardev
> *chr,
>              return;
>          }
>      }
> +
> +    load_char_socket_fd(chr, errp);
>  }
>  
>  static void qemu_chr_parse_socket(QemuOpts *opts, ChardevBackend
> *backend,
> diff --git a/monitor/hmp.c b/monitor/hmp.c
> index b20737e..a425894 100644
> --- a/monitor/hmp.c
> +++ b/monitor/hmp.c
> @@ -1484,4 +1484,7 @@ void monitor_init_hmp(Chardev *chr, bool
> use_readline, Error **errp)
>      qemu_chr_fe_set_handlers(&mon->common.chr, monitor_can_read,
> monitor_read,
>                               monitor_event, NULL, &mon->common,
> NULL, true);
>      monitor_list_append(&mon->common);
> +
> +    /* monitor cannot yet be preserved across cpr */
> +    chr->reopen_on_cpr = true;
>  }
> diff --git a/monitor/qmp.c b/monitor/qmp.c
> index 092c527..0043459 100644
> --- a/monitor/qmp.c
> +++ b/monitor/qmp.c
> @@ -535,4 +535,7 @@ void monitor_init_qmp(Chardev *chr, bool pretty,
> Error **errp)
>                                   NULL, &mon->common, NULL, true);
>          monitor_list_append(&mon->common);
>      }
> +
> +    /* Monitor cannot yet be preserved across cpr */
> +    chr->reopen_on_cpr = true;
>  }



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 29/29] cpr: only-cpr-capable option
  2021-12-22 19:05 ` [PATCH V7 29/29] cpr: only-cpr-capable option Steve Sistare
@ 2022-02-18  9:43   ` Guoyi Tu
  2022-03-03 15:54     ` Steven Sistare
  0 siblings, 1 reply; 96+ messages in thread
From: Guoyi Tu @ 2022-02-18  9:43 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel, tugy
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On Wed, 2021-12-22 at 11:05 -0800, Steve Sistare wrote:
> Add the only-cpr-capable option, which causes qemu to exit with an
> error
> if any devices that are not capable of cpr are added.  This
> guarantees that
> a cpr-exec operation will not fail with an unsupported device error.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  MAINTAINERS             |  1 +
>  chardev/char-socket.c   |  4 ++++
>  hw/vfio/common.c        |  6 ++++++
>  include/sysemu/sysemu.h |  1 +
>  migration/migration.c   |  5 +++++
>  qemu-options.hx         |  8 ++++++++
>  softmmu/globals.c       |  1 +
>  softmmu/physmem.c       |  5 +++++
>  softmmu/vl.c            | 14 +++++++++++++-
>  stubs/cpr.c             |  3 +++
>  stubs/meson.build       |  1 +
>  11 files changed, 48 insertions(+), 1 deletion(-)
>  create mode 100644 stubs/cpr.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index feed239..af5abc3 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2998,6 +2998,7 @@ F: migration/cpr.c
>  F: qapi/cpr.json
>  F: migration/cpr-state.c
>  F: stubs/cpr-state.c
> +F: stubs/cpr.c
>  
>  Record/replay
>  M: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
> diff --git a/chardev/char-socket.c b/chardev/char-socket.c
> index c111e17..a4513a7 100644
> --- a/chardev/char-socket.c
> +++ b/chardev/char-socket.c
> @@ -34,6 +34,7 @@
>  #include "qapi/clone-visitor.h"
>  #include "qapi/qapi-visit-sockets.h"
>  #include "qemu/yank.h"
> +#include "sysemu/sysemu.h"
>  
>  #include "chardev/char-io.h"
>  #include "chardev/char-socket.h"
> @@ -1416,6 +1417,9 @@ static void qmp_chardev_open_socket(Chardev
> *chr,
>  
>      if (!s->tls_creds && !s->is_websock) {
>          qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
> +    } else if (only_cpr_capable) {
> +        error_setg(errp, "error: socket %s is not cpr capable due to
> %s option",
> +                   chr->label, (s->tls_creds ? "TLS" :
> "websocket"));

Should the error be ignored if reopen-on-cpr is set.


>      }
>  
>      /* be isn't opened until we get a connection */
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index f2b4a81..605ffbb 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -38,6 +38,7 @@
>  #include "sysemu/kvm.h"
>  #include "sysemu/reset.h"
>  #include "sysemu/runstate.h"
> +#include "sysemu/sysemu.h"
>  #include "trace.h"
>  #include "qapi/error.h"
>  #include "migration/migration.h"
> @@ -1923,12 +1924,17 @@ static void
> vfio_put_address_space(VFIOAddressSpace *space)
>  static int vfio_get_iommu_type(VFIOContainer *container,
>                                 Error **errp)
>  {
> +    ERRP_GUARD();
>      int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
>                            VFIO_SPAPR_TCE_v2_IOMMU,
> VFIO_SPAPR_TCE_IOMMU };
>      int i;
>  
>      for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
>          if (ioctl(container->fd, VFIO_CHECK_EXTENSION,
> iommu_types[i])) {
> +            if (only_cpr_capable && !vfio_is_cpr_capable(container,
> errp)) {
> +                error_prepend(errp, "only-cpr-capable is specified:
> ");
> +                return -EINVAL;
> +            }
>              return iommu_types[i];
>          }
>      }
> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> index 8fae667..6241c20 100644
> --- a/include/sysemu/sysemu.h
> +++ b/include/sysemu/sysemu.h
> @@ -9,6 +9,7 @@
>  /* vl.c */
>  
>  extern int only_migratable;
> +extern bool only_cpr_capable;
>  extern const char *qemu_name;
>  extern QemuUUID qemu_uuid;
>  extern bool qemu_uuid_set;
> diff --git a/migration/migration.c b/migration/migration.c
> index 3de11ae..f08db0d 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1257,6 +1257,11 @@ static bool migrate_caps_check(bool *cap_list,
>          return false;
>      }
>  
> +    if (cap_list[MIGRATION_CAPABILITY_X_COLO] && only_cpr_capable) {
> +        error_setg(errp, "x-colo is not compatible with -only-cpr-
> capable");
> +        return false;
> +    }
> +
>      return true;
>  }
>  
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 1859b55..0cbf2e3 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -4434,6 +4434,14 @@ SRST
>      an unmigratable state.
>  ERST
>  
> +DEF("only-cpr-capable", 0, QEMU_OPTION_only_cpr_capable, \
> +    "-only-cpr-capable    allow only cpr capable devices\n",
> QEMU_ARCH_ALL)
> +SRST
> +``-only-cpr-capable``
> +    Only allow cpr capable devices, which guarantees that cpr-save
> and
> +    cpr-exec will not fail with an unsupported device error.
> +ERST
> +
>  DEF("nodefaults", 0, QEMU_OPTION_nodefaults, \
>      "-nodefaults     don't create default devices\n", QEMU_ARCH_ALL)
>  SRST
> diff --git a/softmmu/globals.c b/softmmu/globals.c
> index 7d0fc81..a18fd8d 100644
> --- a/softmmu/globals.c
> +++ b/softmmu/globals.c
> @@ -59,6 +59,7 @@ int boot_menu;
>  bool boot_strict;
>  uint8_t *boot_splash_filedata;
>  int only_migratable; /* turn it off unless user states otherwise */
> +bool only_cpr_capable;
>  int icount_align_option;
>  
>  /* The bytes in qemu_uuid are in the order specified by RFC4122,
> _not_ in the
> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> index e227195..e7869f8 100644
> --- a/softmmu/physmem.c
> +++ b/softmmu/physmem.c
> @@ -47,6 +47,7 @@
>  #include "sysemu/dma.h"
>  #include "sysemu/hostmem.h"
>  #include "sysemu/hw_accel.h"
> +#include "sysemu/sysemu.h"
>  #include "sysemu/xen-mapcache.h"
>  #include "trace/trace-root.h"
>  
> @@ -2010,6 +2011,10 @@ static void ram_block_add(RAMBlock *new_block,
> Error **errp)
>                  addr = file_ram_alloc(new_block, maxlen, mfd,
>                                        false, false, 0, errp);
>                  trace_anon_memfd_alloc(name, maxlen, addr, mfd);
> +            } else if (only_cpr_capable) {
> +                error_setg(errp,
> +                    "only-cpr-capable requires -machine memfd-
> alloc=on");
> +                return;
>              } else {
>                  addr = qemu_anon_ram_alloc(maxlen, &mr->align,
>                                             shared, noreserve);
> diff --git a/softmmu/vl.c b/softmmu/vl.c
> index 4319e1a..f14e29e 100644
> --- a/softmmu/vl.c
> +++ b/softmmu/vl.c
> @@ -2743,11 +2743,20 @@ void qmp_x_exit_preconfig(Error **errp)
>      qemu_create_cli_devices();
>      qemu_machine_creation_done();
>  
> +    if (only_cpr_capable && !qemu_chr_is_cpr_capable(errp)) {
> +        ;    /* not reached due to error_fatal */
> +    }
> +
>      if (loadvm) {
>          load_snapshot(loadvm, NULL, false, NULL, &error_fatal);
>      }
>      if (replay_mode != REPLAY_MODE_NONE) {
> -        replay_vmstate_init();
> +        if (only_cpr_capable) {
> +            error_setg(errp, "replay is not compatible with -only-
> cpr-capable");
> +            /* not reached due to error_fatal */
> +        } else {
> +            replay_vmstate_init();
> +        }
>      }
>  
>      if (incoming) {
> @@ -3507,6 +3516,9 @@ void qemu_init(int argc, char **argv, char
> **envp)
>              case QEMU_OPTION_only_migratable:
>                  only_migratable = 1;
>                  break;
> +            case QEMU_OPTION_only_cpr_capable:
> +                only_cpr_capable = true;
> +                break;
>              case QEMU_OPTION_nodefaults:
>                  has_defaults = 0;
>                  break;
> diff --git a/stubs/cpr.c b/stubs/cpr.c
> new file mode 100644
> index 0000000..aaa189e
> --- /dev/null
> +++ b/stubs/cpr.c
> @@ -0,0 +1,3 @@
> +#include "qemu/osdep.h"
> +
> +bool only_cpr_capable;
> diff --git a/stubs/meson.build b/stubs/meson.build
> index 9565c7d..4c9c4ea 100644
> --- a/stubs/meson.build
> +++ b/stubs/meson.build
> @@ -4,6 +4,7 @@ stub_ss.add(files('blk-exp-close-all.c'))
>  stub_ss.add(files('blockdev-close-all-bdrv-states.c'))
>  stub_ss.add(files('change-state-handler.c'))
>  stub_ss.add(files('cmos.c'))
> +stub_ss.add(files('cpr.c'))
>  stub_ss.add(files('cpr-state.c'))
>  stub_ss.add(files('cpu-get-clock.c'))
>  stub_ss.add(files('cpus-get-virtual-clock.c'))

The only-cpr-capable option is a good way to prevent qemu from starting
if some device don't support cpr. But if this option is not provided,
the user still can perform cpr-xxx operation even there are devices
don't support cpr, in this case, the exec() will fail and the original
process cannot recovery.

How about introducing a cpr blocker (as migration blocker does) to
prevent the user from performing cpr-xxx operaton to address the
problem

--
Guoyi Tu




^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 00/29] Live Update
  2022-01-07 18:45 ` [PATCH V7 00/29] Live Update Steven Sistare
@ 2022-02-18 13:36   ` Steven Sistare
  0 siblings, 0 replies; 96+ messages in thread
From: Steven Sistare @ 2022-02-18 13:36 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Markus Armbruster, Eric Blake,
	qemu-devel, Zheng Chuan, Alex Williamson, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Please? - Steve

On 1/7/2022 1:45 PM, Steven Sistare wrote:
> Hi Dave,
>   It has been a long time since we chatted about this series.  The vfio
> patches have been updated with feedback from Alex and are close to being 
> final (I think).  Could you take another look at the patches that you care 
> about?  To refresh your memory, you last reviewed V3 of the series, and I 
> made significant changes to address your comments.  The cover letter lists 
> the changes in V4, V5, V6, and V7.
> 
> Best wishes for the new year,
> - Steve
> 
> On 12/22/2021 2:05 PM, Steve Sistare wrote:
>> Provide the cpr-save, cpr-exec, and cpr-load commands for live update.
>> These save and restore VM state, with minimal guest pause time, so that
>> qemu may be updated to a new version in between.
>>
>> cpr-save stops the VM and saves vmstate to an ordinary file.  It supports
>> any type of guest image and block device, but the caller must not modify
>> guest block devices between cpr-save and cpr-load.  It supports two modes:
>> reboot and restart.
>>
>> In reboot mode, the caller invokes cpr-save and then terminates qemu.
>> The caller may then update the host kernel and system software and reboot.
>> The caller resumes the guest by running qemu with the same arguments as the
>> original process and invoking cpr-load.  To use this mode, guest ram must be
>> mapped to a persistent shared memory file such as /dev/dax0.0, or /dev/shm
>> PKRAM as proposed in https://lore.kernel.org/lkml/1617140178-8773-1-git-send-email-anthony.yznaga@oracle.com.
>>
>> The reboot mode supports vfio devices if the caller first suspends the
>> guest, such as by issuing guest-suspend-ram to the qemu guest agent.  The
>> guest drivers' suspend methods flush outstanding requests and re-initialize
>> the devices, and thus there is no device state to save and restore.
>>
>> Restart mode preserves the guest VM across a restart of the qemu process.
>> After cpr-save, the caller passes qemu command-line arguments to cpr-exec,
>> which directly exec's the new qemu binary.  The arguments must include -S
>> so new qemu starts in a paused state and waits for the cpr-load command.
>> The restart mode supports vfio devices by preserving the vfio container,
>> group, device, and event descriptors across the qemu re-exec, and by
>> updating DMA mapping virtual addresses using VFIO_DMA_UNMAP_FLAG_VADDR and
>> VFIO_DMA_MAP_FLAG_VADDR as defined in https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sistare@oracle.com/
>> and integrated in Linux kernel 5.12.
>>
>> To use the restart mode, qemu must be started with the memfd-alloc option,
>> which allocates guest ram using memfd_create.  The memfd's are saved to
>> the environment and kept open across exec, after which they are found from
>> the environment and re-mmap'd.  Hence guest ram is preserved in place,
>> albeit with new virtual addresses in the qemu process.
>>
>> The caller resumes the guest by invoking cpr-load, which loads state from
>> the file. If the VM was running at cpr-save time, then VM execution resumes.
>> If the VM was suspended at cpr-save time (reboot mode), then the caller must
>> issue a system_wakeup command to resume.
>>
>> The first patches add reboot mode:
>>   - memory: qemu_check_ram_volatile
>>   - migration: fix populate_vfio_info
>>   - migration: qemu file wrappers
>>   - migration: simplify savevm
>>   - vl: start on wakeup request
>>   - cpr: reboot mode
>>   - cpr: reboot HMP interfaces
>>
>> The next patches add restart mode:
>>   - memory: flat section iterator
>>   - oslib: qemu_clear_cloexec
>>   - machine: memfd-alloc option
>>   - qapi: list utility functions
>>   - vl: helper to request re-exec
>>   - cpr: preserve extra state
>>   - cpr: restart mode
>>   - cpr: restart HMP interfaces
>>   - hostmem-memfd: cpr for memory-backend-memfd
>>
>> The next patches add vfio support for restart mode:
>>   - pci: export functions for cpr
>>   - vfio-pci: refactor for cpr
>>   - vfio-pci: cpr part 1 (fd and dma)
>>   - vfio-pci: cpr part 2 (msi)
>>   - vfio-pci: cpr part 3 (intx)
>>   - vfio-pci: recover from unmap-all-vaddr failure
>>
>> The next patches preserve various descriptor-based backend devices across
>> cprexec:
>>   - loader: suppress rom_reset during cpr
>>   - vhost: reset vhost devices for cpr
>>   - chardev: cpr framework
>>   - chardev: cpr for simple devices
>>   - chardev: cpr for pty
>>   - chardev: cpr for sockets
>>   - cpr: only-cpr-capable option
>>
>> Here is an example of updating qemu from v4.2.0 to v4.2.1 using
>> restart mode.  The software update is performed while the guest is
>> running to minimize downtime.
>>
>> window 1                                        | window 2
>>                                                 |
>> # qemu-system-x86_64 ...                        |
>> QEMU 4.2.0 monitor - type 'help' ...            |
>> (qemu) info status                              |
>> VM status: running                              |
>>                                                 | # yum update qemu
>> (qemu) cpr-save /tmp/qemu.sav restart           |
>> (qemu) cpr-exec qemu-system-x86_64 -S ...       |
>> QEMU 4.2.1 monitor - type 'help' ...            |
>> (qemu) info status                              |
>> VM status: paused (prelaunch)                   |
>> (qemu) cpr-load /tmp/qemu.sav                   |
>> (qemu) info status                              |
>> VM status: running                              |
>>
>>
>> Here is an example of updating the host kernel using reboot mode.
>>
>> window 1                                        | window 2
>>                                                 |
>> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
>> QEMU 4.2.1 monitor - type 'help' ...            |
>> (qemu) info status                              |
>> VM status: running                              |
>>                                                 | # yum update kernel-uek
>> (qemu) cpr-save /tmp/qemu.sav reboot            |
>> (qemu) quit                                     |
>>                                                 |
>> # systemctl kexec                               |
>> kexec_core: Starting new kernel                 |
>> ...                                             |
>>                                                 |
>> # qemu-system-x86_64 -S mem-path=/dev/dax0.0 ...|
>> QEMU 4.2.1 monitor - type 'help' ...            |
>> (qemu) info status                              |
>> VM status: paused (prelaunch)                   |
>> (qemu) cpr-load /tmp/qemu.sav                   |
>> (qemu) info status                              |
>> VM status: running                              |
>>
>> Changes from V1 to V2:
>>   - revert vmstate infrastructure changes
>>   - refactor cpr functions into new files
>>   - delete MADV_DOEXEC and use memfd + VFIO_DMA_UNMAP_FLAG_SUSPEND to
>>     preserve memory.
>>   - add framework to filter chardev's that support cpr
>>   - save and restore vfio eventfd's
>>   - modify cprinfo QMP interface
>>   - incorporate misc review feedback
>>   - remove unrelated and unneeded patches
>>   - refactor all patches into a shorter and easier to review series
>>
>> Changes from V2 to V3:
>>   - rebase to qemu 6.0.0
>>   - use final definition of vfio ioctls (VFIO_DMA_UNMAP_FLAG_VADDR etc)
>>   - change memfd-alloc to a machine option
>>   - Use qio_channel_socket_new_fd instead of adding qio_channel_socket_new_fd
>>   - close monitor socket during cpr
>>   - fix a few unreported bugs
>>   - support memory-backend-memfd
>>
>> Changes from V3 to V4:
>>   - split reboot mode into separate patches
>>   - add cprexec command
>>   - delete QEMU_START_FREEZE, argv_main, and /usr/bin/qemu-exec
>>   - add more checks for vfio and cpr compatibility, and recover after errors
>>   - save vfio pci config in vmstate
>>   - rename {setenv,getenv}_event_fd to {save,load}_event_fd
>>   - use qemu_strtol
>>   - change 6.0 references to 6.1
>>   - use strerror(), use EXIT_FAILURE, remove period from error messages
>>   - distribute MAINTAINERS additions to each patch
>>
>> Changes from V4 to V5:
>>   - rebase to master
>>
>> Changes from V5 to V6:
>>   vfio:
>>   - delete redundant bus_master_enable_region in vfio_pci_post_load
>>   - delete unmap.size warning
>>   - fix phys_config memory leak
>>   - add INTX support
>>   - add vfio_named_notifier_init() helper
>>   Other:
>>   - 6.1 -> 6.2
>>   - rename file -> filename in qapi
>>   - delete cprinfo.  qapi introspection serves the same purpose.
>>   - rename cprsave, cprexec, cprload -> cpr-save, cpr-exec, cpr-load
>>   - improve documentation in qapi/cpr.json
>>   - rename qemu_ram_volatile -> qemu_ram_check_volatile, and use
>>     qemu_ram_foreach_block
>>   - rename handle -> opaque
>>   - use ERRP_GUARD
>>   - use g_autoptr and g_autofree, and glib allocation functions
>>   - conform to error conventions for bool and int function return values
>>     and function names.
>>   - remove word "error" in error messages
>>   - rename as_flat_walk and its callback, and add comments.
>>   - rename qemu_clr_cloexec -> qemu_clear_cloexec
>>   - rename close-on-cpr -> reopen-on-cpr
>>   - add strList utility functions
>>   - factor out start on wakeup request to a separate patch
>>   - deleted unnecessary layer (cprsave etc) and squashed QMP patches
>>   - conditionally compile for CONFIG_VFIO
>>
>> Changes from V6 to V7:
>>   vfio:
>>   - convert all event fd's to named event fd's with the same lifecycle and
>>     delete vfio_pci_pre_save
>>   - use vfio listener callback for updating vaddr and
>>     defer listener registration
>>   - update vaddr in vfio_dma_map
>>   - simplify iommu_type derivation
>>   - refactor recovery from unmap-all-vaddr failure to a separate patch
>>   - add vfio_pci_pre_load to handle non-emulated config bits
>>   - do not call VFIO_GROUP_SET_CONTAINER if reused
>>   - add comments for vfio cpr
>>   Other:
>>   - suppress rom_reset during cpr
>>   - more robust management of cpr mode
>>   - delete chardev fd's iff !reopen_on_cpr
>>
>> Steve Sistare (26):
>>   memory: qemu_check_ram_volatile
>>   migration: fix populate_vfio_info
>>   migration: qemu file wrappers
>>   migration: simplify savevm
>>   vl: start on wakeup request
>>   cpr: reboot mode
>>   memory: flat section iterator
>>   oslib: qemu_clear_cloexec
>>   machine: memfd-alloc option
>>   qapi: list utility functions
>>   vl: helper to request re-exec
>>   cpr: preserve extra state
>>   cpr: restart mode
>>   cpr: restart HMP interfaces
>>   hostmem-memfd: cpr for memory-backend-memfd
>>   pci: export functions for cpr
>>   vfio-pci: refactor for cpr
>>   vfio-pci: cpr part 1 (fd and dma)
>>   vfio-pci: cpr part 2 (msi)
>>   vfio-pci: cpr part 3 (intx)
>>   vfio-pci: recover from unmap-all-vaddr failure
>>   loader: suppress rom_reset during cpr
>>   chardev: cpr framework
>>   chardev: cpr for simple devices
>>   chardev: cpr for pty
>>   cpr: only-cpr-capable option
>>
>> Mark Kanda, Steve Sistare (3):
>>   cpr: reboot HMP interfaces
>>   vhost: reset vhost devices for cpr
>>   chardev: cpr for sockets
>>
>>  MAINTAINERS                   |  12 ++
>>  backends/hostmem-memfd.c      |  21 +--
>>  chardev/char-mux.c            |   1 +
>>  chardev/char-null.c           |   1 +
>>  chardev/char-pty.c            |  16 +-
>>  chardev/char-serial.c         |   1 +
>>  chardev/char-socket.c         |  39 +++++
>>  chardev/char-stdio.c          |   8 +
>>  chardev/char.c                |  45 +++++-
>>  gdbstub.c                     |   1 +
>>  hmp-commands.hx               |  50 ++++++
>>  hw/core/loader.c              |   4 +-
>>  hw/core/machine.c             |  19 +++
>>  hw/pci/msix.c                 |  20 ++-
>>  hw/pci/pci.c                  |  13 +-
>>  hw/vfio/common.c              | 184 ++++++++++++++++++---
>>  hw/vfio/cpr.c                 | 129 +++++++++++++++
>>  hw/vfio/meson.build           |   1 +
>>  hw/vfio/pci.c                 | 368 +++++++++++++++++++++++++++++++++++++-----
>>  hw/vfio/trace-events          |   1 +
>>  hw/virtio/vhost.c             |  11 ++
>>  include/chardev/char.h        |   6 +
>>  include/exec/memory.h         |  39 +++++
>>  include/hw/boards.h           |   1 +
>>  include/hw/pci/msix.h         |   5 +
>>  include/hw/pci/pci.h          |   2 +
>>  include/hw/vfio/vfio-common.h |  10 ++
>>  include/hw/virtio/vhost.h     |   1 +
>>  include/migration/cpr.h       |  31 ++++
>>  include/monitor/hmp.h         |   3 +
>>  include/qapi/util.h           |  28 ++++
>>  include/qemu/osdep.h          |   1 +
>>  include/sysemu/runstate.h     |   2 +
>>  include/sysemu/sysemu.h       |   1 +
>>  migration/cpr-state.c         | 228 ++++++++++++++++++++++++++
>>  migration/cpr.c               | 167 +++++++++++++++++++
>>  migration/meson.build         |   2 +
>>  migration/migration.c         |   5 +
>>  migration/qemu-file-channel.c |  36 +++++
>>  migration/qemu-file-channel.h |   6 +
>>  migration/savevm.c            |  21 +--
>>  migration/target.c            |  24 ++-
>>  migration/trace-events        |   5 +
>>  monitor/hmp-cmds.c            |  68 ++++----
>>  monitor/hmp.c                 |   3 +
>>  monitor/qmp.c                 |   3 +
>>  qapi/char.json                |   7 +-
>>  qapi/cpr.json                 |  76 +++++++++
>>  qapi/meson.build              |   1 +
>>  qapi/qapi-schema.json         |   1 +
>>  qapi/qapi-util.c              |  37 +++++
>>  qemu-options.hx               |  40 ++++-
>>  softmmu/globals.c             |   1 +
>>  softmmu/memory.c              |  46 ++++++
>>  softmmu/physmem.c             |  55 +++++--
>>  softmmu/runstate.c            |  38 ++++-
>>  softmmu/vl.c                  |  18 ++-
>>  stubs/cpr-state.c             |  15 ++
>>  stubs/cpr.c                   |   3 +
>>  stubs/meson.build             |   2 +
>>  trace-events                  |   1 +
>>  util/oslib-posix.c            |   9 ++
>>  util/oslib-win32.c            |   4 +
>>  util/qemu-config.c            |   4 +
>>  64 files changed, 1852 insertions(+), 149 deletions(-)
>>  create mode 100644 hw/vfio/cpr.c
>>  create mode 100644 include/migration/cpr.h
>>  create mode 100644 migration/cpr-state.c
>>  create mode 100644 migration/cpr.c
>>  create mode 100644 qapi/cpr.json
>>  create mode 100644 stubs/cpr-state.c
>>  create mode 100644 stubs/cpr.c
>>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 10/29] machine: memfd-alloc option
  2021-12-22 19:05 ` [PATCH V7 10/29] machine: memfd-alloc option Steve Sistare
  2022-02-18  8:05   ` Guoyi Tu
@ 2022-02-24 17:56   ` Dr. David Alan Gilbert
  2022-03-03 15:56     ` Steven Sistare
  2022-03-03 17:21   ` Michael S. Tsirkin
  2022-03-11  9:54   ` David Hildenbrand
  3 siblings, 1 reply; 96+ messages in thread
From: Dr. David Alan Gilbert @ 2022-02-24 17:56 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel, Eric Blake,
	Markus Armbruster, Zheng Chuan, Alex Williamson, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

* Steve Sistare (steven.sistare@oracle.com) wrote:
> Allocate anonymous memory using memfd_create if the memfd-alloc machine
> option is set.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

So other than the minor error nit that Guoyi spotted, I think this is
pretty good,  one other comment below:

> ---
>  hw/core/machine.c   | 19 +++++++++++++++++++
>  include/hw/boards.h |  1 +
>  qemu-options.hx     |  6 ++++++
>  softmmu/physmem.c   | 47 ++++++++++++++++++++++++++++++++++++++---------
>  softmmu/vl.c        |  1 +
>  trace-events        |  1 +
>  util/qemu-config.c  |  4 ++++
>  7 files changed, 70 insertions(+), 9 deletions(-)
> 
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index 53a99ab..7739d88 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -392,6 +392,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
>      ms->mem_merge = value;
>  }
>  
> +static bool machine_get_memfd_alloc(Object *obj, Error **errp)
> +{
> +    MachineState *ms = MACHINE(obj);
> +
> +    return ms->memfd_alloc;
> +}
> +
> +static void machine_set_memfd_alloc(Object *obj, bool value, Error **errp)
> +{
> +    MachineState *ms = MACHINE(obj);
> +
> +    ms->memfd_alloc = value;
> +}
> +
>  static bool machine_get_usb(Object *obj, Error **errp)
>  {
>      MachineState *ms = MACHINE(obj);
> @@ -829,6 +843,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
>      object_class_property_set_description(oc, "mem-merge",
>          "Enable/disable memory merge support");
>  
> +    object_class_property_add_bool(oc, "memfd-alloc",
> +        machine_get_memfd_alloc, machine_set_memfd_alloc);
> +    object_class_property_set_description(oc, "memfd-alloc",
> +        "Enable/disable allocating anonymous memory using memfd_create");
> +
>      object_class_property_add_bool(oc, "usb",
>          machine_get_usb, machine_set_usb);
>      object_class_property_set_description(oc, "usb",
> diff --git a/include/hw/boards.h b/include/hw/boards.h
> index 9c1c190..a57d7a0 100644
> --- a/include/hw/boards.h
> +++ b/include/hw/boards.h
> @@ -327,6 +327,7 @@ struct MachineState {
>      char *dt_compatible;
>      bool dump_guest_core;
>      bool mem_merge;
> +    bool memfd_alloc;
>      bool usb;
>      bool usb_disabled;
>      char *firmware;
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 7d47510..33c8173 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -30,6 +30,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
>      "                vmport=on|off|auto controls emulation of vmport (default: auto)\n"
>      "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
>      "                mem-merge=on|off controls memory merge support (default: on)\n"
> +    "                memfd-alloc=on|off controls allocating anonymous guest RAM using memfd_create (default: off)\n"
>      "                aes-key-wrap=on|off controls support for AES key wrapping (default=on)\n"
>      "                dea-key-wrap=on|off controls support for DEA key wrapping (default=on)\n"
>      "                suppress-vmdesc=on|off disables self-describing migration (default=off)\n"
> @@ -76,6 +77,11 @@ SRST
>          supported by the host, de-duplicates identical memory pages
>          among VMs instances (enabled by default).
>  
> +    ``memfd-alloc=on|off``
> +        Enables or disables allocation of anonymous guest RAM using
> +        memfd_create.  Any associated memory-backend objects are created with
> +        share=on.  The memfd-alloc default is off.
> +
>      ``aes-key-wrap=on|off``
>          Enables or disables AES key wrapping support on s390-ccw hosts.
>          This feature controls whether AES wrapping keys will be created
> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> index 3524c04..95e2b49 100644
> --- a/softmmu/physmem.c
> +++ b/softmmu/physmem.c
> @@ -41,6 +41,7 @@
>  #include "qemu/config-file.h"
>  #include "qemu/error-report.h"
>  #include "qemu/qemu-print.h"
> +#include "qemu/memfd.h"
>  #include "exec/memory.h"
>  #include "exec/ioport.h"
>  #include "sysemu/dma.h"
> @@ -1964,35 +1965,63 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>      const bool shared = qemu_ram_is_shared(new_block);
>      RAMBlock *block;
>      RAMBlock *last_block = NULL;
> +    struct MemoryRegion *mr = new_block->mr;
>      ram_addr_t old_ram_size, new_ram_size;
>      Error *err = NULL;
> +    const char *name;
> +    void *addr = 0;
> +    size_t maxlen;

You could move some of these down to the top of the block you're using
them.

> +    MachineState *ms = MACHINE(qdev_get_machine());
>  
>      old_ram_size = last_ram_page();
>  
>      qemu_mutex_lock_ramlist();
> -    new_block->offset = find_ram_offset(new_block->max_length);
> +    maxlen = new_block->max_length;
> +    new_block->offset = find_ram_offset(maxlen);
>  
>      if (!new_block->host) {
>          if (xen_enabled()) {
> -            xen_ram_alloc(new_block->offset, new_block->max_length,
> -                          new_block->mr, &err);
> +            xen_ram_alloc(new_block->offset, maxlen, new_block->mr, &err);
>              if (err) {
>                  error_propagate(errp, err);
>                  qemu_mutex_unlock_ramlist();
>                  return;
>              }
>          } else {
> -            new_block->host = qemu_anon_ram_alloc(new_block->max_length,
> -                                                  &new_block->mr->align,
> -                                                  shared, noreserve);
> -            if (!new_block->host) {
> +            name = memory_region_name(mr);
> +            if (ms->memfd_alloc) {
> +                Object *parent = &mr->parent_obj;
> +                int mfd = -1;          /* placeholder until next patch */
> +                mr->align = QEMU_VMALLOC_ALIGN;
> +                if (mfd < 0) {
> +                    mfd = qemu_memfd_create(name, maxlen + mr->align,
> +                                            0, 0, 0, &err);
> +                    if (mfd < 0) {
> +                        return;
> +                    }
> +                }
> +                qemu_set_cloexec(mfd);
> +                /* The memory backend already set its desired flags. */
> +                if (!object_dynamic_cast(parent, TYPE_MEMORY_BACKEND)) {
> +                    new_block->flags |= RAM_SHARED;
> +                }
> +                addr = file_ram_alloc(new_block, maxlen, mfd,
> +                                      false, false, 0, errp);
> +                trace_anon_memfd_alloc(name, maxlen, addr, mfd);
> +            } else {
> +                addr = qemu_anon_ram_alloc(maxlen, &mr->align,
> +                                           shared, noreserve);
> +            }
> +
> +            if (!addr) {
>                  error_setg_errno(errp, errno,
>                                   "cannot set up guest memory '%s'",
> -                                 memory_region_name(new_block->mr));
> +                                 name);
>                  qemu_mutex_unlock_ramlist();
>                  return;
>              }
> -            memory_try_enable_merging(new_block->host, new_block->max_length);
> +            memory_try_enable_merging(addr, maxlen);
> +            new_block->host = addr;
>          }
>      }
>  
> diff --git a/softmmu/vl.c b/softmmu/vl.c
> index 620a1f1..ab3648a 100644
> --- a/softmmu/vl.c
> +++ b/softmmu/vl.c
> @@ -2440,6 +2440,7 @@ static void create_default_memdev(MachineState *ms, const char *path)
>          object_property_set_str(obj, "mem-path", path, &error_fatal);
>      }
>      object_property_set_int(obj, "size", ms->ram_size, &error_fatal);
> +    object_property_set_bool(obj, "share", ms->memfd_alloc, &error_fatal);
>      object_property_add_child(object_get_objects_root(), mc->default_ram_id,
>                                obj);
>      /* Ensure backend's memory region name is equal to mc->default_ram_id */
> diff --git a/trace-events b/trace-events
> index a637a61..770a9ac 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -45,6 +45,7 @@ ram_block_discard_range(const char *rbname, void *hva, size_t length, bool need_
>  # accel/tcg/cputlb.c
>  memory_notdirty_write_access(uint64_t vaddr, uint64_t ram_addr, unsigned size) "0x%" PRIx64 " ram_addr 0x%" PRIx64 " size %u"
>  memory_notdirty_set_dirty(uint64_t vaddr) "0x%" PRIx64
> +anon_memfd_alloc(const char *name, size_t size, void *ptr, int fd) "%s size %zu ptr %p fd %d"
>  
>  # gdbstub.c
>  gdbstub_op_start(const char *device) "Starting gdbstub using device %s"
> diff --git a/util/qemu-config.c b/util/qemu-config.c
> index 436ab63..3606e5c 100644
> --- a/util/qemu-config.c
> +++ b/util/qemu-config.c
> @@ -207,6 +207,10 @@ static QemuOptsList machine_opts = {
>              .type = QEMU_OPT_BOOL,
>              .help = "enable/disable memory merge support",
>          },{
> +            .name = "memfd-alloc",
> +            .type = QEMU_OPT_BOOL,
> +            .help = "enable/disable memfd_create for anonymous memory",
> +        },{
>              .name = "usb",
>              .type = QEMU_OPT_BOOL,
>              .help = "Set on/off to enable/disable usb",
> -- 
> 1.8.3.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 03/29] migration: qemu file wrappers
  2021-12-22 19:05 ` [PATCH V7 03/29] migration: qemu file wrappers Steve Sistare
@ 2022-02-24 18:21   ` Dr. David Alan Gilbert
  2022-03-03 15:55     ` Steven Sistare
  0 siblings, 1 reply; 96+ messages in thread
From: Dr. David Alan Gilbert @ 2022-02-24 18:21 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	qemu-devel, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Daniel P. Berrange, Philippe Mathieu-Daudé,
	Alex Bennée

* Steve Sistare (steven.sistare@oracle.com) wrote:
> Add qemu_file_open and qemu_fd_open to create QEMUFile objects for unix
> files and file descriptors.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  migration/qemu-file-channel.c | 36 ++++++++++++++++++++++++++++++++++++
>  migration/qemu-file-channel.h |  6 ++++++
>  2 files changed, 42 insertions(+)
> 
> diff --git a/migration/qemu-file-channel.c b/migration/qemu-file-channel.c
> index bb5a575..afb16d7 100644
> --- a/migration/qemu-file-channel.c
> +++ b/migration/qemu-file-channel.c
> @@ -27,8 +27,10 @@
>  #include "qemu-file.h"
>  #include "io/channel-socket.h"
>  #include "io/channel-tls.h"
> +#include "io/channel-file.h"
>  #include "qemu/iov.h"
>  #include "qemu/yank.h"
> +#include "qapi/error.h"
>  #include "yank_functions.h"
>  
>  
> @@ -192,3 +194,37 @@ QEMUFile *qemu_fopen_channel_output(QIOChannel *ioc)
>      object_ref(OBJECT(ioc));
>      return qemu_fopen_ops(ioc, &channel_output_ops, true);
>  }
> +
> +QEMUFile *qemu_file_open(const char *path, int flags, int mode,
> +                         const char *name, Error **errp)

Can you please make that qemu_fopen_file

> +{
> +    g_autoptr(QIOChannelFile) fioc = NULL;
> +    QIOChannel *ioc;
> +    QEMUFile *f;
> +
> +    if (flags & O_RDWR) {
> +        error_setg(errp, "qemu_file_open %s: O_RDWR not supported", path);
> +        return NULL;
> +    }
> +
> +    fioc = qio_channel_file_new_path(path, flags, mode, errp);
> +    if (!fioc) {
> +        return NULL;
> +    }
> +
> +    ioc = QIO_CHANNEL(fioc);
> +    qio_channel_set_name(ioc, name);
> +    f = (flags & O_WRONLY) ? qemu_fopen_channel_output(ioc) :
> +                             qemu_fopen_channel_input(ioc);
> +    return f;
> +}
> +
> +QEMUFile *qemu_fd_open(int fd, bool writable, const char *name)
> +{

Can you please make that qemu_fopen_fd

> +    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);

Can you use qio_channel_new_fd for that? Then it creates either
a socket or file subclass depending what type of fd is passed
(and gives you a QIOChannel without needing to cast).

> +    QIOChannel *ioc = QIO_CHANNEL(fioc);
> +    QEMUFile *f = writable ? qemu_fopen_channel_output(ioc) :
> +                             qemu_fopen_channel_input(ioc);
> +    qio_channel_set_name(ioc, name);
> +    return f;
> +}
> diff --git a/migration/qemu-file-channel.h b/migration/qemu-file-channel.h
> index 0028a09..324ae2d 100644
> --- a/migration/qemu-file-channel.h
> +++ b/migration/qemu-file-channel.h
> @@ -29,4 +29,10 @@
>  
>  QEMUFile *qemu_fopen_channel_input(QIOChannel *ioc);
>  QEMUFile *qemu_fopen_channel_output(QIOChannel *ioc);
> +
> +QEMUFile *qemu_file_open(const char *path, int flags, int mode,
> +                         const char *name, Error **errp);
> +
> +QEMUFile *qemu_fd_open(int fd, bool writable, const char *name);
> +
>  #endif
> -- 
> 1.8.3.1
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 04/29] migration: simplify savevm
  2021-12-22 19:05 ` [PATCH V7 04/29] migration: simplify savevm Steve Sistare
@ 2022-02-24 18:25   ` Dr. David Alan Gilbert
  2022-03-03 15:55     ` Steven Sistare
  0 siblings, 1 reply; 96+ messages in thread
From: Dr. David Alan Gilbert @ 2022-02-24 18:25 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	qemu-devel, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Daniel P. Berrange, Alex Bennée

* Steve Sistare (steven.sistare@oracle.com) wrote:
> Use qemu_file_open to simplify a few functions in savevm.c.
> No functional change.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

So I think this is mostly OK, but a couple of minor tidyups below;
so with the tidies and the renames from the previous patch:

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  migration/savevm.c | 21 +++++++--------------
>  1 file changed, 7 insertions(+), 14 deletions(-)
> 
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 0bef031..c71d525 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2910,8 +2910,9 @@ bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
>  void qmp_xen_save_devices_state(const char *filename, bool has_live, bool live,
>                                  Error **errp)
>  {
> +    const char *ioc_name = "migration-xen-save-state";
> +    int flags = O_WRONLY | O_CREAT | O_TRUNC;

I don't see why to take these (or the matching ones in load) as separate
variables; just keep them as is, and be parameters.

>      QEMUFile *f;
> -    QIOChannelFile *ioc;
>      int saved_vm_running;
>      int ret;
>  
> @@ -2925,14 +2926,10 @@ void qmp_xen_save_devices_state(const char *filename, bool has_live, bool live,
>      vm_stop(RUN_STATE_SAVE_VM);
>      global_state_store_running();
>  
> -    ioc = qio_channel_file_new_path(filename, O_WRONLY | O_CREAT | O_TRUNC,
> -                                    0660, errp);
> -    if (!ioc) {
> +    f = qemu_file_open(filename, flags, 0660, ioc_name, errp);
> +    if (!f) {
>          goto the_end;
>      }
> -    qio_channel_set_name(QIO_CHANNEL(ioc), "migration-xen-save-state");
> -    f = qemu_fopen_channel_output(QIO_CHANNEL(ioc));
> -    object_unref(OBJECT(ioc));
>      ret = qemu_save_device_state(f);
>      if (ret < 0 || qemu_fclose(f) < 0) {
>          error_setg(errp, QERR_IO_ERROR);
> @@ -2960,8 +2957,8 @@ void qmp_xen_save_devices_state(const char *filename, bool has_live, bool live,
>  
>  void qmp_xen_load_devices_state(const char *filename, Error **errp)
>  {
> +    const char *ioc_name = "migration-xen-load-state";
>      QEMUFile *f;
> -    QIOChannelFile *ioc;
>      int ret;
>  
>      /* Guest must be paused before loading the device state; the RAM state
> @@ -2973,14 +2970,10 @@ void qmp_xen_load_devices_state(const char *filename, Error **errp)
>      }
>      vm_stop(RUN_STATE_RESTORE_VM);
>  
> -    ioc = qio_channel_file_new_path(filename, O_RDONLY | O_BINARY, 0, errp);
> -    if (!ioc) {
> +    f = qemu_file_open(filename, O_RDONLY | O_BINARY, 0, ioc_name, errp);
> +    if (!f) {
>          return;
>      }
> -    qio_channel_set_name(QIO_CHANNEL(ioc), "migration-xen-load-state");
> -    f = qemu_fopen_channel_input(QIO_CHANNEL(ioc));
> -    object_unref(OBJECT(ioc));
> -
>      ret = qemu_loadvm_state(f);
>      qemu_fclose(f);
>      if (ret < 0) {
> -- 
> 1.8.3.1
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 01/29] memory: qemu_check_ram_volatile
  2021-12-22 19:05 ` [PATCH V7 01/29] memory: qemu_check_ram_volatile Steve Sistare
@ 2022-02-24 18:28   ` Dr. David Alan Gilbert
  2022-03-03 15:55     ` Steven Sistare
  2022-03-04 12:47   ` Philippe Mathieu-Daudé
  1 sibling, 1 reply; 96+ messages in thread
From: Dr. David Alan Gilbert @ 2022-02-24 18:28 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	qemu-devel, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Daniel P. Berrange, Philippe Mathieu-Daudé,
	Alex Bennée

* Steve Sistare (steven.sistare@oracle.com) wrote:
> Add a function that returns an error if any ram_list block represents
> volatile memory.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  include/exec/memory.h |  8 ++++++++
>  softmmu/memory.c      | 26 ++++++++++++++++++++++++++
>  2 files changed, 34 insertions(+)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 20f1b27..137f5f3 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -2981,6 +2981,14 @@ bool ram_block_discard_is_disabled(void);
>   */
>  bool ram_block_discard_is_required(void);
>  
> +/**
> + * qemu_ram_check_volatile: return 1 if any memory regions are writable and not
> + * backed by shared memory, else return 0.
> + *
> + * @errp: returned error message identifying the first volatile region found.
> + */
> +int qemu_check_ram_volatile(Error **errp);
> +
>  #endif
>  
>  #endif
> diff --git a/softmmu/memory.c b/softmmu/memory.c
> index 7340e19..30b2f68 100644
> --- a/softmmu/memory.c
> +++ b/softmmu/memory.c
> @@ -2837,6 +2837,32 @@ void memory_global_dirty_log_stop(unsigned int flags)
>      memory_global_dirty_log_do_stop(flags);
>  }
>  
> +static int check_volatile(RAMBlock *rb, void *opaque)
> +{
> +    MemoryRegion *mr = rb->mr;
> +
> +    if (mr &&
> +        memory_region_is_ram(mr) &&
> +        !memory_region_is_ram_device(mr) &&
> +        !memory_region_is_rom(mr) &&
> +        (rb->fd == -1 || !qemu_ram_is_shared(rb))) {
> +        *(const char **)opaque = memory_region_name(mr);
> +        return -1;
> +    }
> +    return 0;
> +}
> +
> +int qemu_check_ram_volatile(Error **errp)
> +{
> +    char *name;

Does that need to be const char *name for safety since you're casting
it to it below?

Other than that,


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> +
> +    if (qemu_ram_foreach_block(check_volatile, &name)) {
> +        error_setg(errp, "Memory region %s is volatile", name);
> +        return -1;
> +    }
> +    return 0;
> +}
> +
>  static void listener_add_address_space(MemoryListener *listener,
>                                         AddressSpace *as)
>  {
> -- 
> 1.8.3.1
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 02/29] migration: fix populate_vfio_info
  2021-12-22 19:05 ` [PATCH V7 02/29] migration: fix populate_vfio_info Steve Sistare
@ 2022-02-24 18:42   ` Peter Maydell
  2022-03-03 15:55     ` Steven Sistare
  0 siblings, 1 reply; 96+ messages in thread
From: Peter Maydell @ 2022-02-24 18:42 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	qemu-devel, Dr. David Alan Gilbert, Zheng Chuan, Alex Williamson,
	Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Daniel P. Berrange, Philippe Mathieu-Daudé,
	Alex Bennée, Markus Armbruster

On Wed, 22 Dec 2021 at 19:45, Steve Sistare <steven.sistare@oracle.com> wrote:
>
> Include CONFIG_DEVICES so that populate_vfio_info is instantiated for
> CONFIG_VFIO.

The commit message says "include CONFIG_DEVICES"...

> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  migration/target.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/migration/target.c b/migration/target.c
> index 907ebf0..4390bf0 100644
> --- a/migration/target.c
> +++ b/migration/target.c
> @@ -8,18 +8,22 @@
>  #include "qemu/osdep.h"
>  #include "qapi/qapi-types-migration.h"
>  #include "migration.h"
> +#include CONFIG_DEVICES

...and the code change does do that, but...

>
>  #ifdef CONFIG_VFIO
> +
>  #include "hw/vfio/vfio-common.h"
> -#endif
>
>  void populate_vfio_info(MigrationInfo *info)
>  {
> -#ifdef CONFIG_VFIO
>      if (vfio_mig_active()) {
>          info->has_vfio = true;
>          info->vfio = g_malloc0(sizeof(*info->vfio));
>          info->vfio->transferred = vfio_mig_bytes_transferred();
>      }
> -#endif
>  }
> +#else
> +
> +void populate_vfio_info(MigrationInfo *info) {}
> +
> +#endif /* CONFIG_VFIO */

...it also seems to be making a no-change-of-behaviour rewrite
of the rest of the file. Is there a reason I'm missing for doing
that ?

thanks
-- PMM


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 05/29] vl: start on wakeup request
  2021-12-22 19:05 ` [PATCH V7 05/29] vl: start on wakeup request Steve Sistare
@ 2022-02-24 18:51   ` Dr. David Alan Gilbert
  2022-03-03 15:56     ` Steven Sistare
  0 siblings, 1 reply; 96+ messages in thread
From: Dr. David Alan Gilbert @ 2022-02-24 18:51 UTC (permalink / raw)
  To: Steve Sistare, Markus Armbruster
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	qemu-devel, Zheng Chuan, Alex Williamson, Paolo Bonzini,
	Stefan Hajnoczi, Marc-André Lureau, Daniel P. Berrange,
	Alex Bennée

* Steve Sistare (steven.sistare@oracle.com) wrote:
> If qemu starts and loads a VM in the suspended state, then a later wakeup
> request will set the state to running, which is not sufficient to initialize
> the vm, as vm_start was never called during this invocation of qemu.  See
> qemu_system_wakeup_request().
> 
> Define the start_on_wakeup_requested() hook to cause vm_start() to be called
> when processing the wakeup request.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  include/sysemu/runstate.h |  1 +
>  softmmu/runstate.c        | 17 ++++++++++++++++-
>  2 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/include/sysemu/runstate.h b/include/sysemu/runstate.h
> index a535691..b655c7b 100644
> --- a/include/sysemu/runstate.h
> +++ b/include/sysemu/runstate.h
> @@ -51,6 +51,7 @@ void qemu_system_reset_request(ShutdownCause reason);
>  void qemu_system_suspend_request(void);
>  void qemu_register_suspend_notifier(Notifier *notifier);
>  bool qemu_wakeup_suspend_enabled(void);
> +void qemu_system_start_on_wakeup_request(void);
>  void qemu_system_wakeup_request(WakeupReason reason, Error **errp);
>  void qemu_system_wakeup_enable(WakeupReason reason, bool enabled);
>  void qemu_register_wakeup_notifier(Notifier *notifier);
> diff --git a/softmmu/runstate.c b/softmmu/runstate.c
> index 10d9b73..3d344c9 100644
> --- a/softmmu/runstate.c
> +++ b/softmmu/runstate.c
> @@ -115,6 +115,8 @@ static const RunStateTransition runstate_transitions_def[] = {
>      { RUN_STATE_PRELAUNCH, RUN_STATE_RUNNING },
>      { RUN_STATE_PRELAUNCH, RUN_STATE_FINISH_MIGRATE },
>      { RUN_STATE_PRELAUNCH, RUN_STATE_INMIGRATE },
> +    { RUN_STATE_PRELAUNCH, RUN_STATE_SUSPENDED },
> +    { RUN_STATE_PRELAUNCH, RUN_STATE_PAUSED },

This seems separate? Is this the bit that allows you to load the VM into
suspended?
But I note you're allowing PAUSED or SUSPENDED here, but the wake up
code only handles suspended - is that expected?

>      { RUN_STATE_FINISH_MIGRATE, RUN_STATE_RUNNING },
>      { RUN_STATE_FINISH_MIGRATE, RUN_STATE_PAUSED },
> @@ -335,6 +337,7 @@ void vm_state_notify(bool running, RunState state)
>      }
>  }
>  
> +static bool start_on_wakeup_requested;
>  static ShutdownCause reset_requested;
>  static ShutdownCause shutdown_requested;
>  static int shutdown_signal;
> @@ -562,6 +565,11 @@ void qemu_register_suspend_notifier(Notifier *notifier)
>      notifier_list_add(&suspend_notifiers, notifier);
>  }
>  
> +void qemu_system_start_on_wakeup_request(void)
> +{
> +    start_on_wakeup_requested = true;
> +}

Markus: Is this OK, or should this actually be another runstate
(PRELAUNCH_SUSPENDED??? or the like??) - is there an interaction here
with the commandline change ideas for a build-the-guest at runtime?

Dave

>  void qemu_system_wakeup_request(WakeupReason reason, Error **errp)
>  {
>      trace_system_wakeup_request(reason);
> @@ -574,7 +582,14 @@ void qemu_system_wakeup_request(WakeupReason reason, Error **errp)
>      if (!(wakeup_reason_mask & (1 << reason))) {
>          return;
>      }
> -    runstate_set(RUN_STATE_RUNNING);
> +
> +    if (start_on_wakeup_requested) {
> +        start_on_wakeup_requested = false;
> +        vm_start();
> +    } else {
> +        runstate_set(RUN_STATE_RUNNING);
> +    }
> +
>      wakeup_reason = reason;
>      qemu_notify_event();
>  }
> -- 
> 1.8.3.1
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 29/29] cpr: only-cpr-capable option
  2022-02-18  9:43   ` Guoyi Tu
@ 2022-03-03 15:54     ` Steven Sistare
  0 siblings, 0 replies; 96+ messages in thread
From: Steven Sistare @ 2022-03-03 15:54 UTC (permalink / raw)
  To: Guoyi Tu, qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On 2/18/2022 4:43 AM, Guoyi Tu wrote:
> On Wed, 2021-12-22 at 11:05 -0800, Steve Sistare wrote:
>> Add the only-cpr-capable option, which causes qemu to exit with an
>> error
>> if any devices that are not capable of cpr are added.  This
>> guarantees that
>> a cpr-exec operation will not fail with an unsupported device error.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  MAINTAINERS             |  1 +
>>  chardev/char-socket.c   |  4 ++++
>>  hw/vfio/common.c        |  6 ++++++
>>  include/sysemu/sysemu.h |  1 +
>>  migration/migration.c   |  5 +++++
>>  qemu-options.hx         |  8 ++++++++
>>  softmmu/globals.c       |  1 +
>>  softmmu/physmem.c       |  5 +++++
>>  softmmu/vl.c            | 14 +++++++++++++-
>>  stubs/cpr.c             |  3 +++
>>  stubs/meson.build       |  1 +
>>  11 files changed, 48 insertions(+), 1 deletion(-)
>>  create mode 100644 stubs/cpr.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index feed239..af5abc3 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -2998,6 +2998,7 @@ F: migration/cpr.c
>>  F: qapi/cpr.json
>>  F: migration/cpr-state.c
>>  F: stubs/cpr-state.c
>> +F: stubs/cpr.c
>>  
>>  Record/replay
>>  M: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
>> diff --git a/chardev/char-socket.c b/chardev/char-socket.c
>> index c111e17..a4513a7 100644
>> --- a/chardev/char-socket.c
>> +++ b/chardev/char-socket.c
>> @@ -34,6 +34,7 @@
>>  #include "qapi/clone-visitor.h"
>>  #include "qapi/qapi-visit-sockets.h"
>>  #include "qemu/yank.h"
>> +#include "sysemu/sysemu.h"
>>  
>>  #include "chardev/char-io.h"
>>  #include "chardev/char-socket.h"
>> @@ -1416,6 +1417,9 @@ static void qmp_chardev_open_socket(Chardev
>> *chr,
>>  
>>      if (!s->tls_creds && !s->is_websock) {
>>          qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
>> +    } else if (only_cpr_capable) {
>> +        error_setg(errp, "error: socket %s is not cpr capable due to
>> %s option",
>> +                   chr->label, (s->tls_creds ? "TLS" :
>> "websocket"));
> 
> Should the error be ignored if reopen-on-cpr is set.

Yes!  Good catch, thanks.

>>      }
>>  
>>      /* be isn't opened until we get a connection */
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index f2b4a81..605ffbb 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -38,6 +38,7 @@
>>  #include "sysemu/kvm.h"
>>  #include "sysemu/reset.h"
>>  #include "sysemu/runstate.h"
>> +#include "sysemu/sysemu.h"
>>  #include "trace.h"
>>  #include "qapi/error.h"
>>  #include "migration/migration.h"
>> @@ -1923,12 +1924,17 @@ static void
>> vfio_put_address_space(VFIOAddressSpace *space)
>>  static int vfio_get_iommu_type(VFIOContainer *container,
>>                                 Error **errp)
>>  {
>> +    ERRP_GUARD();
>>      int iommu_types[] = { VFIO_TYPE1v2_IOMMU, VFIO_TYPE1_IOMMU,
>>                            VFIO_SPAPR_TCE_v2_IOMMU,
>> VFIO_SPAPR_TCE_IOMMU };
>>      int i;
>>  
>>      for (i = 0; i < ARRAY_SIZE(iommu_types); i++) {
>>          if (ioctl(container->fd, VFIO_CHECK_EXTENSION,
>> iommu_types[i])) {
>> +            if (only_cpr_capable && !vfio_is_cpr_capable(container,
>> errp)) {
>> +                error_prepend(errp, "only-cpr-capable is specified:
>> ");
>> +                return -EINVAL;
>> +            }
>>              return iommu_types[i];
>>          }
>>      }
>> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
>> index 8fae667..6241c20 100644
>> --- a/include/sysemu/sysemu.h
>> +++ b/include/sysemu/sysemu.h
>> @@ -9,6 +9,7 @@
>>  /* vl.c */
>>  
>>  extern int only_migratable;
>> +extern bool only_cpr_capable;
>>  extern const char *qemu_name;
>>  extern QemuUUID qemu_uuid;
>>  extern bool qemu_uuid_set;
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 3de11ae..f08db0d 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -1257,6 +1257,11 @@ static bool migrate_caps_check(bool *cap_list,
>>          return false;
>>      }
>>  
>> +    if (cap_list[MIGRATION_CAPABILITY_X_COLO] && only_cpr_capable) {
>> +        error_setg(errp, "x-colo is not compatible with -only-cpr-
>> capable");
>> +        return false;
>> +    }
>> +
>>      return true;
>>  }
>>  
>> diff --git a/qemu-options.hx b/qemu-options.hx
>> index 1859b55..0cbf2e3 100644
>> --- a/qemu-options.hx
>> +++ b/qemu-options.hx
>> @@ -4434,6 +4434,14 @@ SRST
>>      an unmigratable state.
>>  ERST
>>  
>> +DEF("only-cpr-capable", 0, QEMU_OPTION_only_cpr_capable, \
>> +    "-only-cpr-capable    allow only cpr capable devices\n",
>> QEMU_ARCH_ALL)
>> +SRST
>> +``-only-cpr-capable``
>> +    Only allow cpr capable devices, which guarantees that cpr-save
>> and
>> +    cpr-exec will not fail with an unsupported device error.
>> +ERST
>> +
>>  DEF("nodefaults", 0, QEMU_OPTION_nodefaults, \
>>      "-nodefaults     don't create default devices\n", QEMU_ARCH_ALL)
>>  SRST
>> diff --git a/softmmu/globals.c b/softmmu/globals.c
>> index 7d0fc81..a18fd8d 100644
>> --- a/softmmu/globals.c
>> +++ b/softmmu/globals.c
>> @@ -59,6 +59,7 @@ int boot_menu;
>>  bool boot_strict;
>>  uint8_t *boot_splash_filedata;
>>  int only_migratable; /* turn it off unless user states otherwise */
>> +bool only_cpr_capable;
>>  int icount_align_option;
>>  
>>  /* The bytes in qemu_uuid are in the order specified by RFC4122,
>> _not_ in the
>> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
>> index e227195..e7869f8 100644
>> --- a/softmmu/physmem.c
>> +++ b/softmmu/physmem.c
>> @@ -47,6 +47,7 @@
>>  #include "sysemu/dma.h"
>>  #include "sysemu/hostmem.h"
>>  #include "sysemu/hw_accel.h"
>> +#include "sysemu/sysemu.h"
>>  #include "sysemu/xen-mapcache.h"
>>  #include "trace/trace-root.h"
>>  
>> @@ -2010,6 +2011,10 @@ static void ram_block_add(RAMBlock *new_block,
>> Error **errp)
>>                  addr = file_ram_alloc(new_block, maxlen, mfd,
>>                                        false, false, 0, errp);
>>                  trace_anon_memfd_alloc(name, maxlen, addr, mfd);
>> +            } else if (only_cpr_capable) {
>> +                error_setg(errp,
>> +                    "only-cpr-capable requires -machine memfd-
>> alloc=on");
>> +                return;
>>              } else {
>>                  addr = qemu_anon_ram_alloc(maxlen, &mr->align,
>>                                             shared, noreserve);
>> diff --git a/softmmu/vl.c b/softmmu/vl.c
>> index 4319e1a..f14e29e 100644
>> --- a/softmmu/vl.c
>> +++ b/softmmu/vl.c
>> @@ -2743,11 +2743,20 @@ void qmp_x_exit_preconfig(Error **errp)
>>      qemu_create_cli_devices();
>>      qemu_machine_creation_done();
>>  
>> +    if (only_cpr_capable && !qemu_chr_is_cpr_capable(errp)) {
>> +        ;    /* not reached due to error_fatal */
>> +    }
>> +
>>      if (loadvm) {
>>          load_snapshot(loadvm, NULL, false, NULL, &error_fatal);
>>      }
>>      if (replay_mode != REPLAY_MODE_NONE) {
>> -        replay_vmstate_init();
>> +        if (only_cpr_capable) {
>> +            error_setg(errp, "replay is not compatible with -only-
>> cpr-capable");
>> +            /* not reached due to error_fatal */
>> +        } else {
>> +            replay_vmstate_init();
>> +        }
>>      }
>>  
>>      if (incoming) {
>> @@ -3507,6 +3516,9 @@ void qemu_init(int argc, char **argv, char
>> **envp)
>>              case QEMU_OPTION_only_migratable:
>>                  only_migratable = 1;
>>                  break;
>> +            case QEMU_OPTION_only_cpr_capable:
>> +                only_cpr_capable = true;
>> +                break;
>>              case QEMU_OPTION_nodefaults:
>>                  has_defaults = 0;
>>                  break;
>> diff --git a/stubs/cpr.c b/stubs/cpr.c
>> new file mode 100644
>> index 0000000..aaa189e
>> --- /dev/null
>> +++ b/stubs/cpr.c
>> @@ -0,0 +1,3 @@
>> +#include "qemu/osdep.h"
>> +
>> +bool only_cpr_capable;
>> diff --git a/stubs/meson.build b/stubs/meson.build
>> index 9565c7d..4c9c4ea 100644
>> --- a/stubs/meson.build
>> +++ b/stubs/meson.build
>> @@ -4,6 +4,7 @@ stub_ss.add(files('blk-exp-close-all.c'))
>>  stub_ss.add(files('blockdev-close-all-bdrv-states.c'))
>>  stub_ss.add(files('change-state-handler.c'))
>>  stub_ss.add(files('cmos.c'))
>> +stub_ss.add(files('cpr.c'))
>>  stub_ss.add(files('cpr-state.c'))
>>  stub_ss.add(files('cpu-get-clock.c'))
>>  stub_ss.add(files('cpus-get-virtual-clock.c'))
> 
> The only-cpr-capable option is a good way to prevent qemu from starting
> if some device don't support cpr. But if this option is not provided,
> the user still can perform cpr-xxx operation even there are devices
> don't support cpr, in this case, the exec() will fail and the original
> process cannot recovery.
> 
> How about introducing a cpr blocker (as migration blocker does) to
> prevent the user from performing cpr-xxx operaton to address the
> problem

Sure.  I will add a call to qemu_chr_is_cpr_capable() in cpr_save().

Thanks very much for your careful review of the chardev patches.

- Steve



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 28/29] chardev: cpr for sockets
  2022-02-18  9:03   ` Guoyi Tu
@ 2022-03-03 15:55     ` Steven Sistare
  0 siblings, 0 replies; 96+ messages in thread
From: Steven Sistare @ 2022-03-03 15:55 UTC (permalink / raw)
  To: Guoyi Tu, qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On 2/18/2022 4:03 AM, Guoyi Tu wrote:
> On Wed, 2021-12-22 at 11:05 -0800, Steve Sistare wrote:
>> Save accepted socket fds before cpr-save, and look for them after
>> cpr-load.
>> in the environment after cpr-load.  Reject cpr-exec if a socket
>> enables
>> the TLS or websocket option.  Allow a monitor socket by closing it on
>> exec.
>>
>> Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  chardev/char-socket.c | 35 +++++++++++++++++++++++++++++++++++
>>  monitor/hmp.c         |  3 +++
>>  monitor/qmp.c         |  3 +++
>>  3 files changed, 41 insertions(+)
>>
>> diff --git a/chardev/char-socket.c b/chardev/char-socket.c
>> index d619088..c111e17 100644
>> --- a/chardev/char-socket.c
>> +++ b/chardev/char-socket.c
>> @@ -26,6 +26,7 @@
>>  #include "chardev/char.h"
>>  #include "io/channel-socket.h"
>>  #include "io/channel-websock.h"
>> +#include "migration/cpr.h"
>>  #include "qemu/error-report.h"
>>  #include "qemu/module.h"
>>  #include "qemu/option.h"
>> @@ -358,6 +359,10 @@ static void tcp_chr_free_connection(Chardev
>> *chr)
>>      SocketChardev *s = SOCKET_CHARDEV(chr);
>>      int i;
>>  
>> +    if (!chr->reopen_on_cpr) {
>> +        cpr_delete_fd(chr->label, 0);
>> +    }
>> +
>>      if (s->read_msgfds_num) {
>>          for (i = 0; i < s->read_msgfds_num; i++) {
>>              close(s->read_msgfds[i]);
>> @@ -920,6 +925,10 @@ static void tcp_chr_accept(QIONetListener
>> *listener,
>>                                 QIO_CHANNEL(cioc));
>>      }
>>      tcp_chr_new_client(chr, cioc);
>> +
>> +    if (s->sioc && !chr->reopen_on_cpr) {
> 
> Is it necessary check if the device has QEMU_CHAR_FEATURE_CPR feature
> here? In my opinion, fd should not be saved if device don't support
> cpr.

OK.  I'll add a new boolean member to CharDev that controls whether or not
to use cpr fd's:

    qemu_char_open()
        chr->cpr_enabled = (!chr->reopen_on_cpr && 
                            qemu_chr_has_feature(chr, QEMU_CHAR_FEATURE_CPR));

    tcp_chr_accept()
        if (s->sioc && chr->cpr_enabled) {
            cpr_save_fd(chr->label, 0, s->sioc->fd);
        }

... and test it at other places as well.

- Steve

>> +        cpr_save_fd(chr->label, 0, s->sioc->fd);
>> +    }
>>  }
>>  
>>  
>> @@ -1175,6 +1184,26 @@ static gboolean
>> socket_reconnect_timeout(gpointer opaque)
>>      return false;
>>  }
>>  
>> +static int load_char_socket_fd(Chardev *chr, Error **errp)
>> +{
>> +    SocketChardev *sockchar = SOCKET_CHARDEV(chr);
>> +    QIOChannelSocket *sioc;
>> +    const char *label = chr->label;
>> +    int fd = cpr_find_fd(label, 0);
>> +
>> +    if (fd != -1) {
>> +        sockchar = SOCKET_CHARDEV(chr);
>> +        sioc = qio_channel_socket_new_fd(fd, errp);
>> +        if (sioc) {
>> +            tcp_chr_accept(sockchar->listener, sioc, chr);
>> +            object_unref(OBJECT(sioc));
>> +        } else {
>> +            error_setg(errp, "could not restore socket for %s",
>> label);
>> +            return -1;
>> +        }
>> +    }
>> +    return 0;
>> +}
>>  
>>  static int qmp_chardev_open_socket_server(Chardev *chr,
>>                                            bool is_telnet,
>> @@ -1385,6 +1414,10 @@ static void qmp_chardev_open_socket(Chardev
>> *chr,
>>      }
>>      s->registered_yank = true;
>>  
>> +    if (!s->tls_creds && !s->is_websock) {
>> +        qemu_chr_set_feature(chr, QEMU_CHAR_FEATURE_CPR);
>> +    }
>> +
>>      /* be isn't opened until we get a connection */
>>      *be_opened = false;
>>  
>> @@ -1400,6 +1433,8 @@ static void qmp_chardev_open_socket(Chardev
>> *chr,
>>              return;
>>          }
>>      }
>> +
>> +    load_char_socket_fd(chr, errp);
>>  }
>>  
>>  static void qemu_chr_parse_socket(QemuOpts *opts, ChardevBackend
>> *backend,


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 10/29] machine: memfd-alloc option
  2022-02-18  8:05   ` Guoyi Tu
@ 2022-03-03 15:55     ` Steven Sistare
  0 siblings, 0 replies; 96+ messages in thread
From: Steven Sistare @ 2022-03-03 15:55 UTC (permalink / raw)
  To: Guoyi Tu, qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, Dr. David Alan Gilbert,
	Eric Blake, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On 2/18/2022 3:05 AM, Guoyi Tu wrote:
> On Wed, 2021-12-22 at 11:05 -0800, Steve Sistare wrote:
>> Allocate anonymous memory using memfd_create if the memfd-alloc
>> machine
>> option is set.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  hw/core/machine.c   | 19 +++++++++++++++++++
>>  include/hw/boards.h |  1 +
>>  qemu-options.hx     |  6 ++++++
>>  softmmu/physmem.c   | 47 ++++++++++++++++++++++++++++++++++++++-----
>> ----
>>  softmmu/vl.c        |  1 +
>>  trace-events        |  1 +
>>  util/qemu-config.c  |  4 ++++
>>  7 files changed, 70 insertions(+), 9 deletions(-)
>>
>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>> index 53a99ab..7739d88 100644
>> --- a/hw/core/machine.c
>> +++ b/hw/core/machine.c
>> @@ -392,6 +392,20 @@ static void machine_set_mem_merge(Object *obj,
>> bool value, Error **errp)
>>      ms->mem_merge = value;
>>  }
>>  
>> +static bool machine_get_memfd_alloc(Object *obj, Error **errp)
>> +{
>> +    MachineState *ms = MACHINE(obj);
>> +
>> +    return ms->memfd_alloc;
>> +}
>> +
>> +static void machine_set_memfd_alloc(Object *obj, bool value, Error
>> **errp)
>> +{
>> +    MachineState *ms = MACHINE(obj);
>> +
>> +    ms->memfd_alloc = value;
>> +}
>> +
>>  static bool machine_get_usb(Object *obj, Error **errp)
>>  {
>>      MachineState *ms = MACHINE(obj);
>> @@ -829,6 +843,11 @@ static void machine_class_init(ObjectClass *oc,
>> void *data)
>>      object_class_property_set_description(oc, "mem-merge",
>>          "Enable/disable memory merge support");
>>  
>> +    object_class_property_add_bool(oc, "memfd-alloc",
>> +        machine_get_memfd_alloc, machine_set_memfd_alloc);
>> +    object_class_property_set_description(oc, "memfd-alloc",
>> +        "Enable/disable allocating anonymous memory using
>> memfd_create");
>> +
>>      object_class_property_add_bool(oc, "usb",
>>          machine_get_usb, machine_set_usb);
>>      object_class_property_set_description(oc, "usb",
>> diff --git a/include/hw/boards.h b/include/hw/boards.h
>> index 9c1c190..a57d7a0 100644
>> --- a/include/hw/boards.h
>> +++ b/include/hw/boards.h
>> @@ -327,6 +327,7 @@ struct MachineState {
>>      char *dt_compatible;
>>      bool dump_guest_core;
>>      bool mem_merge;
>> +    bool memfd_alloc;
>>      bool usb;
>>      bool usb_disabled;
>>      char *firmware;
>> diff --git a/qemu-options.hx b/qemu-options.hx
>> index 7d47510..33c8173 100644
>> --- a/qemu-options.hx
>> +++ b/qemu-options.hx
>> @@ -30,6 +30,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
>>      "                vmport=on|off|auto controls emulation of vmport
>> (default: auto)\n"
>>      "                dump-guest-core=on|off include guest memory in
>> a core dump (default=on)\n"
>>      "                mem-merge=on|off controls memory merge support
>> (default: on)\n"
>> +    "                memfd-alloc=on|off controls allocating
>> anonymous guest RAM using memfd_create (default: off)\n"
>>      "                aes-key-wrap=on|off controls support for AES
>> key wrapping (default=on)\n"
>>      "                dea-key-wrap=on|off controls support for DEA
>> key wrapping (default=on)\n"
>>      "                suppress-vmdesc=on|off disables self-describing 
>> migration (default=off)\n"
>> @@ -76,6 +77,11 @@ SRST
>>          supported by the host, de-duplicates identical memory pages
>>          among VMs instances (enabled by default).
>>  
>> +    ``memfd-alloc=on|off``
>> +        Enables or disables allocation of anonymous guest RAM using
>> +        memfd_create.  Any associated memory-backend objects are
>> created with
>> +        share=on.  The memfd-alloc default is off.
>> +
>>      ``aes-key-wrap=on|off``
>>          Enables or disables AES key wrapping support on s390-ccw
>> hosts.
>>          This feature controls whether AES wrapping keys will be
>> created
>> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
>> index 3524c04..95e2b49 100644
>> --- a/softmmu/physmem.c
>> +++ b/softmmu/physmem.c
>> @@ -41,6 +41,7 @@
>>  #include "qemu/config-file.h"
>>  #include "qemu/error-report.h"
>>  #include "qemu/qemu-print.h"
>> +#include "qemu/memfd.h"
>>  #include "exec/memory.h"
>>  #include "exec/ioport.h"
>>  #include "sysemu/dma.h"
>> @@ -1964,35 +1965,63 @@ static void ram_block_add(RAMBlock
>> *new_block, Error **errp)
>>      const bool shared = qemu_ram_is_shared(new_block);
>>      RAMBlock *block;
>>      RAMBlock *last_block = NULL;
>> +    struct MemoryRegion *mr = new_block->mr;
>>      ram_addr_t old_ram_size, new_ram_size;
>>      Error *err = NULL;
>> +    const char *name;
>> +    void *addr = 0;
>> +    size_t maxlen;
>> +    MachineState *ms = MACHINE(qdev_get_machine());
>>  
>>      old_ram_size = last_ram_page();
>>  
>>      qemu_mutex_lock_ramlist();
>> -    new_block->offset = find_ram_offset(new_block->max_length);
>> +    maxlen = new_block->max_length;
>> +    new_block->offset = find_ram_offset(maxlen);
>>  
>>      if (!new_block->host) {
>>          if (xen_enabled()) {
>> -            xen_ram_alloc(new_block->offset, new_block->max_length,
>> -                          new_block->mr, &err);
>> +            xen_ram_alloc(new_block->offset, maxlen, new_block->mr,
>> &err);
>>              if (err) {
>>                  error_propagate(errp, err);
>>                  qemu_mutex_unlock_ramlist();
>>                  return;
>>              }
>>          } else {
>> -            new_block->host = qemu_anon_ram_alloc(new_block-
>>> max_length,
>> -                                                  &new_block->mr-
>>> align,
>> -                                                  shared,
>> noreserve);
>> -            if (!new_block->host) {
>> +            name = memory_region_name(mr);
>> +            if (ms->memfd_alloc) {
>> +                Object *parent = &mr->parent_obj;
>> +                int mfd = -1;          /* placeholder until next
>> patch */
>> +                mr->align = QEMU_VMALLOC_ALIGN;
>> +                if (mfd < 0) {
>> +                    mfd = qemu_memfd_create(name, maxlen + mr-
>>> align,
>> +                                            0, 0, 0, &err);
>> +                    if (mfd < 0) {
> 
> the error message should be propagated
> 
> Guoyi Tu
> 
>> +                        return;
>> +                    }

Will do, thanks, by setting errp directly:
     mfd = qemu_memfd_create(name, maxlen + mr->align, 0, 0, 0, errp);

- Steve


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 01/29] memory: qemu_check_ram_volatile
  2022-02-24 18:28   ` Dr. David Alan Gilbert
@ 2022-03-03 15:55     ` Steven Sistare
  0 siblings, 0 replies; 96+ messages in thread
From: Steven Sistare @ 2022-03-03 15:55 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	qemu-devel, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Daniel P. Berrange, Philippe Mathieu-Daudé,
	Alex Bennée

On 2/24/2022 1:28 PM, Dr. David Alan Gilbert wrote:
> * Steve Sistare (steven.sistare@oracle.com) wrote:
>> Add a function that returns an error if any ram_list block represents
>> volatile memory.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  include/exec/memory.h |  8 ++++++++
>>  softmmu/memory.c      | 26 ++++++++++++++++++++++++++
>>  2 files changed, 34 insertions(+)
>>
>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>> index 20f1b27..137f5f3 100644
>> --- a/include/exec/memory.h
>> +++ b/include/exec/memory.h
>> @@ -2981,6 +2981,14 @@ bool ram_block_discard_is_disabled(void);
>>   */
>>  bool ram_block_discard_is_required(void);
>>  
>> +/**
>> + * qemu_ram_check_volatile: return 1 if any memory regions are writable and not
>> + * backed by shared memory, else return 0.
>> + *
>> + * @errp: returned error message identifying the first volatile region found.
>> + */
>> +int qemu_check_ram_volatile(Error **errp);
>> +
>>  #endif
>>  
>>  #endif
>> diff --git a/softmmu/memory.c b/softmmu/memory.c
>> index 7340e19..30b2f68 100644
>> --- a/softmmu/memory.c
>> +++ b/softmmu/memory.c
>> @@ -2837,6 +2837,32 @@ void memory_global_dirty_log_stop(unsigned int flags)
>>      memory_global_dirty_log_do_stop(flags);
>>  }
>>  
>> +static int check_volatile(RAMBlock *rb, void *opaque)
>> +{
>> +    MemoryRegion *mr = rb->mr;
>> +
>> +    if (mr &&
>> +        memory_region_is_ram(mr) &&
>> +        !memory_region_is_ram_device(mr) &&
>> +        !memory_region_is_rom(mr) &&
>> +        (rb->fd == -1 || !qemu_ram_is_shared(rb))) {
>> +        *(const char **)opaque = memory_region_name(mr);
>> +        return -1;
>> +    }
>> +    return 0;
>> +}
>> +
>> +int qemu_check_ram_volatile(Error **errp)
>> +{
>> +    char *name;
> 
> Does that need to be const char *name for safety since you're casting
> it to it below?

Will do, thanks.

- Steve

> Other than that,
> 
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
>> +
>> +    if (qemu_ram_foreach_block(check_volatile, &name)) {
>> +        error_setg(errp, "Memory region %s is volatile", name);
>> +        return -1;
>> +    }
>> +    return 0;
>> +}
>> +
>>  static void listener_add_address_space(MemoryListener *listener,
>>                                         AddressSpace *as)
>>  {
>> -- 
>> 1.8.3.1
>>
>>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 02/29] migration: fix populate_vfio_info
  2022-02-24 18:42   ` Peter Maydell
@ 2022-03-03 15:55     ` Steven Sistare
  2022-03-03 16:21       ` Peter Maydell
  0 siblings, 1 reply; 96+ messages in thread
From: Steven Sistare @ 2022-03-03 15:55 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	qemu-devel, Dr. David Alan Gilbert, Zheng Chuan, Alex Williamson,
	Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Daniel P. Berrange, Philippe Mathieu-Daudé,
	Alex Bennée, Markus Armbruster

On 2/24/2022 1:42 PM, Peter Maydell wrote:
> On Wed, 22 Dec 2021 at 19:45, Steve Sistare <steven.sistare@oracle.com> wrote:
>>
>> Include CONFIG_DEVICES so that populate_vfio_info is instantiated for
>> CONFIG_VFIO.
> 
> The commit message says "include CONFIG_DEVICES"...
> 
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  migration/target.c | 10 +++++++---
>>  1 file changed, 7 insertions(+), 3 deletions(-)
>>
>> diff --git a/migration/target.c b/migration/target.c
>> index 907ebf0..4390bf0 100644
>> --- a/migration/target.c
>> +++ b/migration/target.c
>> @@ -8,18 +8,22 @@
>>  #include "qemu/osdep.h"
>>  #include "qapi/qapi-types-migration.h"
>>  #include "migration.h"
>> +#include CONFIG_DEVICES
> 
> ...and the code change does do that, but...
> 
>>
>>  #ifdef CONFIG_VFIO
>> +
>>  #include "hw/vfio/vfio-common.h"
>> -#endif
>>
>>  void populate_vfio_info(MigrationInfo *info)
>>  {
>> -#ifdef CONFIG_VFIO
>>      if (vfio_mig_active()) {
>>          info->has_vfio = true;
>>          info->vfio = g_malloc0(sizeof(*info->vfio));
>>          info->vfio->transferred = vfio_mig_bytes_transferred();
>>      }
>> -#endif
>>  }
>> +#else
>> +
>> +void populate_vfio_info(MigrationInfo *info) {}
>> +
>> +#endif /* CONFIG_VFIO */
> 
> ...it also seems to be making a no-change-of-behaviour rewrite
> of the rest of the file. Is there a reason I'm missing for doing
> that ?
> 
> thanks
> -- PMM

I'll change the commit message to explain:

    Include CONFIG_DEVICES so that populate_vfio_info is instantiated for
    CONFIG_VFIO, and refactor so only one ifdef is needed when new functions
    are added in a later patch. 
  
The later patch is "vfio-pci: cpr part 1 (fd and dma)"

- Steve


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 03/29] migration: qemu file wrappers
  2022-02-24 18:21   ` Dr. David Alan Gilbert
@ 2022-03-03 15:55     ` Steven Sistare
  0 siblings, 0 replies; 96+ messages in thread
From: Steven Sistare @ 2022-03-03 15:55 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	qemu-devel, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Daniel P. Berrange, Philippe Mathieu-Daudé,
	Alex Bennée

On 2/24/2022 1:21 PM, Dr. David Alan Gilbert wrote:
> * Steve Sistare (steven.sistare@oracle.com) wrote:
>> Add qemu_file_open and qemu_fd_open to create QEMUFile objects for unix
>> files and file descriptors.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  migration/qemu-file-channel.c | 36 ++++++++++++++++++++++++++++++++++++
>>  migration/qemu-file-channel.h |  6 ++++++
>>  2 files changed, 42 insertions(+)
>>
>> diff --git a/migration/qemu-file-channel.c b/migration/qemu-file-channel.c
>> index bb5a575..afb16d7 100644
>> --- a/migration/qemu-file-channel.c
>> +++ b/migration/qemu-file-channel.c
>> @@ -27,8 +27,10 @@
>>  #include "qemu-file.h"
>>  #include "io/channel-socket.h"
>>  #include "io/channel-tls.h"
>> +#include "io/channel-file.h"
>>  #include "qemu/iov.h"
>>  #include "qemu/yank.h"
>> +#include "qapi/error.h"
>>  #include "yank_functions.h"
>>  
>>  
>> @@ -192,3 +194,37 @@ QEMUFile *qemu_fopen_channel_output(QIOChannel *ioc)
>>      object_ref(OBJECT(ioc));
>>      return qemu_fopen_ops(ioc, &channel_output_ops, true);
>>  }
>> +
>> +QEMUFile *qemu_file_open(const char *path, int flags, int mode,
>> +                         const char *name, Error **errp)
> 
> Can you please make that qemu_fopen_file

Will do.

>> +{
>> +    g_autoptr(QIOChannelFile) fioc = NULL;
>> +    QIOChannel *ioc;
>> +    QEMUFile *f;
>> +
>> +    if (flags & O_RDWR) {
>> +        error_setg(errp, "qemu_file_open %s: O_RDWR not supported", path);
>> +        return NULL;
>> +    }
>> +
>> +    fioc = qio_channel_file_new_path(path, flags, mode, errp);
>> +    if (!fioc) {
>> +        return NULL;
>> +    }
>> +
>> +    ioc = QIO_CHANNEL(fioc);
>> +    qio_channel_set_name(ioc, name);
>> +    f = (flags & O_WRONLY) ? qemu_fopen_channel_output(ioc) :
>> +                             qemu_fopen_channel_input(ioc);
>> +    return f;
>> +}
>> +
>> +QEMUFile *qemu_fd_open(int fd, bool writable, const char *name)
>> +{
> 
> Can you please make that qemu_fopen_fd

Will do.

>>> +    g_autoptr(QIOChannelFile) fioc = qio_channel_file_new_fd(fd);
> 
> Can you use qio_channel_new_fd for that? Then it creates either
> a socket or file subclass depending what type of fd is passed
> (and gives you a QIOChannel without needing to cast).

The downside is that we must pass and check an errp, which will only be
set for a socket, and this function qemu_fopen_fd is never intended to
be used for sockets.  The file case never fails.  IMO the current code
is better.  Are you OK with keeping it?

- Steve

>> +    QIOChannel *ioc = QIO_CHANNEL(fioc);
>> +    QEMUFile *f = writable ? qemu_fopen_channel_output(ioc) :
>> +                             qemu_fopen_channel_input(ioc);
>> +    qio_channel_set_name(ioc, name);
>> +    return f;
>> +}
>> diff --git a/migration/qemu-file-channel.h b/migration/qemu-file-channel.h
>> index 0028a09..324ae2d 100644
>> --- a/migration/qemu-file-channel.h
>> +++ b/migration/qemu-file-channel.h
>> @@ -29,4 +29,10 @@
>>  
>>  QEMUFile *qemu_fopen_channel_input(QIOChannel *ioc);
>>  QEMUFile *qemu_fopen_channel_output(QIOChannel *ioc);
>> +
>> +QEMUFile *qemu_file_open(const char *path, int flags, int mode,
>> +                         const char *name, Error **errp);
>> +
>> +QEMUFile *qemu_fd_open(int fd, bool writable, const char *name);
>> +
>>  #endif
>> -- 
>> 1.8.3.1
>>
>>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 04/29] migration: simplify savevm
  2022-02-24 18:25   ` Dr. David Alan Gilbert
@ 2022-03-03 15:55     ` Steven Sistare
  0 siblings, 0 replies; 96+ messages in thread
From: Steven Sistare @ 2022-03-03 15:55 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	qemu-devel, Markus Armbruster, Zheng Chuan, Alex Williamson,
	Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Daniel P. Berrange, Alex Bennée

On 2/24/2022 1:25 PM, Dr. David Alan Gilbert wrote:
> * Steve Sistare (steven.sistare@oracle.com) wrote:
>> Use qemu_file_open to simplify a few functions in savevm.c.
>> No functional change.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> 
> So I think this is mostly OK, but a couple of minor tidyups below;
> so with the tidies and the renames from the previous patch:
> 
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Cool, thanks.

>> ---
>>  migration/savevm.c | 21 +++++++--------------
>>  1 file changed, 7 insertions(+), 14 deletions(-)
>>
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index 0bef031..c71d525 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -2910,8 +2910,9 @@ bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
>>  void qmp_xen_save_devices_state(const char *filename, bool has_live, bool live,
>>                                  Error **errp)
>>  {
>> +    const char *ioc_name = "migration-xen-save-state";
>> +    int flags = O_WRONLY | O_CREAT | O_TRUNC;
> 
> I don't see why to take these (or the matching ones in load) as separate
> variables; just keep them as is, and be parameters.

Will do.

- Steve

>>      QEMUFile *f;
>> -    QIOChannelFile *ioc;
>>      int saved_vm_running;
>>      int ret;
>>  
>> @@ -2925,14 +2926,10 @@ void qmp_xen_save_devices_state(const char *filename, bool has_live, bool live,
>>      vm_stop(RUN_STATE_SAVE_VM);
>>      global_state_store_running();
>>  
>> -    ioc = qio_channel_file_new_path(filename, O_WRONLY | O_CREAT | O_TRUNC,
>> -                                    0660, errp);
>> -    if (!ioc) {
>> +    f = qemu_file_open(filename, flags, 0660, ioc_name, errp);
>> +    if (!f) {
>>          goto the_end;
>>      }
>> -    qio_channel_set_name(QIO_CHANNEL(ioc), "migration-xen-save-state");
>> -    f = qemu_fopen_channel_output(QIO_CHANNEL(ioc));
>> -    object_unref(OBJECT(ioc));
>>      ret = qemu_save_device_state(f);
>>      if (ret < 0 || qemu_fclose(f) < 0) {
>>          error_setg(errp, QERR_IO_ERROR);
>> @@ -2960,8 +2957,8 @@ void qmp_xen_save_devices_state(const char *filename, bool has_live, bool live,
>>  
>>  void qmp_xen_load_devices_state(const char *filename, Error **errp)
>>  {
>> +    const char *ioc_name = "migration-xen-load-state";
>>      QEMUFile *f;
>> -    QIOChannelFile *ioc;
>>      int ret;
>>  
>>      /* Guest must be paused before loading the device state; the RAM state
>> @@ -2973,14 +2970,10 @@ void qmp_xen_load_devices_state(const char *filename, Error **errp)
>>      }
>>      vm_stop(RUN_STATE_RESTORE_VM);
>>  
>> -    ioc = qio_channel_file_new_path(filename, O_RDONLY | O_BINARY, 0, errp);
>> -    if (!ioc) {
>> +    f = qemu_file_open(filename, O_RDONLY | O_BINARY, 0, ioc_name, errp);
>> +    if (!f) {
>>          return;
>>      }
>> -    qio_channel_set_name(QIO_CHANNEL(ioc), "migration-xen-load-state");
>> -    f = qemu_fopen_channel_input(QIO_CHANNEL(ioc));
>> -    object_unref(OBJECT(ioc));
>> -
>>      ret = qemu_loadvm_state(f);
>>      qemu_fclose(f);
>>      if (ret < 0) {
>> -- 
>> 1.8.3.1
>>
>>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 10/29] machine: memfd-alloc option
  2022-02-24 17:56   ` Dr. David Alan Gilbert
@ 2022-03-03 15:56     ` Steven Sistare
  0 siblings, 0 replies; 96+ messages in thread
From: Steven Sistare @ 2022-03-03 15:56 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Jason Zeng,
	Alex Bennée, Juan Quintela, qemu-devel, Eric Blake,
	Markus Armbruster, Zheng Chuan, Alex Williamson, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On 2/24/2022 12:56 PM, Dr. David Alan Gilbert wrote:
> * Steve Sistare (steven.sistare@oracle.com) wrote:
>> Allocate anonymous memory using memfd_create if the memfd-alloc machine
>> option is set.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> 
> So other than the minor error nit that Guoyi spotted, I think this is
> pretty good,  one other comment below:
> 
>> ---
>>  hw/core/machine.c   | 19 +++++++++++++++++++
>>  include/hw/boards.h |  1 +
>>  qemu-options.hx     |  6 ++++++
>>  softmmu/physmem.c   | 47 ++++++++++++++++++++++++++++++++++++++---------
>>  softmmu/vl.c        |  1 +
>>  trace-events        |  1 +
>>  util/qemu-config.c  |  4 ++++
>>  7 files changed, 70 insertions(+), 9 deletions(-)
>>
>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>> index 53a99ab..7739d88 100644
>> --- a/hw/core/machine.c
>> +++ b/hw/core/machine.c
>> @@ -392,6 +392,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
>>      ms->mem_merge = value;
>>  }
>>  
>> +static bool machine_get_memfd_alloc(Object *obj, Error **errp)
>> +{
>> +    MachineState *ms = MACHINE(obj);
>> +
>> +    return ms->memfd_alloc;
>> +}
>> +
>> +static void machine_set_memfd_alloc(Object *obj, bool value, Error **errp)
>> +{
>> +    MachineState *ms = MACHINE(obj);
>> +
>> +    ms->memfd_alloc = value;
>> +}
>> +
>>  static bool machine_get_usb(Object *obj, Error **errp)
>>  {
>>      MachineState *ms = MACHINE(obj);
>> @@ -829,6 +843,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
>>      object_class_property_set_description(oc, "mem-merge",
>>          "Enable/disable memory merge support");
>>  
>> +    object_class_property_add_bool(oc, "memfd-alloc",
>> +        machine_get_memfd_alloc, machine_set_memfd_alloc);
>> +    object_class_property_set_description(oc, "memfd-alloc",
>> +        "Enable/disable allocating anonymous memory using memfd_create");
>> +
>>      object_class_property_add_bool(oc, "usb",
>>          machine_get_usb, machine_set_usb);
>>      object_class_property_set_description(oc, "usb",
>> diff --git a/include/hw/boards.h b/include/hw/boards.h
>> index 9c1c190..a57d7a0 100644
>> --- a/include/hw/boards.h
>> +++ b/include/hw/boards.h
>> @@ -327,6 +327,7 @@ struct MachineState {
>>      char *dt_compatible;
>>      bool dump_guest_core;
>>      bool mem_merge;
>> +    bool memfd_alloc;
>>      bool usb;
>>      bool usb_disabled;
>>      char *firmware;
>> diff --git a/qemu-options.hx b/qemu-options.hx
>> index 7d47510..33c8173 100644
>> --- a/qemu-options.hx
>> +++ b/qemu-options.hx
>> @@ -30,6 +30,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
>>      "                vmport=on|off|auto controls emulation of vmport (default: auto)\n"
>>      "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
>>      "                mem-merge=on|off controls memory merge support (default: on)\n"
>> +    "                memfd-alloc=on|off controls allocating anonymous guest RAM using memfd_create (default: off)\n"
>>      "                aes-key-wrap=on|off controls support for AES key wrapping (default=on)\n"
>>      "                dea-key-wrap=on|off controls support for DEA key wrapping (default=on)\n"
>>      "                suppress-vmdesc=on|off disables self-describing migration (default=off)\n"
>> @@ -76,6 +77,11 @@ SRST
>>          supported by the host, de-duplicates identical memory pages
>>          among VMs instances (enabled by default).
>>  
>> +    ``memfd-alloc=on|off``
>> +        Enables or disables allocation of anonymous guest RAM using
>> +        memfd_create.  Any associated memory-backend objects are created with
>> +        share=on.  The memfd-alloc default is off.
>> +
>>      ``aes-key-wrap=on|off``
>>          Enables or disables AES key wrapping support on s390-ccw hosts.
>>          This feature controls whether AES wrapping keys will be created
>> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
>> index 3524c04..95e2b49 100644
>> --- a/softmmu/physmem.c
>> +++ b/softmmu/physmem.c
>> @@ -41,6 +41,7 @@
>>  #include "qemu/config-file.h"
>>  #include "qemu/error-report.h"
>>  #include "qemu/qemu-print.h"
>> +#include "qemu/memfd.h"
>>  #include "exec/memory.h"
>>  #include "exec/ioport.h"
>>  #include "sysemu/dma.h"
>> @@ -1964,35 +1965,63 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>      const bool shared = qemu_ram_is_shared(new_block);
>>      RAMBlock *block;
>>      RAMBlock *last_block = NULL;
>> +    struct MemoryRegion *mr = new_block->mr;
>>      ram_addr_t old_ram_size, new_ram_size;
>>      Error *err = NULL;
>> +    const char *name;
>> +    void *addr = 0;
>> +    size_t maxlen;
> 
> You could move some of these down to the top of the block you're using
> them.

Will do.

One question:  I added this to shorten lines and make my code additions more readable:

    size_t maxlen;
    maxlen = new_block->max_length;

However, I did not change new_block->max_length to maxlen in the second half of the
function which I did not modify, to avoid noise in the patch that is unrelated to cpr.
Making those changes would shorten a few multi-liners.  What is your preference -- 
make those changes or not?

- Steve

>> +    MachineState *ms = MACHINE(qdev_get_machine());
>>  
>>      old_ram_size = last_ram_page();
>>  
>>      qemu_mutex_lock_ramlist();
>> -    new_block->offset = find_ram_offset(new_block->max_length);
>> +    maxlen = new_block->max_length;
>> +    new_block->offset = find_ram_offset(maxlen);
>>  
>>      if (!new_block->host) {
>>          if (xen_enabled()) {
>> -            xen_ram_alloc(new_block->offset, new_block->max_length,
>> -                          new_block->mr, &err);
>> +            xen_ram_alloc(new_block->offset, maxlen, new_block->mr, &err);
>>              if (err) {
>>                  error_propagate(errp, err);
>>                  qemu_mutex_unlock_ramlist();
>>                  return;
>>              }
>>          } else {
>> -            new_block->host = qemu_anon_ram_alloc(new_block->max_length,
>> -                                                  &new_block->mr->align,
>> -                                                  shared, noreserve);
>> -            if (!new_block->host) {
>> +            name = memory_region_name(mr);
>> +            if (ms->memfd_alloc) {
>> +                Object *parent = &mr->parent_obj;
>> +                int mfd = -1;          /* placeholder until next patch */
>> +                mr->align = QEMU_VMALLOC_ALIGN;
>> +                if (mfd < 0) {
>> +                    mfd = qemu_memfd_create(name, maxlen + mr->align,
>> +                                            0, 0, 0, &err);
>> +                    if (mfd < 0) {
>> +                        return;
>> +                    }
>> +                }
>> +                qemu_set_cloexec(mfd);
>> +                /* The memory backend already set its desired flags. */
>> +                if (!object_dynamic_cast(parent, TYPE_MEMORY_BACKEND)) {
>> +                    new_block->flags |= RAM_SHARED;
>> +                }
>> +                addr = file_ram_alloc(new_block, maxlen, mfd,
>> +                                      false, false, 0, errp);
>> +                trace_anon_memfd_alloc(name, maxlen, addr, mfd);
>> +            } else {
>> +                addr = qemu_anon_ram_alloc(maxlen, &mr->align,
>> +                                           shared, noreserve);
>> +            }
>> +
>> +            if (!addr) {
>>                  error_setg_errno(errp, errno,
>>                                   "cannot set up guest memory '%s'",
>> -                                 memory_region_name(new_block->mr));
>> +                                 name);
>>                  qemu_mutex_unlock_ramlist();
>>                  return;
>>              }
>> -            memory_try_enable_merging(new_block->host, new_block->max_length);
>> +            memory_try_enable_merging(addr, maxlen);
>> +            new_block->host = addr;
>>          }
>>      }
>>  
>> diff --git a/softmmu/vl.c b/softmmu/vl.c
>> index 620a1f1..ab3648a 100644
>> --- a/softmmu/vl.c
>> +++ b/softmmu/vl.c
>> @@ -2440,6 +2440,7 @@ static void create_default_memdev(MachineState *ms, const char *path)
>>          object_property_set_str(obj, "mem-path", path, &error_fatal);
>>      }
>>      object_property_set_int(obj, "size", ms->ram_size, &error_fatal);
>> +    object_property_set_bool(obj, "share", ms->memfd_alloc, &error_fatal);
>>      object_property_add_child(object_get_objects_root(), mc->default_ram_id,
>>                                obj);
>>      /* Ensure backend's memory region name is equal to mc->default_ram_id */
>> diff --git a/trace-events b/trace-events
>> index a637a61..770a9ac 100644
>> --- a/trace-events
>> +++ b/trace-events
>> @@ -45,6 +45,7 @@ ram_block_discard_range(const char *rbname, void *hva, size_t length, bool need_
>>  # accel/tcg/cputlb.c
>>  memory_notdirty_write_access(uint64_t vaddr, uint64_t ram_addr, unsigned size) "0x%" PRIx64 " ram_addr 0x%" PRIx64 " size %u"
>>  memory_notdirty_set_dirty(uint64_t vaddr) "0x%" PRIx64
>> +anon_memfd_alloc(const char *name, size_t size, void *ptr, int fd) "%s size %zu ptr %p fd %d"
>>  
>>  # gdbstub.c
>>  gdbstub_op_start(const char *device) "Starting gdbstub using device %s"
>> diff --git a/util/qemu-config.c b/util/qemu-config.c
>> index 436ab63..3606e5c 100644
>> --- a/util/qemu-config.c
>> +++ b/util/qemu-config.c
>> @@ -207,6 +207,10 @@ static QemuOptsList machine_opts = {
>>              .type = QEMU_OPT_BOOL,
>>              .help = "enable/disable memory merge support",
>>          },{
>> +            .name = "memfd-alloc",
>> +            .type = QEMU_OPT_BOOL,
>> +            .help = "enable/disable memfd_create for anonymous memory",
>> +        },{
>>              .name = "usb",
>>              .type = QEMU_OPT_BOOL,
>>              .help = "Set on/off to enable/disable usb",
>> -- 
>> 1.8.3.1
>>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 05/29] vl: start on wakeup request
  2022-02-24 18:51   ` Dr. David Alan Gilbert
@ 2022-03-03 15:56     ` Steven Sistare
  0 siblings, 0 replies; 96+ messages in thread
From: Steven Sistare @ 2022-03-03 15:56 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Markus Armbruster
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	qemu-devel, Zheng Chuan, Alex Williamson, Paolo Bonzini,
	Stefan Hajnoczi, Marc-André Lureau, Daniel P. Berrange,
	Alex Bennée

On 2/24/2022 1:51 PM, Dr. David Alan Gilbert wrote:
> * Steve Sistare (steven.sistare@oracle.com) wrote:
>> If qemu starts and loads a VM in the suspended state, then a later wakeup
>> request will set the state to running, which is not sufficient to initialize
>> the vm, as vm_start was never called during this invocation of qemu.  See
>> qemu_system_wakeup_request().
>>
>> Define the start_on_wakeup_requested() hook to cause vm_start() to be called
>> when processing the wakeup request.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  include/sysemu/runstate.h |  1 +
>>  softmmu/runstate.c        | 17 ++++++++++++++++-
>>  2 files changed, 17 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/sysemu/runstate.h b/include/sysemu/runstate.h
>> index a535691..b655c7b 100644
>> --- a/include/sysemu/runstate.h
>> +++ b/include/sysemu/runstate.h
>> @@ -51,6 +51,7 @@ void qemu_system_reset_request(ShutdownCause reason);
>>  void qemu_system_suspend_request(void);
>>  void qemu_register_suspend_notifier(Notifier *notifier);
>>  bool qemu_wakeup_suspend_enabled(void);
>> +void qemu_system_start_on_wakeup_request(void);
>>  void qemu_system_wakeup_request(WakeupReason reason, Error **errp);
>>  void qemu_system_wakeup_enable(WakeupReason reason, bool enabled);
>>  void qemu_register_wakeup_notifier(Notifier *notifier);
>> diff --git a/softmmu/runstate.c b/softmmu/runstate.c
>> index 10d9b73..3d344c9 100644
>> --- a/softmmu/runstate.c
>> +++ b/softmmu/runstate.c
>> @@ -115,6 +115,8 @@ static const RunStateTransition runstate_transitions_def[] = {
>>      { RUN_STATE_PRELAUNCH, RUN_STATE_RUNNING },
>>      { RUN_STATE_PRELAUNCH, RUN_STATE_FINISH_MIGRATE },
>>      { RUN_STATE_PRELAUNCH, RUN_STATE_INMIGRATE },
>> +    { RUN_STATE_PRELAUNCH, RUN_STATE_SUSPENDED },
>> +    { RUN_STATE_PRELAUNCH, RUN_STATE_PAUSED },
> 
> This seems separate? 

The RUN_STATE_SUSPENDED line is required for start on wake to work after qemu restarts.

> Is this the bit that allows you to load the VM into suspended?

Yes.

> But I note you're allowing PAUSED or SUSPENDED here, but the wake up
> code only handles suspended - is that expected?

The RUN_STATE_PAUSED line does not belong here. I will move it to  "cpr: reboot mode".

- Steve

>>      { RUN_STATE_FINISH_MIGRATE, RUN_STATE_RUNNING },
>>      { RUN_STATE_FINISH_MIGRATE, RUN_STATE_PAUSED },
>> @@ -335,6 +337,7 @@ void vm_state_notify(bool running, RunState state)
>>      }
>>  }
>>  
>> +static bool start_on_wakeup_requested;
>>  static ShutdownCause reset_requested;
>>  static ShutdownCause shutdown_requested;
>>  static int shutdown_signal;
>> @@ -562,6 +565,11 @@ void qemu_register_suspend_notifier(Notifier *notifier)
>>      notifier_list_add(&suspend_notifiers, notifier);
>>  }
>>  
>> +void qemu_system_start_on_wakeup_request(void)
>> +{
>> +    start_on_wakeup_requested = true;
>> +}
> 
> Markus: Is this OK, or should this actually be another runstate
> (PRELAUNCH_SUSPENDED??? or the like??) - is there an interaction here
> with the commandline change ideas for a build-the-guest at runtime?
> 
> Dave
> 
>>  void qemu_system_wakeup_request(WakeupReason reason, Error **errp)
>>  {
>>      trace_system_wakeup_request(reason);
>> @@ -574,7 +582,14 @@ void qemu_system_wakeup_request(WakeupReason reason, Error **errp)
>>      if (!(wakeup_reason_mask & (1 << reason))) {
>>          return;
>>      }
>> -    runstate_set(RUN_STATE_RUNNING);
>> +
>> +    if (start_on_wakeup_requested) {
>> +        start_on_wakeup_requested = false;
>> +        vm_start();
>> +    } else {
>> +        runstate_set(RUN_STATE_RUNNING);
>> +    }
>> +
>>      wakeup_reason = reason;
>>      qemu_notify_event();
>>  }
>> -- 
>> 1.8.3.1
>>
>>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 02/29] migration: fix populate_vfio_info
  2022-03-03 15:55     ` Steven Sistare
@ 2022-03-03 16:21       ` Peter Maydell
  2022-03-03 16:38         ` Steven Sistare
  0 siblings, 1 reply; 96+ messages in thread
From: Peter Maydell @ 2022-03-03 16:21 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	qemu-devel, Dr. David Alan Gilbert, Zheng Chuan, Alex Williamson,
	Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Daniel P. Berrange, Philippe Mathieu-Daudé,
	Alex Bennée, Markus Armbruster

On Thu, 3 Mar 2022 at 15:55, Steven Sistare <steven.sistare@oracle.com> wrote:
>
> On 2/24/2022 1:42 PM, Peter Maydell wrote:
> > ...it also seems to be making a no-change-of-behaviour rewrite
> > of the rest of the file. Is there a reason I'm missing for doing
> > that ?

> I'll change the commit message to explain:
>
>     Include CONFIG_DEVICES so that populate_vfio_info is instantiated for
>     CONFIG_VFIO, and refactor so only one ifdef is needed when new functions
>     are added in a later patch.
>
> The later patch is "vfio-pci: cpr part 1 (fd and dma)"

I'd prefer it if you split this patch into two patches; these two changes
aren't related.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 02/29] migration: fix populate_vfio_info
  2022-03-03 16:21       ` Peter Maydell
@ 2022-03-03 16:38         ` Steven Sistare
  0 siblings, 0 replies; 96+ messages in thread
From: Steven Sistare @ 2022-03-03 16:38 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	qemu-devel, Dr. David Alan Gilbert, Zheng Chuan, Alex Williamson,
	Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Daniel P. Berrange, Philippe Mathieu-Daudé,
	Alex Bennée, Markus Armbruster

On 3/3/2022 11:21 AM, Peter Maydell wrote:
> On Thu, 3 Mar 2022 at 15:55, Steven Sistare <steven.sistare@oracle.com> wrote:
>>
>> On 2/24/2022 1:42 PM, Peter Maydell wrote:
>>> ...it also seems to be making a no-change-of-behaviour rewrite
>>> of the rest of the file. Is there a reason I'm missing for doing
>>> that ?
> 
>> I'll change the commit message to explain:
>>
>>     Include CONFIG_DEVICES so that populate_vfio_info is instantiated for
>>     CONFIG_VFIO, and refactor so only one ifdef is needed when new functions
>>     are added in a later patch.
>>
>> The later patch is "vfio-pci: cpr part 1 (fd and dma)"
> 
> I'd prefer it if you split this patch into two patches; these two changes
> aren't related.
> 
> thanks
> -- PMM


Will do - Steve


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 10/29] machine: memfd-alloc option
  2021-12-22 19:05 ` [PATCH V7 10/29] machine: memfd-alloc option Steve Sistare
  2022-02-18  8:05   ` Guoyi Tu
  2022-02-24 17:56   ` Dr. David Alan Gilbert
@ 2022-03-03 17:21   ` Michael S. Tsirkin
  2022-03-04 10:41     ` Igor Mammedov
  2022-03-11 10:25     ` David Hildenbrand
  2022-03-11  9:54   ` David Hildenbrand
  3 siblings, 2 replies; 96+ messages in thread
From: Michael S. Tsirkin @ 2022-03-03 17:21 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Daniel P. Berrange, Juan Quintela, Jason Zeng, Alex Bennée,
	qemu-devel, Eric Blake, Dr. David Alan Gilbert, Zheng Chuan,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On Wed, Dec 22, 2021 at 11:05:15AM -0800, Steve Sistare wrote:
> Allocate anonymous memory using memfd_create if the memfd-alloc machine
> option is set.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  hw/core/machine.c   | 19 +++++++++++++++++++
>  include/hw/boards.h |  1 +
>  qemu-options.hx     |  6 ++++++
>  softmmu/physmem.c   | 47 ++++++++++++++++++++++++++++++++++++++---------
>  softmmu/vl.c        |  1 +
>  trace-events        |  1 +
>  util/qemu-config.c  |  4 ++++
>  7 files changed, 70 insertions(+), 9 deletions(-)
> 
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index 53a99ab..7739d88 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -392,6 +392,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
>      ms->mem_merge = value;
>  }
>  
> +static bool machine_get_memfd_alloc(Object *obj, Error **errp)
> +{
> +    MachineState *ms = MACHINE(obj);
> +
> +    return ms->memfd_alloc;
> +}
> +
> +static void machine_set_memfd_alloc(Object *obj, bool value, Error **errp)
> +{
> +    MachineState *ms = MACHINE(obj);
> +
> +    ms->memfd_alloc = value;
> +}
> +
>  static bool machine_get_usb(Object *obj, Error **errp)
>  {
>      MachineState *ms = MACHINE(obj);
> @@ -829,6 +843,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
>      object_class_property_set_description(oc, "mem-merge",
>          "Enable/disable memory merge support");
>  
> +    object_class_property_add_bool(oc, "memfd-alloc",
> +        machine_get_memfd_alloc, machine_set_memfd_alloc);
> +    object_class_property_set_description(oc, "memfd-alloc",
> +        "Enable/disable allocating anonymous memory using memfd_create");
> +
>      object_class_property_add_bool(oc, "usb",
>          machine_get_usb, machine_set_usb);
>      object_class_property_set_description(oc, "usb",
> diff --git a/include/hw/boards.h b/include/hw/boards.h
> index 9c1c190..a57d7a0 100644
> --- a/include/hw/boards.h
> +++ b/include/hw/boards.h
> @@ -327,6 +327,7 @@ struct MachineState {
>      char *dt_compatible;
>      bool dump_guest_core;
>      bool mem_merge;
> +    bool memfd_alloc;
>      bool usb;
>      bool usb_disabled;
>      char *firmware;
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 7d47510..33c8173 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -30,6 +30,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
>      "                vmport=on|off|auto controls emulation of vmport (default: auto)\n"
>      "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
>      "                mem-merge=on|off controls memory merge support (default: on)\n"
> +    "                memfd-alloc=on|off controls allocating anonymous guest RAM using memfd_create (default: off)\n"

Question: are there any disadvantages associated with using
memfd_create? I guess we are using up an fd, but that seems minor.  Any
reason not to set to on by default? maybe with a fallback option to
disable that?

I am concerned that it's actually a kind of memory backend, this flag
seems to instead be closer to the deprecated mem-prealloc. E.g.
it does not work with a mem path, does it?


>      "                aes-key-wrap=on|off controls support for AES key wrapping (default=on)\n"
>      "                dea-key-wrap=on|off controls support for DEA key wrapping (default=on)\n"
>      "                suppress-vmdesc=on|off disables self-describing migration (default=off)\n"
> @@ -76,6 +77,11 @@ SRST
>          supported by the host, de-duplicates identical memory pages
>          among VMs instances (enabled by default).
>  
> +    ``memfd-alloc=on|off``
> +        Enables or disables allocation of anonymous guest RAM using
> +        memfd_create.  Any associated memory-backend objects are created with
> +        share=on.  The memfd-alloc default is off.
> +
>      ``aes-key-wrap=on|off``
>          Enables or disables AES key wrapping support on s390-ccw hosts.
>          This feature controls whether AES wrapping keys will be created
> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> index 3524c04..95e2b49 100644
> --- a/softmmu/physmem.c
> +++ b/softmmu/physmem.c
> @@ -41,6 +41,7 @@
>  #include "qemu/config-file.h"
>  #include "qemu/error-report.h"
>  #include "qemu/qemu-print.h"
> +#include "qemu/memfd.h"
>  #include "exec/memory.h"
>  #include "exec/ioport.h"
>  #include "sysemu/dma.h"
> @@ -1964,35 +1965,63 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>      const bool shared = qemu_ram_is_shared(new_block);
>      RAMBlock *block;
>      RAMBlock *last_block = NULL;
> +    struct MemoryRegion *mr = new_block->mr;
>      ram_addr_t old_ram_size, new_ram_size;
>      Error *err = NULL;
> +    const char *name;
> +    void *addr = 0;
> +    size_t maxlen;
> +    MachineState *ms = MACHINE(qdev_get_machine());
>  
>      old_ram_size = last_ram_page();
>  
>      qemu_mutex_lock_ramlist();
> -    new_block->offset = find_ram_offset(new_block->max_length);
> +    maxlen = new_block->max_length;
> +    new_block->offset = find_ram_offset(maxlen);
>  
>      if (!new_block->host) {
>          if (xen_enabled()) {
> -            xen_ram_alloc(new_block->offset, new_block->max_length,
> -                          new_block->mr, &err);
> +            xen_ram_alloc(new_block->offset, maxlen, new_block->mr, &err);
>              if (err) {
>                  error_propagate(errp, err);
>                  qemu_mutex_unlock_ramlist();
>                  return;
>              }
>          } else {
> -            new_block->host = qemu_anon_ram_alloc(new_block->max_length,
> -                                                  &new_block->mr->align,
> -                                                  shared, noreserve);
> -            if (!new_block->host) {
> +            name = memory_region_name(mr);
> +            if (ms->memfd_alloc) {
> +                Object *parent = &mr->parent_obj;
> +                int mfd = -1;          /* placeholder until next patch */
> +                mr->align = QEMU_VMALLOC_ALIGN;
> +                if (mfd < 0) {
> +                    mfd = qemu_memfd_create(name, maxlen + mr->align,
> +                                            0, 0, 0, &err);
> +                    if (mfd < 0) {
> +                        return;
> +                    }
> +                }
> +                qemu_set_cloexec(mfd);
> +                /* The memory backend already set its desired flags. */
> +                if (!object_dynamic_cast(parent, TYPE_MEMORY_BACKEND)) {
> +                    new_block->flags |= RAM_SHARED;
> +                }
> +                addr = file_ram_alloc(new_block, maxlen, mfd,
> +                                      false, false, 0, errp);
> +                trace_anon_memfd_alloc(name, maxlen, addr, mfd);
> +            } else {
> +                addr = qemu_anon_ram_alloc(maxlen, &mr->align,
> +                                           shared, noreserve);
> +            }
> +
> +            if (!addr) {
>                  error_setg_errno(errp, errno,
>                                   "cannot set up guest memory '%s'",
> -                                 memory_region_name(new_block->mr));
> +                                 name);
>                  qemu_mutex_unlock_ramlist();
>                  return;
>              }
> -            memory_try_enable_merging(new_block->host, new_block->max_length);
> +            memory_try_enable_merging(addr, maxlen);
> +            new_block->host = addr;
>          }
>      }
>  
> diff --git a/softmmu/vl.c b/softmmu/vl.c
> index 620a1f1..ab3648a 100644
> --- a/softmmu/vl.c
> +++ b/softmmu/vl.c
> @@ -2440,6 +2440,7 @@ static void create_default_memdev(MachineState *ms, const char *path)
>          object_property_set_str(obj, "mem-path", path, &error_fatal);
>      }
>      object_property_set_int(obj, "size", ms->ram_size, &error_fatal);
> +    object_property_set_bool(obj, "share", ms->memfd_alloc, &error_fatal);
>      object_property_add_child(object_get_objects_root(), mc->default_ram_id,
>                                obj);
>      /* Ensure backend's memory region name is equal to mc->default_ram_id */
> diff --git a/trace-events b/trace-events
> index a637a61..770a9ac 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -45,6 +45,7 @@ ram_block_discard_range(const char *rbname, void *hva, size_t length, bool need_
>  # accel/tcg/cputlb.c
>  memory_notdirty_write_access(uint64_t vaddr, uint64_t ram_addr, unsigned size) "0x%" PRIx64 " ram_addr 0x%" PRIx64 " size %u"
>  memory_notdirty_set_dirty(uint64_t vaddr) "0x%" PRIx64
> +anon_memfd_alloc(const char *name, size_t size, void *ptr, int fd) "%s size %zu ptr %p fd %d"
>  
>  # gdbstub.c
>  gdbstub_op_start(const char *device) "Starting gdbstub using device %s"
> diff --git a/util/qemu-config.c b/util/qemu-config.c
> index 436ab63..3606e5c 100644
> --- a/util/qemu-config.c
> +++ b/util/qemu-config.c
> @@ -207,6 +207,10 @@ static QemuOptsList machine_opts = {
>              .type = QEMU_OPT_BOOL,
>              .help = "enable/disable memory merge support",
>          },{
> +            .name = "memfd-alloc",
> +            .type = QEMU_OPT_BOOL,
> +            .help = "enable/disable memfd_create for anonymous memory",
> +        },{
>              .name = "usb",
>              .type = QEMU_OPT_BOOL,
>              .help = "Set on/off to enable/disable usb",
> -- 
> 1.8.3.1



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 18/29] vfio-pci: refactor for cpr
  2021-12-22 19:05 ` [PATCH V7 18/29] vfio-pci: refactor " Steve Sistare
@ 2022-03-03 23:21   ` Alex Williamson
  2022-03-07 14:42     ` Steven Sistare
  0 siblings, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2022-03-03 23:21 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	qemu-devel, Dr. David Alan Gilbert, Zheng Chuan, Paolo Bonzini,
	Stefan Hajnoczi, Marc-André Lureau, Daniel P. Berrange,
	Alex Bennée, Markus Armbruster

On Wed, 22 Dec 2021 11:05:23 -0800
Steve Sistare <steven.sistare@oracle.com> wrote:

> +    if (vfio_notifier_init(vdev, &vdev->intx.unmask, "intx-unmask", 0)) {
...
> +    vfio_notifier_cleanup(vdev, &vdev->intx.unmask, "intx-unmask", 0);
...
> +    vfio_notifier_cleanup(vdev, &vdev->intx.unmask, "intx-unmask", 0);
...
> +    ret = vfio_notifier_init(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
...
> +        vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
...
> +    vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
...
> +    const char *name = "kvm_interrupt";
...
> +    if (vfio_notifier_init(vdev, &vector->kvm_interrupt, name, nr)) {
...
> +        vfio_notifier_cleanup(vdev, &vector->kvm_interrupt, name, nr);
...
> +        vfio_notifier_cleanup(vdev, &vector->kvm_interrupt, name, nr);
...
> +    vfio_notifier_cleanup(vdev, &vector->kvm_interrupt, "kvm_interrupt", nr);
...
> +    if (vfio_notifier_init(vdev, &vector->interrupt, "interrupt", nr)) {
...
> +        if (vfio_notifier_init(vdev, &vector->interrupt, "interrupt", i)) {
...
> +            vfio_notifier_cleanup(vdev, &vector->interrupt, "interrupt", i);
...
> +            vfio_notifier_cleanup(vdev, &vector->interrupt, "interrupt", i);
...
> +    if (vfio_notifier_init(vdev, &vdev->err_notifier, "err", 0)) {
...
> +        vfio_notifier_cleanup(vdev, &vdev->err_notifier, "err_notifier", 0);
...
> +    vfio_notifier_cleanup(vdev, &vdev->err_notifier, "err_notifier", 0);
...
> +    if (vfio_notifier_init(vdev, &vdev->req_notifier, "req", 0)) {
...
> +        vfio_notifier_cleanup(vdev, &vdev->req_notifier, "req_notifier", 0);
...
> +    vfio_notifier_cleanup(vdev, &vdev->req_notifier, "req_notifier", 0);

Something seems to have gone astray with "err" and "req" vs
"err_notifier" and "req_notifier".  The pattern is broken.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 10/29] machine: memfd-alloc option
  2022-03-03 17:21   ` Michael S. Tsirkin
@ 2022-03-04 10:41     ` Igor Mammedov
  2022-03-07 14:41       ` Steven Sistare
  2022-03-11 10:25     ` David Hildenbrand
  1 sibling, 1 reply; 96+ messages in thread
From: Igor Mammedov @ 2022-03-04 10:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Zeng, Juan Quintela, Eric Blake,
	Philippe Mathieu-Daudé,
	qemu-devel, Dr. David Alan Gilbert, Zheng Chuan, Alex Williamson,
	Paolo Bonzini, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Daniel P. Berrange, Alex Bennée,
	Markus Armbruster

On Thu, 3 Mar 2022 12:21:15 -0500
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Dec 22, 2021 at 11:05:15AM -0800, Steve Sistare wrote:
> > Allocate anonymous memory using memfd_create if the memfd-alloc machine
> > option is set.
> > 
> > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > ---
> >  hw/core/machine.c   | 19 +++++++++++++++++++
> >  include/hw/boards.h |  1 +
> >  qemu-options.hx     |  6 ++++++
> >  softmmu/physmem.c   | 47 ++++++++++++++++++++++++++++++++++++++---------
> >  softmmu/vl.c        |  1 +
> >  trace-events        |  1 +
> >  util/qemu-config.c  |  4 ++++
> >  7 files changed, 70 insertions(+), 9 deletions(-)
> > 
> > diff --git a/hw/core/machine.c b/hw/core/machine.c
> > index 53a99ab..7739d88 100644
> > --- a/hw/core/machine.c
> > +++ b/hw/core/machine.c
> > @@ -392,6 +392,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
> >      ms->mem_merge = value;
> >  }
> >  
> > +static bool machine_get_memfd_alloc(Object *obj, Error **errp)
> > +{
> > +    MachineState *ms = MACHINE(obj);
> > +
> > +    return ms->memfd_alloc;
> > +}
> > +
> > +static void machine_set_memfd_alloc(Object *obj, bool value, Error **errp)
> > +{
> > +    MachineState *ms = MACHINE(obj);
> > +
> > +    ms->memfd_alloc = value;
> > +}
> > +
> >  static bool machine_get_usb(Object *obj, Error **errp)
> >  {
> >      MachineState *ms = MACHINE(obj);
> > @@ -829,6 +843,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
> >      object_class_property_set_description(oc, "mem-merge",
> >          "Enable/disable memory merge support");
> >  
> > +    object_class_property_add_bool(oc, "memfd-alloc",
> > +        machine_get_memfd_alloc, machine_set_memfd_alloc);
> > +    object_class_property_set_description(oc, "memfd-alloc",
> > +        "Enable/disable allocating anonymous memory using memfd_create");
> > +
> >      object_class_property_add_bool(oc, "usb",
> >          machine_get_usb, machine_set_usb);
> >      object_class_property_set_description(oc, "usb",
> > diff --git a/include/hw/boards.h b/include/hw/boards.h
> > index 9c1c190..a57d7a0 100644
> > --- a/include/hw/boards.h
> > +++ b/include/hw/boards.h
> > @@ -327,6 +327,7 @@ struct MachineState {
> >      char *dt_compatible;
> >      bool dump_guest_core;
> >      bool mem_merge;
> > +    bool memfd_alloc;
> >      bool usb;
> >      bool usb_disabled;
> >      char *firmware;
> > diff --git a/qemu-options.hx b/qemu-options.hx
> > index 7d47510..33c8173 100644
> > --- a/qemu-options.hx
> > +++ b/qemu-options.hx
> > @@ -30,6 +30,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
> >      "                vmport=on|off|auto controls emulation of vmport (default: auto)\n"
> >      "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
> >      "                mem-merge=on|off controls memory merge support (default: on)\n"
> > +    "                memfd-alloc=on|off controls allocating anonymous guest RAM using memfd_create (default: off)\n"  
> 
> Question: are there any disadvantages associated with using
> memfd_create? I guess we are using up an fd, but that seems minor.  Any
> reason not to set to on by default? maybe with a fallback option to
> disable that?
> 
> I am concerned that it's actually a kind of memory backend, this flag
> seems to instead be closer to the deprecated mem-prealloc. E.g.
> it does not work with a mem path, does it?

(mem path and mem-prealloc are transparently aliased to used memory backend
if I recall it right.)

Steve,

For allocating guest RAM, we switched exclusively to using memory-backends
including initial guest RAM (-m size option) and we have hostmem-memfd
that uses memfd_create() and I'd rather avoid adding random knobs to machine
for tweaking how RAM should be allocated, we have memory backends for this,
so this patch begs the question: why hostmem-memfd is not sufficient?
(patch description is rather lacking on rationale behind the patch)


> 
> 
> >      "                aes-key-wrap=on|off controls support for AES key wrapping (default=on)\n"
> >      "                dea-key-wrap=on|off controls support for DEA key wrapping (default=on)\n"
> >      "                suppress-vmdesc=on|off disables self-describing migration (default=off)\n"
> > @@ -76,6 +77,11 @@ SRST
> >          supported by the host, de-duplicates identical memory pages
> >          among VMs instances (enabled by default).
> >  
> > +    ``memfd-alloc=on|off``
> > +        Enables or disables allocation of anonymous guest RAM using
> > +        memfd_create.  Any associated memory-backend objects are created with
> > +        share=on.  The memfd-alloc default is off.
> > +
> >      ``aes-key-wrap=on|off``
> >          Enables or disables AES key wrapping support on s390-ccw hosts.
> >          This feature controls whether AES wrapping keys will be created
> > diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> > index 3524c04..95e2b49 100644
> > --- a/softmmu/physmem.c
> > +++ b/softmmu/physmem.c
> > @@ -41,6 +41,7 @@
> >  #include "qemu/config-file.h"
> >  #include "qemu/error-report.h"
> >  #include "qemu/qemu-print.h"
> > +#include "qemu/memfd.h"
> >  #include "exec/memory.h"
> >  #include "exec/ioport.h"
> >  #include "sysemu/dma.h"
> > @@ -1964,35 +1965,63 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
> >      const bool shared = qemu_ram_is_shared(new_block);
> >      RAMBlock *block;
> >      RAMBlock *last_block = NULL;
> > +    struct MemoryRegion *mr = new_block->mr;
> >      ram_addr_t old_ram_size, new_ram_size;
> >      Error *err = NULL;
> > +    const char *name;
> > +    void *addr = 0;
> > +    size_t maxlen;
> > +    MachineState *ms = MACHINE(qdev_get_machine());
> >  
> >      old_ram_size = last_ram_page();
> >  
> >      qemu_mutex_lock_ramlist();
> > -    new_block->offset = find_ram_offset(new_block->max_length);
> > +    maxlen = new_block->max_length;
> > +    new_block->offset = find_ram_offset(maxlen);
> >  
> >      if (!new_block->host) {
> >          if (xen_enabled()) {
> > -            xen_ram_alloc(new_block->offset, new_block->max_length,
> > -                          new_block->mr, &err);
> > +            xen_ram_alloc(new_block->offset, maxlen, new_block->mr, &err);
> >              if (err) {
> >                  error_propagate(errp, err);
> >                  qemu_mutex_unlock_ramlist();
> >                  return;
> >              }
> >          } else {
> > -            new_block->host = qemu_anon_ram_alloc(new_block->max_length,
> > -                                                  &new_block->mr->align,
> > -                                                  shared, noreserve);
> > -            if (!new_block->host) {
> > +            name = memory_region_name(mr);
> > +            if (ms->memfd_alloc) {
> > +                Object *parent = &mr->parent_obj;
> > +                int mfd = -1;          /* placeholder until next patch */
> > +                mr->align = QEMU_VMALLOC_ALIGN;
> > +                if (mfd < 0) {
> > +                    mfd = qemu_memfd_create(name, maxlen + mr->align,
> > +                                            0, 0, 0, &err);
> > +                    if (mfd < 0) {
> > +                        return;
> > +                    }
> > +                }
> > +                qemu_set_cloexec(mfd);
> > +                /* The memory backend already set its desired flags. */
> > +                if (!object_dynamic_cast(parent, TYPE_MEMORY_BACKEND)) {
> > +                    new_block->flags |= RAM_SHARED;
> > +                }
> > +                addr = file_ram_alloc(new_block, maxlen, mfd,
> > +                                      false, false, 0, errp);
> > +                trace_anon_memfd_alloc(name, maxlen, addr, mfd);
> > +            } else {
> > +                addr = qemu_anon_ram_alloc(maxlen, &mr->align,
> > +                                           shared, noreserve);
> > +            }
> > +
> > +            if (!addr) {
> >                  error_setg_errno(errp, errno,
> >                                   "cannot set up guest memory '%s'",
> > -                                 memory_region_name(new_block->mr));
> > +                                 name);
> >                  qemu_mutex_unlock_ramlist();
> >                  return;
> >              }
> > -            memory_try_enable_merging(new_block->host, new_block->max_length);
> > +            memory_try_enable_merging(addr, maxlen);
> > +            new_block->host = addr;
> >          }
> >      }
> >  
> > diff --git a/softmmu/vl.c b/softmmu/vl.c
> > index 620a1f1..ab3648a 100644
> > --- a/softmmu/vl.c
> > +++ b/softmmu/vl.c
> > @@ -2440,6 +2440,7 @@ static void create_default_memdev(MachineState *ms, const char *path)
> >          object_property_set_str(obj, "mem-path", path, &error_fatal);
> >      }
> >      object_property_set_int(obj, "size", ms->ram_size, &error_fatal);
> > +    object_property_set_bool(obj, "share", ms->memfd_alloc, &error_fatal);
> >      object_property_add_child(object_get_objects_root(), mc->default_ram_id,
> >                                obj);
> >      /* Ensure backend's memory region name is equal to mc->default_ram_id */
> > diff --git a/trace-events b/trace-events
> > index a637a61..770a9ac 100644
> > --- a/trace-events
> > +++ b/trace-events
> > @@ -45,6 +45,7 @@ ram_block_discard_range(const char *rbname, void *hva, size_t length, bool need_
> >  # accel/tcg/cputlb.c
> >  memory_notdirty_write_access(uint64_t vaddr, uint64_t ram_addr, unsigned size) "0x%" PRIx64 " ram_addr 0x%" PRIx64 " size %u"
> >  memory_notdirty_set_dirty(uint64_t vaddr) "0x%" PRIx64
> > +anon_memfd_alloc(const char *name, size_t size, void *ptr, int fd) "%s size %zu ptr %p fd %d"
> >  
> >  # gdbstub.c
> >  gdbstub_op_start(const char *device) "Starting gdbstub using device %s"
> > diff --git a/util/qemu-config.c b/util/qemu-config.c
> > index 436ab63..3606e5c 100644
> > --- a/util/qemu-config.c
> > +++ b/util/qemu-config.c
> > @@ -207,6 +207,10 @@ static QemuOptsList machine_opts = {
> >              .type = QEMU_OPT_BOOL,
> >              .help = "enable/disable memory merge support",
> >          },{
> > +            .name = "memfd-alloc",
> > +            .type = QEMU_OPT_BOOL,
> > +            .help = "enable/disable memfd_create for anonymous memory",
> > +        },{
> >              .name = "usb",
> >              .type = QEMU_OPT_BOOL,
> >              .help = "Set on/off to enable/disable usb",
> > -- 
> > 1.8.3.1  
> 
> 



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 01/29] memory: qemu_check_ram_volatile
  2021-12-22 19:05 ` [PATCH V7 01/29] memory: qemu_check_ram_volatile Steve Sistare
  2022-02-24 18:28   ` Dr. David Alan Gilbert
@ 2022-03-04 12:47   ` Philippe Mathieu-Daudé
  1 sibling, 0 replies; 96+ messages in thread
From: Philippe Mathieu-Daudé @ 2022-03-04 12:47 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	Dr. David Alan Gilbert, Markus Armbruster, Zheng Chuan,
	Alex Williamson, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Daniel P. Berrange,
	Philippe Mathieu-Daudé,
	Alex Bennée

On 22/12/21 20:05, Steve Sistare wrote:
> Add a function that returns an error if any ram_list block represents
> volatile memory.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   include/exec/memory.h |  8 ++++++++
>   softmmu/memory.c      | 26 ++++++++++++++++++++++++++
>   2 files changed, 34 insertions(+)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 20f1b27..137f5f3 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -2981,6 +2981,14 @@ bool ram_block_discard_is_disabled(void);
>    */
>   bool ram_block_discard_is_required(void);
>   
> +/**
> + * qemu_ram_check_volatile: return 1 if any memory regions are writable and not
> + * backed by shared memory, else return 0.
> + *
> + * @errp: returned error message identifying the first volatile region found.

This doesn't seem a good usage of the Error API. This is not an error
actually, but the expected result. If you want to return the first
or all, better use an explicit argument for them. Returning the first
is odd however. Is it useful for the user? If so, we want to return
them all, eventually in a GArray/GPtrArray, and return the MemoryRegion
handle, not its name. Otherwise if it is only useful for developers I'd
simply log the volatile MR name in a trace event.

Then we get:

   bool ram_block_is_volatile(void);

Or

   bool qemu_ram_is_volatile(void);

> + */
> +int qemu_check_ram_volatile(Error **errp);


> diff --git a/softmmu/memory.c b/softmmu/memory.c
> index 7340e19..30b2f68 100644
> --- a/softmmu/memory.c
> +++ b/softmmu/memory.c
> @@ -2837,6 +2837,32 @@ void memory_global_dirty_log_stop(unsigned int flags)
>       memory_global_dirty_log_do_stop(flags);
>   }
>   
> +static int check_volatile(RAMBlock *rb, void *opaque)

If using the 'qemu_ram_is_volatile' name for the public API,
this one could be 'static bool ram_block_is_volatile(...)'.

> +{
> +    MemoryRegion *mr = rb->mr;
> +
> +    if (mr &&
> +        memory_region_is_ram(mr) &&
> +        !memory_region_is_ram_device(mr) &&
> +        !memory_region_is_rom(mr) &&
> +        (rb->fd == -1 || !qemu_ram_is_shared(rb))) {
> +        *(const char **)opaque = memory_region_name(mr);
> +        return -1;
> +    }
> +    return 0;
> +}
> +
> +int qemu_check_ram_volatile(Error **errp)
> +{
> +    char *name;
> +
> +    if (qemu_ram_foreach_block(check_volatile, &name)) {
> +        error_setg(errp, "Memory region %s is volatile", name);
> +        return -1;
> +    }
> +    return 0;
> +}

Regards,

Phil.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 08/29] memory: flat section iterator
  2021-12-22 19:05 ` [PATCH V7 08/29] memory: flat section iterator Steve Sistare
@ 2022-03-04 12:48   ` Philippe Mathieu-Daudé
  2022-03-07 14:42     ` Steven Sistare
  2022-03-09 14:18   ` Marc-André Lureau
  1 sibling, 1 reply; 96+ messages in thread
From: Philippe Mathieu-Daudé @ 2022-03-04 12:48 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	Dr. David Alan Gilbert, Markus Armbruster, Zheng Chuan,
	Alex Williamson, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Daniel P. Berrange,
	Philippe Mathieu-Daudé,
	Alex Bennée

On 22/12/21 20:05, Steve Sistare wrote:
> Add an iterator over the sections of a flattened address space.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   include/exec/memory.h | 31 +++++++++++++++++++++++++++++++
>   softmmu/memory.c      | 20 ++++++++++++++++++++
>   2 files changed, 51 insertions(+)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 137f5f3..9660475 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -2338,6 +2338,37 @@ void memory_region_set_ram_discard_manager(MemoryRegion *mr,
>                                              RamDiscardManager *rdm);
>   
>   /**
> + * memory_region_section_cb: callback for address_space_flat_for_each_section()
> + *
> + * @s: MemoryRegionSection of the range

Nitpicking, can we name this @mrs?

> + * @opaque: data pointer passed to address_space_flat_for_each_section()
> + * @errp: error message, returned to the address_space_flat_for_each_section
> + *        caller.
> + *
> + * Returns: non-zero to stop the iteration, and 0 to continue.  The same
> + * non-zero value is returned to the address_space_flat_for_each_section caller.
> + */
> +
> +typedef int (*memory_region_section_cb)(MemoryRegionSection *s,
> +                                        void *opaque,
> +                                        Error **errp);


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 10/29] machine: memfd-alloc option
  2022-03-04 10:41     ` Igor Mammedov
@ 2022-03-07 14:41       ` Steven Sistare
  2022-03-08  6:50         ` Michael S. Tsirkin
  2022-03-11 10:08         ` Daniel P. Berrangé
  0 siblings, 2 replies; 96+ messages in thread
From: Steven Sistare @ 2022-03-07 14:41 UTC (permalink / raw)
  To: Igor Mammedov, Michael S. Tsirkin
  Cc: Jason Zeng, Juan Quintela, Eric Blake,
	Philippe Mathieu-Daudé,
	qemu-devel, Dr. David Alan Gilbert, Zheng Chuan, Alex Williamson,
	Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Daniel P. Berrange, Alex Bennée, Markus Armbruster

On 3/4/2022 5:41 AM, Igor Mammedov wrote:
> On Thu, 3 Mar 2022 12:21:15 -0500
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
>> On Wed, Dec 22, 2021 at 11:05:15AM -0800, Steve Sistare wrote:
>>> Allocate anonymous memory using memfd_create if the memfd-alloc machine
>>> option is set.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>>  hw/core/machine.c   | 19 +++++++++++++++++++
>>>  include/hw/boards.h |  1 +
>>>  qemu-options.hx     |  6 ++++++
>>>  softmmu/physmem.c   | 47 ++++++++++++++++++++++++++++++++++++++---------
>>>  softmmu/vl.c        |  1 +
>>>  trace-events        |  1 +
>>>  util/qemu-config.c  |  4 ++++
>>>  7 files changed, 70 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>>> index 53a99ab..7739d88 100644
>>> --- a/hw/core/machine.c
>>> +++ b/hw/core/machine.c
>>> @@ -392,6 +392,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
>>>      ms->mem_merge = value;
>>>  }
>>>  
>>> +static bool machine_get_memfd_alloc(Object *obj, Error **errp)
>>> +{
>>> +    MachineState *ms = MACHINE(obj);
>>> +
>>> +    return ms->memfd_alloc;
>>> +}
>>> +
>>> +static void machine_set_memfd_alloc(Object *obj, bool value, Error **errp)
>>> +{
>>> +    MachineState *ms = MACHINE(obj);
>>> +
>>> +    ms->memfd_alloc = value;
>>> +}
>>> +
>>>  static bool machine_get_usb(Object *obj, Error **errp)
>>>  {
>>>      MachineState *ms = MACHINE(obj);
>>> @@ -829,6 +843,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
>>>      object_class_property_set_description(oc, "mem-merge",
>>>          "Enable/disable memory merge support");
>>>  
>>> +    object_class_property_add_bool(oc, "memfd-alloc",
>>> +        machine_get_memfd_alloc, machine_set_memfd_alloc);
>>> +    object_class_property_set_description(oc, "memfd-alloc",
>>> +        "Enable/disable allocating anonymous memory using memfd_create");
>>> +
>>>      object_class_property_add_bool(oc, "usb",
>>>          machine_get_usb, machine_set_usb);
>>>      object_class_property_set_description(oc, "usb",
>>> diff --git a/include/hw/boards.h b/include/hw/boards.h
>>> index 9c1c190..a57d7a0 100644
>>> --- a/include/hw/boards.h
>>> +++ b/include/hw/boards.h
>>> @@ -327,6 +327,7 @@ struct MachineState {
>>>      char *dt_compatible;
>>>      bool dump_guest_core;
>>>      bool mem_merge;
>>> +    bool memfd_alloc;
>>>      bool usb;
>>>      bool usb_disabled;
>>>      char *firmware;
>>> diff --git a/qemu-options.hx b/qemu-options.hx
>>> index 7d47510..33c8173 100644
>>> --- a/qemu-options.hx
>>> +++ b/qemu-options.hx
>>> @@ -30,6 +30,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
>>>      "                vmport=on|off|auto controls emulation of vmport (default: auto)\n"
>>>      "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
>>>      "                mem-merge=on|off controls memory merge support (default: on)\n"
>>> +    "                memfd-alloc=on|off controls allocating anonymous guest RAM using memfd_create (default: off)\n"  
>>
>> Question: are there any disadvantages associated with using
>> memfd_create? I guess we are using up an fd, but that seems minor.  Any
>> reason not to set to on by default? maybe with a fallback option to
>> disable that?

Old Linux host kernels, circa 4.1, do not support huge pages for shared memory.
Also, the tunable to enable huge pages for share memory is different than for
anon memory, so there could be performance loss if it is not set correctly.
    /sys/kernel/mm/transparent_hugepage/enabled
    vs
    /sys/kernel/mm/transparent_hugepage/shmem_enabled

It might make sense to use memfd_create by default for the secondary segments.

>> I am concerned that it's actually a kind of memory backend, this flag
>> seems to instead be closer to the deprecated mem-prealloc. E.g.
>> it does not work with a mem path, does it?

One can still define a memory backend with mempath to create the main ram segment,
though it must be some form of shared to work with live update.  Indeed, I would 
expect most users to specify an explicit memory backend for it.  The secondary
segments would still use memfd_create.

> (mem path and mem-prealloc are transparently aliased to used memory backend
> if I recall it right.)
> 
> Steve,
> 
> For allocating guest RAM, we switched exclusively to using memory-backends
> including initial guest RAM (-m size option) and we have hostmem-memfd
> that uses memfd_create() and I'd rather avoid adding random knobs to machine
> for tweaking how RAM should be allocated, we have memory backends for this,
> so this patch begs the question: why hostmem-memfd is not sufficient?
> (patch description is rather lacking on rationale behind the patch)

There is currently no way to specify memory backends for the secondary memory
segments (vram, roms, etc), and IMO it would be onerous to specify a backend for
each of them.  On x86_64, these include pc.bios, vga.vram, pc.rom, vga.rom,
/rom@etc/acpi/tables, /rom@etc/table-loader, /rom@etc/acpi/rsdp.

- Steve

>>>      "                aes-key-wrap=on|off controls support for AES key wrapping (default=on)\n"
>>>      "                dea-key-wrap=on|off controls support for DEA key wrapping (default=on)\n"
>>>      "                suppress-vmdesc=on|off disables self-describing migration (default=off)\n"
>>> @@ -76,6 +77,11 @@ SRST
>>>          supported by the host, de-duplicates identical memory pages
>>>          among VMs instances (enabled by default).
>>>  
>>> +    ``memfd-alloc=on|off``
>>> +        Enables or disables allocation of anonymous guest RAM using
>>> +        memfd_create.  Any associated memory-backend objects are created with
>>> +        share=on.  The memfd-alloc default is off.
>>> +
>>>      ``aes-key-wrap=on|off``
>>>          Enables or disables AES key wrapping support on s390-ccw hosts.
>>>          This feature controls whether AES wrapping keys will be created
>>> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
>>> index 3524c04..95e2b49 100644
>>> --- a/softmmu/physmem.c
>>> +++ b/softmmu/physmem.c
>>> @@ -41,6 +41,7 @@
>>>  #include "qemu/config-file.h"
>>>  #include "qemu/error-report.h"
>>>  #include "qemu/qemu-print.h"
>>> +#include "qemu/memfd.h"
>>>  #include "exec/memory.h"
>>>  #include "exec/ioport.h"
>>>  #include "sysemu/dma.h"
>>> @@ -1964,35 +1965,63 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>>      const bool shared = qemu_ram_is_shared(new_block);
>>>      RAMBlock *block;
>>>      RAMBlock *last_block = NULL;
>>> +    struct MemoryRegion *mr = new_block->mr;
>>>      ram_addr_t old_ram_size, new_ram_size;
>>>      Error *err = NULL;
>>> +    const char *name;
>>> +    void *addr = 0;
>>> +    size_t maxlen;
>>> +    MachineState *ms = MACHINE(qdev_get_machine());
>>>  
>>>      old_ram_size = last_ram_page();
>>>  
>>>      qemu_mutex_lock_ramlist();
>>> -    new_block->offset = find_ram_offset(new_block->max_length);
>>> +    maxlen = new_block->max_length;
>>> +    new_block->offset = find_ram_offset(maxlen);
>>>  
>>>      if (!new_block->host) {
>>>          if (xen_enabled()) {
>>> -            xen_ram_alloc(new_block->offset, new_block->max_length,
>>> -                          new_block->mr, &err);
>>> +            xen_ram_alloc(new_block->offset, maxlen, new_block->mr, &err);
>>>              if (err) {
>>>                  error_propagate(errp, err);
>>>                  qemu_mutex_unlock_ramlist();
>>>                  return;
>>>              }
>>>          } else {
>>> -            new_block->host = qemu_anon_ram_alloc(new_block->max_length,
>>> -                                                  &new_block->mr->align,
>>> -                                                  shared, noreserve);
>>> -            if (!new_block->host) {
>>> +            name = memory_region_name(mr);
>>> +            if (ms->memfd_alloc) {
>>> +                Object *parent = &mr->parent_obj;
>>> +                int mfd = -1;          /* placeholder until next patch */
>>> +                mr->align = QEMU_VMALLOC_ALIGN;
>>> +                if (mfd < 0) {
>>> +                    mfd = qemu_memfd_create(name, maxlen + mr->align,
>>> +                                            0, 0, 0, &err);
>>> +                    if (mfd < 0) {
>>> +                        return;
>>> +                    }
>>> +                }
>>> +                qemu_set_cloexec(mfd);
>>> +                /* The memory backend already set its desired flags. */
>>> +                if (!object_dynamic_cast(parent, TYPE_MEMORY_BACKEND)) {
>>> +                    new_block->flags |= RAM_SHARED;
>>> +                }
>>> +                addr = file_ram_alloc(new_block, maxlen, mfd,
>>> +                                      false, false, 0, errp);
>>> +                trace_anon_memfd_alloc(name, maxlen, addr, mfd);
>>> +            } else {
>>> +                addr = qemu_anon_ram_alloc(maxlen, &mr->align,
>>> +                                           shared, noreserve);
>>> +            }
>>> +
>>> +            if (!addr) {
>>>                  error_setg_errno(errp, errno,
>>>                                   "cannot set up guest memory '%s'",
>>> -                                 memory_region_name(new_block->mr));
>>> +                                 name);
>>>                  qemu_mutex_unlock_ramlist();
>>>                  return;
>>>              }
>>> -            memory_try_enable_merging(new_block->host, new_block->max_length);
>>> +            memory_try_enable_merging(addr, maxlen);
>>> +            new_block->host = addr;
>>>          }
>>>      }
>>>  
>>> diff --git a/softmmu/vl.c b/softmmu/vl.c
>>> index 620a1f1..ab3648a 100644
>>> --- a/softmmu/vl.c
>>> +++ b/softmmu/vl.c
>>> @@ -2440,6 +2440,7 @@ static void create_default_memdev(MachineState *ms, const char *path)
>>>          object_property_set_str(obj, "mem-path", path, &error_fatal);
>>>      }
>>>      object_property_set_int(obj, "size", ms->ram_size, &error_fatal);
>>> +    object_property_set_bool(obj, "share", ms->memfd_alloc, &error_fatal);
>>>      object_property_add_child(object_get_objects_root(), mc->default_ram_id,
>>>                                obj);
>>>      /* Ensure backend's memory region name is equal to mc->default_ram_id */
>>> diff --git a/trace-events b/trace-events
>>> index a637a61..770a9ac 100644
>>> --- a/trace-events
>>> +++ b/trace-events
>>> @@ -45,6 +45,7 @@ ram_block_discard_range(const char *rbname, void *hva, size_t length, bool need_
>>>  # accel/tcg/cputlb.c
>>>  memory_notdirty_write_access(uint64_t vaddr, uint64_t ram_addr, unsigned size) "0x%" PRIx64 " ram_addr 0x%" PRIx64 " size %u"
>>>  memory_notdirty_set_dirty(uint64_t vaddr) "0x%" PRIx64
>>> +anon_memfd_alloc(const char *name, size_t size, void *ptr, int fd) "%s size %zu ptr %p fd %d"
>>>  
>>>  # gdbstub.c
>>>  gdbstub_op_start(const char *device) "Starting gdbstub using device %s"
>>> diff --git a/util/qemu-config.c b/util/qemu-config.c
>>> index 436ab63..3606e5c 100644
>>> --- a/util/qemu-config.c
>>> +++ b/util/qemu-config.c
>>> @@ -207,6 +207,10 @@ static QemuOptsList machine_opts = {
>>>              .type = QEMU_OPT_BOOL,
>>>              .help = "enable/disable memory merge support",
>>>          },{
>>> +            .name = "memfd-alloc",
>>> +            .type = QEMU_OPT_BOOL,
>>> +            .help = "enable/disable memfd_create for anonymous memory",
>>> +        },{
>>>              .name = "usb",
>>>              .type = QEMU_OPT_BOOL,
>>>              .help = "Set on/off to enable/disable usb",
>>> -- 
>>> 1.8.3.1  
>>
>>
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 08/29] memory: flat section iterator
  2022-03-04 12:48   ` Philippe Mathieu-Daudé
@ 2022-03-07 14:42     ` Steven Sistare
  0 siblings, 0 replies; 96+ messages in thread
From: Steven Sistare @ 2022-03-07 14:42 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, qemu-devel
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	Dr. David Alan Gilbert, Markus Armbruster, Zheng Chuan,
	Alex Williamson, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Daniel P. Berrange,
	Philippe Mathieu-Daudé,
	Alex Bennée

On 3/4/2022 7:48 AM, Philippe Mathieu-Daudé wrote:
> On 22/12/21 20:05, Steve Sistare wrote:
>> Add an iterator over the sections of a flattened address space.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   include/exec/memory.h | 31 +++++++++++++++++++++++++++++++
>>   softmmu/memory.c      | 20 ++++++++++++++++++++
>>   2 files changed, 51 insertions(+)
>>
>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>> index 137f5f3..9660475 100644
>> --- a/include/exec/memory.h
>> +++ b/include/exec/memory.h
>> @@ -2338,6 +2338,37 @@ void memory_region_set_ram_discard_manager(MemoryRegion *mr,
>>                                              RamDiscardManager *rdm);
>>     /**
>> + * memory_region_section_cb: callback for address_space_flat_for_each_section()
>> + *
>> + * @s: MemoryRegionSection of the range
> 
> Nitpicking, can we name this @mrs?

Sure thing - Steve

>> + * @opaque: data pointer passed to address_space_flat_for_each_section()
>> + * @errp: error message, returned to the address_space_flat_for_each_section
>> + *        caller.
>> + *
>> + * Returns: non-zero to stop the iteration, and 0 to continue.  The same
>> + * non-zero value is returned to the address_space_flat_for_each_section caller.
>> + */
>> +
>> +typedef int (*memory_region_section_cb)(MemoryRegionSection *s,
>> +                                        void *opaque,
>> +                                        Error **errp);


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 18/29] vfio-pci: refactor for cpr
  2022-03-03 23:21   ` Alex Williamson
@ 2022-03-07 14:42     ` Steven Sistare
  0 siblings, 0 replies; 96+ messages in thread
From: Steven Sistare @ 2022-03-07 14:42 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	qemu-devel, Dr. David Alan Gilbert, Zheng Chuan, Paolo Bonzini,
	Stefan Hajnoczi, Marc-André Lureau, Daniel P. Berrange,
	Alex Bennée, Markus Armbruster

On 3/3/2022 6:21 PM, Alex Williamson wrote:
> On Wed, 22 Dec 2021 11:05:23 -0800
> Steve Sistare <steven.sistare@oracle.com> wrote:
> 
>> +    if (vfio_notifier_init(vdev, &vdev->intx.unmask, "intx-unmask", 0)) {
> ...
>> +    vfio_notifier_cleanup(vdev, &vdev->intx.unmask, "intx-unmask", 0);
> ...
>> +    vfio_notifier_cleanup(vdev, &vdev->intx.unmask, "intx-unmask", 0);
> ...
>> +    ret = vfio_notifier_init(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
> ...
>> +        vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
> ...
>> +    vfio_notifier_cleanup(vdev, &vdev->intx.interrupt, "intx-interrupt", 0);
> ...
>> +    const char *name = "kvm_interrupt";
> ...
>> +    if (vfio_notifier_init(vdev, &vector->kvm_interrupt, name, nr)) {
> ...
>> +        vfio_notifier_cleanup(vdev, &vector->kvm_interrupt, name, nr);
> ...
>> +        vfio_notifier_cleanup(vdev, &vector->kvm_interrupt, name, nr);
> ...
>> +    vfio_notifier_cleanup(vdev, &vector->kvm_interrupt, "kvm_interrupt", nr);
> ...
>> +    if (vfio_notifier_init(vdev, &vector->interrupt, "interrupt", nr)) {
> ...
>> +        if (vfio_notifier_init(vdev, &vector->interrupt, "interrupt", i)) {
> ...
>> +            vfio_notifier_cleanup(vdev, &vector->interrupt, "interrupt", i);
> ...
>> +            vfio_notifier_cleanup(vdev, &vector->interrupt, "interrupt", i);
> ...
>> +    if (vfio_notifier_init(vdev, &vdev->err_notifier, "err", 0)) {
> ...
>> +        vfio_notifier_cleanup(vdev, &vdev->err_notifier, "err_notifier", 0);
> ...
>> +    vfio_notifier_cleanup(vdev, &vdev->err_notifier, "err_notifier", 0);
> ...
>> +    if (vfio_notifier_init(vdev, &vdev->req_notifier, "req", 0)) {
> ...
>> +        vfio_notifier_cleanup(vdev, &vdev->req_notifier, "req_notifier", 0);
> ...
>> +    vfio_notifier_cleanup(vdev, &vdev->req_notifier, "req_notifier", 0);
> 
> Something seems to have gone astray with "err" and "req" vs
> "err_notifier" and "req_notifier".  The pattern is broken.  Thanks,
> 
> Alex

Super catch, thanks.  Will fix:
  "err" -> "err_notifier"
  "req" -> "req_notifier"

- Steve



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)
  2021-12-22 19:05 ` [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma) Steve Sistare
  2021-12-22 23:15   ` Michael S. Tsirkin
@ 2022-03-07 22:16   ` Alex Williamson
  2022-03-10 15:00     ` Steven Sistare
  1 sibling, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2022-03-07 22:16 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	qemu-devel, Dr. David Alan Gilbert, Zheng Chuan, Paolo Bonzini,
	Stefan Hajnoczi, Marc-André Lureau, Daniel P. Berrange,
	Philippe Mathieu-Daudé,
	Alex Bennée, Markus Armbruster

On Wed, 22 Dec 2021 11:05:24 -0800
Steve Sistare <steven.sistare@oracle.com> wrote:

> Enable vfio-pci devices to be saved and restored across an exec restart
> of qemu.
> 
> At vfio creation time, save the value of vfio container, group, and device
> descriptors in cpr state.
> 
> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
> at a different VA after exec.  DMA to already-mapped pages continues.  Save
> the msi message area as part of vfio-pci vmstate, save the interrupt and
> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
> vfio descriptors.  The flag is not cleared earlier because the descriptors
> should not persist across miscellaneous fork and exec calls that may be
> performed during normal operation.
> 
> On qemu restart, vfio_realize() finds the saved descriptors, uses
> the descriptors, and notes that the device is being reused.  Device and
> iommu state is already configured, so operations in vfio_realize that
> would modify the configuration are skipped for a reused device, including
> vfio ioctl's and writes to PCI configuration space.  The result is that
> vfio_realize constructs qemu data structures that reflect the current
> state of the device.  However, the reconstruction is not complete until
> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
> state.  It rebuilds vector data structures and attaches the interrupts to
> the new KVM instance.  cpr-load then invokes the main vfio listener callback,
> which walks the flattened ranges of the vfio_address_spaces and calls
> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
> starts the VM and suppresses vfio pci device reset.
> 
> This functionality is delivered by 3 patches for clarity.  Part 1 handles
> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
> support.  Part 3 adds INTX support.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  MAINTAINERS                   |   1 +
>  hw/pci/pci.c                  |  10 ++++
>  hw/vfio/common.c              | 115 ++++++++++++++++++++++++++++++++++++++----
>  hw/vfio/cpr.c                 |  94 ++++++++++++++++++++++++++++++++++
>  hw/vfio/meson.build           |   1 +
>  hw/vfio/pci.c                 |  77 ++++++++++++++++++++++++++++
>  hw/vfio/trace-events          |   1 +
>  include/hw/pci/pci.h          |   1 +
>  include/hw/vfio/vfio-common.h |   8 +++
>  include/migration/cpr.h       |   3 ++
>  migration/cpr.c               |  10 +++-
>  migration/target.c            |  14 +++++
>  12 files changed, 324 insertions(+), 11 deletions(-)
>  create mode 100644 hw/vfio/cpr.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index cfe7480..feed239 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2992,6 +2992,7 @@ CPR
>  M: Steve Sistare <steven.sistare@oracle.com>
>  M: Mark Kanda <mark.kanda@oracle.com>
>  S: Maintained
> +F: hw/vfio/cpr.c
>  F: include/migration/cpr.h
>  F: migration/cpr.c
>  F: qapi/cpr.json
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 0fd21e1..e35df4f 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
>  {
>      int r;
>  
> +    /*
> +     * A reused vfio-pci device is already configured, so do not reset it
> +     * during qemu_system_reset prior to cpr-load, else interrupts may be
> +     * lost.  By contrast, pure-virtual pci devices may be reset here and
> +     * updated with new state in cpr-load with no ill effects.
> +     */
> +    if (dev->reused) {
> +        return;
> +    }
> +
>      pci_device_deassert_intx(dev);
>      assert(dev->irq_state == 0);
>  
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 5b87f95..90f66ad 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -31,6 +31,7 @@
>  #include "exec/memory.h"
>  #include "exec/ram_addr.h"
>  #include "hw/hw.h"
> +#include "migration/cpr.h"
>  #include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
>  #include "qemu/range.h"
> @@ -459,6 +460,8 @@ static int vfio_dma_unmap(VFIOContainer *container,
>          .size = size,
>      };
>  
> +    assert(!container->reused);
> +
>      if (iotlb && container->dirty_pages_supported &&
>          vfio_devices_all_running_and_saving(container)) {
>          return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
> @@ -495,12 +498,24 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>  {
>      struct vfio_iommu_type1_dma_map map = {
>          .argsz = sizeof(map),
> -        .flags = VFIO_DMA_MAP_FLAG_READ,
>          .vaddr = (__u64)(uintptr_t)vaddr,
>          .iova = iova,
>          .size = size,
>      };
>  
> +    /*
> +     * Set the new vaddr for any mappings registered during cpr-load.
> +     * Reused is cleared thereafter.
> +     */
> +    if (container->reused) {
> +        map.flags = VFIO_DMA_MAP_FLAG_VADDR;
> +        if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
> +            goto fail;
> +        }
> +        return 0;
> +    }
> +
> +    map.flags = VFIO_DMA_MAP_FLAG_READ;
>      if (!readonly) {
>          map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
>      }
> @@ -516,7 +531,9 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>          return 0;
>      }
>  
> -    error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
> +fail:
> +    error_report("vfio_dma_map %s (iova %lu, size %ld, va %p): %s",
> +        (container->reused ? "VADDR" : ""), iova, size, vaddr, strerror(errno));
>      return -errno;
>  }
>  
> @@ -865,6 +882,12 @@ static void vfio_listener_region_add(MemoryListener *listener,
>                                       MemoryRegionSection *section)
>  {
>      VFIOContainer *container = container_of(listener, VFIOContainer, listener);
> +    vfio_container_region_add(container, section);
> +}
> +
> +void vfio_container_region_add(VFIOContainer *container,
> +                               MemoryRegionSection *section)
> +{
>      hwaddr iova, end;
>      Int128 llend, llsize;
>      void *vaddr;
> @@ -985,6 +1008,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
>          int iommu_idx;
>  
>          trace_vfio_listener_region_add_iommu(iova, end);
> +
>          /*
>           * FIXME: For VFIO iommu types which have KVM acceleration to
>           * avoid bouncing all map/unmaps through qemu this way, this
> @@ -1459,6 +1483,12 @@ static void vfio_listener_release(VFIOContainer *container)
>      }
>  }
>  
> +void vfio_listener_register(VFIOContainer *container)
> +{
> +    container->listener = vfio_memory_listener;
> +    memory_listener_register(&container->listener, container->space->as);
> +}
> +
>  static struct vfio_info_cap_header *
>  vfio_get_cap(void *ptr, uint32_t cap_offset, uint16_t id)
>  {
> @@ -1878,6 +1908,18 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
>  {
>      int iommu_type, ret;
>  
> +    /*
> +     * If container is reused, just set its type and skip the ioctls, as the
> +     * container and group are already configured in the kernel.
> +     * VFIO_TYPE1v2_IOMMU is the only type that supports reuse/cpr.
> +     * If you ever add new types or spapr cpr support, kind reader, please
> +     * also implement VFIO_GET_IOMMU.
> +     */

VFIO_CHECK_EXTENSION should be able to tell us this, right?  Maybe the
problem is that vfio_iommu_type1_check_extension() should actually base
some of the details on the instantiated vfio_iommu, ex.

	switch (arg) {
	case VFIO_TYPE1_IOMMU:
		return (iommu && iommu->v2) ? 0 : 1;
	case VFIO_UNMAP_ALL:
	case VFIO_UPDATE_VADDR:
	case VFIO_TYPE1v2_IOMMU:
		return (iommu && !iommu->v2) ? 0 : 1;
	case VFIO_TYPE1_NESTING_IOMMU:
		return (iommu && !iommu->nesting) ? 0 : 1;
	...

We can't support v1 if we've already set a v2 container and vice versa.
There are probably some corner cases and compatibility to puzzle
through, but I wouldn't think we need a new ioctl to check this.


> +    if (container->reused) {
> +        container->iommu_type = VFIO_TYPE1v2_IOMMU;
> +        return 0;
> +    }
> +
>      iommu_type = vfio_get_iommu_type(container, errp);
>      if (iommu_type < 0) {
>          return iommu_type;
> @@ -1982,9 +2024,12 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>  {
>      VFIOContainer *container;
>      int ret, fd;
> +    bool reused;
>      VFIOAddressSpace *space;
>  
>      space = vfio_get_address_space(as);
> +    fd = cpr_find_fd("vfio_container_for_group", group->groupid);
> +    reused = (fd > 0);
>  
>      /*
>       * VFIO is currently incompatible with discarding of RAM insofar as the
> @@ -2017,8 +2062,16 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>       * details once we know which type of IOMMU we are using.
>       */
>  
> +    /*
> +     * If the container is reused, then the group is already attached in the
> +     * kernel.  If a container with matching fd is found, then update the
> +     * userland group list and return.  It not, then after the loop, create

s/It/If/

> +     * the container struct and group list.
> +     */
> +
>      QLIST_FOREACH(container, &space->containers, next) {
> -        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
> +        if ((reused && container->fd == fd) ||
> +            !ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {


We can have multiple containers, so this can still call the ioctl when
reused = true.  I think it still works, but it's a bit ugly, we're
relying on the ioctl failing when the container is already set for the
group.  Does this need to be something like:

        if (reused) {
            if (container->fd != fd) {
                continue;
            }
        } else if (ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
            continue;
        }

>              ret = vfio_ram_block_discard_disable(container, true);
>              if (ret) {
>                  error_setg_errno(errp, -ret,
> @@ -2032,12 +2085,19 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>              }
>              group->container = container;
>              QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> -            vfio_kvm_device_add_group(group);
> +            if (!reused) {
> +                vfio_kvm_device_add_group(group);
> +                cpr_save_fd("vfio_container_for_group", group->groupid,
> +                            container->fd);
> +            }
>              return 0;
>          }
>      }
>  
> -    fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
> +    if (!reused) {
> +        fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
> +    }
> +
>      if (fd < 0) {
>          error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
>          ret = -errno;
> @@ -2055,6 +2115,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      container = g_malloc0(sizeof(*container));
>      container->space = space;
>      container->fd = fd;
> +    container->reused = reused;
>      container->error = NULL;
>      container->dirty_pages_supported = false;
>      container->dma_max_mappings = 0;
> @@ -2181,9 +2242,16 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      group->container = container;
>      QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>  
> -    container->listener = vfio_memory_listener;
> -
> -    memory_listener_register(&container->listener, container->space->as);
> +    /*
> +     * If reused, register the listener later, after all state that may
> +     * affect regions and mapping boundaries has been cpr-load'ed.  Later,
> +     * the listener will invoke its callback on each flat section and call
> +     * vfio_dma_map to supply the new vaddr, and the calls will match the
> +     * mappings remembered by the kernel.
> +     */
> +    if (!reused) {
> +        vfio_listener_register(container);
> +    }
>  
>      if (container->error) {
>          ret = -1;
> @@ -2193,6 +2261,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>      }
>  
>      container->initialized = true;
> +    if (!reused) {
> +        cpr_save_fd("vfio_container_for_group", group->groupid, fd);
> +    }
>  
>      return 0;
>  listener_release_exit:
> @@ -2222,6 +2293,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>  
>      QLIST_REMOVE(group, container_next);
>      group->container = NULL;
> +    cpr_delete_fd("vfio_container_for_group", group->groupid);

Did you consider having cpr_save_fd() do a find_name() and update/no-op
if found so that we can casually call cpr_save_fd() without nesting it
in a branch the same as done for cpr_delete_fd()?

>  
>      /*
>       * Explicitly release the listener first before unset container,
> @@ -2270,6 +2342,7 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>      VFIOGroup *group;
>      char path[32];
>      struct vfio_group_status status = { .argsz = sizeof(status) };
> +    bool reused;
>  
>      QLIST_FOREACH(group, &vfio_group_list, next) {
>          if (group->groupid == groupid) {
> @@ -2287,7 +2360,13 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>      group = g_malloc0(sizeof(*group));
>  
>      snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
> -    group->fd = qemu_open_old(path, O_RDWR);
> +
> +    group->fd = cpr_find_fd("vfio_group", groupid);
> +    reused = (group->fd >= 0);
> +    if (!reused) {
> +        group->fd = qemu_open_old(path, O_RDWR);
> +    }
> +
>      if (group->fd < 0) {
>          error_setg_errno(errp, errno, "failed to open %s", path);
>          goto free_group_exit;
> @@ -2321,6 +2400,10 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>  
>      QLIST_INSERT_HEAD(&vfio_group_list, group, next);
>  
> +    if (!reused) {
> +        cpr_save_fd("vfio_group", groupid, group->fd);
> +    }

If cpr_save_fd() were idempotent as above, we wouldn't need the reused
variable here and the previous chunk could be simplified.  It might
even suggest a function like "cpr_find_or_open_fd()".

> +
>      return group;
>  
>  close_fd_exit:
> @@ -2345,6 +2428,7 @@ void vfio_put_group(VFIOGroup *group)
>      vfio_disconnect_container(group);
>      QLIST_REMOVE(group, next);
>      trace_vfio_put_group(group->fd);
> +    cpr_delete_fd("vfio_group", group->groupid);
>      close(group->fd);
>      g_free(group);
>  
> @@ -2358,8 +2442,14 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>  {
>      struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
>      int ret, fd;
> +    bool reused;
> +
> +    fd = cpr_find_fd(name, 0);
> +    reused = (fd >= 0);
> +    if (!reused) {
> +        fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
> +    }
>  
> -    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>      if (fd < 0) {
>          error_setg_errno(errp, errno, "error getting device from group %d",
>                           group->groupid);
> @@ -2404,6 +2494,10 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>      vbasedev->num_irqs = dev_info.num_irqs;
>      vbasedev->num_regions = dev_info.num_regions;
>      vbasedev->flags = dev_info.flags;
> +    vbasedev->reused = reused;
> +    if (!reused) {
> +        cpr_save_fd(name, 0, fd);
> +    }

Another cleanup here if we didn't need to tiptoe around cpr_save_fd().

>  
>      trace_vfio_get_device(name, dev_info.flags, dev_info.num_regions,
>                            dev_info.num_irqs);
> @@ -2420,6 +2514,7 @@ void vfio_put_base_device(VFIODevice *vbasedev)
>      QLIST_REMOVE(vbasedev, next);
>      vbasedev->group = NULL;
>      trace_vfio_put_base_device(vbasedev->fd);
> +    cpr_delete_fd(vbasedev->name, 0);
>      close(vbasedev->fd);
>  }
>  
> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
> new file mode 100644
> index 0000000..2c39cd5
> --- /dev/null
> +++ b/hw/vfio/cpr.c
> @@ -0,0 +1,94 @@
> +/*
> + * Copyright (c) 2021 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include <sys/ioctl.h>
> +#include <linux/vfio.h>
> +#include "hw/vfio/vfio-common.h"
> +#include "sysemu/kvm.h"
> +#include "qapi/error.h"
> +#include "trace.h"
> +
> +static int
> +vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
> +{
> +    struct vfio_iommu_type1_dma_unmap unmap = {
> +        .argsz = sizeof(unmap),
> +        .flags = VFIO_DMA_UNMAP_FLAG_VADDR | VFIO_DMA_UNMAP_FLAG_ALL,
> +        .iova = 0,
> +        .size = 0,
> +    };
> +    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
> +        error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
> +        return -errno;
> +    }
> +    return 0;
> +}
> +
> +bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp)
> +{
> +    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
> +        !ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
> +        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR "
> +                         "or VFIO_UNMAP_ALL");
> +        return false;
> +    } else {
> +        return true;
> +    }
> +}

We could have minimally used this where we assumed a TYPE1v2 container.

> +
> +/*
> + * Verify that all containers support CPR, and unmap all dma vaddr's.
> + */
> +int vfio_cpr_save(Error **errp)
> +{
> +    ERRP_GUARD();
> +    VFIOAddressSpace *space;
> +    VFIOContainer *container;
> +
> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> +        QLIST_FOREACH(container, &space->containers, next) {
> +            if (!vfio_is_cpr_capable(container, errp)) {
> +                return -1;
> +            }
> +            if (vfio_dma_unmap_vaddr_all(container, errp)) {
> +                return -1;
> +            }
> +        }
> +    }

Seems like we ought to validate all containers support CPR before we
start blasting vaddrs.  It looks like qmp_cpr_exec() simply returns if
this fails with no attempt to unwind!  Yikes!  Wouldn't we need to
replay the listeners to remap the vaddrs in case of an error?

> +
> +    return 0;
> +}
> +
> +/*
> + * Register the listener for each container, which causes its callback to be
> + * invoked for every flat section.  The callback will see that the container
> + * is reused, and call vfo_dma_map with the new vaddr.
> + */
> +int vfio_cpr_load(Error **errp)
> +{
> +    VFIOAddressSpace *space;
> +    VFIOContainer *container;
> +    VFIOGroup *group;
> +    VFIODevice *vbasedev;
> +
> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> +        QLIST_FOREACH(container, &space->containers, next) {
> +            if (!vfio_is_cpr_capable(container, errp)) {
> +                return -1;
> +            }
> +            vfio_listener_register(container);
> +            container->reused = false;
> +        }
> +    }
> +    QLIST_FOREACH(group, &vfio_group_list, next) {
> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
> +            vbasedev->reused = false;
> +        }
> +    }
> +    return 0;
> +}
> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
> index da9af29..e247b2b 100644
> --- a/hw/vfio/meson.build
> +++ b/hw/vfio/meson.build
> @@ -5,6 +5,7 @@ vfio_ss.add(files(
>    'migration.c',
>  ))
>  vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
> +  'cpr.c',
>    'display.c',
>    'pci-quirks.c',
>    'pci.c',
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index a90cce2..acac8a7 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -30,6 +30,7 @@
>  #include "hw/qdev-properties-system.h"
>  #include "migration/vmstate.h"
>  #include "qapi/qmp/qdict.h"
> +#include "migration/cpr.h"
>  #include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
>  #include "qemu/module.h"
> @@ -2926,6 +2927,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>          vfio_put_group(group);
>          goto error;
>      }
> +    pdev->reused = vdev->vbasedev.reused;
>  
>      vfio_populate_device(vdev, &err);
>      if (err) {
> @@ -3195,6 +3197,11 @@ static void vfio_pci_reset(DeviceState *dev)
>  {
>      VFIOPCIDevice *vdev = VFIO_PCI(dev);
>  
> +    /* Do not reset the device during qemu_system_reset prior to cpr-load */
> +    if (vdev->pdev.reused) {
> +        return;
> +    }
> +
>      trace_vfio_pci_reset(vdev->vbasedev.name);
>  
>      vfio_pci_pre_reset(vdev);
> @@ -3302,6 +3309,75 @@ static Property vfio_pci_dev_properties[] = {
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> +static void vfio_merge_config(VFIOPCIDevice *vdev)
> +{
> +    PCIDevice *pdev = &vdev->pdev;
> +    int size = MIN(pci_config_size(pdev), vdev->config_size);
> +    g_autofree uint8_t *phys_config = g_malloc(size);
> +    uint32_t mask;
> +    int ret, i;
> +
> +    ret = pread(vdev->vbasedev.fd, phys_config, size, vdev->config_offset);
> +    if (ret < size) {
> +        ret = ret < 0 ? errno : EFAULT;
> +        error_report("failed to read device config space: %s", strerror(ret));
> +        return;
> +    }
> +
> +    for (i = 0; i < size; i++) {
> +        mask = vdev->emulated_config_bits[i];
> +        pdev->config[i] = (pdev->config[i] & mask) | (phys_config[i] & ~mask);
> +    }
> +}

IIUC, we get a copy of config space from the vfio device and for each
byte, we keep what we have in emulated config space for emulated bits
and fill in from the device for non-emulated bits.  Meanwhile,
vfio_pci_read_config() doesn't ever return non-emulated bits from
emulated config space, so what specifically are we accomplishing here?

> +
> +/*
> + * The kernel may change non-emulated config bits.  Exclude them from the
> + * changed-bits check in get_pci_config_device.
> + */
> +static int vfio_pci_pre_load(void *opaque)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    PCIDevice *pdev = &vdev->pdev;
> +    int size = MIN(pci_config_size(pdev), vdev->config_size);
> +    int i;
> +
> +    for (i = 0; i < size; i++) {
> +        pdev->cmask[i] &= vdev->emulated_config_bits[i];
> +    }
> +
> +    return 0;
> +}

The previous function seemed like maybe an attempt to make non-emulated
bits in emulated config space consistent for testing, but here we're
masking all non-emulated bits out of that mask.  Why do we need to do
both?

> +
> +static int vfio_pci_post_load(void *opaque, int version_id)
> +{
> +    VFIOPCIDevice *vdev = opaque;
> +    PCIDevice *pdev = &vdev->pdev;
> +
> +    vfio_merge_config(vdev);
> +
> +    pdev->reused = false;
> +
> +    return 0;
> +}
> +
> +static bool vfio_pci_needed(void *opaque)
> +{
> +    return cpr_get_mode() == CPR_MODE_RESTART;
> +}
> +
> +static const VMStateDescription vfio_pci_vmstate = {
> +    .name = "vfio-pci",
> +    .unmigratable = 1,
> +    .version_id = 0,
> +    .minimum_version_id = 0,
> +    .pre_load = vfio_pci_pre_load,
> +    .post_load = vfio_pci_post_load,
> +    .needed = vfio_pci_needed,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
>  static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>  {
>      DeviceClass *dc = DEVICE_CLASS(klass);
> @@ -3309,6 +3385,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>  
>      dc->reset = vfio_pci_reset;
>      device_class_set_props(dc, vfio_pci_dev_properties);
> +    dc->vmsd = &vfio_pci_vmstate;
>      dc->desc = "VFIO-based PCI device assignment";
>      set_bit(DEVICE_CATEGORY_MISC, dc->categories);
>      pdc->realize = vfio_realize;
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 0ef1b5f..63dd0fe 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -118,6 +118,7 @@ vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Devic
>  vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
>  vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
>  vfio_dma_unmap_overflow_workaround(void) ""
> +vfio_region_remap(const char *name, int fd, uint64_t iova_start, uint64_t iova_end, void *vaddr) "%s fd %d 0x%"PRIx64" - 0x%"PRIx64" [%p]"
>  
>  # platform.c
>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index cc63dd4..8557e82 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -361,6 +361,7 @@ struct PCIDevice {
>      /* ID of standby device in net_failover pair */
>      char *failover_pair_id;
>      uint32_t acpi_index;
> +    bool reused;
>  };
>  
>  void pci_register_bar(PCIDevice *pci_dev, int region_num,
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 1641753..bc23c29 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -85,6 +85,7 @@ typedef struct VFIOContainer {
>      Error *error;
>      bool initialized;
>      bool dirty_pages_supported;
> +    bool reused;
>      uint64_t dirty_pgsizes;
>      uint64_t max_dirty_bitmap_size;
>      unsigned long pgsizes;
> @@ -136,6 +137,7 @@ typedef struct VFIODevice {
>      bool no_mmap;
>      bool ram_block_discard_allowed;
>      bool enable_migration;
> +    bool reused;
>      VFIODeviceOps *ops;
>      unsigned int num_irqs;
>      unsigned int num_regions;
> @@ -212,6 +214,9 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
>  void vfio_put_group(VFIOGroup *group);
>  int vfio_get_device(VFIOGroup *group, const char *name,
>                      VFIODevice *vbasedev, Error **errp);
> +int vfio_cpr_save(Error **errp);
> +int vfio_cpr_load(Error **errp);
> +bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp);
>  
>  extern const MemoryRegionOps vfio_region_ops;
>  typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
> @@ -236,6 +241,9 @@ struct vfio_info_cap_header *
>  vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id);
>  #endif
>  extern const MemoryListener vfio_prereg_listener;
> +void vfio_listener_register(VFIOContainer *container);
> +void vfio_container_region_add(VFIOContainer *container,
> +                               MemoryRegionSection *section);
>  
>  int vfio_spapr_create_window(VFIOContainer *container,
>                               MemoryRegionSection *section,
> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> index a4da24e..a4007cf 100644
> --- a/include/migration/cpr.h
> +++ b/include/migration/cpr.h
> @@ -25,4 +25,7 @@ int cpr_state_save(Error **errp);
>  int cpr_state_load(Error **errp);
>  void cpr_state_print(void);
>  
> +int cpr_vfio_save(Error **errp);
> +int cpr_vfio_load(Error **errp);
> +
>  #endif
> diff --git a/migration/cpr.c b/migration/cpr.c
> index 37eca66..cee82cf 100644
> --- a/migration/cpr.c
> +++ b/migration/cpr.c
> @@ -7,6 +7,7 @@
>  
>  #include "qemu/osdep.h"
>  #include "exec/memory.h"
> +#include "hw/vfio/vfio-common.h"
>  #include "io/channel-buffer.h"
>  #include "io/channel-file.h"
>  #include "migration.h"
> @@ -101,7 +102,9 @@ void qmp_cpr_exec(strList *args, Error **errp)
>          error_setg(errp, "cpr-exec requires cpr-save with restart mode");
>          return;
>      }
> -
> +    if (cpr_vfio_save(errp)) {
> +        return;
> +    }

Why is vfio so unique that it needs separate handlers versus other
devices?  Thanks,

Alex


>      cpr_walk_fd(preserve_fd, 0);
>      if (cpr_state_save(errp)) {
>          return;
> @@ -139,6 +142,11 @@ void qmp_cpr_load(const char *filename, Error **errp)
>          goto out;
>      }
>  
> +    if (cpr_get_mode() == CPR_MODE_RESTART &&
> +        cpr_vfio_load(errp)) {
> +        goto out;
> +    }
> +
>      state = global_state_get_runstate();
>      if (state == RUN_STATE_RUNNING) {
>          vm_start();
> diff --git a/migration/target.c b/migration/target.c
> index 4390bf0..984bc9e 100644
> --- a/migration/target.c
> +++ b/migration/target.c
> @@ -8,6 +8,7 @@
>  #include "qemu/osdep.h"
>  #include "qapi/qapi-types-migration.h"
>  #include "migration.h"
> +#include "migration/cpr.h"
>  #include CONFIG_DEVICES
>  
>  #ifdef CONFIG_VFIO
> @@ -22,8 +23,21 @@ void populate_vfio_info(MigrationInfo *info)
>          info->vfio->transferred = vfio_mig_bytes_transferred();
>      }
>  }
> +
> +int cpr_vfio_save(Error **errp)
> +{
> +    return vfio_cpr_save(errp);
> +}
> +
> +int cpr_vfio_load(Error **errp)
> +{
> +    return vfio_cpr_load(errp);
> +}
> +
>  #else
>  
>  void populate_vfio_info(MigrationInfo *info) {}
> +int cpr_vfio_save(Error **errp) { return 0; }
> +int cpr_vfio_load(Error **errp) { return 0; }
>  
>  #endif /* CONFIG_VFIO */



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 10/29] machine: memfd-alloc option
  2022-03-07 14:41       ` Steven Sistare
@ 2022-03-08  6:50         ` Michael S. Tsirkin
  2022-03-08  7:20           ` Igor Mammedov
  2022-03-11 10:08         ` Daniel P. Berrangé
  1 sibling, 1 reply; 96+ messages in thread
From: Michael S. Tsirkin @ 2022-03-08  6:50 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Jason Zeng, Markus Armbruster, Juan Quintela, Eric Blake,
	Philippe Mathieu-Daudé,
	Daniel P. Berrange, qemu-devel, Zheng Chuan, Alex Williamson,
	Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Igor Mammedov, Alex Bennée, Dr. David Alan Gilbert

On Mon, Mar 07, 2022 at 09:41:44AM -0500, Steven Sistare wrote:
> On 3/4/2022 5:41 AM, Igor Mammedov wrote:
> > On Thu, 3 Mar 2022 12:21:15 -0500
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > 
> >> On Wed, Dec 22, 2021 at 11:05:15AM -0800, Steve Sistare wrote:
> >>> Allocate anonymous memory using memfd_create if the memfd-alloc machine
> >>> option is set.
> >>>
> >>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >>> ---
> >>>  hw/core/machine.c   | 19 +++++++++++++++++++
> >>>  include/hw/boards.h |  1 +
> >>>  qemu-options.hx     |  6 ++++++
> >>>  softmmu/physmem.c   | 47 ++++++++++++++++++++++++++++++++++++++---------
> >>>  softmmu/vl.c        |  1 +
> >>>  trace-events        |  1 +
> >>>  util/qemu-config.c  |  4 ++++
> >>>  7 files changed, 70 insertions(+), 9 deletions(-)
> >>>
> >>> diff --git a/hw/core/machine.c b/hw/core/machine.c
> >>> index 53a99ab..7739d88 100644
> >>> --- a/hw/core/machine.c
> >>> +++ b/hw/core/machine.c
> >>> @@ -392,6 +392,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
> >>>      ms->mem_merge = value;
> >>>  }
> >>>  
> >>> +static bool machine_get_memfd_alloc(Object *obj, Error **errp)
> >>> +{
> >>> +    MachineState *ms = MACHINE(obj);
> >>> +
> >>> +    return ms->memfd_alloc;
> >>> +}
> >>> +
> >>> +static void machine_set_memfd_alloc(Object *obj, bool value, Error **errp)
> >>> +{
> >>> +    MachineState *ms = MACHINE(obj);
> >>> +
> >>> +    ms->memfd_alloc = value;
> >>> +}
> >>> +
> >>>  static bool machine_get_usb(Object *obj, Error **errp)
> >>>  {
> >>>      MachineState *ms = MACHINE(obj);
> >>> @@ -829,6 +843,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
> >>>      object_class_property_set_description(oc, "mem-merge",
> >>>          "Enable/disable memory merge support");
> >>>  
> >>> +    object_class_property_add_bool(oc, "memfd-alloc",
> >>> +        machine_get_memfd_alloc, machine_set_memfd_alloc);
> >>> +    object_class_property_set_description(oc, "memfd-alloc",
> >>> +        "Enable/disable allocating anonymous memory using memfd_create");
> >>> +
> >>>      object_class_property_add_bool(oc, "usb",
> >>>          machine_get_usb, machine_set_usb);
> >>>      object_class_property_set_description(oc, "usb",
> >>> diff --git a/include/hw/boards.h b/include/hw/boards.h
> >>> index 9c1c190..a57d7a0 100644
> >>> --- a/include/hw/boards.h
> >>> +++ b/include/hw/boards.h
> >>> @@ -327,6 +327,7 @@ struct MachineState {
> >>>      char *dt_compatible;
> >>>      bool dump_guest_core;
> >>>      bool mem_merge;
> >>> +    bool memfd_alloc;
> >>>      bool usb;
> >>>      bool usb_disabled;
> >>>      char *firmware;
> >>> diff --git a/qemu-options.hx b/qemu-options.hx
> >>> index 7d47510..33c8173 100644
> >>> --- a/qemu-options.hx
> >>> +++ b/qemu-options.hx
> >>> @@ -30,6 +30,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
> >>>      "                vmport=on|off|auto controls emulation of vmport (default: auto)\n"
> >>>      "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
> >>>      "                mem-merge=on|off controls memory merge support (default: on)\n"
> >>> +    "                memfd-alloc=on|off controls allocating anonymous guest RAM using memfd_create (default: off)\n"  
> >>
> >> Question: are there any disadvantages associated with using
> >> memfd_create? I guess we are using up an fd, but that seems minor.  Any
> >> reason not to set to on by default? maybe with a fallback option to
> >> disable that?
> 
> Old Linux host kernels, circa 4.1, do not support huge pages for shared memory.
> Also, the tunable to enable huge pages for share memory is different than for
> anon memory, so there could be performance loss if it is not set correctly.
>     /sys/kernel/mm/transparent_hugepage/enabled
>     vs
>     /sys/kernel/mm/transparent_hugepage/shmem_enabled

I guess we can test this when launching the VM, and select
a good default.

> It might make sense to use memfd_create by default for the secondary segments.

Well there's also KSM now you mention it.

> >> I am concerned that it's actually a kind of memory backend, this flag
> >> seems to instead be closer to the deprecated mem-prealloc. E.g.
> >> it does not work with a mem path, does it?
> 
> One can still define a memory backend with mempath to create the main ram segment,
> though it must be some form of shared to work with live update.  Indeed, I would 
> expect most users to specify an explicit memory backend for it.  The secondary
> segments would still use memfd_create.
> 
> > (mem path and mem-prealloc are transparently aliased to used memory backend
> > if I recall it right.)
> > 
> > Steve,
> > 
> > For allocating guest RAM, we switched exclusively to using memory-backends
> > including initial guest RAM (-m size option) and we have hostmem-memfd
> > that uses memfd_create() and I'd rather avoid adding random knobs to machine
> > for tweaking how RAM should be allocated, we have memory backends for this,
> > so this patch begs the question: why hostmem-memfd is not sufficient?
> > (patch description is rather lacking on rationale behind the patch)
> 
> There is currently no way to specify memory backends for the secondary memory
> segments (vram, roms, etc), and IMO it would be onerous to specify a backend for
> each of them.  On x86_64, these include pc.bios, vga.vram, pc.rom, vga.rom,
> /rom@etc/acpi/tables, /rom@etc/table-loader, /rom@etc/acpi/rsdp.
> 
> - Steve
> 
> >>>      "                aes-key-wrap=on|off controls support for AES key wrapping (default=on)\n"
> >>>      "                dea-key-wrap=on|off controls support for DEA key wrapping (default=on)\n"
> >>>      "                suppress-vmdesc=on|off disables self-describing migration (default=off)\n"
> >>> @@ -76,6 +77,11 @@ SRST
> >>>          supported by the host, de-duplicates identical memory pages
> >>>          among VMs instances (enabled by default).
> >>>  
> >>> +    ``memfd-alloc=on|off``
> >>> +        Enables or disables allocation of anonymous guest RAM using
> >>> +        memfd_create.  Any associated memory-backend objects are created with
> >>> +        share=on.  The memfd-alloc default is off.
> >>> +
> >>>      ``aes-key-wrap=on|off``
> >>>          Enables or disables AES key wrapping support on s390-ccw hosts.
> >>>          This feature controls whether AES wrapping keys will be created
> >>> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> >>> index 3524c04..95e2b49 100644
> >>> --- a/softmmu/physmem.c
> >>> +++ b/softmmu/physmem.c
> >>> @@ -41,6 +41,7 @@
> >>>  #include "qemu/config-file.h"
> >>>  #include "qemu/error-report.h"
> >>>  #include "qemu/qemu-print.h"
> >>> +#include "qemu/memfd.h"
> >>>  #include "exec/memory.h"
> >>>  #include "exec/ioport.h"
> >>>  #include "sysemu/dma.h"
> >>> @@ -1964,35 +1965,63 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
> >>>      const bool shared = qemu_ram_is_shared(new_block);
> >>>      RAMBlock *block;
> >>>      RAMBlock *last_block = NULL;
> >>> +    struct MemoryRegion *mr = new_block->mr;
> >>>      ram_addr_t old_ram_size, new_ram_size;
> >>>      Error *err = NULL;
> >>> +    const char *name;
> >>> +    void *addr = 0;
> >>> +    size_t maxlen;
> >>> +    MachineState *ms = MACHINE(qdev_get_machine());
> >>>  
> >>>      old_ram_size = last_ram_page();
> >>>  
> >>>      qemu_mutex_lock_ramlist();
> >>> -    new_block->offset = find_ram_offset(new_block->max_length);
> >>> +    maxlen = new_block->max_length;
> >>> +    new_block->offset = find_ram_offset(maxlen);
> >>>  
> >>>      if (!new_block->host) {
> >>>          if (xen_enabled()) {
> >>> -            xen_ram_alloc(new_block->offset, new_block->max_length,
> >>> -                          new_block->mr, &err);
> >>> +            xen_ram_alloc(new_block->offset, maxlen, new_block->mr, &err);
> >>>              if (err) {
> >>>                  error_propagate(errp, err);
> >>>                  qemu_mutex_unlock_ramlist();
> >>>                  return;
> >>>              }
> >>>          } else {
> >>> -            new_block->host = qemu_anon_ram_alloc(new_block->max_length,
> >>> -                                                  &new_block->mr->align,
> >>> -                                                  shared, noreserve);
> >>> -            if (!new_block->host) {
> >>> +            name = memory_region_name(mr);
> >>> +            if (ms->memfd_alloc) {
> >>> +                Object *parent = &mr->parent_obj;
> >>> +                int mfd = -1;          /* placeholder until next patch */
> >>> +                mr->align = QEMU_VMALLOC_ALIGN;
> >>> +                if (mfd < 0) {
> >>> +                    mfd = qemu_memfd_create(name, maxlen + mr->align,
> >>> +                                            0, 0, 0, &err);
> >>> +                    if (mfd < 0) {
> >>> +                        return;
> >>> +                    }
> >>> +                }
> >>> +                qemu_set_cloexec(mfd);
> >>> +                /* The memory backend already set its desired flags. */
> >>> +                if (!object_dynamic_cast(parent, TYPE_MEMORY_BACKEND)) {
> >>> +                    new_block->flags |= RAM_SHARED;
> >>> +                }
> >>> +                addr = file_ram_alloc(new_block, maxlen, mfd,
> >>> +                                      false, false, 0, errp);
> >>> +                trace_anon_memfd_alloc(name, maxlen, addr, mfd);
> >>> +            } else {
> >>> +                addr = qemu_anon_ram_alloc(maxlen, &mr->align,
> >>> +                                           shared, noreserve);
> >>> +            }
> >>> +
> >>> +            if (!addr) {
> >>>                  error_setg_errno(errp, errno,
> >>>                                   "cannot set up guest memory '%s'",
> >>> -                                 memory_region_name(new_block->mr));
> >>> +                                 name);
> >>>                  qemu_mutex_unlock_ramlist();
> >>>                  return;
> >>>              }
> >>> -            memory_try_enable_merging(new_block->host, new_block->max_length);
> >>> +            memory_try_enable_merging(addr, maxlen);
> >>> +            new_block->host = addr;
> >>>          }
> >>>      }
> >>>  
> >>> diff --git a/softmmu/vl.c b/softmmu/vl.c
> >>> index 620a1f1..ab3648a 100644
> >>> --- a/softmmu/vl.c
> >>> +++ b/softmmu/vl.c
> >>> @@ -2440,6 +2440,7 @@ static void create_default_memdev(MachineState *ms, const char *path)
> >>>          object_property_set_str(obj, "mem-path", path, &error_fatal);
> >>>      }
> >>>      object_property_set_int(obj, "size", ms->ram_size, &error_fatal);
> >>> +    object_property_set_bool(obj, "share", ms->memfd_alloc, &error_fatal);
> >>>      object_property_add_child(object_get_objects_root(), mc->default_ram_id,
> >>>                                obj);
> >>>      /* Ensure backend's memory region name is equal to mc->default_ram_id */
> >>> diff --git a/trace-events b/trace-events
> >>> index a637a61..770a9ac 100644
> >>> --- a/trace-events
> >>> +++ b/trace-events
> >>> @@ -45,6 +45,7 @@ ram_block_discard_range(const char *rbname, void *hva, size_t length, bool need_
> >>>  # accel/tcg/cputlb.c
> >>>  memory_notdirty_write_access(uint64_t vaddr, uint64_t ram_addr, unsigned size) "0x%" PRIx64 " ram_addr 0x%" PRIx64 " size %u"
> >>>  memory_notdirty_set_dirty(uint64_t vaddr) "0x%" PRIx64
> >>> +anon_memfd_alloc(const char *name, size_t size, void *ptr, int fd) "%s size %zu ptr %p fd %d"
> >>>  
> >>>  # gdbstub.c
> >>>  gdbstub_op_start(const char *device) "Starting gdbstub using device %s"
> >>> diff --git a/util/qemu-config.c b/util/qemu-config.c
> >>> index 436ab63..3606e5c 100644
> >>> --- a/util/qemu-config.c
> >>> +++ b/util/qemu-config.c
> >>> @@ -207,6 +207,10 @@ static QemuOptsList machine_opts = {
> >>>              .type = QEMU_OPT_BOOL,
> >>>              .help = "enable/disable memory merge support",
> >>>          },{
> >>> +            .name = "memfd-alloc",
> >>> +            .type = QEMU_OPT_BOOL,
> >>> +            .help = "enable/disable memfd_create for anonymous memory",
> >>> +        },{
> >>>              .name = "usb",
> >>>              .type = QEMU_OPT_BOOL,
> >>>              .help = "Set on/off to enable/disable usb",
> >>> -- 
> >>> 1.8.3.1  
> >>
> >>
> > 



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 10/29] machine: memfd-alloc option
  2022-03-08  6:50         ` Michael S. Tsirkin
@ 2022-03-08  7:20           ` Igor Mammedov
  2022-03-10 15:36             ` Steven Sistare
  0 siblings, 1 reply; 96+ messages in thread
From: Igor Mammedov @ 2022-03-08  7:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Zeng, Juan Quintela, Eric Blake,
	Philippe Mathieu-Daudé,
	qemu-devel, Dr. David Alan Gilbert, Zheng Chuan, Alex Williamson,
	Paolo Bonzini, Steven Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Daniel P. Berrange, Alex Bennée,
	Markus Armbruster

On Tue, 8 Mar 2022 01:50:11 -0500
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, Mar 07, 2022 at 09:41:44AM -0500, Steven Sistare wrote:
> > On 3/4/2022 5:41 AM, Igor Mammedov wrote:  
> > > On Thu, 3 Mar 2022 12:21:15 -0500
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > >   
> > >> On Wed, Dec 22, 2021 at 11:05:15AM -0800, Steve Sistare wrote:  
> > >>> Allocate anonymous memory using memfd_create if the memfd-alloc machine
> > >>> option is set.
> > >>>
> > >>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > >>> ---
> > >>>  hw/core/machine.c   | 19 +++++++++++++++++++
> > >>>  include/hw/boards.h |  1 +
> > >>>  qemu-options.hx     |  6 ++++++
> > >>>  softmmu/physmem.c   | 47 ++++++++++++++++++++++++++++++++++++++---------
> > >>>  softmmu/vl.c        |  1 +
> > >>>  trace-events        |  1 +
> > >>>  util/qemu-config.c  |  4 ++++
> > >>>  7 files changed, 70 insertions(+), 9 deletions(-)
> > >>>
> > >>> diff --git a/hw/core/machine.c b/hw/core/machine.c
> > >>> index 53a99ab..7739d88 100644
> > >>> --- a/hw/core/machine.c
> > >>> +++ b/hw/core/machine.c
> > >>> @@ -392,6 +392,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
> > >>>      ms->mem_merge = value;
> > >>>  }
> > >>>  
> > >>> +static bool machine_get_memfd_alloc(Object *obj, Error **errp)
> > >>> +{
> > >>> +    MachineState *ms = MACHINE(obj);
> > >>> +
> > >>> +    return ms->memfd_alloc;
> > >>> +}
> > >>> +
> > >>> +static void machine_set_memfd_alloc(Object *obj, bool value, Error **errp)
> > >>> +{
> > >>> +    MachineState *ms = MACHINE(obj);
> > >>> +
> > >>> +    ms->memfd_alloc = value;
> > >>> +}
> > >>> +
> > >>>  static bool machine_get_usb(Object *obj, Error **errp)
> > >>>  {
> > >>>      MachineState *ms = MACHINE(obj);
> > >>> @@ -829,6 +843,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
> > >>>      object_class_property_set_description(oc, "mem-merge",
> > >>>          "Enable/disable memory merge support");
> > >>>  
> > >>> +    object_class_property_add_bool(oc, "memfd-alloc",
> > >>> +        machine_get_memfd_alloc, machine_set_memfd_alloc);
> > >>> +    object_class_property_set_description(oc, "memfd-alloc",
> > >>> +        "Enable/disable allocating anonymous memory using memfd_create");
> > >>> +
> > >>>      object_class_property_add_bool(oc, "usb",
> > >>>          machine_get_usb, machine_set_usb);
> > >>>      object_class_property_set_description(oc, "usb",
> > >>> diff --git a/include/hw/boards.h b/include/hw/boards.h
> > >>> index 9c1c190..a57d7a0 100644
> > >>> --- a/include/hw/boards.h
> > >>> +++ b/include/hw/boards.h
> > >>> @@ -327,6 +327,7 @@ struct MachineState {
> > >>>      char *dt_compatible;
> > >>>      bool dump_guest_core;
> > >>>      bool mem_merge;
> > >>> +    bool memfd_alloc;
> > >>>      bool usb;
> > >>>      bool usb_disabled;
> > >>>      char *firmware;
> > >>> diff --git a/qemu-options.hx b/qemu-options.hx
> > >>> index 7d47510..33c8173 100644
> > >>> --- a/qemu-options.hx
> > >>> +++ b/qemu-options.hx
> > >>> @@ -30,6 +30,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
> > >>>      "                vmport=on|off|auto controls emulation of vmport (default: auto)\n"
> > >>>      "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
> > >>>      "                mem-merge=on|off controls memory merge support (default: on)\n"
> > >>> +    "                memfd-alloc=on|off controls allocating anonymous guest RAM using memfd_create (default: off)\n"    
> > >>
> > >> Question: are there any disadvantages associated with using
> > >> memfd_create? I guess we are using up an fd, but that seems minor.  Any
> > >> reason not to set to on by default? maybe with a fallback option to
> > >> disable that?  
> > 
> > Old Linux host kernels, circa 4.1, do not support huge pages for shared memory.
> > Also, the tunable to enable huge pages for share memory is different than for
> > anon memory, so there could be performance loss if it is not set correctly.
> >     /sys/kernel/mm/transparent_hugepage/enabled
> >     vs
> >     /sys/kernel/mm/transparent_hugepage/shmem_enabled  
> 
> I guess we can test this when launching the VM, and select
> a good default.
> 
> > It might make sense to use memfd_create by default for the secondary segments.  
> 
> Well there's also KSM now you mention it.

then another quest, is there downside to always using memfd_create
without any knobs being involved?

> 
> > >> I am concerned that it's actually a kind of memory backend, this flag
> > >> seems to instead be closer to the deprecated mem-prealloc. E.g.
> > >> it does not work with a mem path, does it?  
> > 
> > One can still define a memory backend with mempath to create the main ram segment,
> > though it must be some form of shared to work with live update.  Indeed, I would 
> > expect most users to specify an explicit memory backend for it.  The secondary
> > segments would still use memfd_create.
> >   
> > > (mem path and mem-prealloc are transparently aliased to used memory backend
> > > if I recall it right.)
> > > 
> > > Steve,
> > > 
> > > For allocating guest RAM, we switched exclusively to using memory-backends
> > > including initial guest RAM (-m size option) and we have hostmem-memfd
> > > that uses memfd_create() and I'd rather avoid adding random knobs to machine
> > > for tweaking how RAM should be allocated, we have memory backends for this,
> > > so this patch begs the question: why hostmem-memfd is not sufficient?
> > > (patch description is rather lacking on rationale behind the patch)  
> > 
> > There is currently no way to specify memory backends for the secondary memory
> > segments (vram, roms, etc), and IMO it would be onerous to specify a backend for
> > each of them.  On x86_64, these include pc.bios, vga.vram, pc.rom, vga.rom,
> > /rom@etc/acpi/tables, /rom@etc/table-loader, /rom@etc/acpi/rsdp.
> > 
> > - Steve
> >   
> > >>>      "                aes-key-wrap=on|off controls support for AES key wrapping (default=on)\n"
> > >>>      "                dea-key-wrap=on|off controls support for DEA key wrapping (default=on)\n"
> > >>>      "                suppress-vmdesc=on|off disables self-describing migration (default=off)\n"
> > >>> @@ -76,6 +77,11 @@ SRST
> > >>>          supported by the host, de-duplicates identical memory pages
> > >>>          among VMs instances (enabled by default).
> > >>>  
> > >>> +    ``memfd-alloc=on|off``
> > >>> +        Enables or disables allocation of anonymous guest RAM using
> > >>> +        memfd_create.  Any associated memory-backend objects are created with
> > >>> +        share=on.  The memfd-alloc default is off.
> > >>> +
> > >>>      ``aes-key-wrap=on|off``
> > >>>          Enables or disables AES key wrapping support on s390-ccw hosts.
> > >>>          This feature controls whether AES wrapping keys will be created
> > >>> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> > >>> index 3524c04..95e2b49 100644
> > >>> --- a/softmmu/physmem.c
> > >>> +++ b/softmmu/physmem.c
> > >>> @@ -41,6 +41,7 @@
> > >>>  #include "qemu/config-file.h"
> > >>>  #include "qemu/error-report.h"
> > >>>  #include "qemu/qemu-print.h"
> > >>> +#include "qemu/memfd.h"
> > >>>  #include "exec/memory.h"
> > >>>  #include "exec/ioport.h"
> > >>>  #include "sysemu/dma.h"
> > >>> @@ -1964,35 +1965,63 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
> > >>>      const bool shared = qemu_ram_is_shared(new_block);
> > >>>      RAMBlock *block;
> > >>>      RAMBlock *last_block = NULL;
> > >>> +    struct MemoryRegion *mr = new_block->mr;
> > >>>      ram_addr_t old_ram_size, new_ram_size;
> > >>>      Error *err = NULL;
> > >>> +    const char *name;
> > >>> +    void *addr = 0;
> > >>> +    size_t maxlen;
> > >>> +    MachineState *ms = MACHINE(qdev_get_machine());
> > >>>  
> > >>>      old_ram_size = last_ram_page();
> > >>>  
> > >>>      qemu_mutex_lock_ramlist();
> > >>> -    new_block->offset = find_ram_offset(new_block->max_length);
> > >>> +    maxlen = new_block->max_length;
> > >>> +    new_block->offset = find_ram_offset(maxlen);
> > >>>  
> > >>>      if (!new_block->host) {
> > >>>          if (xen_enabled()) {
> > >>> -            xen_ram_alloc(new_block->offset, new_block->max_length,
> > >>> -                          new_block->mr, &err);
> > >>> +            xen_ram_alloc(new_block->offset, maxlen, new_block->mr, &err);
> > >>>              if (err) {
> > >>>                  error_propagate(errp, err);
> > >>>                  qemu_mutex_unlock_ramlist();
> > >>>                  return;
> > >>>              }
> > >>>          } else {
> > >>> -            new_block->host = qemu_anon_ram_alloc(new_block->max_length,
> > >>> -                                                  &new_block->mr->align,
> > >>> -                                                  shared, noreserve);
> > >>> -            if (!new_block->host) {
> > >>> +            name = memory_region_name(mr);
> > >>> +            if (ms->memfd_alloc) {
> > >>> +                Object *parent = &mr->parent_obj;
> > >>> +                int mfd = -1;          /* placeholder until next patch */
> > >>> +                mr->align = QEMU_VMALLOC_ALIGN;
> > >>> +                if (mfd < 0) {
> > >>> +                    mfd = qemu_memfd_create(name, maxlen + mr->align,
> > >>> +                                            0, 0, 0, &err);
> > >>> +                    if (mfd < 0) {
> > >>> +                        return;
> > >>> +                    }
> > >>> +                }
> > >>> +                qemu_set_cloexec(mfd);
> > >>> +                /* The memory backend already set its desired flags. */
> > >>> +                if (!object_dynamic_cast(parent, TYPE_MEMORY_BACKEND)) {
> > >>> +                    new_block->flags |= RAM_SHARED;
> > >>> +                }
> > >>> +                addr = file_ram_alloc(new_block, maxlen, mfd,
> > >>> +                                      false, false, 0, errp);
> > >>> +                trace_anon_memfd_alloc(name, maxlen, addr, mfd);
> > >>> +            } else {
> > >>> +                addr = qemu_anon_ram_alloc(maxlen, &mr->align,
> > >>> +                                           shared, noreserve);
> > >>> +            }
> > >>> +
> > >>> +            if (!addr) {
> > >>>                  error_setg_errno(errp, errno,
> > >>>                                   "cannot set up guest memory '%s'",
> > >>> -                                 memory_region_name(new_block->mr));
> > >>> +                                 name);
> > >>>                  qemu_mutex_unlock_ramlist();
> > >>>                  return;
> > >>>              }
> > >>> -            memory_try_enable_merging(new_block->host, new_block->max_length);
> > >>> +            memory_try_enable_merging(addr, maxlen);
> > >>> +            new_block->host = addr;
> > >>>          }
> > >>>      }
> > >>>  
> > >>> diff --git a/softmmu/vl.c b/softmmu/vl.c
> > >>> index 620a1f1..ab3648a 100644
> > >>> --- a/softmmu/vl.c
> > >>> +++ b/softmmu/vl.c
> > >>> @@ -2440,6 +2440,7 @@ static void create_default_memdev(MachineState *ms, const char *path)
> > >>>          object_property_set_str(obj, "mem-path", path, &error_fatal);
> > >>>      }
> > >>>      object_property_set_int(obj, "size", ms->ram_size, &error_fatal);
> > >>> +    object_property_set_bool(obj, "share", ms->memfd_alloc, &error_fatal);
> > >>>      object_property_add_child(object_get_objects_root(), mc->default_ram_id,
> > >>>                                obj);
> > >>>      /* Ensure backend's memory region name is equal to mc->default_ram_id */
> > >>> diff --git a/trace-events b/trace-events
> > >>> index a637a61..770a9ac 100644
> > >>> --- a/trace-events
> > >>> +++ b/trace-events
> > >>> @@ -45,6 +45,7 @@ ram_block_discard_range(const char *rbname, void *hva, size_t length, bool need_
> > >>>  # accel/tcg/cputlb.c
> > >>>  memory_notdirty_write_access(uint64_t vaddr, uint64_t ram_addr, unsigned size) "0x%" PRIx64 " ram_addr 0x%" PRIx64 " size %u"
> > >>>  memory_notdirty_set_dirty(uint64_t vaddr) "0x%" PRIx64
> > >>> +anon_memfd_alloc(const char *name, size_t size, void *ptr, int fd) "%s size %zu ptr %p fd %d"
> > >>>  
> > >>>  # gdbstub.c
> > >>>  gdbstub_op_start(const char *device) "Starting gdbstub using device %s"
> > >>> diff --git a/util/qemu-config.c b/util/qemu-config.c
> > >>> index 436ab63..3606e5c 100644
> > >>> --- a/util/qemu-config.c
> > >>> +++ b/util/qemu-config.c
> > >>> @@ -207,6 +207,10 @@ static QemuOptsList machine_opts = {
> > >>>              .type = QEMU_OPT_BOOL,
> > >>>              .help = "enable/disable memory merge support",
> > >>>          },{
> > >>> +            .name = "memfd-alloc",
> > >>> +            .type = QEMU_OPT_BOOL,
> > >>> +            .help = "enable/disable memfd_create for anonymous memory",
> > >>> +        },{
> > >>>              .name = "usb",
> > >>>              .type = QEMU_OPT_BOOL,
> > >>>              .help = "Set on/off to enable/disable usb",
> > >>> -- 
> > >>> 1.8.3.1    
> > >>
> > >>  
> > >   
> 



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 11/29] qapi: list utility functions
  2021-12-22 19:05 ` [PATCH V7 11/29] qapi: list utility functions Steve Sistare
@ 2022-03-09 14:11   ` Marc-André Lureau
  2022-03-11 16:45     ` Steven Sistare
  0 siblings, 1 reply; 96+ messages in thread
From: Marc-André Lureau @ 2022-03-09 14:11 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin, QEMU,
	Dr. David Alan Gilbert, Zheng Chuan, Alex Williamson,
	Stefan Hajnoczi, Paolo Bonzini, Daniel P. Berrange,
	Philippe Mathieu-Daudé,
	Alex Bennée, Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 5586 bytes --]

Hi

On Wed, Dec 22, 2021 at 11:42 PM Steve Sistare <steven.sistare@oracle.com>
wrote:

> Generalize strList_from_comma_list() to take any delimiter character,
> rename
> as strList_from_string(), and move it to qapi/util.c.  Also add
> strv_from_strList() and QAPI_LIST_LENGTH().
>

Looks like you could easily split, and add some tests.


>
> No functional change.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  include/qapi/util.h | 28 ++++++++++++++++++++++++++++
>  monitor/hmp-cmds.c  | 29 ++---------------------------
>  qapi/qapi-util.c    | 37 +++++++++++++++++++++++++++++++++++++
>  3 files changed, 67 insertions(+), 27 deletions(-)
>
> diff --git a/include/qapi/util.h b/include/qapi/util.h
> index 81a2b13..c249108 100644
> --- a/include/qapi/util.h
> +++ b/include/qapi/util.h
> @@ -22,6 +22,8 @@ typedef struct QEnumLookup {
>      const int size;
>  } QEnumLookup;
>
> +struct strList;
> +
>  const char *qapi_enum_lookup(const QEnumLookup *lookup, int val);
>  int qapi_enum_parse(const QEnumLookup *lookup, const char *buf,
>                      int def, Error **errp);
> @@ -31,6 +33,19 @@ bool qapi_bool_parse(const char *name, const char
> *value, bool *obj,
>  int parse_qapi_name(const char *name, bool complete);
>
>  /*
> + * Produce and return a NULL-terminated array of strings from @args.
> + * All strings are g_strdup'd.
> + */
> +char **strv_from_strList(const struct strList *args);
>
+
>

I'd suggest to use the dedicated glib type GStrv


> +/*
> + * Produce a strList from the character delimited string @in.
> + * All strings are g_strdup'd.
> + * A NULL or empty input string returns NULL.
> + */
> +struct strList *strList_from_string(const char *in, char delim);
> +
> +/*
>   * For any GenericList @list, insert @element at the front.
>   *
>   * Note that this macro evaluates @element exactly once, so it is safe
> @@ -56,4 +71,17 @@ int parse_qapi_name(const char *name, bool complete);
>      (tail) = &(*(tail))->next; \
>  } while (0)
>
> +/*
> + * For any GenericList @list, return its length.
> + */
> +#define QAPI_LIST_LENGTH(list) \
> +    ({ \
> +        int len = 0; \
> +        typeof(list) elem; \
> +        for (elem = list; elem != NULL; elem = elem->next) { \
> +            len++; \
> +        } \
> +        len; \
> +    })
> +
>  #endif
> diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
> index b8c22da..5ca8b4b 100644
> --- a/monitor/hmp-cmds.c
> +++ b/monitor/hmp-cmds.c
> @@ -43,6 +43,7 @@
>  #include "qapi/qapi-commands-run-state.h"
>  #include "qapi/qapi-commands-tpm.h"
>  #include "qapi/qapi-commands-ui.h"
> +#include "qapi/util.h"
>  #include "qapi/qapi-visit-net.h"
>  #include "qapi/qapi-visit-migration.h"
>  #include "qapi/qmp/qdict.h"
> @@ -70,32 +71,6 @@ bool hmp_handle_error(Monitor *mon, Error *err)
>      return false;
>  }
>
> -/*
> - * Produce a strList from a comma separated list.
> - * A NULL or empty input string return NULL.
> - */
> -static strList *strList_from_comma_list(const char *in)
> -{
> -    strList *res = NULL;
> -    strList **tail = &res;
> -
> -    while (in && in[0]) {
> -        char *comma = strchr(in, ',');
> -        char *value;
> -
> -        if (comma) {
> -            value = g_strndup(in, comma - in);
> -            in = comma + 1; /* skip the , */
> -        } else {
> -            value = g_strdup(in);
> -            in = NULL;
> -        }
> -        QAPI_LIST_APPEND(tail, value);
> -    }
> -
> -    return res;
> -}
> -
>  void hmp_info_name(Monitor *mon, const QDict *qdict)
>  {
>      NameInfo *info;
> @@ -1103,7 +1078,7 @@ void hmp_announce_self(Monitor *mon, const QDict
> *qdict)
>                                              migrate_announce_params());
>
>      qapi_free_strList(params->interfaces);
> -    params->interfaces = strList_from_comma_list(interfaces_str);
> +    params->interfaces = strList_from_string(interfaces_str, ',');
>      params->has_interfaces = params->interfaces != NULL;
>      params->id = g_strdup(id);
>      params->has_id = !!params->id;
> diff --git a/qapi/qapi-util.c b/qapi/qapi-util.c
> index fda7044..edd51b3 100644
> --- a/qapi/qapi-util.c
> +++ b/qapi/qapi-util.c
> @@ -15,6 +15,7 @@
>  #include "qapi/error.h"
>  #include "qemu/ctype.h"
>  #include "qapi/qmp/qerror.h"
> +#include "qapi/qapi-builtin-types.h"
>
>  CompatPolicy compat_policy;
>
> @@ -152,3 +153,39 @@ int parse_qapi_name(const char *str, bool complete)
>      }
>      return p - str;
>  }
> +
> +char **strv_from_strList(const strList *args)
> +{
> +    const strList *arg;
> +    int i = 0;
> +    char **argv = g_malloc((QAPI_LIST_LENGTH(args) + 1) * sizeof(char *));
> +
> +    for (arg = args; arg != NULL; arg = arg->next) {
> +        argv[i++] = g_strdup(arg->value);
> +    }
> +    argv[i] = NULL;
> +
> +    return argv;
> +}
> +
> +strList *strList_from_string(const char *in, char delim)
> +{
> +    strList *res = NULL;
> +    strList **tail = &res;
> +
> +    while (in && in[0]) {
> +        char *next = strchr(in, delim);
> +        char *value;
> +
> +        if (next) {
> +            value = g_strndup(in, next - in);
> +            in = next + 1; /* skip the delim */
> +        } else {
> +            value = g_strdup(in);
> +            in = NULL;
> +        }
> +        QAPI_LIST_APPEND(tail, value);
> +    }
> +
> +    return res;
> +}
> --
> 1.8.3.1
>
>
>

-- 
Marc-André Lureau

[-- Attachment #2: Type: text/html, Size: 7310 bytes --]

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 12/29] vl: helper to request re-exec
  2021-12-22 19:05 ` [PATCH V7 12/29] vl: helper to request re-exec Steve Sistare
@ 2022-03-09 14:16   ` Marc-André Lureau
  2022-03-11 16:45     ` Steven Sistare
  0 siblings, 1 reply; 96+ messages in thread
From: Marc-André Lureau @ 2022-03-09 14:16 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin, QEMU,
	Dr. David Alan Gilbert, Zheng Chuan, Alex Williamson,
	Stefan Hajnoczi, Paolo Bonzini, Daniel P. Berrange,
	Philippe Mathieu-Daudé,
	Alex Bennée, Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 3279 bytes --]

On Wed, Dec 22, 2021 at 11:52 PM Steve Sistare <steven.sistare@oracle.com>
wrote:

> Add a qemu_system_exec_request() hook that causes the main loop to exit and
> re-exec qemu using the specified arguments.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  include/sysemu/runstate.h |  1 +
>  softmmu/runstate.c        | 21 +++++++++++++++++++++
>  2 files changed, 22 insertions(+)
>
> diff --git a/include/sysemu/runstate.h b/include/sysemu/runstate.h
> index b655c7b..198211b 100644
> --- a/include/sysemu/runstate.h
> +++ b/include/sysemu/runstate.h
> @@ -57,6 +57,7 @@ void qemu_system_wakeup_enable(WakeupReason reason, bool
> enabled);
>  void qemu_register_wakeup_notifier(Notifier *notifier);
>  void qemu_register_wakeup_support(void);
>  void qemu_system_shutdown_request(ShutdownCause reason);
> +void qemu_system_exec_request(const strList *args);
>  void qemu_system_powerdown_request(void);
>  void qemu_register_powerdown_notifier(Notifier *notifier);
>  void qemu_register_shutdown_notifier(Notifier *notifier);
> diff --git a/softmmu/runstate.c b/softmmu/runstate.c
> index 3d344c9..309a4bf 100644
> --- a/softmmu/runstate.c
> +++ b/softmmu/runstate.c
> @@ -38,6 +38,7 @@
>  #include "monitor/monitor.h"
>  #include "net/net.h"
>  #include "net/vhost_net.h"
> +#include "qapi/util.h"
>  #include "qapi/error.h"
>  #include "qapi/qapi-commands-run-state.h"
>  #include "qapi/qapi-events-run-state.h"
> @@ -355,6 +356,7 @@ static NotifierList wakeup_notifiers =
>  static NotifierList shutdown_notifiers =
>      NOTIFIER_LIST_INITIALIZER(shutdown_notifiers);
>  static uint32_t wakeup_reason_mask = ~(1 << QEMU_WAKEUP_REASON_NONE);
> +static char **exec_argv;
>
>  ShutdownCause qemu_shutdown_requested_get(void)
>  {
> @@ -371,6 +373,11 @@ static int qemu_shutdown_requested(void)
>      return qatomic_xchg(&shutdown_requested, SHUTDOWN_CAUSE_NONE);
>  }
>
> +static int qemu_exec_requested(void)
> +{
> +    return exec_argv != NULL;
> +}
> +
>  static void qemu_kill_report(void)
>  {
>      if (!qtest_driver() && shutdown_signal) {
> @@ -641,6 +648,13 @@ void qemu_system_shutdown_request(ShutdownCause
> reason)
>      qemu_notify_event();
>  }
>
> +void qemu_system_exec_request(const strList *args)
> +{
> +    exec_argv = strv_from_strList(args);
>

I would rather make it take a GStrv, since that's what it actually uses.

I would also check if argv[0] is set (or document the expected behaviour).


> +    shutdown_requested = 1;
> +    qemu_notify_event();
> +}
> +
>  static void qemu_system_powerdown(void)
>  {
>      qapi_event_send_powerdown();
> @@ -689,6 +703,13 @@ static bool main_loop_should_exit(void)
>      }
>      request = qemu_shutdown_requested();
>      if (request) {
> +
> +        if (qemu_exec_requested()) {
> +            execvp(exec_argv[0], exec_argv);
> +            error_report("execvp %s failed: %s", exec_argv[0],
> strerror(errno));
> +            g_strfreev(exec_argv);
> +            exec_argv = NULL;
> +        }
>          qemu_kill_report();
>          qemu_system_shutdown(request);
>          if (shutdown_action == SHUTDOWN_ACTION_PAUSE) {
> --
> 1.8.3.1
>
>
>

-- 
Marc-André Lureau

[-- Attachment #2: Type: text/html, Size: 4293 bytes --]

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 08/29] memory: flat section iterator
  2021-12-22 19:05 ` [PATCH V7 08/29] memory: flat section iterator Steve Sistare
  2022-03-04 12:48   ` Philippe Mathieu-Daudé
@ 2022-03-09 14:18   ` Marc-André Lureau
  1 sibling, 0 replies; 96+ messages in thread
From: Marc-André Lureau @ 2022-03-09 14:18 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin, QEMU,
	Dr. David Alan Gilbert, Zheng Chuan, Alex Williamson,
	Stefan Hajnoczi, Paolo Bonzini, Daniel P. Berrange,
	Philippe Mathieu-Daudé,
	Alex Bennée, Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 3336 bytes --]

Hi

On Thu, Dec 23, 2021 at 12:17 AM Steve Sistare <steven.sistare@oracle.com>
wrote:

> Add an iterator over the sections of a flattened address space.
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  include/exec/memory.h | 31 +++++++++++++++++++++++++++++++
>  softmmu/memory.c      | 20 ++++++++++++++++++++
>  2 files changed, 51 insertions(+)
>
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 137f5f3..9660475 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -2338,6 +2338,37 @@ void
> memory_region_set_ram_discard_manager(MemoryRegion *mr,
>                                             RamDiscardManager *rdm);
>
>  /**
> + * memory_region_section_cb: callback for
> address_space_flat_for_each_section()
> + *
> + * @s: MemoryRegionSection of the range
> + * @opaque: data pointer passed to address_space_flat_for_each_section()
> + * @errp: error message, returned to the
> address_space_flat_for_each_section
> + *        caller.
> + *
> + * Returns: non-zero to stop the iteration, and 0 to continue.  The same
> + * non-zero value is returned to the address_space_flat_for_each_section
> caller.
> + */
> +
> +typedef int (*memory_region_section_cb)(MemoryRegionSection *s,
> +                                        void *opaque,
> +                                        Error **errp);
> +
> +/**
> + * address_space_flat_for_each_section: walk the ranges in the address
> space
> + * flat view and call @func for each.  Return 0 on success, else return
> non-zero
> + * with a message in @errp.
> + *
> + * @as: target address space
> + * @func: callback function
> + * @opaque: passed to @func
> + * @errp: passed to @func
> + */
> +int address_space_flat_for_each_section(AddressSpace *as,
> +                                        memory_region_section_cb func,
> +                                        void *opaque,
> +                                        Error **errp);
> +
> +/**
>   * memory_region_find: translate an address/size relative to a
>   * MemoryRegion into a #MemoryRegionSection.
>   *
> diff --git a/softmmu/memory.c b/softmmu/memory.c
> index 30b2f68..40f3522 100644
> --- a/softmmu/memory.c
> +++ b/softmmu/memory.c
> @@ -2663,6 +2663,26 @@ bool memory_region_is_mapped(MemoryRegion *mr)
>      return mr->container ? true : false;
>  }
>
> +int address_space_flat_for_each_section(AddressSpace *as,
> +                                        memory_region_section_cb func,
> +                                        void *opaque,
> +                                        Error **errp)
> +{
> +    FlatView *view = address_space_get_flatview(as);
> +    FlatRange *fr;
> +    int ret;
> +
> +    FOR_EACH_FLAT_RANGE(fr, view) {
> +        MemoryRegionSection section = section_from_flat_range(fr, view);
> +        ret = func(&section, opaque, errp);
> +        if (ret) {
> +            return ret;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
>  /* Same as memory_region_find, but it does not add a reference to the
>   * returned region.  It must be called from an RCU critical section.
>   */
> --
> 1.8.3.1
>
>
>
lgtm,

Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>


-- 
Marc-André Lureau

[-- Attachment #2: Type: text/html, Size: 4282 bytes --]

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)
  2022-03-07 22:16   ` Alex Williamson
@ 2022-03-10 15:00     ` Steven Sistare
  2022-03-10 18:35       ` Alex Williamson
  0 siblings, 1 reply; 96+ messages in thread
From: Steven Sistare @ 2022-03-10 15:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	qemu-devel, Dr. David Alan Gilbert, Zheng Chuan, Paolo Bonzini,
	Stefan Hajnoczi, Marc-André Lureau, Daniel P. Berrange,
	Philippe Mathieu-Daudé,
	Alex Bennée, Markus Armbruster

On 3/7/2022 5:16 PM, Alex Williamson wrote:
> On Wed, 22 Dec 2021 11:05:24 -0800
> Steve Sistare <steven.sistare@oracle.com> wrote:
> 
>> Enable vfio-pci devices to be saved and restored across an exec restart
>> of qemu.
>>
>> At vfio creation time, save the value of vfio container, group, and device
>> descriptors in cpr state.
>>
>> In cpr-save and cpr-exec, suspend the use of virtual addresses in DMA
>> mappings with VFIO_DMA_UNMAP_FLAG_VADDR, because guest ram will be remapped
>> at a different VA after exec.  DMA to already-mapped pages continues.  Save
>> the msi message area as part of vfio-pci vmstate, save the interrupt and
>> notifier eventfd's in cpr state, and clear the close-on-exec flag for the
>> vfio descriptors.  The flag is not cleared earlier because the descriptors
>> should not persist across miscellaneous fork and exec calls that may be
>> performed during normal operation.
>>
>> On qemu restart, vfio_realize() finds the saved descriptors, uses
>> the descriptors, and notes that the device is being reused.  Device and
>> iommu state is already configured, so operations in vfio_realize that
>> would modify the configuration are skipped for a reused device, including
>> vfio ioctl's and writes to PCI configuration space.  The result is that
>> vfio_realize constructs qemu data structures that reflect the current
>> state of the device.  However, the reconstruction is not complete until
>> cpr-load is called. cpr-load loads the msi data and finds eventfds in cpr
>> state.  It rebuilds vector data structures and attaches the interrupts to
>> the new KVM instance.  cpr-load then invokes the main vfio listener callback,
>> which walks the flattened ranges of the vfio_address_spaces and calls
>> VFIO_DMA_MAP_FLAG_VADDR to inform the kernel of the new VA's.  Lastly, it
>> starts the VM and suppresses vfio pci device reset.
>>
>> This functionality is delivered by 3 patches for clarity.  Part 1 handles
>> device file descriptors and DMA.  Part 2 adds eventfd and MSI/MSI-X vector
>> support.  Part 3 adds INTX support.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  MAINTAINERS                   |   1 +
>>  hw/pci/pci.c                  |  10 ++++
>>  hw/vfio/common.c              | 115 ++++++++++++++++++++++++++++++++++++++----
>>  hw/vfio/cpr.c                 |  94 ++++++++++++++++++++++++++++++++++
>>  hw/vfio/meson.build           |   1 +
>>  hw/vfio/pci.c                 |  77 ++++++++++++++++++++++++++++
>>  hw/vfio/trace-events          |   1 +
>>  include/hw/pci/pci.h          |   1 +
>>  include/hw/vfio/vfio-common.h |   8 +++
>>  include/migration/cpr.h       |   3 ++
>>  migration/cpr.c               |  10 +++-
>>  migration/target.c            |  14 +++++
>>  12 files changed, 324 insertions(+), 11 deletions(-)
>>  create mode 100644 hw/vfio/cpr.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index cfe7480..feed239 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -2992,6 +2992,7 @@ CPR
>>  M: Steve Sistare <steven.sistare@oracle.com>
>>  M: Mark Kanda <mark.kanda@oracle.com>
>>  S: Maintained
>> +F: hw/vfio/cpr.c
>>  F: include/migration/cpr.h
>>  F: migration/cpr.c
>>  F: qapi/cpr.json
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index 0fd21e1..e35df4f 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -307,6 +307,16 @@ static void pci_do_device_reset(PCIDevice *dev)
>>  {
>>      int r;
>>  
>> +    /*
>> +     * A reused vfio-pci device is already configured, so do not reset it
>> +     * during qemu_system_reset prior to cpr-load, else interrupts may be
>> +     * lost.  By contrast, pure-virtual pci devices may be reset here and
>> +     * updated with new state in cpr-load with no ill effects.
>> +     */
>> +    if (dev->reused) {
>> +        return;
>> +    }
>> +
>>      pci_device_deassert_intx(dev);
>>      assert(dev->irq_state == 0);
>>  
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 5b87f95..90f66ad 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -31,6 +31,7 @@
>>  #include "exec/memory.h"
>>  #include "exec/ram_addr.h"
>>  #include "hw/hw.h"
>> +#include "migration/cpr.h"
>>  #include "qemu/error-report.h"
>>  #include "qemu/main-loop.h"
>>  #include "qemu/range.h"
>> @@ -459,6 +460,8 @@ static int vfio_dma_unmap(VFIOContainer *container,
>>          .size = size,
>>      };
>>  
>> +    assert(!container->reused);
>> +
>>      if (iotlb && container->dirty_pages_supported &&
>>          vfio_devices_all_running_and_saving(container)) {
>>          return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
>> @@ -495,12 +498,24 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>>  {
>>      struct vfio_iommu_type1_dma_map map = {
>>          .argsz = sizeof(map),
>> -        .flags = VFIO_DMA_MAP_FLAG_READ,
>>          .vaddr = (__u64)(uintptr_t)vaddr,
>>          .iova = iova,
>>          .size = size,
>>      };
>>  
>> +    /*
>> +     * Set the new vaddr for any mappings registered during cpr-load.
>> +     * Reused is cleared thereafter.
>> +     */
>> +    if (container->reused) {
>> +        map.flags = VFIO_DMA_MAP_FLAG_VADDR;
>> +        if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
>> +            goto fail;
>> +        }
>> +        return 0;
>> +    }
>> +
>> +    map.flags = VFIO_DMA_MAP_FLAG_READ;
>>      if (!readonly) {
>>          map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
>>      }
>> @@ -516,7 +531,9 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>>          return 0;
>>      }
>>  
>> -    error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
>> +fail:
>> +    error_report("vfio_dma_map %s (iova %lu, size %ld, va %p): %s",
>> +        (container->reused ? "VADDR" : ""), iova, size, vaddr, strerror(errno));
>>      return -errno;
>>  }
>>  
>> @@ -865,6 +882,12 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>                                       MemoryRegionSection *section)
>>  {
>>      VFIOContainer *container = container_of(listener, VFIOContainer, listener);
>> +    vfio_container_region_add(container, section);
>> +}
>> +
>> +void vfio_container_region_add(VFIOContainer *container,
>> +                               MemoryRegionSection *section)
>> +{
>>      hwaddr iova, end;
>>      Int128 llend, llsize;
>>      void *vaddr;
>> @@ -985,6 +1008,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>          int iommu_idx;
>>  
>>          trace_vfio_listener_region_add_iommu(iova, end);
>> +
>>          /*
>>           * FIXME: For VFIO iommu types which have KVM acceleration to
>>           * avoid bouncing all map/unmaps through qemu this way, this
>> @@ -1459,6 +1483,12 @@ static void vfio_listener_release(VFIOContainer *container)
>>      }
>>  }
>>  
>> +void vfio_listener_register(VFIOContainer *container)
>> +{
>> +    container->listener = vfio_memory_listener;
>> +    memory_listener_register(&container->listener, container->space->as);
>> +}
>> +
>>  static struct vfio_info_cap_header *
>>  vfio_get_cap(void *ptr, uint32_t cap_offset, uint16_t id)
>>  {
>> @@ -1878,6 +1908,18 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
>>  {
>>      int iommu_type, ret;
>>  
>> +    /*
>> +     * If container is reused, just set its type and skip the ioctls, as the
>> +     * container and group are already configured in the kernel.
>> +     * VFIO_TYPE1v2_IOMMU is the only type that supports reuse/cpr.
>> +     * If you ever add new types or spapr cpr support, kind reader, please
>> +     * also implement VFIO_GET_IOMMU.
>> +     */
> 
> VFIO_CHECK_EXTENSION should be able to tell us this, right?  Maybe the
> problem is that vfio_iommu_type1_check_extension() should actually base
> some of the details on the instantiated vfio_iommu, ex.
> 
> 	switch (arg) {
> 	case VFIO_TYPE1_IOMMU:
> 		return (iommu && iommu->v2) ? 0 : 1;
> 	case VFIO_UNMAP_ALL:
> 	case VFIO_UPDATE_VADDR:
> 	case VFIO_TYPE1v2_IOMMU:
> 		return (iommu && !iommu->v2) ? 0 : 1;
> 	case VFIO_TYPE1_NESTING_IOMMU:
> 		return (iommu && !iommu->nesting) ? 0 : 1;
> 	...
> 
> We can't support v1 if we've already set a v2 container and vice versa.
> There are probably some corner cases and compatibility to puzzle
> through, but I wouldn't think we need a new ioctl to check this.

That change makes sense, and may be worth while on its own merits, but does not
solve the problem, which is that qemu will not be able to infer iommu_type in
the future if new types are added.  Given:
  * a new kernel supporting shiny new TYPE1v3
  * old qemu starts and selects TYPE1v2 in vfio_get_iommu_type because it has no
    knowledge of v3
  * live update to qemu which supports v3, which will be listed first in vfio_get_iommu_type.

Then the new qemu has no way to infer iommu_type.  If it has code that makes 
decisions based on iommu_type (eg, VFIO_SPAPR_TCE_v2_IOMMU in vfio_container_region_add,
or vfio_ram_block_discard_disable, or ...), then new qemu cannot function correctly.

For that, VFIO_GET_IOMMU would be the cleanest solution, to be added the same time our
hypothetical future developer adds TYPE1v3.  The current inability to ask the kernel
"what are you" about a container feels like a bug to me.

>> +    if (container->reused) {
>> +        container->iommu_type = VFIO_TYPE1v2_IOMMU;
>> +        return 0;
>> +    }
>> +
>>      iommu_type = vfio_get_iommu_type(container, errp);
>>      if (iommu_type < 0) {
>>          return iommu_type;
>> @@ -1982,9 +2024,12 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>  {
>>      VFIOContainer *container;
>>      int ret, fd;
>> +    bool reused;
>>      VFIOAddressSpace *space;
>>  
>>      space = vfio_get_address_space(as);
>> +    fd = cpr_find_fd("vfio_container_for_group", group->groupid);
>> +    reused = (fd > 0);
>>  
>>      /*
>>       * VFIO is currently incompatible with discarding of RAM insofar as the
>> @@ -2017,8 +2062,16 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>       * details once we know which type of IOMMU we are using.
>>       */
>>  
>> +    /*
>> +     * If the container is reused, then the group is already attached in the
>> +     * kernel.  If a container with matching fd is found, then update the
>> +     * userland group list and return.  It not, then after the loop, create
> 
> s/It/If/

Check.

>> +     * the container struct and group list.
>> +     */
>> +
>>      QLIST_FOREACH(container, &space->containers, next) {
>> -        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>> +        if ((reused && container->fd == fd) ||
>> +            !ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
> 
> 
> We can have multiple containers, so this can still call the ioctl when
> reused = true.  I think it still works, but it's a bit ugly, we're
> relying on the ioctl failing when the container is already set for the
> group.  Does this need to be something like:
> 
>         if (reused) {
>             if (container->fd != fd) {
>                 continue;
>             }
>         } else if (ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>             continue;
>         }

Better, thanks.

>>              ret = vfio_ram_block_discard_disable(container, true);
>>              if (ret) {
>>                  error_setg_errno(errp, -ret,
>> @@ -2032,12 +2085,19 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>              }
>>              group->container = container;
>>              QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>> -            vfio_kvm_device_add_group(group);
>> +            if (!reused) {
>> +                vfio_kvm_device_add_group(group);
>> +                cpr_save_fd("vfio_container_for_group", group->groupid,
>> +                            container->fd);
>> +            }
>>              return 0;
>>          }
>>      }
>>  
>> -    fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
>> +    if (!reused) {
>> +        fd = qemu_open_old("/dev/vfio/vfio", O_RDWR);
>> +    }
>> +
>>      if (fd < 0) {
>>          error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
>>          ret = -errno;
>> @@ -2055,6 +2115,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>      container = g_malloc0(sizeof(*container));
>>      container->space = space;
>>      container->fd = fd;
>> +    container->reused = reused;
>>      container->error = NULL;
>>      container->dirty_pages_supported = false;
>>      container->dma_max_mappings = 0;
>> @@ -2181,9 +2242,16 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>      group->container = container;
>>      QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>>  
>> -    container->listener = vfio_memory_listener;
>> -
>> -    memory_listener_register(&container->listener, container->space->as);
>> +    /*
>> +     * If reused, register the listener later, after all state that may
>> +     * affect regions and mapping boundaries has been cpr-load'ed.  Later,
>> +     * the listener will invoke its callback on each flat section and call
>> +     * vfio_dma_map to supply the new vaddr, and the calls will match the
>> +     * mappings remembered by the kernel.
>> +     */
>> +    if (!reused) {
>> +        vfio_listener_register(container);
>> +    }
>>  
>>      if (container->error) {
>>          ret = -1;
>> @@ -2193,6 +2261,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
>>      }
>>  
>>      container->initialized = true;
>> +    if (!reused) {
>> +        cpr_save_fd("vfio_container_for_group", group->groupid, fd);
>> +    }
>>  
>>      return 0;
>>  listener_release_exit:
>> @@ -2222,6 +2293,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
>>  
>>      QLIST_REMOVE(group, container_next);
>>      group->container = NULL;
>> +    cpr_delete_fd("vfio_container_for_group", group->groupid);
> 
> Did you consider having cpr_save_fd() do a find_name() and update/no-op
> if found so that we can casually call cpr_save_fd() without nesting it
> in a branch the same as done for cpr_delete_fd()?

I did, but decided it was better to keep cpr_state simple and let higher layers decide
whether the extra search is helpful.  At some call sites, the cpr_save_fd is already 
in a conditional scope with other work. And, in vfio, the "if (reused)" is slightly more
efficient than find_name.  Small reasons, but justified IMO.

>>      /*
>>       * Explicitly release the listener first before unset container,
>> @@ -2270,6 +2342,7 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>>      VFIOGroup *group;
>>      char path[32];
>>      struct vfio_group_status status = { .argsz = sizeof(status) };
>> +    bool reused;
>>  
>>      QLIST_FOREACH(group, &vfio_group_list, next) {
>>          if (group->groupid == groupid) {
>> @@ -2287,7 +2360,13 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>>      group = g_malloc0(sizeof(*group));
>>  
>>      snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
>> -    group->fd = qemu_open_old(path, O_RDWR);
>> +
>> +    group->fd = cpr_find_fd("vfio_group", groupid);
>> +    reused = (group->fd >= 0);
>> +    if (!reused) {
>> +        group->fd = qemu_open_old(path, O_RDWR);
>> +    }
>> +
>>      if (group->fd < 0) {
>>          error_setg_errno(errp, errno, "failed to open %s", path);
>>          goto free_group_exit;
>> @@ -2321,6 +2400,10 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
>>  
>>      QLIST_INSERT_HEAD(&vfio_group_list, group, next);
>>  
>> +    if (!reused) {
>> +        cpr_save_fd("vfio_group", groupid, group->fd);
>> +    }
> 
> If cpr_save_fd() were idempotent as above, we wouldn't need the reused
> variable here and the previous chunk could be simplified.  It might
> even suggest a function like "cpr_find_or_open_fd()".

Sure, I will add a new function.

>> +
>>      return group;
>>  
>>  close_fd_exit:
>> @@ -2345,6 +2428,7 @@ void vfio_put_group(VFIOGroup *group)
>>      vfio_disconnect_container(group);
>>      QLIST_REMOVE(group, next);
>>      trace_vfio_put_group(group->fd);
>> +    cpr_delete_fd("vfio_group", group->groupid);
>>      close(group->fd);
>>      g_free(group);
>>  
>> @@ -2358,8 +2442,14 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>>  {
>>      struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
>>      int ret, fd;
>> +    bool reused;
>> +
>> +    fd = cpr_find_fd(name, 0);
>> +    reused = (fd >= 0);
>> +    if (!reused) {
>> +        fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>> +    }
>>  
>> -    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
>>      if (fd < 0) {
>>          error_setg_errno(errp, errno, "error getting device from group %d",
>>                           group->groupid);
>> @@ -2404,6 +2494,10 @@ int vfio_get_device(VFIOGroup *group, const char *name,
>>      vbasedev->num_irqs = dev_info.num_irqs;
>>      vbasedev->num_regions = dev_info.num_regions;
>>      vbasedev->flags = dev_info.flags;
>> +    vbasedev->reused = reused;
>> +    if (!reused) {
>> +        cpr_save_fd(name, 0, fd);
>> +    }
> 
> Another cleanup here if we didn't need to tiptoe around cpr_save_fd().

Yes, I will find and fix all such callsites.

>>      trace_vfio_get_device(name, dev_info.flags, dev_info.num_regions,
>>                            dev_info.num_irqs);
>> @@ -2420,6 +2514,7 @@ void vfio_put_base_device(VFIODevice *vbasedev)
>>      QLIST_REMOVE(vbasedev, next);
>>      vbasedev->group = NULL;
>>      trace_vfio_put_base_device(vbasedev->fd);
>> +    cpr_delete_fd(vbasedev->name, 0);
>>      close(vbasedev->fd);
>>  }
>>  
>> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
>> new file mode 100644
>> index 0000000..2c39cd5
>> --- /dev/null
>> +++ b/hw/vfio/cpr.c
>> @@ -0,0 +1,94 @@
>> +/*
>> + * Copyright (c) 2021 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include <sys/ioctl.h>
>> +#include <linux/vfio.h>
>> +#include "hw/vfio/vfio-common.h"
>> +#include "sysemu/kvm.h"
>> +#include "qapi/error.h"
>> +#include "trace.h"
>> +
>> +static int
>> +vfio_dma_unmap_vaddr_all(VFIOContainer *container, Error **errp)
>> +{
>> +    struct vfio_iommu_type1_dma_unmap unmap = {
>> +        .argsz = sizeof(unmap),
>> +        .flags = VFIO_DMA_UNMAP_FLAG_VADDR | VFIO_DMA_UNMAP_FLAG_ALL,
>> +        .iova = 0,
>> +        .size = 0,
>> +    };
>> +    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
>> +        error_setg_errno(errp, errno, "vfio_dma_unmap_vaddr_all");
>> +        return -errno;
>> +    }
>> +    return 0;
>> +}
>> +
>> +bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp)
>> +{
>> +    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
>> +        !ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
>> +        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR "
>> +                         "or VFIO_UNMAP_ALL");
>> +        return false;
>> +    } else {
>> +        return true;
>> +    }
>> +}
> 
> We could have minimally used this where we assumed a TYPE1v2 container.

Are you referring to vfio_init_container (discussed above)?
Are you suggesting that, if reused is true, we validate those extensions are
present, before setting iommu_type = VFIO_TYPE1v2_IOMMU?

>> +
>> +/*
>> + * Verify that all containers support CPR, and unmap all dma vaddr's.
>> + */
>> +int vfio_cpr_save(Error **errp)
>> +{
>> +    ERRP_GUARD();
>> +    VFIOAddressSpace *space;
>> +    VFIOContainer *container;
>> +
>> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
>> +        QLIST_FOREACH(container, &space->containers, next) {
>> +            if (!vfio_is_cpr_capable(container, errp)) {
>> +                return -1;
>> +            }
>> +            if (vfio_dma_unmap_vaddr_all(container, errp)) {
>> +                return -1;
>> +            }
>> +        }
>> +    }
> 
> Seems like we ought to validate all containers support CPR before we
> start blasting vaddrs.  It looks like qmp_cpr_exec() simply returns if
> this fails with no attempt to unwind!  Yikes!  Wouldn't we need to
> replay the listeners to remap the vaddrs in case of an error?

Already done.  I refactored that code into a separate patch to tease out some
of the complexity:
  vfio-pci: recover from unmap-all-vaddr failure

>> +
>> +    return 0;
>> +}
>> +
>> +/*
>> + * Register the listener for each container, which causes its callback to be
>> + * invoked for every flat section.  The callback will see that the container
>> + * is reused, and call vfo_dma_map with the new vaddr.
>> + */
>> +int vfio_cpr_load(Error **errp)
>> +{
>> +    VFIOAddressSpace *space;
>> +    VFIOContainer *container;
>> +    VFIOGroup *group;
>> +    VFIODevice *vbasedev;
>> +
>> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
>> +        QLIST_FOREACH(container, &space->containers, next) {
>> +            if (!vfio_is_cpr_capable(container, errp)) {
>> +                return -1;
>> +            }
>> +            vfio_listener_register(container);
>> +            container->reused = false;
>> +        }
>> +    }
>> +    QLIST_FOREACH(group, &vfio_group_list, next) {
>> +        QLIST_FOREACH(vbasedev, &group->device_list, next) {
>> +            vbasedev->reused = false;
>> +        }
>> +    }
>> +    return 0;
>> +}
>> diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
>> index da9af29..e247b2b 100644
>> --- a/hw/vfio/meson.build
>> +++ b/hw/vfio/meson.build
>> @@ -5,6 +5,7 @@ vfio_ss.add(files(
>>    'migration.c',
>>  ))
>>  vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
>> +  'cpr.c',
>>    'display.c',
>>    'pci-quirks.c',
>>    'pci.c',
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index a90cce2..acac8a7 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -30,6 +30,7 @@
>>  #include "hw/qdev-properties-system.h"
>>  #include "migration/vmstate.h"
>>  #include "qapi/qmp/qdict.h"
>> +#include "migration/cpr.h"
>>  #include "qemu/error-report.h"
>>  #include "qemu/main-loop.h"
>>  #include "qemu/module.h"
>> @@ -2926,6 +2927,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>>          vfio_put_group(group);
>>          goto error;
>>      }
>> +    pdev->reused = vdev->vbasedev.reused;
>>  
>>      vfio_populate_device(vdev, &err);
>>      if (err) {
>> @@ -3195,6 +3197,11 @@ static void vfio_pci_reset(DeviceState *dev)
>>  {
>>      VFIOPCIDevice *vdev = VFIO_PCI(dev);
>>  
>> +    /* Do not reset the device during qemu_system_reset prior to cpr-load */
>> +    if (vdev->pdev.reused) {
>> +        return;
>> +    }
>> +
>>      trace_vfio_pci_reset(vdev->vbasedev.name);
>>  
>>      vfio_pci_pre_reset(vdev);
>> @@ -3302,6 +3309,75 @@ static Property vfio_pci_dev_properties[] = {
>>      DEFINE_PROP_END_OF_LIST(),
>>  };
>>  
>> +static void vfio_merge_config(VFIOPCIDevice *vdev)
>> +{
>> +    PCIDevice *pdev = &vdev->pdev;
>> +    int size = MIN(pci_config_size(pdev), vdev->config_size);
>> +    g_autofree uint8_t *phys_config = g_malloc(size);
>> +    uint32_t mask;
>> +    int ret, i;
>> +
>> +    ret = pread(vdev->vbasedev.fd, phys_config, size, vdev->config_offset);
>> +    if (ret < size) {
>> +        ret = ret < 0 ? errno : EFAULT;
>> +        error_report("failed to read device config space: %s", strerror(ret));
>> +        return;
>> +    }
>> +
>> +    for (i = 0; i < size; i++) {
>> +        mask = vdev->emulated_config_bits[i];
>> +        pdev->config[i] = (pdev->config[i] & mask) | (phys_config[i] & ~mask);
>> +    }
>> +}
> 
> IIUC, we get a copy of config space from the vfio device and for each
> byte, we keep what we have in emulated config space for emulated bits
> and fill in from the device for non-emulated bits.  Meanwhile,
> vfio_pci_read_config() doesn't ever return non-emulated bits from
> emulated config space, so what specifically are we accomplishing here?

Nothing, apparently.  I speculated that pdev->config[] could be used as a cache 
for the kernel config for non-emulated bits.  I'll delete it.

>> +
>> +/*
>> + * The kernel may change non-emulated config bits.  Exclude them from the
>> + * changed-bits check in get_pci_config_device.
>> + */
>> +static int vfio_pci_pre_load(void *opaque)
>> +{
>> +    VFIOPCIDevice *vdev = opaque;
>> +    PCIDevice *pdev = &vdev->pdev;
>> +    int size = MIN(pci_config_size(pdev), vdev->config_size);
>> +    int i;
>> +
>> +    for (i = 0; i < size; i++) {
>> +        pdev->cmask[i] &= vdev->emulated_config_bits[i];
>> +    }
>> +
>> +    return 0;
>> +}
> 
> The previous function seemed like maybe an attempt to make non-emulated
> bits in emulated config space consistent for testing, but here we're
> masking all non-emulated bits out of that mask.  Why do we need to do
> both?

We do need this one, as I was triggerring errors in get_pci_config_device.

>> +
>> +static int vfio_pci_post_load(void *opaque, int version_id)
>> +{
>> +    VFIOPCIDevice *vdev = opaque;
>> +    PCIDevice *pdev = &vdev->pdev;
>> +
>> +    vfio_merge_config(vdev);
>> +
>> +    pdev->reused = false;
>> +
>> +    return 0;
>> +}
>> +
>> +static bool vfio_pci_needed(void *opaque)
>> +{
>> +    return cpr_get_mode() == CPR_MODE_RESTART;
>> +}
>> +
>> +static const VMStateDescription vfio_pci_vmstate = {
>> +    .name = "vfio-pci",
>> +    .unmigratable = 1,
>> +    .version_id = 0,
>> +    .minimum_version_id = 0,
>> +    .pre_load = vfio_pci_pre_load,
>> +    .post_load = vfio_pci_post_load,
>> +    .needed = vfio_pci_needed,
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_END_OF_LIST()
>> +    }
>> +};
>> +
>>  static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>>  {
>>      DeviceClass *dc = DEVICE_CLASS(klass);
>> @@ -3309,6 +3385,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>>  
>>      dc->reset = vfio_pci_reset;
>>      device_class_set_props(dc, vfio_pci_dev_properties);
>> +    dc->vmsd = &vfio_pci_vmstate;
>>      dc->desc = "VFIO-based PCI device assignment";
>>      set_bit(DEVICE_CATEGORY_MISC, dc->categories);
>>      pdc->realize = vfio_realize;
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 0ef1b5f..63dd0fe 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -118,6 +118,7 @@ vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Devic
>>  vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
>>  vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
>>  vfio_dma_unmap_overflow_workaround(void) ""
>> +vfio_region_remap(const char *name, int fd, uint64_t iova_start, uint64_t iova_end, void *vaddr) "%s fd %d 0x%"PRIx64" - 0x%"PRIx64" [%p]"
>>  
>>  # platform.c
>>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>> index cc63dd4..8557e82 100644
>> --- a/include/hw/pci/pci.h
>> +++ b/include/hw/pci/pci.h
>> @@ -361,6 +361,7 @@ struct PCIDevice {
>>      /* ID of standby device in net_failover pair */
>>      char *failover_pair_id;
>>      uint32_t acpi_index;
>> +    bool reused;
>>  };
>>  
>>  void pci_register_bar(PCIDevice *pci_dev, int region_num,
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 1641753..bc23c29 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -85,6 +85,7 @@ typedef struct VFIOContainer {
>>      Error *error;
>>      bool initialized;
>>      bool dirty_pages_supported;
>> +    bool reused;
>>      uint64_t dirty_pgsizes;
>>      uint64_t max_dirty_bitmap_size;
>>      unsigned long pgsizes;
>> @@ -136,6 +137,7 @@ typedef struct VFIODevice {
>>      bool no_mmap;
>>      bool ram_block_discard_allowed;
>>      bool enable_migration;
>> +    bool reused;
>>      VFIODeviceOps *ops;
>>      unsigned int num_irqs;
>>      unsigned int num_regions;
>> @@ -212,6 +214,9 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
>>  void vfio_put_group(VFIOGroup *group);
>>  int vfio_get_device(VFIOGroup *group, const char *name,
>>                      VFIODevice *vbasedev, Error **errp);
>> +int vfio_cpr_save(Error **errp);
>> +int vfio_cpr_load(Error **errp);
>> +bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp);
>>  
>>  extern const MemoryRegionOps vfio_region_ops;
>>  typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
>> @@ -236,6 +241,9 @@ struct vfio_info_cap_header *
>>  vfio_get_device_info_cap(struct vfio_device_info *info, uint16_t id);
>>  #endif
>>  extern const MemoryListener vfio_prereg_listener;
>> +void vfio_listener_register(VFIOContainer *container);
>> +void vfio_container_region_add(VFIOContainer *container,
>> +                               MemoryRegionSection *section);
>>  
>>  int vfio_spapr_create_window(VFIOContainer *container,
>>                               MemoryRegionSection *section,
>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>> index a4da24e..a4007cf 100644
>> --- a/include/migration/cpr.h
>> +++ b/include/migration/cpr.h
>> @@ -25,4 +25,7 @@ int cpr_state_save(Error **errp);
>>  int cpr_state_load(Error **errp);
>>  void cpr_state_print(void);
>>  
>> +int cpr_vfio_save(Error **errp);
>> +int cpr_vfio_load(Error **errp);
>> +
>>  #endif
>> diff --git a/migration/cpr.c b/migration/cpr.c
>> index 37eca66..cee82cf 100644
>> --- a/migration/cpr.c
>> +++ b/migration/cpr.c
>> @@ -7,6 +7,7 @@
>>  
>>  #include "qemu/osdep.h"
>>  #include "exec/memory.h"
>> +#include "hw/vfio/vfio-common.h"
>>  #include "io/channel-buffer.h"
>>  #include "io/channel-file.h"
>>  #include "migration.h"
>> @@ -101,7 +102,9 @@ void qmp_cpr_exec(strList *args, Error **errp)
>>          error_setg(errp, "cpr-exec requires cpr-save with restart mode");
>>          return;
>>      }
>> -
>> +    if (cpr_vfio_save(errp)) {
>> +        return;
>> +    }
> 
> Why is vfio so unique that it needs separate handlers versus other
> devices?  Thanks,

In earlier patches these functions fiddled with more objects, but at this point
they are simple enough to convert to pre_save and post_load vmstate handlers for
the container and group objects.  However, we would still need to call special 
functons for vfio from qmp_cpr_exec:

  * validate all containers support CPR before we start blasting vaddrs
    However, I could validate all, in every call to pre_save for each container.
    That would be less efficient, but fits the vmstate model.

  * restore all vaddr's if qemu_save_device_state fails.
    However, I could recover for all containers inside pre_save when one container fails.
    Feels strange touching all objects in a function for one, but there is no real
    downside.

Currently the logic for both of those is in vfio_cpr_save (in the patch 
"vfio-pci: recover from unmap-all-vaddr failure").

So, convert to pre_save and post_load handlers or not?

- Steve

>>      cpr_walk_fd(preserve_fd, 0);
>>      if (cpr_state_save(errp)) {
>>          return;
>> @@ -139,6 +142,11 @@ void qmp_cpr_load(const char *filename, Error **errp)
>>          goto out;
>>      }
>>  
>> +    if (cpr_get_mode() == CPR_MODE_RESTART &&
>> +        cpr_vfio_load(errp)) {
>> +        goto out;
>> +    }
>> +
>>      state = global_state_get_runstate();
>>      if (state == RUN_STATE_RUNNING) {
>>          vm_start();
>> diff --git a/migration/target.c b/migration/target.c
>> index 4390bf0..984bc9e 100644
>> --- a/migration/target.c
>> +++ b/migration/target.c
>> @@ -8,6 +8,7 @@
>>  #include "qemu/osdep.h"
>>  #include "qapi/qapi-types-migration.h"
>>  #include "migration.h"
>> +#include "migration/cpr.h"
>>  #include CONFIG_DEVICES
>>  
>>  #ifdef CONFIG_VFIO
>> @@ -22,8 +23,21 @@ void populate_vfio_info(MigrationInfo *info)
>>          info->vfio->transferred = vfio_mig_bytes_transferred();
>>      }
>>  }
>> +
>> +int cpr_vfio_save(Error **errp)
>> +{
>> +    return vfio_cpr_save(errp);
>> +}
>> +
>> +int cpr_vfio_load(Error **errp)
>> +{
>> +    return vfio_cpr_load(errp);
>> +}
>> +
>>  #else
>>  
>>  void populate_vfio_info(MigrationInfo *info) {}
>> +int cpr_vfio_save(Error **errp) { return 0; }
>> +int cpr_vfio_load(Error **errp) { return 0; }
>>  
>>  #endif /* CONFIG_VFIO */
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 10/29] machine: memfd-alloc option
  2022-03-08  7:20           ` Igor Mammedov
@ 2022-03-10 15:36             ` Steven Sistare
  2022-03-10 16:00               ` Igor Mammedov
  0 siblings, 1 reply; 96+ messages in thread
From: Steven Sistare @ 2022-03-10 15:36 UTC (permalink / raw)
  To: Igor Mammedov, Michael S. Tsirkin
  Cc: Jason Zeng, Juan Quintela, Eric Blake,
	Philippe Mathieu-Daudé,
	qemu-devel, Dr. David Alan Gilbert, Zheng Chuan, Alex Williamson,
	Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Daniel P. Berrange, Alex Bennée, Markus Armbruster

On 3/8/2022 2:20 AM, Igor Mammedov wrote:
> On Tue, 8 Mar 2022 01:50:11 -0500
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
>> On Mon, Mar 07, 2022 at 09:41:44AM -0500, Steven Sistare wrote:
>>> On 3/4/2022 5:41 AM, Igor Mammedov wrote:  
>>>> On Thu, 3 Mar 2022 12:21:15 -0500
>>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>>   
>>>>> On Wed, Dec 22, 2021 at 11:05:15AM -0800, Steve Sistare wrote:  
>>>>>> Allocate anonymous memory using memfd_create if the memfd-alloc machine
>>>>>> option is set.
>>>>>>
>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>> ---
>>>>>>  hw/core/machine.c   | 19 +++++++++++++++++++
>>>>>>  include/hw/boards.h |  1 +
>>>>>>  qemu-options.hx     |  6 ++++++
>>>>>>  softmmu/physmem.c   | 47 ++++++++++++++++++++++++++++++++++++++---------
>>>>>>  softmmu/vl.c        |  1 +
>>>>>>  trace-events        |  1 +
>>>>>>  util/qemu-config.c  |  4 ++++
>>>>>>  7 files changed, 70 insertions(+), 9 deletions(-)
>>>>>>
>>>>>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>>>>>> index 53a99ab..7739d88 100644
>>>>>> --- a/hw/core/machine.c
>>>>>> +++ b/hw/core/machine.c
>>>>>> @@ -392,6 +392,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
>>>>>>      ms->mem_merge = value;
>>>>>>  }
>>>>>>  
>>>>>> +static bool machine_get_memfd_alloc(Object *obj, Error **errp)
>>>>>> +{
>>>>>> +    MachineState *ms = MACHINE(obj);
>>>>>> +
>>>>>> +    return ms->memfd_alloc;
>>>>>> +}
>>>>>> +
>>>>>> +static void machine_set_memfd_alloc(Object *obj, bool value, Error **errp)
>>>>>> +{
>>>>>> +    MachineState *ms = MACHINE(obj);
>>>>>> +
>>>>>> +    ms->memfd_alloc = value;
>>>>>> +}
>>>>>> +
>>>>>>  static bool machine_get_usb(Object *obj, Error **errp)
>>>>>>  {
>>>>>>      MachineState *ms = MACHINE(obj);
>>>>>> @@ -829,6 +843,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
>>>>>>      object_class_property_set_description(oc, "mem-merge",
>>>>>>          "Enable/disable memory merge support");
>>>>>>  
>>>>>> +    object_class_property_add_bool(oc, "memfd-alloc",
>>>>>> +        machine_get_memfd_alloc, machine_set_memfd_alloc);
>>>>>> +    object_class_property_set_description(oc, "memfd-alloc",
>>>>>> +        "Enable/disable allocating anonymous memory using memfd_create");
>>>>>> +
>>>>>>      object_class_property_add_bool(oc, "usb",
>>>>>>          machine_get_usb, machine_set_usb);
>>>>>>      object_class_property_set_description(oc, "usb",
>>>>>> diff --git a/include/hw/boards.h b/include/hw/boards.h
>>>>>> index 9c1c190..a57d7a0 100644
>>>>>> --- a/include/hw/boards.h
>>>>>> +++ b/include/hw/boards.h
>>>>>> @@ -327,6 +327,7 @@ struct MachineState {
>>>>>>      char *dt_compatible;
>>>>>>      bool dump_guest_core;
>>>>>>      bool mem_merge;
>>>>>> +    bool memfd_alloc;
>>>>>>      bool usb;
>>>>>>      bool usb_disabled;
>>>>>>      char *firmware;
>>>>>> diff --git a/qemu-options.hx b/qemu-options.hx
>>>>>> index 7d47510..33c8173 100644
>>>>>> --- a/qemu-options.hx
>>>>>> +++ b/qemu-options.hx
>>>>>> @@ -30,6 +30,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
>>>>>>      "                vmport=on|off|auto controls emulation of vmport (default: auto)\n"
>>>>>>      "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
>>>>>>      "                mem-merge=on|off controls memory merge support (default: on)\n"
>>>>>> +    "                memfd-alloc=on|off controls allocating anonymous guest RAM using memfd_create (default: off)\n"    
>>>>>
>>>>> Question: are there any disadvantages associated with using
>>>>> memfd_create? I guess we are using up an fd, but that seems minor.  Any
>>>>> reason not to set to on by default? maybe with a fallback option to
>>>>> disable that?  
>>>
>>> Old Linux host kernels, circa 4.1, do not support huge pages for shared memory.
>>> Also, the tunable to enable huge pages for share memory is different than for
>>> anon memory, so there could be performance loss if it is not set correctly.
>>>     /sys/kernel/mm/transparent_hugepage/enabled
>>>     vs
>>>     /sys/kernel/mm/transparent_hugepage/shmem_enabled  
>>
>> I guess we can test this when launching the VM, and select
>> a good default.
>>
>>> It might make sense to use memfd_create by default for the secondary segments.  
>>
>> Well there's also KSM now you mention it.
> 
> then another quest, is there downside to always using memfd_create
> without any knobs being involved?

Lower performance if small pages are used (but Michael suggests qemu could 
automatically check the tunable and use anon memory instead)

KSM (same page merging) is not supported for shared memory, so ram_block_add ->
memory_try_enable_merging will not enable it.

In both cases, I expect the degradation would be negligible if memfd_create is
only automatically applied to the secondary segments, which are typically small.
But, someone's secondary segment could be larger, and it is time consuming to
prove innocence when someone claims your change caused their performance regression.

- Steve

>>>>> I am concerned that it's actually a kind of memory backend, this flag
>>>>> seems to instead be closer to the deprecated mem-prealloc. E.g.
>>>>> it does not work with a mem path, does it?  
>>>
>>> One can still define a memory backend with mempath to create the main ram segment,
>>> though it must be some form of shared to work with live update.  Indeed, I would 
>>> expect most users to specify an explicit memory backend for it.  The secondary
>>> segments would still use memfd_create.
>>>   
>>>> (mem path and mem-prealloc are transparently aliased to used memory backend
>>>> if I recall it right.)
>>>>
>>>> Steve,
>>>>
>>>> For allocating guest RAM, we switched exclusively to using memory-backends
>>>> including initial guest RAM (-m size option) and we have hostmem-memfd
>>>> that uses memfd_create() and I'd rather avoid adding random knobs to machine
>>>> for tweaking how RAM should be allocated, we have memory backends for this,
>>>> so this patch begs the question: why hostmem-memfd is not sufficient?
>>>> (patch description is rather lacking on rationale behind the patch)  
>>>
>>> There is currently no way to specify memory backends for the secondary memory
>>> segments (vram, roms, etc), and IMO it would be onerous to specify a backend for
>>> each of them.  On x86_64, these include pc.bios, vga.vram, pc.rom, vga.rom,
>>> /rom@etc/acpi/tables, /rom@etc/table-loader, /rom@etc/acpi/rsdp.
>>>
>>> - Steve
>>>   
>>>>>>      "                aes-key-wrap=on|off controls support for AES key wrapping (default=on)\n"
>>>>>>      "                dea-key-wrap=on|off controls support for DEA key wrapping (default=on)\n"
>>>>>>      "                suppress-vmdesc=on|off disables self-describing migration (default=off)\n"
>>>>>> @@ -76,6 +77,11 @@ SRST
>>>>>>          supported by the host, de-duplicates identical memory pages
>>>>>>          among VMs instances (enabled by default).
>>>>>>  
>>>>>> +    ``memfd-alloc=on|off``
>>>>>> +        Enables or disables allocation of anonymous guest RAM using
>>>>>> +        memfd_create.  Any associated memory-backend objects are created with
>>>>>> +        share=on.  The memfd-alloc default is off.
>>>>>> +
>>>>>>      ``aes-key-wrap=on|off``
>>>>>>          Enables or disables AES key wrapping support on s390-ccw hosts.
>>>>>>          This feature controls whether AES wrapping keys will be created
>>>>>> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
>>>>>> index 3524c04..95e2b49 100644
>>>>>> --- a/softmmu/physmem.c
>>>>>> +++ b/softmmu/physmem.c
>>>>>> @@ -41,6 +41,7 @@
>>>>>>  #include "qemu/config-file.h"
>>>>>>  #include "qemu/error-report.h"
>>>>>>  #include "qemu/qemu-print.h"
>>>>>> +#include "qemu/memfd.h"
>>>>>>  #include "exec/memory.h"
>>>>>>  #include "exec/ioport.h"
>>>>>>  #include "sysemu/dma.h"
>>>>>> @@ -1964,35 +1965,63 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>>>>>      const bool shared = qemu_ram_is_shared(new_block);
>>>>>>      RAMBlock *block;
>>>>>>      RAMBlock *last_block = NULL;
>>>>>> +    struct MemoryRegion *mr = new_block->mr;
>>>>>>      ram_addr_t old_ram_size, new_ram_size;
>>>>>>      Error *err = NULL;
>>>>>> +    const char *name;
>>>>>> +    void *addr = 0;
>>>>>> +    size_t maxlen;
>>>>>> +    MachineState *ms = MACHINE(qdev_get_machine());
>>>>>>  
>>>>>>      old_ram_size = last_ram_page();
>>>>>>  
>>>>>>      qemu_mutex_lock_ramlist();
>>>>>> -    new_block->offset = find_ram_offset(new_block->max_length);
>>>>>> +    maxlen = new_block->max_length;
>>>>>> +    new_block->offset = find_ram_offset(maxlen);
>>>>>>  
>>>>>>      if (!new_block->host) {
>>>>>>          if (xen_enabled()) {
>>>>>> -            xen_ram_alloc(new_block->offset, new_block->max_length,
>>>>>> -                          new_block->mr, &err);
>>>>>> +            xen_ram_alloc(new_block->offset, maxlen, new_block->mr, &err);
>>>>>>              if (err) {
>>>>>>                  error_propagate(errp, err);
>>>>>>                  qemu_mutex_unlock_ramlist();
>>>>>>                  return;
>>>>>>              }
>>>>>>          } else {
>>>>>> -            new_block->host = qemu_anon_ram_alloc(new_block->max_length,
>>>>>> -                                                  &new_block->mr->align,
>>>>>> -                                                  shared, noreserve);
>>>>>> -            if (!new_block->host) {
>>>>>> +            name = memory_region_name(mr);
>>>>>> +            if (ms->memfd_alloc) {
>>>>>> +                Object *parent = &mr->parent_obj;
>>>>>> +                int mfd = -1;          /* placeholder until next patch */
>>>>>> +                mr->align = QEMU_VMALLOC_ALIGN;
>>>>>> +                if (mfd < 0) {
>>>>>> +                    mfd = qemu_memfd_create(name, maxlen + mr->align,
>>>>>> +                                            0, 0, 0, &err);
>>>>>> +                    if (mfd < 0) {
>>>>>> +                        return;
>>>>>> +                    }
>>>>>> +                }
>>>>>> +                qemu_set_cloexec(mfd);
>>>>>> +                /* The memory backend already set its desired flags. */
>>>>>> +                if (!object_dynamic_cast(parent, TYPE_MEMORY_BACKEND)) {
>>>>>> +                    new_block->flags |= RAM_SHARED;
>>>>>> +                }
>>>>>> +                addr = file_ram_alloc(new_block, maxlen, mfd,
>>>>>> +                                      false, false, 0, errp);
>>>>>> +                trace_anon_memfd_alloc(name, maxlen, addr, mfd);
>>>>>> +            } else {
>>>>>> +                addr = qemu_anon_ram_alloc(maxlen, &mr->align,
>>>>>> +                                           shared, noreserve);
>>>>>> +            }
>>>>>> +
>>>>>> +            if (!addr) {
>>>>>>                  error_setg_errno(errp, errno,
>>>>>>                                   "cannot set up guest memory '%s'",
>>>>>> -                                 memory_region_name(new_block->mr));
>>>>>> +                                 name);
>>>>>>                  qemu_mutex_unlock_ramlist();
>>>>>>                  return;
>>>>>>              }
>>>>>> -            memory_try_enable_merging(new_block->host, new_block->max_length);
>>>>>> +            memory_try_enable_merging(addr, maxlen);
>>>>>> +            new_block->host = addr;
>>>>>>          }
>>>>>>      }
>>>>>>  
>>>>>> diff --git a/softmmu/vl.c b/softmmu/vl.c
>>>>>> index 620a1f1..ab3648a 100644
>>>>>> --- a/softmmu/vl.c
>>>>>> +++ b/softmmu/vl.c
>>>>>> @@ -2440,6 +2440,7 @@ static void create_default_memdev(MachineState *ms, const char *path)
>>>>>>          object_property_set_str(obj, "mem-path", path, &error_fatal);
>>>>>>      }
>>>>>>      object_property_set_int(obj, "size", ms->ram_size, &error_fatal);
>>>>>> +    object_property_set_bool(obj, "share", ms->memfd_alloc, &error_fatal);
>>>>>>      object_property_add_child(object_get_objects_root(), mc->default_ram_id,
>>>>>>                                obj);
>>>>>>      /* Ensure backend's memory region name is equal to mc->default_ram_id */
>>>>>> diff --git a/trace-events b/trace-events
>>>>>> index a637a61..770a9ac 100644
>>>>>> --- a/trace-events
>>>>>> +++ b/trace-events
>>>>>> @@ -45,6 +45,7 @@ ram_block_discard_range(const char *rbname, void *hva, size_t length, bool need_
>>>>>>  # accel/tcg/cputlb.c
>>>>>>  memory_notdirty_write_access(uint64_t vaddr, uint64_t ram_addr, unsigned size) "0x%" PRIx64 " ram_addr 0x%" PRIx64 " size %u"
>>>>>>  memory_notdirty_set_dirty(uint64_t vaddr) "0x%" PRIx64
>>>>>> +anon_memfd_alloc(const char *name, size_t size, void *ptr, int fd) "%s size %zu ptr %p fd %d"
>>>>>>  
>>>>>>  # gdbstub.c
>>>>>>  gdbstub_op_start(const char *device) "Starting gdbstub using device %s"
>>>>>> diff --git a/util/qemu-config.c b/util/qemu-config.c
>>>>>> index 436ab63..3606e5c 100644
>>>>>> --- a/util/qemu-config.c
>>>>>> +++ b/util/qemu-config.c
>>>>>> @@ -207,6 +207,10 @@ static QemuOptsList machine_opts = {
>>>>>>              .type = QEMU_OPT_BOOL,
>>>>>>              .help = "enable/disable memory merge support",
>>>>>>          },{
>>>>>> +            .name = "memfd-alloc",
>>>>>> +            .type = QEMU_OPT_BOOL,
>>>>>> +            .help = "enable/disable memfd_create for anonymous memory",
>>>>>> +        },{
>>>>>>              .name = "usb",
>>>>>>              .type = QEMU_OPT_BOOL,
>>>>>>              .help = "Set on/off to enable/disable usb",
>>>>>> -- 
>>>>>> 1.8.3.1    
>>>>>
>>>>>  
>>>>   
>>
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 10/29] machine: memfd-alloc option
  2022-03-10 15:36             ` Steven Sistare
@ 2022-03-10 16:00               ` Igor Mammedov
  2022-03-10 17:28                 ` Steven Sistare
  0 siblings, 1 reply; 96+ messages in thread
From: Igor Mammedov @ 2022-03-10 16:00 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	David Hildenbrand, qemu-devel, Dr. David Alan Gilbert,
	Zheng Chuan, Alex Williamson, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Daniel P. Berrange,
	Philippe Mathieu-Daudé,
	Alex Bennée, Markus Armbruster

On Thu, 10 Mar 2022 10:36:08 -0500
Steven Sistare <steven.sistare@oracle.com> wrote:

> On 3/8/2022 2:20 AM, Igor Mammedov wrote:
> > On Tue, 8 Mar 2022 01:50:11 -0500
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> >> On Mon, Mar 07, 2022 at 09:41:44AM -0500, Steven Sistare wrote:  
> >>> On 3/4/2022 5:41 AM, Igor Mammedov wrote:    
> >>>> On Thu, 3 Mar 2022 12:21:15 -0500
> >>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>>>     
> >>>>> On Wed, Dec 22, 2021 at 11:05:15AM -0800, Steve Sistare wrote:    
> >>>>>> Allocate anonymous memory using memfd_create if the memfd-alloc machine
> >>>>>> option is set.
> >>>>>>
> >>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >>>>>> ---
> >>>>>>  hw/core/machine.c   | 19 +++++++++++++++++++
> >>>>>>  include/hw/boards.h |  1 +
> >>>>>>  qemu-options.hx     |  6 ++++++
> >>>>>>  softmmu/physmem.c   | 47 ++++++++++++++++++++++++++++++++++++++---------
> >>>>>>  softmmu/vl.c        |  1 +
> >>>>>>  trace-events        |  1 +
> >>>>>>  util/qemu-config.c  |  4 ++++
> >>>>>>  7 files changed, 70 insertions(+), 9 deletions(-)
> >>>>>>
> >>>>>> diff --git a/hw/core/machine.c b/hw/core/machine.c
> >>>>>> index 53a99ab..7739d88 100644
> >>>>>> --- a/hw/core/machine.c
> >>>>>> +++ b/hw/core/machine.c
> >>>>>> @@ -392,6 +392,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
> >>>>>>      ms->mem_merge = value;
> >>>>>>  }
> >>>>>>  
> >>>>>> +static bool machine_get_memfd_alloc(Object *obj, Error **errp)
> >>>>>> +{
> >>>>>> +    MachineState *ms = MACHINE(obj);
> >>>>>> +
> >>>>>> +    return ms->memfd_alloc;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static void machine_set_memfd_alloc(Object *obj, bool value, Error **errp)
> >>>>>> +{
> >>>>>> +    MachineState *ms = MACHINE(obj);
> >>>>>> +
> >>>>>> +    ms->memfd_alloc = value;
> >>>>>> +}
> >>>>>> +
> >>>>>>  static bool machine_get_usb(Object *obj, Error **errp)
> >>>>>>  {
> >>>>>>      MachineState *ms = MACHINE(obj);
> >>>>>> @@ -829,6 +843,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
> >>>>>>      object_class_property_set_description(oc, "mem-merge",
> >>>>>>          "Enable/disable memory merge support");
> >>>>>>  
> >>>>>> +    object_class_property_add_bool(oc, "memfd-alloc",
> >>>>>> +        machine_get_memfd_alloc, machine_set_memfd_alloc);
> >>>>>> +    object_class_property_set_description(oc, "memfd-alloc",
> >>>>>> +        "Enable/disable allocating anonymous memory using memfd_create");
> >>>>>> +
> >>>>>>      object_class_property_add_bool(oc, "usb",
> >>>>>>          machine_get_usb, machine_set_usb);
> >>>>>>      object_class_property_set_description(oc, "usb",
> >>>>>> diff --git a/include/hw/boards.h b/include/hw/boards.h
> >>>>>> index 9c1c190..a57d7a0 100644
> >>>>>> --- a/include/hw/boards.h
> >>>>>> +++ b/include/hw/boards.h
> >>>>>> @@ -327,6 +327,7 @@ struct MachineState {
> >>>>>>      char *dt_compatible;
> >>>>>>      bool dump_guest_core;
> >>>>>>      bool mem_merge;
> >>>>>> +    bool memfd_alloc;
> >>>>>>      bool usb;
> >>>>>>      bool usb_disabled;
> >>>>>>      char *firmware;
> >>>>>> diff --git a/qemu-options.hx b/qemu-options.hx
> >>>>>> index 7d47510..33c8173 100644
> >>>>>> --- a/qemu-options.hx
> >>>>>> +++ b/qemu-options.hx
> >>>>>> @@ -30,6 +30,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
> >>>>>>      "                vmport=on|off|auto controls emulation of vmport (default: auto)\n"
> >>>>>>      "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
> >>>>>>      "                mem-merge=on|off controls memory merge support (default: on)\n"
> >>>>>> +    "                memfd-alloc=on|off controls allocating anonymous guest RAM using memfd_create (default: off)\n"      
> >>>>>
> >>>>> Question: are there any disadvantages associated with using
> >>>>> memfd_create? I guess we are using up an fd, but that seems minor.  Any
> >>>>> reason not to set to on by default? maybe with a fallback option to
> >>>>> disable that?    
> >>>
> >>> Old Linux host kernels, circa 4.1, do not support huge pages for shared memory.
> >>> Also, the tunable to enable huge pages for share memory is different than for
> >>> anon memory, so there could be performance loss if it is not set correctly.
> >>>     /sys/kernel/mm/transparent_hugepage/enabled
> >>>     vs
> >>>     /sys/kernel/mm/transparent_hugepage/shmem_enabled    
> >>
> >> I guess we can test this when launching the VM, and select
> >> a good default.
> >>  
> >>> It might make sense to use memfd_create by default for the secondary segments.    
> >>
> >> Well there's also KSM now you mention it.  
> > 
> > then another quest, is there downside to always using memfd_create
> > without any knobs being involved?  
> 
> Lower performance if small pages are used (but Michael suggests qemu could 
> automatically check the tunable and use anon memory instead)
> 
> KSM (same page merging) is not supported for shared memory, so ram_block_add ->
> memory_try_enable_merging will not enable it.
> 
> In both cases, I expect the degradation would be negligible if memfd_create is
> only automatically applied to the secondary segments, which are typically small.
> But, someone's secondary segment could be larger, and it is time consuming to
> prove innocence when someone claims your change caused their performance regression.

Adding David as memory subsystem maintainer, maybe he will a better
idea instead of introducing global knob that would also magically alter 
backends' behavior despite of its their configured settings.



> - Steve
> 
> >>>>> I am concerned that it's actually a kind of memory backend, this flag
> >>>>> seems to instead be closer to the deprecated mem-prealloc. E.g.
> >>>>> it does not work with a mem path, does it?    
> >>>
> >>> One can still define a memory backend with mempath to create the main ram segment,
> >>> though it must be some form of shared to work with live update.  Indeed, I would 
> >>> expect most users to specify an explicit memory backend for it.  The secondary
> >>> segments would still use memfd_create.
> >>>     
> >>>> (mem path and mem-prealloc are transparently aliased to used memory backend
> >>>> if I recall it right.)
> >>>>
> >>>> Steve,
> >>>>
> >>>> For allocating guest RAM, we switched exclusively to using memory-backends
> >>>> including initial guest RAM (-m size option) and we have hostmem-memfd
> >>>> that uses memfd_create() and I'd rather avoid adding random knobs to machine
> >>>> for tweaking how RAM should be allocated, we have memory backends for this,
> >>>> so this patch begs the question: why hostmem-memfd is not sufficient?
> >>>> (patch description is rather lacking on rationale behind the patch)    
> >>>
> >>> There is currently no way to specify memory backends for the secondary memory
> >>> segments (vram, roms, etc), and IMO it would be onerous to specify a backend for
> >>> each of them.  On x86_64, these include pc.bios, vga.vram, pc.rom, vga.rom,
> >>> /rom@etc/acpi/tables, /rom@etc/table-loader, /rom@etc/acpi/rsdp.

MemoryRegion is not the only place where state is stored.
If we only talk about fwcfg entries state, it can also reference
plain malloced memory allocated elsewhere or make a deep copy internally.
Similarly devices also may store state outside of RamBlock framework.

How are you dealing with that?

> >>>
> >>> - Steve
> >>>     
> >>>>>>      "                aes-key-wrap=on|off controls support for AES key wrapping (default=on)\n"
> >>>>>>      "                dea-key-wrap=on|off controls support for DEA key wrapping (default=on)\n"
> >>>>>>      "                suppress-vmdesc=on|off disables self-describing migration (default=off)\n"
> >>>>>> @@ -76,6 +77,11 @@ SRST
> >>>>>>          supported by the host, de-duplicates identical memory pages
> >>>>>>          among VMs instances (enabled by default).
> >>>>>>  
> >>>>>> +    ``memfd-alloc=on|off``
> >>>>>> +        Enables or disables allocation of anonymous guest RAM using
> >>>>>> +        memfd_create.  Any associated memory-backend objects are created with
> >>>>>> +        share=on.  The memfd-alloc default is off.
> >>>>>> +
> >>>>>>      ``aes-key-wrap=on|off``
> >>>>>>          Enables or disables AES key wrapping support on s390-ccw hosts.
> >>>>>>          This feature controls whether AES wrapping keys will be created
> >>>>>> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> >>>>>> index 3524c04..95e2b49 100644
> >>>>>> --- a/softmmu/physmem.c
> >>>>>> +++ b/softmmu/physmem.c
> >>>>>> @@ -41,6 +41,7 @@
> >>>>>>  #include "qemu/config-file.h"
> >>>>>>  #include "qemu/error-report.h"
> >>>>>>  #include "qemu/qemu-print.h"
> >>>>>> +#include "qemu/memfd.h"
> >>>>>>  #include "exec/memory.h"
> >>>>>>  #include "exec/ioport.h"
> >>>>>>  #include "sysemu/dma.h"
> >>>>>> @@ -1964,35 +1965,63 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
> >>>>>>      const bool shared = qemu_ram_is_shared(new_block);
> >>>>>>      RAMBlock *block;
> >>>>>>      RAMBlock *last_block = NULL;
> >>>>>> +    struct MemoryRegion *mr = new_block->mr;
> >>>>>>      ram_addr_t old_ram_size, new_ram_size;
> >>>>>>      Error *err = NULL;
> >>>>>> +    const char *name;
> >>>>>> +    void *addr = 0;
> >>>>>> +    size_t maxlen;
> >>>>>> +    MachineState *ms = MACHINE(qdev_get_machine());
> >>>>>>  
> >>>>>>      old_ram_size = last_ram_page();
> >>>>>>  
> >>>>>>      qemu_mutex_lock_ramlist();
> >>>>>> -    new_block->offset = find_ram_offset(new_block->max_length);
> >>>>>> +    maxlen = new_block->max_length;
> >>>>>> +    new_block->offset = find_ram_offset(maxlen);
> >>>>>>  
> >>>>>>      if (!new_block->host) {
> >>>>>>          if (xen_enabled()) {
> >>>>>> -            xen_ram_alloc(new_block->offset, new_block->max_length,
> >>>>>> -                          new_block->mr, &err);
> >>>>>> +            xen_ram_alloc(new_block->offset, maxlen, new_block->mr, &err);
> >>>>>>              if (err) {
> >>>>>>                  error_propagate(errp, err);
> >>>>>>                  qemu_mutex_unlock_ramlist();
> >>>>>>                  return;
> >>>>>>              }
> >>>>>>          } else {
> >>>>>> -            new_block->host = qemu_anon_ram_alloc(new_block->max_length,
> >>>>>> -                                                  &new_block->mr->align,
> >>>>>> -                                                  shared, noreserve);
> >>>>>> -            if (!new_block->host) {
> >>>>>> +            name = memory_region_name(mr);
> >>>>>> +            if (ms->memfd_alloc) {
> >>>>>> +                Object *parent = &mr->parent_obj;
> >>>>>> +                int mfd = -1;          /* placeholder until next patch */
> >>>>>> +                mr->align = QEMU_VMALLOC_ALIGN;
> >>>>>> +                if (mfd < 0) {
> >>>>>> +                    mfd = qemu_memfd_create(name, maxlen + mr->align,
> >>>>>> +                                            0, 0, 0, &err);
> >>>>>> +                    if (mfd < 0) {
> >>>>>> +                        return;
> >>>>>> +                    }
> >>>>>> +                }
> >>>>>> +                qemu_set_cloexec(mfd);
> >>>>>> +                /* The memory backend already set its desired flags. */
> >>>>>> +                if (!object_dynamic_cast(parent, TYPE_MEMORY_BACKEND)) {
> >>>>>> +                    new_block->flags |= RAM_SHARED;
> >>>>>> +                }
> >>>>>> +                addr = file_ram_alloc(new_block, maxlen, mfd,
> >>>>>> +                                      false, false, 0, errp);
> >>>>>> +                trace_anon_memfd_alloc(name, maxlen, addr, mfd);
> >>>>>> +            } else {
> >>>>>> +                addr = qemu_anon_ram_alloc(maxlen, &mr->align,
> >>>>>> +                                           shared, noreserve);
> >>>>>> +            }
> >>>>>> +
> >>>>>> +            if (!addr) {
> >>>>>>                  error_setg_errno(errp, errno,
> >>>>>>                                   "cannot set up guest memory '%s'",
> >>>>>> -                                 memory_region_name(new_block->mr));
> >>>>>> +                                 name);
> >>>>>>                  qemu_mutex_unlock_ramlist();
> >>>>>>                  return;
> >>>>>>              }
> >>>>>> -            memory_try_enable_merging(new_block->host, new_block->max_length);
> >>>>>> +            memory_try_enable_merging(addr, maxlen);
> >>>>>> +            new_block->host = addr;
> >>>>>>          }
> >>>>>>      }
> >>>>>>  
> >>>>>> diff --git a/softmmu/vl.c b/softmmu/vl.c
> >>>>>> index 620a1f1..ab3648a 100644
> >>>>>> --- a/softmmu/vl.c
> >>>>>> +++ b/softmmu/vl.c
> >>>>>> @@ -2440,6 +2440,7 @@ static void create_default_memdev(MachineState *ms, const char *path)
> >>>>>>          object_property_set_str(obj, "mem-path", path, &error_fatal);
> >>>>>>      }
> >>>>>>      object_property_set_int(obj, "size", ms->ram_size, &error_fatal);
> >>>>>> +    object_property_set_bool(obj, "share", ms->memfd_alloc, &error_fatal);
> >>>>>>      object_property_add_child(object_get_objects_root(), mc->default_ram_id,
> >>>>>>                                obj);
> >>>>>>      /* Ensure backend's memory region name is equal to mc->default_ram_id */
> >>>>>> diff --git a/trace-events b/trace-events
> >>>>>> index a637a61..770a9ac 100644
> >>>>>> --- a/trace-events
> >>>>>> +++ b/trace-events
> >>>>>> @@ -45,6 +45,7 @@ ram_block_discard_range(const char *rbname, void *hva, size_t length, bool need_
> >>>>>>  # accel/tcg/cputlb.c
> >>>>>>  memory_notdirty_write_access(uint64_t vaddr, uint64_t ram_addr, unsigned size) "0x%" PRIx64 " ram_addr 0x%" PRIx64 " size %u"
> >>>>>>  memory_notdirty_set_dirty(uint64_t vaddr) "0x%" PRIx64
> >>>>>> +anon_memfd_alloc(const char *name, size_t size, void *ptr, int fd) "%s size %zu ptr %p fd %d"
> >>>>>>  
> >>>>>>  # gdbstub.c
> >>>>>>  gdbstub_op_start(const char *device) "Starting gdbstub using device %s"
> >>>>>> diff --git a/util/qemu-config.c b/util/qemu-config.c
> >>>>>> index 436ab63..3606e5c 100644
> >>>>>> --- a/util/qemu-config.c
> >>>>>> +++ b/util/qemu-config.c
> >>>>>> @@ -207,6 +207,10 @@ static QemuOptsList machine_opts = {
> >>>>>>              .type = QEMU_OPT_BOOL,
> >>>>>>              .help = "enable/disable memory merge support",
> >>>>>>          },{
> >>>>>> +            .name = "memfd-alloc",
> >>>>>> +            .type = QEMU_OPT_BOOL,
> >>>>>> +            .help = "enable/disable memfd_create for anonymous memory",
> >>>>>> +        },{
> >>>>>>              .name = "usb",
> >>>>>>              .type = QEMU_OPT_BOOL,
> >>>>>>              .help = "Set on/off to enable/disable usb",
> >>>>>> -- 
> >>>>>> 1.8.3.1      
> >>>>>
> >>>>>    
> >>>>     
> >>  
> >   
> 



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 10/29] machine: memfd-alloc option
  2022-03-10 16:00               ` Igor Mammedov
@ 2022-03-10 17:28                 ` Steven Sistare
  2022-03-10 18:18                   ` Steven Sistare
  0 siblings, 1 reply; 96+ messages in thread
From: Steven Sistare @ 2022-03-10 17:28 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	David Hildenbrand, qemu-devel, Dr. David Alan Gilbert,
	Zheng Chuan, Alex Williamson, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Daniel P. Berrange,
	Philippe Mathieu-Daudé,
	Alex Bennée, Markus Armbruster

On 3/10/2022 11:00 AM, Igor Mammedov wrote:
> On Thu, 10 Mar 2022 10:36:08 -0500
> Steven Sistare <steven.sistare@oracle.com> wrote:
> 
>> On 3/8/2022 2:20 AM, Igor Mammedov wrote:
>>> On Tue, 8 Mar 2022 01:50:11 -0500
>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>   
>>>> On Mon, Mar 07, 2022 at 09:41:44AM -0500, Steven Sistare wrote:  
>>>>> On 3/4/2022 5:41 AM, Igor Mammedov wrote:    
>>>>>> On Thu, 3 Mar 2022 12:21:15 -0500
>>>>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>>>>     
>>>>>>> On Wed, Dec 22, 2021 at 11:05:15AM -0800, Steve Sistare wrote:    
>>>>>>>> Allocate anonymous memory using memfd_create if the memfd-alloc machine
>>>>>>>> option is set.
>>>>>>>>
>>>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>>>> ---
>>>>>>>>  hw/core/machine.c   | 19 +++++++++++++++++++
>>>>>>>>  include/hw/boards.h |  1 +
>>>>>>>>  qemu-options.hx     |  6 ++++++
>>>>>>>>  softmmu/physmem.c   | 47 ++++++++++++++++++++++++++++++++++++++---------
>>>>>>>>  softmmu/vl.c        |  1 +
>>>>>>>>  trace-events        |  1 +
>>>>>>>>  util/qemu-config.c  |  4 ++++
>>>>>>>>  7 files changed, 70 insertions(+), 9 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>>>>>>>> index 53a99ab..7739d88 100644
>>>>>>>> --- a/hw/core/machine.c
>>>>>>>> +++ b/hw/core/machine.c
>>>>>>>> @@ -392,6 +392,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
>>>>>>>>      ms->mem_merge = value;
>>>>>>>>  }
>>>>>>>>  
>>>>>>>> +static bool machine_get_memfd_alloc(Object *obj, Error **errp)
>>>>>>>> +{
>>>>>>>> +    MachineState *ms = MACHINE(obj);
>>>>>>>> +
>>>>>>>> +    return ms->memfd_alloc;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static void machine_set_memfd_alloc(Object *obj, bool value, Error **errp)
>>>>>>>> +{
>>>>>>>> +    MachineState *ms = MACHINE(obj);
>>>>>>>> +
>>>>>>>> +    ms->memfd_alloc = value;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>  static bool machine_get_usb(Object *obj, Error **errp)
>>>>>>>>  {
>>>>>>>>      MachineState *ms = MACHINE(obj);
>>>>>>>> @@ -829,6 +843,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
>>>>>>>>      object_class_property_set_description(oc, "mem-merge",
>>>>>>>>          "Enable/disable memory merge support");
>>>>>>>>  
>>>>>>>> +    object_class_property_add_bool(oc, "memfd-alloc",
>>>>>>>> +        machine_get_memfd_alloc, machine_set_memfd_alloc);
>>>>>>>> +    object_class_property_set_description(oc, "memfd-alloc",
>>>>>>>> +        "Enable/disable allocating anonymous memory using memfd_create");
>>>>>>>> +
>>>>>>>>      object_class_property_add_bool(oc, "usb",
>>>>>>>>          machine_get_usb, machine_set_usb);
>>>>>>>>      object_class_property_set_description(oc, "usb",
>>>>>>>> diff --git a/include/hw/boards.h b/include/hw/boards.h
>>>>>>>> index 9c1c190..a57d7a0 100644
>>>>>>>> --- a/include/hw/boards.h
>>>>>>>> +++ b/include/hw/boards.h
>>>>>>>> @@ -327,6 +327,7 @@ struct MachineState {
>>>>>>>>      char *dt_compatible;
>>>>>>>>      bool dump_guest_core;
>>>>>>>>      bool mem_merge;
>>>>>>>> +    bool memfd_alloc;
>>>>>>>>      bool usb;
>>>>>>>>      bool usb_disabled;
>>>>>>>>      char *firmware;
>>>>>>>> diff --git a/qemu-options.hx b/qemu-options.hx
>>>>>>>> index 7d47510..33c8173 100644
>>>>>>>> --- a/qemu-options.hx
>>>>>>>> +++ b/qemu-options.hx
>>>>>>>> @@ -30,6 +30,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
>>>>>>>>      "                vmport=on|off|auto controls emulation of vmport (default: auto)\n"
>>>>>>>>      "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
>>>>>>>>      "                mem-merge=on|off controls memory merge support (default: on)\n"
>>>>>>>> +    "                memfd-alloc=on|off controls allocating anonymous guest RAM using memfd_create (default: off)\n"      
>>>>>>>
>>>>>>> Question: are there any disadvantages associated with using
>>>>>>> memfd_create? I guess we are using up an fd, but that seems minor.  Any
>>>>>>> reason not to set to on by default? maybe with a fallback option to
>>>>>>> disable that?    
>>>>>
>>>>> Old Linux host kernels, circa 4.1, do not support huge pages for shared memory.
>>>>> Also, the tunable to enable huge pages for share memory is different than for
>>>>> anon memory, so there could be performance loss if it is not set correctly.
>>>>>     /sys/kernel/mm/transparent_hugepage/enabled
>>>>>     vs
>>>>>     /sys/kernel/mm/transparent_hugepage/shmem_enabled    
>>>>
>>>> I guess we can test this when launching the VM, and select
>>>> a good default.
>>>>  
>>>>> It might make sense to use memfd_create by default for the secondary segments.    
>>>>
>>>> Well there's also KSM now you mention it.  
>>>
>>> then another quest, is there downside to always using memfd_create
>>> without any knobs being involved?  
>>
>> Lower performance if small pages are used (but Michael suggests qemu could 
>> automatically check the tunable and use anon memory instead)
>>
>> KSM (same page merging) is not supported for shared memory, so ram_block_add ->
>> memory_try_enable_merging will not enable it.
>>
>> In both cases, I expect the degradation would be negligible if memfd_create is
>> only automatically applied to the secondary segments, which are typically small.
>> But, someone's secondary segment could be larger, and it is time consuming to
>> prove innocence when someone claims your change caused their performance regression.
> 
> Adding David as memory subsystem maintainer, maybe he will a better
> idea instead of introducing global knob that would also magically alter 
> backends' behavior despite of its their configured settings.

OK, in ram_block_add I can set the RAM_SHARED flag based on the memory-backend object's
shared flag.  I already set the latter in create_default_memdev when memfd-alloc is
specified.  With that change, we do not override configured settings.  Users can no longer
use memory-backend-ram for CPR, and must change all memory-backend-ram to memory-backend-memfd
in the command-line arguments.  That is fine.

With that change, are you OK with this patch?

- Steve

>>>>>>> I am concerned that it's actually a kind of memory backend, this flag
>>>>>>> seems to instead be closer to the deprecated mem-prealloc. E.g.
>>>>>>> it does not work with a mem path, does it?    
>>>>>
>>>>> One can still define a memory backend with mempath to create the main ram segment,
>>>>> though it must be some form of shared to work with live update.  Indeed, I would 
>>>>> expect most users to specify an explicit memory backend for it.  The secondary
>>>>> segments would still use memfd_create.
>>>>>     
>>>>>> (mem path and mem-prealloc are transparently aliased to used memory backend
>>>>>> if I recall it right.)
>>>>>>
>>>>>> Steve,
>>>>>>
>>>>>> For allocating guest RAM, we switched exclusively to using memory-backends
>>>>>> including initial guest RAM (-m size option) and we have hostmem-memfd
>>>>>> that uses memfd_create() and I'd rather avoid adding random knobs to machine
>>>>>> for tweaking how RAM should be allocated, we have memory backends for this,
>>>>>> so this patch begs the question: why hostmem-memfd is not sufficient?
>>>>>> (patch description is rather lacking on rationale behind the patch)    
>>>>>
>>>>> There is currently no way to specify memory backends for the secondary memory
>>>>> segments (vram, roms, etc), and IMO it would be onerous to specify a backend for
>>>>> each of them.  On x86_64, these include pc.bios, vga.vram, pc.rom, vga.rom,
>>>>> /rom@etc/acpi/tables, /rom@etc/table-loader, /rom@etc/acpi/rsdp.
> 
> MemoryRegion is not the only place where state is stored.
> If we only talk about fwcfg entries state, it can also reference
> plain malloced memory allocated elsewhere or make a deep copy internally.
> Similarly devices also may store state outside of RamBlock framework.
> 
> How are you dealing with that?
> 
>>>>>
>>>>> - Steve
>>>>>     
>>>>>>>>      "                aes-key-wrap=on|off controls support for AES key wrapping (default=on)\n"
>>>>>>>>      "                dea-key-wrap=on|off controls support for DEA key wrapping (default=on)\n"
>>>>>>>>      "                suppress-vmdesc=on|off disables self-describing migration (default=off)\n"
>>>>>>>> @@ -76,6 +77,11 @@ SRST
>>>>>>>>          supported by the host, de-duplicates identical memory pages
>>>>>>>>          among VMs instances (enabled by default).
>>>>>>>>  
>>>>>>>> +    ``memfd-alloc=on|off``
>>>>>>>> +        Enables or disables allocation of anonymous guest RAM using
>>>>>>>> +        memfd_create.  Any associated memory-backend objects are created with
>>>>>>>> +        share=on.  The memfd-alloc default is off.
>>>>>>>> +
>>>>>>>>      ``aes-key-wrap=on|off``
>>>>>>>>          Enables or disables AES key wrapping support on s390-ccw hosts.
>>>>>>>>          This feature controls whether AES wrapping keys will be created
>>>>>>>> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
>>>>>>>> index 3524c04..95e2b49 100644
>>>>>>>> --- a/softmmu/physmem.c
>>>>>>>> +++ b/softmmu/physmem.c
>>>>>>>> @@ -41,6 +41,7 @@
>>>>>>>>  #include "qemu/config-file.h"
>>>>>>>>  #include "qemu/error-report.h"
>>>>>>>>  #include "qemu/qemu-print.h"
>>>>>>>> +#include "qemu/memfd.h"
>>>>>>>>  #include "exec/memory.h"
>>>>>>>>  #include "exec/ioport.h"
>>>>>>>>  #include "sysemu/dma.h"
>>>>>>>> @@ -1964,35 +1965,63 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>>>>>>>      const bool shared = qemu_ram_is_shared(new_block);
>>>>>>>>      RAMBlock *block;
>>>>>>>>      RAMBlock *last_block = NULL;
>>>>>>>> +    struct MemoryRegion *mr = new_block->mr;
>>>>>>>>      ram_addr_t old_ram_size, new_ram_size;
>>>>>>>>      Error *err = NULL;
>>>>>>>> +    const char *name;
>>>>>>>> +    void *addr = 0;
>>>>>>>> +    size_t maxlen;
>>>>>>>> +    MachineState *ms = MACHINE(qdev_get_machine());
>>>>>>>>  
>>>>>>>>      old_ram_size = last_ram_page();
>>>>>>>>  
>>>>>>>>      qemu_mutex_lock_ramlist();
>>>>>>>> -    new_block->offset = find_ram_offset(new_block->max_length);
>>>>>>>> +    maxlen = new_block->max_length;
>>>>>>>> +    new_block->offset = find_ram_offset(maxlen);
>>>>>>>>  
>>>>>>>>      if (!new_block->host) {
>>>>>>>>          if (xen_enabled()) {
>>>>>>>> -            xen_ram_alloc(new_block->offset, new_block->max_length,
>>>>>>>> -                          new_block->mr, &err);
>>>>>>>> +            xen_ram_alloc(new_block->offset, maxlen, new_block->mr, &err);
>>>>>>>>              if (err) {
>>>>>>>>                  error_propagate(errp, err);
>>>>>>>>                  qemu_mutex_unlock_ramlist();
>>>>>>>>                  return;
>>>>>>>>              }
>>>>>>>>          } else {
>>>>>>>> -            new_block->host = qemu_anon_ram_alloc(new_block->max_length,
>>>>>>>> -                                                  &new_block->mr->align,
>>>>>>>> -                                                  shared, noreserve);
>>>>>>>> -            if (!new_block->host) {
>>>>>>>> +            name = memory_region_name(mr);
>>>>>>>> +            if (ms->memfd_alloc) {
>>>>>>>> +                Object *parent = &mr->parent_obj;
>>>>>>>> +                int mfd = -1;          /* placeholder until next patch */
>>>>>>>> +                mr->align = QEMU_VMALLOC_ALIGN;
>>>>>>>> +                if (mfd < 0) {
>>>>>>>> +                    mfd = qemu_memfd_create(name, maxlen + mr->align,
>>>>>>>> +                                            0, 0, 0, &err);
>>>>>>>> +                    if (mfd < 0) {
>>>>>>>> +                        return;
>>>>>>>> +                    }
>>>>>>>> +                }
>>>>>>>> +                qemu_set_cloexec(mfd);
>>>>>>>> +                /* The memory backend already set its desired flags. */
>>>>>>>> +                if (!object_dynamic_cast(parent, TYPE_MEMORY_BACKEND)) {
>>>>>>>> +                    new_block->flags |= RAM_SHARED;
>>>>>>>> +                }
>>>>>>>> +                addr = file_ram_alloc(new_block, maxlen, mfd,
>>>>>>>> +                                      false, false, 0, errp);
>>>>>>>> +                trace_anon_memfd_alloc(name, maxlen, addr, mfd);
>>>>>>>> +            } else {
>>>>>>>> +                addr = qemu_anon_ram_alloc(maxlen, &mr->align,
>>>>>>>> +                                           shared, noreserve);
>>>>>>>> +            }
>>>>>>>> +
>>>>>>>> +            if (!addr) {
>>>>>>>>                  error_setg_errno(errp, errno,
>>>>>>>>                                   "cannot set up guest memory '%s'",
>>>>>>>> -                                 memory_region_name(new_block->mr));
>>>>>>>> +                                 name);
>>>>>>>>                  qemu_mutex_unlock_ramlist();
>>>>>>>>                  return;
>>>>>>>>              }
>>>>>>>> -            memory_try_enable_merging(new_block->host, new_block->max_length);
>>>>>>>> +            memory_try_enable_merging(addr, maxlen);
>>>>>>>> +            new_block->host = addr;
>>>>>>>>          }
>>>>>>>>      }
>>>>>>>>  
>>>>>>>> diff --git a/softmmu/vl.c b/softmmu/vl.c
>>>>>>>> index 620a1f1..ab3648a 100644
>>>>>>>> --- a/softmmu/vl.c
>>>>>>>> +++ b/softmmu/vl.c
>>>>>>>> @@ -2440,6 +2440,7 @@ static void create_default_memdev(MachineState *ms, const char *path)
>>>>>>>>          object_property_set_str(obj, "mem-path", path, &error_fatal);
>>>>>>>>      }
>>>>>>>>      object_property_set_int(obj, "size", ms->ram_size, &error_fatal);
>>>>>>>> +    object_property_set_bool(obj, "share", ms->memfd_alloc, &error_fatal);
>>>>>>>>      object_property_add_child(object_get_objects_root(), mc->default_ram_id,
>>>>>>>>                                obj);
>>>>>>>>      /* Ensure backend's memory region name is equal to mc->default_ram_id */
>>>>>>>> diff --git a/trace-events b/trace-events
>>>>>>>> index a637a61..770a9ac 100644
>>>>>>>> --- a/trace-events
>>>>>>>> +++ b/trace-events
>>>>>>>> @@ -45,6 +45,7 @@ ram_block_discard_range(const char *rbname, void *hva, size_t length, bool need_
>>>>>>>>  # accel/tcg/cputlb.c
>>>>>>>>  memory_notdirty_write_access(uint64_t vaddr, uint64_t ram_addr, unsigned size) "0x%" PRIx64 " ram_addr 0x%" PRIx64 " size %u"
>>>>>>>>  memory_notdirty_set_dirty(uint64_t vaddr) "0x%" PRIx64
>>>>>>>> +anon_memfd_alloc(const char *name, size_t size, void *ptr, int fd) "%s size %zu ptr %p fd %d"
>>>>>>>>  
>>>>>>>>  # gdbstub.c
>>>>>>>>  gdbstub_op_start(const char *device) "Starting gdbstub using device %s"
>>>>>>>> diff --git a/util/qemu-config.c b/util/qemu-config.c
>>>>>>>> index 436ab63..3606e5c 100644
>>>>>>>> --- a/util/qemu-config.c
>>>>>>>> +++ b/util/qemu-config.c
>>>>>>>> @@ -207,6 +207,10 @@ static QemuOptsList machine_opts = {
>>>>>>>>              .type = QEMU_OPT_BOOL,
>>>>>>>>              .help = "enable/disable memory merge support",
>>>>>>>>          },{
>>>>>>>> +            .name = "memfd-alloc",
>>>>>>>> +            .type = QEMU_OPT_BOOL,
>>>>>>>> +            .help = "enable/disable memfd_create for anonymous memory",
>>>>>>>> +        },{
>>>>>>>>              .name = "usb",
>>>>>>>>              .type = QEMU_OPT_BOOL,
>>>>>>>>              .help = "Set on/off to enable/disable usb",
>>>>>>>> -- 
>>>>>>>> 1.8.3.1      
>>>>>>>
>>>>>>>    
>>>>>>     
>>>>  
>>>   
>>
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 10/29] machine: memfd-alloc option
  2022-03-10 17:28                 ` Steven Sistare
@ 2022-03-10 18:18                   ` Steven Sistare
  2022-03-11  9:42                     ` Igor Mammedov
  0 siblings, 1 reply; 96+ messages in thread
From: Steven Sistare @ 2022-03-10 18:18 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	David Hildenbrand, qemu-devel, Dr. David Alan Gilbert,
	Zheng Chuan, Alex Williamson, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Daniel P. Berrange,
	Philippe Mathieu-Daudé,
	Alex Bennée, Markus Armbruster

On 3/10/2022 12:28 PM, Steven Sistare wrote:
> On 3/10/2022 11:00 AM, Igor Mammedov wrote:
>> On Thu, 10 Mar 2022 10:36:08 -0500
>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>
>>> On 3/8/2022 2:20 AM, Igor Mammedov wrote:
>>>> On Tue, 8 Mar 2022 01:50:11 -0500
>>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>>   
>>>>> On Mon, Mar 07, 2022 at 09:41:44AM -0500, Steven Sistare wrote:  
>>>>>> On 3/4/2022 5:41 AM, Igor Mammedov wrote:    
>>>>>>> On Thu, 3 Mar 2022 12:21:15 -0500
>>>>>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>>>>>     
>>>>>>>> On Wed, Dec 22, 2021 at 11:05:15AM -0800, Steve Sistare wrote:    
>>>>>>>>> Allocate anonymous memory using memfd_create if the memfd-alloc machine
>>>>>>>>> option is set.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>>>>> ---
>>>>>>>>>  hw/core/machine.c   | 19 +++++++++++++++++++
>>>>>>>>>  include/hw/boards.h |  1 +
>>>>>>>>>  qemu-options.hx     |  6 ++++++
>>>>>>>>>  softmmu/physmem.c   | 47 ++++++++++++++++++++++++++++++++++++++---------
>>>>>>>>>  softmmu/vl.c        |  1 +
>>>>>>>>>  trace-events        |  1 +
>>>>>>>>>  util/qemu-config.c  |  4 ++++
>>>>>>>>>  7 files changed, 70 insertions(+), 9 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>>>>>>>>> index 53a99ab..7739d88 100644
>>>>>>>>> --- a/hw/core/machine.c
>>>>>>>>> +++ b/hw/core/machine.c
>>>>>>>>> @@ -392,6 +392,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
>>>>>>>>>      ms->mem_merge = value;
>>>>>>>>>  }
>>>>>>>>>  
>>>>>>>>> +static bool machine_get_memfd_alloc(Object *obj, Error **errp)
>>>>>>>>> +{
>>>>>>>>> +    MachineState *ms = MACHINE(obj);
>>>>>>>>> +
>>>>>>>>> +    return ms->memfd_alloc;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +static void machine_set_memfd_alloc(Object *obj, bool value, Error **errp)
>>>>>>>>> +{
>>>>>>>>> +    MachineState *ms = MACHINE(obj);
>>>>>>>>> +
>>>>>>>>> +    ms->memfd_alloc = value;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>>  static bool machine_get_usb(Object *obj, Error **errp)
>>>>>>>>>  {
>>>>>>>>>      MachineState *ms = MACHINE(obj);
>>>>>>>>> @@ -829,6 +843,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
>>>>>>>>>      object_class_property_set_description(oc, "mem-merge",
>>>>>>>>>          "Enable/disable memory merge support");
>>>>>>>>>  
>>>>>>>>> +    object_class_property_add_bool(oc, "memfd-alloc",
>>>>>>>>> +        machine_get_memfd_alloc, machine_set_memfd_alloc);
>>>>>>>>> +    object_class_property_set_description(oc, "memfd-alloc",
>>>>>>>>> +        "Enable/disable allocating anonymous memory using memfd_create");
>>>>>>>>> +
>>>>>>>>>      object_class_property_add_bool(oc, "usb",
>>>>>>>>>          machine_get_usb, machine_set_usb);
>>>>>>>>>      object_class_property_set_description(oc, "usb",
>>>>>>>>> diff --git a/include/hw/boards.h b/include/hw/boards.h
>>>>>>>>> index 9c1c190..a57d7a0 100644
>>>>>>>>> --- a/include/hw/boards.h
>>>>>>>>> +++ b/include/hw/boards.h
>>>>>>>>> @@ -327,6 +327,7 @@ struct MachineState {
>>>>>>>>>      char *dt_compatible;
>>>>>>>>>      bool dump_guest_core;
>>>>>>>>>      bool mem_merge;
>>>>>>>>> +    bool memfd_alloc;
>>>>>>>>>      bool usb;
>>>>>>>>>      bool usb_disabled;
>>>>>>>>>      char *firmware;
>>>>>>>>> diff --git a/qemu-options.hx b/qemu-options.hx
>>>>>>>>> index 7d47510..33c8173 100644
>>>>>>>>> --- a/qemu-options.hx
>>>>>>>>> +++ b/qemu-options.hx
>>>>>>>>> @@ -30,6 +30,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
>>>>>>>>>      "                vmport=on|off|auto controls emulation of vmport (default: auto)\n"
>>>>>>>>>      "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
>>>>>>>>>      "                mem-merge=on|off controls memory merge support (default: on)\n"
>>>>>>>>> +    "                memfd-alloc=on|off controls allocating anonymous guest RAM using memfd_create (default: off)\n"      
>>>>>>>>
>>>>>>>> Question: are there any disadvantages associated with using
>>>>>>>> memfd_create? I guess we are using up an fd, but that seems minor.  Any
>>>>>>>> reason not to set to on by default? maybe with a fallback option to
>>>>>>>> disable that?    
>>>>>>
>>>>>> Old Linux host kernels, circa 4.1, do not support huge pages for shared memory.
>>>>>> Also, the tunable to enable huge pages for share memory is different than for
>>>>>> anon memory, so there could be performance loss if it is not set correctly.
>>>>>>     /sys/kernel/mm/transparent_hugepage/enabled
>>>>>>     vs
>>>>>>     /sys/kernel/mm/transparent_hugepage/shmem_enabled    
>>>>>
>>>>> I guess we can test this when launching the VM, and select
>>>>> a good default.
>>>>>  
>>>>>> It might make sense to use memfd_create by default for the secondary segments.    
>>>>>
>>>>> Well there's also KSM now you mention it.  
>>>>
>>>> then another quest, is there downside to always using memfd_create
>>>> without any knobs being involved?  
>>>
>>> Lower performance if small pages are used (but Michael suggests qemu could 
>>> automatically check the tunable and use anon memory instead)
>>>
>>> KSM (same page merging) is not supported for shared memory, so ram_block_add ->
>>> memory_try_enable_merging will not enable it.
>>>
>>> In both cases, I expect the degradation would be negligible if memfd_create is
>>> only automatically applied to the secondary segments, which are typically small.
>>> But, someone's secondary segment could be larger, and it is time consuming to
>>> prove innocence when someone claims your change caused their performance regression.
>>
>> Adding David as memory subsystem maintainer, maybe he will a better
>> idea instead of introducing global knob that would also magically alter 
>> backends' behavior despite of its their configured settings.
> 
> OK, in ram_block_add I can set the RAM_SHARED flag based on the memory-backend object's
> shared flag.  I already set the latter in create_default_memdev when memfd-alloc is
> specified.  With that change, we do not override configured settings.  Users can no longer
> use memory-backend-ram for CPR, and must change all memory-backend-ram to memory-backend-memfd
> in the command-line arguments.  That is fine.
> 
> With that change, are you OK with this patch?

Sorry, I mis-read my own code in ram_block_add.  The existing code is correct and does 
not alter any backend's behavior.   It only sets the shared flag when the ram is *not* 
being allocated for a backend:

                if (!object_dynamic_cast(parent, TYPE_MEMORY_BACKEND)) {
                    new_block->flags |= RAM_SHARED;
                }

- Steve

>>>>>>>> I am concerned that it's actually a kind of memory backend, this flag
>>>>>>>> seems to instead be closer to the deprecated mem-prealloc. E.g.
>>>>>>>> it does not work with a mem path, does it?    
>>>>>>
>>>>>> One can still define a memory backend with mempath to create the main ram segment,
>>>>>> though it must be some form of shared to work with live update.  Indeed, I would 
>>>>>> expect most users to specify an explicit memory backend for it.  The secondary
>>>>>> segments would still use memfd_create.
>>>>>>     
>>>>>>> (mem path and mem-prealloc are transparently aliased to used memory backend
>>>>>>> if I recall it right.)
>>>>>>>
>>>>>>> Steve,
>>>>>>>
>>>>>>> For allocating guest RAM, we switched exclusively to using memory-backends
>>>>>>> including initial guest RAM (-m size option) and we have hostmem-memfd
>>>>>>> that uses memfd_create() and I'd rather avoid adding random knobs to machine
>>>>>>> for tweaking how RAM should be allocated, we have memory backends for this,
>>>>>>> so this patch begs the question: why hostmem-memfd is not sufficient?
>>>>>>> (patch description is rather lacking on rationale behind the patch)    
>>>>>>
>>>>>> There is currently no way to specify memory backends for the secondary memory
>>>>>> segments (vram, roms, etc), and IMO it would be onerous to specify a backend for
>>>>>> each of them.  On x86_64, these include pc.bios, vga.vram, pc.rom, vga.rom,
>>>>>> /rom@etc/acpi/tables, /rom@etc/table-loader, /rom@etc/acpi/rsdp.
>>
>> MemoryRegion is not the only place where state is stored.
>> If we only talk about fwcfg entries state, it can also reference
>> plain malloced memory allocated elsewhere or make a deep copy internally.
>> Similarly devices also may store state outside of RamBlock framework.
>>
>> How are you dealing with that?
>>
>>>>>>
>>>>>> - Steve
>>>>>>     
>>>>>>>>>      "                aes-key-wrap=on|off controls support for AES key wrapping (default=on)\n"
>>>>>>>>>      "                dea-key-wrap=on|off controls support for DEA key wrapping (default=on)\n"
>>>>>>>>>      "                suppress-vmdesc=on|off disables self-describing migration (default=off)\n"
>>>>>>>>> @@ -76,6 +77,11 @@ SRST
>>>>>>>>>          supported by the host, de-duplicates identical memory pages
>>>>>>>>>          among VMs instances (enabled by default).
>>>>>>>>>  
>>>>>>>>> +    ``memfd-alloc=on|off``
>>>>>>>>> +        Enables or disables allocation of anonymous guest RAM using
>>>>>>>>> +        memfd_create.  Any associated memory-backend objects are created with
>>>>>>>>> +        share=on.  The memfd-alloc default is off.
>>>>>>>>> +
>>>>>>>>>      ``aes-key-wrap=on|off``
>>>>>>>>>          Enables or disables AES key wrapping support on s390-ccw hosts.
>>>>>>>>>          This feature controls whether AES wrapping keys will be created
>>>>>>>>> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
>>>>>>>>> index 3524c04..95e2b49 100644
>>>>>>>>> --- a/softmmu/physmem.c
>>>>>>>>> +++ b/softmmu/physmem.c
>>>>>>>>> @@ -41,6 +41,7 @@
>>>>>>>>>  #include "qemu/config-file.h"
>>>>>>>>>  #include "qemu/error-report.h"
>>>>>>>>>  #include "qemu/qemu-print.h"
>>>>>>>>> +#include "qemu/memfd.h"
>>>>>>>>>  #include "exec/memory.h"
>>>>>>>>>  #include "exec/ioport.h"
>>>>>>>>>  #include "sysemu/dma.h"
>>>>>>>>> @@ -1964,35 +1965,63 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>>>>>>>>      const bool shared = qemu_ram_is_shared(new_block);
>>>>>>>>>      RAMBlock *block;
>>>>>>>>>      RAMBlock *last_block = NULL;
>>>>>>>>> +    struct MemoryRegion *mr = new_block->mr;
>>>>>>>>>      ram_addr_t old_ram_size, new_ram_size;
>>>>>>>>>      Error *err = NULL;
>>>>>>>>> +    const char *name;
>>>>>>>>> +    void *addr = 0;
>>>>>>>>> +    size_t maxlen;
>>>>>>>>> +    MachineState *ms = MACHINE(qdev_get_machine());
>>>>>>>>>  
>>>>>>>>>      old_ram_size = last_ram_page();
>>>>>>>>>  
>>>>>>>>>      qemu_mutex_lock_ramlist();
>>>>>>>>> -    new_block->offset = find_ram_offset(new_block->max_length);
>>>>>>>>> +    maxlen = new_block->max_length;
>>>>>>>>> +    new_block->offset = find_ram_offset(maxlen);
>>>>>>>>>  
>>>>>>>>>      if (!new_block->host) {
>>>>>>>>>          if (xen_enabled()) {
>>>>>>>>> -            xen_ram_alloc(new_block->offset, new_block->max_length,
>>>>>>>>> -                          new_block->mr, &err);
>>>>>>>>> +            xen_ram_alloc(new_block->offset, maxlen, new_block->mr, &err);
>>>>>>>>>              if (err) {
>>>>>>>>>                  error_propagate(errp, err);
>>>>>>>>>                  qemu_mutex_unlock_ramlist();
>>>>>>>>>                  return;
>>>>>>>>>              }
>>>>>>>>>          } else {
>>>>>>>>> -            new_block->host = qemu_anon_ram_alloc(new_block->max_length,
>>>>>>>>> -                                                  &new_block->mr->align,
>>>>>>>>> -                                                  shared, noreserve);
>>>>>>>>> -            if (!new_block->host) {
>>>>>>>>> +            name = memory_region_name(mr);
>>>>>>>>> +            if (ms->memfd_alloc) {
>>>>>>>>> +                Object *parent = &mr->parent_obj;
>>>>>>>>> +                int mfd = -1;          /* placeholder until next patch */
>>>>>>>>> +                mr->align = QEMU_VMALLOC_ALIGN;
>>>>>>>>> +                if (mfd < 0) {
>>>>>>>>> +                    mfd = qemu_memfd_create(name, maxlen + mr->align,
>>>>>>>>> +                                            0, 0, 0, &err);
>>>>>>>>> +                    if (mfd < 0) {
>>>>>>>>> +                        return;
>>>>>>>>> +                    }
>>>>>>>>> +                }
>>>>>>>>> +                qemu_set_cloexec(mfd);
>>>>>>>>> +                /* The memory backend already set its desired flags. */
>>>>>>>>> +                if (!object_dynamic_cast(parent, TYPE_MEMORY_BACKEND)) {
>>>>>>>>> +                    new_block->flags |= RAM_SHARED;
>>>>>>>>> +                }
>>>>>>>>> +                addr = file_ram_alloc(new_block, maxlen, mfd,
>>>>>>>>> +                                      false, false, 0, errp);
>>>>>>>>> +                trace_anon_memfd_alloc(name, maxlen, addr, mfd);
>>>>>>>>> +            } else {
>>>>>>>>> +                addr = qemu_anon_ram_alloc(maxlen, &mr->align,
>>>>>>>>> +                                           shared, noreserve);
>>>>>>>>> +            }
>>>>>>>>> +
>>>>>>>>> +            if (!addr) {
>>>>>>>>>                  error_setg_errno(errp, errno,
>>>>>>>>>                                   "cannot set up guest memory '%s'",
>>>>>>>>> -                                 memory_region_name(new_block->mr));
>>>>>>>>> +                                 name);
>>>>>>>>>                  qemu_mutex_unlock_ramlist();
>>>>>>>>>                  return;
>>>>>>>>>              }
>>>>>>>>> -            memory_try_enable_merging(new_block->host, new_block->max_length);
>>>>>>>>> +            memory_try_enable_merging(addr, maxlen);
>>>>>>>>> +            new_block->host = addr;
>>>>>>>>>          }
>>>>>>>>>      }
>>>>>>>>>  
>>>>>>>>> diff --git a/softmmu/vl.c b/softmmu/vl.c
>>>>>>>>> index 620a1f1..ab3648a 100644
>>>>>>>>> --- a/softmmu/vl.c
>>>>>>>>> +++ b/softmmu/vl.c
>>>>>>>>> @@ -2440,6 +2440,7 @@ static void create_default_memdev(MachineState *ms, const char *path)
>>>>>>>>>          object_property_set_str(obj, "mem-path", path, &error_fatal);
>>>>>>>>>      }
>>>>>>>>>      object_property_set_int(obj, "size", ms->ram_size, &error_fatal);
>>>>>>>>> +    object_property_set_bool(obj, "share", ms->memfd_alloc, &error_fatal);
>>>>>>>>>      object_property_add_child(object_get_objects_root(), mc->default_ram_id,
>>>>>>>>>                                obj);
>>>>>>>>>      /* Ensure backend's memory region name is equal to mc->default_ram_id */
>>>>>>>>> diff --git a/trace-events b/trace-events
>>>>>>>>> index a637a61..770a9ac 100644
>>>>>>>>> --- a/trace-events
>>>>>>>>> +++ b/trace-events
>>>>>>>>> @@ -45,6 +45,7 @@ ram_block_discard_range(const char *rbname, void *hva, size_t length, bool need_
>>>>>>>>>  # accel/tcg/cputlb.c
>>>>>>>>>  memory_notdirty_write_access(uint64_t vaddr, uint64_t ram_addr, unsigned size) "0x%" PRIx64 " ram_addr 0x%" PRIx64 " size %u"
>>>>>>>>>  memory_notdirty_set_dirty(uint64_t vaddr) "0x%" PRIx64
>>>>>>>>> +anon_memfd_alloc(const char *name, size_t size, void *ptr, int fd) "%s size %zu ptr %p fd %d"
>>>>>>>>>  
>>>>>>>>>  # gdbstub.c
>>>>>>>>>  gdbstub_op_start(const char *device) "Starting gdbstub using device %s"
>>>>>>>>> diff --git a/util/qemu-config.c b/util/qemu-config.c
>>>>>>>>> index 436ab63..3606e5c 100644
>>>>>>>>> --- a/util/qemu-config.c
>>>>>>>>> +++ b/util/qemu-config.c
>>>>>>>>> @@ -207,6 +207,10 @@ static QemuOptsList machine_opts = {
>>>>>>>>>              .type = QEMU_OPT_BOOL,
>>>>>>>>>              .help = "enable/disable memory merge support",
>>>>>>>>>          },{
>>>>>>>>> +            .name = "memfd-alloc",
>>>>>>>>> +            .type = QEMU_OPT_BOOL,
>>>>>>>>> +            .help = "enable/disable memfd_create for anonymous memory",
>>>>>>>>> +        },{
>>>>>>>>>              .name = "usb",
>>>>>>>>>              .type = QEMU_OPT_BOOL,
>>>>>>>>>              .help = "Set on/off to enable/disable usb",
>>>>>>>>> -- 
>>>>>>>>> 1.8.3.1      
>>>>>>>>
>>>>>>>>    
>>>>>>>     
>>>>>  
>>>>   
>>>
>>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)
  2022-03-10 15:00     ` Steven Sistare
@ 2022-03-10 18:35       ` Alex Williamson
  2022-03-10 19:55         ` Steven Sistare
  0 siblings, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2022-03-10 18:35 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	qemu-devel, Dr. David Alan Gilbert, Zheng Chuan, Paolo Bonzini,
	Stefan Hajnoczi, Marc-André Lureau, Daniel P. Berrange,
	Philippe Mathieu-Daudé,
	Alex Bennée, Markus Armbruster

On Thu, 10 Mar 2022 10:00:29 -0500
Steven Sistare <steven.sistare@oracle.com> wrote:

> On 3/7/2022 5:16 PM, Alex Williamson wrote:
> > On Wed, 22 Dec 2021 11:05:24 -0800
> > Steve Sistare <steven.sistare@oracle.com> wrote:
> >> @@ -1878,6 +1908,18 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
> >>  {
> >>      int iommu_type, ret;
> >>  
> >> +    /*
> >> +     * If container is reused, just set its type and skip the ioctls, as the
> >> +     * container and group are already configured in the kernel.
> >> +     * VFIO_TYPE1v2_IOMMU is the only type that supports reuse/cpr.
> >> +     * If you ever add new types or spapr cpr support, kind reader, please
> >> +     * also implement VFIO_GET_IOMMU.
> >> +     */  
> > 
> > VFIO_CHECK_EXTENSION should be able to tell us this, right?  Maybe the
> > problem is that vfio_iommu_type1_check_extension() should actually base
> > some of the details on the instantiated vfio_iommu, ex.
> > 
> > 	switch (arg) {
> > 	case VFIO_TYPE1_IOMMU:
> > 		return (iommu && iommu->v2) ? 0 : 1;
> > 	case VFIO_UNMAP_ALL:
> > 	case VFIO_UPDATE_VADDR:
> > 	case VFIO_TYPE1v2_IOMMU:
> > 		return (iommu && !iommu->v2) ? 0 : 1;
> > 	case VFIO_TYPE1_NESTING_IOMMU:
> > 		return (iommu && !iommu->nesting) ? 0 : 1;
> > 	...
> > 
> > We can't support v1 if we've already set a v2 container and vice versa.
> > There are probably some corner cases and compatibility to puzzle
> > through, but I wouldn't think we need a new ioctl to check this.  
> 
> That change makes sense, and may be worth while on its own merits, but does not
> solve the problem, which is that qemu will not be able to infer iommu_type in
> the future if new types are added.  Given:
>   * a new kernel supporting shiny new TYPE1v3
>   * old qemu starts and selects TYPE1v2 in vfio_get_iommu_type because it has no
>     knowledge of v3
>   * live update to qemu which supports v3, which will be listed first in vfio_get_iommu_type.
> 
> Then the new qemu has no way to infer iommu_type.  If it has code that makes 
> decisions based on iommu_type (eg, VFIO_SPAPR_TCE_v2_IOMMU in vfio_container_region_add,
> or vfio_ram_block_discard_disable, or ...), then new qemu cannot function correctly.
> 
> For that, VFIO_GET_IOMMU would be the cleanest solution, to be added the same time our
> hypothetical future developer adds TYPE1v3.  The current inability to ask the kernel
> "what are you" about a container feels like a bug to me.

Hmm, I don't think the kernel has an innate responsibility to remind
the user of a configuration that they've already made.  But I also
don't follow your TYPE1v3 example.  If we added such a type, I imagine
the switch would change to:

	switch (arg)
	case VFIO_TYPE1_IOMMU:
		return (iommu && (iommu->v2 || iommu->v3) ? 0 : 1;
	case VFIO_UNMAP_ALL:
	case VFIO_UPDATE_VADDR:
		return (iommu && !(iommu-v2 || iommu->v3) ? 0 : 1;
	case VFIO_TYPE1v2_IOMMU:
		return (iommu && !iommu-v2) ? 0 : 1;
	case VFIO_TYPE1v3_IOMMU:
		return (iommu && !iommu->v3) ? 0 : 1;
	...

How would that not allow exactly the scenario described, ie. new QEMU
can see that old QEMU left it a v2 IOMMU.

...
> >> +
> >> +bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp)
> >> +{
> >> +    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
> >> +        !ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
> >> +        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR "
> >> +                         "or VFIO_UNMAP_ALL");
> >> +        return false;
> >> +    } else {
> >> +        return true;
> >> +    }
> >> +}  
> > 
> > We could have minimally used this where we assumed a TYPE1v2 container.  
> 
> Are you referring to vfio_init_container (discussed above)?
> Are you suggesting that, if reused is true, we validate those extensions are
> present, before setting iommu_type = VFIO_TYPE1v2_IOMMU?

Yeah, though maybe it's not sufficiently precise to be worthwhile given
the current kernel behavior.

> >> +
> >> +/*
> >> + * Verify that all containers support CPR, and unmap all dma vaddr's.
> >> + */
> >> +int vfio_cpr_save(Error **errp)
> >> +{
> >> +    ERRP_GUARD();
> >> +    VFIOAddressSpace *space;
> >> +    VFIOContainer *container;
> >> +
> >> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> >> +        QLIST_FOREACH(container, &space->containers, next) {
> >> +            if (!vfio_is_cpr_capable(container, errp)) {
> >> +                return -1;
> >> +            }
> >> +            if (vfio_dma_unmap_vaddr_all(container, errp)) {
> >> +                return -1;
> >> +            }
> >> +        }
> >> +    }  
> > 
> > Seems like we ought to validate all containers support CPR before we
> > start blasting vaddrs.  It looks like qmp_cpr_exec() simply returns if
> > this fails with no attempt to unwind!  Yikes!  Wouldn't we need to
> > replay the listeners to remap the vaddrs in case of an error?  
> 
> Already done.  I refactored that code into a separate patch to tease out some
> of the complexity:
>   vfio-pci: recover from unmap-all-vaddr failure

Sorry, didn't get to that one til after I'd sent comments here.

...
> >> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
> >> index a4da24e..a4007cf 100644
> >> --- a/include/migration/cpr.h
> >> +++ b/include/migration/cpr.h
> >> @@ -25,4 +25,7 @@ int cpr_state_save(Error **errp);
> >>  int cpr_state_load(Error **errp);
> >>  void cpr_state_print(void);
> >>  
> >> +int cpr_vfio_save(Error **errp);
> >> +int cpr_vfio_load(Error **errp);
> >> +
> >>  #endif
> >> diff --git a/migration/cpr.c b/migration/cpr.c
> >> index 37eca66..cee82cf 100644
> >> --- a/migration/cpr.c
> >> +++ b/migration/cpr.c
> >> @@ -7,6 +7,7 @@
> >>  
> >>  #include "qemu/osdep.h"
> >>  #include "exec/memory.h"
> >> +#include "hw/vfio/vfio-common.h"
> >>  #include "io/channel-buffer.h"
> >>  #include "io/channel-file.h"
> >>  #include "migration.h"
> >> @@ -101,7 +102,9 @@ void qmp_cpr_exec(strList *args, Error **errp)
> >>          error_setg(errp, "cpr-exec requires cpr-save with restart mode");
> >>          return;
> >>      }
> >> -
> >> +    if (cpr_vfio_save(errp)) {
> >> +        return;
> >> +    }  
> > 
> > Why is vfio so unique that it needs separate handlers versus other
> > devices?  Thanks,  
> 
> In earlier patches these functions fiddled with more objects, but at this point
> they are simple enough to convert to pre_save and post_load vmstate handlers for
> the container and group objects.  However, we would still need to call special 
> functons for vfio from qmp_cpr_exec:
> 
>   * validate all containers support CPR before we start blasting vaddrs
>     However, I could validate all, in every call to pre_save for each container.
>     That would be less efficient, but fits the vmstate model.

Would it be a better option to mirror the migration blocker support, ie.
any device that doesn't support cpr registers a blocker and generic
code only needs to keep track of whether any blockers are registered.

>   * restore all vaddr's if qemu_save_device_state fails.
>     However, I could recover for all containers inside pre_save when one container fails.
>     Feels strange touching all objects in a function for one, but there is no real
>     downside.

I'm not as familiar as I should be with migration callbacks, thanks to
mostly not supporting it for vfio devices, but it seems strange to me
that there's no existing callback or notifier per device to propagate
save failure.  Do we not at least get some sort of resume callback in
that case?

As an alternative, maybe each container could register a vm change
handler that would trigger reloading vaddrs if we move to a running
state and a flag on the container indicates vaddrs were invalidated?
Thanks,

Alex



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)
  2022-03-10 18:35       ` Alex Williamson
@ 2022-03-10 19:55         ` Steven Sistare
  2022-03-10 22:30           ` Alex Williamson
  0 siblings, 1 reply; 96+ messages in thread
From: Steven Sistare @ 2022-03-10 19:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	qemu-devel, Dr. David Alan Gilbert, Zheng Chuan, Paolo Bonzini,
	Stefan Hajnoczi, Marc-André Lureau, Daniel P. Berrange,
	Philippe Mathieu-Daudé,
	Alex Bennée, Markus Armbruster

On 3/10/2022 1:35 PM, Alex Williamson wrote:
> On Thu, 10 Mar 2022 10:00:29 -0500
> Steven Sistare <steven.sistare@oracle.com> wrote:
> 
>> On 3/7/2022 5:16 PM, Alex Williamson wrote:
>>> On Wed, 22 Dec 2021 11:05:24 -0800
>>> Steve Sistare <steven.sistare@oracle.com> wrote:
>>>> @@ -1878,6 +1908,18 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
>>>>  {
>>>>      int iommu_type, ret;
>>>>  
>>>> +    /*
>>>> +     * If container is reused, just set its type and skip the ioctls, as the
>>>> +     * container and group are already configured in the kernel.
>>>> +     * VFIO_TYPE1v2_IOMMU is the only type that supports reuse/cpr.
>>>> +     * If you ever add new types or spapr cpr support, kind reader, please
>>>> +     * also implement VFIO_GET_IOMMU.
>>>> +     */  
>>>
>>> VFIO_CHECK_EXTENSION should be able to tell us this, right?  Maybe the
>>> problem is that vfio_iommu_type1_check_extension() should actually base
>>> some of the details on the instantiated vfio_iommu, ex.
>>>
>>> 	switch (arg) {
>>> 	case VFIO_TYPE1_IOMMU:
>>> 		return (iommu && iommu->v2) ? 0 : 1;
>>> 	case VFIO_UNMAP_ALL:
>>> 	case VFIO_UPDATE_VADDR:
>>> 	case VFIO_TYPE1v2_IOMMU:
>>> 		return (iommu && !iommu->v2) ? 0 : 1;
>>> 	case VFIO_TYPE1_NESTING_IOMMU:
>>> 		return (iommu && !iommu->nesting) ? 0 : 1;
>>> 	...
>>>
>>> We can't support v1 if we've already set a v2 container and vice versa.
>>> There are probably some corner cases and compatibility to puzzle
>>> through, but I wouldn't think we need a new ioctl to check this.  
>>
>> That change makes sense, and may be worth while on its own merits, but does not
>> solve the problem, which is that qemu will not be able to infer iommu_type in
>> the future if new types are added.  Given:
>>   * a new kernel supporting shiny new TYPE1v3
>>   * old qemu starts and selects TYPE1v2 in vfio_get_iommu_type because it has no
>>     knowledge of v3
>>   * live update to qemu which supports v3, which will be listed first in vfio_get_iommu_type.
>>
>> Then the new qemu has no way to infer iommu_type.  If it has code that makes 
>> decisions based on iommu_type (eg, VFIO_SPAPR_TCE_v2_IOMMU in vfio_container_region_add,
>> or vfio_ram_block_discard_disable, or ...), then new qemu cannot function correctly.
>>
>> For that, VFIO_GET_IOMMU would be the cleanest solution, to be added the same time our
>> hypothetical future developer adds TYPE1v3.  The current inability to ask the kernel
>> "what are you" about a container feels like a bug to me.
> 
> Hmm, I don't think the kernel has an innate responsibility to remind
> the user of a configuration that they've already made.  

No, but it can make userland cleaner.  For example, CRIU checkpoint/restart queries
the kernel to save process state, and later makes syscalls to restore it.  Where the
kernel does not export sufficient information, CRIU must provide interpose libraries
so it can remember state internally on its way to the kernel.  And applications must
link against the interpose libraries.

> But I also
> don't follow your TYPE1v3 example.  If we added such a type, I imagine
> the switch would change to:
> 
> 	switch (arg)
> 	case VFIO_TYPE1_IOMMU:
> 		return (iommu && (iommu->v2 || iommu->v3) ? 0 : 1;
> 	case VFIO_UNMAP_ALL:
> 	case VFIO_UPDATE_VADDR:
> 		return (iommu && !(iommu-v2 || iommu->v3) ? 0 : 1;
> 	case VFIO_TYPE1v2_IOMMU:
> 		return (iommu && !iommu-v2) ? 0 : 1;
> 	case VFIO_TYPE1v3_IOMMU:
> 		return (iommu && !iommu->v3) ? 0 : 1;
> 	...
> 
> How would that not allow exactly the scenario described, ie. new QEMU
> can see that old QEMU left it a v2 IOMMU.

OK, that works as long as the switch returns true for all options before
VFIO_SET_IOMMU is called.  I guess your test for "iommu" above does that,
which I missed before.  If we are on the same page now, I will modify my
comment "please also implement VFIO_GET_IOMMU".

> ...
>>>> +
>>>> +bool vfio_is_cpr_capable(VFIOContainer *container, Error **errp)
>>>> +{
>>>> +    if (!ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UPDATE_VADDR) ||
>>>> +        !ioctl(container->fd, VFIO_CHECK_EXTENSION, VFIO_UNMAP_ALL)) {
>>>> +        error_setg(errp, "VFIO container does not support VFIO_UPDATE_VADDR "
>>>> +                         "or VFIO_UNMAP_ALL");
>>>> +        return false;
>>>> +    } else {
>>>> +        return true;
>>>> +    }
>>>> +}  
>>>
>>> We could have minimally used this where we assumed a TYPE1v2 container.  
>>
>> Are you referring to vfio_init_container (discussed above)?
>> Are you suggesting that, if reused is true, we validate those extensions are
>> present, before setting iommu_type = VFIO_TYPE1v2_IOMMU?
> 
> Yeah, though maybe it's not sufficiently precise to be worthwhile given
> the current kernel behavior.
> 
>>>> +
>>>> +/*
>>>> + * Verify that all containers support CPR, and unmap all dma vaddr's.
>>>> + */
>>>> +int vfio_cpr_save(Error **errp)
>>>> +{
>>>> +    ERRP_GUARD();
>>>> +    VFIOAddressSpace *space;
>>>> +    VFIOContainer *container;
>>>> +
>>>> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
>>>> +        QLIST_FOREACH(container, &space->containers, next) {
>>>> +            if (!vfio_is_cpr_capable(container, errp)) {
>>>> +                return -1;
>>>> +            }
>>>> +            if (vfio_dma_unmap_vaddr_all(container, errp)) {
>>>> +                return -1;
>>>> +            }
>>>> +        }
>>>> +    }  
>>>
>>> Seems like we ought to validate all containers support CPR before we
>>> start blasting vaddrs.  It looks like qmp_cpr_exec() simply returns if
>>> this fails with no attempt to unwind!  Yikes!  Wouldn't we need to
>>> replay the listeners to remap the vaddrs in case of an error?  
>>
>> Already done.  I refactored that code into a separate patch to tease out some
>> of the complexity:
>>   vfio-pci: recover from unmap-all-vaddr failure
> 
> Sorry, didn't get to that one til after I'd sent comments here.
> 
> ...
>>>> diff --git a/include/migration/cpr.h b/include/migration/cpr.h
>>>> index a4da24e..a4007cf 100644
>>>> --- a/include/migration/cpr.h
>>>> +++ b/include/migration/cpr.h
>>>> @@ -25,4 +25,7 @@ int cpr_state_save(Error **errp);
>>>>  int cpr_state_load(Error **errp);
>>>>  void cpr_state_print(void);
>>>>  
>>>> +int cpr_vfio_save(Error **errp);
>>>> +int cpr_vfio_load(Error **errp);
>>>> +
>>>>  #endif
>>>> diff --git a/migration/cpr.c b/migration/cpr.c
>>>> index 37eca66..cee82cf 100644
>>>> --- a/migration/cpr.c
>>>> +++ b/migration/cpr.c
>>>> @@ -7,6 +7,7 @@
>>>>  
>>>>  #include "qemu/osdep.h"
>>>>  #include "exec/memory.h"
>>>> +#include "hw/vfio/vfio-common.h"
>>>>  #include "io/channel-buffer.h"
>>>>  #include "io/channel-file.h"
>>>>  #include "migration.h"
>>>> @@ -101,7 +102,9 @@ void qmp_cpr_exec(strList *args, Error **errp)
>>>>          error_setg(errp, "cpr-exec requires cpr-save with restart mode");
>>>>          return;
>>>>      }
>>>> -
>>>> +    if (cpr_vfio_save(errp)) {
>>>> +        return;
>>>> +    }  
>>>
>>> Why is vfio so unique that it needs separate handlers versus other
>>> devices?  Thanks,  
>>
>> In earlier patches these functions fiddled with more objects, but at this point
>> they are simple enough to convert to pre_save and post_load vmstate handlers for
>> the container and group objects.  However, we would still need to call special 
>> functons for vfio from qmp_cpr_exec:
>>
>>   * validate all containers support CPR before we start blasting vaddrs
>>     However, I could validate all, in every call to pre_save for each container.
>>     That would be less efficient, but fits the vmstate model.
> 
> Would it be a better option to mirror the migration blocker support, ie.
> any device that doesn't support cpr registers a blocker and generic
> code only needs to keep track of whether any blockers are registered.

We cannot specifically use migrate_add_blocker(), because it is checked in
the migration specific function migrate_prepare(), in a layer of functions 
above the simpler qemu_save_device_state() used in cpr.  But yes, we could
do something similar for vfio.  Increment a global counter in vfio_realize
if the container does not support cpr, and decrement it when the container is
destroyed.  pre_save could just check the counter.

>>   * restore all vaddr's if qemu_save_device_state fails.
>>     However, I could recover for all containers inside pre_save when one container fails.
>>     Feels strange touching all objects in a function for one, but there is no real
>>     downside.
> 
> I'm not as familiar as I should be with migration callbacks, thanks to
> mostly not supporting it for vfio devices, but it seems strange to me
> that there's no existing callback or notifier per device to propagate
> save failure.  Do we not at least get some sort of resume callback in
> that case?

We do not:
    struct VMStateDescription {
        int (*pre_load)(void *opaque);
        int (*post_load)(void *opaque, int version_id);
        int (*pre_save)(void *opaque);
        int (*post_save)(void *opaque);

The handler returns an error, which stops further saves and is propagated back
to the top level caller qemu_save_device_state().

The vast majority of handlers do not have side effects, with no need to unwind 
anything on failure.

This raises another point.  If pre_save succeeds for all the containers,
but fails for some non-vfio object, then the overall operation is abandoned,
but we do not restore the vaddr's.  To plug that hole, we need to call the
unwind code from qmp_cpr_save, or implement your alternative below.

> As an alternative, maybe each container could register a vm change
> handler that would trigger reloading vaddrs if we move to a running
> state and a flag on the container indicates vaddrs were invalidated?
> Thanks,

That works and is modular, but I dislike that it adds checks on the
happy path for a case that will rarely happen, and it pushes recovery from
failure further away from the original failure, which would make debugging
cascading failures more difficult.

- Steve


 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)
  2022-03-10 19:55         ` Steven Sistare
@ 2022-03-10 22:30           ` Alex Williamson
  2022-03-11 16:22             ` Steven Sistare
  0 siblings, 1 reply; 96+ messages in thread
From: Alex Williamson @ 2022-03-10 22:30 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	qemu-devel, Dr. David Alan Gilbert, Zheng Chuan, Paolo Bonzini,
	Stefan Hajnoczi, Marc-André Lureau, Daniel P. Berrange,
	Philippe Mathieu-Daudé,
	Alex Bennée, Markus Armbruster

On Thu, 10 Mar 2022 14:55:50 -0500
Steven Sistare <steven.sistare@oracle.com> wrote:

> On 3/10/2022 1:35 PM, Alex Williamson wrote:
> > On Thu, 10 Mar 2022 10:00:29 -0500
> > Steven Sistare <steven.sistare@oracle.com> wrote:
> >   
> >> On 3/7/2022 5:16 PM, Alex Williamson wrote:  
> >>> On Wed, 22 Dec 2021 11:05:24 -0800
> >>> Steve Sistare <steven.sistare@oracle.com> wrote:  
> >>>> @@ -1878,6 +1908,18 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
> >>>>  {
> >>>>      int iommu_type, ret;
> >>>>  
> >>>> +    /*
> >>>> +     * If container is reused, just set its type and skip the ioctls, as the
> >>>> +     * container and group are already configured in the kernel.
> >>>> +     * VFIO_TYPE1v2_IOMMU is the only type that supports reuse/cpr.
> >>>> +     * If you ever add new types or spapr cpr support, kind reader, please
> >>>> +     * also implement VFIO_GET_IOMMU.
> >>>> +     */    
> >>>
> >>> VFIO_CHECK_EXTENSION should be able to tell us this, right?  Maybe the
> >>> problem is that vfio_iommu_type1_check_extension() should actually base
> >>> some of the details on the instantiated vfio_iommu, ex.
> >>>
> >>> 	switch (arg) {
> >>> 	case VFIO_TYPE1_IOMMU:
> >>> 		return (iommu && iommu->v2) ? 0 : 1;
> >>> 	case VFIO_UNMAP_ALL:
> >>> 	case VFIO_UPDATE_VADDR:
> >>> 	case VFIO_TYPE1v2_IOMMU:
> >>> 		return (iommu && !iommu->v2) ? 0 : 1;
> >>> 	case VFIO_TYPE1_NESTING_IOMMU:
> >>> 		return (iommu && !iommu->nesting) ? 0 : 1;
> >>> 	...
> >>>
> >>> We can't support v1 if we've already set a v2 container and vice versa.
> >>> There are probably some corner cases and compatibility to puzzle
> >>> through, but I wouldn't think we need a new ioctl to check this.    
> >>
> >> That change makes sense, and may be worth while on its own merits, but does not
> >> solve the problem, which is that qemu will not be able to infer iommu_type in
> >> the future if new types are added.  Given:
> >>   * a new kernel supporting shiny new TYPE1v3
> >>   * old qemu starts and selects TYPE1v2 in vfio_get_iommu_type because it has no
> >>     knowledge of v3
> >>   * live update to qemu which supports v3, which will be listed first in vfio_get_iommu_type.
> >>
> >> Then the new qemu has no way to infer iommu_type.  If it has code that makes 
> >> decisions based on iommu_type (eg, VFIO_SPAPR_TCE_v2_IOMMU in vfio_container_region_add,
> >> or vfio_ram_block_discard_disable, or ...), then new qemu cannot function correctly.
> >>
> >> For that, VFIO_GET_IOMMU would be the cleanest solution, to be added the same time our
> >> hypothetical future developer adds TYPE1v3.  The current inability to ask the kernel
> >> "what are you" about a container feels like a bug to me.  
> > 
> > Hmm, I don't think the kernel has an innate responsibility to remind
> > the user of a configuration that they've already made.    
> 
> No, but it can make userland cleaner.  For example, CRIU checkpoint/restart queries
> the kernel to save process state, and later makes syscalls to restore it.  Where the
> kernel does not export sufficient information, CRIU must provide interpose libraries
> so it can remember state internally on its way to the kernel.  And applications must
> link against the interpose libraries.

The counter argument is that it bloats the kernel to add interfaces to
report back things that userspace should already know.  Which has more
exploit vectors, a new kernel ioctl or yet another userspace library?
 
> > But I also
> > don't follow your TYPE1v3 example.  If we added such a type, I imagine
> > the switch would change to:
> > 
> > 	switch (arg)
> > 	case VFIO_TYPE1_IOMMU:
> > 		return (iommu && (iommu->v2 || iommu->v3) ? 0 : 1;
> > 	case VFIO_UNMAP_ALL:
> > 	case VFIO_UPDATE_VADDR:
> > 		return (iommu && !(iommu-v2 || iommu->v3) ? 0 : 1;
> > 	case VFIO_TYPE1v2_IOMMU:
> > 		return (iommu && !iommu-v2) ? 0 : 1;
> > 	case VFIO_TYPE1v3_IOMMU:
> > 		return (iommu && !iommu->v3) ? 0 : 1;
> > 	...
> > 
> > How would that not allow exactly the scenario described, ie. new QEMU
> > can see that old QEMU left it a v2 IOMMU.  
> 
> OK, that works as long as the switch returns true for all options before
> VFIO_SET_IOMMU is called.  I guess your test for "iommu" above does that,
> which I missed before.  If we are on the same page now, I will modify my
> comment "please also implement VFIO_GET_IOMMU".

Yes, in the above all extensions are supported before the container
type is set, then once set only the relevant extensions are available.

...
> >>>> diff --git a/migration/cpr.c b/migration/cpr.c
> >>>> index 37eca66..cee82cf 100644
> >>>> --- a/migration/cpr.c
> >>>> +++ b/migration/cpr.c
> >>>> @@ -7,6 +7,7 @@
> >>>>  
> >>>>  #include "qemu/osdep.h"
> >>>>  #include "exec/memory.h"
> >>>> +#include "hw/vfio/vfio-common.h"
> >>>>  #include "io/channel-buffer.h"
> >>>>  #include "io/channel-file.h"
> >>>>  #include "migration.h"
> >>>> @@ -101,7 +102,9 @@ void qmp_cpr_exec(strList *args, Error **errp)
> >>>>          error_setg(errp, "cpr-exec requires cpr-save with restart mode");
> >>>>          return;
> >>>>      }
> >>>> -
> >>>> +    if (cpr_vfio_save(errp)) {
> >>>> +        return;
> >>>> +    }    
> >>>
> >>> Why is vfio so unique that it needs separate handlers versus other
> >>> devices?  Thanks,    
> >>
> >> In earlier patches these functions fiddled with more objects, but at this point
> >> they are simple enough to convert to pre_save and post_load vmstate handlers for
> >> the container and group objects.  However, we would still need to call special 
> >> functons for vfio from qmp_cpr_exec:
> >>
> >>   * validate all containers support CPR before we start blasting vaddrs
> >>     However, I could validate all, in every call to pre_save for each container.
> >>     That would be less efficient, but fits the vmstate model.  
> > 
> > Would it be a better option to mirror the migration blocker support, ie.
> > any device that doesn't support cpr registers a blocker and generic
> > code only needs to keep track of whether any blockers are registered.  
> 
> We cannot specifically use migrate_add_blocker(), because it is checked in
> the migration specific function migrate_prepare(), in a layer of functions 
> above the simpler qemu_save_device_state() used in cpr.  But yes, we could
> do something similar for vfio.  Increment a global counter in vfio_realize
> if the container does not support cpr, and decrement it when the container is
> destroyed.  pre_save could just check the counter.

Right, not suggesting to piggyback on migrate_add_blocker() only to use
a similar mechanism.  Only drivers that can't support cpr need register
a blocker but testing for blockers is done generically, not just for
vfio devices.

> >>   * restore all vaddr's if qemu_save_device_state fails.
> >>     However, I could recover for all containers inside pre_save when one container fails.
> >>     Feels strange touching all objects in a function for one, but there is no real
> >>     downside.  
> > 
> > I'm not as familiar as I should be with migration callbacks, thanks to
> > mostly not supporting it for vfio devices, but it seems strange to me
> > that there's no existing callback or notifier per device to propagate
> > save failure.  Do we not at least get some sort of resume callback in
> > that case?  
> 
> We do not:
>     struct VMStateDescription {
>         int (*pre_load)(void *opaque);
>         int (*post_load)(void *opaque, int version_id);
>         int (*pre_save)(void *opaque);
>         int (*post_save)(void *opaque);
> 
> The handler returns an error, which stops further saves and is propagated back
> to the top level caller qemu_save_device_state().
> 
> The vast majority of handlers do not have side effects, with no need to unwind 
> anything on failure.
> 
> This raises another point.  If pre_save succeeds for all the containers,
> but fails for some non-vfio object, then the overall operation is abandoned,
> but we do not restore the vaddr's.  To plug that hole, we need to call the
> unwind code from qmp_cpr_save, or implement your alternative below.

We're trying to reuse migration interfaces, are we also triggering
migration state change notifiers?  ie.
MIGRATION_STATUS_{CANCELLING,CANCELLED,FAILED}  We already hook vfio
devices supporting migration into that notifier to tell the driver to
move the device back to the running state on failure, which seems a bit
unique to vfio devices.  Containers could maybe register their own
callbacks.

> > As an alternative, maybe each container could register a vm change
> > handler that would trigger reloading vaddrs if we move to a running
> > state and a flag on the container indicates vaddrs were invalidated?
> > Thanks,  
> 
> That works and is modular, but I dislike that it adds checks on the
> happy path for a case that will rarely happen, and it pushes recovery from
> failure further away from the original failure, which would make debugging
> cascading failures more difficult.

Would using the migration notifier move us sufficiently closer to the
failure point?  Otherwise I think you're talking about unwinding all
the containers when any one fails, where you didn't like that object
overreach, or maybe adding an optional callback... but I wonder if the
above notifier essentially already does that.

In any case, I think we have options to either implement new or use
existing notifier-like functionality to avoid all these vfio specific
callouts.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 10/29] machine: memfd-alloc option
  2022-03-10 18:18                   ` Steven Sistare
@ 2022-03-11  9:42                     ` Igor Mammedov
  2022-03-29 17:43                       ` Steven Sistare
  0 siblings, 1 reply; 96+ messages in thread
From: Igor Mammedov @ 2022-03-11  9:42 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	David Hildenbrand, qemu-devel, Dr. David Alan Gilbert,
	Zheng Chuan, Alex Williamson, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Daniel P. Berrange,
	Philippe Mathieu-Daudé,
	Alex Bennée, Markus Armbruster

On Thu, 10 Mar 2022 13:18:35 -0500
Steven Sistare <steven.sistare@oracle.com> wrote:

> On 3/10/2022 12:28 PM, Steven Sistare wrote:
> > On 3/10/2022 11:00 AM, Igor Mammedov wrote:  
> >> On Thu, 10 Mar 2022 10:36:08 -0500
> >> Steven Sistare <steven.sistare@oracle.com> wrote:
> >>  
> >>> On 3/8/2022 2:20 AM, Igor Mammedov wrote:  
> >>>> On Tue, 8 Mar 2022 01:50:11 -0500
> >>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>>>     
> >>>>> On Mon, Mar 07, 2022 at 09:41:44AM -0500, Steven Sistare wrote:    
> >>>>>> On 3/4/2022 5:41 AM, Igor Mammedov wrote:      
> >>>>>>> On Thu, 3 Mar 2022 12:21:15 -0500
> >>>>>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >>>>>>>       
> >>>>>>>> On Wed, Dec 22, 2021 at 11:05:15AM -0800, Steve Sistare wrote:      
> >>>>>>>>> Allocate anonymous memory using memfd_create if the memfd-alloc machine
> >>>>>>>>> option is set.
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >>>>>>>>> ---
> >>>>>>>>>  hw/core/machine.c   | 19 +++++++++++++++++++
> >>>>>>>>>  include/hw/boards.h |  1 +
> >>>>>>>>>  qemu-options.hx     |  6 ++++++
> >>>>>>>>>  softmmu/physmem.c   | 47 ++++++++++++++++++++++++++++++++++++++---------
> >>>>>>>>>  softmmu/vl.c        |  1 +
> >>>>>>>>>  trace-events        |  1 +
> >>>>>>>>>  util/qemu-config.c  |  4 ++++
> >>>>>>>>>  7 files changed, 70 insertions(+), 9 deletions(-)
> >>>>>>>>>
> >>>>>>>>> diff --git a/hw/core/machine.c b/hw/core/machine.c
> >>>>>>>>> index 53a99ab..7739d88 100644
> >>>>>>>>> --- a/hw/core/machine.c
> >>>>>>>>> +++ b/hw/core/machine.c
> >>>>>>>>> @@ -392,6 +392,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
> >>>>>>>>>      ms->mem_merge = value;
> >>>>>>>>>  }
> >>>>>>>>>  
> >>>>>>>>> +static bool machine_get_memfd_alloc(Object *obj, Error **errp)
> >>>>>>>>> +{
> >>>>>>>>> +    MachineState *ms = MACHINE(obj);
> >>>>>>>>> +
> >>>>>>>>> +    return ms->memfd_alloc;
> >>>>>>>>> +}
> >>>>>>>>> +
> >>>>>>>>> +static void machine_set_memfd_alloc(Object *obj, bool value, Error **errp)
> >>>>>>>>> +{
> >>>>>>>>> +    MachineState *ms = MACHINE(obj);
> >>>>>>>>> +
> >>>>>>>>> +    ms->memfd_alloc = value;
> >>>>>>>>> +}
> >>>>>>>>> +
> >>>>>>>>>  static bool machine_get_usb(Object *obj, Error **errp)
> >>>>>>>>>  {
> >>>>>>>>>      MachineState *ms = MACHINE(obj);
> >>>>>>>>> @@ -829,6 +843,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
> >>>>>>>>>      object_class_property_set_description(oc, "mem-merge",
> >>>>>>>>>          "Enable/disable memory merge support");
> >>>>>>>>>  
> >>>>>>>>> +    object_class_property_add_bool(oc, "memfd-alloc",
> >>>>>>>>> +        machine_get_memfd_alloc, machine_set_memfd_alloc);
> >>>>>>>>> +    object_class_property_set_description(oc, "memfd-alloc",
> >>>>>>>>> +        "Enable/disable allocating anonymous memory using memfd_create");
> >>>>>>>>> +
> >>>>>>>>>      object_class_property_add_bool(oc, "usb",
> >>>>>>>>>          machine_get_usb, machine_set_usb);
> >>>>>>>>>      object_class_property_set_description(oc, "usb",
> >>>>>>>>> diff --git a/include/hw/boards.h b/include/hw/boards.h
> >>>>>>>>> index 9c1c190..a57d7a0 100644
> >>>>>>>>> --- a/include/hw/boards.h
> >>>>>>>>> +++ b/include/hw/boards.h
> >>>>>>>>> @@ -327,6 +327,7 @@ struct MachineState {
> >>>>>>>>>      char *dt_compatible;
> >>>>>>>>>      bool dump_guest_core;
> >>>>>>>>>      bool mem_merge;
> >>>>>>>>> +    bool memfd_alloc;
> >>>>>>>>>      bool usb;
> >>>>>>>>>      bool usb_disabled;
> >>>>>>>>>      char *firmware;
> >>>>>>>>> diff --git a/qemu-options.hx b/qemu-options.hx
> >>>>>>>>> index 7d47510..33c8173 100644
> >>>>>>>>> --- a/qemu-options.hx
> >>>>>>>>> +++ b/qemu-options.hx
> >>>>>>>>> @@ -30,6 +30,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
> >>>>>>>>>      "                vmport=on|off|auto controls emulation of vmport (default: auto)\n"
> >>>>>>>>>      "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
> >>>>>>>>>      "                mem-merge=on|off controls memory merge support (default: on)\n"
> >>>>>>>>> +    "                memfd-alloc=on|off controls allocating anonymous guest RAM using memfd_create (default: off)\n"        
> >>>>>>>>
> >>>>>>>> Question: are there any disadvantages associated with using
> >>>>>>>> memfd_create? I guess we are using up an fd, but that seems minor.  Any
> >>>>>>>> reason not to set to on by default? maybe with a fallback option to
> >>>>>>>> disable that?      
> >>>>>>
> >>>>>> Old Linux host kernels, circa 4.1, do not support huge pages for shared memory.
> >>>>>> Also, the tunable to enable huge pages for share memory is different than for
> >>>>>> anon memory, so there could be performance loss if it is not set correctly.
> >>>>>>     /sys/kernel/mm/transparent_hugepage/enabled
> >>>>>>     vs
> >>>>>>     /sys/kernel/mm/transparent_hugepage/shmem_enabled      
> >>>>>
> >>>>> I guess we can test this when launching the VM, and select
> >>>>> a good default.
> >>>>>    
> >>>>>> It might make sense to use memfd_create by default for the secondary segments.      
> >>>>>
> >>>>> Well there's also KSM now you mention it.    
> >>>>
> >>>> then another quest, is there downside to always using memfd_create
> >>>> without any knobs being involved?    
> >>>
> >>> Lower performance if small pages are used (but Michael suggests qemu could 
> >>> automatically check the tunable and use anon memory instead)
> >>>
> >>> KSM (same page merging) is not supported for shared memory, so ram_block_add ->
> >>> memory_try_enable_merging will not enable it.
> >>>
> >>> In both cases, I expect the degradation would be negligible if memfd_create is
> >>> only automatically applied to the secondary segments, which are typically small.
> >>> But, someone's secondary segment could be larger, and it is time consuming to
> >>> prove innocence when someone claims your change caused their performance regression.  
> >>
> >> Adding David as memory subsystem maintainer, maybe he will a better
> >> idea instead of introducing global knob that would also magically alter 
> >> backends' behavior despite of its their configured settings.  
> > 
> > OK, in ram_block_add I can set the RAM_SHARED flag based on the memory-backend object's
> > shared flag.  I already set the latter in create_default_memdev when memfd-alloc is
> > specified.  With that change, we do not override configured settings.  Users can no longer
> > use memory-backend-ram for CPR, and must change all memory-backend-ram to memory-backend-memfd
> > in the command-line arguments.  That is fine.
> > 
> > With that change, are you OK with this patch?  
> 
> Sorry, I mis-read my own code in ram_block_add.  The existing code is correct and does 
> not alter any backend's behavior.   It only sets the shared flag when the ram is *not* 
> being allocated for a backend:
> 
>                 if (!object_dynamic_cast(parent, TYPE_MEMORY_BACKEND)) {
>                     new_block->flags |= RAM_SHARED;
>                 }
> 

ok, maybe instead of introducing a generic option, introduce the high level
feature one that turns this and other necessary quirks for it to work (i.e.
something like live-update=on|off).
That will not make QEMU internals any better but at least it will hide obscure
memfd-alloc from users.
Is there a patch that makes QEMU error out if backend without
shared=on is used?

Also, can you answer question below, pls
or point to a patch in series that takes care of that invariant?

[...]

> >>>>>> There is currently no way to specify memory backends for the secondary memory
> >>>>>> segments (vram, roms, etc), and IMO it would be onerous to specify a backend for
> >>>>>> each of them.  On x86_64, these include pc.bios, vga.vram, pc.rom, vga.rom,
> >>>>>> /rom@etc/acpi/tables, /rom@etc/table-loader, /rom@etc/acpi/rsdp.  
> >>
> >> MemoryRegion is not the only place where state is stored.
> >> If we only talk about fwcfg entries state, it can also reference
> >> plain malloced memory allocated elsewhere or make a deep copy internally.
> >> Similarly devices also may store state outside of RamBlock framework.
> >>
> >> How are you dealing with that?
[...]



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 10/29] machine: memfd-alloc option
  2021-12-22 19:05 ` [PATCH V7 10/29] machine: memfd-alloc option Steve Sistare
                     ` (2 preceding siblings ...)
  2022-03-03 17:21   ` Michael S. Tsirkin
@ 2022-03-11  9:54   ` David Hildenbrand
  3 siblings, 0 replies; 96+ messages in thread
From: David Hildenbrand @ 2022-03-11  9:54 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	Dr. David Alan Gilbert, Markus Armbruster, Zheng Chuan,
	Alex Williamson, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Daniel P. Berrange,
	Philippe Mathieu-Daudé,
	Alex Bennée

On 22.12.21 20:05, Steve Sistare wrote:
> Allocate anonymous memory using memfd_create if the memfd-alloc machine
> option is set.

Hi,

late to the party (thanks Igor for CCing)

... in which case it's no longer anonymous memory (because it's now
MAP_SHARED). So you're converting all private memory to shared memory.

For example, memory ballooning will no longer work as expected. There is
no shared zeropage. KSM won't work. This brings a lot of "surprises".


This patch begs for a proper description why this is required and why we
cannot simply let the user handle that by properly using
memory-backend-memfd manually.

Especially the "memfd-alloc option" doesn't even express to a user
what's actually happening and what the implications are.


Long story short: this patch description has to be seriously extended.

> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  hw/core/machine.c   | 19 +++++++++++++++++++
>  include/hw/boards.h |  1 +
>  qemu-options.hx     |  6 ++++++
>  softmmu/physmem.c   | 47 ++++++++++++++++++++++++++++++++++++++---------
>  softmmu/vl.c        |  1 +
>  trace-events        |  1 +
>  util/qemu-config.c  |  4 ++++
>  7 files changed, 70 insertions(+), 9 deletions(-)
> 
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index 53a99ab..7739d88 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -392,6 +392,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
>      ms->mem_merge = value;
>  }
>  
> +static bool machine_get_memfd_alloc(Object *obj, Error **errp)
> +{
> +    MachineState *ms = MACHINE(obj);
> +
> +    return ms->memfd_alloc;
> +}
> +
> +static void machine_set_memfd_alloc(Object *obj, bool value, Error **errp)
> +{
> +    MachineState *ms = MACHINE(obj);
> +
> +    ms->memfd_alloc = value;
> +}
> +
>  static bool machine_get_usb(Object *obj, Error **errp)
>  {
>      MachineState *ms = MACHINE(obj);
> @@ -829,6 +843,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
>      object_class_property_set_description(oc, "mem-merge",
>          "Enable/disable memory merge support");
>  
> +    object_class_property_add_bool(oc, "memfd-alloc",
> +        machine_get_memfd_alloc, machine_set_memfd_alloc);
> +    object_class_property_set_description(oc, "memfd-alloc",
> +        "Enable/disable allocating anonymous memory using memfd_create");
> +
>      object_class_property_add_bool(oc, "usb",
>          machine_get_usb, machine_set_usb);
>      object_class_property_set_description(oc, "usb",
> diff --git a/include/hw/boards.h b/include/hw/boards.h
> index 9c1c190..a57d7a0 100644
> --- a/include/hw/boards.h
> +++ b/include/hw/boards.h
> @@ -327,6 +327,7 @@ struct MachineState {
>      char *dt_compatible;
>      bool dump_guest_core;
>      bool mem_merge;
> +    bool memfd_alloc;
>      bool usb;
>      bool usb_disabled;
>      char *firmware;
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 7d47510..33c8173 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -30,6 +30,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
>      "                vmport=on|off|auto controls emulation of vmport (default: auto)\n"
>      "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
>      "                mem-merge=on|off controls memory merge support (default: on)\n"
> +    "                memfd-alloc=on|off controls allocating anonymous guest RAM using memfd_create (default: off)\n"
>      "                aes-key-wrap=on|off controls support for AES key wrapping (default=on)\n"
>      "                dea-key-wrap=on|off controls support for DEA key wrapping (default=on)\n"
>      "                suppress-vmdesc=on|off disables self-describing migration (default=off)\n"
> @@ -76,6 +77,11 @@ SRST
>          supported by the host, de-duplicates identical memory pages
>          among VMs instances (enabled by default).
>  
> +    ``memfd-alloc=on|off``
> +        Enables or disables allocation of anonymous guest RAM using
> +        memfd_create.  Any associated memory-backend objects are created with
> +        share=on.  The memfd-alloc default is off.
> +
>      ``aes-key-wrap=on|off``
>          Enables or disables AES key wrapping support on s390-ccw hosts.
>          This feature controls whether AES wrapping keys will be created
> diff --git a/softmmu/physmem.c b/softmmu/physmem.c
> index 3524c04..95e2b49 100644
> --- a/softmmu/physmem.c
> +++ b/softmmu/physmem.c
> @@ -41,6 +41,7 @@
>  #include "qemu/config-file.h"
>  #include "qemu/error-report.h"
>  #include "qemu/qemu-print.h"
> +#include "qemu/memfd.h"
>  #include "exec/memory.h"
>  #include "exec/ioport.h"
>  #include "sysemu/dma.h"
> @@ -1964,35 +1965,63 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>      const bool shared = qemu_ram_is_shared(new_block);
>      RAMBlock *block;
>      RAMBlock *last_block = NULL;
> +    struct MemoryRegion *mr = new_block->mr;
>      ram_addr_t old_ram_size, new_ram_size;
>      Error *err = NULL;
> +    const char *name;
> +    void *addr = 0;
> +    size_t maxlen;
> +    MachineState *ms = MACHINE(qdev_get_machine());
>  
>      old_ram_size = last_ram_page();
>  
>      qemu_mutex_lock_ramlist();
> -    new_block->offset = find_ram_offset(new_block->max_length);
> +    maxlen = new_block->max_length;
> +    new_block->offset = find_ram_offset(maxlen);
>  
>      if (!new_block->host) {
>          if (xen_enabled()) {
> -            xen_ram_alloc(new_block->offset, new_block->max_length,
> -                          new_block->mr, &err);
> +            xen_ram_alloc(new_block->offset, maxlen, new_block->mr, &err);
>              if (err) {
>                  error_propagate(errp, err);
>                  qemu_mutex_unlock_ramlist();
>                  return;
>              }
>          } else {
> -            new_block->host = qemu_anon_ram_alloc(new_block->max_length,
> -                                                  &new_block->mr->align,
> -                                                  shared, noreserve);
> -            if (!new_block->host) {
> +            name = memory_region_name(mr);
> +            if (ms->memfd_alloc) {
> +                Object *parent = &mr->parent_obj;
> +                int mfd = -1;          /* placeholder until next patch */
> +                mr->align = QEMU_VMALLOC_ALIGN;
> +                if (mfd < 0) {
> +                    mfd = qemu_memfd_create(name, maxlen + mr->align,
> +                                            0, 0, 0, &err);
> +                    if (mfd < 0) {
> +                        return;
> +                    }
> +                }
> +                qemu_set_cloexec(mfd);
> +                /* The memory backend already set its desired flags. */
> +                if (!object_dynamic_cast(parent, TYPE_MEMORY_BACKEND)) {
> +                    new_block->flags |= RAM_SHARED;
> +                }
> +                addr = file_ram_alloc(new_block, maxlen, mfd,
> +                                      false, false, 0, errp);
> +                trace_anon_memfd_alloc(name, maxlen, addr, mfd);
> +            } else {
> +                addr = qemu_anon_ram_alloc(maxlen, &mr->align,
> +                                           shared, noreserve);
> +            }
> +
> +            if (!addr) {
>                  error_setg_errno(errp, errno,
>                                   "cannot set up guest memory '%s'",
> -                                 memory_region_name(new_block->mr));
> +                                 name);
>                  qemu_mutex_unlock_ramlist();
>                  return;
>              }
> -            memory_try_enable_merging(new_block->host, new_block->max_length);
> +            memory_try_enable_merging(addr, maxlen);
> +            new_block->host = addr;
>          }
>      }
>  
> diff --git a/softmmu/vl.c b/softmmu/vl.c
> index 620a1f1..ab3648a 100644
> --- a/softmmu/vl.c
> +++ b/softmmu/vl.c
> @@ -2440,6 +2440,7 @@ static void create_default_memdev(MachineState *ms, const char *path)
>          object_property_set_str(obj, "mem-path", path, &error_fatal);
>      }
>      object_property_set_int(obj, "size", ms->ram_size, &error_fatal);
> +    object_property_set_bool(obj, "share", ms->memfd_alloc, &error_fatal);
>      object_property_add_child(object_get_objects_root(), mc->default_ram_id,
>                                obj);
>      /* Ensure backend's memory region name is equal to mc->default_ram_id */
> diff --git a/trace-events b/trace-events
> index a637a61..770a9ac 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -45,6 +45,7 @@ ram_block_discard_range(const char *rbname, void *hva, size_t length, bool need_
>  # accel/tcg/cputlb.c
>  memory_notdirty_write_access(uint64_t vaddr, uint64_t ram_addr, unsigned size) "0x%" PRIx64 " ram_addr 0x%" PRIx64 " size %u"
>  memory_notdirty_set_dirty(uint64_t vaddr) "0x%" PRIx64
> +anon_memfd_alloc(const char *name, size_t size, void *ptr, int fd) "%s size %zu ptr %p fd %d"
>  
>  # gdbstub.c
>  gdbstub_op_start(const char *device) "Starting gdbstub using device %s"
> diff --git a/util/qemu-config.c b/util/qemu-config.c
> index 436ab63..3606e5c 100644
> --- a/util/qemu-config.c
> +++ b/util/qemu-config.c
> @@ -207,6 +207,10 @@ static QemuOptsList machine_opts = {
>              .type = QEMU_OPT_BOOL,
>              .help = "enable/disable memory merge support",
>          },{
> +            .name = "memfd-alloc",
> +            .type = QEMU_OPT_BOOL,
> +            .help = "enable/disable memfd_create for anonymous memory",
> +        },{
>              .name = "usb",
>              .type = QEMU_OPT_BOOL,
>              .help = "Set on/off to enable/disable usb",


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 10/29] machine: memfd-alloc option
  2022-03-07 14:41       ` Steven Sistare
  2022-03-08  6:50         ` Michael S. Tsirkin
@ 2022-03-11 10:08         ` Daniel P. Berrangé
  1 sibling, 0 replies; 96+ messages in thread
From: Daniel P. Berrangé @ 2022-03-11 10:08 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	qemu-devel, Dr. David Alan Gilbert, Zheng Chuan, Alex Williamson,
	Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Igor Mammedov, Philippe Mathieu-Daudé,
	Alex Bennée, Markus Armbruster

On Mon, Mar 07, 2022 at 09:41:44AM -0500, Steven Sistare wrote:
> On 3/4/2022 5:41 AM, Igor Mammedov wrote:
> > On Thu, 3 Mar 2022 12:21:15 -0500
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > 
> >> On Wed, Dec 22, 2021 at 11:05:15AM -0800, Steve Sistare wrote:
> >>> Allocate anonymous memory using memfd_create if the memfd-alloc machine
> >>> option is set.
> >>>
> >>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >>> ---
> >>>  hw/core/machine.c   | 19 +++++++++++++++++++
> >>>  include/hw/boards.h |  1 +
> >>>  qemu-options.hx     |  6 ++++++
> >>>  softmmu/physmem.c   | 47 ++++++++++++++++++++++++++++++++++++++---------
> >>>  softmmu/vl.c        |  1 +
> >>>  trace-events        |  1 +
> >>>  util/qemu-config.c  |  4 ++++
> >>>  7 files changed, 70 insertions(+), 9 deletions(-)
> >>>
> >>> diff --git a/hw/core/machine.c b/hw/core/machine.c
> >>> index 53a99ab..7739d88 100644
> >>> --- a/hw/core/machine.c
> >>> +++ b/hw/core/machine.c
> >>> @@ -392,6 +392,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
> >>>      ms->mem_merge = value;
> >>>  }
> >>>  
> >>> +static bool machine_get_memfd_alloc(Object *obj, Error **errp)
> >>> +{
> >>> +    MachineState *ms = MACHINE(obj);
> >>> +
> >>> +    return ms->memfd_alloc;
> >>> +}
> >>> +
> >>> +static void machine_set_memfd_alloc(Object *obj, bool value, Error **errp)
> >>> +{
> >>> +    MachineState *ms = MACHINE(obj);
> >>> +
> >>> +    ms->memfd_alloc = value;
> >>> +}
> >>> +
> >>>  static bool machine_get_usb(Object *obj, Error **errp)
> >>>  {
> >>>      MachineState *ms = MACHINE(obj);
> >>> @@ -829,6 +843,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
> >>>      object_class_property_set_description(oc, "mem-merge",
> >>>          "Enable/disable memory merge support");
> >>>  
> >>> +    object_class_property_add_bool(oc, "memfd-alloc",
> >>> +        machine_get_memfd_alloc, machine_set_memfd_alloc);
> >>> +    object_class_property_set_description(oc, "memfd-alloc",
> >>> +        "Enable/disable allocating anonymous memory using memfd_create");
> >>> +
> >>>      object_class_property_add_bool(oc, "usb",
> >>>          machine_get_usb, machine_set_usb);
> >>>      object_class_property_set_description(oc, "usb",
> >>> diff --git a/include/hw/boards.h b/include/hw/boards.h
> >>> index 9c1c190..a57d7a0 100644
> >>> --- a/include/hw/boards.h
> >>> +++ b/include/hw/boards.h
> >>> @@ -327,6 +327,7 @@ struct MachineState {
> >>>      char *dt_compatible;
> >>>      bool dump_guest_core;
> >>>      bool mem_merge;
> >>> +    bool memfd_alloc;
> >>>      bool usb;
> >>>      bool usb_disabled;
> >>>      char *firmware;
> >>> diff --git a/qemu-options.hx b/qemu-options.hx
> >>> index 7d47510..33c8173 100644
> >>> --- a/qemu-options.hx
> >>> +++ b/qemu-options.hx
> >>> @@ -30,6 +30,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
> >>>      "                vmport=on|off|auto controls emulation of vmport (default: auto)\n"
> >>>      "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
> >>>      "                mem-merge=on|off controls memory merge support (default: on)\n"
> >>> +    "                memfd-alloc=on|off controls allocating anonymous guest RAM using memfd_create (default: off)\n"  
> >>
> >> Question: are there any disadvantages associated with using
> >> memfd_create? I guess we are using up an fd, but that seems minor.  Any
> >> reason not to set to on by default? maybe with a fallback option to
> >> disable that?
> 
> Old Linux host kernels, circa 4.1, do not support huge pages for shared memory.

That doesn't matter, as we don't support any distros with kernels that old

   https://www.qemu.org/docs/master/about/build-platforms.html

We can assume something around kernel 4.18 I believe.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 10/29] machine: memfd-alloc option
  2022-03-03 17:21   ` Michael S. Tsirkin
  2022-03-04 10:41     ` Igor Mammedov
@ 2022-03-11 10:25     ` David Hildenbrand
  1 sibling, 0 replies; 96+ messages in thread
From: David Hildenbrand @ 2022-03-11 10:25 UTC (permalink / raw)
  To: Michael S. Tsirkin, Steve Sistare
  Cc: Jason Zeng, Juan Quintela, Eric Blake,
	Philippe Mathieu-Daudé,
	qemu-devel, Dr. David Alan Gilbert, Zheng Chuan, Alex Williamson,
	Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Daniel P. Berrange, Alex Bennée, Markus Armbruster

On 03.03.22 18:21, Michael S. Tsirkin wrote:
> On Wed, Dec 22, 2021 at 11:05:15AM -0800, Steve Sistare wrote:
>> Allocate anonymous memory using memfd_create if the memfd-alloc machine
>> option is set.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  hw/core/machine.c   | 19 +++++++++++++++++++
>>  include/hw/boards.h |  1 +
>>  qemu-options.hx     |  6 ++++++
>>  softmmu/physmem.c   | 47 ++++++++++++++++++++++++++++++++++++++---------
>>  softmmu/vl.c        |  1 +
>>  trace-events        |  1 +
>>  util/qemu-config.c  |  4 ++++
>>  7 files changed, 70 insertions(+), 9 deletions(-)
>>
>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>> index 53a99ab..7739d88 100644
>> --- a/hw/core/machine.c
>> +++ b/hw/core/machine.c
>> @@ -392,6 +392,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
>>      ms->mem_merge = value;
>>  }
>>  
>> +static bool machine_get_memfd_alloc(Object *obj, Error **errp)
>> +{
>> +    MachineState *ms = MACHINE(obj);
>> +
>> +    return ms->memfd_alloc;
>> +}
>> +
>> +static void machine_set_memfd_alloc(Object *obj, bool value, Error **errp)
>> +{
>> +    MachineState *ms = MACHINE(obj);
>> +
>> +    ms->memfd_alloc = value;
>> +}
>> +
>>  static bool machine_get_usb(Object *obj, Error **errp)
>>  {
>>      MachineState *ms = MACHINE(obj);
>> @@ -829,6 +843,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
>>      object_class_property_set_description(oc, "mem-merge",
>>          "Enable/disable memory merge support");
>>  
>> +    object_class_property_add_bool(oc, "memfd-alloc",
>> +        machine_get_memfd_alloc, machine_set_memfd_alloc);
>> +    object_class_property_set_description(oc, "memfd-alloc",
>> +        "Enable/disable allocating anonymous memory using memfd_create");
>> +
>>      object_class_property_add_bool(oc, "usb",
>>          machine_get_usb, machine_set_usb);
>>      object_class_property_set_description(oc, "usb",
>> diff --git a/include/hw/boards.h b/include/hw/boards.h
>> index 9c1c190..a57d7a0 100644
>> --- a/include/hw/boards.h
>> +++ b/include/hw/boards.h
>> @@ -327,6 +327,7 @@ struct MachineState {
>>      char *dt_compatible;
>>      bool dump_guest_core;
>>      bool mem_merge;
>> +    bool memfd_alloc;
>>      bool usb;
>>      bool usb_disabled;
>>      char *firmware;
>> diff --git a/qemu-options.hx b/qemu-options.hx
>> index 7d47510..33c8173 100644
>> --- a/qemu-options.hx
>> +++ b/qemu-options.hx
>> @@ -30,6 +30,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
>>      "                vmport=on|off|auto controls emulation of vmport (default: auto)\n"
>>      "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
>>      "                mem-merge=on|off controls memory merge support (default: on)\n"
>> +    "                memfd-alloc=on|off controls allocating anonymous guest RAM using memfd_create (default: off)\n"
> 
> Question: are there any disadvantages associated with using
> memfd_create? I guess we are using up an fd, but that seems minor.  Any
> reason not to set to on by default? maybe with a fallback option to
> disable that?
> 
> I am concerned that it's actually a kind of memory backend, this flag
> seems to instead be closer to the deprecated mem-prealloc. E.g.
> it does not work with a mem path, does it?

We had a RH-internal discssuion some time ago, here is my writeup (note
the TMPFS/SHMEM discussion):

--- snip ---

In QEMU, we specify the type of guest RAM via
* -object memory-backend-ram,...
* -object memory-backend-file,...
* -object memory-backend-memfd,...

We can specify whether to share the memory (share=on -- MAP_SHARED),
or whether to keep modifications local to QEMU (share=off -- MAP_PRIVATE).

Using "share=off" (or using the default) with files/memfd can have some
serious side-effects.

ALERT: "share=off" is the default in QEMU for memory-backend-ram and
memory-backend-file. "share=on" is the default in QEMU only for
memory-backend-memfd.


I. MAP_SHARED vs. MAP_PRIVATE

MAP_SHARED: when reading, read file content; when writing, modify file
             content.
MAP_PRIVATE: when reading, read file content, except if there was a
              local/private change. When writing, keep change
              local/private and don't modify file content.


MAP_PRIVATE sounds like a snapshot, however, in some cases it really
behaves differently -- especially with tmpfs/shmem and when QEMU
discards memory (e.g., with virtio-balloon or during postcopy live
migration).

There is some connection between MAP_PRIVATE and NUMA bindings that I
have yet to fully explore. We could have issues with some MAP_SHARED
mappings and NUMA bindings (IOW: policy getting ignored).


II Impact on different memory backends/types

II.1. Anonymous memory:

Usage: -object memory-backend-ram,...

We really want "share=off" in 99.99% of all cases. Shared anonymous RAM
-- i.e., sharing RAM with your child processes -- does not really apply
to QEMU and there are some cases that are broken in QEMU [1]; there is
only a single use case in the context of RDMA -- whereby we only need
shared anonymous memory to make mremap() work, not for actually sharing
RAM with someone else.

II.2. TMPFS/SHMEM

Usage: -object memory-backend-memfd,...
        -object memory-backend-file,mem-path=/dev/shm/FILE,...

We really want "share=on" in 99.99999% of all cases. There is a serious
issue when using private mappings on an empty shmem file, whereby we can
get a double memory consumption. The issue is that even when reading
via a private mapping, we will allocate memory for the actual file (==
RAM for tmpfs) -- even if it's just allocating blocks filled with zero.

So doing a -object memory-backend-file,mem-path=/dev/shm/FILE will in
the worst case consume 4G, even though we have an anonymous file -- *we
have to use share=on*.

II.3. Hugetlb

Usage: -object memory-backend-memfd,hugetlb=on,hugetlbsize=2M,...
        -object memory-backend-file,mem-path=/dev/shm/FILE,...

We usually want "share=on". However, there seems to be nothing wrong
about using "memory-backend-memfd" -- IOW an anonymous file; it works as
expected in my tests (fallocate() behaves in weird ways, but that's a
different story).

II.4. "Ordinary" Files

Usage: -object memory-backend-file,mem-path=/some/file,...

We usually want "share=on" in 99.9% of all cases, to have
modifications go back to the file -- for example, for the "big file" use
case where we want to use the actual file storage as memory backend (for
example, when swapping is not desired), such that we can use the page
cache where possible, but writeback the file content to disk when under
memory pressure.

5. DAX/PMEM

Usage: -object memory-backend-file,mem-path=/dev/dax,...

We want "share=on" in 99.99999% of all cases when using dax/pmem in an
emulated NVDIMM for our guest. We want the changes to go back to
dax/pmem a.k.a. the actual NVDIMM (not some mixture of pmem and system RAM).


III. MAP_PRIVATE vs. virtio-balloon and postcopy live migration

Dave told me about a use case where we

a) Start a VM with a MAP_SHARED file as guest RAM until it is booted up
b) Save the VM state, *excluding guest RAM"
c) Start multiple VMs using the VM state and the MAP_PRIVATE file as
guest RAM

This is essentially a fast "guest snapshot". But beware if you end up
discarding memory in QEMU via ram_block_discard_range(), e.g., via
virtio-balloon or via postcopy live migration.

In QEMU, we always discard file content and modified pages in private
mappings.

Problem: If one VM discards memory, it will modify the snapshot. The
snapshot will be broken. New VMs and running VMs will be affected!

Note: We cannot easily teach QEMU to not modify file content when
discarding memory of private mappings. This would break postcopy live in
some cases completely.

--- snip ---

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma)
  2022-03-10 22:30           ` Alex Williamson
@ 2022-03-11 16:22             ` Steven Sistare
  0 siblings, 0 replies; 96+ messages in thread
From: Steven Sistare @ 2022-03-11 16:22 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	qemu-devel, Dr. David Alan Gilbert, Zheng Chuan, Paolo Bonzini,
	Stefan Hajnoczi, Marc-André Lureau, Daniel P. Berrange,
	Philippe Mathieu-Daudé,
	Alex Bennée, Markus Armbruster

On 3/10/2022 5:30 PM, Alex Williamson wrote:
> On Thu, 10 Mar 2022 14:55:50 -0500
> Steven Sistare <steven.sistare@oracle.com> wrote:
> 
>> On 3/10/2022 1:35 PM, Alex Williamson wrote:
>>> On Thu, 10 Mar 2022 10:00:29 -0500
>>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>>   
>>>> On 3/7/2022 5:16 PM, Alex Williamson wrote:  
>>>>> On Wed, 22 Dec 2021 11:05:24 -0800
>>>>> Steve Sistare <steven.sistare@oracle.com> wrote:  
>>>>> [...]
>>>>>> diff --git a/migration/cpr.c b/migration/cpr.c
>>>>>> index 37eca66..cee82cf 100644
>>>>>> --- a/migration/cpr.c
>>>>>> +++ b/migration/cpr.c
>>>>>> @@ -7,6 +7,7 @@
>>>>>>  
>>>>>>  #include "qemu/osdep.h"
>>>>>>  #include "exec/memory.h"
>>>>>> +#include "hw/vfio/vfio-common.h"
>>>>>>  #include "io/channel-buffer.h"
>>>>>>  #include "io/channel-file.h"
>>>>>>  #include "migration.h"
>>>>>> @@ -101,7 +102,9 @@ void qmp_cpr_exec(strList *args, Error **errp)
>>>>>>          error_setg(errp, "cpr-exec requires cpr-save with restart mode");
>>>>>>          return;
>>>>>>      }
>>>>>> -
>>>>>> +    if (cpr_vfio_save(errp)) {
>>>>>> +        return;
>>>>>> +    }    
>>>>>
>>>>> Why is vfio so unique that it needs separate handlers versus other
>>>>> devices?  Thanks,    
>>>>
>>>> In earlier patches these functions fiddled with more objects, but at this point
>>>> they are simple enough to convert to pre_save and post_load vmstate handlers for
>>>> the container and group objects.  However, we would still need to call special 
>>>> functons for vfio from qmp_cpr_exec:
>>>>
>>>>   * validate all containers support CPR before we start blasting vaddrs
>>>>     However, I could validate all, in every call to pre_save for each container.
>>>>     That would be less efficient, but fits the vmstate model.  
>>>
>>> Would it be a better option to mirror the migration blocker support, ie.
>>> any device that doesn't support cpr registers a blocker and generic
>>> code only needs to keep track of whether any blockers are registered.  
>>
>> We cannot specifically use migrate_add_blocker(), because it is checked in
>> the migration specific function migrate_prepare(), in a layer of functions 
>> above the simpler qemu_save_device_state() used in cpr.  But yes, we could
>> do something similar for vfio.  Increment a global counter in vfio_realize
>> if the container does not support cpr, and decrement it when the container is
>> destroyed.  pre_save could just check the counter.
> 
> Right, not suggesting to piggyback on migrate_add_blocker() only to use
> a similar mechanism.  Only drivers that can't support cpr need register
> a blocker but testing for blockers is done generically, not just for
> vfio devices.
> 
>>>>   * restore all vaddr's if qemu_save_device_state fails.
>>>>     However, I could recover for all containers inside pre_save when one container fails.
>>>>     Feels strange touching all objects in a function for one, but there is no real
>>>>     downside.  
>>>
>>> I'm not as familiar as I should be with migration callbacks, thanks to
>>> mostly not supporting it for vfio devices, but it seems strange to me
>>> that there's no existing callback or notifier per device to propagate
>>> save failure.  Do we not at least get some sort of resume callback in
>>> that case?  
>>
>> We do not:
>>     struct VMStateDescription {
>>         int (*pre_load)(void *opaque);
>>         int (*post_load)(void *opaque, int version_id);
>>         int (*pre_save)(void *opaque);
>>         int (*post_save)(void *opaque);
>>
>> The handler returns an error, which stops further saves and is propagated back
>> to the top level caller qemu_save_device_state().
>>
>> The vast majority of handlers do not have side effects, with no need to unwind 
>> anything on failure.
>>
>> This raises another point.  If pre_save succeeds for all the containers,
>> but fails for some non-vfio object, then the overall operation is abandoned,
>> but we do not restore the vaddr's.  To plug that hole, we need to call the
>> unwind code from qmp_cpr_save, or implement your alternative below.
> 
> We're trying to reuse migration interfaces, are we also triggering
> migration state change notifiers?  ie.
> MIGRATION_STATUS_{CANCELLING,CANCELLED,FAILED}  

No. That happens in the migration layer which we do not use.

> We already hook vfio
> devices supporting migration into that notifier to tell the driver to
> move the device back to the running state on failure, which seems a bit
> unique to vfio devices.  Containers could maybe register their own
> callbacks.
> 
>>> As an alternative, maybe each container could register a vm change
>>> handler that would trigger reloading vaddrs if we move to a running
>>> state and a flag on the container indicates vaddrs were invalidated?
>>> Thanks,  
>>
>> That works and is modular, but I dislike that it adds checks on the
>> happy path for a case that will rarely happen, and it pushes recovery from
>> failure further away from the original failure, which would make debugging
>> cascading failures more difficult.
> 
> Would using the migration notifier move us sufficiently closer to the
> failure point?  Otherwise I think you're talking about unwinding all
> the containers when any one fails, where you didn't like that object
> overreach, or maybe adding an optional callback... but I wonder if the
> above notifier essentially already does that.
> 
> In any case, I think we have options to either implement new or use
> existing notifier-like functionality to avoid all these vfio specific
> callouts.  Thanks,

Yes, defining a cpr notifier for failure and cleanup is a good solution.
I'll work on that and a cpr blocker.  I'll use the latter for vfio and
the chardevs.

- Steve


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 12/29] vl: helper to request re-exec
  2022-03-09 14:16   ` Marc-André Lureau
@ 2022-03-11 16:45     ` Steven Sistare
  0 siblings, 0 replies; 96+ messages in thread
From: Steven Sistare @ 2022-03-11 16:45 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin, QEMU,
	Dr. David Alan Gilbert, Zheng Chuan, Alex Williamson,
	Stefan Hajnoczi, Paolo Bonzini, Daniel P. Berrange,
	Philippe Mathieu-Daudé,
	Alex Bennée, Markus Armbruster

On 3/9/2022 9:16 AM, Marc-André Lureau wrote:
> On Wed, Dec 22, 2021 at 11:52 PM Steve Sistare <steven.sistare@oracle.com <mailto:steven.sistare@oracle.com>> wrote:
> 
>     Add a qemu_system_exec_request() hook that causes the main loop to exit and
>     re-exec qemu using the specified arguments.
> 
>     Signed-off-by: Steve Sistare <steven.sistare@oracle.com <mailto:steven.sistare@oracle.com>>
>     ---
>      include/sysemu/runstate.h |  1 +
>      softmmu/runstate.c        | 21 +++++++++++++++++++++
>      2 files changed, 22 insertions(+)
> 
>     diff --git a/include/sysemu/runstate.h b/include/sysemu/runstate.h
>     index b655c7b..198211b 100644
>     --- a/include/sysemu/runstate.h
>     +++ b/include/sysemu/runstate.h
>     @@ -57,6 +57,7 @@ void qemu_system_wakeup_enable(WakeupReason reason, bool enabled);
>      void qemu_register_wakeup_notifier(Notifier *notifier);
>      void qemu_register_wakeup_support(void);
>      void qemu_system_shutdown_request(ShutdownCause reason);
>     +void qemu_system_exec_request(const strList *args);
>      void qemu_system_powerdown_request(void);
>      void qemu_register_powerdown_notifier(Notifier *notifier);
>      void qemu_register_shutdown_notifier(Notifier *notifier);
>     diff --git a/softmmu/runstate.c b/softmmu/runstate.c
>     index 3d344c9..309a4bf 100644
>     --- a/softmmu/runstate.c
>     +++ b/softmmu/runstate.c
>     @@ -38,6 +38,7 @@
>      #include "monitor/monitor.h"
>      #include "net/net.h"
>      #include "net/vhost_net.h"
>     +#include "qapi/util.h"
>      #include "qapi/error.h"
>      #include "qapi/qapi-commands-run-state.h"
>      #include "qapi/qapi-events-run-state.h"
>     @@ -355,6 +356,7 @@ static NotifierList wakeup_notifiers =
>      static NotifierList shutdown_notifiers =
>          NOTIFIER_LIST_INITIALIZER(shutdown_notifiers);
>      static uint32_t wakeup_reason_mask = ~(1 << QEMU_WAKEUP_REASON_NONE);
>     +static char **exec_argv;
> 
>      ShutdownCause qemu_shutdown_requested_get(void)
>      {
>     @@ -371,6 +373,11 @@ static int qemu_shutdown_requested(void)
>          return qatomic_xchg(&shutdown_requested, SHUTDOWN_CAUSE_NONE);
>      }
> 
>     +static int qemu_exec_requested(void)
>     +{
>     +    return exec_argv != NULL;
>     +}
>     +
>      static void qemu_kill_report(void)
>      {
>          if (!qtest_driver() && shutdown_signal) {
>     @@ -641,6 +648,13 @@ void qemu_system_shutdown_request(ShutdownCause reason)
>          qemu_notify_event();
>      }
> 
>     +void qemu_system_exec_request(const strList *args)
>     +{
>     +    exec_argv = strv_from_strList(args);
> 
> 
> I would rather make it take a GStrv, since that's what it actually uses.
> 
> I would also check if argv[0] is set (or document the expected behaviour).

Will do, thanks.

- Steve

>     +    shutdown_requested = 1;
>     +    qemu_notify_event();
>     +}
>     +
>      static void qemu_system_powerdown(void)
>      {
>          qapi_event_send_powerdown();
>     @@ -689,6 +703,13 @@ static bool main_loop_should_exit(void)
>          }
>          request = qemu_shutdown_requested();
>          if (request) {
>     +
>     +        if (qemu_exec_requested()) {
>     +            execvp(exec_argv[0], exec_argv);
>     +            error_report("execvp %s failed: %s", exec_argv[0], strerror(errno));
>     +            g_strfreev(exec_argv);
>     +            exec_argv = NULL;
>     +        }
>              qemu_kill_report();
>              qemu_system_shutdown(request);
>              if (shutdown_action == SHUTDOWN_ACTION_PAUSE) {
>     -- 
>     1.8.3.1
> 
> 
> 
> 
> -- 
> Marc-André Lureau


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 11/29] qapi: list utility functions
  2022-03-09 14:11   ` Marc-André Lureau
@ 2022-03-11 16:45     ` Steven Sistare
  2022-03-11 21:59       ` Marc-André Lureau
  0 siblings, 1 reply; 96+ messages in thread
From: Steven Sistare @ 2022-03-11 16:45 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin, QEMU,
	Dr. David Alan Gilbert, Zheng Chuan, Alex Williamson,
	Stefan Hajnoczi, Paolo Bonzini, Daniel P. Berrange,
	Philippe Mathieu-Daudé,
	Alex Bennée, Markus Armbruster

On 3/9/2022 9:11 AM, Marc-André Lureau wrote:
> Hi
> 
> On Wed, Dec 22, 2021 at 11:42 PM Steve Sistare <steven.sistare@oracle.com <mailto:steven.sistare@oracle.com>> wrote:
> 
>     Generalize strList_from_comma_list() to take any delimiter character, rename
>     as strList_from_string(), and move it to qapi/util.c.  Also add
>     strv_from_strList() and QAPI_LIST_LENGTH().
> 
> Looks like you could easily split, and add some tests.

Will do.  
I don't see any tests that include qapi/util.h, so this will be a new test file.

For the split, how about:
  patch: qapi: strList_from_string
  patch: qapi: strv_from_strList
  patch: qapi: QAPI_LIST_LENGTH
  patch: qapi: unit tests for lists

Or do you prefer that unit tests be pushed with each function's patch?

>     No functional change.
> 
>     Signed-off-by: Steve Sistare <steven.sistare@oracle.com <mailto:steven.sistare@oracle.com>>
>     ---
>      include/qapi/util.h | 28 ++++++++++++++++++++++++++++
>      monitor/hmp-cmds.c  | 29 ++---------------------------
>      qapi/qapi-util.c    | 37 +++++++++++++++++++++++++++++++++++++
>      3 files changed, 67 insertions(+), 27 deletions(-)
> 
>     diff --git a/include/qapi/util.h b/include/qapi/util.h
>     index 81a2b13..c249108 100644
>     --- a/include/qapi/util.h
>     +++ b/include/qapi/util.h
>     @@ -22,6 +22,8 @@ typedef struct QEnumLookup {
>          const int size;
>      } QEnumLookup;
> 
>     +struct strList;
>     +
>      const char *qapi_enum_lookup(const QEnumLookup *lookup, int val);
>      int qapi_enum_parse(const QEnumLookup *lookup, const char *buf,
>                          int def, Error **errp);
>     @@ -31,6 +33,19 @@ bool qapi_bool_parse(const char *name, const char *value, bool *obj,
>      int parse_qapi_name(const char *name, bool complete);
> 
>      /*
>     + * Produce and return a NULL-terminated array of strings from @args.
>     + * All strings are g_strdup'd.
>     + */
>     +char **strv_from_strList(const struct strList *args);
> 
>     +
> 
> I'd suggest to use the dedicated glib type GStrv

Will do, here and in related code.

- Steve

>     +/*
>     + * Produce a strList from the character delimited string @in.
>     + * All strings are g_strdup'd.
>     + * A NULL or empty input string returns NULL.
>     + */
>     +struct strList *strList_from_string(const char *in, char delim);
>     +
>     +/*
>       * For any GenericList @list, insert @element at the front.
>       *
>       * Note that this macro evaluates @element exactly once, so it is safe
>     @@ -56,4 +71,17 @@ int parse_qapi_name(const char *name, bool complete);
>          (tail) = &(*(tail))->next; \
>      } while (0)
> 
>     +/*
>     + * For any GenericList @list, return its length.
>     + */
>     +#define QAPI_LIST_LENGTH(list) \
>     +    ({ \
>     +        int len = 0; \
>     +        typeof(list) elem; \
>     +        for (elem = list; elem != NULL; elem = elem->next) { \
>     +            len++; \
>     +        } \
>     +        len; \
>     +    })
>     +
>      #endif
>     diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
>     index b8c22da..5ca8b4b 100644
>     --- a/monitor/hmp-cmds.c
>     +++ b/monitor/hmp-cmds.c
>     @@ -43,6 +43,7 @@
>      #include "qapi/qapi-commands-run-state.h"
>      #include "qapi/qapi-commands-tpm.h"
>      #include "qapi/qapi-commands-ui.h"
>     +#include "qapi/util.h"
>      #include "qapi/qapi-visit-net.h"
>      #include "qapi/qapi-visit-migration.h"
>      #include "qapi/qmp/qdict.h"
>     @@ -70,32 +71,6 @@ bool hmp_handle_error(Monitor *mon, Error *err)
>          return false;
>      }
> 
>     -/*
>     - * Produce a strList from a comma separated list.
>     - * A NULL or empty input string return NULL.
>     - */
>     -static strList *strList_from_comma_list(const char *in)
>     -{
>     -    strList *res = NULL;
>     -    strList **tail = &res;
>     -
>     -    while (in && in[0]) {
>     -        char *comma = strchr(in, ',');
>     -        char *value;
>     -
>     -        if (comma) {
>     -            value = g_strndup(in, comma - in);
>     -            in = comma + 1; /* skip the , */
>     -        } else {
>     -            value = g_strdup(in);
>     -            in = NULL;
>     -        }
>     -        QAPI_LIST_APPEND(tail, value);
>     -    }
>     -
>     -    return res;
>     -}
>     -
>      void hmp_info_name(Monitor *mon, const QDict *qdict)
>      {
>          NameInfo *info;
>     @@ -1103,7 +1078,7 @@ void hmp_announce_self(Monitor *mon, const QDict *qdict)
>                                                  migrate_announce_params());
> 
>          qapi_free_strList(params->interfaces);
>     -    params->interfaces = strList_from_comma_list(interfaces_str);
>     +    params->interfaces = strList_from_string(interfaces_str, ',');
>          params->has_interfaces = params->interfaces != NULL;
>          params->id = g_strdup(id);
>          params->has_id = !!params->id;
>     diff --git a/qapi/qapi-util.c b/qapi/qapi-util.c
>     index fda7044..edd51b3 100644
>     --- a/qapi/qapi-util.c
>     +++ b/qapi/qapi-util.c
>     @@ -15,6 +15,7 @@
>      #include "qapi/error.h"
>      #include "qemu/ctype.h"
>      #include "qapi/qmp/qerror.h"
>     +#include "qapi/qapi-builtin-types.h"
> 
>      CompatPolicy compat_policy;
> 
>     @@ -152,3 +153,39 @@ int parse_qapi_name(const char *str, bool complete)
>          }
>          return p - str;
>      }
>     +
>     +char **strv_from_strList(const strList *args)
>     +{
>     +    const strList *arg;
>     +    int i = 0;
>     +    char **argv = g_malloc((QAPI_LIST_LENGTH(args) + 1) * sizeof(char *));
>     +
>     +    for (arg = args; arg != NULL; arg = arg->next) {
>     +        argv[i++] = g_strdup(arg->value);
>     +    }
>     +    argv[i] = NULL;
>     +
>     +    return argv;
>     +}
>     +
>     +strList *strList_from_string(const char *in, char delim)
>     +{
>     +    strList *res = NULL;
>     +    strList **tail = &res;
>     +
>     +    while (in && in[0]) {
>     +        char *next = strchr(in, delim);
>     +        char *value;
>     +
>     +        if (next) {
>     +            value = g_strndup(in, next - in);
>     +            in = next + 1; /* skip the delim */
>     +        } else {
>     +            value = g_strdup(in);
>     +            in = NULL;
>     +        }
>     +        QAPI_LIST_APPEND(tail, value);
>     +    }
>     +
>     +    return res;
>     +}
>     -- 
>     1.8.3.1
> 
> 
> 
> 
> -- 
> Marc-André Lureau


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 11/29] qapi: list utility functions
  2022-03-11 16:45     ` Steven Sistare
@ 2022-03-11 21:59       ` Marc-André Lureau
  0 siblings, 0 replies; 96+ messages in thread
From: Marc-André Lureau @ 2022-03-11 21:59 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin, QEMU,
	Dr. David Alan Gilbert, Zheng Chuan, Alex Williamson,
	Stefan Hajnoczi, Paolo Bonzini, Daniel P. Berrange,
	Philippe Mathieu-Daudé,
	Alex Bennée, Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 7527 bytes --]

Hi

On Fri, Mar 11, 2022 at 8:46 PM Steven Sistare <steven.sistare@oracle.com>
wrote:

> On 3/9/2022 9:11 AM, Marc-André Lureau wrote:
> > Hi
> >
> > On Wed, Dec 22, 2021 at 11:42 PM Steve Sistare <
> steven.sistare@oracle.com <mailto:steven.sistare@oracle.com>> wrote:
> >
> >     Generalize strList_from_comma_list() to take any delimiter
> character, rename
> >     as strList_from_string(), and move it to qapi/util.c.  Also add
> >     strv_from_strList() and QAPI_LIST_LENGTH().
> >
> > Looks like you could easily split, and add some tests.
>
> Will do.
> I don't see any tests that include qapi/util.h, so this will be a new test
> file.
>
> For the split, how about:
>   patch: qapi: strList_from_string
>   patch: qapi: strv_from_strList
>   patch: qapi: QAPI_LIST_LENGTH
>   patch: qapi: unit tests for lists
>
>
Sure, that's fine


> Or do you prefer that unit tests be pushed with each function's patch?
>

I don't have a strong preference. I usually prefer new code coming with its
own test, but if the resulting patch becomes too large, or if the test
touches other related aspects, might be better off as different patch. Up
to you!


> >     No functional change.
> >
> >     Signed-off-by: Steve Sistare <steven.sistare@oracle.com <mailto:
> steven.sistare@oracle.com>>
> >     ---
> >      include/qapi/util.h | 28 ++++++++++++++++++++++++++++
> >      monitor/hmp-cmds.c  | 29 ++---------------------------
> >      qapi/qapi-util.c    | 37 +++++++++++++++++++++++++++++++++++++
> >      3 files changed, 67 insertions(+), 27 deletions(-)
> >
> >     diff --git a/include/qapi/util.h b/include/qapi/util.h
> >     index 81a2b13..c249108 100644
> >     --- a/include/qapi/util.h
> >     +++ b/include/qapi/util.h
> >     @@ -22,6 +22,8 @@ typedef struct QEnumLookup {
> >          const int size;
> >      } QEnumLookup;
> >
> >     +struct strList;
> >     +
> >      const char *qapi_enum_lookup(const QEnumLookup *lookup, int val);
> >      int qapi_enum_parse(const QEnumLookup *lookup, const char *buf,
> >                          int def, Error **errp);
> >     @@ -31,6 +33,19 @@ bool qapi_bool_parse(const char *name, const char
> *value, bool *obj,
> >      int parse_qapi_name(const char *name, bool complete);
> >
> >      /*
> >     + * Produce and return a NULL-terminated array of strings from @args.
> >     + * All strings are g_strdup'd.
> >     + */
> >     +char **strv_from_strList(const struct strList *args);
> >
> >     +
> >
> > I'd suggest to use the dedicated glib type GStrv
>
> Will do, here and in related code.
>

thanks


>
> - Steve
>
> >     +/*
> >     + * Produce a strList from the character delimited string @in.
> >     + * All strings are g_strdup'd.
> >     + * A NULL or empty input string returns NULL.
> >     + */
> >     +struct strList *strList_from_string(const char *in, char delim);
> >     +
> >     +/*
> >       * For any GenericList @list, insert @element at the front.
> >       *
> >       * Note that this macro evaluates @element exactly once, so it is
> safe
> >     @@ -56,4 +71,17 @@ int parse_qapi_name(const char *name, bool
> complete);
> >          (tail) = &(*(tail))->next; \
> >      } while (0)
> >
> >     +/*
> >     + * For any GenericList @list, return its length.
> >     + */
> >     +#define QAPI_LIST_LENGTH(list) \
> >     +    ({ \
> >     +        int len = 0; \
> >     +        typeof(list) elem; \
> >     +        for (elem = list; elem != NULL; elem = elem->next) { \
> >     +            len++; \
> >     +        } \
> >     +        len; \
> >     +    })
> >     +
> >      #endif
> >     diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
> >     index b8c22da..5ca8b4b 100644
> >     --- a/monitor/hmp-cmds.c
> >     +++ b/monitor/hmp-cmds.c
> >     @@ -43,6 +43,7 @@
> >      #include "qapi/qapi-commands-run-state.h"
> >      #include "qapi/qapi-commands-tpm.h"
> >      #include "qapi/qapi-commands-ui.h"
> >     +#include "qapi/util.h"
> >      #include "qapi/qapi-visit-net.h"
> >      #include "qapi/qapi-visit-migration.h"
> >      #include "qapi/qmp/qdict.h"
> >     @@ -70,32 +71,6 @@ bool hmp_handle_error(Monitor *mon, Error *err)
> >          return false;
> >      }
> >
> >     -/*
> >     - * Produce a strList from a comma separated list.
> >     - * A NULL or empty input string return NULL.
> >     - */
> >     -static strList *strList_from_comma_list(const char *in)
> >     -{
> >     -    strList *res = NULL;
> >     -    strList **tail = &res;
> >     -
> >     -    while (in && in[0]) {
> >     -        char *comma = strchr(in, ',');
> >     -        char *value;
> >     -
> >     -        if (comma) {
> >     -            value = g_strndup(in, comma - in);
> >     -            in = comma + 1; /* skip the , */
> >     -        } else {
> >     -            value = g_strdup(in);
> >     -            in = NULL;
> >     -        }
> >     -        QAPI_LIST_APPEND(tail, value);
> >     -    }
> >     -
> >     -    return res;
> >     -}
> >     -
> >      void hmp_info_name(Monitor *mon, const QDict *qdict)
> >      {
> >          NameInfo *info;
> >     @@ -1103,7 +1078,7 @@ void hmp_announce_self(Monitor *mon, const
> QDict *qdict)
> >
>  migrate_announce_params());
> >
> >          qapi_free_strList(params->interfaces);
> >     -    params->interfaces = strList_from_comma_list(interfaces_str);
> >     +    params->interfaces = strList_from_string(interfaces_str, ',');
> >          params->has_interfaces = params->interfaces != NULL;
> >          params->id = g_strdup(id);
> >          params->has_id = !!params->id;
> >     diff --git a/qapi/qapi-util.c b/qapi/qapi-util.c
> >     index fda7044..edd51b3 100644
> >     --- a/qapi/qapi-util.c
> >     +++ b/qapi/qapi-util.c
> >     @@ -15,6 +15,7 @@
> >      #include "qapi/error.h"
> >      #include "qemu/ctype.h"
> >      #include "qapi/qmp/qerror.h"
> >     +#include "qapi/qapi-builtin-types.h"
> >
> >      CompatPolicy compat_policy;
> >
> >     @@ -152,3 +153,39 @@ int parse_qapi_name(const char *str, bool
> complete)
> >          }
> >          return p - str;
> >      }
> >     +
> >     +char **strv_from_strList(const strList *args)
> >     +{
> >     +    const strList *arg;
> >     +    int i = 0;
> >     +    char **argv = g_malloc((QAPI_LIST_LENGTH(args) + 1) *
> sizeof(char *));
> >     +
> >     +    for (arg = args; arg != NULL; arg = arg->next) {
> >     +        argv[i++] = g_strdup(arg->value);
> >     +    }
> >     +    argv[i] = NULL;
> >     +
> >     +    return argv;
> >     +}
> >     +
> >     +strList *strList_from_string(const char *in, char delim)
> >     +{
> >     +    strList *res = NULL;
> >     +    strList **tail = &res;
> >     +
> >     +    while (in && in[0]) {
> >     +        char *next = strchr(in, delim);
> >     +        char *value;
> >     +
> >     +        if (next) {
> >     +            value = g_strndup(in, next - in);
> >     +            in = next + 1; /* skip the delim */
> >     +        } else {
> >     +            value = g_strdup(in);
> >     +            in = NULL;
> >     +        }
> >     +        QAPI_LIST_APPEND(tail, value);
> >     +    }
> >     +
> >     +    return res;
> >     +}
> >     --
> >     1.8.3.1
> >
> >
> >
> >
> > --
> > Marc-André Lureau
>


-- 
Marc-André Lureau

[-- Attachment #2: Type: text/html, Size: 10653 bytes --]

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH V7 10/29] machine: memfd-alloc option
  2022-03-11  9:42                     ` Igor Mammedov
@ 2022-03-29 17:43                       ` Steven Sistare
  0 siblings, 0 replies; 96+ messages in thread
From: Steven Sistare @ 2022-03-29 17:43 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Jason Zeng, Juan Quintela, Eric Blake, Michael S. Tsirkin,
	David Hildenbrand, qemu-devel, Dr. David Alan Gilbert,
	Zheng Chuan, Alex Williamson, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Daniel P. Berrange,
	Philippe Mathieu-Daudé,
	Alex Bennée, Markus Armbruster

On 3/11/2022 4:42 AM, Igor Mammedov wrote:
> On Thu, 10 Mar 2022 13:18:35 -0500
> Steven Sistare <steven.sistare@oracle.com> wrote:
> 
>> On 3/10/2022 12:28 PM, Steven Sistare wrote:
>>> On 3/10/2022 11:00 AM, Igor Mammedov wrote:  
>>>> On Thu, 10 Mar 2022 10:36:08 -0500
>>>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>>>  
>>>>> On 3/8/2022 2:20 AM, Igor Mammedov wrote:  
>>>>>> On Tue, 8 Mar 2022 01:50:11 -0500
>>>>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>>>>     
>>>>>>> On Mon, Mar 07, 2022 at 09:41:44AM -0500, Steven Sistare wrote:    
>>>>>>>> On 3/4/2022 5:41 AM, Igor Mammedov wrote:      
>>>>>>>>> On Thu, 3 Mar 2022 12:21:15 -0500
>>>>>>>>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>>>>>>>       
>>>>>>>>>> On Wed, Dec 22, 2021 at 11:05:15AM -0800, Steve Sistare wrote:      
>>>>>>>>>>> Allocate anonymous memory using memfd_create if the memfd-alloc machine
>>>>>>>>>>> option is set.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>>>>>>> ---
>>>>>>>>>>>  hw/core/machine.c   | 19 +++++++++++++++++++
>>>>>>>>>>>  include/hw/boards.h |  1 +
>>>>>>>>>>>  qemu-options.hx     |  6 ++++++
>>>>>>>>>>>  softmmu/physmem.c   | 47 ++++++++++++++++++++++++++++++++++++++---------
>>>>>>>>>>>  softmmu/vl.c        |  1 +
>>>>>>>>>>>  trace-events        |  1 +
>>>>>>>>>>>  util/qemu-config.c  |  4 ++++
>>>>>>>>>>>  7 files changed, 70 insertions(+), 9 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>>>>>>>>>>> index 53a99ab..7739d88 100644
>>>>>>>>>>> --- a/hw/core/machine.c
>>>>>>>>>>> +++ b/hw/core/machine.c
>>>>>>>>>>> @@ -392,6 +392,20 @@ static void machine_set_mem_merge(Object *obj, bool value, Error **errp)
>>>>>>>>>>>      ms->mem_merge = value;
>>>>>>>>>>>  }
>>>>>>>>>>>  
>>>>>>>>>>> +static bool machine_get_memfd_alloc(Object *obj, Error **errp)
>>>>>>>>>>> +{
>>>>>>>>>>> +    MachineState *ms = MACHINE(obj);
>>>>>>>>>>> +
>>>>>>>>>>> +    return ms->memfd_alloc;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static void machine_set_memfd_alloc(Object *obj, bool value, Error **errp)
>>>>>>>>>>> +{
>>>>>>>>>>> +    MachineState *ms = MACHINE(obj);
>>>>>>>>>>> +
>>>>>>>>>>> +    ms->memfd_alloc = value;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>>  static bool machine_get_usb(Object *obj, Error **errp)
>>>>>>>>>>>  {
>>>>>>>>>>>      MachineState *ms = MACHINE(obj);
>>>>>>>>>>> @@ -829,6 +843,11 @@ static void machine_class_init(ObjectClass *oc, void *data)
>>>>>>>>>>>      object_class_property_set_description(oc, "mem-merge",
>>>>>>>>>>>          "Enable/disable memory merge support");
>>>>>>>>>>>  
>>>>>>>>>>> +    object_class_property_add_bool(oc, "memfd-alloc",
>>>>>>>>>>> +        machine_get_memfd_alloc, machine_set_memfd_alloc);
>>>>>>>>>>> +    object_class_property_set_description(oc, "memfd-alloc",
>>>>>>>>>>> +        "Enable/disable allocating anonymous memory using memfd_create");
>>>>>>>>>>> +
>>>>>>>>>>>      object_class_property_add_bool(oc, "usb",
>>>>>>>>>>>          machine_get_usb, machine_set_usb);
>>>>>>>>>>>      object_class_property_set_description(oc, "usb",
>>>>>>>>>>> diff --git a/include/hw/boards.h b/include/hw/boards.h
>>>>>>>>>>> index 9c1c190..a57d7a0 100644
>>>>>>>>>>> --- a/include/hw/boards.h
>>>>>>>>>>> +++ b/include/hw/boards.h
>>>>>>>>>>> @@ -327,6 +327,7 @@ struct MachineState {
>>>>>>>>>>>      char *dt_compatible;
>>>>>>>>>>>      bool dump_guest_core;
>>>>>>>>>>>      bool mem_merge;
>>>>>>>>>>> +    bool memfd_alloc;
>>>>>>>>>>>      bool usb;
>>>>>>>>>>>      bool usb_disabled;
>>>>>>>>>>>      char *firmware;
>>>>>>>>>>> diff --git a/qemu-options.hx b/qemu-options.hx
>>>>>>>>>>> index 7d47510..33c8173 100644
>>>>>>>>>>> --- a/qemu-options.hx
>>>>>>>>>>> +++ b/qemu-options.hx
>>>>>>>>>>> @@ -30,6 +30,7 @@ DEF("machine", HAS_ARG, QEMU_OPTION_machine, \
>>>>>>>>>>>      "                vmport=on|off|auto controls emulation of vmport (default: auto)\n"
>>>>>>>>>>>      "                dump-guest-core=on|off include guest memory in a core dump (default=on)\n"
>>>>>>>>>>>      "                mem-merge=on|off controls memory merge support (default: on)\n"
>>>>>>>>>>> +    "                memfd-alloc=on|off controls allocating anonymous guest RAM using memfd_create (default: off)\n"        
>>>>>>>>>>
>>>>>>>>>> Question: are there any disadvantages associated with using
>>>>>>>>>> memfd_create? I guess we are using up an fd, but that seems minor.  Any
>>>>>>>>>> reason not to set to on by default? maybe with a fallback option to
>>>>>>>>>> disable that?      
>>>>>>>>
>>>>>>>> Old Linux host kernels, circa 4.1, do not support huge pages for shared memory.
>>>>>>>> Also, the tunable to enable huge pages for share memory is different than for
>>>>>>>> anon memory, so there could be performance loss if it is not set correctly.
>>>>>>>>     /sys/kernel/mm/transparent_hugepage/enabled
>>>>>>>>     vs
>>>>>>>>     /sys/kernel/mm/transparent_hugepage/shmem_enabled      
>>>>>>>
>>>>>>> I guess we can test this when launching the VM, and select
>>>>>>> a good default.
>>>>>>>    
>>>>>>>> It might make sense to use memfd_create by default for the secondary segments.      
>>>>>>>
>>>>>>> Well there's also KSM now you mention it.    
>>>>>>
>>>>>> then another quest, is there downside to always using memfd_create
>>>>>> without any knobs being involved?    
>>>>>
>>>>> Lower performance if small pages are used (but Michael suggests qemu could 
>>>>> automatically check the tunable and use anon memory instead)
>>>>>
>>>>> KSM (same page merging) is not supported for shared memory, so ram_block_add ->
>>>>> memory_try_enable_merging will not enable it.
>>>>>
>>>>> In both cases, I expect the degradation would be negligible if memfd_create is
>>>>> only automatically applied to the secondary segments, which are typically small.
>>>>> But, someone's secondary segment could be larger, and it is time consuming to
>>>>> prove innocence when someone claims your change caused their performance regression.  
>>>>
>>>> Adding David as memory subsystem maintainer, maybe he will a better
>>>> idea instead of introducing global knob that would also magically alter 
>>>> backends' behavior despite of its their configured settings.  
>>>
>>> OK, in ram_block_add I can set the RAM_SHARED flag based on the memory-backend object's
>>> shared flag.  I already set the latter in create_default_memdev when memfd-alloc is
>>> specified.  With that change, we do not override configured settings.  Users can no longer
>>> use memory-backend-ram for CPR, and must change all memory-backend-ram to memory-backend-memfd
>>> in the command-line arguments.  That is fine.
>>>
>>> With that change, are you OK with this patch?  
>>
>> Sorry, I mis-read my own code in ram_block_add.  The existing code is correct and does 
>> not alter any backend's behavior.   It only sets the shared flag when the ram is *not* 
>> being allocated for a backend:
>>
>>                 if (!object_dynamic_cast(parent, TYPE_MEMORY_BACKEND)) {
>>                     new_block->flags |= RAM_SHARED;
>>                 }
>>
> 
> ok, maybe instead of introducing a generic option, introduce the high level
> feature one that turns this and other necessary quirks for it to work (i.e.
> something like live-update=on|off).
> That will not make QEMU internals any better but at least it will hide obscure
> memfd-alloc from users.

That occurred to me, but during this review, a few folks have said it would be useful
to expose memfd-alloc directly.  And, I don't think that hiding memfd-alloc under
another flag is helpful, as some users will still want to understand how enabling cpr 
affects the VM environment.  I would still be documenting the memfd behavior for the
hypothetical new flag.

I currently document memfd-alloc in the following places.  If you think this is 
confusing or incomplete, please let me know:

qemu-options.hx

@item memfd-alloc=on|off
Enables or disables allocation of anonymous guest RAM using
memfd_create.  Any associated memory-backend objects are created with
share=on.  The memfd-alloc default is off.

hmp-commands.hx

@item cpr-save @var{filename} @var{mode}
...
If @var{mode} is 'restart', the checkpoint remains valid after restarting qemu
using a subsequent cpr-exec.  All guest RAM objects must be shared.  The
share=on property is required for memory created with an explicit -object
option, and the memfd-alloc machine property is required for memory that is
implicitly created. 

And this error message in a few places for the only-cpr-capable command line option:

"only-cpr-capable requires -machine memfd-alloc=on"

> Is there a patch that makes QEMU error out if backend without
> shared=on is used?

No.  I will add that check, thanks.

> Also, can you answer question below, pls
> or point to a patch in series that takes care of that invariant?
> 
> [...]
> 
>>>>>>>> There is currently no way to specify memory backends for the secondary memory
>>>>>>>> segments (vram, roms, etc), and IMO it would be onerous to specify a backend for
>>>>>>>> each of them.  On x86_64, these include pc.bios, vga.vram, pc.rom, vga.rom,
>>>>>>>> /rom@etc/acpi/tables, /rom@etc/table-loader, /rom@etc/acpi/rsdp.  
>>>>
>>>> MemoryRegion is not the only place where state is stored.
>>>> If we only talk about fwcfg entries state, it can also reference
>>>> plain malloced memory allocated elsewhere or make a deep copy internally.
>>>> Similarly devices also may store state outside of RamBlock framework.
>>>>
>>>> How are you dealing with that?

Sorry, I missed this before.
fwcfg defines vmstate handlers that save and restore all state to a file across the cpr
operation, similar to live migration.  In general, if it works for live migration, then
it works for cpr.  If you find a counter-example, please let me know.

- Steve


^ permalink raw reply	[flat|nested] 96+ messages in thread

end of thread, other threads:[~2022-03-29 17:50 UTC | newest]

Thread overview: 96+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-22 19:05 [PATCH V7 00/29] Live Update Steve Sistare
2021-12-22 19:05 ` [PATCH V7 01/29] memory: qemu_check_ram_volatile Steve Sistare
2022-02-24 18:28   ` Dr. David Alan Gilbert
2022-03-03 15:55     ` Steven Sistare
2022-03-04 12:47   ` Philippe Mathieu-Daudé
2021-12-22 19:05 ` [PATCH V7 02/29] migration: fix populate_vfio_info Steve Sistare
2022-02-24 18:42   ` Peter Maydell
2022-03-03 15:55     ` Steven Sistare
2022-03-03 16:21       ` Peter Maydell
2022-03-03 16:38         ` Steven Sistare
2021-12-22 19:05 ` [PATCH V7 03/29] migration: qemu file wrappers Steve Sistare
2022-02-24 18:21   ` Dr. David Alan Gilbert
2022-03-03 15:55     ` Steven Sistare
2021-12-22 19:05 ` [PATCH V7 04/29] migration: simplify savevm Steve Sistare
2022-02-24 18:25   ` Dr. David Alan Gilbert
2022-03-03 15:55     ` Steven Sistare
2021-12-22 19:05 ` [PATCH V7 05/29] vl: start on wakeup request Steve Sistare
2022-02-24 18:51   ` Dr. David Alan Gilbert
2022-03-03 15:56     ` Steven Sistare
2021-12-22 19:05 ` [PATCH V7 06/29] cpr: reboot mode Steve Sistare
2021-12-22 19:05 ` [PATCH V7 07/29] cpr: reboot HMP interfaces Steve Sistare
2021-12-22 19:05 ` [PATCH V7 08/29] memory: flat section iterator Steve Sistare
2022-03-04 12:48   ` Philippe Mathieu-Daudé
2022-03-07 14:42     ` Steven Sistare
2022-03-09 14:18   ` Marc-André Lureau
2021-12-22 19:05 ` [PATCH V7 09/29] oslib: qemu_clear_cloexec Steve Sistare
2021-12-22 19:05 ` [PATCH V7 10/29] machine: memfd-alloc option Steve Sistare
2022-02-18  8:05   ` Guoyi Tu
2022-03-03 15:55     ` Steven Sistare
2022-02-24 17:56   ` Dr. David Alan Gilbert
2022-03-03 15:56     ` Steven Sistare
2022-03-03 17:21   ` Michael S. Tsirkin
2022-03-04 10:41     ` Igor Mammedov
2022-03-07 14:41       ` Steven Sistare
2022-03-08  6:50         ` Michael S. Tsirkin
2022-03-08  7:20           ` Igor Mammedov
2022-03-10 15:36             ` Steven Sistare
2022-03-10 16:00               ` Igor Mammedov
2022-03-10 17:28                 ` Steven Sistare
2022-03-10 18:18                   ` Steven Sistare
2022-03-11  9:42                     ` Igor Mammedov
2022-03-29 17:43                       ` Steven Sistare
2022-03-11 10:08         ` Daniel P. Berrangé
2022-03-11 10:25     ` David Hildenbrand
2022-03-11  9:54   ` David Hildenbrand
2021-12-22 19:05 ` [PATCH V7 11/29] qapi: list utility functions Steve Sistare
2022-03-09 14:11   ` Marc-André Lureau
2022-03-11 16:45     ` Steven Sistare
2022-03-11 21:59       ` Marc-André Lureau
2021-12-22 19:05 ` [PATCH V7 12/29] vl: helper to request re-exec Steve Sistare
2022-03-09 14:16   ` Marc-André Lureau
2022-03-11 16:45     ` Steven Sistare
2021-12-22 19:05 ` [PATCH V7 13/29] cpr: preserve extra state Steve Sistare
2021-12-22 19:05 ` [PATCH V7 14/29] cpr: restart mode Steve Sistare
2021-12-22 19:05 ` [PATCH V7 15/29] cpr: restart HMP interfaces Steve Sistare
2021-12-22 19:05 ` [PATCH V7 16/29] hostmem-memfd: cpr for memory-backend-memfd Steve Sistare
2021-12-22 19:05 ` [PATCH V7 17/29] pci: export functions for cpr Steve Sistare
2021-12-22 23:07   ` Michael S. Tsirkin
2022-01-05 17:22     ` Steven Sistare
2022-01-05 20:16       ` Michael S. Tsirkin
2022-01-06 22:48         ` Steven Sistare
2022-01-07 10:03           ` Michael S. Tsirkin
2021-12-22 19:05 ` [PATCH V7 18/29] vfio-pci: refactor " Steve Sistare
2022-03-03 23:21   ` Alex Williamson
2022-03-07 14:42     ` Steven Sistare
2021-12-22 19:05 ` [PATCH V7 19/29] vfio-pci: cpr part 1 (fd and dma) Steve Sistare
2021-12-22 23:15   ` Michael S. Tsirkin
2022-01-05 17:24     ` Steven Sistare
2022-01-05 21:14       ` Michael S. Tsirkin
2022-01-05 21:40         ` Steven Sistare
2022-01-05 23:09           ` Michael S. Tsirkin
2022-01-05 23:24             ` Steven Sistare
2022-01-06  9:12               ` Michael S. Tsirkin
2022-01-06 19:13                 ` Steven Sistare
2022-03-07 22:16   ` Alex Williamson
2022-03-10 15:00     ` Steven Sistare
2022-03-10 18:35       ` Alex Williamson
2022-03-10 19:55         ` Steven Sistare
2022-03-10 22:30           ` Alex Williamson
2022-03-11 16:22             ` Steven Sistare
2021-12-22 19:05 ` [PATCH V7 20/29] vfio-pci: cpr part 2 (msi) Steve Sistare
2021-12-22 19:05 ` [PATCH V7 21/29] vfio-pci: cpr part 3 (intx) Steve Sistare
2021-12-22 19:05 ` [PATCH V7 22/29] vfio-pci: recover from unmap-all-vaddr failure Steve Sistare
2021-12-22 19:05 ` [PATCH V7 23/29] vhost: reset vhost devices for cpr Steve Sistare
2021-12-22 19:05 ` [PATCH V7 24/29] loader: suppress rom_reset during cpr Steve Sistare
2021-12-22 19:05 ` [PATCH V7 25/29] chardev: cpr framework Steve Sistare
2021-12-22 19:05 ` [PATCH V7 26/29] chardev: cpr for simple devices Steve Sistare
2021-12-22 19:05 ` [PATCH V7 27/29] chardev: cpr for pty Steve Sistare
2021-12-22 19:05 ` [PATCH V7 28/29] chardev: cpr for sockets Steve Sistare
2022-02-18  9:03   ` Guoyi Tu
2022-03-03 15:55     ` Steven Sistare
2021-12-22 19:05 ` [PATCH V7 29/29] cpr: only-cpr-capable option Steve Sistare
2022-02-18  9:43   ` Guoyi Tu
2022-03-03 15:54     ` Steven Sistare
2022-01-07 18:45 ` [PATCH V7 00/29] Live Update Steven Sistare
2022-02-18 13:36   ` Steven Sistare

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.