qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V1 00/32] Live Update
@ 2020-07-30 15:14 Steve Sistare
  2020-07-30 15:14 ` [PATCH V1 01/32] savevm: add vmstate handler iterators Steve Sistare
                   ` (35 more replies)
  0 siblings, 36 replies; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Improve and extend the qemu functions that save and restore VM state so a
guest may be suspended and resumed with minimal pause time.  qemu may be
updated to a new version in between.

The first set of patches adds the cprsave and cprload commands to save and
restore VM state, and allow the host kernel to be updated and rebooted in
between.  The VM must create guest RAM in a persistent shared memory file,
such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in 
https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/

cprsave stops the VCPUs and saves VM device state in a simple file, and
thus supports any type of guest image and block device.  The caller must
not modify the VM's block devices between cprsave and cprload.

cprsave and cprload support guests with vfio devices if the caller first
suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
The guest drivers suspend methods flush outstanding requests and re-
initialize the devices, and thus there is no device state to save and
restore.

   1 savevm: add vmstate handler iterators
   2 savevm: VM handlers mode mask
   3 savevm: QMP command for cprsave
   4 savevm: HMP Command for cprsave
   5 savevm: QMP command for cprload
   6 savevm: HMP Command for cprload
   7 savevm: QMP command for cprinfo
   8 savevm: HMP command for cprinfo
   9 savevm: prevent cprsave if memory is volatile
  10 kvmclock: restore paused KVM clock
  11 cpu: disable ticks when suspended
  12 vl: pause option
  13 gdbstub: gdb support for suspended state

The next patches add a restart method that eliminates the persistent memory
constraint, and allows qemu to be updated across the restart, but does not
allow host reboot.  Anonymous memory segments used by the guest are
preserved across a re-exec of qemu, mapped at the same VA, via a proposed
madvise(MADV_DOEXEC) option in the Linux kernel.  See
https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/

  14 savevm: VMS_RESTART and cprsave restart
  15 vl: QEMU_START_FREEZE env var
  16 oslib: add qemu_clr_cloexec
  17 util: env var helpers
  18 osdep: import MADV_DOEXEC
  19 memory: ram_block_add cosmetic changes
  20 vl: add helper to request re-exec
  21 exec, memory: exec(3) to restart
  22 char: qio_channel_socket_accept reuse fd
  23 char: save/restore chardev socket fds
  24 ui: save/restore vnc socket fds
  25 char: save/restore chardev pty fds
  26 monitor: save/restore QMP negotiation status
  27 vhost: reset vhost devices upon cprsave
  28 char: restore terminal on restart

The next patches extend the restart method to save and restore vfio-pci
state, eliminating the requirement for a guest agent.  The vfio container,
group, and device descriptors are preserved across the qemu re-exec.

  29 pci: export pci_update_mappings
  30 vfio-pci: save and restore
  31 vfio-pci: trace pci config
  32 vfio-pci: improved tracing

Here is an example of updating qemu from v4.2.0 to v4.2.1 using 
"cprload restart".  The software update is performed while the guest is
running to minimize downtime.

window 1				| window 2
					|
# qemu-system-x86_64 ... 		|
QEMU 4.2.0 monitor - type 'help' ...	|
(qemu) info status			|
VM status: running			|
					| # yum update qemu
(qemu) cprsave /tmp/qemu.sav restart	|
QEMU 4.2.1 monitor - type 'help' ...	|
(qemu) info status			|
VM status: paused (prelaunch)		|
(qemu) cprload /tmp/qemu.sav		|
(qemu) info status			|
VM status: running			|


Here is an example of updating the host kernel using "cprload reboot"

window 1					| window 2
						|
# qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
QEMU 4.2.1 monitor - type 'help' ...		|
(qemu) info status				|
VM status: running				|
						| # yum update kernel-uek
(qemu) cprsave /tmp/qemu.sav restart		|
						|
# systemctl kexec				|
kexec_core: Starting new kernel			|
...						|
						|
# qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
QEMU 4.2.1 monitor - type 'help' ...		|
(qemu) info status				|
VM status: paused (prelaunch)			|
(qemu) cprload /tmp/qemu.sav			|
(qemu) info status				|
VM status: running				|


Mark Kanda (5):
  char: qio_channel_socket_accept reuse fd
  char: save/restore chardev socket fds
  ui: save/restore vnc socket fds
  monitor: save/restore QMP negotiation status
  vhost: reset vhost devices upon cprsave

Steve Sistare (27):
  savevm: add vmstate handler iterators
  savevm: VM handlers mode mask
  savevm: QMP command for cprsave
  savevm: HMP Command for cprsave
  savevm: QMP command for cprload
  savevm: HMP Command for cprload
  savevm: QMP command for cprinfo
  savevm: HMP command for cprinfo
  savevm: prevent cprsave if memory is volatile
  kvmclock: restore paused KVM clock
  cpu: disable ticks when suspended
  vl: pause option
  gdbstub: gdb support for suspended state
  savevm: VMS_RESTART and cprsave restart
  vl: QEMU_START_FREEZE env var
  oslib: add qemu_clr_cloexec
  util: env var helpers
  osdep: import MADV_DOEXEC
  memory: ram_block_add cosmetic changes
  vl: add helper to request re-exec
  exec, memory: exec(3) to restart
  char: save/restore chardev pty fds
  char: restore terminal on restart
  pci: export pci_update_mappings
  vfio-pci: save and restore
  vfio-pci: trace pci config
  vfio-pci: improved tracing

 MAINTAINERS                    |   7 ++
 accel/kvm/kvm-all.c            |   8 +-
 accel/kvm/trace-events         |   3 +-
 chardev/char-pty.c             |  38 +++++--
 chardev/char-socket.c          |  35 ++++++
 chardev/char-stdio.c           |   7 ++
 chardev/char.c                 |  16 +++
 exec.c                         |  88 +++++++++++++--
 gdbstub.c                      |  11 +-
 hmp-commands.hx                |  46 ++++++++
 hw/i386/kvm/clock.c            |   6 +-
 hw/pci/msix.c                  |   1 +
 hw/pci/pci.c                   |  17 +--
 hw/pci/trace-events            |   5 +-
 hw/vfio/common.c               | 115 ++++++++++++++++----
 hw/vfio/pci.c                  | 179 ++++++++++++++++++++++++++++++-
 hw/vfio/platform.c             |   2 +-
 hw/vfio/trace-events           |  11 +-
 hw/virtio/vhost.c              |  12 +++
 include/chardev/char.h         |   8 ++
 include/exec/memory.h          |   4 +
 include/hw/pci/pci.h           |   2 +
 include/hw/vfio/vfio-common.h  |   4 +-
 include/io/channel-socket.h    |   3 +-
 include/migration/register.h   |   3 +
 include/migration/vmstate.h    |  11 ++
 include/monitor/hmp.h          |   3 +
 include/qemu/cutils.h          |   1 +
 include/qemu/env.h             |  31 ++++++
 include/qemu/osdep.h           |   8 ++
 include/sysemu/sysemu.h        |  10 ++
 io/channel-socket.c            |  12 ++-
 io/net-listener.c              |   4 +-
 migration/block.c              |   1 +
 migration/migration.c          |   4 +-
 migration/ram.c                |   1 +
 migration/savevm.c             | 237 ++++++++++++++++++++++++++++++++++++-----
 migration/savevm.h             |   4 +-
 monitor/hmp-cmds.c             |  28 +++++
 monitor/qmp-cmds.c             |  16 +++
 monitor/qmp.c                  |  42 ++++++++
 qapi/migration.json            |  35 ++++++
 qapi/pragma.json               |   1 +
 qemu-options.hx                |   9 ++
 scsi/qemu-pr-helper.c          |   2 +-
 softmmu/vl.c                   |  65 ++++++++++-
 tests/qtest/tpm-emu.c          |   2 +-
 tests/test-char.c              |   2 +-
 tests/test-io-channel-socket.c |   4 +-
 trace-events                   |   2 +
 ui/vnc.c                       | 153 +++++++++++++++++++++-----
 util/Makefile.objs             |   2 +-
 util/env.c                     | 132 +++++++++++++++++++++++
 util/oslib-posix.c             |   9 ++
 util/oslib-win32.c             |   4 +
 55 files changed, 1331 insertions(+), 135 deletions(-)
 create mode 100644 include/qemu/env.h
 create mode 100644 util/env.c

-- 
1.8.3.1



^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH V1 01/32] savevm: add vmstate handler iterators
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-09-11 16:24   ` Dr. David Alan Gilbert
  2020-07-30 15:14 ` [PATCH V1 02/32] savevm: VM handlers mode mask Steve Sistare
                   ` (34 subsequent siblings)
  35 siblings, 1 reply; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Provide the SAVEVM_FOREACH and SAVEVM_FORALL macros to loop over all save
VM state handlers.  The former will filter handlers based on the operation
in the later patch "savevm: VM handlers mode mask".  The latter loops over
all handlers.

No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 migration/savevm.c | 57 ++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 38 insertions(+), 19 deletions(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index 45c9dd9..a07fcad 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -266,6 +266,25 @@ static SaveState savevm_state = {
     .global_section_id = 0,
 };
 
+/*
+ * The FOREACH macros will filter handlers based on the current operation when
+ * additional conditions are added in a subsequent patch.
+ */
+
+#define SAVEVM_FOREACH(se, entry)                                    \
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry)                \
+
+#define SAVEVM_FOREACH_SAFE(se, entry, new_se)                       \
+    QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se)   \
+
+/* The FORALL macros unconditionally loop over all handlers. */
+
+#define SAVEVM_FORALL(se, entry)                                     \
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry)
+
+#define SAVEVM_FORALL_SAFE(se, entry, new_se)                        \
+    QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se)
+
 static bool should_validate_capability(int capability)
 {
     assert(capability >= 0 && capability < MIGRATION_CAPABILITY__MAX);
@@ -673,7 +692,7 @@ static uint32_t calculate_new_instance_id(const char *idstr)
     SaveStateEntry *se;
     uint32_t instance_id = 0;
 
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FORALL(se, entry) {
         if (strcmp(idstr, se->idstr) == 0
             && instance_id <= se->instance_id) {
             instance_id = se->instance_id + 1;
@@ -689,7 +708,7 @@ static int calculate_compat_instance_id(const char *idstr)
     SaveStateEntry *se;
     int instance_id = 0;
 
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FORALL(se, entry) {
         if (!se->compat) {
             continue;
         }
@@ -803,7 +822,7 @@ void unregister_savevm(VMStateIf *obj, const char *idstr, void *opaque)
     }
     pstrcat(id, sizeof(id), idstr);
 
-    QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se) {
+    SAVEVM_FORALL_SAFE(se, entry, new_se) {
         if (strcmp(se->idstr, id) == 0 && se->opaque == opaque) {
             savevm_state_handler_remove(se);
             g_free(se->compat);
@@ -867,7 +886,7 @@ void vmstate_unregister(VMStateIf *obj, const VMStateDescription *vmsd,
 {
     SaveStateEntry *se, *new_se;
 
-    QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se) {
+    SAVEVM_FORALL_SAFE(se, entry, new_se) {
         if (se->vmsd == vmsd && se->opaque == opaque) {
             savevm_state_handler_remove(se);
             g_free(se->compat);
@@ -1119,7 +1138,7 @@ bool qemu_savevm_state_blocked(Error **errp)
 {
     SaveStateEntry *se;
 
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FORALL(se, entry) {
         if (se->vmsd && se->vmsd->unmigratable) {
             error_setg(errp, "State blocked by non-migratable device '%s'",
                        se->idstr);
@@ -1145,7 +1164,7 @@ bool qemu_savevm_state_guest_unplug_pending(void)
 {
     SaveStateEntry *se;
 
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         if (se->vmsd && se->vmsd->dev_unplug_pending &&
             se->vmsd->dev_unplug_pending(se->opaque)) {
             return true;
@@ -1162,7 +1181,7 @@ void qemu_savevm_state_setup(QEMUFile *f)
     int ret;
 
     trace_savevm_state_setup();
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         if (!se->ops || !se->ops->save_setup) {
             continue;
         }
@@ -1193,7 +1212,7 @@ int qemu_savevm_state_resume_prepare(MigrationState *s)
 
     trace_savevm_state_resume_prepare();
 
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         if (!se->ops || !se->ops->resume_prepare) {
             continue;
         }
@@ -1223,7 +1242,7 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy)
     int ret = 1;
 
     trace_savevm_state_iterate();
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         if (!se->ops || !se->ops->save_live_iterate) {
             continue;
         }
@@ -1291,7 +1310,7 @@ void qemu_savevm_state_complete_postcopy(QEMUFile *f)
     SaveStateEntry *se;
     int ret;
 
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         if (!se->ops || !se->ops->save_live_complete_postcopy) {
             continue;
         }
@@ -1324,7 +1343,7 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
     SaveStateEntry *se;
     int ret;
 
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         if (!se->ops ||
             (in_postcopy && se->ops->has_postcopy &&
              se->ops->has_postcopy(se->opaque)) ||
@@ -1366,7 +1385,7 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
     vmdesc = qjson_new();
     json_prop_int(vmdesc, "page_size", qemu_target_page_size());
     json_start_array(vmdesc, "devices");
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
 
         if ((!se->ops || !se->ops->save_state) && !se->vmsd) {
             continue;
@@ -1476,7 +1495,7 @@ void qemu_savevm_state_pending(QEMUFile *f, uint64_t threshold_size,
     *res_postcopy_only = 0;
 
 
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         if (!se->ops || !se->ops->save_live_pending) {
             continue;
         }
@@ -1501,7 +1520,7 @@ void qemu_savevm_state_cleanup(void)
     }
 
     trace_savevm_state_cleanup();
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         if (se->ops && se->ops->save_cleanup) {
             se->ops->save_cleanup(se->opaque);
         }
@@ -1580,7 +1599,7 @@ int qemu_save_device_state(QEMUFile *f)
     }
     cpu_synchronize_all_states();
 
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         int ret;
 
         if (se->is_ram) {
@@ -1612,7 +1631,7 @@ static SaveStateEntry *find_se(const char *idstr, uint32_t instance_id)
 {
     SaveStateEntry *se;
 
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FORALL(se, entry) {
         if (!strcmp(se->idstr, idstr) &&
             (instance_id == se->instance_id ||
              instance_id == se->alias_id))
@@ -2334,7 +2353,7 @@ qemu_loadvm_section_part_end(QEMUFile *f, MigrationIncomingState *mis)
     }
 
     trace_qemu_loadvm_state_section_partend(section_id);
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         if (se->load_section_id == section_id) {
             break;
         }
@@ -2400,7 +2419,7 @@ static int qemu_loadvm_state_setup(QEMUFile *f)
     int ret;
 
     trace_loadvm_state_setup();
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         if (!se->ops || !se->ops->load_setup) {
             continue;
         }
@@ -2425,7 +2444,7 @@ void qemu_loadvm_state_cleanup(void)
     SaveStateEntry *se;
 
     trace_loadvm_state_cleanup();
-    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+    SAVEVM_FOREACH(se, entry) {
         if (se->ops && se->ops->load_cleanup) {
             se->ops->load_cleanup(se->opaque);
         }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 02/32] savevm: VM handlers mode mask
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
  2020-07-30 15:14 ` [PATCH V1 01/32] savevm: add vmstate handler iterators Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-07-30 15:14 ` [PATCH V1 03/32] savevm: QMP command for cprsave Steve Sistare
                   ` (33 subsequent siblings)
  35 siblings, 0 replies; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Add a new mode argument to qemu_savevm_state() and qemu_loadvm_state() that
can customize the operation.  Define the VMS_MIGRATE and VMS_SNAPSHOT modes
for the existing live migration and snapshot capabilities.

Provide a mode mask for vmstate handlers.  A handler is only processed by
SAVEVM_FOREACH if its mask includes the savevm_state.mode.  Unmodified
handler declarations have a zero mask field, which implicitly enables the
handler for all modes.

No functional change for the VMS_MIGRATE and VMS_SNAPSHOT modes.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/migration/register.h |  3 +++
 include/migration/vmstate.h  |  9 ++++++++
 migration/migration.c        |  4 ++--
 migration/savevm.c           | 51 +++++++++++++++++++++++++++++++++++---------
 migration/savevm.h           |  4 +++-
 5 files changed, 58 insertions(+), 13 deletions(-)

diff --git a/include/migration/register.h b/include/migration/register.h
index c1dcff0..c030a10 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -17,6 +17,9 @@
 #include "hw/vmstate-if.h"
 
 typedef struct SaveVMHandlers {
+    /* Mask of VMStateMode's that should use this handler */
+    unsigned mode_mask;
+
     /* This runs inside the iothread lock.  */
     SaveStateHandler *save_state;
 
diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index f68ed7d..fa575f9 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -158,6 +158,12 @@ typedef enum {
     MIG_PRI_MAX,
 } MigrationPriority;
 
+typedef enum {
+    VMS_MIGRATE  = (1U << 1),
+    VMS_SNAPSHOT = (1U << 2),
+    VMS_MODE_ALL = ~0U
+} VMStateMode;
+
 struct VMStateField {
     const char *name;
     const char *err_hint;
@@ -182,6 +188,7 @@ struct VMStateDescription {
     int minimum_version_id;
     int minimum_version_id_old;
     MigrationPriority priority;
+    unsigned mode_mask;
     LoadStateHandler *load_state_old;
     int (*pre_load)(void *opaque);
     int (*post_load)(void *opaque, int version_id);
@@ -1215,4 +1222,6 @@ void vmstate_register_ram_global(struct MemoryRegion *memory);
 
 bool vmstate_check_only_migratable(const VMStateDescription *vmsd);
 
+void savevm_set_mode(VMStateMode mode);
+
 #endif
diff --git a/migration/migration.c b/migration/migration.c
index 2ed9923..e3d0899 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -465,7 +465,7 @@ static void process_incoming_migration_co(void *opaque)
     postcopy_state_set(POSTCOPY_INCOMING_NONE);
     migrate_set_state(&mis->state, MIGRATION_STATUS_NONE,
                       MIGRATION_STATUS_ACTIVE);
-    ret = qemu_loadvm_state(mis->from_src_file);
+    ret = qemu_loadvm_state(mis->from_src_file, VMS_MIGRATE);
 
     ps = postcopy_state_get();
     trace_process_incoming_migration_co_end(ret, ps);
@@ -3414,7 +3414,7 @@ static void *migration_thread(void *opaque)
 
     object_ref(OBJECT(s));
     update_iteration_initial_status(s);
-
+    savevm_set_mode(VMS_MIGRATE);
     qemu_savevm_state_header(s->to_dst_file);
 
     /*
diff --git a/migration/savevm.c b/migration/savevm.c
index a07fcad..ce02b6b 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -256,6 +256,7 @@ typedef struct SaveState {
     const char *name;
     uint32_t target_page_bits;
     uint32_t caps_count;
+    VMStateMode mode;
     MigrationCapability *capabilities;
     QemuUUID uuid;
 } SaveState;
@@ -266,16 +267,15 @@ static SaveState savevm_state = {
     .global_section_id = 0,
 };
 
-/*
- * The FOREACH macros will filter handlers based on the current operation when
- * additional conditions are added in a subsequent patch.
- */
+/* The FOREACH macros filter handlers based on the current operation. */
 
 #define SAVEVM_FOREACH(se, entry)                                    \
     QTAILQ_FOREACH(se, &savevm_state.handlers, entry)                \
+        if (savevm_state.mode & mode_mask(se))
 
 #define SAVEVM_FOREACH_SAFE(se, entry, new_se)                       \
     QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se)   \
+        if (savevm_state.mode & mode_mask(se))
 
 /* The FORALL macros unconditionally loop over all handlers. */
 
@@ -285,6 +285,33 @@ static SaveState savevm_state = {
 #define SAVEVM_FORALL_SAFE(se, entry, new_se)                        \
     QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se)
 
+/*
+ * Set the current mode to be used for filtering savevm handlers in
+ * SAVEVM_FOREACH.
+ */
+void savevm_set_mode(VMStateMode mode)
+{
+    savevm_state.mode = mode;
+}
+
+/*
+ * A savevm handler is selected in SAVEVM_FOREACH if its mask overlaps the
+ * current mode.  The mask is defined by either the new vmsd interface or the
+ * legacy ops interface.  If the mask is zero, it implicily includes all modes.
+ */
+static inline unsigned mode_mask(SaveStateEntry *se)
+{
+    const VMStateDescription *vmsd = se->vmsd;
+    unsigned mask = 0;
+
+    if (vmsd) {
+        mask = vmsd->mode_mask;
+    } else if (se->ops) {
+        mask = se->ops->mode_mask;
+    }
+    return mask ? mask : VMS_MODE_ALL;
+}
+
 static bool should_validate_capability(int capability)
 {
     assert(capability >= 0 && capability < MIGRATION_CAPABILITY__MAX);
@@ -1527,12 +1554,14 @@ void qemu_savevm_state_cleanup(void)
     }
 }
 
-static int qemu_savevm_state(QEMUFile *f, Error **errp)
+static int qemu_savevm_state(QEMUFile *f, VMStateMode mode, Error **errp)
 {
     int ret;
     MigrationState *ms = migrate_get_current();
     MigrationStatus status;
 
+    savevm_set_mode(mode);
+
     if (migration_is_running(ms->state)) {
         error_setg(errp, QERR_MIGRATION_ACTIVE);
         return -EINVAL;
@@ -2557,13 +2586,14 @@ out:
     return ret;
 }
 
-int qemu_loadvm_state(QEMUFile *f)
+int qemu_loadvm_state(QEMUFile *f, VMStateMode mode)
 {
     MigrationIncomingState *mis = migration_incoming_get_current();
     Error *local_err = NULL;
     int ret;
 
-    if (qemu_savevm_state_blocked(&local_err)) {
+    if ((mode & (VMS_SNAPSHOT | VMS_MIGRATE)) &&
+        qemu_savevm_state_blocked(&local_err)) {
         error_report_err(local_err);
         return -EINVAL;
     }
@@ -2736,7 +2766,7 @@ int save_snapshot(const char *name, Error **errp)
         error_setg(errp, "Could not open VM state file");
         goto the_end;
     }
-    ret = qemu_savevm_state(f, errp);
+    ret = qemu_savevm_state(f, VMS_SNAPSHOT, errp);
     vm_state_size = qemu_ftell(f);
     ret2 = qemu_fclose(f);
     if (ret < 0) {
@@ -2785,6 +2815,7 @@ void qmp_xen_save_devices_state(const char *filename, bool has_live, bool live,
     int saved_vm_running;
     int ret;
 
+    savevm_set_mode(VMS_MIGRATE);
     if (!has_live) {
         /* live default to true so old version of Xen tool stack can have a
          * successfull live migration */
@@ -2850,7 +2881,7 @@ void qmp_xen_load_devices_state(const char *filename, Error **errp)
     f = qemu_fopen_channel_input(QIO_CHANNEL(ioc));
     object_unref(OBJECT(ioc));
 
-    ret = qemu_loadvm_state(f);
+    ret = qemu_loadvm_state(f, VMS_MIGRATE);
     qemu_fclose(f);
     if (ret < 0) {
         error_setg(errp, QERR_IO_ERROR);
@@ -2928,7 +2959,7 @@ int load_snapshot(const char *name, Error **errp)
     mis->from_src_file = f;
 
     aio_context_acquire(aio_context);
-    ret = qemu_loadvm_state(f);
+    ret = qemu_loadvm_state(f, VMS_SNAPSHOT);
     migration_incoming_state_destroy();
     aio_context_release(aio_context);
 
diff --git a/migration/savevm.h b/migration/savevm.h
index ba64a7e..4b7ce91 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -14,6 +14,8 @@
 #ifndef MIGRATION_SAVEVM_H
 #define MIGRATION_SAVEVM_H
 
+#include "migration/vmstate.h"
+
 #define QEMU_VM_FILE_MAGIC           0x5145564d
 #define QEMU_VM_FILE_VERSION_COMPAT  0x00000002
 #define QEMU_VM_FILE_VERSION         0x00000003
@@ -60,7 +62,7 @@ void qemu_savevm_send_colo_enable(QEMUFile *f);
 void qemu_savevm_live_state(QEMUFile *f);
 int qemu_save_device_state(QEMUFile *f);
 
-int qemu_loadvm_state(QEMUFile *f);
+int qemu_loadvm_state(QEMUFile *f, VMStateMode mode);
 void qemu_loadvm_state_cleanup(void);
 int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
 int qemu_load_device_state(QEMUFile *f);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 03/32] savevm: QMP command for cprsave
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
  2020-07-30 15:14 ` [PATCH V1 01/32] savevm: add vmstate handler iterators Steve Sistare
  2020-07-30 15:14 ` [PATCH V1 02/32] savevm: VM handlers mode mask Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-07-30 16:12   ` Eric Blake
  2020-09-11 16:43   ` Dr. David Alan Gilbert
  2020-07-30 15:14 ` [PATCH V1 04/32] savevm: HMP Command " Steve Sistare
                   ` (32 subsequent siblings)
  35 siblings, 2 replies; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

To enable live reboot, provide the cprsave QMP command and the VMS_REBOOT
vmstate-saving operation, which saves the state of the virtual machine in a
simple file.

Syntax:
  {'command':'cprsave', 'data':{'file':'str', 'mode':'str'}}

  The mode argument must be 'reboot'.  Additional modes will be defined in
  the future.

Unlike the savevm command, cprsave supports any type of guest image and
block device.  cprsave stops the VM so that guest ram and block devices are
not modified after state is saved.  Guest ram must be mapped to a persistent
memory file such as /dev/dax0.0.  The ram object vmstate handler and block
device handler do not apply to VMS_REBOOT, so restrict them to VMS_MIGRATE
or VMS_SNAPSHOT.  After cprsave completes successfully, qemu exits.

After issuing cprsave, the caller may update qemu, update the host kernel,
reboot, start qemu using the same arguments as the original process, and
issue the cprload command to restore the guest.  cprload is added by
subsequent patches.

If the caller suspends the guest instead of stopping the VM, such as by
issuing guest-suspend-ram to the qemu guest agent, then cprsave and cprload
support guests with vfio devices.  The guest drivers suspend methods flush
outstanding requests and re-initialize the devices, and thus there is no
device state to save and restore.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Signed-off-by: Maran Wilson <maran.wilson@oracle.com>
---
 include/migration/vmstate.h |  1 +
 include/sysemu/sysemu.h     |  2 ++
 migration/block.c           |  1 +
 migration/ram.c             |  1 +
 migration/savevm.c          | 59 +++++++++++++++++++++++++++++++++++++++++++++
 monitor/qmp-cmds.c          |  6 +++++
 qapi/migration.json         | 14 +++++++++++
 7 files changed, 84 insertions(+)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index fa575f9..c58551a 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -161,6 +161,7 @@ typedef enum {
 typedef enum {
     VMS_MIGRATE  = (1U << 1),
     VMS_SNAPSHOT = (1U << 2),
+    VMS_REBOOT   = (1U << 3),
     VMS_MODE_ALL = ~0U
 } VMStateMode;
 
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index 4b6a5c4..6fe86e6 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -24,6 +24,8 @@ extern bool machine_init_done;
 void qemu_add_machine_init_done_notifier(Notifier *notify);
 void qemu_remove_machine_init_done_notifier(Notifier *notify);
 
+void save_cpr_snapshot(const char *file, const char *mode, Error **errp);
+
 extern int autostart;
 
 typedef enum {
diff --git a/migration/block.c b/migration/block.c
index 737b649..a69accb 100644
--- a/migration/block.c
+++ b/migration/block.c
@@ -1023,6 +1023,7 @@ static SaveVMHandlers savevm_block_handlers = {
     .load_state = block_load,
     .save_cleanup = block_migration_cleanup,
     .is_active = block_is_active,
+    .mode_mask = VMS_MIGRATE | VMS_SNAPSHOT,
 };
 
 void blk_mig_init(void)
diff --git a/migration/ram.c b/migration/ram.c
index 76d4fee..f0d5d9f 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3795,6 +3795,7 @@ static SaveVMHandlers savevm_ram_handlers = {
     .load_setup = ram_load_setup,
     .load_cleanup = ram_load_cleanup,
     .resume_prepare = ram_resume_prepare,
+    .mode_mask = VMS_MIGRATE | VMS_SNAPSHOT,
 };
 
 void ram_mig_init(void)
diff --git a/migration/savevm.c b/migration/savevm.c
index ce02b6b..ff1a46e 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2680,6 +2680,65 @@ int qemu_load_device_state(QEMUFile *f)
     return 0;
 }
 
+static QEMUFile *qf_file_open(const char *filename, int flags, int mode,
+                              Error **errp)
+{
+    QIOChannel *ioc;
+    int fd = qemu_open(filename, flags, mode);
+
+    if (fd < 0) {
+        error_setg_errno(errp, errno, "%s(%s)", __func__, filename);
+        return NULL;
+    }
+
+    ioc = QIO_CHANNEL(qio_channel_file_new_fd(fd));
+
+    if (flags & O_WRONLY) {
+        return qemu_fopen_channel_output(ioc);
+    }
+
+    return qemu_fopen_channel_input(ioc);
+}
+
+void save_cpr_snapshot(const char *file, const char *mode, Error **errp)
+{
+    int ret = 0;
+    QEMUFile *f;
+    VMStateMode op;
+
+    if (!strcmp(mode, "reboot")) {
+        op = VMS_REBOOT;
+    } else {
+        error_setg(errp, "cprsave: bad mode %s", mode);
+        return;
+    }
+
+    f = qf_file_open(file, O_CREAT | O_WRONLY | O_TRUNC, 0600, errp);
+    if (!f) {
+        return;
+    }
+
+    ret = global_state_store();
+    if (ret) {
+        error_setg(errp, "Error saving global state");
+        qemu_fclose(f);
+        return;
+    }
+
+    vm_stop(RUN_STATE_SAVE_VM);
+
+    ret = qemu_savevm_state(f, op, errp);
+    if ((ret < 0) && !*errp) {
+        error_setg(errp, "qemu_savevm_state failed");
+    }
+    qemu_fclose(f);
+
+    if (op == VMS_REBOOT) {
+        no_shutdown = 0;
+        qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
+    }
+}
+
 int save_snapshot(const char *name, Error **errp)
 {
     BlockDriverState *bs, *bs1;
diff --git a/monitor/qmp-cmds.c b/monitor/qmp-cmds.c
index 864cbfa..9ec7b88 100644
--- a/monitor/qmp-cmds.c
+++ b/monitor/qmp-cmds.c
@@ -35,6 +35,7 @@
 #include "qapi/qapi-commands-machine.h"
 #include "qapi/qapi-commands-misc.h"
 #include "qapi/qapi-commands-ui.h"
+#include "qapi/qapi-commands-migration.h"
 #include "qapi/qmp/qerror.h"
 #include "hw/mem/memory-device.h"
 #include "hw/acpi/acpi_dev_interface.h"
@@ -161,6 +162,11 @@ void qmp_cont(Error **errp)
     }
 }
 
+void qmp_cprsave(const char *file, const char *mode, Error **errp)
+{
+    save_cpr_snapshot(file, mode, errp);
+}
+
 void qmp_system_wakeup(Error **errp)
 {
     if (!qemu_wakeup_suspend_enabled()) {
diff --git a/qapi/migration.json b/qapi/migration.json
index d500055..b61df1d 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -1621,3 +1621,17 @@
 ##
 { 'event': 'UNPLUG_PRIMARY',
   'data': { 'device-id': 'str' } }
+
+##
+# @cprsave:
+#
+# Create a checkpoint of the virtual machine device state in @file.
+# Guest RAM and guest block device blocks are not saved.
+#
+# @file: name of checkpoint file
+# @mode: 'reboot' : checkpoint can be cprload'ed after a host kexec reboot.
+#
+# Since 5.0
+##
+{ 'command': 'cprsave', 'data': { 'file': 'str', 'mode': 'str' } }
+
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 04/32] savevm: HMP Command for cprsave
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (2 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 03/32] savevm: QMP command for cprsave Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-09-11 16:57   ` Dr. David Alan Gilbert
  2020-07-30 15:14 ` [PATCH V1 05/32] savevm: QMP command for cprload Steve Sistare
                   ` (31 subsequent siblings)
  35 siblings, 1 reply; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Enable HMP access to the cprsave QMP command.

Usage: cprsave <filename> <mode>

Signed-off-by: Maran Wilson <maran.wilson@oracle.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hmp-commands.hx       | 18 ++++++++++++++++++
 include/monitor/hmp.h |  1 +
 monitor/hmp-cmds.c    | 10 ++++++++++
 3 files changed, 29 insertions(+)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index 60f395c..c8defd9 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -354,6 +354,24 @@ SRST
 ERST
 
     {
+        .name       = "cprsave",
+        .args_type  = "file:s,mode:s",
+        .params     = "file 'reboot'",
+        .help       = "create a checkpoint of the VM in file",
+        .cmd        = hmp_cprsave,
+    },
+
+SRST
+``cprsave`` *tag*
+  Stop VCPUs, create a checkpoint of the whole virtual machine and save it
+  in *file*.
+  If *mode* is 'reboot', the checkpoint can be cprload'ed after a host kexec
+  reboot.
+  exec() /usr/bin/qemu-exec if it exists, else exec /usr/bin/qemu-system-x86_64,
+  passing all the original command line arguments.  The VCPUs remain paused.
+ERST
+
+    {
         .name       = "delvm",
         .args_type  = "name:s",
         .params     = "tag",
diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
index c986cfd..af8ee23 100644
--- a/include/monitor/hmp.h
+++ b/include/monitor/hmp.h
@@ -59,6 +59,7 @@ void hmp_balloon(Monitor *mon, const QDict *qdict);
 void hmp_loadvm(Monitor *mon, const QDict *qdict);
 void hmp_savevm(Monitor *mon, const QDict *qdict);
 void hmp_delvm(Monitor *mon, const QDict *qdict);
+void hmp_cprsave(Monitor *mon, const QDict *qdict);
 void hmp_migrate_cancel(Monitor *mon, const QDict *qdict);
 void hmp_migrate_continue(Monitor *mon, const QDict *qdict);
 void hmp_migrate_incoming(Monitor *mon, const QDict *qdict);
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index ae4b6a4..59196ed 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -1139,6 +1139,16 @@ void hmp_announce_self(Monitor *mon, const QDict *qdict)
     qapi_free_AnnounceParameters(params);
 }
 
+void hmp_cprsave(Monitor *mon, const QDict *qdict)
+{
+    Error *err = NULL;
+
+    qmp_cprsave(qdict_get_try_str(qdict, "file"),
+                qdict_get_try_str(qdict, "mode"),
+                &err);
+    hmp_handle_error(mon, err);
+}
+
 void hmp_migrate_cancel(Monitor *mon, const QDict *qdict)
 {
     qmp_migrate_cancel(NULL);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 05/32] savevm: QMP command for cprload
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (3 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 04/32] savevm: HMP Command " Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-07-30 16:14   ` Eric Blake
  2020-07-30 15:14 ` [PATCH V1 06/32] savevm: HMP Command " Steve Sistare
                   ` (30 subsequent siblings)
  35 siblings, 1 reply; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Provide the cprload QMP command.  The VM is created from the file produced
by the cprsave command.  Guest RAM is restored in-place from the shared
memory backend file, and guest block devices are used as is.  The contents
of such devices must not be modified between the cprsave and cprload
operations.  If the VM was running at cprsave time, then VM execution
resumes.

Syntax:
  {'command':'cprload', 'data':{'file':'str'}}

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Signed-off-by: Maran Wilson <maran.wilson@oracle.com>
---
 include/sysemu/sysemu.h |  2 ++
 migration/savevm.c      | 34 ++++++++++++++++++++++++++++++++++
 monitor/qmp-cmds.c      |  5 +++++
 qapi/migration.json     | 11 +++++++++++
 softmmu/vl.c            | 15 ++++++++++++++-
 5 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index 6fe86e6..5360da5 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -25,6 +25,7 @@ void qemu_add_machine_init_done_notifier(Notifier *notify);
 void qemu_remove_machine_init_done_notifier(Notifier *notify);
 
 void save_cpr_snapshot(const char *file, const char *mode, Error **errp);
+void load_cpr_snapshot(const char *file, Error **errp);
 
 extern int autostart;
 
@@ -53,6 +54,7 @@ extern uint8_t *boot_splash_filedata;
 extern bool enable_mlock;
 extern bool enable_cpu_pm;
 extern QEMUClockType rtc_clock;
+extern int start_on_wake;
 
 #define MAX_OPTION_ROMS 16
 typedef struct QEMUOptionRom {
diff --git a/migration/savevm.c b/migration/savevm.c
index ff1a46e..1509173 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2948,6 +2948,40 @@ void qmp_xen_load_devices_state(const char *filename, Error **errp)
     migration_incoming_state_destroy();
 }
 
+void load_cpr_snapshot(const char *file, Error **errp)
+{
+    QEMUFile *f;
+    int ret;
+    RunState state;
+
+    if (runstate_is_running()) {
+        error_setg(errp, "cprload called for a running VM");
+        return;
+    }
+
+    f = qf_file_open(file, O_RDONLY, 0, errp);
+    if (!f) {
+        return;
+    }
+
+    ret = qemu_loadvm_state(f, VMS_REBOOT);
+    qemu_fclose(f);
+    if (ret < 0) {
+        error_setg(errp, "Error %d while loading VM state", ret);
+        return;
+    }
+
+    state = global_state_get_runstate();
+    if (state == RUN_STATE_RUNNING) {
+        vm_start();
+    } else {
+        runstate_set(state);
+        if (runstate_check(RUN_STATE_SUSPENDED)) {
+            start_on_wake = 1;
+        }
+    }
+}
+
 int load_snapshot(const char *name, Error **errp)
 {
     BlockDriverState *bs, *bs_vm_state;
diff --git a/monitor/qmp-cmds.c b/monitor/qmp-cmds.c
index 9ec7b88..81e6feb 100644
--- a/monitor/qmp-cmds.c
+++ b/monitor/qmp-cmds.c
@@ -167,6 +167,11 @@ void qmp_cprsave(const char *file, const char *mode, Error **errp)
     save_cpr_snapshot(file, mode, errp);
 }
 
+void qmp_cprload(const char *file, Error **errp)
+{
+    load_cpr_snapshot(file, errp);
+}
+
 void qmp_system_wakeup(Error **errp)
 {
     if (!qemu_wakeup_suspend_enabled()) {
diff --git a/qapi/migration.json b/qapi/migration.json
index b61df1d..ce4d32b 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -1635,3 +1635,14 @@
 ##
 { 'command': 'cprsave', 'data': { 'file': 'str', 'mode': 'str' } }
 
+##
+# @cprload:
+#
+# Start virtual machine from checkpoint file that was created earlier using
+# the cprsave command.
+#
+# @file: name of checkpoint file
+#
+# Since 5.0
+##
+{ 'command': 'cprload', 'data': { 'file': 'str' } }
diff --git a/softmmu/vl.c b/softmmu/vl.c
index 660537a..8478778 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -137,6 +137,7 @@ static time_t rtc_ref_start_datetime;
 static int rtc_realtime_clock_offset; /* used only with QEMU_CLOCK_REALTIME */
 static int rtc_host_datetime_offset = -1; /* valid & used only with
                                              RTC_BASE_DATETIME */
+int start_on_wake;
 QEMUClockType rtc_clock;
 int vga_interface_type = VGA_NONE;
 static DisplayOptions dpy;
@@ -602,6 +603,8 @@ static const RunStateTransition runstate_transitions_def[] = {
     { RUN_STATE_PRELAUNCH, RUN_STATE_RUNNING },
     { RUN_STATE_PRELAUNCH, RUN_STATE_FINISH_MIGRATE },
     { RUN_STATE_PRELAUNCH, RUN_STATE_INMIGRATE },
+    { RUN_STATE_PRELAUNCH, RUN_STATE_SUSPENDED },
+    { RUN_STATE_PRELAUNCH, RUN_STATE_PAUSED },
 
     { RUN_STATE_FINISH_MIGRATE, RUN_STATE_RUNNING },
     { RUN_STATE_FINISH_MIGRATE, RUN_STATE_PAUSED },
@@ -1519,7 +1522,17 @@ void qemu_system_wakeup_request(WakeupReason reason, Error **errp)
     if (!(wakeup_reason_mask & (1 << reason))) {
         return;
     }
-    runstate_set(RUN_STATE_RUNNING);
+
+    /*
+     * Must call vm_start if it has never been called, to invoke the state
+     * change callbacks for the first time.
+     */
+    if (start_on_wake) {
+        start_on_wake = 0;
+        vm_start();
+    } else {
+        runstate_set(RUN_STATE_RUNNING);
+    }
     wakeup_reason = reason;
     qemu_notify_event();
 }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 06/32] savevm: HMP Command for cprload
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (4 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 05/32] savevm: QMP command for cprload Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-07-30 15:14 ` [PATCH V1 07/32] savevm: QMP command for cprinfo Steve Sistare
                   ` (29 subsequent siblings)
  35 siblings, 0 replies; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Enable HMP access to the cprload QMP command.

Usage: cprload <file>

Signed-off-bu: Maran Wilson <maran.wilson@oracle.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hmp-commands.hx       | 13 +++++++++++++
 include/monitor/hmp.h |  1 +
 monitor/hmp-cmds.c    |  8 ++++++++
 3 files changed, 22 insertions(+)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index c8defd9..cb67150 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -372,6 +372,19 @@ SRST
 ERST
 
     {
+        .name       = "cprload",
+        .args_type  = "file:s",
+        .params     = "file",
+        .help       = "load VM checkpoint from file",
+        .cmd        = hmp_cprload,
+    },
+
+SRST
+``cprload`` *tag*
+  Load a virtual machine from checkpoint file *file* and continue VCPUs.
+ERST
+
+    {
         .name       = "delvm",
         .args_type  = "name:s",
         .params     = "tag",
diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
index af8ee23..7b8cdfd 100644
--- a/include/monitor/hmp.h
+++ b/include/monitor/hmp.h
@@ -60,6 +60,7 @@ void hmp_loadvm(Monitor *mon, const QDict *qdict);
 void hmp_savevm(Monitor *mon, const QDict *qdict);
 void hmp_delvm(Monitor *mon, const QDict *qdict);
 void hmp_cprsave(Monitor *mon, const QDict *qdict);
+void hmp_cprload(Monitor *mon, const QDict *qdict);
 void hmp_migrate_cancel(Monitor *mon, const QDict *qdict);
 void hmp_migrate_continue(Monitor *mon, const QDict *qdict);
 void hmp_migrate_incoming(Monitor *mon, const QDict *qdict);
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index 59196ed..ba95737 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -1149,6 +1149,14 @@ void hmp_cprsave(Monitor *mon, const QDict *qdict)
     hmp_handle_error(mon, err);
 }
 
+void hmp_cprload(Monitor *mon, const QDict *qdict)
+{
+    Error *err = NULL;
+
+    qmp_cprload(qdict_get_try_str(qdict, "file"), &err);
+    hmp_handle_error(mon, err);
+}
+
 void hmp_migrate_cancel(Monitor *mon, const QDict *qdict)
 {
     qmp_migrate_cancel(NULL);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 07/32] savevm: QMP command for cprinfo
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (5 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 06/32] savevm: HMP Command " Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-07-30 16:17   ` Eric Blake
  2020-07-30 15:14 ` [PATCH V1 08/32] savevm: HMP " Steve Sistare
                   ` (28 subsequent siblings)
  35 siblings, 1 reply; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Provide the cprinfo QMP command.  This returns a string with a space-
separated list of modes supported by cprsave, and can be used by clients
as a feature test to check if the running QEMU instance supports cprsave.

Syntax:
  {'command':'cprinfo', 'returns':'str'}

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 monitor/qmp-cmds.c  | 5 +++++
 qapi/migration.json | 9 +++++++++
 qapi/pragma.json    | 1 +
 3 files changed, 15 insertions(+)

diff --git a/monitor/qmp-cmds.c b/monitor/qmp-cmds.c
index 81e6feb..8c400e6 100644
--- a/monitor/qmp-cmds.c
+++ b/monitor/qmp-cmds.c
@@ -162,6 +162,11 @@ void qmp_cont(Error **errp)
     }
 }
 
+char *qmp_cprinfo(Error **errp)
+{
+    return g_strdup("reboot");
+}
+
 void qmp_cprsave(const char *file, const char *mode, Error **errp)
 {
     save_cpr_snapshot(file, mode, errp);
diff --git a/qapi/migration.json b/qapi/migration.json
index ce4d32b..8190b16 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -1623,6 +1623,15 @@
   'data': { 'device-id': 'str' } }
 
 ##
+# @cprinfo:
+#
+# Return a space-delimited list of modes supported by the cprsave command
+#
+# Since 5.0
+##
+{ 'command': 'cprinfo', 'returns': 'str' }
+
+##
 # @cprsave:
 #
 # Create a checkpoint of the virtual machine device state in @file.
diff --git a/qapi/pragma.json b/qapi/pragma.json
index cffae27..43bdb39 100644
--- a/qapi/pragma.json
+++ b/qapi/pragma.json
@@ -5,6 +5,7 @@
 { 'pragma': {
     # Commands allowed to return a non-dictionary:
     'returns-whitelist': [
+        'cprinfo',
         'human-monitor-command',
         'qom-get',
         'query-migrate-cache-size',
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 08/32] savevm: HMP command for cprinfo
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (6 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 07/32] savevm: QMP command for cprinfo Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-09-11 17:27   ` Dr. David Alan Gilbert
  2020-07-30 15:14 ` [PATCH V1 09/32] savevm: prevent cprsave if memory is volatile Steve Sistare
                   ` (27 subsequent siblings)
  35 siblings, 1 reply; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Enable HMP access to the cprinfo QMP command.

Usage: cprinfo

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hmp-commands.hx       | 13 +++++++++++++
 include/monitor/hmp.h |  1 +
 monitor/hmp-cmds.c    | 10 ++++++++++
 3 files changed, 24 insertions(+)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index cb67150..7517876 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -354,6 +354,19 @@ SRST
 ERST
 
     {
+        .name       = "cprinfo",
+        .args_type  = "",
+        .params     = "",
+        .help       = "return list of modes supported by cprsave",
+        .cmd        = hmp_cprinfo,
+    },
+
+SRST
+``cprinfo`` *tag*
+  Return a space-delimited list of modes supported by cprsave.
+ERST
+
+    {
         .name       = "cprsave",
         .args_type  = "file:s,mode:s",
         .params     = "file 'reboot'",
diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
index 7b8cdfd..919b9a9 100644
--- a/include/monitor/hmp.h
+++ b/include/monitor/hmp.h
@@ -59,6 +59,7 @@ void hmp_balloon(Monitor *mon, const QDict *qdict);
 void hmp_loadvm(Monitor *mon, const QDict *qdict);
 void hmp_savevm(Monitor *mon, const QDict *qdict);
 void hmp_delvm(Monitor *mon, const QDict *qdict);
+void hmp_cprinfo(Monitor *mon, const QDict *qdict);
 void hmp_cprsave(Monitor *mon, const QDict *qdict);
 void hmp_cprload(Monitor *mon, const QDict *qdict);
 void hmp_migrate_cancel(Monitor *mon, const QDict *qdict);
diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
index ba95737..2f6af07 100644
--- a/monitor/hmp-cmds.c
+++ b/monitor/hmp-cmds.c
@@ -1139,6 +1139,16 @@ void hmp_announce_self(Monitor *mon, const QDict *qdict)
     qapi_free_AnnounceParameters(params);
 }
 
+void hmp_cprinfo(Monitor *mon, const QDict *qdict)
+{
+    Error *err = NULL;
+    char *res = qmp_cprinfo(&err);
+
+    monitor_printf(mon, "%s\n", res);
+    g_free(res);
+    hmp_handle_error(mon, err);
+}
+
 void hmp_cprsave(Monitor *mon, const QDict *qdict)
 {
     Error *err = NULL;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 09/32] savevm: prevent cprsave if memory is volatile
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (7 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 08/32] savevm: HMP " Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-09-11 17:35   ` Dr. David Alan Gilbert
  2020-07-30 15:14 ` [PATCH V1 10/32] kvmclock: restore paused KVM clock Steve Sistare
                   ` (26 subsequent siblings)
  35 siblings, 1 reply; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

cprsave and cprload require that guest ram be backed by an externally
visible shared file.  Check that in cprsave.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 exec.c                | 32 ++++++++++++++++++++++++++++++++
 include/exec/memory.h |  2 ++
 migration/savevm.c    |  4 ++++
 3 files changed, 38 insertions(+)

diff --git a/exec.c b/exec.c
index 6f381f9..02160e0 100644
--- a/exec.c
+++ b/exec.c
@@ -2726,6 +2726,38 @@ ram_addr_t qemu_ram_addr_from_host(void *ptr)
     return block->offset + offset;
 }
 
+/*
+ * Return true if any memory regions are writable and not backed by shared
+ * memory.  Exclude x86 option rom shadow "pc.rom" by name, even though it is
+ * writable.
+ */
+bool qemu_ram_volatile(Error **errp)
+{
+    RAMBlock *block;
+    MemoryRegion *mr;
+    bool ret = false;
+
+    rcu_read_lock();
+    QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
+        mr = block->mr;
+        if (mr &&
+            memory_region_is_ram(mr) &&
+            !memory_region_is_ram_device(mr) &&
+            !memory_region_is_rom(mr) &&
+            (!mr->name || strcmp(mr->name, "pc.rom")) &&
+            (block->fd == -1 || !qemu_ram_is_shared(block))) {
+
+            error_setg(errp, "Memory region %s is volatile",
+                       memory_region_name(mr));
+            ret = true;
+            break;
+        }
+    }
+
+    rcu_read_unlock();
+    return ret;
+}
+
 /* Generate a debug exception if a watchpoint has been hit.  */
 void cpu_check_watchpoint(CPUState *cpu, vaddr addr, vaddr len,
                           MemTxAttrs attrs, int flags, uintptr_t ra)
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 307e527..6aafbb0 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -2519,6 +2519,8 @@ bool ram_block_discard_is_disabled(void);
  */
 bool ram_block_discard_is_required(void);
 
+bool qemu_ram_volatile(Error **errp);
+
 #endif
 
 #endif
diff --git a/migration/savevm.c b/migration/savevm.c
index 1509173..f101039 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2713,6 +2713,10 @@ void save_cpr_snapshot(const char *file, const char *mode, Error **errp)
         return;
     }
 
+    if (op == VMS_REBOOT && qemu_ram_volatile(errp)) {
+        return;
+    }
+
     f = qf_file_open(file, O_CREAT | O_WRONLY | O_TRUNC, 0600, errp);
     if (!f) {
         return;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 10/32] kvmclock: restore paused KVM clock
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (8 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 09/32] savevm: prevent cprsave if memory is volatile Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-09-11 17:50   ` Dr. David Alan Gilbert
  2020-07-30 15:14 ` [PATCH V1 11/32] cpu: disable ticks when suspended Steve Sistare
                   ` (25 subsequent siblings)
  35 siblings, 1 reply; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

If the VM is paused when the KVM clock is serialized to a file, record
that the clock is valid, so the value will be reused rather than
overwritten after cprload with a new call to KVM_GET_CLOCK here:

kvmclock_vm_state_change()
    if (running)
        ...
    else
        if (s->clock_valid)
            return;         <-- instead, return here

        kvm_update_clock()
           kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, &data)  <-- overwritten

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/i386/kvm/clock.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c
index 6428335..161991a 100644
--- a/hw/i386/kvm/clock.c
+++ b/hw/i386/kvm/clock.c
@@ -285,18 +285,22 @@ static int kvmclock_pre_save(void *opaque)
     if (!s->runstate_paused) {
         kvm_update_clock(s);
     }
+    if (!runstate_is_running()) {
+        s->clock_valid = true;
+    }
 
     return 0;
 }
 
 static const VMStateDescription kvmclock_vmsd = {
     .name = "kvmclock",
-    .version_id = 1,
+    .version_id = 2,
     .minimum_version_id = 1,
     .pre_load = kvmclock_pre_load,
     .pre_save = kvmclock_pre_save,
     .fields = (VMStateField[]) {
         VMSTATE_UINT64(clock, KVMClockState),
+        VMSTATE_BOOL_V(clock_valid, KVMClockState, 2),
         VMSTATE_END_OF_LIST()
     },
     .subsections = (const VMStateDescription * []) {
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 11/32] cpu: disable ticks when suspended
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (9 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 10/32] kvmclock: restore paused KVM clock Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-09-11 17:53   ` Dr. David Alan Gilbert
  2020-07-30 15:14 ` [PATCH V1 12/32] vl: pause option Steve Sistare
                   ` (24 subsequent siblings)
  35 siblings, 1 reply; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

After cprload, the guest console misbehaves.  You must type 8 characters
before any are echoed to the terminal.  Qemu was not sending interrupts
to the guest because the QEMU_CLOCK_VIRTUAL timers_state.cpu_clock_offset
was bad.  The offset is usually updated at cprsave time by the path

  save_cpr_snapshot()
    vm_stop()
      do_vm_stop()
        if (runstate_is_running())
          cpu_disable_ticks();
            timers_state.cpu_clock_offset = cpu_get_clock_locked();

However, if the guest is in RUN_STATE_SUSPENDED, then cpu_disable_ticks is
not called.  Further, the earlier transition to suspended in
qemu_system_suspend did not disable ticks.  To fix, call cpu_disable_ticks
from save_cpr_snapshot.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 migration/savevm.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/migration/savevm.c b/migration/savevm.c
index f101039..00f493b 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2729,6 +2729,11 @@ void save_cpr_snapshot(const char *file, const char *mode, Error **errp)
         return;
     }
 
+    /* Update timers_state before saving.  Suspend did not so do. */
+    if (runstate_check(RUN_STATE_SUSPENDED)) {
+        cpu_disable_ticks();
+    }
+
     vm_stop(RUN_STATE_SAVE_VM);
 
     ret = qemu_savevm_state(f, op, errp);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 12/32] vl: pause option
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (10 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 11/32] cpu: disable ticks when suspended Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-07-30 16:20   ` Eric Blake
  2020-07-30 17:03   ` Alex Bennée
  2020-07-30 15:14 ` [PATCH V1 13/32] gdbstub: gdb support for suspended state Steve Sistare
                   ` (23 subsequent siblings)
  35 siblings, 2 replies; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Provide the -pause command-line parameter and the QEMU_PAUSE environment
variable to briefly pause QEMU in main and allow a developer to attach gdb.
Useful when the developer does not invoke QEMU directly, such as when using
libvirt.

Usage:
  qemu -pause <seconds>
  or
  export QEMU_PAUSE=<seconds>

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 qemu-options.hx |  9 +++++++++
 softmmu/vl.c    | 15 ++++++++++++++-
 2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/qemu-options.hx b/qemu-options.hx
index 708583b..8505cf2 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -3668,6 +3668,15 @@ SRST
     option is experimental.
 ERST
 
+DEF("pause", HAS_ARG, QEMU_OPTION_pause, \
+    "-pause secs    Pause for secs seconds on entry to main.\n", QEMU_ARCH_ALL)
+
+SRST
+``--pause secs``
+    Pause for a number of seconds on entry to main.  Useful for attaching
+    a debugger after QEMU has been launched by some other entity.
+ERST
+
 DEF("S", 0, QEMU_OPTION_S, \
     "-S              freeze CPU at startup (use 'c' to start execution)\n",
     QEMU_ARCH_ALL)
diff --git a/softmmu/vl.c b/softmmu/vl.c
index 8478778..951994f 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -2844,7 +2844,7 @@ static void create_default_memdev(MachineState *ms, const char *path)
 
 void qemu_init(int argc, char **argv, char **envp)
 {
-    int i;
+    int i, seconds;
     int snapshot, linux_boot;
     const char *initrd_filename;
     const char *kernel_filename, *kernel_cmdline;
@@ -2882,6 +2882,13 @@ void qemu_init(int argc, char **argv, char **envp)
     QemuPluginList plugin_list = QTAILQ_HEAD_INITIALIZER(plugin_list);
     int mem_prealloc = 0; /* force preallocation of physical target memory */
 
+    if (getenv("QEMU_PAUSE")) {
+        seconds = atoi(getenv("QEMU_PAUSE"));
+        printf("Pausing %d seconds for debugger. QEMU PID is %d\n",
+               seconds, getpid());
+        sleep(seconds);
+    }
+
     os_set_line_buffering();
 
     error_init(argv[0]);
@@ -3204,6 +3211,12 @@ void qemu_init(int argc, char **argv, char **envp)
             case QEMU_OPTION_gdb:
                 add_device_config(DEV_GDB, optarg);
                 break;
+            case QEMU_OPTION_pause:
+                seconds = atoi(optarg);
+                printf("Pausing %d seconds for debugger. QEMU PID is %d\n",
+                            seconds, getpid());
+                sleep(seconds);
+                break;
             case QEMU_OPTION_L:
                 if (is_help_option(optarg)) {
                     list_data_dirs = true;
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 13/32] gdbstub: gdb support for suspended state
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (11 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 12/32] vl: pause option Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-09-11 18:41   ` Dr. David Alan Gilbert
  2020-07-30 15:14 ` [PATCH V1 14/32] savevm: VMS_RESTART and cprsave restart Steve Sistare
                   ` (22 subsequent siblings)
  35 siblings, 1 reply; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Modify the gdb server so a continue command appears to resume execution
when in RUN_STATE_SUSPENDED.  Do not print the next gdb prompt, but do not
actually resume instruction fetch.  While in this "fake" running mode, a
ctrl-C returns the user to the gdb prompt.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 gdbstub.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/gdbstub.c b/gdbstub.c
index f3a318c..2f0d9ff 100644
--- a/gdbstub.c
+++ b/gdbstub.c
@@ -461,7 +461,9 @@ static inline void gdb_continue(void)
 #else
     if (!runstate_needs_reset()) {
         trace_gdbstub_op_continue();
-        vm_start();
+        if (!runstate_check(RUN_STATE_SUSPENDED)) {
+            vm_start();
+        }
     }
 #endif
 }
@@ -490,7 +492,7 @@ static int gdb_continue_partial(char *newstates)
     int flag = 0;
 
     if (!runstate_needs_reset()) {
-        if (vm_prepare_start()) {
+        if (!runstate_check(RUN_STATE_SUSPENDED) && vm_prepare_start()) {
             return 0;
         }
 
@@ -2835,6 +2837,9 @@ static void gdb_read_byte(uint8_t ch)
         /* when the CPU is running, we cannot do anything except stop
            it when receiving a char */
         vm_stop(RUN_STATE_PAUSED);
+    } else if (runstate_check(RUN_STATE_SUSPENDED) && ch == 3) {
+        /* Received ctrl-c from gdb */
+        gdb_vm_state_change(0, 0, RUN_STATE_PAUSED);
     } else
 #endif
     {
@@ -3282,6 +3287,8 @@ static void gdb_sigterm_handler(int signal)
 {
     if (runstate_is_running()) {
         vm_stop(RUN_STATE_PAUSED);
+    } else if (runstate_check(RUN_STATE_SUSPENDED)) {
+        gdb_vm_state_change(0, 0, RUN_STATE_PAUSED);
     }
 }
 #endif
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 14/32] savevm: VMS_RESTART and cprsave restart
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (12 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 13/32] gdbstub: gdb support for suspended state Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-07-30 16:22   ` Eric Blake
  2020-09-11 18:44   ` Dr. David Alan Gilbert
  2020-07-30 15:14 ` [PATCH V1 15/32] vl: QEMU_START_FREEZE env var Steve Sistare
                   ` (21 subsequent siblings)
  35 siblings, 2 replies; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Add the VMS_RESTART variant of vmstate, for use when upgrading qemu in place
on the same host without a reboot.  Invoke it using:
  cprsave <filename> restart

VMS_RESTART supports guest ram mapped by private anonymous memory, versus
VMS_REBOOT which requires that guest ram be mapped by persistent shared
memory.  Subsequent patches complete its implementation.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hmp-commands.hx             | 4 +++-
 include/migration/vmstate.h | 1 +
 migration/savevm.c          | 4 +++-
 monitor/qmp-cmds.c          | 2 +-
 qapi/migration.json         | 1 +
 5 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index 7517876..11a2089 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -369,7 +369,7 @@ ERST
     {
         .name       = "cprsave",
         .args_type  = "file:s,mode:s",
-        .params     = "file 'reboot'",
+        .params     = "file 'restart'|'reboot'",
         .help       = "create a checkpoint of the VM in file",
         .cmd        = hmp_cprsave,
     },
@@ -380,6 +380,8 @@ SRST
   in *file*.
   If *mode* is 'reboot', the checkpoint can be cprload'ed after a host kexec
   reboot.
+  If *mode* is 'restart', the checkpoint can be cprload'ed after restarting
+  qemu.
   exec() /usr/bin/qemu-exec if it exists, else exec /usr/bin/qemu-system-x86_64,
   passing all the original command line arguments.  The VCPUs remain paused.
 ERST
diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index c58551a..8239b84 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -162,6 +162,7 @@ typedef enum {
     VMS_MIGRATE  = (1U << 1),
     VMS_SNAPSHOT = (1U << 2),
     VMS_REBOOT   = (1U << 3),
+    VMS_RESTART  = (1U << 4),
     VMS_MODE_ALL = ~0U
 } VMStateMode;
 
diff --git a/migration/savevm.c b/migration/savevm.c
index 00f493b..38cc63a 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2708,6 +2708,8 @@ void save_cpr_snapshot(const char *file, const char *mode, Error **errp)
 
     if (!strcmp(mode, "reboot")) {
         op = VMS_REBOOT;
+    } else if (!strcmp(mode, "restart")) {
+        op = VMS_RESTART;
     } else {
         error_setg(errp, "cprsave: bad mode %s", mode);
         return;
@@ -2973,7 +2975,7 @@ void load_cpr_snapshot(const char *file, Error **errp)
         return;
     }
 
-    ret = qemu_loadvm_state(f, VMS_REBOOT);
+    ret = qemu_loadvm_state(f, VMS_REBOOT | VMS_RESTART);
     qemu_fclose(f);
     if (ret < 0) {
         error_setg(errp, "Error %d while loading VM state", ret);
diff --git a/monitor/qmp-cmds.c b/monitor/qmp-cmds.c
index 8c400e6..8a74c6e 100644
--- a/monitor/qmp-cmds.c
+++ b/monitor/qmp-cmds.c
@@ -164,7 +164,7 @@ void qmp_cont(Error **errp)
 
 char *qmp_cprinfo(Error **errp)
 {
-    return g_strdup("reboot");
+    return g_strdup("reboot restart");
 }
 
 void qmp_cprsave(const char *file, const char *mode, Error **errp)
diff --git a/qapi/migration.json b/qapi/migration.json
index 8190b16..d22992b 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -1639,6 +1639,7 @@
 #
 # @file: name of checkpoint file
 # @mode: 'reboot' : checkpoint can be cprload'ed after a host kexec reboot.
+#        'restart': checkpoint can be cprload'ed after restarting qemu.
 #
 # Since 5.0
 ##
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 15/32] vl: QEMU_START_FREEZE env var
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (13 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 14/32] savevm: VMS_RESTART and cprsave restart Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-09-11 18:49   ` Dr. David Alan Gilbert
  2020-07-30 15:14 ` [PATCH V1 16/32] oslib: add qemu_clr_cloexec Steve Sistare
                   ` (20 subsequent siblings)
  35 siblings, 1 reply; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

For qemu upgrade and restart, we will re-exec() qemu with the same argv.
However, qemu must start in a paused state and wait for the cprload command,
and the original argv might not contain the -S option.  To avoid modifying
argv, provide the QEMU_START_FREEZE environment variable.  If
QEMU_START_FREEZE is set, then set autostart=0, like the -S option.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 softmmu/vl.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/softmmu/vl.c b/softmmu/vl.c
index 951994f..7016e39 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -4501,6 +4501,11 @@ void qemu_init(int argc, char **argv, char **envp)
         exit(0);
     }
 
+    if (getenv("QEMU_START_FREEZE")) {
+        unsetenv("QEMU_START_FREEZE");
+        autostart = 0;
+    }
+
     if (incoming) {
         Error *local_err = NULL;
         qemu_start_incoming_migration(incoming, &local_err);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 16/32] oslib: add qemu_clr_cloexec
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (14 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 15/32] vl: QEMU_START_FREEZE env var Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-09-11 18:52   ` Dr. David Alan Gilbert
  2020-07-30 15:14 ` [PATCH V1 17/32] util: env var helpers Steve Sistare
                   ` (19 subsequent siblings)
  35 siblings, 1 reply; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/qemu/osdep.h | 1 +
 util/oslib-posix.c   | 9 +++++++++
 util/oslib-win32.c   | 4 ++++
 3 files changed, 14 insertions(+)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 45c217a..bb28df1 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -551,6 +551,7 @@ static inline void qemu_timersub(const struct timeval *val1,
 #endif
 
 void qemu_set_cloexec(int fd);
+void qemu_clr_cloexec(int fd);
 
 /* Starting on QEMU 2.5, qemu_hw_version() returns "2.5+" by default
  * instead of QEMU_VERSION, so setting hw_version on MachineClass
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index d923674..28fee45 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -314,6 +314,15 @@ void qemu_set_cloexec(int fd)
     assert(f != -1);
 }
 
+void qemu_clr_cloexec(int fd)
+{
+    int f;
+    f = fcntl(fd, F_GETFD);
+    assert(f != -1);
+    f = fcntl(fd, F_SETFD, f & ~FD_CLOEXEC);
+    assert(f != -1);
+}
+
 /*
  * Creates a pipe with FD_CLOEXEC set on both file descriptors
  */
diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index 7eedbe5..e5d0c7c 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -254,6 +254,10 @@ void qemu_set_cloexec(int fd)
 {
 }
 
+void qemu_clr_cloexec(int fd)
+{
+}
+
 /* Offset between 1/1/1601 and 1/1/1970 in 100 nanosec units */
 #define _W32_FT_OFFSET (116444736000000000ULL)
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 17/32] util: env var helpers
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (15 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 16/32] oslib: add qemu_clr_cloexec Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-09-11 19:00   ` Dr. David Alan Gilbert
  2020-07-30 15:14 ` [PATCH V1 18/32] osdep: import MADV_DOEXEC Steve Sistare
                   ` (18 subsequent siblings)
  35 siblings, 1 reply; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Add functions for saving fd's and ram extents in the environment via
setenv, and for reading them back via getenv.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
---
 MAINTAINERS           |   7 +++
 include/qemu/cutils.h |   1 +
 include/qemu/env.h    |  31 ++++++++++++
 util/Makefile.objs    |   2 +-
 util/env.c            | 132 ++++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 172 insertions(+), 1 deletion(-)
 create mode 100644 include/qemu/env.h
 create mode 100644 util/env.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 3395abd..8d377a7 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3115,3 +3115,10 @@ Performance Tools and Tests
 M: Ahmed Karaman <ahmedkhaledkaraman@gmail.com>
 S: Maintained
 F: scripts/performance/
+
+Environment variable helpers
+M: Steve Sistare <steven.sistare@oracle.com>
+M: Mark Kanda <mark.kanda@oracle.com>
+S: Maintained
+F: include/qemu/env.h
+F: util/env.c
diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h
index eb59852..d4c7d70 100644
--- a/include/qemu/cutils.h
+++ b/include/qemu/cutils.h
@@ -1,6 +1,7 @@
 #ifndef QEMU_CUTILS_H
 #define QEMU_CUTILS_H
 
+#include "qemu/env.h"
 /**
  * pstrcpy:
  * @buf: buffer to copy string into
diff --git a/include/qemu/env.h b/include/qemu/env.h
new file mode 100644
index 0000000..53cc121
--- /dev/null
+++ b/include/qemu/env.h
@@ -0,0 +1,31 @@
+/*
+ * Copyright (c) 2020 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef QEMU_ENV_H
+#define QEMU_ENV_H
+
+#define FD_PREFIX "QEMU_FD_"
+#define ADDR_PREFIX "QEMU_ADDR_"
+#define LEN_PREFIX "QEMU_LEN_"
+#define BOOL_PREFIX "QEMU_BOOL_"
+
+typedef int (*walkenv_cb)(const char *name, const char *val, void *handle);
+
+bool getenv_ram(const char *name, void **addrp, size_t *lenp);
+void setenv_ram(const char *name, void *addr, size_t len);
+void unsetenv_ram(const char *name);
+int getenv_fd(const char *name);
+void setenv_fd(const char *name, int fd);
+void unsetenv_fd(const char *name);
+bool getenv_bool(const char *name);
+void setenv_bool(const char *name, bool val);
+void unsetenv_bool(const char *name);
+int walkenv(const char *prefix, walkenv_cb cb, void *handle);
+void printenv(void);
+
+#endif
diff --git a/util/Makefile.objs b/util/Makefile.objs
index cc5e371..d357932 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -1,4 +1,4 @@
-util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o
+util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o env.o
 util-obj-$(call lnot,$(CONFIG_ATOMIC64)) += atomic64.o
 util-obj-$(CONFIG_POSIX) += aio-posix.o
 util-obj-$(CONFIG_POSIX) += fdmon-poll.o
diff --git a/util/env.c b/util/env.c
new file mode 100644
index 0000000..0cc4a9f
--- /dev/null
+++ b/util/env.c
@@ -0,0 +1,132 @@
+/*
+ * Copyright (c) 2020 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/env.h"
+
+static uint64_t getenv_ulong(const char *prefix, const char *name, bool *found)
+{
+    char var[80], *val;
+    uint64_t res;
+
+    snprintf(var, sizeof(var), "%s%s", prefix, name);
+    val = getenv(var);
+    if (val) {
+        *found = true;
+        res = strtol(val, 0, 10);
+    } else {
+        *found = false;
+        res = 0;
+    }
+    return res;
+}
+
+static void setenv_ulong(const char *prefix, const char *name, uint64_t val)
+{
+    char var[80], val_str[80];
+    snprintf(var, sizeof(var), "%s%s", prefix, name);
+    snprintf(val_str, sizeof(val_str), "%"PRIu64, val);
+    setenv(var, val_str, 1);
+}
+
+static void unsetenv_ulong(const char *prefix, const char *name)
+{
+    char var[80];
+    snprintf(var, sizeof(var), "%s%s", prefix, name);
+    unsetenv(var);
+}
+
+bool getenv_ram(const char *name, void **addrp, size_t *lenp)
+{
+    bool found1, found2;
+    *addrp = (void *) getenv_ulong(ADDR_PREFIX, name, &found1);
+    *lenp = getenv_ulong(LEN_PREFIX, name, &found2);
+    assert(found1 == found2);
+    return found1;
+}
+
+void setenv_ram(const char *name, void *addr, size_t len)
+{
+    setenv_ulong(ADDR_PREFIX, name, (uint64_t)addr);
+    setenv_ulong(LEN_PREFIX, name, len);
+}
+
+void unsetenv_ram(const char *name)
+{
+    unsetenv_ulong(ADDR_PREFIX, name);
+    unsetenv_ulong(LEN_PREFIX, name);
+}
+
+int getenv_fd(const char *name)
+{
+    bool found;
+    int fd = getenv_ulong(FD_PREFIX, name, &found);
+    if (!found) {
+        fd = -1;
+    }
+    return fd;
+}
+
+void setenv_fd(const char *name, int fd)
+{
+    setenv_ulong(FD_PREFIX, name, fd);
+}
+
+void unsetenv_fd(const char *name)
+{
+    unsetenv_ulong(FD_PREFIX, name);
+}
+
+bool getenv_bool(const char *name)
+{
+    bool found;
+    bool val = getenv_ulong(BOOL_PREFIX, name, &found);
+    if (!found) {
+        val = -1;
+    }
+    return val;
+}
+
+void setenv_bool(const char *name, bool val)
+{
+    setenv_ulong(BOOL_PREFIX, name, val);
+}
+
+void unsetenv_bool(const char *name)
+{
+    unsetenv_ulong(BOOL_PREFIX, name);
+}
+
+int walkenv(const char *prefix, walkenv_cb cb, void *handle)
+{
+    char *str, name[128];
+    char **envp = environ;
+    size_t prefix_len = strlen(prefix);
+
+    while (*envp) {
+        str = *envp++;
+        if (!strncmp(str, prefix, prefix_len)) {
+            char *val = strchr(str, '=');
+            str += prefix_len;
+            strncpy(name, str, val - str);
+            name[val - str] = 0;
+            if (cb(name, val + 1, handle)) {
+                return 1;
+            }
+        }
+    }
+    return 0;
+}
+
+void printenv(void)
+{
+    char **ptr = environ;
+    while (*ptr) {
+        puts(*ptr++);
+    }
+}
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 18/32] osdep: import MADV_DOEXEC
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (16 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 17/32] util: env var helpers Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-08-17 18:30   ` Steven Sistare
  2020-07-30 15:14 ` [PATCH V1 19/32] memory: ram_block_add cosmetic changes Steve Sistare
                   ` (17 subsequent siblings)
  35 siblings, 1 reply; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Anonymous memory segments used by the guest are preserved across a re-exec
of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
in the Linux kernel. For the madvise patches, see:

https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/qemu/osdep.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index bb28df1..7ce555a 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -390,6 +390,11 @@ void qemu_anon_ram_free(void *ptr, size_t size);
 #else
 #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
 #endif
+#ifdef MADV_DOEXEC
+#define QEMU_MADV_DOEXEC MADV_DOEXEC
+#else
+#define QEMU_MADV_DOEXEC QEMU_MADV_INVALID
+#endif
 
 #elif defined(CONFIG_POSIX_MADVISE)
 
@@ -403,6 +408,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
 #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
+#define QEMU_MADV_DOEXEC  QEMU_MADV_INVALID
 
 #else /* no-op */
 
@@ -416,6 +422,7 @@ void qemu_anon_ram_free(void *ptr, size_t size);
 #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
+#define QEMU_MADV_DOEXEC  QEMU_MADV_INVALID
 
 #endif
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 19/32] memory: ram_block_add cosmetic changes
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (17 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 18/32] osdep: import MADV_DOEXEC Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-07-30 15:14 ` [PATCH V1 20/32] vl: add helper to request re-exec Steve Sistare
                   ` (16 subsequent siblings)
  35 siblings, 0 replies; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Massage the code to simplify the later patch "exec, memory: exec(3) to
restart".

No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 exec.c | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/exec.c b/exec.c
index 02160e0..359e437 100644
--- a/exec.c
+++ b/exec.c
@@ -2233,32 +2233,37 @@ static void ram_block_add(RAMBlock *new_block, Error **errp, bool shared)
     RAMBlock *last_block = NULL;
     ram_addr_t old_ram_size, new_ram_size;
     Error *err = NULL;
+    const char *name;
+    void *addr;
+    size_t maxlen;
 
     old_ram_size = last_ram_page();
 
     qemu_mutex_lock_ramlist();
-    new_block->offset = find_ram_offset(new_block->max_length);
+    maxlen = new_block->max_length;
+    new_block->offset = find_ram_offset(maxlen);
 
     if (!new_block->host) {
         if (xen_enabled()) {
-            xen_ram_alloc(new_block->offset, new_block->max_length,
-                          new_block->mr, &err);
+            xen_ram_alloc(new_block->offset, maxlen, new_block->mr, &err);
             if (err) {
                 error_propagate(errp, err);
                 qemu_mutex_unlock_ramlist();
                 return;
             }
         } else {
-            new_block->host = phys_mem_alloc(new_block->max_length,
-                                             &new_block->mr->align, shared);
-            if (!new_block->host) {
+            name = memory_region_name(new_block->mr);
+            addr = phys_mem_alloc(maxlen, &new_block->mr->align, shared);
+
+            if (!addr) {
                 error_setg_errno(errp, errno,
                                  "cannot set up guest memory '%s'",
-                                 memory_region_name(new_block->mr));
+                                 name);
                 qemu_mutex_unlock_ramlist();
                 return;
             }
-            memory_try_enable_merging(new_block->host, new_block->max_length);
+            memory_try_enable_merging(addr, maxlen);
+            new_block->host = addr;
         }
     }
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 20/32] vl: add helper to request re-exec
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (18 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 19/32] memory: ram_block_add cosmetic changes Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-07-30 15:14 ` [PATCH V1 21/32] exec, memory: exec(3) to restart Steve Sistare
                   ` (15 subsequent siblings)
  35 siblings, 0 replies; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Add a qemu_exec_requested() hook that causes the main loop to exit and
re-exec qemu using the same initial arguments.  If /usr/bin/qemu-exec
exists, exec that instead.  This is an optional site-specific trampoline
that may alter the environment before exec'ing the qemu binary.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/sysemu/sysemu.h |  1 +
 softmmu/vl.c            | 30 ++++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index 5360da5..4dfc4ca 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -15,6 +15,7 @@ extern QemuUUID qemu_uuid;
 extern bool qemu_uuid_set;
 
 void qemu_add_data_dir(const char *path);
+void qemu_system_exec_request(void);
 
 void qemu_add_exit_notifier(Notifier *notify);
 void qemu_remove_exit_notifier(Notifier *notify);
diff --git a/softmmu/vl.c b/softmmu/vl.c
index 7016e39..72f0e08 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -116,6 +116,7 @@
 
 #define MAX_VIRTIO_CONSOLES 1
 
+static char **argv_main;
 static const char *data_dir[16];
 static int data_dir_idx;
 const char *bios_name = NULL;
@@ -1296,6 +1297,7 @@ static ShutdownCause reset_requested;
 static ShutdownCause shutdown_requested;
 static int shutdown_signal;
 static pid_t shutdown_pid;
+static int exec_requested;
 static int powerdown_requested;
 static int debug_requested;
 static int suspend_requested;
@@ -1326,6 +1328,11 @@ static int qemu_shutdown_requested(void)
     return atomic_xchg(&shutdown_requested, SHUTDOWN_CAUSE_NONE);
 }
 
+static int qemu_exec_requested(void)
+{
+    return atomic_xchg(&exec_requested, 0);
+}
+
 static void qemu_kill_report(void)
 {
     if (!qtest_driver() && shutdown_signal) {
@@ -1582,6 +1589,13 @@ void qemu_system_shutdown_request(ShutdownCause reason)
     qemu_notify_event();
 }
 
+void qemu_system_exec_request(void)
+{
+    shutdown_requested = 1;
+    exec_requested = 1;
+    qemu_notify_event();
+}
+
 static void qemu_system_powerdown(void)
 {
     qapi_event_send_powerdown();
@@ -1617,6 +1631,16 @@ void qemu_system_debug_request(void)
     qemu_notify_event();
 }
 
+static void qemu_exec(void)
+{
+    const char *helper = "/usr/bin/qemu-exec";
+    const char *bin = !access(helper, X_OK) ? helper : argv_main[0];
+
+    execvp(bin, argv_main);
+    error_report("execvp failed, errno %d.", errno);
+    exit(1);
+}
+
 static bool main_loop_should_exit(void)
 {
     RunState r;
@@ -1637,6 +1661,11 @@ static bool main_loop_should_exit(void)
     }
     request = qemu_shutdown_requested();
     if (request) {
+
+        if (qemu_exec_requested()) {
+            qemu_exec();
+            /* not reached */
+        }
         qemu_kill_report();
         qemu_system_shutdown(request);
         if (no_shutdown) {
@@ -2891,6 +2920,7 @@ void qemu_init(int argc, char **argv, char **envp)
 
     os_set_line_buffering();
 
+    argv_main = argv;
     error_init(argv[0]);
     module_call_init(MODULE_INIT_TRACE);
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 21/32] exec, memory: exec(3) to restart
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (19 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 20/32] vl: add helper to request re-exec Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-07-30 15:14 ` [PATCH V1 22/32] char: qio_channel_socket_accept reuse fd Steve Sistare
                   ` (14 subsequent siblings)
  35 siblings, 0 replies; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Use exec() to restart qemu to a potentially new version, while preserving
guest RAM.  The guest pauses briefly.

cprsave saves the address and length of RAM blocks to the environment via
setenv, tags the RAM with the new madvise(MADV_DOEXEC) option to preserve
it across exec, then exec()'s the (typically updated) qemu binary with the
original argv.

On qemu restart, ram_block_add() finds the env vars that describe preserved
RAM segments and does not reallocate them.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 exec.c                | 36 ++++++++++++++++++++++++++++++++++--
 include/exec/memory.h |  2 ++
 migration/savevm.c    | 16 ++++++++++++++++
 3 files changed, 52 insertions(+), 2 deletions(-)

diff --git a/exec.c b/exec.c
index 359e437..5473c09 100644
--- a/exec.c
+++ b/exec.c
@@ -2235,7 +2235,7 @@ static void ram_block_add(RAMBlock *new_block, Error **errp, bool shared)
     Error *err = NULL;
     const char *name;
     void *addr;
-    size_t maxlen;
+    size_t len, maxlen;
 
     old_ram_size = last_ram_page();
 
@@ -2253,7 +2253,12 @@ static void ram_block_add(RAMBlock *new_block, Error **errp, bool shared)
             }
         } else {
             name = memory_region_name(new_block->mr);
-            addr = phys_mem_alloc(maxlen, &new_block->mr->align, shared);
+            if (getenv_ram(name, &addr, &len)) {
+                assert(len == maxlen);
+            } else {
+                addr = phys_mem_alloc(maxlen, &new_block->mr->align, shared);
+                setenv_ram(name, addr, maxlen);
+            }
 
             if (!addr) {
                 error_setg_errno(errp, errno,
@@ -2499,6 +2504,8 @@ void qemu_ram_free(RAMBlock *block)
         return;
     }
 
+    unsetenv_ram(memory_region_name(block->mr));
+
     if (block->host) {
         ram_block_notify_remove(block->host, block->max_length);
     }
@@ -2763,6 +2770,31 @@ bool qemu_ram_volatile(Error **errp)
     return ret;
 }
 
+static int preserve_ram(const char *name, const char *val, void *handle)
+{
+    void *addr;
+    size_t len;
+    Error **errp = handle;
+
+    getenv_ram(name, &addr, &len);
+    if (qemu_madvise(addr, len, QEMU_MADV_DOEXEC)) {
+        error_setg_errno(errp, errno,
+                         "MADV_DOEXEC failed on memory region %s", name);
+        return 1;
+    }
+    return 0;
+}
+
+
+int qemu_preserve_ram(Error **errp)
+{
+    int ret;
+    qemu_mutex_lock_ramlist();
+    ret = walkenv(ADDR_PREFIX, preserve_ram, errp);
+    qemu_mutex_unlock_ramlist();
+    return ret;
+}
+
 /* Generate a debug exception if a watchpoint has been hit.  */
 void cpu_check_watchpoint(CPUState *cpu, vaddr addr, vaddr len,
                           MemTxAttrs attrs, int flags, uintptr_t ra)
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 6aafbb0..e2d297d 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -2521,6 +2521,8 @@ bool ram_block_discard_is_required(void);
 
 bool qemu_ram_volatile(Error **errp);
 
+int qemu_preserve_ram(Error **errp);
+
 #endif
 
 #endif
diff --git a/migration/savevm.c b/migration/savevm.c
index 38cc63a..2902006 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2719,6 +2719,16 @@ void save_cpr_snapshot(const char *file, const char *mode, Error **errp)
         return;
     }
 
+    if (op == VMS_RESTART && QEMU_MADV_DOEXEC == QEMU_MADV_INVALID) {
+        error_setg(errp, "kernel does not support MADV_DOEXEC.");
+        return;
+    }
+
+    if (op == VMS_RESTART && xen_enabled()) {
+        error_setg(errp, "xen does not support cprsave restart");
+        return;
+    }
+
     f = qf_file_open(file, O_CREAT | O_WRONLY | O_TRUNC, 0600, errp);
     if (!f) {
         return;
@@ -2747,6 +2757,12 @@ void save_cpr_snapshot(const char *file, const char *mode, Error **errp)
     if (op == VMS_REBOOT) {
         no_shutdown = 0;
         qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
+    } else if (op == VMS_RESTART) {
+        if (qemu_preserve_ram(errp)) {
+            return;
+        }
+        qemu_system_exec_request();
+        putenv((char *)"QEMU_START_FREEZE=");
     }
 }
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 22/32] char: qio_channel_socket_accept reuse fd
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (20 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 21/32] exec, memory: exec(3) to restart Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-09-15 17:33   ` Dr. David Alan Gilbert
  2020-07-30 15:14 ` [PATCH V1 23/32] char: save/restore chardev socket fds Steve Sistare
                   ` (13 subsequent siblings)
  35 siblings, 1 reply; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

From: Mark Kanda <mark.kanda@oracle.com>

Add an fd argument to qio_channel_socket_accept.  If not -1, the channel
uses that fd instead of accepting a new socket connection.  All callers
pass -1 in this patch, so no functional change.

Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/io/channel-socket.h    |  3 ++-
 io/channel-socket.c            | 12 +++++++++---
 io/net-listener.c              |  4 ++--
 scsi/qemu-pr-helper.c          |  2 +-
 tests/qtest/tpm-emu.c          |  2 +-
 tests/test-char.c              |  2 +-
 tests/test-io-channel-socket.c |  4 ++--
 7 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
index 777ff59..0ffc560 100644
--- a/include/io/channel-socket.h
+++ b/include/io/channel-socket.h
@@ -248,6 +248,7 @@ qio_channel_socket_get_remote_address(QIOChannelSocket *ioc,
 /**
  * qio_channel_socket_accept:
  * @ioc: the socket channel object
+ * @reuse_fd: fd to reuse; -1 otherwise
  * @errp: pointer to a NULL-initialized error object
  *
  * If the socket represents a server, then this accepts
@@ -258,7 +259,7 @@ qio_channel_socket_get_remote_address(QIOChannelSocket *ioc,
  */
 QIOChannelSocket *
 qio_channel_socket_accept(QIOChannelSocket *ioc,
-                          Error **errp);
+                          int reuse_fd, Error **errp);
 
 
 #endif /* QIO_CHANNEL_SOCKET_H */
diff --git a/io/channel-socket.c b/io/channel-socket.c
index e1b4667..dde12bf 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -352,7 +352,7 @@ void qio_channel_socket_dgram_async(QIOChannelSocket *ioc,
 
 QIOChannelSocket *
 qio_channel_socket_accept(QIOChannelSocket *ioc,
-                          Error **errp)
+                          int reuse_fd, Error **errp)
 {
     QIOChannelSocket *cioc;
 
@@ -362,8 +362,14 @@ qio_channel_socket_accept(QIOChannelSocket *ioc,
 
  retry:
     trace_qio_channel_socket_accept(ioc);
-    cioc->fd = qemu_accept(ioc->fd, (struct sockaddr *)&cioc->remoteAddr,
-                           &cioc->remoteAddrLen);
+
+    if (reuse_fd != -1) {
+        cioc->fd = reuse_fd;
+    } else {
+        cioc->fd = qemu_accept(ioc->fd, (struct sockaddr *)&cioc->remoteAddr,
+                               &cioc->remoteAddrLen);
+    }
+
     if (cioc->fd < 0) {
         if (errno == EINTR) {
             goto retry;
diff --git a/io/net-listener.c b/io/net-listener.c
index 5d8a226..bbdea1e 100644
--- a/io/net-listener.c
+++ b/io/net-listener.c
@@ -45,7 +45,7 @@ static gboolean qio_net_listener_channel_func(QIOChannel *ioc,
     QIOChannelSocket *sioc;
 
     sioc = qio_channel_socket_accept(QIO_CHANNEL_SOCKET(ioc),
-                                     NULL);
+                                     -1, NULL);
     if (!sioc) {
         return TRUE;
     }
@@ -194,7 +194,7 @@ static gboolean qio_net_listener_wait_client_func(QIOChannel *ioc,
     QIOChannelSocket *sioc;
 
     sioc = qio_channel_socket_accept(QIO_CHANNEL_SOCKET(ioc),
-                                     NULL);
+                                     -1, NULL);
     if (!sioc) {
         return TRUE;
     }
diff --git a/scsi/qemu-pr-helper.c b/scsi/qemu-pr-helper.c
index 57ad830..0e6d683 100644
--- a/scsi/qemu-pr-helper.c
+++ b/scsi/qemu-pr-helper.c
@@ -800,7 +800,7 @@ static gboolean accept_client(QIOChannel *ioc, GIOCondition cond, gpointer opaqu
     PRHelperClient *prh;
 
     cioc = qio_channel_socket_accept(QIO_CHANNEL_SOCKET(ioc),
-                                     NULL);
+                                     -1, NULL);
     if (!cioc) {
         return TRUE;
     }
diff --git a/tests/qtest/tpm-emu.c b/tests/qtest/tpm-emu.c
index 2e8eb7b..19e5dab 100644
--- a/tests/qtest/tpm-emu.c
+++ b/tests/qtest/tpm-emu.c
@@ -83,7 +83,7 @@ void *tpm_emu_ctrl_thread(void *data)
     g_cond_signal(&s->data_cond);
 
     qio_channel_wait(QIO_CHANNEL(lioc), G_IO_IN);
-    ioc = QIO_CHANNEL(qio_channel_socket_accept(lioc, &error_abort));
+    ioc = QIO_CHANNEL(qio_channel_socket_accept(lioc, -1, &error_abort));
     g_assert(ioc);
 
     {
diff --git a/tests/test-char.c b/tests/test-char.c
index 614bdac..1bb6ae0 100644
--- a/tests/test-char.c
+++ b/tests/test-char.c
@@ -884,7 +884,7 @@ char_socket_client_server_thread(gpointer data)
     QIOChannelSocket *cioc;
 
 retry:
-    cioc = qio_channel_socket_accept(ioc, &error_abort);
+    cioc = qio_channel_socket_accept(ioc, -1, &error_abort);
     g_assert_nonnull(cioc);
 
     if (char_socket_ping_pong(QIO_CHANNEL(cioc), NULL) != 0) {
diff --git a/tests/test-io-channel-socket.c b/tests/test-io-channel-socket.c
index d43083a..0d410cf 100644
--- a/tests/test-io-channel-socket.c
+++ b/tests/test-io-channel-socket.c
@@ -75,7 +75,7 @@ static void test_io_channel_setup_sync(SocketAddress *listen_addr,
     qio_channel_set_delay(*src, false);
 
     qio_channel_wait(QIO_CHANNEL(lioc), G_IO_IN);
-    *dst = QIO_CHANNEL(qio_channel_socket_accept(lioc, &error_abort));
+    *dst = QIO_CHANNEL(qio_channel_socket_accept(lioc, -1, &error_abort));
     g_assert(*dst);
 
     test_io_channel_set_socket_bufs(*src, *dst);
@@ -143,7 +143,7 @@ static void test_io_channel_setup_async(SocketAddress *listen_addr,
     g_assert(!data.err);
 
     qio_channel_wait(QIO_CHANNEL(lioc), G_IO_IN);
-    *dst = QIO_CHANNEL(qio_channel_socket_accept(lioc, &error_abort));
+    *dst = QIO_CHANNEL(qio_channel_socket_accept(lioc, -1, &error_abort));
     g_assert(*dst);
 
     qio_channel_set_delay(*src, false);
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 23/32] char: save/restore chardev socket fds
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (21 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 22/32] char: qio_channel_socket_accept reuse fd Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-07-30 15:14 ` [PATCH V1 24/32] ui: save/restore vnc " Steve Sistare
                   ` (12 subsequent siblings)
  35 siblings, 0 replies; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

From: Mark Kanda <mark.kanda@oracle.com>

Iterate through the character devices and save/restore the socket fds.

Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 chardev/char-socket.c   | 35 +++++++++++++++++++++++++++++++++++
 chardev/char.c          | 14 ++++++++++++++
 include/chardev/char.h  |  5 +++++
 include/sysemu/sysemu.h |  1 +
 migration/savevm.c      |  8 ++++++++
 5 files changed, 63 insertions(+)

diff --git a/chardev/char-socket.c b/chardev/char-socket.c
index ef62dbf..e08e7e1 100644
--- a/chardev/char-socket.c
+++ b/chardev/char-socket.c
@@ -36,6 +36,8 @@
 #include "qapi/qapi-visit-sockets.h"
 
 #include "chardev/char-io.h"
+#include "sysemu/sysemu.h"
+#include "qemu/cutils.h"
 
 /***********************************************************/
 /* TCP Net console */
@@ -400,6 +402,7 @@ static void tcp_chr_free_connection(Chardev *chr)
     SocketChardev *s = SOCKET_CHARDEV(chr);
     int i;
 
+    unsetenv_fd(chr->label);
     if (s->read_msgfds_num) {
         for (i = 0; i < s->read_msgfds_num; i++) {
             close(s->read_msgfds[i]);
@@ -1375,6 +1378,9 @@ static void qmp_chardev_open_socket(Chardev *chr,
             return;
         }
     }
+
+    load_char_socket_fd(chr);
+
 }
 
 static void qemu_chr_parse_socket(QemuOpts *opts, ChardevBackend *backend,
@@ -1517,3 +1523,32 @@ static void register_types(void)
 }
 
 type_init(register_types);
+
+void save_char_socket_fd(Chardev *chr)
+{
+    SocketChardev *sockchar = SOCKET_CHARDEV(chr);
+
+    if (sockchar->sioc) {
+        setenv_fd(chr->label, sockchar->sioc->fd);
+    }
+}
+
+void load_char_socket_fd(Chardev *chr)
+{
+    SocketChardev *sockchar;
+    QIOChannelSocket *sioc;
+
+    int fd = getenv_fd(chr->label);
+
+    if (fd != -1) {
+        unsetenv_fd(chr->label);
+        sockchar = SOCKET_CHARDEV(chr);
+        sioc = qio_channel_socket_accept(*sockchar->listener->sioc, fd, NULL);
+        if (sioc) {
+            tcp_chr_accept(sockchar->listener, sioc, chr);
+        } else {
+            error_printf("error: could not restore socket for %s\n",
+                         chr->label);
+        }
+    }
+}
diff --git a/chardev/char.c b/chardev/char.c
index 77e7ec8..8fd54cc 100644
--- a/chardev/char.c
+++ b/chardev/char.c
@@ -34,6 +34,7 @@
 #include "qapi/qapi-commands-char.h"
 #include "qapi/qmp/qerror.h"
 #include "sysemu/replay.h"
+#include "sysemu/sysemu.h"
 #include "qemu/help_option.h"
 #include "qemu/module.h"
 #include "qemu/option.h"
@@ -1174,3 +1175,16 @@ static void register_types(void)
 }
 
 type_init(register_types);
+
+static int chardev_is_socket(Object *child, void *opaque)
+{
+    if (CHARDEV_IS_SOCKET(child)) {
+        save_char_socket_fd((Chardev *) child);
+    }
+    return 0;
+}
+
+void save_chardev_fds(void)
+{
+    object_child_foreach(get_chardevs_root(), chardev_is_socket, NULL);
+}
diff --git a/include/chardev/char.h b/include/chardev/char.h
index 00589a6..80a9cf8 100644
--- a/include/chardev/char.h
+++ b/include/chardev/char.h
@@ -250,6 +250,8 @@ int qemu_chr_wait_connected(Chardev *chr, Error **errp);
     object_dynamic_cast(OBJECT(chr), TYPE_CHARDEV_RINGBUF)
 #define CHARDEV_IS_PTY(chr) \
     object_dynamic_cast(OBJECT(chr), TYPE_CHARDEV_PTY)
+#define CHARDEV_IS_SOCKET(chr) \
+    object_dynamic_cast(OBJECT(chr), TYPE_CHARDEV_SOCKET)
 
 typedef struct ChardevClass {
     ObjectClass parent_class;
@@ -290,4 +292,7 @@ GSource *qemu_chr_timeout_add_ms(Chardev *chr, guint ms,
 /* console.c */
 void qemu_chr_parse_vc(QemuOpts *opts, ChardevBackend *backend, Error **errp);
 
+void save_char_socket_fd(Chardev *);
+void load_char_socket_fd(Chardev *);
+
 #endif
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index 4dfc4ca..fa1a5c3 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -27,6 +27,7 @@ void qemu_remove_machine_init_done_notifier(Notifier *notify);
 
 void save_cpr_snapshot(const char *file, const char *mode, Error **errp);
 void load_cpr_snapshot(const char *file, Error **errp);
+void save_chardev_fds(void);
 
 extern int autostart;
 
diff --git a/migration/savevm.c b/migration/savevm.c
index 2902006..81f38c4 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2700,6 +2700,12 @@ static QEMUFile *qf_file_open(const char *filename, int flags, int mode,
     return qemu_fopen_channel_input(ioc);
 }
 
+static int preserve_fd(const char *name, const char *val, void *handle)
+{
+    qemu_clr_cloexec(atoi(val));
+    return 0;
+}
+
 void save_cpr_snapshot(const char *file, const char *mode, Error **errp)
 {
     int ret = 0;
@@ -2761,6 +2767,8 @@ void save_cpr_snapshot(const char *file, const char *mode, Error **errp)
         if (qemu_preserve_ram(errp)) {
             return;
         }
+        save_chardev_fds();
+        walkenv(FD_PREFIX, preserve_fd, 0);
         qemu_system_exec_request();
         putenv((char *)"QEMU_START_FREEZE=");
     }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 24/32] ui: save/restore vnc socket fds
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (22 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 23/32] char: save/restore chardev socket fds Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-07-31  9:06   ` Daniel P. Berrangé
  2020-07-30 15:14 ` [PATCH V1 25/32] char: save/restore chardev pty fds Steve Sistare
                   ` (11 subsequent siblings)
  35 siblings, 1 reply; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

From: Mark Kanda <mark.kanda@oracle.com>

Iterate through the VNC displays and save/restore the socket fds.

Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/sysemu/sysemu.h |   2 +
 migration/savevm.c      |   3 +
 ui/vnc.c                | 153 +++++++++++++++++++++++++++++++++++++++---------
 3 files changed, 130 insertions(+), 28 deletions(-)

diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index fa1a5c3..3e7bfee 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -28,6 +28,8 @@ void qemu_remove_machine_init_done_notifier(Notifier *notify);
 void save_cpr_snapshot(const char *file, const char *mode, Error **errp);
 void load_cpr_snapshot(const char *file, Error **errp);
 void save_chardev_fds(void);
+void save_vnc_fds(void);
+void load_vnc_fds(void);
 
 extern int autostart;
 
diff --git a/migration/savevm.c b/migration/savevm.c
index 81f38c4..35fafb7 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2768,6 +2768,7 @@ void save_cpr_snapshot(const char *file, const char *mode, Error **errp)
             return;
         }
         save_chardev_fds();
+        save_vnc_fds();
         walkenv(FD_PREFIX, preserve_fd, 0);
         qemu_system_exec_request();
         putenv((char *)"QEMU_START_FREEZE=");
@@ -3015,6 +3016,8 @@ void load_cpr_snapshot(const char *file, Error **errp)
             start_on_wake = 1;
         }
     }
+
+    load_vnc_fds();
 }
 
 int load_snapshot(const char *name, Error **errp)
diff --git a/ui/vnc.c b/ui/vnc.c
index f006aa1..947ddf5 100644
--- a/ui/vnc.c
+++ b/ui/vnc.c
@@ -50,6 +50,7 @@
 #include "qom/object_interfaces.h"
 #include "qemu/cutils.h"
 #include "io/dns-resolver.h"
+#include "sysemu/sysemu.h"
 
 #define VNC_REFRESH_INTERVAL_BASE GUI_REFRESH_INTERVAL_DEFAULT
 #define VNC_REFRESH_INTERVAL_INC  50
@@ -2214,28 +2215,34 @@ static void set_pixel_format(VncState *vs, int bits_per_pixel,
     graphic_hw_update(vs->vd->dcl.con);
 }
 
-static void pixel_format_message (VncState *vs) {
+/*
+ * reuse - true if we are using an existing (already initialized)
+ * connection to a vnc client
+ */
+static void pixel_format_message(VncState *vs, bool reuse)
+{
     char pad[3] = { 0, 0, 0 };
 
     vs->client_pf = qemu_default_pixelformat(32);
 
-    vnc_write_u8(vs, vs->client_pf.bits_per_pixel); /* bits-per-pixel */
-    vnc_write_u8(vs, vs->client_pf.depth); /* depth */
+    if (!reuse) {
+        vnc_write_u8(vs, vs->client_pf.bits_per_pixel); /* bits-per-pixel */
+        vnc_write_u8(vs, vs->client_pf.depth); /* depth */
 
 #ifdef HOST_WORDS_BIGENDIAN
-    vnc_write_u8(vs, 1);             /* big-endian-flag */
+        vnc_write_u8(vs, 1);             /* big-endian-flag */
 #else
-    vnc_write_u8(vs, 0);             /* big-endian-flag */
+        vnc_write_u8(vs, 0);             /* big-endian-flag */
 #endif
-    vnc_write_u8(vs, 1);             /* true-color-flag */
-    vnc_write_u16(vs, vs->client_pf.rmax);     /* red-max */
-    vnc_write_u16(vs, vs->client_pf.gmax);     /* green-max */
-    vnc_write_u16(vs, vs->client_pf.bmax);     /* blue-max */
-    vnc_write_u8(vs, vs->client_pf.rshift);    /* red-shift */
-    vnc_write_u8(vs, vs->client_pf.gshift);    /* green-shift */
-    vnc_write_u8(vs, vs->client_pf.bshift);    /* blue-shift */
-    vnc_write(vs, pad, 3);           /* padding */
-
+        vnc_write_u8(vs, 1);             /* true-color-flag */
+        vnc_write_u16(vs, vs->client_pf.rmax);     /* red-max */
+        vnc_write_u16(vs, vs->client_pf.gmax);     /* green-max */
+        vnc_write_u16(vs, vs->client_pf.bmax);     /* blue-max */
+        vnc_write_u8(vs, vs->client_pf.rshift);    /* red-shift */
+        vnc_write_u8(vs, vs->client_pf.gshift);    /* green-shift */
+        vnc_write_u8(vs, vs->client_pf.bshift);    /* blue-shift */
+        vnc_write(vs, pad, 3);           /* padding */
+    }
     vnc_hextile_set_pixel_conversion(vs, 0);
     vs->write_pixels = vnc_write_pixels_copy;
 }
@@ -2252,7 +2259,7 @@ static void vnc_colordepth(VncState *vs)
                                pixman_image_get_width(vs->vd->server),
                                pixman_image_get_height(vs->vd->server),
                                VNC_ENCODING_WMVi);
-        pixel_format_message(vs);
+        pixel_format_message(vs, false);
         vnc_unlock_output(vs);
         vnc_flush(vs);
     } else {
@@ -2420,7 +2427,8 @@ static int protocol_client_msg(VncState *vs, uint8_t *data, size_t len)
     return 0;
 }
 
-static int protocol_client_init(VncState *vs, uint8_t *data, size_t len)
+static int protocol_client_init_base(VncState *vs, uint8_t *data, size_t len,
+                                     bool reuse)
 {
     char buf[1024];
     VncShareMode mode;
@@ -2495,10 +2503,11 @@ static int protocol_client_init(VncState *vs, uint8_t *data, size_t len)
            pixman_image_get_height(vs->vd->server) >= 0);
     vs->client_width = pixman_image_get_width(vs->vd->server);
     vs->client_height = pixman_image_get_height(vs->vd->server);
-    vnc_write_u16(vs, vs->client_width);
-    vnc_write_u16(vs, vs->client_height);
-
-    pixel_format_message(vs);
+    if (!reuse) {
+        vnc_write_u16(vs, vs->client_width);
+        vnc_write_u16(vs, vs->client_height);
+    }
+    pixel_format_message(vs, reuse);
 
     if (qemu_name) {
         size = snprintf(buf, sizeof(buf), "QEMU (%s)", qemu_name);
@@ -2509,9 +2518,11 @@ static int protocol_client_init(VncState *vs, uint8_t *data, size_t len)
         size = snprintf(buf, sizeof(buf), "QEMU");
     }
 
-    vnc_write_u32(vs, size);
-    vnc_write(vs, buf, size);
-    vnc_flush(vs);
+    if (!reuse) {
+        vnc_write_u32(vs, size);
+        vnc_write(vs, buf, size);
+        vnc_flush(vs);
+    }
 
     vnc_client_cache_auth(vs);
     vnc_qmp_event(vs, QAPI_EVENT_VNC_INITIALIZED);
@@ -2521,6 +2532,11 @@ static int protocol_client_init(VncState *vs, uint8_t *data, size_t len)
     return 0;
 }
 
+static int protocol_client_init(VncState *vs, uint8_t *data, size_t len)
+{
+    return protocol_client_init_base(vs, data, len, false);
+}
+
 void start_client_init(VncState *vs)
 {
     vnc_read_when(vs, protocol_client_init, 1);
@@ -3012,8 +3028,12 @@ static void vnc_refresh(DisplayChangeListener *dcl)
     }
 }
 
+/*
+ * reuse - true if we are using an existing (already initialized)
+ * connection to a vnc client
+ */
 static void vnc_connect(VncDisplay *vd, QIOChannelSocket *sioc,
-                        bool skipauth, bool websocket)
+                        bool skipauth, bool websocket, bool reuse)
 {
     VncState *vs = g_new0(VncState, 1);
     bool first_client = QTAILQ_EMPTY(&vd->clients);
@@ -3109,10 +3129,15 @@ static void vnc_connect(VncDisplay *vd, QIOChannelSocket *sioc,
 
     graphic_hw_update(vd->dcl.con);
 
-    if (!vs->websocket) {
+    if ((!vs->websocket) && !reuse) {
         vnc_start_protocol(vs);
     }
 
+    if (reuse) {
+        uint8_t data[1] = {0};
+        (void) protocol_client_init_base(vs, data, sizeof(data), true);
+    }
+
     if (vd->num_connecting > vd->connections_limit) {
         QTAILQ_FOREACH(vs, &vd->clients, next) {
             if (vs->share_mode == VNC_SHARE_MODE_CONNECTING) {
@@ -3143,7 +3168,7 @@ static void vnc_listen_io(QIONetListener *listener,
     qio_channel_set_name(QIO_CHANNEL(cioc),
                          isWebsock ? "vnc-ws-server" : "vnc-server");
     qio_channel_set_delay(QIO_CHANNEL(cioc), false);
-    vnc_connect(vd, cioc, false, isWebsock);
+    vnc_connect(vd, cioc, false, isWebsock, false);
 }
 
 static const DisplayChangeListenerOps dcl_ops = {
@@ -3733,7 +3758,7 @@ static int vnc_display_connect(VncDisplay *vd,
     if (qio_channel_socket_connect_sync(sioc, saddr[0], errp) < 0) {
         return -1;
     }
-    vnc_connect(vd, sioc, false, false);
+    vnc_connect(vd, sioc, false, false, false);
     object_unref(OBJECT(sioc));
     return 0;
 }
@@ -4057,7 +4082,7 @@ void vnc_display_add_client(const char *id, int csock, bool skipauth)
     sioc = qio_channel_socket_new_fd(csock, NULL);
     if (sioc) {
         qio_channel_set_name(QIO_CHANNEL(sioc), "vnc-server");
-        vnc_connect(vd, sioc, skipauth, false);
+        vnc_connect(vd, sioc, skipauth, false, false);
         object_unref(OBJECT(sioc));
     }
 }
@@ -4117,3 +4142,75 @@ static void vnc_register_config(void)
     qemu_add_opts(&qemu_vnc_opts);
 }
 opts_init(vnc_register_config);
+
+void save_vnc_fds(void)
+{
+    VncDisplay *vd;
+    VncState *vs;
+    int disp_num = 0;
+    char name[40];
+
+    QTAILQ_FOREACH(vd, &vnc_displays, next) {
+        QTAILQ_FOREACH(vs, &vd->clients, next) {
+            if (vs->sioc) {
+                snprintf(name, sizeof(name), "%s_%d", vs->sioc->parent.name,
+                         disp_num);
+                setenv_fd(name, vs->sioc->fd);
+                break;
+            }
+        }
+        disp_num++;
+    }
+}
+
+static void set_vnc_fd(char *name, QIOChannelSocket *cioc, VncDisplay *vd,
+                       bool isWebsock)
+{
+    VncState *vs;
+    QIOChannelSocket *sioc;
+
+    int fd = getenv_fd(name);
+    if (fd != -1) {
+        sioc = qio_channel_socket_accept(cioc, fd, NULL);
+        if (sioc) {
+            unsetenv_fd(name);
+            qio_channel_set_name(QIO_CHANNEL(sioc),
+                                 isWebsock ? "vnc-ws-server" : "vnc-server");
+
+            qio_channel_set_delay(QIO_CHANNEL(sioc), false);
+            vnc_connect(vd, sioc, false, isWebsock, true);
+            object_unref(OBJECT(sioc));
+
+            /* force update on all clients */
+            QTAILQ_FOREACH(vs, &vd->clients, next) {
+                vs->update = VNC_STATE_UPDATE_FORCE;
+            }
+        } else {
+            error_printf("Could not restore vnc channel %s; "
+                     "client must reconnect.\n", name);
+        }
+    }
+}
+
+void load_vnc_fds(void)
+{
+    VncDisplay *vd;
+    QIOChannelSocket *cioc = NULL;
+    int disp_num = 0;
+    char name[40];
+
+    QTAILQ_FOREACH(vd, &vnc_displays, next) {
+        if (vd->listener) {
+            cioc = *vd->listener->sioc;
+            snprintf(name, sizeof(name), "vnc-server_%d", disp_num);
+            set_vnc_fd(name, cioc, vd, false);
+        }
+
+        if (vd->wslistener) {
+            cioc = *vd->wslistener->sioc;
+            snprintf(name, sizeof(name), "vnc-ws-server_%d", disp_num);
+            set_vnc_fd(name, cioc, vd, true);
+        }
+        disp_num++;
+    }
+}
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 25/32] char: save/restore chardev pty fds
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (23 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 24/32] ui: save/restore vnc " Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-07-30 15:14 ` [PATCH V1 26/32] monitor: save/restore QMP negotiation status Steve Sistare
                   ` (10 subsequent siblings)
  35 siblings, 0 replies; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Save and restore pty descriptors across cprsave and cprload.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 chardev/char-pty.c     | 38 +++++++++++++++++++++++++++-----------
 chardev/char.c         |  2 ++
 include/chardev/char.h |  1 +
 3 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/chardev/char-pty.c b/chardev/char-pty.c
index 1cc501a..0785429 100644
--- a/chardev/char-pty.c
+++ b/chardev/char-pty.c
@@ -30,6 +30,7 @@
 #include "qemu/sockets.h"
 #include "qemu/error-report.h"
 #include "qemu/module.h"
+#include "qemu/cutils.h"
 #include "qemu/qemu-print.h"
 
 #include "chardev/char-io.h"
@@ -183,6 +184,16 @@ static void pty_chr_state(Chardev *chr, int connected)
     }
 }
 
+void save_char_pty_fd(Chardev *chr)
+{
+    PtyChardev *s = PTY_CHARDEV(chr);
+    QIOChannelFile *fioc = QIO_CHANNEL_FILE(s->ioc);
+
+    if (fioc) {
+        setenv_fd(chr->label, fioc->fd);
+    }
+}
+
 static void char_pty_finalize(Object *obj)
 {
     Chardev *chr = CHARDEV(obj);
@@ -204,18 +215,23 @@ static void char_pty_open(Chardev *chr,
     char pty_name[PATH_MAX];
     char *name;
 
-    master_fd = qemu_openpty_raw(&slave_fd, pty_name);
-    if (master_fd < 0) {
-        error_setg_errno(errp, errno, "Failed to create PTY");
-        return;
-    }
-
-    close(slave_fd);
-    qemu_set_nonblock(master_fd);
+    master_fd = getenv_fd(chr->label);
+    if (master_fd >= 0) {
+        unsetenv_fd(chr->label);
+        chr->filename = g_strdup_printf("pty:unknown");
+    } else {
+        master_fd = qemu_openpty_raw(&slave_fd, pty_name);
+        if (master_fd < 0) {
+            error_setg_errno(errp, errno, "Failed to create PTY");
+            return;
+        }
 
-    chr->filename = g_strdup_printf("pty:%s", pty_name);
-    qemu_printf("char device redirected to %s (label %s)\n",
-                pty_name, chr->label);
+        close(slave_fd);
+        qemu_set_nonblock(master_fd);
+        chr->filename = g_strdup_printf("pty:%s", pty_name);
+        qemu_printf("char device redirected to %s (label %s)\n",
+                    pty_name, chr->label);
+    }
 
     s = PTY_CHARDEV(chr);
     s->ioc = QIO_CHANNEL(qio_channel_file_new_fd(master_fd));
diff --git a/chardev/char.c b/chardev/char.c
index 8fd54cc..da75a04 100644
--- a/chardev/char.c
+++ b/chardev/char.c
@@ -1180,6 +1180,8 @@ static int chardev_is_socket(Object *child, void *opaque)
 {
     if (CHARDEV_IS_SOCKET(child)) {
         save_char_socket_fd((Chardev *) child);
+    } else if (CHARDEV_IS_PTY(child)) {
+        save_char_pty_fd((Chardev *) child);
     }
     return 0;
 }
diff --git a/include/chardev/char.h b/include/chardev/char.h
index 80a9cf8..c18bda8 100644
--- a/include/chardev/char.h
+++ b/include/chardev/char.h
@@ -294,5 +294,6 @@ void qemu_chr_parse_vc(QemuOpts *opts, ChardevBackend *backend, Error **errp);
 
 void save_char_socket_fd(Chardev *);
 void load_char_socket_fd(Chardev *);
+void save_char_pty_fd(Chardev *);
 
 #endif
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 26/32] monitor: save/restore QMP negotiation status
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (24 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 25/32] char: save/restore chardev pty fds Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-07-30 15:14 ` [PATCH V1 27/32] vhost: reset vhost devices upon cprsave Steve Sistare
                   ` (9 subsequent siblings)
  35 siblings, 0 replies; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

From: Mark Kanda <mark.kanda@oracle.com>

Save and restore QMP compatibility negotiation status across cprsave and
cprload.

Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/sysemu/sysemu.h |  1 +
 migration/savevm.c      |  1 +
 monitor/qmp.c           | 42 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 44 insertions(+)

diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index 3e7bfee..c5b2f24 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -30,6 +30,7 @@ void load_cpr_snapshot(const char *file, Error **errp);
 void save_chardev_fds(void);
 void save_vnc_fds(void);
 void load_vnc_fds(void);
+void save_qmp_negotiation_status(void);
 
 extern int autostart;
 
diff --git a/migration/savevm.c b/migration/savevm.c
index 35fafb7..225eaa6 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2770,6 +2770,7 @@ void save_cpr_snapshot(const char *file, const char *mode, Error **errp)
         save_chardev_fds();
         save_vnc_fds();
         walkenv(FD_PREFIX, preserve_fd, 0);
+        save_qmp_negotiation_status();
         qemu_system_exec_request();
         putenv((char *)"QEMU_START_FREEZE=");
     }
diff --git a/monitor/qmp.c b/monitor/qmp.c
index d433cea..9944ce5 100644
--- a/monitor/qmp.c
+++ b/monitor/qmp.c
@@ -33,6 +33,8 @@
 #include "qapi/qmp/qlist.h"
 #include "qapi/qmp/qstring.h"
 #include "trace.h"
+#include "qemu/env.h"
+#include "sysemu/sysemu.h"
 
 struct QMPRequest {
     /* Owner of the request */
@@ -398,6 +400,21 @@ static void monitor_qmp_setup_handlers_bh(void *opaque)
     monitor_list_append(&mon->common);
 }
 
+static void setenv_qmp(const char *name, bool val)
+{
+    setenv_bool(name, val);
+}
+
+static bool getenv_qmp(const char *name)
+{
+    bool ret = getenv_bool(name);
+    if (ret != -1) {
+        unsetenv_bool(name);
+        return ret;
+    }
+    return false;
+}
+
 void monitor_init_qmp(Chardev *chr, bool pretty, Error **errp)
 {
     MonitorQMP *mon = g_new0(MonitorQMP, 1);
@@ -438,4 +455,29 @@ void monitor_init_qmp(Chardev *chr, bool pretty, Error **errp)
                                  NULL, &mon->common, NULL, true);
         monitor_list_append(&mon->common);
     }
+
+    /*
+     * If a chr->label qmp env var is true, this is a restored qmp
+     * connection with capabilities negotiated.
+     */
+    if (getenv_qmp(chr->label) == true) {
+        mon->commands = &qmp_commands;
+    }
+}
+
+void save_qmp_negotiation_status(void)
+{
+    Monitor *mon;
+    MonitorQMP *qmp_mon;
+
+    QTAILQ_FOREACH(mon, &mon_list, entry) {
+        if (!monitor_is_qmp(mon)) {
+            continue;
+        }
+
+        qmp_mon = container_of(mon, MonitorQMP, common);
+        if (qmp_mon->commands == &qmp_commands) {
+            setenv_qmp(mon->chr.chr->label, true);
+        }
+    }
 }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 27/32] vhost: reset vhost devices upon cprsave
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (25 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 26/32] monitor: save/restore QMP negotiation status Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-07-30 15:14 ` [PATCH V1 28/32] char: restore terminal on restart Steve Sistare
                   ` (8 subsequent siblings)
  35 siblings, 0 replies; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

From: Mark Kanda <mark.kanda@oracle.com>

A vhost device is implicitly preserved across re-exec because its fd is not
closed, and the value of the fd is specified on the command line for the
new qemu to find.  However, new qemu issues an VHOST_RESET_OWNER ioctl,
which fails because the device already has an owner.  To fix, reset the
owner prior to exec.

Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/virtio/vhost.c       | 12 ++++++++++++
 include/sysemu/sysemu.h |  1 +
 migration/savevm.c      |  1 +
 3 files changed, 14 insertions(+)

diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 1a1384e..d065b53 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -29,6 +29,7 @@
 #include "sysemu/dma.h"
 #include "sysemu/tcg.h"
 #include "trace.h"
+#include "sysemu/sysemu.h"
 
 /* enabled until disconnected backend stabilizes */
 #define _VHOST_DEBUG 1
@@ -1773,3 +1774,14 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
 
     return -1;
 }
+
+void reset_vhost_devices(void)
+{
+    struct vhost_dev *dev;
+
+    QLIST_FOREACH(dev, &vhost_devices, entry) {
+        if (dev->vhost_ops->vhost_reset_device(dev) < 0) {
+            VHOST_OPS_DEBUG("vhost_reset_device failed");
+        }
+    }
+}
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index c5b2f24..e19c15b 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -30,6 +30,7 @@ void load_cpr_snapshot(const char *file, Error **errp);
 void save_chardev_fds(void);
 void save_vnc_fds(void);
 void load_vnc_fds(void);
+void reset_vhost_devices(void);
 void save_qmp_negotiation_status(void);
 
 extern int autostart;
diff --git a/migration/savevm.c b/migration/savevm.c
index 225eaa6..732dfb5 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2770,6 +2770,7 @@ void save_cpr_snapshot(const char *file, const char *mode, Error **errp)
         save_chardev_fds();
         save_vnc_fds();
         walkenv(FD_PREFIX, preserve_fd, 0);
+        reset_vhost_devices();
         save_qmp_negotiation_status();
         qemu_system_exec_request();
         putenv((char *)"QEMU_START_FREEZE=");
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 28/32] char: restore terminal on restart
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (26 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 27/32] vhost: reset vhost devices upon cprsave Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-07-30 15:14 ` [PATCH V1 29/32] pci: export pci_update_mappings Steve Sistare
                   ` (7 subsequent siblings)
  35 siblings, 0 replies; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

If stdin is is a char backend device, then restore original stdin terminal
settings in before re-exec'ing.  Otherwise, the new qemu sees the modified
settings as initial settings, and does not restore the true initial settings
when it exits.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 chardev/char-stdio.c   | 7 +++++++
 include/chardev/char.h | 2 ++
 migration/savevm.c     | 2 ++
 3 files changed, 11 insertions(+)

diff --git a/chardev/char-stdio.c b/chardev/char-stdio.c
index 82eaebc..6481d08 100644
--- a/chardev/char-stdio.c
+++ b/chardev/char-stdio.c
@@ -119,6 +119,13 @@ static void qemu_chr_open_stdio(Chardev *chr,
 }
 #endif
 
+void qemu_term_exit(void)
+{
+#ifndef _WIN32
+    term_exit();
+#endif
+}
+
 static void qemu_chr_parse_stdio(QemuOpts *opts, ChardevBackend *backend,
                                  Error **errp)
 {
diff --git a/include/chardev/char.h b/include/chardev/char.h
index c18bda8..5fd3ecc 100644
--- a/include/chardev/char.h
+++ b/include/chardev/char.h
@@ -296,4 +296,6 @@ void save_char_socket_fd(Chardev *);
 void load_char_socket_fd(Chardev *);
 void save_char_pty_fd(Chardev *);
 
+void qemu_term_exit(void);
+
 #endif
diff --git a/migration/savevm.c b/migration/savevm.c
index 732dfb5..881dc13 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -32,6 +32,7 @@
 #include "migration.h"
 #include "migration/snapshot.h"
 #include "migration/vmstate.h"
+#include "chardev/char.h"
 #include "migration/misc.h"
 #include "migration/register.h"
 #include "migration/global_state.h"
@@ -2772,6 +2773,7 @@ void save_cpr_snapshot(const char *file, const char *mode, Error **errp)
         walkenv(FD_PREFIX, preserve_fd, 0);
         reset_vhost_devices();
         save_qmp_negotiation_status();
+        qemu_term_exit();
         qemu_system_exec_request();
         putenv((char *)"QEMU_START_FREEZE=");
     }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 29/32] pci: export pci_update_mappings
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (27 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 28/32] char: restore terminal on restart Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-07-30 15:14 ` [PATCH V1 30/32] vfio-pci: save and restore Steve Sistare
                   ` (6 subsequent siblings)
  35 siblings, 0 replies; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Allow pci_update_mappings to be called from other modules.
No change in functionality.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/pci/pci.c         | 3 +--
 include/hw/pci/pci.h | 1 +
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index de0fae1..7343e00 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -216,7 +216,6 @@ static const TypeInfo pcie_bus_info = {
 };
 
 static PCIBus *pci_find_bus_nr(PCIBus *bus, int bus_num);
-static void pci_update_mappings(PCIDevice *d);
 static void pci_irq_handler(void *opaque, int irq_num, int level);
 static void pci_add_option_rom(PCIDevice *pdev, bool is_default_rom, Error **);
 static void pci_del_option_rom(PCIDevice *pdev);
@@ -1316,7 +1315,7 @@ static pcibus_t pci_bar_address(PCIDevice *d,
     return new_addr;
 }
 
-static void pci_update_mappings(PCIDevice *d)
+void pci_update_mappings(PCIDevice *d)
 {
     PCIIORegion *r;
     int i;
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index c1bf7d5..bd07c86 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -865,5 +865,6 @@ extern const VMStateDescription vmstate_pci_device;
 }
 
 MSIMessage pci_get_msi_message(PCIDevice *dev, int vector);
+void pci_update_mappings(PCIDevice *d);
 
 #endif
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 30/32] vfio-pci: save and restore
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (28 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 29/32] pci: export pci_update_mappings Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-08-06 10:22   ` Jason Zeng
  2020-07-30 15:14 ` [PATCH V1 31/32] vfio-pci: trace pci config Steve Sistare
                   ` (5 subsequent siblings)
  35 siblings, 1 reply; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Enable vfio-pci devices to be saved and restored across an exec restart
of qemu.

At vfio creation time, save the value of vfio container, group, and device
descriptors in the environment.

In cprsave, save the msi message area as part of vfio-pci vmstate, and
clear the close-on-exec flag for the vfio descriptors.  The flag is not
cleared earlier because the descriptors should not persist across misc
fork and exec calls that may be performed during normal operation.

On qemu restart, vfio_realize() finds the descriptor env vars, uses
the descriptors, and notes that the device is being reused.  Device and
iommu state is already configured, so operations in vfio_realize that
would modify the configuration are skipped for a reused device, including
vfio ioctl's and writes to PCI configuration space.  The result is that
vfio_realize constructs qemu data structures that reflect the current
state of the device.  However, the reconstruction is not complete until
cprload is called, and vfio_pci_post_load uses the msi data to rebuild
interrupt structures and attach the interrupts to the new KVM instance.
Lastly, vfio device reset is suppressed when the VM is started.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/pci/pci.c                  |  4 ++
 hw/vfio/common.c              | 99 ++++++++++++++++++++++++++++++++++---------
 hw/vfio/pci.c                 | 79 ++++++++++++++++++++++++++++++++--
 hw/vfio/platform.c            |  2 +-
 include/hw/pci/pci.h          |  1 +
 include/hw/vfio/vfio-common.h |  4 +-
 migration/savevm.c            |  2 +-
 7 files changed, 163 insertions(+), 28 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 7343e00..c2e1509 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -291,6 +291,10 @@ static void pci_do_device_reset(PCIDevice *dev)
 {
     int r;
 
+    if (dev->reused) {
+        return;
+    }
+
     pci_device_deassert_intx(dev);
     assert(dev->irq_state == 0);
 
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 3335714..a51a093 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -37,6 +37,7 @@
 #include "sysemu/reset.h"
 #include "trace.h"
 #include "qapi/error.h"
+#include "qemu/cutils.h"
 
 VFIOGroupList vfio_group_list =
     QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -299,6 +300,10 @@ static int vfio_dma_unmap(VFIOContainer *container,
         .size = size,
     };
 
+    if (container->reused) {
+        return 0;
+    }
+
     while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
         /*
          * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
@@ -336,6 +341,10 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
         .size = size,
     };
 
+    if (container->reused) {
+        return 0;
+    }
+
     if (!readonly) {
         map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
     }
@@ -1179,25 +1188,27 @@ static int vfio_init_container(VFIOContainer *container, int group_fd,
         return iommu_type;
     }
 
-    ret = ioctl(group_fd, VFIO_GROUP_SET_CONTAINER, &container->fd);
-    if (ret) {
-        error_setg_errno(errp, errno, "Failed to set group container");
-        return -errno;
-    }
+    if (!container->reused) {
+        ret = ioctl(group_fd, VFIO_GROUP_SET_CONTAINER, &container->fd);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to set group container");
+            return -errno;
+        }
 
-    while (ioctl(container->fd, VFIO_SET_IOMMU, iommu_type)) {
-        if (iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
-            /*
-             * On sPAPR, despite the IOMMU subdriver always advertises v1 and
-             * v2, the running platform may not support v2 and there is no
-             * way to guess it until an IOMMU group gets added to the container.
-             * So in case it fails with v2, try v1 as a fallback.
-             */
-            iommu_type = VFIO_SPAPR_TCE_IOMMU;
-            continue;
+        while (ioctl(container->fd, VFIO_SET_IOMMU, iommu_type)) {
+            if (iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) {
+                /*
+                 * On sPAPR, despite the IOMMU subdriver always advertises v1
+                 * and v2, the running platform may not support v2 and there is
+                 * no way to guess it until an IOMMU group gets added to the
+                 * container. So in case it fails with v2, try v1 as a fallback.
+                 */
+                iommu_type = VFIO_SPAPR_TCE_IOMMU;
+                continue;
+            }
+            error_setg_errno(errp, errno, "Failed to set iommu for container");
+            return -errno;
         }
-        error_setg_errno(errp, errno, "Failed to set iommu for container");
-        return -errno;
     }
 
     container->iommu_type = iommu_type;
@@ -1210,6 +1221,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     VFIOContainer *container;
     int ret, fd;
     VFIOAddressSpace *space;
+    char name[40];
+    bool reused;
 
     space = vfio_get_address_space(as);
 
@@ -1254,7 +1267,13 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
         }
     }
 
-    fd = qemu_open("/dev/vfio/vfio", O_RDWR);
+    snprintf(name, sizeof(name), "vfio_container_%d", group->groupid);
+    fd = getenv_fd(name);
+    reused = (fd >= 0);
+    if (fd < 0) {
+        fd = qemu_open("/dev/vfio/vfio", O_RDWR);
+    }
+
     if (fd < 0) {
         error_setg_errno(errp, errno, "failed to open /dev/vfio/vfio");
         ret = -errno;
@@ -1272,6 +1291,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
     container = g_malloc0(sizeof(*container));
     container->space = space;
     container->fd = fd;
+    container->cid = group->groupid;
+    container->reused = reused;
     container->error = NULL;
     QLIST_INIT(&container->giommu_list);
     QLIST_INIT(&container->hostwin_list);
@@ -1395,6 +1416,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
 
     container->initialized = true;
 
+    if (!reused) {
+        setenv_fd(name, fd);
+    }
+
     return 0;
 listener_release_exit:
     QLIST_REMOVE(group, container_next);
@@ -1418,6 +1443,7 @@ put_space_exit:
 static void vfio_disconnect_container(VFIOGroup *group)
 {
     VFIOContainer *container = group->container;
+    char name[40];
 
     QLIST_REMOVE(group, container_next);
     group->container = NULL;
@@ -1450,6 +1476,8 @@ static void vfio_disconnect_container(VFIOGroup *group)
         }
 
         trace_vfio_disconnect_container(container->fd);
+        snprintf(name, sizeof(name), "vfio_container_%d", container->cid);
+        unsetenv_fd(name);
         close(container->fd);
         g_free(container);
 
@@ -1462,6 +1490,7 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
     VFIOGroup *group;
     char path[32];
     struct vfio_group_status status = { .argsz = sizeof(status) };
+    bool reused;
 
     QLIST_FOREACH(group, &vfio_group_list, next) {
         if (group->groupid == groupid) {
@@ -1479,7 +1508,13 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
     group = g_malloc0(sizeof(*group));
 
     snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
-    group->fd = qemu_open(path, O_RDWR);
+
+    group->fd = getenv_fd(path);
+    reused = (group->fd >= 0);
+    if (group->fd < 0) {
+        group->fd = qemu_open(path, O_RDWR);
+    }
+
     if (group->fd < 0) {
         error_setg_errno(errp, errno, "failed to open %s", path);
         goto free_group_exit;
@@ -1513,6 +1548,10 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
 
     QLIST_INSERT_HEAD(&vfio_group_list, group, next);
 
+    if (!reused) {
+        setenv_fd(path, group->fd);
+    }
+
     return group;
 
 close_fd_exit:
@@ -1526,6 +1565,8 @@ free_group_exit:
 
 void vfio_put_group(VFIOGroup *group)
 {
+    char path[32];
+
     if (!group || !QLIST_EMPTY(&group->device_list)) {
         return;
     }
@@ -1537,6 +1578,8 @@ void vfio_put_group(VFIOGroup *group)
     vfio_disconnect_container(group);
     QLIST_REMOVE(group, next);
     trace_vfio_put_group(group->fd);
+    snprintf(path, sizeof(path), "/dev/vfio/%d", group->groupid);
+    unsetenv_fd(path);
     close(group->fd);
     g_free(group);
 
@@ -1546,12 +1589,18 @@ void vfio_put_group(VFIOGroup *group)
 }
 
 int vfio_get_device(VFIOGroup *group, const char *name,
-                    VFIODevice *vbasedev, Error **errp)
+                    VFIODevice *vbasedev, bool *reusedp, Error **errp)
 {
     struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
     int ret, fd;
+    bool reused;
+
+    fd = getenv_fd(name);
+    reused = (fd >= 0);
+    if (fd < 0) {
+        fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
+    }
 
-    fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
     if (fd < 0) {
         error_setg_errno(errp, errno, "error getting device from group %d",
                          group->groupid);
@@ -1601,6 +1650,13 @@ int vfio_get_device(VFIOGroup *group, const char *name,
                           dev_info.num_irqs);
 
     vbasedev->reset_works = !!(dev_info.flags & VFIO_DEVICE_FLAGS_RESET);
+
+    if (!reused) {
+        setenv_fd(name, fd);
+    }
+    if (reusedp) {
+        *reusedp = reused;
+    }
     return 0;
 }
 
@@ -1612,6 +1668,7 @@ void vfio_put_base_device(VFIODevice *vbasedev)
     QLIST_REMOVE(vbasedev, next);
     vbasedev->group = NULL;
     trace_vfio_put_base_device(vbasedev->fd);
+    unsetenv_fd(vbasedev->name);
     close(vbasedev->fd);
 }
 
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 2e561c0..5743807 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -49,6 +49,7 @@
 
 static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
 static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
+static const VMStateDescription vfio_pci_vmstate;
 
 /*
  * Disabling BAR mmaping can be slow, but toggling it around INTx can
@@ -1585,6 +1586,14 @@ static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled)
     }
 }
 
+static void vfio_config_sync(VFIOPCIDevice *vdev, uint32_t offset, size_t len)
+{
+    if (pread(vdev->vbasedev.fd, vdev->pdev.config + offset, len,
+          vdev->config_offset + offset) != len) {
+        error_report("vfio_config_sync pread failed");
+    }
+}
+
 static void vfio_bar_prepare(VFIOPCIDevice *vdev, int nr)
 {
     VFIOBAR *bar = &vdev->bars[nr];
@@ -1626,6 +1635,7 @@ static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
 {
     VFIOBAR *bar = &vdev->bars[nr];
     char *name;
+    PCIDevice *pdev = &vdev->pdev;
 
     if (!bar->size) {
         return;
@@ -1646,6 +1656,9 @@ static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
     }
 
     pci_register_bar(&vdev->pdev, nr, bar->type, bar->mr);
+    if (pdev->reused) {
+        vfio_config_sync(vdev, pci_bar(pdev, nr), 8);
+    }
 }
 
 static void vfio_bars_register(VFIOPCIDevice *vdev)
@@ -2805,7 +2818,8 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         goto error;
     }
 
-    ret = vfio_get_device(group, vdev->vbasedev.name, &vdev->vbasedev, errp);
+    ret = vfio_get_device(group, vdev->vbasedev.name, &vdev->vbasedev,
+                          &pdev->reused, errp);
     if (ret) {
         vfio_put_group(group);
         goto error;
@@ -2972,9 +2986,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
                                              vfio_intx_routing_notifier);
         vdev->irqchip_change_notifier.notify = vfio_irqchip_change;
         kvm_irqchip_add_change_notifier(&vdev->irqchip_change_notifier);
-        ret = vfio_intx_enable(vdev, errp);
-        if (ret) {
-            goto out_deregister;
+        if (!pdev->reused) {
+            ret = vfio_intx_enable(vdev, errp);
+            if (ret) {
+                goto out_deregister;
+            }
         }
     }
 
@@ -3017,6 +3033,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn_quirk(vdev);
 
+    vfio_config_sync(vdev, pdev->msix_cap + PCI_MSIX_FLAGS, 2);
+    if (pdev->reused) {
+        pci_update_mappings(pdev);
+    }
+
     return;
 
 out_deregister:
@@ -3080,6 +3101,10 @@ static void vfio_pci_reset(DeviceState *dev)
 {
     VFIOPCIDevice *vdev = PCI_VFIO(dev);
 
+    if (vdev->pdev.reused) {
+        return;
+    }
+
     trace_vfio_pci_reset(vdev->vbasedev.name);
 
     vfio_pci_pre_reset(vdev);
@@ -3182,6 +3207,51 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
+static int vfio_pci_post_load(void *opaque, int version_id)
+{
+    int vector;
+    MSIMessage msg;
+    Error *err = 0;
+    VFIOPCIDevice *vdev = opaque;
+    PCIDevice *pdev = &vdev->pdev;
+
+    if (msix_enabled(pdev)) {
+        vfio_msix_enable(vdev);
+        pdev->msix_function_masked = false;
+
+        for (vector = 0; vector < vdev->pdev.msix_entries_nr; vector++) {
+            if (!msix_is_masked(pdev, vector)) {
+                msg = msix_get_message(pdev, vector);
+                vfio_msix_vector_use(pdev, vector, msg);
+            }
+        }
+
+    } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
+        vfio_intx_enable(vdev, &err);
+        if (err) {
+            error_report_err(err);
+        }
+    }
+
+    vdev->vbasedev.group->container->reused = false;
+    vdev->pdev.reused = false;
+
+    return 0;
+}
+
+static const VMStateDescription vfio_pci_vmstate = {
+    .name = "vfio-pci",
+    .unmigratable = 1,
+    .mode_mask = VMS_RESTART,
+    .version_id = 0,
+    .minimum_version_id = 0,
+    .post_load = vfio_pci_post_load,
+    .fields = (VMStateField[]) {
+        VMSTATE_MSIX(pdev, VFIOPCIDevice),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
 static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
@@ -3189,6 +3259,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 
     dc->reset = vfio_pci_reset;
     device_class_set_props(dc, vfio_pci_dev_properties);
+    dc->vmsd = &vfio_pci_vmstate;
     dc->desc = "VFIO-based PCI device assignment";
     set_bit(DEVICE_CATEGORY_MISC, dc->categories);
     pdc->realize = vfio_realize;
diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
index ac2cefc..e6e1a5d 100644
--- a/hw/vfio/platform.c
+++ b/hw/vfio/platform.c
@@ -592,7 +592,7 @@ static int vfio_base_device_init(VFIODevice *vbasedev, Error **errp)
             return -EBUSY;
         }
     }
-    ret = vfio_get_device(group, vbasedev->name, vbasedev, errp);
+    ret = vfio_get_device(group, vbasedev->name, vbasedev, 0, errp);
     if (ret) {
         vfio_put_group(group);
         return ret;
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index bd07c86..c926a24 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -358,6 +358,7 @@ struct PCIDevice {
 
     /* ID of standby device in net_failover pair */
     char *failover_pair_id;
+    bool reused;
 };
 
 void pci_register_bar(PCIDevice *pci_dev, int region_num,
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index c78f3ff..4e2a332 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -73,6 +73,8 @@ typedef struct VFIOContainer {
     unsigned iommu_type;
     Error *error;
     bool initialized;
+    bool reused;
+    int cid;
     unsigned long pgsizes;
     QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
     QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
@@ -177,7 +179,7 @@ void vfio_reset_handler(void *opaque);
 VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
 void vfio_put_group(VFIOGroup *group);
 int vfio_get_device(VFIOGroup *group, const char *name,
-                    VFIODevice *vbasedev, Error **errp);
+                    VFIODevice *vbasedev, bool *reused, Error **errp);
 
 extern const MemoryRegionOps vfio_region_ops;
 typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
diff --git a/migration/savevm.c b/migration/savevm.c
index 881dc13..2606cf0 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1568,7 +1568,7 @@ static int qemu_savevm_state(QEMUFile *f, VMStateMode mode, Error **errp)
         return -EINVAL;
     }
 
-    if (migrate_use_block()) {
+    if ((mode & (VMS_SNAPSHOT | VMS_MIGRATE)) && migrate_use_block()) {
         error_setg(errp, "Block migration and snapshots are incompatible");
         return -EINVAL;
     }
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 31/32] vfio-pci: trace pci config
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (29 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 30/32] vfio-pci: save and restore Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-07-30 15:14 ` [PATCH V1 32/32] vfio-pci: improved tracing Steve Sistare
                   ` (4 subsequent siblings)
  35 siblings, 0 replies; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Add new trace points trace_vfio_pci_config and trace_vfio_msix_table to dump
PCI config space and MSI data.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/vfio/pci.c        | 99 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/trace-events |  2 ++
 2 files changed, 101 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5743807..f72e277 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2715,6 +2715,90 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
     vdev->req_enabled = false;
 }
 
+/* To limit output, trace only this many bytes of config. */
+#define CONFIG_LEN 512
+
+static void vfio_dump_config(const char *name, int fd, off_t offset)
+{
+    int i, j, n, config[CONFIG_LEN / 4];
+    char buf[128];
+    const char *fmt;
+    char *ptr = buf;
+    int *v = config;
+    int len = sizeof(buf) - 1;
+
+#ifdef CONFIG_TRACE_DTRACE
+    if (!QEMU_VFIO_PCI_CONFIG_ENABLED()) {
+        return;
+    }
+#endif
+
+    if (pread(fd, &config, sizeof(config), offset) < 0) {
+        perror("pread");
+        return;
+    }
+
+    trace_vfio_pci_config(name);
+
+    for (i = 0; i < CONFIG_LEN; i += 32, v += 8) {
+        n = snprintf(buf, len, "+%3d:", i);
+        ptr += n;
+        len -= n;
+        for (j = 0; j < 8; j++) {
+            fmt = v[j] ?  " %08x" : " %8x";
+            n = snprintf(ptr, len, fmt, v[j]);
+            ptr += n;
+            len -= n;
+        }
+        *ptr = 0;   /* terminate in case of truncation above */
+        trace_vfio_pci_config(buf);
+    }
+}
+
+static void vfio_dump_config_vdev(VFIOPCIDevice *vdev)
+{
+    vfio_dump_config(vdev->vbasedev.name, vdev->vbasedev.fd,
+                     vdev->config_offset);
+}
+
+static void vfio_dump_msix_vdev(VFIOPCIDevice *vdev)
+{
+    int i;
+    int *ptr = (int *) vdev->pdev.msix_table;
+
+    for (i = 0; i < vdev->pdev.msix_entries_nr; i++, ptr += 4) {
+        trace_vfio_msix_table(vdev->vbasedev.name, i,
+                              ptr[0], ptr[1], ptr[2], ptr[3]);
+    }
+}
+
+static void vfio_diff_config(VFIOPCIDevice *vdev)
+{
+    int i;
+    unsigned char config[CONFIG_LEN];
+    int n = sizeof(config);
+    unsigned char *c1 = (unsigned char *)config;
+    unsigned char *c2 = (unsigned char *)vdev->pdev.config;
+    char buf[128];
+
+#ifdef CONFIG_TRACE_DTRACE
+    if (!QEMU_VFIO_PCI_CONFIG_ENABLED()) {
+        return;
+    }
+#endif
+
+    if (pread(vdev->vbasedev.fd, &config, n, vdev->config_offset) != n) {
+        error_report("vfio_diff_config pread failed");
+    }
+    for (i = 0; i < CONFIG_LEN; i++) {
+        if (c1[i] != c2[i]) {
+            snprintf(buf, sizeof(buf),
+                     "config mismatch at %d: %x vs %x", i, c1[i], c2[i]);
+            trace_vfio_pci_config(buf);
+        }
+    }
+}
+
 static void vfio_realize(PCIDevice *pdev, Error **errp)
 {
     VFIOPCIDevice *vdev = PCI_VFIO(pdev);
@@ -3037,6 +3121,9 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
     if (pdev->reused) {
         pci_update_mappings(pdev);
     }
+    vfio_diff_config(vdev);
+    vfio_dump_config_vdev(vdev);
+    vfio_dump_msix_vdev(vdev);
 
     return;
 
@@ -3207,6 +3294,15 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
+static int vfio_pci_pre_save(void *opaque)
+{
+    VFIOPCIDevice *vdev = opaque;
+
+    vfio_dump_config_vdev(vdev);
+    vfio_dump_msix_vdev(vdev);
+    return 0;
+}
+
 static int vfio_pci_post_load(void *opaque, int version_id)
 {
     int vector;
@@ -3226,6 +3322,8 @@ static int vfio_pci_post_load(void *opaque, int version_id)
             }
         }
 
+        vfio_dump_msix_vdev(vdev);
+
     } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
         vfio_intx_enable(vdev, &err);
         if (err) {
@@ -3246,6 +3344,7 @@ static const VMStateDescription vfio_pci_vmstate = {
     .version_id = 0,
     .minimum_version_id = 0,
     .post_load = vfio_pci_post_load,
+    .pre_save = vfio_pci_pre_save,
     .fields = (VMStateField[]) {
         VMSTATE_MSIX(pdev, VFIOPCIDevice),
         VMSTATE_END_OF_LIST()
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index b1ef55a..10d899c 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -47,6 +47,8 @@ vfio_pci_emulated_vendor_id(const char *name, uint16_t val) "%s 0x%04x"
 vfio_pci_emulated_device_id(const char *name, uint16_t val) "%s 0x%04x"
 vfio_pci_emulated_sub_vendor_id(const char *name, uint16_t val) "%s 0x%04x"
 vfio_pci_emulated_sub_device_id(const char *name, uint16_t val) "%s 0x%04x"
+vfio_msix_table(const char *name, int index, int x0, int x1, int x2, int x3) "%s MSI-X[%d] = { %x %x %x %x }"
+vfio_pci_config(const char *buf) "%s"
 
 # pci-quirks.c
 vfio_quirk_rom_blacklisted(const char *name, uint16_t vid, uint16_t did) "%s %04x:%04x"
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH V1 32/32] vfio-pci: improved tracing
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (30 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 31/32] vfio-pci: trace pci config Steve Sistare
@ 2020-07-30 15:14 ` Steve Sistare
  2020-09-15 18:49   ` Dr. David Alan Gilbert
  2020-07-30 16:52 ` [PATCH V1 00/32] Live Update Daniel P. Berrangé
                   ` (3 subsequent siblings)
  35 siblings, 1 reply; 118+ messages in thread
From: Steve Sistare @ 2020-07-30 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Steve Sistare, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Print more info for existing trace points:
  trace_kvm_irqchip_add_msi_route.
  trace_pci_update_mappings_del
  trace_pci_update_mappings_add

Add new trace points:
  trace_kvm_irqchip_assign_irqfd
  trace_msix_table_mmio_write
  trace_vfio_dma_unmap
  trace_vfio_dma_map
  trace_vfio_region
  trace_vfio_descriptors
  trace_ram_block_add

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 accel/kvm/kvm-all.c    |  8 ++++++--
 accel/kvm/trace-events |  3 ++-
 exec.c                 |  3 +++
 hw/pci/msix.c          |  1 +
 hw/pci/pci.c           | 10 ++++++----
 hw/pci/trace-events    |  5 +++--
 hw/vfio/common.c       | 16 +++++++++++++++-
 hw/vfio/pci.c          |  1 +
 hw/vfio/trace-events   |  9 ++++++---
 trace-events           |  2 ++
 10 files changed, 45 insertions(+), 13 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 63ef6af..5511ea7 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -46,6 +46,7 @@
 #include "sysemu/reset.h"
 
 #include "hw/boards.h"
+#include "trace-root.h"
 
 /* This check must be after config-host.h is included */
 #ifdef CONFIG_EVENTFD
@@ -1670,7 +1671,7 @@ int kvm_irqchip_add_msi_route(KVMState *s, int vector, PCIDevice *dev)
     }
 
     trace_kvm_irqchip_add_msi_route(dev ? dev->name : (char *)"N/A",
-                                    vector, virq);
+                                    vector, virq, msg.address, msg.data);
 
     kvm_add_routing_entry(s, &kroute);
     kvm_arch_add_msi_route_post(&kroute, vector, dev);
@@ -1717,6 +1718,7 @@ static int kvm_irqchip_assign_irqfd(KVMState *s, EventNotifier *event,
 {
     int fd = event_notifier_get_fd(event);
     int rfd = resample ? event_notifier_get_fd(resample) : -1;
+    int ret;
 
     struct kvm_irqfd irqfd = {
         .fd = fd,
@@ -1758,7 +1760,9 @@ static int kvm_irqchip_assign_irqfd(KVMState *s, EventNotifier *event,
         return -ENOSYS;
     }
 
-    return kvm_vm_ioctl(s, KVM_IRQFD, &irqfd);
+    ret = kvm_vm_ioctl(s, KVM_IRQFD, &irqfd);
+    trace_kvm_irqchip_assign_irqfd(fd, virq, rfd, ret);
+    return ret;
 }
 
 int kvm_irqchip_add_adapter_route(KVMState *s, AdapterInfo *adapter)
diff --git a/accel/kvm/trace-events b/accel/kvm/trace-events
index a68eb66..67a01e6 100644
--- a/accel/kvm/trace-events
+++ b/accel/kvm/trace-events
@@ -9,7 +9,8 @@ kvm_device_ioctl(int fd, int type, void *arg) "dev fd %d, type 0x%x, arg %p"
 kvm_failed_reg_get(uint64_t id, const char *msg) "Warning: Unable to retrieve ONEREG %" PRIu64 " from KVM: %s"
 kvm_failed_reg_set(uint64_t id, const char *msg) "Warning: Unable to set ONEREG %" PRIu64 " to KVM: %s"
 kvm_irqchip_commit_routes(void) ""
-kvm_irqchip_add_msi_route(char *name, int vector, int virq) "dev %s vector %d virq %d"
+kvm_irqchip_add_msi_route(char *name, int vector, int virq, uint64_t addr, uint32_t data) "%s, vector %d, virq %d, msg {addr 0x%"PRIx64", data 0x%x}"
+kvm_irqchip_assign_irqfd(int fd, int virq, int rfd, int status) "(fd=%d, virq=%d, rfd=%d) KVM_IRQFD returns %d"
 kvm_irqchip_update_msi_route(int virq) "Updating MSI route virq=%d"
 kvm_irqchip_release_virq(int virq) "virq %d"
 kvm_set_ioeventfd_mmio(int fd, uint64_t addr, uint32_t val, bool assign, uint32_t size, bool datamatch) "fd: %d @0x%" PRIx64 " val=0x%x assign: %d size: %d match: %d"
diff --git a/exec.c b/exec.c
index 5473c09..dd99ee0 100644
--- a/exec.c
+++ b/exec.c
@@ -2319,6 +2319,9 @@ static void ram_block_add(RAMBlock *new_block, Error **errp, bool shared)
         }
         ram_block_notify_add(new_block->host, new_block->max_length);
     }
+    trace_ram_block_add(new_block->host, new_block->max_length,
+                        memory_region_name(new_block->mr),
+                        new_block->mr->readonly ? "ro" : "rw");
 }
 
 #ifdef CONFIG_POSIX
diff --git a/hw/pci/msix.c b/hw/pci/msix.c
index 67e34f3..65a2882 100644
--- a/hw/pci/msix.c
+++ b/hw/pci/msix.c
@@ -189,6 +189,7 @@ static void msix_table_mmio_write(void *opaque, hwaddr addr,
     int vector = addr / PCI_MSIX_ENTRY_SIZE;
     bool was_masked;
 
+    trace_msix_table_mmio_write(dev->name, addr, val, size);
     was_masked = msix_is_masked(dev, vector);
     pci_set_long(dev->msix_table + addr, val);
     msix_handle_mask_update(dev, vector, was_masked);
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index c2e1509..6142411 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -1324,9 +1324,11 @@ void pci_update_mappings(PCIDevice *d)
     PCIIORegion *r;
     int i;
     pcibus_t new_addr;
+    const char *name;
 
     for(i = 0; i < PCI_NUM_REGIONS; i++) {
         r = &d->io_regions[i];
+        name = r->memory ? r->memory->name : "";
 
         /* this region isn't registered */
         if (!r->size)
@@ -1340,18 +1342,18 @@ void pci_update_mappings(PCIDevice *d)
 
         /* now do the real mapping */
         if (r->addr != PCI_BAR_UNMAPPED) {
-            trace_pci_update_mappings_del(d, pci_dev_bus_num(d),
+            trace_pci_update_mappings_del(d->name, pci_dev_bus_num(d),
                                           PCI_SLOT(d->devfn),
                                           PCI_FUNC(d->devfn),
-                                          i, r->addr, r->size);
+                                          i, r->addr, r->size, name);
             memory_region_del_subregion(r->address_space, r->memory);
         }
         r->addr = new_addr;
         if (r->addr != PCI_BAR_UNMAPPED) {
-            trace_pci_update_mappings_add(d, pci_dev_bus_num(d),
+            trace_pci_update_mappings_add(d->name, pci_dev_bus_num(d),
                                           PCI_SLOT(d->devfn),
                                           PCI_FUNC(d->devfn),
-                                          i, r->addr, r->size);
+                                          i, r->addr, r->size, name);
             memory_region_add_subregion_overlap(r->address_space,
                                                 r->addr, r->memory, 1);
         }
diff --git a/hw/pci/trace-events b/hw/pci/trace-events
index def4b39..6dd7015 100644
--- a/hw/pci/trace-events
+++ b/hw/pci/trace-events
@@ -1,8 +1,8 @@
 # See docs/devel/tracing.txt for syntax documentation.
 
 # pci.c
-pci_update_mappings_del(void *d, uint32_t bus, uint32_t slot, uint32_t func, int bar, uint64_t addr, uint64_t size) "d=%p %02x:%02x.%x %d,0x%"PRIx64"+0x%"PRIx64
-pci_update_mappings_add(void *d, uint32_t bus, uint32_t slot, uint32_t func, int bar, uint64_t addr, uint64_t size) "d=%p %02x:%02x.%x %d,0x%"PRIx64"+0x%"PRIx64
+pci_update_mappings_del(const char *dname, uint32_t bus, uint32_t slot, uint32_t func, int bar, uint64_t addr, uint64_t size, const char *name) "%s %02x:%02x.%x [%d] 0x%"PRIx64", 0x%"PRIx64"B \"%s\""
+pci_update_mappings_add(const char *dname, uint32_t bus, uint32_t slot, uint32_t func, int bar, uint64_t addr, uint64_t size, const char *name) "%s %02x:%02x.%x [%d] 0x%"PRIx64", 0x%"PRIx64"B \"%s\""
 
 # pci_host.c
 pci_cfg_read(const char *dev, unsigned devid, unsigned fnid, unsigned offs, unsigned val) "%s %02u:%u @0x%x -> 0x%x"
@@ -10,3 +10,4 @@ pci_cfg_write(const char *dev, unsigned devid, unsigned fnid, unsigned offs, uns
 
 # msix.c
 msix_write_config(char *name, bool enabled, bool masked) "dev %s enabled %d masked %d"
+msix_table_mmio_write(char *name, uint64_t addr, uint64_t val, unsigned size)  "(%s, @%"PRId64", 0x%"PRIx64", %dB)"
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index a51a093..23c8bf3 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -304,6 +304,8 @@ static int vfio_dma_unmap(VFIOContainer *container,
         return 0;
     }
 
+    trace_vfio_dma_unmap(container->fd, iova, size);
+
     while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
         /*
          * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
@@ -327,6 +329,11 @@ static int vfio_dma_unmap(VFIOContainer *container,
         return -errno;
     }
 
+    if (unmap.size != size) {
+        error_printf("warn: VFIO_UNMAP_DMA(0x%lx, 0x%lx) only unmaps 0x%llx",
+                     iova, size, unmap.size);
+    }
+
     return 0;
 }
 
@@ -345,6 +352,9 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
         return 0;
     }
 
+    trace_vfio_dma_map(container->fd, iova, size, vaddr,
+                       (readonly ? "r" : "rw"));
+
     if (!readonly) {
         map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
     }
@@ -985,7 +995,8 @@ int vfio_region_mmap(VFIORegion *region)
         trace_vfio_region_mmap(memory_region_name(&region->mmaps[i].mem),
                                region->mmaps[i].offset,
                                region->mmaps[i].offset +
-                               region->mmaps[i].size - 1);
+                               region->mmaps[i].size - 1,
+                               region->mmaps[i].mmap);
     }
 
     return 0;
@@ -1696,6 +1707,9 @@ retry:
         goto retry;
     }
 
+    trace_vfio_region(vbasedev->name, index, (*info)->offset, (*info)->size,
+                      (*info)->cap_offset, (*info)->flags);
+
     return 0;
 }
 
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index f72e277..d74e078 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -41,6 +41,7 @@
 #include "trace.h"
 #include "qapi/error.h"
 #include "migration/blocker.h"
+#include "trace-root.h"
 
 #define TYPE_VFIO_PCI "vfio-pci"
 #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 10d899c..83cd0a6 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -25,7 +25,7 @@ vfio_pci_size_rom(const char *name, int size) "%s ROM size 0x%x"
 vfio_vga_write(uint64_t addr, uint64_t data, int size) " (0x%"PRIx64", 0x%"PRIx64", %d)"
 vfio_vga_read(uint64_t addr, int size, uint64_t data) " (0x%"PRIx64", %d) = 0x%"PRIx64
 vfio_pci_read_config(const char *name, int addr, int len, int val) " (%s, @0x%x, len=0x%x) 0x%x"
-vfio_pci_write_config(const char *name, int addr, int val, int len) " (%s, @0x%x, 0x%x, len=0x%x)"
+vfio_pci_write_config(const char *name, int addr, int val, int len) "(%s, @0x%x, 0x%x, 0x%xB)"
 vfio_msi_setup(const char *name, int pos) "%s PCI MSI CAP @0x%x"
 vfio_msix_early_setup(const char *name, int pos, int table_bar, int offset, int entries) "%s PCI MSI-X CAP @0x%x, BAR %d, offset 0x%x, entries %d"
 vfio_check_pcie_flr(const char *name) "%s Supports FLR via PCIe cap"
@@ -37,7 +37,7 @@ vfio_pci_hot_reset_dep_devices(int domain, int bus, int slot, int function, int
 vfio_pci_hot_reset_result(const char *name, const char *result) "%s hot reset: %s"
 vfio_populate_device_config(const char *name, unsigned long size, unsigned long offset, unsigned long flags) "Device %s config:\n  size: 0x%lx, offset: 0x%lx, flags: 0x%lx"
 vfio_populate_device_get_irq_info_failure(const char *errstr) "VFIO_DEVICE_GET_IRQ_INFO failure: %s"
-vfio_realize(const char *name, int group_id) " (%s) group %d"
+vfio_realize(const char *name, int group_id) "(%s) group %d"
 vfio_mdev(const char *name, bool is_mdev) " (%s) is_mdev %d"
 vfio_add_ext_cap_dropped(const char *name, uint16_t cap, uint16_t offset) "%s 0x%x@0x%x"
 vfio_pci_reset(const char *name) " (%s)"
@@ -109,7 +109,7 @@ vfio_get_device(const char * name, unsigned int flags, unsigned int num_regions,
 vfio_put_base_device(int fd) "close vdev->fd=%d"
 vfio_region_setup(const char *dev, int index, const char *name, unsigned long flags, unsigned long offset, unsigned long size) "Device %s, region %d \"%s\", flags: 0x%lx, offset: 0x%lx, size: 0x%lx"
 vfio_region_mmap_fault(const char *name, int index, unsigned long offset, unsigned long size, int fault) "Region %s mmaps[%d], [0x%lx - 0x%lx], fault: %d"
-vfio_region_mmap(const char *name, unsigned long offset, unsigned long end) "Region %s [0x%lx - 0x%lx]"
+vfio_region_mmap(const char *name, unsigned long offset, unsigned long end, void *addr) "%s [0x%lx - 0x%lx] maps to %p"
 vfio_region_exit(const char *name, int index) "Device %s, region %d"
 vfio_region_finalize(const char *name, int index) "Device %s, region %d"
 vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
@@ -117,6 +117,9 @@ vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Devic
 vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
 vfio_dma_unmap_overflow_workaround(void) ""
+vfio_dma_unmap(int fd, uint64_t iova, uint64_t size) "fd %d, iova 0x%"PRIx64", len 0x%"PRIx64
+vfio_dma_map(int fd, uint64_t iova, uint64_t size, void *addr, const char *access) "fd %d, iova 0x%"PRIx64", len 0x%"PRIx64", va %p, %s"
+vfio_region(const char *name, int index, uint64_t offset, uint64_t size, int cap_offset, int flags) "%s [%d]: +0x%"PRIx64", 0x%"PRIx64"B, cap +0x%x, flags 0x%x"
 
 # platform.c
 vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
diff --git a/trace-events b/trace-events
index 42107eb..98589a4 100644
--- a/trace-events
+++ b/trace-events
@@ -107,6 +107,8 @@ qmp_job_complete(void *job) "job %p"
 qmp_job_finalize(void *job) "job %p"
 qmp_job_dismiss(void *job) "job %p"
 
+# exec.c
+ram_block_add(void *host, uint64_t maxlen, const char *name, const char *mode) "host=%p, maxlen=0x%"PRIx64", mr = {name=%s, %s}"
 
 ### Guest events, keep at bottom
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 03/32] savevm: QMP command for cprsave
  2020-07-30 15:14 ` [PATCH V1 03/32] savevm: QMP command for cprsave Steve Sistare
@ 2020-07-30 16:12   ` Eric Blake
  2020-07-30 17:52     ` Steven Sistare
  2020-09-11 16:43   ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 118+ messages in thread
From: Eric Blake @ 2020-07-30 16:12 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Daniel P. Berrange, Juan Quintela, Philippe Mathieu-Daudé,
	Michael S. Tsirkin, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Stefan Hajnoczi, Paolo Bonzini,
	Marc-André Lureau, Alex Bennée

On 7/30/20 10:14 AM, Steve Sistare wrote:
> To enable live reboot, provide the cprsave QMP command and the VMS_REBOOT
> vmstate-saving operation, which saves the state of the virtual machine in a
> simple file.
> 
> Syntax:
>    {'command':'cprsave', 'data':{'file':'str', 'mode':'str'}}
> 
>    The mode argument must be 'reboot'.  Additional modes will be defined in
>    the future.
> 

Focusing on just the UI:

> +++ b/qapi/migration.json
> @@ -1621,3 +1621,17 @@
>   ##
>   { 'event': 'UNPLUG_PRIMARY',
>     'data': { 'device-id': 'str' } }
> +
> +##
> +# @cprsave:
> +#
> +# Create a checkpoint of the virtual machine device state in @file.
> +# Guest RAM and guest block device blocks are not saved.
> +#
> +# @file: name of checkpoint file

Since you used qemu_open() in the code, this can include a 
'/dev/fdset/NNN' magic name for saving into a previously-passed-in file 
descriptor instead of directly opening a local file name.  That's a good 
thing, but I don't know if it needs explicit mention in the docs.

> +# @mode: 'reboot' : checkpoint can be cprload'ed after a host kexec reboot.
> +#
> +# Since 5.0

5.2 (you've missed 5.0 by a long shot, and even 5.1 is too late now).

> +##
> +{ 'command': 'cprsave', 'data': { 'file': 'str', 'mode': 'str' } }

'mode' should be an enum type, rather than an open-coded string:

{ 'enum': 'CprMode', 'data': ['reboot'] }
{ 'command': 'cprsave', 'data': {'file': 'str', 'mode': 'CprMode' } }

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 05/32] savevm: QMP command for cprload
  2020-07-30 15:14 ` [PATCH V1 05/32] savevm: QMP command for cprload Steve Sistare
@ 2020-07-30 16:14   ` Eric Blake
  2020-07-30 18:00     ` Steven Sistare
  0 siblings, 1 reply; 118+ messages in thread
From: Eric Blake @ 2020-07-30 16:14 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Daniel P. Berrange, Juan Quintela, Philippe Mathieu-Daudé,
	Michael S. Tsirkin, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Stefan Hajnoczi, Paolo Bonzini,
	Marc-André Lureau, Alex Bennée

On 7/30/20 10:14 AM, Steve Sistare wrote:
> Provide the cprload QMP command.  The VM is created from the file produced
> by the cprsave command.  Guest RAM is restored in-place from the shared
> memory backend file, and guest block devices are used as is.  The contents
> of such devices must not be modified between the cprsave and cprload
> operations.  If the VM was running at cprsave time, then VM execution
> resumes.

Is it always wise to unconditionally resume, or might this command need 
an additional optional knob that says what state (paused or running) to 
move into?

> 
> Syntax:
>    {'command':'cprload', 'data':{'file':'str'}}
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> Signed-off-by: Maran Wilson <maran.wilson@oracle.com>
> ---

> +++ b/qapi/migration.json
> @@ -1635,3 +1635,14 @@
>   ##
>   { 'command': 'cprsave', 'data': { 'file': 'str', 'mode': 'str' } }
>   
> +##
> +# @cprload:
> +#
> +# Start virtual machine from checkpoint file that was created earlier using
> +# the cprsave command.
> +#
> +# @file: name of checkpoint file
> +#
> +# Since 5.0

another 5.2 instance. I'll quit pointing it out for the rest of the series.

> +##
> +{ 'command': 'cprload', 'data': { 'file': 'str' } }
> diff --git a/softmmu/vl.c b/softmmu/vl.c
> index 660537a..8478778 100644

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 07/32] savevm: QMP command for cprinfo
  2020-07-30 15:14 ` [PATCH V1 07/32] savevm: QMP command for cprinfo Steve Sistare
@ 2020-07-30 16:17   ` Eric Blake
  2020-07-30 18:02     ` Steven Sistare
  0 siblings, 1 reply; 118+ messages in thread
From: Eric Blake @ 2020-07-30 16:17 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Daniel P. Berrange, Juan Quintela, Philippe Mathieu-Daudé,
	Michael S. Tsirkin, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Stefan Hajnoczi, Paolo Bonzini,
	Marc-André Lureau, Alex Bennée

On 7/30/20 10:14 AM, Steve Sistare wrote:
> Provide the cprinfo QMP command.  This returns a string with a space-
> separated list of modes supported by cprsave, and can be used by clients
> as a feature test to check if the running QEMU instance supports cprsave.

When you've already got array support in the QMP language, why are you 
making the user parse a string into an array after the fact?

> 
> Syntax:
>    {'command':'cprinfo', 'returns':'str'}
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---

> +++ b/qapi/migration.json
> @@ -1623,6 +1623,15 @@
>     'data': { 'device-id': 'str' } }
>   
>   ##
> +# @cprinfo:
> +#
> +# Return a space-delimited list of modes supported by the cprsave command
> +#
> +# Since 5.0
> +##
> +{ 'command': 'cprinfo', 'returns': 'str' }

Returning a 'str' is non-extensible.  The fact that you had to edit the 
whitelist is proof that you should have done something better.  I recommend:

{ 'command': 'cprinfo', 'returns': { 'modes': [ 'CprMode' ] }

using the CprMode enum I proposed earlier.

> +
> +##
>   # @cprsave:
>   #
>   # Create a checkpoint of the virtual machine device state in @file.
> diff --git a/qapi/pragma.json b/qapi/pragma.json
> index cffae27..43bdb39 100644
> --- a/qapi/pragma.json
> +++ b/qapi/pragma.json
> @@ -5,6 +5,7 @@
>   { 'pragma': {
>       # Commands allowed to return a non-dictionary:
>       'returns-whitelist': [
> +        'cprinfo',

This should not be needed.  Design the return value correctly in the 
first place.

>           'human-monitor-command',
>           'qom-get',
>           'query-migrate-cache-size',
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 12/32] vl: pause option
  2020-07-30 15:14 ` [PATCH V1 12/32] vl: pause option Steve Sistare
@ 2020-07-30 16:20   ` Eric Blake
  2020-07-30 18:11     ` Steven Sistare
  2020-07-30 17:03   ` Alex Bennée
  1 sibling, 1 reply; 118+ messages in thread
From: Eric Blake @ 2020-07-30 16:20 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Daniel P. Berrange, Juan Quintela, Philippe Mathieu-Daudé,
	Michael S. Tsirkin, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Stefan Hajnoczi, Paolo Bonzini,
	Marc-André Lureau, Alex Bennée

On 7/30/20 10:14 AM, Steve Sistare wrote:
> Provide the -pause command-line parameter and the QEMU_PAUSE environment
> variable to briefly pause QEMU in main and allow a developer to attach gdb.
> Useful when the developer does not invoke QEMU directly, such as when using
> libvirt.

How would you set this option with libvirt?

It feels like you are trying to reinvent something that is already 
well-documented:

https://www.berrange.com/posts/2011/10/12/debugging-early-startup-of-kvm-with-gdb-when-launched-by-libvirtd/

> 
> Usage:
>    qemu -pause <seconds>
>    or
>    export QEMU_PAUSE=<seconds>
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>   qemu-options.hx |  9 +++++++++
>   softmmu/vl.c    | 15 ++++++++++++++-
>   2 files changed, 23 insertions(+), 1 deletion(-)

> @@ -3204,6 +3211,12 @@ void qemu_init(int argc, char **argv, char **envp)
>               case QEMU_OPTION_gdb:
>                   add_device_config(DEV_GDB, optarg);
>                   break;
> +            case QEMU_OPTION_pause:
> +                seconds = atoi(optarg);

atoi() cannot detect overflow.  You should never use it in robust 
parsing of untrusted input.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 14/32] savevm: VMS_RESTART and cprsave restart
  2020-07-30 15:14 ` [PATCH V1 14/32] savevm: VMS_RESTART and cprsave restart Steve Sistare
@ 2020-07-30 16:22   ` Eric Blake
  2020-07-30 18:14     ` Steven Sistare
  2020-09-11 18:44   ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 118+ messages in thread
From: Eric Blake @ 2020-07-30 16:22 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Daniel P. Berrange, Juan Quintela, Philippe Mathieu-Daudé,
	Michael S. Tsirkin, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Stefan Hajnoczi, Paolo Bonzini,
	Marc-André Lureau, Alex Bennée

On 7/30/20 10:14 AM, Steve Sistare wrote:
> Add the VMS_RESTART variant of vmstate, for use when upgrading qemu in place
> on the same host without a reboot.  Invoke it using:
>    cprsave <filename> restart
> 
> VMS_RESTART supports guest ram mapped by private anonymous memory, versus
> VMS_REBOOT which requires that guest ram be mapped by persistent shared
> memory.  Subsequent patches complete its implementation.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---

> +++ b/qapi/migration.json
> @@ -1639,6 +1639,7 @@
>   #
>   # @file: name of checkpoint file
>   # @mode: 'reboot' : checkpoint can be cprload'ed after a host kexec reboot.
> +#        'restart': checkpoint can be cprload'ed after restarting qemu.

This should be a modification to an enum type (the 'CprMode' type I 
suggested earlier in the series).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 00/32] Live Update
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (31 preceding siblings ...)
  2020-07-30 15:14 ` [PATCH V1 32/32] vfio-pci: improved tracing Steve Sistare
@ 2020-07-30 16:52 ` Daniel P. Berrangé
  2020-07-30 18:48   ` Steven Sistare
  2020-07-30 17:15 ` Paolo Bonzini
                   ` (2 subsequent siblings)
  35 siblings, 1 reply; 118+ messages in thread
From: Daniel P. Berrangé @ 2020-07-30 16:52 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Juan Quintela, Philippe Mathieu-Daudé,
	Michael S. Tsirkin, Markus Armbruster, qemu-devel,
	Alex Williamson, Stefan Hajnoczi, Paolo Bonzini,
	Marc-André Lureau, Alex Bennée, Dr. David Alan Gilbert

On Thu, Jul 30, 2020 at 08:14:04AM -0700, Steve Sistare wrote:
> Improve and extend the qemu functions that save and restore VM state so a
> guest may be suspended and resumed with minimal pause time.  qemu may be
> updated to a new version in between.
> 
> The first set of patches adds the cprsave and cprload commands to save and
> restore VM state, and allow the host kernel to be updated and rebooted in
> between.  The VM must create guest RAM in a persistent shared memory file,
> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in 
> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/
> 
> cprsave stops the VCPUs and saves VM device state in a simple file, and
> thus supports any type of guest image and block device.  The caller must
> not modify the VM's block devices between cprsave and cprload.
> 
> cprsave and cprload support guests with vfio devices if the caller first
> suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
> The guest drivers suspend methods flush outstanding requests and re-
> initialize the devices, and thus there is no device state to save and
> restore.
> 
>    1 savevm: add vmstate handler iterators
>    2 savevm: VM handlers mode mask
>    3 savevm: QMP command for cprsave
>    4 savevm: HMP Command for cprsave
>    5 savevm: QMP command for cprload
>    6 savevm: HMP Command for cprload
>    7 savevm: QMP command for cprinfo
>    8 savevm: HMP command for cprinfo
>    9 savevm: prevent cprsave if memory is volatile
>   10 kvmclock: restore paused KVM clock
>   11 cpu: disable ticks when suspended
>   12 vl: pause option
>   13 gdbstub: gdb support for suspended state
> 
> The next patches add a restart method that eliminates the persistent memory
> constraint, and allows qemu to be updated across the restart, but does not
> allow host reboot.  Anonymous memory segments used by the guest are
> preserved across a re-exec of qemu, mapped at the same VA, via a proposed
> madvise(MADV_DOEXEC) option in the Linux kernel.  See
> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
> 
>   14 savevm: VMS_RESTART and cprsave restart
>   15 vl: QEMU_START_FREEZE env var
>   16 oslib: add qemu_clr_cloexec
>   17 util: env var helpers
>   18 osdep: import MADV_DOEXEC
>   19 memory: ram_block_add cosmetic changes
>   20 vl: add helper to request re-exec
>   21 exec, memory: exec(3) to restart
>   22 char: qio_channel_socket_accept reuse fd
>   23 char: save/restore chardev socket fds
>   24 ui: save/restore vnc socket fds
>   25 char: save/restore chardev pty fds

Keeping FDs open across re-exec is a nice trick, but how are you dealing
with the state associated with them, most especially the TLS encryption
state ? AFAIK, there's no way to serialize/deserialize the TLS state that
GNUTLS maintains, and the patches don't show any sign of dealing with
this. IOW it looks like while the FD will be preserved, any TLS session
running on it will fail.

I'm going to presume that you're probably just considering the TLS features
out of scope for your patch series.  It would be useful if you have any
info about this and other things you've considered out of scope for this
patch series.

I'm not seeing anything in the block layer about preserving open FDs, so
I presume you're just letting the block layer close and then re-open any
FDs it has ?  This would have the side effect that any locks held on the
FDs are lost, so there's a potential race condition where another process
could acquire the lock and prevent the re-exec completing. That said this
is unavoidable, because Linux kernel is completely broken wrt keeping
fnctl() locks held across a re-exec, always throwing away the locks if
more than 1 thread is running [1].

Regards,
Daniel

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1552621
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 12/32] vl: pause option
  2020-07-30 15:14 ` [PATCH V1 12/32] vl: pause option Steve Sistare
  2020-07-30 16:20   ` Eric Blake
@ 2020-07-30 17:03   ` Alex Bennée
  2020-07-30 18:14     ` Steven Sistare
  1 sibling, 1 reply; 118+ messages in thread
From: Alex Bennée @ 2020-07-30 17:03 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Markus Armbruster,
	Juan Quintela, Dr. David Alan Gilbert, qemu-devel,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé


Steve Sistare <steven.sistare@oracle.com> writes:

> Provide the -pause command-line parameter and the QEMU_PAUSE environment
> variable to briefly pause QEMU in main and allow a developer to attach gdb.
> Useful when the developer does not invoke QEMU directly, such as when using
> libvirt.

How does this differ from -S?

>
> Usage:
>   qemu -pause <seconds>
>   or
>   export QEMU_PAUSE=<seconds>
>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  qemu-options.hx |  9 +++++++++
>  softmmu/vl.c    | 15 ++++++++++++++-
>  2 files changed, 23 insertions(+), 1 deletion(-)
>
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 708583b..8505cf2 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -3668,6 +3668,15 @@ SRST
>      option is experimental.
>  ERST
>  
> +DEF("pause", HAS_ARG, QEMU_OPTION_pause, \
> +    "-pause secs    Pause for secs seconds on entry to main.\n", QEMU_ARCH_ALL)
> +
> +SRST
> +``--pause secs``
> +    Pause for a number of seconds on entry to main.  Useful for attaching
> +    a debugger after QEMU has been launched by some other entity.
> +ERST
> +

It seems like having an option to race with the debugger is just asking
for trouble.

>  DEF("S", 0, QEMU_OPTION_S, \
>      "-S              freeze CPU at startup (use 'c' to start execution)\n",
>      QEMU_ARCH_ALL)
> diff --git a/softmmu/vl.c b/softmmu/vl.c
> index 8478778..951994f 100644
> --- a/softmmu/vl.c
> +++ b/softmmu/vl.c
> @@ -2844,7 +2844,7 @@ static void create_default_memdev(MachineState *ms, const char *path)
>  
>  void qemu_init(int argc, char **argv, char **envp)
>  {
> -    int i;
> +    int i, seconds;
>      int snapshot, linux_boot;
>      const char *initrd_filename;
>      const char *kernel_filename, *kernel_cmdline;
> @@ -2882,6 +2882,13 @@ void qemu_init(int argc, char **argv, char **envp)
>      QemuPluginList plugin_list = QTAILQ_HEAD_INITIALIZER(plugin_list);
>      int mem_prealloc = 0; /* force preallocation of physical target memory */
>  
> +    if (getenv("QEMU_PAUSE")) {
> +        seconds = atoi(getenv("QEMU_PAUSE"));
> +        printf("Pausing %d seconds for debugger. QEMU PID is %d\n",
> +               seconds, getpid());
> +        sleep(seconds);
> +    }
> +
>      os_set_line_buffering();
>  
>      error_init(argv[0]);
> @@ -3204,6 +3211,12 @@ void qemu_init(int argc, char **argv, char **envp)
>              case QEMU_OPTION_gdb:
>                  add_device_config(DEV_GDB, optarg);
>                  break;
> +            case QEMU_OPTION_pause:
> +                seconds = atoi(optarg);
> +                printf("Pausing %d seconds for debugger. QEMU PID is %d\n",
> +                            seconds, getpid());
> +                sleep(seconds);
> +                break;
>              case QEMU_OPTION_L:
>                  if (is_help_option(optarg)) {
>                      list_data_dirs = true;


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 00/32] Live Update
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (32 preceding siblings ...)
  2020-07-30 16:52 ` [PATCH V1 00/32] Live Update Daniel P. Berrangé
@ 2020-07-30 17:15 ` Paolo Bonzini
  2020-07-30 19:09   ` Steven Sistare
  2020-07-30 17:49 ` Dr. David Alan Gilbert
  2020-08-04 18:18 ` Steven Sistare
  35 siblings, 1 reply; 118+ messages in thread
From: Paolo Bonzini @ 2020-07-30 17:15 UTC (permalink / raw)
  To: Steve Sistare, qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin,
	Philippe Mathieu-Daudé,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée

On 30/07/20 17:14, Steve Sistare wrote:
> The first set of patches adds the cprsave and cprload commands to save and
> restore VM state, and allow the host kernel to be updated and rebooted in
> between.  The VM must create guest RAM in a persistent shared memory file,
> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in 
> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/
> 
> cprsave stops the VCPUs and saves VM device state in a simple file, and
> thus supports any type of guest image and block device.  The caller must
> not modify the VM's block devices between cprsave and cprload.

Stupid question, what does cpr stand for?  If it is checkpoint/restore,
please spell it out.  Also, how does the functionality compare to
xen-save-devices-state and xen-load-devices-state?

> cprsave and cprload support guests with vfio devices if the caller first
> suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
> The guest drivers suspend methods flush outstanding requests and re-
> initialize the devices, and thus there is no device state to save and
> restore.

This probably should be allowed even for regular migration.  Can you
generalize the code as a separate series?

Thanks,

Paolo



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 00/32] Live Update
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (33 preceding siblings ...)
  2020-07-30 17:15 ` Paolo Bonzini
@ 2020-07-30 17:49 ` Dr. David Alan Gilbert
  2020-07-30 19:31   ` Steven Sistare
  2020-08-04 18:18 ` Steven Sistare
  35 siblings, 1 reply; 118+ messages in thread
From: Dr. David Alan Gilbert @ 2020-07-30 17:49 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Daniel P. Berrange, Juan Quintela, Philippe Mathieu-Daudé,
	Michael S. Tsirkin, qemu-devel, Markus Armbruster,
	Alex Williamson, Stefan Hajnoczi, Paolo Bonzini,
	Marc-André Lureau, Alex Bennée

* Steve Sistare (steven.sistare@oracle.com) wrote:
> Improve and extend the qemu functions that save and restore VM state so a
> guest may be suspended and resumed with minimal pause time.  qemu may be
> updated to a new version in between.

Nice.

> The first set of patches adds the cprsave and cprload commands to save and
> restore VM state, and allow the host kernel to be updated and rebooted in
> between.  The VM must create guest RAM in a persistent shared memory file,
> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in 
> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/
> 
> cprsave stops the VCPUs and saves VM device state in a simple file, and
> thus supports any type of guest image and block device.  The caller must
> not modify the VM's block devices between cprsave and cprload.

can I ask why you don't just add a migration flag to skip the devices
you don't want, and then do a migrate to a file?
(i.e. migrate "exec:cat > afile")
We already have the 'x-ignore-shared' capability that's used for doing
RAM snapshots of VMs; primarily I think for being able to start a VM
from a RAM snapshot as a fast VM start trick.
(There's also a xen_save_devices that does something similar).
If you backed the RAM as you say, enabled x-ignore-shared and then did:

   migrate "exec:cat > afile"

and restarted the destination with:

    migrate_incoming "exec:cat afile"

what is different (except the later stuff about the vfio magic and
chardevs).

Dave

> cprsave and cprload support guests with vfio devices if the caller first
> suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
> The guest drivers suspend methods flush outstanding requests and re-
> initialize the devices, and thus there is no device state to save and
> restore.
> 
>    1 savevm: add vmstate handler iterators
>    2 savevm: VM handlers mode mask
>    3 savevm: QMP command for cprsave
>    4 savevm: HMP Command for cprsave
>    5 savevm: QMP command for cprload
>    6 savevm: HMP Command for cprload
>    7 savevm: QMP command for cprinfo
>    8 savevm: HMP command for cprinfo
>    9 savevm: prevent cprsave if memory is volatile
>   10 kvmclock: restore paused KVM clock
>   11 cpu: disable ticks when suspended
>   12 vl: pause option
>   13 gdbstub: gdb support for suspended state
> 
> The next patches add a restart method that eliminates the persistent memory
> constraint, and allows qemu to be updated across the restart, but does not
> allow host reboot.  Anonymous memory segments used by the guest are
> preserved across a re-exec of qemu, mapped at the same VA, via a proposed
> madvise(MADV_DOEXEC) option in the Linux kernel.  See
> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
> 
>   14 savevm: VMS_RESTART and cprsave restart
>   15 vl: QEMU_START_FREEZE env var
>   16 oslib: add qemu_clr_cloexec
>   17 util: env var helpers
>   18 osdep: import MADV_DOEXEC
>   19 memory: ram_block_add cosmetic changes
>   20 vl: add helper to request re-exec
>   21 exec, memory: exec(3) to restart
>   22 char: qio_channel_socket_accept reuse fd
>   23 char: save/restore chardev socket fds
>   24 ui: save/restore vnc socket fds
>   25 char: save/restore chardev pty fds
>   26 monitor: save/restore QMP negotiation status
>   27 vhost: reset vhost devices upon cprsave
>   28 char: restore terminal on restart
> 
> The next patches extend the restart method to save and restore vfio-pci
> state, eliminating the requirement for a guest agent.  The vfio container,
> group, and device descriptors are preserved across the qemu re-exec.
> 
>   29 pci: export pci_update_mappings
>   30 vfio-pci: save and restore
>   31 vfio-pci: trace pci config
>   32 vfio-pci: improved tracing
> 
> Here is an example of updating qemu from v4.2.0 to v4.2.1 using 
> "cprload restart".  The software update is performed while the guest is
> running to minimize downtime.
> 
> window 1				| window 2
> 					|
> # qemu-system-x86_64 ... 		|
> QEMU 4.2.0 monitor - type 'help' ...	|
> (qemu) info status			|
> VM status: running			|
> 					| # yum update qemu
> (qemu) cprsave /tmp/qemu.sav restart	|
> QEMU 4.2.1 monitor - type 'help' ...	|
> (qemu) info status			|
> VM status: paused (prelaunch)		|
> (qemu) cprload /tmp/qemu.sav		|
> (qemu) info status			|
> VM status: running			|
> 
> 
> Here is an example of updating the host kernel using "cprload reboot"
> 
> window 1					| window 2
> 						|
> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
> QEMU 4.2.1 monitor - type 'help' ...		|
> (qemu) info status				|
> VM status: running				|
> 						| # yum update kernel-uek
> (qemu) cprsave /tmp/qemu.sav restart		|
> 						|
> # systemctl kexec				|
> kexec_core: Starting new kernel			|
> ...						|
> 						|
> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
> QEMU 4.2.1 monitor - type 'help' ...		|
> (qemu) info status				|
> VM status: paused (prelaunch)			|
> (qemu) cprload /tmp/qemu.sav			|
> (qemu) info status				|
> VM status: running				|
> 
> 
> Mark Kanda (5):
>   char: qio_channel_socket_accept reuse fd
>   char: save/restore chardev socket fds
>   ui: save/restore vnc socket fds
>   monitor: save/restore QMP negotiation status
>   vhost: reset vhost devices upon cprsave
> 
> Steve Sistare (27):
>   savevm: add vmstate handler iterators
>   savevm: VM handlers mode mask
>   savevm: QMP command for cprsave
>   savevm: HMP Command for cprsave
>   savevm: QMP command for cprload
>   savevm: HMP Command for cprload
>   savevm: QMP command for cprinfo
>   savevm: HMP command for cprinfo
>   savevm: prevent cprsave if memory is volatile
>   kvmclock: restore paused KVM clock
>   cpu: disable ticks when suspended
>   vl: pause option
>   gdbstub: gdb support for suspended state
>   savevm: VMS_RESTART and cprsave restart
>   vl: QEMU_START_FREEZE env var
>   oslib: add qemu_clr_cloexec
>   util: env var helpers
>   osdep: import MADV_DOEXEC
>   memory: ram_block_add cosmetic changes
>   vl: add helper to request re-exec
>   exec, memory: exec(3) to restart
>   char: save/restore chardev pty fds
>   char: restore terminal on restart
>   pci: export pci_update_mappings
>   vfio-pci: save and restore
>   vfio-pci: trace pci config
>   vfio-pci: improved tracing
> 
>  MAINTAINERS                    |   7 ++
>  accel/kvm/kvm-all.c            |   8 +-
>  accel/kvm/trace-events         |   3 +-
>  chardev/char-pty.c             |  38 +++++--
>  chardev/char-socket.c          |  35 ++++++
>  chardev/char-stdio.c           |   7 ++
>  chardev/char.c                 |  16 +++
>  exec.c                         |  88 +++++++++++++--
>  gdbstub.c                      |  11 +-
>  hmp-commands.hx                |  46 ++++++++
>  hw/i386/kvm/clock.c            |   6 +-
>  hw/pci/msix.c                  |   1 +
>  hw/pci/pci.c                   |  17 +--
>  hw/pci/trace-events            |   5 +-
>  hw/vfio/common.c               | 115 ++++++++++++++++----
>  hw/vfio/pci.c                  | 179 ++++++++++++++++++++++++++++++-
>  hw/vfio/platform.c             |   2 +-
>  hw/vfio/trace-events           |  11 +-
>  hw/virtio/vhost.c              |  12 +++
>  include/chardev/char.h         |   8 ++
>  include/exec/memory.h          |   4 +
>  include/hw/pci/pci.h           |   2 +
>  include/hw/vfio/vfio-common.h  |   4 +-
>  include/io/channel-socket.h    |   3 +-
>  include/migration/register.h   |   3 +
>  include/migration/vmstate.h    |  11 ++
>  include/monitor/hmp.h          |   3 +
>  include/qemu/cutils.h          |   1 +
>  include/qemu/env.h             |  31 ++++++
>  include/qemu/osdep.h           |   8 ++
>  include/sysemu/sysemu.h        |  10 ++
>  io/channel-socket.c            |  12 ++-
>  io/net-listener.c              |   4 +-
>  migration/block.c              |   1 +
>  migration/migration.c          |   4 +-
>  migration/ram.c                |   1 +
>  migration/savevm.c             | 237 ++++++++++++++++++++++++++++++++++++-----
>  migration/savevm.h             |   4 +-
>  monitor/hmp-cmds.c             |  28 +++++
>  monitor/qmp-cmds.c             |  16 +++
>  monitor/qmp.c                  |  42 ++++++++
>  qapi/migration.json            |  35 ++++++
>  qapi/pragma.json               |   1 +
>  qemu-options.hx                |   9 ++
>  scsi/qemu-pr-helper.c          |   2 +-
>  softmmu/vl.c                   |  65 ++++++++++-
>  tests/qtest/tpm-emu.c          |   2 +-
>  tests/test-char.c              |   2 +-
>  tests/test-io-channel-socket.c |   4 +-
>  trace-events                   |   2 +
>  ui/vnc.c                       | 153 +++++++++++++++++++++-----
>  util/Makefile.objs             |   2 +-
>  util/env.c                     | 132 +++++++++++++++++++++++
>  util/oslib-posix.c             |   9 ++
>  util/oslib-win32.c             |   4 +
>  55 files changed, 1331 insertions(+), 135 deletions(-)
>  create mode 100644 include/qemu/env.h
>  create mode 100644 util/env.c
> 
> -- 
> 1.8.3.1
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 03/32] savevm: QMP command for cprsave
  2020-07-30 16:12   ` Eric Blake
@ 2020-07-30 17:52     ` Steven Sistare
  0 siblings, 0 replies; 118+ messages in thread
From: Steven Sistare @ 2020-07-30 17:52 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: Daniel P. Berrange, Juan Quintela, Philippe Mathieu-Daudé,
	Michael S. Tsirkin, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Stefan Hajnoczi, Paolo Bonzini,
	Marc-André Lureau, Alex Bennée

On 7/30/2020 12:12 PM, Eric Blake wrote:
> On 7/30/20 10:14 AM, Steve Sistare wrote:
>> To enable live reboot, provide the cprsave QMP command and the VMS_REBOOT
>> vmstate-saving operation, which saves the state of the virtual machine in a
>> simple file.
>>
>> Syntax:
>>    {'command':'cprsave', 'data':{'file':'str', 'mode':'str'}}
>>
>>    The mode argument must be 'reboot'.  Additional modes will be defined in
>>    the future.
>>
> 
> Focusing on just the UI:
> 
>> +++ b/qapi/migration.json
>> @@ -1621,3 +1621,17 @@
>>   ##
>>   { 'event': 'UNPLUG_PRIMARY',
>>     'data': { 'device-id': 'str' } }
>> +
>> +##
>> +# @cprsave:
>> +#
>> +# Create a checkpoint of the virtual machine device state in @file.
>> +# Guest RAM and guest block device blocks are not saved.
>> +#
>> +# @file: name of checkpoint file
> 
> Since you used qemu_open() in the code, this can include a '/dev/fdset/NNN' magic name for saving into a previously-passed-in file descriptor instead of directly opening a local file name.  That's a good thing, but I don't know if it needs explicit mention in the docs.

OK, I'll look for other uses of file and fdset in the docs and see if it fits naturally here.

>> +# @mode: 'reboot' : checkpoint can be cprload'ed after a host kexec reboot.
>> +#
>> +# Since 5.0
> 
> 5.2 (you've missed 5.0 by a long shot, and even 5.1 is too late now).

Yup!  Will fix here and in the other patches, thanks.

>> +##
>> +{ 'command': 'cprsave', 'data': { 'file': 'str', 'mode': 'str' } }
> 
> 'mode' should be an enum type, rather than an open-coded string:
> 
> { 'enum': 'CprMode', 'data': ['reboot'] }
> { 'command': 'cprsave', 'data': {'file': 'str', 'mode': 'CprMode' } }

Will do, thanks for the syntax.

- Steve



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 05/32] savevm: QMP command for cprload
  2020-07-30 16:14   ` Eric Blake
@ 2020-07-30 18:00     ` Steven Sistare
  2020-09-11 17:18       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 118+ messages in thread
From: Steven Sistare @ 2020-07-30 18:00 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: Daniel P. Berrange, Juan Quintela, Philippe Mathieu-Daudé,
	Michael S. Tsirkin, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Stefan Hajnoczi, Paolo Bonzini,
	Marc-André Lureau, Alex Bennée

On 7/30/2020 12:14 PM, Eric Blake wrote:
> On 7/30/20 10:14 AM, Steve Sistare wrote:
>> Provide the cprload QMP command.  The VM is created from the file produced
>> by the cprsave command.  Guest RAM is restored in-place from the shared
>> memory backend file, and guest block devices are used as is.  The contents
>> of such devices must not be modified between the cprsave and cprload
>> operations.  If the VM was running at cprsave time, then VM execution
>> resumes.
> 
> Is it always wise to unconditionally resume, or might this command need an additional optional knob that says what state (paused or running) to move into?

This can already be done.  Issue a stop command before cprsave, then cprload will finish in a
paused state.

Also, cprsave re-execs and leaves the guest in a paused state.  One can

send device add commands, then send cprload which continues
.

>> Syntax:
>>    {'command':'cprload', 'data':{'file':'str'}}
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> Signed-off-by: Maran Wilson <maran.wilson@oracle.com>
>> ---
> 
>> +++ b/qapi/migration.json
>> @@ -1635,3 +1635,14 @@
>>   ##
>>   { 'command': 'cprsave', 'data': { 'file': 'str', 'mode': 'str' } }
>>   +##
>> +# @cprload:
>> +#
>> +# Start virtual machine from checkpoint file that was created earlier using
>> +# the cprsave command.
>> +#
>> +# @file: name of checkpoint file
>> +#
>> +# Since 5.0
> 
> another 5.2 instance. I'll quit pointing it out for the rest of the series.

Will find and fix all, thanks.

- Steve



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 07/32] savevm: QMP command for cprinfo
  2020-07-30 16:17   ` Eric Blake
@ 2020-07-30 18:02     ` Steven Sistare
  0 siblings, 0 replies; 118+ messages in thread
From: Steven Sistare @ 2020-07-30 18:02 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: Daniel P. Berrange, Juan Quintela, Philippe Mathieu-Daudé,
	Michael S. Tsirkin, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Stefan Hajnoczi, Paolo Bonzini,
	Marc-André Lureau, Alex Bennée

On 7/30/2020 12:17 PM, Eric Blake wrote:
> On 7/30/20 10:14 AM, Steve Sistare wrote:
>> Provide the cprinfo QMP command.  This returns a string with a space-
>> separated list of modes supported by cprsave, and can be used by clients
>> as a feature test to check if the running QEMU instance supports cprsave.
> 
> When you've already got array support in the QMP language, why are you making the user parse a string into an array after the fact?

Will fix as you suggest, thanks.  I had HMP on the brain - Steve

>> Syntax:
>>    {'command':'cprinfo', 'returns':'str'}
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
> 
>> +++ b/qapi/migration.json
>> @@ -1623,6 +1623,15 @@
>>     'data': { 'device-id': 'str' } }
>>     ##
>> +# @cprinfo:
>> +#
>> +# Return a space-delimited list of modes supported by the cprsave command
>> +#
>> +# Since 5.0
>> +##
>> +{ 'command': 'cprinfo', 'returns': 'str' }
> 
> Returning a 'str' is non-extensible.  The fact that you had to edit the whitelist is proof that you should have done something better.  I recommend:
> 
> { 'command': 'cprinfo', 'returns': { 'modes': [ 'CprMode' ] }
> 
> using the CprMode enum I proposed earlier.
> 
>> +
>> +##
>>   # @cprsave:
>>   #
>>   # Create a checkpoint of the virtual machine device state in @file.
>> diff --git a/qapi/pragma.json b/qapi/pragma.json
>> index cffae27..43bdb39 100644
>> --- a/qapi/pragma.json
>> +++ b/qapi/pragma.json
>> @@ -5,6 +5,7 @@
>>   { 'pragma': {
>>       # Commands allowed to return a non-dictionary:
>>       'returns-whitelist': [
>> +        'cprinfo',
> 
> This should not be needed.  Design the return value correctly in the first place.
> 
>>           'human-monitor-command',
>>           'qom-get',
>>           'query-migrate-cache-size',
>>
> 


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 12/32] vl: pause option
  2020-07-30 16:20   ` Eric Blake
@ 2020-07-30 18:11     ` Steven Sistare
  2020-07-31 10:07       ` Daniel P. Berrangé
  0 siblings, 1 reply; 118+ messages in thread
From: Steven Sistare @ 2020-07-30 18:11 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: Daniel P. Berrange, Juan Quintela, Philippe Mathieu-Daudé,
	Michael S. Tsirkin, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Stefan Hajnoczi, Paolo Bonzini,
	Marc-André Lureau, Alex Bennée

On 7/30/2020 12:20 PM, Eric Blake wrote:
> On 7/30/20 10:14 AM, Steve Sistare wrote:
>> Provide the -pause command-line parameter and the QEMU_PAUSE environment
>> variable to briefly pause QEMU in main and allow a developer to attach gdb.
>> Useful when the developer does not invoke QEMU directly, such as when using
>> libvirt.
> 
> How would you set this option with libvirt?

Add -pause in the qemu args in the xml.
 
> It feels like you are trying to reinvent something that is already well-documented:
> 
> https://www.berrange.com/posts/2011/10/12/debugging-early-startup-of-kvm-with-gdb-when-launched-by-libvirtd/

Too many steps to reach BINGO for my taste.  Easier is better.  Also, in our shop we start qemu 
in other ways, such as via services.

These new hooks helped me and my colleagues, and I hope others may also find them useful, 
but if not then we drop them.

>> Usage:
>>    qemu -pause <seconds>
>>    or
>>    export QEMU_PAUSE=<seconds>
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>   qemu-options.hx |  9 +++++++++
>>   softmmu/vl.c    | 15 ++++++++++++++-
>>   2 files changed, 23 insertions(+), 1 deletion(-)
> 
>> @@ -3204,6 +3211,12 @@ void qemu_init(int argc, char **argv, char **envp)
>>               case QEMU_OPTION_gdb:
>>                   add_device_config(DEV_GDB, optarg);
>>                   break;
>> +            case QEMU_OPTION_pause:
>> +                seconds = atoi(optarg);
> 
> atoi() cannot detect overflow.  You should never use it in robust parsing of untrusted input.

OK.

- Steve





^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 12/32] vl: pause option
  2020-07-30 17:03   ` Alex Bennée
@ 2020-07-30 18:14     ` Steven Sistare
  2020-07-31  9:44       ` Alex Bennée
  2020-09-11 17:59       ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 118+ messages in thread
From: Steven Sistare @ 2020-07-30 18:14 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Markus Armbruster,
	Juan Quintela, Dr. David Alan Gilbert, qemu-devel,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

On 7/30/2020 1:03 PM, Alex Bennée wrote:
> 
> Steve Sistare <steven.sistare@oracle.com> writes:
> 
>> Provide the -pause command-line parameter and the QEMU_PAUSE environment
>> variable to briefly pause QEMU in main and allow a developer to attach gdb.
>> Useful when the developer does not invoke QEMU directly, such as when using
>> libvirt.
> 
> How does this differ from -S?

The -S flag runs qemu to the main loop but does not start the guest.  Lots of code
that you may need to debug runs before you get there.

- Steve
>> Usage:
>>   qemu -pause <seconds>
>>   or
>>   export QEMU_PAUSE=<seconds>
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  qemu-options.hx |  9 +++++++++
>>  softmmu/vl.c    | 15 ++++++++++++++-
>>  2 files changed, 23 insertions(+), 1 deletion(-)
>>
>> diff --git a/qemu-options.hx b/qemu-options.hx
>> index 708583b..8505cf2 100644
>> --- a/qemu-options.hx
>> +++ b/qemu-options.hx
>> @@ -3668,6 +3668,15 @@ SRST
>>      option is experimental.
>>  ERST
>>  
>> +DEF("pause", HAS_ARG, QEMU_OPTION_pause, \
>> +    "-pause secs    Pause for secs seconds on entry to main.\n", QEMU_ARCH_ALL)
>> +
>> +SRST
>> +``--pause secs``
>> +    Pause for a number of seconds on entry to main.  Useful for attaching
>> +    a debugger after QEMU has been launched by some other entity.
>> +ERST
>> +
> 
> It seems like having an option to race with the debugger is just asking
> for trouble.
> 
>>  DEF("S", 0, QEMU_OPTION_S, \
>>      "-S              freeze CPU at startup (use 'c' to start execution)\n",
>>      QEMU_ARCH_ALL)
>> diff --git a/softmmu/vl.c b/softmmu/vl.c
>> index 8478778..951994f 100644
>> --- a/softmmu/vl.c
>> +++ b/softmmu/vl.c
>> @@ -2844,7 +2844,7 @@ static void create_default_memdev(MachineState *ms, const char *path)
>>  
>>  void qemu_init(int argc, char **argv, char **envp)
>>  {
>> -    int i;
>> +    int i, seconds;
>>      int snapshot, linux_boot;
>>      const char *initrd_filename;
>>      const char *kernel_filename, *kernel_cmdline;
>> @@ -2882,6 +2882,13 @@ void qemu_init(int argc, char **argv, char **envp)
>>      QemuPluginList plugin_list = QTAILQ_HEAD_INITIALIZER(plugin_list);
>>      int mem_prealloc = 0; /* force preallocation of physical target memory */
>>  
>> +    if (getenv("QEMU_PAUSE")) {
>> +        seconds = atoi(getenv("QEMU_PAUSE"));
>> +        printf("Pausing %d seconds for debugger. QEMU PID is %d\n",
>> +               seconds, getpid());
>> +        sleep(seconds);
>> +    }
>> +
>>      os_set_line_buffering();
>>  
>>      error_init(argv[0]);
>> @@ -3204,6 +3211,12 @@ void qemu_init(int argc, char **argv, char **envp)
>>              case QEMU_OPTION_gdb:
>>                  add_device_config(DEV_GDB, optarg);
>>                  break;
>> +            case QEMU_OPTION_pause:
>> +                seconds = atoi(optarg);
>> +                printf("Pausing %d seconds for debugger. QEMU PID is %d\n",
>> +                            seconds, getpid());
>> +                sleep(seconds);
>> +                break;
>>              case QEMU_OPTION_L:
>>                  if (is_help_option(optarg)) {
>>                      list_data_dirs = true;
> 
> 


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 14/32] savevm: VMS_RESTART and cprsave restart
  2020-07-30 16:22   ` Eric Blake
@ 2020-07-30 18:14     ` Steven Sistare
  0 siblings, 0 replies; 118+ messages in thread
From: Steven Sistare @ 2020-07-30 18:14 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: Daniel P. Berrange, Juan Quintela, Philippe Mathieu-Daudé,
	Michael S. Tsirkin, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Stefan Hajnoczi, Paolo Bonzini,
	Marc-André Lureau, Alex Bennée

On 7/30/2020 12:22 PM, Eric Blake wrote:
> On 7/30/20 10:14 AM, Steve Sistare wrote:
>> Add the VMS_RESTART variant of vmstate, for use when upgrading qemu in place
>> on the same host without a reboot.  Invoke it using:
>>    cprsave <filename> restart
>>
>> VMS_RESTART supports guest ram mapped by private anonymous memory, versus
>> VMS_REBOOT which requires that guest ram be mapped by persistent shared
>> memory.  Subsequent patches complete its implementation.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
> 
>> +++ b/qapi/migration.json
>> @@ -1639,6 +1639,7 @@
>>   #
>>   # @file: name of checkpoint file
>>   # @mode: 'reboot' : checkpoint can be cprload'ed after a host kexec reboot.
>> +#        'restart': checkpoint can be cprload'ed after restarting qemu.
> 
> This should be a modification to an enum type (the 'CprMode' type I suggested earlier in the series).

Will do - steve


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 00/32] Live Update
  2020-07-30 16:52 ` [PATCH V1 00/32] Live Update Daniel P. Berrangé
@ 2020-07-30 18:48   ` Steven Sistare
  2020-07-31  8:53     ` Daniel P. Berrangé
  0 siblings, 1 reply; 118+ messages in thread
From: Steven Sistare @ 2020-07-30 18:48 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Juan Quintela, Philippe Mathieu-Daudé,
	Michael S. Tsirkin, Markus Armbruster, qemu-devel,
	Alex Williamson, Stefan Hajnoczi, Paolo Bonzini,
	Marc-André Lureau, Alex Bennée, Dr. David Alan Gilbert

On 7/30/2020 12:52 PM, Daniel P. Berrangé wrote:
> On Thu, Jul 30, 2020 at 08:14:04AM -0700, Steve Sistare wrote:
>> Improve and extend the qemu functions that save and restore VM state so a
>> guest may be suspended and resumed with minimal pause time.  qemu may be
>> updated to a new version in between.
>>
>> The first set of patches adds the cprsave and cprload commands to save and
>> restore VM state, and allow the host kernel to be updated and rebooted in
>> between.  The VM must create guest RAM in a persistent shared memory file,
>> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in 
>> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/
>>
>> cprsave stops the VCPUs and saves VM device state in a simple file, and
>> thus supports any type of guest image and block device.  The caller must
>> not modify the VM's block devices between cprsave and cprload.
>>
>> cprsave and cprload support guests with vfio devices if the caller first
>> suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
>> The guest drivers suspend methods flush outstanding requests and re-
>> initialize the devices, and thus there is no device state to save and
>> restore.
>>
>>    1 savevm: add vmstate handler iterators
>>    2 savevm: VM handlers mode mask
>>    3 savevm: QMP command for cprsave
>>    4 savevm: HMP Command for cprsave
>>    5 savevm: QMP command for cprload
>>    6 savevm: HMP Command for cprload
>>    7 savevm: QMP command for cprinfo
>>    8 savevm: HMP command for cprinfo
>>    9 savevm: prevent cprsave if memory is volatile
>>   10 kvmclock: restore paused KVM clock
>>   11 cpu: disable ticks when suspended
>>   12 vl: pause option
>>   13 gdbstub: gdb support for suspended state
>>
>> The next patches add a restart method that eliminates the persistent memory
>> constraint, and allows qemu to be updated across the restart, but does not
>> allow host reboot.  Anonymous memory segments used by the guest are
>> preserved across a re-exec of qemu, mapped at the same VA, via a proposed
>> madvise(MADV_DOEXEC) option in the Linux kernel.  See
>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
>>
>>   14 savevm: VMS_RESTART and cprsave restart
>>   15 vl: QEMU_START_FREEZE env var
>>   16 oslib: add qemu_clr_cloexec
>>   17 util: env var helpers
>>   18 osdep: import MADV_DOEXEC
>>   19 memory: ram_block_add cosmetic changes
>>   20 vl: add helper to request re-exec
>>   21 exec, memory: exec(3) to restart
>>   22 char: qio_channel_socket_accept reuse fd
>>   23 char: save/restore chardev socket fds
>>   24 ui: save/restore vnc socket fds
>>   25 char: save/restore chardev pty fds
> 
> Keeping FDs open across re-exec is a nice trick, but how are you dealing
> with the state associated with them, most especially the TLS encryption
> state ? AFAIK, there's no way to serialize/deserialize the TLS state that
> GNUTLS maintains, and the patches don't show any sign of dealing with
> this. IOW it looks like while the FD will be preserved, any TLS session
> running on it will fail.

I had not considered TLS.  If a non-qemu library maintains connection state, then
we won't be able to support it for live update until the library provides interfaces
to serialize the state.

For qemu objects, so far vmstate has been adequate to represent the devices with
descriptors that we preserve.

> I'm going to presume that you're probably just considering the TLS features
> out of scope for your patch series.  It would be useful if you have any
> info about this and other things you've considered out of scope for this
> patch series.

The descriptors covered in these patches are needed for our use case.  I realize
there are others that could perhaps be preserved, but we have not tried them.
Those descriptors are closed on exec as usual, and are reopened after exec. I
expect that we or others will support more over time.

> I'm not seeing anything in the block layer about preserving open FDs, so
> I presume you're just letting the block layer close and then re-open any
> FDs it has ?  

Correct.

> This would have the side effect that any locks held on the
> FDs are lost, so there's a potential race condition where another process
> could acquire the lock and prevent the re-exec completing. That said this
> is unavoidable, because Linux kernel is completely broken wrt keeping
> fnctl() locks held across a re-exec, always throwing away the locks if
> more than 1 thread is running [1].

Ouch.

- Steve

> 
> Regards,
> Daniel
> 
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1552621
> 


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 00/32] Live Update
  2020-07-30 17:15 ` Paolo Bonzini
@ 2020-07-30 19:09   ` Steven Sistare
  2020-07-30 21:39     ` Paolo Bonzini
  0 siblings, 1 reply; 118+ messages in thread
From: Steven Sistare @ 2020-07-30 19:09 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin,
	Philippe Mathieu-Daudé,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée

On 7/30/2020 1:15 PM, Paolo Bonzini wrote:
> On 30/07/20 17:14, Steve Sistare wrote:
>> The first set of patches adds the cprsave and cprload commands to save and
>> restore VM state, and allow the host kernel to be updated and rebooted in
>> between.  The VM must create guest RAM in a persistent shared memory file,
>> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in 
>> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/
>>
>> cprsave stops the VCPUs and saves VM device state in a simple file, and
>> thus supports any type of guest image and block device.  The caller must
>> not modify the VM's block devices between cprsave and cprload.
> 
> Stupid question, what does cpr stand for?  If it is checkpoint/restore,

Checkpoint/restart.  An acronym from my HPC days.  I will spell it out.

> please spell it out.  Also, how does the functionality compare to
> xen-save-devices-state and xen-load-devices-state?

qmp_xen_save_devices_state serializes device state to a file which is loaded 
on the target for a live migration.  It performs some of the same actions
as cprsave/cprload but does not support live update-in-place.

>> cprsave and cprload support guests with vfio devices if the caller first
>> suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
>> The guest drivers suspend methods flush outstanding requests and re-
>> initialize the devices, and thus there is no device state to save and
>> restore.
> 
> This probably should be allowed even for regular migration.  Can you
> generalize the code as a separate series?

Maybe.  I think that would be a distinct patch that ignores the vfio migration blocker 
if the state is suspended.  Plus a qemu agent call to do the suspend.  Needs more
thought.

- Steve

> 
> Thanks,
> 
> Paolo
> 


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 00/32] Live Update
  2020-07-30 17:49 ` Dr. David Alan Gilbert
@ 2020-07-30 19:31   ` Steven Sistare
  0 siblings, 0 replies; 118+ messages in thread
From: Steven Sistare @ 2020-07-30 19:31 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Juan Quintela, Philippe Mathieu-Daudé,
	Michael S. Tsirkin, qemu-devel, Markus Armbruster,
	Alex Williamson, Stefan Hajnoczi, Paolo Bonzini,
	Marc-André Lureau, Alex Bennée

[-- Attachment #1: Type: text/plain, Size: 10222 bytes --]

On 7/30/2020 1:49 PM, Dr. David Alan Gilbert wrote:
> * Steve Sistare (steven.sistare@oracle.com) wrote:
>> Improve and extend the qemu functions that save and restore VM state so a
>> guest may be suspended and resumed with minimal pause time.  qemu may be
>> updated to a new version in between.
> 
> Nice.
> 
>> The first set of patches adds the cprsave and cprload commands to save and
>> restore VM state, and allow the host kernel to be updated and rebooted in
>> between.  The VM must create guest RAM in a persistent shared memory file,
>> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in 
>> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/
>>
>> cprsave stops the VCPUs and saves VM device state in a simple file, and
>> thus supports any type of guest image and block device.  The caller must
>> not modify the VM's block devices between cprsave and cprload.
> 
> can I ask why you don't just add a migration flag to skip the devices
> you don't want, and then do a migrate to a file?
> (i.e. migrate "exec:cat > afile")
> We already have the 'x-ignore-shared' capability that's used for doing
> RAM snapshots of VMs; primarily I think for being able to start a VM
> from a RAM snapshot as a fast VM start trick.
> (There's also a xen_save_devices that does something similar).
> If you backed the RAM as you say, enabled x-ignore-shared and then did:
> 
>    migrate "exec:cat > afile"
> 
> and restarted the destination with:
> 
>     migrate_incoming "exec:cat afile"
> 
> what is different (except the later stuff about the vfio magic and
> chardevs).
> 
> Dave

Yes, I did consider whether to extend the migration syntax and implemention in
save_vmstate and load_vmstate, versus creating something new.  Those functions 
handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr 
use case, and the cpr functions handle state that is n/a for the migration case. 
I judged that a single function handling both would be less readable and 
maintainable.  At their core all these routines call qemu_loadvm_state() and 
qemu_savevm_state().
 The surrounding code is mostly different.


Take a look at 
  savevm.c:save_vmstate()   vs   save_cpr_snapshot() attached
and
  savevm.c:load_vmstate()   vs   load_cpr_snapshot() attached

I attached the complete versions of the cpr functions because they are built up
over multiple patches in this series, thus hard to visualize in patch form.

- Steve

> 
>> cprsave and cprload support guests with vfio devices if the caller first
>> suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
>> The guest drivers suspend methods flush outstanding requests and re-
>> initialize the devices, and thus there is no device state to save and
>> restore.
>>
>>    1 savevm: add vmstate handler iterators
>>    2 savevm: VM handlers mode mask
>>    3 savevm: QMP command for cprsave
>>    4 savevm: HMP Command for cprsave
>>    5 savevm: QMP command for cprload
>>    6 savevm: HMP Command for cprload
>>    7 savevm: QMP command for cprinfo
>>    8 savevm: HMP command for cprinfo
>>    9 savevm: prevent cprsave if memory is volatile
>>   10 kvmclock: restore paused KVM clock
>>   11 cpu: disable ticks when suspended
>>   12 vl: pause option
>>   13 gdbstub: gdb support for suspended state
>>
>> The next patches add a restart method that eliminates the persistent memory
>> constraint, and allows qemu to be updated across the restart, but does not
>> allow host reboot.  Anonymous memory segments used by the guest are
>> preserved across a re-exec of qemu, mapped at the same VA, via a proposed
>> madvise(MADV_DOEXEC) option in the Linux kernel.  See
>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
>>
>>   14 savevm: VMS_RESTART and cprsave restart
>>   15 vl: QEMU_START_FREEZE env var
>>   16 oslib: add qemu_clr_cloexec
>>   17 util: env var helpers
>>   18 osdep: import MADV_DOEXEC
>>   19 memory: ram_block_add cosmetic changes
>>   20 vl: add helper to request re-exec
>>   21 exec, memory: exec(3) to restart
>>   22 char: qio_channel_socket_accept reuse fd
>>   23 char: save/restore chardev socket fds
>>   24 ui: save/restore vnc socket fds
>>   25 char: save/restore chardev pty fds
>>   26 monitor: save/restore QMP negotiation status
>>   27 vhost: reset vhost devices upon cprsave
>>   28 char: restore terminal on restart
>>
>> The next patches extend the restart method to save and restore vfio-pci
>> state, eliminating the requirement for a guest agent.  The vfio container,
>> group, and device descriptors are preserved across the qemu re-exec.
>>
>>   29 pci: export pci_update_mappings
>>   30 vfio-pci: save and restore
>>   31 vfio-pci: trace pci config
>>   32 vfio-pci: improved tracing
>>
>> Here is an example of updating qemu from v4.2.0 to v4.2.1 using 
>> "cprload restart".  The software update is performed while the guest is
>> running to minimize downtime.
>>
>> window 1				| window 2
>> 					|
>> # qemu-system-x86_64 ... 		|
>> QEMU 4.2.0 monitor - type 'help' ...	|
>> (qemu) info status			|
>> VM status: running			|
>> 					| # yum update qemu
>> (qemu) cprsave /tmp/qemu.sav restart	|
>> QEMU 4.2.1 monitor - type 'help' ...	|
>> (qemu) info status			|
>> VM status: paused (prelaunch)		|
>> (qemu) cprload /tmp/qemu.sav		|
>> (qemu) info status			|
>> VM status: running			|
>>
>>
>> Here is an example of updating the host kernel using "cprload reboot"
>>
>> window 1					| window 2
>> 						|
>> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
>> QEMU 4.2.1 monitor - type 'help' ...		|
>> (qemu) info status				|
>> VM status: running				|
>> 						| # yum update kernel-uek
>> (qemu) cprsave /tmp/qemu.sav restart		|
>> 						|
>> # systemctl kexec				|
>> kexec_core: Starting new kernel			|
>> ...						|
>> 						|
>> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
>> QEMU 4.2.1 monitor - type 'help' ...		|
>> (qemu) info status				|
>> VM status: paused (prelaunch)			|
>> (qemu) cprload /tmp/qemu.sav			|
>> (qemu) info status				|
>> VM status: running				|
>>
>>
>> Mark Kanda (5):
>>   char: qio_channel_socket_accept reuse fd
>>   char: save/restore chardev socket fds
>>   ui: save/restore vnc socket fds
>>   monitor: save/restore QMP negotiation status
>>   vhost: reset vhost devices upon cprsave
>>
>> Steve Sistare (27):
>>   savevm: add vmstate handler iterators
>>   savevm: VM handlers mode mask
>>   savevm: QMP command for cprsave
>>   savevm: HMP Command for cprsave
>>   savevm: QMP command for cprload
>>   savevm: HMP Command for cprload
>>   savevm: QMP command for cprinfo
>>   savevm: HMP command for cprinfo
>>   savevm: prevent cprsave if memory is volatile
>>   kvmclock: restore paused KVM clock
>>   cpu: disable ticks when suspended
>>   vl: pause option
>>   gdbstub: gdb support for suspended state
>>   savevm: VMS_RESTART and cprsave restart
>>   vl: QEMU_START_FREEZE env var
>>   oslib: add qemu_clr_cloexec
>>   util: env var helpers
>>   osdep: import MADV_DOEXEC
>>   memory: ram_block_add cosmetic changes
>>   vl: add helper to request re-exec
>>   exec, memory: exec(3) to restart
>>   char: save/restore chardev pty fds
>>   char: restore terminal on restart
>>   pci: export pci_update_mappings
>>   vfio-pci: save and restore
>>   vfio-pci: trace pci config
>>   vfio-pci: improved tracing
>>
>>  MAINTAINERS                    |   7 ++
>>  accel/kvm/kvm-all.c            |   8 +-
>>  accel/kvm/trace-events         |   3 +-
>>  chardev/char-pty.c             |  38 +++++--
>>  chardev/char-socket.c          |  35 ++++++
>>  chardev/char-stdio.c           |   7 ++
>>  chardev/char.c                 |  16 +++
>>  exec.c                         |  88 +++++++++++++--
>>  gdbstub.c                      |  11 +-
>>  hmp-commands.hx                |  46 ++++++++
>>  hw/i386/kvm/clock.c            |   6 +-
>>  hw/pci/msix.c                  |   1 +
>>  hw/pci/pci.c                   |  17 +--
>>  hw/pci/trace-events            |   5 +-
>>  hw/vfio/common.c               | 115 ++++++++++++++++----
>>  hw/vfio/pci.c                  | 179 ++++++++++++++++++++++++++++++-
>>  hw/vfio/platform.c             |   2 +-
>>  hw/vfio/trace-events           |  11 +-
>>  hw/virtio/vhost.c              |  12 +++
>>  include/chardev/char.h         |   8 ++
>>  include/exec/memory.h          |   4 +
>>  include/hw/pci/pci.h           |   2 +
>>  include/hw/vfio/vfio-common.h  |   4 +-
>>  include/io/channel-socket.h    |   3 +-
>>  include/migration/register.h   |   3 +
>>  include/migration/vmstate.h    |  11 ++
>>  include/monitor/hmp.h          |   3 +
>>  include/qemu/cutils.h          |   1 +
>>  include/qemu/env.h             |  31 ++++++
>>  include/qemu/osdep.h           |   8 ++
>>  include/sysemu/sysemu.h        |  10 ++
>>  io/channel-socket.c            |  12 ++-
>>  io/net-listener.c              |   4 +-
>>  migration/block.c              |   1 +
>>  migration/migration.c          |   4 +-
>>  migration/ram.c                |   1 +
>>  migration/savevm.c             | 237 ++++++++++++++++++++++++++++++++++++-----
>>  migration/savevm.h             |   4 +-
>>  monitor/hmp-cmds.c             |  28 +++++
>>  monitor/qmp-cmds.c             |  16 +++
>>  monitor/qmp.c                  |  42 ++++++++
>>  qapi/migration.json            |  35 ++++++
>>  qapi/pragma.json               |   1 +
>>  qemu-options.hx                |   9 ++
>>  scsi/qemu-pr-helper.c          |   2 +-
>>  softmmu/vl.c                   |  65 ++++++++++-
>>  tests/qtest/tpm-emu.c          |   2 +-
>>  tests/test-char.c              |   2 +-
>>  tests/test-io-channel-socket.c |   4 +-
>>  trace-events                   |   2 +
>>  ui/vnc.c                       | 153 +++++++++++++++++++++-----
>>  util/Makefile.objs             |   2 +-
>>  util/env.c                     | 132 +++++++++++++++++++++++
>>  util/oslib-posix.c             |   9 ++
>>  util/oslib-win32.c             |   4 +
>>  55 files changed, 1331 insertions(+), 135 deletions(-)
>>  create mode 100644 include/qemu/env.h
>>  create mode 100644 util/env.c
>>
>> -- 
>> 1.8.3.1
>>
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

[-- Attachment #2: save_cpr_snapshot.c --]
[-- Type: text/plain, Size: 1947 bytes --]

void save_cpr_snapshot(const char *file, const char *mode, Error **errp)
{
    int ret = 0;
    QEMUFile *f;
    VMStateMode op;

    if (!strcmp(mode, "reboot")) {
        op = VMS_REBOOT;
    } else if (!strcmp(mode, "restart")) {
        op = VMS_RESTART;
    } else {
        error_setg(errp, "cprsave: bad mode %s", mode);
        return;
    }

    if (op == VMS_REBOOT && qemu_ram_volatile(errp)) {
        return;
    }

    if (op == VMS_RESTART && QEMU_MADV_DOEXEC == QEMU_MADV_INVALID) {
        error_setg(errp, "kernel does not support MADV_DOEXEC.");
        return;
    }

    if (op == VMS_RESTART && xen_enabled()) {
        error_setg(errp, "xen does not support cprsave restart");
        return;
    }

    f = qf_file_open(file, O_CREAT | O_WRONLY | O_TRUNC, 0600, errp);
    if (!f) {
        return;
    }

    ret = global_state_store();
    if (ret) {
        error_setg(errp, "Error saving global state");
        qemu_fclose(f);
        return;
    }

    /* Update timers_state before saving.  Suspend did not so do. */
    if (runstate_check(RUN_STATE_SUSPENDED)) {
        cpu_disable_ticks();
    }

    vm_stop(RUN_STATE_SAVE_VM);

    ret = qemu_savevm_state(f, op, errp);
    if ((ret < 0) && !*errp) {
        error_setg(errp, "qemu_savevm_state failed");
    }
    qemu_fclose(f);

    if (op == VMS_REBOOT) {
        no_shutdown = 0;
        qemu_system_shutdown_request();
    } else if (op == VMS_RESTART) {
        if (qemu_preserve_ram(errp)) {
            return;
        }
        save_chardev_fds();
        save_vnc_fds();
        save_named_fd("mntfd");          /* was received from qemu-cpr */
        save_named_fd("ctlfd");          /* was received from qemu-cpr */
        walkenv(FD_PREFIX, preserve_fd, 0);
        reset_vhost_devices();
        save_qmp_negotiation_status();
        qemu_term_exit();
        qemu_system_exec_request();
        putenv((char *)"QEMU_START_FREEZE=");
    }
}


[-- Attachment #3: load_cpr_snapshot.c --]
[-- Type: text/plain, Size: 758 bytes --]

void load_cpr_snapshot(const char *file, Error **errp)
{
    QEMUFile *f;
    int ret;
    RunState state;

    if (runstate_is_running()) {
        error_setg(errp, "cprload called for a running VM");
        return;
    }

    f = qf_file_open(file, O_RDONLY, 0, errp);
    if (!f) {
        return;
    }

    ret = qemu_loadvm_state(f, VMS_REBOOT | VMS_RESTART);
    qemu_fclose(f);
    if (ret < 0) {
        error_setg(errp, "Error %d while loading VM state", ret);
        return;
    }

    state = global_state_get_runstate();
    if (state == RUN_STATE_RUNNING) {
        vm_start();
    } else {
        runstate_set(state);
        if (runstate_check(RUN_STATE_SUSPENDED)) {
            start_on_wake = 1;
        }
    }

    load_vnc_fds();
}


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 00/32] Live Update
  2020-07-30 19:09   ` Steven Sistare
@ 2020-07-30 21:39     ` Paolo Bonzini
  2020-07-31 19:22       ` Steven Sistare
  0 siblings, 1 reply; 118+ messages in thread
From: Paolo Bonzini @ 2020-07-30 21:39 UTC (permalink / raw)
  To: Steven Sistare, qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin,
	Philippe Mathieu-Daudé,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée

On 30/07/20 21:09, Steven Sistare wrote:
>> please spell it out.  Also, how does the functionality compare to
>> xen-save-devices-state and xen-load-devices-state?
>
> qmp_xen_save_devices_state serializes device state to a file which is loaded 
> on the target for a live migration.  It performs some of the same actions
> as cprsave/cprload but does not support live update-in-place.

So it is a subset, can code be reused across both?  Also, live migration
across versions is supported, so can you describe the special
update-in-place support more precisely?  I am confused about the use
cases, which require (or try) to keep file descriptors across re-exec,
which are for kexec, and so on.

>>> cprsave and cprload support guests with vfio devices if the caller first
>>> suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
>>> The guest drivers suspend methods flush outstanding requests and re-
>>> initialize the devices, and thus there is no device state to save and
>>> restore.
>> This probably should be allowed even for regular migration.  Can you
>> generalize the code as a separate series?
>
> Maybe.  I think that would be a distinct patch that ignores the vfio migration blocker 
> if the state is suspended.  Plus a qemu agent call to do the suspend.  Needs more
> thought.

The agent already supports suspend, so that should be relatively easy.
Only the code to add/remove the VFIO migration blocker from a VM state
change notifier, or something like that, would be needed.

Paolo



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 00/32] Live Update
  2020-07-30 18:48   ` Steven Sistare
@ 2020-07-31  8:53     ` Daniel P. Berrangé
  2020-07-31 15:27       ` Steven Sistare
  0 siblings, 1 reply; 118+ messages in thread
From: Daniel P. Berrangé @ 2020-07-31  8:53 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Michael S. Tsirkin, Alex Bennée, Juan Quintela, qemu-devel,
	Markus Armbruster, Alex Williamson, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Dr. David Alan Gilbert

On Thu, Jul 30, 2020 at 02:48:44PM -0400, Steven Sistare wrote:
> On 7/30/2020 12:52 PM, Daniel P. Berrangé wrote:
> > On Thu, Jul 30, 2020 at 08:14:04AM -0700, Steve Sistare wrote:
> >> Improve and extend the qemu functions that save and restore VM state so a
> >> guest may be suspended and resumed with minimal pause time.  qemu may be
> >> updated to a new version in between.
> >>
> >> The first set of patches adds the cprsave and cprload commands to save and
> >> restore VM state, and allow the host kernel to be updated and rebooted in
> >> between.  The VM must create guest RAM in a persistent shared memory file,
> >> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in 
> >> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/
> >>
> >> cprsave stops the VCPUs and saves VM device state in a simple file, and
> >> thus supports any type of guest image and block device.  The caller must
> >> not modify the VM's block devices between cprsave and cprload.
> >>
> >> cprsave and cprload support guests with vfio devices if the caller first
> >> suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
> >> The guest drivers suspend methods flush outstanding requests and re-
> >> initialize the devices, and thus there is no device state to save and
> >> restore.
> >>
> >>    1 savevm: add vmstate handler iterators
> >>    2 savevm: VM handlers mode mask
> >>    3 savevm: QMP command for cprsave
> >>    4 savevm: HMP Command for cprsave
> >>    5 savevm: QMP command for cprload
> >>    6 savevm: HMP Command for cprload
> >>    7 savevm: QMP command for cprinfo
> >>    8 savevm: HMP command for cprinfo
> >>    9 savevm: prevent cprsave if memory is volatile
> >>   10 kvmclock: restore paused KVM clock
> >>   11 cpu: disable ticks when suspended
> >>   12 vl: pause option
> >>   13 gdbstub: gdb support for suspended state
> >>
> >> The next patches add a restart method that eliminates the persistent memory
> >> constraint, and allows qemu to be updated across the restart, but does not
> >> allow host reboot.  Anonymous memory segments used by the guest are
> >> preserved across a re-exec of qemu, mapped at the same VA, via a proposed
> >> madvise(MADV_DOEXEC) option in the Linux kernel.  See
> >> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
> >>
> >>   14 savevm: VMS_RESTART and cprsave restart
> >>   15 vl: QEMU_START_FREEZE env var
> >>   16 oslib: add qemu_clr_cloexec
> >>   17 util: env var helpers
> >>   18 osdep: import MADV_DOEXEC
> >>   19 memory: ram_block_add cosmetic changes
> >>   20 vl: add helper to request re-exec
> >>   21 exec, memory: exec(3) to restart
> >>   22 char: qio_channel_socket_accept reuse fd
> >>   23 char: save/restore chardev socket fds
> >>   24 ui: save/restore vnc socket fds
> >>   25 char: save/restore chardev pty fds
> > 
> > Keeping FDs open across re-exec is a nice trick, but how are you dealing
> > with the state associated with them, most especially the TLS encryption
> > state ? AFAIK, there's no way to serialize/deserialize the TLS state that
> > GNUTLS maintains, and the patches don't show any sign of dealing with
> > this. IOW it looks like while the FD will be preserved, any TLS session
> > running on it will fail.
> 
> I had not considered TLS.  If a non-qemu library maintains connection state, then
> we won't be able to support it for live update until the library provides interfaces
> to serialize the state.
> 
> For qemu objects, so far vmstate has been adequate to represent the devices with
> descriptors that we preserve.

My main concern about this series is that there is an implicit assumption
that QEMU is *not* configured with certain features that are not handled
If QEMU is using one of the unsupported features, I don't see anything in
the series which attempts to prevent the actions.

IOW, users can have an arbitrary QEMU config, attempt to use these new features,
the commands may well succeed, but the user is silently left with a broken QEMU.
Such silent failure modes are really undesirable as they'll lead to a never
ending stream of hard to diagnose bug reports for QEMU maintainers.

TLS is one example of this, the live upgrade  will "succeed", but the TLS
connections will be totally non-functional.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 24/32] ui: save/restore vnc socket fds
  2020-07-30 15:14 ` [PATCH V1 24/32] ui: save/restore vnc " Steve Sistare
@ 2020-07-31  9:06   ` Daniel P. Berrangé
  2020-07-31 16:51     ` Steven Sistare
  0 siblings, 1 reply; 118+ messages in thread
From: Daniel P. Berrangé @ 2020-07-31  9:06 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Juan Quintela, Philippe Mathieu-Daudé,
	Michael S. Tsirkin, Markus Armbruster, qemu-devel,
	Alex Williamson, Stefan Hajnoczi, Paolo Bonzini,
	Marc-André Lureau, Alex Bennée, Dr. David Alan Gilbert

On Thu, Jul 30, 2020 at 08:14:28AM -0700, Steve Sistare wrote:
> From: Mark Kanda <mark.kanda@oracle.com>
> 
> Iterate through the VNC displays and save/restore the socket fds.

This patch doesn't appear to do anything around the client state, so I
can't see how this will work in general.  eg QEMU is 1/2 way through
receiving a message from the client, and we trigger re-exec.

The new QEMU is going to startup considering the VNC client is in an
idle state, and will then read the 2nd 1/2 of the message off the
client socket. Everything will go rapidly downhill from there.
Or the reverse, the server has sent a message, but this outbound
message is still in the buffer and only been partially sent on the
wire. We re'exec and now we've lost the unsent part of the buffer.


> 
> Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  include/sysemu/sysemu.h |   2 +
>  migration/savevm.c      |   3 +
>  ui/vnc.c                | 153 +++++++++++++++++++++++++++++++++++++++---------
>  3 files changed, 130 insertions(+), 28 deletions(-)
> 
> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> index fa1a5c3..3e7bfee 100644
> --- a/include/sysemu/sysemu.h
> +++ b/include/sysemu/sysemu.h
> @@ -28,6 +28,8 @@ void qemu_remove_machine_init_done_notifier(Notifier *notify);
>  void save_cpr_snapshot(const char *file, const char *mode, Error **errp);
>  void load_cpr_snapshot(const char *file, Error **errp);
>  void save_chardev_fds(void);
> +void save_vnc_fds(void);
> +void load_vnc_fds(void);
>  
>  extern int autostart;
>  
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 81f38c4..35fafb7 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2768,6 +2768,7 @@ void save_cpr_snapshot(const char *file, const char *mode, Error **errp)
>              return;
>          }
>          save_chardev_fds();
> +        save_vnc_fds();
>          walkenv(FD_PREFIX, preserve_fd, 0);
>          qemu_system_exec_request();
>          putenv((char *)"QEMU_START_FREEZE=");
> @@ -3015,6 +3016,8 @@ void load_cpr_snapshot(const char *file, Error **errp)
>              start_on_wake = 1;
>          }
>      }
> +
> +    load_vnc_fds();
>  }
>  
>  int load_snapshot(const char *name, Error **errp)
> diff --git a/ui/vnc.c b/ui/vnc.c
> index f006aa1..947ddf5 100644
> --- a/ui/vnc.c
> +++ b/ui/vnc.c
> @@ -50,6 +50,7 @@
>  #include "qom/object_interfaces.h"
>  #include "qemu/cutils.h"
>  #include "io/dns-resolver.h"
> +#include "sysemu/sysemu.h"
>  
>  #define VNC_REFRESH_INTERVAL_BASE GUI_REFRESH_INTERVAL_DEFAULT
>  #define VNC_REFRESH_INTERVAL_INC  50
> @@ -2214,28 +2215,34 @@ static void set_pixel_format(VncState *vs, int bits_per_pixel,
>      graphic_hw_update(vs->vd->dcl.con);
>  }
>  
> -static void pixel_format_message (VncState *vs) {
> +/*
> + * reuse - true if we are using an existing (already initialized)
> + * connection to a vnc client
> + */
> +static void pixel_format_message(VncState *vs, bool reuse)
> +{
>      char pad[3] = { 0, 0, 0 };
>  
>      vs->client_pf = qemu_default_pixelformat(32);
>  
> -    vnc_write_u8(vs, vs->client_pf.bits_per_pixel); /* bits-per-pixel */
> -    vnc_write_u8(vs, vs->client_pf.depth); /* depth */
> +    if (!reuse) {
> +        vnc_write_u8(vs, vs->client_pf.bits_per_pixel); /* bits-per-pixel */
> +        vnc_write_u8(vs, vs->client_pf.depth); /* depth */
>  
>  #ifdef HOST_WORDS_BIGENDIAN
> -    vnc_write_u8(vs, 1);             /* big-endian-flag */
> +        vnc_write_u8(vs, 1);             /* big-endian-flag */
>  #else
> -    vnc_write_u8(vs, 0);             /* big-endian-flag */
> +        vnc_write_u8(vs, 0);             /* big-endian-flag */
>  #endif
> -    vnc_write_u8(vs, 1);             /* true-color-flag */
> -    vnc_write_u16(vs, vs->client_pf.rmax);     /* red-max */
> -    vnc_write_u16(vs, vs->client_pf.gmax);     /* green-max */
> -    vnc_write_u16(vs, vs->client_pf.bmax);     /* blue-max */
> -    vnc_write_u8(vs, vs->client_pf.rshift);    /* red-shift */
> -    vnc_write_u8(vs, vs->client_pf.gshift);    /* green-shift */
> -    vnc_write_u8(vs, vs->client_pf.bshift);    /* blue-shift */
> -    vnc_write(vs, pad, 3);           /* padding */
> -
> +        vnc_write_u8(vs, 1);             /* true-color-flag */
> +        vnc_write_u16(vs, vs->client_pf.rmax);     /* red-max */
> +        vnc_write_u16(vs, vs->client_pf.gmax);     /* green-max */
> +        vnc_write_u16(vs, vs->client_pf.bmax);     /* blue-max */
> +        vnc_write_u8(vs, vs->client_pf.rshift);    /* red-shift */
> +        vnc_write_u8(vs, vs->client_pf.gshift);    /* green-shift */
> +        vnc_write_u8(vs, vs->client_pf.bshift);    /* blue-shift */
> +        vnc_write(vs, pad, 3);           /* padding */
> +    }
>      vnc_hextile_set_pixel_conversion(vs, 0);
>      vs->write_pixels = vnc_write_pixels_copy;
>  }
> @@ -2252,7 +2259,7 @@ static void vnc_colordepth(VncState *vs)
>                                 pixman_image_get_width(vs->vd->server),
>                                 pixman_image_get_height(vs->vd->server),
>                                 VNC_ENCODING_WMVi);
> -        pixel_format_message(vs);
> +        pixel_format_message(vs, false);
>          vnc_unlock_output(vs);
>          vnc_flush(vs);
>      } else {
> @@ -2420,7 +2427,8 @@ static int protocol_client_msg(VncState *vs, uint8_t *data, size_t len)
>      return 0;
>  }
>  
> -static int protocol_client_init(VncState *vs, uint8_t *data, size_t len)
> +static int protocol_client_init_base(VncState *vs, uint8_t *data, size_t len,
> +                                     bool reuse)
>  {
>      char buf[1024];
>      VncShareMode mode;
> @@ -2495,10 +2503,11 @@ static int protocol_client_init(VncState *vs, uint8_t *data, size_t len)
>             pixman_image_get_height(vs->vd->server) >= 0);
>      vs->client_width = pixman_image_get_width(vs->vd->server);
>      vs->client_height = pixman_image_get_height(vs->vd->server);
> -    vnc_write_u16(vs, vs->client_width);
> -    vnc_write_u16(vs, vs->client_height);
> -
> -    pixel_format_message(vs);
> +    if (!reuse) {
> +        vnc_write_u16(vs, vs->client_width);
> +        vnc_write_u16(vs, vs->client_height);
> +    }
> +    pixel_format_message(vs, reuse);
>  
>      if (qemu_name) {
>          size = snprintf(buf, sizeof(buf), "QEMU (%s)", qemu_name);
> @@ -2509,9 +2518,11 @@ static int protocol_client_init(VncState *vs, uint8_t *data, size_t len)
>          size = snprintf(buf, sizeof(buf), "QEMU");
>      }
>  
> -    vnc_write_u32(vs, size);
> -    vnc_write(vs, buf, size);
> -    vnc_flush(vs);
> +    if (!reuse) {
> +        vnc_write_u32(vs, size);
> +        vnc_write(vs, buf, size);
> +        vnc_flush(vs);
> +    }
>  
>      vnc_client_cache_auth(vs);
>      vnc_qmp_event(vs, QAPI_EVENT_VNC_INITIALIZED);
> @@ -2521,6 +2532,11 @@ static int protocol_client_init(VncState *vs, uint8_t *data, size_t len)
>      return 0;
>  }
>  
> +static int protocol_client_init(VncState *vs, uint8_t *data, size_t len)
> +{
> +    return protocol_client_init_base(vs, data, len, false);
> +}
> +
>  void start_client_init(VncState *vs)
>  {
>      vnc_read_when(vs, protocol_client_init, 1);
> @@ -3012,8 +3028,12 @@ static void vnc_refresh(DisplayChangeListener *dcl)
>      }
>  }
>  
> +/*
> + * reuse - true if we are using an existing (already initialized)
> + * connection to a vnc client
> + */
>  static void vnc_connect(VncDisplay *vd, QIOChannelSocket *sioc,
> -                        bool skipauth, bool websocket)
> +                        bool skipauth, bool websocket, bool reuse)
>  {
>      VncState *vs = g_new0(VncState, 1);
>      bool first_client = QTAILQ_EMPTY(&vd->clients);
> @@ -3109,10 +3129,15 @@ static void vnc_connect(VncDisplay *vd, QIOChannelSocket *sioc,
>  
>      graphic_hw_update(vd->dcl.con);
>  
> -    if (!vs->websocket) {
> +    if ((!vs->websocket) && !reuse) {
>          vnc_start_protocol(vs);
>      }
>  
> +    if (reuse) {
> +        uint8_t data[1] = {0};
> +        (void) protocol_client_init_base(vs, data, sizeof(data), true);
> +    }
> +
>      if (vd->num_connecting > vd->connections_limit) {
>          QTAILQ_FOREACH(vs, &vd->clients, next) {
>              if (vs->share_mode == VNC_SHARE_MODE_CONNECTING) {
> @@ -3143,7 +3168,7 @@ static void vnc_listen_io(QIONetListener *listener,
>      qio_channel_set_name(QIO_CHANNEL(cioc),
>                           isWebsock ? "vnc-ws-server" : "vnc-server");
>      qio_channel_set_delay(QIO_CHANNEL(cioc), false);
> -    vnc_connect(vd, cioc, false, isWebsock);
> +    vnc_connect(vd, cioc, false, isWebsock, false);
>  }
>  
>  static const DisplayChangeListenerOps dcl_ops = {
> @@ -3733,7 +3758,7 @@ static int vnc_display_connect(VncDisplay *vd,
>      if (qio_channel_socket_connect_sync(sioc, saddr[0], errp) < 0) {
>          return -1;
>      }
> -    vnc_connect(vd, sioc, false, false);
> +    vnc_connect(vd, sioc, false, false, false);
>      object_unref(OBJECT(sioc));
>      return 0;
>  }
> @@ -4057,7 +4082,7 @@ void vnc_display_add_client(const char *id, int csock, bool skipauth)
>      sioc = qio_channel_socket_new_fd(csock, NULL);
>      if (sioc) {
>          qio_channel_set_name(QIO_CHANNEL(sioc), "vnc-server");
> -        vnc_connect(vd, sioc, skipauth, false);
> +        vnc_connect(vd, sioc, skipauth, false, false);
>          object_unref(OBJECT(sioc));
>      }
>  }
> @@ -4117,3 +4142,75 @@ static void vnc_register_config(void)
>      qemu_add_opts(&qemu_vnc_opts);
>  }
>  opts_init(vnc_register_config);
> +
> +void save_vnc_fds(void)
> +{
> +    VncDisplay *vd;
> +    VncState *vs;
> +    int disp_num = 0;
> +    char name[40];
> +
> +    QTAILQ_FOREACH(vd, &vnc_displays, next) {
> +        QTAILQ_FOREACH(vs, &vd->clients, next) {
> +            if (vs->sioc) {
> +                snprintf(name, sizeof(name), "%s_%d", vs->sioc->parent.name,
> +                         disp_num);

'disp_num' is only updated by the outer loop. So if we have multiple
iterations of the inner loop, we'll have multiple FDs wth the same
name that try to be stored. Presumably we'll loose all but the last.

> +                setenv_fd(name, vs->sioc->fd);
> +                break;
> +            }
> +        }
> +        disp_num++;
> +    }
> +}
> +
> +static void set_vnc_fd(char *name, QIOChannelSocket *cioc, VncDisplay *vd,
> +                       bool isWebsock)
> +{
> +    VncState *vs;
> +    QIOChannelSocket *sioc;
> +
> +    int fd = getenv_fd(name);
> +    if (fd != -1) {
> +        sioc = qio_channel_socket_accept(cioc, fd, NULL);
> +        if (sioc) {
> +            unsetenv_fd(name);
> +            qio_channel_set_name(QIO_CHANNEL(sioc),
> +                                 isWebsock ? "vnc-ws-server" : "vnc-server");
> +
> +            qio_channel_set_delay(QIO_CHANNEL(sioc), false);
> +            vnc_connect(vd, sioc, false, isWebsock, true);
> +            object_unref(OBJECT(sioc));
> +
> +            /* force update on all clients */
> +            QTAILQ_FOREACH(vs, &vd->clients, next) {
> +                vs->update = VNC_STATE_UPDATE_FORCE;
> +            }
> +        } else {
> +            error_printf("Could not restore vnc channel %s; "
> +                     "client must reconnect.\n", name);
> +        }
> +    }
> +}
> +
> +void load_vnc_fds(void)
> +{
> +    VncDisplay *vd;
> +    QIOChannelSocket *cioc = NULL;
> +    int disp_num = 0;
> +    char name[40];
> +
> +    QTAILQ_FOREACH(vd, &vnc_displays, next) {
> +        if (vd->listener) {
> +            cioc = *vd->listener->sioc;
> +            snprintf(name, sizeof(name), "vnc-server_%d", disp_num);
> +            set_vnc_fd(name, cioc, vd, false);
> +        }
> +
> +        if (vd->wslistener) {
> +            cioc = *vd->wslistener->sioc;
> +            snprintf(name, sizeof(name), "vnc-ws-server_%d", disp_num);
> +            set_vnc_fd(name, cioc, vd, true);
> +        }
> +        disp_num++;

This only attempts to restore a single client for each listener,
despite trying (but failing) to save multiple clients.


In any case, as per my comment at the top of the pathc, this whole
patch just looks broken as it is not doing anything with client
state.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 12/32] vl: pause option
  2020-07-30 18:14     ` Steven Sistare
@ 2020-07-31  9:44       ` Alex Bennée
  2020-09-11 17:59       ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 118+ messages in thread
From: Alex Bennée @ 2020-07-31  9:44 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Markus Armbruster,
	Juan Quintela, Dr. David Alan Gilbert, qemu-devel,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé


Steven Sistare <steven.sistare@oracle.com> writes:

> On 7/30/2020 1:03 PM, Alex Bennée wrote:
>> 
>> Steve Sistare <steven.sistare@oracle.com> writes:
>> 
>>> Provide the -pause command-line parameter and the QEMU_PAUSE environment
>>> variable to briefly pause QEMU in main and allow a developer to attach gdb.
>>> Useful when the developer does not invoke QEMU directly, such as when using
>>> libvirt.
>> 
>> How does this differ from -S?
>
> The -S flag runs qemu to the main loop but does not start the guest.  Lots of code
> that you may need to debug runs before you get there.

Right - so this is for attaching a debugger to QEMU itself, not using
the gdbstub? Why isn't this a problem the calling entity can solve by
the way it invoked QEMU?

We have similar sort of solutions for debugging our testcases:

  https://wiki.qemu.org/Features/QTest#Using_debugging_tools_under_the_test_harness

I still think:

>>> +DEF("pause", HAS_ARG, QEMU_OPTION_pause, \
>>> +    "-pause secs    Pause for secs seconds on entry to main.\n", QEMU_ARCH_ALL)
>>> +
>>> +SRST
>>> +``--pause secs``
>>> +    Pause for a number of seconds on entry to main.  Useful for attaching
>>> +    a debugger after QEMU has been launched by some other entity.
>>> +ERST
>>> +
>> 
>> It seems like having an option to race with the debugger is just asking
>> for trouble.

this make the option problematic.

-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 12/32] vl: pause option
  2020-07-30 18:11     ` Steven Sistare
@ 2020-07-31 10:07       ` Daniel P. Berrangé
  2020-07-31 15:18         ` Steven Sistare
  0 siblings, 1 reply; 118+ messages in thread
From: Daniel P. Berrangé @ 2020-07-31 10:07 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Michael S. Tsirkin, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Alex Bennée, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Markus Armbruster

On Thu, Jul 30, 2020 at 02:11:19PM -0400, Steven Sistare wrote:
> On 7/30/2020 12:20 PM, Eric Blake wrote:
> > On 7/30/20 10:14 AM, Steve Sistare wrote:
> >> Provide the -pause command-line parameter and the QEMU_PAUSE environment
> >> variable to briefly pause QEMU in main and allow a developer to attach gdb.
> >> Useful when the developer does not invoke QEMU directly, such as when using
> >> libvirt.
> > 
> > How would you set this option with libvirt?
> 
> Add -pause in the qemu args in the xml.
>  
> > It feels like you are trying to reinvent something that is already well-documented:
> > 
> > https://www.berrange.com/posts/2011/10/12/debugging-early-startup-of-kvm-with-gdb-when-launched-by-libvirtd/
> 
> Too many steps to reach BINGO for my taste.  Easier is better.  Also, in our shop we start qemu 
> in other ways, such as via services.


A "sleep" is a pretty crude & unreliable way to get into debugging
though. It is racy for a start, but also QEMU has a bunch of stuff
that runs via ELF constructors before main() even starts.

So I feel like the thing that starts QEMU is better placed to provide
a way in for debugging.

eg the service launcher can send SIGSTOP to the child process immediately
before the execve(qemu) call.

Now user can attach with the debugger, allow execution to continue,
and has ability to debug *everything* right from the ELF constructors
onwards into main() and all that follows.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 12/32] vl: pause option
  2020-07-31 10:07       ` Daniel P. Berrangé
@ 2020-07-31 15:18         ` Steven Sistare
  0 siblings, 0 replies; 118+ messages in thread
From: Steven Sistare @ 2020-07-31 15:18 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Michael S. Tsirkin, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Alex Bennée, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Markus Armbruster

On 7/31/2020 6:07 AM, Daniel P. Berrangé wrote:
> On Thu, Jul 30, 2020 at 02:11:19PM -0400, Steven Sistare wrote:
>> On 7/30/2020 12:20 PM, Eric Blake wrote:
>>> On 7/30/20 10:14 AM, Steve Sistare wrote:
>>>> Provide the -pause command-line parameter and the QEMU_PAUSE environment
>>>> variable to briefly pause QEMU in main and allow a developer to attach gdb.
>>>> Useful when the developer does not invoke QEMU directly, such as when using
>>>> libvirt.
>>>
>>> How would you set this option with libvirt?
>>
>> Add -pause in the qemu args in the xml.
>>  
>>> It feels like you are trying to reinvent something that is already well-documented:
>>>
>>> https://www.berrange.com/posts/2011/10/12/debugging-early-startup-of-kvm-with-gdb-when-launched-by-libvirtd/
>>
>> Too many steps to reach BINGO for my taste.  Easier is better.  Also, in our shop we start qemu 
>> in other ways, such as via services.
> 
> A "sleep" is a pretty crude & unreliable way to get into debugging
> though. It is racy for a start, but also QEMU has a bunch of stuff
> that runs via ELF constructors before main() even starts.
> 
> So I feel like the thing that starts QEMU is better placed to provide
> a way in for debugging.
> 
> eg the service launcher can send SIGSTOP to the child process immediately
> before the execve(qemu) call.
> 
> Now user can attach with the debugger, allow execution to continue,
> and has ability to debug *everything* right from the ELF constructors
> onwards into main() and all that follows.
> 
> Regards,
> Daniel

That is a nice solution for the launchers we can modify.
We could use your idea in place of the sleep in main,
    kill(getpid(), SIGSTOP);

Not quite as good as being able to debug the elf constructors, but still helpful.

- Steve


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 00/32] Live Update
  2020-07-31  8:53     ` Daniel P. Berrangé
@ 2020-07-31 15:27       ` Steven Sistare
  2020-07-31 15:52         ` Daniel P. Berrangé
  0 siblings, 1 reply; 118+ messages in thread
From: Steven Sistare @ 2020-07-31 15:27 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Michael S. Tsirkin, Alex Bennée, Juan Quintela, qemu-devel,
	Markus Armbruster, Alex Williamson, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Dr. David Alan Gilbert

On 7/31/2020 4:53 AM, Daniel P. Berrangé wrote:
> On Thu, Jul 30, 2020 at 02:48:44PM -0400, Steven Sistare wrote:
>> On 7/30/2020 12:52 PM, Daniel P. Berrangé wrote:
>>> On Thu, Jul 30, 2020 at 08:14:04AM -0700, Steve Sistare wrote:
>>>> Improve and extend the qemu functions that save and restore VM state so a
>>>> guest may be suspended and resumed with minimal pause time.  qemu may be
>>>> updated to a new version in between.
>>>>
>>>> The first set of patches adds the cprsave and cprload commands to save and
>>>> restore VM state, and allow the host kernel to be updated and rebooted in
>>>> between.  The VM must create guest RAM in a persistent shared memory file,
>>>> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in 
>>>> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/
>>>>
>>>> cprsave stops the VCPUs and saves VM device state in a simple file, and
>>>> thus supports any type of guest image and block device.  The caller must
>>>> not modify the VM's block devices between cprsave and cprload.
>>>>
>>>> cprsave and cprload support guests with vfio devices if the caller first
>>>> suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
>>>> The guest drivers suspend methods flush outstanding requests and re-
>>>> initialize the devices, and thus there is no device state to save and
>>>> restore.
>>>>
>>>>    1 savevm: add vmstate handler iterators
>>>>    2 savevm: VM handlers mode mask
>>>>    3 savevm: QMP command for cprsave
>>>>    4 savevm: HMP Command for cprsave
>>>>    5 savevm: QMP command for cprload
>>>>    6 savevm: HMP Command for cprload
>>>>    7 savevm: QMP command for cprinfo
>>>>    8 savevm: HMP command for cprinfo
>>>>    9 savevm: prevent cprsave if memory is volatile
>>>>   10 kvmclock: restore paused KVM clock
>>>>   11 cpu: disable ticks when suspended
>>>>   12 vl: pause option
>>>>   13 gdbstub: gdb support for suspended state
>>>>
>>>> The next patches add a restart method that eliminates the persistent memory
>>>> constraint, and allows qemu to be updated across the restart, but does not
>>>> allow host reboot.  Anonymous memory segments used by the guest are
>>>> preserved across a re-exec of qemu, mapped at the same VA, via a proposed
>>>> madvise(MADV_DOEXEC) option in the Linux kernel.  See
>>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
>>>>
>>>>   14 savevm: VMS_RESTART and cprsave restart
>>>>   15 vl: QEMU_START_FREEZE env var
>>>>   16 oslib: add qemu_clr_cloexec
>>>>   17 util: env var helpers
>>>>   18 osdep: import MADV_DOEXEC
>>>>   19 memory: ram_block_add cosmetic changes
>>>>   20 vl: add helper to request re-exec
>>>>   21 exec, memory: exec(3) to restart
>>>>   22 char: qio_channel_socket_accept reuse fd
>>>>   23 char: save/restore chardev socket fds
>>>>   24 ui: save/restore vnc socket fds
>>>>   25 char: save/restore chardev pty fds
>>>
>>> Keeping FDs open across re-exec is a nice trick, but how are you dealing
>>> with the state associated with them, most especially the TLS encryption
>>> state ? AFAIK, there's no way to serialize/deserialize the TLS state that
>>> GNUTLS maintains, and the patches don't show any sign of dealing with
>>> this. IOW it looks like while the FD will be preserved, any TLS session
>>> running on it will fail.
>>
>> I had not considered TLS.  If a non-qemu library maintains connection state, then
>> we won't be able to support it for live update until the library provides interfaces
>> to serialize the state.
>>
>> For qemu objects, so far vmstate has been adequate to represent the devices with
>> descriptors that we preserve.
> 
> My main concern about this series is that there is an implicit assumption
> that QEMU is *not* configured with certain features that are not handled
> If QEMU is using one of the unsupported features, I don't see anything in
> the series which attempts to prevent the actions.
> 
> IOW, users can have an arbitrary QEMU config, attempt to use these new features,
> the commands may well succeed, but the user is silently left with a broken QEMU.
> Such silent failure modes are really undesirable as they'll lead to a never
> ending stream of hard to diagnose bug reports for QEMU maintainers.
> 
> TLS is one example of this, the live upgrade  will "succeed", but the TLS
> connections will be totally non-functional.

I agree with all your points and would like to do better in this area.  Other than hunting for 
every use of a descriptor and either supporting it or blocking cpr, do you have any suggestions?
Thinking out loud, maybe we can gather all the fds that we support, then look for all fds in the
process, and block the cpr if we find an unrecognized fd.

- Steve


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 00/32] Live Update
  2020-07-31 15:27       ` Steven Sistare
@ 2020-07-31 15:52         ` Daniel P. Berrangé
  2020-07-31 17:20           ` Steven Sistare
  2020-08-11 19:08           ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 118+ messages in thread
From: Daniel P. Berrangé @ 2020-07-31 15:52 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Michael S. Tsirkin, Alex Bennée, Juan Quintela, qemu-devel,
	Markus Armbruster, Alex Williamson, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Dr. David Alan Gilbert

On Fri, Jul 31, 2020 at 11:27:45AM -0400, Steven Sistare wrote:
> On 7/31/2020 4:53 AM, Daniel P. Berrangé wrote:
> > On Thu, Jul 30, 2020 at 02:48:44PM -0400, Steven Sistare wrote:
> >> On 7/30/2020 12:52 PM, Daniel P. Berrangé wrote:
> >>> On Thu, Jul 30, 2020 at 08:14:04AM -0700, Steve Sistare wrote:
> >>>> Improve and extend the qemu functions that save and restore VM state so a
> >>>> guest may be suspended and resumed with minimal pause time.  qemu may be
> >>>> updated to a new version in between.
> >>>>
> >>>> The first set of patches adds the cprsave and cprload commands to save and
> >>>> restore VM state, and allow the host kernel to be updated and rebooted in
> >>>> between.  The VM must create guest RAM in a persistent shared memory file,
> >>>> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in 
> >>>> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/
> >>>>
> >>>> cprsave stops the VCPUs and saves VM device state in a simple file, and
> >>>> thus supports any type of guest image and block device.  The caller must
> >>>> not modify the VM's block devices between cprsave and cprload.
> >>>>
> >>>> cprsave and cprload support guests with vfio devices if the caller first
> >>>> suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
> >>>> The guest drivers suspend methods flush outstanding requests and re-
> >>>> initialize the devices, and thus there is no device state to save and
> >>>> restore.
> >>>>
> >>>>    1 savevm: add vmstate handler iterators
> >>>>    2 savevm: VM handlers mode mask
> >>>>    3 savevm: QMP command for cprsave
> >>>>    4 savevm: HMP Command for cprsave
> >>>>    5 savevm: QMP command for cprload
> >>>>    6 savevm: HMP Command for cprload
> >>>>    7 savevm: QMP command for cprinfo
> >>>>    8 savevm: HMP command for cprinfo
> >>>>    9 savevm: prevent cprsave if memory is volatile
> >>>>   10 kvmclock: restore paused KVM clock
> >>>>   11 cpu: disable ticks when suspended
> >>>>   12 vl: pause option
> >>>>   13 gdbstub: gdb support for suspended state
> >>>>
> >>>> The next patches add a restart method that eliminates the persistent memory
> >>>> constraint, and allows qemu to be updated across the restart, but does not
> >>>> allow host reboot.  Anonymous memory segments used by the guest are
> >>>> preserved across a re-exec of qemu, mapped at the same VA, via a proposed
> >>>> madvise(MADV_DOEXEC) option in the Linux kernel.  See
> >>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
> >>>>
> >>>>   14 savevm: VMS_RESTART and cprsave restart
> >>>>   15 vl: QEMU_START_FREEZE env var
> >>>>   16 oslib: add qemu_clr_cloexec
> >>>>   17 util: env var helpers
> >>>>   18 osdep: import MADV_DOEXEC
> >>>>   19 memory: ram_block_add cosmetic changes
> >>>>   20 vl: add helper to request re-exec
> >>>>   21 exec, memory: exec(3) to restart
> >>>>   22 char: qio_channel_socket_accept reuse fd
> >>>>   23 char: save/restore chardev socket fds
> >>>>   24 ui: save/restore vnc socket fds
> >>>>   25 char: save/restore chardev pty fds
> >>>
> >>> Keeping FDs open across re-exec is a nice trick, but how are you dealing
> >>> with the state associated with them, most especially the TLS encryption
> >>> state ? AFAIK, there's no way to serialize/deserialize the TLS state that
> >>> GNUTLS maintains, and the patches don't show any sign of dealing with
> >>> this. IOW it looks like while the FD will be preserved, any TLS session
> >>> running on it will fail.
> >>
> >> I had not considered TLS.  If a non-qemu library maintains connection state, then
> >> we won't be able to support it for live update until the library provides interfaces
> >> to serialize the state.
> >>
> >> For qemu objects, so far vmstate has been adequate to represent the devices with
> >> descriptors that we preserve.
> > 
> > My main concern about this series is that there is an implicit assumption
> > that QEMU is *not* configured with certain features that are not handled
> > If QEMU is using one of the unsupported features, I don't see anything in
> > the series which attempts to prevent the actions.
> > 
> > IOW, users can have an arbitrary QEMU config, attempt to use these new features,
> > the commands may well succeed, but the user is silently left with a broken QEMU.
> > Such silent failure modes are really undesirable as they'll lead to a never
> > ending stream of hard to diagnose bug reports for QEMU maintainers.
> > 
> > TLS is one example of this, the live upgrade  will "succeed", but the TLS
> > connections will be totally non-functional.
> 
> I agree with all your points and would like to do better in this area.  Other than hunting for 
> every use of a descriptor and either supporting it or blocking cpr, do you have any suggestions?
> Thinking out loud, maybe we can gather all the fds that we support, then look for all fds in the
> process, and block the cpr if we find an unrecognized fd.

There's no magic easy answer to this problem. Conceptually it is similar to
the problem of reliably migrating guest device state, but in this case we're
primarily concerned about the backends instead.

For migration we've got standardized interfaces that devices must implement
in order to correctly support migration serialization. There is also support
for devices to register migration "blockers" which prevent any use of the
migration feature when the device is present.

We lack this kind of concept for the backend, and that's what I think needs
to be tackled in a more thorough way.  There are quite alot of backends,
but they're grouped into a reasonable small number of sets (UIs, chardevs,
blockdevs, net devs, etc). We need some standard interface that we can
plumb into all the backends, along with providing backends the ability to
block the re-exec. If we plumb the generic infrastructure into each of the
different types of backend, and make the default behaviour be to reject
the re-exec. Then we need to carefull consider specific  backend impls
and allow the re-exec only in the very precise cases we can demonstrate
to be safe.

IOW, have a presumption that re-exec will *not* be permitted. Over time
we can make it work for an ever expanding set of use cases. 


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 24/32] ui: save/restore vnc socket fds
  2020-07-31  9:06   ` Daniel P. Berrangé
@ 2020-07-31 16:51     ` Steven Sistare
  0 siblings, 0 replies; 118+ messages in thread
From: Steven Sistare @ 2020-07-31 16:51 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Juan Quintela, Philippe Mathieu-Daudé,
	Michael S. Tsirkin, Markus Armbruster, qemu-devel,
	Alex Williamson, Stefan Hajnoczi, Paolo Bonzini,
	Marc-André Lureau, Alex Bennée, Dr. David Alan Gilbert

On 7/31/2020 5:06 AM, Daniel P. Berrangé wrote:
> On Thu, Jul 30, 2020 at 08:14:28AM -0700, Steve Sistare wrote:
>> From: Mark Kanda <mark.kanda@oracle.com>
>>
>> Iterate through the VNC displays and save/restore the socket fds.
> 
> This patch doesn't appear to do anything around the client state, so I
> can't see how this will work in general.  eg QEMU is 1/2 way through
> receiving a message from the client, and we trigger re-exec.
> 
> The new QEMU is going to startup considering the VNC client is in an
> idle state, and will then read the 2nd 1/2 of the message off the
> client socket. Everything will go rapidly downhill from there.
> Or the reverse, the server has sent a message, but this outbound
> message is still in the buffer and only been partially sent on the
> wire. We re'exec and now we've lost the unsent part of the buffer.

Yes.  For partial messages in qemu object buffers, we need to add a draining phase
between exec-requested and exec, and complete all partial messages.

For kernel socket buffers, we should be OK.  If we are accurately preserving vnc
server state (which is the intent), then we can correctly respond to any client
reqwuests that were sent to us pre-exec but read into qemu post-exec.

However, there is another icky issue with vnc.  It only works reliably with raw 
encoding.  Compressed streams accumulate state on the client side which ww cannot
match on the server when we create a new zlib stream after exec.  The vnc protocol
defines a per-stream reset flag in the compression control word, which sounds like it
should reset zlib state, but it does not for tigervnc.  I have not tried other clients.

vnc is one of the tricker patches in this series.  It may be wisest to close the connection 
and require the client to reconnect.  The virtual framebuffer is preserved, so the same content 
will be shown after reconnect.

- Steve



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 00/32] Live Update
  2020-07-31 15:52         ` Daniel P. Berrangé
@ 2020-07-31 17:20           ` Steven Sistare
  2020-08-11 19:08           ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 118+ messages in thread
From: Steven Sistare @ 2020-07-31 17:20 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Michael S. Tsirkin, Alex Bennée, Juan Quintela, qemu-devel,
	Markus Armbruster, Alex Williamson, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Dr. David Alan Gilbert

On 7/31/2020 11:52 AM, Daniel P. Berrangé wrote:
> On Fri, Jul 31, 2020 at 11:27:45AM -0400, Steven Sistare wrote:
>> On 7/31/2020 4:53 AM, Daniel P. Berrangé wrote:
>>> On Thu, Jul 30, 2020 at 02:48:44PM -0400, Steven Sistare wrote:
>>>> On 7/30/2020 12:52 PM, Daniel P. Berrangé wrote:
>>>>> On Thu, Jul 30, 2020 at 08:14:04AM -0700, Steve Sistare wrote:
>>>>>> Improve and extend the qemu functions that save and restore VM state so a
>>>>>> guest may be suspended and resumed with minimal pause time.  qemu may be
>>>>>> updated to a new version in between.
>>>>>>
>>>>>> The first set of patches adds the cprsave and cprload commands to save and
>>>>>> restore VM state, and allow the host kernel to be updated and rebooted in
>>>>>> between.  The VM must create guest RAM in a persistent shared memory file,
>>>>>> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in 
>>>>>> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/
>>>>>>
>>>>>> cprsave stops the VCPUs and saves VM device state in a simple file, and
>>>>>> thus supports any type of guest image and block device.  The caller must
>>>>>> not modify the VM's block devices between cprsave and cprload.
>>>>>>
>>>>>> cprsave and cprload support guests with vfio devices if the caller first
>>>>>> suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
>>>>>> The guest drivers suspend methods flush outstanding requests and re-
>>>>>> initialize the devices, and thus there is no device state to save and
>>>>>> restore.
>>>>>>
>>>>>>    1 savevm: add vmstate handler iterators
>>>>>>    2 savevm: VM handlers mode mask
>>>>>>    3 savevm: QMP command for cprsave
>>>>>>    4 savevm: HMP Command for cprsave
>>>>>>    5 savevm: QMP command for cprload
>>>>>>    6 savevm: HMP Command for cprload
>>>>>>    7 savevm: QMP command for cprinfo
>>>>>>    8 savevm: HMP command for cprinfo
>>>>>>    9 savevm: prevent cprsave if memory is volatile
>>>>>>   10 kvmclock: restore paused KVM clock
>>>>>>   11 cpu: disable ticks when suspended
>>>>>>   12 vl: pause option
>>>>>>   13 gdbstub: gdb support for suspended state
>>>>>>
>>>>>> The next patches add a restart method that eliminates the persistent memory
>>>>>> constraint, and allows qemu to be updated across the restart, but does not
>>>>>> allow host reboot.  Anonymous memory segments used by the guest are
>>>>>> preserved across a re-exec of qemu, mapped at the same VA, via a proposed
>>>>>> madvise(MADV_DOEXEC) option in the Linux kernel.  See
>>>>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
>>>>>>
>>>>>>   14 savevm: VMS_RESTART and cprsave restart
>>>>>>   15 vl: QEMU_START_FREEZE env var
>>>>>>   16 oslib: add qemu_clr_cloexec
>>>>>>   17 util: env var helpers
>>>>>>   18 osdep: import MADV_DOEXEC
>>>>>>   19 memory: ram_block_add cosmetic changes
>>>>>>   20 vl: add helper to request re-exec
>>>>>>   21 exec, memory: exec(3) to restart
>>>>>>   22 char: qio_channel_socket_accept reuse fd
>>>>>>   23 char: save/restore chardev socket fds
>>>>>>   24 ui: save/restore vnc socket fds
>>>>>>   25 char: save/restore chardev pty fds
>>>>>
>>>>> Keeping FDs open across re-exec is a nice trick, but how are you dealing
>>>>> with the state associated with them, most especially the TLS encryption
>>>>> state ? AFAIK, there's no way to serialize/deserialize the TLS state that
>>>>> GNUTLS maintains, and the patches don't show any sign of dealing with
>>>>> this. IOW it looks like while the FD will be preserved, any TLS session
>>>>> running on it will fail.
>>>>
>>>> I had not considered TLS.  If a non-qemu library maintains connection state, then
>>>> we won't be able to support it for live update until the library provides interfaces
>>>> to serialize the state.
>>>>
>>>> For qemu objects, so far vmstate has been adequate to represent the devices with
>>>> descriptors that we preserve.
>>>
>>> My main concern about this series is that there is an implicit assumption
>>> that QEMU is *not* configured with certain features that are not handled
>>> If QEMU is using one of the unsupported features, I don't see anything in
>>> the series which attempts to prevent the actions.
>>>
>>> IOW, users can have an arbitrary QEMU config, attempt to use these new features,
>>> the commands may well succeed, but the user is silently left with a broken QEMU.
>>> Such silent failure modes are really undesirable as they'll lead to a never
>>> ending stream of hard to diagnose bug reports for QEMU maintainers.
>>>
>>> TLS is one example of this, the live upgrade  will "succeed", but the TLS
>>> connections will be totally non-functional.
>>
>> I agree with all your points and would like to do better in this area.  Other than hunting for 
>> every use of a descriptor and either supporting it or blocking cpr, do you have any suggestions?
>> Thinking out loud, maybe we can gather all the fds that we support, then look for all fds in the
>> process, and block the cpr if we find an unrecognized fd.
> 
> There's no magic easy answer to this problem. Conceptually it is similar to
> the problem of reliably migrating guest device state, but in this case we're
> primarily concerned about the backends instead.
> 
> For migration we've got standardized interfaces that devices must implement
> in order to correctly support migration serialization. There is also support
> for devices to register migration "blockers" which prevent any use of the
> migration feature when the device is present.
> 
> We lack this kind of concept for the backend, and that's what I think needs
> to be tackled in a more thorough way.  There are quite alot of backends,
> but they're grouped into a reasonable small number of sets (UIs, chardevs,
> blockdevs, net devs, etc). We need some standard interface that we can
> plumb into all the backends, along with providing backends the ability to
> block the re-exec. If we plumb the generic infrastructure into each of the
> different types of backend, and make the default behaviour be to reject
> the re-exec. Then we need to carefull consider specific  backend impls
> and allow the re-exec only in the very precise cases we can demonstrate
> to be safe.
> 
> IOW, have a presumption that re-exec will *not* be permitted. Over time
> we can make it work for an ever expanding set of use cases. 

Actually, we could use the vmstate mode_mask field added in patch 2, and only allow the restart
mode for vmstate objects that have been vetted.  Currently an uninitialized mask (value 0)
enables the object for all modes, but we could change that.

- Steve


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 00/32] Live Update
  2020-07-30 21:39     ` Paolo Bonzini
@ 2020-07-31 19:22       ` Steven Sistare
  0 siblings, 0 replies; 118+ messages in thread
From: Steven Sistare @ 2020-07-31 19:22 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin,
	Philippe Mathieu-Daudé,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Alex Bennée

On 7/30/2020 5:39 PM, Paolo Bonzini wrote:
> On 30/07/20 21:09, Steven Sistare wrote:
>>> please spell it out.  Also, how does the functionality compare to
>>> xen-save-devices-state and xen-load-devices-state?
>>
>> qmp_xen_save_devices_state serializes device state to a file which is loaded 
>> on the target for a live migration.  It performs some of the same actions
>> as cprsave/cprload but does not support live update-in-place.
> 
> So it is a subset, can code be reused across both?  

They use common subroutines, but their bodies check different conditions, so I
don't think merging would be an improvement.  We do provide a new helper 
qf_file_open() which could replace a handful of lines in both qmp_xen_save_devices_state 
and qmp_xen_load_devices_state.

> Also, live migration
> across versions is supported, so can you describe the special
> update-in-place support more precisely?  I am confused about the use
> cases, which require (or try) to keep file descriptors across re-exec,
> which are for kexec, and so on.

Sure. The first use case allows you to kexec reboot the host and update host
software and/or qemu.  It does not preserve descriptors, and guest ram must be
backed by persistant shared memory.  Guest pause time depends on host reboot
time, which can be seconds to 10's of seconds.

The second case allows you to update qemu in place, but not update the host.
Guest ram can be in shared or anonymous memory.  We call madvise(MADV_DOEXEC)
to tell the kernel to preserve anon memory across the exec.  Open descriptors
are preserved.  Addresses and lengths of saved memory segments are saved in
the environment, and the values of descriptors are saved.  When new qemu
restarts, it finds those values in the environment and uses them when the
various objects are created.  Memory is not realloc'd, it is already present,
and the address and lengths are saved in the ram objects.  Guest pause time
is in the 100 to 200 msec range.  It is less resource intensive than live
migration, and is appropriate if your only goal is to update qemu, as opposed
to evacuating a host.

>>>> cprsave and cprload support guests with vfio devices if the caller first
>>>> suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
>>>> The guest drivers suspend methods flush outstanding requests and re-
>>>> initialize the devices, and thus there is no device state to save and
>>>> restore.
>>> This probably should be allowed even for regular migration.  Can you
>>> generalize the code as a separate series?
>>
>> Maybe.  I think that would be a distinct patch that ignores the vfio migration blocker 
>> if the state is suspended.  Plus a qemu agent call to do the suspend.  Needs more
>> thought.
> 
> The agent already supports suspend, so that should be relatively easy.
> Only the code to add/remove the VFIO migration blocker from a VM state
> change notifier, or something like that, would be needed.

Yes, I have experimented with the guest's suspend method.

- Steve


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 00/32] Live Update
  2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
                   ` (34 preceding siblings ...)
  2020-07-30 17:49 ` Dr. David Alan Gilbert
@ 2020-08-04 18:18 ` Steven Sistare
  35 siblings, 0 replies; 118+ messages in thread
From: Steven Sistare @ 2020-08-04 18:18 UTC (permalink / raw)
  To: qemu-devel
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé

Hi folks, any questions or comments on the vfio and pci changes in 
patch 30?  Or on the means of preserving anonymous memory and re-exec'ing 
in patches 14 - 21?

- Steve

On 7/30/2020 11:14 AM, Steve Sistare wrote:
> Improve and extend the qemu functions that save and restore VM state so a
> guest may be suspended and resumed with minimal pause time.  qemu may be
> updated to a new version in between.
> 
> The first set of patches adds the cprsave and cprload commands to save and
> restore VM state, and allow the host kernel to be updated and rebooted in
> between.  The VM must create guest RAM in a persistent shared memory file,
> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in 
> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/
> 
> cprsave stops the VCPUs and saves VM device state in a simple file, and
> thus supports any type of guest image and block device.  The caller must
> not modify the VM's block devices between cprsave and cprload.
> 
> cprsave and cprload support guests with vfio devices if the caller first
> suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
> The guest drivers suspend methods flush outstanding requests and re-
> initialize the devices, and thus there is no device state to save and
> restore.
> 
>    1 savevm: add vmstate handler iterators
>    2 savevm: VM handlers mode mask
>    3 savevm: QMP command for cprsave
>    4 savevm: HMP Command for cprsave
>    5 savevm: QMP command for cprload
>    6 savevm: HMP Command for cprload
>    7 savevm: QMP command for cprinfo
>    8 savevm: HMP command for cprinfo
>    9 savevm: prevent cprsave if memory is volatile
>   10 kvmclock: restore paused KVM clock
>   11 cpu: disable ticks when suspended
>   12 vl: pause option
>   13 gdbstub: gdb support for suspended state
> 
> The next patches add a restart method that eliminates the persistent memory
> constraint, and allows qemu to be updated across the restart, but does not
> allow host reboot.  Anonymous memory segments used by the guest are
> preserved across a re-exec of qemu, mapped at the same VA, via a proposed
> madvise(MADV_DOEXEC) option in the Linux kernel.  See
> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
> 
>   14 savevm: VMS_RESTART and cprsave restart
>   15 vl: QEMU_START_FREEZE env var
>   16 oslib: add qemu_clr_cloexec
>   17 util: env var helpers
>   18 osdep: import MADV_DOEXEC
>   19 memory: ram_block_add cosmetic changes
>   20 vl: add helper to request re-exec
>   21 exec, memory: exec(3) to restart
>   22 char: qio_channel_socket_accept reuse fd
>   23 char: save/restore chardev socket fds
>   24 ui: save/restore vnc socket fds
>   25 char: save/restore chardev pty fds
>   26 monitor: save/restore QMP negotiation status
>   27 vhost: reset vhost devices upon cprsave
>   28 char: restore terminal on restart
> 
> The next patches extend the restart method to save and restore vfio-pci
> state, eliminating the requirement for a guest agent.  The vfio container,
> group, and device descriptors are preserved across the qemu re-exec.
> 
>   29 pci: export pci_update_mappings
>   30 vfio-pci: save and restore
>   31 vfio-pci: trace pci config
>   32 vfio-pci: improved tracing
> 
> Here is an example of updating qemu from v4.2.0 to v4.2.1 using 
> "cprload restart".  The software update is performed while the guest is
> running to minimize downtime.
> 
> window 1				| window 2
> 					|
> # qemu-system-x86_64 ... 		|
> QEMU 4.2.0 monitor - type 'help' ...	|
> (qemu) info status			|
> VM status: running			|
> 					| # yum update qemu
> (qemu) cprsave /tmp/qemu.sav restart	|
> QEMU 4.2.1 monitor - type 'help' ...	|
> (qemu) info status			|
> VM status: paused (prelaunch)		|
> (qemu) cprload /tmp/qemu.sav		|
> (qemu) info status			|
> VM status: running			|
> 
> 
> Here is an example of updating the host kernel using "cprload reboot"
> 
> window 1					| window 2
> 						|
> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
> QEMU 4.2.1 monitor - type 'help' ...		|
> (qemu) info status				|
> VM status: running				|
> 						| # yum update kernel-uek
> (qemu) cprsave /tmp/qemu.sav restart		|
> 						|
> # systemctl kexec				|
> kexec_core: Starting new kernel			|
> ...						|
> 						|
> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
> QEMU 4.2.1 monitor - type 'help' ...		|
> (qemu) info status				|
> VM status: paused (prelaunch)			|
> (qemu) cprload /tmp/qemu.sav			|
> (qemu) info status				|
> VM status: running				|
> 
> 
> Mark Kanda (5):
>   char: qio_channel_socket_accept reuse fd
>   char: save/restore chardev socket fds
>   ui: save/restore vnc socket fds
>   monitor: save/restore QMP negotiation status
>   vhost: reset vhost devices upon cprsave
> 
> Steve Sistare (27):
>   savevm: add vmstate handler iterators
>   savevm: VM handlers mode mask
>   savevm: QMP command for cprsave
>   savevm: HMP Command for cprsave
>   savevm: QMP command for cprload
>   savevm: HMP Command for cprload
>   savevm: QMP command for cprinfo
>   savevm: HMP command for cprinfo
>   savevm: prevent cprsave if memory is volatile
>   kvmclock: restore paused KVM clock
>   cpu: disable ticks when suspended
>   vl: pause option
>   gdbstub: gdb support for suspended state
>   savevm: VMS_RESTART and cprsave restart
>   vl: QEMU_START_FREEZE env var
>   oslib: add qemu_clr_cloexec
>   util: env var helpers
>   osdep: import MADV_DOEXEC
>   memory: ram_block_add cosmetic changes
>   vl: add helper to request re-exec
>   exec, memory: exec(3) to restart
>   char: save/restore chardev pty fds
>   char: restore terminal on restart
>   pci: export pci_update_mappings
>   vfio-pci: save and restore
>   vfio-pci: trace pci config
>   vfio-pci: improved tracing
> 
>  MAINTAINERS                    |   7 ++
>  accel/kvm/kvm-all.c            |   8 +-
>  accel/kvm/trace-events         |   3 +-
>  chardev/char-pty.c             |  38 +++++--
>  chardev/char-socket.c          |  35 ++++++
>  chardev/char-stdio.c           |   7 ++
>  chardev/char.c                 |  16 +++
>  exec.c                         |  88 +++++++++++++--
>  gdbstub.c                      |  11 +-
>  hmp-commands.hx                |  46 ++++++++
>  hw/i386/kvm/clock.c            |   6 +-
>  hw/pci/msix.c                  |   1 +
>  hw/pci/pci.c                   |  17 +--
>  hw/pci/trace-events            |   5 +-
>  hw/vfio/common.c               | 115 ++++++++++++++++----
>  hw/vfio/pci.c                  | 179 ++++++++++++++++++++++++++++++-
>  hw/vfio/platform.c             |   2 +-
>  hw/vfio/trace-events           |  11 +-
>  hw/virtio/vhost.c              |  12 +++
>  include/chardev/char.h         |   8 ++
>  include/exec/memory.h          |   4 +
>  include/hw/pci/pci.h           |   2 +
>  include/hw/vfio/vfio-common.h  |   4 +-
>  include/io/channel-socket.h    |   3 +-
>  include/migration/register.h   |   3 +
>  include/migration/vmstate.h    |  11 ++
>  include/monitor/hmp.h          |   3 +
>  include/qemu/cutils.h          |   1 +
>  include/qemu/env.h             |  31 ++++++
>  include/qemu/osdep.h           |   8 ++
>  include/sysemu/sysemu.h        |  10 ++
>  io/channel-socket.c            |  12 ++-
>  io/net-listener.c              |   4 +-
>  migration/block.c              |   1 +
>  migration/migration.c          |   4 +-
>  migration/ram.c                |   1 +
>  migration/savevm.c             | 237 ++++++++++++++++++++++++++++++++++++-----
>  migration/savevm.h             |   4 +-
>  monitor/hmp-cmds.c             |  28 +++++
>  monitor/qmp-cmds.c             |  16 +++
>  monitor/qmp.c                  |  42 ++++++++
>  qapi/migration.json            |  35 ++++++
>  qapi/pragma.json               |   1 +
>  qemu-options.hx                |   9 ++
>  scsi/qemu-pr-helper.c          |   2 +-
>  softmmu/vl.c                   |  65 ++++++++++-
>  tests/qtest/tpm-emu.c          |   2 +-
>  tests/test-char.c              |   2 +-
>  tests/test-io-channel-socket.c |   4 +-
>  trace-events                   |   2 +
>  ui/vnc.c                       | 153 +++++++++++++++++++++-----
>  util/Makefile.objs             |   2 +-
>  util/env.c                     | 132 +++++++++++++++++++++++
>  util/oslib-posix.c             |   9 ++
>  util/oslib-win32.c             |   4 +
>  55 files changed, 1331 insertions(+), 135 deletions(-)
>  create mode 100644 include/qemu/env.h
>  create mode 100644 util/env.c
> 


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 30/32] vfio-pci: save and restore
  2020-07-30 15:14 ` [PATCH V1 30/32] vfio-pci: save and restore Steve Sistare
@ 2020-08-06 10:22   ` Jason Zeng
  2020-08-07 20:38     ` Steven Sistare
  0 siblings, 1 reply; 118+ messages in thread
From: Jason Zeng @ 2020-08-06 10:22 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Daniel P. Berrange, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-devel, Dr. David Alan Gilbert,
	Alex Williamson, Stefan Hajnoczi, Paolo Bonzini,
	Marc-André Lureau, Philippe Mathieu-Daudé,
	Alex Bennée

Hi Steve,

On Thu, Jul 30, 2020 at 08:14:34AM -0700, Steve Sistare wrote:
> @@ -3182,6 +3207,51 @@ static Property vfio_pci_dev_properties[] = {
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> +static int vfio_pci_post_load(void *opaque, int version_id)
> +{
> +    int vector;
> +    MSIMessage msg;
> +    Error *err = 0;
> +    VFIOPCIDevice *vdev = opaque;
> +    PCIDevice *pdev = &vdev->pdev;
> +
> +    if (msix_enabled(pdev)) {
> +        vfio_msix_enable(vdev);
> +        pdev->msix_function_masked = false;
> +
> +        for (vector = 0; vector < vdev->pdev.msix_entries_nr; vector++) {
> +            if (!msix_is_masked(pdev, vector)) {
> +                msg = msix_get_message(pdev, vector);
> +                vfio_msix_vector_use(pdev, vector, msg);
> +            }
> +        }

It looks to me MSIX re-init here may lose device IRQs and impact
device hardware state?

The re-init will cause the kernel vfio driver to connect the device
MSIX vectors to new eventfds and KVM instance. But before that, device
IRQs will be routed to previous eventfd. Looks these IRQs will be lost.

And the re-init will make the device go through the procedure of
disabling MSIX, enabling INTX, and re-enabling MSIX and vectors.
So if the device is active, its hardware state will be impacted?


Thanks,
Jason

> +
> +    } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
> +        vfio_intx_enable(vdev, &err);
> +        if (err) {
> +            error_report_err(err);
> +        }
> +    }
> +
> +    vdev->vbasedev.group->container->reused = false;
> +    vdev->pdev.reused = false;
> +
> +    return 0;
> +}
> +
> +static const VMStateDescription vfio_pci_vmstate = {
> +    .name = "vfio-pci",
> +    .unmigratable = 1,
> +    .mode_mask = VMS_RESTART,
> +    .version_id = 0,
> +    .minimum_version_id = 0,
> +    .post_load = vfio_pci_post_load,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_MSIX(pdev, VFIOPCIDevice),
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
>  static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>  {
>      DeviceClass *dc = DEVICE_CLASS(klass);
> @@ -3189,6 +3259,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>  
>      dc->reset = vfio_pci_reset;
>      device_class_set_props(dc, vfio_pci_dev_properties);
> +    dc->vmsd = &vfio_pci_vmstate;
>      dc->desc = "VFIO-based PCI device assignment";
>      set_bit(DEVICE_CATEGORY_MISC, dc->categories);
>      pdc->realize = vfio_realize;
> diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
> index ac2cefc..e6e1a5d 100644
> --- a/hw/vfio/platform.c
> +++ b/hw/vfio/platform.c
> @@ -592,7 +592,7 @@ static int vfio_base_device_init(VFIODevice *vbasedev, Error **errp)
>              return -EBUSY;
>          }
>      }
> -    ret = vfio_get_device(group, vbasedev->name, vbasedev, errp);
> +    ret = vfio_get_device(group, vbasedev->name, vbasedev, 0, errp);
>      if (ret) {
>          vfio_put_group(group);
>          return ret;
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index bd07c86..c926a24 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -358,6 +358,7 @@ struct PCIDevice {
>  
>      /* ID of standby device in net_failover pair */
>      char *failover_pair_id;
> +    bool reused;
>  };
>  
>  void pci_register_bar(PCIDevice *pci_dev, int region_num,
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index c78f3ff..4e2a332 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -73,6 +73,8 @@ typedef struct VFIOContainer {
>      unsigned iommu_type;
>      Error *error;
>      bool initialized;
> +    bool reused;
> +    int cid;
>      unsigned long pgsizes;
>      QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>      QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
> @@ -177,7 +179,7 @@ void vfio_reset_handler(void *opaque);
>  VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
>  void vfio_put_group(VFIOGroup *group);
>  int vfio_get_device(VFIOGroup *group, const char *name,
> -                    VFIODevice *vbasedev, Error **errp);
> +                    VFIODevice *vbasedev, bool *reused, Error **errp);
>  
>  extern const MemoryRegionOps vfio_region_ops;
>  typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 881dc13..2606cf0 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1568,7 +1568,7 @@ static int qemu_savevm_state(QEMUFile *f, VMStateMode mode, Error **errp)
>          return -EINVAL;
>      }
>  
> -    if (migrate_use_block()) {
> +    if ((mode & (VMS_SNAPSHOT | VMS_MIGRATE)) && migrate_use_block()) {
>          error_setg(errp, "Block migration and snapshots are incompatible");
>          return -EINVAL;
>      }
> -- 
> 1.8.3.1
> 
> 


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 30/32] vfio-pci: save and restore
  2020-08-06 10:22   ` Jason Zeng
@ 2020-08-07 20:38     ` Steven Sistare
  2020-08-10  3:50       ` Jason Zeng
  0 siblings, 1 reply; 118+ messages in thread
From: Steven Sistare @ 2020-08-07 20:38 UTC (permalink / raw)
  To: Jason Zeng
  Cc: Daniel P. Berrange, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-devel, Dr. David Alan Gilbert,
	Alex Williamson, Stefan Hajnoczi, Paolo Bonzini,
	Marc-André Lureau, Philippe Mathieu-Daudé,
	Alex Bennée

On 8/6/2020 6:22 AM, Jason Zeng wrote:
> Hi Steve,
> 
> On Thu, Jul 30, 2020 at 08:14:34AM -0700, Steve Sistare wrote:
>> @@ -3182,6 +3207,51 @@ static Property vfio_pci_dev_properties[] = {
>>      DEFINE_PROP_END_OF_LIST(),
>>  };
>>  
>> +static int vfio_pci_post_load(void *opaque, int version_id)
>> +{
>> +    int vector;
>> +    MSIMessage msg;
>> +    Error *err = 0;
>> +    VFIOPCIDevice *vdev = opaque;
>> +    PCIDevice *pdev = &vdev->pdev;
>> +
>> +    if (msix_enabled(pdev)) {
>> +        vfio_msix_enable(vdev);
>> +        pdev->msix_function_masked = false;
>> +
>> +        for (vector = 0; vector < vdev->pdev.msix_entries_nr; vector++) {
>> +            if (!msix_is_masked(pdev, vector)) {
>> +                msg = msix_get_message(pdev, vector);
>> +                vfio_msix_vector_use(pdev, vector, msg);
>> +            }
>> +        }
> 
> It looks to me MSIX re-init here may lose device IRQs and impact
> device hardware state?
> 
> The re-init will cause the kernel vfio driver to connect the device
> MSIX vectors to new eventfds and KVM instance. But before that, device
> IRQs will be routed to previous eventfd. Looks these IRQs will be lost.

Thanks Jason, that sounds like a problem.  I could try reading and saving an 
event from eventfd before shutdown, and injecting it into the eventfd after
restart, but that would be racy unless I disable interrupts.  Or, unconditionally
inject a spurious interrupt after restart to kick it, in case an interrupt 
was lost.

Do you have any other ideas?

> And the re-init will make the device go through the procedure of
> disabling MSIX, enabling INTX, and re-enabling MSIX and vectors.
> So if the device is active, its hardware state will be impacted?

Again thanks.  vfio_msix_enable() does indeed call vfio_disable_interrupts().
For a quick experiment, I deleted that call in for the post_load code path, and 
it seems to work fine, but I need to study it more.

- Steve
 
>> +
>> +    } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
>> +        vfio_intx_enable(vdev, &err);
>> +        if (err) {
>> +            error_report_err(err);
>> +        }
>> +    }
>> +
>> +    vdev->vbasedev.group->container->reused = false;
>> +    vdev->pdev.reused = false;
>> +
>> +    return 0;
>> +}
>> +
>> +static const VMStateDescription vfio_pci_vmstate = {
>> +    .name = "vfio-pci",
>> +    .unmigratable = 1,
>> +    .mode_mask = VMS_RESTART,
>> +    .version_id = 0,
>> +    .minimum_version_id = 0,
>> +    .post_load = vfio_pci_post_load,
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_MSIX(pdev, VFIOPCIDevice),
>> +        VMSTATE_END_OF_LIST()
>> +    }
>> +};
>> +
>>  static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>>  {
>>      DeviceClass *dc = DEVICE_CLASS(klass);
>> @@ -3189,6 +3259,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>>  
>>      dc->reset = vfio_pci_reset;
>>      device_class_set_props(dc, vfio_pci_dev_properties);
>> +    dc->vmsd = &vfio_pci_vmstate;
>>      dc->desc = "VFIO-based PCI device assignment";
>>      set_bit(DEVICE_CATEGORY_MISC, dc->categories);
>>      pdc->realize = vfio_realize;
>> diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
>> index ac2cefc..e6e1a5d 100644
>> --- a/hw/vfio/platform.c
>> +++ b/hw/vfio/platform.c
>> @@ -592,7 +592,7 @@ static int vfio_base_device_init(VFIODevice *vbasedev, Error **errp)
>>              return -EBUSY;
>>          }
>>      }
>> -    ret = vfio_get_device(group, vbasedev->name, vbasedev, errp);
>> +    ret = vfio_get_device(group, vbasedev->name, vbasedev, 0, errp);
>>      if (ret) {
>>          vfio_put_group(group);
>>          return ret;
>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>> index bd07c86..c926a24 100644
>> --- a/include/hw/pci/pci.h
>> +++ b/include/hw/pci/pci.h
>> @@ -358,6 +358,7 @@ struct PCIDevice {
>>  
>>      /* ID of standby device in net_failover pair */
>>      char *failover_pair_id;
>> +    bool reused;
>>  };
>>  
>>  void pci_register_bar(PCIDevice *pci_dev, int region_num,
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index c78f3ff..4e2a332 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -73,6 +73,8 @@ typedef struct VFIOContainer {
>>      unsigned iommu_type;
>>      Error *error;
>>      bool initialized;
>> +    bool reused;
>> +    int cid;
>>      unsigned long pgsizes;
>>      QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>>      QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
>> @@ -177,7 +179,7 @@ void vfio_reset_handler(void *opaque);
>>  VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
>>  void vfio_put_group(VFIOGroup *group);
>>  int vfio_get_device(VFIOGroup *group, const char *name,
>> -                    VFIODevice *vbasedev, Error **errp);
>> +                    VFIODevice *vbasedev, bool *reused, Error **errp);
>>  
>>  extern const MemoryRegionOps vfio_region_ops;
>>  typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index 881dc13..2606cf0 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -1568,7 +1568,7 @@ static int qemu_savevm_state(QEMUFile *f, VMStateMode mode, Error **errp)
>>          return -EINVAL;
>>      }
>>  
>> -    if (migrate_use_block()) {
>> +    if ((mode & (VMS_SNAPSHOT | VMS_MIGRATE)) && migrate_use_block()) {
>>          error_setg(errp, "Block migration and snapshots are incompatible");
>>          return -EINVAL;
>>      }
>> -- 
>> 1.8.3.1
>>
>>


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 30/32] vfio-pci: save and restore
  2020-08-07 20:38     ` Steven Sistare
@ 2020-08-10  3:50       ` Jason Zeng
  2020-08-19 21:15         ` Steven Sistare
  0 siblings, 1 reply; 118+ messages in thread
From: Jason Zeng @ 2020-08-10  3:50 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Daniel P. Berrange, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-devel, Dr. David Alan Gilbert,
	Alex Williamson, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Jason Zeng, Philippe Mathieu-Daudé,
	Alex Bennée

On Fri, Aug 07, 2020 at 04:38:12PM -0400, Steven Sistare wrote:
> On 8/6/2020 6:22 AM, Jason Zeng wrote:
> > Hi Steve,
> > 
> > On Thu, Jul 30, 2020 at 08:14:34AM -0700, Steve Sistare wrote:
> >> @@ -3182,6 +3207,51 @@ static Property vfio_pci_dev_properties[] = {
> >>      DEFINE_PROP_END_OF_LIST(),
> >>  };
> >>  
> >> +static int vfio_pci_post_load(void *opaque, int version_id)
> >> +{
> >> +    int vector;
> >> +    MSIMessage msg;
> >> +    Error *err = 0;
> >> +    VFIOPCIDevice *vdev = opaque;
> >> +    PCIDevice *pdev = &vdev->pdev;
> >> +
> >> +    if (msix_enabled(pdev)) {
> >> +        vfio_msix_enable(vdev);
> >> +        pdev->msix_function_masked = false;
> >> +
> >> +        for (vector = 0; vector < vdev->pdev.msix_entries_nr; vector++) {
> >> +            if (!msix_is_masked(pdev, vector)) {
> >> +                msg = msix_get_message(pdev, vector);
> >> +                vfio_msix_vector_use(pdev, vector, msg);
> >> +            }
> >> +        }
> > 
> > It looks to me MSIX re-init here may lose device IRQs and impact
> > device hardware state?
> > 
> > The re-init will cause the kernel vfio driver to connect the device
> > MSIX vectors to new eventfds and KVM instance. But before that, device
> > IRQs will be routed to previous eventfd. Looks these IRQs will be lost.
> 
> Thanks Jason, that sounds like a problem.  I could try reading and saving an 
> event from eventfd before shutdown, and injecting it into the eventfd after
> restart, but that would be racy unless I disable interrupts.  Or, unconditionally
> inject a spurious interrupt after restart to kick it, in case an interrupt 
> was lost.
> 
> Do you have any other ideas?

Maybe we can consider to also hand over the eventfd file descriptor, or
even the KVM fds to the new Qemu?

If the KVM fds can be preserved, we will just need to restore Qemu KVM
side states. But not sure how complicated the implementation would be.

If we only preserve the eventfd fd, we can attach the old eventfd to
vfio devices. But looks it may turn out we always inject an interrupt
unconditionally, because kernel KVM irqfd eventfd handling is a bit
different than normal user land eventfd read/write. It doesn't decrease
the counter in the eventfd context. So if we read the eventfd from new
Qemu, it looks will always have a non-zero counter, which requires an
interrupt injection.

> 
> > And the re-init will make the device go through the procedure of
> > disabling MSIX, enabling INTX, and re-enabling MSIX and vectors.
> > So if the device is active, its hardware state will be impacted?
> 
> Again thanks.  vfio_msix_enable() does indeed call vfio_disable_interrupts().
> For a quick experiment, I deleted that call in for the post_load code path, and 
> it seems to work fine, but I need to study it more.

vfio_msix_vector_use() will also trigger this procedure in the kernel.

Looks we shouldn't trigger any kernel vfio actions here? Because we
preserve vfio fds, so its kernel state shouldn't be touched. Here we
may only need to restore Qemu states. Re-connect to KVM instance should
be done automatically when we setup the KVM irqfds with the same eventfd.

BTW, if I remember correctly, it is not enough to only save MSIX state
in the snapshot. We should also save the Qemu side pci config space
cache to the snapshot, because Qemu's copy is not exactly the same as
the kernel's copy. I encountered this before, but I don't remember which
field it was.

And another question, why don't we support MSI? I see the code only
handles MSIX?

Thanks,
Jason


> 
> - Steve
>  
> >> +
> >> +    } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
> >> +        vfio_intx_enable(vdev, &err);
> >> +        if (err) {
> >> +            error_report_err(err);
> >> +        }
> >> +    }
> >> +
> >> +    vdev->vbasedev.group->container->reused = false;
> >> +    vdev->pdev.reused = false;
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +static const VMStateDescription vfio_pci_vmstate = {
> >> +    .name = "vfio-pci",
> >> +    .unmigratable = 1,
> >> +    .mode_mask = VMS_RESTART,
> >> +    .version_id = 0,
> >> +    .minimum_version_id = 0,
> >> +    .post_load = vfio_pci_post_load,
> >> +    .fields = (VMStateField[]) {
> >> +        VMSTATE_MSIX(pdev, VFIOPCIDevice),
> >> +        VMSTATE_END_OF_LIST()
> >> +    }
> >> +};
> >> +
> >>  static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
> >>  {
> >>      DeviceClass *dc = DEVICE_CLASS(klass);
> >> @@ -3189,6 +3259,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
> >>  
> >>      dc->reset = vfio_pci_reset;
> >>      device_class_set_props(dc, vfio_pci_dev_properties);
> >> +    dc->vmsd = &vfio_pci_vmstate;
> >>      dc->desc = "VFIO-based PCI device assignment";
> >>      set_bit(DEVICE_CATEGORY_MISC, dc->categories);
> >>      pdc->realize = vfio_realize;
> >> diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
> >> index ac2cefc..e6e1a5d 100644
> >> --- a/hw/vfio/platform.c
> >> +++ b/hw/vfio/platform.c
> >> @@ -592,7 +592,7 @@ static int vfio_base_device_init(VFIODevice *vbasedev, Error **errp)
> >>              return -EBUSY;
> >>          }
> >>      }
> >> -    ret = vfio_get_device(group, vbasedev->name, vbasedev, errp);
> >> +    ret = vfio_get_device(group, vbasedev->name, vbasedev, 0, errp);
> >>      if (ret) {
> >>          vfio_put_group(group);
> >>          return ret;
> >> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> >> index bd07c86..c926a24 100644
> >> --- a/include/hw/pci/pci.h
> >> +++ b/include/hw/pci/pci.h
> >> @@ -358,6 +358,7 @@ struct PCIDevice {
> >>  
> >>      /* ID of standby device in net_failover pair */
> >>      char *failover_pair_id;
> >> +    bool reused;
> >>  };
> >>  
> >>  void pci_register_bar(PCIDevice *pci_dev, int region_num,
> >> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >> index c78f3ff..4e2a332 100644
> >> --- a/include/hw/vfio/vfio-common.h
> >> +++ b/include/hw/vfio/vfio-common.h
> >> @@ -73,6 +73,8 @@ typedef struct VFIOContainer {
> >>      unsigned iommu_type;
> >>      Error *error;
> >>      bool initialized;
> >> +    bool reused;
> >> +    int cid;
> >>      unsigned long pgsizes;
> >>      QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
> >>      QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
> >> @@ -177,7 +179,7 @@ void vfio_reset_handler(void *opaque);
> >>  VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
> >>  void vfio_put_group(VFIOGroup *group);
> >>  int vfio_get_device(VFIOGroup *group, const char *name,
> >> -                    VFIODevice *vbasedev, Error **errp);
> >> +                    VFIODevice *vbasedev, bool *reused, Error **errp);
> >>  
> >>  extern const MemoryRegionOps vfio_region_ops;
> >>  typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
> >> diff --git a/migration/savevm.c b/migration/savevm.c
> >> index 881dc13..2606cf0 100644
> >> --- a/migration/savevm.c
> >> +++ b/migration/savevm.c
> >> @@ -1568,7 +1568,7 @@ static int qemu_savevm_state(QEMUFile *f, VMStateMode mode, Error **errp)
> >>          return -EINVAL;
> >>      }
> >>  
> >> -    if (migrate_use_block()) {
> >> +    if ((mode & (VMS_SNAPSHOT | VMS_MIGRATE)) && migrate_use_block()) {
> >>          error_setg(errp, "Block migration and snapshots are incompatible");
> >>          return -EINVAL;
> >>      }
> >> -- 
> >> 1.8.3.1
> >>
> >>


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 00/32] Live Update
  2020-07-31 15:52         ` Daniel P. Berrangé
  2020-07-31 17:20           ` Steven Sistare
@ 2020-08-11 19:08           ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 118+ messages in thread
From: Dr. David Alan Gilbert @ 2020-08-11 19:08 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Michael S. Tsirkin, Alex Bennée, Juan Quintela, qemu-devel,
	Markus Armbruster, Alex Williamson, Steven Sistare,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Fri, Jul 31, 2020 at 11:27:45AM -0400, Steven Sistare wrote:
> > On 7/31/2020 4:53 AM, Daniel P. Berrangé wrote:
> > > On Thu, Jul 30, 2020 at 02:48:44PM -0400, Steven Sistare wrote:
> > >> On 7/30/2020 12:52 PM, Daniel P. Berrangé wrote:
> > >>> On Thu, Jul 30, 2020 at 08:14:04AM -0700, Steve Sistare wrote:
> > >>>> Improve and extend the qemu functions that save and restore VM state so a
> > >>>> guest may be suspended and resumed with minimal pause time.  qemu may be
> > >>>> updated to a new version in between.
> > >>>>
> > >>>> The first set of patches adds the cprsave and cprload commands to save and
> > >>>> restore VM state, and allow the host kernel to be updated and rebooted in
> > >>>> between.  The VM must create guest RAM in a persistent shared memory file,
> > >>>> such as /dev/dax0.0 or persistant /dev/shm PKRAM as proposed in 
> > >>>> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/
> > >>>>
> > >>>> cprsave stops the VCPUs and saves VM device state in a simple file, and
> > >>>> thus supports any type of guest image and block device.  The caller must
> > >>>> not modify the VM's block devices between cprsave and cprload.
> > >>>>
> > >>>> cprsave and cprload support guests with vfio devices if the caller first
> > >>>> suspends the guest by issuing guest-suspend-ram to the qemu guest agent.
> > >>>> The guest drivers suspend methods flush outstanding requests and re-
> > >>>> initialize the devices, and thus there is no device state to save and
> > >>>> restore.
> > >>>>
> > >>>>    1 savevm: add vmstate handler iterators
> > >>>>    2 savevm: VM handlers mode mask
> > >>>>    3 savevm: QMP command for cprsave
> > >>>>    4 savevm: HMP Command for cprsave
> > >>>>    5 savevm: QMP command for cprload
> > >>>>    6 savevm: HMP Command for cprload
> > >>>>    7 savevm: QMP command for cprinfo
> > >>>>    8 savevm: HMP command for cprinfo
> > >>>>    9 savevm: prevent cprsave if memory is volatile
> > >>>>   10 kvmclock: restore paused KVM clock
> > >>>>   11 cpu: disable ticks when suspended
> > >>>>   12 vl: pause option
> > >>>>   13 gdbstub: gdb support for suspended state
> > >>>>
> > >>>> The next patches add a restart method that eliminates the persistent memory
> > >>>> constraint, and allows qemu to be updated across the restart, but does not
> > >>>> allow host reboot.  Anonymous memory segments used by the guest are
> > >>>> preserved across a re-exec of qemu, mapped at the same VA, via a proposed
> > >>>> madvise(MADV_DOEXEC) option in the Linux kernel.  See
> > >>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
> > >>>>
> > >>>>   14 savevm: VMS_RESTART and cprsave restart
> > >>>>   15 vl: QEMU_START_FREEZE env var
> > >>>>   16 oslib: add qemu_clr_cloexec
> > >>>>   17 util: env var helpers
> > >>>>   18 osdep: import MADV_DOEXEC
> > >>>>   19 memory: ram_block_add cosmetic changes
> > >>>>   20 vl: add helper to request re-exec
> > >>>>   21 exec, memory: exec(3) to restart
> > >>>>   22 char: qio_channel_socket_accept reuse fd
> > >>>>   23 char: save/restore chardev socket fds
> > >>>>   24 ui: save/restore vnc socket fds
> > >>>>   25 char: save/restore chardev pty fds
> > >>>
> > >>> Keeping FDs open across re-exec is a nice trick, but how are you dealing
> > >>> with the state associated with them, most especially the TLS encryption
> > >>> state ? AFAIK, there's no way to serialize/deserialize the TLS state that
> > >>> GNUTLS maintains, and the patches don't show any sign of dealing with
> > >>> this. IOW it looks like while the FD will be preserved, any TLS session
> > >>> running on it will fail.
> > >>
> > >> I had not considered TLS.  If a non-qemu library maintains connection state, then
> > >> we won't be able to support it for live update until the library provides interfaces
> > >> to serialize the state.
> > >>
> > >> For qemu objects, so far vmstate has been adequate to represent the devices with
> > >> descriptors that we preserve.
> > > 
> > > My main concern about this series is that there is an implicit assumption
> > > that QEMU is *not* configured with certain features that are not handled
> > > If QEMU is using one of the unsupported features, I don't see anything in
> > > the series which attempts to prevent the actions.
> > > 
> > > IOW, users can have an arbitrary QEMU config, attempt to use these new features,
> > > the commands may well succeed, but the user is silently left with a broken QEMU.
> > > Such silent failure modes are really undesirable as they'll lead to a never
> > > ending stream of hard to diagnose bug reports for QEMU maintainers.
> > > 
> > > TLS is one example of this, the live upgrade  will "succeed", but the TLS
> > > connections will be totally non-functional.
> > 
> > I agree with all your points and would like to do better in this area.  Other than hunting for 
> > every use of a descriptor and either supporting it or blocking cpr, do you have any suggestions?
> > Thinking out loud, maybe we can gather all the fds that we support, then look for all fds in the
> > process, and block the cpr if we find an unrecognized fd.
> 
> There's no magic easy answer to this problem. Conceptually it is similar to
> the problem of reliably migrating guest device state, but in this case we're
> primarily concerned about the backends instead.
> 
> For migration we've got standardized interfaces that devices must implement
> in order to correctly support migration serialization. There is also support
> for devices to register migration "blockers" which prevent any use of the
> migration feature when the device is present.
> 
> We lack this kind of concept for the backend, and that's what I think needs
> to be tackled in a more thorough way.  There are quite alot of backends,
> but they're grouped into a reasonable small number of sets (UIs, chardevs,
> blockdevs, net devs, etc). We need some standard interface that we can
> plumb into all the backends, along with providing backends the ability to
> block the re-exec. If we plumb the generic infrastructure into each of the
> different types of backend, and make the default behaviour be to reject
> the re-exec. Then we need to carefull consider specific  backend impls
> and allow the re-exec only in the very precise cases we can demonstrate
> to be safe.
> 
> IOW, have a presumption that re-exec will *not* be permitted. Over time
> we can make it work for an ever expanding set of use cases. 

Yes, it does feel like an interface that needs to be implemented on the
chardev; then you don't need to worry about handling them all
individually.

Dave

> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 18/32] osdep: import MADV_DOEXEC
  2020-07-30 15:14 ` [PATCH V1 18/32] osdep: import MADV_DOEXEC Steve Sistare
@ 2020-08-17 18:30   ` Steven Sistare
  2020-08-17 20:48     ` Alex Williamson
  0 siblings, 1 reply; 118+ messages in thread
From: Steven Sistare @ 2020-08-17 18:30 UTC (permalink / raw)
  To: qemu-devel, Alex Williamson
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, Dr. David Alan Gilbert, Markus Armbruster,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On 7/30/2020 11:14 AM, Steve Sistare wrote:
> Anonymous memory segments used by the guest are preserved across a re-exec
> of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
> in the Linux kernel. For the madvise patches, see:
> 
> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  include/qemu/osdep.h | 7 +++++++
>  1 file changed, 7 insertions(+)

Hi Alex,
  The MADV_DOEXEC functionality, which is a pre-requisite for the entire qemu 
live update series, is getting a chilly reception on lkml.  We could instead 
create guest memory using memfd_create and preserve the fd across exec.  However, 
the subsequent mmap(fd) will return a different VA than was used previously, 
which  is a problem for memory that was registered with vfio, as the original VA 
is remembered in the kernel struct vfio_dma and used in various kernel functions, 
such as vfio_iommu_replay.

To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size, and
new_vaddr.  The implementation finds an exact match for (iova, size) and replaces 
vaddr with new_vaddr.  Flags cannot be changed.

memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC.
vfio on any form of shared memory (shm, dax, etc) could also be preserved across
exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA.

What do you think?

- Steve


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 18/32] osdep: import MADV_DOEXEC
  2020-08-17 18:30   ` Steven Sistare
@ 2020-08-17 20:48     ` Alex Williamson
  2020-08-17 21:20       ` Steven Sistare
  0 siblings, 1 reply; 118+ messages in thread
From: Alex Williamson @ 2020-08-17 20:48 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Markus Armbruster

On Mon, 17 Aug 2020 14:30:51 -0400
Steven Sistare <steven.sistare@oracle.com> wrote:

> On 7/30/2020 11:14 AM, Steve Sistare wrote:
> > Anonymous memory segments used by the guest are preserved across a re-exec
> > of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
> > in the Linux kernel. For the madvise patches, see:
> > 
> > https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
> > 
> > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > ---
> >  include/qemu/osdep.h | 7 +++++++
> >  1 file changed, 7 insertions(+)  
> 
> Hi Alex,
>   The MADV_DOEXEC functionality, which is a pre-requisite for the entire qemu 
> live update series, is getting a chilly reception on lkml.  We could instead 
> create guest memory using memfd_create and preserve the fd across exec.  However, 
> the subsequent mmap(fd) will return a different VA than was used previously, 
> which  is a problem for memory that was registered with vfio, as the original VA 
> is remembered in the kernel struct vfio_dma and used in various kernel functions, 
> such as vfio_iommu_replay.
> 
> To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size, and
> new_vaddr.  The implementation finds an exact match for (iova, size) and replaces 
> vaddr with new_vaddr.  Flags cannot be changed.
> 
> memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC.
> vfio on any form of shared memory (shm, dax, etc) could also be preserved across
> exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA.
> 
> What do you think

Your new REMAP ioctl would have parameters identical to the MAP_DMA
ioctl, so I think we should just use one of the flag bits on the
existing MAP_DMA ioctl for this variant.

Reading through the discussion on the kernel side there seems to be
some confusion around why vfio needs the vaddr beyond the user call to
MAP_DMA though.  Originally this was used to test for virtually
contiguous mappings for merging and splitting purposes.  This is
defunct in the v2 interface, however the vaddr is now used largely for
mdev devices.  If an mdev device is not backed by an IOMMU device and
does not share a container with an IOMMU device, then a user MAP_DMA
ioctl essentially just registers the translation within the vfio
container.  The mdev vendor driver can then later either request pages
to be pinned for device DMA or can perform copy_to/from_user() to
simulate DMA via the CPU.

Therefore I don't see that there's a simple re-architecture of the vfio
IOMMU backend that could drop vaddr use.  I'm a bit concerned this new
remap proposal also raises the question of how do we prevent userspace
remapping vaddrs racing with asynchronous kernel use of the previous
vaddrs.  Are we expecting guest drivers/agents to quiesce the device,
or maybe relying on clearing bus-master, for PCI devices, to halt DMA?
The vfio migration interface we've developed does have a mechanism to
stop a device, would we need to use this here?  If we do have a
mechanism to quiesce the device, is the only reason we're not unmapping
everything and remapping it into the new address space the latency in
performing that operation?  Thanks,

Alex



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 18/32] osdep: import MADV_DOEXEC
  2020-08-17 20:48     ` Alex Williamson
@ 2020-08-17 21:20       ` Steven Sistare
  2020-08-17 21:44         ` Alex Williamson
  0 siblings, 1 reply; 118+ messages in thread
From: Steven Sistare @ 2020-08-17 21:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Markus Armbruster

On 8/17/2020 4:48 PM, Alex Williamson wrote:
> On Mon, 17 Aug 2020 14:30:51 -0400
> Steven Sistare <steven.sistare@oracle.com> wrote:
> 
>> On 7/30/2020 11:14 AM, Steve Sistare wrote:
>>> Anonymous memory segments used by the guest are preserved across a re-exec
>>> of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
>>> in the Linux kernel. For the madvise patches, see:
>>>
>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>>  include/qemu/osdep.h | 7 +++++++
>>>  1 file changed, 7 insertions(+)  
>>
>> Hi Alex,
>>   The MADV_DOEXEC functionality, which is a pre-requisite for the entire qemu 
>> live update series, is getting a chilly reception on lkml.  We could instead 
>> create guest memory using memfd_create and preserve the fd across exec.  However, 
>> the subsequent mmap(fd) will return a different VA than was used previously, 
>> which  is a problem for memory that was registered with vfio, as the original VA 
>> is remembered in the kernel struct vfio_dma and used in various kernel functions, 
>> such as vfio_iommu_replay.
>>
>> To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size, and
>> new_vaddr.  The implementation finds an exact match for (iova, size) and replaces 
>> vaddr with new_vaddr.  Flags cannot be changed.
>>
>> memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC.
>> vfio on any form of shared memory (shm, dax, etc) could also be preserved across
>> exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA.
>>
>> What do you think
> 
> Your new REMAP ioctl would have parameters identical to the MAP_DMA
> ioctl, so I think we should just use one of the flag bits on the
> existing MAP_DMA ioctl for this variant.

Sounds good.

> Reading through the discussion on the kernel side there seems to be
> some confusion around why vfio needs the vaddr beyond the user call to
> MAP_DMA though.  Originally this was used to test for virtually
> contiguous mappings for merging and splitting purposes.  This is
> defunct in the v2 interface, however the vaddr is now used largely for
> mdev devices.  If an mdev device is not backed by an IOMMU device and
> does not share a container with an IOMMU device, then a user MAP_DMA
> ioctl essentially just registers the translation within the vfio
> container.  The mdev vendor driver can then later either request pages
> to be pinned for device DMA or can perform copy_to/from_user() to
> simulate DMA via the CPU.
> 
> Therefore I don't see that there's a simple re-architecture of the vfio
> IOMMU backend that could drop vaddr use.  

Yes.  I did not explain on lkml as you do here (thanks), but I reached the 
same conclusion.

> I'm a bit concerned this new
> remap proposal also raises the question of how do we prevent userspace
> remapping vaddrs racing with asynchronous kernel use of the previous
> vaddrs.  

Agreed.  After a quick glance at the code, holding iommu->lock during 
remap might be sufficient, but it needs more study.

> Are we expecting guest drivers/agents to quiesce the device,
> or maybe relying on clearing bus-master, for PCI devices, to halt DMA?

No.  We want to support any guest, and the guest is not aware that qemu
live update is occurring.

> The vfio migration interface we've developed does have a mechanism to
> stop a device, would we need to use this here?  If we do have a
> mechanism to quiesce the device, is the only reason we're not unmapping
> everything and remapping it into the new address space the latency in
> performing that operation?  Thanks,

Same answer - we don't require that the guest has vfio migration support.

- Steve


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 18/32] osdep: import MADV_DOEXEC
  2020-08-17 21:20       ` Steven Sistare
@ 2020-08-17 21:44         ` Alex Williamson
  2020-08-18  2:42           ` Alex Williamson
  0 siblings, 1 reply; 118+ messages in thread
From: Alex Williamson @ 2020-08-17 21:44 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Markus Armbruster

On Mon, 17 Aug 2020 17:20:57 -0400
Steven Sistare <steven.sistare@oracle.com> wrote:

> On 8/17/2020 4:48 PM, Alex Williamson wrote:
> > On Mon, 17 Aug 2020 14:30:51 -0400
> > Steven Sistare <steven.sistare@oracle.com> wrote:
> >   
> >> On 7/30/2020 11:14 AM, Steve Sistare wrote:  
> >>> Anonymous memory segments used by the guest are preserved across a re-exec
> >>> of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
> >>> in the Linux kernel. For the madvise patches, see:
> >>>
> >>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
> >>>
> >>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >>> ---
> >>>  include/qemu/osdep.h | 7 +++++++
> >>>  1 file changed, 7 insertions(+)    
> >>
> >> Hi Alex,
> >>   The MADV_DOEXEC functionality, which is a pre-requisite for the entire qemu 
> >> live update series, is getting a chilly reception on lkml.  We could instead 
> >> create guest memory using memfd_create and preserve the fd across exec.  However, 
> >> the subsequent mmap(fd) will return a different VA than was used previously, 
> >> which  is a problem for memory that was registered with vfio, as the original VA 
> >> is remembered in the kernel struct vfio_dma and used in various kernel functions, 
> >> such as vfio_iommu_replay.
> >>
> >> To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size, and
> >> new_vaddr.  The implementation finds an exact match for (iova, size) and replaces 
> >> vaddr with new_vaddr.  Flags cannot be changed.
> >>
> >> memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC.
> >> vfio on any form of shared memory (shm, dax, etc) could also be preserved across
> >> exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA.
> >>
> >> What do you think  
> > 
> > Your new REMAP ioctl would have parameters identical to the MAP_DMA
> > ioctl, so I think we should just use one of the flag bits on the
> > existing MAP_DMA ioctl for this variant.  
> 
> Sounds good.
> 
> > Reading through the discussion on the kernel side there seems to be
> > some confusion around why vfio needs the vaddr beyond the user call to
> > MAP_DMA though.  Originally this was used to test for virtually
> > contiguous mappings for merging and splitting purposes.  This is
> > defunct in the v2 interface, however the vaddr is now used largely for
> > mdev devices.  If an mdev device is not backed by an IOMMU device and
> > does not share a container with an IOMMU device, then a user MAP_DMA
> > ioctl essentially just registers the translation within the vfio
> > container.  The mdev vendor driver can then later either request pages
> > to be pinned for device DMA or can perform copy_to/from_user() to
> > simulate DMA via the CPU.
> > 
> > Therefore I don't see that there's a simple re-architecture of the vfio
> > IOMMU backend that could drop vaddr use.    
> 
> Yes.  I did not explain on lkml as you do here (thanks), but I reached the 
> same conclusion.
> 
> > I'm a bit concerned this new
> > remap proposal also raises the question of how do we prevent userspace
> > remapping vaddrs racing with asynchronous kernel use of the previous
> > vaddrs.    
> 
> Agreed.  After a quick glance at the code, holding iommu->lock during 
> remap might be sufficient, but it needs more study.

Unless you're suggesting an extended hold of the lock across the entire
re-exec of QEMU, that's only going to prevent a race between a remap
and a vendor driver pin or access, the time between the previous vaddr
becoming invalid and the remap is unprotected.

> > Are we expecting guest drivers/agents to quiesce the device,
> > or maybe relying on clearing bus-master, for PCI devices, to halt DMA?  
> 
> No.  We want to support any guest, and the guest is not aware that qemu
> live update is occurring.
> 
> > The vfio migration interface we've developed does have a mechanism to
> > stop a device, would we need to use this here?  If we do have a
> > mechanism to quiesce the device, is the only reason we're not unmapping
> > everything and remapping it into the new address space the latency in
> > performing that operation?  Thanks,  
> 
> Same answer - we don't require that the guest has vfio migration support.

QEMU toggling the runstate of the device via the vfio migration
interface could be done transparently to the guest, but if your
intention is to support any device (where none currently support the
migration interface) perhaps it's a moot point.  It seems like this
scheme only works with IOMMU backed devices where the device can
continue to operate against pinned pages, anything that might need to
dynamically pin pages against the process vaddr as it's running async
to the QEMU re-exec needs to be blocked or stopped.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 18/32] osdep: import MADV_DOEXEC
  2020-08-17 21:44         ` Alex Williamson
@ 2020-08-18  2:42           ` Alex Williamson
  2020-08-19 21:52             ` Steven Sistare
  0 siblings, 1 reply; 118+ messages in thread
From: Alex Williamson @ 2020-08-18  2:42 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Dr. David Alan Gilbert,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Markus Armbruster

On Mon, 17 Aug 2020 15:44:03 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Mon, 17 Aug 2020 17:20:57 -0400
> Steven Sistare <steven.sistare@oracle.com> wrote:
> 
> > On 8/17/2020 4:48 PM, Alex Williamson wrote:  
> > > On Mon, 17 Aug 2020 14:30:51 -0400
> > > Steven Sistare <steven.sistare@oracle.com> wrote:
> > >     
> > >> On 7/30/2020 11:14 AM, Steve Sistare wrote:    
> > >>> Anonymous memory segments used by the guest are preserved across a re-exec
> > >>> of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
> > >>> in the Linux kernel. For the madvise patches, see:
> > >>>
> > >>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
> > >>>
> > >>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > >>> ---
> > >>>  include/qemu/osdep.h | 7 +++++++
> > >>>  1 file changed, 7 insertions(+)      
> > >>
> > >> Hi Alex,
> > >>   The MADV_DOEXEC functionality, which is a pre-requisite for the entire qemu 
> > >> live update series, is getting a chilly reception on lkml.  We could instead 
> > >> create guest memory using memfd_create and preserve the fd across exec.  However, 
> > >> the subsequent mmap(fd) will return a different VA than was used previously, 
> > >> which  is a problem for memory that was registered with vfio, as the original VA 
> > >> is remembered in the kernel struct vfio_dma and used in various kernel functions, 
> > >> such as vfio_iommu_replay.
> > >>
> > >> To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size, and
> > >> new_vaddr.  The implementation finds an exact match for (iova, size) and replaces 
> > >> vaddr with new_vaddr.  Flags cannot be changed.
> > >>
> > >> memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC.
> > >> vfio on any form of shared memory (shm, dax, etc) could also be preserved across
> > >> exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA.
> > >>
> > >> What do you think    
> > > 
> > > Your new REMAP ioctl would have parameters identical to the MAP_DMA
> > > ioctl, so I think we should just use one of the flag bits on the
> > > existing MAP_DMA ioctl for this variant.    
> > 
> > Sounds good.
> >   
> > > Reading through the discussion on the kernel side there seems to be
> > > some confusion around why vfio needs the vaddr beyond the user call to
> > > MAP_DMA though.  Originally this was used to test for virtually
> > > contiguous mappings for merging and splitting purposes.  This is
> > > defunct in the v2 interface, however the vaddr is now used largely for
> > > mdev devices.  If an mdev device is not backed by an IOMMU device and
> > > does not share a container with an IOMMU device, then a user MAP_DMA
> > > ioctl essentially just registers the translation within the vfio
> > > container.  The mdev vendor driver can then later either request pages
> > > to be pinned for device DMA or can perform copy_to/from_user() to
> > > simulate DMA via the CPU.
> > > 
> > > Therefore I don't see that there's a simple re-architecture of the vfio
> > > IOMMU backend that could drop vaddr use.      
> > 
> > Yes.  I did not explain on lkml as you do here (thanks), but I reached the 
> > same conclusion.
> >   
> > > I'm a bit concerned this new
> > > remap proposal also raises the question of how do we prevent userspace
> > > remapping vaddrs racing with asynchronous kernel use of the previous
> > > vaddrs.      
> > 
> > Agreed.  After a quick glance at the code, holding iommu->lock during 
> > remap might be sufficient, but it needs more study.  
> 
> Unless you're suggesting an extended hold of the lock across the entire
> re-exec of QEMU, that's only going to prevent a race between a remap
> and a vendor driver pin or access, the time between the previous vaddr
> becoming invalid and the remap is unprotected.
> 
> > > Are we expecting guest drivers/agents to quiesce the device,
> > > or maybe relying on clearing bus-master, for PCI devices, to halt DMA?    
> > 
> > No.  We want to support any guest, and the guest is not aware that qemu
> > live update is occurring.
> >   
> > > The vfio migration interface we've developed does have a mechanism to
> > > stop a device, would we need to use this here?  If we do have a
> > > mechanism to quiesce the device, is the only reason we're not unmapping
> > > everything and remapping it into the new address space the latency in
> > > performing that operation?  Thanks,    
> > 
> > Same answer - we don't require that the guest has vfio migration support.  
> 
> QEMU toggling the runstate of the device via the vfio migration
> interface could be done transparently to the guest, but if your
> intention is to support any device (where none currently support the
> migration interface) perhaps it's a moot point.  It seems like this
> scheme only works with IOMMU backed devices where the device can
> continue to operate against pinned pages, anything that might need to
> dynamically pin pages against the process vaddr as it's running async
> to the QEMU re-exec needs to be blocked or stopped.  Thanks,

Hmm, even if a device is running against pinned memory, how do we
handle device interrupts that occur during QEMU's downtime?  I see that
we reconfigure interrupts, but does QEMU need to drain the eventfd and
manually inject those missed interrupts or will setting up the irqfds
trigger a poll?  Thanks,

Alex



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 30/32] vfio-pci: save and restore
  2020-08-10  3:50       ` Jason Zeng
@ 2020-08-19 21:15         ` Steven Sistare
  2020-08-20 10:33           ` Jason Zeng
  0 siblings, 1 reply; 118+ messages in thread
From: Steven Sistare @ 2020-08-19 21:15 UTC (permalink / raw)
  To: Jason Zeng
  Cc: Daniel P. Berrange, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-devel, Dr. David Alan Gilbert,
	Alex Williamson, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Jason Zeng, Philippe Mathieu-Daudé,
	Alex Bennée

On 8/9/2020 11:50 PM, Jason Zeng wrote:
> On Fri, Aug 07, 2020 at 04:38:12PM -0400, Steven Sistare wrote:
>> On 8/6/2020 6:22 AM, Jason Zeng wrote:
>>> Hi Steve,
>>>
>>> On Thu, Jul 30, 2020 at 08:14:34AM -0700, Steve Sistare wrote:
>>>> @@ -3182,6 +3207,51 @@ static Property vfio_pci_dev_properties[] = {
>>>>      DEFINE_PROP_END_OF_LIST(),
>>>>  };
>>>>  
>>>> +static int vfio_pci_post_load(void *opaque, int version_id)
>>>> +{
>>>> +    int vector;
>>>> +    MSIMessage msg;
>>>> +    Error *err = 0;
>>>> +    VFIOPCIDevice *vdev = opaque;
>>>> +    PCIDevice *pdev = &vdev->pdev;
>>>> +
>>>> +    if (msix_enabled(pdev)) {
>>>> +        vfio_msix_enable(vdev);
>>>> +        pdev->msix_function_masked = false;
>>>> +
>>>> +        for (vector = 0; vector < vdev->pdev.msix_entries_nr; vector++) {
>>>> +            if (!msix_is_masked(pdev, vector)) {
>>>> +                msg = msix_get_message(pdev, vector);
>>>> +                vfio_msix_vector_use(pdev, vector, msg);
>>>> +            }
>>>> +        }
>>>
>>> It looks to me MSIX re-init here may lose device IRQs and impact
>>> device hardware state?
>>>
>>> The re-init will cause the kernel vfio driver to connect the device
>>> MSIX vectors to new eventfds and KVM instance. But before that, device
>>> IRQs will be routed to previous eventfd. Looks these IRQs will be lost.
>>
>> Thanks Jason, that sounds like a problem.  I could try reading and saving an 
>> event from eventfd before shutdown, and injecting it into the eventfd after
>> restart, but that would be racy unless I disable interrupts.  Or, unconditionally
>> inject a spurious interrupt after restart to kick it, in case an interrupt 
>> was lost.
>>
>> Do you have any other ideas?
> 
> Maybe we can consider to also hand over the eventfd file descriptor, or

I believe preserving this descriptor in isolation is not sufficient.  We would
also need to preserve the KVM instance which it is linked to.

> or even the KVM fds to the new Qemu?
> 
> If the KVM fds can be preserved, we will just need to restore Qemu KVM
> side states. But not sure how complicated the implementation would be.

That should work, but I fear it would require many code changes in QEMU
to re-use descriptors at object creation time and suppress the initial 
configuration ioctl's, so it's not my first choice for a solution.

> If we only preserve the eventfd fd, we can attach the old eventfd to
> vfio devices. But looks it may turn out we always inject an interrupt
> unconditionally, because kernel KVM irqfd eventfd handling is a bit
> different than normal user land eventfd read/write. It doesn't decrease
> the counter in the eventfd context. So if we read the eventfd from new
> Qemu, it looks will always have a non-zero counter, which requires an
> interrupt injection.

Good to know, thanks.

I will try creating a new eventfd and injecting an interrupt unconditionally.
I need a test case to demonstrate losing an interrupt, and fixing it with
injection.  Any advice?  My stress tests with a virtual function nic and a
directly assigned nvme block device have never failed across live update.

>>> And the re-init will make the device go through the procedure of
>>> disabling MSIX, enabling INTX, and re-enabling MSIX and vectors.
>>> So if the device is active, its hardware state will be impacted?
>>
>> Again thanks.  vfio_msix_enable() does indeed call vfio_disable_interrupts().
>> For a quick experiment, I deleted that call in for the post_load code path, and 
>> it seems to work fine, but I need to study it more.
> 
> vfio_msix_vector_use() will also trigger this procedure in the kernel.

Because that code path calls VFIO_DEVICE_SET_IRQS? Or something else?
Can you point to what it triggers in the kernel?

> Looks we shouldn't trigger any kernel vfio actions here? Because we
> preserve vfio fds, so its kernel state shouldn't be touched. Here we
> may only need to restore Qemu states. Re-connect to KVM instance should
> be done automatically when we setup the KVM irqfds with the same eventfd.
> 
> BTW, if I remember correctly, it is not enough to only save MSIX state
> in the snapshot. We should also save the Qemu side pci config space
> cache to the snapshot, because Qemu's copy is not exactly the same as
> the kernel's copy. I encountered this before, but I don't remember which
> field it was.

FYI all, Jason told me offline that qemu may emulate some pci capabilities and
hence keeps state in the shadow config that is never written to the kernel.
I need to study that.

> And another question, why don't we support MSI? I see the code only
> handles MSIX?

Yes, needs more code for MSI.

- Steve
  
>>>> +
>>>> +    } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
>>>> +        vfio_intx_enable(vdev, &err);
>>>> +        if (err) {
>>>> +            error_report_err(err);
>>>> +        }
>>>> +    }
>>>> +
>>>> +    vdev->vbasedev.group->container->reused = false;
>>>> +    vdev->pdev.reused = false;
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static const VMStateDescription vfio_pci_vmstate = {
>>>> +    .name = "vfio-pci",
>>>> +    .unmigratable = 1,
>>>> +    .mode_mask = VMS_RESTART,
>>>> +    .version_id = 0,
>>>> +    .minimum_version_id = 0,
>>>> +    .post_load = vfio_pci_post_load,
>>>> +    .fields = (VMStateField[]) {
>>>> +        VMSTATE_MSIX(pdev, VFIOPCIDevice),
>>>> +        VMSTATE_END_OF_LIST()
>>>> +    }
>>>> +};
>>>> +
>>>>  static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>>>>  {
>>>>      DeviceClass *dc = DEVICE_CLASS(klass);
>>>> @@ -3189,6 +3259,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
>>>>  
>>>>      dc->reset = vfio_pci_reset;
>>>>      device_class_set_props(dc, vfio_pci_dev_properties);
>>>> +    dc->vmsd = &vfio_pci_vmstate;
>>>>      dc->desc = "VFIO-based PCI device assignment";
>>>>      set_bit(DEVICE_CATEGORY_MISC, dc->categories);
>>>>      pdc->realize = vfio_realize;
>>>> diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
>>>> index ac2cefc..e6e1a5d 100644
>>>> --- a/hw/vfio/platform.c
>>>> +++ b/hw/vfio/platform.c
>>>> @@ -592,7 +592,7 @@ static int vfio_base_device_init(VFIODevice *vbasedev, Error **errp)
>>>>              return -EBUSY;
>>>>          }
>>>>      }
>>>> -    ret = vfio_get_device(group, vbasedev->name, vbasedev, errp);
>>>> +    ret = vfio_get_device(group, vbasedev->name, vbasedev, 0, errp);
>>>>      if (ret) {
>>>>          vfio_put_group(group);
>>>>          return ret;
>>>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>>>> index bd07c86..c926a24 100644
>>>> --- a/include/hw/pci/pci.h
>>>> +++ b/include/hw/pci/pci.h
>>>> @@ -358,6 +358,7 @@ struct PCIDevice {
>>>>  
>>>>      /* ID of standby device in net_failover pair */
>>>>      char *failover_pair_id;
>>>> +    bool reused;
>>>>  };
>>>>  
>>>>  void pci_register_bar(PCIDevice *pci_dev, int region_num,
>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>>> index c78f3ff..4e2a332 100644
>>>> --- a/include/hw/vfio/vfio-common.h
>>>> +++ b/include/hw/vfio/vfio-common.h
>>>> @@ -73,6 +73,8 @@ typedef struct VFIOContainer {
>>>>      unsigned iommu_type;
>>>>      Error *error;
>>>>      bool initialized;
>>>> +    bool reused;
>>>> +    int cid;
>>>>      unsigned long pgsizes;
>>>>      QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>>>>      QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
>>>> @@ -177,7 +179,7 @@ void vfio_reset_handler(void *opaque);
>>>>  VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
>>>>  void vfio_put_group(VFIOGroup *group);
>>>>  int vfio_get_device(VFIOGroup *group, const char *name,
>>>> -                    VFIODevice *vbasedev, Error **errp);
>>>> +                    VFIODevice *vbasedev, bool *reused, Error **errp);
>>>>  
>>>>  extern const MemoryRegionOps vfio_region_ops;
>>>>  typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
>>>> diff --git a/migration/savevm.c b/migration/savevm.c
>>>> index 881dc13..2606cf0 100644
>>>> --- a/migration/savevm.c
>>>> +++ b/migration/savevm.c
>>>> @@ -1568,7 +1568,7 @@ static int qemu_savevm_state(QEMUFile *f, VMStateMode mode, Error **errp)
>>>>          return -EINVAL;
>>>>      }
>>>>  
>>>> -    if (migrate_use_block()) {
>>>> +    if ((mode & (VMS_SNAPSHOT | VMS_MIGRATE)) && migrate_use_block()) {
>>>>          error_setg(errp, "Block migration and snapshots are incompatible");
>>>>          return -EINVAL;
>>>>      }
>>>> -- 
>>>> 1.8.3.1
>>>>
>>>>


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 18/32] osdep: import MADV_DOEXEC
  2020-08-18  2:42           ` Alex Williamson
@ 2020-08-19 21:52             ` Steven Sistare
  2020-08-24 22:30               ` Alex Williamson
  0 siblings, 1 reply; 118+ messages in thread
From: Steven Sistare @ 2020-08-19 21:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Dr. David Alan Gilbert,
	Anthony Yznaga, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On 8/17/2020 10:42 PM, Alex Williamson wrote:
> On Mon, 17 Aug 2020 15:44:03 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
>> On Mon, 17 Aug 2020 17:20:57 -0400
>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>
>>> On 8/17/2020 4:48 PM, Alex Williamson wrote:  
>>>> On Mon, 17 Aug 2020 14:30:51 -0400
>>>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>>>     
>>>>> On 7/30/2020 11:14 AM, Steve Sistare wrote:    
>>>>>> Anonymous memory segments used by the guest are preserved across a re-exec
>>>>>> of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
>>>>>> in the Linux kernel. For the madvise patches, see:
>>>>>>
>>>>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
>>>>>>
>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>> ---
>>>>>>  include/qemu/osdep.h | 7 +++++++
>>>>>>  1 file changed, 7 insertions(+)      
>>>>>
>>>>> Hi Alex,
>>>>>   The MADV_DOEXEC functionality, which is a pre-requisite for the entire qemu 
>>>>> live update series, is getting a chilly reception on lkml.  We could instead 
>>>>> create guest memory using memfd_create and preserve the fd across exec.  However, 
>>>>> the subsequent mmap(fd) will return a different VA than was used previously, 
>>>>> which  is a problem for memory that was registered with vfio, as the original VA 
>>>>> is remembered in the kernel struct vfio_dma and used in various kernel functions, 
>>>>> such as vfio_iommu_replay.
>>>>>
>>>>> To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size, and
>>>>> new_vaddr.  The implementation finds an exact match for (iova, size) and replaces 
>>>>> vaddr with new_vaddr.  Flags cannot be changed.
>>>>>
>>>>> memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC.
>>>>> vfio on any form of shared memory (shm, dax, etc) could also be preserved across
>>>>> exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA.
>>>>>
>>>>> What do you think    
>>>>
>>>> Your new REMAP ioctl would have parameters identical to the MAP_DMA
>>>> ioctl, so I think we should just use one of the flag bits on the
>>>> existing MAP_DMA ioctl for this variant.    
>>>
>>> Sounds good.
>>>   
>>>> Reading through the discussion on the kernel side there seems to be
>>>> some confusion around why vfio needs the vaddr beyond the user call to
>>>> MAP_DMA though.  Originally this was used to test for virtually
>>>> contiguous mappings for merging and splitting purposes.  This is
>>>> defunct in the v2 interface, however the vaddr is now used largely for
>>>> mdev devices.  If an mdev device is not backed by an IOMMU device and
>>>> does not share a container with an IOMMU device, then a user MAP_DMA
>>>> ioctl essentially just registers the translation within the vfio
>>>> container.  The mdev vendor driver can then later either request pages
>>>> to be pinned for device DMA or can perform copy_to/from_user() to
>>>> simulate DMA via the CPU.
>>>>
>>>> Therefore I don't see that there's a simple re-architecture of the vfio
>>>> IOMMU backend that could drop vaddr use.      
>>>
>>> Yes.  I did not explain on lkml as you do here (thanks), but I reached the 
>>> same conclusion.
>>>   
>>>> I'm a bit concerned this new
>>>> remap proposal also raises the question of how do we prevent userspace
>>>> remapping vaddrs racing with asynchronous kernel use of the previous
>>>> vaddrs.      
>>>
>>> Agreed.  After a quick glance at the code, holding iommu->lock during 
>>> remap might be sufficient, but it needs more study.  
>>
>> Unless you're suggesting an extended hold of the lock across the entire
>> re-exec of QEMU, that's only going to prevent a race between a remap
>> and a vendor driver pin or access, the time between the previous vaddr
>> becoming invalid and the remap is unprotected.

OK.  What if we exclude mediated devices?  Its appears they are the only
ones where the kernel may async'ly use the vaddr, via call chains ending in 
vfio_iommu_type1_pin_pages or vfio_iommu_type1_dma_rw_chunk.

The other functions that use dma->vaddr are
    vfio_dma_do_map 
    vfio_pin_map_dma 
    vfio_iommu_replay 
    vfio_pin_pages_remote
and they are all initiated via userland ioctl (if I have traced all the code 
paths correctly).  Thus iommu->lock would protect them.

We would block live update in qemu if the config includes a mediated device.

VFIO_IOMMU_REMAP_DMA would return EINVAL if the container has a mediated device.

>>>> Are we expecting guest drivers/agents to quiesce the device,
>>>> or maybe relying on clearing bus-master, for PCI devices, to halt DMA?    
>>>
>>> No.  We want to support any guest, and the guest is not aware that qemu
>>> live update is occurring.
>>>   
>>>> The vfio migration interface we've developed does have a mechanism to
>>>> stop a device, would we need to use this here?  If we do have a
>>>> mechanism to quiesce the device, is the only reason we're not unmapping
>>>> everything and remapping it into the new address space the latency in
>>>> performing that operation?  Thanks,    
>>>
>>> Same answer - we don't require that the guest has vfio migration support.  
>>
>> QEMU toggling the runstate of the device via the vfio migration
>> interface could be done transparently to the guest, but if your
>> intention is to support any device (where none currently support the
>> migration interface) perhaps it's a moot point.  

That sounds useful when devices support.  Can you give me some function names
or references so I can study this qemu-based "vfio migration interface".

>> It seems like this
>> scheme only works with IOMMU backed devices where the device can
>> continue to operate against pinned pages, anything that might need to
>> dynamically pin pages against the process vaddr as it's running async
>> to the QEMU re-exec needs to be blocked or stopped.  Thanks,

Yes, true of this remap proposal.

I wanted to unconditionally support all devices, which is why I think that

MADV_DOEXEC is a nifty solution.  If you agree, please add your voice to the

lkml discussion.

> Hmm, even if a device is running against pinned memory, how do we
> handle device interrupts that occur during QEMU's downtime?  I see that
> we reconfigure interrupts, but does QEMU need to drain the eventfd and
> manually inject those missed interrupts or will setting up the irqfds
> trigger a poll?  Thanks,

My existing code is apparently deficient in this area; I close the pre-exec eventfd,
and post exec create a new eventfd and attach it to the vfio device.  Jason and I
are discussing alternatives in this thread of the series:
  https://lore.kernel.org/qemu-devel/0da862c8-74bc-bf06-a436-4ebfcb9dd8d4@oracle.com/

I am hoping that unconditionally injecting a (potentially spurious) interrupt on a 
new eventfd after exec will solve the problem.

BTW, thanks for discussing these issues.  I appreciate it.

- Steve


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 30/32] vfio-pci: save and restore
  2020-08-19 21:15         ` Steven Sistare
@ 2020-08-20 10:33           ` Jason Zeng
  2020-10-07 21:25             ` Steven Sistare
  0 siblings, 1 reply; 118+ messages in thread
From: Jason Zeng @ 2020-08-20 10:33 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Daniel P. Berrange, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-devel, Dr. David Alan Gilbert,
	Alex Williamson, Paolo Bonzini, Stefan Hajnoczi,
	Marc-André Lureau, Jason Zeng, Philippe Mathieu-Daudé,
	Alex Bennée

On Wed, Aug 19, 2020 at 05:15:11PM -0400, Steven Sistare wrote:
> On 8/9/2020 11:50 PM, Jason Zeng wrote:
> > On Fri, Aug 07, 2020 at 04:38:12PM -0400, Steven Sistare wrote:
> >> On 8/6/2020 6:22 AM, Jason Zeng wrote:
> >>> Hi Steve,
> >>>
> >>> On Thu, Jul 30, 2020 at 08:14:34AM -0700, Steve Sistare wrote:
> >>>> @@ -3182,6 +3207,51 @@ static Property vfio_pci_dev_properties[] = {
> >>>>      DEFINE_PROP_END_OF_LIST(),
> >>>>  };
> >>>>  
> >>>> +static int vfio_pci_post_load(void *opaque, int version_id)
> >>>> +{
> >>>> +    int vector;
> >>>> +    MSIMessage msg;
> >>>> +    Error *err = 0;
> >>>> +    VFIOPCIDevice *vdev = opaque;
> >>>> +    PCIDevice *pdev = &vdev->pdev;
> >>>> +
> >>>> +    if (msix_enabled(pdev)) {
> >>>> +        vfio_msix_enable(vdev);
> >>>> +        pdev->msix_function_masked = false;
> >>>> +
> >>>> +        for (vector = 0; vector < vdev->pdev.msix_entries_nr; vector++) {
> >>>> +            if (!msix_is_masked(pdev, vector)) {
> >>>> +                msg = msix_get_message(pdev, vector);
> >>>> +                vfio_msix_vector_use(pdev, vector, msg);
> >>>> +            }
> >>>> +        }
> >>>
> >>> It looks to me MSIX re-init here may lose device IRQs and impact
> >>> device hardware state?
> >>>
> >>> The re-init will cause the kernel vfio driver to connect the device
> >>> MSIX vectors to new eventfds and KVM instance. But before that, device
> >>> IRQs will be routed to previous eventfd. Looks these IRQs will be lost.
> >>
> >> Thanks Jason, that sounds like a problem.  I could try reading and saving an 
> >> event from eventfd before shutdown, and injecting it into the eventfd after
> >> restart, but that would be racy unless I disable interrupts.  Or, unconditionally
> >> inject a spurious interrupt after restart to kick it, in case an interrupt 
> >> was lost.
> >>
> >> Do you have any other ideas?
> > 
> > Maybe we can consider to also hand over the eventfd file descriptor, or
> 
> I believe preserving this descriptor in isolation is not sufficient.  We would
> also need to preserve the KVM instance which it is linked to.
> 
> > or even the KVM fds to the new Qemu?
> > 
> > If the KVM fds can be preserved, we will just need to restore Qemu KVM
> > side states. But not sure how complicated the implementation would be.
> 
> That should work, but I fear it would require many code changes in QEMU
> to re-use descriptors at object creation time and suppress the initial 
> configuration ioctl's, so it's not my first choice for a solution.
> 
> > If we only preserve the eventfd fd, we can attach the old eventfd to
> > vfio devices. But looks it may turn out we always inject an interrupt
> > unconditionally, because kernel KVM irqfd eventfd handling is a bit
> > different than normal user land eventfd read/write. It doesn't decrease
> > the counter in the eventfd context. So if we read the eventfd from new
> > Qemu, it looks will always have a non-zero counter, which requires an
> > interrupt injection.
> 
> Good to know, thanks.
> 
> I will try creating a new eventfd and injecting an interrupt unconditionally.
> I need a test case to demonstrate losing an interrupt, and fixing it with
> injection.  Any advice?  My stress tests with a virtual function nic and a
> directly assigned nvme block device have never failed across live update.
> 

I am not familiar with nvme devices. For nic device, to my understanding,
stress nic testing will not have many IRQs, because nic driver usually
enables NAPI, which only take the first interrupt, then disable interrupt
and start polling. It will only re-enable interrupt after some packet
quota reached or the traffic quiesces for a while. But anyway, if the
test goes enough long time, the number of IRQs should also be big, not
sure why it doesn't trigger any issue. Maybe we can have some study on
the IRQ pattern for the testing and see how we can design a test case?
or see if our assumption is wrong?


> >>> And the re-init will make the device go through the procedure of
> >>> disabling MSIX, enabling INTX, and re-enabling MSIX and vectors.
> >>> So if the device is active, its hardware state will be impacted?
> >>
> >> Again thanks.  vfio_msix_enable() does indeed call vfio_disable_interrupts().
> >> For a quick experiment, I deleted that call in for the post_load code path, and 
> >> it seems to work fine, but I need to study it more.
> > 
> > vfio_msix_vector_use() will also trigger this procedure in the kernel.
> 
> Because that code path calls VFIO_DEVICE_SET_IRQS? Or something else?
> Can you point to what it triggers in the kernel?


In vfio_msix_vector_use(), I see vfio_disable_irqindex() will be invoked
if "vdev->nr_vectors < nr + 1" is true. Since the 'vdev' is re-inited,
so this condition should be true, and vfio_disable_irqindex() will
trigger VFIO_DEVICE_SET_IRQS with VFIO_IRQ_SET_DATA_NONE, which will
cause kernel to disable MSIX.

> 
> > Looks we shouldn't trigger any kernel vfio actions here? Because we
> > preserve vfio fds, so its kernel state shouldn't be touched. Here we
> > may only need to restore Qemu states. Re-connect to KVM instance should
> > be done automatically when we setup the KVM irqfds with the same eventfd.
> > 
> > BTW, if I remember correctly, it is not enough to only save MSIX state
> > in the snapshot. We should also save the Qemu side pci config space
> > cache to the snapshot, because Qemu's copy is not exactly the same as
> > the kernel's copy. I encountered this before, but I don't remember which
> > field it was.
> 
> FYI all, Jason told me offline that qemu may emulate some pci capabilities and
> hence keeps state in the shadow config that is never written to the kernel.
> I need to study that.
> 

Sorry, I read the code again, see Qemu does write all config-space-write
to kernel in vfio_pci_write_config(). Now I am also confused about what
I was seeing previously :(. But it seems we still need to look at kernel
code to see if mismatch is possibile for config space cache between Qemu
and kernel.

FYI. Some discussion about the VFIO PCI config space saving/restoring in
live migration scenario:
https://lists.gnu.org/archive/html/qemu-devel/2020-06/msg06964.html

thanks,
Jason


> > And another question, why don't we support MSI? I see the code only
> > handles MSIX?
> 
> Yes, needs more code for MSI.
> 
> - Steve
>   
> >>>> +
> >>>> +    } else if (vfio_pci_read_config(pdev, PCI_INTERRUPT_PIN, 1)) {
> >>>> +        vfio_intx_enable(vdev, &err);
> >>>> +        if (err) {
> >>>> +            error_report_err(err);
> >>>> +        }
> >>>> +    }
> >>>> +
> >>>> +    vdev->vbasedev.group->container->reused = false;
> >>>> +    vdev->pdev.reused = false;
> >>>> +
> >>>> +    return 0;
> >>>> +}
> >>>> +
> >>>> +static const VMStateDescription vfio_pci_vmstate = {
> >>>> +    .name = "vfio-pci",
> >>>> +    .unmigratable = 1,
> >>>> +    .mode_mask = VMS_RESTART,
> >>>> +    .version_id = 0,
> >>>> +    .minimum_version_id = 0,
> >>>> +    .post_load = vfio_pci_post_load,
> >>>> +    .fields = (VMStateField[]) {
> >>>> +        VMSTATE_MSIX(pdev, VFIOPCIDevice),
> >>>> +        VMSTATE_END_OF_LIST()
> >>>> +    }
> >>>> +};
> >>>> +
> >>>>  static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
> >>>>  {
> >>>>      DeviceClass *dc = DEVICE_CLASS(klass);
> >>>> @@ -3189,6 +3259,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
> >>>>  
> >>>>      dc->reset = vfio_pci_reset;
> >>>>      device_class_set_props(dc, vfio_pci_dev_properties);
> >>>> +    dc->vmsd = &vfio_pci_vmstate;
> >>>>      dc->desc = "VFIO-based PCI device assignment";
> >>>>      set_bit(DEVICE_CATEGORY_MISC, dc->categories);
> >>>>      pdc->realize = vfio_realize;
> >>>> diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
> >>>> index ac2cefc..e6e1a5d 100644
> >>>> --- a/hw/vfio/platform.c
> >>>> +++ b/hw/vfio/platform.c
> >>>> @@ -592,7 +592,7 @@ static int vfio_base_device_init(VFIODevice *vbasedev, Error **errp)
> >>>>              return -EBUSY;
> >>>>          }
> >>>>      }
> >>>> -    ret = vfio_get_device(group, vbasedev->name, vbasedev, errp);
> >>>> +    ret = vfio_get_device(group, vbasedev->name, vbasedev, 0, errp);
> >>>>      if (ret) {
> >>>>          vfio_put_group(group);
> >>>>          return ret;
> >>>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> >>>> index bd07c86..c926a24 100644
> >>>> --- a/include/hw/pci/pci.h
> >>>> +++ b/include/hw/pci/pci.h
> >>>> @@ -358,6 +358,7 @@ struct PCIDevice {
> >>>>  
> >>>>      /* ID of standby device in net_failover pair */
> >>>>      char *failover_pair_id;
> >>>> +    bool reused;
> >>>>  };
> >>>>  
> >>>>  void pci_register_bar(PCIDevice *pci_dev, int region_num,
> >>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >>>> index c78f3ff..4e2a332 100644
> >>>> --- a/include/hw/vfio/vfio-common.h
> >>>> +++ b/include/hw/vfio/vfio-common.h
> >>>> @@ -73,6 +73,8 @@ typedef struct VFIOContainer {
> >>>>      unsigned iommu_type;
> >>>>      Error *error;
> >>>>      bool initialized;
> >>>> +    bool reused;
> >>>> +    int cid;
> >>>>      unsigned long pgsizes;
> >>>>      QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
> >>>>      QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
> >>>> @@ -177,7 +179,7 @@ void vfio_reset_handler(void *opaque);
> >>>>  VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
> >>>>  void vfio_put_group(VFIOGroup *group);
> >>>>  int vfio_get_device(VFIOGroup *group, const char *name,
> >>>> -                    VFIODevice *vbasedev, Error **errp);
> >>>> +                    VFIODevice *vbasedev, bool *reused, Error **errp);
> >>>>  
> >>>>  extern const MemoryRegionOps vfio_region_ops;
> >>>>  typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
> >>>> diff --git a/migration/savevm.c b/migration/savevm.c
> >>>> index 881dc13..2606cf0 100644
> >>>> --- a/migration/savevm.c
> >>>> +++ b/migration/savevm.c
> >>>> @@ -1568,7 +1568,7 @@ static int qemu_savevm_state(QEMUFile *f, VMStateMode mode, Error **errp)
> >>>>          return -EINVAL;
> >>>>      }
> >>>>  
> >>>> -    if (migrate_use_block()) {
> >>>> +    if ((mode & (VMS_SNAPSHOT | VMS_MIGRATE)) && migrate_use_block()) {
> >>>>          error_setg(errp, "Block migration and snapshots are incompatible");
> >>>>          return -EINVAL;
> >>>>      }
> >>>> -- 
> >>>> 1.8.3.1
> >>>>
> >>>>


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 18/32] osdep: import MADV_DOEXEC
  2020-08-19 21:52             ` Steven Sistare
@ 2020-08-24 22:30               ` Alex Williamson
  2020-10-08 16:32                 ` Steven Sistare
  0 siblings, 1 reply; 118+ messages in thread
From: Alex Williamson @ 2020-08-24 22:30 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Dr. David Alan Gilbert,
	Anthony Yznaga, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On Wed, 19 Aug 2020 17:52:26 -0400
Steven Sistare <steven.sistare@oracle.com> wrote:

> On 8/17/2020 10:42 PM, Alex Williamson wrote:
> > On Mon, 17 Aug 2020 15:44:03 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> >   
> >> On Mon, 17 Aug 2020 17:20:57 -0400
> >> Steven Sistare <steven.sistare@oracle.com> wrote:
> >>  
> >>> On 8/17/2020 4:48 PM, Alex Williamson wrote:    
> >>>> On Mon, 17 Aug 2020 14:30:51 -0400
> >>>> Steven Sistare <steven.sistare@oracle.com> wrote:
> >>>>       
> >>>>> On 7/30/2020 11:14 AM, Steve Sistare wrote:      
> >>>>>> Anonymous memory segments used by the guest are preserved across a re-exec
> >>>>>> of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
> >>>>>> in the Linux kernel. For the madvise patches, see:
> >>>>>>
> >>>>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
> >>>>>>
> >>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >>>>>> ---
> >>>>>>  include/qemu/osdep.h | 7 +++++++
> >>>>>>  1 file changed, 7 insertions(+)        
> >>>>>
> >>>>> Hi Alex,
> >>>>>   The MADV_DOEXEC functionality, which is a pre-requisite for the entire qemu 
> >>>>> live update series, is getting a chilly reception on lkml.  We could instead 
> >>>>> create guest memory using memfd_create and preserve the fd across exec.  However, 
> >>>>> the subsequent mmap(fd) will return a different VA than was used previously, 
> >>>>> which  is a problem for memory that was registered with vfio, as the original VA 
> >>>>> is remembered in the kernel struct vfio_dma and used in various kernel functions, 
> >>>>> such as vfio_iommu_replay.
> >>>>>
> >>>>> To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size, and
> >>>>> new_vaddr.  The implementation finds an exact match for (iova, size) and replaces 
> >>>>> vaddr with new_vaddr.  Flags cannot be changed.
> >>>>>
> >>>>> memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC.
> >>>>> vfio on any form of shared memory (shm, dax, etc) could also be preserved across
> >>>>> exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA.
> >>>>>
> >>>>> What do you think      
> >>>>
> >>>> Your new REMAP ioctl would have parameters identical to the MAP_DMA
> >>>> ioctl, so I think we should just use one of the flag bits on the
> >>>> existing MAP_DMA ioctl for this variant.      
> >>>
> >>> Sounds good.
> >>>     
> >>>> Reading through the discussion on the kernel side there seems to be
> >>>> some confusion around why vfio needs the vaddr beyond the user call to
> >>>> MAP_DMA though.  Originally this was used to test for virtually
> >>>> contiguous mappings for merging and splitting purposes.  This is
> >>>> defunct in the v2 interface, however the vaddr is now used largely for
> >>>> mdev devices.  If an mdev device is not backed by an IOMMU device and
> >>>> does not share a container with an IOMMU device, then a user MAP_DMA
> >>>> ioctl essentially just registers the translation within the vfio
> >>>> container.  The mdev vendor driver can then later either request pages
> >>>> to be pinned for device DMA or can perform copy_to/from_user() to
> >>>> simulate DMA via the CPU.
> >>>>
> >>>> Therefore I don't see that there's a simple re-architecture of the vfio
> >>>> IOMMU backend that could drop vaddr use.        
> >>>
> >>> Yes.  I did not explain on lkml as you do here (thanks), but I reached the 
> >>> same conclusion.
> >>>     
> >>>> I'm a bit concerned this new
> >>>> remap proposal also raises the question of how do we prevent userspace
> >>>> remapping vaddrs racing with asynchronous kernel use of the previous
> >>>> vaddrs.        
> >>>
> >>> Agreed.  After a quick glance at the code, holding iommu->lock during 
> >>> remap might be sufficient, but it needs more study.    
> >>
> >> Unless you're suggesting an extended hold of the lock across the entire
> >> re-exec of QEMU, that's only going to prevent a race between a remap
> >> and a vendor driver pin or access, the time between the previous vaddr
> >> becoming invalid and the remap is unprotected.  
> 
> OK.  What if we exclude mediated devices?  Its appears they are the only
> ones where the kernel may async'ly use the vaddr, via call chains ending in 
> vfio_iommu_type1_pin_pages or vfio_iommu_type1_dma_rw_chunk.
> 
> The other functions that use dma->vaddr are
>     vfio_dma_do_map 
>     vfio_pin_map_dma 
>     vfio_iommu_replay 
>     vfio_pin_pages_remote
> and they are all initiated via userland ioctl (if I have traced all the code 
> paths correctly).  Thus iommu->lock would protect them.
> 
> We would block live update in qemu if the config includes a mediated device.
> 
> VFIO_IOMMU_REMAP_DMA would return EINVAL if the container has a mediated device.

That's not a solution I'd really be in favor of.  We're eliminating an
entire class of devices because they _might_ make use of these
interfaces, but anyone can add a vfio bus driver, even exposing the
same device API, and maybe make use of some of these interfaces in that
driver.  Maybe we'd even have reason to do it in vfio-pci if we had
reason to virtualize some aspect of a device.  I think we're setting
ourselves up for a very complicated support scenario if we just
arbitrarily decide to deny drivers using certain interfaces.


> >>>> Are we expecting guest drivers/agents to quiesce the device,
> >>>> or maybe relying on clearing bus-master, for PCI devices, to halt DMA?      
> >>>
> >>> No.  We want to support any guest, and the guest is not aware that qemu
> >>> live update is occurring.
> >>>     
> >>>> The vfio migration interface we've developed does have a mechanism to
> >>>> stop a device, would we need to use this here?  If we do have a
> >>>> mechanism to quiesce the device, is the only reason we're not unmapping
> >>>> everything and remapping it into the new address space the latency in
> >>>> performing that operation?  Thanks,      
> >>>
> >>> Same answer - we don't require that the guest has vfio migration support.    
> >>
> >> QEMU toggling the runstate of the device via the vfio migration
> >> interface could be done transparently to the guest, but if your
> >> intention is to support any device (where none currently support the
> >> migration interface) perhaps it's a moot point.    
> 
> That sounds useful when devices support.  Can you give me some function names
> or references so I can study this qemu-based "vfio migration interface".

The uAPI is documented in commit a8a24f3f6e38.  We're still waiting on
the QEMU support or implementation in an mdev vendor driver.
Essentially migration exposes a new region of the device which would be
implemented by the vendor driver.  A register within that region
manipulates the device state, so a device could be stopped by clearing
the 'run' bit in that register.


> >> It seems like this
> >> scheme only works with IOMMU backed devices where the device can
> >> continue to operate against pinned pages, anything that might need to
> >> dynamically pin pages against the process vaddr as it's running async
> >> to the QEMU re-exec needs to be blocked or stopped.  Thanks,  
> 
> Yes, true of this remap proposal.
> 
> I wanted to unconditionally support all devices, which is why I think that
> 
> MADV_DOEXEC is a nifty solution.  If you agree, please add your voice to the
> 
> lkml discussion.
> 
> > Hmm, even if a device is running against pinned memory, how do we
> > handle device interrupts that occur during QEMU's downtime?  I see that
> > we reconfigure interrupts, but does QEMU need to drain the eventfd and
> > manually inject those missed interrupts or will setting up the irqfds
> > trigger a poll?  Thanks,  
> 
> My existing code is apparently deficient in this area; I close the pre-exec eventfd,
> and post exec create a new eventfd and attach it to the vfio device.  Jason and I
> are discussing alternatives in this thread of the series:
>   https://lore.kernel.org/qemu-devel/https0da862c8-74bc-bf06-a436-4ebfcb9dd8d4@oracle.com/
> 
> I am hoping that unconditionally injecting a (potentially spurious) interrupt on a 
> new eventfd after exec will solve the problem.

That's sloppy, but maybe sufficient, but I agree with Jason's concern
about treading carefully around anything that would cause the interrupt
state of the device to be modified, which certainly might not be
transparent to the device.  Then there's the issue of what would happen
if a fatal AER event occurred while the eventfd is disconnected.  We
wouldn't want to generate a spurious event on that channel.  The device
request eventfd will retry from the kernel side, so simply reconnecting
it should work.  Each type of virtual interrupt will need to have a
plan for what to do around this disconnected period, and like the error
reporting one, it might not be safe to lose the interrupt nor inject a
spurious interrupt.

Regarding emulated state in QEMU, yes QEMU does write all config to
the kernel, where some things might be emulated in the kernel, but
there are also things emulated in QEMU.  See for example
vdev->emulated_config_bits.  Writing everything to the kernel is just a
simplification because we know the kernel will drop writes that it
doesn't allow.  It's essentially a catch-all.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 01/32] savevm: add vmstate handler iterators
  2020-07-30 15:14 ` [PATCH V1 01/32] savevm: add vmstate handler iterators Steve Sistare
@ 2020-09-11 16:24   ` Dr. David Alan Gilbert
  2020-09-24 21:43     ` Steven Sistare
  0 siblings, 1 reply; 118+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-11 16:24 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

Apologies for taking a while to get around to this, 

* Steve Sistare (steven.sistare@oracle.com) wrote:
> Provide the SAVEVM_FOREACH and SAVEVM_FORALL macros to loop over all save
> VM state handlers.  The former will filter handlers based on the operation
> in the later patch "savevm: VM handlers mode mask".  The latter loops over
> all handlers.
> 
> No functional change.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  migration/savevm.c | 57 ++++++++++++++++++++++++++++++++++++------------------
>  1 file changed, 38 insertions(+), 19 deletions(-)
> 
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 45c9dd9..a07fcad 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -266,6 +266,25 @@ static SaveState savevm_state = {
>      .global_section_id = 0,
>  };
>  
> +/*
> + * The FOREACH macros will filter handlers based on the current operation when
> + * additional conditions are added in a subsequent patch.
> + */
> +
> +#define SAVEVM_FOREACH(se, entry)                                    \
> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry)                \
> +
> +#define SAVEVM_FOREACH_SAFE(se, entry, new_se)                       \
> +    QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se)   \
> +
> +/* The FORALL macros unconditionally loop over all handlers. */
> +
> +#define SAVEVM_FORALL(se, entry)                                     \
> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry)
> +
> +#define SAVEVM_FORALL_SAFE(se, entry, new_se)                        \
> +    QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se)
> +

OK, can I ask you to merge this with the next patch but to spin it the
other way, so that we have:

  SAVEVM_FOR(se, entry, mask)

and the places you use SAVEVM_FORALL_SAFE would become

  SAVEVM_FOR(se, entry, VMS_MODE_ALL)

I'm thinking at some point in the future we could merge a bunch of the
other flag checks in there.

Dave


>  static bool should_validate_capability(int capability)
>  {
>      assert(capability >= 0 && capability < MIGRATION_CAPABILITY__MAX);
> @@ -673,7 +692,7 @@ static uint32_t calculate_new_instance_id(const char *idstr)
>      SaveStateEntry *se;
>      uint32_t instance_id = 0;
>  
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FORALL(se, entry) {
>          if (strcmp(idstr, se->idstr) == 0
>              && instance_id <= se->instance_id) {
>              instance_id = se->instance_id + 1;
> @@ -689,7 +708,7 @@ static int calculate_compat_instance_id(const char *idstr)
>      SaveStateEntry *se;
>      int instance_id = 0;
>  
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FORALL(se, entry) {
>          if (!se->compat) {
>              continue;
>          }
> @@ -803,7 +822,7 @@ void unregister_savevm(VMStateIf *obj, const char *idstr, void *opaque)
>      }
>      pstrcat(id, sizeof(id), idstr);
>  
> -    QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se) {
> +    SAVEVM_FORALL_SAFE(se, entry, new_se) {
>          if (strcmp(se->idstr, id) == 0 && se->opaque == opaque) {
>              savevm_state_handler_remove(se);
>              g_free(se->compat);
> @@ -867,7 +886,7 @@ void vmstate_unregister(VMStateIf *obj, const VMStateDescription *vmsd,
>  {
>      SaveStateEntry *se, *new_se;
>  
> -    QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se) {
> +    SAVEVM_FORALL_SAFE(se, entry, new_se) {
>          if (se->vmsd == vmsd && se->opaque == opaque) {
>              savevm_state_handler_remove(se);
>              g_free(se->compat);
> @@ -1119,7 +1138,7 @@ bool qemu_savevm_state_blocked(Error **errp)
>  {
>      SaveStateEntry *se;
>  
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FORALL(se, entry) {
>          if (se->vmsd && se->vmsd->unmigratable) {
>              error_setg(errp, "State blocked by non-migratable device '%s'",
>                         se->idstr);
> @@ -1145,7 +1164,7 @@ bool qemu_savevm_state_guest_unplug_pending(void)
>  {
>      SaveStateEntry *se;
>  
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          if (se->vmsd && se->vmsd->dev_unplug_pending &&
>              se->vmsd->dev_unplug_pending(se->opaque)) {
>              return true;
> @@ -1162,7 +1181,7 @@ void qemu_savevm_state_setup(QEMUFile *f)
>      int ret;
>  
>      trace_savevm_state_setup();
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          if (!se->ops || !se->ops->save_setup) {
>              continue;
>          }
> @@ -1193,7 +1212,7 @@ int qemu_savevm_state_resume_prepare(MigrationState *s)
>  
>      trace_savevm_state_resume_prepare();
>  
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          if (!se->ops || !se->ops->resume_prepare) {
>              continue;
>          }
> @@ -1223,7 +1242,7 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy)
>      int ret = 1;
>  
>      trace_savevm_state_iterate();
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          if (!se->ops || !se->ops->save_live_iterate) {
>              continue;
>          }
> @@ -1291,7 +1310,7 @@ void qemu_savevm_state_complete_postcopy(QEMUFile *f)
>      SaveStateEntry *se;
>      int ret;
>  
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          if (!se->ops || !se->ops->save_live_complete_postcopy) {
>              continue;
>          }
> @@ -1324,7 +1343,7 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
>      SaveStateEntry *se;
>      int ret;
>  
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          if (!se->ops ||
>              (in_postcopy && se->ops->has_postcopy &&
>               se->ops->has_postcopy(se->opaque)) ||
> @@ -1366,7 +1385,7 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
>      vmdesc = qjson_new();
>      json_prop_int(vmdesc, "page_size", qemu_target_page_size());
>      json_start_array(vmdesc, "devices");
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>  
>          if ((!se->ops || !se->ops->save_state) && !se->vmsd) {
>              continue;
> @@ -1476,7 +1495,7 @@ void qemu_savevm_state_pending(QEMUFile *f, uint64_t threshold_size,
>      *res_postcopy_only = 0;
>  
>  
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          if (!se->ops || !se->ops->save_live_pending) {
>              continue;
>          }
> @@ -1501,7 +1520,7 @@ void qemu_savevm_state_cleanup(void)
>      }
>  
>      trace_savevm_state_cleanup();
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          if (se->ops && se->ops->save_cleanup) {
>              se->ops->save_cleanup(se->opaque);
>          }
> @@ -1580,7 +1599,7 @@ int qemu_save_device_state(QEMUFile *f)
>      }
>      cpu_synchronize_all_states();
>  
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          int ret;
>  
>          if (se->is_ram) {
> @@ -1612,7 +1631,7 @@ static SaveStateEntry *find_se(const char *idstr, uint32_t instance_id)
>  {
>      SaveStateEntry *se;
>  
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FORALL(se, entry) {
>          if (!strcmp(se->idstr, idstr) &&
>              (instance_id == se->instance_id ||
>               instance_id == se->alias_id))
> @@ -2334,7 +2353,7 @@ qemu_loadvm_section_part_end(QEMUFile *f, MigrationIncomingState *mis)
>      }
>  
>      trace_qemu_loadvm_state_section_partend(section_id);
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          if (se->load_section_id == section_id) {
>              break;
>          }
> @@ -2400,7 +2419,7 @@ static int qemu_loadvm_state_setup(QEMUFile *f)
>      int ret;
>  
>      trace_loadvm_state_setup();
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          if (!se->ops || !se->ops->load_setup) {
>              continue;
>          }
> @@ -2425,7 +2444,7 @@ void qemu_loadvm_state_cleanup(void)
>      SaveStateEntry *se;
>  
>      trace_loadvm_state_cleanup();
> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +    SAVEVM_FOREACH(se, entry) {
>          if (se->ops && se->ops->load_cleanup) {
>              se->ops->load_cleanup(se->opaque);
>          }
> -- 
> 1.8.3.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 03/32] savevm: QMP command for cprsave
  2020-07-30 15:14 ` [PATCH V1 03/32] savevm: QMP command for cprsave Steve Sistare
  2020-07-30 16:12   ` Eric Blake
@ 2020-09-11 16:43   ` Dr. David Alan Gilbert
  2020-09-25 18:43     ` Steven Sistare
  1 sibling, 1 reply; 118+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-11 16:43 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

* Steve Sistare (steven.sistare@oracle.com) wrote:
> To enable live reboot, provide the cprsave QMP command and the VMS_REBOOT
> vmstate-saving operation, which saves the state of the virtual machine in a
> simple file.
> 
> Syntax:
>   {'command':'cprsave', 'data':{'file':'str', 'mode':'str'}}
> 
>   The mode argument must be 'reboot'.  Additional modes will be defined in
>   the future.
> 
> Unlike the savevm command, cprsave supports any type of guest image and
> block device.  cprsave stops the VM so that guest ram and block devices are
> not modified after state is saved.  Guest ram must be mapped to a persistent
> memory file such as /dev/dax0.0.  The ram object vmstate handler and block
> device handler do not apply to VMS_REBOOT, so restrict them to VMS_MIGRATE
> or VMS_SNAPSHOT.  After cprsave completes successfully, qemu exits.
> 
> After issuing cprsave, the caller may update qemu, update the host kernel,
> reboot, start qemu using the same arguments as the original process, and
> issue the cprload command to restore the guest.  cprload is added by
> subsequent patches.
> 
> If the caller suspends the guest instead of stopping the VM, such as by
> issuing guest-suspend-ram to the qemu guest agent, then cprsave and cprload
> support guests with vfio devices.  The guest drivers suspend methods flush
> outstanding requests and re-initialize the devices, and thus there is no
> device state to save and restore.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> Signed-off-by: Maran Wilson <maran.wilson@oracle.com>

Going back a step; could you.....

> ---
>  include/migration/vmstate.h |  1 +
>  include/sysemu/sysemu.h     |  2 ++
>  migration/block.c           |  1 +
>  migration/ram.c             |  1 +
>  migration/savevm.c          | 59 +++++++++++++++++++++++++++++++++++++++++++++
>  monitor/qmp-cmds.c          |  6 +++++
>  qapi/migration.json         | 14 +++++++++++
>  7 files changed, 84 insertions(+)
> 
> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> index fa575f9..c58551a 100644
> --- a/include/migration/vmstate.h
> +++ b/include/migration/vmstate.h
> @@ -161,6 +161,7 @@ typedef enum {
>  typedef enum {
>      VMS_MIGRATE  = (1U << 1),
>      VMS_SNAPSHOT = (1U << 2),
> +    VMS_REBOOT   = (1U << 3),
>      VMS_MODE_ALL = ~0U
>  } VMStateMode;
>  
> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> index 4b6a5c4..6fe86e6 100644
> --- a/include/sysemu/sysemu.h
> +++ b/include/sysemu/sysemu.h
> @@ -24,6 +24,8 @@ extern bool machine_init_done;
>  void qemu_add_machine_init_done_notifier(Notifier *notify);
>  void qemu_remove_machine_init_done_notifier(Notifier *notify);
>  
> +void save_cpr_snapshot(const char *file, const char *mode, Error **errp);
> +
>  extern int autostart;
>  
>  typedef enum {
> diff --git a/migration/block.c b/migration/block.c
> index 737b649..a69accb 100644
> --- a/migration/block.c
> +++ b/migration/block.c
> @@ -1023,6 +1023,7 @@ static SaveVMHandlers savevm_block_handlers = {
>      .load_state = block_load,
>      .save_cleanup = block_migration_cleanup,
>      .is_active = block_is_active,
> +    .mode_mask = VMS_MIGRATE | VMS_SNAPSHOT,
>  };
>  
>  void blk_mig_init(void)
> diff --git a/migration/ram.c b/migration/ram.c
> index 76d4fee..f0d5d9f 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -3795,6 +3795,7 @@ static SaveVMHandlers savevm_ram_handlers = {
>      .load_setup = ram_load_setup,
>      .load_cleanup = ram_load_cleanup,
>      .resume_prepare = ram_resume_prepare,
> +    .mode_mask = VMS_MIGRATE | VMS_SNAPSHOT,
>  };
>  
>  void ram_mig_init(void)
> diff --git a/migration/savevm.c b/migration/savevm.c
> index ce02b6b..ff1a46e 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2680,6 +2680,65 @@ int qemu_load_device_state(QEMUFile *f)
>      return 0;
>  }
>  
> +static QEMUFile *qf_file_open(const char *filename, int flags, int mode,
> +                              Error **errp)
> +{
> +    QIOChannel *ioc;
> +    int fd = qemu_open(filename, flags, mode);
> +
> +    if (fd < 0) {
> +        error_setg_errno(errp, errno, "%s(%s)", __func__, filename);
> +        return NULL;
> +    }
> +
> +    ioc = QIO_CHANNEL(qio_channel_file_new_fd(fd));
> +
> +    if (flags & O_WRONLY) {
> +        return qemu_fopen_channel_output(ioc);
> +    }
> +
> +    return qemu_fopen_channel_input(ioc);
> +}
> +
> +void save_cpr_snapshot(const char *file, const char *mode, Error **errp)
> +{
> +    int ret = 0;
> +    QEMUFile *f;
> +    VMStateMode op;
> +
> +    if (!strcmp(mode, "reboot")) {
> +        op = VMS_REBOOT;
> +    } else {
> +        error_setg(errp, "cprsave: bad mode %s", mode);
> +        return;
> +    }
> +
> +    f = qf_file_open(file, O_CREAT | O_WRONLY | O_TRUNC, 0600, errp);
> +    if (!f) {
> +        return;
> +    }
> +
> +    ret = global_state_store();
> +    if (ret) {
> +        error_setg(errp, "Error saving global state");
> +        qemu_fclose(f);
> +        return;
> +    }
> +
> +    vm_stop(RUN_STATE_SAVE_VM);
> +
> +    ret = qemu_savevm_state(f, op, errp);
> +    if ((ret < 0) && !*errp) {
> +        error_setg(errp, "qemu_savevm_state failed");
> +    }

just call qemu_save_device_state(f) there rather than introducing the
modes?
What you're doing is VERY similar to qmp_xen_save_devices_state and also
COLO's device state saving.

(and also very similar to migration with the x-ignore-shared flag set).

Dave

> +    qemu_fclose(f);
> +
> +    if (op == VMS_REBOOT) {
> +        no_shutdown = 0;
> +        qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
> +    }
> +}
> +
>  int save_snapshot(const char *name, Error **errp)
>  {
>      BlockDriverState *bs, *bs1;
> diff --git a/monitor/qmp-cmds.c b/monitor/qmp-cmds.c
> index 864cbfa..9ec7b88 100644
> --- a/monitor/qmp-cmds.c
> +++ b/monitor/qmp-cmds.c
> @@ -35,6 +35,7 @@
>  #include "qapi/qapi-commands-machine.h"
>  #include "qapi/qapi-commands-misc.h"
>  #include "qapi/qapi-commands-ui.h"
> +#include "qapi/qapi-commands-migration.h"
>  #include "qapi/qmp/qerror.h"
>  #include "hw/mem/memory-device.h"
>  #include "hw/acpi/acpi_dev_interface.h"
> @@ -161,6 +162,11 @@ void qmp_cont(Error **errp)
>      }
>  }
>  
> +void qmp_cprsave(const char *file, const char *mode, Error **errp)
> +{
> +    save_cpr_snapshot(file, mode, errp);
> +}
> +
>  void qmp_system_wakeup(Error **errp)
>  {
>      if (!qemu_wakeup_suspend_enabled()) {
> diff --git a/qapi/migration.json b/qapi/migration.json
> index d500055..b61df1d 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -1621,3 +1621,17 @@
>  ##
>  { 'event': 'UNPLUG_PRIMARY',
>    'data': { 'device-id': 'str' } }
> +
> +##
> +# @cprsave:
> +#
> +# Create a checkpoint of the virtual machine device state in @file.
> +# Guest RAM and guest block device blocks are not saved.
> +#
> +# @file: name of checkpoint file
> +# @mode: 'reboot' : checkpoint can be cprload'ed after a host kexec reboot.
> +#
> +# Since 5.0
> +##
> +{ 'command': 'cprsave', 'data': { 'file': 'str', 'mode': 'str' } }
> +
> -- 
> 1.8.3.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 04/32] savevm: HMP Command for cprsave
  2020-07-30 15:14 ` [PATCH V1 04/32] savevm: HMP Command " Steve Sistare
@ 2020-09-11 16:57   ` Dr. David Alan Gilbert
  2020-09-24 21:44     ` Steven Sistare
  0 siblings, 1 reply; 118+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-11 16:57 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

* Steve Sistare (steven.sistare@oracle.com) wrote:
> Enable HMP access to the cprsave QMP command.
> 
> Usage: cprsave <filename> <mode>
> 
> Signed-off-by: Maran Wilson <maran.wilson@oracle.com>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

I realise that the current mode is currently only 'reboot' - can you
please give us a clue as to why you've got a mode argument that's
currently only got one mode?

Dave

> ---
>  hmp-commands.hx       | 18 ++++++++++++++++++
>  include/monitor/hmp.h |  1 +
>  monitor/hmp-cmds.c    | 10 ++++++++++
>  3 files changed, 29 insertions(+)
> 
> diff --git a/hmp-commands.hx b/hmp-commands.hx
> index 60f395c..c8defd9 100644
> --- a/hmp-commands.hx
> +++ b/hmp-commands.hx
> @@ -354,6 +354,24 @@ SRST
>  ERST
>  
>      {
> +        .name       = "cprsave",
> +        .args_type  = "file:s,mode:s",
> +        .params     = "file 'reboot'",
> +        .help       = "create a checkpoint of the VM in file",
> +        .cmd        = hmp_cprsave,
> +    },
> +
> +SRST
> +``cprsave`` *tag*
> +  Stop VCPUs, create a checkpoint of the whole virtual machine and save it
> +  in *file*.
> +  If *mode* is 'reboot', the checkpoint can be cprload'ed after a host kexec
> +  reboot.
> +  exec() /usr/bin/qemu-exec if it exists, else exec /usr/bin/qemu-system-x86_64,
> +  passing all the original command line arguments.  The VCPUs remain paused.
> +ERST
> +
> +    {
>          .name       = "delvm",
>          .args_type  = "name:s",
>          .params     = "tag",
> diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
> index c986cfd..af8ee23 100644
> --- a/include/monitor/hmp.h
> +++ b/include/monitor/hmp.h
> @@ -59,6 +59,7 @@ void hmp_balloon(Monitor *mon, const QDict *qdict);
>  void hmp_loadvm(Monitor *mon, const QDict *qdict);
>  void hmp_savevm(Monitor *mon, const QDict *qdict);
>  void hmp_delvm(Monitor *mon, const QDict *qdict);
> +void hmp_cprsave(Monitor *mon, const QDict *qdict);
>  void hmp_migrate_cancel(Monitor *mon, const QDict *qdict);
>  void hmp_migrate_continue(Monitor *mon, const QDict *qdict);
>  void hmp_migrate_incoming(Monitor *mon, const QDict *qdict);
> diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
> index ae4b6a4..59196ed 100644
> --- a/monitor/hmp-cmds.c
> +++ b/monitor/hmp-cmds.c
> @@ -1139,6 +1139,16 @@ void hmp_announce_self(Monitor *mon, const QDict *qdict)
>      qapi_free_AnnounceParameters(params);
>  }
>  
> +void hmp_cprsave(Monitor *mon, const QDict *qdict)
> +{
> +    Error *err = NULL;
> +
> +    qmp_cprsave(qdict_get_try_str(qdict, "file"),
> +                qdict_get_try_str(qdict, "mode"),
> +                &err);
> +    hmp_handle_error(mon, err);
> +}
> +
>  void hmp_migrate_cancel(Monitor *mon, const QDict *qdict)
>  {
>      qmp_migrate_cancel(NULL);
> -- 
> 1.8.3.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 05/32] savevm: QMP command for cprload
  2020-07-30 18:00     ` Steven Sistare
@ 2020-09-11 17:18       ` Dr. David Alan Gilbert
  2020-09-24 21:49         ` Steven Sistare
  0 siblings, 1 reply; 118+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-11 17:18 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Daniel P. Berrange, Juan Quintela, Michael S. Tsirkin,
	qemu-devel, Markus Armbruster, Alex Williamson, Stefan Hajnoczi,
	Paolo Bonzini, Marc-André Lureau,
	Philippe Mathieu-Daudé,
	Alex Bennée

* Steven Sistare (steven.sistare@oracle.com) wrote:
> On 7/30/2020 12:14 PM, Eric Blake wrote:
> > On 7/30/20 10:14 AM, Steve Sistare wrote:
> >> Provide the cprload QMP command.  The VM is created from the file produced
> >> by the cprsave command.  Guest RAM is restored in-place from the shared
> >> memory backend file, and guest block devices are used as is.  The contents
> >> of such devices must not be modified between the cprsave and cprload
> >> operations.  If the VM was running at cprsave time, then VM execution
> >> resumes.
> > 
> > Is it always wise to unconditionally resume, or might this command need an additional optional knob that says what state (paused or running) to move into?
> 
> This can already be done.  Issue a stop command before cprsave, then cprload will finish in a
> paused state.
> 
> Also, cprsave re-execs and leaves the guest in a paused state.  One can
> 
> send device add commands, then send cprload which continues
> .

You're suffering here because you're reinventing stuff rather than
reusing existing migration paths.
With the existing migration code we require the qemu
to be started with -incoming ... so we know it's in a clean
state ready for being loaded, and we've already got the -S
mechanism that dictates whether or not the VM autostarts
(regardless of the saved state in the image).  The management
layers find this pretty useful if they need to wire some networking
or storage up at the point they know they've got a VM image that's
loaded OK.

Dave

> 
> >> Syntax:
> >>    {'command':'cprload', 'data':{'file':'str'}}
> >>
> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >> Signed-off-by: Maran Wilson <maran.wilson@oracle.com>
> >> ---
> > 
> >> +++ b/qapi/migration.json
> >> @@ -1635,3 +1635,14 @@
> >>   ##
> >>   { 'command': 'cprsave', 'data': { 'file': 'str', 'mode': 'str' } }
> >>   +##
> >> +# @cprload:
> >> +#
> >> +# Start virtual machine from checkpoint file that was created earlier using
> >> +# the cprsave command.
> >> +#
> >> +# @file: name of checkpoint file
> >> +#
> >> +# Since 5.0
> > 
> > another 5.2 instance. I'll quit pointing it out for the rest of the series.
> 
> Will find and fix all, thanks.
> 
> - Steve
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 08/32] savevm: HMP command for cprinfo
  2020-07-30 15:14 ` [PATCH V1 08/32] savevm: HMP " Steve Sistare
@ 2020-09-11 17:27   ` Dr. David Alan Gilbert
  2020-09-24 21:50     ` Steven Sistare
  0 siblings, 1 reply; 118+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-11 17:27 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

* Steve Sistare (steven.sistare@oracle.com) wrote:
> Enable HMP access to the cprinfo QMP command.
> 
> Usage: cprinfo
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

As with Eric's comment on the qemp I don't think you need it;
for HMP alll you really need is something that lists it in the help.

(Also I'd expect an info  cpr   to be a possibility that could give
some information about it - e.g. if you've just saved/can save/loaded a
CPR image)

Dave

> ---
>  hmp-commands.hx       | 13 +++++++++++++
>  include/monitor/hmp.h |  1 +
>  monitor/hmp-cmds.c    | 10 ++++++++++
>  3 files changed, 24 insertions(+)
> 
> diff --git a/hmp-commands.hx b/hmp-commands.hx
> index cb67150..7517876 100644
> --- a/hmp-commands.hx
> +++ b/hmp-commands.hx
> @@ -354,6 +354,19 @@ SRST
>  ERST
>  
>      {
> +        .name       = "cprinfo",
> +        .args_type  = "",
> +        .params     = "",
> +        .help       = "return list of modes supported by cprsave",
> +        .cmd        = hmp_cprinfo,
> +    },
> +
> +SRST
> +``cprinfo`` *tag*
> +  Return a space-delimited list of modes supported by cprsave.
> +ERST
> +
> +    {
>          .name       = "cprsave",
>          .args_type  = "file:s,mode:s",
>          .params     = "file 'reboot'",
> diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
> index 7b8cdfd..919b9a9 100644
> --- a/include/monitor/hmp.h
> +++ b/include/monitor/hmp.h
> @@ -59,6 +59,7 @@ void hmp_balloon(Monitor *mon, const QDict *qdict);
>  void hmp_loadvm(Monitor *mon, const QDict *qdict);
>  void hmp_savevm(Monitor *mon, const QDict *qdict);
>  void hmp_delvm(Monitor *mon, const QDict *qdict);
> +void hmp_cprinfo(Monitor *mon, const QDict *qdict);
>  void hmp_cprsave(Monitor *mon, const QDict *qdict);
>  void hmp_cprload(Monitor *mon, const QDict *qdict);
>  void hmp_migrate_cancel(Monitor *mon, const QDict *qdict);
> diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
> index ba95737..2f6af07 100644
> --- a/monitor/hmp-cmds.c
> +++ b/monitor/hmp-cmds.c
> @@ -1139,6 +1139,16 @@ void hmp_announce_self(Monitor *mon, const QDict *qdict)
>      qapi_free_AnnounceParameters(params);
>  }
>  
> +void hmp_cprinfo(Monitor *mon, const QDict *qdict)
> +{
> +    Error *err = NULL;
> +    char *res = qmp_cprinfo(&err);
> +
> +    monitor_printf(mon, "%s\n", res);
> +    g_free(res);
> +    hmp_handle_error(mon, err);
> +}
> +
>  void hmp_cprsave(Monitor *mon, const QDict *qdict)
>  {
>      Error *err = NULL;
> -- 
> 1.8.3.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 09/32] savevm: prevent cprsave if memory is volatile
  2020-07-30 15:14 ` [PATCH V1 09/32] savevm: prevent cprsave if memory is volatile Steve Sistare
@ 2020-09-11 17:35   ` Dr. David Alan Gilbert
  2020-09-24 21:51     ` Steven Sistare
  0 siblings, 1 reply; 118+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-11 17:35 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

* Steve Sistare (steven.sistare@oracle.com) wrote:
> cprsave and cprload require that guest ram be backed by an externally
> visible shared file.  Check that in cprsave.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  exec.c                | 32 ++++++++++++++++++++++++++++++++
>  include/exec/memory.h |  2 ++
>  migration/savevm.c    |  4 ++++
>  3 files changed, 38 insertions(+)
> 
> diff --git a/exec.c b/exec.c
> index 6f381f9..02160e0 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -2726,6 +2726,38 @@ ram_addr_t qemu_ram_addr_from_host(void *ptr)
>      return block->offset + offset;
>  }
>  
> +/*
> + * Return true if any memory regions are writable and not backed by shared
> + * memory.  Exclude x86 option rom shadow "pc.rom" by name, even though it is
> + * writable.

Tell me about 'pc.rom' - this is a very odd hack.
Again note the trick done by the existing migration capability
x-ignore-shared ; it doesn't special case, it just doesn't migrate
the 'shared' blocks.

Dave


> + */
> +bool qemu_ram_volatile(Error **errp)
> +{
> +    RAMBlock *block;
> +    MemoryRegion *mr;
> +    bool ret = false;
> +
> +    rcu_read_lock();
> +    QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
> +        mr = block->mr;
> +        if (mr &&
> +            memory_region_is_ram(mr) &&
> +            !memory_region_is_ram_device(mr) &&
> +            !memory_region_is_rom(mr) &&
> +            (!mr->name || strcmp(mr->name, "pc.rom")) &&
> +            (block->fd == -1 || !qemu_ram_is_shared(block))) {
> +
> +            error_setg(errp, "Memory region %s is volatile",
> +                       memory_region_name(mr));
> +            ret = true;
> +            break;
> +        }
> +    }
> +
> +    rcu_read_unlock();
> +    return ret;
> +}
> +
>  /* Generate a debug exception if a watchpoint has been hit.  */
>  void cpu_check_watchpoint(CPUState *cpu, vaddr addr, vaddr len,
>                            MemTxAttrs attrs, int flags, uintptr_t ra)
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 307e527..6aafbb0 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -2519,6 +2519,8 @@ bool ram_block_discard_is_disabled(void);
>   */
>  bool ram_block_discard_is_required(void);
>  
> +bool qemu_ram_volatile(Error **errp);
> +
>  #endif
>  
>  #endif
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 1509173..f101039 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2713,6 +2713,10 @@ void save_cpr_snapshot(const char *file, const char *mode, Error **errp)
>          return;
>      }
>  
> +    if (op == VMS_REBOOT && qemu_ram_volatile(errp)) {
> +        return;
> +    }
> +
>      f = qf_file_open(file, O_CREAT | O_WRONLY | O_TRUNC, 0600, errp);
>      if (!f) {
>          return;
> -- 
> 1.8.3.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 10/32] kvmclock: restore paused KVM clock
  2020-07-30 15:14 ` [PATCH V1 10/32] kvmclock: restore paused KVM clock Steve Sistare
@ 2020-09-11 17:50   ` Dr. David Alan Gilbert
  2020-09-25 18:07     ` Steven Sistare
  0 siblings, 1 reply; 118+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-11 17:50 UTC (permalink / raw)
  To: Steve Sistare, Paolo Bonzini
  Cc: Daniel P. Berrange, Michael S. Tsirkin,
	Philippe Mathieu-Daudé,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Alex Bennée

* Steve Sistare (steven.sistare@oracle.com) wrote:
> If the VM is paused when the KVM clock is serialized to a file, record
> that the clock is valid, so the value will be reused rather than
> overwritten after cprload with a new call to KVM_GET_CLOCK here:
> 
> kvmclock_vm_state_change()
>     if (running)
>         ...
>     else
>         if (s->clock_valid)
>             return;         <-- instead, return here
> 
>         kvm_update_clock()
>            kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, &data)  <-- overwritten
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  hw/i386/kvm/clock.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c
> index 6428335..161991a 100644
> --- a/hw/i386/kvm/clock.c
> +++ b/hw/i386/kvm/clock.c
> @@ -285,18 +285,22 @@ static int kvmclock_pre_save(void *opaque)
>      if (!s->runstate_paused) {
>          kvm_update_clock(s);
>      }
> +    if (!runstate_is_running()) {
> +        s->clock_valid = true;
> +    }
>  
>      return 0;
>  }
>  
>  static const VMStateDescription kvmclock_vmsd = {
>      .name = "kvmclock",
> -    .version_id = 1,
> +    .version_id = 2,
>      .minimum_version_id = 1,
>      .pre_load = kvmclock_pre_load,
>      .pre_save = kvmclock_pre_save,
>      .fields = (VMStateField[]) {
>          VMSTATE_UINT64(clock, KVMClockState),
> +        VMSTATE_BOOL_V(clock_valid, KVMClockState, 2),
>          VMSTATE_END_OF_LIST()

We always try and avoid bumping version_id unless we're
desperate because it breaks backwards migration.

Didn't you already know from the stored migration state
(in the globalstate) if the loaded VM was running?

It's also not clear to me why you're avoiding reloading the state;
have you preserved that some other way?

Dave

>      },
>      .subsections = (const VMStateDescription * []) {
> -- 
> 1.8.3.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 11/32] cpu: disable ticks when suspended
  2020-07-30 15:14 ` [PATCH V1 11/32] cpu: disable ticks when suspended Steve Sistare
@ 2020-09-11 17:53   ` Dr. David Alan Gilbert
  2020-09-24 20:42     ` Steven Sistare
  0 siblings, 1 reply; 118+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-11 17:53 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

* Steve Sistare (steven.sistare@oracle.com) wrote:
> After cprload, the guest console misbehaves.  You must type 8 characters
> before any are echoed to the terminal.  Qemu was not sending interrupts
> to the guest because the QEMU_CLOCK_VIRTUAL timers_state.cpu_clock_offset
> was bad.  The offset is usually updated at cprsave time by the path
> 
>   save_cpr_snapshot()
>     vm_stop()
>       do_vm_stop()
>         if (runstate_is_running())
>           cpu_disable_ticks();
>             timers_state.cpu_clock_offset = cpu_get_clock_locked();
> 
> However, if the guest is in RUN_STATE_SUSPENDED, then cpu_disable_ticks is
> not called.  Further, the earlier transition to suspended in
> qemu_system_suspend did not disable ticks.  To fix, call cpu_disable_ticks
> from save_cpr_snapshot.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Are you saying this is really a more generic bug with migrating when
suspended and we should fix this anyway?

Dave

> ---
>  migration/savevm.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/migration/savevm.c b/migration/savevm.c
> index f101039..00f493b 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2729,6 +2729,11 @@ void save_cpr_snapshot(const char *file, const char *mode, Error **errp)
>          return;
>      }
>  
> +    /* Update timers_state before saving.  Suspend did not so do. */
> +    if (runstate_check(RUN_STATE_SUSPENDED)) {
> +        cpu_disable_ticks();
> +    }
> +
>      vm_stop(RUN_STATE_SAVE_VM);
>  
>      ret = qemu_savevm_state(f, op, errp);
> -- 
> 1.8.3.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 12/32] vl: pause option
  2020-07-30 18:14     ` Steven Sistare
  2020-07-31  9:44       ` Alex Bennée
@ 2020-09-11 17:59       ` Dr. David Alan Gilbert
  2020-09-24 21:51         ` Steven Sistare
  1 sibling, 1 reply; 118+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-11 17:59 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

* Steven Sistare (steven.sistare@oracle.com) wrote:
> On 7/30/2020 1:03 PM, Alex Bennée wrote:
> > 
> > Steve Sistare <steven.sistare@oracle.com> writes:
> > 
> >> Provide the -pause command-line parameter and the QEMU_PAUSE environment
> >> variable to briefly pause QEMU in main and allow a developer to attach gdb.
> >> Useful when the developer does not invoke QEMU directly, such as when using
> >> libvirt.
> > 
> > How does this differ from -S?
> 
> The -S flag runs qemu to the main loop but does not start the guest.  Lots of code
> that you may need to debug runs before you get there.

You might try the '--preconfig' option - that's pretty early on.
The other one is adding a chardev and telling it to wait for a server;
that'll wait until you telnet to the port.

(Either way, this patch shouldn't really be part of this series, it's a
separate discussion)

Dave

> - Steve
> >> Usage:
> >>   qemu -pause <seconds>
> >>   or
> >>   export QEMU_PAUSE=<seconds>
> >>
> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >> ---
> >>  qemu-options.hx |  9 +++++++++
> >>  softmmu/vl.c    | 15 ++++++++++++++-
> >>  2 files changed, 23 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/qemu-options.hx b/qemu-options.hx
> >> index 708583b..8505cf2 100644
> >> --- a/qemu-options.hx
> >> +++ b/qemu-options.hx
> >> @@ -3668,6 +3668,15 @@ SRST
> >>      option is experimental.
> >>  ERST
> >>  
> >> +DEF("pause", HAS_ARG, QEMU_OPTION_pause, \
> >> +    "-pause secs    Pause for secs seconds on entry to main.\n", QEMU_ARCH_ALL)
> >> +
> >> +SRST
> >> +``--pause secs``
> >> +    Pause for a number of seconds on entry to main.  Useful for attaching
> >> +    a debugger after QEMU has been launched by some other entity.
> >> +ERST
> >> +
> > 
> > It seems like having an option to race with the debugger is just asking
> > for trouble.
> > 
> >>  DEF("S", 0, QEMU_OPTION_S, \
> >>      "-S              freeze CPU at startup (use 'c' to start execution)\n",
> >>      QEMU_ARCH_ALL)
> >> diff --git a/softmmu/vl.c b/softmmu/vl.c
> >> index 8478778..951994f 100644
> >> --- a/softmmu/vl.c
> >> +++ b/softmmu/vl.c
> >> @@ -2844,7 +2844,7 @@ static void create_default_memdev(MachineState *ms, const char *path)
> >>  
> >>  void qemu_init(int argc, char **argv, char **envp)
> >>  {
> >> -    int i;
> >> +    int i, seconds;
> >>      int snapshot, linux_boot;
> >>      const char *initrd_filename;
> >>      const char *kernel_filename, *kernel_cmdline;
> >> @@ -2882,6 +2882,13 @@ void qemu_init(int argc, char **argv, char **envp)
> >>      QemuPluginList plugin_list = QTAILQ_HEAD_INITIALIZER(plugin_list);
> >>      int mem_prealloc = 0; /* force preallocation of physical target memory */
> >>  
> >> +    if (getenv("QEMU_PAUSE")) {
> >> +        seconds = atoi(getenv("QEMU_PAUSE"));
> >> +        printf("Pausing %d seconds for debugger. QEMU PID is %d\n",
> >> +               seconds, getpid());
> >> +        sleep(seconds);
> >> +    }
> >> +
> >>      os_set_line_buffering();
> >>  
> >>      error_init(argv[0]);
> >> @@ -3204,6 +3211,12 @@ void qemu_init(int argc, char **argv, char **envp)
> >>              case QEMU_OPTION_gdb:
> >>                  add_device_config(DEV_GDB, optarg);
> >>                  break;
> >> +            case QEMU_OPTION_pause:
> >> +                seconds = atoi(optarg);
> >> +                printf("Pausing %d seconds for debugger. QEMU PID is %d\n",
> >> +                            seconds, getpid());
> >> +                sleep(seconds);
> >> +                break;
> >>              case QEMU_OPTION_L:
> >>                  if (is_help_option(optarg)) {
> >>                      list_data_dirs = true;
> > 
> > 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 13/32] gdbstub: gdb support for suspended state
  2020-07-30 15:14 ` [PATCH V1 13/32] gdbstub: gdb support for suspended state Steve Sistare
@ 2020-09-11 18:41   ` Dr. David Alan Gilbert
  2020-09-24 21:51     ` Steven Sistare
  0 siblings, 1 reply; 118+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-11 18:41 UTC (permalink / raw)
  To: Steve Sistare, alex.bennee, philmd
  Cc: Daniel P. Berrange, Stefan Hajnoczi, Michael S. Tsirkin,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Marc-André Lureau, Paolo Bonzini

* Steve Sistare (steven.sistare@oracle.com) wrote:
> Modify the gdb server so a continue command appears to resume execution
> when in RUN_STATE_SUSPENDED.  Do not print the next gdb prompt, but do not
> actually resume instruction fetch.  While in this "fake" running mode, a
> ctrl-C returns the user to the gdb prompt.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

This patch doesn't feel like it lives here; it seems to be a separate
gdbstub patch and it'll get noticed/merged quicker just sent on it's
own.

Dave

> ---
>  gdbstub.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/gdbstub.c b/gdbstub.c
> index f3a318c..2f0d9ff 100644
> --- a/gdbstub.c
> +++ b/gdbstub.c
> @@ -461,7 +461,9 @@ static inline void gdb_continue(void)
>  #else
>      if (!runstate_needs_reset()) {
>          trace_gdbstub_op_continue();
> -        vm_start();
> +        if (!runstate_check(RUN_STATE_SUSPENDED)) {
> +            vm_start();
> +        }
>      }
>  #endif
>  }
> @@ -490,7 +492,7 @@ static int gdb_continue_partial(char *newstates)
>      int flag = 0;
>  
>      if (!runstate_needs_reset()) {
> -        if (vm_prepare_start()) {
> +        if (!runstate_check(RUN_STATE_SUSPENDED) && vm_prepare_start()) {
>              return 0;
>          }
>  
> @@ -2835,6 +2837,9 @@ static void gdb_read_byte(uint8_t ch)
>          /* when the CPU is running, we cannot do anything except stop
>             it when receiving a char */
>          vm_stop(RUN_STATE_PAUSED);
> +    } else if (runstate_check(RUN_STATE_SUSPENDED) && ch == 3) {
> +        /* Received ctrl-c from gdb */
> +        gdb_vm_state_change(0, 0, RUN_STATE_PAUSED);
>      } else
>  #endif
>      {
> @@ -3282,6 +3287,8 @@ static void gdb_sigterm_handler(int signal)
>  {
>      if (runstate_is_running()) {
>          vm_stop(RUN_STATE_PAUSED);
> +    } else if (runstate_check(RUN_STATE_SUSPENDED)) {
> +        gdb_vm_state_change(0, 0, RUN_STATE_PAUSED);
>      }
>  }
>  #endif
> -- 
> 1.8.3.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 14/32] savevm: VMS_RESTART and cprsave restart
  2020-07-30 15:14 ` [PATCH V1 14/32] savevm: VMS_RESTART and cprsave restart Steve Sistare
  2020-07-30 16:22   ` Eric Blake
@ 2020-09-11 18:44   ` Dr. David Alan Gilbert
  2020-09-24 21:44     ` Steven Sistare
  1 sibling, 1 reply; 118+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-11 18:44 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

* Steve Sistare (steven.sistare@oracle.com) wrote:
> Add the VMS_RESTART variant of vmstate, for use when upgrading qemu in place
> on the same host without a reboot.  Invoke it using:
>   cprsave <filename> restart
> 
> VMS_RESTART supports guest ram mapped by private anonymous memory, versus
> VMS_REBOOT which requires that guest ram be mapped by persistent shared
> memory.  Subsequent patches complete its implementation.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

You should find with the enum like Eric suggests this mostly disappears;
but also you might want to put it after the patches that implement it.

Dave

> ---
>  hmp-commands.hx             | 4 +++-
>  include/migration/vmstate.h | 1 +
>  migration/savevm.c          | 4 +++-
>  monitor/qmp-cmds.c          | 2 +-
>  qapi/migration.json         | 1 +
>  5 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/hmp-commands.hx b/hmp-commands.hx
> index 7517876..11a2089 100644
> --- a/hmp-commands.hx
> +++ b/hmp-commands.hx
> @@ -369,7 +369,7 @@ ERST
>      {
>          .name       = "cprsave",
>          .args_type  = "file:s,mode:s",
> -        .params     = "file 'reboot'",
> +        .params     = "file 'restart'|'reboot'",
>          .help       = "create a checkpoint of the VM in file",
>          .cmd        = hmp_cprsave,
>      },
> @@ -380,6 +380,8 @@ SRST
>    in *file*.
>    If *mode* is 'reboot', the checkpoint can be cprload'ed after a host kexec
>    reboot.
> +  If *mode* is 'restart', the checkpoint can be cprload'ed after restarting
> +  qemu.
>    exec() /usr/bin/qemu-exec if it exists, else exec /usr/bin/qemu-system-x86_64,
>    passing all the original command line arguments.  The VCPUs remain paused.
>  ERST
> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> index c58551a..8239b84 100644
> --- a/include/migration/vmstate.h
> +++ b/include/migration/vmstate.h
> @@ -162,6 +162,7 @@ typedef enum {
>      VMS_MIGRATE  = (1U << 1),
>      VMS_SNAPSHOT = (1U << 2),
>      VMS_REBOOT   = (1U << 3),
> +    VMS_RESTART  = (1U << 4),
>      VMS_MODE_ALL = ~0U
>  } VMStateMode;
>  
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 00f493b..38cc63a 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2708,6 +2708,8 @@ void save_cpr_snapshot(const char *file, const char *mode, Error **errp)
>  
>      if (!strcmp(mode, "reboot")) {
>          op = VMS_REBOOT;
> +    } else if (!strcmp(mode, "restart")) {
> +        op = VMS_RESTART;
>      } else {
>          error_setg(errp, "cprsave: bad mode %s", mode);
>          return;
> @@ -2973,7 +2975,7 @@ void load_cpr_snapshot(const char *file, Error **errp)
>          return;
>      }
>  
> -    ret = qemu_loadvm_state(f, VMS_REBOOT);
> +    ret = qemu_loadvm_state(f, VMS_REBOOT | VMS_RESTART);
>      qemu_fclose(f);
>      if (ret < 0) {
>          error_setg(errp, "Error %d while loading VM state", ret);
> diff --git a/monitor/qmp-cmds.c b/monitor/qmp-cmds.c
> index 8c400e6..8a74c6e 100644
> --- a/monitor/qmp-cmds.c
> +++ b/monitor/qmp-cmds.c
> @@ -164,7 +164,7 @@ void qmp_cont(Error **errp)
>  
>  char *qmp_cprinfo(Error **errp)
>  {
> -    return g_strdup("reboot");
> +    return g_strdup("reboot restart");
>  }
>  
>  void qmp_cprsave(const char *file, const char *mode, Error **errp)
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 8190b16..d22992b 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -1639,6 +1639,7 @@
>  #
>  # @file: name of checkpoint file
>  # @mode: 'reboot' : checkpoint can be cprload'ed after a host kexec reboot.
> +#        'restart': checkpoint can be cprload'ed after restarting qemu.
>  #
>  # Since 5.0
>  ##
> -- 
> 1.8.3.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 15/32] vl: QEMU_START_FREEZE env var
  2020-07-30 15:14 ` [PATCH V1 15/32] vl: QEMU_START_FREEZE env var Steve Sistare
@ 2020-09-11 18:49   ` Dr. David Alan Gilbert
  2020-09-24 21:47     ` Steven Sistare
  0 siblings, 1 reply; 118+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-11 18:49 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

* Steve Sistare (steven.sistare@oracle.com) wrote:
> For qemu upgrade and restart, we will re-exec() qemu with the same argv.
> However, qemu must start in a paused state and wait for the cprload command,
> and the original argv might not contain the -S option.  To avoid modifying
> argv, provide the QEMU_START_FREEZE environment variable.  If
> QEMU_START_FREEZE is set, then set autostart=0, like the -S option.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

What's wrong with modifying the argv?

Note, also the trick -incoming defer uses;  the whole point here is that
we start qemu with   -incoming defer     and then we can issue commands
to modify the QEMU configuration before we actually reload state.

Note, even without CPR there might be reasons that you need to modify
the argv; for example, imagine that since it was originally booted
someone had hotplug added an extra CPU or RAM or a disk; the new QEMU
must be started in a state that reflects the state in which the VM was
at the point when it was saved, not the point at which it was started
long ago.

Dave

> ---
>  softmmu/vl.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/softmmu/vl.c b/softmmu/vl.c
> index 951994f..7016e39 100644
> --- a/softmmu/vl.c
> +++ b/softmmu/vl.c
> @@ -4501,6 +4501,11 @@ void qemu_init(int argc, char **argv, char **envp)
>          exit(0);
>      }
>  
> +    if (getenv("QEMU_START_FREEZE")) {
> +        unsetenv("QEMU_START_FREEZE");
> +        autostart = 0;
> +    }
> +
>      if (incoming) {
>          Error *local_err = NULL;
>          qemu_start_incoming_migration(incoming, &local_err);
> -- 
> 1.8.3.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 16/32] oslib: add qemu_clr_cloexec
  2020-07-30 15:14 ` [PATCH V1 16/32] oslib: add qemu_clr_cloexec Steve Sistare
@ 2020-09-11 18:52   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 118+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-11 18:52 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

* Steve Sistare (steven.sistare@oracle.com) wrote:
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Seems same as set, so:

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  include/qemu/osdep.h | 1 +
>  util/oslib-posix.c   | 9 +++++++++
>  util/oslib-win32.c   | 4 ++++
>  3 files changed, 14 insertions(+)
> 
> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index 45c217a..bb28df1 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -551,6 +551,7 @@ static inline void qemu_timersub(const struct timeval *val1,
>  #endif
>  
>  void qemu_set_cloexec(int fd);
> +void qemu_clr_cloexec(int fd);
>  
>  /* Starting on QEMU 2.5, qemu_hw_version() returns "2.5+" by default
>   * instead of QEMU_VERSION, so setting hw_version on MachineClass
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index d923674..28fee45 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -314,6 +314,15 @@ void qemu_set_cloexec(int fd)
>      assert(f != -1);
>  }
>  
> +void qemu_clr_cloexec(int fd)
> +{
> +    int f;
> +    f = fcntl(fd, F_GETFD);
> +    assert(f != -1);
> +    f = fcntl(fd, F_SETFD, f & ~FD_CLOEXEC);
> +    assert(f != -1);
> +}
> +
>  /*
>   * Creates a pipe with FD_CLOEXEC set on both file descriptors
>   */
> diff --git a/util/oslib-win32.c b/util/oslib-win32.c
> index 7eedbe5..e5d0c7c 100644
> --- a/util/oslib-win32.c
> +++ b/util/oslib-win32.c
> @@ -254,6 +254,10 @@ void qemu_set_cloexec(int fd)
>  {
>  }
>  
> +void qemu_clr_cloexec(int fd)
> +{
> +}
> +
>  /* Offset between 1/1/1601 and 1/1/1970 in 100 nanosec units */
>  #define _W32_FT_OFFSET (116444736000000000ULL)
>  
> -- 
> 1.8.3.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 17/32] util: env var helpers
  2020-07-30 15:14 ` [PATCH V1 17/32] util: env var helpers Steve Sistare
@ 2020-09-11 19:00   ` Dr. David Alan Gilbert
  2020-09-24 21:52     ` Steven Sistare
  0 siblings, 1 reply; 118+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-11 19:00 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

* Steve Sistare (steven.sistare@oracle.com) wrote:
> Add functions for saving fd's and ram extents in the environment via
> setenv, and for reading them back via getenv.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> Signed-off-by: Mark Kanda <mark.kanda@oracle.com>

This is an awful lot of env stuff - how about dumping
all this stuff into a file and reloading it?

Dave

> ---
>  MAINTAINERS           |   7 +++
>  include/qemu/cutils.h |   1 +
>  include/qemu/env.h    |  31 ++++++++++++
>  util/Makefile.objs    |   2 +-
>  util/env.c            | 132 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 172 insertions(+), 1 deletion(-)
>  create mode 100644 include/qemu/env.h
>  create mode 100644 util/env.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 3395abd..8d377a7 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3115,3 +3115,10 @@ Performance Tools and Tests
>  M: Ahmed Karaman <ahmedkhaledkaraman@gmail.com>
>  S: Maintained
>  F: scripts/performance/
> +
> +Environment variable helpers
> +M: Steve Sistare <steven.sistare@oracle.com>
> +M: Mark Kanda <mark.kanda@oracle.com>
> +S: Maintained
> +F: include/qemu/env.h
> +F: util/env.c
> diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h
> index eb59852..d4c7d70 100644
> --- a/include/qemu/cutils.h
> +++ b/include/qemu/cutils.h
> @@ -1,6 +1,7 @@
>  #ifndef QEMU_CUTILS_H
>  #define QEMU_CUTILS_H
>  
> +#include "qemu/env.h"
>  /**
>   * pstrcpy:
>   * @buf: buffer to copy string into
> diff --git a/include/qemu/env.h b/include/qemu/env.h
> new file mode 100644
> index 0000000..53cc121
> --- /dev/null
> +++ b/include/qemu/env.h
> @@ -0,0 +1,31 @@
> +/*
> + * Copyright (c) 2020 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#ifndef QEMU_ENV_H
> +#define QEMU_ENV_H
> +
> +#define FD_PREFIX "QEMU_FD_"
> +#define ADDR_PREFIX "QEMU_ADDR_"
> +#define LEN_PREFIX "QEMU_LEN_"
> +#define BOOL_PREFIX "QEMU_BOOL_"
> +
> +typedef int (*walkenv_cb)(const char *name, const char *val, void *handle);
> +
> +bool getenv_ram(const char *name, void **addrp, size_t *lenp);
> +void setenv_ram(const char *name, void *addr, size_t len);
> +void unsetenv_ram(const char *name);
> +int getenv_fd(const char *name);
> +void setenv_fd(const char *name, int fd);
> +void unsetenv_fd(const char *name);
> +bool getenv_bool(const char *name);
> +void setenv_bool(const char *name, bool val);
> +void unsetenv_bool(const char *name);
> +int walkenv(const char *prefix, walkenv_cb cb, void *handle);
> +void printenv(void);
> +
> +#endif
> diff --git a/util/Makefile.objs b/util/Makefile.objs
> index cc5e371..d357932 100644
> --- a/util/Makefile.objs
> +++ b/util/Makefile.objs
> @@ -1,4 +1,4 @@
> -util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o
> +util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o env.o
>  util-obj-$(call lnot,$(CONFIG_ATOMIC64)) += atomic64.o
>  util-obj-$(CONFIG_POSIX) += aio-posix.o
>  util-obj-$(CONFIG_POSIX) += fdmon-poll.o
> diff --git a/util/env.c b/util/env.c
> new file mode 100644
> index 0000000..0cc4a9f
> --- /dev/null
> +++ b/util/env.c
> @@ -0,0 +1,132 @@
> +/*
> + * Copyright (c) 2020 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/env.h"
> +
> +static uint64_t getenv_ulong(const char *prefix, const char *name, bool *found)
> +{
> +    char var[80], *val;
> +    uint64_t res;
> +
> +    snprintf(var, sizeof(var), "%s%s", prefix, name);
> +    val = getenv(var);
> +    if (val) {
> +        *found = true;
> +        res = strtol(val, 0, 10);
> +    } else {
> +        *found = false;
> +        res = 0;
> +    }
> +    return res;
> +}
> +
> +static void setenv_ulong(const char *prefix, const char *name, uint64_t val)
> +{
> +    char var[80], val_str[80];
> +    snprintf(var, sizeof(var), "%s%s", prefix, name);
> +    snprintf(val_str, sizeof(val_str), "%"PRIu64, val);
> +    setenv(var, val_str, 1);
> +}
> +
> +static void unsetenv_ulong(const char *prefix, const char *name)
> +{
> +    char var[80];
> +    snprintf(var, sizeof(var), "%s%s", prefix, name);
> +    unsetenv(var);
> +}
> +
> +bool getenv_ram(const char *name, void **addrp, size_t *lenp)
> +{
> +    bool found1, found2;
> +    *addrp = (void *) getenv_ulong(ADDR_PREFIX, name, &found1);
> +    *lenp = getenv_ulong(LEN_PREFIX, name, &found2);
> +    assert(found1 == found2);
> +    return found1;
> +}
> +
> +void setenv_ram(const char *name, void *addr, size_t len)
> +{
> +    setenv_ulong(ADDR_PREFIX, name, (uint64_t)addr);
> +    setenv_ulong(LEN_PREFIX, name, len);
> +}
> +
> +void unsetenv_ram(const char *name)
> +{
> +    unsetenv_ulong(ADDR_PREFIX, name);
> +    unsetenv_ulong(LEN_PREFIX, name);
> +}
> +
> +int getenv_fd(const char *name)
> +{
> +    bool found;
> +    int fd = getenv_ulong(FD_PREFIX, name, &found);
> +    if (!found) {
> +        fd = -1;
> +    }
> +    return fd;
> +}
> +
> +void setenv_fd(const char *name, int fd)
> +{
> +    setenv_ulong(FD_PREFIX, name, fd);
> +}
> +
> +void unsetenv_fd(const char *name)
> +{
> +    unsetenv_ulong(FD_PREFIX, name);
> +}
> +
> +bool getenv_bool(const char *name)
> +{
> +    bool found;
> +    bool val = getenv_ulong(BOOL_PREFIX, name, &found);
> +    if (!found) {
> +        val = -1;
> +    }
> +    return val;
> +}
> +
> +void setenv_bool(const char *name, bool val)
> +{
> +    setenv_ulong(BOOL_PREFIX, name, val);
> +}
> +
> +void unsetenv_bool(const char *name)
> +{
> +    unsetenv_ulong(BOOL_PREFIX, name);
> +}
> +
> +int walkenv(const char *prefix, walkenv_cb cb, void *handle)
> +{
> +    char *str, name[128];
> +    char **envp = environ;
> +    size_t prefix_len = strlen(prefix);
> +
> +    while (*envp) {
> +        str = *envp++;
> +        if (!strncmp(str, prefix, prefix_len)) {
> +            char *val = strchr(str, '=');
> +            str += prefix_len;
> +            strncpy(name, str, val - str);
> +            name[val - str] = 0;
> +            if (cb(name, val + 1, handle)) {
> +                return 1;
> +            }
> +        }
> +    }
> +    return 0;
> +}
> +
> +void printenv(void)
> +{
> +    char **ptr = environ;
> +    while (*ptr) {
> +        puts(*ptr++);
> +    }
> +}
> -- 
> 1.8.3.1
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 22/32] char: qio_channel_socket_accept reuse fd
  2020-07-30 15:14 ` [PATCH V1 22/32] char: qio_channel_socket_accept reuse fd Steve Sistare
@ 2020-09-15 17:33   ` Dr. David Alan Gilbert
  2020-09-15 17:53     ` Daniel P. Berrangé
  2020-09-24 21:54     ` Steven Sistare
  0 siblings, 2 replies; 118+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-15 17:33 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

* Steve Sistare (steven.sistare@oracle.com) wrote:
> From: Mark Kanda <mark.kanda@oracle.com>
> 
> Add an fd argument to qio_channel_socket_accept.  If not -1, the channel
> uses that fd instead of accepting a new socket connection.  All callers
> pass -1 in this patch, so no functional change.

Doesn't some of this just come from the fact you're insisting on reusing
the command line?   We should be able to open a chardev on an fd
shouldn't we?

Dave

> Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  include/io/channel-socket.h    |  3 ++-
>  io/channel-socket.c            | 12 +++++++++---
>  io/net-listener.c              |  4 ++--
>  scsi/qemu-pr-helper.c          |  2 +-
>  tests/qtest/tpm-emu.c          |  2 +-
>  tests/test-char.c              |  2 +-
>  tests/test-io-channel-socket.c |  4 ++--
>  7 files changed, 18 insertions(+), 11 deletions(-)
> 
> diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
> index 777ff59..0ffc560 100644
> --- a/include/io/channel-socket.h
> +++ b/include/io/channel-socket.h
> @@ -248,6 +248,7 @@ qio_channel_socket_get_remote_address(QIOChannelSocket *ioc,
>  /**
>   * qio_channel_socket_accept:
>   * @ioc: the socket channel object
> + * @reuse_fd: fd to reuse; -1 otherwise
>   * @errp: pointer to a NULL-initialized error object
>   *
>   * If the socket represents a server, then this accepts
> @@ -258,7 +259,7 @@ qio_channel_socket_get_remote_address(QIOChannelSocket *ioc,
>   */
>  QIOChannelSocket *
>  qio_channel_socket_accept(QIOChannelSocket *ioc,
> -                          Error **errp);
> +                          int reuse_fd, Error **errp);
>  
>  
>  #endif /* QIO_CHANNEL_SOCKET_H */
> diff --git a/io/channel-socket.c b/io/channel-socket.c
> index e1b4667..dde12bf 100644
> --- a/io/channel-socket.c
> +++ b/io/channel-socket.c
> @@ -352,7 +352,7 @@ void qio_channel_socket_dgram_async(QIOChannelSocket *ioc,
>  
>  QIOChannelSocket *
>  qio_channel_socket_accept(QIOChannelSocket *ioc,
> -                          Error **errp)
> +                          int reuse_fd, Error **errp)
>  {
>      QIOChannelSocket *cioc;
>  
> @@ -362,8 +362,14 @@ qio_channel_socket_accept(QIOChannelSocket *ioc,
>  
>   retry:
>      trace_qio_channel_socket_accept(ioc);
> -    cioc->fd = qemu_accept(ioc->fd, (struct sockaddr *)&cioc->remoteAddr,
> -                           &cioc->remoteAddrLen);
> +
> +    if (reuse_fd != -1) {
> +        cioc->fd = reuse_fd;
> +    } else {
> +        cioc->fd = qemu_accept(ioc->fd, (struct sockaddr *)&cioc->remoteAddr,
> +                               &cioc->remoteAddrLen);
> +    }
> +
>      if (cioc->fd < 0) {
>          if (errno == EINTR) {
>              goto retry;
> diff --git a/io/net-listener.c b/io/net-listener.c
> index 5d8a226..bbdea1e 100644
> --- a/io/net-listener.c
> +++ b/io/net-listener.c
> @@ -45,7 +45,7 @@ static gboolean qio_net_listener_channel_func(QIOChannel *ioc,
>      QIOChannelSocket *sioc;
>  
>      sioc = qio_channel_socket_accept(QIO_CHANNEL_SOCKET(ioc),
> -                                     NULL);
> +                                     -1, NULL);
>      if (!sioc) {
>          return TRUE;
>      }
> @@ -194,7 +194,7 @@ static gboolean qio_net_listener_wait_client_func(QIOChannel *ioc,
>      QIOChannelSocket *sioc;
>  
>      sioc = qio_channel_socket_accept(QIO_CHANNEL_SOCKET(ioc),
> -                                     NULL);
> +                                     -1, NULL);
>      if (!sioc) {
>          return TRUE;
>      }
> diff --git a/scsi/qemu-pr-helper.c b/scsi/qemu-pr-helper.c
> index 57ad830..0e6d683 100644
> --- a/scsi/qemu-pr-helper.c
> +++ b/scsi/qemu-pr-helper.c
> @@ -800,7 +800,7 @@ static gboolean accept_client(QIOChannel *ioc, GIOCondition cond, gpointer opaqu
>      PRHelperClient *prh;
>  
>      cioc = qio_channel_socket_accept(QIO_CHANNEL_SOCKET(ioc),
> -                                     NULL);
> +                                     -1, NULL);
>      if (!cioc) {
>          return TRUE;
>      }
> diff --git a/tests/qtest/tpm-emu.c b/tests/qtest/tpm-emu.c
> index 2e8eb7b..19e5dab 100644
> --- a/tests/qtest/tpm-emu.c
> +++ b/tests/qtest/tpm-emu.c
> @@ -83,7 +83,7 @@ void *tpm_emu_ctrl_thread(void *data)
>      g_cond_signal(&s->data_cond);
>  
>      qio_channel_wait(QIO_CHANNEL(lioc), G_IO_IN);
> -    ioc = QIO_CHANNEL(qio_channel_socket_accept(lioc, &error_abort));
> +    ioc = QIO_CHANNEL(qio_channel_socket_accept(lioc, -1, &error_abort));
>      g_assert(ioc);
>  
>      {
> diff --git a/tests/test-char.c b/tests/test-char.c
> index 614bdac..1bb6ae0 100644
> --- a/tests/test-char.c
> +++ b/tests/test-char.c
> @@ -884,7 +884,7 @@ char_socket_client_server_thread(gpointer data)
>      QIOChannelSocket *cioc;
>  
>  retry:
> -    cioc = qio_channel_socket_accept(ioc, &error_abort);
> +    cioc = qio_channel_socket_accept(ioc, -1, &error_abort);
>      g_assert_nonnull(cioc);
>  
>      if (char_socket_ping_pong(QIO_CHANNEL(cioc), NULL) != 0) {
> diff --git a/tests/test-io-channel-socket.c b/tests/test-io-channel-socket.c
> index d43083a..0d410cf 100644
> --- a/tests/test-io-channel-socket.c
> +++ b/tests/test-io-channel-socket.c
> @@ -75,7 +75,7 @@ static void test_io_channel_setup_sync(SocketAddress *listen_addr,
>      qio_channel_set_delay(*src, false);
>  
>      qio_channel_wait(QIO_CHANNEL(lioc), G_IO_IN);
> -    *dst = QIO_CHANNEL(qio_channel_socket_accept(lioc, &error_abort));
> +    *dst = QIO_CHANNEL(qio_channel_socket_accept(lioc, -1, &error_abort));
>      g_assert(*dst);
>  
>      test_io_channel_set_socket_bufs(*src, *dst);
> @@ -143,7 +143,7 @@ static void test_io_channel_setup_async(SocketAddress *listen_addr,
>      g_assert(!data.err);
>  
>      qio_channel_wait(QIO_CHANNEL(lioc), G_IO_IN);
> -    *dst = QIO_CHANNEL(qio_channel_socket_accept(lioc, &error_abort));
> +    *dst = QIO_CHANNEL(qio_channel_socket_accept(lioc, -1, &error_abort));
>      g_assert(*dst);
>  
>      qio_channel_set_delay(*src, false);
> -- 
> 1.8.3.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 22/32] char: qio_channel_socket_accept reuse fd
  2020-09-15 17:33   ` Dr. David Alan Gilbert
@ 2020-09-15 17:53     ` Daniel P. Berrangé
  2020-09-24 21:54     ` Steven Sistare
  1 sibling, 0 replies; 118+ messages in thread
From: Daniel P. Berrangé @ 2020-09-15 17:53 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Michael S. Tsirkin, Philippe Mathieu-Daudé,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Steve Sistare, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Alex Bennée

On Tue, Sep 15, 2020 at 06:33:34PM +0100, Dr. David Alan Gilbert wrote:
> * Steve Sistare (steven.sistare@oracle.com) wrote:
> > From: Mark Kanda <mark.kanda@oracle.com>
> > 
> > Add an fd argument to qio_channel_socket_accept.  If not -1, the channel
> > uses that fd instead of accepting a new socket connection.  All callers
> > pass -1 in this patch, so no functional change.
> 
> Doesn't some of this just come from the fact you're insisting on reusing
> the command line?   We should be able to open a chardev on an fd
> shouldn't we?

Even ignoring that question, this patch looks pointless to me. The callers
have to be modified to pass in the FD to use instead of accepting a new
connection. Given that, you migt as well just modify the callers to use
the FD immediately if valid and never call qio_channel_socket_accept at all.

ie

   if (reuse_fd)
      fd = reuse_fd;
   else
      fd = qio_channel_socket_accept(ioc...)

> 
> Dave
> 
> > Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
> > Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > ---
> >  include/io/channel-socket.h    |  3 ++-
> >  io/channel-socket.c            | 12 +++++++++---
> >  io/net-listener.c              |  4 ++--
> >  scsi/qemu-pr-helper.c          |  2 +-
> >  tests/qtest/tpm-emu.c          |  2 +-
> >  tests/test-char.c              |  2 +-
> >  tests/test-io-channel-socket.c |  4 ++--
> >  7 files changed, 18 insertions(+), 11 deletions(-)
> > 
> > diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
> > index 777ff59..0ffc560 100644
> > --- a/include/io/channel-socket.h
> > +++ b/include/io/channel-socket.h
> > @@ -248,6 +248,7 @@ qio_channel_socket_get_remote_address(QIOChannelSocket *ioc,
> >  /**
> >   * qio_channel_socket_accept:
> >   * @ioc: the socket channel object
> > + * @reuse_fd: fd to reuse; -1 otherwise
> >   * @errp: pointer to a NULL-initialized error object
> >   *
> >   * If the socket represents a server, then this accepts
> > @@ -258,7 +259,7 @@ qio_channel_socket_get_remote_address(QIOChannelSocket *ioc,
> >   */
> >  QIOChannelSocket *
> >  qio_channel_socket_accept(QIOChannelSocket *ioc,
> > -                          Error **errp);
> > +                          int reuse_fd, Error **errp);
> >  
> >  
> >  #endif /* QIO_CHANNEL_SOCKET_H */
> > diff --git a/io/channel-socket.c b/io/channel-socket.c
> > index e1b4667..dde12bf 100644
> > --- a/io/channel-socket.c
> > +++ b/io/channel-socket.c
> > @@ -352,7 +352,7 @@ void qio_channel_socket_dgram_async(QIOChannelSocket *ioc,
> >  
> >  QIOChannelSocket *
> >  qio_channel_socket_accept(QIOChannelSocket *ioc,
> > -                          Error **errp)
> > +                          int reuse_fd, Error **errp)
> >  {
> >      QIOChannelSocket *cioc;
> >  
> > @@ -362,8 +362,14 @@ qio_channel_socket_accept(QIOChannelSocket *ioc,
> >  
> >   retry:
> >      trace_qio_channel_socket_accept(ioc);
> > -    cioc->fd = qemu_accept(ioc->fd, (struct sockaddr *)&cioc->remoteAddr,
> > -                           &cioc->remoteAddrLen);
> > +
> > +    if (reuse_fd != -1) {
> > +        cioc->fd = reuse_fd;
> > +    } else {
> > +        cioc->fd = qemu_accept(ioc->fd, (struct sockaddr *)&cioc->remoteAddr,
> > +                               &cioc->remoteAddrLen);
> > +    }
> > +
> >      if (cioc->fd < 0) {
> >          if (errno == EINTR) {
> >              goto retry;
> > diff --git a/io/net-listener.c b/io/net-listener.c
> > index 5d8a226..bbdea1e 100644
> > --- a/io/net-listener.c
> > +++ b/io/net-listener.c
> > @@ -45,7 +45,7 @@ static gboolean qio_net_listener_channel_func(QIOChannel *ioc,
> >      QIOChannelSocket *sioc;
> >  
> >      sioc = qio_channel_socket_accept(QIO_CHANNEL_SOCKET(ioc),
> > -                                     NULL);
> > +                                     -1, NULL);
> >      if (!sioc) {
> >          return TRUE;
> >      }
> > @@ -194,7 +194,7 @@ static gboolean qio_net_listener_wait_client_func(QIOChannel *ioc,
> >      QIOChannelSocket *sioc;
> >  
> >      sioc = qio_channel_socket_accept(QIO_CHANNEL_SOCKET(ioc),
> > -                                     NULL);
> > +                                     -1, NULL);
> >      if (!sioc) {
> >          return TRUE;
> >      }
> > diff --git a/scsi/qemu-pr-helper.c b/scsi/qemu-pr-helper.c
> > index 57ad830..0e6d683 100644
> > --- a/scsi/qemu-pr-helper.c
> > +++ b/scsi/qemu-pr-helper.c
> > @@ -800,7 +800,7 @@ static gboolean accept_client(QIOChannel *ioc, GIOCondition cond, gpointer opaqu
> >      PRHelperClient *prh;
> >  
> >      cioc = qio_channel_socket_accept(QIO_CHANNEL_SOCKET(ioc),
> > -                                     NULL);
> > +                                     -1, NULL);
> >      if (!cioc) {
> >          return TRUE;
> >      }
> > diff --git a/tests/qtest/tpm-emu.c b/tests/qtest/tpm-emu.c
> > index 2e8eb7b..19e5dab 100644
> > --- a/tests/qtest/tpm-emu.c
> > +++ b/tests/qtest/tpm-emu.c
> > @@ -83,7 +83,7 @@ void *tpm_emu_ctrl_thread(void *data)
> >      g_cond_signal(&s->data_cond);
> >  
> >      qio_channel_wait(QIO_CHANNEL(lioc), G_IO_IN);
> > -    ioc = QIO_CHANNEL(qio_channel_socket_accept(lioc, &error_abort));
> > +    ioc = QIO_CHANNEL(qio_channel_socket_accept(lioc, -1, &error_abort));
> >      g_assert(ioc);
> >  
> >      {
> > diff --git a/tests/test-char.c b/tests/test-char.c
> > index 614bdac..1bb6ae0 100644
> > --- a/tests/test-char.c
> > +++ b/tests/test-char.c
> > @@ -884,7 +884,7 @@ char_socket_client_server_thread(gpointer data)
> >      QIOChannelSocket *cioc;
> >  
> >  retry:
> > -    cioc = qio_channel_socket_accept(ioc, &error_abort);
> > +    cioc = qio_channel_socket_accept(ioc, -1, &error_abort);
> >      g_assert_nonnull(cioc);
> >  
> >      if (char_socket_ping_pong(QIO_CHANNEL(cioc), NULL) != 0) {
> > diff --git a/tests/test-io-channel-socket.c b/tests/test-io-channel-socket.c
> > index d43083a..0d410cf 100644
> > --- a/tests/test-io-channel-socket.c
> > +++ b/tests/test-io-channel-socket.c
> > @@ -75,7 +75,7 @@ static void test_io_channel_setup_sync(SocketAddress *listen_addr,
> >      qio_channel_set_delay(*src, false);
> >  
> >      qio_channel_wait(QIO_CHANNEL(lioc), G_IO_IN);
> > -    *dst = QIO_CHANNEL(qio_channel_socket_accept(lioc, &error_abort));
> > +    *dst = QIO_CHANNEL(qio_channel_socket_accept(lioc, -1, &error_abort));
> >      g_assert(*dst);
> >  
> >      test_io_channel_set_socket_bufs(*src, *dst);
> > @@ -143,7 +143,7 @@ static void test_io_channel_setup_async(SocketAddress *listen_addr,
> >      g_assert(!data.err);
> >  
> >      qio_channel_wait(QIO_CHANNEL(lioc), G_IO_IN);
> > -    *dst = QIO_CHANNEL(qio_channel_socket_accept(lioc, &error_abort));
> > +    *dst = QIO_CHANNEL(qio_channel_socket_accept(lioc, -1, &error_abort));
> >      g_assert(*dst);
> >  
> >      qio_channel_set_delay(*src, false);
> > -- 
> > 1.8.3.1
> > 
> -- 
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 32/32] vfio-pci: improved tracing
  2020-07-30 15:14 ` [PATCH V1 32/32] vfio-pci: improved tracing Steve Sistare
@ 2020-09-15 18:49   ` Dr. David Alan Gilbert
  2020-09-24 21:52     ` Steven Sistare
  0 siblings, 1 reply; 118+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-15 18:49 UTC (permalink / raw)
  To: Steve Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

* Steve Sistare (steven.sistare@oracle.com) wrote:
> Print more info for existing trace points:
>   trace_kvm_irqchip_add_msi_route.
>   trace_pci_update_mappings_del
>   trace_pci_update_mappings_add
> 
> Add new trace points:
>   trace_kvm_irqchip_assign_irqfd
>   trace_msix_table_mmio_write
>   trace_vfio_dma_unmap
>   trace_vfio_dma_map
>   trace_vfio_region
>   trace_vfio_descriptors
>   trace_ram_block_add
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>

Why don't you split this out into a separate patch by itself; if they're
general extra useful tracing they can just go in.

Note that you've also added a new warning in  vfio_dma_unmap

Dave

> ---
>  accel/kvm/kvm-all.c    |  8 ++++++--
>  accel/kvm/trace-events |  3 ++-
>  exec.c                 |  3 +++
>  hw/pci/msix.c          |  1 +
>  hw/pci/pci.c           | 10 ++++++----
>  hw/pci/trace-events    |  5 +++--
>  hw/vfio/common.c       | 16 +++++++++++++++-
>  hw/vfio/pci.c          |  1 +
>  hw/vfio/trace-events   |  9 ++++++---
>  trace-events           |  2 ++
>  10 files changed, 45 insertions(+), 13 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 63ef6af..5511ea7 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -46,6 +46,7 @@
>  #include "sysemu/reset.h"
>  
>  #include "hw/boards.h"
> +#include "trace-root.h"
>  
>  /* This check must be after config-host.h is included */
>  #ifdef CONFIG_EVENTFD
> @@ -1670,7 +1671,7 @@ int kvm_irqchip_add_msi_route(KVMState *s, int vector, PCIDevice *dev)
>      }
>  
>      trace_kvm_irqchip_add_msi_route(dev ? dev->name : (char *)"N/A",
> -                                    vector, virq);
> +                                    vector, virq, msg.address, msg.data);
>  
>      kvm_add_routing_entry(s, &kroute);
>      kvm_arch_add_msi_route_post(&kroute, vector, dev);
> @@ -1717,6 +1718,7 @@ static int kvm_irqchip_assign_irqfd(KVMState *s, EventNotifier *event,
>  {
>      int fd = event_notifier_get_fd(event);
>      int rfd = resample ? event_notifier_get_fd(resample) : -1;
> +    int ret;
>  
>      struct kvm_irqfd irqfd = {
>          .fd = fd,
> @@ -1758,7 +1760,9 @@ static int kvm_irqchip_assign_irqfd(KVMState *s, EventNotifier *event,
>          return -ENOSYS;
>      }
>  
> -    return kvm_vm_ioctl(s, KVM_IRQFD, &irqfd);
> +    ret = kvm_vm_ioctl(s, KVM_IRQFD, &irqfd);
> +    trace_kvm_irqchip_assign_irqfd(fd, virq, rfd, ret);
> +    return ret;
>  }
>  
>  int kvm_irqchip_add_adapter_route(KVMState *s, AdapterInfo *adapter)
> diff --git a/accel/kvm/trace-events b/accel/kvm/trace-events
> index a68eb66..67a01e6 100644
> --- a/accel/kvm/trace-events
> +++ b/accel/kvm/trace-events
> @@ -9,7 +9,8 @@ kvm_device_ioctl(int fd, int type, void *arg) "dev fd %d, type 0x%x, arg %p"
>  kvm_failed_reg_get(uint64_t id, const char *msg) "Warning: Unable to retrieve ONEREG %" PRIu64 " from KVM: %s"
>  kvm_failed_reg_set(uint64_t id, const char *msg) "Warning: Unable to set ONEREG %" PRIu64 " to KVM: %s"
>  kvm_irqchip_commit_routes(void) ""
> -kvm_irqchip_add_msi_route(char *name, int vector, int virq) "dev %s vector %d virq %d"
> +kvm_irqchip_add_msi_route(char *name, int vector, int virq, uint64_t addr, uint32_t data) "%s, vector %d, virq %d, msg {addr 0x%"PRIx64", data 0x%x}"
> +kvm_irqchip_assign_irqfd(int fd, int virq, int rfd, int status) "(fd=%d, virq=%d, rfd=%d) KVM_IRQFD returns %d"
>  kvm_irqchip_update_msi_route(int virq) "Updating MSI route virq=%d"
>  kvm_irqchip_release_virq(int virq) "virq %d"
>  kvm_set_ioeventfd_mmio(int fd, uint64_t addr, uint32_t val, bool assign, uint32_t size, bool datamatch) "fd: %d @0x%" PRIx64 " val=0x%x assign: %d size: %d match: %d"
> diff --git a/exec.c b/exec.c
> index 5473c09..dd99ee0 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -2319,6 +2319,9 @@ static void ram_block_add(RAMBlock *new_block, Error **errp, bool shared)
>          }
>          ram_block_notify_add(new_block->host, new_block->max_length);
>      }
> +    trace_ram_block_add(new_block->host, new_block->max_length,
> +                        memory_region_name(new_block->mr),
> +                        new_block->mr->readonly ? "ro" : "rw");
>  }
>  
>  #ifdef CONFIG_POSIX
> diff --git a/hw/pci/msix.c b/hw/pci/msix.c
> index 67e34f3..65a2882 100644
> --- a/hw/pci/msix.c
> +++ b/hw/pci/msix.c
> @@ -189,6 +189,7 @@ static void msix_table_mmio_write(void *opaque, hwaddr addr,
>      int vector = addr / PCI_MSIX_ENTRY_SIZE;
>      bool was_masked;
>  
> +    trace_msix_table_mmio_write(dev->name, addr, val, size);
>      was_masked = msix_is_masked(dev, vector);
>      pci_set_long(dev->msix_table + addr, val);
>      msix_handle_mask_update(dev, vector, was_masked);
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index c2e1509..6142411 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -1324,9 +1324,11 @@ void pci_update_mappings(PCIDevice *d)
>      PCIIORegion *r;
>      int i;
>      pcibus_t new_addr;
> +    const char *name;
>  
>      for(i = 0; i < PCI_NUM_REGIONS; i++) {
>          r = &d->io_regions[i];
> +        name = r->memory ? r->memory->name : "";
>  
>          /* this region isn't registered */
>          if (!r->size)
> @@ -1340,18 +1342,18 @@ void pci_update_mappings(PCIDevice *d)
>  
>          /* now do the real mapping */
>          if (r->addr != PCI_BAR_UNMAPPED) {
> -            trace_pci_update_mappings_del(d, pci_dev_bus_num(d),
> +            trace_pci_update_mappings_del(d->name, pci_dev_bus_num(d),
>                                            PCI_SLOT(d->devfn),
>                                            PCI_FUNC(d->devfn),
> -                                          i, r->addr, r->size);
> +                                          i, r->addr, r->size, name);
>              memory_region_del_subregion(r->address_space, r->memory);
>          }
>          r->addr = new_addr;
>          if (r->addr != PCI_BAR_UNMAPPED) {
> -            trace_pci_update_mappings_add(d, pci_dev_bus_num(d),
> +            trace_pci_update_mappings_add(d->name, pci_dev_bus_num(d),
>                                            PCI_SLOT(d->devfn),
>                                            PCI_FUNC(d->devfn),
> -                                          i, r->addr, r->size);
> +                                          i, r->addr, r->size, name);
>              memory_region_add_subregion_overlap(r->address_space,
>                                                  r->addr, r->memory, 1);
>          }
> diff --git a/hw/pci/trace-events b/hw/pci/trace-events
> index def4b39..6dd7015 100644
> --- a/hw/pci/trace-events
> +++ b/hw/pci/trace-events
> @@ -1,8 +1,8 @@
>  # See docs/devel/tracing.txt for syntax documentation.
>  
>  # pci.c
> -pci_update_mappings_del(void *d, uint32_t bus, uint32_t slot, uint32_t func, int bar, uint64_t addr, uint64_t size) "d=%p %02x:%02x.%x %d,0x%"PRIx64"+0x%"PRIx64
> -pci_update_mappings_add(void *d, uint32_t bus, uint32_t slot, uint32_t func, int bar, uint64_t addr, uint64_t size) "d=%p %02x:%02x.%x %d,0x%"PRIx64"+0x%"PRIx64
> +pci_update_mappings_del(const char *dname, uint32_t bus, uint32_t slot, uint32_t func, int bar, uint64_t addr, uint64_t size, const char *name) "%s %02x:%02x.%x [%d] 0x%"PRIx64", 0x%"PRIx64"B \"%s\""
> +pci_update_mappings_add(const char *dname, uint32_t bus, uint32_t slot, uint32_t func, int bar, uint64_t addr, uint64_t size, const char *name) "%s %02x:%02x.%x [%d] 0x%"PRIx64", 0x%"PRIx64"B \"%s\""
>  
>  # pci_host.c
>  pci_cfg_read(const char *dev, unsigned devid, unsigned fnid, unsigned offs, unsigned val) "%s %02u:%u @0x%x -> 0x%x"
> @@ -10,3 +10,4 @@ pci_cfg_write(const char *dev, unsigned devid, unsigned fnid, unsigned offs, uns
>  
>  # msix.c
>  msix_write_config(char *name, bool enabled, bool masked) "dev %s enabled %d masked %d"
> +msix_table_mmio_write(char *name, uint64_t addr, uint64_t val, unsigned size)  "(%s, @%"PRId64", 0x%"PRIx64", %dB)"
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index a51a093..23c8bf3 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -304,6 +304,8 @@ static int vfio_dma_unmap(VFIOContainer *container,
>          return 0;
>      }
>  
> +    trace_vfio_dma_unmap(container->fd, iova, size);
> +
>      while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
>          /*
>           * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
> @@ -327,6 +329,11 @@ static int vfio_dma_unmap(VFIOContainer *container,
>          return -errno;
>      }
>  
> +    if (unmap.size != size) {
> +        error_printf("warn: VFIO_UNMAP_DMA(0x%lx, 0x%lx) only unmaps 0x%llx",
> +                     iova, size, unmap.size);
> +    }
> +
>      return 0;
>  }
>  
> @@ -345,6 +352,9 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>          return 0;
>      }
>  
> +    trace_vfio_dma_map(container->fd, iova, size, vaddr,
> +                       (readonly ? "r" : "rw"));
> +
>      if (!readonly) {
>          map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
>      }
> @@ -985,7 +995,8 @@ int vfio_region_mmap(VFIORegion *region)
>          trace_vfio_region_mmap(memory_region_name(&region->mmaps[i].mem),
>                                 region->mmaps[i].offset,
>                                 region->mmaps[i].offset +
> -                               region->mmaps[i].size - 1);
> +                               region->mmaps[i].size - 1,
> +                               region->mmaps[i].mmap);
>      }
>  
>      return 0;
> @@ -1696,6 +1707,9 @@ retry:
>          goto retry;
>      }
>  
> +    trace_vfio_region(vbasedev->name, index, (*info)->offset, (*info)->size,
> +                      (*info)->cap_offset, (*info)->flags);
> +
>      return 0;
>  }
>  
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index f72e277..d74e078 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -41,6 +41,7 @@
>  #include "trace.h"
>  #include "qapi/error.h"
>  #include "migration/blocker.h"
> +#include "trace-root.h"
>  
>  #define TYPE_VFIO_PCI "vfio-pci"
>  #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 10d899c..83cd0a6 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -25,7 +25,7 @@ vfio_pci_size_rom(const char *name, int size) "%s ROM size 0x%x"
>  vfio_vga_write(uint64_t addr, uint64_t data, int size) " (0x%"PRIx64", 0x%"PRIx64", %d)"
>  vfio_vga_read(uint64_t addr, int size, uint64_t data) " (0x%"PRIx64", %d) = 0x%"PRIx64
>  vfio_pci_read_config(const char *name, int addr, int len, int val) " (%s, @0x%x, len=0x%x) 0x%x"
> -vfio_pci_write_config(const char *name, int addr, int val, int len) " (%s, @0x%x, 0x%x, len=0x%x)"
> +vfio_pci_write_config(const char *name, int addr, int val, int len) "(%s, @0x%x, 0x%x, 0x%xB)"
>  vfio_msi_setup(const char *name, int pos) "%s PCI MSI CAP @0x%x"
>  vfio_msix_early_setup(const char *name, int pos, int table_bar, int offset, int entries) "%s PCI MSI-X CAP @0x%x, BAR %d, offset 0x%x, entries %d"
>  vfio_check_pcie_flr(const char *name) "%s Supports FLR via PCIe cap"
> @@ -37,7 +37,7 @@ vfio_pci_hot_reset_dep_devices(int domain, int bus, int slot, int function, int
>  vfio_pci_hot_reset_result(const char *name, const char *result) "%s hot reset: %s"
>  vfio_populate_device_config(const char *name, unsigned long size, unsigned long offset, unsigned long flags) "Device %s config:\n  size: 0x%lx, offset: 0x%lx, flags: 0x%lx"
>  vfio_populate_device_get_irq_info_failure(const char *errstr) "VFIO_DEVICE_GET_IRQ_INFO failure: %s"
> -vfio_realize(const char *name, int group_id) " (%s) group %d"
> +vfio_realize(const char *name, int group_id) "(%s) group %d"
>  vfio_mdev(const char *name, bool is_mdev) " (%s) is_mdev %d"
>  vfio_add_ext_cap_dropped(const char *name, uint16_t cap, uint16_t offset) "%s 0x%x@0x%x"
>  vfio_pci_reset(const char *name) " (%s)"
> @@ -109,7 +109,7 @@ vfio_get_device(const char * name, unsigned int flags, unsigned int num_regions,
>  vfio_put_base_device(int fd) "close vdev->fd=%d"
>  vfio_region_setup(const char *dev, int index, const char *name, unsigned long flags, unsigned long offset, unsigned long size) "Device %s, region %d \"%s\", flags: 0x%lx, offset: 0x%lx, size: 0x%lx"
>  vfio_region_mmap_fault(const char *name, int index, unsigned long offset, unsigned long size, int fault) "Region %s mmaps[%d], [0x%lx - 0x%lx], fault: %d"
> -vfio_region_mmap(const char *name, unsigned long offset, unsigned long end) "Region %s [0x%lx - 0x%lx]"
> +vfio_region_mmap(const char *name, unsigned long offset, unsigned long end, void *addr) "%s [0x%lx - 0x%lx] maps to %p"
>  vfio_region_exit(const char *name, int index) "Device %s, region %d"
>  vfio_region_finalize(const char *name, int index) "Device %s, region %d"
>  vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
> @@ -117,6 +117,9 @@ vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Devic
>  vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
>  vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
>  vfio_dma_unmap_overflow_workaround(void) ""
> +vfio_dma_unmap(int fd, uint64_t iova, uint64_t size) "fd %d, iova 0x%"PRIx64", len 0x%"PRIx64
> +vfio_dma_map(int fd, uint64_t iova, uint64_t size, void *addr, const char *access) "fd %d, iova 0x%"PRIx64", len 0x%"PRIx64", va %p, %s"
> +vfio_region(const char *name, int index, uint64_t offset, uint64_t size, int cap_offset, int flags) "%s [%d]: +0x%"PRIx64", 0x%"PRIx64"B, cap +0x%x, flags 0x%x"
>  
>  # platform.c
>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
> diff --git a/trace-events b/trace-events
> index 42107eb..98589a4 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -107,6 +107,8 @@ qmp_job_complete(void *job) "job %p"
>  qmp_job_finalize(void *job) "job %p"
>  qmp_job_dismiss(void *job) "job %p"
>  
> +# exec.c
> +ram_block_add(void *host, uint64_t maxlen, const char *name, const char *mode) "host=%p, maxlen=0x%"PRIx64", mr = {name=%s, %s}"
>  
>  ### Guest events, keep at bottom
>  
> -- 
> 1.8.3.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 11/32] cpu: disable ticks when suspended
  2020-09-11 17:53   ` Dr. David Alan Gilbert
@ 2020-09-24 20:42     ` Steven Sistare
  2020-09-25  9:03       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 118+ messages in thread
From: Steven Sistare @ 2020-09-24 20:42 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On 9/11/2020 1:53 PM, Dr. David Alan Gilbert wrote:
> * Steve Sistare (steven.sistare@oracle.com) wrote:
>> After cprload, the guest console misbehaves.  You must type 8 characters
>> before any are echoed to the terminal.  Qemu was not sending interrupts
>> to the guest because the QEMU_CLOCK_VIRTUAL timers_state.cpu_clock_offset
>> was bad.  The offset is usually updated at cprsave time by the path
>>
>>   save_cpr_snapshot()
>>     vm_stop()
>>       do_vm_stop()
>>         if (runstate_is_running())
>>           cpu_disable_ticks();
>>             timers_state.cpu_clock_offset = cpu_get_clock_locked();
>>
>> However, if the guest is in RUN_STATE_SUSPENDED, then cpu_disable_ticks is
>> not called.  Further, the earlier transition to suspended in
>> qemu_system_suspend did not disable ticks.  To fix, call cpu_disable_ticks
>> from save_cpr_snapshot.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> 
> Are you saying this is really a more generic bug with migrating when
> suspended and we should fix this anyway?

Yes.  Or when suspended and calling save_vmstate(), or qmp_xen_save_devices_state().
Each of those functions needs the same fix unless someone identifies a more
centralized way in the state transition logic to disable ticks.

- Steve

>> ---
>>  migration/savevm.c | 5 +++++
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index f101039..00f493b 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -2729,6 +2729,11 @@ void save_cpr_snapshot(const char *file, const char *mode, Error **errp)
>>          return;
>>      }
>>  
>> +    /* Update timers_state before saving.  Suspend did not so do. */
>> +    if (runstate_check(RUN_STATE_SUSPENDED)) {
>> +        cpu_disable_ticks();
>> +    }
>> +
>>      vm_stop(RUN_STATE_SAVE_VM);
>>  
>>      ret = qemu_savevm_state(f, op, errp);
>> -- 
>> 1.8.3.1
>>


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 01/32] savevm: add vmstate handler iterators
  2020-09-11 16:24   ` Dr. David Alan Gilbert
@ 2020-09-24 21:43     ` Steven Sistare
  2020-09-25  9:07       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 118+ messages in thread
From: Steven Sistare @ 2020-09-24 21:43 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On 9/11/2020 12:24 PM, Dr. David Alan Gilbert wrote:
> Apologies for taking a while to get around to this, 
> 
> * Steve Sistare (steven.sistare@oracle.com) wrote:
>> Provide the SAVEVM_FOREACH and SAVEVM_FORALL macros to loop over all save
>> VM state handlers.  The former will filter handlers based on the operation
>> in the later patch "savevm: VM handlers mode mask".  The latter loops over
>> all handlers.
>>
>> No functional change.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  migration/savevm.c | 57 ++++++++++++++++++++++++++++++++++++------------------
>>  1 file changed, 38 insertions(+), 19 deletions(-)
>>
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index 45c9dd9..a07fcad 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -266,6 +266,25 @@ static SaveState savevm_state = {
>>      .global_section_id = 0,
>>  };
>>  
>> +/*
>> + * The FOREACH macros will filter handlers based on the current operation when
>> + * additional conditions are added in a subsequent patch.
>> + */
>> +
>> +#define SAVEVM_FOREACH(se, entry)                                    \
>> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry)                \
>> +
>> +#define SAVEVM_FOREACH_SAFE(se, entry, new_se)                       \
>> +    QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se)   \
>> +
>> +/* The FORALL macros unconditionally loop over all handlers. */
>> +
>> +#define SAVEVM_FORALL(se, entry)                                     \
>> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry)
>> +
>> +#define SAVEVM_FORALL_SAFE(se, entry, new_se)                        \
>> +    QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se)
>> +
> 
> OK, can I ask you to merge this with the next patch but to spin it the
> other way, so that we have:
> 
>   SAVEVM_FOR(se, entry, mask)
> 
> and the places you use SAVEVM_FORALL_SAFE would become
> 
>   SAVEVM_FOR(se, entry, VMS_MODE_ALL)
> 
> I'm thinking at some point in the future we could merge a bunch of the
> other flag checks in there.

Sure.  Is this what you have in mind?

#define SAVEVM_FOR(se, entry, mask)                    \
    QTAILQ_FOREACH(se, &savevm_state.handlers, entry)  \
        if (savevm_state.mode & mask)

#define SAVEVM_FOR_SAFE(se, entry, new_se, mask)                    \
    QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se)  \
        if (savevm_state.mode & mask)

Callers:
  SAVEVM_FOR(se, entry, mode_mask(se))
  SAVEVM_FOR(se, entry, VMS_MODE_ALL)
  SAVEVM_FOR_SAFE(se, entry, mode_mask(se))
  SAVEVM_FOR_SAFE(se, entry, VMS_MODE_ALL)

- Steve

>>  static bool should_validate_capability(int capability)
>>  {
>>      assert(capability >= 0 && capability < MIGRATION_CAPABILITY__MAX);
>> @@ -673,7 +692,7 @@ static uint32_t calculate_new_instance_id(const char *idstr)
>>      SaveStateEntry *se;
>>      uint32_t instance_id = 0;
>>  
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FORALL(se, entry) {
>>          if (strcmp(idstr, se->idstr) == 0
>>              && instance_id <= se->instance_id) {
>>              instance_id = se->instance_id + 1;
>> @@ -689,7 +708,7 @@ static int calculate_compat_instance_id(const char *idstr)
>>      SaveStateEntry *se;
>>      int instance_id = 0;
>>  
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FORALL(se, entry) {
>>          if (!se->compat) {
>>              continue;
>>          }
>> @@ -803,7 +822,7 @@ void unregister_savevm(VMStateIf *obj, const char *idstr, void *opaque)
>>      }
>>      pstrcat(id, sizeof(id), idstr);
>>  
>> -    QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se) {
>> +    SAVEVM_FORALL_SAFE(se, entry, new_se) {
>>          if (strcmp(se->idstr, id) == 0 && se->opaque == opaque) {
>>              savevm_state_handler_remove(se);
>>              g_free(se->compat);
>> @@ -867,7 +886,7 @@ void vmstate_unregister(VMStateIf *obj, const VMStateDescription *vmsd,
>>  {
>>      SaveStateEntry *se, *new_se;
>>  
>> -    QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se) {
>> +    SAVEVM_FORALL_SAFE(se, entry, new_se) {
>>          if (se->vmsd == vmsd && se->opaque == opaque) {
>>              savevm_state_handler_remove(se);
>>              g_free(se->compat);
>> @@ -1119,7 +1138,7 @@ bool qemu_savevm_state_blocked(Error **errp)
>>  {
>>      SaveStateEntry *se;
>>  
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FORALL(se, entry) {
>>          if (se->vmsd && se->vmsd->unmigratable) {
>>              error_setg(errp, "State blocked by non-migratable device '%s'",
>>                         se->idstr);
>> @@ -1145,7 +1164,7 @@ bool qemu_savevm_state_guest_unplug_pending(void)
>>  {
>>      SaveStateEntry *se;
>>  
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>          if (se->vmsd && se->vmsd->dev_unplug_pending &&
>>              se->vmsd->dev_unplug_pending(se->opaque)) {
>>              return true;
>> @@ -1162,7 +1181,7 @@ void qemu_savevm_state_setup(QEMUFile *f)
>>      int ret;
>>  
>>      trace_savevm_state_setup();
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>          if (!se->ops || !se->ops->save_setup) {
>>              continue;
>>          }
>> @@ -1193,7 +1212,7 @@ int qemu_savevm_state_resume_prepare(MigrationState *s)
>>  
>>      trace_savevm_state_resume_prepare();
>>  
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>          if (!se->ops || !se->ops->resume_prepare) {
>>              continue;
>>          }
>> @@ -1223,7 +1242,7 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy)
>>      int ret = 1;
>>  
>>      trace_savevm_state_iterate();
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>          if (!se->ops || !se->ops->save_live_iterate) {
>>              continue;
>>          }
>> @@ -1291,7 +1310,7 @@ void qemu_savevm_state_complete_postcopy(QEMUFile *f)
>>      SaveStateEntry *se;
>>      int ret;
>>  
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>          if (!se->ops || !se->ops->save_live_complete_postcopy) {
>>              continue;
>>          }
>> @@ -1324,7 +1343,7 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
>>      SaveStateEntry *se;
>>      int ret;
>>  
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>          if (!se->ops ||
>>              (in_postcopy && se->ops->has_postcopy &&
>>               se->ops->has_postcopy(se->opaque)) ||
>> @@ -1366,7 +1385,7 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
>>      vmdesc = qjson_new();
>>      json_prop_int(vmdesc, "page_size", qemu_target_page_size());
>>      json_start_array(vmdesc, "devices");
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>  
>>          if ((!se->ops || !se->ops->save_state) && !se->vmsd) {
>>              continue;
>> @@ -1476,7 +1495,7 @@ void qemu_savevm_state_pending(QEMUFile *f, uint64_t threshold_size,
>>      *res_postcopy_only = 0;
>>  
>>  
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>          if (!se->ops || !se->ops->save_live_pending) {
>>              continue;
>>          }
>> @@ -1501,7 +1520,7 @@ void qemu_savevm_state_cleanup(void)
>>      }
>>  
>>      trace_savevm_state_cleanup();
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>          if (se->ops && se->ops->save_cleanup) {
>>              se->ops->save_cleanup(se->opaque);
>>          }
>> @@ -1580,7 +1599,7 @@ int qemu_save_device_state(QEMUFile *f)
>>      }
>>      cpu_synchronize_all_states();
>>  
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>          int ret;
>>  
>>          if (se->is_ram) {
>> @@ -1612,7 +1631,7 @@ static SaveStateEntry *find_se(const char *idstr, uint32_t instance_id)
>>  {
>>      SaveStateEntry *se;
>>  
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FORALL(se, entry) {
>>          if (!strcmp(se->idstr, idstr) &&
>>              (instance_id == se->instance_id ||
>>               instance_id == se->alias_id))
>> @@ -2334,7 +2353,7 @@ qemu_loadvm_section_part_end(QEMUFile *f, MigrationIncomingState *mis)
>>      }
>>  
>>      trace_qemu_loadvm_state_section_partend(section_id);
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>          if (se->load_section_id == section_id) {
>>              break;
>>          }
>> @@ -2400,7 +2419,7 @@ static int qemu_loadvm_state_setup(QEMUFile *f)
>>      int ret;
>>  
>>      trace_loadvm_state_setup();
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>          if (!se->ops || !se->ops->load_setup) {
>>              continue;
>>          }
>> @@ -2425,7 +2444,7 @@ void qemu_loadvm_state_cleanup(void)
>>      SaveStateEntry *se;
>>  
>>      trace_loadvm_state_cleanup();
>> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +    SAVEVM_FOREACH(se, entry) {
>>          if (se->ops && se->ops->load_cleanup) {
>>              se->ops->load_cleanup(se->opaque);
>>          }
>> -- 
>> 1.8.3.1
>>


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 04/32] savevm: HMP Command for cprsave
  2020-09-11 16:57   ` Dr. David Alan Gilbert
@ 2020-09-24 21:44     ` Steven Sistare
  2020-09-25  9:26       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 118+ messages in thread
From: Steven Sistare @ 2020-09-24 21:44 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On 9/11/2020 12:57 PM, Dr. David Alan Gilbert wrote:
> * Steve Sistare (steven.sistare@oracle.com) wrote:
>> Enable HMP access to the cprsave QMP command.
>>
>> Usage: cprsave <filename> <mode>
>>
>> Signed-off-by: Maran Wilson <maran.wilson@oracle.com>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> 
> I realise that the current mode is currently only 'reboot' - can you
> please give us a clue as to why you've got a mode argument that's
> currently only got one mode?

Patch 14 adds the restart mode.
I factored the patches by capability to make the review easier, first
presenting the reboot patches, then the restart patches.

- Steve

>> ---
>>  hmp-commands.hx       | 18 ++++++++++++++++++
>>  include/monitor/hmp.h |  1 +
>>  monitor/hmp-cmds.c    | 10 ++++++++++
>>  3 files changed, 29 insertions(+)
>>
>> diff --git a/hmp-commands.hx b/hmp-commands.hx
>> index 60f395c..c8defd9 100644
>> --- a/hmp-commands.hx
>> +++ b/hmp-commands.hx
>> @@ -354,6 +354,24 @@ SRST
>>  ERST
>>  
>>      {
>> +        .name       = "cprsave",
>> +        .args_type  = "file:s,mode:s",
>> +        .params     = "file 'reboot'",
>> +        .help       = "create a checkpoint of the VM in file",
>> +        .cmd        = hmp_cprsave,
>> +    },
>> +
>> +SRST
>> +``cprsave`` *tag*
>> +  Stop VCPUs, create a checkpoint of the whole virtual machine and save it
>> +  in *file*.
>> +  If *mode* is 'reboot', the checkpoint can be cprload'ed after a host kexec
>> +  reboot.
>> +  exec() /usr/bin/qemu-exec if it exists, else exec /usr/bin/qemu-system-x86_64,
>> +  passing all the original command line arguments.  The VCPUs remain paused.
>> +ERST
>> +
>> +    {
>>          .name       = "delvm",
>>          .args_type  = "name:s",
>>          .params     = "tag",
>> diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
>> index c986cfd..af8ee23 100644
>> --- a/include/monitor/hmp.h
>> +++ b/include/monitor/hmp.h
>> @@ -59,6 +59,7 @@ void hmp_balloon(Monitor *mon, const QDict *qdict);
>>  void hmp_loadvm(Monitor *mon, const QDict *qdict);
>>  void hmp_savevm(Monitor *mon, const QDict *qdict);
>>  void hmp_delvm(Monitor *mon, const QDict *qdict);
>> +void hmp_cprsave(Monitor *mon, const QDict *qdict);
>>  void hmp_migrate_cancel(Monitor *mon, const QDict *qdict);
>>  void hmp_migrate_continue(Monitor *mon, const QDict *qdict);
>>  void hmp_migrate_incoming(Monitor *mon, const QDict *qdict);
>> diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
>> index ae4b6a4..59196ed 100644
>> --- a/monitor/hmp-cmds.c
>> +++ b/monitor/hmp-cmds.c
>> @@ -1139,6 +1139,16 @@ void hmp_announce_self(Monitor *mon, const QDict *qdict)
>>      qapi_free_AnnounceParameters(params);
>>  }
>>  
>> +void hmp_cprsave(Monitor *mon, const QDict *qdict)
>> +{
>> +    Error *err = NULL;
>> +
>> +    qmp_cprsave(qdict_get_try_str(qdict, "file"),
>> +                qdict_get_try_str(qdict, "mode"),
>> +                &err);
>> +    hmp_handle_error(mon, err);
>> +}
>> +
>>  void hmp_migrate_cancel(Monitor *mon, const QDict *qdict)
>>  {
>>      qmp_migrate_cancel(NULL);
>> -- 
>> 1.8.3.1
>>


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 14/32] savevm: VMS_RESTART and cprsave restart
  2020-09-11 18:44   ` Dr. David Alan Gilbert
@ 2020-09-24 21:44     ` Steven Sistare
  0 siblings, 0 replies; 118+ messages in thread
From: Steven Sistare @ 2020-09-24 21:44 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On 9/11/2020 2:44 PM, Dr. David Alan Gilbert wrote:
> * Steve Sistare (steven.sistare@oracle.com) wrote:
>> Add the VMS_RESTART variant of vmstate, for use when upgrading qemu in place
>> on the same host without a reboot.  Invoke it using:
>>   cprsave <filename> restart
>>
>> VMS_RESTART supports guest ram mapped by private anonymous memory, versus
>> VMS_REBOOT which requires that guest ram be mapped by persistent shared
>> memory.  Subsequent patches complete its implementation.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> 
> You should find with the enum like Eric suggests this mostly disappears;
> but also you might want to put it after the patches that implement it.

Sure.  If this gets too small I will add it to the implementation patch.
I cannot move this after the impl, because the impl has additional uses of VMS_RESTART.

- Steve

>> ---
>>  hmp-commands.hx             | 4 +++-
>>  include/migration/vmstate.h | 1 +
>>  migration/savevm.c          | 4 +++-
>>  monitor/qmp-cmds.c          | 2 +-
>>  qapi/migration.json         | 1 +
>>  5 files changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/hmp-commands.hx b/hmp-commands.hx
>> index 7517876..11a2089 100644
>> --- a/hmp-commands.hx
>> +++ b/hmp-commands.hx
>> @@ -369,7 +369,7 @@ ERST
>>      {
>>          .name       = "cprsave",
>>          .args_type  = "file:s,mode:s",
>> -        .params     = "file 'reboot'",
>> +        .params     = "file 'restart'|'reboot'",
>>          .help       = "create a checkpoint of the VM in file",
>>          .cmd        = hmp_cprsave,
>>      },
>> @@ -380,6 +380,8 @@ SRST
>>    in *file*.
>>    If *mode* is 'reboot', the checkpoint can be cprload'ed after a host kexec
>>    reboot.
>> +  If *mode* is 'restart', the checkpoint can be cprload'ed after restarting
>> +  qemu.
>>    exec() /usr/bin/qemu-exec if it exists, else exec /usr/bin/qemu-system-x86_64,
>>    passing all the original command line arguments.  The VCPUs remain paused.
>>  ERST
>> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
>> index c58551a..8239b84 100644
>> --- a/include/migration/vmstate.h
>> +++ b/include/migration/vmstate.h
>> @@ -162,6 +162,7 @@ typedef enum {
>>      VMS_MIGRATE  = (1U << 1),
>>      VMS_SNAPSHOT = (1U << 2),
>>      VMS_REBOOT   = (1U << 3),
>> +    VMS_RESTART  = (1U << 4),
>>      VMS_MODE_ALL = ~0U
>>  } VMStateMode;
>>  
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index 00f493b..38cc63a 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -2708,6 +2708,8 @@ void save_cpr_snapshot(const char *file, const char *mode, Error **errp)
>>  
>>      if (!strcmp(mode, "reboot")) {
>>          op = VMS_REBOOT;
>> +    } else if (!strcmp(mode, "restart")) {
>> +        op = VMS_RESTART;
>>      } else {
>>          error_setg(errp, "cprsave: bad mode %s", mode);
>>          return;
>> @@ -2973,7 +2975,7 @@ void load_cpr_snapshot(const char *file, Error **errp)
>>          return;
>>      }
>>  
>> -    ret = qemu_loadvm_state(f, VMS_REBOOT);
>> +    ret = qemu_loadvm_state(f, VMS_REBOOT | VMS_RESTART);
>>      qemu_fclose(f);
>>      if (ret < 0) {
>>          error_setg(errp, "Error %d while loading VM state", ret);
>> diff --git a/monitor/qmp-cmds.c b/monitor/qmp-cmds.c
>> index 8c400e6..8a74c6e 100644
>> --- a/monitor/qmp-cmds.c
>> +++ b/monitor/qmp-cmds.c
>> @@ -164,7 +164,7 @@ void qmp_cont(Error **errp)
>>  
>>  char *qmp_cprinfo(Error **errp)
>>  {
>> -    return g_strdup("reboot");
>> +    return g_strdup("reboot restart");
>>  }
>>  
>>  void qmp_cprsave(const char *file, const char *mode, Error **errp)
>> diff --git a/qapi/migration.json b/qapi/migration.json
>> index 8190b16..d22992b 100644
>> --- a/qapi/migration.json
>> +++ b/qapi/migration.json
>> @@ -1639,6 +1639,7 @@
>>  #
>>  # @file: name of checkpoint file
>>  # @mode: 'reboot' : checkpoint can be cprload'ed after a host kexec reboot.
>> +#        'restart': checkpoint can be cprload'ed after restarting qemu.
>>  #
>>  # Since 5.0
>>  ##
>> -- 
>> 1.8.3.1
>>


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 15/32] vl: QEMU_START_FREEZE env var
  2020-09-11 18:49   ` Dr. David Alan Gilbert
@ 2020-09-24 21:47     ` Steven Sistare
  2020-09-25 15:52       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 118+ messages in thread
From: Steven Sistare @ 2020-09-24 21:47 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On 9/11/2020 2:49 PM, Dr. David Alan Gilbert wrote:
> * Steve Sistare (steven.sistare@oracle.com) wrote:
>> For qemu upgrade and restart, we will re-exec() qemu with the same argv.
>> However, qemu must start in a paused state and wait for the cprload command,
>> and the original argv might not contain the -S option.  To avoid modifying
>> argv, provide the QEMU_START_FREEZE environment variable.  If
>> QEMU_START_FREEZE is set, then set autostart=0, like the -S option.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> 
> What's wrong with modifying the argv?
> 
> Note, also the trick -incoming defer uses;  the whole point here is that
> we start qemu with   -incoming defer     and then we can issue commands
> to modify the QEMU configuration before we actually reload state.
> 
> Note, even without CPR there might be reasons that you need to modify
> the argv; for example, imagine that since it was originally booted
> someone had hotplug added an extra CPU or RAM or a disk; the new QEMU
> must be started in a state that reflects the state in which the VM was
> at the point when it was saved, not the point at which it was started
> long ago.

The code is simpler if we do not need to parse and massage the argv, and that is 
sufficient for many use cases.  QEMU_START_FREEZE adds only a few lines of code, and 
it's nice to have that choice.

For hot plug, we rely on the management layer to know what devices were plugged
after the initial startup, and re-plug them after restart.  cprsave restarts qemu,
which creates command-line devices.  At this point the manager would send the hotplug 
commands (just like -incoming defer), then send cprload. 

Having said that, if the management layer sometimes performs live migration, and sometimes
performs cpr restart, then we need to strip out any -incoming args from argv before restart.
This can be done in the vendor-specific qemu-exec helper (patch 20).

- Steve

>> ---
>>  softmmu/vl.c | 5 +++++
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/softmmu/vl.c b/softmmu/vl.c
>> index 951994f..7016e39 100644
>> --- a/softmmu/vl.c
>> +++ b/softmmu/vl.c
>> @@ -4501,6 +4501,11 @@ void qemu_init(int argc, char **argv, char **envp)
>>          exit(0);
>>      }
>>  
>> +    if (getenv("QEMU_START_FREEZE")) {
>> +        unsetenv("QEMU_START_FREEZE");
>> +        autostart = 0;
>> +    }
>> +
>>      if (incoming) {
>>          Error *local_err = NULL;
>>          qemu_start_incoming_migration(incoming, &local_err);
>> -- 
>> 1.8.3.1
>>


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 05/32] savevm: QMP command for cprload
  2020-09-11 17:18       ` Dr. David Alan Gilbert
@ 2020-09-24 21:49         ` Steven Sistare
  0 siblings, 0 replies; 118+ messages in thread
From: Steven Sistare @ 2020-09-24 21:49 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Juan Quintela, Michael S. Tsirkin,
	qemu-devel, Markus Armbruster, Alex Williamson, Stefan Hajnoczi,
	Paolo Bonzini, Marc-André Lureau,
	Philippe Mathieu-Daudé,
	Alex Bennée

On 9/11/2020 1:18 PM, Dr. David Alan Gilbert wrote:
> * Steven Sistare (steven.sistare@oracle.com) wrote:
>> On 7/30/2020 12:14 PM, Eric Blake wrote:
>>> On 7/30/20 10:14 AM, Steve Sistare wrote:
>>>> Provide the cprload QMP command.  The VM is created from the file produced
>>>> by the cprsave command.  Guest RAM is restored in-place from the shared
>>>> memory backend file, and guest block devices are used as is.  The contents
>>>> of such devices must not be modified between the cprsave and cprload
>>>> operations.  If the VM was running at cprsave time, then VM execution
>>>> resumes.
>>>
>>> Is it always wise to unconditionally resume, or might this command need an additional optional knob that says what state (paused or running) to move into?
>>
>> This can already be done.  Issue a stop command before cprsave, then cprload will finish in a
>> paused state.
>>
>> Also, cprsave re-execs and leaves the guest in a paused state.  One can
>>
>> send device add commands, then send cprload which continues
>> .
> 
> You're suffering here because you're reinventing stuff rather than
> reusing existing migration paths.
> With the existing migration code we require the qemu
> to be started with -incoming ... so we know it's in a clean
> state ready for being loaded, and we've already got the -S
> mechanism that dictates whether or not the VM autostarts
> (regardless of the saved state in the image).  The management
> layers find this pretty useful if they need to wire some networking
> or storage up at the point they know they've got a VM image that's
> loaded OK.

I am not seeing the issue here.  The manager can hot plug with cprsave as
easily as with migration, at the same transition points.  I don't see what
migration paths should be reused here.

CPR                                     Migration

- cprsave restarts qemu with the env    - qemu -S -incoming defer
var QEMU_START_FREEZE set, which
clears autostart just like -S.
(see patch 15)

- command-line devices created          - command-line devices created

- vmstate is prelaunch                  - vmstate is inmigrate

- manager sends hotplug commands        - manager sends hotplug commands

- manager sends cprload                 - manager sends migrate_incoming

It would perhaps be more correct for cprsave to leave the vm in the preconfig
state.

I don't feel like I'm suffering.  At least not yet :)

- Steve

>>>> Syntax:
>>>>    {'command':'cprload', 'data':{'file':'str'}}
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> Signed-off-by: Maran Wilson <maran.wilson@oracle.com>
>>>> ---
>>>
>>>> +++ b/qapi/migration.json
>>>> @@ -1635,3 +1635,14 @@
>>>>   ##
>>>>   { 'command': 'cprsave', 'data': { 'file': 'str', 'mode': 'str' } }
>>>>   +##
>>>> +# @cprload:
>>>> +#
>>>> +# Start virtual machine from checkpoint file that was created earlier using
>>>> +# the cprsave command.
>>>> +#
>>>> +# @file: name of checkpoint file
>>>> +#
>>>> +# Since 5.0
>>>
>>> another 5.2 instance. I'll quit pointing it out for the rest of the series.
>>
>> Will find and fix all, thanks.
>>
>> - Steve
>>


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 08/32] savevm: HMP command for cprinfo
  2020-09-11 17:27   ` Dr. David Alan Gilbert
@ 2020-09-24 21:50     ` Steven Sistare
  0 siblings, 0 replies; 118+ messages in thread
From: Steven Sistare @ 2020-09-24 21:50 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On 9/11/2020 1:27 PM, Dr. David Alan Gilbert wrote:
> * Steve Sistare (steven.sistare@oracle.com) wrote:
>> Enable HMP access to the cprinfo QMP command.
>>
>> Usage: cprinfo
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> 
> As with Eric's comment on the qemp I don't think you need it;
> for HMP alll you really need is something that lists it in the help.

We need an architected stable interface to know that a qemu instance supports
cpr.  I don't think parsing help is good enough.  The hmp interface is great
for use in bash; easy to use and efficient.

> (Also I'd expect an info  cpr   to be a possibility that could give
> some information about it - e.g. if you've just saved/can save/loaded a
> CPR image)

Yes, that occurred to me.  We could add some flags in the future and remain
backwards compatible.  I should start now with a sub-command schema to make future 
expansion cleaner:
  "cprinfo modes" - return supported modes, eg "reboot restart"

- Steve

>> ---
>>  hmp-commands.hx       | 13 +++++++++++++
>>  include/monitor/hmp.h |  1 +
>>  monitor/hmp-cmds.c    | 10 ++++++++++
>>  3 files changed, 24 insertions(+)
>>
>> diff --git a/hmp-commands.hx b/hmp-commands.hx
>> index cb67150..7517876 100644
>> --- a/hmp-commands.hx
>> +++ b/hmp-commands.hx
>> @@ -354,6 +354,19 @@ SRST
>>  ERST
>>  
>>      {
>> +        .name       = "cprinfo",
>> +        .args_type  = "",
>> +        .params     = "",
>> +        .help       = "return list of modes supported by cprsave",
>> +        .cmd        = hmp_cprinfo,
>> +    },
>> +
>> +SRST
>> +``cprinfo`` *tag*
>> +  Return a space-delimited list of modes supported by cprsave.
>> +ERST
>> +
>> +    {
>>          .name       = "cprsave",
>>          .args_type  = "file:s,mode:s",
>>          .params     = "file 'reboot'",
>> diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
>> index 7b8cdfd..919b9a9 100644
>> --- a/include/monitor/hmp.h
>> +++ b/include/monitor/hmp.h
>> @@ -59,6 +59,7 @@ void hmp_balloon(Monitor *mon, const QDict *qdict);
>>  void hmp_loadvm(Monitor *mon, const QDict *qdict);
>>  void hmp_savevm(Monitor *mon, const QDict *qdict);
>>  void hmp_delvm(Monitor *mon, const QDict *qdict);
>> +void hmp_cprinfo(Monitor *mon, const QDict *qdict);
>>  void hmp_cprsave(Monitor *mon, const QDict *qdict);
>>  void hmp_cprload(Monitor *mon, const QDict *qdict);
>>  void hmp_migrate_cancel(Monitor *mon, const QDict *qdict);
>> diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
>> index ba95737..2f6af07 100644
>> --- a/monitor/hmp-cmds.c
>> +++ b/monitor/hmp-cmds.c
>> @@ -1139,6 +1139,16 @@ void hmp_announce_self(Monitor *mon, const QDict *qdict)
>>      qapi_free_AnnounceParameters(params);
>>  }
>>  
>> +void hmp_cprinfo(Monitor *mon, const QDict *qdict)
>> +{
>> +    Error *err = NULL;
>> +    char *res = qmp_cprinfo(&err);
>> +
>> +    monitor_printf(mon, "%s\n", res);
>> +    g_free(res);
>> +    hmp_handle_error(mon, err);
>> +}
>> +
>>  void hmp_cprsave(Monitor *mon, const QDict *qdict)
>>  {
>>      Error *err = NULL;
>> -- 
>> 1.8.3.1
>>


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 09/32] savevm: prevent cprsave if memory is volatile
  2020-09-11 17:35   ` Dr. David Alan Gilbert
@ 2020-09-24 21:51     ` Steven Sistare
  0 siblings, 0 replies; 118+ messages in thread
From: Steven Sistare @ 2020-09-24 21:51 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On 9/11/2020 1:35 PM, Dr. David Alan Gilbert wrote:
> * Steve Sistare (steven.sistare@oracle.com) wrote:
>> cprsave and cprload require that guest ram be backed by an externally
>> visible shared file.  Check that in cprsave.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  exec.c                | 32 ++++++++++++++++++++++++++++++++
>>  include/exec/memory.h |  2 ++
>>  migration/savevm.c    |  4 ++++
>>  3 files changed, 38 insertions(+)
>>
>> diff --git a/exec.c b/exec.c
>> index 6f381f9..02160e0 100644
>> --- a/exec.c
>> +++ b/exec.c
>> @@ -2726,6 +2726,38 @@ ram_addr_t qemu_ram_addr_from_host(void *ptr)
>>      return block->offset + offset;
>>  }
>>  
>> +/*
>> + * Return true if any memory regions are writable and not backed by shared
>> + * memory.  Exclude x86 option rom shadow "pc.rom" by name, even though it is
>> + * writable.
> 
> Tell me about 'pc.rom' - this is a very odd hack.
> Again note the trick done by the existing migration capability
> x-ignore-shared ; it doesn't special case, it just doesn't migrate
> the 'shared' blocks.

pc.rom is indeed a rom.  Its contents do not change, and it can be recreated
from scratch after a restart.  However, the x86 arch code does not mark it
readonly, so there is no proper way to tell it is not volatile, and its
presence blocks cprsave reboot.  Checking for the name "pc.rom" was the only
way to recognize this anomaly, short of modifying the x86 code.

However, I initially developed the above using an old version of qemu, and
a more recent version has fixed it:

pc_memory_init()
    memory_region_init_ram(option_rom_mr, NULL, "pc.rom", PC_ROM_SIZE,
                           &error_fatal);
    if (pcmc->pci_enabled) {
        memory_region_set_readonly(option_rom_mr, true);
    }

Now memory_region_is_rom() will correctly classify this segment, and I will
happily delete the hack.

- Steve

>> + */
>> +bool qemu_ram_volatile(Error **errp)
>> +{
>> +    RAMBlock *block;
>> +    MemoryRegion *mr;
>> +    bool ret = false;
>> +
>> +    rcu_read_lock();
>> +    QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
>> +        mr = block->mr;
>> +        if (mr &&
>> +            memory_region_is_ram(mr) &&
>> +            !memory_region_is_ram_device(mr) &&
>> +            !memory_region_is_rom(mr) &&
>> +            (!mr->name || strcmp(mr->name, "pc.rom")) &&
>> +            (block->fd == -1 || !qemu_ram_is_shared(block))) {
>> +
>> +            error_setg(errp, "Memory region %s is volatile",
>> +                       memory_region_name(mr));
>> +            ret = true;
>> +            break;
>> +        }
>> +    }
>> +
>> +    rcu_read_unlock();
>> +    return ret;
>> +}
>> +
>>  /* Generate a debug exception if a watchpoint has been hit.  */
>>  void cpu_check_watchpoint(CPUState *cpu, vaddr addr, vaddr len,
>>                            MemTxAttrs attrs, int flags, uintptr_t ra)
>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>> index 307e527..6aafbb0 100644
>> --- a/include/exec/memory.h
>> +++ b/include/exec/memory.h
>> @@ -2519,6 +2519,8 @@ bool ram_block_discard_is_disabled(void);
>>   */
>>  bool ram_block_discard_is_required(void);
>>  
>> +bool qemu_ram_volatile(Error **errp);
>> +
>>  #endif
>>  
>>  #endif
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index 1509173..f101039 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -2713,6 +2713,10 @@ void save_cpr_snapshot(const char *file, const char *mode, Error **errp)
>>          return;
>>      }
>>  
>> +    if (op == VMS_REBOOT && qemu_ram_volatile(errp)) {
>> +        return;
>> +    }
>> +
>>      f = qf_file_open(file, O_CREAT | O_WRONLY | O_TRUNC, 0600, errp);
>>      if (!f) {
>>          return;
>> -- 
>> 1.8.3.1
>>


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 12/32] vl: pause option
  2020-09-11 17:59       ` Dr. David Alan Gilbert
@ 2020-09-24 21:51         ` Steven Sistare
  0 siblings, 0 replies; 118+ messages in thread
From: Steven Sistare @ 2020-09-24 21:51 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On 9/11/2020 1:59 PM, Dr. David Alan Gilbert wrote:
> * Steven Sistare (steven.sistare@oracle.com) wrote:
>> On 7/30/2020 1:03 PM, Alex Bennée wrote:
>>>
>>> Steve Sistare <steven.sistare@oracle.com> writes:
>>>
>>>> Provide the -pause command-line parameter and the QEMU_PAUSE environment
>>>> variable to briefly pause QEMU in main and allow a developer to attach gdb.
>>>> Useful when the developer does not invoke QEMU directly, such as when using
>>>> libvirt.
>>>
>>> How does this differ from -S?
>>
>> The -S flag runs qemu to the main loop but does not start the guest.  Lots of code
>> that you may need to debug runs before you get there.
> 
> You might try the '--preconfig' option - that's pretty early on.
> The other one is adding a chardev and telling it to wait for a server;
> that'll wait until you telnet to the port.
> 
> (Either way, this patch shouldn't really be part of this series, it's a
> separate discussion)

Sure, I will pull it from the series.

- Steve

>> - Steve
>>>> Usage:
>>>>   qemu -pause <seconds>
>>>>   or
>>>>   export QEMU_PAUSE=<seconds>
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>>  qemu-options.hx |  9 +++++++++
>>>>  softmmu/vl.c    | 15 ++++++++++++++-
>>>>  2 files changed, 23 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/qemu-options.hx b/qemu-options.hx
>>>> index 708583b..8505cf2 100644
>>>> --- a/qemu-options.hx
>>>> +++ b/qemu-options.hx
>>>> @@ -3668,6 +3668,15 @@ SRST
>>>>      option is experimental.
>>>>  ERST
>>>>  
>>>> +DEF("pause", HAS_ARG, QEMU_OPTION_pause, \
>>>> +    "-pause secs    Pause for secs seconds on entry to main.\n", QEMU_ARCH_ALL)
>>>> +
>>>> +SRST
>>>> +``--pause secs``
>>>> +    Pause for a number of seconds on entry to main.  Useful for attaching
>>>> +    a debugger after QEMU has been launched by some other entity.
>>>> +ERST
>>>> +
>>>
>>> It seems like having an option to race with the debugger is just asking
>>> for trouble.
>>>
>>>>  DEF("S", 0, QEMU_OPTION_S, \
>>>>      "-S              freeze CPU at startup (use 'c' to start execution)\n",
>>>>      QEMU_ARCH_ALL)
>>>> diff --git a/softmmu/vl.c b/softmmu/vl.c
>>>> index 8478778..951994f 100644
>>>> --- a/softmmu/vl.c
>>>> +++ b/softmmu/vl.c
>>>> @@ -2844,7 +2844,7 @@ static void create_default_memdev(MachineState *ms, const char *path)
>>>>  
>>>>  void qemu_init(int argc, char **argv, char **envp)
>>>>  {
>>>> -    int i;
>>>> +    int i, seconds;
>>>>      int snapshot, linux_boot;
>>>>      const char *initrd_filename;
>>>>      const char *kernel_filename, *kernel_cmdline;
>>>> @@ -2882,6 +2882,13 @@ void qemu_init(int argc, char **argv, char **envp)
>>>>      QemuPluginList plugin_list = QTAILQ_HEAD_INITIALIZER(plugin_list);
>>>>      int mem_prealloc = 0; /* force preallocation of physical target memory */
>>>>  
>>>> +    if (getenv("QEMU_PAUSE")) {
>>>> +        seconds = atoi(getenv("QEMU_PAUSE"));
>>>> +        printf("Pausing %d seconds for debugger. QEMU PID is %d\n",
>>>> +               seconds, getpid());
>>>> +        sleep(seconds);
>>>> +    }
>>>> +
>>>>      os_set_line_buffering();
>>>>  
>>>>      error_init(argv[0]);
>>>> @@ -3204,6 +3211,12 @@ void qemu_init(int argc, char **argv, char **envp)
>>>>              case QEMU_OPTION_gdb:
>>>>                  add_device_config(DEV_GDB, optarg);
>>>>                  break;
>>>> +            case QEMU_OPTION_pause:
>>>> +                seconds = atoi(optarg);
>>>> +                printf("Pausing %d seconds for debugger. QEMU PID is %d\n",
>>>> +                            seconds, getpid());
>>>> +                sleep(seconds);
>>>> +                break;
>>>>              case QEMU_OPTION_L:
>>>>                  if (is_help_option(optarg)) {
>>>>                      list_data_dirs = true;
>>>
>>>
>>


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 13/32] gdbstub: gdb support for suspended state
  2020-09-11 18:41   ` Dr. David Alan Gilbert
@ 2020-09-24 21:51     ` Steven Sistare
  0 siblings, 0 replies; 118+ messages in thread
From: Steven Sistare @ 2020-09-24 21:51 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, alex.bennee, philmd
  Cc: Daniel P. Berrange, Stefan Hajnoczi, Michael S. Tsirkin,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Marc-André Lureau, Paolo Bonzini

On 9/11/2020 2:41 PM, Dr. David Alan Gilbert wrote:
> * Steve Sistare (steven.sistare@oracle.com) wrote:
>> Modify the gdb server so a continue command appears to resume execution
>> when in RUN_STATE_SUSPENDED.  Do not print the next gdb prompt, but do not
>> actually resume instruction fetch.  While in this "fake" running mode, a
>> ctrl-C returns the user to the gdb prompt.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> 
> This patch doesn't feel like it lives here; it seems to be a separate
> gdbstub patch and it'll get noticed/merged quicker just sent on it's
> own.

OK, I will submit it separately.

- Steve

>>  gdbstub.c | 11 +++++++++--
>>  1 file changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/gdbstub.c b/gdbstub.c
>> index f3a318c..2f0d9ff 100644
>> --- a/gdbstub.c
>> +++ b/gdbstub.c
>> @@ -461,7 +461,9 @@ static inline void gdb_continue(void)
>>  #else
>>      if (!runstate_needs_reset()) {
>>          trace_gdbstub_op_continue();
>> -        vm_start();
>> +        if (!runstate_check(RUN_STATE_SUSPENDED)) {
>> +            vm_start();
>> +        }
>>      }
>>  #endif
>>  }
>> @@ -490,7 +492,7 @@ static int gdb_continue_partial(char *newstates)
>>      int flag = 0;
>>  
>>      if (!runstate_needs_reset()) {
>> -        if (vm_prepare_start()) {
>> +        if (!runstate_check(RUN_STATE_SUSPENDED) && vm_prepare_start()) {
>>              return 0;
>>          }
>>  
>> @@ -2835,6 +2837,9 @@ static void gdb_read_byte(uint8_t ch)
>>          /* when the CPU is running, we cannot do anything except stop
>>             it when receiving a char */
>>          vm_stop(RUN_STATE_PAUSED);
>> +    } else if (runstate_check(RUN_STATE_SUSPENDED) && ch == 3) {
>> +        /* Received ctrl-c from gdb */
>> +        gdb_vm_state_change(0, 0, RUN_STATE_PAUSED);
>>      } else
>>  #endif
>>      {
>> @@ -3282,6 +3287,8 @@ static void gdb_sigterm_handler(int signal)
>>  {
>>      if (runstate_is_running()) {
>>          vm_stop(RUN_STATE_PAUSED);
>> +    } else if (runstate_check(RUN_STATE_SUSPENDED)) {
>> +        gdb_vm_state_change(0, 0, RUN_STATE_PAUSED);
>>      }
>>  }
>>  #endif
>> -- 
>> 1.8.3.1
>>


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 17/32] util: env var helpers
  2020-09-11 19:00   ` Dr. David Alan Gilbert
@ 2020-09-24 21:52     ` Steven Sistare
  0 siblings, 0 replies; 118+ messages in thread
From: Steven Sistare @ 2020-09-24 21:52 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On 9/11/2020 3:00 PM, Dr. David Alan Gilbert wrote:
> * Steve Sistare (steven.sistare@oracle.com) wrote:
>> Add functions for saving fd's and ram extents in the environment via
>> setenv, and for reading them back via getenv.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
> 
> This is an awful lot of env stuff - how about dumping
> all this stuff into a file and reloading it?

I don't think there will be significantly fewer lines if this is
re-written to use a file, and the existing code is very simple -- a few
easy to understand lines for each accessor.  Please skim env.c, you will
grok it in less time than it took to write our emails.  A file-based
version may be even  longer because the code would need to check for
malformed input, since the file contents could be changed outside qemu,
and we would need to provide an option for where the file is stored.

The env is nice because it eliminates a failure point -- the variables always 
get carried through to the post-exec process.  No lost or stale files.

- Steve

>> ---
>>  MAINTAINERS           |   7 +++
>>  include/qemu/cutils.h |   1 +
>>  include/qemu/env.h    |  31 ++++++++++++
>>  util/Makefile.objs    |   2 +-
>>  util/env.c            | 132 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>  5 files changed, 172 insertions(+), 1 deletion(-)
>>  create mode 100644 include/qemu/env.h
>>  create mode 100644 util/env.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 3395abd..8d377a7 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -3115,3 +3115,10 @@ Performance Tools and Tests
>>  M: Ahmed Karaman <ahmedkhaledkaraman@gmail.com>
>>  S: Maintained
>>  F: scripts/performance/
>> +
>> +Environment variable helpers
>> +M: Steve Sistare <steven.sistare@oracle.com>
>> +M: Mark Kanda <mark.kanda@oracle.com>
>> +S: Maintained
>> +F: include/qemu/env.h
>> +F: util/env.c
>> diff --git a/include/qemu/cutils.h b/include/qemu/cutils.h
>> index eb59852..d4c7d70 100644
>> --- a/include/qemu/cutils.h
>> +++ b/include/qemu/cutils.h
>> @@ -1,6 +1,7 @@
>>  #ifndef QEMU_CUTILS_H
>>  #define QEMU_CUTILS_H
>>  
>> +#include "qemu/env.h"
>>  /**
>>   * pstrcpy:
>>   * @buf: buffer to copy string into
>> diff --git a/include/qemu/env.h b/include/qemu/env.h
>> new file mode 100644
>> index 0000000..53cc121
>> --- /dev/null
>> +++ b/include/qemu/env.h
>> @@ -0,0 +1,31 @@
>> +/*
>> + * Copyright (c) 2020 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2.
>> + * See the COPYING file in the top-level directory.
>> + *
>> + */
>> +
>> +#ifndef QEMU_ENV_H
>> +#define QEMU_ENV_H
>> +
>> +#define FD_PREFIX "QEMU_FD_"
>> +#define ADDR_PREFIX "QEMU_ADDR_"
>> +#define LEN_PREFIX "QEMU_LEN_"
>> +#define BOOL_PREFIX "QEMU_BOOL_"
>> +
>> +typedef int (*walkenv_cb)(const char *name, const char *val, void *handle);
>> +
>> +bool getenv_ram(const char *name, void **addrp, size_t *lenp);
>> +void setenv_ram(const char *name, void *addr, size_t len);
>> +void unsetenv_ram(const char *name);
>> +int getenv_fd(const char *name);
>> +void setenv_fd(const char *name, int fd);
>> +void unsetenv_fd(const char *name);
>> +bool getenv_bool(const char *name);
>> +void setenv_bool(const char *name, bool val);
>> +void unsetenv_bool(const char *name);
>> +int walkenv(const char *prefix, walkenv_cb cb, void *handle);
>> +void printenv(void);
>> +
>> +#endif
>> diff --git a/util/Makefile.objs b/util/Makefile.objs
>> index cc5e371..d357932 100644
>> --- a/util/Makefile.objs
>> +++ b/util/Makefile.objs
>> @@ -1,4 +1,4 @@
>> -util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o
>> +util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o env.o
>>  util-obj-$(call lnot,$(CONFIG_ATOMIC64)) += atomic64.o
>>  util-obj-$(CONFIG_POSIX) += aio-posix.o
>>  util-obj-$(CONFIG_POSIX) += fdmon-poll.o
>> diff --git a/util/env.c b/util/env.c
>> new file mode 100644
>> index 0000000..0cc4a9f
>> --- /dev/null
>> +++ b/util/env.c
>> @@ -0,0 +1,132 @@
>> +/*
>> + * Copyright (c) 2020 Oracle and/or its affiliates.
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2.
>> + * See the COPYING file in the top-level directory.
>> + *
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/env.h"
>> +
>> +static uint64_t getenv_ulong(const char *prefix, const char *name, bool *found)
>> +{
>> +    char var[80], *val;
>> +    uint64_t res;
>> +
>> +    snprintf(var, sizeof(var), "%s%s", prefix, name);
>> +    val = getenv(var);
>> +    if (val) {
>> +        *found = true;
>> +        res = strtol(val, 0, 10);
>> +    } else {
>> +        *found = false;
>> +        res = 0;
>> +    }
>> +    return res;
>> +}
>> +
>> +static void setenv_ulong(const char *prefix, const char *name, uint64_t val)
>> +{
>> +    char var[80], val_str[80];
>> +    snprintf(var, sizeof(var), "%s%s", prefix, name);
>> +    snprintf(val_str, sizeof(val_str), "%"PRIu64, val);
>> +    setenv(var, val_str, 1);
>> +}
>> +
>> +static void unsetenv_ulong(const char *prefix, const char *name)
>> +{
>> +    char var[80];
>> +    snprintf(var, sizeof(var), "%s%s", prefix, name);
>> +    unsetenv(var);
>> +}
>> +
>> +bool getenv_ram(const char *name, void **addrp, size_t *lenp)
>> +{
>> +    bool found1, found2;
>> +    *addrp = (void *) getenv_ulong(ADDR_PREFIX, name, &found1);
>> +    *lenp = getenv_ulong(LEN_PREFIX, name, &found2);
>> +    assert(found1 == found2);
>> +    return found1;
>> +}
>> +
>> +void setenv_ram(const char *name, void *addr, size_t len)
>> +{
>> +    setenv_ulong(ADDR_PREFIX, name, (uint64_t)addr);
>> +    setenv_ulong(LEN_PREFIX, name, len);
>> +}
>> +
>> +void unsetenv_ram(const char *name)
>> +{
>> +    unsetenv_ulong(ADDR_PREFIX, name);
>> +    unsetenv_ulong(LEN_PREFIX, name);
>> +}
>> +
>> +int getenv_fd(const char *name)
>> +{
>> +    bool found;
>> +    int fd = getenv_ulong(FD_PREFIX, name, &found);
>> +    if (!found) {
>> +        fd = -1;
>> +    }
>> +    return fd;
>> +}
>> +
>> +void setenv_fd(const char *name, int fd)
>> +{
>> +    setenv_ulong(FD_PREFIX, name, fd);
>> +}
>> +
>> +void unsetenv_fd(const char *name)
>> +{
>> +    unsetenv_ulong(FD_PREFIX, name);
>> +}
>> +
>> +bool getenv_bool(const char *name)
>> +{
>> +    bool found;
>> +    bool val = getenv_ulong(BOOL_PREFIX, name, &found);
>> +    if (!found) {
>> +        val = -1;
>> +    }
>> +    return val;
>> +}
>> +
>> +void setenv_bool(const char *name, bool val)
>> +{
>> +    setenv_ulong(BOOL_PREFIX, name, val);
>> +}
>> +
>> +void unsetenv_bool(const char *name)
>> +{
>> +    unsetenv_ulong(BOOL_PREFIX, name);
>> +}
>> +
>> +int walkenv(const char *prefix, walkenv_cb cb, void *handle)
>> +{
>> +    char *str, name[128];
>> +    char **envp = environ;
>> +    size_t prefix_len = strlen(prefix);
>> +
>> +    while (*envp) {
>> +        str = *envp++;
>> +        if (!strncmp(str, prefix, prefix_len)) {
>> +            char *val = strchr(str, '=');
>> +            str += prefix_len;
>> +            strncpy(name, str, val - str);
>> +            name[val - str] = 0;
>> +            if (cb(name, val + 1, handle)) {
>> +                return 1;
>> +            }
>> +        }
>> +    }
>> +    return 0;
>> +}
>> +
>> +void printenv(void)
>> +{
>> +    char **ptr = environ;
>> +    while (*ptr) {
>> +        puts(*ptr++);
>> +    }
>> +}
>> -- 
>> 1.8.3.1
>>
>>


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 32/32] vfio-pci: improved tracing
  2020-09-15 18:49   ` Dr. David Alan Gilbert
@ 2020-09-24 21:52     ` Steven Sistare
  0 siblings, 0 replies; 118+ messages in thread
From: Steven Sistare @ 2020-09-24 21:52 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On 9/15/2020 2:49 PM, Dr. David Alan Gilbert wrote:
> * Steve Sistare (steven.sistare@oracle.com) wrote:
>> Print more info for existing trace points:
>>   trace_kvm_irqchip_add_msi_route.
>>   trace_pci_update_mappings_del
>>   trace_pci_update_mappings_add
>>
>> Add new trace points:
>>   trace_kvm_irqchip_assign_irqfd
>>   trace_msix_table_mmio_write
>>   trace_vfio_dma_unmap
>>   trace_vfio_dma_map
>>   trace_vfio_region
>>   trace_vfio_descriptors
>>   trace_ram_block_add
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> 
> Why don't you split this out into a separate patch by itself; if they're
> general extra useful tracing they can just go in.

OK, will submit separately.

> Note that you've also added a new warning in vfio_dma_unmap

I'll move that elsewhere.

- Steve

>> ---
>>  accel/kvm/kvm-all.c    |  8 ++++++--
>>  accel/kvm/trace-events |  3 ++-
>>  exec.c                 |  3 +++
>>  hw/pci/msix.c          |  1 +
>>  hw/pci/pci.c           | 10 ++++++----
>>  hw/pci/trace-events    |  5 +++--
>>  hw/vfio/common.c       | 16 +++++++++++++++-
>>  hw/vfio/pci.c          |  1 +
>>  hw/vfio/trace-events   |  9 ++++++---
>>  trace-events           |  2 ++
>>  10 files changed, 45 insertions(+), 13 deletions(-)
>>
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>> index 63ef6af..5511ea7 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -46,6 +46,7 @@
>>  #include "sysemu/reset.h"
>>  
>>  #include "hw/boards.h"
>> +#include "trace-root.h"
>>  
>>  /* This check must be after config-host.h is included */
>>  #ifdef CONFIG_EVENTFD
>> @@ -1670,7 +1671,7 @@ int kvm_irqchip_add_msi_route(KVMState *s, int vector, PCIDevice *dev)
>>      }
>>  
>>      trace_kvm_irqchip_add_msi_route(dev ? dev->name : (char *)"N/A",
>> -                                    vector, virq);
>> +                                    vector, virq, msg.address, msg.data);
>>  
>>      kvm_add_routing_entry(s, &kroute);
>>      kvm_arch_add_msi_route_post(&kroute, vector, dev);
>> @@ -1717,6 +1718,7 @@ static int kvm_irqchip_assign_irqfd(KVMState *s, EventNotifier *event,
>>  {
>>      int fd = event_notifier_get_fd(event);
>>      int rfd = resample ? event_notifier_get_fd(resample) : -1;
>> +    int ret;
>>  
>>      struct kvm_irqfd irqfd = {
>>          .fd = fd,
>> @@ -1758,7 +1760,9 @@ static int kvm_irqchip_assign_irqfd(KVMState *s, EventNotifier *event,
>>          return -ENOSYS;
>>      }
>>  
>> -    return kvm_vm_ioctl(s, KVM_IRQFD, &irqfd);
>> +    ret = kvm_vm_ioctl(s, KVM_IRQFD, &irqfd);
>> +    trace_kvm_irqchip_assign_irqfd(fd, virq, rfd, ret);
>> +    return ret;
>>  }
>>  
>>  int kvm_irqchip_add_adapter_route(KVMState *s, AdapterInfo *adapter)
>> diff --git a/accel/kvm/trace-events b/accel/kvm/trace-events
>> index a68eb66..67a01e6 100644
>> --- a/accel/kvm/trace-events
>> +++ b/accel/kvm/trace-events
>> @@ -9,7 +9,8 @@ kvm_device_ioctl(int fd, int type, void *arg) "dev fd %d, type 0x%x, arg %p"
>>  kvm_failed_reg_get(uint64_t id, const char *msg) "Warning: Unable to retrieve ONEREG %" PRIu64 " from KVM: %s"
>>  kvm_failed_reg_set(uint64_t id, const char *msg) "Warning: Unable to set ONEREG %" PRIu64 " to KVM: %s"
>>  kvm_irqchip_commit_routes(void) ""
>> -kvm_irqchip_add_msi_route(char *name, int vector, int virq) "dev %s vector %d virq %d"
>> +kvm_irqchip_add_msi_route(char *name, int vector, int virq, uint64_t addr, uint32_t data) "%s, vector %d, virq %d, msg {addr 0x%"PRIx64", data 0x%x}"
>> +kvm_irqchip_assign_irqfd(int fd, int virq, int rfd, int status) "(fd=%d, virq=%d, rfd=%d) KVM_IRQFD returns %d"
>>  kvm_irqchip_update_msi_route(int virq) "Updating MSI route virq=%d"
>>  kvm_irqchip_release_virq(int virq) "virq %d"
>>  kvm_set_ioeventfd_mmio(int fd, uint64_t addr, uint32_t val, bool assign, uint32_t size, bool datamatch) "fd: %d @0x%" PRIx64 " val=0x%x assign: %d size: %d match: %d"
>> diff --git a/exec.c b/exec.c
>> index 5473c09..dd99ee0 100644
>> --- a/exec.c
>> +++ b/exec.c
>> @@ -2319,6 +2319,9 @@ static void ram_block_add(RAMBlock *new_block, Error **errp, bool shared)
>>          }
>>          ram_block_notify_add(new_block->host, new_block->max_length);
>>      }
>> +    trace_ram_block_add(new_block->host, new_block->max_length,
>> +                        memory_region_name(new_block->mr),
>> +                        new_block->mr->readonly ? "ro" : "rw");
>>  }
>>  
>>  #ifdef CONFIG_POSIX
>> diff --git a/hw/pci/msix.c b/hw/pci/msix.c
>> index 67e34f3..65a2882 100644
>> --- a/hw/pci/msix.c
>> +++ b/hw/pci/msix.c
>> @@ -189,6 +189,7 @@ static void msix_table_mmio_write(void *opaque, hwaddr addr,
>>      int vector = addr / PCI_MSIX_ENTRY_SIZE;
>>      bool was_masked;
>>  
>> +    trace_msix_table_mmio_write(dev->name, addr, val, size);
>>      was_masked = msix_is_masked(dev, vector);
>>      pci_set_long(dev->msix_table + addr, val);
>>      msix_handle_mask_update(dev, vector, was_masked);
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index c2e1509..6142411 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -1324,9 +1324,11 @@ void pci_update_mappings(PCIDevice *d)
>>      PCIIORegion *r;
>>      int i;
>>      pcibus_t new_addr;
>> +    const char *name;
>>  
>>      for(i = 0; i < PCI_NUM_REGIONS; i++) {
>>          r = &d->io_regions[i];
>> +        name = r->memory ? r->memory->name : "";
>>  
>>          /* this region isn't registered */
>>          if (!r->size)
>> @@ -1340,18 +1342,18 @@ void pci_update_mappings(PCIDevice *d)
>>  
>>          /* now do the real mapping */
>>          if (r->addr != PCI_BAR_UNMAPPED) {
>> -            trace_pci_update_mappings_del(d, pci_dev_bus_num(d),
>> +            trace_pci_update_mappings_del(d->name, pci_dev_bus_num(d),
>>                                            PCI_SLOT(d->devfn),
>>                                            PCI_FUNC(d->devfn),
>> -                                          i, r->addr, r->size);
>> +                                          i, r->addr, r->size, name);
>>              memory_region_del_subregion(r->address_space, r->memory);
>>          }
>>          r->addr = new_addr;
>>          if (r->addr != PCI_BAR_UNMAPPED) {
>> -            trace_pci_update_mappings_add(d, pci_dev_bus_num(d),
>> +            trace_pci_update_mappings_add(d->name, pci_dev_bus_num(d),
>>                                            PCI_SLOT(d->devfn),
>>                                            PCI_FUNC(d->devfn),
>> -                                          i, r->addr, r->size);
>> +                                          i, r->addr, r->size, name);
>>              memory_region_add_subregion_overlap(r->address_space,
>>                                                  r->addr, r->memory, 1);
>>          }
>> diff --git a/hw/pci/trace-events b/hw/pci/trace-events
>> index def4b39..6dd7015 100644
>> --- a/hw/pci/trace-events
>> +++ b/hw/pci/trace-events
>> @@ -1,8 +1,8 @@
>>  # See docs/devel/tracing.txt for syntax documentation.
>>  
>>  # pci.c
>> -pci_update_mappings_del(void *d, uint32_t bus, uint32_t slot, uint32_t func, int bar, uint64_t addr, uint64_t size) "d=%p %02x:%02x.%x %d,0x%"PRIx64"+0x%"PRIx64
>> -pci_update_mappings_add(void *d, uint32_t bus, uint32_t slot, uint32_t func, int bar, uint64_t addr, uint64_t size) "d=%p %02x:%02x.%x %d,0x%"PRIx64"+0x%"PRIx64
>> +pci_update_mappings_del(const char *dname, uint32_t bus, uint32_t slot, uint32_t func, int bar, uint64_t addr, uint64_t size, const char *name) "%s %02x:%02x.%x [%d] 0x%"PRIx64", 0x%"PRIx64"B \"%s\""
>> +pci_update_mappings_add(const char *dname, uint32_t bus, uint32_t slot, uint32_t func, int bar, uint64_t addr, uint64_t size, const char *name) "%s %02x:%02x.%x [%d] 0x%"PRIx64", 0x%"PRIx64"B \"%s\""
>>  
>>  # pci_host.c
>>  pci_cfg_read(const char *dev, unsigned devid, unsigned fnid, unsigned offs, unsigned val) "%s %02u:%u @0x%x -> 0x%x"
>> @@ -10,3 +10,4 @@ pci_cfg_write(const char *dev, unsigned devid, unsigned fnid, unsigned offs, uns
>>  
>>  # msix.c
>>  msix_write_config(char *name, bool enabled, bool masked) "dev %s enabled %d masked %d"
>> +msix_table_mmio_write(char *name, uint64_t addr, uint64_t val, unsigned size)  "(%s, @%"PRId64", 0x%"PRIx64", %dB)"
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index a51a093..23c8bf3 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -304,6 +304,8 @@ static int vfio_dma_unmap(VFIOContainer *container,
>>          return 0;
>>      }
>>  
>> +    trace_vfio_dma_unmap(container->fd, iova, size);
>> +
>>      while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
>>          /*
>>           * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
>> @@ -327,6 +329,11 @@ static int vfio_dma_unmap(VFIOContainer *container,
>>          return -errno;
>>      }
>>  
>> +    if (unmap.size != size) {
>> +        error_printf("warn: VFIO_UNMAP_DMA(0x%lx, 0x%lx) only unmaps 0x%llx",
>> +                     iova, size, unmap.size);
>> +    }
>> +
>>      return 0;
>>  }
>>  
>> @@ -345,6 +352,9 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>>          return 0;
>>      }
>>  
>> +    trace_vfio_dma_map(container->fd, iova, size, vaddr,
>> +                       (readonly ? "r" : "rw"));
>> +
>>      if (!readonly) {
>>          map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
>>      }
>> @@ -985,7 +995,8 @@ int vfio_region_mmap(VFIORegion *region)
>>          trace_vfio_region_mmap(memory_region_name(&region->mmaps[i].mem),
>>                                 region->mmaps[i].offset,
>>                                 region->mmaps[i].offset +
>> -                               region->mmaps[i].size - 1);
>> +                               region->mmaps[i].size - 1,
>> +                               region->mmaps[i].mmap);
>>      }
>>  
>>      return 0;
>> @@ -1696,6 +1707,9 @@ retry:
>>          goto retry;
>>      }
>>  
>> +    trace_vfio_region(vbasedev->name, index, (*info)->offset, (*info)->size,
>> +                      (*info)->cap_offset, (*info)->flags);
>> +
>>      return 0;
>>  }
>>  
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index f72e277..d74e078 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -41,6 +41,7 @@
>>  #include "trace.h"
>>  #include "qapi/error.h"
>>  #include "migration/blocker.h"
>> +#include "trace-root.h"
>>  
>>  #define TYPE_VFIO_PCI "vfio-pci"
>>  #define PCI_VFIO(obj)    OBJECT_CHECK(VFIOPCIDevice, obj, TYPE_VFIO_PCI)
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 10d899c..83cd0a6 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -25,7 +25,7 @@ vfio_pci_size_rom(const char *name, int size) "%s ROM size 0x%x"
>>  vfio_vga_write(uint64_t addr, uint64_t data, int size) " (0x%"PRIx64", 0x%"PRIx64", %d)"
>>  vfio_vga_read(uint64_t addr, int size, uint64_t data) " (0x%"PRIx64", %d) = 0x%"PRIx64
>>  vfio_pci_read_config(const char *name, int addr, int len, int val) " (%s, @0x%x, len=0x%x) 0x%x"
>> -vfio_pci_write_config(const char *name, int addr, int val, int len) " (%s, @0x%x, 0x%x, len=0x%x)"
>> +vfio_pci_write_config(const char *name, int addr, int val, int len) "(%s, @0x%x, 0x%x, 0x%xB)"
>>  vfio_msi_setup(const char *name, int pos) "%s PCI MSI CAP @0x%x"
>>  vfio_msix_early_setup(const char *name, int pos, int table_bar, int offset, int entries) "%s PCI MSI-X CAP @0x%x, BAR %d, offset 0x%x, entries %d"
>>  vfio_check_pcie_flr(const char *name) "%s Supports FLR via PCIe cap"
>> @@ -37,7 +37,7 @@ vfio_pci_hot_reset_dep_devices(int domain, int bus, int slot, int function, int
>>  vfio_pci_hot_reset_result(const char *name, const char *result) "%s hot reset: %s"
>>  vfio_populate_device_config(const char *name, unsigned long size, unsigned long offset, unsigned long flags) "Device %s config:\n  size: 0x%lx, offset: 0x%lx, flags: 0x%lx"
>>  vfio_populate_device_get_irq_info_failure(const char *errstr) "VFIO_DEVICE_GET_IRQ_INFO failure: %s"
>> -vfio_realize(const char *name, int group_id) " (%s) group %d"
>> +vfio_realize(const char *name, int group_id) "(%s) group %d"
>>  vfio_mdev(const char *name, bool is_mdev) " (%s) is_mdev %d"
>>  vfio_add_ext_cap_dropped(const char *name, uint16_t cap, uint16_t offset) "%s 0x%x@0x%x"
>>  vfio_pci_reset(const char *name) " (%s)"
>> @@ -109,7 +109,7 @@ vfio_get_device(const char * name, unsigned int flags, unsigned int num_regions,
>>  vfio_put_base_device(int fd) "close vdev->fd=%d"
>>  vfio_region_setup(const char *dev, int index, const char *name, unsigned long flags, unsigned long offset, unsigned long size) "Device %s, region %d \"%s\", flags: 0x%lx, offset: 0x%lx, size: 0x%lx"
>>  vfio_region_mmap_fault(const char *name, int index, unsigned long offset, unsigned long size, int fault) "Region %s mmaps[%d], [0x%lx - 0x%lx], fault: %d"
>> -vfio_region_mmap(const char *name, unsigned long offset, unsigned long end) "Region %s [0x%lx - 0x%lx]"
>> +vfio_region_mmap(const char *name, unsigned long offset, unsigned long end, void *addr) "%s [0x%lx - 0x%lx] maps to %p"
>>  vfio_region_exit(const char *name, int index) "Device %s, region %d"
>>  vfio_region_finalize(const char *name, int index) "Device %s, region %d"
>>  vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps enabled: %d"
>> @@ -117,6 +117,9 @@ vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Devic
>>  vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
>>  vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
>>  vfio_dma_unmap_overflow_workaround(void) ""
>> +vfio_dma_unmap(int fd, uint64_t iova, uint64_t size) "fd %d, iova 0x%"PRIx64", len 0x%"PRIx64
>> +vfio_dma_map(int fd, uint64_t iova, uint64_t size, void *addr, const char *access) "fd %d, iova 0x%"PRIx64", len 0x%"PRIx64", va %p, %s"
>> +vfio_region(const char *name, int index, uint64_t offset, uint64_t size, int cap_offset, int flags) "%s [%d]: +0x%"PRIx64", 0x%"PRIx64"B, cap +0x%x, flags 0x%x"
>>  
>>  # platform.c
>>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
>> diff --git a/trace-events b/trace-events
>> index 42107eb..98589a4 100644
>> --- a/trace-events
>> +++ b/trace-events
>> @@ -107,6 +107,8 @@ qmp_job_complete(void *job) "job %p"
>>  qmp_job_finalize(void *job) "job %p"
>>  qmp_job_dismiss(void *job) "job %p"
>>  
>> +# exec.c
>> +ram_block_add(void *host, uint64_t maxlen, const char *name, const char *mode) "host=%p, maxlen=0x%"PRIx64", mr = {name=%s, %s}"
>>  
>>  ### Guest events, keep at bottom
>>  
>> -- 
>> 1.8.3.1
>>


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 22/32] char: qio_channel_socket_accept reuse fd
  2020-09-15 17:33   ` Dr. David Alan Gilbert
  2020-09-15 17:53     ` Daniel P. Berrangé
@ 2020-09-24 21:54     ` Steven Sistare
  1 sibling, 0 replies; 118+ messages in thread
From: Steven Sistare @ 2020-09-24 21:54 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On 9/15/2020 1:33 PM, Dr. David Alan Gilbert wrote:
> * Steve Sistare (steven.sistare@oracle.com) wrote:
>> From: Mark Kanda <mark.kanda@oracle.com>
>>
>> Add an fd argument to qio_channel_socket_accept.  If not -1, the channel
>> uses that fd instead of accepting a new socket connection.  All callers
>> pass -1 in this patch, so no functional change.
> 
> Doesn't some of this just come from the fact you're insisting on reusing
> the command line?   We should be able to open a chardev on an fd
> shouldn't we?

If the management layer originally added the char device via hot plug, then
we expect it to do so again after restart, following the typical practice for
live migration.  The device has no presence on the command line.

- Steve

>> Signed-off-by: Mark Kanda <mark.kanda@oracle.com>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  include/io/channel-socket.h    |  3 ++-
>>  io/channel-socket.c            | 12 +++++++++---
>>  io/net-listener.c              |  4 ++--
>>  scsi/qemu-pr-helper.c          |  2 +-
>>  tests/qtest/tpm-emu.c          |  2 +-
>>  tests/test-char.c              |  2 +-
>>  tests/test-io-channel-socket.c |  4 ++--
>>  7 files changed, 18 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
>> index 777ff59..0ffc560 100644
>> --- a/include/io/channel-socket.h
>> +++ b/include/io/channel-socket.h
>> @@ -248,6 +248,7 @@ qio_channel_socket_get_remote_address(QIOChannelSocket *ioc,
>>  /**
>>   * qio_channel_socket_accept:
>>   * @ioc: the socket channel object
>> + * @reuse_fd: fd to reuse; -1 otherwise
>>   * @errp: pointer to a NULL-initialized error object
>>   *
>>   * If the socket represents a server, then this accepts
>> @@ -258,7 +259,7 @@ qio_channel_socket_get_remote_address(QIOChannelSocket *ioc,
>>   */
>>  QIOChannelSocket *
>>  qio_channel_socket_accept(QIOChannelSocket *ioc,
>> -                          Error **errp);
>> +                          int reuse_fd, Error **errp);
>>  
>>  
>>  #endif /* QIO_CHANNEL_SOCKET_H */
>> diff --git a/io/channel-socket.c b/io/channel-socket.c
>> index e1b4667..dde12bf 100644
>> --- a/io/channel-socket.c
>> +++ b/io/channel-socket.c
>> @@ -352,7 +352,7 @@ void qio_channel_socket_dgram_async(QIOChannelSocket *ioc,
>>  
>>  QIOChannelSocket *
>>  qio_channel_socket_accept(QIOChannelSocket *ioc,
>> -                          Error **errp)
>> +                          int reuse_fd, Error **errp)
>>  {
>>      QIOChannelSocket *cioc;
>>  
>> @@ -362,8 +362,14 @@ qio_channel_socket_accept(QIOChannelSocket *ioc,
>>  
>>   retry:
>>      trace_qio_channel_socket_accept(ioc);
>> -    cioc->fd = qemu_accept(ioc->fd, (struct sockaddr *)&cioc->remoteAddr,
>> -                           &cioc->remoteAddrLen);
>> +
>> +    if (reuse_fd != -1) {
>> +        cioc->fd = reuse_fd;
>> +    } else {
>> +        cioc->fd = qemu_accept(ioc->fd, (struct sockaddr *)&cioc->remoteAddr,
>> +                               &cioc->remoteAddrLen);
>> +    }
>> +
>>      if (cioc->fd < 0) {
>>          if (errno == EINTR) {
>>              goto retry;
>> diff --git a/io/net-listener.c b/io/net-listener.c
>> index 5d8a226..bbdea1e 100644
>> --- a/io/net-listener.c
>> +++ b/io/net-listener.c
>> @@ -45,7 +45,7 @@ static gboolean qio_net_listener_channel_func(QIOChannel *ioc,
>>      QIOChannelSocket *sioc;
>>  
>>      sioc = qio_channel_socket_accept(QIO_CHANNEL_SOCKET(ioc),
>> -                                     NULL);
>> +                                     -1, NULL);
>>      if (!sioc) {
>>          return TRUE;
>>      }
>> @@ -194,7 +194,7 @@ static gboolean qio_net_listener_wait_client_func(QIOChannel *ioc,
>>      QIOChannelSocket *sioc;
>>  
>>      sioc = qio_channel_socket_accept(QIO_CHANNEL_SOCKET(ioc),
>> -                                     NULL);
>> +                                     -1, NULL);
>>      if (!sioc) {
>>          return TRUE;
>>      }
>> diff --git a/scsi/qemu-pr-helper.c b/scsi/qemu-pr-helper.c
>> index 57ad830..0e6d683 100644
>> --- a/scsi/qemu-pr-helper.c
>> +++ b/scsi/qemu-pr-helper.c
>> @@ -800,7 +800,7 @@ static gboolean accept_client(QIOChannel *ioc, GIOCondition cond, gpointer opaqu
>>      PRHelperClient *prh;
>>  
>>      cioc = qio_channel_socket_accept(QIO_CHANNEL_SOCKET(ioc),
>> -                                     NULL);
>> +                                     -1, NULL);
>>      if (!cioc) {
>>          return TRUE;
>>      }
>> diff --git a/tests/qtest/tpm-emu.c b/tests/qtest/tpm-emu.c
>> index 2e8eb7b..19e5dab 100644
>> --- a/tests/qtest/tpm-emu.c
>> +++ b/tests/qtest/tpm-emu.c
>> @@ -83,7 +83,7 @@ void *tpm_emu_ctrl_thread(void *data)
>>      g_cond_signal(&s->data_cond);
>>  
>>      qio_channel_wait(QIO_CHANNEL(lioc), G_IO_IN);
>> -    ioc = QIO_CHANNEL(qio_channel_socket_accept(lioc, &error_abort));
>> +    ioc = QIO_CHANNEL(qio_channel_socket_accept(lioc, -1, &error_abort));
>>      g_assert(ioc);
>>  
>>      {
>> diff --git a/tests/test-char.c b/tests/test-char.c
>> index 614bdac..1bb6ae0 100644
>> --- a/tests/test-char.c
>> +++ b/tests/test-char.c
>> @@ -884,7 +884,7 @@ char_socket_client_server_thread(gpointer data)
>>      QIOChannelSocket *cioc;
>>  
>>  retry:
>> -    cioc = qio_channel_socket_accept(ioc, &error_abort);
>> +    cioc = qio_channel_socket_accept(ioc, -1, &error_abort);
>>      g_assert_nonnull(cioc);
>>  
>>      if (char_socket_ping_pong(QIO_CHANNEL(cioc), NULL) != 0) {
>> diff --git a/tests/test-io-channel-socket.c b/tests/test-io-channel-socket.c
>> index d43083a..0d410cf 100644
>> --- a/tests/test-io-channel-socket.c
>> +++ b/tests/test-io-channel-socket.c
>> @@ -75,7 +75,7 @@ static void test_io_channel_setup_sync(SocketAddress *listen_addr,
>>      qio_channel_set_delay(*src, false);
>>  
>>      qio_channel_wait(QIO_CHANNEL(lioc), G_IO_IN);
>> -    *dst = QIO_CHANNEL(qio_channel_socket_accept(lioc, &error_abort));
>> +    *dst = QIO_CHANNEL(qio_channel_socket_accept(lioc, -1, &error_abort));
>>      g_assert(*dst);
>>  
>>      test_io_channel_set_socket_bufs(*src, *dst);
>> @@ -143,7 +143,7 @@ static void test_io_channel_setup_async(SocketAddress *listen_addr,
>>      g_assert(!data.err);
>>  
>>      qio_channel_wait(QIO_CHANNEL(lioc), G_IO_IN);
>> -    *dst = QIO_CHANNEL(qio_channel_socket_accept(lioc, &error_abort));
>> +    *dst = QIO_CHANNEL(qio_channel_socket_accept(lioc, -1, &error_abort));
>>      g_assert(*dst);
>>  
>>      qio_channel_set_delay(*src, false);
>> -- 
>> 1.8.3.1
>>


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 11/32] cpu: disable ticks when suspended
  2020-09-24 20:42     ` Steven Sistare
@ 2020-09-25  9:03       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 118+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-25  9:03 UTC (permalink / raw)
  To: Steven Sistare, mtosatti
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

* Steven Sistare (steven.sistare@oracle.com) wrote:
> On 9/11/2020 1:53 PM, Dr. David Alan Gilbert wrote:
> > * Steve Sistare (steven.sistare@oracle.com) wrote:
> >> After cprload, the guest console misbehaves.  You must type 8 characters
> >> before any are echoed to the terminal.  Qemu was not sending interrupts
> >> to the guest because the QEMU_CLOCK_VIRTUAL timers_state.cpu_clock_offset
> >> was bad.  The offset is usually updated at cprsave time by the path
> >>
> >>   save_cpr_snapshot()
> >>     vm_stop()
> >>       do_vm_stop()
> >>         if (runstate_is_running())
> >>           cpu_disable_ticks();
> >>             timers_state.cpu_clock_offset = cpu_get_clock_locked();
> >>
> >> However, if the guest is in RUN_STATE_SUSPENDED, then cpu_disable_ticks is
> >> not called.  Further, the earlier transition to suspended in
> >> qemu_system_suspend did not disable ticks.  To fix, call cpu_disable_ticks
> >> from save_cpr_snapshot.
> >>
> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > 
> > Are you saying this is really a more generic bug with migrating when
> > suspended and we should fix this anyway?
> 
> Yes.  Or when suspended and calling save_vmstate(), or qmp_xen_save_devices_state().
> Each of those functions needs the same fix unless someone identifies a more
> centralized way in the state transition logic to disable ticks.

OK, in that case please split this out of the series and we can take a
fix as normal;  please cc in mtosatti@redhat.com .

Dave

> 
> - Steve
> 
> >> ---
> >>  migration/savevm.c | 5 +++++
> >>  1 file changed, 5 insertions(+)
> >>
> >> diff --git a/migration/savevm.c b/migration/savevm.c
> >> index f101039..00f493b 100644
> >> --- a/migration/savevm.c
> >> +++ b/migration/savevm.c
> >> @@ -2729,6 +2729,11 @@ void save_cpr_snapshot(const char *file, const char *mode, Error **errp)
> >>          return;
> >>      }
> >>  
> >> +    /* Update timers_state before saving.  Suspend did not so do. */
> >> +    if (runstate_check(RUN_STATE_SUSPENDED)) {
> >> +        cpu_disable_ticks();
> >> +    }
> >> +
> >>      vm_stop(RUN_STATE_SAVE_VM);
> >>  
> >>      ret = qemu_savevm_state(f, op, errp);
> >> -- 
> >> 1.8.3.1
> >>
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 01/32] savevm: add vmstate handler iterators
  2020-09-24 21:43     ` Steven Sistare
@ 2020-09-25  9:07       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 118+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-25  9:07 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

* Steven Sistare (steven.sistare@oracle.com) wrote:
> On 9/11/2020 12:24 PM, Dr. David Alan Gilbert wrote:
> > Apologies for taking a while to get around to this, 
> > 
> > * Steve Sistare (steven.sistare@oracle.com) wrote:
> >> Provide the SAVEVM_FOREACH and SAVEVM_FORALL macros to loop over all save
> >> VM state handlers.  The former will filter handlers based on the operation
> >> in the later patch "savevm: VM handlers mode mask".  The latter loops over
> >> all handlers.
> >>
> >> No functional change.
> >>
> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >> ---
> >>  migration/savevm.c | 57 ++++++++++++++++++++++++++++++++++++------------------
> >>  1 file changed, 38 insertions(+), 19 deletions(-)
> >>
> >> diff --git a/migration/savevm.c b/migration/savevm.c
> >> index 45c9dd9..a07fcad 100644
> >> --- a/migration/savevm.c
> >> +++ b/migration/savevm.c
> >> @@ -266,6 +266,25 @@ static SaveState savevm_state = {
> >>      .global_section_id = 0,
> >>  };
> >>  
> >> +/*
> >> + * The FOREACH macros will filter handlers based on the current operation when
> >> + * additional conditions are added in a subsequent patch.
> >> + */
> >> +
> >> +#define SAVEVM_FOREACH(se, entry)                                    \
> >> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry)                \
> >> +
> >> +#define SAVEVM_FOREACH_SAFE(se, entry, new_se)                       \
> >> +    QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se)   \
> >> +
> >> +/* The FORALL macros unconditionally loop over all handlers. */
> >> +
> >> +#define SAVEVM_FORALL(se, entry)                                     \
> >> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry)
> >> +
> >> +#define SAVEVM_FORALL_SAFE(se, entry, new_se)                        \
> >> +    QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se)
> >> +
> > 
> > OK, can I ask you to merge this with the next patch but to spin it the
> > other way, so that we have:
> > 
> >   SAVEVM_FOR(se, entry, mask)
> > 
> > and the places you use SAVEVM_FORALL_SAFE would become
> > 
> >   SAVEVM_FOR(se, entry, VMS_MODE_ALL)
> > 
> > I'm thinking at some point in the future we could merge a bunch of the
> > other flag checks in there.
> 
> Sure.  Is this what you have in mind?
> 
> #define SAVEVM_FOR(se, entry, mask)                    \
>     QTAILQ_FOREACH(se, &savevm_state.handlers, entry)  \
>         if (savevm_state.mode & mask)
> 
> #define SAVEVM_FOR_SAFE(se, entry, new_se, mask)                    \
>     QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se)  \
>         if (savevm_state.mode & mask)
> 
> Callers:
>   SAVEVM_FOR(se, entry, mode_mask(se))
>   SAVEVM_FOR(se, entry, VMS_MODE_ALL)
>   SAVEVM_FOR_SAFE(se, entry, mode_mask(se))
>   SAVEVM_FOR_SAFE(se, entry, VMS_MODE_ALL)

Yeh that looks about OK.

Dave

> - Steve
> 
> >>  static bool should_validate_capability(int capability)
> >>  {
> >>      assert(capability >= 0 && capability < MIGRATION_CAPABILITY__MAX);
> >> @@ -673,7 +692,7 @@ static uint32_t calculate_new_instance_id(const char *idstr)
> >>      SaveStateEntry *se;
> >>      uint32_t instance_id = 0;
> >>  
> >> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> >> +    SAVEVM_FORALL(se, entry) {
> >>          if (strcmp(idstr, se->idstr) == 0
> >>              && instance_id <= se->instance_id) {
> >>              instance_id = se->instance_id + 1;
> >> @@ -689,7 +708,7 @@ static int calculate_compat_instance_id(const char *idstr)
> >>      SaveStateEntry *se;
> >>      int instance_id = 0;
> >>  
> >> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> >> +    SAVEVM_FORALL(se, entry) {
> >>          if (!se->compat) {
> >>              continue;
> >>          }
> >> @@ -803,7 +822,7 @@ void unregister_savevm(VMStateIf *obj, const char *idstr, void *opaque)
> >>      }
> >>      pstrcat(id, sizeof(id), idstr);
> >>  
> >> -    QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se) {
> >> +    SAVEVM_FORALL_SAFE(se, entry, new_se) {
> >>          if (strcmp(se->idstr, id) == 0 && se->opaque == opaque) {
> >>              savevm_state_handler_remove(se);
> >>              g_free(se->compat);
> >> @@ -867,7 +886,7 @@ void vmstate_unregister(VMStateIf *obj, const VMStateDescription *vmsd,
> >>  {
> >>      SaveStateEntry *se, *new_se;
> >>  
> >> -    QTAILQ_FOREACH_SAFE(se, &savevm_state.handlers, entry, new_se) {
> >> +    SAVEVM_FORALL_SAFE(se, entry, new_se) {
> >>          if (se->vmsd == vmsd && se->opaque == opaque) {
> >>              savevm_state_handler_remove(se);
> >>              g_free(se->compat);
> >> @@ -1119,7 +1138,7 @@ bool qemu_savevm_state_blocked(Error **errp)
> >>  {
> >>      SaveStateEntry *se;
> >>  
> >> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> >> +    SAVEVM_FORALL(se, entry) {
> >>          if (se->vmsd && se->vmsd->unmigratable) {
> >>              error_setg(errp, "State blocked by non-migratable device '%s'",
> >>                         se->idstr);
> >> @@ -1145,7 +1164,7 @@ bool qemu_savevm_state_guest_unplug_pending(void)
> >>  {
> >>      SaveStateEntry *se;
> >>  
> >> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> >> +    SAVEVM_FOREACH(se, entry) {
> >>          if (se->vmsd && se->vmsd->dev_unplug_pending &&
> >>              se->vmsd->dev_unplug_pending(se->opaque)) {
> >>              return true;
> >> @@ -1162,7 +1181,7 @@ void qemu_savevm_state_setup(QEMUFile *f)
> >>      int ret;
> >>  
> >>      trace_savevm_state_setup();
> >> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> >> +    SAVEVM_FOREACH(se, entry) {
> >>          if (!se->ops || !se->ops->save_setup) {
> >>              continue;
> >>          }
> >> @@ -1193,7 +1212,7 @@ int qemu_savevm_state_resume_prepare(MigrationState *s)
> >>  
> >>      trace_savevm_state_resume_prepare();
> >>  
> >> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> >> +    SAVEVM_FOREACH(se, entry) {
> >>          if (!se->ops || !se->ops->resume_prepare) {
> >>              continue;
> >>          }
> >> @@ -1223,7 +1242,7 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy)
> >>      int ret = 1;
> >>  
> >>      trace_savevm_state_iterate();
> >> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> >> +    SAVEVM_FOREACH(se, entry) {
> >>          if (!se->ops || !se->ops->save_live_iterate) {
> >>              continue;
> >>          }
> >> @@ -1291,7 +1310,7 @@ void qemu_savevm_state_complete_postcopy(QEMUFile *f)
> >>      SaveStateEntry *se;
> >>      int ret;
> >>  
> >> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> >> +    SAVEVM_FOREACH(se, entry) {
> >>          if (!se->ops || !se->ops->save_live_complete_postcopy) {
> >>              continue;
> >>          }
> >> @@ -1324,7 +1343,7 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
> >>      SaveStateEntry *se;
> >>      int ret;
> >>  
> >> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> >> +    SAVEVM_FOREACH(se, entry) {
> >>          if (!se->ops ||
> >>              (in_postcopy && se->ops->has_postcopy &&
> >>               se->ops->has_postcopy(se->opaque)) ||
> >> @@ -1366,7 +1385,7 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
> >>      vmdesc = qjson_new();
> >>      json_prop_int(vmdesc, "page_size", qemu_target_page_size());
> >>      json_start_array(vmdesc, "devices");
> >> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> >> +    SAVEVM_FOREACH(se, entry) {
> >>  
> >>          if ((!se->ops || !se->ops->save_state) && !se->vmsd) {
> >>              continue;
> >> @@ -1476,7 +1495,7 @@ void qemu_savevm_state_pending(QEMUFile *f, uint64_t threshold_size,
> >>      *res_postcopy_only = 0;
> >>  
> >>  
> >> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> >> +    SAVEVM_FOREACH(se, entry) {
> >>          if (!se->ops || !se->ops->save_live_pending) {
> >>              continue;
> >>          }
> >> @@ -1501,7 +1520,7 @@ void qemu_savevm_state_cleanup(void)
> >>      }
> >>  
> >>      trace_savevm_state_cleanup();
> >> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> >> +    SAVEVM_FOREACH(se, entry) {
> >>          if (se->ops && se->ops->save_cleanup) {
> >>              se->ops->save_cleanup(se->opaque);
> >>          }
> >> @@ -1580,7 +1599,7 @@ int qemu_save_device_state(QEMUFile *f)
> >>      }
> >>      cpu_synchronize_all_states();
> >>  
> >> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> >> +    SAVEVM_FOREACH(se, entry) {
> >>          int ret;
> >>  
> >>          if (se->is_ram) {
> >> @@ -1612,7 +1631,7 @@ static SaveStateEntry *find_se(const char *idstr, uint32_t instance_id)
> >>  {
> >>      SaveStateEntry *se;
> >>  
> >> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> >> +    SAVEVM_FORALL(se, entry) {
> >>          if (!strcmp(se->idstr, idstr) &&
> >>              (instance_id == se->instance_id ||
> >>               instance_id == se->alias_id))
> >> @@ -2334,7 +2353,7 @@ qemu_loadvm_section_part_end(QEMUFile *f, MigrationIncomingState *mis)
> >>      }
> >>  
> >>      trace_qemu_loadvm_state_section_partend(section_id);
> >> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> >> +    SAVEVM_FOREACH(se, entry) {
> >>          if (se->load_section_id == section_id) {
> >>              break;
> >>          }
> >> @@ -2400,7 +2419,7 @@ static int qemu_loadvm_state_setup(QEMUFile *f)
> >>      int ret;
> >>  
> >>      trace_loadvm_state_setup();
> >> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> >> +    SAVEVM_FOREACH(se, entry) {
> >>          if (!se->ops || !se->ops->load_setup) {
> >>              continue;
> >>          }
> >> @@ -2425,7 +2444,7 @@ void qemu_loadvm_state_cleanup(void)
> >>      SaveStateEntry *se;
> >>  
> >>      trace_loadvm_state_cleanup();
> >> -    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> >> +    SAVEVM_FOREACH(se, entry) {
> >>          if (se->ops && se->ops->load_cleanup) {
> >>              se->ops->load_cleanup(se->opaque);
> >>          }
> >> -- 
> >> 1.8.3.1
> >>
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 04/32] savevm: HMP Command for cprsave
  2020-09-24 21:44     ` Steven Sistare
@ 2020-09-25  9:26       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 118+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-25  9:26 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

* Steven Sistare (steven.sistare@oracle.com) wrote:
> On 9/11/2020 12:57 PM, Dr. David Alan Gilbert wrote:
> > * Steve Sistare (steven.sistare@oracle.com) wrote:
> >> Enable HMP access to the cprsave QMP command.
> >>
> >> Usage: cprsave <filename> <mode>
> >>
> >> Signed-off-by: Maran Wilson <maran.wilson@oracle.com>
> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > 
> > I realise that the current mode is currently only 'reboot' - can you
> > please give us a clue as to why you've got a mode argument that's
> > currently only got one mode?
> 
> Patch 14 adds the restart mode.
> I factored the patches by capability to make the review easier, first
> presenting the reboot patches, then the restart patches.

OK, but just add a comment here saying that you'll add another mode
later; otherwise it looks a bit weird.

Dave

> - Steve
> 
> >> ---
> >>  hmp-commands.hx       | 18 ++++++++++++++++++
> >>  include/monitor/hmp.h |  1 +
> >>  monitor/hmp-cmds.c    | 10 ++++++++++
> >>  3 files changed, 29 insertions(+)
> >>
> >> diff --git a/hmp-commands.hx b/hmp-commands.hx
> >> index 60f395c..c8defd9 100644
> >> --- a/hmp-commands.hx
> >> +++ b/hmp-commands.hx
> >> @@ -354,6 +354,24 @@ SRST
> >>  ERST
> >>  
> >>      {
> >> +        .name       = "cprsave",
> >> +        .args_type  = "file:s,mode:s",
> >> +        .params     = "file 'reboot'",
> >> +        .help       = "create a checkpoint of the VM in file",
> >> +        .cmd        = hmp_cprsave,
> >> +    },
> >> +
> >> +SRST
> >> +``cprsave`` *tag*
> >> +  Stop VCPUs, create a checkpoint of the whole virtual machine and save it
> >> +  in *file*.
> >> +  If *mode* is 'reboot', the checkpoint can be cprload'ed after a host kexec
> >> +  reboot.
> >> +  exec() /usr/bin/qemu-exec if it exists, else exec /usr/bin/qemu-system-x86_64,
> >> +  passing all the original command line arguments.  The VCPUs remain paused.
> >> +ERST
> >> +
> >> +    {
> >>          .name       = "delvm",
> >>          .args_type  = "name:s",
> >>          .params     = "tag",
> >> diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h
> >> index c986cfd..af8ee23 100644
> >> --- a/include/monitor/hmp.h
> >> +++ b/include/monitor/hmp.h
> >> @@ -59,6 +59,7 @@ void hmp_balloon(Monitor *mon, const QDict *qdict);
> >>  void hmp_loadvm(Monitor *mon, const QDict *qdict);
> >>  void hmp_savevm(Monitor *mon, const QDict *qdict);
> >>  void hmp_delvm(Monitor *mon, const QDict *qdict);
> >> +void hmp_cprsave(Monitor *mon, const QDict *qdict);
> >>  void hmp_migrate_cancel(Monitor *mon, const QDict *qdict);
> >>  void hmp_migrate_continue(Monitor *mon, const QDict *qdict);
> >>  void hmp_migrate_incoming(Monitor *mon, const QDict *qdict);
> >> diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
> >> index ae4b6a4..59196ed 100644
> >> --- a/monitor/hmp-cmds.c
> >> +++ b/monitor/hmp-cmds.c
> >> @@ -1139,6 +1139,16 @@ void hmp_announce_self(Monitor *mon, const QDict *qdict)
> >>      qapi_free_AnnounceParameters(params);
> >>  }
> >>  
> >> +void hmp_cprsave(Monitor *mon, const QDict *qdict)
> >> +{
> >> +    Error *err = NULL;
> >> +
> >> +    qmp_cprsave(qdict_get_try_str(qdict, "file"),
> >> +                qdict_get_try_str(qdict, "mode"),
> >> +                &err);
> >> +    hmp_handle_error(mon, err);
> >> +}
> >> +
> >>  void hmp_migrate_cancel(Monitor *mon, const QDict *qdict)
> >>  {
> >>      qmp_migrate_cancel(NULL);
> >> -- 
> >> 1.8.3.1
> >>
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 15/32] vl: QEMU_START_FREEZE env var
  2020-09-24 21:47     ` Steven Sistare
@ 2020-09-25 15:52       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 118+ messages in thread
From: Dr. David Alan Gilbert @ 2020-09-25 15:52 UTC (permalink / raw)
  To: Steven Sistare, Daniel P. Berrange
  Cc: Michael S. Tsirkin, Alex Bennée, Juan Quintela, qemu-devel,
	Markus Armbruster, Alex Williamson, Stefan Hajnoczi,
	Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

* Steven Sistare (steven.sistare@oracle.com) wrote:
> On 9/11/2020 2:49 PM, Dr. David Alan Gilbert wrote:
> > * Steve Sistare (steven.sistare@oracle.com) wrote:
> >> For qemu upgrade and restart, we will re-exec() qemu with the same argv.
> >> However, qemu must start in a paused state and wait for the cprload command,
> >> and the original argv might not contain the -S option.  To avoid modifying
> >> argv, provide the QEMU_START_FREEZE environment variable.  If
> >> QEMU_START_FREEZE is set, then set autostart=0, like the -S option.
> >>
> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > 
> > What's wrong with modifying the argv?
> > 
> > Note, also the trick -incoming defer uses;  the whole point here is that
> > we start qemu with   -incoming defer     and then we can issue commands
> > to modify the QEMU configuration before we actually reload state.
> > 
> > Note, even without CPR there might be reasons that you need to modify
> > the argv; for example, imagine that since it was originally booted
> > someone had hotplug added an extra CPU or RAM or a disk; the new QEMU
> > must be started in a state that reflects the state in which the VM was
> > at the point when it was saved, not the point at which it was started
> > long ago.
> 
> The code is simpler if we do not need to parse and massage the argv, and that is 
> sufficient for many use cases.  QEMU_START_FREEZE adds only a few lines of code, and 
> it's nice to have that choice.
> 
> For hot plug, we rely on the management layer to know what devices were plugged
> after the initial startup, and re-plug them after restart.  cprsave restarts qemu,
> which creates command-line devices.  At this point the manager would send the hotplug 
> commands (just like -incoming defer), then send cprload. 
> 
> Having said that, if the management layer sometimes performs live migration, and sometimes
> performs cpr restart, then we need to strip out any -incoming args from argv before restart.
> This can be done in the vendor-specific qemu-exec helper (patch 20).

My problem is I can see a whole bunch of places that reusing the
original argv breaks, so I don't think this is a useful general
solution:

   a) The -incoming example
   b) The management app has to reply the hotplug sequence
   c) ...even if it did there's no guarantee that the original
pre-hotplug commandline works:
      i) e.g. an original block device file was deleted
     ii) One of the endpoints for a network device is gone.

  Any part of (c) could cause the exec'd qemu to fail before
it gets as far as allowing you to issue the hotplug commands.
It's also plain dangerous, since the exec'd qemu shouldn't be accessing
a  file or device that has been hot-unplugged and might now be part of
a different VM.

So I think you really should pass another command line option here
rather than setting an environment variable; and then I think you should
consider two separate things:

  a) You could easily strip out options of the form --cpr-freeze
  b) Consider something more general; e.g. allow the management layer to
specify a new set of argv to be used by the exec.

Dave

> - Steve
> 
> >> ---
> >>  softmmu/vl.c | 5 +++++
> >>  1 file changed, 5 insertions(+)
> >>
> >> diff --git a/softmmu/vl.c b/softmmu/vl.c
> >> index 951994f..7016e39 100644
> >> --- a/softmmu/vl.c
> >> +++ b/softmmu/vl.c
> >> @@ -4501,6 +4501,11 @@ void qemu_init(int argc, char **argv, char **envp)
> >>          exit(0);
> >>      }
> >>  
> >> +    if (getenv("QEMU_START_FREEZE")) {
> >> +        unsetenv("QEMU_START_FREEZE");
> >> +        autostart = 0;
> >> +    }
> >> +
> >>      if (incoming) {
> >>          Error *local_err = NULL;
> >>          qemu_start_incoming_migration(incoming, &local_err);
> >> -- 
> >> 1.8.3.1
> >>
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 10/32] kvmclock: restore paused KVM clock
  2020-09-11 17:50   ` Dr. David Alan Gilbert
@ 2020-09-25 18:07     ` Steven Sistare
  0 siblings, 0 replies; 118+ messages in thread
From: Steven Sistare @ 2020-09-25 18:07 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On 9/11/2020 1:50 PM, Dr. David Alan Gilbert wrote:
> * Steve Sistare (steven.sistare@oracle.com) wrote:
>> If the VM is paused when the KVM clock is serialized to a file, record
>> that the clock is valid, so the value will be reused rather than
>> overwritten after cprload with a new call to KVM_GET_CLOCK here:
>>
>> kvmclock_vm_state_change()
>>     if (running)
>>         ...
>>     else
>>         if (s->clock_valid)
>>             return;         <-- instead, return here
>>
>>         kvm_update_clock()
>>            kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, &data)  <-- overwritten
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  hw/i386/kvm/clock.c | 6 +++++-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c
>> index 6428335..161991a 100644
>> --- a/hw/i386/kvm/clock.c
>> +++ b/hw/i386/kvm/clock.c
>> @@ -285,18 +285,22 @@ static int kvmclock_pre_save(void *opaque)
>>      if (!s->runstate_paused) {
>>          kvm_update_clock(s);
>>      }
>> +    if (!runstate_is_running()) {
>> +        s->clock_valid = true;
>> +    }
>>  
>>      return 0;
>>  }
>>  
>>  static const VMStateDescription kvmclock_vmsd = {
>>      .name = "kvmclock",
>> -    .version_id = 1,
>> +    .version_id = 2,
>>      .minimum_version_id = 1,
>>      .pre_load = kvmclock_pre_load,
>>      .pre_save = kvmclock_pre_save,
>>      .fields = (VMStateField[]) {
>>          VMSTATE_UINT64(clock, KVMClockState),
>> +        VMSTATE_BOOL_V(clock_valid, KVMClockState, 2),
>>          VMSTATE_END_OF_LIST()
> 
> We always try and avoid bumping version_id unless we're
> desperate because it breaks backwards migration.
> 
> Didn't you already know from the stored migration state
> (in the globalstate) if the loaded VM was running?
> 
> It's also not clear to me why you're avoiding reloading the state;
> have you preserved that some other way?

This patch was needed only for an early version of cprload which had some gratuitous
vmstate transitions.  I will happily drop this patch.

- Steve

>>      },
>>      .subsections = (const VMStateDescription * []) {
>> -- 
>> 1.8.3.1
>>


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 03/32] savevm: QMP command for cprsave
  2020-09-11 16:43   ` Dr. David Alan Gilbert
@ 2020-09-25 18:43     ` Steven Sistare
  2020-09-25 22:22       ` Steven Sistare
  0 siblings, 1 reply; 118+ messages in thread
From: Steven Sistare @ 2020-09-25 18:43 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On 9/11/2020 12:43 PM, Dr. David Alan Gilbert wrote:
> * Steve Sistare (steven.sistare@oracle.com) wrote:
>> To enable live reboot, provide the cprsave QMP command and the VMS_REBOOT
>> vmstate-saving operation, which saves the state of the virtual machine in a
>> simple file.
>>
>> Syntax:
>>   {'command':'cprsave', 'data':{'file':'str', 'mode':'str'}}
>>
>>   The mode argument must be 'reboot'.  Additional modes will be defined in
>>   the future.
>>
>> Unlike the savevm command, cprsave supports any type of guest image and
>> block device.  cprsave stops the VM so that guest ram and block devices are
>> not modified after state is saved.  Guest ram must be mapped to a persistent
>> memory file such as /dev/dax0.0.  The ram object vmstate handler and block
>> device handler do not apply to VMS_REBOOT, so restrict them to VMS_MIGRATE
>> or VMS_SNAPSHOT.  After cprsave completes successfully, qemu exits.
>>
>> After issuing cprsave, the caller may update qemu, update the host kernel,
>> reboot, start qemu using the same arguments as the original process, and
>> issue the cprload command to restore the guest.  cprload is added by
>> subsequent patches.
>>
>> If the caller suspends the guest instead of stopping the VM, such as by
>> issuing guest-suspend-ram to the qemu guest agent, then cprsave and cprload
>> support guests with vfio devices.  The guest drivers suspend methods flush
>> outstanding requests and re-initialize the devices, and thus there is no
>> device state to save and restore.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> Signed-off-by: Maran Wilson <maran.wilson@oracle.com>
> 
> Going back a step; could you.....
> 
>> ---
>>  include/migration/vmstate.h |  1 +
>>  include/sysemu/sysemu.h     |  2 ++
>>  migration/block.c           |  1 +
>>  migration/ram.c             |  1 +
>>  migration/savevm.c          | 59 +++++++++++++++++++++++++++++++++++++++++++++
>>  monitor/qmp-cmds.c          |  6 +++++
>>  qapi/migration.json         | 14 +++++++++++
>>  7 files changed, 84 insertions(+)
>>
>> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
>> index fa575f9..c58551a 100644
>> --- a/include/migration/vmstate.h
>> +++ b/include/migration/vmstate.h
>> @@ -161,6 +161,7 @@ typedef enum {
>>  typedef enum {
>>      VMS_MIGRATE  = (1U << 1),
>>      VMS_SNAPSHOT = (1U << 2),
>> +    VMS_REBOOT   = (1U << 3),
>>      VMS_MODE_ALL = ~0U
>>  } VMStateMode;
>>  
>> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
>> index 4b6a5c4..6fe86e6 100644
>> --- a/include/sysemu/sysemu.h
>> +++ b/include/sysemu/sysemu.h
>> @@ -24,6 +24,8 @@ extern bool machine_init_done;
>>  void qemu_add_machine_init_done_notifier(Notifier *notify);
>>  void qemu_remove_machine_init_done_notifier(Notifier *notify);
>>  
>> +void save_cpr_snapshot(const char *file, const char *mode, Error **errp);
>> +
>>  extern int autostart;
>>  
>>  typedef enum {
>> diff --git a/migration/block.c b/migration/block.c
>> index 737b649..a69accb 100644
>> --- a/migration/block.c
>> +++ b/migration/block.c
>> @@ -1023,6 +1023,7 @@ static SaveVMHandlers savevm_block_handlers = {
>>      .load_state = block_load,
>>      .save_cleanup = block_migration_cleanup,
>>      .is_active = block_is_active,
>> +    .mode_mask = VMS_MIGRATE | VMS_SNAPSHOT,
>>  };
>>  
>>  void blk_mig_init(void)
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 76d4fee..f0d5d9f 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -3795,6 +3795,7 @@ static SaveVMHandlers savevm_ram_handlers = {
>>      .load_setup = ram_load_setup,
>>      .load_cleanup = ram_load_cleanup,
>>      .resume_prepare = ram_resume_prepare,
>> +    .mode_mask = VMS_MIGRATE | VMS_SNAPSHOT,
>>  };
>>  
>>  void ram_mig_init(void)
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index ce02b6b..ff1a46e 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -2680,6 +2680,65 @@ int qemu_load_device_state(QEMUFile *f)
>>      return 0;
>>  }
>>  
>> +static QEMUFile *qf_file_open(const char *filename, int flags, int mode,
>> +                              Error **errp)
>> +{
>> +    QIOChannel *ioc;
>> +    int fd = qemu_open(filename, flags, mode);
>> +
>> +    if (fd < 0) {
>> +        error_setg_errno(errp, errno, "%s(%s)", __func__, filename);
>> +        return NULL;
>> +    }
>> +
>> +    ioc = QIO_CHANNEL(qio_channel_file_new_fd(fd));
>> +
>> +    if (flags & O_WRONLY) {
>> +        return qemu_fopen_channel_output(ioc);
>> +    }
>> +
>> +    return qemu_fopen_channel_input(ioc);
>> +}
>> +
>> +void save_cpr_snapshot(const char *file, const char *mode, Error **errp)
>> +{
>> +    int ret = 0;
>> +    QEMUFile *f;
>> +    VMStateMode op;
>> +
>> +    if (!strcmp(mode, "reboot")) {
>> +        op = VMS_REBOOT;
>> +    } else {
>> +        error_setg(errp, "cprsave: bad mode %s", mode);
>> +        return;
>> +    }
>> +
>> +    f = qf_file_open(file, O_CREAT | O_WRONLY | O_TRUNC, 0600, errp);
>> +    if (!f) {
>> +        return;
>> +    }
>> +
>> +    ret = global_state_store();
>> +    if (ret) {
>> +        error_setg(errp, "Error saving global state");
>> +        qemu_fclose(f);
>> +        return;
>> +    }
>> +
>> +    vm_stop(RUN_STATE_SAVE_VM);
>> +
>> +    ret = qemu_savevm_state(f, op, errp);
>> +    if ((ret < 0) && !*errp) {
>> +        error_setg(errp, "qemu_savevm_state failed");
>> +    }
> 
> just call qemu_save_device_state(f) there rather than introducing the
> modes?
> What you're doing is VERY similar to qmp_xen_save_devices_state and also
> COLO's device state saving.
> 
> (and also very similar to migration with the x-ignore-shared flag set).

Good idea, calling qemu_save_device_state instead of qemu_savevm_state will factor
out the steps that are specific to migration.  I'll still need the mode, though,
to exclude savevm_block_handlers, and maybe for other reasons.  I'll try it.

- Steve

>> +    qemu_fclose(f);
>> +
>> +    if (op == VMS_REBOOT) {
>> +        no_shutdown = 0;
>> +        qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
>> +    }
>> +}
>> +
[...]



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 03/32] savevm: QMP command for cprsave
  2020-09-25 18:43     ` Steven Sistare
@ 2020-09-25 22:22       ` Steven Sistare
  0 siblings, 0 replies; 118+ messages in thread
From: Steven Sistare @ 2020-09-25 22:22 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Markus Armbruster, Alex Williamson,
	Stefan Hajnoczi, Marc-André Lureau, Paolo Bonzini,
	Philippe Mathieu-Daudé

On 9/25/2020 2:43 PM, Steven Sistare wrote:
> On 9/11/2020 12:43 PM, Dr. David Alan Gilbert wrote:
>> * Steve Sistare (steven.sistare@oracle.com) wrote:
>>> To enable live reboot, provide the cprsave QMP command and the VMS_REBOOT
>>> vmstate-saving operation, which saves the state of the virtual machine in a
>>> simple file.
>>>
>>> Syntax:
>>>   {'command':'cprsave', 'data':{'file':'str', 'mode':'str'}}
>>>
>>>   The mode argument must be 'reboot'.  Additional modes will be defined in
>>>   the future.
>>>
>>> Unlike the savevm command, cprsave supports any type of guest image and
>>> block device.  cprsave stops the VM so that guest ram and block devices are
>>> not modified after state is saved.  Guest ram must be mapped to a persistent
>>> memory file such as /dev/dax0.0.  The ram object vmstate handler and block
>>> device handler do not apply to VMS_REBOOT, so restrict them to VMS_MIGRATE
>>> or VMS_SNAPSHOT.  After cprsave completes successfully, qemu exits.
>>>
>>> After issuing cprsave, the caller may update qemu, update the host kernel,
>>> reboot, start qemu using the same arguments as the original process, and
>>> issue the cprload command to restore the guest.  cprload is added by
>>> subsequent patches.
>>>
>>> If the caller suspends the guest instead of stopping the VM, such as by
>>> issuing guest-suspend-ram to the qemu guest agent, then cprsave and cprload
>>> support guests with vfio devices.  The guest drivers suspend methods flush
>>> outstanding requests and re-initialize the devices, and thus there is no
>>> device state to save and restore.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> Signed-off-by: Maran Wilson <maran.wilson@oracle.com>
>>
>> Going back a step; could you.....
>>
>>> ---
>>>  include/migration/vmstate.h |  1 +
>>>  include/sysemu/sysemu.h     |  2 ++
>>>  migration/block.c           |  1 +
>>>  migration/ram.c             |  1 +
>>>  migration/savevm.c          | 59 +++++++++++++++++++++++++++++++++++++++++++++
>>>  monitor/qmp-cmds.c          |  6 +++++
>>>  qapi/migration.json         | 14 +++++++++++
>>>  7 files changed, 84 insertions(+)
>>>
>>> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
>>> index fa575f9..c58551a 100644
>>> --- a/include/migration/vmstate.h
>>> +++ b/include/migration/vmstate.h
>>> @@ -161,6 +161,7 @@ typedef enum {
>>>  typedef enum {
>>>      VMS_MIGRATE  = (1U << 1),
>>>      VMS_SNAPSHOT = (1U << 2),
>>> +    VMS_REBOOT   = (1U << 3),
>>>      VMS_MODE_ALL = ~0U
>>>  } VMStateMode;
>>>  
>>> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
>>> index 4b6a5c4..6fe86e6 100644
>>> --- a/include/sysemu/sysemu.h
>>> +++ b/include/sysemu/sysemu.h
>>> @@ -24,6 +24,8 @@ extern bool machine_init_done;
>>>  void qemu_add_machine_init_done_notifier(Notifier *notify);
>>>  void qemu_remove_machine_init_done_notifier(Notifier *notify);
>>>  
>>> +void save_cpr_snapshot(const char *file, const char *mode, Error **errp);
>>> +
>>>  extern int autostart;
>>>  
>>>  typedef enum {
>>> diff --git a/migration/block.c b/migration/block.c
>>> index 737b649..a69accb 100644
>>> --- a/migration/block.c
>>> +++ b/migration/block.c
>>> @@ -1023,6 +1023,7 @@ static SaveVMHandlers savevm_block_handlers = {
>>>      .load_state = block_load,
>>>      .save_cleanup = block_migration_cleanup,
>>>      .is_active = block_is_active,
>>> +    .mode_mask = VMS_MIGRATE | VMS_SNAPSHOT,
>>>  };
>>>  
>>>  void blk_mig_init(void)
>>> diff --git a/migration/ram.c b/migration/ram.c
>>> index 76d4fee..f0d5d9f 100644
>>> --- a/migration/ram.c
>>> +++ b/migration/ram.c
>>> @@ -3795,6 +3795,7 @@ static SaveVMHandlers savevm_ram_handlers = {
>>>      .load_setup = ram_load_setup,
>>>      .load_cleanup = ram_load_cleanup,
>>>      .resume_prepare = ram_resume_prepare,
>>> +    .mode_mask = VMS_MIGRATE | VMS_SNAPSHOT,
>>>  };
>>>  
>>>  void ram_mig_init(void)
>>> diff --git a/migration/savevm.c b/migration/savevm.c
>>> index ce02b6b..ff1a46e 100644
>>> --- a/migration/savevm.c
>>> +++ b/migration/savevm.c
>>> @@ -2680,6 +2680,65 @@ int qemu_load_device_state(QEMUFile *f)
>>>      return 0;
>>>  }
>>>  
>>> +static QEMUFile *qf_file_open(const char *filename, int flags, int mode,
>>> +                              Error **errp)
>>> +{
>>> +    QIOChannel *ioc;
>>> +    int fd = qemu_open(filename, flags, mode);
>>> +
>>> +    if (fd < 0) {
>>> +        error_setg_errno(errp, errno, "%s(%s)", __func__, filename);
>>> +        return NULL;
>>> +    }
>>> +
>>> +    ioc = QIO_CHANNEL(qio_channel_file_new_fd(fd));
>>> +
>>> +    if (flags & O_WRONLY) {
>>> +        return qemu_fopen_channel_output(ioc);
>>> +    }
>>> +
>>> +    return qemu_fopen_channel_input(ioc);
>>> +}
>>> +
>>> +void save_cpr_snapshot(const char *file, const char *mode, Error **errp)
>>> +{
>>> +    int ret = 0;
>>> +    QEMUFile *f;
>>> +    VMStateMode op;
>>> +
>>> +    if (!strcmp(mode, "reboot")) {
>>> +        op = VMS_REBOOT;
>>> +    } else {
>>> +        error_setg(errp, "cprsave: bad mode %s", mode);
>>> +        return;
>>> +    }
>>> +
>>> +    f = qf_file_open(file, O_CREAT | O_WRONLY | O_TRUNC, 0600, errp);
>>> +    if (!f) {
>>> +        return;
>>> +    }
>>> +
>>> +    ret = global_state_store();
>>> +    if (ret) {
>>> +        error_setg(errp, "Error saving global state");
>>> +        qemu_fclose(f);
>>> +        return;
>>> +    }
>>> +
>>> +    vm_stop(RUN_STATE_SAVE_VM);
>>> +
>>> +    ret = qemu_savevm_state(f, op, errp);
>>> +    if ((ret < 0) && !*errp) {
>>> +        error_setg(errp, "qemu_savevm_state failed");
>>> +    }
>>
>> just call qemu_save_device_state(f) there rather than introducing the
>> modes?
>> What you're doing is VERY similar to qmp_xen_save_devices_state and also
>> COLO's device state saving.
>>
>> (and also very similar to migration with the x-ignore-shared flag set).
> 
> Good idea, calling qemu_save_device_state instead of qemu_savevm_state will factor
> out the steps that are specific to migration.  I'll still need the mode, though,
> to exclude savevm_block_handlers, and maybe for other reasons.  I'll try it.

This works and is a keeper. I do not need mode to exclude savevm_block_handlers.  However, 
I still need mode and mode_mask so my vfio vmstate handler is only applied for the VMS_RESTART
mode.  I could instead iterate through the vfio devices and do something special on save
and load, but the mode_mask is cleaner and could have more uses in the future.  Do you agree?

- Steve 

>>> +    qemu_fclose(f);
>>> +
>>> +    if (op == VMS_REBOOT) {
>>> +        no_shutdown = 0;
>>> +        qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
>>> +    }
>>> +}
>>> +
> [...]
> 


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 30/32] vfio-pci: save and restore
  2020-08-20 10:33           ` Jason Zeng
@ 2020-10-07 21:25             ` Steven Sistare
  0 siblings, 0 replies; 118+ messages in thread
From: Steven Sistare @ 2020-10-07 21:25 UTC (permalink / raw)
  To: Jason Zeng, Alex Williamson
  Cc: Daniel P. Berrange, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-devel, Dr. David Alan Gilbert,
	Paolo Bonzini, Stefan Hajnoczi, Marc-André Lureau,
	Jason Zeng, Philippe Mathieu-Daudé,
	Alex Bennée



On 8/20/2020 6:33 AM, Jason Zeng wrote:
> On Wed, Aug 19, 2020 at 05:15:11PM -0400, Steven Sistare wrote:
>> On 8/9/2020 11:50 PM, Jason Zeng wrote:
>>> On Fri, Aug 07, 2020 at 04:38:12PM -0400, Steven Sistare wrote:
>>>> On 8/6/2020 6:22 AM, Jason Zeng wrote:
>>>>> Hi Steve,
>>>>>
>>>>> On Thu, Jul 30, 2020 at 08:14:34AM -0700, Steve Sistare wrote:
>>>>>> @@ -3182,6 +3207,51 @@ static Property vfio_pci_dev_properties[] = {
>>>>>>      DEFINE_PROP_END_OF_LIST(),
>>>>>>  };
>>>>>>  
>>>>>> +static int vfio_pci_post_load(void *opaque, int version_id)
>>>>>> +{
>>>>>> +    int vector;
>>>>>> +    MSIMessage msg;
>>>>>> +    Error *err = 0;
>>>>>> +    VFIOPCIDevice *vdev = opaque;
>>>>>> +    PCIDevice *pdev = &vdev->pdev;
>>>>>> +
>>>>>> +    if (msix_enabled(pdev)) {
>>>>>> +        vfio_msix_enable(vdev);
>>>>>> +        pdev->msix_function_masked = false;
>>>>>> +
>>>>>> +        for (vector = 0; vector < vdev->pdev.msix_entries_nr; vector++) {
>>>>>> +            if (!msix_is_masked(pdev, vector)) {
>>>>>> +                msg = msix_get_message(pdev, vector);
>>>>>> +                vfio_msix_vector_use(pdev, vector, msg);
>>>>>> +            }
>>>>>> +        }
>>>>>
>>>>> It looks to me MSIX re-init here may lose device IRQs and impact
>>>>> device hardware state?
>>>>>
>>>>> The re-init will cause the kernel vfio driver to connect the device
>>>>> MSIX vectors to new eventfds and KVM instance. But before that, device
>>>>> IRQs will be routed to previous eventfd. Looks these IRQs will be lost.
>>>>
>>>> Thanks Jason, that sounds like a problem.  I could try reading and saving an 
>>>> event from eventfd before shutdown, and injecting it into the eventfd after
>>>> restart, but that would be racy unless I disable interrupts.  Or, unconditionally
>>>> inject a spurious interrupt after restart to kick it, in case an interrupt 
>>>> was lost.
>>>>
>>>> Do you have any other ideas?
>>>
>>> Maybe we can consider to also hand over the eventfd file descriptor, or
>>
>> I believe preserving this descriptor in isolation is not sufficient.  We would
>> also need to preserve the KVM instance which it is linked to.
>>
>>> or even the KVM fds to the new Qemu?
>>>
>>> If the KVM fds can be preserved, we will just need to restore Qemu KVM
>>> side states. But not sure how complicated the implementation would be.
>>
>> That should work, but I fear it would require many code changes in QEMU
>> to re-use descriptors at object creation time and suppress the initial 
>> configuration ioctl's, so it's not my first choice for a solution.
>>
>>> If we only preserve the eventfd fd, we can attach the old eventfd to
>>> vfio devices. But looks it may turn out we always inject an interrupt
>>> unconditionally, because kernel KVM irqfd eventfd handling is a bit
>>> different than normal user land eventfd read/write. It doesn't decrease
>>> the counter in the eventfd context. So if we read the eventfd from new
>>> Qemu, it looks will always have a non-zero counter, which requires an
>>> interrupt injection.
>>
>> Good to know, thanks.
>>
>> I will try creating a new eventfd and injecting an interrupt unconditionally.
>> I need a test case to demonstrate losing an interrupt, and fixing it with
>> injection.  Any advice?  My stress tests with a virtual function nic and a
>> directly assigned nvme block device have never failed across live update.
>>
> 
> I am not familiar with nvme devices. For nic device, to my understanding,
> stress nic testing will not have many IRQs, because nic driver usually
> enables NAPI, which only take the first interrupt, then disable interrupt
> and start polling. It will only re-enable interrupt after some packet
> quota reached or the traffic quiesces for a while. But anyway, if the
> test goes enough long time, the number of IRQs should also be big, not
> sure why it doesn't trigger any issue. Maybe we can have some study on
> the IRQ pattern for the testing and see how we can design a test case?
> or see if our assumption is wrong?
> 
> 
>>>>> And the re-init will make the device go through the procedure of
>>>>> disabling MSIX, enabling INTX, and re-enabling MSIX and vectors.
>>>>> So if the device is active, its hardware state will be impacted?
>>>>
>>>> Again thanks.  vfio_msix_enable() does indeed call vfio_disable_interrupts().
>>>> For a quick experiment, I deleted that call in for the post_load code path, and 
>>>> it seems to work fine, but I need to study it more.
>>>
>>> vfio_msix_vector_use() will also trigger this procedure in the kernel.
>>
>> Because that code path calls VFIO_DEVICE_SET_IRQS? Or something else?
>> Can you point to what it triggers in the kernel?
> 
> 
> In vfio_msix_vector_use(), I see vfio_disable_irqindex() will be invoked
> if "vdev->nr_vectors < nr + 1" is true. Since the 'vdev' is re-inited,
> so this condition should be true, and vfio_disable_irqindex() will
> trigger VFIO_DEVICE_SET_IRQS with VFIO_IRQ_SET_DATA_NONE, which will
> cause kernel to disable MSIX.
> 
>>
>>> Looks we shouldn't trigger any kernel vfio actions here? Because we
>>> preserve vfio fds, so its kernel state shouldn't be touched. Here we
>>> may only need to restore Qemu states. Re-connect to KVM instance should
>>> be done automatically when we setup the KVM irqfds with the same eventfd.
>>>
>>> BTW, if I remember correctly, it is not enough to only save MSIX state
>>> in the snapshot. We should also save the Qemu side pci config space
>>> cache to the snapshot, because Qemu's copy is not exactly the same as
>>> the kernel's copy. I encountered this before, but I don't remember which
>>> field it was.
>>
>> FYI all, Jason told me offline that qemu may emulate some pci capabilities and
>> hence keeps state in the shadow config that is never written to the kernel.
>> I need to study that.
>>
> 
> Sorry, I read the code again, see Qemu does write all config-space-write
> to kernel in vfio_pci_write_config(). Now I am also confused about what
> I was seeing previously :(. But it seems we still need to look at kernel
> code to see if mismatch is possibile for config space cache between Qemu
> and kernel.
> 
> FYI. Some discussion about the VFIO PCI config space saving/restoring in
> live migration scenario:
> https://lists.gnu.org/archive/html/qemu-devel/2020-06/msg06964.html
> 

I have coded a solution for much of the "lost interrupts" issue.
cprsave preserves the vfio err, req, and msi irq eventfd's across exec:
  vdev->err_notifier
  vdev->req_notifier
  vdev->msi_vectors[i].interrupt
  vdev->msi_vectors[i].kvm_interrupt

The KVM instance is destroyed and recreated as before.
The eventfd descriptors are found and reused during vfio_realize using
event_notifier_init_fd.  No calls to VFIO_DEVICE_SET_IRQS are made before or
after the exec.  The descriptors are attached to the new KVM instance via the
usual ioctl's on the existing code paths.

It works.  I issue cprsave, send an interrupt, wait a few seconds, then issue cprload.
The interrupt fires immediately after cprload.  I tested interrupt delivery to the 
kvm_irqchip and to qemu.

It does not support Posted Interrupts, as that involves state attached to the
VMCS, which is destroyed with the KVM instance.  That needs more study and
a likely kernel enhancement.

I will post the full code as part of the V2 patch series.

- Steve



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 18/32] osdep: import MADV_DOEXEC
  2020-08-24 22:30               ` Alex Williamson
@ 2020-10-08 16:32                 ` Steven Sistare
  2020-10-15 20:36                   ` Alex Williamson
  0 siblings, 1 reply; 118+ messages in thread
From: Steven Sistare @ 2020-10-08 16:32 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Dr. David Alan Gilbert,
	Anthony Yznaga, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On 8/24/2020 6:30 PM, Alex Williamson wrote:
> On Wed, 19 Aug 2020 17:52:26 -0400
> Steven Sistare <steven.sistare@oracle.com> wrote:
>> On 8/17/2020 10:42 PM, Alex Williamson wrote:
>>> On Mon, 17 Aug 2020 15:44:03 -0600
>>> Alex Williamson <alex.williamson@redhat.com> wrote:
>>>> On Mon, 17 Aug 2020 17:20:57 -0400
>>>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>>>> On 8/17/2020 4:48 PM, Alex Williamson wrote:    
>>>>>> On Mon, 17 Aug 2020 14:30:51 -0400
>>>>>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>>>>>       
>>>>>>> On 7/30/2020 11:14 AM, Steve Sistare wrote:      
>>>>>>>> Anonymous memory segments used by the guest are preserved across a re-exec
>>>>>>>> of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
>>>>>>>> in the Linux kernel. For the madvise patches, see:
>>>>>>>>
>>>>>>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
>>>>>>>>
>>>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>>>> ---
>>>>>>>>  include/qemu/osdep.h | 7 +++++++
>>>>>>>>  1 file changed, 7 insertions(+)        
>>>>>>>
>>>>>>> Hi Alex,
>>>>>>>   The MADV_DOEXEC functionality, which is a pre-requisite for the entire qemu 
>>>>>>> live update series, is getting a chilly reception on lkml.  We could instead 
>>>>>>> create guest memory using memfd_create and preserve the fd across exec.  However, 
>>>>>>> the subsequent mmap(fd) will return a different VA than was used previously, 
>>>>>>> which  is a problem for memory that was registered with vfio, as the original VA 
>>>>>>> is remembered in the kernel struct vfio_dma and used in various kernel functions, 
>>>>>>> such as vfio_iommu_replay.
>>>>>>>
>>>>>>> To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size, and
>>>>>>> new_vaddr.  The implementation finds an exact match for (iova, size) and replaces 
>>>>>>> vaddr with new_vaddr.  Flags cannot be changed.
>>>>>>>
>>>>>>> memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC.
>>>>>>> vfio on any form of shared memory (shm, dax, etc) could also be preserved across
>>>>>>> exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA.
>>>>>>>
>>>>>>> What do you think      
>>>>>>
>>>>>> Your new REMAP ioctl would have parameters identical to the MAP_DMA
>>>>>> ioctl, so I think we should just use one of the flag bits on the
>>>>>> existing MAP_DMA ioctl for this variant.      
>>>>>
>>>>> Sounds good.
>>>>>     
>>>>>> Reading through the discussion on the kernel side there seems to be
>>>>>> some confusion around why vfio needs the vaddr beyond the user call to
>>>>>> MAP_DMA though.  Originally this was used to test for virtually
>>>>>> contiguous mappings for merging and splitting purposes.  This is
>>>>>> defunct in the v2 interface, however the vaddr is now used largely for
>>>>>> mdev devices.  If an mdev device is not backed by an IOMMU device and
>>>>>> does not share a container with an IOMMU device, then a user MAP_DMA
>>>>>> ioctl essentially just registers the translation within the vfio
>>>>>> container.  The mdev vendor driver can then later either request pages
>>>>>> to be pinned for device DMA or can perform copy_to/from_user() to
>>>>>> simulate DMA via the CPU.
>>>>>>
>>>>>> Therefore I don't see that there's a simple re-architecture of the vfio
>>>>>> IOMMU backend that could drop vaddr use.        
>>>>>
>>>>> Yes.  I did not explain on lkml as you do here (thanks), but I reached the 
>>>>> same conclusion.
>>>>>     
>>>>>> I'm a bit concerned this new
>>>>>> remap proposal also raises the question of how do we prevent userspace
>>>>>> remapping vaddrs racing with asynchronous kernel use of the previous
>>>>>> vaddrs.        
>>>>>
>>>>> Agreed.  After a quick glance at the code, holding iommu->lock during 
>>>>> remap might be sufficient, but it needs more study.    
>>>>
>>>> Unless you're suggesting an extended hold of the lock across the entire
>>>> re-exec of QEMU, that's only going to prevent a race between a remap
>>>> and a vendor driver pin or access, the time between the previous vaddr
>>>> becoming invalid and the remap is unprotected.  
>>
>> OK.  What if we exclude mediated devices?  Its appears they are the only
>> ones where the kernel may async'ly use the vaddr, via call chains ending in 
>> vfio_iommu_type1_pin_pages or vfio_iommu_type1_dma_rw_chunk.
>>
>> The other functions that use dma->vaddr are
>>     vfio_dma_do_map 
>>     vfio_pin_map_dma 
>>     vfio_iommu_replay 
>>     vfio_pin_pages_remote
>> and they are all initiated via userland ioctl (if I have traced all the code 
>> paths correctly).  Thus iommu->lock would protect them.
>>
>> We would block live update in qemu if the config includes a mediated device.
>>
>> VFIO_IOMMU_REMAP_DMA would return EINVAL if the container has a mediated device.
> 
> That's not a solution I'd really be in favor of.  We're eliminating an
> entire class of devices because they _might_ make use of these
> interfaces, but anyone can add a vfio bus driver, even exposing the
> same device API, and maybe make use of some of these interfaces in that
> driver.  Maybe we'd even have reason to do it in vfio-pci if we had
> reason to virtualize some aspect of a device.  I think we're setting
> ourselves up for a very complicated support scenario if we just
> arbitrarily decide to deny drivers using certain interfaces.
> 
>>>>>> Are we expecting guest drivers/agents to quiesce the device,
>>>>>> or maybe relying on clearing bus-master, for PCI devices, to halt DMA?      
>>>>>
>>>>> No.  We want to support any guest, and the guest is not aware that qemu
>>>>> live update is occurring.
>>>>>     
>>>>>> The vfio migration interface we've developed does have a mechanism to
>>>>>> stop a device, would we need to use this here?  If we do have a
>>>>>> mechanism to quiesce the device, is the only reason we're not unmapping
>>>>>> everything and remapping it into the new address space the latency in
>>>>>> performing that operation?  Thanks,      
>>>>>
>>>>> Same answer - we don't require that the guest has vfio migration support.    
>>>>
>>>> QEMU toggling the runstate of the device via the vfio migration
>>>> interface could be done transparently to the guest, but if your
>>>> intention is to support any device (where none currently support the
>>>> migration interface) perhaps it's a moot point.    
>>
>> That sounds useful when devices support.  Can you give me some function names
>> or references so I can study this qemu-based "vfio migration interface".
> 
> The uAPI is documented in commit a8a24f3f6e38.  We're still waiting on
> the QEMU support or implementation in an mdev vendor driver.
> Essentially migration exposes a new region of the device which would be
> implemented by the vendor driver.  A register within that region
> manipulates the device state, so a device could be stopped by clearing
> the 'run' bit in that register.
> 
>>>> It seems like this
>>>> scheme only works with IOMMU backed devices where the device can
>>>> continue to operate against pinned pages, anything that might need to
>>>> dynamically pin pages against the process vaddr as it's running async
>>>> to the QEMU re-exec needs to be blocked or stopped.  Thanks,  
>>
>> Yes, true of this remap proposal.
>>
>> I wanted to unconditionally support all devices, which is why I think that
>>
>> MADV_DOEXEC is a nifty solution.  If you agree, please add your voice to the
>> lkml discussion.

Hi Alex, here is a modified proposal to remap vaddr in the face of async requests
from mediated device drivers.

Define a new flag VFIO_DMA_MAP_FLAG_REMAP for use with VFIO_IOMMU_UNMAP_DMA and
VFIO_IOMMU_MAP_DMA.

VFIO_IOMMU_UNMAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
  Discard vaddr on the existing dma region defined by (iova, size), but keep the
  struct vfio_dma.  Subsequent translation requests are blocked.
  The implementation sets a flag in struct vfio_dma.  vfio_pin_pages() and
  vfio_dma_rw() acquire iommu->lock, check the flag, and retry.
  Called before exec.

VFIO_IOMMU_MAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
  Remap the region (iova, size) to vaddr, and resume translation requests.
  Called after exec.

Unfortunately, remap as defined above has an undesirable side effect.  The mdev
driver may use kernel worker threads which serve requests from multiple clients
(eg i915/gvt workload_thread).  A process that fails to call MAP_DMA with REMAP,
or is tardy doing so, will delay other processes who are stuck waiting in
vfio_pin_pages or vfio_dma_rw.  This is unacceptable, and I mention this scheme in
case I am misinterpreting the code (maybe they do not share a single struct vfio_iommu
instance?), or in case you see a way to salvage it.

Here is a more robust implementation.  It only works for dma regions backed by
a file, such as memfd or shm.

VFIO_IOMMU_UNMAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
  Find the file and offset for iova, and save the struct file pointer in
  struct vfio_dma.  In vfio_pin_pages and vfio_dma_rw and their descendants,
  if file* is set, then call pagecache_get_page() to get the pfn, instead of
  get_user_pages.

VFIO_IOMMU_MAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
  Remap the region (iova, size) to vaddr and drop the file reference.

This begs the question of whether we can always use pagecache_get_page, and
eliminate the dependency on vaddr.  The translation performance could be
different, though.

I have not implemented this yet.  Any thoughts before I do?

- Steve


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 18/32] osdep: import MADV_DOEXEC
  2020-10-08 16:32                 ` Steven Sistare
@ 2020-10-15 20:36                   ` Alex Williamson
  2020-10-19 16:33                     ` Steven Sistare
  0 siblings, 1 reply; 118+ messages in thread
From: Alex Williamson @ 2020-10-15 20:36 UTC (permalink / raw)
  To: Steven Sistare
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Dr. David Alan Gilbert,
	Anthony Yznaga, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On Thu, 8 Oct 2020 12:32:35 -0400
Steven Sistare <steven.sistare@oracle.com> wrote:

> On 8/24/2020 6:30 PM, Alex Williamson wrote:
> > On Wed, 19 Aug 2020 17:52:26 -0400
> > Steven Sistare <steven.sistare@oracle.com> wrote:  
> >> On 8/17/2020 10:42 PM, Alex Williamson wrote:  
> >>> On Mon, 17 Aug 2020 15:44:03 -0600
> >>> Alex Williamson <alex.williamson@redhat.com> wrote:  
> >>>> On Mon, 17 Aug 2020 17:20:57 -0400
> >>>> Steven Sistare <steven.sistare@oracle.com> wrote:  
> >>>>> On 8/17/2020 4:48 PM, Alex Williamson wrote:      
> >>>>>> On Mon, 17 Aug 2020 14:30:51 -0400
> >>>>>> Steven Sistare <steven.sistare@oracle.com> wrote:
> >>>>>>         
> >>>>>>> On 7/30/2020 11:14 AM, Steve Sistare wrote:        
> >>>>>>>> Anonymous memory segments used by the guest are preserved across a re-exec
> >>>>>>>> of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
> >>>>>>>> in the Linux kernel. For the madvise patches, see:
> >>>>>>>>
> >>>>>>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
> >>>>>>>>
> >>>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >>>>>>>> ---
> >>>>>>>>  include/qemu/osdep.h | 7 +++++++
> >>>>>>>>  1 file changed, 7 insertions(+)          
> >>>>>>>
> >>>>>>> Hi Alex,
> >>>>>>>   The MADV_DOEXEC functionality, which is a pre-requisite for the entire qemu 
> >>>>>>> live update series, is getting a chilly reception on lkml.  We could instead 
> >>>>>>> create guest memory using memfd_create and preserve the fd across exec.  However, 
> >>>>>>> the subsequent mmap(fd) will return a different VA than was used previously, 
> >>>>>>> which  is a problem for memory that was registered with vfio, as the original VA 
> >>>>>>> is remembered in the kernel struct vfio_dma and used in various kernel functions, 
> >>>>>>> such as vfio_iommu_replay.
> >>>>>>>
> >>>>>>> To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size, and
> >>>>>>> new_vaddr.  The implementation finds an exact match for (iova, size) and replaces 
> >>>>>>> vaddr with new_vaddr.  Flags cannot be changed.
> >>>>>>>
> >>>>>>> memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC.
> >>>>>>> vfio on any form of shared memory (shm, dax, etc) could also be preserved across
> >>>>>>> exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA.
> >>>>>>>
> >>>>>>> What do you think        
> >>>>>>
> >>>>>> Your new REMAP ioctl would have parameters identical to the MAP_DMA
> >>>>>> ioctl, so I think we should just use one of the flag bits on the
> >>>>>> existing MAP_DMA ioctl for this variant.        
> >>>>>
> >>>>> Sounds good.
> >>>>>       
> >>>>>> Reading through the discussion on the kernel side there seems to be
> >>>>>> some confusion around why vfio needs the vaddr beyond the user call to
> >>>>>> MAP_DMA though.  Originally this was used to test for virtually
> >>>>>> contiguous mappings for merging and splitting purposes.  This is
> >>>>>> defunct in the v2 interface, however the vaddr is now used largely for
> >>>>>> mdev devices.  If an mdev device is not backed by an IOMMU device and
> >>>>>> does not share a container with an IOMMU device, then a user MAP_DMA
> >>>>>> ioctl essentially just registers the translation within the vfio
> >>>>>> container.  The mdev vendor driver can then later either request pages
> >>>>>> to be pinned for device DMA or can perform copy_to/from_user() to
> >>>>>> simulate DMA via the CPU.
> >>>>>>
> >>>>>> Therefore I don't see that there's a simple re-architecture of the vfio
> >>>>>> IOMMU backend that could drop vaddr use.          
> >>>>>
> >>>>> Yes.  I did not explain on lkml as you do here (thanks), but I reached the 
> >>>>> same conclusion.
> >>>>>       
> >>>>>> I'm a bit concerned this new
> >>>>>> remap proposal also raises the question of how do we prevent userspace
> >>>>>> remapping vaddrs racing with asynchronous kernel use of the previous
> >>>>>> vaddrs.          
> >>>>>
> >>>>> Agreed.  After a quick glance at the code, holding iommu->lock during 
> >>>>> remap might be sufficient, but it needs more study.      
> >>>>
> >>>> Unless you're suggesting an extended hold of the lock across the entire
> >>>> re-exec of QEMU, that's only going to prevent a race between a remap
> >>>> and a vendor driver pin or access, the time between the previous vaddr
> >>>> becoming invalid and the remap is unprotected.    
> >>
> >> OK.  What if we exclude mediated devices?  Its appears they are the only
> >> ones where the kernel may async'ly use the vaddr, via call chains ending in 
> >> vfio_iommu_type1_pin_pages or vfio_iommu_type1_dma_rw_chunk.
> >>
> >> The other functions that use dma->vaddr are
> >>     vfio_dma_do_map 
> >>     vfio_pin_map_dma 
> >>     vfio_iommu_replay 
> >>     vfio_pin_pages_remote
> >> and they are all initiated via userland ioctl (if I have traced all the code 
> >> paths correctly).  Thus iommu->lock would protect them.
> >>
> >> We would block live update in qemu if the config includes a mediated device.
> >>
> >> VFIO_IOMMU_REMAP_DMA would return EINVAL if the container has a mediated device.  
> > 
> > That's not a solution I'd really be in favor of.  We're eliminating an
> > entire class of devices because they _might_ make use of these
> > interfaces, but anyone can add a vfio bus driver, even exposing the
> > same device API, and maybe make use of some of these interfaces in that
> > driver.  Maybe we'd even have reason to do it in vfio-pci if we had
> > reason to virtualize some aspect of a device.  I think we're setting
> > ourselves up for a very complicated support scenario if we just
> > arbitrarily decide to deny drivers using certain interfaces.
> >   
> >>>>>> Are we expecting guest drivers/agents to quiesce the device,
> >>>>>> or maybe relying on clearing bus-master, for PCI devices, to halt DMA?        
> >>>>>
> >>>>> No.  We want to support any guest, and the guest is not aware that qemu
> >>>>> live update is occurring.
> >>>>>       
> >>>>>> The vfio migration interface we've developed does have a mechanism to
> >>>>>> stop a device, would we need to use this here?  If we do have a
> >>>>>> mechanism to quiesce the device, is the only reason we're not unmapping
> >>>>>> everything and remapping it into the new address space the latency in
> >>>>>> performing that operation?  Thanks,        
> >>>>>
> >>>>> Same answer - we don't require that the guest has vfio migration support.      
> >>>>
> >>>> QEMU toggling the runstate of the device via the vfio migration
> >>>> interface could be done transparently to the guest, but if your
> >>>> intention is to support any device (where none currently support the
> >>>> migration interface) perhaps it's a moot point.      
> >>
> >> That sounds useful when devices support.  Can you give me some function names
> >> or references so I can study this qemu-based "vfio migration interface".  
> > 
> > The uAPI is documented in commit a8a24f3f6e38.  We're still waiting on
> > the QEMU support or implementation in an mdev vendor driver.
> > Essentially migration exposes a new region of the device which would be
> > implemented by the vendor driver.  A register within that region
> > manipulates the device state, so a device could be stopped by clearing
> > the 'run' bit in that register.
> >   
> >>>> It seems like this
> >>>> scheme only works with IOMMU backed devices where the device can
> >>>> continue to operate against pinned pages, anything that might need to
> >>>> dynamically pin pages against the process vaddr as it's running async
> >>>> to the QEMU re-exec needs to be blocked or stopped.  Thanks,    
> >>
> >> Yes, true of this remap proposal.
> >>
> >> I wanted to unconditionally support all devices, which is why I think that
> >>
> >> MADV_DOEXEC is a nifty solution.  If you agree, please add your voice to the
> >> lkml discussion.  
> 
> Hi Alex, here is a modified proposal to remap vaddr in the face of async requests
> from mediated device drivers.
> 
> Define a new flag VFIO_DMA_MAP_FLAG_REMAP for use with VFIO_IOMMU_UNMAP_DMA and
> VFIO_IOMMU_MAP_DMA.
> 
> VFIO_IOMMU_UNMAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
>   Discard vaddr on the existing dma region defined by (iova, size), but keep the
>   struct vfio_dma.  Subsequent translation requests are blocked.
>   The implementation sets a flag in struct vfio_dma.  vfio_pin_pages() and
>   vfio_dma_rw() acquire iommu->lock, check the flag, and retry.
>   Called before exec.
> 
> VFIO_IOMMU_MAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
>   Remap the region (iova, size) to vaddr, and resume translation requests.
>   Called after exec.
> 
> Unfortunately, remap as defined above has an undesirable side effect.  The mdev
> driver may use kernel worker threads which serve requests from multiple clients
> (eg i915/gvt workload_thread).  A process that fails to call MAP_DMA with REMAP,
> or is tardy doing so, will delay other processes who are stuck waiting in
> vfio_pin_pages or vfio_dma_rw.  This is unacceptable, and I mention this scheme in
> case I am misinterpreting the code (maybe they do not share a single struct vfio_iommu
> instance?), or in case you see a way to salvage it.

Right, that's my first thought when I hear that the pin and dma_rw paths
are blocked as well, we cannot rely on userspace to unblock anything.
A malicious user may hold out just to see how long until the host
becomes unusable.  Userspace determines how many groups share a
vfio_iommu.

> Here is a more robust implementation.  It only works for dma regions backed by
> a file, such as memfd or shm.
> 
> VFIO_IOMMU_UNMAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
>   Find the file and offset for iova, and save the struct file pointer in
>   struct vfio_dma.  In vfio_pin_pages and vfio_dma_rw and their descendants,
>   if file* is set, then call pagecache_get_page() to get the pfn, instead of
>   get_user_pages.
> 
> VFIO_IOMMU_MAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
>   Remap the region (iova, size) to vaddr and drop the file reference.
> 
> This begs the question of whether we can always use pagecache_get_page, and
> eliminate the dependency on vaddr.  The translation performance could be
> different, though.
> 
> I have not implemented this yet.  Any thoughts before I do?

That's a pretty hefty usage restriction, but what I take from it is
that these are mechanisms which provide a fallback lookup path that can
service callers in the interim during the gap of the range being
remapped.  The callers are always providing an IOVA and wishing to do
something to the memory referenced by that IOVA, we just need a
translation mechanism.  The IOMMU itself is also such an alternative
lookup, via iommu_iova_to_phys(), but of course requiring an
IOMMU-backed device is just another usage restriction, potentially one
that's not even apparent to the user.

Is a more general solution to make sure there's always an IOVA-to-phys
lookup mechanism available, implementing one if not provided by the
IOMMU or memory backing interface?  We'd need to adapt the dma_rw
interface to work on either a VA or PA, and pinning pages on
UNMAP+REMAP, plus stashing them in a translation structure, plus
dynamically adapting to changes (ex. the IOMMU backed device being
removed, leaving a non-IOMMU backed device in the vfio_iommu) all
sounds pretty complicated, especially as the vfio-iommu-type1 backend
becomes stretched to be more and more fragile.  Possibly it's still
feasible though.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 18/32] osdep: import MADV_DOEXEC
  2020-10-15 20:36                   ` Alex Williamson
@ 2020-10-19 16:33                     ` Steven Sistare
  2020-10-26 18:28                       ` Steven Sistare
  0 siblings, 1 reply; 118+ messages in thread
From: Steven Sistare @ 2020-10-19 16:33 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Dr. David Alan Gilbert,
	Anthony Yznaga, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On 10/15/2020 4:36 PM, Alex Williamson wrote:
> On Thu, 8 Oct 2020 12:32:35 -0400
> Steven Sistare <steven.sistare@oracle.com> wrote:
>> On 8/24/2020 6:30 PM, Alex Williamson wrote:
>>> On Wed, 19 Aug 2020 17:52:26 -0400
>>> Steven Sistare <steven.sistare@oracle.com> wrote:  
>>>> On 8/17/2020 10:42 PM, Alex Williamson wrote:  
>>>>> On Mon, 17 Aug 2020 15:44:03 -0600
>>>>> Alex Williamson <alex.williamson@redhat.com> wrote:  
>>>>>> On Mon, 17 Aug 2020 17:20:57 -0400
>>>>>> Steven Sistare <steven.sistare@oracle.com> wrote:  
>>>>>>> On 8/17/2020 4:48 PM, Alex Williamson wrote:      
>>>>>>>> On Mon, 17 Aug 2020 14:30:51 -0400
>>>>>>>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>>>>>>>         
>>>>>>>>> On 7/30/2020 11:14 AM, Steve Sistare wrote:        
>>>>>>>>>> Anonymous memory segments used by the guest are preserved across a re-exec
>>>>>>>>>> of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
>>>>>>>>>> in the Linux kernel. For the madvise patches, see:
>>>>>>>>>>
>>>>>>>>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>>>>>> ---
>>>>>>>>>>  include/qemu/osdep.h | 7 +++++++
>>>>>>>>>>  1 file changed, 7 insertions(+)          
>>>>>>>>>
>>>>>>>>> Hi Alex,
>>>>>>>>>   The MADV_DOEXEC functionality, which is a pre-requisite for the entire qemu 
>>>>>>>>> live update series, is getting a chilly reception on lkml.  We could instead 
>>>>>>>>> create guest memory using memfd_create and preserve the fd across exec.  However, 
>>>>>>>>> the subsequent mmap(fd) will return a different VA than was used previously, 
>>>>>>>>> which  is a problem for memory that was registered with vfio, as the original VA 
>>>>>>>>> is remembered in the kernel struct vfio_dma and used in various kernel functions, 
>>>>>>>>> such as vfio_iommu_replay.
>>>>>>>>>
>>>>>>>>> To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size, and
>>>>>>>>> new_vaddr.  The implementation finds an exact match for (iova, size) and replaces 
>>>>>>>>> vaddr with new_vaddr.  Flags cannot be changed.
>>>>>>>>>
>>>>>>>>> memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC.
>>>>>>>>> vfio on any form of shared memory (shm, dax, etc) could also be preserved across
>>>>>>>>> exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA.
>>>>>>>>>
>>>>>>>>> What do you think        
>>>>>>>>
>>>>>>>> Your new REMAP ioctl would have parameters identical to the MAP_DMA
>>>>>>>> ioctl, so I think we should just use one of the flag bits on the
>>>>>>>> existing MAP_DMA ioctl for this variant.        
>>>>>>>
>>>>>>> Sounds good.
>>>>>>>       
>>>>>>>> Reading through the discussion on the kernel side there seems to be
>>>>>>>> some confusion around why vfio needs the vaddr beyond the user call to
>>>>>>>> MAP_DMA though.  Originally this was used to test for virtually
>>>>>>>> contiguous mappings for merging and splitting purposes.  This is
>>>>>>>> defunct in the v2 interface, however the vaddr is now used largely for
>>>>>>>> mdev devices.  If an mdev device is not backed by an IOMMU device and
>>>>>>>> does not share a container with an IOMMU device, then a user MAP_DMA
>>>>>>>> ioctl essentially just registers the translation within the vfio
>>>>>>>> container.  The mdev vendor driver can then later either request pages
>>>>>>>> to be pinned for device DMA or can perform copy_to/from_user() to
>>>>>>>> simulate DMA via the CPU.
>>>>>>>>
>>>>>>>> Therefore I don't see that there's a simple re-architecture of the vfio
>>>>>>>> IOMMU backend that could drop vaddr use.          
>>>>>>>
>>>>>>> Yes.  I did not explain on lkml as you do here (thanks), but I reached the 
>>>>>>> same conclusion.
>>>>>>>       
>>>>>>>> I'm a bit concerned this new
>>>>>>>> remap proposal also raises the question of how do we prevent userspace
>>>>>>>> remapping vaddrs racing with asynchronous kernel use of the previous
>>>>>>>> vaddrs.          
>>>>>>>
>>>>>>> Agreed.  After a quick glance at the code, holding iommu->lock during 
>>>>>>> remap might be sufficient, but it needs more study.      
>>>>>>
>>>>>> Unless you're suggesting an extended hold of the lock across the entire
>>>>>> re-exec of QEMU, that's only going to prevent a race between a remap
>>>>>> and a vendor driver pin or access, the time between the previous vaddr
>>>>>> becoming invalid and the remap is unprotected.    
>>>>
>>>> OK.  What if we exclude mediated devices?  Its appears they are the only
>>>> ones where the kernel may async'ly use the vaddr, via call chains ending in 
>>>> vfio_iommu_type1_pin_pages or vfio_iommu_type1_dma_rw_chunk.
>>>>
>>>> The other functions that use dma->vaddr are
>>>>     vfio_dma_do_map 
>>>>     vfio_pin_map_dma 
>>>>     vfio_iommu_replay 
>>>>     vfio_pin_pages_remote
>>>> and they are all initiated via userland ioctl (if I have traced all the code 
>>>> paths correctly).  Thus iommu->lock would protect them.
>>>>
>>>> We would block live update in qemu if the config includes a mediated device.
>>>>
>>>> VFIO_IOMMU_REMAP_DMA would return EINVAL if the container has a mediated device.  
>>>
>>> That's not a solution I'd really be in favor of.  We're eliminating an
>>> entire class of devices because they _might_ make use of these
>>> interfaces, but anyone can add a vfio bus driver, even exposing the
>>> same device API, and maybe make use of some of these interfaces in that
>>> driver.  Maybe we'd even have reason to do it in vfio-pci if we had
>>> reason to virtualize some aspect of a device.  I think we're setting
>>> ourselves up for a very complicated support scenario if we just
>>> arbitrarily decide to deny drivers using certain interfaces.
>>>   
>>>>>>>> Are we expecting guest drivers/agents to quiesce the device,
>>>>>>>> or maybe relying on clearing bus-master, for PCI devices, to halt DMA?        
>>>>>>>
>>>>>>> No.  We want to support any guest, and the guest is not aware that qemu
>>>>>>> live update is occurring.
>>>>>>>       
>>>>>>>> The vfio migration interface we've developed does have a mechanism to
>>>>>>>> stop a device, would we need to use this here?  If we do have a
>>>>>>>> mechanism to quiesce the device, is the only reason we're not unmapping
>>>>>>>> everything and remapping it into the new address space the latency in
>>>>>>>> performing that operation?  Thanks,        
>>>>>>>
>>>>>>> Same answer - we don't require that the guest has vfio migration support.      
>>>>>>
>>>>>> QEMU toggling the runstate of the device via the vfio migration
>>>>>> interface could be done transparently to the guest, but if your
>>>>>> intention is to support any device (where none currently support the
>>>>>> migration interface) perhaps it's a moot point.      
>>>>
>>>> That sounds useful when devices support.  Can you give me some function names
>>>> or references so I can study this qemu-based "vfio migration interface".  
>>>
>>> The uAPI is documented in commit a8a24f3f6e38.  We're still waiting on
>>> the QEMU support or implementation in an mdev vendor driver.
>>> Essentially migration exposes a new region of the device which would be
>>> implemented by the vendor driver.  A register within that region
>>> manipulates the device state, so a device could be stopped by clearing
>>> the 'run' bit in that register.
>>>   
>>>>>> It seems like this
>>>>>> scheme only works with IOMMU backed devices where the device can
>>>>>> continue to operate against pinned pages, anything that might need to
>>>>>> dynamically pin pages against the process vaddr as it's running async
>>>>>> to the QEMU re-exec needs to be blocked or stopped.  Thanks,    
>>>>
>>>> Yes, true of this remap proposal.
>>>>
>>>> I wanted to unconditionally support all devices, which is why I think that
>>>>
>>>> MADV_DOEXEC is a nifty solution.  If you agree, please add your voice to the
>>>> lkml discussion.  
>>
>> Hi Alex, here is a modified proposal to remap vaddr in the face of async requests
>> from mediated device drivers.
>>
>> Define a new flag VFIO_DMA_MAP_FLAG_REMAP for use with VFIO_IOMMU_UNMAP_DMA and
>> VFIO_IOMMU_MAP_DMA.
>>
>> VFIO_IOMMU_UNMAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
>>   Discard vaddr on the existing dma region defined by (iova, size), but keep the
>>   struct vfio_dma.  Subsequent translation requests are blocked.
>>   The implementation sets a flag in struct vfio_dma.  vfio_pin_pages() and
>>   vfio_dma_rw() acquire iommu->lock, check the flag, and retry.
>>   Called before exec.
>>
>> VFIO_IOMMU_MAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
>>   Remap the region (iova, size) to vaddr, and resume translation requests.
>>   Called after exec.
>>
>> Unfortunately, remap as defined above has an undesirable side effect.  The mdev
>> driver may use kernel worker threads which serve requests from multiple clients
>> (eg i915/gvt workload_thread).  A process that fails to call MAP_DMA with REMAP,
>> or is tardy doing so, will delay other processes who are stuck waiting in
>> vfio_pin_pages or vfio_dma_rw.  This is unacceptable, and I mention this scheme in
>> case I am misinterpreting the code (maybe they do not share a single struct vfio_iommu
>> instance?), or in case you see a way to salvage it.
> 
> Right, that's my first thought when I hear that the pin and dma_rw paths
> are blocked as well, we cannot rely on userspace to unblock anything.
> A malicious user may hold out just to see how long until the host
> becomes unusable.  Userspace determines how many groups share a
> vfio_iommu.
> 
>> Here is a more robust implementation.  It only works for dma regions backed by
>> a file, such as memfd or shm.
>>
>> VFIO_IOMMU_UNMAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
>>   Find the file and offset for iova, and save the struct file pointer in
>>   struct vfio_dma.  In vfio_pin_pages and vfio_dma_rw and their descendants,
>>   if file* is set, then call pagecache_get_page() to get the pfn, instead of
>>   get_user_pages.
>>
>> VFIO_IOMMU_MAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
>>   Remap the region (iova, size) to vaddr and drop the file reference.
>>
>> This begs the question of whether we can always use pagecache_get_page, and
>> eliminate the dependency on vaddr.  The translation performance could be
>> different, though.
>>
>> I have not implemented this yet.  Any thoughts before I do?
> 
> That's a pretty hefty usage restriction, but what I take from it is
> that these are mechanisms which provide a fallback lookup path that can
> service callers in the interim during the gap of the range being
> remapped.  The callers are always providing an IOVA and wishing to do
> something to the memory referenced by that IOVA, we just need a
> translation mechanism.  The IOMMU itself is also such an alternative
> lookup, via iommu_iova_to_phys(), but of course requiring an
> IOMMU-backed device is just another usage restriction, potentially one
> that's not even apparent to the user.
> 
> Is a more general solution to make sure there's always an IOVA-to-phys
> lookup mechanism available, implementing one if not provided by the
> IOMMU or memory backing interface?  We'd need to adapt the dma_rw
> interface to work on either a VA or PA, and pinning pages on
> UNMAP+REMAP, plus stashing them in a translation structure, plus
> dynamically adapting to changes (ex. the IOMMU backed device being
> removed, leaving a non-IOMMU backed device in the vfio_iommu) all
> sounds pretty complicated, especially as the vfio-iommu-type1 backend
> becomes stretched to be more and more fragile.  Possibly it's still
> feasible though.  Thanks,

Requiring file backed memory is not a restriction in practice, because there is
no way to preserve MAP_ANON memory across exec and map it into the new qemu process.
That is what MADV_DOEXEC would have provided.  Without it, one cannot do live 
update with MAP_ANON memory.

For qemu, when allocating anonymous memory for guest memory regions, we would modify the 
allocation  functions to call  memfd_create + mmap(fd) instead of mmap(MAP_ANON).  The
implementation of memfd_create creates a /dev/shm file and unlinks it. Thus the memory is 
backed by a file, and the VFIO UNMAP/REMAP proposal works.

- Steve


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH V1 18/32] osdep: import MADV_DOEXEC
  2020-10-19 16:33                     ` Steven Sistare
@ 2020-10-26 18:28                       ` Steven Sistare
  0 siblings, 0 replies; 118+ messages in thread
From: Steven Sistare @ 2020-10-26 18:28 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Daniel P. Berrange, Michael S. Tsirkin, Alex Bennée,
	Juan Quintela, qemu-devel, Dr. David Alan Gilbert,
	Anthony Yznaga, Stefan Hajnoczi, Marc-André Lureau,
	Paolo Bonzini, Philippe Mathieu-Daudé,
	Markus Armbruster

On 10/19/2020 12:33 PM, Steven Sistare wrote:
> On 10/15/2020 4:36 PM, Alex Williamson wrote:
>> On Thu, 8 Oct 2020 12:32:35 -0400
>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>> On 8/24/2020 6:30 PM, Alex Williamson wrote:
>>>> On Wed, 19 Aug 2020 17:52:26 -0400
>>>> Steven Sistare <steven.sistare@oracle.com> wrote:  
>>>>> On 8/17/2020 10:42 PM, Alex Williamson wrote:  
>>>>>> On Mon, 17 Aug 2020 15:44:03 -0600
>>>>>> Alex Williamson <alex.williamson@redhat.com> wrote:  
>>>>>>> On Mon, 17 Aug 2020 17:20:57 -0400
>>>>>>> Steven Sistare <steven.sistare@oracle.com> wrote:  
>>>>>>>> On 8/17/2020 4:48 PM, Alex Williamson wrote:      
>>>>>>>>> On Mon, 17 Aug 2020 14:30:51 -0400
>>>>>>>>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>>>>>>>>         
>>>>>>>>>> On 7/30/2020 11:14 AM, Steve Sistare wrote:        
>>>>>>>>>>> Anonymous memory segments used by the guest are preserved across a re-exec
>>>>>>>>>>> of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) option
>>>>>>>>>>> in the Linux kernel. For the madvise patches, see:
>>>>>>>>>>>
>>>>>>>>>>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com/
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>>>>>>> ---
>>>>>>>>>>>  include/qemu/osdep.h | 7 +++++++
>>>>>>>>>>>  1 file changed, 7 insertions(+)          
>>>>>>>>>>
>>>>>>>>>> Hi Alex,
>>>>>>>>>>   The MADV_DOEXEC functionality, which is a pre-requisite for the entire qemu 
>>>>>>>>>> live update series, is getting a chilly reception on lkml.  We could instead 
>>>>>>>>>> create guest memory using memfd_create and preserve the fd across exec.  However, 
>>>>>>>>>> the subsequent mmap(fd) will return a different VA than was used previously, 
>>>>>>>>>> which  is a problem for memory that was registered with vfio, as the original VA 
>>>>>>>>>> is remembered in the kernel struct vfio_dma and used in various kernel functions, 
>>>>>>>>>> such as vfio_iommu_replay.
>>>>>>>>>>
>>>>>>>>>> To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size, and
>>>>>>>>>> new_vaddr.  The implementation finds an exact match for (iova, size) and replaces 
>>>>>>>>>> vaddr with new_vaddr.  Flags cannot be changed.
>>>>>>>>>>
>>>>>>>>>> memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC.
>>>>>>>>>> vfio on any form of shared memory (shm, dax, etc) could also be preserved across
>>>>>>>>>> exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA.
>>>>>>>>>>
>>>>>>>>>> What do you think        
>>>>>>>>>
>>>>>>>>> Your new REMAP ioctl would have parameters identical to the MAP_DMA
>>>>>>>>> ioctl, so I think we should just use one of the flag bits on the
>>>>>>>>> existing MAP_DMA ioctl for this variant.        
>>>>>>>>
>>>>>>>> Sounds good.
>>>>>>>>       
>>>>>>>>> Reading through the discussion on the kernel side there seems to be
>>>>>>>>> some confusion around why vfio needs the vaddr beyond the user call to
>>>>>>>>> MAP_DMA though.  Originally this was used to test for virtually
>>>>>>>>> contiguous mappings for merging and splitting purposes.  This is
>>>>>>>>> defunct in the v2 interface, however the vaddr is now used largely for
>>>>>>>>> mdev devices.  If an mdev device is not backed by an IOMMU device and
>>>>>>>>> does not share a container with an IOMMU device, then a user MAP_DMA
>>>>>>>>> ioctl essentially just registers the translation within the vfio
>>>>>>>>> container.  The mdev vendor driver can then later either request pages
>>>>>>>>> to be pinned for device DMA or can perform copy_to/from_user() to
>>>>>>>>> simulate DMA via the CPU.
>>>>>>>>>
>>>>>>>>> Therefore I don't see that there's a simple re-architecture of the vfio
>>>>>>>>> IOMMU backend that could drop vaddr use.          
>>>>>>>>
>>>>>>>> Yes.  I did not explain on lkml as you do here (thanks), but I reached the 
>>>>>>>> same conclusion.
>>>>>>>>       
>>>>>>>>> I'm a bit concerned this new
>>>>>>>>> remap proposal also raises the question of how do we prevent userspace
>>>>>>>>> remapping vaddrs racing with asynchronous kernel use of the previous
>>>>>>>>> vaddrs.          
>>>>>>>>
>>>>>>>> Agreed.  After a quick glance at the code, holding iommu->lock during 
>>>>>>>> remap might be sufficient, but it needs more study.      
>>>>>>>
>>>>>>> Unless you're suggesting an extended hold of the lock across the entire
>>>>>>> re-exec of QEMU, that's only going to prevent a race between a remap
>>>>>>> and a vendor driver pin or access, the time between the previous vaddr
>>>>>>> becoming invalid and the remap is unprotected.    
>>>>>
>>>>> OK.  What if we exclude mediated devices?  Its appears they are the only
>>>>> ones where the kernel may async'ly use the vaddr, via call chains ending in 
>>>>> vfio_iommu_type1_pin_pages or vfio_iommu_type1_dma_rw_chunk.
>>>>>
>>>>> The other functions that use dma->vaddr are
>>>>>     vfio_dma_do_map 
>>>>>     vfio_pin_map_dma 
>>>>>     vfio_iommu_replay 
>>>>>     vfio_pin_pages_remote
>>>>> and they are all initiated via userland ioctl (if I have traced all the code 
>>>>> paths correctly).  Thus iommu->lock would protect them.
>>>>>
>>>>> We would block live update in qemu if the config includes a mediated device.
>>>>>
>>>>> VFIO_IOMMU_REMAP_DMA would return EINVAL if the container has a mediated device.  
>>>>
>>>> That's not a solution I'd really be in favor of.  We're eliminating an
>>>> entire class of devices because they _might_ make use of these
>>>> interfaces, but anyone can add a vfio bus driver, even exposing the
>>>> same device API, and maybe make use of some of these interfaces in that
>>>> driver.  Maybe we'd even have reason to do it in vfio-pci if we had
>>>> reason to virtualize some aspect of a device.  I think we're setting
>>>> ourselves up for a very complicated support scenario if we just
>>>> arbitrarily decide to deny drivers using certain interfaces.
>>>>   
>>>>>>>>> Are we expecting guest drivers/agents to quiesce the device,
>>>>>>>>> or maybe relying on clearing bus-master, for PCI devices, to halt DMA?        
>>>>>>>>
>>>>>>>> No.  We want to support any guest, and the guest is not aware that qemu
>>>>>>>> live update is occurring.
>>>>>>>>       
>>>>>>>>> The vfio migration interface we've developed does have a mechanism to
>>>>>>>>> stop a device, would we need to use this here?  If we do have a
>>>>>>>>> mechanism to quiesce the device, is the only reason we're not unmapping
>>>>>>>>> everything and remapping it into the new address space the latency in
>>>>>>>>> performing that operation?  Thanks,        
>>>>>>>>
>>>>>>>> Same answer - we don't require that the guest has vfio migration support.      
>>>>>>>
>>>>>>> QEMU toggling the runstate of the device via the vfio migration
>>>>>>> interface could be done transparently to the guest, but if your
>>>>>>> intention is to support any device (where none currently support the
>>>>>>> migration interface) perhaps it's a moot point.      
>>>>>
>>>>> That sounds useful when devices support.  Can you give me some function names
>>>>> or references so I can study this qemu-based "vfio migration interface".  
>>>>
>>>> The uAPI is documented in commit a8a24f3f6e38.  We're still waiting on
>>>> the QEMU support or implementation in an mdev vendor driver.
>>>> Essentially migration exposes a new region of the device which would be
>>>> implemented by the vendor driver.  A register within that region
>>>> manipulates the device state, so a device could be stopped by clearing
>>>> the 'run' bit in that register.
>>>>   
>>>>>>> It seems like this
>>>>>>> scheme only works with IOMMU backed devices where the device can
>>>>>>> continue to operate against pinned pages, anything that might need to
>>>>>>> dynamically pin pages against the process vaddr as it's running async
>>>>>>> to the QEMU re-exec needs to be blocked or stopped.  Thanks,    
>>>>>
>>>>> Yes, true of this remap proposal.
>>>>>
>>>>> I wanted to unconditionally support all devices, which is why I think that
>>>>>
>>>>> MADV_DOEXEC is a nifty solution.  If you agree, please add your voice to the
>>>>> lkml discussion.  
>>>
>>> Hi Alex, here is a modified proposal to remap vaddr in the face of async requests
>>> from mediated device drivers.
>>>
>>> Define a new flag VFIO_DMA_MAP_FLAG_REMAP for use with VFIO_IOMMU_UNMAP_DMA and
>>> VFIO_IOMMU_MAP_DMA.
>>>
>>> VFIO_IOMMU_UNMAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
>>>   Discard vaddr on the existing dma region defined by (iova, size), but keep the
>>>   struct vfio_dma.  Subsequent translation requests are blocked.
>>>   The implementation sets a flag in struct vfio_dma.  vfio_pin_pages() and
>>>   vfio_dma_rw() acquire iommu->lock, check the flag, and retry.
>>>   Called before exec.
>>>
>>> VFIO_IOMMU_MAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
>>>   Remap the region (iova, size) to vaddr, and resume translation requests.
>>>   Called after exec.
>>>
>>> Unfortunately, remap as defined above has an undesirable side effect.  The mdev
>>> driver may use kernel worker threads which serve requests from multiple clients
>>> (eg i915/gvt workload_thread).  A process that fails to call MAP_DMA with REMAP,
>>> or is tardy doing so, will delay other processes who are stuck waiting in
>>> vfio_pin_pages or vfio_dma_rw.  This is unacceptable, and I mention this scheme in
>>> case I am misinterpreting the code (maybe they do not share a single struct vfio_iommu
>>> instance?), or in case you see a way to salvage it.
>>
>> Right, that's my first thought when I hear that the pin and dma_rw paths
>> are blocked as well, we cannot rely on userspace to unblock anything.
>> A malicious user may hold out just to see how long until the host
>> becomes unusable.  Userspace determines how many groups share a
>> vfio_iommu.

I want to reconsider the above solution.  The pagecache_get_page solution below has problems 
which I will elaborate on shortly.

I was confused about the granularity of vfio_iommu sharing, but now I see that each vGPU 
in the gvt example would have its own vfio_iommu, so one misbehaving process would not 
block another.  Multiple iommu groups within one container share a vfio_iommu, and a
delay in finishing the remap will block all those devices from completing pin or 
rw operations, but that only blocks the kernel thread servicing that application.  
Please say more on how that could cause the system to become unusable.

>>> Here is a more robust implementation.  It only works for dma regions backed by
>>> a file, such as memfd or shm.
>>>
>>> VFIO_IOMMU_UNMAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
>>>   Find the file and offset for iova, and save the struct file pointer in
>>>   struct vfio_dma.  In vfio_pin_pages and vfio_dma_rw and their descendants,
>>>   if file* is set, then call pagecache_get_page() to get the pfn, instead of
>>>   get_user_pages.
>>>
>>> VFIO_IOMMU_MAP_DMA  flags=VFIO_DMA_MAP_FLAG_REMAP
>>>   Remap the region (iova, size) to vaddr and drop the file reference.
>>>
>>> This begs the question of whether we can always use pagecache_get_page, and
>>> eliminate the dependency on vaddr.  The translation performance could be
>>> different, though.
>>>
>>> I have not implemented this yet.  Any thoughts before I do?
>>
>> That's a pretty hefty usage restriction, but what I take from it is
>> that these are mechanisms which provide a fallback lookup path that can
>> service callers in the interim during the gap of the range being
>> remapped.  The callers are always providing an IOVA and wishing to do
>> something to the memory referenced by that IOVA, we just need a
>> translation mechanism.  The IOMMU itself is also such an alternative
>> lookup, via iommu_iova_to_phys(), but of course requiring an
>> IOMMU-backed device is just another usage restriction, potentially one
>> that's not even apparent to the user.
>>
>> Is a more general solution to make sure there's always an IOVA-to-phys
>> lookup mechanism available, implementing one if not provided by the
>> IOMMU or memory backing interface?  We'd need to adapt the dma_rw
>> interface to work on either a VA or PA, and pinning pages on
>> UNMAP+REMAP, plus stashing them in a translation structure, plus
>> dynamically adapting to changes (ex. the IOMMU backed device being
>> removed, leaving a non-IOMMU backed device in the vfio_iommu) all
>> sounds pretty complicated, especially as the vfio-iommu-type1 backend
>> becomes stretched to be more and more fragile.  Possibly it's still
>> feasible though.  Thanks,
> 
> Requiring file backed memory is not a restriction in practice, because there is
> no way to preserve MAP_ANON memory across exec and map it into the new qemu process.
> That is what MADV_DOEXEC would have provided.  Without it, one cannot do live 
> update with MAP_ANON memory.
> 
> For qemu, when allocating anonymous memory for guest memory regions, we would modify the 
> allocation  functions to call  memfd_create + mmap(fd) instead of mmap(MAP_ANON).  The
> implementation of memfd_create creates a /dev/shm file and unlinks it. Thus the memory is 
> backed by a file, and the VFIO UNMAP/REMAP proposal works.

I prototyped this using pagecache_get_page() to fault in and translate any page in the segment,
and it works. However, faulting at this level misses many checks and side effects that are
normally applied at the handle_mm_fault() level and below.  Just as one example, shmem_fault 
has accounting and limits for its swap-backed ramfs.  There is no good way to refactor the 
higher level functions to apply all side effects and only use a segment offset rather than VA.
Those functions use VA, vma, and pagetable throughout.

To salvage this approach, we could pre-fault and pin the entire dma range on the unmap-remap
call, and unpin in map-remap. If we do so,  pagecache_get_page() will simply return the 
translation. However, the pinning is very expensive and could cause a noticeable 
pause for guest operations if it waits on any pin or rw operations during the pre-fault phase.
We could pin unconditionally in the initial dma_map, before the guest is started, but
that defeats the purpose of the vfio_pin_pages interface.

- Steve


^ permalink raw reply	[flat|nested] 118+ messages in thread

end of thread, other threads:[~2020-10-26 18:33 UTC | newest]

Thread overview: 118+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-30 15:14 [PATCH V1 00/32] Live Update Steve Sistare
2020-07-30 15:14 ` [PATCH V1 01/32] savevm: add vmstate handler iterators Steve Sistare
2020-09-11 16:24   ` Dr. David Alan Gilbert
2020-09-24 21:43     ` Steven Sistare
2020-09-25  9:07       ` Dr. David Alan Gilbert
2020-07-30 15:14 ` [PATCH V1 02/32] savevm: VM handlers mode mask Steve Sistare
2020-07-30 15:14 ` [PATCH V1 03/32] savevm: QMP command for cprsave Steve Sistare
2020-07-30 16:12   ` Eric Blake
2020-07-30 17:52     ` Steven Sistare
2020-09-11 16:43   ` Dr. David Alan Gilbert
2020-09-25 18:43     ` Steven Sistare
2020-09-25 22:22       ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 04/32] savevm: HMP Command " Steve Sistare
2020-09-11 16:57   ` Dr. David Alan Gilbert
2020-09-24 21:44     ` Steven Sistare
2020-09-25  9:26       ` Dr. David Alan Gilbert
2020-07-30 15:14 ` [PATCH V1 05/32] savevm: QMP command for cprload Steve Sistare
2020-07-30 16:14   ` Eric Blake
2020-07-30 18:00     ` Steven Sistare
2020-09-11 17:18       ` Dr. David Alan Gilbert
2020-09-24 21:49         ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 06/32] savevm: HMP Command " Steve Sistare
2020-07-30 15:14 ` [PATCH V1 07/32] savevm: QMP command for cprinfo Steve Sistare
2020-07-30 16:17   ` Eric Blake
2020-07-30 18:02     ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 08/32] savevm: HMP " Steve Sistare
2020-09-11 17:27   ` Dr. David Alan Gilbert
2020-09-24 21:50     ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 09/32] savevm: prevent cprsave if memory is volatile Steve Sistare
2020-09-11 17:35   ` Dr. David Alan Gilbert
2020-09-24 21:51     ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 10/32] kvmclock: restore paused KVM clock Steve Sistare
2020-09-11 17:50   ` Dr. David Alan Gilbert
2020-09-25 18:07     ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 11/32] cpu: disable ticks when suspended Steve Sistare
2020-09-11 17:53   ` Dr. David Alan Gilbert
2020-09-24 20:42     ` Steven Sistare
2020-09-25  9:03       ` Dr. David Alan Gilbert
2020-07-30 15:14 ` [PATCH V1 12/32] vl: pause option Steve Sistare
2020-07-30 16:20   ` Eric Blake
2020-07-30 18:11     ` Steven Sistare
2020-07-31 10:07       ` Daniel P. Berrangé
2020-07-31 15:18         ` Steven Sistare
2020-07-30 17:03   ` Alex Bennée
2020-07-30 18:14     ` Steven Sistare
2020-07-31  9:44       ` Alex Bennée
2020-09-11 17:59       ` Dr. David Alan Gilbert
2020-09-24 21:51         ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 13/32] gdbstub: gdb support for suspended state Steve Sistare
2020-09-11 18:41   ` Dr. David Alan Gilbert
2020-09-24 21:51     ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 14/32] savevm: VMS_RESTART and cprsave restart Steve Sistare
2020-07-30 16:22   ` Eric Blake
2020-07-30 18:14     ` Steven Sistare
2020-09-11 18:44   ` Dr. David Alan Gilbert
2020-09-24 21:44     ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 15/32] vl: QEMU_START_FREEZE env var Steve Sistare
2020-09-11 18:49   ` Dr. David Alan Gilbert
2020-09-24 21:47     ` Steven Sistare
2020-09-25 15:52       ` Dr. David Alan Gilbert
2020-07-30 15:14 ` [PATCH V1 16/32] oslib: add qemu_clr_cloexec Steve Sistare
2020-09-11 18:52   ` Dr. David Alan Gilbert
2020-07-30 15:14 ` [PATCH V1 17/32] util: env var helpers Steve Sistare
2020-09-11 19:00   ` Dr. David Alan Gilbert
2020-09-24 21:52     ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 18/32] osdep: import MADV_DOEXEC Steve Sistare
2020-08-17 18:30   ` Steven Sistare
2020-08-17 20:48     ` Alex Williamson
2020-08-17 21:20       ` Steven Sistare
2020-08-17 21:44         ` Alex Williamson
2020-08-18  2:42           ` Alex Williamson
2020-08-19 21:52             ` Steven Sistare
2020-08-24 22:30               ` Alex Williamson
2020-10-08 16:32                 ` Steven Sistare
2020-10-15 20:36                   ` Alex Williamson
2020-10-19 16:33                     ` Steven Sistare
2020-10-26 18:28                       ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 19/32] memory: ram_block_add cosmetic changes Steve Sistare
2020-07-30 15:14 ` [PATCH V1 20/32] vl: add helper to request re-exec Steve Sistare
2020-07-30 15:14 ` [PATCH V1 21/32] exec, memory: exec(3) to restart Steve Sistare
2020-07-30 15:14 ` [PATCH V1 22/32] char: qio_channel_socket_accept reuse fd Steve Sistare
2020-09-15 17:33   ` Dr. David Alan Gilbert
2020-09-15 17:53     ` Daniel P. Berrangé
2020-09-24 21:54     ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 23/32] char: save/restore chardev socket fds Steve Sistare
2020-07-30 15:14 ` [PATCH V1 24/32] ui: save/restore vnc " Steve Sistare
2020-07-31  9:06   ` Daniel P. Berrangé
2020-07-31 16:51     ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 25/32] char: save/restore chardev pty fds Steve Sistare
2020-07-30 15:14 ` [PATCH V1 26/32] monitor: save/restore QMP negotiation status Steve Sistare
2020-07-30 15:14 ` [PATCH V1 27/32] vhost: reset vhost devices upon cprsave Steve Sistare
2020-07-30 15:14 ` [PATCH V1 28/32] char: restore terminal on restart Steve Sistare
2020-07-30 15:14 ` [PATCH V1 29/32] pci: export pci_update_mappings Steve Sistare
2020-07-30 15:14 ` [PATCH V1 30/32] vfio-pci: save and restore Steve Sistare
2020-08-06 10:22   ` Jason Zeng
2020-08-07 20:38     ` Steven Sistare
2020-08-10  3:50       ` Jason Zeng
2020-08-19 21:15         ` Steven Sistare
2020-08-20 10:33           ` Jason Zeng
2020-10-07 21:25             ` Steven Sistare
2020-07-30 15:14 ` [PATCH V1 31/32] vfio-pci: trace pci config Steve Sistare
2020-07-30 15:14 ` [PATCH V1 32/32] vfio-pci: improved tracing Steve Sistare
2020-09-15 18:49   ` Dr. David Alan Gilbert
2020-09-24 21:52     ` Steven Sistare
2020-07-30 16:52 ` [PATCH V1 00/32] Live Update Daniel P. Berrangé
2020-07-30 18:48   ` Steven Sistare
2020-07-31  8:53     ` Daniel P. Berrangé
2020-07-31 15:27       ` Steven Sistare
2020-07-31 15:52         ` Daniel P. Berrangé
2020-07-31 17:20           ` Steven Sistare
2020-08-11 19:08           ` Dr. David Alan Gilbert
2020-07-30 17:15 ` Paolo Bonzini
2020-07-30 19:09   ` Steven Sistare
2020-07-30 21:39     ` Paolo Bonzini
2020-07-31 19:22       ` Steven Sistare
2020-07-30 17:49 ` Dr. David Alan Gilbert
2020-07-30 19:31   ` Steven Sistare
2020-08-04 18:18 ` Steven Sistare

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).