qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH for-8.2 0/6] vfio/migration: Add P2P support for VFIO migration
@ 2023-07-16  8:15 Avihai Horon
  2023-07-16  8:15 ` [PATCH for-8.2 1/6] vfio/migration: Move from STOP_COPY to STOP in vfio_save_cleanup() Avihai Horon
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: Avihai Horon @ 2023-07-16  8:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Paolo Bonzini,
	Joao Martins, Yishai Hadas, Jason Gunthorpe, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta, Avihai Horon

Hi all,

The first patch in this series adds a small optimization to VFIO
migration by moving the STOP_COPY->STOP transition to
vfio_save_cleanup(). Testing with a ConnectX-7 VFIO device showed that
this can reduce downtime by up to 6%.

The rest of the series adds P2P support for VFIO migration.

VFIO migration uAPI defines an optional intermediate P2P quiescent
state. While in the P2P quiescent state, P2P DMA transactions cannot be
initiated by the device, but the device can respond to incoming ones.
Additionally, all outstanding P2P transactions are guaranteed to have
been completed by the time the device enters this state.

The purpose of this state is to support migration of multiple devices
that are doing P2P transactions between themselves.

To implement P2P migration support, all the devices must be transitioned
to the P2P quiescent state before being stopped or started.

This behavior is achieved by adding an optional pre VM state change
callback to VMChangeStateEntry. These callbacks are invoked before the
main VM state change callbacks, transitioning all the VFIO devices to
the P2P state, and only then are the main callbacks invoked, which stop
or start the devices.

This will allow migration of multiple VFIO devices if all of them
support P2P migration.

Thanks.

Avihai Horon (5):
  vfio/migration: Move from STOP_COPY to STOP in vfio_save_cleanup()
  sysemu: Add pre VM state change callback
  qdev: Add qdev_add_vm_change_state_handler_full()
  vfio/migration: Add P2P support for VFIO migration
  vfio/migration: Allow migration of multiple P2P supporting devices

Joao Martins (1):
  vfio/migration: Refactor PRE_COPY and RUNNING state checks

 docs/devel/vfio-migration.rst     | 93 +++++++++++++++++++------------
 include/hw/vfio/vfio-common.h     |  2 +
 include/sysemu/runstate.h         |  7 +++
 hw/core/vm-change-state-handler.c | 14 ++++-
 hw/vfio/common.c                  | 50 +++++++++++++----
 hw/vfio/migration.c               | 83 +++++++++++++++++++--------
 softmmu/runstate.c                | 39 +++++++++++++
 hw/vfio/trace-events              |  2 +-
 8 files changed, 217 insertions(+), 73 deletions(-)

-- 
2.26.3



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH for-8.2 1/6] vfio/migration: Move from STOP_COPY to STOP in vfio_save_cleanup()
  2023-07-16  8:15 [PATCH for-8.2 0/6] vfio/migration: Add P2P support for VFIO migration Avihai Horon
@ 2023-07-16  8:15 ` Avihai Horon
  2023-07-30 16:25   ` Cédric Le Goater
  2023-07-16  8:15 ` [PATCH for-8.2 2/6] sysemu: Add pre VM state change callback Avihai Horon
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Avihai Horon @ 2023-07-16  8:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Paolo Bonzini,
	Joao Martins, Yishai Hadas, Jason Gunthorpe, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta, Avihai Horon

Changing the device state from STOP_COPY to STOP can take time as the
device may need to free resources and do other operations as part of the
transition. Currently, this is done in vfio_save_complete_precopy() and
therefore it is counted in the migration downtime.

To avoid this, change the device state from STOP_COPY to STOP in
vfio_save_cleanup(), which is called after migration has completed and
thus is not part of migration downtime.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 hw/vfio/migration.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 2674f4bc47..8acd182a8b 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -383,6 +383,19 @@ static void vfio_save_cleanup(void *opaque)
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
 
+    /*
+     * Changing device state from STOP_COPY to STOP can take time. Do it here,
+     * after migration has completed, so it won't increase downtime.
+     */
+    if (migration->device_state == VFIO_DEVICE_STATE_STOP_COPY) {
+        /*
+         * If setting the device in STOP state fails, the device should be
+         * reset. To do so, use ERROR state as a recover state.
+         */
+        vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP,
+                                 VFIO_DEVICE_STATE_ERROR);
+    }
+
     g_free(migration->data_buffer);
     migration->data_buffer = NULL;
     migration->precopy_init_size = 0;
@@ -508,12 +521,6 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
         return ret;
     }
 
-    /*
-     * If setting the device in STOP state fails, the device should be reset.
-     * To do so, use ERROR state as a recover state.
-     */
-    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP,
-                                   VFIO_DEVICE_STATE_ERROR);
     trace_vfio_save_complete_precopy(vbasedev->name, ret);
 
     return ret;
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH for-8.2 2/6] sysemu: Add pre VM state change callback
  2023-07-16  8:15 [PATCH for-8.2 0/6] vfio/migration: Add P2P support for VFIO migration Avihai Horon
  2023-07-16  8:15 ` [PATCH for-8.2 1/6] vfio/migration: Move from STOP_COPY to STOP in vfio_save_cleanup() Avihai Horon
@ 2023-07-16  8:15 ` Avihai Horon
  2023-07-27 16:23   ` Cédric Le Goater
  2023-07-16  8:15 ` [PATCH for-8.2 3/6] qdev: Add qdev_add_vm_change_state_handler_full() Avihai Horon
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Avihai Horon @ 2023-07-16  8:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Paolo Bonzini,
	Joao Martins, Yishai Hadas, Jason Gunthorpe, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta, Avihai Horon

Add pre VM state change callback to struct VMChangeStateEntry.

The pre VM state change callback is optional and can be set by the new
function qemu_add_vm_change_state_handler_prio_full() that allows
setting this callback in addition to the main callback.

The pre VM state change callbacks and main callbacks are called in two
separate phases: First all pre VM state change callbacks are called and
only then all main callbacks are called.

The purpose of the new pre VM state change callback is to allow all
devices to run a preliminary task before calling the devices' main
callbacks.

This will facilitate adding P2P support for VFIO migration where all
VFIO devices need to be put in an intermediate P2P quiescent state
before being stopped or started by the main VM state change callback.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 include/sysemu/runstate.h |  4 ++++
 softmmu/runstate.c        | 39 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

diff --git a/include/sysemu/runstate.h b/include/sysemu/runstate.h
index 7beb29c2e2..bb38a4b4bd 100644
--- a/include/sysemu/runstate.h
+++ b/include/sysemu/runstate.h
@@ -16,6 +16,10 @@ VMChangeStateEntry *qemu_add_vm_change_state_handler(VMChangeStateHandler *cb,
                                                      void *opaque);
 VMChangeStateEntry *qemu_add_vm_change_state_handler_prio(
         VMChangeStateHandler *cb, void *opaque, int priority);
+VMChangeStateEntry *
+qemu_add_vm_change_state_handler_prio_full(VMChangeStateHandler *cb,
+                                           VMChangeStateHandler *pre_change_cb,
+                                           void *opaque, int priority);
 VMChangeStateEntry *qdev_add_vm_change_state_handler(DeviceState *dev,
                                                      VMChangeStateHandler *cb,
                                                      void *opaque);
diff --git a/softmmu/runstate.c b/softmmu/runstate.c
index f3bd862818..a1f0653899 100644
--- a/softmmu/runstate.c
+++ b/softmmu/runstate.c
@@ -271,6 +271,7 @@ void qemu_system_vmstop_request(RunState state)
 }
 struct VMChangeStateEntry {
     VMChangeStateHandler *cb;
+    VMChangeStateHandler *pre_change_cb;
     void *opaque;
     QTAILQ_ENTRY(VMChangeStateEntry) entries;
     int priority;
@@ -293,12 +294,38 @@ static QTAILQ_HEAD(, VMChangeStateEntry) vm_change_state_head =
  */
 VMChangeStateEntry *qemu_add_vm_change_state_handler_prio(
         VMChangeStateHandler *cb, void *opaque, int priority)
+{
+    return qemu_add_vm_change_state_handler_prio_full(cb, NULL, opaque,
+                                                      priority);
+}
+
+/**
+ * qemu_add_vm_change_state_handler_prio_full:
+ * @cb: the main callback to invoke
+ * @pre_change_cb: a callback to invoke before the main callback
+ * @opaque: user data passed to the callbacks
+ * @priority: low priorities execute first when the vm runs and the reverse is
+ *            true when the vm stops
+ *
+ * Register a main callback function and an optional pre VM state change
+ * callback function that are invoked when the vm starts or stops running. The
+ * main callback and the pre VM state change callback are called in two
+ * separate phases: First all pre VM state change callbacks are called and only
+ * then all main callbacks are called.
+ *
+ * Returns: an entry to be freed using qemu_del_vm_change_state_handler()
+ */
+VMChangeStateEntry *
+qemu_add_vm_change_state_handler_prio_full(VMChangeStateHandler *cb,
+                                           VMChangeStateHandler *pre_change_cb,
+                                           void *opaque, int priority)
 {
     VMChangeStateEntry *e;
     VMChangeStateEntry *other;
 
     e = g_malloc0(sizeof(*e));
     e->cb = cb;
+    e->pre_change_cb = pre_change_cb;
     e->opaque = opaque;
     e->priority = priority;
 
@@ -333,10 +360,22 @@ void vm_state_notify(bool running, RunState state)
     trace_vm_state_notify(running, state, RunState_str(state));
 
     if (running) {
+        QTAILQ_FOREACH_SAFE(e, &vm_change_state_head, entries, next) {
+            if (e->pre_change_cb) {
+                e->pre_change_cb(e->opaque, running, state);
+            }
+        }
+
         QTAILQ_FOREACH_SAFE(e, &vm_change_state_head, entries, next) {
             e->cb(e->opaque, running, state);
         }
     } else {
+        QTAILQ_FOREACH_REVERSE_SAFE(e, &vm_change_state_head, entries, next) {
+            if (e->pre_change_cb) {
+                e->pre_change_cb(e->opaque, running, state);
+            }
+        }
+
         QTAILQ_FOREACH_REVERSE_SAFE(e, &vm_change_state_head, entries, next) {
             e->cb(e->opaque, running, state);
         }
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH for-8.2 3/6] qdev: Add qdev_add_vm_change_state_handler_full()
  2023-07-16  8:15 [PATCH for-8.2 0/6] vfio/migration: Add P2P support for VFIO migration Avihai Horon
  2023-07-16  8:15 ` [PATCH for-8.2 1/6] vfio/migration: Move from STOP_COPY to STOP in vfio_save_cleanup() Avihai Horon
  2023-07-16  8:15 ` [PATCH for-8.2 2/6] sysemu: Add pre VM state change callback Avihai Horon
@ 2023-07-16  8:15 ` Avihai Horon
  2023-07-16  8:15 ` [PATCH for-8.2 4/6] vfio/migration: Refactor PRE_COPY and RUNNING state checks Avihai Horon
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Avihai Horon @ 2023-07-16  8:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Paolo Bonzini,
	Joao Martins, Yishai Hadas, Jason Gunthorpe, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta, Avihai Horon

Add qdev_add_vm_change_state_handler_full() variant that allows setting
a pre VM state change callback in addition to the main callback.

This will facilitate adding P2P support for VFIO migration in the
following patches.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/sysemu/runstate.h         |  3 +++
 hw/core/vm-change-state-handler.c | 14 +++++++++++++-
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/include/sysemu/runstate.h b/include/sysemu/runstate.h
index bb38a4b4bd..ca97b9dfa7 100644
--- a/include/sysemu/runstate.h
+++ b/include/sysemu/runstate.h
@@ -23,6 +23,9 @@ qemu_add_vm_change_state_handler_prio_full(VMChangeStateHandler *cb,
 VMChangeStateEntry *qdev_add_vm_change_state_handler(DeviceState *dev,
                                                      VMChangeStateHandler *cb,
                                                      void *opaque);
+VMChangeStateEntry *qdev_add_vm_change_state_handler_full(
+    DeviceState *dev, VMChangeStateHandler *cb,
+    VMChangeStateHandler *pre_change_cb, void *opaque);
 void qemu_del_vm_change_state_handler(VMChangeStateEntry *e);
 /**
  * vm_state_notify: Notify the state of the VM
diff --git a/hw/core/vm-change-state-handler.c b/hw/core/vm-change-state-handler.c
index 1f3630986d..24f155fb62 100644
--- a/hw/core/vm-change-state-handler.c
+++ b/hw/core/vm-change-state-handler.c
@@ -55,8 +55,20 @@ static int qdev_get_dev_tree_depth(DeviceState *dev)
 VMChangeStateEntry *qdev_add_vm_change_state_handler(DeviceState *dev,
                                                      VMChangeStateHandler *cb,
                                                      void *opaque)
+{
+    return qdev_add_vm_change_state_handler_full(dev, cb, NULL, opaque);
+}
+
+/*
+ * Exactly like qdev_add_vm_change_state_handler() but passes a pre_change_cb
+ * argument too.
+ */
+VMChangeStateEntry *qdev_add_vm_change_state_handler_full(
+    DeviceState *dev, VMChangeStateHandler *cb,
+    VMChangeStateHandler *pre_change_cb, void *opaque)
 {
     int depth = qdev_get_dev_tree_depth(dev);
 
-    return qemu_add_vm_change_state_handler_prio(cb, opaque, depth);
+    return qemu_add_vm_change_state_handler_prio_full(cb, pre_change_cb, opaque,
+                                                      depth);
 }
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH for-8.2 4/6] vfio/migration: Refactor PRE_COPY and RUNNING state checks
  2023-07-16  8:15 [PATCH for-8.2 0/6] vfio/migration: Add P2P support for VFIO migration Avihai Horon
                   ` (2 preceding siblings ...)
  2023-07-16  8:15 ` [PATCH for-8.2 3/6] qdev: Add qdev_add_vm_change_state_handler_full() Avihai Horon
@ 2023-07-16  8:15 ` Avihai Horon
  2023-07-21 11:33   ` Cédric Le Goater
  2023-07-16  8:15 ` [PATCH for-8.2 5/6] vfio/migration: Add P2P support for VFIO migration Avihai Horon
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Avihai Horon @ 2023-07-16  8:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Paolo Bonzini,
	Joao Martins, Yishai Hadas, Jason Gunthorpe, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta, Avihai Horon

From: Joao Martins <joao.m.martins@oracle.com>

Move the PRE_COPY and RUNNING state checks to helper functions.

This is in preparation for adding P2P VFIO migration support, where
these helpers will also test for PRE_COPY_P2P and RUNNING_P2P states.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 include/hw/vfio/vfio-common.h |  2 ++
 hw/vfio/common.c              | 22 ++++++++++++++++++----
 hw/vfio/migration.c           | 10 ++++------
 3 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index da43d27352..e9b8954595 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -230,6 +230,8 @@ void vfio_unblock_multiple_devices_migration(void);
 bool vfio_viommu_preset(VFIODevice *vbasedev);
 int64_t vfio_mig_bytes_transferred(void);
 void vfio_reset_bytes_transferred(void);
+bool vfio_device_state_is_running(VFIODevice *vbasedev);
+bool vfio_device_state_is_precopy(VFIODevice *vbasedev);
 
 #ifdef CONFIG_LINUX
 int vfio_get_region_info(VFIODevice *vbasedev, int index,
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 9aac21abb7..16cf79a76c 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -437,6 +437,20 @@ static void vfio_set_migration_error(int err)
     }
 }
 
+bool vfio_device_state_is_running(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+
+    return migration->device_state == VFIO_DEVICE_STATE_RUNNING;
+}
+
+bool vfio_device_state_is_precopy(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+
+    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
+}
+
 static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
 {
     VFIOGroup *group;
@@ -457,8 +471,8 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
             }
 
             if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
-                (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
-                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) {
+                (vfio_device_state_is_running(vbasedev) ||
+                 vfio_device_state_is_precopy(vbasedev))) {
                 return false;
             }
         }
@@ -503,8 +517,8 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
                 return false;
             }
 
-            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
-                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
+            if (vfio_device_state_is_running(vbasedev) ||
+                vfio_device_state_is_precopy(vbasedev)) {
                 continue;
             } else {
                 return false;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 8acd182a8b..48f9c23cbe 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -411,7 +411,7 @@ static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
     VFIODevice *vbasedev = opaque;
     VFIOMigration *migration = vbasedev->migration;
 
-    if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) {
+    if (!vfio_device_state_is_precopy(vbasedev)) {
         return;
     }
 
@@ -444,7 +444,7 @@ static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
     vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
     *must_precopy += stop_copy_size;
 
-    if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
+    if (vfio_device_state_is_precopy(vbasedev)) {
         vfio_query_precopy_size(migration);
 
         *must_precopy +=
@@ -459,9 +459,8 @@ static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
 static bool vfio_is_active_iterate(void *opaque)
 {
     VFIODevice *vbasedev = opaque;
-    VFIOMigration *migration = vbasedev->migration;
 
-    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
+    return vfio_device_state_is_precopy(vbasedev);
 }
 
 static int vfio_save_iterate(QEMUFile *f, void *opaque)
@@ -656,7 +655,6 @@ static const SaveVMHandlers savevm_vfio_handlers = {
 static void vfio_vmstate_change(void *opaque, bool running, RunState state)
 {
     VFIODevice *vbasedev = opaque;
-    VFIOMigration *migration = vbasedev->migration;
     enum vfio_device_mig_state new_state;
     int ret;
 
@@ -664,7 +662,7 @@ static void vfio_vmstate_change(void *opaque, bool running, RunState state)
         new_state = VFIO_DEVICE_STATE_RUNNING;
     } else {
         new_state =
-            (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY &&
+            (vfio_device_state_is_precopy(vbasedev) &&
              (state == RUN_STATE_FINISH_MIGRATE || state == RUN_STATE_PAUSED)) ?
                 VFIO_DEVICE_STATE_STOP_COPY :
                 VFIO_DEVICE_STATE_STOP;
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH for-8.2 5/6] vfio/migration: Add P2P support for VFIO migration
  2023-07-16  8:15 [PATCH for-8.2 0/6] vfio/migration: Add P2P support for VFIO migration Avihai Horon
                   ` (3 preceding siblings ...)
  2023-07-16  8:15 ` [PATCH for-8.2 4/6] vfio/migration: Refactor PRE_COPY and RUNNING state checks Avihai Horon
@ 2023-07-16  8:15 ` Avihai Horon
  2023-07-21 11:48   ` Cédric Le Goater
  2023-07-16  8:15 ` [PATCH for-8.2 6/6] vfio/migration: Allow migration of multiple P2P supporting devices Avihai Horon
  2023-07-18 15:46 ` [PATCH for-8.2 0/6] vfio/migration: Add P2P support for VFIO migration Jason Gunthorpe
  6 siblings, 1 reply; 20+ messages in thread
From: Avihai Horon @ 2023-07-16  8:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Paolo Bonzini,
	Joao Martins, Yishai Hadas, Jason Gunthorpe, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta, Avihai Horon

VFIO migration uAPI defines an optional intermediate P2P quiescent
state. While in the P2P quiescent state, P2P DMA transactions cannot be
initiated by the device, but the device can respond to incoming ones.
Additionally, all outstanding P2P transactions are guaranteed to have
been completed by the time the device enters this state.

The purpose of this state is to support migration of multiple devices
that are doing P2P transactions between themselves.

Add support for P2P migration by transitioning all the devices to the
P2P quiescent state before stopping or starting the devices. Use the new
VMChangeStateHandler pre_change_cb to achieve that behavior.

This will allow migration of multiple VFIO devices if all of them
support P2P migration.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 docs/devel/vfio-migration.rst | 93 +++++++++++++++++++++--------------
 hw/vfio/common.c              |  6 ++-
 hw/vfio/migration.c           | 58 +++++++++++++++++-----
 hw/vfio/trace-events          |  2 +-
 4 files changed, 107 insertions(+), 52 deletions(-)

diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
index b433cb5bb2..b9c57ba651 100644
--- a/docs/devel/vfio-migration.rst
+++ b/docs/devel/vfio-migration.rst
@@ -23,9 +23,21 @@ and recommends that the initial bytes are sent and loaded in the destination
 before stopping the source VM. Enabling this migration capability will
 guarantee that and thus, can potentially reduce downtime even further.
 
-Note that currently VFIO migration is supported only for a single device. This
-is due to VFIO migration's lack of P2P support. However, P2P support is planned
-to be added later on.
+To support migration of multiple devices that are doing P2P transactions
+between themselves, VFIO migration uAPI defines an intermediate P2P quiescent
+state. While in the P2P quiescent state, P2P DMA transactions cannot be
+initiated by the device, but the device can respond to incoming ones.
+Additionally, all outstanding P2P transactions are guaranteed to have been
+completed by the time the device enters this state.
+
+All the devices that support P2P migration are first transitioned to the P2P
+quiescent state and only then are they stopped or started. This makes migration
+safe P2P-wise, since starting and stopping the devices is not done atomically
+for all the devices together.
+
+Thus, multiple VFIO devices migration is allowed only if all the devices
+support P2P migration. Single VFIO device migration is allowed regardless of
+P2P migration support.
 
 A detailed description of the UAPI for VFIO device migration can be found in
 the comment for the ``vfio_device_mig_state`` structure in the header file
@@ -132,54 +144,63 @@ will be blocked.
 Flow of state changes during Live migration
 ===========================================
 
-Below is the flow of state change during live migration.
+Below is the state change flow during live migration for a VFIO device that
+supports both precopy and P2P migration. The flow for devices that don't
+support it is similar, except that the relevant states for precopy and P2P are
+skipped.
 The values in the parentheses represent the VM state, the migration state, and
 the VFIO device state, respectively.
-The text in the square brackets represents the flow if the VFIO device supports
-pre-copy.
 
 Live migration save path
 ------------------------
 
 ::
 
-                        QEMU normal running state
-                        (RUNNING, _NONE, _RUNNING)
-                                  |
+                           QEMU normal running state
+                           (RUNNING, _NONE, _RUNNING)
+                                      |
                      migrate_init spawns migration_thread
-                Migration thread then calls each device's .save_setup()
-                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
-                                  |
-                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
-      If device is active, get pending_bytes by .state_pending_{estimate,exact}()
-          If total pending_bytes >= threshold_size, call .save_live_iterate()
-                  [Data of VFIO device for pre-copy phase is copied]
-        Iterate till total pending bytes converge and are less than threshold
-                                  |
-  On migration completion, vCPU stops and calls .save_live_complete_precopy for
-  each active device. The VFIO device is then transitioned into _STOP_COPY state
-                  (FINISH_MIGRATE, _DEVICE, _STOP_COPY)
-                                  |
-     For the VFIO device, iterate in .save_live_complete_precopy until
-                         pending data is 0
-                   (FINISH_MIGRATE, _DEVICE, _STOP)
-                                  |
-                 (FINISH_MIGRATE, _COMPLETED, _STOP)
-             Migraton thread schedules cleanup bottom half and exits
+            Migration thread then calls each device's .save_setup()
+                          (RUNNING, _SETUP, _PRE_COPY)
+                                      |
+                         (RUNNING, _ACTIVE, _PRE_COPY)
+  If device is active, get pending_bytes by .state_pending_{estimate,exact}()
+       If total pending_bytes >= threshold_size, call .save_live_iterate()
+                Data of VFIO device for pre-copy phase is copied
+      Iterate till total pending bytes converge and are less than threshold
+                                      |
+       On migration completion, the vCPUs and the VFIO device are stopped
+              The VFIO device is first put in P2P quiescent state
+                    (FINISH_MIGRATE, _ACTIVE, _PRE_COPY_P2P)
+                                      |
+                Then the VFIO device is put in _STOP_COPY state
+                     (FINISH_MIGRATE, _ACTIVE, _STOP_COPY)
+         .save_live_complete_precopy() is called for each active device
+      For the VFIO device, iterate in .save_live_complete_precopy() until
+                               pending data is 0
+                                      |
+                     (POSTMIGRATE, _COMPLETED, _STOP_COPY)
+            Migraton thread schedules cleanup bottom half and exits
+                                      |
+                           .save_cleanup() is called
+                        (POSTMIGRATE, _COMPLETED, _STOP)
 
 Live migration resume path
 --------------------------
 
 ::
 
-              Incoming migration calls .load_setup for each device
-                       (RESTORE_VM, _ACTIVE, _STOP)
-                                 |
-       For each device, .load_state is called for that device section data
-                       (RESTORE_VM, _ACTIVE, _RESUMING)
-                                 |
-    At the end, .load_cleanup is called for each device and vCPUs are started
-                       (RUNNING, _NONE, _RUNNING)
+             Incoming migration calls .load_setup() for each device
+                          (RESTORE_VM, _ACTIVE, _STOP)
+                                      |
+     For each device, .load_state() is called for that device section data
+                        (RESTORE_VM, _ACTIVE, _RESUMING)
+                                      |
+  At the end, .load_cleanup() is called for each device and vCPUs are started
+              The VFIO device is first put in P2P quiescent state
+                        (RUNNING, _ACTIVE, _RUNNING_P2P)
+                                      |
+                           (RUNNING, _NONE, _RUNNING)
 
 Postcopy
 ========
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 16cf79a76c..7c3d636025 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -441,14 +441,16 @@ bool vfio_device_state_is_running(VFIODevice *vbasedev)
 {
     VFIOMigration *migration = vbasedev->migration;
 
-    return migration->device_state == VFIO_DEVICE_STATE_RUNNING;
+    return migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
+           migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P;
 }
 
 bool vfio_device_state_is_precopy(VFIODevice *vbasedev)
 {
     VFIOMigration *migration = vbasedev->migration;
 
-    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
+    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
+           migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P;
 }
 
 static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 48f9c23cbe..02ee99c938 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -71,8 +71,13 @@ static const char *mig_state_to_str(enum vfio_device_mig_state state)
         return "STOP_COPY";
     case VFIO_DEVICE_STATE_RESUMING:
         return "RESUMING";
+    case VFIO_DEVICE_STATE_RUNNING_P2P:
+        return "RUNNING_P2P";
     case VFIO_DEVICE_STATE_PRE_COPY:
         return "PRE_COPY";
+    case VFIO_DEVICE_STATE_PRE_COPY_P2P:
+        return "PRE_COPY_P2P";
+
     default:
         return "UNKNOWN STATE";
     }
@@ -652,20 +657,31 @@ static const SaveVMHandlers savevm_vfio_handlers = {
 
 /* ---------------------------------------------------------------------- */
 
-static void vfio_vmstate_change(void *opaque, bool running, RunState state)
+static void vfio_vmstate_change(VFIODevice *vbasedev, bool running,
+                                RunState state, bool pre_state_change)
 {
-    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
     enum vfio_device_mig_state new_state;
     int ret;
 
-    if (running) {
-        new_state = VFIO_DEVICE_STATE_RUNNING;
+    if (pre_state_change && !(migration->mig_flags & VFIO_MIGRATION_P2P)) {
+        return;
+    }
+
+    if (pre_state_change) {
+        new_state = migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ?
+                        VFIO_DEVICE_STATE_PRE_COPY_P2P :
+                        VFIO_DEVICE_STATE_RUNNING_P2P;
     } else {
-        new_state =
-            (vfio_device_state_is_precopy(vbasedev) &&
-             (state == RUN_STATE_FINISH_MIGRATE || state == RUN_STATE_PAUSED)) ?
-                VFIO_DEVICE_STATE_STOP_COPY :
-                VFIO_DEVICE_STATE_STOP;
+        if (running) {
+            new_state = VFIO_DEVICE_STATE_RUNNING;
+        } else {
+            new_state = (vfio_device_state_is_precopy(vbasedev) &&
+                         (state == RUN_STATE_FINISH_MIGRATE ||
+                          state == RUN_STATE_PAUSED)) ?
+                            VFIO_DEVICE_STATE_STOP_COPY :
+                            VFIO_DEVICE_STATE_STOP;
+        }
     }
 
     /*
@@ -685,7 +701,23 @@ static void vfio_vmstate_change(void *opaque, bool running, RunState state)
     }
 
     trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
-                              mig_state_to_str(new_state));
+                              pre_state_change, mig_state_to_str(new_state));
+}
+
+static void vfio_vmstate_pre_change_handler(void *opaque, bool running,
+                                            RunState state)
+{
+    VFIODevice *vbasedev = opaque;
+
+    vfio_vmstate_change(vbasedev, running, state, true);
+}
+
+static void vfio_vmstate_change_handler(void *opaque, bool running,
+                                        RunState state)
+{
+    VFIODevice *vbasedev = opaque;
+
+    vfio_vmstate_change(vbasedev, running, state, false);
 }
 
 static void vfio_migration_state_notifier(Notifier *notifier, void *data)
@@ -798,9 +830,9 @@ static int vfio_migration_init(VFIODevice *vbasedev)
     register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers,
                          vbasedev);
 
-    migration->vm_state = qdev_add_vm_change_state_handler(vbasedev->dev,
-                                                           vfio_vmstate_change,
-                                                           vbasedev);
+    migration->vm_state = qdev_add_vm_change_state_handler_full(
+        vbasedev->dev, vfio_vmstate_change_handler,
+        vfio_vmstate_pre_change_handler, vbasedev);
     migration->migration_state.notify = vfio_migration_state_notifier;
     add_migration_state_change_notifier(&migration->migration_state);
 
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index ee7509e68e..efafe613f2 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -166,4 +166,4 @@ vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy
 vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
 vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
 vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
-vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
+vfio_vmstate_change(const char *name, int running, const char *reason, bool pre_state_change, const char *dev_state) " (%s) running %d reason %s pre state change %d device state %s"
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH for-8.2 6/6] vfio/migration: Allow migration of multiple P2P supporting devices
  2023-07-16  8:15 [PATCH for-8.2 0/6] vfio/migration: Add P2P support for VFIO migration Avihai Horon
                   ` (4 preceding siblings ...)
  2023-07-16  8:15 ` [PATCH for-8.2 5/6] vfio/migration: Add P2P support for VFIO migration Avihai Horon
@ 2023-07-16  8:15 ` Avihai Horon
  2023-07-21 12:09   ` Cédric Le Goater
  2023-07-18 15:46 ` [PATCH for-8.2 0/6] vfio/migration: Add P2P support for VFIO migration Jason Gunthorpe
  6 siblings, 1 reply; 20+ messages in thread
From: Avihai Horon @ 2023-07-16  8:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Paolo Bonzini,
	Joao Martins, Yishai Hadas, Jason Gunthorpe, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta, Avihai Horon

Now that P2P support has been added to VFIO migration, allow migration
of multiple devices if all of them support P2P migration.

Single device migration is allowed regardless of P2P migration support.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/vfio/common.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 7c3d636025..753b320739 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -363,21 +363,31 @@ bool vfio_mig_active(void)
 
 static Error *multiple_devices_migration_blocker;
 
-static unsigned int vfio_migratable_device_num(void)
+/*
+ * Multiple devices migration is allowed only if all devices support P2P
+ * migration. Single device migration is allowed regardless of P2P migration
+ * support.
+ */
+static bool vfio_should_block_multiple_devices_migration(void)
 {
     VFIOGroup *group;
     VFIODevice *vbasedev;
     unsigned int device_num = 0;
+    bool all_support_p2p = true;
 
     QLIST_FOREACH(group, &vfio_group_list, next) {
         QLIST_FOREACH(vbasedev, &group->device_list, next) {
             if (vbasedev->migration) {
                 device_num++;
+
+                if (!(vbasedev->migration->mig_flags & VFIO_MIGRATION_P2P)) {
+                    all_support_p2p = false;
+                }
             }
         }
     }
 
-    return device_num;
+    return !all_support_p2p && device_num > 1;
 }
 
 int vfio_block_multiple_devices_migration(VFIODevice *vbasedev, Error **errp)
@@ -385,19 +395,19 @@ int vfio_block_multiple_devices_migration(VFIODevice *vbasedev, Error **errp)
     int ret;
 
     if (multiple_devices_migration_blocker ||
-        vfio_migratable_device_num() <= 1) {
+        !vfio_should_block_multiple_devices_migration()) {
         return 0;
     }
 
     if (vbasedev->enable_migration == ON_OFF_AUTO_ON) {
-        error_setg(errp, "Migration is currently not supported with multiple "
-                         "VFIO devices");
+        error_setg(errp, "Multiple VFIO devices migration is supported only if "
+                         "all of them support P2P migration");
         return -EINVAL;
     }
 
     error_setg(&multiple_devices_migration_blocker,
-               "Migration is currently not supported with multiple "
-               "VFIO devices");
+               "Multiple VFIO devices migration is supported only if all of "
+               "them support P2P migration");
     ret = migrate_add_blocker(multiple_devices_migration_blocker, errp);
     if (ret < 0) {
         error_free(multiple_devices_migration_blocker);
@@ -410,7 +420,7 @@ int vfio_block_multiple_devices_migration(VFIODevice *vbasedev, Error **errp)
 void vfio_unblock_multiple_devices_migration(void)
 {
     if (!multiple_devices_migration_blocker ||
-        vfio_migratable_device_num() > 1) {
+        vfio_should_block_multiple_devices_migration()) {
         return;
     }
 
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH for-8.2 0/6] vfio/migration: Add P2P support for VFIO migration
  2023-07-16  8:15 [PATCH for-8.2 0/6] vfio/migration: Add P2P support for VFIO migration Avihai Horon
                   ` (5 preceding siblings ...)
  2023-07-16  8:15 ` [PATCH for-8.2 6/6] vfio/migration: Allow migration of multiple P2P supporting devices Avihai Horon
@ 2023-07-18 15:46 ` Jason Gunthorpe
  2023-07-30  6:50   ` Avihai Horon
  6 siblings, 1 reply; 20+ messages in thread
From: Jason Gunthorpe @ 2023-07-18 15:46 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
	Paolo Bonzini, Joao Martins, Yishai Hadas, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta

On Sun, Jul 16, 2023 at 11:15:35AM +0300, Avihai Horon wrote:
> Hi all,
> 
> The first patch in this series adds a small optimization to VFIO
> migration by moving the STOP_COPY->STOP transition to
> vfio_save_cleanup(). Testing with a ConnectX-7 VFIO device showed that
> this can reduce downtime by up to 6%.
> 
> The rest of the series adds P2P support for VFIO migration.
> 
> VFIO migration uAPI defines an optional intermediate P2P quiescent
> state. While in the P2P quiescent state, P2P DMA transactions cannot be
> initiated by the device, but the device can respond to incoming ones.
> Additionally, all outstanding P2P transactions are guaranteed to have
> been completed by the time the device enters this state.

For clarity it is "all outstanding DMA transactions are guarenteed to
have been completed" - the device has no idea if something is P2P or
not.

> The purpose of this state is to support migration of multiple devices
> that are doing P2P transactions between themselves.

s/are/might/

Jason


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH for-8.2 4/6] vfio/migration: Refactor PRE_COPY and RUNNING state checks
  2023-07-16  8:15 ` [PATCH for-8.2 4/6] vfio/migration: Refactor PRE_COPY and RUNNING state checks Avihai Horon
@ 2023-07-21 11:33   ` Cédric Le Goater
  0 siblings, 0 replies; 20+ messages in thread
From: Cédric Le Goater @ 2023-07-21 11:33 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Paolo Bonzini, Joao Martins, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta

On 7/16/23 10:15, Avihai Horon wrote:
> From: Joao Martins <joao.m.martins@oracle.com>
> 
> Move the PRE_COPY and RUNNING state checks to helper functions.
> 
> This is in preparation for adding P2P VFIO migration support, where
> these helpers will also test for PRE_COPY_P2P and RUNNING_P2P states.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>


Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> ---
>   include/hw/vfio/vfio-common.h |  2 ++
>   hw/vfio/common.c              | 22 ++++++++++++++++++----
>   hw/vfio/migration.c           | 10 ++++------
>   3 files changed, 24 insertions(+), 10 deletions(-)
> 
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index da43d27352..e9b8954595 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -230,6 +230,8 @@ void vfio_unblock_multiple_devices_migration(void);
>   bool vfio_viommu_preset(VFIODevice *vbasedev);
>   int64_t vfio_mig_bytes_transferred(void);
>   void vfio_reset_bytes_transferred(void);
> +bool vfio_device_state_is_running(VFIODevice *vbasedev);
> +bool vfio_device_state_is_precopy(VFIODevice *vbasedev);
>   
>   #ifdef CONFIG_LINUX
>   int vfio_get_region_info(VFIODevice *vbasedev, int index,
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 9aac21abb7..16cf79a76c 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -437,6 +437,20 @@ static void vfio_set_migration_error(int err)
>       }
>   }
>   
> +bool vfio_device_state_is_running(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    return migration->device_state == VFIO_DEVICE_STATE_RUNNING;
> +}
> +
> +bool vfio_device_state_is_precopy(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
> +}
> +
>   static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>   {
>       VFIOGroup *group;
> @@ -457,8 +471,8 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>               }
>   
>               if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
> -                (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> -                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) {
> +                (vfio_device_state_is_running(vbasedev) ||
> +                 vfio_device_state_is_precopy(vbasedev))) {
>                   return false;
>               }
>           }
> @@ -503,8 +517,8 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>                   return false;
>               }
>   
> -            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> -                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
> +            if (vfio_device_state_is_running(vbasedev) ||
> +                vfio_device_state_is_precopy(vbasedev)) {
>                   continue;
>               } else {
>                   return false;
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 8acd182a8b..48f9c23cbe 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -411,7 +411,7 @@ static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
>       VFIODevice *vbasedev = opaque;
>       VFIOMigration *migration = vbasedev->migration;
>   
> -    if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) {
> +    if (!vfio_device_state_is_precopy(vbasedev)) {
>           return;
>       }
>   
> @@ -444,7 +444,7 @@ static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
>       vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>       *must_precopy += stop_copy_size;
>   
> -    if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
> +    if (vfio_device_state_is_precopy(vbasedev)) {
>           vfio_query_precopy_size(migration);
>   
>           *must_precopy +=
> @@ -459,9 +459,8 @@ static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
>   static bool vfio_is_active_iterate(void *opaque)
>   {
>       VFIODevice *vbasedev = opaque;
> -    VFIOMigration *migration = vbasedev->migration;
>   
> -    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
> +    return vfio_device_state_is_precopy(vbasedev);
>   }
>   
>   static int vfio_save_iterate(QEMUFile *f, void *opaque)
> @@ -656,7 +655,6 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>   static void vfio_vmstate_change(void *opaque, bool running, RunState state)
>   {
>       VFIODevice *vbasedev = opaque;
> -    VFIOMigration *migration = vbasedev->migration;
>       enum vfio_device_mig_state new_state;
>       int ret;
>   
> @@ -664,7 +662,7 @@ static void vfio_vmstate_change(void *opaque, bool running, RunState state)
>           new_state = VFIO_DEVICE_STATE_RUNNING;
>       } else {
>           new_state =
> -            (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY &&
> +            (vfio_device_state_is_precopy(vbasedev) &&
>                (state == RUN_STATE_FINISH_MIGRATE || state == RUN_STATE_PAUSED)) ?
>                   VFIO_DEVICE_STATE_STOP_COPY :
>                   VFIO_DEVICE_STATE_STOP;



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH for-8.2 5/6] vfio/migration: Add P2P support for VFIO migration
  2023-07-16  8:15 ` [PATCH for-8.2 5/6] vfio/migration: Add P2P support for VFIO migration Avihai Horon
@ 2023-07-21 11:48   ` Cédric Le Goater
  2023-07-23 10:37     ` Avihai Horon
  0 siblings, 1 reply; 20+ messages in thread
From: Cédric Le Goater @ 2023-07-21 11:48 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Paolo Bonzini, Joao Martins, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta

On 7/16/23 10:15, Avihai Horon wrote:
> VFIO migration uAPI defines an optional intermediate P2P quiescent
> state. While in the P2P quiescent state, P2P DMA transactions cannot be
> initiated by the device, but the device can respond to incoming ones.
> Additionally, all outstanding P2P transactions are guaranteed to have
> been completed by the time the device enters this state.
> 
> The purpose of this state is to support migration of multiple devices
> that are doing P2P transactions between themselves.
> 
> Add support for P2P migration by transitioning all the devices to the
> P2P quiescent state before stopping or starting the devices. Use the new
> VMChangeStateHandler pre_change_cb to achieve that behavior.
> 
> This will allow migration of multiple VFIO devices if all of them
> support P2P migration.

LGTM, one small comment below
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>   docs/devel/vfio-migration.rst | 93 +++++++++++++++++++++--------------
>   hw/vfio/common.c              |  6 ++-
>   hw/vfio/migration.c           | 58 +++++++++++++++++-----
>   hw/vfio/trace-events          |  2 +-
>   4 files changed, 107 insertions(+), 52 deletions(-)
> 
> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
> index b433cb5bb2..b9c57ba651 100644
> --- a/docs/devel/vfio-migration.rst
> +++ b/docs/devel/vfio-migration.rst
> @@ -23,9 +23,21 @@ and recommends that the initial bytes are sent and loaded in the destination
>   before stopping the source VM. Enabling this migration capability will
>   guarantee that and thus, can potentially reduce downtime even further.
>   
> -Note that currently VFIO migration is supported only for a single device. This
> -is due to VFIO migration's lack of P2P support. However, P2P support is planned
> -to be added later on.
> +To support migration of multiple devices that are doing P2P transactions
> +between themselves, VFIO migration uAPI defines an intermediate P2P quiescent
> +state. While in the P2P quiescent state, P2P DMA transactions cannot be
> +initiated by the device, but the device can respond to incoming ones.
> +Additionally, all outstanding P2P transactions are guaranteed to have been
> +completed by the time the device enters this state.
> +
> +All the devices that support P2P migration are first transitioned to the P2P
> +quiescent state and only then are they stopped or started. This makes migration
> +safe P2P-wise, since starting and stopping the devices is not done atomically
> +for all the devices together.
> +
> +Thus, multiple VFIO devices migration is allowed only if all the devices
> +support P2P migration. Single VFIO device migration is allowed regardless of
> +P2P migration support.
>   
>   A detailed description of the UAPI for VFIO device migration can be found in
>   the comment for the ``vfio_device_mig_state`` structure in the header file
> @@ -132,54 +144,63 @@ will be blocked.
>   Flow of state changes during Live migration
>   ===========================================
>   
> -Below is the flow of state change during live migration.
> +Below is the state change flow during live migration for a VFIO device that
> +supports both precopy and P2P migration. The flow for devices that don't
> +support it is similar, except that the relevant states for precopy and P2P are
> +skipped.
>   The values in the parentheses represent the VM state, the migration state, and
>   the VFIO device state, respectively.
> -The text in the square brackets represents the flow if the VFIO device supports
> -pre-copy.
>   
>   Live migration save path
>   ------------------------
>   
>   ::
>   
> -                        QEMU normal running state
> -                        (RUNNING, _NONE, _RUNNING)
> -                                  |
> +                           QEMU normal running state
> +                           (RUNNING, _NONE, _RUNNING)
> +                                      |
>                        migrate_init spawns migration_thread
> -                Migration thread then calls each device's .save_setup()
> -                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
> -                                  |
> -                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
> -      If device is active, get pending_bytes by .state_pending_{estimate,exact}()
> -          If total pending_bytes >= threshold_size, call .save_live_iterate()
> -                  [Data of VFIO device for pre-copy phase is copied]
> -        Iterate till total pending bytes converge and are less than threshold
> -                                  |
> -  On migration completion, vCPU stops and calls .save_live_complete_precopy for
> -  each active device. The VFIO device is then transitioned into _STOP_COPY state
> -                  (FINISH_MIGRATE, _DEVICE, _STOP_COPY)
> -                                  |
> -     For the VFIO device, iterate in .save_live_complete_precopy until
> -                         pending data is 0
> -                   (FINISH_MIGRATE, _DEVICE, _STOP)
> -                                  |
> -                 (FINISH_MIGRATE, _COMPLETED, _STOP)
> -             Migraton thread schedules cleanup bottom half and exits
> +            Migration thread then calls each device's .save_setup()
> +                          (RUNNING, _SETUP, _PRE_COPY)
> +                                      |
> +                         (RUNNING, _ACTIVE, _PRE_COPY)
> +  If device is active, get pending_bytes by .state_pending_{estimate,exact}()
> +       If total pending_bytes >= threshold_size, call .save_live_iterate()
> +                Data of VFIO device for pre-copy phase is copied
> +      Iterate till total pending bytes converge and are less than threshold
> +                                      |
> +       On migration completion, the vCPUs and the VFIO device are stopped
> +              The VFIO device is first put in P2P quiescent state
> +                    (FINISH_MIGRATE, _ACTIVE, _PRE_COPY_P2P)
> +                                      |
> +                Then the VFIO device is put in _STOP_COPY state
> +                     (FINISH_MIGRATE, _ACTIVE, _STOP_COPY)
> +         .save_live_complete_precopy() is called for each active device
> +      For the VFIO device, iterate in .save_live_complete_precopy() until
> +                               pending data is 0
> +                                      |
> +                     (POSTMIGRATE, _COMPLETED, _STOP_COPY)
> +            Migraton thread schedules cleanup bottom half and exits
> +                                      |
> +                           .save_cleanup() is called
> +                        (POSTMIGRATE, _COMPLETED, _STOP)
>   
>   Live migration resume path
>   --------------------------
>   
>   ::
>   
> -              Incoming migration calls .load_setup for each device
> -                       (RESTORE_VM, _ACTIVE, _STOP)
> -                                 |
> -       For each device, .load_state is called for that device section data
> -                       (RESTORE_VM, _ACTIVE, _RESUMING)
> -                                 |
> -    At the end, .load_cleanup is called for each device and vCPUs are started
> -                       (RUNNING, _NONE, _RUNNING)
> +             Incoming migration calls .load_setup() for each device
> +                          (RESTORE_VM, _ACTIVE, _STOP)
> +                                      |
> +     For each device, .load_state() is called for that device section data
> +                        (RESTORE_VM, _ACTIVE, _RESUMING)
> +                                      |
> +  At the end, .load_cleanup() is called for each device and vCPUs are started
> +              The VFIO device is first put in P2P quiescent state
> +                        (RUNNING, _ACTIVE, _RUNNING_P2P)
> +                                      |
> +                           (RUNNING, _NONE, _RUNNING)
>   
>   Postcopy
>   ========
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 16cf79a76c..7c3d636025 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -441,14 +441,16 @@ bool vfio_device_state_is_running(VFIODevice *vbasedev)
>   {
>       VFIOMigration *migration = vbasedev->migration;
>   
> -    return migration->device_state == VFIO_DEVICE_STATE_RUNNING;
> +    return migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> +           migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P;
>   }
>   
>   bool vfio_device_state_is_precopy(VFIODevice *vbasedev)
>   {
>       VFIOMigration *migration = vbasedev->migration;
>   
> -    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
> +    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
> +           migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P;
>   }
>   
>   static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 48f9c23cbe..02ee99c938 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -71,8 +71,13 @@ static const char *mig_state_to_str(enum vfio_device_mig_state state)
>           return "STOP_COPY";
>       case VFIO_DEVICE_STATE_RESUMING:
>           return "RESUMING";
> +    case VFIO_DEVICE_STATE_RUNNING_P2P:
> +        return "RUNNING_P2P";
>       case VFIO_DEVICE_STATE_PRE_COPY:
>           return "PRE_COPY";
> +    case VFIO_DEVICE_STATE_PRE_COPY_P2P:
> +        return "PRE_COPY_P2P";
> +
>       default:
>           return "UNKNOWN STATE";
>       }
> @@ -652,20 +657,31 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>   
>   /* ---------------------------------------------------------------------- */
>   
> -static void vfio_vmstate_change(void *opaque, bool running, RunState state)
> +static void vfio_vmstate_change(VFIODevice *vbasedev, bool running,
> +                                RunState state, bool pre_state_change)

Instead of mixing both vmstate change handlers in one routine, I would prefer
a new pre_state handler to be introduced.

Thanks,

C.


>   {
> -    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
>       enum vfio_device_mig_state new_state;
>       int ret;
>   
> -    if (running) {
> -        new_state = VFIO_DEVICE_STATE_RUNNING;
> +    if (pre_state_change && !(migration->mig_flags & VFIO_MIGRATION_P2P)) {
> +        return;
> +    }
> +
> +    if (pre_state_change) {
> +        new_state = migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ?
> +                        VFIO_DEVICE_STATE_PRE_COPY_P2P :
> +                        VFIO_DEVICE_STATE_RUNNING_P2P;
>       } else {
> -        new_state =
> -            (vfio_device_state_is_precopy(vbasedev) &&
> -             (state == RUN_STATE_FINISH_MIGRATE || state == RUN_STATE_PAUSED)) ?
> -                VFIO_DEVICE_STATE_STOP_COPY :
> -                VFIO_DEVICE_STATE_STOP;
> +        if (running) {
> +            new_state = VFIO_DEVICE_STATE_RUNNING;
> +        } else {
> +            new_state = (vfio_device_state_is_precopy(vbasedev) &&
> +                         (state == RUN_STATE_FINISH_MIGRATE ||
> +                          state == RUN_STATE_PAUSED)) ?
> +                            VFIO_DEVICE_STATE_STOP_COPY :
> +                            VFIO_DEVICE_STATE_STOP;
> +        }
>       }
>   
>       /*
> @@ -685,7 +701,23 @@ static void vfio_vmstate_change(void *opaque, bool running, RunState state)
>       }
>   
>       trace_vfio_vmstate_change(vbasedev->name, running, RunState_str(state),
> -                              mig_state_to_str(new_state));
> +                              pre_state_change, mig_state_to_str(new_state));
> +}
> +
> +static void vfio_vmstate_pre_change_handler(void *opaque, bool running,
> +                                            RunState state)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    vfio_vmstate_change(vbasedev, running, state, true);
> +}
> +
> +static void vfio_vmstate_change_handler(void *opaque, bool running,
> +                                        RunState state)
> +{
> +    VFIODevice *vbasedev = opaque;
> +
> +    vfio_vmstate_change(vbasedev, running, state, false);
>   }
>   
>   static void vfio_migration_state_notifier(Notifier *notifier, void *data)
> @@ -798,9 +830,9 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>       register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers,
>                            vbasedev);
>   
> -    migration->vm_state = qdev_add_vm_change_state_handler(vbasedev->dev,
> -                                                           vfio_vmstate_change,
> -                                                           vbasedev);
> +    migration->vm_state = qdev_add_vm_change_state_handler_full(
> +        vbasedev->dev, vfio_vmstate_change_handler,
> +        vfio_vmstate_pre_change_handler, vbasedev);
>       migration->migration_state.notify = vfio_migration_state_notifier;
>       add_migration_state_change_notifier(&migration->migration_state);
>   
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index ee7509e68e..efafe613f2 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -166,4 +166,4 @@ vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy
>   vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
>   vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
>   vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
> -vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
> +vfio_vmstate_change(const char *name, int running, const char *reason, bool pre_state_change, const char *dev_state) " (%s) running %d reason %s pre state change %d device state %s"



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH for-8.2 6/6] vfio/migration: Allow migration of multiple P2P supporting devices
  2023-07-16  8:15 ` [PATCH for-8.2 6/6] vfio/migration: Allow migration of multiple P2P supporting devices Avihai Horon
@ 2023-07-21 12:09   ` Cédric Le Goater
  2023-07-23 10:42     ` Avihai Horon
  0 siblings, 1 reply; 20+ messages in thread
From: Cédric Le Goater @ 2023-07-21 12:09 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Paolo Bonzini, Joao Martins, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta

On 7/16/23 10:15, Avihai Horon wrote:
> Now that P2P support has been added to VFIO migration, allow migration
> of multiple devices if all of them support P2P migration.
> 
> Single device migration is allowed regardless of P2P migration support.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   hw/vfio/common.c | 26 ++++++++++++++++++--------
>   1 file changed, 18 insertions(+), 8 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 7c3d636025..753b320739 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -363,21 +363,31 @@ bool vfio_mig_active(void)
>   
>   static Error *multiple_devices_migration_blocker;
>   
> -static unsigned int vfio_migratable_device_num(void)
> +/*
> + * Multiple devices migration is allowed only if all devices support P2P
> + * migration. Single device migration is allowed regardless of P2P migration
> + * support.
> + */
> +static bool vfio_should_block_multiple_devices_migration(void)

Could we revert the logic and call the routine :

   vfio_multiple_devices_migration_is_supported()

I think it would be clearer in the callers.  This is minor.

Thanks,

C.

>   {
>       VFIOGroup *group;
>       VFIODevice *vbasedev;
>       unsigned int device_num = 0;
> +    bool all_support_p2p = true;
>   
>       QLIST_FOREACH(group, &vfio_group_list, next) {
>           QLIST_FOREACH(vbasedev, &group->device_list, next) {
>               if (vbasedev->migration) {
>                   device_num++;
> +
> +                if (!(vbasedev->migration->mig_flags & VFIO_MIGRATION_P2P)) {
> +                    all_support_p2p = false;
> +                }
>               }
>           }
>       }
>   
> -    return device_num;
> +    return !all_support_p2p && device_num > 1;
>   }
>   
>   int vfio_block_multiple_devices_migration(VFIODevice *vbasedev, Error **errp)
> @@ -385,19 +395,19 @@ int vfio_block_multiple_devices_migration(VFIODevice *vbasedev, Error **errp)
>       int ret;
>   
>       if (multiple_devices_migration_blocker ||
> -        vfio_migratable_device_num() <= 1) {
> +        !vfio_should_block_multiple_devices_migration()) {
>           return 0;
>       }
>   
>       if (vbasedev->enable_migration == ON_OFF_AUTO_ON) {
> -        error_setg(errp, "Migration is currently not supported with multiple "
> -                         "VFIO devices");
> +        error_setg(errp, "Multiple VFIO devices migration is supported only if "
> +                         "all of them support P2P migration");
>           return -EINVAL;
>       }
>   
>       error_setg(&multiple_devices_migration_blocker,
> -               "Migration is currently not supported with multiple "
> -               "VFIO devices");
> +               "Multiple VFIO devices migration is supported only if all of "
> +               "them support P2P migration");
>       ret = migrate_add_blocker(multiple_devices_migration_blocker, errp);
>       if (ret < 0) {
>           error_free(multiple_devices_migration_blocker);
> @@ -410,7 +420,7 @@ int vfio_block_multiple_devices_migration(VFIODevice *vbasedev, Error **errp)
>   void vfio_unblock_multiple_devices_migration(void)
>   {
>       if (!multiple_devices_migration_blocker ||
> -        vfio_migratable_device_num() > 1) {
> +        vfio_should_block_multiple_devices_migration()) {
>           return;
>       }
>   



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH for-8.2 5/6] vfio/migration: Add P2P support for VFIO migration
  2023-07-21 11:48   ` Cédric Le Goater
@ 2023-07-23 10:37     ` Avihai Horon
  0 siblings, 0 replies; 20+ messages in thread
From: Avihai Horon @ 2023-07-23 10:37 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Paolo Bonzini, Joao Martins, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta


On 21/07/2023 14:48, Cédric Le Goater wrote:
> External email: Use caution opening links or attachments
>
>
> On 7/16/23 10:15, Avihai Horon wrote:
>> VFIO migration uAPI defines an optional intermediate P2P quiescent
>> state. While in the P2P quiescent state, P2P DMA transactions cannot be
>> initiated by the device, but the device can respond to incoming ones.
>> Additionally, all outstanding P2P transactions are guaranteed to have
>> been completed by the time the device enters this state.
>>
>> The purpose of this state is to support migration of multiple devices
>> that are doing P2P transactions between themselves.
>>
>> Add support for P2P migration by transitioning all the devices to the
>> P2P quiescent state before stopping or starting the devices. Use the new
>> VMChangeStateHandler pre_change_cb to achieve that behavior.
>>
>> This will allow migration of multiple VFIO devices if all of them
>> support P2P migration.
>
> LGTM, one small comment below
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   docs/devel/vfio-migration.rst | 93 +++++++++++++++++++++--------------
>>   hw/vfio/common.c              |  6 ++-
>>   hw/vfio/migration.c           | 58 +++++++++++++++++-----
>>   hw/vfio/trace-events          |  2 +-
>>   4 files changed, 107 insertions(+), 52 deletions(-)
>>
>> diff --git a/docs/devel/vfio-migration.rst 
>> b/docs/devel/vfio-migration.rst
>> index b433cb5bb2..b9c57ba651 100644
>> --- a/docs/devel/vfio-migration.rst
>> +++ b/docs/devel/vfio-migration.rst
>> @@ -23,9 +23,21 @@ and recommends that the initial bytes are sent and 
>> loaded in the destination
>>   before stopping the source VM. Enabling this migration capability will
>>   guarantee that and thus, can potentially reduce downtime even further.
>>
>> -Note that currently VFIO migration is supported only for a single 
>> device. This
>> -is due to VFIO migration's lack of P2P support. However, P2P support 
>> is planned
>> -to be added later on.
>> +To support migration of multiple devices that are doing P2P 
>> transactions
>> +between themselves, VFIO migration uAPI defines an intermediate P2P 
>> quiescent
>> +state. While in the P2P quiescent state, P2P DMA transactions cannot be
>> +initiated by the device, but the device can respond to incoming ones.
>> +Additionally, all outstanding P2P transactions are guaranteed to 
>> have been
>> +completed by the time the device enters this state.
>> +
>> +All the devices that support P2P migration are first transitioned to 
>> the P2P
>> +quiescent state and only then are they stopped or started. This 
>> makes migration
>> +safe P2P-wise, since starting and stopping the devices is not done 
>> atomically
>> +for all the devices together.
>> +
>> +Thus, multiple VFIO devices migration is allowed only if all the 
>> devices
>> +support P2P migration. Single VFIO device migration is allowed 
>> regardless of
>> +P2P migration support.
>>
>>   A detailed description of the UAPI for VFIO device migration can be 
>> found in
>>   the comment for the ``vfio_device_mig_state`` structure in the 
>> header file
>> @@ -132,54 +144,63 @@ will be blocked.
>>   Flow of state changes during Live migration
>>   ===========================================
>>
>> -Below is the flow of state change during live migration.
>> +Below is the state change flow during live migration for a VFIO 
>> device that
>> +supports both precopy and P2P migration. The flow for devices that 
>> don't
>> +support it is similar, except that the relevant states for precopy 
>> and P2P are
>> +skipped.
>>   The values in the parentheses represent the VM state, the migration 
>> state, and
>>   the VFIO device state, respectively.
>> -The text in the square brackets represents the flow if the VFIO 
>> device supports
>> -pre-copy.
>>
>>   Live migration save path
>>   ------------------------
>>
>>   ::
>>
>> -                        QEMU normal running state
>> -                        (RUNNING, _NONE, _RUNNING)
>> -                                  |
>> +                           QEMU normal running state
>> +                           (RUNNING, _NONE, _RUNNING)
>> +                                      |
>>                        migrate_init spawns migration_thread
>> -                Migration thread then calls each device's .save_setup()
>> -                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
>> -                                  |
>> -                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
>> -      If device is active, get pending_bytes by 
>> .state_pending_{estimate,exact}()
>> -          If total pending_bytes >= threshold_size, call 
>> .save_live_iterate()
>> -                  [Data of VFIO device for pre-copy phase is copied]
>> -        Iterate till total pending bytes converge and are less than 
>> threshold
>> -                                  |
>> -  On migration completion, vCPU stops and calls 
>> .save_live_complete_precopy for
>> -  each active device. The VFIO device is then transitioned into 
>> _STOP_COPY state
>> -                  (FINISH_MIGRATE, _DEVICE, _STOP_COPY)
>> -                                  |
>> -     For the VFIO device, iterate in .save_live_complete_precopy until
>> -                         pending data is 0
>> -                   (FINISH_MIGRATE, _DEVICE, _STOP)
>> -                                  |
>> -                 (FINISH_MIGRATE, _COMPLETED, _STOP)
>> -             Migraton thread schedules cleanup bottom half and exits
>> +            Migration thread then calls each device's .save_setup()
>> +                          (RUNNING, _SETUP, _PRE_COPY)
>> +                                      |
>> +                         (RUNNING, _ACTIVE, _PRE_COPY)
>> +  If device is active, get pending_bytes by 
>> .state_pending_{estimate,exact}()
>> +       If total pending_bytes >= threshold_size, call 
>> .save_live_iterate()
>> +                Data of VFIO device for pre-copy phase is copied
>> +      Iterate till total pending bytes converge and are less than 
>> threshold
>> +                                      |
>> +       On migration completion, the vCPUs and the VFIO device are 
>> stopped
>> +              The VFIO device is first put in P2P quiescent state
>> +                    (FINISH_MIGRATE, _ACTIVE, _PRE_COPY_P2P)
>> +                                      |
>> +                Then the VFIO device is put in _STOP_COPY state
>> +                     (FINISH_MIGRATE, _ACTIVE, _STOP_COPY)
>> +         .save_live_complete_precopy() is called for each active device
>> +      For the VFIO device, iterate in .save_live_complete_precopy() 
>> until
>> +                               pending data is 0
>> +                                      |
>> +                     (POSTMIGRATE, _COMPLETED, _STOP_COPY)
>> +            Migraton thread schedules cleanup bottom half and exits
>> +                                      |
>> +                           .save_cleanup() is called
>> +                        (POSTMIGRATE, _COMPLETED, _STOP)
>>
>>   Live migration resume path
>>   --------------------------
>>
>>   ::
>>
>> -              Incoming migration calls .load_setup for each device
>> -                       (RESTORE_VM, _ACTIVE, _STOP)
>> -                                 |
>> -       For each device, .load_state is called for that device 
>> section data
>> -                       (RESTORE_VM, _ACTIVE, _RESUMING)
>> -                                 |
>> -    At the end, .load_cleanup is called for each device and vCPUs 
>> are started
>> -                       (RUNNING, _NONE, _RUNNING)
>> +             Incoming migration calls .load_setup() for each device
>> +                          (RESTORE_VM, _ACTIVE, _STOP)
>> +                                      |
>> +     For each device, .load_state() is called for that device 
>> section data
>> +                        (RESTORE_VM, _ACTIVE, _RESUMING)
>> +                                      |
>> +  At the end, .load_cleanup() is called for each device and vCPUs 
>> are started
>> +              The VFIO device is first put in P2P quiescent state
>> +                        (RUNNING, _ACTIVE, _RUNNING_P2P)
>> +                                      |
>> +                           (RUNNING, _NONE, _RUNNING)
>>
>>   Postcopy
>>   ========
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 16cf79a76c..7c3d636025 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -441,14 +441,16 @@ bool vfio_device_state_is_running(VFIODevice 
>> *vbasedev)
>>   {
>>       VFIOMigration *migration = vbasedev->migration;
>>
>> -    return migration->device_state == VFIO_DEVICE_STATE_RUNNING;
>> +    return migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
>> +           migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P;
>>   }
>>
>>   bool vfio_device_state_is_precopy(VFIODevice *vbasedev)
>>   {
>>       VFIOMigration *migration = vbasedev->migration;
>>
>> -    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
>> +    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ||
>> +           migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P;
>>   }
>>
>>   static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 48f9c23cbe..02ee99c938 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -71,8 +71,13 @@ static const char *mig_state_to_str(enum 
>> vfio_device_mig_state state)
>>           return "STOP_COPY";
>>       case VFIO_DEVICE_STATE_RESUMING:
>>           return "RESUMING";
>> +    case VFIO_DEVICE_STATE_RUNNING_P2P:
>> +        return "RUNNING_P2P";
>>       case VFIO_DEVICE_STATE_PRE_COPY:
>>           return "PRE_COPY";
>> +    case VFIO_DEVICE_STATE_PRE_COPY_P2P:
>> +        return "PRE_COPY_P2P";
>> +
>>       default:
>>           return "UNKNOWN STATE";
>>       }
>> @@ -652,20 +657,31 @@ static const SaveVMHandlers 
>> savevm_vfio_handlers = {
>>
>>   /* 
>> ---------------------------------------------------------------------- 
>> */
>>
>> -static void vfio_vmstate_change(void *opaque, bool running, RunState 
>> state)
>> +static void vfio_vmstate_change(VFIODevice *vbasedev, bool running,
>> +                                RunState state, bool pre_state_change)
>
> Instead of mixing both vmstate change handlers in one routine, I would 
> prefer
> a new pre_state handler to be introduced.

Sure, I will change that.

Thanks!

>
>>   {
>> -    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>>       enum vfio_device_mig_state new_state;
>>       int ret;
>>
>> -    if (running) {
>> -        new_state = VFIO_DEVICE_STATE_RUNNING;
>> +    if (pre_state_change && !(migration->mig_flags & 
>> VFIO_MIGRATION_P2P)) {
>> +        return;
>> +    }
>> +
>> +    if (pre_state_change) {
>> +        new_state = migration->device_state == 
>> VFIO_DEVICE_STATE_PRE_COPY ?
>> +                        VFIO_DEVICE_STATE_PRE_COPY_P2P :
>> +                        VFIO_DEVICE_STATE_RUNNING_P2P;
>>       } else {
>> -        new_state =
>> -            (vfio_device_state_is_precopy(vbasedev) &&
>> -             (state == RUN_STATE_FINISH_MIGRATE || state == 
>> RUN_STATE_PAUSED)) ?
>> -                VFIO_DEVICE_STATE_STOP_COPY :
>> -                VFIO_DEVICE_STATE_STOP;
>> +        if (running) {
>> +            new_state = VFIO_DEVICE_STATE_RUNNING;
>> +        } else {
>> +            new_state = (vfio_device_state_is_precopy(vbasedev) &&
>> +                         (state == RUN_STATE_FINISH_MIGRATE ||
>> +                          state == RUN_STATE_PAUSED)) ?
>> +                            VFIO_DEVICE_STATE_STOP_COPY :
>> +                            VFIO_DEVICE_STATE_STOP;
>> +        }
>>       }
>>
>>       /*
>> @@ -685,7 +701,23 @@ static void vfio_vmstate_change(void *opaque, 
>> bool running, RunState state)
>>       }
>>
>>       trace_vfio_vmstate_change(vbasedev->name, running, 
>> RunState_str(state),
>> -                              mig_state_to_str(new_state));
>> +                              pre_state_change, 
>> mig_state_to_str(new_state));
>> +}
>> +
>> +static void vfio_vmstate_pre_change_handler(void *opaque, bool running,
>> +                                            RunState state)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +
>> +    vfio_vmstate_change(vbasedev, running, state, true);
>> +}
>> +
>> +static void vfio_vmstate_change_handler(void *opaque, bool running,
>> +                                        RunState state)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +
>> +    vfio_vmstate_change(vbasedev, running, state, false);
>>   }
>>
>>   static void vfio_migration_state_notifier(Notifier *notifier, void 
>> *data)
>> @@ -798,9 +830,9 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>>       register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, 
>> &savevm_vfio_handlers,
>>                            vbasedev);
>>
>> -    migration->vm_state = 
>> qdev_add_vm_change_state_handler(vbasedev->dev,
>> - vfio_vmstate_change,
>> - vbasedev);
>> +    migration->vm_state = qdev_add_vm_change_state_handler_full(
>> +        vbasedev->dev, vfio_vmstate_change_handler,
>> +        vfio_vmstate_pre_change_handler, vbasedev);
>>       migration->migration_state.notify = vfio_migration_state_notifier;
>> add_migration_state_change_notifier(&migration->migration_state);
>>
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index ee7509e68e..efafe613f2 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -166,4 +166,4 @@ vfio_save_iterate(const char *name, uint64_t 
>> precopy_init_size, uint64_t precopy
>>   vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) 
>> data buffer size 0x%"PRIx64
>>   vfio_state_pending_estimate(const char *name, uint64_t precopy, 
>> uint64_t postcopy, uint64_t precopy_init_size, uint64_t 
>> precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" 
>> precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
>>   vfio_state_pending_exact(const char *name, uint64_t precopy, 
>> uint64_t postcopy, uint64_t stopcopy_size, uint64_t 
>> precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 
>> 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy 
>> initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
>> -vfio_vmstate_change(const char *name, int running, const char 
>> *reason, const char *dev_state) " (%s) running %d reason %s device 
>> state %s"
>> +vfio_vmstate_change(const char *name, int running, const char 
>> *reason, bool pre_state_change, const char *dev_state) " (%s) running 
>> %d reason %s pre state change %d device state %s"
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH for-8.2 6/6] vfio/migration: Allow migration of multiple P2P supporting devices
  2023-07-21 12:09   ` Cédric Le Goater
@ 2023-07-23 10:42     ` Avihai Horon
  0 siblings, 0 replies; 20+ messages in thread
From: Avihai Horon @ 2023-07-23 10:42 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Paolo Bonzini, Joao Martins, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta


On 21/07/2023 15:09, Cédric Le Goater wrote:
> External email: Use caution opening links or attachments
>
>
> On 7/16/23 10:15, Avihai Horon wrote:
>> Now that P2P support has been added to VFIO migration, allow migration
>> of multiple devices if all of them support P2P migration.
>>
>> Single device migration is allowed regardless of P2P migration support.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>   hw/vfio/common.c | 26 ++++++++++++++++++--------
>>   1 file changed, 18 insertions(+), 8 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 7c3d636025..753b320739 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -363,21 +363,31 @@ bool vfio_mig_active(void)
>>
>>   static Error *multiple_devices_migration_blocker;
>>
>> -static unsigned int vfio_migratable_device_num(void)
>> +/*
>> + * Multiple devices migration is allowed only if all devices support 
>> P2P
>> + * migration. Single device migration is allowed regardless of P2P 
>> migration
>> + * support.
>> + */
>> +static bool vfio_should_block_multiple_devices_migration(void)
>
> Could we revert the logic and call the routine :
>
>   vfio_multiple_devices_migration_is_supported()
>
> I think it would be clearer in the callers.  This is minor.
>
Yes, of course, I will change that.

Thanks!

>
>>   {
>>       VFIOGroup *group;
>>       VFIODevice *vbasedev;
>>       unsigned int device_num = 0;
>> +    bool all_support_p2p = true;
>>
>>       QLIST_FOREACH(group, &vfio_group_list, next) {
>>           QLIST_FOREACH(vbasedev, &group->device_list, next) {
>>               if (vbasedev->migration) {
>>                   device_num++;
>> +
>> +                if (!(vbasedev->migration->mig_flags & 
>> VFIO_MIGRATION_P2P)) {
>> +                    all_support_p2p = false;
>> +                }
>>               }
>>           }
>>       }
>>
>> -    return device_num;
>> +    return !all_support_p2p && device_num > 1;
>>   }
>>
>>   int vfio_block_multiple_devices_migration(VFIODevice *vbasedev, 
>> Error **errp)
>> @@ -385,19 +395,19 @@ int 
>> vfio_block_multiple_devices_migration(VFIODevice *vbasedev, Error 
>> **errp)
>>       int ret;
>>
>>       if (multiple_devices_migration_blocker ||
>> -        vfio_migratable_device_num() <= 1) {
>> +        !vfio_should_block_multiple_devices_migration()) {
>>           return 0;
>>       }
>>
>>       if (vbasedev->enable_migration == ON_OFF_AUTO_ON) {
>> -        error_setg(errp, "Migration is currently not supported with 
>> multiple "
>> -                         "VFIO devices");
>> +        error_setg(errp, "Multiple VFIO devices migration is 
>> supported only if "
>> +                         "all of them support P2P migration");
>>           return -EINVAL;
>>       }
>>
>>       error_setg(&multiple_devices_migration_blocker,
>> -               "Migration is currently not supported with multiple "
>> -               "VFIO devices");
>> +               "Multiple VFIO devices migration is supported only if 
>> all of "
>> +               "them support P2P migration");
>>       ret = migrate_add_blocker(multiple_devices_migration_blocker, 
>> errp);
>>       if (ret < 0) {
>>           error_free(multiple_devices_migration_blocker);
>> @@ -410,7 +420,7 @@ int 
>> vfio_block_multiple_devices_migration(VFIODevice *vbasedev, Error 
>> **errp)
>>   void vfio_unblock_multiple_devices_migration(void)
>>   {
>>       if (!multiple_devices_migration_blocker ||
>> -        vfio_migratable_device_num() > 1) {
>> +        vfio_should_block_multiple_devices_migration()) {
>>           return;
>>       }
>>
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH for-8.2 2/6] sysemu: Add pre VM state change callback
  2023-07-16  8:15 ` [PATCH for-8.2 2/6] sysemu: Add pre VM state change callback Avihai Horon
@ 2023-07-27 16:23   ` Cédric Le Goater
  2023-07-30  6:31     ` Avihai Horon
  0 siblings, 1 reply; 20+ messages in thread
From: Cédric Le Goater @ 2023-07-27 16:23 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Paolo Bonzini, Joao Martins, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta

On 7/16/23 10:15, Avihai Horon wrote:
> Add pre VM state change callback to struct VMChangeStateEntry.

This sentence could be the subject.

> 
> The pre VM state change callback is optional and can be set by the new
> function qemu_add_vm_change_state_handler_prio_full() that allows
> setting this callback in addition to the main callback.
> 
> The pre VM state change callbacks and main callbacks are called in two
> separate phases: First all pre VM state change callbacks are called and
> only then all main callbacks are called.
> 
> The purpose of the new pre VM state change callback is to allow all
> devices to run a preliminary task before calling the devices' main
> callbacks.
> 
> This will facilitate adding P2P support for VFIO migration where all
> VFIO devices need to be put in an intermediate P2P quiescent state
> before being stopped or started by the main VM state change callback.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>   include/sysemu/runstate.h |  4 ++++
>   softmmu/runstate.c        | 39 +++++++++++++++++++++++++++++++++++++++
>   2 files changed, 43 insertions(+)
> 
> diff --git a/include/sysemu/runstate.h b/include/sysemu/runstate.h
> index 7beb29c2e2..bb38a4b4bd 100644
> --- a/include/sysemu/runstate.h
> +++ b/include/sysemu/runstate.h
> @@ -16,6 +16,10 @@ VMChangeStateEntry *qemu_add_vm_change_state_handler(VMChangeStateHandler *cb,
>                                                        void *opaque);
>   VMChangeStateEntry *qemu_add_vm_change_state_handler_prio(
>           VMChangeStateHandler *cb, void *opaque, int priority);
> +VMChangeStateEntry *
> +qemu_add_vm_change_state_handler_prio_full(VMChangeStateHandler *cb,
> +                                           VMChangeStateHandler *pre_change_cb,
> +                                           void *opaque, int priority);
>   VMChangeStateEntry *qdev_add_vm_change_state_handler(DeviceState *dev,
>                                                        VMChangeStateHandler *cb,
>                                                        void *opaque);
> diff --git a/softmmu/runstate.c b/softmmu/runstate.c
> index f3bd862818..a1f0653899 100644
> --- a/softmmu/runstate.c
> +++ b/softmmu/runstate.c
> @@ -271,6 +271,7 @@ void qemu_system_vmstop_request(RunState state)
>   }
>   struct VMChangeStateEntry {
>       VMChangeStateHandler *cb;
> +    VMChangeStateHandler *pre_change_cb;

I propose to use 'prepare' instead of 'pre_change'


>       void *opaque;
>       QTAILQ_ENTRY(VMChangeStateEntry) entries;
>       int priority;
> @@ -293,12 +294,38 @@ static QTAILQ_HEAD(, VMChangeStateEntry) vm_change_state_head =
>    */
>   VMChangeStateEntry *qemu_add_vm_change_state_handler_prio(
>           VMChangeStateHandler *cb, void *opaque, int priority)
> +{
> +    return qemu_add_vm_change_state_handler_prio_full(cb, NULL, opaque,
> +                                                      priority);
> +}
> +
> +/**
> + * qemu_add_vm_change_state_handler_prio_full:

qemu_add_vm_change_state_handler_prio_all() may be. I don't have much better
but 'full' doesn't sound right. minor.

The rest looks fine to me.

Thanks,

C.

> + * @cb: the main callback to invoke
> + * @pre_change_cb: a callback to invoke before the main callback
> + * @opaque: user data passed to the callbacks
> + * @priority: low priorities execute first when the vm runs and the reverse is
> + *            true when the vm stops
> + *
> + * Register a main callback function and an optional pre VM state change
> + * callback function that are invoked when the vm starts or stops running. The
> + * main callback and the pre VM state change callback are called in two
> + * separate phases: First all pre VM state change callbacks are called and only
> + * then all main callbacks are called.
> + *
> + * Returns: an entry to be freed using qemu_del_vm_change_state_handler()
> + */
> +VMChangeStateEntry *
> +qemu_add_vm_change_state_handler_prio_full(VMChangeStateHandler *cb,
> +                                           VMChangeStateHandler *pre_change_cb,
> +                                           void *opaque, int priority)
>   {
>       VMChangeStateEntry *e;
>       VMChangeStateEntry *other;
>   
>       e = g_malloc0(sizeof(*e));
>       e->cb = cb;
> +    e->pre_change_cb = pre_change_cb;
>       e->opaque = opaque;
>       e->priority = priority;
>   
> @@ -333,10 +360,22 @@ void vm_state_notify(bool running, RunState state)
>       trace_vm_state_notify(running, state, RunState_str(state));
>   
>       if (running) {
> +        QTAILQ_FOREACH_SAFE(e, &vm_change_state_head, entries, next) {
> +            if (e->pre_change_cb) {
> +                e->pre_change_cb(e->opaque, running, state);
> +            }
> +        }
> +
>           QTAILQ_FOREACH_SAFE(e, &vm_change_state_head, entries, next) {
>               e->cb(e->opaque, running, state);
>           }
>       } else {
> +        QTAILQ_FOREACH_REVERSE_SAFE(e, &vm_change_state_head, entries, next) {
> +            if (e->pre_change_cb) {
> +                e->pre_change_cb(e->opaque, running, state);
> +            }
> +        }
> +
>           QTAILQ_FOREACH_REVERSE_SAFE(e, &vm_change_state_head, entries, next) {
>               e->cb(e->opaque, running, state);
>           }



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH for-8.2 2/6] sysemu: Add pre VM state change callback
  2023-07-27 16:23   ` Cédric Le Goater
@ 2023-07-30  6:31     ` Avihai Horon
  2023-07-30 16:22       ` Cédric Le Goater
  0 siblings, 1 reply; 20+ messages in thread
From: Avihai Horon @ 2023-07-30  6:31 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Paolo Bonzini, Joao Martins, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta


On 27/07/2023 19:23, Cédric Le Goater wrote:
> External email: Use caution opening links or attachments
>
>
> On 7/16/23 10:15, Avihai Horon wrote:
>> Add pre VM state change callback to struct VMChangeStateEntry.
>
> This sentence could be the subject.

Sure.

>
>>
>> The pre VM state change callback is optional and can be set by the new
>> function qemu_add_vm_change_state_handler_prio_full() that allows
>> setting this callback in addition to the main callback.
>>
>> The pre VM state change callbacks and main callbacks are called in two
>> separate phases: First all pre VM state change callbacks are called and
>> only then all main callbacks are called.
>>
>> The purpose of the new pre VM state change callback is to allow all
>> devices to run a preliminary task before calling the devices' main
>> callbacks.
>>
>> This will facilitate adding P2P support for VFIO migration where all
>> VFIO devices need to be put in an intermediate P2P quiescent state
>> before being stopped or started by the main VM state change callback.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   include/sysemu/runstate.h |  4 ++++
>>   softmmu/runstate.c        | 39 +++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 43 insertions(+)
>>
>> diff --git a/include/sysemu/runstate.h b/include/sysemu/runstate.h
>> index 7beb29c2e2..bb38a4b4bd 100644
>> --- a/include/sysemu/runstate.h
>> +++ b/include/sysemu/runstate.h
>> @@ -16,6 +16,10 @@ VMChangeStateEntry 
>> *qemu_add_vm_change_state_handler(VMChangeStateHandler *cb,
>>                                                        void *opaque);
>>   VMChangeStateEntry *qemu_add_vm_change_state_handler_prio(
>>           VMChangeStateHandler *cb, void *opaque, int priority);
>> +VMChangeStateEntry *
>> +qemu_add_vm_change_state_handler_prio_full(VMChangeStateHandler *cb,
>> +                                           VMChangeStateHandler 
>> *pre_change_cb,
>> +                                           void *opaque, int priority);
>>   VMChangeStateEntry *qdev_add_vm_change_state_handler(DeviceState *dev,
>> VMChangeStateHandler *cb,
>>                                                        void *opaque);
>> diff --git a/softmmu/runstate.c b/softmmu/runstate.c
>> index f3bd862818..a1f0653899 100644
>> --- a/softmmu/runstate.c
>> +++ b/softmmu/runstate.c
>> @@ -271,6 +271,7 @@ void qemu_system_vmstop_request(RunState state)
>>   }
>>   struct VMChangeStateEntry {
>>       VMChangeStateHandler *cb;
>> +    VMChangeStateHandler *pre_change_cb;
>
> I propose to use 'prepare' instead of 'pre_change'

Sure, I will change it.

>
>
>>       void *opaque;
>>       QTAILQ_ENTRY(VMChangeStateEntry) entries;
>>       int priority;
>> @@ -293,12 +294,38 @@ static QTAILQ_HEAD(, VMChangeStateEntry) 
>> vm_change_state_head =
>>    */
>>   VMChangeStateEntry *qemu_add_vm_change_state_handler_prio(
>>           VMChangeStateHandler *cb, void *opaque, int priority)
>> +{
>> +    return qemu_add_vm_change_state_handler_prio_full(cb, NULL, opaque,
>> + priority);
>> +}
>> +
>> +/**
>> + * qemu_add_vm_change_state_handler_prio_full:
>
> qemu_add_vm_change_state_handler_prio_all() may be. I don't have much 
> better
> but 'full' doesn't sound right. minor.

I followed the GLib naming convention.
For example, g_tree_new and g_tree_new_full, or g_hash_table_new and 
g_hash_table_new_full.
I tend to go with GLib naming as it might be more familiar to people.
What do you think?

Thanks!

>
> The rest looks fine to me.
>
> Thanks,
>
> C.
>
>> + * @cb: the main callback to invoke
>> + * @pre_change_cb: a callback to invoke before the main callback
>> + * @opaque: user data passed to the callbacks
>> + * @priority: low priorities execute first when the vm runs and the 
>> reverse is
>> + *            true when the vm stops
>> + *
>> + * Register a main callback function and an optional pre VM state 
>> change
>> + * callback function that are invoked when the vm starts or stops 
>> running. The
>> + * main callback and the pre VM state change callback are called in two
>> + * separate phases: First all pre VM state change callbacks are 
>> called and only
>> + * then all main callbacks are called.
>> + *
>> + * Returns: an entry to be freed using 
>> qemu_del_vm_change_state_handler()
>> + */
>> +VMChangeStateEntry *
>> +qemu_add_vm_change_state_handler_prio_full(VMChangeStateHandler *cb,
>> +                                           VMChangeStateHandler 
>> *pre_change_cb,
>> +                                           void *opaque, int priority)
>>   {
>>       VMChangeStateEntry *e;
>>       VMChangeStateEntry *other;
>>
>>       e = g_malloc0(sizeof(*e));
>>       e->cb = cb;
>> +    e->pre_change_cb = pre_change_cb;
>>       e->opaque = opaque;
>>       e->priority = priority;
>>
>> @@ -333,10 +360,22 @@ void vm_state_notify(bool running, RunState state)
>>       trace_vm_state_notify(running, state, RunState_str(state));
>>
>>       if (running) {
>> +        QTAILQ_FOREACH_SAFE(e, &vm_change_state_head, entries, next) {
>> +            if (e->pre_change_cb) {
>> +                e->pre_change_cb(e->opaque, running, state);
>> +            }
>> +        }
>> +
>>           QTAILQ_FOREACH_SAFE(e, &vm_change_state_head, entries, next) {
>>               e->cb(e->opaque, running, state);
>>           }
>>       } else {
>> +        QTAILQ_FOREACH_REVERSE_SAFE(e, &vm_change_state_head, 
>> entries, next) {
>> +            if (e->pre_change_cb) {
>> +                e->pre_change_cb(e->opaque, running, state);
>> +            }
>> +        }
>> +
>>           QTAILQ_FOREACH_REVERSE_SAFE(e, &vm_change_state_head, 
>> entries, next) {
>>               e->cb(e->opaque, running, state);
>>           }
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH for-8.2 0/6] vfio/migration: Add P2P support for VFIO migration
  2023-07-18 15:46 ` [PATCH for-8.2 0/6] vfio/migration: Add P2P support for VFIO migration Jason Gunthorpe
@ 2023-07-30  6:50   ` Avihai Horon
  0 siblings, 0 replies; 20+ messages in thread
From: Avihai Horon @ 2023-07-30  6:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
	Paolo Bonzini, Joao Martins, Yishai Hadas, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta


On 18/07/2023 18:46, Jason Gunthorpe wrote:
> On Sun, Jul 16, 2023 at 11:15:35AM +0300, Avihai Horon wrote:
>> Hi all,
>>
>> The first patch in this series adds a small optimization to VFIO
>> migration by moving the STOP_COPY->STOP transition to
>> vfio_save_cleanup(). Testing with a ConnectX-7 VFIO device showed that
>> this can reduce downtime by up to 6%.
>>
>> The rest of the series adds P2P support for VFIO migration.
>>
>> VFIO migration uAPI defines an optional intermediate P2P quiescent
>> state. While in the P2P quiescent state, P2P DMA transactions cannot be
>> initiated by the device, but the device can respond to incoming ones.
>> Additionally, all outstanding P2P transactions are guaranteed to have
>> been completed by the time the device enters this state.
> For clarity it is "all outstanding DMA transactions are guarenteed to
> have been completed" - the device has no idea if something is P2P or
> not.

The uAPI states that [1]: "If the device can identify P2P 
transactionsthen it can stop only P2P DMA, otherwise it must stop all DMA.".

So I will stick to the uAPI definition.

[1] 
https://elixir.bootlin.com/linux/v6.5-rc1/source/include/uapi/linux/vfio.h#L1032

>
>> The purpose of this state is to support migration of multiple devices
>> that are doing P2P transactions between themselves.
> s/are/might/

Sure, will change it.

Thanks!



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH for-8.2 2/6] sysemu: Add pre VM state change callback
  2023-07-30  6:31     ` Avihai Horon
@ 2023-07-30 16:22       ` Cédric Le Goater
  0 siblings, 0 replies; 20+ messages in thread
From: Cédric Le Goater @ 2023-07-30 16:22 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Paolo Bonzini, Joao Martins, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta

[ ... ]

>>> + * qemu_add_vm_change_state_handler_prio_full:
>>
>> qemu_add_vm_change_state_handler_prio_all() may be. I don't have much better
>> but 'full' doesn't sound right. minor.
> 
> I followed the GLib naming convention.
> For example, g_tree_new and g_tree_new_full, or g_hash_table_new and g_hash_table_new_full.
> I tend to go with GLib naming as it might be more familiar to people.
> What do you think?

Let's keep it then.

Thanks,

C.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH for-8.2 1/6] vfio/migration: Move from STOP_COPY to STOP in vfio_save_cleanup()
  2023-07-16  8:15 ` [PATCH for-8.2 1/6] vfio/migration: Move from STOP_COPY to STOP in vfio_save_cleanup() Avihai Horon
@ 2023-07-30 16:25   ` Cédric Le Goater
  2023-07-31  6:32     ` Avihai Horon
  0 siblings, 1 reply; 20+ messages in thread
From: Cédric Le Goater @ 2023-07-30 16:25 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Paolo Bonzini, Joao Martins, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta

On 7/16/23 10:15, Avihai Horon wrote:
> Changing the device state from STOP_COPY to STOP can take time as the
> device may need to free resources and do other operations as part of the
> transition. Currently, this is done in vfio_save_complete_precopy() and
> therefore it is counted in the migration downtime.
> 
> To avoid this, change the device state from STOP_COPY to STOP in
> vfio_save_cleanup(), which is called after migration has completed and
> thus is not part of migration downtime.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>

Have you tried this series with vGPUs ? If so, are there any improvement
to report ?

Thanks,

C.

> ---
>   hw/vfio/migration.c | 19 +++++++++++++------
>   1 file changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 2674f4bc47..8acd182a8b 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -383,6 +383,19 @@ static void vfio_save_cleanup(void *opaque)
>       VFIODevice *vbasedev = opaque;
>       VFIOMigration *migration = vbasedev->migration;
>   
> +    /*
> +     * Changing device state from STOP_COPY to STOP can take time. Do it here,
> +     * after migration has completed, so it won't increase downtime.
> +     */
> +    if (migration->device_state == VFIO_DEVICE_STATE_STOP_COPY) {
> +        /*
> +         * If setting the device in STOP state fails, the device should be
> +         * reset. To do so, use ERROR state as a recover state.
> +         */
> +        vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP,
> +                                 VFIO_DEVICE_STATE_ERROR);
> +    }
> +
>       g_free(migration->data_buffer);
>       migration->data_buffer = NULL;
>       migration->precopy_init_size = 0;
> @@ -508,12 +521,6 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>           return ret;
>       }
>   
> -    /*
> -     * If setting the device in STOP state fails, the device should be reset.
> -     * To do so, use ERROR state as a recover state.
> -     */
> -    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP,
> -                                   VFIO_DEVICE_STATE_ERROR);
>       trace_vfio_save_complete_precopy(vbasedev->name, ret);
>   
>       return ret;



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH for-8.2 1/6] vfio/migration: Move from STOP_COPY to STOP in vfio_save_cleanup()
  2023-07-30 16:25   ` Cédric Le Goater
@ 2023-07-31  6:32     ` Avihai Horon
  2023-07-31  7:05       ` Cédric Le Goater
  0 siblings, 1 reply; 20+ messages in thread
From: Avihai Horon @ 2023-07-31  6:32 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Paolo Bonzini, Joao Martins, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta


On 30/07/2023 19:25, Cédric Le Goater wrote:
> External email: Use caution opening links or attachments
>
>
> On 7/16/23 10:15, Avihai Horon wrote:
>> Changing the device state from STOP_COPY to STOP can take time as the
>> device may need to free resources and do other operations as part of the
>> transition. Currently, this is done in vfio_save_complete_precopy() and
>> therefore it is counted in the migration downtime.
>>
>> To avoid this, change the device state from STOP_COPY to STOP in
>> vfio_save_cleanup(), which is called after migration has completed and
>> thus is not part of migration downtime.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>
> Have you tried this series with vGPUs ? If so, are there any improvement
> to report ?
>
Not with a vGPU.
But I tried it with a ConnectX-7 VF and I could see up to 6% downtime 
improvement.
I mentioned this in the cover letter.
Do you want to mention it also in the commit message?

Thanks.

>
>> ---
>>   hw/vfio/migration.c | 19 +++++++++++++------
>>   1 file changed, 13 insertions(+), 6 deletions(-)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 2674f4bc47..8acd182a8b 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -383,6 +383,19 @@ static void vfio_save_cleanup(void *opaque)
>>       VFIODevice *vbasedev = opaque;
>>       VFIOMigration *migration = vbasedev->migration;
>>
>> +    /*
>> +     * Changing device state from STOP_COPY to STOP can take time. 
>> Do it here,
>> +     * after migration has completed, so it won't increase downtime.
>> +     */
>> +    if (migration->device_state == VFIO_DEVICE_STATE_STOP_COPY) {
>> +        /*
>> +         * If setting the device in STOP state fails, the device 
>> should be
>> +         * reset. To do so, use ERROR state as a recover state.
>> +         */
>> +        vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP,
>> +                                 VFIO_DEVICE_STATE_ERROR);
>> +    }
>> +
>>       g_free(migration->data_buffer);
>>       migration->data_buffer = NULL;
>>       migration->precopy_init_size = 0;
>> @@ -508,12 +521,6 @@ static int vfio_save_complete_precopy(QEMUFile 
>> *f, void *opaque)
>>           return ret;
>>       }
>>
>> -    /*
>> -     * If setting the device in STOP state fails, the device should 
>> be reset.
>> -     * To do so, use ERROR state as a recover state.
>> -     */
>> -    ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP,
>> -                                   VFIO_DEVICE_STATE_ERROR);
>>       trace_vfio_save_complete_precopy(vbasedev->name, ret);
>>
>>       return ret;
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH for-8.2 1/6] vfio/migration: Move from STOP_COPY to STOP in vfio_save_cleanup()
  2023-07-31  6:32     ` Avihai Horon
@ 2023-07-31  7:05       ` Cédric Le Goater
  0 siblings, 0 replies; 20+ messages in thread
From: Cédric Le Goater @ 2023-07-31  7:05 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Paolo Bonzini, Joao Martins, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta

On 7/31/23 08:32, Avihai Horon wrote:
> 
> On 30/07/2023 19:25, Cédric Le Goater wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 7/16/23 10:15, Avihai Horon wrote:
>>> Changing the device state from STOP_COPY to STOP can take time as the
>>> device may need to free resources and do other operations as part of the
>>> transition. Currently, this is done in vfio_save_complete_precopy() and
>>> therefore it is counted in the migration downtime.
>>>
>>> To avoid this, change the device state from STOP_COPY to STOP in
>>> vfio_save_cleanup(), which is called after migration has completed and
>>> thus is not part of migration downtime.
>>>
>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>
>> Have you tried this series with vGPUs ? If so, are there any improvement
>> to report ?
>>
> Not with a vGPU.
> But I tried it with a ConnectX-7 VF and I could see up to 6% downtime improvement.
> I mentioned this in the cover letter.
> Do you want to mention it also in the commit message?

This is not necessary.

Thanks,

C.





^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-07-31  7:05 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-16  8:15 [PATCH for-8.2 0/6] vfio/migration: Add P2P support for VFIO migration Avihai Horon
2023-07-16  8:15 ` [PATCH for-8.2 1/6] vfio/migration: Move from STOP_COPY to STOP in vfio_save_cleanup() Avihai Horon
2023-07-30 16:25   ` Cédric Le Goater
2023-07-31  6:32     ` Avihai Horon
2023-07-31  7:05       ` Cédric Le Goater
2023-07-16  8:15 ` [PATCH for-8.2 2/6] sysemu: Add pre VM state change callback Avihai Horon
2023-07-27 16:23   ` Cédric Le Goater
2023-07-30  6:31     ` Avihai Horon
2023-07-30 16:22       ` Cédric Le Goater
2023-07-16  8:15 ` [PATCH for-8.2 3/6] qdev: Add qdev_add_vm_change_state_handler_full() Avihai Horon
2023-07-16  8:15 ` [PATCH for-8.2 4/6] vfio/migration: Refactor PRE_COPY and RUNNING state checks Avihai Horon
2023-07-21 11:33   ` Cédric Le Goater
2023-07-16  8:15 ` [PATCH for-8.2 5/6] vfio/migration: Add P2P support for VFIO migration Avihai Horon
2023-07-21 11:48   ` Cédric Le Goater
2023-07-23 10:37     ` Avihai Horon
2023-07-16  8:15 ` [PATCH for-8.2 6/6] vfio/migration: Allow migration of multiple P2P supporting devices Avihai Horon
2023-07-21 12:09   ` Cédric Le Goater
2023-07-23 10:42     ` Avihai Horon
2023-07-18 15:46 ` [PATCH for-8.2 0/6] vfio/migration: Add P2P support for VFIO migration Jason Gunthorpe
2023-07-30  6:50   ` Avihai Horon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).