All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/7] migration: Add switchover ack capability and VFIO precopy support
@ 2023-05-21 15:18 Avihai Horon
  2023-05-21 15:18 ` [PATCH v3 1/7] migration: Add switchover ack capability Avihai Horon
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: Avihai Horon @ 2023-05-21 15:18 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
	Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

Hello everyone,

This is v3 of the switchover ack series (previously called precopy
initial data).

Changes from v2 [4]:
* Rebased on latest master branch.
* Changed the capability name to "switchover-ack" and the related
  code/docs accordingly. (Peter)
* Added a counter for the number of switchover ack users in the source
  and used it to skip switchover ack if there are no users (instead of
  setting the switchover acked flag to true). (Peter)
* Added R-bs.

Changes from v1 [3]:
* Rebased on latest master branch.
* Updated to latest QAPI doc comment conventions and refined
  QAPI docs and capability error message. (Markus)
* Followed Peter/Juan suggestion and removed the handshake between
  source and destination.
  Now the capability must be set on both source and destination.
  Compatibility of this feature between different QEMU versions or
  different host capabilities (i.e., kernel) is achieved in the regular
  way of device properties and hw_comapt_x_y.
* Replaced is_initial_data_active() and initial_data_loaded()
  SaveVMHandlers handlers with a notification mechanism. (Peter)
* Set the capability also in destination in the migration test.
* Added VFIO device property x-allow-pre-copy to be able to preserve
  compatibility between different QEMU versions or different host
  capabilities (i.e., kernel).
* Changed VFIO precopy initial data implementation according to the
  above changes.
* Documented VFIO precopy initial data support in VFIO migration
  documentation.
* Added R-bs.

===

This series adds a new migration capability called "switchover ack". The
purpose of this capability is to reduce migration downtime in cases
where loading of migration data in the destination can take a lot of
time, such as with VFIO migration data.

The series then moves to add precopy support and switchover ack support
for VFIO migration.

Switchover ack is used by VFIO migration, but other migrated devices can
add support for it and use it as well.

=== Background ===

Migration downtime estimation is calculated based on bandwidth and
remaining migration data. This assumes that loading of migration data in
the destination takes a negligible amount of time and that downtime
depends only on network speed.

While this may be true for RAM, it's not necessarily true for other
migrated devices. For example, loading the data of a VFIO device in the
destination might require from the device to allocate resources and
prepare internal data structures which can take a significant amount of
time to do.

This poses a problem, as the source may think that the remaining
migration data is small enough to meet the downtime limit, so it will
stop the VM and complete the migration, but in fact sending and loading
the data in the destination may take longer than the downtime limit.

To solve this, VFIO migration uAPI defines "initial bytes" as part of
its precopy stream [1]. Initial bytes can be used in various ways to
improve VFIO migration performance. For example, it can be used to
transfer device metadata to pre-allocate resources in the destination.
However, for this to work we need to make sure that all initial bytes
are sent and loaded in the destination before the source VM is stopped.

The new switchover ack migration capability helps us achieve this.
It prevents the source from stopping the VM and completing the migration
until an ACK is received from the destination that it's OK to do so.
Thus, a VFIO device can make sure that its initial bytes were sent
and loaded in the destination before the source VM is stopped.

Note that this relies on the return path capability to communicate from
the destination back to the source.

=== Flow of operation ===

To use switchover ack, the capability must be enabled in both the source
and the destination.

During migration setup, migration code calls the switchover_ack_needed()
SaveVMHandlers handler of the migrated devices, both in the source and
the destination, to check if switchover ack is used by them. In the
destination, a "switchover_ack_pending_num" counter is increased for
each migrated device that supports this feature. It will be used later
to mark when an ACK should be sent to the source.

Migration is active and the source starts to send precopy data as usual.
In the destination, when a migrated device thinks it's OK to do
switchover, it notifies the migration code about it and the
"switchover_ack_pending_num" counter is decreased. For example, for a
VFIO device it's when the device receives and loads its initial bytes.

When the "switchover_ack_pending_num" counter reaches zero, it means
that all devices agree to do switchover and an ACK is sent to the
source, which will now be able to complete the migration when
appropriate.

=== Test results ===

The below table shows the downtime of two identical migrations. In the
first migration swithcover ack is disabled and in the second it is
enabled. The migrated VM is assigned with a mlx5 VFIO device which has
300MB of device data to be migrated.

+----------------------+-----------------------+----------+
|    Switchover ack    | VFIO device data size | Downtime |
+----------------------+-----------------------+----------+
|       Disabled       |         300MB         |  1900ms  |
|       Enabled        |         300MB         |  420ms   |
+----------------------+-----------------------+----------+

Switchover ack gives a roughly 4.5 times improvement in downtime.
The 1480ms difference is time that is used for resource allocation for
the VFIO device in the destination. Without switchover ack, this time is
spent when the source VM is stopped and thus the downtime is much
higher. With switchover ack, the time is spent when the source VM is
still running.

=== Patch breakdown ===

- Patches 1-4 add the switchover ack capability.
- Patches 5-6 add VFIO migration precopy support. Similar version of
  them was previously sent here [2].
- Patch 7 adds switchover ack support for VFIO migration.

Thanks for reviewing!

[1]
https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/vfio.h#L1048

[2]
https://lore.kernel.org/qemu-devel/20230222174915.5647-3-avihaih@nvidia.com/

[3]
https://lore.kernel.org/qemu-devel/20230501140141.11743-1-avihaih@nvidia.com/

[4]
https://lore.kernel.org/qemu-devel/20230517155219.10691-1-avihaih@nvidia.com/

Avihai Horon (7):
  migration: Add switchover ack capability
  migration: Implement switchover ack logic
  migration: Enable switchover ack capability
  tests: Add migration switchover ack capability test
  vfio/migration: Refactor vfio_save_block() to return saved data size
  vfio/migration: Add VFIO migration pre-copy support
  vfio/migration: Add support for switchover ack capability

 docs/devel/vfio-migration.rst |  45 +++++--
 qapi/migration.json           |  12 +-
 include/hw/vfio/vfio-common.h |   6 +
 include/migration/register.h  |   3 +
 migration/migration.h         |  16 +++
 migration/options.h           |   1 +
 migration/savevm.h            |   2 +
 hw/core/machine.c             |   1 +
 hw/vfio/common.c              |   6 +-
 hw/vfio/migration.c           | 220 +++++++++++++++++++++++++++++++---
 hw/vfio/pci.c                 |   2 +
 migration/migration.c         |  42 ++++++-
 migration/options.c           |  17 +++
 migration/savevm.c            |  56 +++++++++
 tests/qtest/migration-test.c  |  26 ++++
 hw/vfio/trace-events          |   4 +-
 migration/trace-events        |   4 +
 17 files changed, 430 insertions(+), 33 deletions(-)

-- 
2.26.3



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3 1/7] migration: Add switchover ack capability
  2023-05-21 15:18 [PATCH v3 0/7] migration: Add switchover ack capability and VFIO precopy support Avihai Horon
@ 2023-05-21 15:18 ` Avihai Horon
  2023-05-24 19:24   ` Peter Xu
  2023-05-25  9:33   ` Markus Armbruster
  2023-05-21 15:18 ` [PATCH v3 2/7] migration: Implement switchover ack logic Avihai Horon
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 20+ messages in thread
From: Avihai Horon @ 2023-05-21 15:18 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
	Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

Migration downtime estimation is calculated based on bandwidth and
remaining migration data. This assumes that loading of migration data in
the destination takes a negligible amount of time and that downtime
depends only on network speed.

While this may be true for RAM, it's not necessarily true for other
migration users. For example, loading the data of a VFIO device in the
destination might require from the device to allocate resources, prepare
internal data structures and so on. These operations can take a
significant amount of time which can increase migration downtime.

This patch adds a new capability "switchover ack" that prevents the
source from stopping the VM and completing the migration until an ACK
is received from the destination that it's OK to do so.

This can be used by migrated devices in various ways to reduce downtime.
For example, a device can send initial precopy metadata to pre-allocate
resources in the destination and use this capability to make sure that
the pre-allocation is completed before the source VM is stopped, so it
will have full effect.

This new capability relies on the return path capability to communicate
from the destination back to the source.

The actual implementation of the capability will be added in the
following patches.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 qapi/migration.json | 12 +++++++++++-
 migration/options.h |  1 +
 migration/options.c | 21 +++++++++++++++++++++
 3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/qapi/migration.json b/qapi/migration.json
index 179af0c4d8..061ea512e0 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -487,6 +487,16 @@
 #     and should not affect the correctness of postcopy migration.
 #     (since 7.1)
 #
+# @switchover-ack: If enabled, migration will not stop the source VM
+#     and complete the migration until an ACK is received from the
+#     destination that it's OK to do so.  Exactly when this ACK is
+#     sent depends on the migrated devices that use this feature.
+#     For example, a device can use it to make sure some of its data
+#     is sent and loaded in the destination before doing switchover.
+#     This can reduce downtime if devices that support this capability
+#     are present.  'return-path' capability must be enabled to use
+#     it.  (since 8.1)
+#
 # Features:
 #
 # @unstable: Members @x-colo and @x-ignore-shared are experimental.
@@ -502,7 +512,7 @@
            'dirty-bitmaps', 'postcopy-blocktime', 'late-block-activate',
            { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
            'validate-uuid', 'background-snapshot',
-           'zero-copy-send', 'postcopy-preempt'] }
+           'zero-copy-send', 'postcopy-preempt', 'switchover-ack'] }
 
 ##
 # @MigrationCapabilityStatus:
diff --git a/migration/options.h b/migration/options.h
index 45991af3c2..9aaf363322 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -40,6 +40,7 @@ bool migrate_postcopy_ram(void);
 bool migrate_rdma_pin_all(void);
 bool migrate_release_ram(void);
 bool migrate_return_path(void);
+bool migrate_switchover_ack(void);
 bool migrate_validate_uuid(void);
 bool migrate_xbzrle(void);
 bool migrate_zero_blocks(void);
diff --git a/migration/options.c b/migration/options.c
index b62ab30cd5..16007afca6 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -185,6 +185,8 @@ Property migration_properties[] = {
     DEFINE_PROP_MIG_CAP("x-zero-copy-send",
             MIGRATION_CAPABILITY_ZERO_COPY_SEND),
 #endif
+    DEFINE_PROP_MIG_CAP("x-switchover-ack",
+                        MIGRATION_CAPABILITY_SWITCHOVER_ACK),
 
     DEFINE_PROP_END_OF_LIST(),
 };
@@ -308,6 +310,13 @@ bool migrate_return_path(void)
     return s->capabilities[MIGRATION_CAPABILITY_RETURN_PATH];
 }
 
+bool migrate_switchover_ack(void)
+{
+    MigrationState *s = migrate_get_current();
+
+    return s->capabilities[MIGRATION_CAPABILITY_SWITCHOVER_ACK];
+}
+
 bool migrate_validate_uuid(void)
 {
     MigrationState *s = migrate_get_current();
@@ -547,6 +556,18 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, Error **errp)
         }
     }
 
+    if (new_caps[MIGRATION_CAPABILITY_SWITCHOVER_ACK]) {
+        if (!new_caps[MIGRATION_CAPABILITY_RETURN_PATH]) {
+            error_setg(errp, "Capability 'switchover-ack' requires capability "
+                             "'return-path'");
+            return false;
+        }
+
+        /* Disable this capability until it's implemented */
+        error_setg(errp, "'switchover-ack' is not implemented yet");
+        return false;
+    }
+
     return true;
 }
 
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 2/7] migration: Implement switchover ack logic
  2023-05-21 15:18 [PATCH v3 0/7] migration: Add switchover ack capability and VFIO precopy support Avihai Horon
  2023-05-21 15:18 ` [PATCH v3 1/7] migration: Add switchover ack capability Avihai Horon
@ 2023-05-21 15:18 ` Avihai Horon
  2023-05-24 19:32   ` Peter Xu
  2023-05-21 15:18 ` [PATCH v3 3/7] migration: Enable switchover ack capability Avihai Horon
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Avihai Horon @ 2023-05-21 15:18 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
	Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

Implement switchover ack logic. This prevents the source from stopping
the VM and completing the migration until an ACK is received from the
destination that it's OK to do so.

To achieve this, a new SaveVMHandlers handler switchover_ack_needed()
and a new return path message MIG_RP_MSG_SWITCHOVER_ACK are added.

The switchover_ack_needed() handler is called during migration setup
both in the source and the destination to check if switchover ack is
used by the migrated device.

When switchover is approved by all migrated devices in the destination
that support this capability, the MIG_RP_MSG_INITIAL_DATA_LOADED_ACK
return path message is sent to the source to notify it that it's OK to
do switchover.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 include/migration/register.h |  3 ++
 migration/migration.h        | 16 +++++++++++
 migration/savevm.h           |  2 ++
 migration/migration.c        | 42 +++++++++++++++++++++++++--
 migration/savevm.c           | 56 ++++++++++++++++++++++++++++++++++++
 migration/trace-events       |  4 +++
 6 files changed, 121 insertions(+), 2 deletions(-)

diff --git a/include/migration/register.h b/include/migration/register.h
index a8dfd8fefd..cda36d377b 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -71,6 +71,9 @@ typedef struct SaveVMHandlers {
     int (*load_cleanup)(void *opaque);
     /* Called when postcopy migration wants to resume from failure */
     int (*resume_prepare)(MigrationState *s, void *opaque);
+
+    /* Checks if switchover ack should be used. Called both in src and dest */
+    bool (*switchover_ack_needed)(void *opaque);
 } SaveVMHandlers;
 
 int register_savevm_live(const char *idstr,
diff --git a/migration/migration.h b/migration/migration.h
index 48a46123a0..e209860cce 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -209,6 +209,13 @@ struct MigrationIncomingState {
      * contains valid information.
      */
     QemuMutex page_request_mutex;
+
+    /*
+     * Number of devices that have yet to approve switchover. When this reaches
+     * zero an ACK that it's OK to do switchover is sent to the source. No lock
+     * is needed as this field is updated serially.
+     */
+    unsigned int switchover_ack_pending_num;
 };
 
 MigrationIncomingState *migration_incoming_get_current(void);
@@ -437,6 +444,14 @@ struct MigrationState {
 
     /* QEMU_VM_VMDESCRIPTION content filled for all non-iterable devices. */
     JSONWriter *vmdesc;
+
+    /* Number of devices that use switchover ack capability */
+    unsigned int switchover_ack_user_num;
+    /*
+     * Indicates whether an ACK from the destination that it's OK to do
+     * switchover has been received.
+     */
+    bool switchover_acked;
 };
 
 void migrate_set_state(int *state, int old_state, int new_state);
@@ -477,6 +492,7 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
 void migrate_send_rp_recv_bitmap(MigrationIncomingState *mis,
                                  char *block_name);
 void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t value);
+int migrate_send_rp_switchover_ack(MigrationIncomingState *mis);
 
 void dirty_bitmap_mig_before_vm_start(void);
 void dirty_bitmap_mig_cancel_outgoing(void);
diff --git a/migration/savevm.h b/migration/savevm.h
index fb636735f0..5c3e1a026b 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -32,6 +32,7 @@
 bool qemu_savevm_state_blocked(Error **errp);
 void qemu_savevm_non_migratable_list(strList **reasons);
 void qemu_savevm_state_setup(QEMUFile *f);
+void qemu_savevm_state_switchover_ack_needed(MigrationState *ms);
 bool qemu_savevm_state_guest_unplug_pending(void);
 int qemu_savevm_state_resume_prepare(MigrationState *s);
 void qemu_savevm_state_header(QEMUFile *f);
@@ -65,6 +66,7 @@ int qemu_loadvm_state(QEMUFile *f);
 void qemu_loadvm_state_cleanup(void);
 int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
 int qemu_load_device_state(QEMUFile *f);
+int qemu_loadvm_approve_switchover(void);
 int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
         bool in_postcopy, bool inactivate_disks);
 
diff --git a/migration/migration.c b/migration/migration.c
index 5de7f734b9..87423ec30c 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -78,6 +78,7 @@ enum mig_rp_message_type {
     MIG_RP_MSG_REQ_PAGES,    /* data (start: be64, len: be32) */
     MIG_RP_MSG_RECV_BITMAP,  /* send recved_bitmap back to source */
     MIG_RP_MSG_RESUME_ACK,   /* tell source that we are ready to resume */
+    MIG_RP_MSG_SWITCHOVER_ACK, /* Tell source it's OK to do switchover */
 
     MIG_RP_MSG_MAX
 };
@@ -760,6 +761,11 @@ bool migration_has_all_channels(void)
     return true;
 }
 
+int migrate_send_rp_switchover_ack(MigrationIncomingState *mis)
+{
+    return migrate_send_rp_message(mis, MIG_RP_MSG_SWITCHOVER_ACK, 0, NULL);
+}
+
 /*
  * Send a 'SHUT' message on the return channel with the given value
  * to indicate that we've finished with the RP.  Non-0 value indicates
@@ -1405,6 +1411,8 @@ void migrate_init(MigrationState *s)
     s->vm_was_running = false;
     s->iteration_initial_bytes = 0;
     s->threshold_size = 0;
+    s->switchover_ack_user_num = 0;
+    s->switchover_acked = false;
 }
 
 int migrate_add_blocker_internal(Error *reason, Error **errp)
@@ -1721,6 +1729,7 @@ static struct rp_cmd_args {
     [MIG_RP_MSG_REQ_PAGES_ID]   = { .len = -1, .name = "REQ_PAGES_ID" },
     [MIG_RP_MSG_RECV_BITMAP]    = { .len = -1, .name = "RECV_BITMAP" },
     [MIG_RP_MSG_RESUME_ACK]     = { .len =  4, .name = "RESUME_ACK" },
+    [MIG_RP_MSG_SWITCHOVER_ACK] = { .len =  0, .name = "SWITCHOVER_ACK" },
     [MIG_RP_MSG_MAX]            = { .len = -1, .name = "MAX" },
 };
 
@@ -1959,6 +1968,11 @@ retry:
             }
             break;
 
+        case MIG_RP_MSG_SWITCHOVER_ACK:
+            ms->switchover_acked = true;
+            trace_source_return_path_thread_switchover_acked();
+            break;
+
         default:
             break;
         }
@@ -2700,6 +2714,25 @@ static void migration_update_counters(MigrationState *s,
                               bandwidth, s->threshold_size);
 }
 
+static bool migration_can_switchover(MigrationState *s)
+{
+    if (!migrate_switchover_ack()) {
+        return true;
+    }
+
+    /* Switchover ack was enabled but no device uses it */
+    if (!s->switchover_ack_user_num) {
+        return true;
+    }
+
+    /* No reason to wait for switchover ACK if VM is stopped */
+    if (!runstate_is_running()) {
+        return true;
+    }
+
+    return s->switchover_acked;
+}
+
 /* Migration thread iteration status */
 typedef enum {
     MIG_ITERATE_RESUME,         /* Resume current iteration */
@@ -2715,6 +2748,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
 {
     uint64_t must_precopy, can_postcopy;
     bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
+    bool can_switchover = migration_can_switchover(s);
 
     qemu_savevm_state_pending_estimate(&must_precopy, &can_postcopy);
     uint64_t pending_size = must_precopy + can_postcopy;
@@ -2727,14 +2761,14 @@ static MigIterateState migration_iteration_run(MigrationState *s)
         trace_migrate_pending_exact(pending_size, must_precopy, can_postcopy);
     }
 
-    if (!pending_size || pending_size < s->threshold_size) {
+    if ((!pending_size || pending_size < s->threshold_size) && can_switchover) {
         trace_migration_thread_low_pending(pending_size);
         migration_completion(s);
         return MIG_ITERATE_BREAK;
     }
 
     /* Still a significant amount to transfer */
-    if (!in_postcopy && must_precopy <= s->threshold_size &&
+    if (!in_postcopy && must_precopy <= s->threshold_size && can_switchover &&
         qatomic_read(&s->start_postcopy)) {
         if (postcopy_start(s)) {
             error_report("%s: postcopy failed to start", __func__);
@@ -2959,6 +2993,10 @@ static void *migration_thread(void *opaque)
 
     qemu_savevm_state_setup(s->to_dst_file);
 
+    if (migrate_switchover_ack()) {
+        qemu_savevm_state_switchover_ack_needed(s);
+    }
+
     qemu_savevm_wait_unplug(s, MIGRATION_STATUS_SETUP,
                                MIGRATION_STATUS_ACTIVE);
 
diff --git a/migration/savevm.c b/migration/savevm.c
index 03795ce8dc..9482b1ff27 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1233,6 +1233,23 @@ bool qemu_savevm_state_guest_unplug_pending(void)
     return false;
 }
 
+void qemu_savevm_state_switchover_ack_needed(MigrationState *ms)
+{
+    SaveStateEntry *se;
+
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+        if (!se->ops || !se->ops->switchover_ack_needed) {
+            continue;
+        }
+
+        if (se->ops->switchover_ack_needed(se->opaque)) {
+            ms->switchover_ack_user_num++;
+        }
+    }
+
+    trace_savevm_state_switchover_ack_needed(ms->switchover_ack_user_num);
+}
+
 void qemu_savevm_state_setup(QEMUFile *f)
 {
     MigrationState *ms = migrate_get_current();
@@ -2586,6 +2603,23 @@ static int qemu_loadvm_state_header(QEMUFile *f)
     return 0;
 }
 
+static void qemu_loadvm_state_switchover_ack_needed(MigrationIncomingState *mis)
+{
+    SaveStateEntry *se;
+
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+        if (!se->ops || !se->ops->switchover_ack_needed) {
+            continue;
+        }
+
+        if (se->ops->switchover_ack_needed(se->opaque)) {
+            mis->switchover_ack_pending_num++;
+        }
+    }
+
+    trace_loadvm_state_switchover_ack_needed(mis->switchover_ack_pending_num);
+}
+
 static int qemu_loadvm_state_setup(QEMUFile *f)
 {
     SaveStateEntry *se;
@@ -2789,6 +2823,10 @@ int qemu_loadvm_state(QEMUFile *f)
         return -EINVAL;
     }
 
+    if (migrate_switchover_ack()) {
+        qemu_loadvm_state_switchover_ack_needed(mis);
+    }
+
     cpu_synchronize_all_pre_loadvm();
 
     ret = qemu_loadvm_state_main(f, mis);
@@ -2862,6 +2900,24 @@ int qemu_load_device_state(QEMUFile *f)
     return 0;
 }
 
+int qemu_loadvm_approve_switchover(void)
+{
+    MigrationIncomingState *mis = migration_incoming_get_current();
+
+    if (!mis->switchover_ack_pending_num) {
+        return -EINVAL;
+    }
+
+    mis->switchover_ack_pending_num--;
+    trace_loadvm_approve_switchover(mis->switchover_ack_pending_num);
+
+    if (mis->switchover_ack_pending_num) {
+        return 0;
+    }
+
+    return migrate_send_rp_switchover_ack(mis);
+}
+
 bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
                   bool has_devices, strList *devices, Error **errp)
 {
diff --git a/migration/trace-events b/migration/trace-events
index cdaef7a1ea..c52b429b28 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -7,6 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
 qemu_loadvm_state_post_main(int ret) "%d"
 qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
 qemu_savevm_send_packaged(void) ""
+loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
 loadvm_state_setup(void) ""
 loadvm_state_cleanup(void) ""
 loadvm_handle_cmd_packaged(unsigned int length) "%u"
@@ -23,6 +24,7 @@ loadvm_postcopy_ram_handle_discard_end(void) ""
 loadvm_postcopy_ram_handle_discard_header(const char *ramid, uint16_t len) "%s: %ud"
 loadvm_process_command(const char *s, uint16_t len) "com=%s len=%d"
 loadvm_process_command_ping(uint32_t val) "0x%x"
+loadvm_approve_switchover(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
 postcopy_ram_listen_thread_exit(void) ""
 postcopy_ram_listen_thread_start(void) ""
 qemu_savevm_send_postcopy_advise(void) ""
@@ -39,6 +41,7 @@ savevm_send_postcopy_resume(void) ""
 savevm_send_colo_enable(void) ""
 savevm_send_recv_bitmap(char *name) "%s"
 savevm_state_setup(void) ""
+savevm_state_switchover_ack_needed(unsigned int switchover_ack_user_num) "Switchover ack user num=%u"
 savevm_state_resume_prepare(void) ""
 savevm_state_header(void) ""
 savevm_state_iterate(void) ""
@@ -180,6 +183,7 @@ source_return_path_thread_loop_top(void) ""
 source_return_path_thread_pong(uint32_t val) "0x%x"
 source_return_path_thread_shut(uint32_t val) "0x%x"
 source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32
+source_return_path_thread_switchover_acked(void) ""
 migration_thread_low_pending(uint64_t pending) "%" PRIu64
 migrate_transferred(uint64_t tranferred, uint64_t time_spent, uint64_t bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " max_size %" PRId64
 process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 3/7] migration: Enable switchover ack capability
  2023-05-21 15:18 [PATCH v3 0/7] migration: Add switchover ack capability and VFIO precopy support Avihai Horon
  2023-05-21 15:18 ` [PATCH v3 1/7] migration: Add switchover ack capability Avihai Horon
  2023-05-21 15:18 ` [PATCH v3 2/7] migration: Implement switchover ack logic Avihai Horon
@ 2023-05-21 15:18 ` Avihai Horon
  2023-05-21 15:18 ` [PATCH v3 4/7] tests: Add migration switchover ack capability test Avihai Horon
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Avihai Horon @ 2023-05-21 15:18 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
	Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

Now that switchover ack logic has been implemented, enable the
capability.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
 migration/options.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/migration/options.c b/migration/options.c
index 16007afca6..5a9505adf7 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -562,10 +562,6 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, Error **errp)
                              "'return-path'");
             return false;
         }
-
-        /* Disable this capability until it's implemented */
-        error_setg(errp, "'switchover-ack' is not implemented yet");
-        return false;
     }
 
     return true;
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 4/7] tests: Add migration switchover ack capability test
  2023-05-21 15:18 [PATCH v3 0/7] migration: Add switchover ack capability and VFIO precopy support Avihai Horon
                   ` (2 preceding siblings ...)
  2023-05-21 15:18 ` [PATCH v3 3/7] migration: Enable switchover ack capability Avihai Horon
@ 2023-05-21 15:18 ` Avihai Horon
  2023-05-21 15:18 ` [PATCH v3 5/7] vfio/migration: Refactor vfio_save_block() to return saved data size Avihai Horon
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Avihai Horon @ 2023-05-21 15:18 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
	Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

Add migration switchover ack capability test. The test runs without
devices that support this capability, but is still useful to make sure
it didn't break anything.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
 tests/qtest/migration-test.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index b99b49a314..d246a5bbc5 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -1648,6 +1648,28 @@ static void test_precopy_tcp_plain(void)
     test_precopy_common(&args);
 }
 
+static void *test_migrate_switchover_ack_start(QTestState *from, QTestState *to)
+{
+
+    migrate_set_capability(from, "return-path", true);
+    migrate_set_capability(to, "return-path", true);
+
+    migrate_set_capability(from, "switchover-ack", true);
+    migrate_set_capability(to, "switchover-ack", true);
+
+    return NULL;
+}
+
+static void test_precopy_tcp_switchover_ack(void)
+{
+    MigrateCommon args = {
+        .listen_uri = "tcp:127.0.0.1:0",
+        .start_hook = test_migrate_switchover_ack_start,
+    };
+
+    test_precopy_common(&args);
+}
+
 #ifdef CONFIG_GNUTLS
 static void test_precopy_tcp_tls_psk_match(void)
 {
@@ -2695,6 +2717,10 @@ int main(int argc, char **argv)
 #endif /* CONFIG_GNUTLS */
 
     qtest_add_func("/migration/precopy/tcp/plain", test_precopy_tcp_plain);
+
+    qtest_add_func("/migration/precopy/tcp/plain/switchover-ack",
+                   test_precopy_tcp_switchover_ack);
+
 #ifdef CONFIG_GNUTLS
     qtest_add_func("/migration/precopy/tcp/tls/psk/match",
                    test_precopy_tcp_tls_psk_match);
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 5/7] vfio/migration: Refactor vfio_save_block() to return saved data size
  2023-05-21 15:18 [PATCH v3 0/7] migration: Add switchover ack capability and VFIO precopy support Avihai Horon
                   ` (3 preceding siblings ...)
  2023-05-21 15:18 ` [PATCH v3 4/7] tests: Add migration switchover ack capability test Avihai Horon
@ 2023-05-21 15:18 ` Avihai Horon
  2023-05-21 15:18 ` [PATCH v3 6/7] vfio/migration: Add VFIO migration pre-copy support Avihai Horon
  2023-05-21 15:18 ` [PATCH v3 7/7] vfio/migration: Add support for switchover ack capability Avihai Horon
  6 siblings, 0 replies; 20+ messages in thread
From: Avihai Horon @ 2023-05-21 15:18 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
	Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

Refactor vfio_save_block() to return the size of saved data on success
and -errno on error.

This will be used in next patch to implement VFIO migration pre-copy
support.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
---
 hw/vfio/migration.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 6b58dddb88..235978fd68 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -241,8 +241,8 @@ static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
     return 0;
 }
 
-/* Returns 1 if end-of-stream is reached, 0 if more data and -errno if error */
-static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
+/* Returns the size of saved data on success and -errno on error */
+static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
 {
     ssize_t data_size;
 
@@ -252,7 +252,7 @@ static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
         return -errno;
     }
     if (data_size == 0) {
-        return 1;
+        return 0;
     }
 
     qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
@@ -262,7 +262,7 @@ static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
 
     trace_vfio_save_block(migration->vbasedev->name, data_size);
 
-    return qemu_file_get_error(f);
+    return qemu_file_get_error(f) ?: data_size;
 }
 
 /* ---------------------------------------------------------------------- */
@@ -335,6 +335,7 @@ static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
 static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
+    ssize_t data_size;
     int ret;
 
     /* We reach here with device state STOP only */
@@ -345,11 +346,11 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     }
 
     do {
-        ret = vfio_save_block(f, vbasedev->migration);
-        if (ret < 0) {
-            return ret;
+        data_size = vfio_save_block(f, vbasedev->migration);
+        if (data_size < 0) {
+            return data_size;
         }
-    } while (!ret);
+    } while (data_size);
 
     qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
     ret = qemu_file_get_error(f);
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 6/7] vfio/migration: Add VFIO migration pre-copy support
  2023-05-21 15:18 [PATCH v3 0/7] migration: Add switchover ack capability and VFIO precopy support Avihai Horon
                   ` (4 preceding siblings ...)
  2023-05-21 15:18 ` [PATCH v3 5/7] vfio/migration: Refactor vfio_save_block() to return saved data size Avihai Horon
@ 2023-05-21 15:18 ` Avihai Horon
  2023-05-23 14:56   ` Cédric Le Goater
  2023-05-21 15:18 ` [PATCH v3 7/7] vfio/migration: Add support for switchover ack capability Avihai Horon
  6 siblings, 1 reply; 20+ messages in thread
From: Avihai Horon @ 2023-05-21 15:18 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
	Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

Pre-copy support allows the VFIO device data to be transferred while the
VM is running. This helps to accommodate VFIO devices that have a large
amount of data that needs to be transferred, and it can reduce migration
downtime.

Pre-copy support is optional in VFIO migration protocol v2.
Implement pre-copy of VFIO migration protocol v2 and use it for devices
that support it. Full description of it can be found here [1].

In addition, add a new VFIO device property x-allow-pre-copy to keep
migration compatibility to/from older QEMU versions that don't have VFIO
pre-copy support.

[1]
https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 docs/devel/vfio-migration.rst |  35 +++++---
 include/hw/vfio/vfio-common.h |   4 +
 hw/core/machine.c             |   1 +
 hw/vfio/common.c              |   6 +-
 hw/vfio/migration.c           | 163 ++++++++++++++++++++++++++++++++--
 hw/vfio/pci.c                 |   2 +
 hw/vfio/trace-events          |   4 +-
 7 files changed, 193 insertions(+), 22 deletions(-)

diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
index 1b68ccf115..e896b2a673 100644
--- a/docs/devel/vfio-migration.rst
+++ b/docs/devel/vfio-migration.rst
@@ -7,12 +7,14 @@ the guest is running on source host and restoring this saved state on the
 destination host. This document details how saving and restoring of VFIO
 devices is done in QEMU.
 
-Migration of VFIO devices currently consists of a single stop-and-copy phase.
-During the stop-and-copy phase the guest is stopped and the entire VFIO device
-data is transferred to the destination.
-
-The pre-copy phase of migration is currently not supported for VFIO devices.
-Support for VFIO pre-copy will be added later on.
+Migration of VFIO devices consists of two phases: the optional pre-copy phase,
+and the stop-and-copy phase. The pre-copy phase is iterative and allows to
+accommodate VFIO devices that have a large amount of data that needs to be
+transferred. The iterative pre-copy phase of migration allows for the guest to
+continue whilst the VFIO device state is transferred to the destination, this
+helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy
+support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
+VFIO_DEVICE_FEATURE_MIGRATION ioctl.
 
 Note that currently VFIO migration is supported only for a single device. This
 is due to VFIO migration's lack of P2P support. However, P2P support is planned
@@ -29,10 +31,20 @@ VFIO implements the device hooks for the iterative approach as follows:
 * A ``load_setup`` function that sets the VFIO device on the destination in
   _RESUMING state.
 
+* A ``state_pending_estimate`` function that reports an estimate of the
+  remaining pre-copy data that the vendor driver has yet to save for the VFIO
+  device.
+
 * A ``state_pending_exact`` function that reads pending_bytes from the vendor
   driver, which indicates the amount of data that the vendor driver has yet to
   save for the VFIO device.
 
+* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
+  active only when the VFIO device is in pre-copy states.
+
+* A ``save_live_iterate`` function that reads the VFIO device's data from the
+  vendor driver during iterative pre-copy phase.
+
 * A ``save_state`` function to save the device config space if it is present.
 
 * A ``save_live_complete_precopy`` function that sets the VFIO device in
@@ -111,8 +123,10 @@ Flow of state changes during Live migration
 ===========================================
 
 Below is the flow of state change during live migration.
-The values in the brackets represent the VM state, the migration state, and
+The values in the parentheses represent the VM state, the migration state, and
 the VFIO device state, respectively.
+The text in the square brackets represents the flow if the VFIO device supports
+pre-copy.
 
 Live migration save path
 ------------------------
@@ -124,11 +138,12 @@ Live migration save path
                                   |
                      migrate_init spawns migration_thread
                 Migration thread then calls each device's .save_setup()
-                       (RUNNING, _SETUP, _RUNNING)
+                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
                                   |
-                      (RUNNING, _ACTIVE, _RUNNING)
-             If device is active, get pending_bytes by .state_pending_exact()
+                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
+      If device is active, get pending_bytes by .state_pending_{estimate,exact}()
           If total pending_bytes >= threshold_size, call .save_live_iterate()
+                  [Data of VFIO device for pre-copy phase is copied]
         Iterate till total pending bytes converge and are less than threshold
                                   |
   On migration completion, vCPU stops and calls .save_live_complete_precopy for
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index eed244f25f..5ce7a01d56 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -66,6 +66,9 @@ typedef struct VFIOMigration {
     int data_fd;
     void *data_buffer;
     size_t data_buffer_size;
+    uint64_t precopy_init_size;
+    uint64_t precopy_dirty_size;
+    uint64_t mig_flags;
 } VFIOMigration;
 
 typedef struct VFIOAddressSpace {
@@ -143,6 +146,7 @@ typedef struct VFIODevice {
     VFIOMigration *migration;
     Error *migration_blocker;
     OnOffAuto pre_copy_dirty_page_tracking;
+    bool allow_pre_copy;
     bool dirty_pages_supported;
     bool dirty_tracking;
 } VFIODevice;
diff --git a/hw/core/machine.c b/hw/core/machine.c
index 07f763eb2e..50439e5cbb 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -41,6 +41,7 @@
 
 GlobalProperty hw_compat_8_0[] = {
     { "migration", "multifd-flush-after-each-section", "on"},
+    { "vfio-pci", "x-allow-pre-copy", "false" },
 };
 const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0);
 
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 78358ede27..b73086e17a 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -492,7 +492,8 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
             }
 
             if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
-                migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
+                (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
+                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) {
                 return false;
             }
         }
@@ -537,7 +538,8 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
                 return false;
             }
 
-            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
+            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
+                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
                 continue;
             } else {
                 return false;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 235978fd68..418efed019 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -68,6 +68,8 @@ static const char *mig_state_to_str(enum vfio_device_mig_state state)
         return "STOP_COPY";
     case VFIO_DEVICE_STATE_RESUMING:
         return "RESUMING";
+    case VFIO_DEVICE_STATE_PRE_COPY:
+        return "PRE_COPY";
     default:
         return "UNKNOWN STATE";
     }
@@ -241,6 +243,22 @@ static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
     return 0;
 }
 
+static int vfio_query_precopy_size(VFIOMigration *migration)
+{
+    struct vfio_precopy_info precopy = {
+        .argsz = sizeof(precopy),
+    };
+
+    if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, &precopy)) {
+        return -errno;
+    }
+
+    migration->precopy_init_size = precopy.initial_bytes;
+    migration->precopy_dirty_size = precopy.dirty_bytes;
+
+    return 0;
+}
+
 /* Returns the size of saved data on success and -errno on error */
 static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
 {
@@ -249,6 +267,11 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
     data_size = read(migration->data_fd, migration->data_buffer,
                      migration->data_buffer_size);
     if (data_size < 0) {
+        /* Pre-copy emptied all the device state for now */
+        if (errno == ENOMSG) {
+            return 0;
+        }
+
         return -errno;
     }
     if (data_size == 0) {
@@ -265,6 +288,39 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
     return qemu_file_get_error(f) ?: data_size;
 }
 
+static void vfio_update_estimated_pending_data(VFIOMigration *migration,
+                                               uint64_t data_size)
+{
+    if (!data_size) {
+        /*
+         * Pre-copy emptied all the device state for now, update estimated sizes
+         * accordingly.
+         */
+        migration->precopy_init_size = 0;
+        migration->precopy_dirty_size = 0;
+
+        return;
+    }
+
+    if (migration->precopy_init_size) {
+        uint64_t init_size = MIN(migration->precopy_init_size, data_size);
+
+        migration->precopy_init_size -= init_size;
+        data_size -= init_size;
+    }
+
+    migration->precopy_dirty_size -= MIN(migration->precopy_dirty_size,
+                                         data_size);
+}
+
+static bool vfio_precopy_supported(VFIODevice *vbasedev)
+{
+    VFIOMigration *migration = vbasedev->migration;
+
+    return vbasedev->allow_pre_copy &&
+           migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
@@ -285,6 +341,31 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
         return -ENOMEM;
     }
 
+    if (vfio_precopy_supported(vbasedev)) {
+        int ret;
+
+        migration->precopy_init_size = 0;
+        migration->precopy_dirty_size = 0;
+
+        switch (migration->device_state) {
+        case VFIO_DEVICE_STATE_RUNNING:
+            ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_PRE_COPY,
+                                           VFIO_DEVICE_STATE_RUNNING);
+            if (ret) {
+                return ret;
+            }
+
+            vfio_query_precopy_size(migration);
+
+            break;
+        case VFIO_DEVICE_STATE_STOP:
+            /* vfio_save_complete_precopy() will go to STOP_COPY */
+            break;
+        default:
+            return -EINVAL;
+        }
+    }
+
     trace_vfio_save_setup(vbasedev->name, migration->data_buffer_size);
 
     qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
@@ -303,22 +384,36 @@ static void vfio_save_cleanup(void *opaque)
     trace_vfio_save_cleanup(vbasedev->name);
 }
 
+static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
+                                        uint64_t *can_postcopy)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) {
+        return;
+    }
+
+    *must_precopy +=
+        migration->precopy_init_size + migration->precopy_dirty_size;
+
+    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
+                                      *can_postcopy,
+                                      migration->precopy_init_size,
+                                      migration->precopy_dirty_size);
+}
+
 /*
  * Migration size of VFIO devices can be as little as a few KBs or as big as
  * many GBs. This value should be big enough to cover the worst case.
  */
 #define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
 
-/*
- * Only exact function is implemented and not estimate function. The reason is
- * that during pre-copy phase of migration the estimate function is called
- * repeatedly while pending RAM size is over the threshold, thus migration
- * can't converge and querying the VFIO device pending data size is useless.
- */
 static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
                                      uint64_t *can_postcopy)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
     uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
 
     /*
@@ -328,8 +423,49 @@ static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
     vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
     *must_precopy += stop_copy_size;
 
+    if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
+        migration->precopy_init_size = 0;
+        migration->precopy_dirty_size = 0;
+        vfio_query_precopy_size(migration);
+
+        *must_precopy +=
+            migration->precopy_init_size + migration->precopy_dirty_size;
+    }
+
     trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
-                                   stop_copy_size);
+                                   stop_copy_size, migration->precopy_init_size,
+                                   migration->precopy_dirty_size);
+}
+
+static bool vfio_is_active_iterate(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
+}
+
+static int vfio_save_iterate(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    ssize_t data_size;
+
+    data_size = vfio_save_block(f, migration);
+    if (data_size < 0) {
+        return data_size;
+    }
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    vfio_update_estimated_pending_data(migration, data_size);
+
+    trace_vfio_save_iterate(vbasedev->name);
+
+    /*
+     * A VFIO device's pre-copy dirty_bytes is not guaranteed to reach zero.
+     * Return 1 so following handlers will not be potentially blocked.
+     */
+    return 1;
 }
 
 static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
@@ -338,7 +474,7 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     ssize_t data_size;
     int ret;
 
-    /* We reach here with device state STOP only */
+    /* We reach here with device state STOP or STOP_COPY only */
     ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
                                    VFIO_DEVICE_STATE_STOP);
     if (ret) {
@@ -457,7 +593,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
 static const SaveVMHandlers savevm_vfio_handlers = {
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
+    .state_pending_estimate = vfio_state_pending_estimate,
     .state_pending_exact = vfio_state_pending_exact,
+    .is_active_iterate = vfio_is_active_iterate,
+    .save_live_iterate = vfio_save_iterate,
     .save_live_complete_precopy = vfio_save_complete_precopy,
     .save_state = vfio_save_state,
     .load_setup = vfio_load_setup,
@@ -470,13 +609,18 @@ static const SaveVMHandlers savevm_vfio_handlers = {
 static void vfio_vmstate_change(void *opaque, bool running, RunState state)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
     enum vfio_device_mig_state new_state;
     int ret;
 
     if (running) {
         new_state = VFIO_DEVICE_STATE_RUNNING;
     } else {
-        new_state = VFIO_DEVICE_STATE_STOP;
+        new_state =
+            (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY &&
+             (state == RUN_STATE_FINISH_MIGRATE || state == RUN_STATE_PAUSED)) ?
+                VFIO_DEVICE_STATE_STOP_COPY :
+                VFIO_DEVICE_STATE_STOP;
     }
 
     /*
@@ -603,6 +747,7 @@ static int vfio_migration_init(VFIODevice *vbasedev)
     migration->vbasedev = vbasedev;
     migration->device_state = VFIO_DEVICE_STATE_RUNNING;
     migration->data_fd = -1;
+    migration->mig_flags = mig_flags;
 
     vbasedev->dirty_pages_supported = vfio_dma_logging_supported(vbasedev);
 
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index bf27a39905..72f30ce09f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3335,6 +3335,8 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking", VFIOPCIDevice,
                             vbasedev.pre_copy_dirty_page_tracking,
                             ON_OFF_AUTO_ON),
+    DEFINE_PROP_BOOL("x-allow-pre-copy", VFIOPCIDevice,
+                     vbasedev.allow_pre_copy, true),
     DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
                             display, ON_OFF_AUTO_OFF),
     DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 646e42fd27..fd6893cb43 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -162,6 +162,8 @@ vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
 vfio_save_cleanup(const char *name) " (%s)"
 vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
 vfio_save_device_config_state(const char *name) " (%s)"
+vfio_save_iterate(const char *name) " (%s)"
 vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
-vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64
+vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
+vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
 vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 7/7] vfio/migration: Add support for switchover ack capability
  2023-05-21 15:18 [PATCH v3 0/7] migration: Add switchover ack capability and VFIO precopy support Avihai Horon
                   ` (5 preceding siblings ...)
  2023-05-21 15:18 ` [PATCH v3 6/7] vfio/migration: Add VFIO migration pre-copy support Avihai Horon
@ 2023-05-21 15:18 ` Avihai Horon
  2023-05-23 15:09   ` Cédric Le Goater
  6 siblings, 1 reply; 20+ messages in thread
From: Avihai Horon @ 2023-05-21 15:18 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Eduardo Habkost,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
	Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
	Kirti Wankhede, Tarun Gupta, Joao Martins

Loading of a VFIO device's data can take a substantial amount of time as
the device may need to allocate resources, prepare internal data
structures, etc. This can increase migration downtime, especially for
VFIO devices with a lot of resources.

To solve this, VFIO migration uAPI defines "initial bytes" as part of
its precopy data stream. Initial bytes can be used in various ways to
improve VFIO migration performance. For example, it can be used to
transfer device metadata to pre-allocate resources in the destination.
However, for this to work we need to make sure that all initial bytes
are sent and loaded in the destination before the source VM is stopped.

Use migration switchover ack capability to make sure a VFIO device's
initial bytes are sent and loaded in the destination before the source
stops the VM and attempts to complete the migration.
This can significantly reduce migration downtime for some devices.

As precopy support and precopy initial bytes support come together in
VFIO migration, use x-allow-pre-copy device property to control usage of
this feature as well.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 docs/devel/vfio-migration.rst | 10 +++++++++
 include/hw/vfio/vfio-common.h |  2 ++
 hw/vfio/migration.c           | 42 ++++++++++++++++++++++++++++++++++-
 3 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
index e896b2a673..e75793b76a 100644
--- a/docs/devel/vfio-migration.rst
+++ b/docs/devel/vfio-migration.rst
@@ -16,6 +16,13 @@ helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy
 support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
 VFIO_DEVICE_FEATURE_MIGRATION ioctl.
 
+When pre-copy is supported, it's possible to further reduce downtime by
+enabling "switchover-ack" migration capability.
+VFIO migration uAPI defines "initial bytes" as part of its pre-copy data stream
+and recommends that the initial bytes are sent and loaded in the destination
+before stopping the source VM. Enabling this migration capability will
+guarantee that and thus, can potentially reduce downtime even further.
+
 Note that currently VFIO migration is supported only for a single device. This
 is due to VFIO migration's lack of P2P support. However, P2P support is planned
 to be added later on.
@@ -45,6 +52,9 @@ VFIO implements the device hooks for the iterative approach as follows:
 * A ``save_live_iterate`` function that reads the VFIO device's data from the
   vendor driver during iterative pre-copy phase.
 
+* A ``switchover_ack_needed`` function that checks if the VFIO device uses
+  "switchover-ack" migration capability when this capability is enabled.
+
 * A ``save_state`` function to save the device config space if it is present.
 
 * A ``save_live_complete_precopy`` function that sets the VFIO device in
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 5ce7a01d56..b4a7eb0f9a 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -69,6 +69,8 @@ typedef struct VFIOMigration {
     uint64_t precopy_init_size;
     uint64_t precopy_dirty_size;
     uint64_t mig_flags;
+    bool switchover_ack_needed;
+    bool initial_data_sent;
 } VFIOMigration;
 
 typedef struct VFIOAddressSpace {
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 418efed019..09c669c1d8 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -18,6 +18,7 @@
 #include "sysemu/runstate.h"
 #include "hw/vfio/vfio-common.h"
 #include "migration/migration.h"
+#include "migration/savevm.h"
 #include "migration/vmstate.h"
 #include "migration/qemu-file.h"
 #include "migration/register.h"
@@ -45,6 +46,7 @@
 #define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
 #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
 #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
+#define VFIO_MIG_FLAG_DEV_INIT_DATA_SENT (0xffffffffef100005ULL)
 
 /*
  * This is an arbitrary size based on migration of mlx5 devices, where typically
@@ -380,6 +382,8 @@ static void vfio_save_cleanup(void *opaque)
 
     g_free(migration->data_buffer);
     migration->data_buffer = NULL;
+    migration->switchover_ack_needed = false;
+    migration->initial_data_sent = false;
     vfio_migration_cleanup(vbasedev);
     trace_vfio_save_cleanup(vbasedev->name);
 }
@@ -455,10 +459,17 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
     if (data_size < 0) {
         return data_size;
     }
-    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
 
     vfio_update_estimated_pending_data(migration, data_size);
 
+    if (migration->switchover_ack_needed && !migration->precopy_init_size &&
+        !migration->initial_data_sent) {
+        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT);
+        migration->initial_data_sent = true;
+    } else {
+        qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+    }
+
     trace_vfio_save_iterate(vbasedev->name);
 
     /*
@@ -576,6 +587,24 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
             }
             break;
         }
+        case VFIO_MIG_FLAG_DEV_INIT_DATA_SENT:
+        {
+            if (!vbasedev->migration->switchover_ack_needed) {
+                error_report("%s: Received INIT_DATA_SENT but switchover ack "
+                             "is not needed",
+                             vbasedev->name);
+                return -EINVAL;
+            }
+
+            ret = qemu_loadvm_approve_switchover();
+            if (ret) {
+                error_report(
+                    "%s: qemu_loadvm_approve_switchover failed, err=%d (%s)",
+                    vbasedev->name, ret, strerror(-ret));
+            }
+
+            return ret;
+        }
         default:
             error_report("%s: Unknown tag 0x%"PRIx64, vbasedev->name, data);
             return -EINVAL;
@@ -590,6 +619,16 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
     return ret;
 }
 
+static bool vfio_switchover_ack_needed(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    migration->switchover_ack_needed = vfio_precopy_supported(vbasedev);
+
+    return migration->switchover_ack_needed;
+}
+
 static const SaveVMHandlers savevm_vfio_handlers = {
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
@@ -602,6 +641,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
     .load_setup = vfio_load_setup,
     .load_cleanup = vfio_load_cleanup,
     .load_state = vfio_load_state,
+    .switchover_ack_needed = vfio_switchover_ack_needed,
 };
 
 /* ---------------------------------------------------------------------- */
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 6/7] vfio/migration: Add VFIO migration pre-copy support
  2023-05-21 15:18 ` [PATCH v3 6/7] vfio/migration: Add VFIO migration pre-copy support Avihai Horon
@ 2023-05-23 14:56   ` Cédric Le Goater
  2023-05-24 12:49     ` Avihai Horon
  0 siblings, 1 reply; 20+ messages in thread
From: Cédric Le Goater @ 2023-05-23 14:56 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Eduardo Habkost, Marcel Apfelbaum,
	Philippe Mathieu-Daudé,
	Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
	Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

Hello Avihai,

On 5/21/23 17:18, Avihai Horon wrote:
> Pre-copy support allows the VFIO device data to be transferred while the
> VM is running. This helps to accommodate VFIO devices that have a large
> amount of data that needs to be transferred, and it can reduce migration
> downtime.
> 
> Pre-copy support is optional in VFIO migration protocol v2.
> Implement pre-copy of VFIO migration protocol v2 and use it for devices
> that support it. Full description of it can be found here [1].
> 
> In addition, add a new VFIO device property x-allow-pre-copy to keep
> migration compatibility to/from older QEMU versions that don't have VFIO
> pre-copy support.
> 
> [1]
> https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/


May be simply reference Linux commit 4db52602a607 ("vfio: Extend the device
migration protocol with PRE_COPY") instead.

some comments below,


> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>   docs/devel/vfio-migration.rst |  35 +++++---
>   include/hw/vfio/vfio-common.h |   4 +
>   hw/core/machine.c             |   1 +
>   hw/vfio/common.c              |   6 +-
>   hw/vfio/migration.c           | 163 ++++++++++++++++++++++++++++++++--
>   hw/vfio/pci.c                 |   2 +
>   hw/vfio/trace-events          |   4 +-
>   7 files changed, 193 insertions(+), 22 deletions(-)
> 
> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
> index 1b68ccf115..e896b2a673 100644
> --- a/docs/devel/vfio-migration.rst
> +++ b/docs/devel/vfio-migration.rst
> @@ -7,12 +7,14 @@ the guest is running on source host and restoring this saved state on the
>   destination host. This document details how saving and restoring of VFIO
>   devices is done in QEMU.
>   
> -Migration of VFIO devices currently consists of a single stop-and-copy phase.
> -During the stop-and-copy phase the guest is stopped and the entire VFIO device
> -data is transferred to the destination.
> -
> -The pre-copy phase of migration is currently not supported for VFIO devices.
> -Support for VFIO pre-copy will be added later on.
> +Migration of VFIO devices consists of two phases: the optional pre-copy phase,
> +and the stop-and-copy phase. The pre-copy phase is iterative and allows to
> +accommodate VFIO devices that have a large amount of data that needs to be
> +transferred. The iterative pre-copy phase of migration allows for the guest to
> +continue whilst the VFIO device state is transferred to the destination, this
> +helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy
> +support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
> +VFIO_DEVICE_FEATURE_MIGRATION ioctl.
>   
>   Note that currently VFIO migration is supported only for a single device. This
>   is due to VFIO migration's lack of P2P support. However, P2P support is planned
> @@ -29,10 +31,20 @@ VFIO implements the device hooks for the iterative approach as follows:
>   * A ``load_setup`` function that sets the VFIO device on the destination in
>     _RESUMING state.
>   
> +* A ``state_pending_estimate`` function that reports an estimate of the
> +  remaining pre-copy data that the vendor driver has yet to save for the VFIO
> +  device.
> +
>   * A ``state_pending_exact`` function that reads pending_bytes from the vendor
>     driver, which indicates the amount of data that the vendor driver has yet to
>     save for the VFIO device.
>   
> +* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
> +  active only when the VFIO device is in pre-copy states.
> +
> +* A ``save_live_iterate`` function that reads the VFIO device's data from the
> +  vendor driver during iterative pre-copy phase.
> +
>   * A ``save_state`` function to save the device config space if it is present.
>   
>   * A ``save_live_complete_precopy`` function that sets the VFIO device in
> @@ -111,8 +123,10 @@ Flow of state changes during Live migration
>   ===========================================
>   
>   Below is the flow of state change during live migration.
> -The values in the brackets represent the VM state, the migration state, and
> +The values in the parentheses represent the VM state, the migration state, and
>   the VFIO device state, respectively.
> +The text in the square brackets represents the flow if the VFIO device supports
> +pre-copy.
>   
>   Live migration save path
>   ------------------------
> @@ -124,11 +138,12 @@ Live migration save path
>                                     |
>                        migrate_init spawns migration_thread
>                   Migration thread then calls each device's .save_setup()
> -                       (RUNNING, _SETUP, _RUNNING)
> +                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
>                                     |
> -                      (RUNNING, _ACTIVE, _RUNNING)
> -             If device is active, get pending_bytes by .state_pending_exact()
> +                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
> +      If device is active, get pending_bytes by .state_pending_{estimate,exact}()
>             If total pending_bytes >= threshold_size, call .save_live_iterate()
> +                  [Data of VFIO device for pre-copy phase is copied]
>           Iterate till total pending bytes converge and are less than threshold
>                                     |
>     On migration completion, vCPU stops and calls .save_live_complete_precopy for
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index eed244f25f..5ce7a01d56 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -66,6 +66,9 @@ typedef struct VFIOMigration {
>       int data_fd;
>       void *data_buffer;
>       size_t data_buffer_size;
> +    uint64_t precopy_init_size;
> +    uint64_t precopy_dirty_size;
> +    uint64_t mig_flags;

It would have been cleaner to introduce VFIOMigration::mig_flags and its
update in another patch. This is minor.

>   } VFIOMigration;
>   
>   typedef struct VFIOAddressSpace {
> @@ -143,6 +146,7 @@ typedef struct VFIODevice {
>       VFIOMigration *migration;
>       Error *migration_blocker;
>       OnOffAuto pre_copy_dirty_page_tracking;
> +    bool allow_pre_copy;

same comment for this bool and the associated property, because it would
ease backports.

>       bool dirty_pages_supported;
>       bool dirty_tracking;
>   } VFIODevice;
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index 07f763eb2e..50439e5cbb 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -41,6 +41,7 @@
>   
>   GlobalProperty hw_compat_8_0[] = {
>       { "migration", "multifd-flush-after-each-section", "on"},
> +    { "vfio-pci", "x-allow-pre-copy", "false" },
>   };
>   const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0);
>   
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 78358ede27..b73086e17a 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -492,7 +492,8 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>               }
>   
>               if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
> -                migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
> +                (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> +                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) {
>                   return false;
>               }
>           }
> @@ -537,7 +538,8 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>                   return false;
>               }
>   
> -            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
> +            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> +                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
>                   continue;
>               } else {
>                   return false;
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 235978fd68..418efed019 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -68,6 +68,8 @@ static const char *mig_state_to_str(enum vfio_device_mig_state state)
>           return "STOP_COPY";
>       case VFIO_DEVICE_STATE_RESUMING:
>           return "RESUMING";
> +    case VFIO_DEVICE_STATE_PRE_COPY:
> +        return "PRE_COPY";
>       default:
>           return "UNKNOWN STATE";
>       }
> @@ -241,6 +243,22 @@ static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
>       return 0;
>   }
>   
> +static int vfio_query_precopy_size(VFIOMigration *migration)
> +{
> +    struct vfio_precopy_info precopy = {
> +        .argsz = sizeof(precopy),
> +    };

May be move here  :

         migration->precopy_init_size = 0;
         migration->precopy_dirty_size = 0;

since the values are reset always before calling vfio_query_precopy_size()

> +
> +    if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, &precopy)) {
> +        return -errno;
> +    }
> +
> +    migration->precopy_init_size = precopy.initial_bytes;
> +    migration->precopy_dirty_size = precopy.dirty_bytes;
> +
> +    return 0;
> +}
> +
>   /* Returns the size of saved data on success and -errno on error */
>   static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>   {
> @@ -249,6 +267,11 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>       data_size = read(migration->data_fd, migration->data_buffer,
>                        migration->data_buffer_size);
>       if (data_size < 0) {
> +        /* Pre-copy emptied all the device state for now */
> +        if (errno == ENOMSG) {

Could you explain a little more this errno please ? It looks like an API with
the VFIO PCI variant kernel driver.

> +            return 0;
> +        }
> +
>           return -errno;
>       }
>       if (data_size == 0) {
> @@ -265,6 +288,39 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>       return qemu_file_get_error(f) ?: data_size;
>   }
>   
> +static void vfio_update_estimated_pending_data(VFIOMigration *migration,
> +                                               uint64_t data_size)
> +{
> +    if (!data_size) {
> +        /*
> +         * Pre-copy emptied all the device state for now, update estimated sizes
> +         * accordingly.
> +         */
> +        migration->precopy_init_size = 0;
> +        migration->precopy_dirty_size = 0;
> +
> +        return;
> +    }
> +
> +    if (migration->precopy_init_size) {
> +        uint64_t init_size = MIN(migration->precopy_init_size, data_size);
> +
> +        migration->precopy_init_size -= init_size;
> +        data_size -= init_size;
> +    }
> +
> +    migration->precopy_dirty_size -= MIN(migration->precopy_dirty_size,
> +                                         data_size);

Do we have a trace event for all this data values ?

> +}
> +
> +static bool vfio_precopy_supported(VFIODevice *vbasedev)
> +{
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    return vbasedev->allow_pre_copy &&
> +           migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
> +}
> +
>   /* ---------------------------------------------------------------------- */
>   
>   static int vfio_save_setup(QEMUFile *f, void *opaque)
> @@ -285,6 +341,31 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
>           return -ENOMEM;
>       }
>   
> +    if (vfio_precopy_supported(vbasedev)) {
> +        int ret;
> +
> +        migration->precopy_init_size = 0;
> +        migration->precopy_dirty_size = 0;
> +
> +        switch (migration->device_state) {
> +        case VFIO_DEVICE_STATE_RUNNING:
> +            ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_PRE_COPY,
> +                                           VFIO_DEVICE_STATE_RUNNING);
> +            if (ret) {
> +                return ret;
> +            }
> +
> +            vfio_query_precopy_size(migration);
> +
> +            break;
> +        case VFIO_DEVICE_STATE_STOP:
> +            /* vfio_save_complete_precopy() will go to STOP_COPY */
> +            break;
> +        default:
> +            return -EINVAL;
> +        }
> +    }
> +
>       trace_vfio_save_setup(vbasedev->name, migration->data_buffer_size);
>   
>       qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> @@ -303,22 +384,36 @@ static void vfio_save_cleanup(void *opaque)
>       trace_vfio_save_cleanup(vbasedev->name);
>   }
>   
> +static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
> +                                        uint64_t *can_postcopy)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) {
> +        return;
> +    }
> +
> +    *must_precopy +=
> +        migration->precopy_init_size + migration->precopy_dirty_size;
> +
> +    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
> +                                      *can_postcopy,
> +                                      migration->precopy_init_size,
> +                                      migration->precopy_dirty_size);


ok we have one :) I wonder if we should not update trace_vfio_save_iterate()
also with some values.

> +}
> +
>   /*
>    * Migration size of VFIO devices can be as little as a few KBs or as big as
>    * many GBs. This value should be big enough to cover the worst case.
>    */
>   #define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
>   
> -/*
> - * Only exact function is implemented and not estimate function. The reason is
> - * that during pre-copy phase of migration the estimate function is called
> - * repeatedly while pending RAM size is over the threshold, thus migration
> - * can't converge and querying the VFIO device pending data size is useless.
> - */
>   static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
>                                        uint64_t *can_postcopy)
>   {
>       VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
>       uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
>   
>       /*
> @@ -328,8 +423,49 @@ static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
>       vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>       *must_precopy += stop_copy_size;
>   
> +    if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
> +        migration->precopy_init_size = 0;
> +        migration->precopy_dirty_size = 0;
> +        vfio_query_precopy_size(migration);
> +
> +        *must_precopy +=
> +            migration->precopy_init_size + migration->precopy_dirty_size;
> +    }
> +
>       trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
> -                                   stop_copy_size);
> +                                   stop_copy_size, migration->precopy_init_size,
> +                                   migration->precopy_dirty_size);
> +}
> +
> +static bool vfio_is_active_iterate(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
> +}
> +
> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    ssize_t data_size;
> +
> +    data_size = vfio_save_block(f, migration);
> +    if (data_size < 0) {
> +        return data_size;
> +    }
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    vfio_update_estimated_pending_data(migration, data_size);
> +
> +    trace_vfio_save_iterate(vbasedev->name);
> +
> +    /*
> +     * A VFIO device's pre-copy dirty_bytes is not guaranteed to reach zero.
> +     * Return 1 so following handlers will not be potentially blocked.

Can this condition be detected to warn the user ?

> +     */
> +    return 1;
>   }
>   
>   static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> @@ -338,7 +474,7 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>       ssize_t data_size;
>       int ret;
>   
> -    /* We reach here with device state STOP only */
> +    /* We reach here with device state STOP or STOP_COPY only */
>       ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
>                                      VFIO_DEVICE_STATE_STOP);
>       if (ret) {
> @@ -457,7 +593,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>   static const SaveVMHandlers savevm_vfio_handlers = {
>       .save_setup = vfio_save_setup,
>       .save_cleanup = vfio_save_cleanup,
> +    .state_pending_estimate = vfio_state_pending_estimate,
>       .state_pending_exact = vfio_state_pending_exact,
> +    .is_active_iterate = vfio_is_active_iterate,
> +    .save_live_iterate = vfio_save_iterate,
>       .save_live_complete_precopy = vfio_save_complete_precopy,
>       .save_state = vfio_save_state,
>       .load_setup = vfio_load_setup,
> @@ -470,13 +609,18 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>   static void vfio_vmstate_change(void *opaque, bool running, RunState state)
>   {
>       VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
>       enum vfio_device_mig_state new_state;
>       int ret;
>   
>       if (running) {
>           new_state = VFIO_DEVICE_STATE_RUNNING;
>       } else {
> -        new_state = VFIO_DEVICE_STATE_STOP;
> +        new_state =
> +            (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY &&
> +             (state == RUN_STATE_FINISH_MIGRATE || state == RUN_STATE_PAUSED)) ?
> +                VFIO_DEVICE_STATE_STOP_COPY :
> +                VFIO_DEVICE_STATE_STOP;
>       }
>   
>       /*
> @@ -603,6 +747,7 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>       migration->vbasedev = vbasedev;
>       migration->device_state = VFIO_DEVICE_STATE_RUNNING;
>       migration->data_fd = -1;
> +    migration->mig_flags = mig_flags;
>   
>       vbasedev->dirty_pages_supported = vfio_dma_logging_supported(vbasedev);
>   
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index bf27a39905..72f30ce09f 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3335,6 +3335,8 @@ static Property vfio_pci_dev_properties[] = {
>       DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking", VFIOPCIDevice,
>                               vbasedev.pre_copy_dirty_page_tracking,
>                               ON_OFF_AUTO_ON),
> +    DEFINE_PROP_BOOL("x-allow-pre-copy", VFIOPCIDevice,
> +                     vbasedev.allow_pre_copy, true),
>       DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
>                               display, ON_OFF_AUTO_OFF),
>       DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 646e42fd27..fd6893cb43 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -162,6 +162,8 @@ vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
>   vfio_save_cleanup(const char *name) " (%s)"
>   vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
>   vfio_save_device_config_state(const char *name) " (%s)"
> +vfio_save_iterate(const char *name) " (%s)"
>   vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
> -vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64
> +vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
> +vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
>   vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 7/7] vfio/migration: Add support for switchover ack capability
  2023-05-21 15:18 ` [PATCH v3 7/7] vfio/migration: Add support for switchover ack capability Avihai Horon
@ 2023-05-23 15:09   ` Cédric Le Goater
  2023-05-24 12:52     ` Avihai Horon
  0 siblings, 1 reply; 20+ messages in thread
From: Cédric Le Goater @ 2023-05-23 15:09 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Eduardo Habkost, Marcel Apfelbaum,
	Philippe Mathieu-Daudé,
	Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
	Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On 5/21/23 17:18, Avihai Horon wrote:
> Loading of a VFIO device's data can take a substantial amount of time as
> the device may need to allocate resources, prepare internal data
> structures, etc. This can increase migration downtime, especially for
> VFIO devices with a lot of resources.
> 
> To solve this, VFIO migration uAPI defines "initial bytes" as part of
> its precopy data stream. Initial bytes can be used in various ways to
> improve VFIO migration performance. For example, it can be used to
> transfer device metadata to pre-allocate resources in the destination.
> However, for this to work we need to make sure that all initial bytes
> are sent and loaded in the destination before the source VM is stopped.
> 
> Use migration switchover ack capability to make sure a VFIO device's
> initial bytes are sent and loaded in the destination before the source
> stops the VM and attempts to complete the migration.
> This can significantly reduce migration downtime for some devices.
> 
> As precopy support and precopy initial bytes support come together in



> VFIO migration, use x-allow-pre-copy device property to control usage of
> this feature as well.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>


Reviewed-by: Cédric Le Goater <clg@redhat.com>

> ---
>   docs/devel/vfio-migration.rst | 10 +++++++++
>   include/hw/vfio/vfio-common.h |  2 ++
>   hw/vfio/migration.c           | 42 ++++++++++++++++++++++++++++++++++-
>   3 files changed, 53 insertions(+), 1 deletion(-)
> 
> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
> index e896b2a673..e75793b76a 100644
> --- a/docs/devel/vfio-migration.rst
> +++ b/docs/devel/vfio-migration.rst
> @@ -16,6 +16,13 @@ helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy
>   support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
>   VFIO_DEVICE_FEATURE_MIGRATION ioctl.
>   
> +When pre-copy is supported, it's possible to further reduce downtime by
> +enabling "switchover-ack" migration capability.
> +VFIO migration uAPI defines "initial bytes" as part of its pre-copy data stream
> +and recommends that the initial bytes are sent and loaded in the destination
> +before stopping the source VM. Enabling this migration capability will
> +guarantee that and thus, can potentially reduce downtime even further.
> +
>   Note that currently VFIO migration is supported only for a single device. This
>   is due to VFIO migration's lack of P2P support. However, P2P support is planned
>   to be added later on.
> @@ -45,6 +52,9 @@ VFIO implements the device hooks for the iterative approach as follows:
>   * A ``save_live_iterate`` function that reads the VFIO device's data from the
>     vendor driver during iterative pre-copy phase.
>   
> +* A ``switchover_ack_needed`` function that checks if the VFIO device uses
> +  "switchover-ack" migration capability when this capability is enabled.
> +
>   * A ``save_state`` function to save the device config space if it is present.
>   
>   * A ``save_live_complete_precopy`` function that sets the VFIO device in
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 5ce7a01d56..b4a7eb0f9a 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -69,6 +69,8 @@ typedef struct VFIOMigration {
>       uint64_t precopy_init_size;
>       uint64_t precopy_dirty_size;
>       uint64_t mig_flags;
> +    bool switchover_ack_needed;
> +    bool initial_data_sent;
>   } VFIOMigration;
>   
>   typedef struct VFIOAddressSpace {
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 418efed019..09c669c1d8 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -18,6 +18,7 @@
>   #include "sysemu/runstate.h"
>   #include "hw/vfio/vfio-common.h"
>   #include "migration/migration.h"
> +#include "migration/savevm.h"
>   #include "migration/vmstate.h"
>   #include "migration/qemu-file.h"
>   #include "migration/register.h"
> @@ -45,6 +46,7 @@
>   #define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
>   #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
>   #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
> +#define VFIO_MIG_FLAG_DEV_INIT_DATA_SENT (0xffffffffef100005ULL)
>   
>   /*
>    * This is an arbitrary size based on migration of mlx5 devices, where typically
> @@ -380,6 +382,8 @@ static void vfio_save_cleanup(void *opaque)
>   
>       g_free(migration->data_buffer);
>       migration->data_buffer = NULL;
> +    migration->switchover_ack_needed = false;
> +    migration->initial_data_sent = false;
>       vfio_migration_cleanup(vbasedev);
>       trace_vfio_save_cleanup(vbasedev->name);
>   }
> @@ -455,10 +459,17 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>       if (data_size < 0) {
>           return data_size;
>       }
> -    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>   
>       vfio_update_estimated_pending_data(migration, data_size);
>   
> +    if (migration->switchover_ack_needed && !migration->precopy_init_size &&
> +        !migration->initial_data_sent) {
> +        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT);
> +        migration->initial_data_sent = true;
> +    } else {
> +        qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +    }
> +
>       trace_vfio_save_iterate(vbasedev->name);
>   
>       /*
> @@ -576,6 +587,24 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>               }
>               break;
>           }
> +        case VFIO_MIG_FLAG_DEV_INIT_DATA_SENT:
> +        {
> +            if (!vbasedev->migration->switchover_ack_needed) {
> +                error_report("%s: Received INIT_DATA_SENT but switchover ack "
> +                             "is not needed",
> +                             vbasedev->name);

This extra line could be avoided. This is minor.

> +                return -EINVAL;
> +            }
> +
> +            ret = qemu_loadvm_approve_switchover();
> +            if (ret) {
> +                error_report(
> +                    "%s: qemu_loadvm_approve_switchover failed, err=%d (%s)",
> +                    vbasedev->name, ret, strerror(-ret));
> +            }
> +
> +            return ret;
> +        }
>           default:
>               error_report("%s: Unknown tag 0x%"PRIx64, vbasedev->name, data);
>               return -EINVAL;
> @@ -590,6 +619,16 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>       return ret;
>   }
>   
> +static bool vfio_switchover_ack_needed(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    migration->switchover_ack_needed = vfio_precopy_supported(vbasedev);
> +
> +    return migration->switchover_ack_needed;
> +}
> +
>   static const SaveVMHandlers savevm_vfio_handlers = {
>       .save_setup = vfio_save_setup,
>       .save_cleanup = vfio_save_cleanup,
> @@ -602,6 +641,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>       .load_setup = vfio_load_setup,
>       .load_cleanup = vfio_load_cleanup,
>       .load_state = vfio_load_state,
> +    .switchover_ack_needed = vfio_switchover_ack_needed,
>   };
>   
>   /* ---------------------------------------------------------------------- */



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 6/7] vfio/migration: Add VFIO migration pre-copy support
  2023-05-23 14:56   ` Cédric Le Goater
@ 2023-05-24 12:49     ` Avihai Horon
  2023-05-24 15:38       ` Cédric Le Goater
  0 siblings, 1 reply; 20+ messages in thread
From: Avihai Horon @ 2023-05-24 12:49 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Eduardo Habkost, Marcel Apfelbaum,
	Philippe Mathieu-Daudé,
	Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
	Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins


On 23/05/2023 17:56, Cédric Le Goater wrote:
> External email: Use caution opening links or attachments
>
>
> Hello Avihai,
>
> On 5/21/23 17:18, Avihai Horon wrote:
>> Pre-copy support allows the VFIO device data to be transferred while the
>> VM is running. This helps to accommodate VFIO devices that have a large
>> amount of data that needs to be transferred, and it can reduce migration
>> downtime.
>>
>> Pre-copy support is optional in VFIO migration protocol v2.
>> Implement pre-copy of VFIO migration protocol v2 and use it for devices
>> that support it. Full description of it can be found here [1].
>>
>> In addition, add a new VFIO device property x-allow-pre-copy to keep
>> migration compatibility to/from older QEMU versions that don't have VFIO
>> pre-copy support.
>>
>> [1]
>> https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/
>
>
> May be simply reference Linux commit 4db52602a607 ("vfio: Extend the 
> device
> migration protocol with PRE_COPY") instead.

Sure, I will change it.

>
> some comments below,
>
>
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   docs/devel/vfio-migration.rst |  35 +++++---
>>   include/hw/vfio/vfio-common.h |   4 +
>>   hw/core/machine.c             |   1 +
>>   hw/vfio/common.c              |   6 +-
>>   hw/vfio/migration.c           | 163 ++++++++++++++++++++++++++++++++--
>>   hw/vfio/pci.c                 |   2 +
>>   hw/vfio/trace-events          |   4 +-
>>   7 files changed, 193 insertions(+), 22 deletions(-)
>>
>> diff --git a/docs/devel/vfio-migration.rst 
>> b/docs/devel/vfio-migration.rst
>> index 1b68ccf115..e896b2a673 100644
>> --- a/docs/devel/vfio-migration.rst
>> +++ b/docs/devel/vfio-migration.rst
>> @@ -7,12 +7,14 @@ the guest is running on source host and restoring 
>> this saved state on the
>>   destination host. This document details how saving and restoring of 
>> VFIO
>>   devices is done in QEMU.
>>
>> -Migration of VFIO devices currently consists of a single 
>> stop-and-copy phase.
>> -During the stop-and-copy phase the guest is stopped and the entire 
>> VFIO device
>> -data is transferred to the destination.
>> -
>> -The pre-copy phase of migration is currently not supported for VFIO 
>> devices.
>> -Support for VFIO pre-copy will be added later on.
>> +Migration of VFIO devices consists of two phases: the optional 
>> pre-copy phase,
>> +and the stop-and-copy phase. The pre-copy phase is iterative and 
>> allows to
>> +accommodate VFIO devices that have a large amount of data that needs 
>> to be
>> +transferred. The iterative pre-copy phase of migration allows for 
>> the guest to
>> +continue whilst the VFIO device state is transferred to the 
>> destination, this
>> +helps to reduce the total downtime of the VM. VFIO devices opt-in to 
>> pre-copy
>> +support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
>> +VFIO_DEVICE_FEATURE_MIGRATION ioctl.
>>
>>   Note that currently VFIO migration is supported only for a single 
>> device. This
>>   is due to VFIO migration's lack of P2P support. However, P2P 
>> support is planned
>> @@ -29,10 +31,20 @@ VFIO implements the device hooks for the 
>> iterative approach as follows:
>>   * A ``load_setup`` function that sets the VFIO device on the 
>> destination in
>>     _RESUMING state.
>>
>> +* A ``state_pending_estimate`` function that reports an estimate of the
>> +  remaining pre-copy data that the vendor driver has yet to save for 
>> the VFIO
>> +  device.
>> +
>>   * A ``state_pending_exact`` function that reads pending_bytes from 
>> the vendor
>>     driver, which indicates the amount of data that the vendor driver 
>> has yet to
>>     save for the VFIO device.
>>
>> +* An ``is_active_iterate`` function that indicates 
>> ``save_live_iterate`` is
>> +  active only when the VFIO device is in pre-copy states.
>> +
>> +* A ``save_live_iterate`` function that reads the VFIO device's data 
>> from the
>> +  vendor driver during iterative pre-copy phase.
>> +
>>   * A ``save_state`` function to save the device config space if it 
>> is present.
>>
>>   * A ``save_live_complete_precopy`` function that sets the VFIO 
>> device in
>> @@ -111,8 +123,10 @@ Flow of state changes during Live migration
>>   ===========================================
>>
>>   Below is the flow of state change during live migration.
>> -The values in the brackets represent the VM state, the migration 
>> state, and
>> +The values in the parentheses represent the VM state, the migration 
>> state, and
>>   the VFIO device state, respectively.
>> +The text in the square brackets represents the flow if the VFIO 
>> device supports
>> +pre-copy.
>>
>>   Live migration save path
>>   ------------------------
>> @@ -124,11 +138,12 @@ Live migration save path
>>                                     |
>>                        migrate_init spawns migration_thread
>>                   Migration thread then calls each device's 
>> .save_setup()
>> -                       (RUNNING, _SETUP, _RUNNING)
>> +                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
>>                                     |
>> -                      (RUNNING, _ACTIVE, _RUNNING)
>> -             If device is active, get pending_bytes by 
>> .state_pending_exact()
>> +                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
>> +      If device is active, get pending_bytes by 
>> .state_pending_{estimate,exact}()
>>             If total pending_bytes >= threshold_size, call 
>> .save_live_iterate()
>> +                  [Data of VFIO device for pre-copy phase is copied]
>>           Iterate till total pending bytes converge and are less than 
>> threshold
>>                                     |
>>     On migration completion, vCPU stops and calls 
>> .save_live_complete_precopy for
>> diff --git a/include/hw/vfio/vfio-common.h 
>> b/include/hw/vfio/vfio-common.h
>> index eed244f25f..5ce7a01d56 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -66,6 +66,9 @@ typedef struct VFIOMigration {
>>       int data_fd;
>>       void *data_buffer;
>>       size_t data_buffer_size;
>> +    uint64_t precopy_init_size;
>> +    uint64_t precopy_dirty_size;
>> +    uint64_t mig_flags;
>
> It would have been cleaner to introduce VFIOMigration::mig_flags and its
> update in another patch. This is minor.

Sure, I will split it.

>
>
>>   } VFIOMigration;
>>
>>   typedef struct VFIOAddressSpace {
>> @@ -143,6 +146,7 @@ typedef struct VFIODevice {
>>       VFIOMigration *migration;
>>       Error *migration_blocker;
>>       OnOffAuto pre_copy_dirty_page_tracking;
>> +    bool allow_pre_copy;
>
> same comment for this bool and the associated property, because it would
> ease backports.

Sure.
Just for general knowledge, can you explain how this could ease backports?

>
>
>>       bool dirty_pages_supported;
>>       bool dirty_tracking;
>>   } VFIODevice;
>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>> index 07f763eb2e..50439e5cbb 100644
>> --- a/hw/core/machine.c
>> +++ b/hw/core/machine.c
>> @@ -41,6 +41,7 @@
>>
>>   GlobalProperty hw_compat_8_0[] = {
>>       { "migration", "multifd-flush-after-each-section", "on"},
>> +    { "vfio-pci", "x-allow-pre-copy", "false" },
>>   };
>>   const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0);
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 78358ede27..b73086e17a 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -492,7 +492,8 @@ static bool 
>> vfio_devices_all_dirty_tracking(VFIOContainer *container)
>>               }
>>
>>               if (vbasedev->pre_copy_dirty_page_tracking == 
>> ON_OFF_AUTO_OFF &&
>> -                migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
>> +                (migration->device_state == 
>> VFIO_DEVICE_STATE_RUNNING ||
>> +                 migration->device_state == 
>> VFIO_DEVICE_STATE_PRE_COPY)) {
>>                   return false;
>>               }
>>           }
>> @@ -537,7 +538,8 @@ static bool 
>> vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>>                   return false;
>>               }
>>
>> -            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
>> +            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
>> +                migration->device_state == 
>> VFIO_DEVICE_STATE_PRE_COPY) {
>>                   continue;
>>               } else {
>>                   return false;
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 235978fd68..418efed019 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -68,6 +68,8 @@ static const char *mig_state_to_str(enum 
>> vfio_device_mig_state state)
>>           return "STOP_COPY";
>>       case VFIO_DEVICE_STATE_RESUMING:
>>           return "RESUMING";
>> +    case VFIO_DEVICE_STATE_PRE_COPY:
>> +        return "PRE_COPY";
>>       default:
>>           return "UNKNOWN STATE";
>>       }
>> @@ -241,6 +243,22 @@ static int vfio_query_stop_copy_size(VFIODevice 
>> *vbasedev,
>>       return 0;
>>   }
>>
>> +static int vfio_query_precopy_size(VFIOMigration *migration)
>> +{
>> +    struct vfio_precopy_info precopy = {
>> +        .argsz = sizeof(precopy),
>> +    };
>
> May be move here  :
>
>         migration->precopy_init_size = 0;
>         migration->precopy_dirty_size = 0;
>
> since the values are reset always before calling vfio_query_precopy_size()

OK.
I will also reset these values in vfio_save_cleanup() so there won't be 
stale values in case migration is cancelled or fails.

>
>
>> +
>> +    if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, 
>> &precopy)) {
>> +        return -errno;
>> +    }
>> +
>> +    migration->precopy_init_size = precopy.initial_bytes;
>> +    migration->precopy_dirty_size = precopy.dirty_bytes;
>> +
>> +    return 0;
>> +}
>> +
>>   /* Returns the size of saved data on success and -errno on error */
>>   static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>>   {
>> @@ -249,6 +267,11 @@ static ssize_t vfio_save_block(QEMUFile *f, 
>> VFIOMigration *migration)
>>       data_size = read(migration->data_fd, migration->data_buffer,
>>                        migration->data_buffer_size);
>>       if (data_size < 0) {
>> +        /* Pre-copy emptied all the device state for now */
>> +        if (errno == ENOMSG) {
>
> Could you explain a little more this errno please ? It looks like an 
> API with
> the VFIO PCI variant kernel driver.

Yes, it's explained in the precopy uAPI [1].
Do you want to change the comment to something like the following?
/*
  * ENOMSG indicates that the migration data_fd has reached a temporary
  * "end of stream", i.e. both initial_bytes and dirty_bytes are zero.
  * More data may be available later in future reads.
  */

[1] 
https://elixir.bootlin.com/linux/v6.4-rc1/source/include/uapi/linux/vfio.h#L1084

>
>> +            return 0;
>> +        }
>> +
>>           return -errno;
>>       }
>>       if (data_size == 0) {
>> @@ -265,6 +288,39 @@ static ssize_t vfio_save_block(QEMUFile *f, 
>> VFIOMigration *migration)
>>       return qemu_file_get_error(f) ?: data_size;
>>   }
>>
>> +static void vfio_update_estimated_pending_data(VFIOMigration 
>> *migration,
>> +                                               uint64_t data_size)
>> +{
>> +    if (!data_size) {
>> +        /*
>> +         * Pre-copy emptied all the device state for now, update 
>> estimated sizes
>> +         * accordingly.
>> +         */
>> +        migration->precopy_init_size = 0;
>> +        migration->precopy_dirty_size = 0;
>> +
>> +        return;
>> +    }
>> +
>> +    if (migration->precopy_init_size) {
>> +        uint64_t init_size = MIN(migration->precopy_init_size, 
>> data_size);
>> +
>> +        migration->precopy_init_size -= init_size;
>> +        data_size -= init_size;
>> +    }
>> +
>> +    migration->precopy_dirty_size -= MIN(migration->precopy_dirty_size,
>> +                                         data_size);
>
> Do we have a trace event for all this data values ?
>
>> +}
>> +
>> +static bool vfio_precopy_supported(VFIODevice *vbasedev)
>> +{
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    return vbasedev->allow_pre_copy &&
>> +           migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
>> +}
>> +
>>   /* 
>> ---------------------------------------------------------------------- 
>> */
>>
>>   static int vfio_save_setup(QEMUFile *f, void *opaque)
>> @@ -285,6 +341,31 @@ static int vfio_save_setup(QEMUFile *f, void 
>> *opaque)
>>           return -ENOMEM;
>>       }
>>
>> +    if (vfio_precopy_supported(vbasedev)) {
>> +        int ret;
>> +
>> +        migration->precopy_init_size = 0;
>> +        migration->precopy_dirty_size = 0;
>> +
>> +        switch (migration->device_state) {
>> +        case VFIO_DEVICE_STATE_RUNNING:
>> +            ret = vfio_migration_set_state(vbasedev, 
>> VFIO_DEVICE_STATE_PRE_COPY,
>> + VFIO_DEVICE_STATE_RUNNING);
>> +            if (ret) {
>> +                return ret;
>> +            }
>> +
>> +            vfio_query_precopy_size(migration);
>> +
>> +            break;
>> +        case VFIO_DEVICE_STATE_STOP:
>> +            /* vfio_save_complete_precopy() will go to STOP_COPY */
>> +            break;
>> +        default:
>> +            return -EINVAL;
>> +        }
>> +    }
>> +
>>       trace_vfio_save_setup(vbasedev->name, 
>> migration->data_buffer_size);
>>
>>       qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> @@ -303,22 +384,36 @@ static void vfio_save_cleanup(void *opaque)
>>       trace_vfio_save_cleanup(vbasedev->name);
>>   }
>>
>> +static void vfio_state_pending_estimate(void *opaque, uint64_t 
>> *must_precopy,
>> +                                        uint64_t *can_postcopy)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) {
>> +        return;
>> +    }
>> +
>> +    *must_precopy +=
>> +        migration->precopy_init_size + migration->precopy_dirty_size;
>> +
>> +    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
>> +                                      *can_postcopy,
>> + migration->precopy_init_size,
>> + migration->precopy_dirty_size);
>
>
> ok we have one :) I wonder if we should not update 
> trace_vfio_save_iterate()
> also with some values.

Hmm, yes, I guess it wouldn't hurt.

>
>> +}
>> +
>>   /*
>>    * Migration size of VFIO devices can be as little as a few KBs or 
>> as big as
>>    * many GBs. This value should be big enough to cover the worst case.
>>    */
>>   #define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
>>
>> -/*
>> - * Only exact function is implemented and not estimate function. The 
>> reason is
>> - * that during pre-copy phase of migration the estimate function is 
>> called
>> - * repeatedly while pending RAM size is over the threshold, thus 
>> migration
>> - * can't converge and querying the VFIO device pending data size is 
>> useless.
>> - */
>>   static void vfio_state_pending_exact(void *opaque, uint64_t 
>> *must_precopy,
>>                                        uint64_t *can_postcopy)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>>       uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
>>
>>       /*
>> @@ -328,8 +423,49 @@ static void vfio_state_pending_exact(void 
>> *opaque, uint64_t *must_precopy,
>>       vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>>       *must_precopy += stop_copy_size;
>>
>> +    if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
>> +        migration->precopy_init_size = 0;
>> +        migration->precopy_dirty_size = 0;
>> +        vfio_query_precopy_size(migration);
>> +
>> +        *must_precopy +=
>> +            migration->precopy_init_size + 
>> migration->precopy_dirty_size;
>> +    }
>> +
>>       trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, 
>> *can_postcopy,
>> -                                   stop_copy_size);
>> +                                   stop_copy_size, 
>> migration->precopy_init_size,
>> + migration->precopy_dirty_size);
>> +}
>> +
>> +static bool vfio_is_active_iterate(void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
>> +}
>> +
>> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    ssize_t data_size;
>> +
>> +    data_size = vfio_save_block(f, migration);
>> +    if (data_size < 0) {
>> +        return data_size;
>> +    }
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    vfio_update_estimated_pending_data(migration, data_size);
>> +
>> +    trace_vfio_save_iterate(vbasedev->name);
>> +
>> +    /*
>> +     * A VFIO device's pre-copy dirty_bytes is not guaranteed to 
>> reach zero.
>> +     * Return 1 so following handlers will not be potentially blocked.
>
> Can this condition be detected to warn the user ?

I don't think so, it depends on the kernel driver implementation.

Thanks!

>
>
>> +     */
>> +    return 1;
>>   }
>>
>>   static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>> @@ -338,7 +474,7 @@ static int vfio_save_complete_precopy(QEMUFile 
>> *f, void *opaque)
>>       ssize_t data_size;
>>       int ret;
>>
>> -    /* We reach here with device state STOP only */
>> +    /* We reach here with device state STOP or STOP_COPY only */
>>       ret = vfio_migration_set_state(vbasedev, 
>> VFIO_DEVICE_STATE_STOP_COPY,
>>                                      VFIO_DEVICE_STATE_STOP);
>>       if (ret) {
>> @@ -457,7 +593,10 @@ static int vfio_load_state(QEMUFile *f, void 
>> *opaque, int version_id)
>>   static const SaveVMHandlers savevm_vfio_handlers = {
>>       .save_setup = vfio_save_setup,
>>       .save_cleanup = vfio_save_cleanup,
>> +    .state_pending_estimate = vfio_state_pending_estimate,
>>       .state_pending_exact = vfio_state_pending_exact,
>> +    .is_active_iterate = vfio_is_active_iterate,
>> +    .save_live_iterate = vfio_save_iterate,
>>       .save_live_complete_precopy = vfio_save_complete_precopy,
>>       .save_state = vfio_save_state,
>>       .load_setup = vfio_load_setup,
>> @@ -470,13 +609,18 @@ static const SaveVMHandlers 
>> savevm_vfio_handlers = {
>>   static void vfio_vmstate_change(void *opaque, bool running, 
>> RunState state)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>>       enum vfio_device_mig_state new_state;
>>       int ret;
>>
>>       if (running) {
>>           new_state = VFIO_DEVICE_STATE_RUNNING;
>>       } else {
>> -        new_state = VFIO_DEVICE_STATE_STOP;
>> +        new_state =
>> +            (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY &&
>> +             (state == RUN_STATE_FINISH_MIGRATE || state == 
>> RUN_STATE_PAUSED)) ?
>> +                VFIO_DEVICE_STATE_STOP_COPY :
>> +                VFIO_DEVICE_STATE_STOP;
>>       }
>>
>>       /*
>> @@ -603,6 +747,7 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>>       migration->vbasedev = vbasedev;
>>       migration->device_state = VFIO_DEVICE_STATE_RUNNING;
>>       migration->data_fd = -1;
>> +    migration->mig_flags = mig_flags;
>>
>>       vbasedev->dirty_pages_supported = 
>> vfio_dma_logging_supported(vbasedev);
>>
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index bf27a39905..72f30ce09f 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -3335,6 +3335,8 @@ static Property vfio_pci_dev_properties[] = {
>>       DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking", 
>> VFIOPCIDevice,
>> vbasedev.pre_copy_dirty_page_tracking,
>>                               ON_OFF_AUTO_ON),
>> +    DEFINE_PROP_BOOL("x-allow-pre-copy", VFIOPCIDevice,
>> +                     vbasedev.allow_pre_copy, true),
>>       DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
>>                               display, ON_OFF_AUTO_OFF),
>>       DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 646e42fd27..fd6893cb43 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -162,6 +162,8 @@ vfio_save_block(const char *name, int data_size) 
>> " (%s) data_size %d"
>>   vfio_save_cleanup(const char *name) " (%s)"
>>   vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
>>   vfio_save_device_config_state(const char *name) " (%s)"
>> +vfio_save_iterate(const char *name) " (%s)"
>>   vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) 
>> data buffer size 0x%"PRIx64
>> -vfio_state_pending_exact(const char *name, uint64_t precopy, 
>> uint64_t postcopy, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" 
>> postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64
>> +vfio_state_pending_estimate(const char *name, uint64_t precopy, 
>> uint64_t postcopy, uint64_t precopy_init_size, uint64_t 
>> precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" 
>> precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
>> +vfio_state_pending_exact(const char *name, uint64_t precopy, 
>> uint64_t postcopy, uint64_t stopcopy_size, uint64_t 
>> precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 
>> 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy 
>> initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
>>   vfio_vmstate_change(const char *name, int running, const char 
>> *reason, const char *dev_state) " (%s) running %d reason %s device 
>> state %s"
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 7/7] vfio/migration: Add support for switchover ack capability
  2023-05-23 15:09   ` Cédric Le Goater
@ 2023-05-24 12:52     ` Avihai Horon
  0 siblings, 0 replies; 20+ messages in thread
From: Avihai Horon @ 2023-05-24 12:52 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Eduardo Habkost, Marcel Apfelbaum,
	Philippe Mathieu-Daudé,
	Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
	Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins


On 23/05/2023 18:09, Cédric Le Goater wrote:
> External email: Use caution opening links or attachments
>
>
> On 5/21/23 17:18, Avihai Horon wrote:
>> Loading of a VFIO device's data can take a substantial amount of time as
>> the device may need to allocate resources, prepare internal data
>> structures, etc. This can increase migration downtime, especially for
>> VFIO devices with a lot of resources.
>>
>> To solve this, VFIO migration uAPI defines "initial bytes" as part of
>> its precopy data stream. Initial bytes can be used in various ways to
>> improve VFIO migration performance. For example, it can be used to
>> transfer device metadata to pre-allocate resources in the destination.
>> However, for this to work we need to make sure that all initial bytes
>> are sent and loaded in the destination before the source VM is stopped.
>>
>> Use migration switchover ack capability to make sure a VFIO device's
>> initial bytes are sent and loaded in the destination before the source
>> stops the VM and attempts to complete the migration.
>> This can significantly reduce migration downtime for some devices.
>>
>> As precopy support and precopy initial bytes support come together in
>
>
>
>> VFIO migration, use x-allow-pre-copy device property to control usage of
>> this feature as well.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>
>
> Reviewed-by: Cédric Le Goater <clg@redhat.com>
>
>> ---
>>   docs/devel/vfio-migration.rst | 10 +++++++++
>>   include/hw/vfio/vfio-common.h |  2 ++
>>   hw/vfio/migration.c           | 42 ++++++++++++++++++++++++++++++++++-
>>   3 files changed, 53 insertions(+), 1 deletion(-)
>>
>> diff --git a/docs/devel/vfio-migration.rst 
>> b/docs/devel/vfio-migration.rst
>> index e896b2a673..e75793b76a 100644
>> --- a/docs/devel/vfio-migration.rst
>> +++ b/docs/devel/vfio-migration.rst
>> @@ -16,6 +16,13 @@ helps to reduce the total downtime of the VM. VFIO 
>> devices opt-in to pre-copy
>>   support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
>>   VFIO_DEVICE_FEATURE_MIGRATION ioctl.
>>
>> +When pre-copy is supported, it's possible to further reduce downtime by
>> +enabling "switchover-ack" migration capability.
>> +VFIO migration uAPI defines "initial bytes" as part of its pre-copy 
>> data stream
>> +and recommends that the initial bytes are sent and loaded in the 
>> destination
>> +before stopping the source VM. Enabling this migration capability will
>> +guarantee that and thus, can potentially reduce downtime even further.
>> +
>>   Note that currently VFIO migration is supported only for a single 
>> device. This
>>   is due to VFIO migration's lack of P2P support. However, P2P 
>> support is planned
>>   to be added later on.
>> @@ -45,6 +52,9 @@ VFIO implements the device hooks for the iterative 
>> approach as follows:
>>   * A ``save_live_iterate`` function that reads the VFIO device's 
>> data from the
>>     vendor driver during iterative pre-copy phase.
>>
>> +* A ``switchover_ack_needed`` function that checks if the VFIO 
>> device uses
>> +  "switchover-ack" migration capability when this capability is 
>> enabled.
>> +
>>   * A ``save_state`` function to save the device config space if it 
>> is present.
>>
>>   * A ``save_live_complete_precopy`` function that sets the VFIO 
>> device in
>> diff --git a/include/hw/vfio/vfio-common.h 
>> b/include/hw/vfio/vfio-common.h
>> index 5ce7a01d56..b4a7eb0f9a 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -69,6 +69,8 @@ typedef struct VFIOMigration {
>>       uint64_t precopy_init_size;
>>       uint64_t precopy_dirty_size;
>>       uint64_t mig_flags;
>> +    bool switchover_ack_needed;
>> +    bool initial_data_sent;
>>   } VFIOMigration;
>>
>>   typedef struct VFIOAddressSpace {
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 418efed019..09c669c1d8 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -18,6 +18,7 @@
>>   #include "sysemu/runstate.h"
>>   #include "hw/vfio/vfio-common.h"
>>   #include "migration/migration.h"
>> +#include "migration/savevm.h"
>>   #include "migration/vmstate.h"
>>   #include "migration/qemu-file.h"
>>   #include "migration/register.h"
>> @@ -45,6 +46,7 @@
>>   #define VFIO_MIG_FLAG_DEV_CONFIG_STATE (0xffffffffef100002ULL)
>>   #define VFIO_MIG_FLAG_DEV_SETUP_STATE (0xffffffffef100003ULL)
>>   #define VFIO_MIG_FLAG_DEV_DATA_STATE (0xffffffffef100004ULL)
>> +#define VFIO_MIG_FLAG_DEV_INIT_DATA_SENT (0xffffffffef100005ULL)
>>
>>   /*
>>    * This is an arbitrary size based on migration of mlx5 devices, 
>> where typically
>> @@ -380,6 +382,8 @@ static void vfio_save_cleanup(void *opaque)
>>
>>       g_free(migration->data_buffer);
>>       migration->data_buffer = NULL;
>> +    migration->switchover_ack_needed = false;
>> +    migration->initial_data_sent = false;
>>       vfio_migration_cleanup(vbasedev);
>>       trace_vfio_save_cleanup(vbasedev->name);
>>   }
>> @@ -455,10 +459,17 @@ static int vfio_save_iterate(QEMUFile *f, void 
>> *opaque)
>>       if (data_size < 0) {
>>           return data_size;
>>       }
>> -    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>
>>       vfio_update_estimated_pending_data(migration, data_size);
>>
>> +    if (migration->switchover_ack_needed && 
>> !migration->precopy_init_size &&
>> +        !migration->initial_data_sent) {
>> +        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT);
>> +        migration->initial_data_sent = true;
>> +    } else {
>> +        qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +    }
>> +
>>       trace_vfio_save_iterate(vbasedev->name);
>>
>>       /*
>> @@ -576,6 +587,24 @@ static int vfio_load_state(QEMUFile *f, void 
>> *opaque, int version_id)
>>               }
>>               break;
>>           }
>> +        case VFIO_MIG_FLAG_DEV_INIT_DATA_SENT:
>> +        {
>> +            if (!vbasedev->migration->switchover_ack_needed) {
>> +                error_report("%s: Received INIT_DATA_SENT but 
>> switchover ack "
>> +                             "is not needed",
>> +                             vbasedev->name);
>
> This extra line could be avoided. This is minor.

Ah, right, I will remove it in next version.

Thanks!

>
>
>> +                return -EINVAL;
>> +            }
>> +
>> +            ret = qemu_loadvm_approve_switchover();
>> +            if (ret) {
>> +                error_report(
>> +                    "%s: qemu_loadvm_approve_switchover failed, 
>> err=%d (%s)",
>> +                    vbasedev->name, ret, strerror(-ret));
>> +            }
>> +
>> +            return ret;
>> +        }
>>           default:
>>               error_report("%s: Unknown tag 0x%"PRIx64, 
>> vbasedev->name, data);
>>               return -EINVAL;
>> @@ -590,6 +619,16 @@ static int vfio_load_state(QEMUFile *f, void 
>> *opaque, int version_id)
>>       return ret;
>>   }
>>
>> +static bool vfio_switchover_ack_needed(void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    migration->switchover_ack_needed = 
>> vfio_precopy_supported(vbasedev);
>> +
>> +    return migration->switchover_ack_needed;
>> +}
>> +
>>   static const SaveVMHandlers savevm_vfio_handlers = {
>>       .save_setup = vfio_save_setup,
>>       .save_cleanup = vfio_save_cleanup,
>> @@ -602,6 +641,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>>       .load_setup = vfio_load_setup,
>>       .load_cleanup = vfio_load_cleanup,
>>       .load_state = vfio_load_state,
>> +    .switchover_ack_needed = vfio_switchover_ack_needed,
>>   };
>>
>>   /* 
>> ---------------------------------------------------------------------- 
>> */
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 6/7] vfio/migration: Add VFIO migration pre-copy support
  2023-05-24 12:49     ` Avihai Horon
@ 2023-05-24 15:38       ` Cédric Le Goater
  2023-05-25  9:24         ` Avihai Horon
  0 siblings, 1 reply; 20+ messages in thread
From: Cédric Le Goater @ 2023-05-24 15:38 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Eduardo Habkost, Marcel Apfelbaum,
	Philippe Mathieu-Daudé,
	Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
	Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

Hello Avihai,

On 5/24/23 14:49, Avihai Horon wrote:
> 
> On 23/05/2023 17:56, Cédric Le Goater wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> Hello Avihai,
>>
>> On 5/21/23 17:18, Avihai Horon wrote:
>>> Pre-copy support allows the VFIO device data to be transferred while the
>>> VM is running. This helps to accommodate VFIO devices that have a large
>>> amount of data that needs to be transferred, and it can reduce migration
>>> downtime.
>>>
>>> Pre-copy support is optional in VFIO migration protocol v2.
>>> Implement pre-copy of VFIO migration protocol v2 and use it for devices
>>> that support it. Full description of it can be found here [1].
>>>
>>> In addition, add a new VFIO device property x-allow-pre-copy to keep
>>> migration compatibility to/from older QEMU versions that don't have VFIO
>>> pre-copy support.
>>>
>>> [1]
>>> https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/
>>
>>
>> May be simply reference Linux commit 4db52602a607 ("vfio: Extend the device
>> migration protocol with PRE_COPY") instead.
> 
> Sure, I will change it.
> 
>>
>> some comments below,
>>
>>
>>>
>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>> ---
>>>   docs/devel/vfio-migration.rst |  35 +++++---
>>>   include/hw/vfio/vfio-common.h |   4 +
>>>   hw/core/machine.c             |   1 +
>>>   hw/vfio/common.c              |   6 +-
>>>   hw/vfio/migration.c           | 163 ++++++++++++++++++++++++++++++++--
>>>   hw/vfio/pci.c                 |   2 +
>>>   hw/vfio/trace-events          |   4 +-
>>>   7 files changed, 193 insertions(+), 22 deletions(-)
>>>
>>> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
>>> index 1b68ccf115..e896b2a673 100644
>>> --- a/docs/devel/vfio-migration.rst
>>> +++ b/docs/devel/vfio-migration.rst
>>> @@ -7,12 +7,14 @@ the guest is running on source host and restoring this saved state on the
>>>   destination host. This document details how saving and restoring of VFIO
>>>   devices is done in QEMU.
>>>
>>> -Migration of VFIO devices currently consists of a single stop-and-copy phase.
>>> -During the stop-and-copy phase the guest is stopped and the entire VFIO device
>>> -data is transferred to the destination.
>>> -
>>> -The pre-copy phase of migration is currently not supported for VFIO devices.
>>> -Support for VFIO pre-copy will be added later on.
>>> +Migration of VFIO devices consists of two phases: the optional pre-copy phase,
>>> +and the stop-and-copy phase. The pre-copy phase is iterative and allows to
>>> +accommodate VFIO devices that have a large amount of data that needs to be
>>> +transferred. The iterative pre-copy phase of migration allows for the guest to
>>> +continue whilst the VFIO device state is transferred to the destination, this
>>> +helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy
>>> +support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
>>> +VFIO_DEVICE_FEATURE_MIGRATION ioctl.
>>>
>>>   Note that currently VFIO migration is supported only for a single device. This
>>>   is due to VFIO migration's lack of P2P support. However, P2P support is planned
>>> @@ -29,10 +31,20 @@ VFIO implements the device hooks for the iterative approach as follows:
>>>   * A ``load_setup`` function that sets the VFIO device on the destination in
>>>     _RESUMING state.
>>>
>>> +* A ``state_pending_estimate`` function that reports an estimate of the
>>> +  remaining pre-copy data that the vendor driver has yet to save for the VFIO
>>> +  device.
>>> +
>>>   * A ``state_pending_exact`` function that reads pending_bytes from the vendor
>>>     driver, which indicates the amount of data that the vendor driver has yet to
>>>     save for the VFIO device.
>>>
>>> +* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
>>> +  active only when the VFIO device is in pre-copy states.
>>> +
>>> +* A ``save_live_iterate`` function that reads the VFIO device's data from the
>>> +  vendor driver during iterative pre-copy phase.
>>> +
>>>   * A ``save_state`` function to save the device config space if it is present.
>>>
>>>   * A ``save_live_complete_precopy`` function that sets the VFIO device in
>>> @@ -111,8 +123,10 @@ Flow of state changes during Live migration
>>>   ===========================================
>>>
>>>   Below is the flow of state change during live migration.
>>> -The values in the brackets represent the VM state, the migration state, and
>>> +The values in the parentheses represent the VM state, the migration state, and
>>>   the VFIO device state, respectively.
>>> +The text in the square brackets represents the flow if the VFIO device supports
>>> +pre-copy.
>>>
>>>   Live migration save path
>>>   ------------------------
>>> @@ -124,11 +138,12 @@ Live migration save path
>>>                                     |
>>>                        migrate_init spawns migration_thread
>>>                   Migration thread then calls each device's .save_setup()
>>> -                       (RUNNING, _SETUP, _RUNNING)
>>> +                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
>>>                                     |
>>> -                      (RUNNING, _ACTIVE, _RUNNING)
>>> -             If device is active, get pending_bytes by .state_pending_exact()
>>> +                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
>>> +      If device is active, get pending_bytes by .state_pending_{estimate,exact}()
>>>             If total pending_bytes >= threshold_size, call .save_live_iterate()
>>> +                  [Data of VFIO device for pre-copy phase is copied]
>>>           Iterate till total pending bytes converge and are less than threshold
>>>                                     |
>>>     On migration completion, vCPU stops and calls .save_live_complete_precopy for
>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>> index eed244f25f..5ce7a01d56 100644
>>> --- a/include/hw/vfio/vfio-common.h
>>> +++ b/include/hw/vfio/vfio-common.h
>>> @@ -66,6 +66,9 @@ typedef struct VFIOMigration {
>>>       int data_fd;
>>>       void *data_buffer;
>>>       size_t data_buffer_size;
>>> +    uint64_t precopy_init_size;
>>> +    uint64_t precopy_dirty_size;
>>> +    uint64_t mig_flags;
>>
>> It would have been cleaner to introduce VFIOMigration::mig_flags and its
>> update in another patch. This is minor.
> 
> Sure, I will split it.
> 
>>
>>
>>>   } VFIOMigration;
>>>
>>>   typedef struct VFIOAddressSpace {
>>> @@ -143,6 +146,7 @@ typedef struct VFIODevice {
>>>       VFIOMigration *migration;
>>>       Error *migration_blocker;
>>>       OnOffAuto pre_copy_dirty_page_tracking;
>>> +    bool allow_pre_copy;
>>
>> same comment for this bool and the associated property, because it would
>> ease backports.
> 
> Sure.
> Just for general knowledge, can you explain how this could ease backports?

The downstream machine names are different. Each distro has its own
flavor. So adding a machine option always require some massaging.

That might change in the future though.

Anyhow, it is good pratice to isolate a change adding a restriction
or a hw compatibility in its own patch.

> 
>>
>>
>>>       bool dirty_pages_supported;
>>>       bool dirty_tracking;
>>>   } VFIODevice;
>>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>>> index 07f763eb2e..50439e5cbb 100644
>>> --- a/hw/core/machine.c
>>> +++ b/hw/core/machine.c
>>> @@ -41,6 +41,7 @@
>>>
>>>   GlobalProperty hw_compat_8_0[] = {
>>>       { "migration", "multifd-flush-after-each-section", "on"},
>>> +    { "vfio-pci", "x-allow-pre-copy", "false" },
>>>   };
>>>   const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0);
>>>
>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>> index 78358ede27..b73086e17a 100644
>>> --- a/hw/vfio/common.c
>>> +++ b/hw/vfio/common.c
>>> @@ -492,7 +492,8 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>>>               }
>>>
>>>               if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
>>> -                migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
>>> +                (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
>>> +                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) {
>>>                   return false;
>>>               }
>>>           }
>>> @@ -537,7 +538,8 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>>>                   return false;
>>>               }
>>>
>>> -            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
>>> +            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
>>> +                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
>>>                   continue;
>>>               } else {
>>>                   return false;
>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>> index 235978fd68..418efed019 100644
>>> --- a/hw/vfio/migration.c
>>> +++ b/hw/vfio/migration.c
>>> @@ -68,6 +68,8 @@ static const char *mig_state_to_str(enum vfio_device_mig_state state)
>>>           return "STOP_COPY";
>>>       case VFIO_DEVICE_STATE_RESUMING:
>>>           return "RESUMING";
>>> +    case VFIO_DEVICE_STATE_PRE_COPY:
>>> +        return "PRE_COPY";
>>>       default:
>>>           return "UNKNOWN STATE";
>>>       }
>>> @@ -241,6 +243,22 @@ static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
>>>       return 0;
>>>   }
>>>
>>> +static int vfio_query_precopy_size(VFIOMigration *migration)
>>> +{
>>> +    struct vfio_precopy_info precopy = {
>>> +        .argsz = sizeof(precopy),
>>> +    };
>>
>> May be move here  :
>>
>>         migration->precopy_init_size = 0;
>>         migration->precopy_dirty_size = 0;
>>
>> since the values are reset always before calling vfio_query_precopy_size()
> 
> OK.
> I will also reset these values in vfio_save_cleanup() so there won't be stale values in case migration is cancelled or fails.
> 
>>
>>
>>> +
>>> +    if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, &precopy)) {
>>> +        return -errno;
>>> +    }
>>> +
>>> +    migration->precopy_init_size = precopy.initial_bytes;
>>> +    migration->precopy_dirty_size = precopy.dirty_bytes;
>>> +
>>> +    return 0;
>>> +}
>>> +
>>>   /* Returns the size of saved data on success and -errno on error */
>>>   static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>>>   {
>>> @@ -249,6 +267,11 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>>>       data_size = read(migration->data_fd, migration->data_buffer,
>>>                        migration->data_buffer_size);
>>>       if (data_size < 0) {
>>> +        /* Pre-copy emptied all the device state for now */
>>> +        if (errno == ENOMSG) {
>>
>> Could you explain a little more this errno please ? It looks like an API with
>> the VFIO PCI variant kernel driver.
> 
> Yes, it's explained in the precopy uAPI [1].

Oh. I forgot to look at the uAPI file and only checked the driver.
All good then.

> Do you want to change the comment to something like the following?

Yes, please. It would be good for the reader to have a reference on
the kernel uAPI.

> /*
>   * ENOMSG indicates that the migration data_fd has reached a temporary
>   * "end of stream", i.e. both initial_bytes and dirty_bytes are zero.
>   * More data may be available later in future reads.
>   */

Please add something like :

  "For more information, please refer to the Linux kernel VFIO uAPI"

No need for links or file names.


> [1] https://elixir.bootlin.com/linux/v6.4-rc1/source/include/uapi/linux/vfio.h#L1084
> 
>>
>>> +            return 0;
>>> +        }
>>> +
>>>           return -errno;
>>>       }
>>>       if (data_size == 0) {
>>> @@ -265,6 +288,39 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>>>       return qemu_file_get_error(f) ?: data_size;
>>>   }
>>>
>>> +static void vfio_update_estimated_pending_data(VFIOMigration *migration,
>>> +                                               uint64_t data_size)
>>> +{
>>> +    if (!data_size) {
>>> +        /*
>>> +         * Pre-copy emptied all the device state for now, update estimated sizes
>>> +         * accordingly.
>>> +         */
>>> +        migration->precopy_init_size = 0;
>>> +        migration->precopy_dirty_size = 0;
>>> +
>>> +        return;
>>> +    }
>>> +
>>> +    if (migration->precopy_init_size) {
>>> +        uint64_t init_size = MIN(migration->precopy_init_size, data_size);
>>> +
>>> +        migration->precopy_init_size -= init_size;
>>> +        data_size -= init_size;
>>> +    }
>>> +
>>> +    migration->precopy_dirty_size -= MIN(migration->precopy_dirty_size,
>>> +                                         data_size);
>>
>> Do we have a trace event for all this data values ?
>>
>>> +}
>>> +
>>> +static bool vfio_precopy_supported(VFIODevice *vbasedev)
>>> +{
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +
>>> +    return vbasedev->allow_pre_copy &&
>>> +           migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
>>> +}
>>> +
>>>   /* ---------------------------------------------------------------------- */
>>>
>>>   static int vfio_save_setup(QEMUFile *f, void *opaque)
>>> @@ -285,6 +341,31 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
>>>           return -ENOMEM;
>>>       }
>>>
>>> +    if (vfio_precopy_supported(vbasedev)) {
>>> +        int ret;
>>> +
>>> +        migration->precopy_init_size = 0;
>>> +        migration->precopy_dirty_size = 0;
>>> +
>>> +        switch (migration->device_state) {
>>> +        case VFIO_DEVICE_STATE_RUNNING:
>>> +            ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_PRE_COPY,
>>> + VFIO_DEVICE_STATE_RUNNING);
>>> +            if (ret) {
>>> +                return ret;
>>> +            }
>>> +
>>> +            vfio_query_precopy_size(migration);
>>> +
>>> +            break;
>>> +        case VFIO_DEVICE_STATE_STOP:
>>> +            /* vfio_save_complete_precopy() will go to STOP_COPY */
>>> +            break;
>>> +        default:
>>> +            return -EINVAL;
>>> +        }
>>> +    }
>>> +
>>>       trace_vfio_save_setup(vbasedev->name, migration->data_buffer_size);
>>>
>>>       qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>> @@ -303,22 +384,36 @@ static void vfio_save_cleanup(void *opaque)
>>>       trace_vfio_save_cleanup(vbasedev->name);
>>>   }
>>>
>>> +static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
>>> +                                        uint64_t *can_postcopy)
>>> +{
>>> +    VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +
>>> +    if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) {
>>> +        return;
>>> +    }
>>> +
>>> +    *must_precopy +=
>>> +        migration->precopy_init_size + migration->precopy_dirty_size;
>>> +
>>> +    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
>>> +                                      *can_postcopy,
>>> + migration->precopy_init_size,
>>> + migration->precopy_dirty_size);
>>
>>
>> ok we have one :) I wonder if we should not update trace_vfio_save_iterate()
>> also with some values.
> 
> Hmm, yes, I guess it wouldn't hurt.
> 
>>
>>> +}
>>> +
>>>   /*
>>>    * Migration size of VFIO devices can be as little as a few KBs or as big as
>>>    * many GBs. This value should be big enough to cover the worst case.
>>>    */
>>>   #define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
>>>
>>> -/*
>>> - * Only exact function is implemented and not estimate function. The reason is
>>> - * that during pre-copy phase of migration the estimate function is called
>>> - * repeatedly while pending RAM size is over the threshold, thus migration
>>> - * can't converge and querying the VFIO device pending data size is useless.
>>> - */
>>>   static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
>>>                                        uint64_t *can_postcopy)
>>>   {
>>>       VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>>       uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
>>>
>>>       /*
>>> @@ -328,8 +423,49 @@ static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
>>>       vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>>>       *must_precopy += stop_copy_size;
>>>
>>> +    if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
>>> +        migration->precopy_init_size = 0;
>>> +        migration->precopy_dirty_size = 0;
>>> +        vfio_query_precopy_size(migration);
>>> +
>>> +        *must_precopy +=
>>> +            migration->precopy_init_size + migration->precopy_dirty_size;
>>> +    }
>>> +
>>>       trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
>>> -                                   stop_copy_size);
>>> +                                   stop_copy_size, migration->precopy_init_size,
>>> + migration->precopy_dirty_size);
>>> +}
>>> +
>>> +static bool vfio_is_active_iterate(void *opaque)
>>> +{
>>> +    VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +
>>> +    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
>>> +}
>>> +
>>> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
>>> +{
>>> +    VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>> +    ssize_t data_size;
>>> +
>>> +    data_size = vfio_save_block(f, migration);
>>> +    if (data_size < 0) {
>>> +        return data_size;
>>> +    }
>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>> +
>>> +    vfio_update_estimated_pending_data(migration, data_size);
>>> +
>>> +    trace_vfio_save_iterate(vbasedev->name);
>>> +
>>> +    /*
>>> +     * A VFIO device's pre-copy dirty_bytes is not guaranteed to reach zero.
>>> +     * Return 1 so following handlers will not be potentially blocked.
>>
>> Can this condition be detected to warn the user ?
> 
> I don't think so, it depends on the kernel driver implementation.

OK. We will see how it evolves with time. We might need some message
saying pre copy is not converging.

Thanks,

C.


> 
> Thanks!
> 
>>
>>
>>> +     */
>>> +    return 1;
>>>   }
>>>
>>>   static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>> @@ -338,7 +474,7 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>>       ssize_t data_size;
>>>       int ret;
>>>
>>> -    /* We reach here with device state STOP only */
>>> +    /* We reach here with device state STOP or STOP_COPY only */
>>>       ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
>>>                                      VFIO_DEVICE_STATE_STOP);
>>>       if (ret) {
>>> @@ -457,7 +593,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>>>   static const SaveVMHandlers savevm_vfio_handlers = {
>>>       .save_setup = vfio_save_setup,
>>>       .save_cleanup = vfio_save_cleanup,
>>> +    .state_pending_estimate = vfio_state_pending_estimate,
>>>       .state_pending_exact = vfio_state_pending_exact,
>>> +    .is_active_iterate = vfio_is_active_iterate,
>>> +    .save_live_iterate = vfio_save_iterate,
>>>       .save_live_complete_precopy = vfio_save_complete_precopy,
>>>       .save_state = vfio_save_state,
>>>       .load_setup = vfio_load_setup,
>>> @@ -470,13 +609,18 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>>>   static void vfio_vmstate_change(void *opaque, bool running, RunState state)
>>>   {
>>>       VFIODevice *vbasedev = opaque;
>>> +    VFIOMigration *migration = vbasedev->migration;
>>>       enum vfio_device_mig_state new_state;
>>>       int ret;
>>>
>>>       if (running) {
>>>           new_state = VFIO_DEVICE_STATE_RUNNING;
>>>       } else {
>>> -        new_state = VFIO_DEVICE_STATE_STOP;
>>> +        new_state =
>>> +            (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY &&
>>> +             (state == RUN_STATE_FINISH_MIGRATE || state == RUN_STATE_PAUSED)) ?
>>> +                VFIO_DEVICE_STATE_STOP_COPY :
>>> +                VFIO_DEVICE_STATE_STOP;
>>>       }
>>>
>>>       /*
>>> @@ -603,6 +747,7 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>>>       migration->vbasedev = vbasedev;
>>>       migration->device_state = VFIO_DEVICE_STATE_RUNNING;
>>>       migration->data_fd = -1;
>>> +    migration->mig_flags = mig_flags;
>>>
>>>       vbasedev->dirty_pages_supported = vfio_dma_logging_supported(vbasedev);
>>>
>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>> index bf27a39905..72f30ce09f 100644
>>> --- a/hw/vfio/pci.c
>>> +++ b/hw/vfio/pci.c
>>> @@ -3335,6 +3335,8 @@ static Property vfio_pci_dev_properties[] = {
>>>       DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking", VFIOPCIDevice,
>>> vbasedev.pre_copy_dirty_page_tracking,
>>>                               ON_OFF_AUTO_ON),
>>> +    DEFINE_PROP_BOOL("x-allow-pre-copy", VFIOPCIDevice,
>>> +                     vbasedev.allow_pre_copy, true),
>>>       DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
>>>                               display, ON_OFF_AUTO_OFF),
>>>       DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
>>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>>> index 646e42fd27..fd6893cb43 100644
>>> --- a/hw/vfio/trace-events
>>> +++ b/hw/vfio/trace-events
>>> @@ -162,6 +162,8 @@ vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
>>>   vfio_save_cleanup(const char *name) " (%s)"
>>>   vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
>>>   vfio_save_device_config_state(const char *name) " (%s)"
>>> +vfio_save_iterate(const char *name) " (%s)"
>>>   vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
>>> -vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64
>>> +vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
>>> +vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
>>>   vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
>>
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/7] migration: Add switchover ack capability
  2023-05-21 15:18 ` [PATCH v3 1/7] migration: Add switchover ack capability Avihai Horon
@ 2023-05-24 19:24   ` Peter Xu
  2023-05-25  9:33   ` Markus Armbruster
  1 sibling, 0 replies; 20+ messages in thread
From: Peter Xu @ 2023-05-24 19:24 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, Juan Quintela, Leonardo Bras, Eric Blake,
	Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On Sun, May 21, 2023 at 06:18:02PM +0300, Avihai Horon wrote:
> Migration downtime estimation is calculated based on bandwidth and
> remaining migration data. This assumes that loading of migration data in
> the destination takes a negligible amount of time and that downtime
> depends only on network speed.
> 
> While this may be true for RAM, it's not necessarily true for other
> migration users. For example, loading the data of a VFIO device in the
> destination might require from the device to allocate resources, prepare
> internal data structures and so on. These operations can take a
> significant amount of time which can increase migration downtime.
> 
> This patch adds a new capability "switchover ack" that prevents the
> source from stopping the VM and completing the migration until an ACK
> is received from the destination that it's OK to do so.
> 
> This can be used by migrated devices in various ways to reduce downtime.
> For example, a device can send initial precopy metadata to pre-allocate
> resources in the destination and use this capability to make sure that
> the pre-allocation is completed before the source VM is stopped, so it
> will have full effect.
> 
> This new capability relies on the return path capability to communicate
> from the destination back to the source.
> 
> The actual implementation of the capability will be added in the
> following patches.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/7] migration: Implement switchover ack logic
  2023-05-21 15:18 ` [PATCH v3 2/7] migration: Implement switchover ack logic Avihai Horon
@ 2023-05-24 19:32   ` Peter Xu
  2023-05-25  9:51     ` Avihai Horon
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Xu @ 2023-05-24 19:32 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, Juan Quintela, Leonardo Bras, Eric Blake,
	Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On Sun, May 21, 2023 at 06:18:03PM +0300, Avihai Horon wrote:
> Implement switchover ack logic. This prevents the source from stopping
> the VM and completing the migration until an ACK is received from the
> destination that it's OK to do so.
> 
> To achieve this, a new SaveVMHandlers handler switchover_ack_needed()
> and a new return path message MIG_RP_MSG_SWITCHOVER_ACK are added.
> 
> The switchover_ack_needed() handler is called during migration setup
> both in the source and the destination to check if switchover ack is
> used by the migrated device.
> 
> When switchover is approved by all migrated devices in the destination
> that support this capability, the MIG_RP_MSG_INITIAL_DATA_LOADED_ACK

s/MIG_RP_MSG_INITIAL_DATA_LOADED_ACK/MIG_RP_MSG_SWITCHOVER_ACK/

> return path message is sent to the source to notify it that it's OK to
> do switchover.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  include/migration/register.h |  3 ++
>  migration/migration.h        | 16 +++++++++++
>  migration/savevm.h           |  2 ++
>  migration/migration.c        | 42 +++++++++++++++++++++++++--
>  migration/savevm.c           | 56 ++++++++++++++++++++++++++++++++++++
>  migration/trace-events       |  4 +++
>  6 files changed, 121 insertions(+), 2 deletions(-)
> 
> diff --git a/include/migration/register.h b/include/migration/register.h
> index a8dfd8fefd..cda36d377b 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -71,6 +71,9 @@ typedef struct SaveVMHandlers {
>      int (*load_cleanup)(void *opaque);
>      /* Called when postcopy migration wants to resume from failure */
>      int (*resume_prepare)(MigrationState *s, void *opaque);
> +
> +    /* Checks if switchover ack should be used. Called both in src and dest */
> +    bool (*switchover_ack_needed)(void *opaque);

Sorry if I'm going to suggest something that overwrites what I
suggested.. :( Luckily, not so much.

When I read the new series I just noticed maybe it's still better to only
use this on dst, and always do the ACK.

The problem is based on your patch 1 description, the RP ACK message will
be sent if the switchover-ack cap is set, but actually it's conditionally
in the current impl just to handle num==0 case, so either the impl needs
change or the desc needs change.

It turns out it'll be even cleaner to always send it.  If so..

>  } SaveVMHandlers;
>  
>  int register_savevm_live(const char *idstr,
> diff --git a/migration/migration.h b/migration/migration.h
> index 48a46123a0..e209860cce 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -209,6 +209,13 @@ struct MigrationIncomingState {
>       * contains valid information.
>       */
>      QemuMutex page_request_mutex;
> +
> +    /*
> +     * Number of devices that have yet to approve switchover. When this reaches
> +     * zero an ACK that it's OK to do switchover is sent to the source. No lock
> +     * is needed as this field is updated serially.
> +     */
> +    unsigned int switchover_ack_pending_num;
>  };
>  
>  MigrationIncomingState *migration_incoming_get_current(void);
> @@ -437,6 +444,14 @@ struct MigrationState {
>  
>      /* QEMU_VM_VMDESCRIPTION content filled for all non-iterable devices. */
>      JSONWriter *vmdesc;
> +
> +    /* Number of devices that use switchover ack capability */
> +    unsigned int switchover_ack_user_num;

... we save this field, then...

> +    /*
> +     * Indicates whether an ACK from the destination that it's OK to do
> +     * switchover has been received.
> +     */
> +    bool switchover_acked;
>  };
>  
>  void migrate_set_state(int *state, int old_state, int new_state);
> @@ -477,6 +492,7 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
>  void migrate_send_rp_recv_bitmap(MigrationIncomingState *mis,
>                                   char *block_name);
>  void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t value);
> +int migrate_send_rp_switchover_ack(MigrationIncomingState *mis);
>  
>  void dirty_bitmap_mig_before_vm_start(void);
>  void dirty_bitmap_mig_cancel_outgoing(void);
> diff --git a/migration/savevm.h b/migration/savevm.h
> index fb636735f0..5c3e1a026b 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -32,6 +32,7 @@
>  bool qemu_savevm_state_blocked(Error **errp);
>  void qemu_savevm_non_migratable_list(strList **reasons);
>  void qemu_savevm_state_setup(QEMUFile *f);
> +void qemu_savevm_state_switchover_ack_needed(MigrationState *ms);
>  bool qemu_savevm_state_guest_unplug_pending(void);
>  int qemu_savevm_state_resume_prepare(MigrationState *s);
>  void qemu_savevm_state_header(QEMUFile *f);
> @@ -65,6 +66,7 @@ int qemu_loadvm_state(QEMUFile *f);
>  void qemu_loadvm_state_cleanup(void);
>  int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
>  int qemu_load_device_state(QEMUFile *f);
> +int qemu_loadvm_approve_switchover(void);
>  int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
>          bool in_postcopy, bool inactivate_disks);
>  
> diff --git a/migration/migration.c b/migration/migration.c
> index 5de7f734b9..87423ec30c 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -78,6 +78,7 @@ enum mig_rp_message_type {
>      MIG_RP_MSG_REQ_PAGES,    /* data (start: be64, len: be32) */
>      MIG_RP_MSG_RECV_BITMAP,  /* send recved_bitmap back to source */
>      MIG_RP_MSG_RESUME_ACK,   /* tell source that we are ready to resume */
> +    MIG_RP_MSG_SWITCHOVER_ACK, /* Tell source it's OK to do switchover */
>  
>      MIG_RP_MSG_MAX
>  };
> @@ -760,6 +761,11 @@ bool migration_has_all_channels(void)
>      return true;
>  }
>  
> +int migrate_send_rp_switchover_ack(MigrationIncomingState *mis)
> +{
> +    return migrate_send_rp_message(mis, MIG_RP_MSG_SWITCHOVER_ACK, 0, NULL);
> +}
> +
>  /*
>   * Send a 'SHUT' message on the return channel with the given value
>   * to indicate that we've finished with the RP.  Non-0 value indicates
> @@ -1405,6 +1411,8 @@ void migrate_init(MigrationState *s)
>      s->vm_was_running = false;
>      s->iteration_initial_bytes = 0;
>      s->threshold_size = 0;
> +    s->switchover_ack_user_num = 0;
> +    s->switchover_acked = false;
>  }
>  
>  int migrate_add_blocker_internal(Error *reason, Error **errp)
> @@ -1721,6 +1729,7 @@ static struct rp_cmd_args {
>      [MIG_RP_MSG_REQ_PAGES_ID]   = { .len = -1, .name = "REQ_PAGES_ID" },
>      [MIG_RP_MSG_RECV_BITMAP]    = { .len = -1, .name = "RECV_BITMAP" },
>      [MIG_RP_MSG_RESUME_ACK]     = { .len =  4, .name = "RESUME_ACK" },
> +    [MIG_RP_MSG_SWITCHOVER_ACK] = { .len =  0, .name = "SWITCHOVER_ACK" },
>      [MIG_RP_MSG_MAX]            = { .len = -1, .name = "MAX" },
>  };
>  
> @@ -1959,6 +1968,11 @@ retry:
>              }
>              break;
>  
> +        case MIG_RP_MSG_SWITCHOVER_ACK:
> +            ms->switchover_acked = true;
> +            trace_source_return_path_thread_switchover_acked();
> +            break;
> +
>          default:
>              break;
>          }
> @@ -2700,6 +2714,25 @@ static void migration_update_counters(MigrationState *s,
>                                bandwidth, s->threshold_size);
>  }
>  
> +static bool migration_can_switchover(MigrationState *s)
> +{
> +    if (!migrate_switchover_ack()) {
> +        return true;
> +    }
> +
> +    /* Switchover ack was enabled but no device uses it */
> +    if (!s->switchover_ack_user_num) {
> +        return true;
> +    }

... we drop this, then...

> +
> +    /* No reason to wait for switchover ACK if VM is stopped */
> +    if (!runstate_is_running()) {
> +        return true;
> +    }
> +
> +    return s->switchover_acked;
> +}
> +
>  /* Migration thread iteration status */
>  typedef enum {
>      MIG_ITERATE_RESUME,         /* Resume current iteration */
> @@ -2715,6 +2748,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>  {
>      uint64_t must_precopy, can_postcopy;
>      bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
> +    bool can_switchover = migration_can_switchover(s);
>  
>      qemu_savevm_state_pending_estimate(&must_precopy, &can_postcopy);
>      uint64_t pending_size = must_precopy + can_postcopy;
> @@ -2727,14 +2761,14 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>          trace_migrate_pending_exact(pending_size, must_precopy, can_postcopy);
>      }
>  
> -    if (!pending_size || pending_size < s->threshold_size) {
> +    if ((!pending_size || pending_size < s->threshold_size) && can_switchover) {
>          trace_migration_thread_low_pending(pending_size);
>          migration_completion(s);
>          return MIG_ITERATE_BREAK;
>      }
>  
>      /* Still a significant amount to transfer */
> -    if (!in_postcopy && must_precopy <= s->threshold_size &&
> +    if (!in_postcopy && must_precopy <= s->threshold_size && can_switchover &&
>          qatomic_read(&s->start_postcopy)) {
>          if (postcopy_start(s)) {
>              error_report("%s: postcopy failed to start", __func__);
> @@ -2959,6 +2993,10 @@ static void *migration_thread(void *opaque)
>  
>      qemu_savevm_state_setup(s->to_dst_file);
>  
> +    if (migrate_switchover_ack()) {
> +        qemu_savevm_state_switchover_ack_needed(s);
> +    }

... we also drop this, then...

> +
>      qemu_savevm_wait_unplug(s, MIGRATION_STATUS_SETUP,
>                                 MIGRATION_STATUS_ACTIVE);
>  
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 03795ce8dc..9482b1ff27 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1233,6 +1233,23 @@ bool qemu_savevm_state_guest_unplug_pending(void)
>      return false;
>  }
>  
> +void qemu_savevm_state_switchover_ack_needed(MigrationState *ms)
> +{
> +    SaveStateEntry *se;
> +
> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +        if (!se->ops || !se->ops->switchover_ack_needed) {
> +            continue;
> +        }
> +
> +        if (se->ops->switchover_ack_needed(se->opaque)) {
> +            ms->switchover_ack_user_num++;
> +        }
> +    }
> +
> +    trace_savevm_state_switchover_ack_needed(ms->switchover_ack_user_num);
> +}

... we also drop this, then...

> +
>  void qemu_savevm_state_setup(QEMUFile *f)
>  {
>      MigrationState *ms = migrate_get_current();
> @@ -2586,6 +2603,23 @@ static int qemu_loadvm_state_header(QEMUFile *f)
>      return 0;
>  }
>  
> +static void qemu_loadvm_state_switchover_ack_needed(MigrationIncomingState *mis)
> +{
> +    SaveStateEntry *se;
> +
> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +        if (!se->ops || !se->ops->switchover_ack_needed) {
> +            continue;
> +        }
> +
> +        if (se->ops->switchover_ack_needed(se->opaque)) {
> +            mis->switchover_ack_pending_num++;
> +        }
> +    }
> +
> +    trace_loadvm_state_switchover_ack_needed(mis->switchover_ack_pending_num);
> +}
> +
>  static int qemu_loadvm_state_setup(QEMUFile *f)
>  {
>      SaveStateEntry *se;
> @@ -2789,6 +2823,10 @@ int qemu_loadvm_state(QEMUFile *f)
>          return -EINVAL;
>      }
>  
> +    if (migrate_switchover_ack()) {
> +        qemu_loadvm_state_switchover_ack_needed(mis);

... here, we send ACK msg if num==0.

So we (1) make the wire protocol clearer (ACK must be there if cap set),
then (2) drop a bunch of code too.  Actually we also make the code clearer
too on src.

What do you think?

Other than that this looks very good to me.

> +    }
> +
>      cpu_synchronize_all_pre_loadvm();
>  
>      ret = qemu_loadvm_state_main(f, mis);
> @@ -2862,6 +2900,24 @@ int qemu_load_device_state(QEMUFile *f)
>      return 0;
>  }
>  
> +int qemu_loadvm_approve_switchover(void)
> +{
> +    MigrationIncomingState *mis = migration_incoming_get_current();
> +
> +    if (!mis->switchover_ack_pending_num) {
> +        return -EINVAL;
> +    }
> +
> +    mis->switchover_ack_pending_num--;
> +    trace_loadvm_approve_switchover(mis->switchover_ack_pending_num);
> +
> +    if (mis->switchover_ack_pending_num) {
> +        return 0;
> +    }
> +
> +    return migrate_send_rp_switchover_ack(mis);
> +}
> +
>  bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
>                    bool has_devices, strList *devices, Error **errp)
>  {
> diff --git a/migration/trace-events b/migration/trace-events
> index cdaef7a1ea..c52b429b28 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -7,6 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
>  qemu_loadvm_state_post_main(int ret) "%d"
>  qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
>  qemu_savevm_send_packaged(void) ""
> +loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
>  loadvm_state_setup(void) ""
>  loadvm_state_cleanup(void) ""
>  loadvm_handle_cmd_packaged(unsigned int length) "%u"
> @@ -23,6 +24,7 @@ loadvm_postcopy_ram_handle_discard_end(void) ""
>  loadvm_postcopy_ram_handle_discard_header(const char *ramid, uint16_t len) "%s: %ud"
>  loadvm_process_command(const char *s, uint16_t len) "com=%s len=%d"
>  loadvm_process_command_ping(uint32_t val) "0x%x"
> +loadvm_approve_switchover(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
>  postcopy_ram_listen_thread_exit(void) ""
>  postcopy_ram_listen_thread_start(void) ""
>  qemu_savevm_send_postcopy_advise(void) ""
> @@ -39,6 +41,7 @@ savevm_send_postcopy_resume(void) ""
>  savevm_send_colo_enable(void) ""
>  savevm_send_recv_bitmap(char *name) "%s"
>  savevm_state_setup(void) ""
> +savevm_state_switchover_ack_needed(unsigned int switchover_ack_user_num) "Switchover ack user num=%u"
>  savevm_state_resume_prepare(void) ""
>  savevm_state_header(void) ""
>  savevm_state_iterate(void) ""
> @@ -180,6 +183,7 @@ source_return_path_thread_loop_top(void) ""
>  source_return_path_thread_pong(uint32_t val) "0x%x"
>  source_return_path_thread_shut(uint32_t val) "0x%x"
>  source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32
> +source_return_path_thread_switchover_acked(void) ""
>  migration_thread_low_pending(uint64_t pending) "%" PRIu64
>  migrate_transferred(uint64_t tranferred, uint64_t time_spent, uint64_t bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " max_size %" PRId64
>  process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
> -- 
> 2.26.3
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 6/7] vfio/migration: Add VFIO migration pre-copy support
  2023-05-24 15:38       ` Cédric Le Goater
@ 2023-05-25  9:24         ` Avihai Horon
  0 siblings, 0 replies; 20+ messages in thread
From: Avihai Horon @ 2023-05-25  9:24 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Eduardo Habkost, Marcel Apfelbaum,
	Philippe Mathieu-Daudé,
	Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
	Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins


On 24/05/2023 18:38, Cédric Le Goater wrote:
> External email: Use caution opening links or attachments
>
>
> Hello Avihai,
>
> On 5/24/23 14:49, Avihai Horon wrote:
>>
>> On 23/05/2023 17:56, Cédric Le Goater wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> Hello Avihai,
>>>
>>> On 5/21/23 17:18, Avihai Horon wrote:
>>>> Pre-copy support allows the VFIO device data to be transferred 
>>>> while the
>>>> VM is running. This helps to accommodate VFIO devices that have a 
>>>> large
>>>> amount of data that needs to be transferred, and it can reduce 
>>>> migration
>>>> downtime.
>>>>
>>>> Pre-copy support is optional in VFIO migration protocol v2.
>>>> Implement pre-copy of VFIO migration protocol v2 and use it for 
>>>> devices
>>>> that support it. Full description of it can be found here [1].
>>>>
>>>> In addition, add a new VFIO device property x-allow-pre-copy to keep
>>>> migration compatibility to/from older QEMU versions that don't have 
>>>> VFIO
>>>> pre-copy support.
>>>>
>>>> [1]
>>>> https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/
>>>
>>>
>>> May be simply reference Linux commit 4db52602a607 ("vfio: Extend the 
>>> device
>>> migration protocol with PRE_COPY") instead.
>>
>> Sure, I will change it.
>>
>>>
>>> some comments below,
>>>
>>>
>>>>
>>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>>> ---
>>>>   docs/devel/vfio-migration.rst |  35 +++++---
>>>>   include/hw/vfio/vfio-common.h |   4 +
>>>>   hw/core/machine.c             |   1 +
>>>>   hw/vfio/common.c              |   6 +-
>>>>   hw/vfio/migration.c           | 163 
>>>> ++++++++++++++++++++++++++++++++--
>>>>   hw/vfio/pci.c                 |   2 +
>>>>   hw/vfio/trace-events          |   4 +-
>>>>   7 files changed, 193 insertions(+), 22 deletions(-)
>>>>
>>>> diff --git a/docs/devel/vfio-migration.rst 
>>>> b/docs/devel/vfio-migration.rst
>>>> index 1b68ccf115..e896b2a673 100644
>>>> --- a/docs/devel/vfio-migration.rst
>>>> +++ b/docs/devel/vfio-migration.rst
>>>> @@ -7,12 +7,14 @@ the guest is running on source host and restoring 
>>>> this saved state on the
>>>>   destination host. This document details how saving and restoring 
>>>> of VFIO
>>>>   devices is done in QEMU.
>>>>
>>>> -Migration of VFIO devices currently consists of a single 
>>>> stop-and-copy phase.
>>>> -During the stop-and-copy phase the guest is stopped and the entire 
>>>> VFIO device
>>>> -data is transferred to the destination.
>>>> -
>>>> -The pre-copy phase of migration is currently not supported for 
>>>> VFIO devices.
>>>> -Support for VFIO pre-copy will be added later on.
>>>> +Migration of VFIO devices consists of two phases: the optional 
>>>> pre-copy phase,
>>>> +and the stop-and-copy phase. The pre-copy phase is iterative and 
>>>> allows to
>>>> +accommodate VFIO devices that have a large amount of data that 
>>>> needs to be
>>>> +transferred. The iterative pre-copy phase of migration allows for 
>>>> the guest to
>>>> +continue whilst the VFIO device state is transferred to the 
>>>> destination, this
>>>> +helps to reduce the total downtime of the VM. VFIO devices opt-in 
>>>> to pre-copy
>>>> +support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
>>>> +VFIO_DEVICE_FEATURE_MIGRATION ioctl.
>>>>
>>>>   Note that currently VFIO migration is supported only for a single 
>>>> device. This
>>>>   is due to VFIO migration's lack of P2P support. However, P2P 
>>>> support is planned
>>>> @@ -29,10 +31,20 @@ VFIO implements the device hooks for the 
>>>> iterative approach as follows:
>>>>   * A ``load_setup`` function that sets the VFIO device on the 
>>>> destination in
>>>>     _RESUMING state.
>>>>
>>>> +* A ``state_pending_estimate`` function that reports an estimate 
>>>> of the
>>>> +  remaining pre-copy data that the vendor driver has yet to save 
>>>> for the VFIO
>>>> +  device.
>>>> +
>>>>   * A ``state_pending_exact`` function that reads pending_bytes 
>>>> from the vendor
>>>>     driver, which indicates the amount of data that the vendor 
>>>> driver has yet to
>>>>     save for the VFIO device.
>>>>
>>>> +* An ``is_active_iterate`` function that indicates 
>>>> ``save_live_iterate`` is
>>>> +  active only when the VFIO device is in pre-copy states.
>>>> +
>>>> +* A ``save_live_iterate`` function that reads the VFIO device's 
>>>> data from the
>>>> +  vendor driver during iterative pre-copy phase.
>>>> +
>>>>   * A ``save_state`` function to save the device config space if it 
>>>> is present.
>>>>
>>>>   * A ``save_live_complete_precopy`` function that sets the VFIO 
>>>> device in
>>>> @@ -111,8 +123,10 @@ Flow of state changes during Live migration
>>>>   ===========================================
>>>>
>>>>   Below is the flow of state change during live migration.
>>>> -The values in the brackets represent the VM state, the migration 
>>>> state, and
>>>> +The values in the parentheses represent the VM state, the 
>>>> migration state, and
>>>>   the VFIO device state, respectively.
>>>> +The text in the square brackets represents the flow if the VFIO 
>>>> device supports
>>>> +pre-copy.
>>>>
>>>>   Live migration save path
>>>>   ------------------------
>>>> @@ -124,11 +138,12 @@ Live migration save path
>>>>                                     |
>>>>                        migrate_init spawns migration_thread
>>>>                   Migration thread then calls each device's 
>>>> .save_setup()
>>>> -                       (RUNNING, _SETUP, _RUNNING)
>>>> +                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
>>>>                                     |
>>>> -                      (RUNNING, _ACTIVE, _RUNNING)
>>>> -             If device is active, get pending_bytes by 
>>>> .state_pending_exact()
>>>> +                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
>>>> +      If device is active, get pending_bytes by 
>>>> .state_pending_{estimate,exact}()
>>>>             If total pending_bytes >= threshold_size, call 
>>>> .save_live_iterate()
>>>> +                  [Data of VFIO device for pre-copy phase is copied]
>>>>           Iterate till total pending bytes converge and are less 
>>>> than threshold
>>>>                                     |
>>>>     On migration completion, vCPU stops and calls 
>>>> .save_live_complete_precopy for
>>>> diff --git a/include/hw/vfio/vfio-common.h 
>>>> b/include/hw/vfio/vfio-common.h
>>>> index eed244f25f..5ce7a01d56 100644
>>>> --- a/include/hw/vfio/vfio-common.h
>>>> +++ b/include/hw/vfio/vfio-common.h
>>>> @@ -66,6 +66,9 @@ typedef struct VFIOMigration {
>>>>       int data_fd;
>>>>       void *data_buffer;
>>>>       size_t data_buffer_size;
>>>> +    uint64_t precopy_init_size;
>>>> +    uint64_t precopy_dirty_size;
>>>> +    uint64_t mig_flags;
>>>
>>> It would have been cleaner to introduce VFIOMigration::mig_flags and 
>>> its
>>> update in another patch. This is minor.
>>
>> Sure, I will split it.
>>
>>>
>>>
>>>>   } VFIOMigration;
>>>>
>>>>   typedef struct VFIOAddressSpace {
>>>> @@ -143,6 +146,7 @@ typedef struct VFIODevice {
>>>>       VFIOMigration *migration;
>>>>       Error *migration_blocker;
>>>>       OnOffAuto pre_copy_dirty_page_tracking;
>>>> +    bool allow_pre_copy;
>>>
>>> same comment for this bool and the associated property, because it 
>>> would
>>> ease backports.
>>
>> Sure.
>> Just for general knowledge, can you explain how this could ease 
>> backports?
>
> The downstream machine names are different. Each distro has its own
> flavor. So adding a machine option always require some massaging.
>
> That might change in the future though.
>
> Anyhow, it is good pratice to isolate a change adding a restriction
> or a hw compatibility in its own patch.

Ah, I see.
Thanks for the explanation.

>
>
>>
>>>
>>>
>>>>       bool dirty_pages_supported;
>>>>       bool dirty_tracking;
>>>>   } VFIODevice;
>>>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>>>> index 07f763eb2e..50439e5cbb 100644
>>>> --- a/hw/core/machine.c
>>>> +++ b/hw/core/machine.c
>>>> @@ -41,6 +41,7 @@
>>>>
>>>>   GlobalProperty hw_compat_8_0[] = {
>>>>       { "migration", "multifd-flush-after-each-section", "on"},
>>>> +    { "vfio-pci", "x-allow-pre-copy", "false" },
>>>>   };
>>>>   const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0);
>>>>
>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>>>> index 78358ede27..b73086e17a 100644
>>>> --- a/hw/vfio/common.c
>>>> +++ b/hw/vfio/common.c
>>>> @@ -492,7 +492,8 @@ static bool 
>>>> vfio_devices_all_dirty_tracking(VFIOContainer *container)
>>>>               }
>>>>
>>>>               if (vbasedev->pre_copy_dirty_page_tracking == 
>>>> ON_OFF_AUTO_OFF &&
>>>> -                migration->device_state == 
>>>> VFIO_DEVICE_STATE_RUNNING) {
>>>> +                (migration->device_state == 
>>>> VFIO_DEVICE_STATE_RUNNING ||
>>>> +                 migration->device_state == 
>>>> VFIO_DEVICE_STATE_PRE_COPY)) {
>>>>                   return false;
>>>>               }
>>>>           }
>>>> @@ -537,7 +538,8 @@ static bool 
>>>> vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>>>>                   return false;
>>>>               }
>>>>
>>>> -            if (migration->device_state == 
>>>> VFIO_DEVICE_STATE_RUNNING) {
>>>> +            if (migration->device_state == 
>>>> VFIO_DEVICE_STATE_RUNNING ||
>>>> +                migration->device_state == 
>>>> VFIO_DEVICE_STATE_PRE_COPY) {
>>>>                   continue;
>>>>               } else {
>>>>                   return false;
>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>>> index 235978fd68..418efed019 100644
>>>> --- a/hw/vfio/migration.c
>>>> +++ b/hw/vfio/migration.c
>>>> @@ -68,6 +68,8 @@ static const char *mig_state_to_str(enum 
>>>> vfio_device_mig_state state)
>>>>           return "STOP_COPY";
>>>>       case VFIO_DEVICE_STATE_RESUMING:
>>>>           return "RESUMING";
>>>> +    case VFIO_DEVICE_STATE_PRE_COPY:
>>>> +        return "PRE_COPY";
>>>>       default:
>>>>           return "UNKNOWN STATE";
>>>>       }
>>>> @@ -241,6 +243,22 @@ static int 
>>>> vfio_query_stop_copy_size(VFIODevice *vbasedev,
>>>>       return 0;
>>>>   }
>>>>
>>>> +static int vfio_query_precopy_size(VFIOMigration *migration)
>>>> +{
>>>> +    struct vfio_precopy_info precopy = {
>>>> +        .argsz = sizeof(precopy),
>>>> +    };
>>>
>>> May be move here  :
>>>
>>>         migration->precopy_init_size = 0;
>>>         migration->precopy_dirty_size = 0;
>>>
>>> since the values are reset always before calling 
>>> vfio_query_precopy_size()
>>
>> OK.
>> I will also reset these values in vfio_save_cleanup() so there won't 
>> be stale values in case migration is cancelled or fails.
>>
>>>
>>>
>>>> +
>>>> +    if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, 
>>>> &precopy)) {
>>>> +        return -errno;
>>>> +    }
>>>> +
>>>> +    migration->precopy_init_size = precopy.initial_bytes;
>>>> +    migration->precopy_dirty_size = precopy.dirty_bytes;
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>>   /* Returns the size of saved data on success and -errno on error */
>>>>   static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration 
>>>> *migration)
>>>>   {
>>>> @@ -249,6 +267,11 @@ static ssize_t vfio_save_block(QEMUFile *f, 
>>>> VFIOMigration *migration)
>>>>       data_size = read(migration->data_fd, migration->data_buffer,
>>>>                        migration->data_buffer_size);
>>>>       if (data_size < 0) {
>>>> +        /* Pre-copy emptied all the device state for now */
>>>> +        if (errno == ENOMSG) {
>>>
>>> Could you explain a little more this errno please ? It looks like an 
>>> API with
>>> the VFIO PCI variant kernel driver.
>>
>> Yes, it's explained in the precopy uAPI [1].
>
> Oh. I forgot to look at the uAPI file and only checked the driver.
> All good then.
>
>> Do you want to change the comment to something like the following?
>
> Yes, please. It would be good for the reader to have a reference on
> the kernel uAPI.
>
>> /*
>>   * ENOMSG indicates that the migration data_fd has reached a temporary
>>   * "end of stream", i.e. both initial_bytes and dirty_bytes are zero.
>>   * More data may be available later in future reads.
>>   */
>
> Please add something like :
>
>  "For more information, please refer to the Linux kernel VFIO uAPI"
>
> No need for links or file names.
>
OK, I will add that.

Thanks!

>
>> [1] 
>> https://elixir.bootlin.com/linux/v6.4-rc1/source/include/uapi/linux/vfio.h#L1084
>>
>>>
>>>> +            return 0;
>>>> +        }
>>>> +
>>>>           return -errno;
>>>>       }
>>>>       if (data_size == 0) {
>>>> @@ -265,6 +288,39 @@ static ssize_t vfio_save_block(QEMUFile *f, 
>>>> VFIOMigration *migration)
>>>>       return qemu_file_get_error(f) ?: data_size;
>>>>   }
>>>>
>>>> +static void vfio_update_estimated_pending_data(VFIOMigration 
>>>> *migration,
>>>> +                                               uint64_t data_size)
>>>> +{
>>>> +    if (!data_size) {
>>>> +        /*
>>>> +         * Pre-copy emptied all the device state for now, update 
>>>> estimated sizes
>>>> +         * accordingly.
>>>> +         */
>>>> +        migration->precopy_init_size = 0;
>>>> +        migration->precopy_dirty_size = 0;
>>>> +
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    if (migration->precopy_init_size) {
>>>> +        uint64_t init_size = MIN(migration->precopy_init_size, 
>>>> data_size);
>>>> +
>>>> +        migration->precopy_init_size -= init_size;
>>>> +        data_size -= init_size;
>>>> +    }
>>>> +
>>>> +    migration->precopy_dirty_size -= 
>>>> MIN(migration->precopy_dirty_size,
>>>> +                                         data_size);
>>>
>>> Do we have a trace event for all this data values ?
>>>
>>>> +}
>>>> +
>>>> +static bool vfio_precopy_supported(VFIODevice *vbasedev)
>>>> +{
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>> +
>>>> +    return vbasedev->allow_pre_copy &&
>>>> +           migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
>>>> +}
>>>> +
>>>>   /* 
>>>> ---------------------------------------------------------------------- 
>>>> */
>>>>
>>>>   static int vfio_save_setup(QEMUFile *f, void *opaque)
>>>> @@ -285,6 +341,31 @@ static int vfio_save_setup(QEMUFile *f, void 
>>>> *opaque)
>>>>           return -ENOMEM;
>>>>       }
>>>>
>>>> +    if (vfio_precopy_supported(vbasedev)) {
>>>> +        int ret;
>>>> +
>>>> +        migration->precopy_init_size = 0;
>>>> +        migration->precopy_dirty_size = 0;
>>>> +
>>>> +        switch (migration->device_state) {
>>>> +        case VFIO_DEVICE_STATE_RUNNING:
>>>> +            ret = vfio_migration_set_state(vbasedev, 
>>>> VFIO_DEVICE_STATE_PRE_COPY,
>>>> + VFIO_DEVICE_STATE_RUNNING);
>>>> +            if (ret) {
>>>> +                return ret;
>>>> +            }
>>>> +
>>>> +            vfio_query_precopy_size(migration);
>>>> +
>>>> +            break;
>>>> +        case VFIO_DEVICE_STATE_STOP:
>>>> +            /* vfio_save_complete_precopy() will go to STOP_COPY */
>>>> +            break;
>>>> +        default:
>>>> +            return -EINVAL;
>>>> +        }
>>>> +    }
>>>> +
>>>>       trace_vfio_save_setup(vbasedev->name, 
>>>> migration->data_buffer_size);
>>>>
>>>>       qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>>> @@ -303,22 +384,36 @@ static void vfio_save_cleanup(void *opaque)
>>>>       trace_vfio_save_cleanup(vbasedev->name);
>>>>   }
>>>>
>>>> +static void vfio_state_pending_estimate(void *opaque, uint64_t 
>>>> *must_precopy,
>>>> +                                        uint64_t *can_postcopy)
>>>> +{
>>>> +    VFIODevice *vbasedev = opaque;
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>> +
>>>> +    if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    *must_precopy +=
>>>> +        migration->precopy_init_size + migration->precopy_dirty_size;
>>>> +
>>>> +    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
>>>> +                                      *can_postcopy,
>>>> + migration->precopy_init_size,
>>>> + migration->precopy_dirty_size);
>>>
>>>
>>> ok we have one :) I wonder if we should not update 
>>> trace_vfio_save_iterate()
>>> also with some values.
>>
>> Hmm, yes, I guess it wouldn't hurt.
>>
>>>
>>>> +}
>>>> +
>>>>   /*
>>>>    * Migration size of VFIO devices can be as little as a few KBs 
>>>> or as big as
>>>>    * many GBs. This value should be big enough to cover the worst 
>>>> case.
>>>>    */
>>>>   #define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
>>>>
>>>> -/*
>>>> - * Only exact function is implemented and not estimate function. 
>>>> The reason is
>>>> - * that during pre-copy phase of migration the estimate function 
>>>> is called
>>>> - * repeatedly while pending RAM size is over the threshold, thus 
>>>> migration
>>>> - * can't converge and querying the VFIO device pending data size 
>>>> is useless.
>>>> - */
>>>>   static void vfio_state_pending_exact(void *opaque, uint64_t 
>>>> *must_precopy,
>>>>                                        uint64_t *can_postcopy)
>>>>   {
>>>>       VFIODevice *vbasedev = opaque;
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>>       uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
>>>>
>>>>       /*
>>>> @@ -328,8 +423,49 @@ static void vfio_state_pending_exact(void 
>>>> *opaque, uint64_t *must_precopy,
>>>>       vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>>>>       *must_precopy += stop_copy_size;
>>>>
>>>> +    if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
>>>> +        migration->precopy_init_size = 0;
>>>> +        migration->precopy_dirty_size = 0;
>>>> +        vfio_query_precopy_size(migration);
>>>> +
>>>> +        *must_precopy +=
>>>> +            migration->precopy_init_size + 
>>>> migration->precopy_dirty_size;
>>>> +    }
>>>> +
>>>>       trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, 
>>>> *can_postcopy,
>>>> -                                   stop_copy_size);
>>>> +                                   stop_copy_size, 
>>>> migration->precopy_init_size,
>>>> + migration->precopy_dirty_size);
>>>> +}
>>>> +
>>>> +static bool vfio_is_active_iterate(void *opaque)
>>>> +{
>>>> +    VFIODevice *vbasedev = opaque;
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>> +
>>>> +    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
>>>> +}
>>>> +
>>>> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
>>>> +{
>>>> +    VFIODevice *vbasedev = opaque;
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>> +    ssize_t data_size;
>>>> +
>>>> +    data_size = vfio_save_block(f, migration);
>>>> +    if (data_size < 0) {
>>>> +        return data_size;
>>>> +    }
>>>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>>> +
>>>> +    vfio_update_estimated_pending_data(migration, data_size);
>>>> +
>>>> +    trace_vfio_save_iterate(vbasedev->name);
>>>> +
>>>> +    /*
>>>> +     * A VFIO device's pre-copy dirty_bytes is not guaranteed to 
>>>> reach zero.
>>>> +     * Return 1 so following handlers will not be potentially 
>>>> blocked.
>>>
>>> Can this condition be detected to warn the user ?
>>
>> I don't think so, it depends on the kernel driver implementation.
>
> OK. We will see how it evolves with time. We might need some message
> saying pre copy is not converging.
>
> Thanks,
>
> C.
>
>
>>
>> Thanks!
>>
>>>
>>>
>>>> +     */
>>>> +    return 1;
>>>>   }
>>>>
>>>>   static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>>> @@ -338,7 +474,7 @@ static int vfio_save_complete_precopy(QEMUFile 
>>>> *f, void *opaque)
>>>>       ssize_t data_size;
>>>>       int ret;
>>>>
>>>> -    /* We reach here with device state STOP only */
>>>> +    /* We reach here with device state STOP or STOP_COPY only */
>>>>       ret = vfio_migration_set_state(vbasedev, 
>>>> VFIO_DEVICE_STATE_STOP_COPY,
>>>> VFIO_DEVICE_STATE_STOP);
>>>>       if (ret) {
>>>> @@ -457,7 +593,10 @@ static int vfio_load_state(QEMUFile *f, void 
>>>> *opaque, int version_id)
>>>>   static const SaveVMHandlers savevm_vfio_handlers = {
>>>>       .save_setup = vfio_save_setup,
>>>>       .save_cleanup = vfio_save_cleanup,
>>>> +    .state_pending_estimate = vfio_state_pending_estimate,
>>>>       .state_pending_exact = vfio_state_pending_exact,
>>>> +    .is_active_iterate = vfio_is_active_iterate,
>>>> +    .save_live_iterate = vfio_save_iterate,
>>>>       .save_live_complete_precopy = vfio_save_complete_precopy,
>>>>       .save_state = vfio_save_state,
>>>>       .load_setup = vfio_load_setup,
>>>> @@ -470,13 +609,18 @@ static const SaveVMHandlers 
>>>> savevm_vfio_handlers = {
>>>>   static void vfio_vmstate_change(void *opaque, bool running, 
>>>> RunState state)
>>>>   {
>>>>       VFIODevice *vbasedev = opaque;
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>>       enum vfio_device_mig_state new_state;
>>>>       int ret;
>>>>
>>>>       if (running) {
>>>>           new_state = VFIO_DEVICE_STATE_RUNNING;
>>>>       } else {
>>>> -        new_state = VFIO_DEVICE_STATE_STOP;
>>>> +        new_state =
>>>> +            (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY &&
>>>> +             (state == RUN_STATE_FINISH_MIGRATE || state == 
>>>> RUN_STATE_PAUSED)) ?
>>>> +                VFIO_DEVICE_STATE_STOP_COPY :
>>>> +                VFIO_DEVICE_STATE_STOP;
>>>>       }
>>>>
>>>>       /*
>>>> @@ -603,6 +747,7 @@ static int vfio_migration_init(VFIODevice 
>>>> *vbasedev)
>>>>       migration->vbasedev = vbasedev;
>>>>       migration->device_state = VFIO_DEVICE_STATE_RUNNING;
>>>>       migration->data_fd = -1;
>>>> +    migration->mig_flags = mig_flags;
>>>>
>>>>       vbasedev->dirty_pages_supported = 
>>>> vfio_dma_logging_supported(vbasedev);
>>>>
>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>> index bf27a39905..72f30ce09f 100644
>>>> --- a/hw/vfio/pci.c
>>>> +++ b/hw/vfio/pci.c
>>>> @@ -3335,6 +3335,8 @@ static Property vfio_pci_dev_properties[] = {
>>>> DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking", 
>>>> VFIOPCIDevice,
>>>> vbasedev.pre_copy_dirty_page_tracking,
>>>>                               ON_OFF_AUTO_ON),
>>>> +    DEFINE_PROP_BOOL("x-allow-pre-copy", VFIOPCIDevice,
>>>> +                     vbasedev.allow_pre_copy, true),
>>>>       DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
>>>>                               display, ON_OFF_AUTO_OFF),
>>>>       DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
>>>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>>>> index 646e42fd27..fd6893cb43 100644
>>>> --- a/hw/vfio/trace-events
>>>> +++ b/hw/vfio/trace-events
>>>> @@ -162,6 +162,8 @@ vfio_save_block(const char *name, int 
>>>> data_size) " (%s) data_size %d"
>>>>   vfio_save_cleanup(const char *name) " (%s)"
>>>>   vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
>>>>   vfio_save_device_config_state(const char *name) " (%s)"
>>>> +vfio_save_iterate(const char *name) " (%s)"
>>>>   vfio_save_setup(const char *name, uint64_t data_buffer_size) " 
>>>> (%s) data buffer size 0x%"PRIx64
>>>> -vfio_state_pending_exact(const char *name, uint64_t precopy, 
>>>> uint64_t postcopy, uint64_t stopcopy_size) " (%s) precopy 
>>>> 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64
>>>> +vfio_state_pending_estimate(const char *name, uint64_t precopy, 
>>>> uint64_t postcopy, uint64_t precopy_init_size, uint64_t 
>>>> precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" 
>>>> precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
>>>> +vfio_state_pending_exact(const char *name, uint64_t precopy, 
>>>> uint64_t postcopy, uint64_t stopcopy_size, uint64_t 
>>>> precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 
>>>> 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy 
>>>> initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
>>>>   vfio_vmstate_change(const char *name, int running, const char 
>>>> *reason, const char *dev_state) " (%s) running %d reason %s device 
>>>> state %s"
>>>
>>
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/7] migration: Add switchover ack capability
  2023-05-21 15:18 ` [PATCH v3 1/7] migration: Add switchover ack capability Avihai Horon
  2023-05-24 19:24   ` Peter Xu
@ 2023-05-25  9:33   ` Markus Armbruster
  1 sibling, 0 replies; 20+ messages in thread
From: Markus Armbruster @ 2023-05-25  9:33 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
	Thomas Huth, Laurent Vivier, Paolo Bonzini, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

Avihai Horon <avihaih@nvidia.com> writes:

> Migration downtime estimation is calculated based on bandwidth and
> remaining migration data. This assumes that loading of migration data in
> the destination takes a negligible amount of time and that downtime
> depends only on network speed.
>
> While this may be true for RAM, it's not necessarily true for other
> migration users. For example, loading the data of a VFIO device in the
> destination might require from the device to allocate resources, prepare
> internal data structures and so on. These operations can take a
> significant amount of time which can increase migration downtime.
>
> This patch adds a new capability "switchover ack" that prevents the
> source from stopping the VM and completing the migration until an ACK
> is received from the destination that it's OK to do so.
>
> This can be used by migrated devices in various ways to reduce downtime.
> For example, a device can send initial precopy metadata to pre-allocate
> resources in the destination and use this capability to make sure that
> the pre-allocation is completed before the source VM is stopped, so it
> will have full effect.
>
> This new capability relies on the return path capability to communicate
> from the destination back to the source.
>
> The actual implementation of the capability will be added in the
> following patches.
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>

QAPI schema
Acked-by: Markus Armbruster <armbru@redhat.com>



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/7] migration: Implement switchover ack logic
  2023-05-24 19:32   ` Peter Xu
@ 2023-05-25  9:51     ` Avihai Horon
  2023-05-25 12:44       ` Peter Xu
  0 siblings, 1 reply; 20+ messages in thread
From: Avihai Horon @ 2023-05-25  9:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, Juan Quintela, Leonardo Bras, Eric Blake,
	Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins


On 24/05/2023 22:32, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Sun, May 21, 2023 at 06:18:03PM +0300, Avihai Horon wrote:
>> Implement switchover ack logic. This prevents the source from stopping
>> the VM and completing the migration until an ACK is received from the
>> destination that it's OK to do so.
>>
>> To achieve this, a new SaveVMHandlers handler switchover_ack_needed()
>> and a new return path message MIG_RP_MSG_SWITCHOVER_ACK are added.
>>
>> The switchover_ack_needed() handler is called during migration setup
>> both in the source and the destination to check if switchover ack is
>> used by the migrated device.
>>
>> When switchover is approved by all migrated devices in the destination
>> that support this capability, the MIG_RP_MSG_INITIAL_DATA_LOADED_ACK
> s/MIG_RP_MSG_INITIAL_DATA_LOADED_ACK/MIG_RP_MSG_SWITCHOVER_ACK/

Will change.

>
>> return path message is sent to the source to notify it that it's OK to
>> do switchover.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   include/migration/register.h |  3 ++
>>   migration/migration.h        | 16 +++++++++++
>>   migration/savevm.h           |  2 ++
>>   migration/migration.c        | 42 +++++++++++++++++++++++++--
>>   migration/savevm.c           | 56 ++++++++++++++++++++++++++++++++++++
>>   migration/trace-events       |  4 +++
>>   6 files changed, 121 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/migration/register.h b/include/migration/register.h
>> index a8dfd8fefd..cda36d377b 100644
>> --- a/include/migration/register.h
>> +++ b/include/migration/register.h
>> @@ -71,6 +71,9 @@ typedef struct SaveVMHandlers {
>>       int (*load_cleanup)(void *opaque);
>>       /* Called when postcopy migration wants to resume from failure */
>>       int (*resume_prepare)(MigrationState *s, void *opaque);
>> +
>> +    /* Checks if switchover ack should be used. Called both in src and dest */
>> +    bool (*switchover_ack_needed)(void *opaque);
> Sorry if I'm going to suggest something that overwrites what I
> suggested.. :( Luckily, not so much.
>
> When I read the new series I just noticed maybe it's still better to only
> use this on dst, and always do the ACK.
>
> The problem is based on your patch 1 description, the RP ACK message will
> be sent if the switchover-ack cap is set, but actually it's conditionally
> in the current impl just to handle num==0 case, so either the impl needs
> change or the desc needs change.
>
> It turns out it'll be even cleaner to always send it.  If so..
>
>>   } SaveVMHandlers;
>>
>>   int register_savevm_live(const char *idstr,
>> diff --git a/migration/migration.h b/migration/migration.h
>> index 48a46123a0..e209860cce 100644
>> --- a/migration/migration.h
>> +++ b/migration/migration.h
>> @@ -209,6 +209,13 @@ struct MigrationIncomingState {
>>        * contains valid information.
>>        */
>>       QemuMutex page_request_mutex;
>> +
>> +    /*
>> +     * Number of devices that have yet to approve switchover. When this reaches
>> +     * zero an ACK that it's OK to do switchover is sent to the source. No lock
>> +     * is needed as this field is updated serially.
>> +     */
>> +    unsigned int switchover_ack_pending_num;
>>   };
>>
>>   MigrationIncomingState *migration_incoming_get_current(void);
>> @@ -437,6 +444,14 @@ struct MigrationState {
>>
>>       /* QEMU_VM_VMDESCRIPTION content filled for all non-iterable devices. */
>>       JSONWriter *vmdesc;
>> +
>> +    /* Number of devices that use switchover ack capability */
>> +    unsigned int switchover_ack_user_num;
> ... we save this field, then...
>
>> +    /*
>> +     * Indicates whether an ACK from the destination that it's OK to do
>> +     * switchover has been received.
>> +     */
>> +    bool switchover_acked;
>>   };
>>
>>   void migrate_set_state(int *state, int old_state, int new_state);
>> @@ -477,6 +492,7 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
>>   void migrate_send_rp_recv_bitmap(MigrationIncomingState *mis,
>>                                    char *block_name);
>>   void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t value);
>> +int migrate_send_rp_switchover_ack(MigrationIncomingState *mis);
>>
>>   void dirty_bitmap_mig_before_vm_start(void);
>>   void dirty_bitmap_mig_cancel_outgoing(void);
>> diff --git a/migration/savevm.h b/migration/savevm.h
>> index fb636735f0..5c3e1a026b 100644
>> --- a/migration/savevm.h
>> +++ b/migration/savevm.h
>> @@ -32,6 +32,7 @@
>>   bool qemu_savevm_state_blocked(Error **errp);
>>   void qemu_savevm_non_migratable_list(strList **reasons);
>>   void qemu_savevm_state_setup(QEMUFile *f);
>> +void qemu_savevm_state_switchover_ack_needed(MigrationState *ms);
>>   bool qemu_savevm_state_guest_unplug_pending(void);
>>   int qemu_savevm_state_resume_prepare(MigrationState *s);
>>   void qemu_savevm_state_header(QEMUFile *f);
>> @@ -65,6 +66,7 @@ int qemu_loadvm_state(QEMUFile *f);
>>   void qemu_loadvm_state_cleanup(void);
>>   int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
>>   int qemu_load_device_state(QEMUFile *f);
>> +int qemu_loadvm_approve_switchover(void);
>>   int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
>>           bool in_postcopy, bool inactivate_disks);
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 5de7f734b9..87423ec30c 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -78,6 +78,7 @@ enum mig_rp_message_type {
>>       MIG_RP_MSG_REQ_PAGES,    /* data (start: be64, len: be32) */
>>       MIG_RP_MSG_RECV_BITMAP,  /* send recved_bitmap back to source */
>>       MIG_RP_MSG_RESUME_ACK,   /* tell source that we are ready to resume */
>> +    MIG_RP_MSG_SWITCHOVER_ACK, /* Tell source it's OK to do switchover */
>>
>>       MIG_RP_MSG_MAX
>>   };
>> @@ -760,6 +761,11 @@ bool migration_has_all_channels(void)
>>       return true;
>>   }
>>
>> +int migrate_send_rp_switchover_ack(MigrationIncomingState *mis)
>> +{
>> +    return migrate_send_rp_message(mis, MIG_RP_MSG_SWITCHOVER_ACK, 0, NULL);
>> +}
>> +
>>   /*
>>    * Send a 'SHUT' message on the return channel with the given value
>>    * to indicate that we've finished with the RP.  Non-0 value indicates
>> @@ -1405,6 +1411,8 @@ void migrate_init(MigrationState *s)
>>       s->vm_was_running = false;
>>       s->iteration_initial_bytes = 0;
>>       s->threshold_size = 0;
>> +    s->switchover_ack_user_num = 0;
>> +    s->switchover_acked = false;
>>   }
>>
>>   int migrate_add_blocker_internal(Error *reason, Error **errp)
>> @@ -1721,6 +1729,7 @@ static struct rp_cmd_args {
>>       [MIG_RP_MSG_REQ_PAGES_ID]   = { .len = -1, .name = "REQ_PAGES_ID" },
>>       [MIG_RP_MSG_RECV_BITMAP]    = { .len = -1, .name = "RECV_BITMAP" },
>>       [MIG_RP_MSG_RESUME_ACK]     = { .len =  4, .name = "RESUME_ACK" },
>> +    [MIG_RP_MSG_SWITCHOVER_ACK] = { .len =  0, .name = "SWITCHOVER_ACK" },
>>       [MIG_RP_MSG_MAX]            = { .len = -1, .name = "MAX" },
>>   };
>>
>> @@ -1959,6 +1968,11 @@ retry:
>>               }
>>               break;
>>
>> +        case MIG_RP_MSG_SWITCHOVER_ACK:
>> +            ms->switchover_acked = true;
>> +            trace_source_return_path_thread_switchover_acked();
>> +            break;
>> +
>>           default:
>>               break;
>>           }
>> @@ -2700,6 +2714,25 @@ static void migration_update_counters(MigrationState *s,
>>                                 bandwidth, s->threshold_size);
>>   }
>>
>> +static bool migration_can_switchover(MigrationState *s)
>> +{
>> +    if (!migrate_switchover_ack()) {
>> +        return true;
>> +    }
>> +
>> +    /* Switchover ack was enabled but no device uses it */
>> +    if (!s->switchover_ack_user_num) {
>> +        return true;
>> +    }
> ... we drop this, then...
>
>> +
>> +    /* No reason to wait for switchover ACK if VM is stopped */
>> +    if (!runstate_is_running()) {
>> +        return true;
>> +    }
>> +
>> +    return s->switchover_acked;
>> +}
>> +
>>   /* Migration thread iteration status */
>>   typedef enum {
>>       MIG_ITERATE_RESUME,         /* Resume current iteration */
>> @@ -2715,6 +2748,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>>   {
>>       uint64_t must_precopy, can_postcopy;
>>       bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
>> +    bool can_switchover = migration_can_switchover(s);
>>
>>       qemu_savevm_state_pending_estimate(&must_precopy, &can_postcopy);
>>       uint64_t pending_size = must_precopy + can_postcopy;
>> @@ -2727,14 +2761,14 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>>           trace_migrate_pending_exact(pending_size, must_precopy, can_postcopy);
>>       }
>>
>> -    if (!pending_size || pending_size < s->threshold_size) {
>> +    if ((!pending_size || pending_size < s->threshold_size) && can_switchover) {
>>           trace_migration_thread_low_pending(pending_size);
>>           migration_completion(s);
>>           return MIG_ITERATE_BREAK;
>>       }
>>
>>       /* Still a significant amount to transfer */
>> -    if (!in_postcopy && must_precopy <= s->threshold_size &&
>> +    if (!in_postcopy && must_precopy <= s->threshold_size && can_switchover &&
>>           qatomic_read(&s->start_postcopy)) {
>>           if (postcopy_start(s)) {
>>               error_report("%s: postcopy failed to start", __func__);
>> @@ -2959,6 +2993,10 @@ static void *migration_thread(void *opaque)
>>
>>       qemu_savevm_state_setup(s->to_dst_file);
>>
>> +    if (migrate_switchover_ack()) {
>> +        qemu_savevm_state_switchover_ack_needed(s);
>> +    }
> ... we also drop this, then...
>
>> +
>>       qemu_savevm_wait_unplug(s, MIGRATION_STATUS_SETUP,
>>                                  MIGRATION_STATUS_ACTIVE);
>>
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index 03795ce8dc..9482b1ff27 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -1233,6 +1233,23 @@ bool qemu_savevm_state_guest_unplug_pending(void)
>>       return false;
>>   }
>>
>> +void qemu_savevm_state_switchover_ack_needed(MigrationState *ms)
>> +{
>> +    SaveStateEntry *se;
>> +
>> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +        if (!se->ops || !se->ops->switchover_ack_needed) {
>> +            continue;
>> +        }
>> +
>> +        if (se->ops->switchover_ack_needed(se->opaque)) {
>> +            ms->switchover_ack_user_num++;
>> +        }
>> +    }
>> +
>> +    trace_savevm_state_switchover_ack_needed(ms->switchover_ack_user_num);
>> +}
> ... we also drop this, then...
>
>> +
>>   void qemu_savevm_state_setup(QEMUFile *f)
>>   {
>>       MigrationState *ms = migrate_get_current();
>> @@ -2586,6 +2603,23 @@ static int qemu_loadvm_state_header(QEMUFile *f)
>>       return 0;
>>   }
>>
>> +static void qemu_loadvm_state_switchover_ack_needed(MigrationIncomingState *mis)
>> +{
>> +    SaveStateEntry *se;
>> +
>> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +        if (!se->ops || !se->ops->switchover_ack_needed) {
>> +            continue;
>> +        }
>> +
>> +        if (se->ops->switchover_ack_needed(se->opaque)) {
>> +            mis->switchover_ack_pending_num++;
>> +        }
>> +    }
>> +
>> +    trace_loadvm_state_switchover_ack_needed(mis->switchover_ack_pending_num);
>> +}
>> +
>>   static int qemu_loadvm_state_setup(QEMUFile *f)
>>   {
>>       SaveStateEntry *se;
>> @@ -2789,6 +2823,10 @@ int qemu_loadvm_state(QEMUFile *f)
>>           return -EINVAL;
>>       }
>>
>> +    if (migrate_switchover_ack()) {
>> +        qemu_loadvm_state_switchover_ack_needed(mis);

[1]

> ... here, we send ACK msg if num==0.
>
> So we (1) make the wire protocol clearer (ACK must be there if cap set),
> then (2) drop a bunch of code too.  Actually we also make the code clearer
> too on src.
>
> What do you think?

I really like this idea! It simplifies things even more.

However, there is one issue -- we can't send the ACK up here [1], as at 
that point the return path has not been created yet.
A possible solution is to check for mis->switchover_ack_pending_num == 0 
when we create the return path and send the ACK there.
It's not as clean as checking the number of users and ACKing here in the 
same place, but it works.

Do you think it's OK? or do you have another idea?

Thanks!

>
> Other than that this looks very good to me.
>
>> +    }
>> +
>>       cpu_synchronize_all_pre_loadvm();
>>
>>       ret = qemu_loadvm_state_main(f, mis);
>> @@ -2862,6 +2900,24 @@ int qemu_load_device_state(QEMUFile *f)
>>       return 0;
>>   }
>>
>> +int qemu_loadvm_approve_switchover(void)
>> +{
>> +    MigrationIncomingState *mis = migration_incoming_get_current();
>> +
>> +    if (!mis->switchover_ack_pending_num) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    mis->switchover_ack_pending_num--;
>> +    trace_loadvm_approve_switchover(mis->switchover_ack_pending_num);
>> +
>> +    if (mis->switchover_ack_pending_num) {
>> +        return 0;
>> +    }
>> +
>> +    return migrate_send_rp_switchover_ack(mis);
>> +}
>> +
>>   bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
>>                     bool has_devices, strList *devices, Error **errp)
>>   {
>> diff --git a/migration/trace-events b/migration/trace-events
>> index cdaef7a1ea..c52b429b28 100644
>> --- a/migration/trace-events
>> +++ b/migration/trace-events
>> @@ -7,6 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
>>   qemu_loadvm_state_post_main(int ret) "%d"
>>   qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
>>   qemu_savevm_send_packaged(void) ""
>> +loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
>>   loadvm_state_setup(void) ""
>>   loadvm_state_cleanup(void) ""
>>   loadvm_handle_cmd_packaged(unsigned int length) "%u"
>> @@ -23,6 +24,7 @@ loadvm_postcopy_ram_handle_discard_end(void) ""
>>   loadvm_postcopy_ram_handle_discard_header(const char *ramid, uint16_t len) "%s: %ud"
>>   loadvm_process_command(const char *s, uint16_t len) "com=%s len=%d"
>>   loadvm_process_command_ping(uint32_t val) "0x%x"
>> +loadvm_approve_switchover(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
>>   postcopy_ram_listen_thread_exit(void) ""
>>   postcopy_ram_listen_thread_start(void) ""
>>   qemu_savevm_send_postcopy_advise(void) ""
>> @@ -39,6 +41,7 @@ savevm_send_postcopy_resume(void) ""
>>   savevm_send_colo_enable(void) ""
>>   savevm_send_recv_bitmap(char *name) "%s"
>>   savevm_state_setup(void) ""
>> +savevm_state_switchover_ack_needed(unsigned int switchover_ack_user_num) "Switchover ack user num=%u"
>>   savevm_state_resume_prepare(void) ""
>>   savevm_state_header(void) ""
>>   savevm_state_iterate(void) ""
>> @@ -180,6 +183,7 @@ source_return_path_thread_loop_top(void) ""
>>   source_return_path_thread_pong(uint32_t val) "0x%x"
>>   source_return_path_thread_shut(uint32_t val) "0x%x"
>>   source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32
>> +source_return_path_thread_switchover_acked(void) ""
>>   migration_thread_low_pending(uint64_t pending) "%" PRIu64
>>   migrate_transferred(uint64_t tranferred, uint64_t time_spent, uint64_t bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " max_size %" PRId64
>>   process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
>> --
>> 2.26.3
>>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/7] migration: Implement switchover ack logic
  2023-05-25  9:51     ` Avihai Horon
@ 2023-05-25 12:44       ` Peter Xu
  2023-05-28 12:42         ` Avihai Horon
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Xu @ 2023-05-25 12:44 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, Juan Quintela, Leonardo Bras, Eric Blake,
	Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins

On Thu, May 25, 2023 at 12:51:46PM +0300, Avihai Horon wrote:
> However, there is one issue -- we can't send the ACK up here [1], as at that
> point the return path has not been created yet.
> A possible solution is to check for mis->switchover_ack_pending_num == 0
> when we create the return path and send the ACK there.
> It's not as clean as checking the number of users and ACKing here in the
> same place, but it works.
> 
> Do you think it's OK? or do you have another idea?

Good point. It looks fine to me as we create the return path mostly at the
very early stage right after the migration headers.

Let's just add a comment describing the ordering fact?  IMHO it's about
switchover_ack_pending_num won't be used before setup of the return path
due to current wire protocol (we create return path before sending mostly
anything else), so setup it there is fine.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/7] migration: Implement switchover ack logic
  2023-05-25 12:44       ` Peter Xu
@ 2023-05-28 12:42         ` Avihai Horon
  0 siblings, 0 replies; 20+ messages in thread
From: Avihai Horon @ 2023-05-28 12:42 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
	Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	Yanan Wang, Juan Quintela, Leonardo Bras, Eric Blake,
	Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
	Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
	Tarun Gupta, Joao Martins


On 25/05/2023 15:44, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, May 25, 2023 at 12:51:46PM +0300, Avihai Horon wrote:
>> However, there is one issue -- we can't send the ACK up here [1], as at that
>> point the return path has not been created yet.
>> A possible solution is to check for mis->switchover_ack_pending_num == 0
>> when we create the return path and send the ACK there.
>> It's not as clean as checking the number of users and ACKing here in the
>> same place, but it works.
>>
>> Do you think it's OK? or do you have another idea?
> Good point. It looks fine to me as we create the return path mostly at the
> very early stage right after the migration headers.
>
> Let's just add a comment describing the ordering fact?  IMHO it's about
> switchover_ack_pending_num won't be used before setup of the return path
> due to current wire protocol (we create return path before sending mostly
> anything else), so setup it there is fine.

Sure, I will also add a comment.

Thanks.



^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-05-28 12:49 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-21 15:18 [PATCH v3 0/7] migration: Add switchover ack capability and VFIO precopy support Avihai Horon
2023-05-21 15:18 ` [PATCH v3 1/7] migration: Add switchover ack capability Avihai Horon
2023-05-24 19:24   ` Peter Xu
2023-05-25  9:33   ` Markus Armbruster
2023-05-21 15:18 ` [PATCH v3 2/7] migration: Implement switchover ack logic Avihai Horon
2023-05-24 19:32   ` Peter Xu
2023-05-25  9:51     ` Avihai Horon
2023-05-25 12:44       ` Peter Xu
2023-05-28 12:42         ` Avihai Horon
2023-05-21 15:18 ` [PATCH v3 3/7] migration: Enable switchover ack capability Avihai Horon
2023-05-21 15:18 ` [PATCH v3 4/7] tests: Add migration switchover ack capability test Avihai Horon
2023-05-21 15:18 ` [PATCH v3 5/7] vfio/migration: Refactor vfio_save_block() to return saved data size Avihai Horon
2023-05-21 15:18 ` [PATCH v3 6/7] vfio/migration: Add VFIO migration pre-copy support Avihai Horon
2023-05-23 14:56   ` Cédric Le Goater
2023-05-24 12:49     ` Avihai Horon
2023-05-24 15:38       ` Cédric Le Goater
2023-05-25  9:24         ` Avihai Horon
2023-05-21 15:18 ` [PATCH v3 7/7] vfio/migration: Add support for switchover ack capability Avihai Horon
2023-05-23 15:09   ` Cédric Le Goater
2023-05-24 12:52     ` Avihai Horon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.