qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support
@ 2023-05-01 14:01 Avihai Horon
  2023-05-01 14:01 ` [PATCH 1/8] migration: Add precopy initial data capability Avihai Horon
                   ` (8 more replies)
  0 siblings, 9 replies; 42+ messages in thread
From: Avihai Horon @ 2023-05-01 14:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela, Peter Xu,
	Leonardo Bras, Eric Blake, Markus Armbruster, Thomas Huth,
	Laurent Vivier, Paolo Bonzini, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Avihai Horon, Kirti Wankhede, Tarun Gupta,
	Joao Martins

Hello everyone,

This series adds a new migration capability called "precopy initial
data". The purpose of this capability is to reduce migration downtime in
cases where loading of migration data in the destination can take a lot
of time, such as with VFIO migration data.

The series then moves to add precopy support and precopy initial data
support for VFIO migration.

Precopy initial data is used by VFIO migration, but other migration
users can add support for it and use it as well.

=== Background ===

Migration downtime estimation is calculated based on bandwidth and
remaining migration data. This assumes that loading of migration data in
the destination takes a negligible amount of time and that downtime
depends only on network speed.

While this may be true for RAM, it's not necessarily true for other
migration users. For example, loading the data of a VFIO device in the
destination might require from the device to allocate resources and
prepare internal data structures which can take a significant amount of
time to do.

This poses a problem, as the source may think that the remaining
migration data is small enough to meet the downtime limit, so it will
stop the VM and complete the migration, but in fact sending and loading
the data in the destination may take longer than the downtime limit.

To solve this, VFIO migration uAPI defines "initial bytes" as part of
its precopy stream [1]. Initial bytes can be used in various ways to
improve VFIO migration performance. For example, it can be used to
transfer device metadata to pre-allocate resources in the destination.
However, for this to work we need to make sure that all initial bytes
are sent and loaded in the destination before the source VM is stopped.

The new precopy initial data migration capability helps us achieve this.
It allows the source to send initial precopy data and the destination to
ACK that this data has been loaded. Migration will not attempt to stop
the source VM and complete the migration until this ACK is received.

Note that this relies on the return path capability to communicate from
the destination back to the source.

=== Flow of operation ===

To use precopy initial data, the capability must be enabled in the
source.

As this capability must be supported also in the destination, a
handshake is performed during migration setup. The purpose of the
handshake is to notify the destination that precopy initial data is used
and to check if it's supported.

The handshake is done in two levels. First, a general handshake is done
with the destination migration code to notify that precopy initial data
is used. Then, for each migration user in the source that supports
precopy initial data, a handshake is done with its counterpart in the
destination:
If both support it, precopy initial data will be used for them.
If source doesn't support it, precopy initial data will not be used for
them.
If source supports it and destination doesn't, migration will be failed.

Assuming the handshake succeeded, migration starts to send precopy data
and as part of it also the initial precopy data. Initial precopy data is
just like any other precopy data and as such, migration code is not
aware of it. Therefore, it's the responsibility of the migration users
(such as VFIO devices) to notify their counterparts in the destination
that their initial precopy data has been sent (for example, VFIO
migration does it when its initial bytes reach zero).

In the destination, migration code will query each migration user that
supports precopy initial data and check if its initial data has been
loaded. If initial data has been loaded by all of them, an ACK will be
sent to the source which will now be able to complete migration when
appropriate.

=== Test results ===

The below table shows the downtime of two identical migrations. In the
first migration precopy initial data is disabled and in the second it is
enabled. The migrated VM is assigned with a mlx5 VFIO device which has
300MB of device data to be migrated.

+----------------------+-----------------------+----------+
| Precopy initial data | VFIO device data size | Downtime |
+----------------------+-----------------------+----------+
|       Disabled       |         300MB         |  1900ms  |
|       Enabled        |         300MB         |  420ms   |
+----------------------+-----------------------+----------+

Precopy initial data gives a roughly 4.5 times improvement in downtime.
The 1480ms difference is time that is used for resource allocation for
the VFIO device in the destination. Without precopy initial data, this
time is spent when the source VM is stopped and thus the downtime is
much higher. With precopy initial data, the time is spent when the
source VM is still running.

=== Patch breakdown ===

- Patches 1-5 add the precopy initial data capability.
- Patches 6-7 add VFIO migration precopy support. Similar version of
  them was previously sent here [2].
- Patch 8 adds precopy initial data support for VFIO migration.

Thanks for reviewing!

[1]
https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/vfio.h#L1048

[2]
https://lore.kernel.org/qemu-devel/20230222174915.5647-3-avihaih@nvidia.com/

Avihai Horon (8):
  migration: Add precopy initial data capability
  migration: Add precopy initial data handshake
  migration: Add precopy initial data loaded ACK functionality
  migration: Enable precopy initial data capability
  tests: Add migration precopy initial data capability test
  vfio/migration: Refactor vfio_save_block() to return saved data size
  vfio/migration: Add VFIO migration pre-copy support
  vfio/migration: Add support for precopy initial data capability

 docs/devel/vfio-migration.rst |  35 ++++--
 qapi/migration.json           |   9 +-
 include/hw/vfio/vfio-common.h |   6 +
 include/migration/register.h  |  13 ++
 migration/migration.h         |  15 +++
 migration/options.h           |   1 +
 migration/savevm.h            |   1 +
 hw/vfio/common.c              |   6 +-
 hw/vfio/migration.c           | 218 +++++++++++++++++++++++++++++++---
 migration/migration.c         |  45 ++++++-
 migration/options.c           |  16 +++
 migration/savevm.c            | 141 ++++++++++++++++++++++
 tests/qtest/migration-test.c  |  23 ++++
 hw/vfio/trace-events          |   4 +-
 migration/trace-events        |   4 +
 15 files changed, 504 insertions(+), 33 deletions(-)

-- 
2.26.3



^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/8] migration: Add precopy initial data capability
  2023-05-01 14:01 [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support Avihai Horon
@ 2023-05-01 14:01 ` Avihai Horon
  2023-05-10  8:24   ` Juan Quintela
  2023-05-17  9:17   ` Markus Armbruster
  2023-05-01 14:01 ` [PATCH 2/8] migration: Add precopy initial data handshake Avihai Horon
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 42+ messages in thread
From: Avihai Horon @ 2023-05-01 14:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela, Peter Xu,
	Leonardo Bras, Eric Blake, Markus Armbruster, Thomas Huth,
	Laurent Vivier, Paolo Bonzini, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Avihai Horon, Kirti Wankhede, Tarun Gupta,
	Joao Martins

Migration downtime estimation is calculated based on bandwidth and
remaining migration data. This assumes that loading of migration data in
the destination takes a negligible amount of time and that downtime
depends only on network speed.

While this may be true for RAM, it's not necessarily true for other
migration users. For example, loading the data of a VFIO device in the
destination might require from the device to allocate resources, prepare
internal data structures and so on. These operations can take a
significant amount of time which can increase migration downtime.

This patch adds a new capability "precopy initial data" that allows the
source to send initial precopy data and the destination to ACK that this
data has been loaded. Migration will not attempt to stop the source VM
and complete the migration until this ACK is received.

This will allow migration users to send initial precopy data which can
be used to reduce downtime (e.g., by pre-allocating resources), while
making sure that the source will stop the VM and complete the migration
only after this initial precopy data is sent and loaded in the
destination so it will have full effect.

This new capability relies on the return path capability to communicate
from the destination back to the source.

The actual implementation of the capability will be added in the
following patches.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 qapi/migration.json |  9 ++++++++-
 migration/options.h |  1 +
 migration/options.c | 20 ++++++++++++++++++++
 3 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/qapi/migration.json b/qapi/migration.json
index 82000adce4..d496148386 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -478,6 +478,13 @@
 #                    should not affect the correctness of postcopy migration.
 #                    (since 7.1)
 #
+# @precopy-initial-data: If enabled, migration will not attempt to stop source
+#                        VM and complete the migration until an ACK is received
+#                        from the destination that initial precopy data has
+#                        been loaded. This can improve downtime if there are
+#                        migration users that support precopy initial data.
+#                        (since 8.1)
+#
 # Features:
 # @unstable: Members @x-colo and @x-ignore-shared are experimental.
 #
@@ -492,7 +499,7 @@
            'dirty-bitmaps', 'postcopy-blocktime', 'late-block-activate',
            { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
            'validate-uuid', 'background-snapshot',
-           'zero-copy-send', 'postcopy-preempt'] }
+           'zero-copy-send', 'postcopy-preempt', 'precopy-initial-data'] }
 
 ##
 # @MigrationCapabilityStatus:
diff --git a/migration/options.h b/migration/options.h
index 3c322867cd..d004b6321e 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -44,6 +44,7 @@ bool migrate_pause_before_switchover(void);
 bool migrate_postcopy_blocktime(void);
 bool migrate_postcopy_preempt(void);
 bool migrate_postcopy_ram(void);
+bool migrate_precopy_initial_data(void);
 bool migrate_rdma_pin_all(void);
 bool migrate_release_ram(void);
 bool migrate_return_path(void);
diff --git a/migration/options.c b/migration/options.c
index 53b7fc5d5d..c4ef0c60c7 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -184,6 +184,8 @@ Property migration_properties[] = {
     DEFINE_PROP_MIG_CAP("x-zero-copy-send",
             MIGRATION_CAPABILITY_ZERO_COPY_SEND),
 #endif
+    DEFINE_PROP_MIG_CAP("x-precopy-initial-data",
+                        MIGRATION_CAPABILITY_PRECOPY_INITIAL_DATA),
 
     DEFINE_PROP_END_OF_LIST(),
 };
@@ -286,6 +288,13 @@ bool migrate_postcopy_ram(void)
     return s->capabilities[MIGRATION_CAPABILITY_POSTCOPY_RAM];
 }
 
+bool migrate_precopy_initial_data(void)
+{
+    MigrationState *s = migrate_get_current();
+
+    return s->capabilities[MIGRATION_CAPABILITY_PRECOPY_INITIAL_DATA];
+}
+
 bool migrate_rdma_pin_all(void)
 {
     MigrationState *s = migrate_get_current();
@@ -546,6 +555,17 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, Error **errp)
         }
     }
 
+    if (new_caps[MIGRATION_CAPABILITY_PRECOPY_INITIAL_DATA]) {
+        if (!new_caps[MIGRATION_CAPABILITY_RETURN_PATH]) {
+            error_setg(errp, "Precopy initial data requires return path");
+            return false;
+        }
+
+        /* Disable this capability until it's implemented */
+        error_setg(errp, "Precopy initial data is not implemented yet");
+        return false;
+    }
+
     return true;
 }
 
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 2/8] migration: Add precopy initial data handshake
  2023-05-01 14:01 [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support Avihai Horon
  2023-05-01 14:01 ` [PATCH 1/8] migration: Add precopy initial data capability Avihai Horon
@ 2023-05-01 14:01 ` Avihai Horon
  2023-05-02 22:54   ` Peter Xu
                     ` (2 more replies)
  2023-05-01 14:01 ` [PATCH 3/8] migration: Add precopy initial data loaded ACK functionality Avihai Horon
                   ` (6 subsequent siblings)
  8 siblings, 3 replies; 42+ messages in thread
From: Avihai Horon @ 2023-05-01 14:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela, Peter Xu,
	Leonardo Bras, Eric Blake, Markus Armbruster, Thomas Huth,
	Laurent Vivier, Paolo Bonzini, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Avihai Horon, Kirti Wankhede, Tarun Gupta,
	Joao Martins

Add precopy initial data handshake between source and destination upon
migration setup. The purpose of the handshake is to notify the
destination that precopy initial data is used and which migration users
(i.e., SaveStateEntry) are going to use it.

The handshake is done in two levels. First, a general enable command is
sent to notify the destination migration code that precopy initial data
is used.
Then, for each migration user in the source that supports precopy
initial data, an enable command is sent to its counterpart in the
destination:
If both support it, precopy initial data will be used for them.
If source doesn't support it, precopy initial data will not be used for
them.
If source supports it and destination doesn't, migration will be failed.

To implement it, a new migration command MIG_CMD_INITIAL_DATA_ENABLE is
added, as well as a new SaveVMHandlers handler initial_data_advise.
Calling the handler advises the migration user that precopy initial data
is used and its return value indicates whether precopy initial data is
supported by it.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 include/migration/register.h |   6 +++
 migration/migration.h        |   3 ++
 migration/savevm.h           |   1 +
 migration/migration.c        |   4 ++
 migration/savevm.c           | 102 +++++++++++++++++++++++++++++++++++
 migration/trace-events       |   2 +
 6 files changed, 118 insertions(+)

diff --git a/include/migration/register.h b/include/migration/register.h
index a8dfd8fefd..0a73f3883e 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -71,6 +71,12 @@ typedef struct SaveVMHandlers {
     int (*load_cleanup)(void *opaque);
     /* Called when postcopy migration wants to resume from failure */
     int (*resume_prepare)(MigrationState *s, void *opaque);
+
+    /*
+     * Advises that precopy initial data was requested to be enabled. Returns
+     * true if it's supported or false otherwise. Called both in src and dest.
+     */
+    bool (*initial_data_advise)(void *opaque);
 } SaveVMHandlers;
 
 int register_savevm_live(const char *idstr,
diff --git a/migration/migration.h b/migration/migration.h
index 3a918514e7..4f615e9dbc 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -204,6 +204,9 @@ struct MigrationIncomingState {
      * contains valid information.
      */
     QemuMutex page_request_mutex;
+
+    /* Indicates whether precopy initial data was enabled and should be used */
+    bool initial_data_enabled;
 };
 
 MigrationIncomingState *migration_incoming_get_current(void);
diff --git a/migration/savevm.h b/migration/savevm.h
index fb636735f0..d47ab4ad18 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -58,6 +58,7 @@ void qemu_savevm_send_postcopy_ram_discard(QEMUFile *f, const char *name,
                                            uint64_t *start_list,
                                            uint64_t *length_list);
 void qemu_savevm_send_colo_enable(QEMUFile *f);
+void qemu_savevm_send_initial_data_enable(MigrationState *ms, QEMUFile *f);
 void qemu_savevm_live_state(QEMUFile *f);
 int qemu_save_device_state(QEMUFile *f);
 
diff --git a/migration/migration.c b/migration/migration.c
index abcadbb619..68cdf5b184 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2964,6 +2964,10 @@ static void *migration_thread(void *opaque)
         qemu_savevm_send_colo_enable(s->to_dst_file);
     }
 
+    if (migrate_precopy_initial_data()) {
+        qemu_savevm_send_initial_data_enable(s, s->to_dst_file);
+    }
+
     qemu_savevm_state_setup(s->to_dst_file);
 
     qemu_savevm_wait_unplug(s, MIGRATION_STATUS_SETUP,
diff --git a/migration/savevm.c b/migration/savevm.c
index a9181b444b..2740defdf0 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -71,6 +71,13 @@
 
 const unsigned int postcopy_ram_discard_version;
 
+typedef struct {
+    uint8_t general_enable;
+    uint8_t reserved[7];
+    uint8_t idstr[256];
+    uint32_t instance_id;
+} InitialDataInfo;
+
 /* Subcommands for QEMU_VM_COMMAND */
 enum qemu_vm_cmd {
     MIG_CMD_INVALID = 0,   /* Must be 0 */
@@ -90,6 +97,8 @@ enum qemu_vm_cmd {
     MIG_CMD_ENABLE_COLO,       /* Enable COLO */
     MIG_CMD_POSTCOPY_RESUME,   /* resume postcopy on dest */
     MIG_CMD_RECV_BITMAP,       /* Request for recved bitmap on dst */
+
+    MIG_CMD_INITIAL_DATA_ENABLE, /* Enable precopy initial data in dest */
     MIG_CMD_MAX
 };
 
@@ -109,6 +118,8 @@ static struct mig_cmd_args {
     [MIG_CMD_POSTCOPY_RESUME]  = { .len =  0, .name = "POSTCOPY_RESUME" },
     [MIG_CMD_PACKAGED]         = { .len =  4, .name = "PACKAGED" },
     [MIG_CMD_RECV_BITMAP]      = { .len = -1, .name = "RECV_BITMAP" },
+    [MIG_CMD_INITIAL_DATA_ENABLE] = { .len = sizeof(InitialDataInfo),
+                                      .name = "INITIAL_DATA_ENABLE" },
     [MIG_CMD_MAX]              = { .len = -1, .name = "MAX" },
 };
 
@@ -1036,6 +1047,40 @@ static void qemu_savevm_command_send(QEMUFile *f,
     qemu_fflush(f);
 }
 
+void qemu_savevm_send_initial_data_enable(MigrationState *ms, QEMUFile *f)
+{
+    SaveStateEntry *se;
+    InitialDataInfo buf;
+
+    /* Enable precopy initial data generally in the migration */
+    memset(&buf, 0, sizeof(buf));
+    buf.general_enable = 1;
+    qemu_savevm_command_send(f, MIG_CMD_INITIAL_DATA_ENABLE, sizeof(buf),
+                             (uint8_t *)&buf);
+    trace_savevm_send_initial_data_enable(buf.general_enable, (char *)buf.idstr,
+                                          buf.instance_id);
+
+    /* Enable precopy initial data for each migration user that supports it */
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+        if (!se->ops || !se->ops->initial_data_advise) {
+            continue;
+        }
+
+        if (!se->ops->initial_data_advise(se->opaque)) {
+            continue;
+        }
+
+        memset(&buf, 0, sizeof(buf));
+        memcpy(buf.idstr, se->idstr, sizeof(buf.idstr));
+        buf.instance_id = se->instance_id;
+
+        qemu_savevm_command_send(f, MIG_CMD_INITIAL_DATA_ENABLE, sizeof(buf),
+                                 (uint8_t *)&buf);
+        trace_savevm_send_initial_data_enable(
+            buf.general_enable, (char *)buf.idstr, buf.instance_id);
+    }
+}
+
 void qemu_savevm_send_colo_enable(QEMUFile *f)
 {
     trace_savevm_send_colo_enable();
@@ -2314,6 +2359,60 @@ static int loadvm_process_enable_colo(MigrationIncomingState *mis)
     return ret;
 }
 
+static int loadvm_handle_initial_data_enable(MigrationIncomingState *mis)
+{
+    InitialDataInfo buf;
+    SaveStateEntry *se;
+    ssize_t read_size;
+
+    read_size = qemu_get_buffer(mis->from_src_file, (void *)&buf, sizeof(buf));
+    if (read_size != sizeof(buf)) {
+        error_report("%s: Could not get data buffer, read_size %ld, len %lu",
+                     __func__, read_size, sizeof(buf));
+
+        return -EIO;
+    }
+
+    /* Enable precopy initial data generally in the migration */
+    if (buf.general_enable) {
+        mis->initial_data_enabled = true;
+        trace_loadvm_handle_initial_data_enable(
+            buf.general_enable, (char *)buf.idstr, buf.instance_id);
+
+        return 0;
+    }
+
+    /* Enable precopy initial data for a specific migration user */
+    se = find_se((char *)buf.idstr, buf.instance_id);
+    if (!se) {
+        error_report("%s: Could not find SaveStateEntry, idstr '%s', "
+                     "instance_id %" PRIu32,
+                     __func__, buf.idstr, buf.instance_id);
+
+        return -ENOENT;
+    }
+
+    if (!se->ops || !se->ops->initial_data_advise) {
+        error_report("%s: '%s' doesn't have required "
+                     "initial_data_advise op",
+                     __func__, buf.idstr);
+
+        return -EOPNOTSUPP;
+    }
+
+    if (!se->ops->initial_data_advise(se->opaque)) {
+        error_report("%s: '%s' doesn't support precopy initial data", __func__,
+                     buf.idstr);
+
+        return -EOPNOTSUPP;
+    }
+
+    trace_loadvm_handle_initial_data_enable(buf.general_enable,
+                                            (char *)buf.idstr, buf.instance_id);
+
+    return 0;
+}
+
 /*
  * Process an incoming 'QEMU_VM_COMMAND'
  * 0           just a normal return
@@ -2397,6 +2496,9 @@ static int loadvm_process_command(QEMUFile *f)
 
     case MIG_CMD_ENABLE_COLO:
         return loadvm_process_enable_colo(mis);
+
+    case MIG_CMD_INITIAL_DATA_ENABLE:
+        return loadvm_handle_initial_data_enable(mis);
     }
 
     return 0;
diff --git a/migration/trace-events b/migration/trace-events
index 92161eeac5..21ae471126 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -23,6 +23,7 @@ loadvm_postcopy_ram_handle_discard_end(void) ""
 loadvm_postcopy_ram_handle_discard_header(const char *ramid, uint16_t len) "%s: %ud"
 loadvm_process_command(const char *s, uint16_t len) "com=%s len=%d"
 loadvm_process_command_ping(uint32_t val) "0x%x"
+loadvm_handle_initial_data_enable(uint8_t general_enable, const char *idstr, int instance_id) "general_enable=%u, idstr=%s, instance_id=%u"
 postcopy_ram_listen_thread_exit(void) ""
 postcopy_ram_listen_thread_start(void) ""
 qemu_savevm_send_postcopy_advise(void) ""
@@ -38,6 +39,7 @@ savevm_send_postcopy_run(void) ""
 savevm_send_postcopy_resume(void) ""
 savevm_send_colo_enable(void) ""
 savevm_send_recv_bitmap(char *name) "%s"
+savevm_send_initial_data_enable(uint8_t general_enable, const char *idstr, int instance_id) "general_enable=%u, idstr=%s, instance_id=%u"
 savevm_state_setup(void) ""
 savevm_state_resume_prepare(void) ""
 savevm_state_header(void) ""
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 3/8] migration: Add precopy initial data loaded ACK functionality
  2023-05-01 14:01 [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support Avihai Horon
  2023-05-01 14:01 ` [PATCH 1/8] migration: Add precopy initial data capability Avihai Horon
  2023-05-01 14:01 ` [PATCH 2/8] migration: Add precopy initial data handshake Avihai Horon
@ 2023-05-01 14:01 ` Avihai Horon
  2023-05-02 22:56   ` Peter Xu
  2023-05-10  8:54   ` Juan Quintela
  2023-05-01 14:01 ` [PATCH 4/8] migration: Enable precopy initial data capability Avihai Horon
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 42+ messages in thread
From: Avihai Horon @ 2023-05-01 14:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela, Peter Xu,
	Leonardo Bras, Eric Blake, Markus Armbruster, Thomas Huth,
	Laurent Vivier, Paolo Bonzini, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Avihai Horon, Kirti Wankhede, Tarun Gupta,
	Joao Martins

Add the core functionality of precopy initial data, which allows the
destination to ACK that initial data has been loaded and the source to
wait for this ACK before completing the migration.

A new return path command MIG_RP_MSG_INITIAL_DATA_LOADED_ACK is added.
It is sent by the destination after precopy initial data is loaded to
ACK to the source that precopy initial data has been loaded.

In addition, two new SaveVMHandlers handlers are added:
1. is_initial_data_active which indicates whether precopy initial data
   is used for this migration user (i.e., SaveStateEntry).
2. initial_data_loaded which indicates whether precopy initial data has
   been loaded by this migration user.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 include/migration/register.h |  7 ++++++
 migration/migration.h        | 12 +++++++++++
 migration/migration.c        | 41 ++++++++++++++++++++++++++++++++++--
 migration/savevm.c           | 39 ++++++++++++++++++++++++++++++++++
 migration/trace-events       |  2 ++
 5 files changed, 99 insertions(+), 2 deletions(-)

diff --git a/include/migration/register.h b/include/migration/register.h
index 0a73f3883e..297bbe9f73 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -77,6 +77,13 @@ typedef struct SaveVMHandlers {
      * true if it's supported or false otherwise. Called both in src and dest.
      */
     bool (*initial_data_advise)(void *opaque);
+    /*
+     * Checks if precopy initial data is active. If it's inactive,
+     * initial_data_loaded check is skipped.
+     */
+    bool (*is_initial_data_active)(void *opaque);
+    /* Checks if precopy initial data has been loaded in dest */
+    bool (*initial_data_loaded)(void *opaque);
 } SaveVMHandlers;
 
 int register_savevm_live(const char *idstr,
diff --git a/migration/migration.h b/migration/migration.h
index 4f615e9dbc..d865c23d87 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -207,6 +207,11 @@ struct MigrationIncomingState {
 
     /* Indicates whether precopy initial data was enabled and should be used */
     bool initial_data_enabled;
+    /*
+     * Indicates whether an ACK that precopy initial data was loaded has been
+     * sent to source.
+     */
+    bool initial_data_loaded_ack_sent;
 };
 
 MigrationIncomingState *migration_incoming_get_current(void);
@@ -435,6 +440,12 @@ struct MigrationState {
 
     /* QEMU_VM_VMDESCRIPTION content filled for all non-iterable devices. */
     JSONWriter *vmdesc;
+
+    /*
+     * Indicates whether an ACK that precopy initial data was loaded in
+     * destination has been received.
+     */
+    bool initial_data_loaded_acked;
 };
 
 void migrate_set_state(int *state, int old_state, int new_state);
@@ -475,6 +486,7 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
 void migrate_send_rp_recv_bitmap(MigrationIncomingState *mis,
                                  char *block_name);
 void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t value);
+int migrate_send_rp_initial_data_loaded_ack(MigrationIncomingState *mis);
 
 void dirty_bitmap_mig_before_vm_start(void);
 void dirty_bitmap_mig_cancel_outgoing(void);
diff --git a/migration/migration.c b/migration/migration.c
index 68cdf5b184..304cab2fa1 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -77,6 +77,11 @@ enum mig_rp_message_type {
     MIG_RP_MSG_RECV_BITMAP,  /* send recved_bitmap back to source */
     MIG_RP_MSG_RESUME_ACK,   /* tell source that we are ready to resume */
 
+    MIG_RP_MSG_INITIAL_DATA_LOADED_ACK, /*
+                                         * Tell source precopy initial data is
+                                         * loaded.
+                                         */
+
     MIG_RP_MSG_MAX
 };
 
@@ -756,6 +761,12 @@ bool migration_has_all_channels(void)
     return true;
 }
 
+int migrate_send_rp_initial_data_loaded_ack(MigrationIncomingState *mis)
+{
+    return migrate_send_rp_message(mis, MIG_RP_MSG_INITIAL_DATA_LOADED_ACK, 0,
+                                   NULL);
+}
+
 /*
  * Send a 'SHUT' message on the return channel with the given value
  * to indicate that we've finished with the RP.  Non-0 value indicates
@@ -1401,6 +1412,8 @@ void migrate_init(MigrationState *s)
     s->vm_was_running = false;
     s->iteration_initial_bytes = 0;
     s->threshold_size = 0;
+
+    s->initial_data_loaded_acked = false;
 }
 
 int migrate_add_blocker_internal(Error *reason, Error **errp)
@@ -1717,6 +1730,9 @@ static struct rp_cmd_args {
     [MIG_RP_MSG_REQ_PAGES_ID]   = { .len = -1, .name = "REQ_PAGES_ID" },
     [MIG_RP_MSG_RECV_BITMAP]    = { .len = -1, .name = "RECV_BITMAP" },
     [MIG_RP_MSG_RESUME_ACK]     = { .len =  4, .name = "RESUME_ACK" },
+    [MIG_RP_MSG_INITIAL_DATA_LOADED_ACK] = { .len = 0,
+                                             .name =
+                                                 "INITIAL_DATA_LOADED_ACK" },
     [MIG_RP_MSG_MAX]            = { .len = -1, .name = "MAX" },
 };
 
@@ -1955,6 +1971,11 @@ retry:
             }
             break;
 
+        case MIG_RP_MSG_INITIAL_DATA_LOADED_ACK:
+            ms->initial_data_loaded_acked = true;
+            trace_source_return_path_thread_initial_data_loaded_ack();
+            break;
+
         default:
             break;
         }
@@ -2704,6 +2725,20 @@ static void migration_update_counters(MigrationState *s,
                               bandwidth, s->threshold_size);
 }
 
+static bool initial_data_loaded_acked(MigrationState *s)
+{
+    if (!migrate_precopy_initial_data()) {
+        return true;
+    }
+
+    /* No reason to wait for precopy initial data loaded ACK if VM is stopped */
+    if (!runstate_is_running()) {
+        return true;
+    }
+
+    return s->initial_data_loaded_acked;
+}
+
 /* Migration thread iteration status */
 typedef enum {
     MIG_ITERATE_RESUME,         /* Resume current iteration */
@@ -2719,6 +2754,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
 {
     uint64_t must_precopy, can_postcopy;
     bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
+    bool initial_data_loaded = initial_data_loaded_acked(s);
 
     qemu_savevm_state_pending_estimate(&must_precopy, &can_postcopy);
     uint64_t pending_size = must_precopy + can_postcopy;
@@ -2731,7 +2767,8 @@ static MigIterateState migration_iteration_run(MigrationState *s)
         trace_migrate_pending_exact(pending_size, must_precopy, can_postcopy);
     }
 
-    if (!pending_size || pending_size < s->threshold_size) {
+    if ((!pending_size || pending_size < s->threshold_size) &&
+        initial_data_loaded) {
         trace_migration_thread_low_pending(pending_size);
         migration_completion(s);
         return MIG_ITERATE_BREAK;
@@ -2739,7 +2776,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
 
     /* Still a significant amount to transfer */
     if (!in_postcopy && must_precopy <= s->threshold_size &&
-        qatomic_read(&s->start_postcopy)) {
+        initial_data_loaded && qatomic_read(&s->start_postcopy)) {
         if (postcopy_start(s)) {
             error_report("%s: postcopy failed to start", __func__);
         }
diff --git a/migration/savevm.c b/migration/savevm.c
index 2740defdf0..7a94deda3b 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2504,6 +2504,39 @@ static int loadvm_process_command(QEMUFile *f)
     return 0;
 }
 
+static int qemu_loadvm_initial_data_loaded_ack(MigrationIncomingState *mis)
+{
+    SaveStateEntry *se;
+    int ret;
+
+    if (!mis->initial_data_enabled || mis->initial_data_loaded_ack_sent) {
+        return 0;
+    }
+
+    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+        if (!se->ops || !se->ops->initial_data_loaded) {
+            continue;
+        }
+
+        if (!se->ops->is_initial_data_active ||
+            !se->ops->is_initial_data_active(se->opaque)) {
+            continue;
+        }
+
+        if (!se->ops->initial_data_loaded(se->opaque)) {
+            return 0;
+        }
+    }
+
+    ret = migrate_send_rp_initial_data_loaded_ack(mis);
+    if (!ret) {
+        mis->initial_data_loaded_ack_sent = true;
+        trace_loadvm_initial_data_loaded_acked();
+    }
+
+    return ret;
+}
+
 /*
  * Read a footer off the wire and check that it matches the expected section
  *
@@ -2826,6 +2859,12 @@ retry:
             if (ret < 0) {
                 goto out;
             }
+
+            ret = qemu_loadvm_initial_data_loaded_ack(mis);
+            if (ret < 0) {
+                goto out;
+            }
+
             break;
         case QEMU_VM_COMMAND:
             ret = loadvm_process_command(f);
diff --git a/migration/trace-events b/migration/trace-events
index 21ae471126..a0e1d3b2fd 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -24,6 +24,7 @@ loadvm_postcopy_ram_handle_discard_header(const char *ramid, uint16_t len) "%s:
 loadvm_process_command(const char *s, uint16_t len) "com=%s len=%d"
 loadvm_process_command_ping(uint32_t val) "0x%x"
 loadvm_handle_initial_data_enable(uint8_t general_enable, const char *idstr, int instance_id) "general_enable=%u, idstr=%s, instance_id=%u"
+loadvm_initial_data_loaded_acked(void) ""
 postcopy_ram_listen_thread_exit(void) ""
 postcopy_ram_listen_thread_start(void) ""
 qemu_savevm_send_postcopy_advise(void) ""
@@ -182,6 +183,7 @@ source_return_path_thread_loop_top(void) ""
 source_return_path_thread_pong(uint32_t val) "0x%x"
 source_return_path_thread_shut(uint32_t val) "0x%x"
 source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32
+source_return_path_thread_initial_data_loaded_ack(void) ""
 migration_thread_low_pending(uint64_t pending) "%" PRIu64
 migrate_transferred(uint64_t tranferred, uint64_t time_spent, uint64_t bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " max_size %" PRId64
 process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 4/8] migration: Enable precopy initial data capability
  2023-05-01 14:01 [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support Avihai Horon
                   ` (2 preceding siblings ...)
  2023-05-01 14:01 ` [PATCH 3/8] migration: Add precopy initial data loaded ACK functionality Avihai Horon
@ 2023-05-01 14:01 ` Avihai Horon
  2023-05-10  8:55   ` Juan Quintela
  2023-05-01 14:01 ` [PATCH 5/8] tests: Add migration precopy initial data capability test Avihai Horon
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 42+ messages in thread
From: Avihai Horon @ 2023-05-01 14:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela, Peter Xu,
	Leonardo Bras, Eric Blake, Markus Armbruster, Thomas Huth,
	Laurent Vivier, Paolo Bonzini, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Avihai Horon, Kirti Wankhede, Tarun Gupta,
	Joao Martins

Now that precopy initial data logic has been implemented, enable the
capability.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 migration/options.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/migration/options.c b/migration/options.c
index c4ef0c60c7..77a570f866 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -560,10 +560,6 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, Error **errp)
             error_setg(errp, "Precopy initial data requires return path");
             return false;
         }
-
-        /* Disable this capability until it's implemented */
-        error_setg(errp, "Precopy initial data is not implemented yet");
-        return false;
     }
 
     return true;
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 5/8] tests: Add migration precopy initial data capability test
  2023-05-01 14:01 [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support Avihai Horon
                   ` (3 preceding siblings ...)
  2023-05-01 14:01 ` [PATCH 4/8] migration: Enable precopy initial data capability Avihai Horon
@ 2023-05-01 14:01 ` Avihai Horon
  2023-05-10  8:55   ` Juan Quintela
  2023-05-01 14:01 ` [PATCH 6/8] vfio/migration: Refactor vfio_save_block() to return saved data size Avihai Horon
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 42+ messages in thread
From: Avihai Horon @ 2023-05-01 14:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela, Peter Xu,
	Leonardo Bras, Eric Blake, Markus Armbruster, Thomas Huth,
	Laurent Vivier, Paolo Bonzini, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Avihai Horon, Kirti Wankhede, Tarun Gupta,
	Joao Martins

Add migration precopy initial data capability test. The test runs
without migration users that support this capability, but is still
useful to make sure it didn't break anything.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 tests/qtest/migration-test.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 60dd53d3ec..71d30bd330 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -1533,6 +1533,25 @@ static void test_precopy_tcp_plain(void)
     test_precopy_common(&args);
 }
 
+static void *test_migrate_initial_data_start(QTestState *from, QTestState *to)
+{
+
+    migrate_set_capability(from, "return-path", true);
+    migrate_set_capability(from, "precopy-initial-data", true);
+
+    return NULL;
+}
+
+static void test_precopy_tcp_initial_data(void)
+{
+    MigrateCommon args = {
+        .listen_uri = "tcp:127.0.0.1:0",
+        .start_hook = test_migrate_initial_data_start,
+    };
+
+    test_precopy_common(&args);
+}
+
 #ifdef CONFIG_GNUTLS
 static void test_precopy_tcp_tls_psk_match(void)
 {
@@ -2557,6 +2576,10 @@ int main(int argc, char **argv)
 #endif /* CONFIG_GNUTLS */
 
     qtest_add_func("/migration/precopy/tcp/plain", test_precopy_tcp_plain);
+
+    qtest_add_func("/migration/precopy/tcp/plain/precopy-initial-data",
+                   test_precopy_tcp_initial_data);
+
 #ifdef CONFIG_GNUTLS
     qtest_add_func("/migration/precopy/tcp/tls/psk/match",
                    test_precopy_tcp_tls_psk_match);
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 6/8] vfio/migration: Refactor vfio_save_block() to return saved data size
  2023-05-01 14:01 [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support Avihai Horon
                   ` (4 preceding siblings ...)
  2023-05-01 14:01 ` [PATCH 5/8] tests: Add migration precopy initial data capability test Avihai Horon
@ 2023-05-01 14:01 ` Avihai Horon
  2023-05-10  9:00   ` Juan Quintela
  2023-05-01 14:01 ` [PATCH 7/8] vfio/migration: Add VFIO migration pre-copy support Avihai Horon
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 42+ messages in thread
From: Avihai Horon @ 2023-05-01 14:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela, Peter Xu,
	Leonardo Bras, Eric Blake, Markus Armbruster, Thomas Huth,
	Laurent Vivier, Paolo Bonzini, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Avihai Horon, Kirti Wankhede, Tarun Gupta,
	Joao Martins

Refactor vfio_save_block() to return the size of saved data on success
and -errno on error.

This will be used in next patch to implement VFIO migration pre-copy
support.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
 hw/vfio/migration.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 6b58dddb88..235978fd68 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -241,8 +241,8 @@ static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
     return 0;
 }
 
-/* Returns 1 if end-of-stream is reached, 0 if more data and -errno if error */
-static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
+/* Returns the size of saved data on success and -errno on error */
+static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
 {
     ssize_t data_size;
 
@@ -252,7 +252,7 @@ static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
         return -errno;
     }
     if (data_size == 0) {
-        return 1;
+        return 0;
     }
 
     qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
@@ -262,7 +262,7 @@ static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
 
     trace_vfio_save_block(migration->vbasedev->name, data_size);
 
-    return qemu_file_get_error(f);
+    return qemu_file_get_error(f) ?: data_size;
 }
 
 /* ---------------------------------------------------------------------- */
@@ -335,6 +335,7 @@ static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
 static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
+    ssize_t data_size;
     int ret;
 
     /* We reach here with device state STOP only */
@@ -345,11 +346,11 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     }
 
     do {
-        ret = vfio_save_block(f, vbasedev->migration);
-        if (ret < 0) {
-            return ret;
+        data_size = vfio_save_block(f, vbasedev->migration);
+        if (data_size < 0) {
+            return data_size;
         }
-    } while (!ret);
+    } while (data_size);
 
     qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
     ret = qemu_file_get_error(f);
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 7/8] vfio/migration: Add VFIO migration pre-copy support
  2023-05-01 14:01 [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support Avihai Horon
                   ` (5 preceding siblings ...)
  2023-05-01 14:01 ` [PATCH 6/8] vfio/migration: Refactor vfio_save_block() to return saved data size Avihai Horon
@ 2023-05-01 14:01 ` Avihai Horon
  2023-05-01 14:01 ` [PATCH 8/8] vfio/migration: Add support for precopy initial data capability Avihai Horon
  2023-05-02 22:49 ` [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support Peter Xu
  8 siblings, 0 replies; 42+ messages in thread
From: Avihai Horon @ 2023-05-01 14:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela, Peter Xu,
	Leonardo Bras, Eric Blake, Markus Armbruster, Thomas Huth,
	Laurent Vivier, Paolo Bonzini, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Avihai Horon, Kirti Wankhede, Tarun Gupta,
	Joao Martins

Pre-copy support allows the VFIO device data to be transferred while the
VM is running. This helps to accommodate VFIO devices that have a large
amount of data that needs to be transferred, and it can reduce migration
downtime.

Pre-copy support is optional in VFIO migration protocol v2.
Implement pre-copy of VFIO migration protocol v2 and use it for devices
that support it. Full description of it can be found here [1].

[1]
https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 docs/devel/vfio-migration.rst |  35 +++++---
 include/hw/vfio/vfio-common.h |   3 +
 hw/vfio/common.c              |   6 +-
 hw/vfio/migration.c           | 155 ++++++++++++++++++++++++++++++++--
 hw/vfio/trace-events          |   4 +-
 5 files changed, 181 insertions(+), 22 deletions(-)

diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
index 1b68ccf115..e896b2a673 100644
--- a/docs/devel/vfio-migration.rst
+++ b/docs/devel/vfio-migration.rst
@@ -7,12 +7,14 @@ the guest is running on source host and restoring this saved state on the
 destination host. This document details how saving and restoring of VFIO
 devices is done in QEMU.
 
-Migration of VFIO devices currently consists of a single stop-and-copy phase.
-During the stop-and-copy phase the guest is stopped and the entire VFIO device
-data is transferred to the destination.
-
-The pre-copy phase of migration is currently not supported for VFIO devices.
-Support for VFIO pre-copy will be added later on.
+Migration of VFIO devices consists of two phases: the optional pre-copy phase,
+and the stop-and-copy phase. The pre-copy phase is iterative and allows to
+accommodate VFIO devices that have a large amount of data that needs to be
+transferred. The iterative pre-copy phase of migration allows for the guest to
+continue whilst the VFIO device state is transferred to the destination, this
+helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy
+support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
+VFIO_DEVICE_FEATURE_MIGRATION ioctl.
 
 Note that currently VFIO migration is supported only for a single device. This
 is due to VFIO migration's lack of P2P support. However, P2P support is planned
@@ -29,10 +31,20 @@ VFIO implements the device hooks for the iterative approach as follows:
 * A ``load_setup`` function that sets the VFIO device on the destination in
   _RESUMING state.
 
+* A ``state_pending_estimate`` function that reports an estimate of the
+  remaining pre-copy data that the vendor driver has yet to save for the VFIO
+  device.
+
 * A ``state_pending_exact`` function that reads pending_bytes from the vendor
   driver, which indicates the amount of data that the vendor driver has yet to
   save for the VFIO device.
 
+* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
+  active only when the VFIO device is in pre-copy states.
+
+* A ``save_live_iterate`` function that reads the VFIO device's data from the
+  vendor driver during iterative pre-copy phase.
+
 * A ``save_state`` function to save the device config space if it is present.
 
 * A ``save_live_complete_precopy`` function that sets the VFIO device in
@@ -111,8 +123,10 @@ Flow of state changes during Live migration
 ===========================================
 
 Below is the flow of state change during live migration.
-The values in the brackets represent the VM state, the migration state, and
+The values in the parentheses represent the VM state, the migration state, and
 the VFIO device state, respectively.
+The text in the square brackets represents the flow if the VFIO device supports
+pre-copy.
 
 Live migration save path
 ------------------------
@@ -124,11 +138,12 @@ Live migration save path
                                   |
                      migrate_init spawns migration_thread
                 Migration thread then calls each device's .save_setup()
-                       (RUNNING, _SETUP, _RUNNING)
+                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
                                   |
-                      (RUNNING, _ACTIVE, _RUNNING)
-             If device is active, get pending_bytes by .state_pending_exact()
+                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
+      If device is active, get pending_bytes by .state_pending_{estimate,exact}()
           If total pending_bytes >= threshold_size, call .save_live_iterate()
+                  [Data of VFIO device for pre-copy phase is copied]
         Iterate till total pending bytes converge and are less than threshold
                                   |
   On migration completion, vCPU stops and calls .save_live_complete_precopy for
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index eed244f25f..fa42955d4c 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -66,6 +66,9 @@ typedef struct VFIOMigration {
     int data_fd;
     void *data_buffer;
     size_t data_buffer_size;
+    uint64_t precopy_init_size;
+    uint64_t precopy_dirty_size;
+    uint64_t mig_flags;
 } VFIOMigration;
 
 typedef struct VFIOAddressSpace {
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 4d01ea3515..4f01f0148e 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -491,7 +491,8 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
             }
 
             if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
-                migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
+                (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
+                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) {
                 return false;
             }
         }
@@ -536,7 +537,8 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
                 return false;
             }
 
-            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
+            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
+                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
                 continue;
             } else {
                 return false;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 235978fd68..980be1f614 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -68,6 +68,8 @@ static const char *mig_state_to_str(enum vfio_device_mig_state state)
         return "STOP_COPY";
     case VFIO_DEVICE_STATE_RESUMING:
         return "RESUMING";
+    case VFIO_DEVICE_STATE_PRE_COPY:
+        return "PRE_COPY";
     default:
         return "UNKNOWN STATE";
     }
@@ -241,6 +243,22 @@ static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
     return 0;
 }
 
+static int vfio_query_precopy_size(VFIOMigration *migration)
+{
+    struct vfio_precopy_info precopy = {
+        .argsz = sizeof(precopy),
+    };
+
+    if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, &precopy)) {
+        return -errno;
+    }
+
+    migration->precopy_init_size = precopy.initial_bytes;
+    migration->precopy_dirty_size = precopy.dirty_bytes;
+
+    return 0;
+}
+
 /* Returns the size of saved data on success and -errno on error */
 static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
 {
@@ -249,6 +267,11 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
     data_size = read(migration->data_fd, migration->data_buffer,
                      migration->data_buffer_size);
     if (data_size < 0) {
+        /* Pre-copy emptied all the device state for now */
+        if (errno == ENOMSG) {
+            return 0;
+        }
+
         return -errno;
     }
     if (data_size == 0) {
@@ -265,6 +288,31 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
     return qemu_file_get_error(f) ?: data_size;
 }
 
+static void vfio_update_estimated_pending_data(VFIOMigration *migration,
+                                               uint64_t data_size)
+{
+    if (!data_size) {
+        /*
+         * Pre-copy emptied all the device state for now, update estimated sizes
+         * accordingly.
+         */
+        migration->precopy_init_size = 0;
+        migration->precopy_dirty_size = 0;
+
+        return;
+    }
+
+    if (migration->precopy_init_size) {
+        uint64_t init_size = MIN(migration->precopy_init_size, data_size);
+
+        migration->precopy_init_size -= init_size;
+        data_size -= init_size;
+    }
+
+    migration->precopy_dirty_size -= MIN(migration->precopy_dirty_size,
+                                         data_size);
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
@@ -285,6 +333,31 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
         return -ENOMEM;
     }
 
+    if (migration->mig_flags & VFIO_MIGRATION_PRE_COPY) {
+        int ret;
+
+        migration->precopy_init_size = 0;
+        migration->precopy_dirty_size = 0;
+
+        switch (migration->device_state) {
+        case VFIO_DEVICE_STATE_RUNNING:
+            ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_PRE_COPY,
+                                           VFIO_DEVICE_STATE_RUNNING);
+            if (ret) {
+                return ret;
+            }
+
+            vfio_query_precopy_size(migration);
+
+            break;
+        case VFIO_DEVICE_STATE_STOP:
+            /* vfio_save_complete_precopy() will go to STOP_COPY */
+            break;
+        default:
+            return -EINVAL;
+        }
+    }
+
     trace_vfio_save_setup(vbasedev->name, migration->data_buffer_size);
 
     qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
@@ -303,22 +376,36 @@ static void vfio_save_cleanup(void *opaque)
     trace_vfio_save_cleanup(vbasedev->name);
 }
 
+static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
+                                        uint64_t *can_postcopy)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) {
+        return;
+    }
+
+    *must_precopy +=
+        migration->precopy_init_size + migration->precopy_dirty_size;
+
+    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
+                                      *can_postcopy,
+                                      migration->precopy_init_size,
+                                      migration->precopy_dirty_size);
+}
+
 /*
  * Migration size of VFIO devices can be as little as a few KBs or as big as
  * many GBs. This value should be big enough to cover the worst case.
  */
 #define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
 
-/*
- * Only exact function is implemented and not estimate function. The reason is
- * that during pre-copy phase of migration the estimate function is called
- * repeatedly while pending RAM size is over the threshold, thus migration
- * can't converge and querying the VFIO device pending data size is useless.
- */
 static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
                                      uint64_t *can_postcopy)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
     uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
 
     /*
@@ -328,8 +415,49 @@ static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
     vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
     *must_precopy += stop_copy_size;
 
+    if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
+        migration->precopy_init_size = 0;
+        migration->precopy_dirty_size = 0;
+        vfio_query_precopy_size(migration);
+
+        *must_precopy +=
+            migration->precopy_init_size + migration->precopy_dirty_size;
+    }
+
     trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
-                                   stop_copy_size);
+                                   stop_copy_size, migration->precopy_init_size,
+                                   migration->precopy_dirty_size);
+}
+
+static bool vfio_is_active_iterate(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
+}
+
+static int vfio_save_iterate(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    ssize_t data_size;
+
+    data_size = vfio_save_block(f, migration);
+    if (data_size < 0) {
+        return data_size;
+    }
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    vfio_update_estimated_pending_data(migration, data_size);
+
+    trace_vfio_save_iterate(vbasedev->name);
+
+    /*
+     * A VFIO device's pre-copy dirty_bytes is not guaranteed to reach zero.
+     * Return 1 so following handlers will not be potentially blocked.
+     */
+    return 1;
 }
 
 static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
@@ -338,7 +466,7 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     ssize_t data_size;
     int ret;
 
-    /* We reach here with device state STOP only */
+    /* We reach here with device state STOP or STOP_COPY only */
     ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
                                    VFIO_DEVICE_STATE_STOP);
     if (ret) {
@@ -457,7 +585,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
 static const SaveVMHandlers savevm_vfio_handlers = {
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
+    .state_pending_estimate = vfio_state_pending_estimate,
     .state_pending_exact = vfio_state_pending_exact,
+    .is_active_iterate = vfio_is_active_iterate,
+    .save_live_iterate = vfio_save_iterate,
     .save_live_complete_precopy = vfio_save_complete_precopy,
     .save_state = vfio_save_state,
     .load_setup = vfio_load_setup,
@@ -470,13 +601,18 @@ static const SaveVMHandlers savevm_vfio_handlers = {
 static void vfio_vmstate_change(void *opaque, bool running, RunState state)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
     enum vfio_device_mig_state new_state;
     int ret;
 
     if (running) {
         new_state = VFIO_DEVICE_STATE_RUNNING;
     } else {
-        new_state = VFIO_DEVICE_STATE_STOP;
+        new_state =
+            (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY &&
+             (state == RUN_STATE_FINISH_MIGRATE || state == RUN_STATE_PAUSED)) ?
+                VFIO_DEVICE_STATE_STOP_COPY :
+                VFIO_DEVICE_STATE_STOP;
     }
 
     /*
@@ -603,6 +739,7 @@ static int vfio_migration_init(VFIODevice *vbasedev)
     migration->vbasedev = vbasedev;
     migration->device_state = VFIO_DEVICE_STATE_RUNNING;
     migration->data_fd = -1;
+    migration->mig_flags = mig_flags;
 
     vbasedev->dirty_pages_supported = vfio_dma_logging_supported(vbasedev);
 
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 646e42fd27..fd6893cb43 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -162,6 +162,8 @@ vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
 vfio_save_cleanup(const char *name) " (%s)"
 vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
 vfio_save_device_config_state(const char *name) " (%s)"
+vfio_save_iterate(const char *name) " (%s)"
 vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
-vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64
+vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
+vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
 vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 8/8] vfio/migration: Add support for precopy initial data capability
  2023-05-01 14:01 [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support Avihai Horon
                   ` (6 preceding siblings ...)
  2023-05-01 14:01 ` [PATCH 7/8] vfio/migration: Add VFIO migration pre-copy support Avihai Horon
@ 2023-05-01 14:01 ` Avihai Horon
  2023-05-02 22:49 ` [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support Peter Xu
  8 siblings, 0 replies; 42+ messages in thread
From: Avihai Horon @ 2023-05-01 14:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alex Williamson, Cédric Le Goater, Juan Quintela, Peter Xu,
	Leonardo Bras, Eric Blake, Markus Armbruster, Thomas Huth,
	Laurent Vivier, Paolo Bonzini, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Avihai Horon, Kirti Wankhede, Tarun Gupta,
	Joao Martins

Loading of a VFIO device's data can take a substantial amount of time as
the device may need to allocate resources, prepare internal data
structures, etc. This can increase migration downtime, especially for
VFIO devices with a lot of resources.

To solve this, VFIO migration uAPI defines "initial bytes" as part of
its precopy data stream. Initial bytes can be used in various ways to
improve VFIO migration performance. For example, it can be used to
transfer device metadata to pre-allocate resources in the destination.
However, for this to work we need to make sure that all initial bytes
are sent and loaded in the destination before the source VM is stopped.

Use migration precopy initial data capability to make sure a VFIO
device's initial bytes are sent and loaded in the destination before the
source stops the VM and attempts to complete the migration.
This can significantly reduce migration downtime.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 include/hw/vfio/vfio-common.h |  3 +++
 hw/vfio/migration.c           | 48 ++++++++++++++++++++++++++++++++++-
 2 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index fa42955d4c..dd3b052682 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -69,6 +69,9 @@ typedef struct VFIOMigration {
     uint64_t precopy_init_size;
     uint64_t precopy_dirty_size;
     uint64_t mig_flags;
+    bool initial_data_active;
+    bool initial_data_sent;
+    bool initial_data_loaded;
 } VFIOMigration;
 
 typedef struct VFIOAddressSpace {
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 980be1f614..23f4f1f8a5 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -45,6 +45,7 @@
 #define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
 #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
 #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
+#define VFIO_MIG_FLAG_DEV_INIT_DATA_SENT (0xffffffffef100005ULL)
 
 /*
  * This is an arbitrary size based on migration of mlx5 devices, where typically
@@ -372,6 +373,8 @@ static void vfio_save_cleanup(void *opaque)
 
     g_free(migration->data_buffer);
     migration->data_buffer = NULL;
+    migration->initial_data_sent = false;
+    migration->initial_data_active = false;
     vfio_migration_cleanup(vbasedev);
     trace_vfio_save_cleanup(vbasedev->name);
 }
@@ -447,10 +450,17 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
     if (data_size < 0) {
         return data_size;
     }
-    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
 
     vfio_update_estimated_pending_data(migration, data_size);
 
+    if (migration->initial_data_active && !migration->precopy_init_size &&
+        !migration->initial_data_sent) {
+        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT);
+        migration->initial_data_sent = true;
+    } else {
+        qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+    }
+
     trace_vfio_save_iterate(vbasedev->name);
 
     /*
@@ -568,6 +578,12 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
             }
             break;
         }
+        case VFIO_MIG_FLAG_DEV_INIT_DATA_SENT:
+        {
+            vbasedev->migration->initial_data_loaded = true;
+
+            return 0;
+        }
         default:
             error_report("%s: Unknown tag 0x%"PRIx64, vbasedev->name, data);
             return -EINVAL;
@@ -582,6 +598,33 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
     return ret;
 }
 
+static bool vfio_initial_data_advise(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    migration->initial_data_active =
+        migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
+
+    return migration->initial_data_active;
+}
+
+static bool vfio_is_initial_data_active(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    return migration->initial_data_active;
+}
+
+static bool vfio_initial_data_loaded(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    return migration->initial_data_loaded;
+}
+
 static const SaveVMHandlers savevm_vfio_handlers = {
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
@@ -594,6 +637,9 @@ static const SaveVMHandlers savevm_vfio_handlers = {
     .load_setup = vfio_load_setup,
     .load_cleanup = vfio_load_cleanup,
     .load_state = vfio_load_state,
+    .initial_data_advise = vfio_initial_data_advise,
+    .is_initial_data_active = vfio_is_initial_data_active,
+    .initial_data_loaded = vfio_initial_data_loaded,
 };
 
 /* ---------------------------------------------------------------------- */
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support
  2023-05-01 14:01 [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support Avihai Horon
                   ` (7 preceding siblings ...)
  2023-05-01 14:01 ` [PATCH 8/8] vfio/migration: Add support for precopy initial data capability Avihai Horon
@ 2023-05-02 22:49 ` Peter Xu
  2023-05-03 15:22   ` Avihai Horon
  8 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2023-05-02 22:49 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
	Juan Quintela, Leonardo Bras, Eric Blake, Markus Armbruster,
	Thomas Huth, Laurent Vivier, Paolo Bonzini, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Mon, May 01, 2023 at 05:01:33PM +0300, Avihai Horon wrote:
> Hello everyone,

Hi, Avihai,

> === Flow of operation ===
> 
> To use precopy initial data, the capability must be enabled in the
> source.
> 
> As this capability must be supported also in the destination, a
> handshake is performed during migration setup. The purpose of the
> handshake is to notify the destination that precopy initial data is used
> and to check if it's supported.
> 
> The handshake is done in two levels. First, a general handshake is done
> with the destination migration code to notify that precopy initial data
> is used. Then, for each migration user in the source that supports
> precopy initial data, a handshake is done with its counterpart in the
> destination:
> If both support it, precopy initial data will be used for them.
> If source doesn't support it, precopy initial data will not be used for
> them.
> If source supports it and destination doesn't, migration will be failed.
> 
> Assuming the handshake succeeded, migration starts to send precopy data
> and as part of it also the initial precopy data. Initial precopy data is
> just like any other precopy data and as such, migration code is not
> aware of it. Therefore, it's the responsibility of the migration users
> (such as VFIO devices) to notify their counterparts in the destination
> that their initial precopy data has been sent (for example, VFIO
> migration does it when its initial bytes reach zero).
> 
> In the destination, migration code will query each migration user that
> supports precopy initial data and check if its initial data has been
> loaded. If initial data has been loaded by all of them, an ACK will be
> sent to the source which will now be able to complete migration when
> appropriate.

I can understand why this is useful, what I'm not 100% sure is whether the
complexity is needed.  The idea seems to be that src never switchover
unless it receives a READY notification from dst.

I'm imaging below simplified and more general workflow, not sure whether it
could work for you:

  - Introduce a new cap "switchover-ready", it means whether there'll be a
    ready event sent from dst -> src for "being ready for switchover"

  - When cap set, a new msg MIG_RP_MSG_SWITCHOVER_READY is defined and
    handled on src showing that dest is ready for switchover. It'll be sent
    only if dest is ready for the switchover

  - Introduce a field SaveVMHandlers.explicit_switchover_needed.  For each
    special device like vfio that would like to participate in the decision
    making, device can set its explicit_switchover_needed=1.  This field is
    ignored if the new cap is not set.

  - Dst qemu: when new cap set, remember how many special devices are there
    requesting explicit switchover (count of SaveVMHandlers that has the
    bit set during load setup) as switch_over_pending=N.

  - Dst qemu: Once a device thinks its fine to switchover (probably in the
    load_state() callback), it calls migration_notify_switchover_ready().
    That decreases switch_over_pending and when it hits zero, one msg
    MIG_RP_MSG_SWITCHOVER_READY will be sent to src.

Only until READY msg received on src could src switchover the precopy to
dst.

Then it only needs 1 more field in SaveVMHandlers rather than 3, and only 1
more msg (dst->src).

This is based on the fact that right now we always set caps on both qemus
so I suppose it already means either both have or don't have the feature
(even if one has, not setting the cap means disabled on both).

Would it work for this case and cleaner?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/8] migration: Add precopy initial data handshake
  2023-05-01 14:01 ` [PATCH 2/8] migration: Add precopy initial data handshake Avihai Horon
@ 2023-05-02 22:54   ` Peter Xu
  2023-05-03 15:31     ` Avihai Horon
  2023-05-10  8:40   ` Juan Quintela
  2023-05-14 16:42   ` Cédric Le Goater
  2 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2023-05-02 22:54 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
	Juan Quintela, Leonardo Bras, Eric Blake, Markus Armbruster,
	Thomas Huth, Laurent Vivier, Paolo Bonzini, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

(I've left high level comment in cover letter, but still some quick
comments I noticed when reading)

On Mon, May 01, 2023 at 05:01:35PM +0300, Avihai Horon wrote:
> Add precopy initial data handshake between source and destination upon
> migration setup. The purpose of the handshake is to notify the
> destination that precopy initial data is used and which migration users
> (i.e., SaveStateEntry) are going to use it.
> 
> The handshake is done in two levels. First, a general enable command is
> sent to notify the destination migration code that precopy initial data
> is used.
> Then, for each migration user in the source that supports precopy
> initial data, an enable command is sent to its counterpart in the
> destination:
> If both support it, precopy initial data will be used for them.
> If source doesn't support it, precopy initial data will not be used for
> them.
> If source supports it and destination doesn't, migration will be failed.
> 
> To implement it, a new migration command MIG_CMD_INITIAL_DATA_ENABLE is
> added, as well as a new SaveVMHandlers handler initial_data_advise.
> Calling the handler advises the migration user that precopy initial data
> is used and its return value indicates whether precopy initial data is
> supported by it.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  include/migration/register.h |   6 +++
>  migration/migration.h        |   3 ++
>  migration/savevm.h           |   1 +
>  migration/migration.c        |   4 ++
>  migration/savevm.c           | 102 +++++++++++++++++++++++++++++++++++
>  migration/trace-events       |   2 +
>  6 files changed, 118 insertions(+)
> 
> diff --git a/include/migration/register.h b/include/migration/register.h
> index a8dfd8fefd..0a73f3883e 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -71,6 +71,12 @@ typedef struct SaveVMHandlers {
>      int (*load_cleanup)(void *opaque);
>      /* Called when postcopy migration wants to resume from failure */
>      int (*resume_prepare)(MigrationState *s, void *opaque);
> +
> +    /*
> +     * Advises that precopy initial data was requested to be enabled. Returns
> +     * true if it's supported or false otherwise. Called both in src and dest.
> +     */
> +    bool (*initial_data_advise)(void *opaque);
>  } SaveVMHandlers;
>  
>  int register_savevm_live(const char *idstr,
> diff --git a/migration/migration.h b/migration/migration.h
> index 3a918514e7..4f615e9dbc 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -204,6 +204,9 @@ struct MigrationIncomingState {
>       * contains valid information.
>       */
>      QemuMutex page_request_mutex;
> +
> +    /* Indicates whether precopy initial data was enabled and should be used */
> +    bool initial_data_enabled;
>  };
>  
>  MigrationIncomingState *migration_incoming_get_current(void);
> diff --git a/migration/savevm.h b/migration/savevm.h
> index fb636735f0..d47ab4ad18 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -58,6 +58,7 @@ void qemu_savevm_send_postcopy_ram_discard(QEMUFile *f, const char *name,
>                                             uint64_t *start_list,
>                                             uint64_t *length_list);
>  void qemu_savevm_send_colo_enable(QEMUFile *f);
> +void qemu_savevm_send_initial_data_enable(MigrationState *ms, QEMUFile *f);
>  void qemu_savevm_live_state(QEMUFile *f);
>  int qemu_save_device_state(QEMUFile *f);
>  
> diff --git a/migration/migration.c b/migration/migration.c
> index abcadbb619..68cdf5b184 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2964,6 +2964,10 @@ static void *migration_thread(void *opaque)
>          qemu_savevm_send_colo_enable(s->to_dst_file);
>      }
>  
> +    if (migrate_precopy_initial_data()) {
> +        qemu_savevm_send_initial_data_enable(s, s->to_dst_file);
> +    }
> +
>      qemu_savevm_state_setup(s->to_dst_file);
>  
>      qemu_savevm_wait_unplug(s, MIGRATION_STATUS_SETUP,
> diff --git a/migration/savevm.c b/migration/savevm.c
> index a9181b444b..2740defdf0 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -71,6 +71,13 @@
>  
>  const unsigned int postcopy_ram_discard_version;
>  
> +typedef struct {
> +    uint8_t general_enable;
> +    uint8_t reserved[7];
> +    uint8_t idstr[256];
> +    uint32_t instance_id;
> +} InitialDataInfo;
> +
>  /* Subcommands for QEMU_VM_COMMAND */
>  enum qemu_vm_cmd {
>      MIG_CMD_INVALID = 0,   /* Must be 0 */
> @@ -90,6 +97,8 @@ enum qemu_vm_cmd {
>      MIG_CMD_ENABLE_COLO,       /* Enable COLO */
>      MIG_CMD_POSTCOPY_RESUME,   /* resume postcopy on dest */
>      MIG_CMD_RECV_BITMAP,       /* Request for recved bitmap on dst */
> +
> +    MIG_CMD_INITIAL_DATA_ENABLE, /* Enable precopy initial data in dest */
>      MIG_CMD_MAX
>  };
>  
> @@ -109,6 +118,8 @@ static struct mig_cmd_args {
>      [MIG_CMD_POSTCOPY_RESUME]  = { .len =  0, .name = "POSTCOPY_RESUME" },
>      [MIG_CMD_PACKAGED]         = { .len =  4, .name = "PACKAGED" },
>      [MIG_CMD_RECV_BITMAP]      = { .len = -1, .name = "RECV_BITMAP" },
> +    [MIG_CMD_INITIAL_DATA_ENABLE] = { .len = sizeof(InitialDataInfo),
> +                                      .name = "INITIAL_DATA_ENABLE" },
>      [MIG_CMD_MAX]              = { .len = -1, .name = "MAX" },
>  };
>  
> @@ -1036,6 +1047,40 @@ static void qemu_savevm_command_send(QEMUFile *f,
>      qemu_fflush(f);
>  }
>  
> +void qemu_savevm_send_initial_data_enable(MigrationState *ms, QEMUFile *f)
> +{
> +    SaveStateEntry *se;
> +    InitialDataInfo buf;
> +
> +    /* Enable precopy initial data generally in the migration */
> +    memset(&buf, 0, sizeof(buf));
> +    buf.general_enable = 1;
> +    qemu_savevm_command_send(f, MIG_CMD_INITIAL_DATA_ENABLE, sizeof(buf),
> +                             (uint8_t *)&buf);
> +    trace_savevm_send_initial_data_enable(buf.general_enable, (char *)buf.idstr,
> +                                          buf.instance_id);

Maybe a generalized helper would be nice here to send one
MIG_CMD_INITIAL_DATA_ENABLE packet, then it can be used below too.

Here instance_id is multi-bytes, we may want to consider endianess.  it
seems the common way is use big endian over the wire for qemu migration
msgs.

> +
> +    /* Enable precopy initial data for each migration user that supports it */
> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +        if (!se->ops || !se->ops->initial_data_advise) {
> +            continue;
> +        }
> +
> +        if (!se->ops->initial_data_advise(se->opaque)) {
> +            continue;
> +        }
> +
> +        memset(&buf, 0, sizeof(buf));
> +        memcpy(buf.idstr, se->idstr, sizeof(buf.idstr));
> +        buf.instance_id = se->instance_id;
> +
> +        qemu_savevm_command_send(f, MIG_CMD_INITIAL_DATA_ENABLE, sizeof(buf),
> +                                 (uint8_t *)&buf);
> +        trace_savevm_send_initial_data_enable(
> +            buf.general_enable, (char *)buf.idstr, buf.instance_id);
> +    }
> +}
> +
>  void qemu_savevm_send_colo_enable(QEMUFile *f)
>  {
>      trace_savevm_send_colo_enable();
> @@ -2314,6 +2359,60 @@ static int loadvm_process_enable_colo(MigrationIncomingState *mis)
>      return ret;
>  }
>  
> +static int loadvm_handle_initial_data_enable(MigrationIncomingState *mis)
> +{
> +    InitialDataInfo buf;
> +    SaveStateEntry *se;
> +    ssize_t read_size;
> +
> +    read_size = qemu_get_buffer(mis->from_src_file, (void *)&buf, sizeof(buf));
> +    if (read_size != sizeof(buf)) {
> +        error_report("%s: Could not get data buffer, read_size %ld, len %lu",
> +                     __func__, read_size, sizeof(buf));
> +
> +        return -EIO;
> +    }
> +
> +    /* Enable precopy initial data generally in the migration */
> +    if (buf.general_enable) {
> +        mis->initial_data_enabled = true;
> +        trace_loadvm_handle_initial_data_enable(
> +            buf.general_enable, (char *)buf.idstr, buf.instance_id);
> +
> +        return 0;
> +    }
> +
> +    /* Enable precopy initial data for a specific migration user */
> +    se = find_se((char *)buf.idstr, buf.instance_id);
> +    if (!se) {
> +        error_report("%s: Could not find SaveStateEntry, idstr '%s', "
> +                     "instance_id %" PRIu32,
> +                     __func__, buf.idstr, buf.instance_id);
> +
> +        return -ENOENT;
> +    }
> +
> +    if (!se->ops || !se->ops->initial_data_advise) {
> +        error_report("%s: '%s' doesn't have required "
> +                     "initial_data_advise op",
> +                     __func__, buf.idstr);
> +
> +        return -EOPNOTSUPP;
> +    }
> +
> +    if (!se->ops->initial_data_advise(se->opaque)) {
> +        error_report("%s: '%s' doesn't support precopy initial data", __func__,
> +                     buf.idstr);
> +
> +        return -EOPNOTSUPP;
> +    }
> +
> +    trace_loadvm_handle_initial_data_enable(buf.general_enable,
> +                                            (char *)buf.idstr, buf.instance_id);
> +
> +    return 0;
> +}
> +
>  /*
>   * Process an incoming 'QEMU_VM_COMMAND'
>   * 0           just a normal return
> @@ -2397,6 +2496,9 @@ static int loadvm_process_command(QEMUFile *f)
>  
>      case MIG_CMD_ENABLE_COLO:
>          return loadvm_process_enable_colo(mis);
> +
> +    case MIG_CMD_INITIAL_DATA_ENABLE:
> +        return loadvm_handle_initial_data_enable(mis);
>      }
>  
>      return 0;
> diff --git a/migration/trace-events b/migration/trace-events
> index 92161eeac5..21ae471126 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -23,6 +23,7 @@ loadvm_postcopy_ram_handle_discard_end(void) ""
>  loadvm_postcopy_ram_handle_discard_header(const char *ramid, uint16_t len) "%s: %ud"
>  loadvm_process_command(const char *s, uint16_t len) "com=%s len=%d"
>  loadvm_process_command_ping(uint32_t val) "0x%x"
> +loadvm_handle_initial_data_enable(uint8_t general_enable, const char *idstr, int instance_id) "general_enable=%u, idstr=%s, instance_id=%u"
>  postcopy_ram_listen_thread_exit(void) ""
>  postcopy_ram_listen_thread_start(void) ""
>  qemu_savevm_send_postcopy_advise(void) ""
> @@ -38,6 +39,7 @@ savevm_send_postcopy_run(void) ""
>  savevm_send_postcopy_resume(void) ""
>  savevm_send_colo_enable(void) ""
>  savevm_send_recv_bitmap(char *name) "%s"
> +savevm_send_initial_data_enable(uint8_t general_enable, const char *idstr, int instance_id) "general_enable=%u, idstr=%s, instance_id=%u"
>  savevm_state_setup(void) ""
>  savevm_state_resume_prepare(void) ""
>  savevm_state_header(void) ""
> -- 
> 2.26.3
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/8] migration: Add precopy initial data loaded ACK functionality
  2023-05-01 14:01 ` [PATCH 3/8] migration: Add precopy initial data loaded ACK functionality Avihai Horon
@ 2023-05-02 22:56   ` Peter Xu
  2023-05-03 15:36     ` Avihai Horon
  2023-05-10  8:54   ` Juan Quintela
  1 sibling, 1 reply; 42+ messages in thread
From: Peter Xu @ 2023-05-02 22:56 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
	Juan Quintela, Leonardo Bras, Eric Blake, Markus Armbruster,
	Thomas Huth, Laurent Vivier, Paolo Bonzini, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Mon, May 01, 2023 at 05:01:36PM +0300, Avihai Horon wrote:
> Add the core functionality of precopy initial data, which allows the
> destination to ACK that initial data has been loaded and the source to
> wait for this ACK before completing the migration.
> 
> A new return path command MIG_RP_MSG_INITIAL_DATA_LOADED_ACK is added.
> It is sent by the destination after precopy initial data is loaded to
> ACK to the source that precopy initial data has been loaded.
> 
> In addition, two new SaveVMHandlers handlers are added:
> 1. is_initial_data_active which indicates whether precopy initial data
>    is used for this migration user (i.e., SaveStateEntry).
> 2. initial_data_loaded which indicates whether precopy initial data has
>    been loaded by this migration user.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  include/migration/register.h |  7 ++++++
>  migration/migration.h        | 12 +++++++++++
>  migration/migration.c        | 41 ++++++++++++++++++++++++++++++++++--
>  migration/savevm.c           | 39 ++++++++++++++++++++++++++++++++++
>  migration/trace-events       |  2 ++
>  5 files changed, 99 insertions(+), 2 deletions(-)
> 
> diff --git a/include/migration/register.h b/include/migration/register.h
> index 0a73f3883e..297bbe9f73 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -77,6 +77,13 @@ typedef struct SaveVMHandlers {
>       * true if it's supported or false otherwise. Called both in src and dest.
>       */
>      bool (*initial_data_advise)(void *opaque);
> +    /*
> +     * Checks if precopy initial data is active. If it's inactive,
> +     * initial_data_loaded check is skipped.
> +     */
> +    bool (*is_initial_data_active)(void *opaque);
> +    /* Checks if precopy initial data has been loaded in dest */
> +    bool (*initial_data_loaded)(void *opaque);
>  } SaveVMHandlers;
>  
>  int register_savevm_live(const char *idstr,
> diff --git a/migration/migration.h b/migration/migration.h
> index 4f615e9dbc..d865c23d87 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -207,6 +207,11 @@ struct MigrationIncomingState {
>  
>      /* Indicates whether precopy initial data was enabled and should be used */
>      bool initial_data_enabled;
> +    /*
> +     * Indicates whether an ACK that precopy initial data was loaded has been
> +     * sent to source.
> +     */
> +    bool initial_data_loaded_ack_sent;
>  };
>  
>  MigrationIncomingState *migration_incoming_get_current(void);
> @@ -435,6 +440,12 @@ struct MigrationState {
>  
>      /* QEMU_VM_VMDESCRIPTION content filled for all non-iterable devices. */
>      JSONWriter *vmdesc;
> +
> +    /*
> +     * Indicates whether an ACK that precopy initial data was loaded in
> +     * destination has been received.
> +     */
> +    bool initial_data_loaded_acked;
>  };
>  
>  void migrate_set_state(int *state, int old_state, int new_state);
> @@ -475,6 +486,7 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
>  void migrate_send_rp_recv_bitmap(MigrationIncomingState *mis,
>                                   char *block_name);
>  void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t value);
> +int migrate_send_rp_initial_data_loaded_ack(MigrationIncomingState *mis);
>  
>  void dirty_bitmap_mig_before_vm_start(void);
>  void dirty_bitmap_mig_cancel_outgoing(void);
> diff --git a/migration/migration.c b/migration/migration.c
> index 68cdf5b184..304cab2fa1 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -77,6 +77,11 @@ enum mig_rp_message_type {
>      MIG_RP_MSG_RECV_BITMAP,  /* send recved_bitmap back to source */
>      MIG_RP_MSG_RESUME_ACK,   /* tell source that we are ready to resume */
>  
> +    MIG_RP_MSG_INITIAL_DATA_LOADED_ACK, /*
> +                                         * Tell source precopy initial data is
> +                                         * loaded.
> +                                         */
> +
>      MIG_RP_MSG_MAX
>  };
>  
> @@ -756,6 +761,12 @@ bool migration_has_all_channels(void)
>      return true;
>  }
>  
> +int migrate_send_rp_initial_data_loaded_ack(MigrationIncomingState *mis)
> +{
> +    return migrate_send_rp_message(mis, MIG_RP_MSG_INITIAL_DATA_LOADED_ACK, 0,
> +                                   NULL);
> +}
> +
>  /*
>   * Send a 'SHUT' message on the return channel with the given value
>   * to indicate that we've finished with the RP.  Non-0 value indicates
> @@ -1401,6 +1412,8 @@ void migrate_init(MigrationState *s)
>      s->vm_was_running = false;
>      s->iteration_initial_bytes = 0;
>      s->threshold_size = 0;
> +
> +    s->initial_data_loaded_acked = false;
>  }
>  
>  int migrate_add_blocker_internal(Error *reason, Error **errp)
> @@ -1717,6 +1730,9 @@ static struct rp_cmd_args {
>      [MIG_RP_MSG_REQ_PAGES_ID]   = { .len = -1, .name = "REQ_PAGES_ID" },
>      [MIG_RP_MSG_RECV_BITMAP]    = { .len = -1, .name = "RECV_BITMAP" },
>      [MIG_RP_MSG_RESUME_ACK]     = { .len =  4, .name = "RESUME_ACK" },
> +    [MIG_RP_MSG_INITIAL_DATA_LOADED_ACK] = { .len = 0,
> +                                             .name =
> +                                                 "INITIAL_DATA_LOADED_ACK" },
>      [MIG_RP_MSG_MAX]            = { .len = -1, .name = "MAX" },
>  };
>  
> @@ -1955,6 +1971,11 @@ retry:
>              }
>              break;
>  
> +        case MIG_RP_MSG_INITIAL_DATA_LOADED_ACK:
> +            ms->initial_data_loaded_acked = true;
> +            trace_source_return_path_thread_initial_data_loaded_ack();
> +            break;
> +
>          default:
>              break;
>          }
> @@ -2704,6 +2725,20 @@ static void migration_update_counters(MigrationState *s,
>                                bandwidth, s->threshold_size);
>  }
>  
> +static bool initial_data_loaded_acked(MigrationState *s)
> +{
> +    if (!migrate_precopy_initial_data()) {
> +        return true;
> +    }
> +
> +    /* No reason to wait for precopy initial data loaded ACK if VM is stopped */
> +    if (!runstate_is_running()) {
> +        return true;
> +    }
> +
> +    return s->initial_data_loaded_acked;
> +}
> +
>  /* Migration thread iteration status */
>  typedef enum {
>      MIG_ITERATE_RESUME,         /* Resume current iteration */
> @@ -2719,6 +2754,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>  {
>      uint64_t must_precopy, can_postcopy;
>      bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
> +    bool initial_data_loaded = initial_data_loaded_acked(s);
>  
>      qemu_savevm_state_pending_estimate(&must_precopy, &can_postcopy);
>      uint64_t pending_size = must_precopy + can_postcopy;
> @@ -2731,7 +2767,8 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>          trace_migrate_pending_exact(pending_size, must_precopy, can_postcopy);
>      }
>  
> -    if (!pending_size || pending_size < s->threshold_size) {
> +    if ((!pending_size || pending_size < s->threshold_size) &&
> +        initial_data_loaded) {
>          trace_migration_thread_low_pending(pending_size);
>          migration_completion(s);
>          return MIG_ITERATE_BREAK;
> @@ -2739,7 +2776,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>  
>      /* Still a significant amount to transfer */
>      if (!in_postcopy && must_precopy <= s->threshold_size &&
> -        qatomic_read(&s->start_postcopy)) {
> +        initial_data_loaded && qatomic_read(&s->start_postcopy)) {
>          if (postcopy_start(s)) {
>              error_report("%s: postcopy failed to start", __func__);
>          }
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 2740defdf0..7a94deda3b 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2504,6 +2504,39 @@ static int loadvm_process_command(QEMUFile *f)
>      return 0;
>  }
>  
> +static int qemu_loadvm_initial_data_loaded_ack(MigrationIncomingState *mis)
> +{
> +    SaveStateEntry *se;
> +    int ret;
> +
> +    if (!mis->initial_data_enabled || mis->initial_data_loaded_ack_sent) {
> +        return 0;
> +    }
> +
> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +        if (!se->ops || !se->ops->initial_data_loaded) {
> +            continue;
> +        }
> +
> +        if (!se->ops->is_initial_data_active ||
> +            !se->ops->is_initial_data_active(se->opaque)) {
> +            continue;
> +        }
> +
> +        if (!se->ops->initial_data_loaded(se->opaque)) {
> +            return 0;
> +        }
> +    }
> +
> +    ret = migrate_send_rp_initial_data_loaded_ack(mis);
> +    if (!ret) {
> +        mis->initial_data_loaded_ack_sent = true;
> +        trace_loadvm_initial_data_loaded_acked();
> +    }
> +
> +    return ret;
> +}
> +
>  /*
>   * Read a footer off the wire and check that it matches the expected section
>   *
> @@ -2826,6 +2859,12 @@ retry:
>              if (ret < 0) {
>                  goto out;
>              }
> +
> +            ret = qemu_loadvm_initial_data_loaded_ack(mis);
> +            if (ret < 0) {
> +                goto out;
> +            }

This is slightly hacky - it gets called for every END section.

Please consider something cleaner, e.g., feel free to consider a
notification mechanism I mentioned in my reply to the cover letter, so it's
called only if the device is ready for switchover (no matter what interface
it'll use).

Thanks,

> +
>              break;
>          case QEMU_VM_COMMAND:
>              ret = loadvm_process_command(f);
> diff --git a/migration/trace-events b/migration/trace-events
> index 21ae471126..a0e1d3b2fd 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -24,6 +24,7 @@ loadvm_postcopy_ram_handle_discard_header(const char *ramid, uint16_t len) "%s:
>  loadvm_process_command(const char *s, uint16_t len) "com=%s len=%d"
>  loadvm_process_command_ping(uint32_t val) "0x%x"
>  loadvm_handle_initial_data_enable(uint8_t general_enable, const char *idstr, int instance_id) "general_enable=%u, idstr=%s, instance_id=%u"
> +loadvm_initial_data_loaded_acked(void) ""
>  postcopy_ram_listen_thread_exit(void) ""
>  postcopy_ram_listen_thread_start(void) ""
>  qemu_savevm_send_postcopy_advise(void) ""
> @@ -182,6 +183,7 @@ source_return_path_thread_loop_top(void) ""
>  source_return_path_thread_pong(uint32_t val) "0x%x"
>  source_return_path_thread_shut(uint32_t val) "0x%x"
>  source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32
> +source_return_path_thread_initial_data_loaded_ack(void) ""
>  migration_thread_low_pending(uint64_t pending) "%" PRIu64
>  migrate_transferred(uint64_t tranferred, uint64_t time_spent, uint64_t bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " max_size %" PRId64
>  process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
> -- 
> 2.26.3
> 
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support
  2023-05-02 22:49 ` [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support Peter Xu
@ 2023-05-03 15:22   ` Avihai Horon
  2023-05-03 15:49     ` Peter Xu
  2023-05-10  9:12     ` Juan Quintela
  0 siblings, 2 replies; 42+ messages in thread
From: Avihai Horon @ 2023-05-03 15:22 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
	Juan Quintela, Leonardo Bras, Eric Blake, Markus Armbruster,
	Thomas Huth, Laurent Vivier, Paolo Bonzini, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins


On 03/05/2023 1:49, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Mon, May 01, 2023 at 05:01:33PM +0300, Avihai Horon wrote:
>> Hello everyone,
> Hi, Avihai,
>
>> === Flow of operation ===
>>
>> To use precopy initial data, the capability must be enabled in the
>> source.
>>
>> As this capability must be supported also in the destination, a
>> handshake is performed during migration setup. The purpose of the
>> handshake is to notify the destination that precopy initial data is used
>> and to check if it's supported.
>>
>> The handshake is done in two levels. First, a general handshake is done
>> with the destination migration code to notify that precopy initial data
>> is used. Then, for each migration user in the source that supports
>> precopy initial data, a handshake is done with its counterpart in the
>> destination:
>> If both support it, precopy initial data will be used for them.
>> If source doesn't support it, precopy initial data will not be used for
>> them.
>> If source supports it and destination doesn't, migration will be failed.
>>
>> Assuming the handshake succeeded, migration starts to send precopy data
>> and as part of it also the initial precopy data. Initial precopy data is
>> just like any other precopy data and as such, migration code is not
>> aware of it. Therefore, it's the responsibility of the migration users
>> (such as VFIO devices) to notify their counterparts in the destination
>> that their initial precopy data has been sent (for example, VFIO
>> migration does it when its initial bytes reach zero).
>>
>> In the destination, migration code will query each migration user that
>> supports precopy initial data and check if its initial data has been
>> loaded. If initial data has been loaded by all of them, an ACK will be
>> sent to the source which will now be able to complete migration when
>> appropriate.
> I can understand why this is useful, what I'm not 100% sure is whether the
> complexity is needed.  The idea seems to be that src never switchover
> unless it receives a READY notification from dst.
>
> I'm imaging below simplified and more general workflow, not sure whether it
> could work for you:
>
>    - Introduce a new cap "switchover-ready", it means whether there'll be a
>      ready event sent from dst -> src for "being ready for switchover"
>
>    - When cap set, a new msg MIG_RP_MSG_SWITCHOVER_READY is defined and
>      handled on src showing that dest is ready for switchover. It'll be sent
>      only if dest is ready for the switchover
>
>    - Introduce a field SaveVMHandlers.explicit_switchover_needed.  For each
>      special device like vfio that would like to participate in the decision
>      making, device can set its explicit_switchover_needed=1.  This field is
>      ignored if the new cap is not set.
>
>    - Dst qemu: when new cap set, remember how many special devices are there
>      requesting explicit switchover (count of SaveVMHandlers that has the
>      bit set during load setup) as switch_over_pending=N.
>
>    - Dst qemu: Once a device thinks its fine to switchover (probably in the
>      load_state() callback), it calls migration_notify_switchover_ready().
>      That decreases switch_over_pending and when it hits zero, one msg
>      MIG_RP_MSG_SWITCHOVER_READY will be sent to src.
>
> Only until READY msg received on src could src switchover the precopy to
> dst.
>
> Then it only needs 1 more field in SaveVMHandlers rather than 3, and only 1
> more msg (dst->src).
>
> This is based on the fact that right now we always set caps on both qemus
> so I suppose it already means either both have or don't have the feature
> (even if one has, not setting the cap means disabled on both).
>
> Would it work for this case and cleaner?

Hi Peter, thanks for the response!
Your approach is indeed much simpler, however I have a few concerns 
regarding compatibility.

You are saying that caps are always set both in src and dest.
But what happens if we set the cap only on one side?
Should we care about these scenarios?
For example, if we set the cap only in src, then src will wait 
indefinitely for dest to notify that switchover is ready.
Would you expect migration to fail instead of just keep running 
indefinitely?
In current approach we only need to enable the cap in the source, so 
such scenario can't happen.

Let's look at some other scenario.
Src QEMU supports explicit-switchover for device X but *not* for device 
Y (i.e., src QEMU is some older version of QEMU that supports 
explicit-switchover for device X but not for Y).
Dest QEMU supports explicit-switchover for device X and device Y.
The capability is set in both src and dest.
In the destination we will have switchover_pending=2 because both X and 
Y support explicit-switchover.
We do migration, but switchover_pending will never reach 0 because only 
X supports it in the source, so the migration will run indefinitely.
The per-device handshake solves this by making device Y not use 
explicit-switchover in this case.

Thanks.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/8] migration: Add precopy initial data handshake
  2023-05-02 22:54   ` Peter Xu
@ 2023-05-03 15:31     ` Avihai Horon
  0 siblings, 0 replies; 42+ messages in thread
From: Avihai Horon @ 2023-05-03 15:31 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
	Juan Quintela, Leonardo Bras, Eric Blake, Markus Armbruster,
	Thomas Huth, Laurent Vivier, Paolo Bonzini, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins


On 03/05/2023 1:54, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> (I've left high level comment in cover letter, but still some quick
> comments I noticed when reading)
>
> On Mon, May 01, 2023 at 05:01:35PM +0300, Avihai Horon wrote:
>> Add precopy initial data handshake between source and destination upon
>> migration setup. The purpose of the handshake is to notify the
>> destination that precopy initial data is used and which migration users
>> (i.e., SaveStateEntry) are going to use it.
>>
>> The handshake is done in two levels. First, a general enable command is
>> sent to notify the destination migration code that precopy initial data
>> is used.
>> Then, for each migration user in the source that supports precopy
>> initial data, an enable command is sent to its counterpart in the
>> destination:
>> If both support it, precopy initial data will be used for them.
>> If source doesn't support it, precopy initial data will not be used for
>> them.
>> If source supports it and destination doesn't, migration will be failed.
>>
>> To implement it, a new migration command MIG_CMD_INITIAL_DATA_ENABLE is
>> added, as well as a new SaveVMHandlers handler initial_data_advise.
>> Calling the handler advises the migration user that precopy initial data
>> is used and its return value indicates whether precopy initial data is
>> supported by it.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   include/migration/register.h |   6 +++
>>   migration/migration.h        |   3 ++
>>   migration/savevm.h           |   1 +
>>   migration/migration.c        |   4 ++
>>   migration/savevm.c           | 102 +++++++++++++++++++++++++++++++++++
>>   migration/trace-events       |   2 +
>>   6 files changed, 118 insertions(+)
>>
>> diff --git a/include/migration/register.h b/include/migration/register.h
>> index a8dfd8fefd..0a73f3883e 100644
>> --- a/include/migration/register.h
>> +++ b/include/migration/register.h
>> @@ -71,6 +71,12 @@ typedef struct SaveVMHandlers {
>>       int (*load_cleanup)(void *opaque);
>>       /* Called when postcopy migration wants to resume from failure */
>>       int (*resume_prepare)(MigrationState *s, void *opaque);
>> +
>> +    /*
>> +     * Advises that precopy initial data was requested to be enabled. Returns
>> +     * true if it's supported or false otherwise. Called both in src and dest.
>> +     */
>> +    bool (*initial_data_advise)(void *opaque);
>>   } SaveVMHandlers;
>>
>>   int register_savevm_live(const char *idstr,
>> diff --git a/migration/migration.h b/migration/migration.h
>> index 3a918514e7..4f615e9dbc 100644
>> --- a/migration/migration.h
>> +++ b/migration/migration.h
>> @@ -204,6 +204,9 @@ struct MigrationIncomingState {
>>        * contains valid information.
>>        */
>>       QemuMutex page_request_mutex;
>> +
>> +    /* Indicates whether precopy initial data was enabled and should be used */
>> +    bool initial_data_enabled;
>>   };
>>
>>   MigrationIncomingState *migration_incoming_get_current(void);
>> diff --git a/migration/savevm.h b/migration/savevm.h
>> index fb636735f0..d47ab4ad18 100644
>> --- a/migration/savevm.h
>> +++ b/migration/savevm.h
>> @@ -58,6 +58,7 @@ void qemu_savevm_send_postcopy_ram_discard(QEMUFile *f, const char *name,
>>                                              uint64_t *start_list,
>>                                              uint64_t *length_list);
>>   void qemu_savevm_send_colo_enable(QEMUFile *f);
>> +void qemu_savevm_send_initial_data_enable(MigrationState *ms, QEMUFile *f);
>>   void qemu_savevm_live_state(QEMUFile *f);
>>   int qemu_save_device_state(QEMUFile *f);
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index abcadbb619..68cdf5b184 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -2964,6 +2964,10 @@ static void *migration_thread(void *opaque)
>>           qemu_savevm_send_colo_enable(s->to_dst_file);
>>       }
>>
>> +    if (migrate_precopy_initial_data()) {
>> +        qemu_savevm_send_initial_data_enable(s, s->to_dst_file);
>> +    }
>> +
>>       qemu_savevm_state_setup(s->to_dst_file);
>>
>>       qemu_savevm_wait_unplug(s, MIGRATION_STATUS_SETUP,
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index a9181b444b..2740defdf0 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -71,6 +71,13 @@
>>
>>   const unsigned int postcopy_ram_discard_version;
>>
>> +typedef struct {
>> +    uint8_t general_enable;
>> +    uint8_t reserved[7];
>> +    uint8_t idstr[256];
>> +    uint32_t instance_id;
>> +} InitialDataInfo;
>> +
>>   /* Subcommands for QEMU_VM_COMMAND */
>>   enum qemu_vm_cmd {
>>       MIG_CMD_INVALID = 0,   /* Must be 0 */
>> @@ -90,6 +97,8 @@ enum qemu_vm_cmd {
>>       MIG_CMD_ENABLE_COLO,       /* Enable COLO */
>>       MIG_CMD_POSTCOPY_RESUME,   /* resume postcopy on dest */
>>       MIG_CMD_RECV_BITMAP,       /* Request for recved bitmap on dst */
>> +
>> +    MIG_CMD_INITIAL_DATA_ENABLE, /* Enable precopy initial data in dest */
>>       MIG_CMD_MAX
>>   };
>>
>> @@ -109,6 +118,8 @@ static struct mig_cmd_args {
>>       [MIG_CMD_POSTCOPY_RESUME]  = { .len =  0, .name = "POSTCOPY_RESUME" },
>>       [MIG_CMD_PACKAGED]         = { .len =  4, .name = "PACKAGED" },
>>       [MIG_CMD_RECV_BITMAP]      = { .len = -1, .name = "RECV_BITMAP" },
>> +    [MIG_CMD_INITIAL_DATA_ENABLE] = { .len = sizeof(InitialDataInfo),
>> +                                      .name = "INITIAL_DATA_ENABLE" },
>>       [MIG_CMD_MAX]              = { .len = -1, .name = "MAX" },
>>   };
>>
>> @@ -1036,6 +1047,40 @@ static void qemu_savevm_command_send(QEMUFile *f,
>>       qemu_fflush(f);
>>   }
>>
>> +void qemu_savevm_send_initial_data_enable(MigrationState *ms, QEMUFile *f)
>> +{
>> +    SaveStateEntry *se;
>> +    InitialDataInfo buf;
>> +
>> +    /* Enable precopy initial data generally in the migration */
>> +    memset(&buf, 0, sizeof(buf));
>> +    buf.general_enable = 1;
>> +    qemu_savevm_command_send(f, MIG_CMD_INITIAL_DATA_ENABLE, sizeof(buf),
>> +                             (uint8_t *)&buf);
>> +    trace_savevm_send_initial_data_enable(buf.general_enable, (char *)buf.idstr,
>> +                                          buf.instance_id);
> Maybe a generalized helper would be nice here to send one
> MIG_CMD_INITIAL_DATA_ENABLE packet, then it can be used below too.

Sure, I will do that.

>
> Here instance_id is multi-bytes, we may want to consider endianess.  it
> seems the common way is use big endian over the wire for qemu migration
> msgs.

Oh, right. I will fix it.

Thanks.

>> +
>> +    /* Enable precopy initial data for each migration user that supports it */
>> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +        if (!se->ops || !se->ops->initial_data_advise) {
>> +            continue;
>> +        }
>> +
>> +        if (!se->ops->initial_data_advise(se->opaque)) {
>> +            continue;
>> +        }
>> +
>> +        memset(&buf, 0, sizeof(buf));
>> +        memcpy(buf.idstr, se->idstr, sizeof(buf.idstr));
>> +        buf.instance_id = se->instance_id;
>> +
>> +        qemu_savevm_command_send(f, MIG_CMD_INITIAL_DATA_ENABLE, sizeof(buf),
>> +                                 (uint8_t *)&buf);
>> +        trace_savevm_send_initial_data_enable(
>> +            buf.general_enable, (char *)buf.idstr, buf.instance_id);
>> +    }
>> +}
>> +
>>   void qemu_savevm_send_colo_enable(QEMUFile *f)
>>   {
>>       trace_savevm_send_colo_enable();
>> @@ -2314,6 +2359,60 @@ static int loadvm_process_enable_colo(MigrationIncomingState *mis)
>>       return ret;
>>   }
>>
>> +static int loadvm_handle_initial_data_enable(MigrationIncomingState *mis)
>> +{
>> +    InitialDataInfo buf;
>> +    SaveStateEntry *se;
>> +    ssize_t read_size;
>> +
>> +    read_size = qemu_get_buffer(mis->from_src_file, (void *)&buf, sizeof(buf));
>> +    if (read_size != sizeof(buf)) {
>> +        error_report("%s: Could not get data buffer, read_size %ld, len %lu",
>> +                     __func__, read_size, sizeof(buf));
>> +
>> +        return -EIO;
>> +    }
>> +
>> +    /* Enable precopy initial data generally in the migration */
>> +    if (buf.general_enable) {
>> +        mis->initial_data_enabled = true;
>> +        trace_loadvm_handle_initial_data_enable(
>> +            buf.general_enable, (char *)buf.idstr, buf.instance_id);
>> +
>> +        return 0;
>> +    }
>> +
>> +    /* Enable precopy initial data for a specific migration user */
>> +    se = find_se((char *)buf.idstr, buf.instance_id);
>> +    if (!se) {
>> +        error_report("%s: Could not find SaveStateEntry, idstr '%s', "
>> +                     "instance_id %" PRIu32,
>> +                     __func__, buf.idstr, buf.instance_id);
>> +
>> +        return -ENOENT;
>> +    }
>> +
>> +    if (!se->ops || !se->ops->initial_data_advise) {
>> +        error_report("%s: '%s' doesn't have required "
>> +                     "initial_data_advise op",
>> +                     __func__, buf.idstr);
>> +
>> +        return -EOPNOTSUPP;
>> +    }
>> +
>> +    if (!se->ops->initial_data_advise(se->opaque)) {
>> +        error_report("%s: '%s' doesn't support precopy initial data", __func__,
>> +                     buf.idstr);
>> +
>> +        return -EOPNOTSUPP;
>> +    }
>> +
>> +    trace_loadvm_handle_initial_data_enable(buf.general_enable,
>> +                                            (char *)buf.idstr, buf.instance_id);
>> +
>> +    return 0;
>> +}
>> +
>>   /*
>>    * Process an incoming 'QEMU_VM_COMMAND'
>>    * 0           just a normal return
>> @@ -2397,6 +2496,9 @@ static int loadvm_process_command(QEMUFile *f)
>>
>>       case MIG_CMD_ENABLE_COLO:
>>           return loadvm_process_enable_colo(mis);
>> +
>> +    case MIG_CMD_INITIAL_DATA_ENABLE:
>> +        return loadvm_handle_initial_data_enable(mis);
>>       }
>>
>>       return 0;
>> diff --git a/migration/trace-events b/migration/trace-events
>> index 92161eeac5..21ae471126 100644
>> --- a/migration/trace-events
>> +++ b/migration/trace-events
>> @@ -23,6 +23,7 @@ loadvm_postcopy_ram_handle_discard_end(void) ""
>>   loadvm_postcopy_ram_handle_discard_header(const char *ramid, uint16_t len) "%s: %ud"
>>   loadvm_process_command(const char *s, uint16_t len) "com=%s len=%d"
>>   loadvm_process_command_ping(uint32_t val) "0x%x"
>> +loadvm_handle_initial_data_enable(uint8_t general_enable, const char *idstr, int instance_id) "general_enable=%u, idstr=%s, instance_id=%u"
>>   postcopy_ram_listen_thread_exit(void) ""
>>   postcopy_ram_listen_thread_start(void) ""
>>   qemu_savevm_send_postcopy_advise(void) ""
>> @@ -38,6 +39,7 @@ savevm_send_postcopy_run(void) ""
>>   savevm_send_postcopy_resume(void) ""
>>   savevm_send_colo_enable(void) ""
>>   savevm_send_recv_bitmap(char *name) "%s"
>> +savevm_send_initial_data_enable(uint8_t general_enable, const char *idstr, int instance_id) "general_enable=%u, idstr=%s, instance_id=%u"
>>   savevm_state_setup(void) ""
>>   savevm_state_resume_prepare(void) ""
>>   savevm_state_header(void) ""
>> --
>> 2.26.3
>>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/8] migration: Add precopy initial data loaded ACK functionality
  2023-05-02 22:56   ` Peter Xu
@ 2023-05-03 15:36     ` Avihai Horon
  0 siblings, 0 replies; 42+ messages in thread
From: Avihai Horon @ 2023-05-03 15:36 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
	Juan Quintela, Leonardo Bras, Eric Blake, Markus Armbruster,
	Thomas Huth, Laurent Vivier, Paolo Bonzini, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins


On 03/05/2023 1:56, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Mon, May 01, 2023 at 05:01:36PM +0300, Avihai Horon wrote:
>> Add the core functionality of precopy initial data, which allows the
>> destination to ACK that initial data has been loaded and the source to
>> wait for this ACK before completing the migration.
>>
>> A new return path command MIG_RP_MSG_INITIAL_DATA_LOADED_ACK is added.
>> It is sent by the destination after precopy initial data is loaded to
>> ACK to the source that precopy initial data has been loaded.
>>
>> In addition, two new SaveVMHandlers handlers are added:
>> 1. is_initial_data_active which indicates whether precopy initial data
>>     is used for this migration user (i.e., SaveStateEntry).
>> 2. initial_data_loaded which indicates whether precopy initial data has
>>     been loaded by this migration user.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   include/migration/register.h |  7 ++++++
>>   migration/migration.h        | 12 +++++++++++
>>   migration/migration.c        | 41 ++++++++++++++++++++++++++++++++++--
>>   migration/savevm.c           | 39 ++++++++++++++++++++++++++++++++++
>>   migration/trace-events       |  2 ++
>>   5 files changed, 99 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/migration/register.h b/include/migration/register.h
>> index 0a73f3883e..297bbe9f73 100644
>> --- a/include/migration/register.h
>> +++ b/include/migration/register.h
>> @@ -77,6 +77,13 @@ typedef struct SaveVMHandlers {
>>        * true if it's supported or false otherwise. Called both in src and dest.
>>        */
>>       bool (*initial_data_advise)(void *opaque);
>> +    /*
>> +     * Checks if precopy initial data is active. If it's inactive,
>> +     * initial_data_loaded check is skipped.
>> +     */
>> +    bool (*is_initial_data_active)(void *opaque);
>> +    /* Checks if precopy initial data has been loaded in dest */
>> +    bool (*initial_data_loaded)(void *opaque);
>>   } SaveVMHandlers;
>>
>>   int register_savevm_live(const char *idstr,
>> diff --git a/migration/migration.h b/migration/migration.h
>> index 4f615e9dbc..d865c23d87 100644
>> --- a/migration/migration.h
>> +++ b/migration/migration.h
>> @@ -207,6 +207,11 @@ struct MigrationIncomingState {
>>
>>       /* Indicates whether precopy initial data was enabled and should be used */
>>       bool initial_data_enabled;
>> +    /*
>> +     * Indicates whether an ACK that precopy initial data was loaded has been
>> +     * sent to source.
>> +     */
>> +    bool initial_data_loaded_ack_sent;
>>   };
>>
>>   MigrationIncomingState *migration_incoming_get_current(void);
>> @@ -435,6 +440,12 @@ struct MigrationState {
>>
>>       /* QEMU_VM_VMDESCRIPTION content filled for all non-iterable devices. */
>>       JSONWriter *vmdesc;
>> +
>> +    /*
>> +     * Indicates whether an ACK that precopy initial data was loaded in
>> +     * destination has been received.
>> +     */
>> +    bool initial_data_loaded_acked;
>>   };
>>
>>   void migrate_set_state(int *state, int old_state, int new_state);
>> @@ -475,6 +486,7 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
>>   void migrate_send_rp_recv_bitmap(MigrationIncomingState *mis,
>>                                    char *block_name);
>>   void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t value);
>> +int migrate_send_rp_initial_data_loaded_ack(MigrationIncomingState *mis);
>>
>>   void dirty_bitmap_mig_before_vm_start(void);
>>   void dirty_bitmap_mig_cancel_outgoing(void);
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 68cdf5b184..304cab2fa1 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -77,6 +77,11 @@ enum mig_rp_message_type {
>>       MIG_RP_MSG_RECV_BITMAP,  /* send recved_bitmap back to source */
>>       MIG_RP_MSG_RESUME_ACK,   /* tell source that we are ready to resume */
>>
>> +    MIG_RP_MSG_INITIAL_DATA_LOADED_ACK, /*
>> +                                         * Tell source precopy initial data is
>> +                                         * loaded.
>> +                                         */
>> +
>>       MIG_RP_MSG_MAX
>>   };
>>
>> @@ -756,6 +761,12 @@ bool migration_has_all_channels(void)
>>       return true;
>>   }
>>
>> +int migrate_send_rp_initial_data_loaded_ack(MigrationIncomingState *mis)
>> +{
>> +    return migrate_send_rp_message(mis, MIG_RP_MSG_INITIAL_DATA_LOADED_ACK, 0,
>> +                                   NULL);
>> +}
>> +
>>   /*
>>    * Send a 'SHUT' message on the return channel with the given value
>>    * to indicate that we've finished with the RP.  Non-0 value indicates
>> @@ -1401,6 +1412,8 @@ void migrate_init(MigrationState *s)
>>       s->vm_was_running = false;
>>       s->iteration_initial_bytes = 0;
>>       s->threshold_size = 0;
>> +
>> +    s->initial_data_loaded_acked = false;
>>   }
>>
>>   int migrate_add_blocker_internal(Error *reason, Error **errp)
>> @@ -1717,6 +1730,9 @@ static struct rp_cmd_args {
>>       [MIG_RP_MSG_REQ_PAGES_ID]   = { .len = -1, .name = "REQ_PAGES_ID" },
>>       [MIG_RP_MSG_RECV_BITMAP]    = { .len = -1, .name = "RECV_BITMAP" },
>>       [MIG_RP_MSG_RESUME_ACK]     = { .len =  4, .name = "RESUME_ACK" },
>> +    [MIG_RP_MSG_INITIAL_DATA_LOADED_ACK] = { .len = 0,
>> +                                             .name =
>> +                                                 "INITIAL_DATA_LOADED_ACK" },
>>       [MIG_RP_MSG_MAX]            = { .len = -1, .name = "MAX" },
>>   };
>>
>> @@ -1955,6 +1971,11 @@ retry:
>>               }
>>               break;
>>
>> +        case MIG_RP_MSG_INITIAL_DATA_LOADED_ACK:
>> +            ms->initial_data_loaded_acked = true;
>> +            trace_source_return_path_thread_initial_data_loaded_ack();
>> +            break;
>> +
>>           default:
>>               break;
>>           }
>> @@ -2704,6 +2725,20 @@ static void migration_update_counters(MigrationState *s,
>>                                 bandwidth, s->threshold_size);
>>   }
>>
>> +static bool initial_data_loaded_acked(MigrationState *s)
>> +{
>> +    if (!migrate_precopy_initial_data()) {
>> +        return true;
>> +    }
>> +
>> +    /* No reason to wait for precopy initial data loaded ACK if VM is stopped */
>> +    if (!runstate_is_running()) {
>> +        return true;
>> +    }
>> +
>> +    return s->initial_data_loaded_acked;
>> +}
>> +
>>   /* Migration thread iteration status */
>>   typedef enum {
>>       MIG_ITERATE_RESUME,         /* Resume current iteration */
>> @@ -2719,6 +2754,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>>   {
>>       uint64_t must_precopy, can_postcopy;
>>       bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
>> +    bool initial_data_loaded = initial_data_loaded_acked(s);
>>
>>       qemu_savevm_state_pending_estimate(&must_precopy, &can_postcopy);
>>       uint64_t pending_size = must_precopy + can_postcopy;
>> @@ -2731,7 +2767,8 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>>           trace_migrate_pending_exact(pending_size, must_precopy, can_postcopy);
>>       }
>>
>> -    if (!pending_size || pending_size < s->threshold_size) {
>> +    if ((!pending_size || pending_size < s->threshold_size) &&
>> +        initial_data_loaded) {
>>           trace_migration_thread_low_pending(pending_size);
>>           migration_completion(s);
>>           return MIG_ITERATE_BREAK;
>> @@ -2739,7 +2776,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>>
>>       /* Still a significant amount to transfer */
>>       if (!in_postcopy && must_precopy <= s->threshold_size &&
>> -        qatomic_read(&s->start_postcopy)) {
>> +        initial_data_loaded && qatomic_read(&s->start_postcopy)) {
>>           if (postcopy_start(s)) {
>>               error_report("%s: postcopy failed to start", __func__);
>>           }
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index 2740defdf0..7a94deda3b 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -2504,6 +2504,39 @@ static int loadvm_process_command(QEMUFile *f)
>>       return 0;
>>   }
>>
>> +static int qemu_loadvm_initial_data_loaded_ack(MigrationIncomingState *mis)
>> +{
>> +    SaveStateEntry *se;
>> +    int ret;
>> +
>> +    if (!mis->initial_data_enabled || mis->initial_data_loaded_ack_sent) {
>> +        return 0;
>> +    }
>> +
>> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +        if (!se->ops || !se->ops->initial_data_loaded) {
>> +            continue;
>> +        }
>> +
>> +        if (!se->ops->is_initial_data_active ||
>> +            !se->ops->is_initial_data_active(se->opaque)) {
>> +            continue;
>> +        }
>> +
>> +        if (!se->ops->initial_data_loaded(se->opaque)) {
>> +            return 0;
>> +        }
>> +    }
>> +
>> +    ret = migrate_send_rp_initial_data_loaded_ack(mis);
>> +    if (!ret) {
>> +        mis->initial_data_loaded_ack_sent = true;
>> +        trace_loadvm_initial_data_loaded_acked();
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>>   /*
>>    * Read a footer off the wire and check that it matches the expected section
>>    *
>> @@ -2826,6 +2859,12 @@ retry:
>>               if (ret < 0) {
>>                   goto out;
>>               }
>> +
>> +            ret = qemu_loadvm_initial_data_loaded_ack(mis);
>> +            if (ret < 0) {
>> +                goto out;
>> +            }
> This is slightly hacky - it gets called for every END section.
>
> Please consider something cleaner, e.g., feel free to consider a
> notification mechanism I mentioned in my reply to the cover letter, so it's
> called only if the device is ready for switchover (no matter what interface
> it'll use).

Sure. Your notification suggestion sounds much better. I will use it.

Thanks.

>> +
>>               break;
>>           case QEMU_VM_COMMAND:
>>               ret = loadvm_process_command(f);
>> diff --git a/migration/trace-events b/migration/trace-events
>> index 21ae471126..a0e1d3b2fd 100644
>> --- a/migration/trace-events
>> +++ b/migration/trace-events
>> @@ -24,6 +24,7 @@ loadvm_postcopy_ram_handle_discard_header(const char *ramid, uint16_t len) "%s:
>>   loadvm_process_command(const char *s, uint16_t len) "com=%s len=%d"
>>   loadvm_process_command_ping(uint32_t val) "0x%x"
>>   loadvm_handle_initial_data_enable(uint8_t general_enable, const char *idstr, int instance_id) "general_enable=%u, idstr=%s, instance_id=%u"
>> +loadvm_initial_data_loaded_acked(void) ""
>>   postcopy_ram_listen_thread_exit(void) ""
>>   postcopy_ram_listen_thread_start(void) ""
>>   qemu_savevm_send_postcopy_advise(void) ""
>> @@ -182,6 +183,7 @@ source_return_path_thread_loop_top(void) ""
>>   source_return_path_thread_pong(uint32_t val) "0x%x"
>>   source_return_path_thread_shut(uint32_t val) "0x%x"
>>   source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32
>> +source_return_path_thread_initial_data_loaded_ack(void) ""
>>   migration_thread_low_pending(uint64_t pending) "%" PRIu64
>>   migrate_transferred(uint64_t tranferred, uint64_t time_spent, uint64_t bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " max_size %" PRId64
>>   process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
>> --
>> 2.26.3
>>
>>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support
  2023-05-03 15:22   ` Avihai Horon
@ 2023-05-03 15:49     ` Peter Xu
  2023-05-04 10:18       ` Avihai Horon
  2023-05-10  9:12     ` Juan Quintela
  1 sibling, 1 reply; 42+ messages in thread
From: Peter Xu @ 2023-05-03 15:49 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
	Juan Quintela, Leonardo Bras, Eric Blake, Markus Armbruster,
	Thomas Huth, Laurent Vivier, Paolo Bonzini, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Wed, May 03, 2023 at 06:22:59PM +0300, Avihai Horon wrote:
> 
> On 03/05/2023 1:49, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Mon, May 01, 2023 at 05:01:33PM +0300, Avihai Horon wrote:
> > > Hello everyone,
> > Hi, Avihai,
> > 
> > > === Flow of operation ===
> > > 
> > > To use precopy initial data, the capability must be enabled in the
> > > source.
> > > 
> > > As this capability must be supported also in the destination, a
> > > handshake is performed during migration setup. The purpose of the
> > > handshake is to notify the destination that precopy initial data is used
> > > and to check if it's supported.
> > > 
> > > The handshake is done in two levels. First, a general handshake is done
> > > with the destination migration code to notify that precopy initial data
> > > is used. Then, for each migration user in the source that supports
> > > precopy initial data, a handshake is done with its counterpart in the
> > > destination:
> > > If both support it, precopy initial data will be used for them.
> > > If source doesn't support it, precopy initial data will not be used for
> > > them.
> > > If source supports it and destination doesn't, migration will be failed.
> > > 
> > > Assuming the handshake succeeded, migration starts to send precopy data
> > > and as part of it also the initial precopy data. Initial precopy data is
> > > just like any other precopy data and as such, migration code is not
> > > aware of it. Therefore, it's the responsibility of the migration users
> > > (such as VFIO devices) to notify their counterparts in the destination
> > > that their initial precopy data has been sent (for example, VFIO
> > > migration does it when its initial bytes reach zero).
> > > 
> > > In the destination, migration code will query each migration user that
> > > supports precopy initial data and check if its initial data has been
> > > loaded. If initial data has been loaded by all of them, an ACK will be
> > > sent to the source which will now be able to complete migration when
> > > appropriate.
> > I can understand why this is useful, what I'm not 100% sure is whether the
> > complexity is needed.  The idea seems to be that src never switchover
> > unless it receives a READY notification from dst.
> > 
> > I'm imaging below simplified and more general workflow, not sure whether it
> > could work for you:
> > 
> >    - Introduce a new cap "switchover-ready", it means whether there'll be a
> >      ready event sent from dst -> src for "being ready for switchover"
> > 
> >    - When cap set, a new msg MIG_RP_MSG_SWITCHOVER_READY is defined and
> >      handled on src showing that dest is ready for switchover. It'll be sent
> >      only if dest is ready for the switchover
> > 
> >    - Introduce a field SaveVMHandlers.explicit_switchover_needed.  For each
> >      special device like vfio that would like to participate in the decision
> >      making, device can set its explicit_switchover_needed=1.  This field is
> >      ignored if the new cap is not set.
> > 
> >    - Dst qemu: when new cap set, remember how many special devices are there
> >      requesting explicit switchover (count of SaveVMHandlers that has the
> >      bit set during load setup) as switch_over_pending=N.
> > 
> >    - Dst qemu: Once a device thinks its fine to switchover (probably in the
> >      load_state() callback), it calls migration_notify_switchover_ready().
> >      That decreases switch_over_pending and when it hits zero, one msg
> >      MIG_RP_MSG_SWITCHOVER_READY will be sent to src.
> > 
> > Only until READY msg received on src could src switchover the precopy to
> > dst.
> > 
> > Then it only needs 1 more field in SaveVMHandlers rather than 3, and only 1
> > more msg (dst->src).
> > 
> > This is based on the fact that right now we always set caps on both qemus
> > so I suppose it already means either both have or don't have the feature
> > (even if one has, not setting the cap means disabled on both).
> > 
> > Would it work for this case and cleaner?
> 
> Hi Peter, thanks for the response!
> Your approach is indeed much simpler, however I have a few concerns
> regarding compatibility.
> 
> You are saying that caps are always set both in src and dest.
> But what happens if we set the cap only on one side?
> Should we care about these scenarios?

I think it's not needed for now, but I am aware that this is a problem.
It's just that it is a more generic problem to me rather than very special
in the current feature being proposed.  At least there're a few times
Daniel showed concern on keeping this way and hoped we can have a better
handshake in general with migration framework.

I'd be perfectly fine if you want to approach this with a handshake
methodology, but I hope if so we should provide a more generic handshake.
So potentially that can make this new feature rely on the handshake work,
and slower to get into shape.  Your call on how to address this, at least
fine by me either way.

In my imagination a generic handshake should happen at the very start of
migration and negociate feature bits between src/dst qemu, so they can
reach a consensus on what to do next.

> For example, if we set the cap only in src, then src will wait indefinitely
> for dest to notify that switchover is ready.
> Would you expect migration to fail instead of just keep running
> indefinitely?
> In current approach we only need to enable the cap in the source, so such
> scenario can't happen.
> 
> Let's look at some other scenario.
> Src QEMU supports explicit-switchover for device X but *not* for device Y
> (i.e., src QEMU is some older version of QEMU that supports
> explicit-switchover for device X but not for Y).
> Dest QEMU supports explicit-switchover for device X and device Y.
> The capability is set in both src and dest.
> In the destination we will have switchover_pending=2 because both X and Y
> support explicit-switchover.
> We do migration, but switchover_pending will never reach 0 because only X
> supports it in the source, so the migration will run indefinitely.
> The per-device handshake solves this by making device Y not use
> explicit-switchover in this case.

Hmm, right.  When I was replying obviously I thought that decision can be
made sololy by the dest qemu, then I assumed it's fine.  Because IIUC in
that case how many devices that supports switchover_pending on src qemu
doesn't really matter but only dest.

But I re-read the last patch and I do see that there's a new bit that will
change the device protocol of migration:

  if (migration->initial_data_active && !migration->precopy_init_size &&
      !migration->initial_data_sent) {
      qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT);
      migration->initial_data_sent = true;
  } else {
      qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
  }

With this, I think what you said makes sense because then the src qemu
matters on deciding whether to send VFIO_MIG_FLAG_DEV_INIT_DATA_SENT, it
also needs to make sure dst qemu will recognize it.

Do you think this new VFIO_MIG_FLAG_DEV_INIT_DATA_SENT is a must to have?
Can this decision be made on dest qemu only?

To ask in another way, I saw that precopy_init_size is the fundation to
decide whether to send this flag.  Then it's a matter of whether dest qemu
is also aware of precopy_init_size, then it can already tell when it's
ready to handle the switchover.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support
  2023-05-03 15:49     ` Peter Xu
@ 2023-05-04 10:18       ` Avihai Horon
  2023-05-04 15:50         ` Peter Xu
  0 siblings, 1 reply; 42+ messages in thread
From: Avihai Horon @ 2023-05-04 10:18 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
	Juan Quintela, Leonardo Bras, Eric Blake, Markus Armbruster,
	Thomas Huth, Laurent Vivier, Paolo Bonzini, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins


On 03/05/2023 18:49, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Wed, May 03, 2023 at 06:22:59PM +0300, Avihai Horon wrote:
>> On 03/05/2023 1:49, Peter Xu wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Mon, May 01, 2023 at 05:01:33PM +0300, Avihai Horon wrote:
>>>> Hello everyone,
>>> Hi, Avihai,
>>>
>>>> === Flow of operation ===
>>>>
>>>> To use precopy initial data, the capability must be enabled in the
>>>> source.
>>>>
>>>> As this capability must be supported also in the destination, a
>>>> handshake is performed during migration setup. The purpose of the
>>>> handshake is to notify the destination that precopy initial data is used
>>>> and to check if it's supported.
>>>>
>>>> The handshake is done in two levels. First, a general handshake is done
>>>> with the destination migration code to notify that precopy initial data
>>>> is used. Then, for each migration user in the source that supports
>>>> precopy initial data, a handshake is done with its counterpart in the
>>>> destination:
>>>> If both support it, precopy initial data will be used for them.
>>>> If source doesn't support it, precopy initial data will not be used for
>>>> them.
>>>> If source supports it and destination doesn't, migration will be failed.
>>>>
>>>> Assuming the handshake succeeded, migration starts to send precopy data
>>>> and as part of it also the initial precopy data. Initial precopy data is
>>>> just like any other precopy data and as such, migration code is not
>>>> aware of it. Therefore, it's the responsibility of the migration users
>>>> (such as VFIO devices) to notify their counterparts in the destination
>>>> that their initial precopy data has been sent (for example, VFIO
>>>> migration does it when its initial bytes reach zero).
>>>>
>>>> In the destination, migration code will query each migration user that
>>>> supports precopy initial data and check if its initial data has been
>>>> loaded. If initial data has been loaded by all of them, an ACK will be
>>>> sent to the source which will now be able to complete migration when
>>>> appropriate.
>>> I can understand why this is useful, what I'm not 100% sure is whether the
>>> complexity is needed.  The idea seems to be that src never switchover
>>> unless it receives a READY notification from dst.
>>>
>>> I'm imaging below simplified and more general workflow, not sure whether it
>>> could work for you:
>>>
>>>     - Introduce a new cap "switchover-ready", it means whether there'll be a
>>>       ready event sent from dst -> src for "being ready for switchover"
>>>
>>>     - When cap set, a new msg MIG_RP_MSG_SWITCHOVER_READY is defined and
>>>       handled on src showing that dest is ready for switchover. It'll be sent
>>>       only if dest is ready for the switchover
>>>
>>>     - Introduce a field SaveVMHandlers.explicit_switchover_needed.  For each
>>>       special device like vfio that would like to participate in the decision
>>>       making, device can set its explicit_switchover_needed=1.  This field is
>>>       ignored if the new cap is not set.
>>>
>>>     - Dst qemu: when new cap set, remember how many special devices are there
>>>       requesting explicit switchover (count of SaveVMHandlers that has the
>>>       bit set during load setup) as switch_over_pending=N.
>>>
>>>     - Dst qemu: Once a device thinks its fine to switchover (probably in the
>>>       load_state() callback), it calls migration_notify_switchover_ready().
>>>       That decreases switch_over_pending and when it hits zero, one msg
>>>       MIG_RP_MSG_SWITCHOVER_READY will be sent to src.
>>>
>>> Only until READY msg received on src could src switchover the precopy to
>>> dst.
>>>
>>> Then it only needs 1 more field in SaveVMHandlers rather than 3, and only 1
>>> more msg (dst->src).
>>>
>>> This is based on the fact that right now we always set caps on both qemus
>>> so I suppose it already means either both have or don't have the feature
>>> (even if one has, not setting the cap means disabled on both).
>>>
>>> Would it work for this case and cleaner?
>> Hi Peter, thanks for the response!
>> Your approach is indeed much simpler, however I have a few concerns
>> regarding compatibility.
>>
>> You are saying that caps are always set both in src and dest.
>> But what happens if we set the cap only on one side?
>> Should we care about these scenarios?
> I think it's not needed for now, but I am aware that this is a problem.
> It's just that it is a more generic problem to me rather than very special
> in the current feature being proposed.  At least there're a few times
> Daniel showed concern on keeping this way and hoped we can have a better
> handshake in general with migration framework.
>
> I'd be perfectly fine if you want to approach this with a handshake
> methodology, but I hope if so we should provide a more generic handshake.
> So potentially that can make this new feature rely on the handshake work,
> and slower to get into shape.  Your call on how to address this, at least
> fine by me either way.

I'd really like this feature to get in, and I'm afraid making it 
dependent on first implementing a general migration handshake may take a 
long time, like you said.
What about keeping current approach but changing it such that the 
capability will have to be set in both src and dest, to make it similar 
to other capability usages?
I.e., we will remove the "general" handshake:

     /* Enable precopy initial data generally in the migration */
     memset(&buf, 0, sizeof(buf));
     buf.general_enable = 1;
     qemu_savevm_command_send(f, MIG_CMD_INITIAL_DATA_ENABLE, sizeof(buf),
                              (uint8_t *)&buf);

but keep the per-device handshake, which is not a handshake for 
migration capabilities, but a part of the protocol when the capability 
is set, like in multifd, postcopy, etc.
This way we can advance with this feature while making the general 
migration handshake an independent effort.
Will that work for you?

BTW, with your suggestion to add a notification mechanism to notify when 
initial data is loaded in dest, I think we can drop these two 
SaveVMHandlers handlers:
     /*
      * Checks if precopy initial data is active. If it's inactive,
      * initial_data_loaded check is skipped.
      */
     bool (*is_initial_data_active)(void *opaque);
     /* Checks if precopy initial data has been loaded in dest */
     bool (*initial_data_loaded)(void *opaque);

> In my imagination a generic handshake should happen at the very start of
> migration and negociate feature bits between src/dst qemu, so they can
> reach a consensus on what to do next.
>
>> For example, if we set the cap only in src, then src will wait indefinitely
>> for dest to notify that switchover is ready.
>> Would you expect migration to fail instead of just keep running
>> indefinitely?
>> In current approach we only need to enable the cap in the source, so such
>> scenario can't happen.
>>
>> Let's look at some other scenario.
>> Src QEMU supports explicit-switchover for device X but *not* for device Y
>> (i.e., src QEMU is some older version of QEMU that supports
>> explicit-switchover for device X but not for Y).
>> Dest QEMU supports explicit-switchover for device X and device Y.
>> The capability is set in both src and dest.
>> In the destination we will have switchover_pending=2 because both X and Y
>> support explicit-switchover.
>> We do migration, but switchover_pending will never reach 0 because only X
>> supports it in the source, so the migration will run indefinitely.
>> The per-device handshake solves this by making device Y not use
>> explicit-switchover in this case.
> Hmm, right.  When I was replying obviously I thought that decision can be
> made sololy by the dest qemu, then I assumed it's fine.  Because IIUC in
> that case how many devices that supports switchover_pending on src qemu
> doesn't really matter but only dest.
>
> But I re-read the last patch and I do see that there's a new bit that will
> change the device protocol of migration:
>
>    if (migration->initial_data_active && !migration->precopy_init_size &&
>        !migration->initial_data_sent) {
>        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT);
>        migration->initial_data_sent = true;
>    } else {
>        qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>    }
>
> With this, I think what you said makes sense because then the src qemu
> matters on deciding whether to send VFIO_MIG_FLAG_DEV_INIT_DATA_SENT, it
> also needs to make sure dst qemu will recognize it.
>
> Do you think this new VFIO_MIG_FLAG_DEV_INIT_DATA_SENT is a must to have?
> Can this decision be made on dest qemu only?
>
> To ask in another way, I saw that precopy_init_size is the fundation to
> decide whether to send this flag.  Then it's a matter of whether dest qemu
> is also aware of precopy_init_size, then it can already tell when it's
> ready to handle the switchover.

The destination is not aware of precopy_init_size, only the source knows it.
So the source must send VFIO_MIG_FLAG_DEV_INIT_DATA_SENT to notify dest 
that the initial data was sent.

Thanks.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support
  2023-05-04 10:18       ` Avihai Horon
@ 2023-05-04 15:50         ` Peter Xu
  2023-05-07 12:54           ` Avihai Horon
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2023-05-04 15:50 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
	Juan Quintela, Leonardo Bras, Eric Blake, Markus Armbruster,
	Thomas Huth, Laurent Vivier, Paolo Bonzini, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Thu, May 04, 2023 at 01:18:04PM +0300, Avihai Horon wrote:
> 
> On 03/05/2023 18:49, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Wed, May 03, 2023 at 06:22:59PM +0300, Avihai Horon wrote:
> > > On 03/05/2023 1:49, Peter Xu wrote:
> > > > External email: Use caution opening links or attachments
> > > > 
> > > > 
> > > > On Mon, May 01, 2023 at 05:01:33PM +0300, Avihai Horon wrote:
> > > > > Hello everyone,
> > > > Hi, Avihai,
> > > > 
> > > > > === Flow of operation ===
> > > > > 
> > > > > To use precopy initial data, the capability must be enabled in the
> > > > > source.
> > > > > 
> > > > > As this capability must be supported also in the destination, a
> > > > > handshake is performed during migration setup. The purpose of the
> > > > > handshake is to notify the destination that precopy initial data is used
> > > > > and to check if it's supported.
> > > > > 
> > > > > The handshake is done in two levels. First, a general handshake is done
> > > > > with the destination migration code to notify that precopy initial data
> > > > > is used. Then, for each migration user in the source that supports
> > > > > precopy initial data, a handshake is done with its counterpart in the
> > > > > destination:
> > > > > If both support it, precopy initial data will be used for them.
> > > > > If source doesn't support it, precopy initial data will not be used for
> > > > > them.
> > > > > If source supports it and destination doesn't, migration will be failed.
> > > > > 
> > > > > Assuming the handshake succeeded, migration starts to send precopy data
> > > > > and as part of it also the initial precopy data. Initial precopy data is
> > > > > just like any other precopy data and as such, migration code is not
> > > > > aware of it. Therefore, it's the responsibility of the migration users
> > > > > (such as VFIO devices) to notify their counterparts in the destination
> > > > > that their initial precopy data has been sent (for example, VFIO
> > > > > migration does it when its initial bytes reach zero).
> > > > > 
> > > > > In the destination, migration code will query each migration user that
> > > > > supports precopy initial data and check if its initial data has been
> > > > > loaded. If initial data has been loaded by all of them, an ACK will be
> > > > > sent to the source which will now be able to complete migration when
> > > > > appropriate.
> > > > I can understand why this is useful, what I'm not 100% sure is whether the
> > > > complexity is needed.  The idea seems to be that src never switchover
> > > > unless it receives a READY notification from dst.
> > > > 
> > > > I'm imaging below simplified and more general workflow, not sure whether it
> > > > could work for you:
> > > > 
> > > >     - Introduce a new cap "switchover-ready", it means whether there'll be a
> > > >       ready event sent from dst -> src for "being ready for switchover"
> > > > 
> > > >     - When cap set, a new msg MIG_RP_MSG_SWITCHOVER_READY is defined and
> > > >       handled on src showing that dest is ready for switchover. It'll be sent
> > > >       only if dest is ready for the switchover
> > > > 
> > > >     - Introduce a field SaveVMHandlers.explicit_switchover_needed.  For each
> > > >       special device like vfio that would like to participate in the decision
> > > >       making, device can set its explicit_switchover_needed=1.  This field is
> > > >       ignored if the new cap is not set.
> > > > 
> > > >     - Dst qemu: when new cap set, remember how many special devices are there
> > > >       requesting explicit switchover (count of SaveVMHandlers that has the
> > > >       bit set during load setup) as switch_over_pending=N.
> > > > 
> > > >     - Dst qemu: Once a device thinks its fine to switchover (probably in the
> > > >       load_state() callback), it calls migration_notify_switchover_ready().
> > > >       That decreases switch_over_pending and when it hits zero, one msg
> > > >       MIG_RP_MSG_SWITCHOVER_READY will be sent to src.
> > > > 
> > > > Only until READY msg received on src could src switchover the precopy to
> > > > dst.
> > > > 
> > > > Then it only needs 1 more field in SaveVMHandlers rather than 3, and only 1
> > > > more msg (dst->src).
> > > > 
> > > > This is based on the fact that right now we always set caps on both qemus
> > > > so I suppose it already means either both have or don't have the feature
> > > > (even if one has, not setting the cap means disabled on both).
> > > > 
> > > > Would it work for this case and cleaner?
> > > Hi Peter, thanks for the response!
> > > Your approach is indeed much simpler, however I have a few concerns
> > > regarding compatibility.
> > > 
> > > You are saying that caps are always set both in src and dest.
> > > But what happens if we set the cap only on one side?
> > > Should we care about these scenarios?
> > I think it's not needed for now, but I am aware that this is a problem.
> > It's just that it is a more generic problem to me rather than very special
> > in the current feature being proposed.  At least there're a few times
> > Daniel showed concern on keeping this way and hoped we can have a better
> > handshake in general with migration framework.
> > 
> > I'd be perfectly fine if you want to approach this with a handshake
> > methodology, but I hope if so we should provide a more generic handshake.
> > So potentially that can make this new feature rely on the handshake work,
> > and slower to get into shape.  Your call on how to address this, at least
> > fine by me either way.
> 
> I'd really like this feature to get in, and I'm afraid making it dependent
> on first implementing a general migration handshake may take a long time,
> like you said.
> What about keeping current approach but changing it such that the capability
> will have to be set in both src and dest, to make it similar to other
> capability usages?
> I.e., we will remove the "general" handshake:
> 
>     /* Enable precopy initial data generally in the migration */
>     memset(&buf, 0, sizeof(buf));
>     buf.general_enable = 1;
>     qemu_savevm_command_send(f, MIG_CMD_INITIAL_DATA_ENABLE, sizeof(buf),
>                              (uint8_t *)&buf);
> 
> but keep the per-device handshake, which is not a handshake for migration
> capabilities, but a part of the protocol when the capability is set, like in
> multifd, postcopy, etc.
> This way we can advance with this feature while making the general migration
> handshake an independent effort.
> Will that work for you?

Yes it's fine by me.

> 
> BTW, with your suggestion to add a notification mechanism to notify when
> initial data is loaded in dest, I think we can drop these two SaveVMHandlers
> handlers:
>     /*
>      * Checks if precopy initial data is active. If it's inactive,
>      * initial_data_loaded check is skipped.
>      */
>     bool (*is_initial_data_active)(void *opaque);
>     /* Checks if precopy initial data has been loaded in dest */
>     bool (*initial_data_loaded)(void *opaque);
> 
> > In my imagination a generic handshake should happen at the very start of
> > migration and negociate feature bits between src/dst qemu, so they can
> > reach a consensus on what to do next.
> > 
> > > For example, if we set the cap only in src, then src will wait indefinitely
> > > for dest to notify that switchover is ready.
> > > Would you expect migration to fail instead of just keep running
> > > indefinitely?
> > > In current approach we only need to enable the cap in the source, so such
> > > scenario can't happen.
> > > 
> > > Let's look at some other scenario.
> > > Src QEMU supports explicit-switchover for device X but *not* for device Y
> > > (i.e., src QEMU is some older version of QEMU that supports
> > > explicit-switchover for device X but not for Y).
> > > Dest QEMU supports explicit-switchover for device X and device Y.
> > > The capability is set in both src and dest.
> > > In the destination we will have switchover_pending=2 because both X and Y
> > > support explicit-switchover.
> > > We do migration, but switchover_pending will never reach 0 because only X
> > > supports it in the source, so the migration will run indefinitely.
> > > The per-device handshake solves this by making device Y not use
> > > explicit-switchover in this case.
> > Hmm, right.  When I was replying obviously I thought that decision can be
> > made sololy by the dest qemu, then I assumed it's fine.  Because IIUC in
> > that case how many devices that supports switchover_pending on src qemu
> > doesn't really matter but only dest.
> > 
> > But I re-read the last patch and I do see that there's a new bit that will
> > change the device protocol of migration:
> > 
> >    if (migration->initial_data_active && !migration->precopy_init_size &&
> >        !migration->initial_data_sent) {
> >        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT);
> >        migration->initial_data_sent = true;
> >    } else {
> >        qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> >    }
> > 
> > With this, I think what you said makes sense because then the src qemu
> > matters on deciding whether to send VFIO_MIG_FLAG_DEV_INIT_DATA_SENT, it
> > also needs to make sure dst qemu will recognize it.
> > 
> > Do you think this new VFIO_MIG_FLAG_DEV_INIT_DATA_SENT is a must to have?
> > Can this decision be made on dest qemu only?
> > 
> > To ask in another way, I saw that precopy_init_size is the fundation to
> > decide whether to send this flag.  Then it's a matter of whether dest qemu
> > is also aware of precopy_init_size, then it can already tell when it's
> > ready to handle the switchover.
> 
> The destination is not aware of precopy_init_size, only the source knows it.
> So the source must send VFIO_MIG_FLAG_DEV_INIT_DATA_SENT to notify dest that
> the initial data was sent.

Then, can the src qemu notify instead?

We can have similar notification mechanism on src qemu and if that can work
we can further same the other MIG_RP_MSG.  The counter counts how many
special src devices are there and we don't switchover only if all agree.

I know that even if !precopy_init_size on src, it doesn't mean that dest
has already digested all the data in the send buffer.  However since we'll
anyway make sure queued data landed before switch over happens (e.g., when
we only have 1 migration channel data are sent in sequential manner), it
means when switchover the dst qemu should have these loaded?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support
  2023-05-04 15:50         ` Peter Xu
@ 2023-05-07 12:54           ` Avihai Horon
  2023-05-08  0:49             ` Peter Xu
  0 siblings, 1 reply; 42+ messages in thread
From: Avihai Horon @ 2023-05-07 12:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
	Juan Quintela, Leonardo Bras, Eric Blake, Markus Armbruster,
	Thomas Huth, Laurent Vivier, Paolo Bonzini, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins


On 04/05/2023 18:50, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, May 04, 2023 at 01:18:04PM +0300, Avihai Horon wrote:
>> On 03/05/2023 18:49, Peter Xu wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Wed, May 03, 2023 at 06:22:59PM +0300, Avihai Horon wrote:
>>>> On 03/05/2023 1:49, Peter Xu wrote:
>>>>> External email: Use caution opening links or attachments
>>>>>
>>>>>
>>>>> On Mon, May 01, 2023 at 05:01:33PM +0300, Avihai Horon wrote:
>>>>>> Hello everyone,
>>>>> Hi, Avihai,
>>>>>
>>>>>> === Flow of operation ===
>>>>>>
>>>>>> To use precopy initial data, the capability must be enabled in the
>>>>>> source.
>>>>>>
>>>>>> As this capability must be supported also in the destination, a
>>>>>> handshake is performed during migration setup. The purpose of the
>>>>>> handshake is to notify the destination that precopy initial data is used
>>>>>> and to check if it's supported.
>>>>>>
>>>>>> The handshake is done in two levels. First, a general handshake is done
>>>>>> with the destination migration code to notify that precopy initial data
>>>>>> is used. Then, for each migration user in the source that supports
>>>>>> precopy initial data, a handshake is done with its counterpart in the
>>>>>> destination:
>>>>>> If both support it, precopy initial data will be used for them.
>>>>>> If source doesn't support it, precopy initial data will not be used for
>>>>>> them.
>>>>>> If source supports it and destination doesn't, migration will be failed.
>>>>>>
>>>>>> Assuming the handshake succeeded, migration starts to send precopy data
>>>>>> and as part of it also the initial precopy data. Initial precopy data is
>>>>>> just like any other precopy data and as such, migration code is not
>>>>>> aware of it. Therefore, it's the responsibility of the migration users
>>>>>> (such as VFIO devices) to notify their counterparts in the destination
>>>>>> that their initial precopy data has been sent (for example, VFIO
>>>>>> migration does it when its initial bytes reach zero).
>>>>>>
>>>>>> In the destination, migration code will query each migration user that
>>>>>> supports precopy initial data and check if its initial data has been
>>>>>> loaded. If initial data has been loaded by all of them, an ACK will be
>>>>>> sent to the source which will now be able to complete migration when
>>>>>> appropriate.
>>>>> I can understand why this is useful, what I'm not 100% sure is whether the
>>>>> complexity is needed.  The idea seems to be that src never switchover
>>>>> unless it receives a READY notification from dst.
>>>>>
>>>>> I'm imaging below simplified and more general workflow, not sure whether it
>>>>> could work for you:
>>>>>
>>>>>      - Introduce a new cap "switchover-ready", it means whether there'll be a
>>>>>        ready event sent from dst -> src for "being ready for switchover"
>>>>>
>>>>>      - When cap set, a new msg MIG_RP_MSG_SWITCHOVER_READY is defined and
>>>>>        handled on src showing that dest is ready for switchover. It'll be sent
>>>>>        only if dest is ready for the switchover
>>>>>
>>>>>      - Introduce a field SaveVMHandlers.explicit_switchover_needed.  For each
>>>>>        special device like vfio that would like to participate in the decision
>>>>>        making, device can set its explicit_switchover_needed=1.  This field is
>>>>>        ignored if the new cap is not set.
>>>>>
>>>>>      - Dst qemu: when new cap set, remember how many special devices are there
>>>>>        requesting explicit switchover (count of SaveVMHandlers that has the
>>>>>        bit set during load setup) as switch_over_pending=N.
>>>>>
>>>>>      - Dst qemu: Once a device thinks its fine to switchover (probably in the
>>>>>        load_state() callback), it calls migration_notify_switchover_ready().
>>>>>        That decreases switch_over_pending and when it hits zero, one msg
>>>>>        MIG_RP_MSG_SWITCHOVER_READY will be sent to src.
>>>>>
>>>>> Only until READY msg received on src could src switchover the precopy to
>>>>> dst.
>>>>>
>>>>> Then it only needs 1 more field in SaveVMHandlers rather than 3, and only 1
>>>>> more msg (dst->src).
>>>>>
>>>>> This is based on the fact that right now we always set caps on both qemus
>>>>> so I suppose it already means either both have or don't have the feature
>>>>> (even if one has, not setting the cap means disabled on both).
>>>>>
>>>>> Would it work for this case and cleaner?
>>>> Hi Peter, thanks for the response!
>>>> Your approach is indeed much simpler, however I have a few concerns
>>>> regarding compatibility.
>>>>
>>>> You are saying that caps are always set both in src and dest.
>>>> But what happens if we set the cap only on one side?
>>>> Should we care about these scenarios?
>>> I think it's not needed for now, but I am aware that this is a problem.
>>> It's just that it is a more generic problem to me rather than very special
>>> in the current feature being proposed.  At least there're a few times
>>> Daniel showed concern on keeping this way and hoped we can have a better
>>> handshake in general with migration framework.
>>>
>>> I'd be perfectly fine if you want to approach this with a handshake
>>> methodology, but I hope if so we should provide a more generic handshake.
>>> So potentially that can make this new feature rely on the handshake work,
>>> and slower to get into shape.  Your call on how to address this, at least
>>> fine by me either way.
>> I'd really like this feature to get in, and I'm afraid making it dependent
>> on first implementing a general migration handshake may take a long time,
>> like you said.
>> What about keeping current approach but changing it such that the capability
>> will have to be set in both src and dest, to make it similar to other
>> capability usages?
>> I.e., we will remove the "general" handshake:
>>
>>      /* Enable precopy initial data generally in the migration */
>>      memset(&buf, 0, sizeof(buf));
>>      buf.general_enable = 1;
>>      qemu_savevm_command_send(f, MIG_CMD_INITIAL_DATA_ENABLE, sizeof(buf),
>>                               (uint8_t *)&buf);
>>
>> but keep the per-device handshake, which is not a handshake for migration
>> capabilities, but a part of the protocol when the capability is set, like in
>> multifd, postcopy, etc.
>> This way we can advance with this feature while making the general migration
>> handshake an independent effort.
>> Will that work for you?
> Yes it's fine by me.
>
>> BTW, with your suggestion to add a notification mechanism to notify when
>> initial data is loaded in dest, I think we can drop these two SaveVMHandlers
>> handlers:
>>      /*
>>       * Checks if precopy initial data is active. If it's inactive,
>>       * initial_data_loaded check is skipped.
>>       */
>>      bool (*is_initial_data_active)(void *opaque);
>>      /* Checks if precopy initial data has been loaded in dest */
>>      bool (*initial_data_loaded)(void *opaque);
>>
>>> In my imagination a generic handshake should happen at the very start of
>>> migration and negociate feature bits between src/dst qemu, so they can
>>> reach a consensus on what to do next.
>>>
>>>> For example, if we set the cap only in src, then src will wait indefinitely
>>>> for dest to notify that switchover is ready.
>>>> Would you expect migration to fail instead of just keep running
>>>> indefinitely?
>>>> In current approach we only need to enable the cap in the source, so such
>>>> scenario can't happen.
>>>>
>>>> Let's look at some other scenario.
>>>> Src QEMU supports explicit-switchover for device X but *not* for device Y
>>>> (i.e., src QEMU is some older version of QEMU that supports
>>>> explicit-switchover for device X but not for Y).
>>>> Dest QEMU supports explicit-switchover for device X and device Y.
>>>> The capability is set in both src and dest.
>>>> In the destination we will have switchover_pending=2 because both X and Y
>>>> support explicit-switchover.
>>>> We do migration, but switchover_pending will never reach 0 because only X
>>>> supports it in the source, so the migration will run indefinitely.
>>>> The per-device handshake solves this by making device Y not use
>>>> explicit-switchover in this case.
>>> Hmm, right.  When I was replying obviously I thought that decision can be
>>> made sololy by the dest qemu, then I assumed it's fine.  Because IIUC in
>>> that case how many devices that supports switchover_pending on src qemu
>>> doesn't really matter but only dest.
>>>
>>> But I re-read the last patch and I do see that there's a new bit that will
>>> change the device protocol of migration:
>>>
>>>     if (migration->initial_data_active && !migration->precopy_init_size &&
>>>         !migration->initial_data_sent) {
>>>         qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT);
>>>         migration->initial_data_sent = true;
>>>     } else {
>>>         qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>>     }
>>>
>>> With this, I think what you said makes sense because then the src qemu
>>> matters on deciding whether to send VFIO_MIG_FLAG_DEV_INIT_DATA_SENT, it
>>> also needs to make sure dst qemu will recognize it.
>>>
>>> Do you think this new VFIO_MIG_FLAG_DEV_INIT_DATA_SENT is a must to have?
>>> Can this decision be made on dest qemu only?
>>>
>>> To ask in another way, I saw that precopy_init_size is the fundation to
>>> decide whether to send this flag.  Then it's a matter of whether dest qemu
>>> is also aware of precopy_init_size, then it can already tell when it's
>>> ready to handle the switchover.
>> The destination is not aware of precopy_init_size, only the source knows it.
>> So the source must send VFIO_MIG_FLAG_DEV_INIT_DATA_SENT to notify dest that
>> the initial data was sent.
> Then, can the src qemu notify instead?
>
> We can have similar notification mechanism on src qemu and if that can work
> we can further same the other MIG_RP_MSG.  The counter counts how many
> special src devices are there and we don't switchover only if all agree.

Do you mean the following:
We will have the pending counter and notification mechanism in source 
instead of destination.
The MIG_RP_MSG_INITIAL_DATA_LOADED_ACK message will be sent by each 
device in the destination that loaded its initial data.
Each such RP_MSG will decrease the pending counter in source.
When the pending counter in source reaches zero, source can complete 
migration.

Or did I misunderstand you?

> I know that even if !precopy_init_size on src, it doesn't mean that dest
> has already digested all the data in the send buffer.  However since we'll
> anyway make sure queued data landed before switch over happens (e.g., when
> we only have 1 migration channel data are sent in sequential manner), it
> means when switchover the dst qemu should have these loaded?
>
> Thanks,
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support
  2023-05-07 12:54           ` Avihai Horon
@ 2023-05-08  0:49             ` Peter Xu
  2023-05-08 11:11               ` Avihai Horon
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Xu @ 2023-05-08  0:49 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
	Juan Quintela, Leonardo Bras, Eric Blake, Markus Armbruster,
	Thomas Huth, Laurent Vivier, Paolo Bonzini, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins

On Sun, May 07, 2023 at 03:54:00PM +0300, Avihai Horon wrote:
> 
> On 04/05/2023 18:50, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Thu, May 04, 2023 at 01:18:04PM +0300, Avihai Horon wrote:
> > > On 03/05/2023 18:49, Peter Xu wrote:
> > > > External email: Use caution opening links or attachments
> > > > 
> > > > 
> > > > On Wed, May 03, 2023 at 06:22:59PM +0300, Avihai Horon wrote:
> > > > > On 03/05/2023 1:49, Peter Xu wrote:
> > > > > > External email: Use caution opening links or attachments
> > > > > > 
> > > > > > 
> > > > > > On Mon, May 01, 2023 at 05:01:33PM +0300, Avihai Horon wrote:
> > > > > > > Hello everyone,
> > > > > > Hi, Avihai,
> > > > > > 
> > > > > > > === Flow of operation ===
> > > > > > > 
> > > > > > > To use precopy initial data, the capability must be enabled in the
> > > > > > > source.
> > > > > > > 
> > > > > > > As this capability must be supported also in the destination, a
> > > > > > > handshake is performed during migration setup. The purpose of the
> > > > > > > handshake is to notify the destination that precopy initial data is used
> > > > > > > and to check if it's supported.
> > > > > > > 
> > > > > > > The handshake is done in two levels. First, a general handshake is done
> > > > > > > with the destination migration code to notify that precopy initial data
> > > > > > > is used. Then, for each migration user in the source that supports
> > > > > > > precopy initial data, a handshake is done with its counterpart in the
> > > > > > > destination:
> > > > > > > If both support it, precopy initial data will be used for them.
> > > > > > > If source doesn't support it, precopy initial data will not be used for
> > > > > > > them.
> > > > > > > If source supports it and destination doesn't, migration will be failed.
> > > > > > > 
> > > > > > > Assuming the handshake succeeded, migration starts to send precopy data
> > > > > > > and as part of it also the initial precopy data. Initial precopy data is
> > > > > > > just like any other precopy data and as such, migration code is not
> > > > > > > aware of it. Therefore, it's the responsibility of the migration users
> > > > > > > (such as VFIO devices) to notify their counterparts in the destination
> > > > > > > that their initial precopy data has been sent (for example, VFIO
> > > > > > > migration does it when its initial bytes reach zero).
> > > > > > > 
> > > > > > > In the destination, migration code will query each migration user that
> > > > > > > supports precopy initial data and check if its initial data has been
> > > > > > > loaded. If initial data has been loaded by all of them, an ACK will be
> > > > > > > sent to the source which will now be able to complete migration when
> > > > > > > appropriate.
> > > > > > I can understand why this is useful, what I'm not 100% sure is whether the
> > > > > > complexity is needed.  The idea seems to be that src never switchover
> > > > > > unless it receives a READY notification from dst.
> > > > > > 
> > > > > > I'm imaging below simplified and more general workflow, not sure whether it
> > > > > > could work for you:
> > > > > > 
> > > > > >      - Introduce a new cap "switchover-ready", it means whether there'll be a
> > > > > >        ready event sent from dst -> src for "being ready for switchover"
> > > > > > 
> > > > > >      - When cap set, a new msg MIG_RP_MSG_SWITCHOVER_READY is defined and
> > > > > >        handled on src showing that dest is ready for switchover. It'll be sent
> > > > > >        only if dest is ready for the switchover
> > > > > > 
> > > > > >      - Introduce a field SaveVMHandlers.explicit_switchover_needed.  For each
> > > > > >        special device like vfio that would like to participate in the decision
> > > > > >        making, device can set its explicit_switchover_needed=1.  This field is
> > > > > >        ignored if the new cap is not set.
> > > > > > 
> > > > > >      - Dst qemu: when new cap set, remember how many special devices are there
> > > > > >        requesting explicit switchover (count of SaveVMHandlers that has the
> > > > > >        bit set during load setup) as switch_over_pending=N.
> > > > > > 
> > > > > >      - Dst qemu: Once a device thinks its fine to switchover (probably in the
> > > > > >        load_state() callback), it calls migration_notify_switchover_ready().
> > > > > >        That decreases switch_over_pending and when it hits zero, one msg
> > > > > >        MIG_RP_MSG_SWITCHOVER_READY will be sent to src.
> > > > > > 
> > > > > > Only until READY msg received on src could src switchover the precopy to
> > > > > > dst.
> > > > > > 
> > > > > > Then it only needs 1 more field in SaveVMHandlers rather than 3, and only 1
> > > > > > more msg (dst->src).
> > > > > > 
> > > > > > This is based on the fact that right now we always set caps on both qemus
> > > > > > so I suppose it already means either both have or don't have the feature
> > > > > > (even if one has, not setting the cap means disabled on both).
> > > > > > 
> > > > > > Would it work for this case and cleaner?
> > > > > Hi Peter, thanks for the response!
> > > > > Your approach is indeed much simpler, however I have a few concerns
> > > > > regarding compatibility.
> > > > > 
> > > > > You are saying that caps are always set both in src and dest.
> > > > > But what happens if we set the cap only on one side?
> > > > > Should we care about these scenarios?
> > > > I think it's not needed for now, but I am aware that this is a problem.
> > > > It's just that it is a more generic problem to me rather than very special
> > > > in the current feature being proposed.  At least there're a few times
> > > > Daniel showed concern on keeping this way and hoped we can have a better
> > > > handshake in general with migration framework.
> > > > 
> > > > I'd be perfectly fine if you want to approach this with a handshake
> > > > methodology, but I hope if so we should provide a more generic handshake.
> > > > So potentially that can make this new feature rely on the handshake work,
> > > > and slower to get into shape.  Your call on how to address this, at least
> > > > fine by me either way.
> > > I'd really like this feature to get in, and I'm afraid making it dependent
> > > on first implementing a general migration handshake may take a long time,
> > > like you said.
> > > What about keeping current approach but changing it such that the capability
> > > will have to be set in both src and dest, to make it similar to other
> > > capability usages?
> > > I.e., we will remove the "general" handshake:
> > > 
> > >      /* Enable precopy initial data generally in the migration */
> > >      memset(&buf, 0, sizeof(buf));
> > >      buf.general_enable = 1;
> > >      qemu_savevm_command_send(f, MIG_CMD_INITIAL_DATA_ENABLE, sizeof(buf),
> > >                               (uint8_t *)&buf);
> > > 
> > > but keep the per-device handshake, which is not a handshake for migration
> > > capabilities, but a part of the protocol when the capability is set, like in
> > > multifd, postcopy, etc.
> > > This way we can advance with this feature while making the general migration
> > > handshake an independent effort.
> > > Will that work for you?
> > Yes it's fine by me.
> > 
> > > BTW, with your suggestion to add a notification mechanism to notify when
> > > initial data is loaded in dest, I think we can drop these two SaveVMHandlers
> > > handlers:
> > >      /*
> > >       * Checks if precopy initial data is active. If it's inactive,
> > >       * initial_data_loaded check is skipped.
> > >       */
> > >      bool (*is_initial_data_active)(void *opaque);
> > >      /* Checks if precopy initial data has been loaded in dest */
> > >      bool (*initial_data_loaded)(void *opaque);
> > > 
> > > > In my imagination a generic handshake should happen at the very start of
> > > > migration and negociate feature bits between src/dst qemu, so they can
> > > > reach a consensus on what to do next.
> > > > 
> > > > > For example, if we set the cap only in src, then src will wait indefinitely
> > > > > for dest to notify that switchover is ready.
> > > > > Would you expect migration to fail instead of just keep running
> > > > > indefinitely?
> > > > > In current approach we only need to enable the cap in the source, so such
> > > > > scenario can't happen.
> > > > > 
> > > > > Let's look at some other scenario.
> > > > > Src QEMU supports explicit-switchover for device X but *not* for device Y
> > > > > (i.e., src QEMU is some older version of QEMU that supports
> > > > > explicit-switchover for device X but not for Y).
> > > > > Dest QEMU supports explicit-switchover for device X and device Y.
> > > > > The capability is set in both src and dest.
> > > > > In the destination we will have switchover_pending=2 because both X and Y
> > > > > support explicit-switchover.
> > > > > We do migration, but switchover_pending will never reach 0 because only X
> > > > > supports it in the source, so the migration will run indefinitely.
> > > > > The per-device handshake solves this by making device Y not use
> > > > > explicit-switchover in this case.
> > > > Hmm, right.  When I was replying obviously I thought that decision can be
> > > > made sololy by the dest qemu, then I assumed it's fine.  Because IIUC in
> > > > that case how many devices that supports switchover_pending on src qemu
> > > > doesn't really matter but only dest.
> > > > 
> > > > But I re-read the last patch and I do see that there's a new bit that will
> > > > change the device protocol of migration:
> > > > 
> > > >     if (migration->initial_data_active && !migration->precopy_init_size &&
> > > >         !migration->initial_data_sent) {
> > > >         qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT);
> > > >         migration->initial_data_sent = true;
> > > >     } else {
> > > >         qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> > > >     }
> > > > 
> > > > With this, I think what you said makes sense because then the src qemu
> > > > matters on deciding whether to send VFIO_MIG_FLAG_DEV_INIT_DATA_SENT, it
> > > > also needs to make sure dst qemu will recognize it.
> > > > 
> > > > Do you think this new VFIO_MIG_FLAG_DEV_INIT_DATA_SENT is a must to have?
> > > > Can this decision be made on dest qemu only?
> > > > 
> > > > To ask in another way, I saw that precopy_init_size is the fundation to
> > > > decide whether to send this flag.  Then it's a matter of whether dest qemu
> > > > is also aware of precopy_init_size, then it can already tell when it's
> > > > ready to handle the switchover.
> > > The destination is not aware of precopy_init_size, only the source knows it.
> > > So the source must send VFIO_MIG_FLAG_DEV_INIT_DATA_SENT to notify dest that
> > > the initial data was sent.
> > Then, can the src qemu notify instead?
> > 
> > We can have similar notification mechanism on src qemu and if that can work
> > we can further same the other MIG_RP_MSG.  The counter counts how many

I meant s/same/save/ here..

> > special src devices are there and we don't switchover only if all agree.
> 
> Do you mean the following:
> We will have the pending counter and notification mechanism in source
> instead of destination.
> The MIG_RP_MSG_INITIAL_DATA_LOADED_ACK message will be sent by each device
> in the destination that loaded its initial data.
> Each such RP_MSG will decrease the pending counter in source.
> When the pending counter in source reaches zero, source can complete
> migration.
> 
> Or did I misunderstand you?

I meant the notification can come sololy from the src qemu, so src qemu can
skip the switchover if any of the src relevant device hasn't yet
acknowledged on the switchover.

Then I think we can avoid introducing a MIG_RP msg?  I'm not sure whether
I missed something, though.  I stated my understanding on the ordering below.

> 
> > I know that even if !precopy_init_size on src, it doesn't mean that dest
> > has already digested all the data in the send buffer.  However since we'll
> > anyway make sure queued data landed before switch over happens (e.g., when
> > we only have 1 migration channel data are sent in sequential manner), it
> > means when switchover the dst qemu should have these loaded?
> > 
> > Thanks,
> > 
> > --
> > Peter Xu
> > 
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support
  2023-05-08  0:49             ` Peter Xu
@ 2023-05-08 11:11               ` Avihai Horon
  0 siblings, 0 replies; 42+ messages in thread
From: Avihai Horon @ 2023-05-08 11:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
	Juan Quintela, Leonardo Bras, Eric Blake, Markus Armbruster,
	Thomas Huth, Laurent Vivier, Paolo Bonzini, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede, Tarun Gupta,
	Joao Martins


On 08/05/2023 3:49, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Sun, May 07, 2023 at 03:54:00PM +0300, Avihai Horon wrote:
>> On 04/05/2023 18:50, Peter Xu wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Thu, May 04, 2023 at 01:18:04PM +0300, Avihai Horon wrote:
>>>> On 03/05/2023 18:49, Peter Xu wrote:
>>>>> External email: Use caution opening links or attachments
>>>>>
>>>>>
>>>>> On Wed, May 03, 2023 at 06:22:59PM +0300, Avihai Horon wrote:
>>>>>> On 03/05/2023 1:49, Peter Xu wrote:
>>>>>>> External email: Use caution opening links or attachments
>>>>>>>
>>>>>>>
>>>>>>> On Mon, May 01, 2023 at 05:01:33PM +0300, Avihai Horon wrote:
>>>>>>>> Hello everyone,
>>>>>>> Hi, Avihai,
>>>>>>>
>>>>>>>> === Flow of operation ===
>>>>>>>>
>>>>>>>> To use precopy initial data, the capability must be enabled in the
>>>>>>>> source.
>>>>>>>>
>>>>>>>> As this capability must be supported also in the destination, a
>>>>>>>> handshake is performed during migration setup. The purpose of the
>>>>>>>> handshake is to notify the destination that precopy initial data is used
>>>>>>>> and to check if it's supported.
>>>>>>>>
>>>>>>>> The handshake is done in two levels. First, a general handshake is done
>>>>>>>> with the destination migration code to notify that precopy initial data
>>>>>>>> is used. Then, for each migration user in the source that supports
>>>>>>>> precopy initial data, a handshake is done with its counterpart in the
>>>>>>>> destination:
>>>>>>>> If both support it, precopy initial data will be used for them.
>>>>>>>> If source doesn't support it, precopy initial data will not be used for
>>>>>>>> them.
>>>>>>>> If source supports it and destination doesn't, migration will be failed.
>>>>>>>>
>>>>>>>> Assuming the handshake succeeded, migration starts to send precopy data
>>>>>>>> and as part of it also the initial precopy data. Initial precopy data is
>>>>>>>> just like any other precopy data and as such, migration code is not
>>>>>>>> aware of it. Therefore, it's the responsibility of the migration users
>>>>>>>> (such as VFIO devices) to notify their counterparts in the destination
>>>>>>>> that their initial precopy data has been sent (for example, VFIO
>>>>>>>> migration does it when its initial bytes reach zero).
>>>>>>>>
>>>>>>>> In the destination, migration code will query each migration user that
>>>>>>>> supports precopy initial data and check if its initial data has been
>>>>>>>> loaded. If initial data has been loaded by all of them, an ACK will be
>>>>>>>> sent to the source which will now be able to complete migration when
>>>>>>>> appropriate.
>>>>>>> I can understand why this is useful, what I'm not 100% sure is whether the
>>>>>>> complexity is needed.  The idea seems to be that src never switchover
>>>>>>> unless it receives a READY notification from dst.
>>>>>>>
>>>>>>> I'm imaging below simplified and more general workflow, not sure whether it
>>>>>>> could work for you:
>>>>>>>
>>>>>>>       - Introduce a new cap "switchover-ready", it means whether there'll be a
>>>>>>>         ready event sent from dst -> src for "being ready for switchover"
>>>>>>>
>>>>>>>       - When cap set, a new msg MIG_RP_MSG_SWITCHOVER_READY is defined and
>>>>>>>         handled on src showing that dest is ready for switchover. It'll be sent
>>>>>>>         only if dest is ready for the switchover
>>>>>>>
>>>>>>>       - Introduce a field SaveVMHandlers.explicit_switchover_needed.  For each
>>>>>>>         special device like vfio that would like to participate in the decision
>>>>>>>         making, device can set its explicit_switchover_needed=1.  This field is
>>>>>>>         ignored if the new cap is not set.
>>>>>>>
>>>>>>>       - Dst qemu: when new cap set, remember how many special devices are there
>>>>>>>         requesting explicit switchover (count of SaveVMHandlers that has the
>>>>>>>         bit set during load setup) as switch_over_pending=N.
>>>>>>>
>>>>>>>       - Dst qemu: Once a device thinks its fine to switchover (probably in the
>>>>>>>         load_state() callback), it calls migration_notify_switchover_ready().
>>>>>>>         That decreases switch_over_pending and when it hits zero, one msg
>>>>>>>         MIG_RP_MSG_SWITCHOVER_READY will be sent to src.
>>>>>>>
>>>>>>> Only until READY msg received on src could src switchover the precopy to
>>>>>>> dst.
>>>>>>>
>>>>>>> Then it only needs 1 more field in SaveVMHandlers rather than 3, and only 1
>>>>>>> more msg (dst->src).
>>>>>>>
>>>>>>> This is based on the fact that right now we always set caps on both qemus
>>>>>>> so I suppose it already means either both have or don't have the feature
>>>>>>> (even if one has, not setting the cap means disabled on both).
>>>>>>>
>>>>>>> Would it work for this case and cleaner?
>>>>>> Hi Peter, thanks for the response!
>>>>>> Your approach is indeed much simpler, however I have a few concerns
>>>>>> regarding compatibility.
>>>>>>
>>>>>> You are saying that caps are always set both in src and dest.
>>>>>> But what happens if we set the cap only on one side?
>>>>>> Should we care about these scenarios?
>>>>> I think it's not needed for now, but I am aware that this is a problem.
>>>>> It's just that it is a more generic problem to me rather than very special
>>>>> in the current feature being proposed.  At least there're a few times
>>>>> Daniel showed concern on keeping this way and hoped we can have a better
>>>>> handshake in general with migration framework.
>>>>>
>>>>> I'd be perfectly fine if you want to approach this with a handshake
>>>>> methodology, but I hope if so we should provide a more generic handshake.
>>>>> So potentially that can make this new feature rely on the handshake work,
>>>>> and slower to get into shape.  Your call on how to address this, at least
>>>>> fine by me either way.
>>>> I'd really like this feature to get in, and I'm afraid making it dependent
>>>> on first implementing a general migration handshake may take a long time,
>>>> like you said.
>>>> What about keeping current approach but changing it such that the capability
>>>> will have to be set in both src and dest, to make it similar to other
>>>> capability usages?
>>>> I.e., we will remove the "general" handshake:
>>>>
>>>>       /* Enable precopy initial data generally in the migration */
>>>>       memset(&buf, 0, sizeof(buf));
>>>>       buf.general_enable = 1;
>>>>       qemu_savevm_command_send(f, MIG_CMD_INITIAL_DATA_ENABLE, sizeof(buf),
>>>>                                (uint8_t *)&buf);
>>>>
>>>> but keep the per-device handshake, which is not a handshake for migration
>>>> capabilities, but a part of the protocol when the capability is set, like in
>>>> multifd, postcopy, etc.
>>>> This way we can advance with this feature while making the general migration
>>>> handshake an independent effort.
>>>> Will that work for you?
>>> Yes it's fine by me.
>>>
>>>> BTW, with your suggestion to add a notification mechanism to notify when
>>>> initial data is loaded in dest, I think we can drop these two SaveVMHandlers
>>>> handlers:
>>>>       /*
>>>>        * Checks if precopy initial data is active. If it's inactive,
>>>>        * initial_data_loaded check is skipped.
>>>>        */
>>>>       bool (*is_initial_data_active)(void *opaque);
>>>>       /* Checks if precopy initial data has been loaded in dest */
>>>>       bool (*initial_data_loaded)(void *opaque);
>>>>
>>>>> In my imagination a generic handshake should happen at the very start of
>>>>> migration and negociate feature bits between src/dst qemu, so they can
>>>>> reach a consensus on what to do next.
>>>>>
>>>>>> For example, if we set the cap only in src, then src will wait indefinitely
>>>>>> for dest to notify that switchover is ready.
>>>>>> Would you expect migration to fail instead of just keep running
>>>>>> indefinitely?
>>>>>> In current approach we only need to enable the cap in the source, so such
>>>>>> scenario can't happen.

[1]

>>>>>> Let's look at some other scenario.
>>>>>> Src QEMU supports explicit-switchover for device X but *not* for device Y
>>>>>> (i.e., src QEMU is some older version of QEMU that supports
>>>>>> explicit-switchover for device X but not for Y).
>>>>>> Dest QEMU supports explicit-switchover for device X and device Y.
>>>>>> The capability is set in both src and dest.
>>>>>> In the destination we will have switchover_pending=2 because both X and Y
>>>>>> support explicit-switchover.
>>>>>> We do migration, but switchover_pending will never reach 0 because only X
>>>>>> supports it in the source, so the migration will run indefinitely.
>>>>>> The per-device handshake solves this by making device Y not use
>>>>>> explicit-switchover in this case.
>>>>> Hmm, right.  When I was replying obviously I thought that decision can be
>>>>> made sololy by the dest qemu, then I assumed it's fine.  Because IIUC in
>>>>> that case how many devices that supports switchover_pending on src qemu
>>>>> doesn't really matter but only dest.
>>>>>
>>>>> But I re-read the last patch and I do see that there's a new bit that will
>>>>> change the device protocol of migration:
>>>>>
>>>>>      if (migration->initial_data_active && !migration->precopy_init_size &&
>>>>>          !migration->initial_data_sent) {
>>>>>          qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT);
>>>>>          migration->initial_data_sent = true;
>>>>>      } else {
>>>>>          qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>>>>      }
>>>>>
>>>>> With this, I think what you said makes sense because then the src qemu
>>>>> matters on deciding whether to send VFIO_MIG_FLAG_DEV_INIT_DATA_SENT, it
>>>>> also needs to make sure dst qemu will recognize it.
>>>>>
>>>>> Do you think this new VFIO_MIG_FLAG_DEV_INIT_DATA_SENT is a must to have?
>>>>> Can this decision be made on dest qemu only?
>>>>>
>>>>> To ask in another way, I saw that precopy_init_size is the fundation to
>>>>> decide whether to send this flag.  Then it's a matter of whether dest qemu
>>>>> is also aware of precopy_init_size, then it can already tell when it's
>>>>> ready to handle the switchover.
>>>> The destination is not aware of precopy_init_size, only the source knows it.
>>>> So the source must send VFIO_MIG_FLAG_DEV_INIT_DATA_SENT to notify dest that
>>>> the initial data was sent.
>>> Then, can the src qemu notify instead?
>>>
>>> We can have similar notification mechanism on src qemu and if that can work
>>> we can further same the other MIG_RP_MSG.  The counter counts how many
> I meant s/same/save/ here..
>
>>> special src devices are there and we don't switchover only if all agree.
>> Do you mean the following:
>> We will have the pending counter and notification mechanism in source
>> instead of destination.
>> The MIG_RP_MSG_INITIAL_DATA_LOADED_ACK message will be sent by each device
>> in the destination that loaded its initial data.
>> Each such RP_MSG will decrease the pending counter in source.
>> When the pending counter in source reaches zero, source can complete
>> migration.
>>
>> Or did I misunderstand you?
> I meant the notification can come sololy from the src qemu, so src qemu can
> skip the switchover if any of the src relevant device hasn't yet
> acknowledged on the switchover.
>
> Then I think we can avoid introducing a MIG_RP msg?  I'm not sure whether
> I missed something, though.  I stated my understanding on the ordering below.

Oh, I see what you mean now.
However, I don't think it will work -- knowing that source has sent the 
initial data alone is not enough.
We must get an ACK from destination that it processed the initial data, 
as processing it can take time.

For example:
- We have one device that supports precopy initial data.
- Loading the initial data takes 2 seconds.
The device in the source just finished to send the initial data, so it 
notifies that it can do switchover.
The source sees that the device has acked the switchover, plus the 
general pending precopy data is below the threshold,
so it decides to stop the VM and complete migration.
However, the destination is still processing the initial precopy data, 
which takes 2 seconds, during which the destination
is blocked and can't process new migration data and complete the migration.
The source VM is already stopped, so these 2 seconds are spent during 
downtime.

What I suggested above, of having the notification mechanism in source 
and sending MIG_RP message for each
device in the destination that loaded its initial data, solves the 
problem in the aforementioned scenario ([1] above).
That's because switchover_pending will be 1 in source and will reach 
zero when device X in destination acks that
its initial data is loaded, so migration could be completed.
With that, we only need to add a single new MIG_RP message and the 
initial_data_advise() SaveVMHandlers handler.

However, I'm not sure this will actually work.
Supporting initial data may require special handling/preparation. Device 
Y in the destination may expect to get
initial data, but device Y in source will not send it since it doesn't 
support it. Depending on device implementation, this may fail migration.

The per-device advise can solve this by syncing between the devices in 
source and destination.
For example, device Y in the destination will not get the advise MIG_CMD 
and thus will know not to expect initial data.
Plus, in case device in source supports initial data but device in 
destination doesn't, migration will fail during setup
and not only when destination device gets unexpected initial data. This 
gives a clearer and constant behavior.

This will add some complexity of sending a new advise MIG_CMD per device 
in source and handling it in the destination,
but I still tend to include the per-device advise for the reasons above.
What's your opinion about it?

Thanks.

>>> I know that even if !precopy_init_size on src, it doesn't mean that dest
>>> has already digested all the data in the send buffer.  However since we'll
>>> anyway make sure queued data landed before switch over happens (e.g., when
>>> we only have 1 migration channel data are sent in sequential manner), it
>>> means when switchover the dst qemu should have these loaded?
>>>
>>> Thanks,
>>>
>>> --
>>> Peter Xu
>>>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/8] migration: Add precopy initial data capability
  2023-05-01 14:01 ` [PATCH 1/8] migration: Add precopy initial data capability Avihai Horon
@ 2023-05-10  8:24   ` Juan Quintela
  2023-05-17  9:17   ` Markus Armbruster
  1 sibling, 0 replies; 42+ messages in thread
From: Juan Quintela @ 2023-05-10  8:24 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Peter Xu,
	Leonardo Bras, Eric Blake, Markus Armbruster, Thomas Huth,
	Laurent Vivier, Paolo Bonzini, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins

Avihai Horon <avihaih@nvidia.com> wrote:
> Migration downtime estimation is calculated based on bandwidth and
> remaining migration data. This assumes that loading of migration data in
> the destination takes a negligible amount of time and that downtime
> depends only on network speed.
>
> While this may be true for RAM, it's not necessarily true for other
> migration users. For example, loading the data of a VFIO device in the
> destination might require from the device to allocate resources, prepare
> internal data structures and so on. These operations can take a
> significant amount of time which can increase migration downtime.
>
> This patch adds a new capability "precopy initial data" that allows the
> source to send initial precopy data and the destination to ACK that this
> data has been loaded. Migration will not attempt to stop the source VM
> and complete the migration until this ACK is received.
>
> This will allow migration users to send initial precopy data which can
> be used to reduce downtime (e.g., by pre-allocating resources), while
> making sure that the source will stop the VM and complete the migration
> only after this initial precopy data is sent and loaded in the
> destination so it will have full effect.
>
> This new capability relies on the return path capability to communicate
> from the destination back to the source.
>
> The actual implementation of the capability will be added in the
> following patches.
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>


Capability definition is correct.
I am not given the review-by until the rest of the series is discussed,
but nothing else to do here.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/8] migration: Add precopy initial data handshake
  2023-05-01 14:01 ` [PATCH 2/8] migration: Add precopy initial data handshake Avihai Horon
  2023-05-02 22:54   ` Peter Xu
@ 2023-05-10  8:40   ` Juan Quintela
  2023-05-10 15:32     ` Avihai Horon
  2023-05-14 16:42   ` Cédric Le Goater
  2 siblings, 1 reply; 42+ messages in thread
From: Juan Quintela @ 2023-05-10  8:40 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Peter Xu,
	Leonardo Bras, Eric Blake, Markus Armbruster, Thomas Huth,
	Laurent Vivier, Paolo Bonzini, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins

Avihai Horon <avihaih@nvidia.com> wrote:
> Add precopy initial data handshake between source and destination upon
> migration setup. The purpose of the handshake is to notify the
> destination that precopy initial data is used and which migration users
> (i.e., SaveStateEntry) are going to use it.
>
> The handshake is done in two levels. First, a general enable command is
> sent to notify the destination migration code that precopy initial data
> is used.
> Then, for each migration user in the source that supports precopy
> initial data, an enable command is sent to its counterpart in the
> destination:
> If both support it, precopy initial data will be used for them.
> If source doesn't support it, precopy initial data will not be used for
> them.
> If source supports it and destination doesn't, migration will be failed.
>
> To implement it, a new migration command MIG_CMD_INITIAL_DATA_ENABLE is
> added, as well as a new SaveVMHandlers handler initial_data_advise.
> Calling the handler advises the migration user that precopy initial data
> is used and its return value indicates whether precopy initial data is
> supported by it.
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>

> diff --git a/migration/savevm.c b/migration/savevm.c
> index a9181b444b..2740defdf0 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -71,6 +71,13 @@
>  
>  const unsigned int postcopy_ram_discard_version;
>  
> +typedef struct {
> +    uint8_t general_enable;

I miss a comment for this field.

I think that you only use the values 0 and 1
And that 1 means something like: we require this feature and do nothing
And that 0 means that this is a device that uses the feature and
<something, something>

> +    uint8_t reserved[7];
> +    uint8_t idstr[256];
> +    uint32_t instance_id;
> +} InitialDataInfo;

We have several "reserved" space here.  Do we want a Version field?
It don't seem that we need a size field, as the command is fixed length.

> @@ -90,6 +97,8 @@ enum qemu_vm_cmd {
>      MIG_CMD_ENABLE_COLO,       /* Enable COLO */
>      MIG_CMD_POSTCOPY_RESUME,   /* resume postcopy on dest */
>      MIG_CMD_RECV_BITMAP,       /* Request for recved bitmap on dst */
> +

Spurious blank line

> +    MIG_CMD_INITIAL_DATA_ENABLE, /* Enable precopy initial data in dest */
>      MIG_CMD_MAX



> +void qemu_savevm_send_initial_data_enable(MigrationState *ms, QEMUFile *f)
> +{
> +    SaveStateEntry *se;
> +    InitialDataInfo buf;

Small nit.

The new way in the block to declare that something needs to be
initialized to zero is:

    InitialDataInfo buf = {};

And no, I have no clue if this makes the compiler generate any better code.

> +    /* Enable precopy initial data generally in the migration */
> +    memset(&buf, 0, sizeof(buf));
> +    buf.general_enable = 1;
> +    qemu_savevm_command_send(f, MIG_CMD_INITIAL_DATA_ENABLE, sizeof(buf),
> +                             (uint8_t *)&buf);
> +    trace_savevm_send_initial_data_enable(buf.general_enable, (char *)buf.idstr,
> +                                          buf.instance_id);

No buf.idstr here.

Why do we need a command before the loop and seeing if we are having any
device that requires this.

> +    /* Enable precopy initial data for each migration user that supports it */
> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +        if (!se->ops || !se->ops->initial_data_advise) {
> +            continue;
> +        }
> +
> +        if (!se->ops->initial_data_advise(se->opaque)) {
> +            continue;
> +        }

Is this callback going to send anything?

> +
> +        memset(&buf, 0, sizeof(buf));
> +        memcpy(buf.idstr, se->idstr, sizeof(buf.idstr));
> +        buf.instance_id = se->instance_id;
> +
> +        qemu_savevm_command_send(f, MIG_CMD_INITIAL_DATA_ENABLE, sizeof(buf),
> +                                 (uint8_t *)&buf);
> +        trace_savevm_send_initial_data_enable(
> +            buf.general_enable, (char *)buf.idstr, buf.instance_id);

It is me or general_enable is always zero here?

> +    }
> +}
> +
>  void qemu_savevm_send_colo_enable(QEMUFile *f)
>  {
>      trace_savevm_send_colo_enable();
> @@ -2314,6 +2359,60 @@ static int loadvm_process_enable_colo(MigrationIncomingState *mis)
>      return ret;
>  }
>  
> +static int loadvm_handle_initial_data_enable(MigrationIncomingState *mis)
> +{
> +    InitialDataInfo buf;
> +    SaveStateEntry *se;
> +    ssize_t read_size;
> +
> +    read_size = qemu_get_buffer(mis->from_src_file, (void *)&buf, sizeof(buf));
> +    if (read_size != sizeof(buf)) {
> +        error_report("%s: Could not get data buffer, read_size %ld, len %lu",
> +                     __func__, read_size, sizeof(buf));
> +
> +        return -EIO;
> +    }
> +
> +    /* Enable precopy initial data generally in the migration */
> +    if (buf.general_enable) {
> +        mis->initial_data_enabled = true;
> +        trace_loadvm_handle_initial_data_enable(
> +            buf.general_enable, (char *)buf.idstr, buf.instance_id);
> +
> +        return 0;
> +    }
> +
> +    /* Enable precopy initial data for a specific migration user */
> +    se = find_se((char *)buf.idstr, buf.instance_id);
> +    if (!se) {
> +        error_report("%s: Could not find SaveStateEntry, idstr '%s', "
> +                     "instance_id %" PRIu32,
> +                     __func__, buf.idstr, buf.instance_id);
> +
> +        return -ENOENT;
> +    }
> +
> +    if (!se->ops || !se->ops->initial_data_advise) {
> +        error_report("%s: '%s' doesn't have required "
> +                     "initial_data_advise op",
> +                     __func__, buf.idstr);
> +
> +        return -EOPNOTSUPP;
> +    }
> +
> +    if (!se->ops->initial_data_advise(se->opaque)) {
> +        error_report("%s: '%s' doesn't support precopy initial data", __func__,
> +                     buf.idstr);

This is not your fault.  Just venting.

And here we are, again, with a place where we can't return errors.  Sniff.

> +
> +        return -EOPNOTSUPP;
> +    }

I have to wait until I see the usage later in the series, but it is a
good idea to have a single handle for source and destination, and not
passing at least a parameter telling where are we?

Really nice patch, very good done and very good integrated with the
surrounded style.  A pleasure.

Later, Juan.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/8] migration: Add precopy initial data loaded ACK functionality
  2023-05-01 14:01 ` [PATCH 3/8] migration: Add precopy initial data loaded ACK functionality Avihai Horon
  2023-05-02 22:56   ` Peter Xu
@ 2023-05-10  8:54   ` Juan Quintela
  2023-05-10 15:52     ` Avihai Horon
  1 sibling, 1 reply; 42+ messages in thread
From: Juan Quintela @ 2023-05-10  8:54 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Peter Xu,
	Leonardo Bras, Eric Blake, Markus Armbruster, Thomas Huth,
	Laurent Vivier, Paolo Bonzini, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins

Avihai Horon <avihaih@nvidia.com> wrote:
> Add the core functionality of precopy initial data, which allows the
> destination to ACK that initial data has been loaded and the source to
> wait for this ACK before completing the migration.
>
> A new return path command MIG_RP_MSG_INITIAL_DATA_LOADED_ACK is added.
> It is sent by the destination after precopy initial data is loaded to
> ACK to the source that precopy initial data has been loaded.
>
> In addition, two new SaveVMHandlers handlers are added:
> 1. is_initial_data_active which indicates whether precopy initial data
>    is used for this migration user (i.e., SaveStateEntry).
> 2. initial_data_loaded which indicates whether precopy initial data has
>    been loaded by this migration user.
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>


> @@ -1401,6 +1412,8 @@ void migrate_init(MigrationState *s)
>      s->vm_was_running = false;
>      s->iteration_initial_bytes = 0;
>      s->threshold_size = 0;
> +
> +    s->initial_data_loaded_acked = false;

In general, you let a blank line for the stuff you add, when all the
previous fields don't do that.  Can you remove it.

> @@ -2704,6 +2725,20 @@ static void migration_update_counters(MigrationState *s,
>                                bandwidth, s->threshold_size);
>  }
>  
> +static bool initial_data_loaded_acked(MigrationState *s)
> +{
> +    if (!migrate_precopy_initial_data()) {
> +        return true;
> +    }
> +
> +    /* No reason to wait for precopy initial data loaded ACK if VM is stopped */
> +    if (!runstate_is_running()) {
> +        return true;
> +    }

Thinking loud here.

What happens if we start a migration.  Guest is running.
We enable precopy_initial_data().

And then we stop the guest.

Are we going to receive data that expect this return false?  Or it is
handled somewhere else?

> @@ -2719,6 +2754,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>  {
>      uint64_t must_precopy, can_postcopy;
>      bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
> +    bool initial_data_loaded = initial_data_loaded_acked(s);
>  
>      qemu_savevm_state_pending_estimate(&must_precopy, &can_postcopy);
>      uint64_t pending_size = must_precopy + can_postcopy;
> @@ -2731,7 +2767,8 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>          trace_migrate_pending_exact(pending_size, must_precopy, can_postcopy);
>      }
>  
> -    if (!pending_size || pending_size < s->threshold_size) {
> +    if ((!pending_size || pending_size < s->threshold_size) &&
> +        initial_data_loaded) {
>          trace_migration_thread_low_pending(pending_size);
>          migration_completion(s);
>          return MIG_ITERATE_BREAK;

For this specific variable, I think I am going to add something more
general that this can piggy back.

For the migration tests I need exactly this functionality.  I want
migration to run until the test decided that it is a good idea to
finish.  I.e. For testing xbzrle I need at least three iterations, to
test auto_converge I need a minimum of 13 iterations.  And I am going to
do exactly what you have done here.

Will came back to you after I think something.

> @@ -2739,7 +2776,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>  
>      /* Still a significant amount to transfer */
>      if (!in_postcopy && must_precopy <= s->threshold_size &&
> -        qatomic_read(&s->start_postcopy)) {
> +        initial_data_loaded && qatomic_read(&s->start_postcopy)) {
>          if (postcopy_start(s)) {
>              error_report("%s: postcopy failed to start", __func__);
>          }
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 2740defdf0..7a94deda3b 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2504,6 +2504,39 @@ static int loadvm_process_command(QEMUFile *f)
>      return 0;
>  }
>  
> +static int qemu_loadvm_initial_data_loaded_ack(MigrationIncomingState *mis)
> +{
> +    SaveStateEntry *se;
> +    int ret;
> +
> +    if (!mis->initial_data_enabled || mis->initial_data_loaded_ack_sent) {
> +        return 0;
> +    }
> +
> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> +        if (!se->ops || !se->ops->initial_data_loaded) {
> +            continue;
> +        }
> +
> +        if (!se->ops->is_initial_data_active ||
> +            !se->ops->is_initial_data_active(se->opaque)) {
> +            continue;
> +        }

If you don't have any other use for is_initial_data_active() I think you
can integrate the functionality with initial_data_loaded().

If it is not active just return 1?

> +
> +        if (!se->ops->initial_data_loaded(se->opaque)) {
> +            return 0;
> +        }
> +    }
> +
> +    ret = migrate_send_rp_initial_data_loaded_ack(mis);
> +    if (!ret) {
> +        mis->initial_data_loaded_ack_sent = true;
> +        trace_loadvm_initial_data_loaded_acked();
> +    }
> +
> +    return ret;
> +}

Later, Juan



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/8] migration: Enable precopy initial data capability
  2023-05-01 14:01 ` [PATCH 4/8] migration: Enable precopy initial data capability Avihai Horon
@ 2023-05-10  8:55   ` Juan Quintela
  0 siblings, 0 replies; 42+ messages in thread
From: Juan Quintela @ 2023-05-10  8:55 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Peter Xu,
	Leonardo Bras, Eric Blake, Markus Armbruster, Thomas Huth,
	Laurent Vivier, Paolo Bonzini, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins

Avihai Horon <avihaih@nvidia.com> wrote:
> Now that precopy initial data logic has been implemented, enable the
> capability.
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

For when the time comes O:-)



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/8] tests: Add migration precopy initial data capability test
  2023-05-01 14:01 ` [PATCH 5/8] tests: Add migration precopy initial data capability test Avihai Horon
@ 2023-05-10  8:55   ` Juan Quintela
  0 siblings, 0 replies; 42+ messages in thread
From: Juan Quintela @ 2023-05-10  8:55 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Peter Xu,
	Leonardo Bras, Eric Blake, Markus Armbruster, Thomas Huth,
	Laurent Vivier, Paolo Bonzini, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins

Avihai Horon <avihaih@nvidia.com> wrote:
> Add migration precopy initial data capability test. The test runs
> without migration users that support this capability, but is still
> useful to make sure it didn't break anything.
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 6/8] vfio/migration: Refactor vfio_save_block() to return saved data size
  2023-05-01 14:01 ` [PATCH 6/8] vfio/migration: Refactor vfio_save_block() to return saved data size Avihai Horon
@ 2023-05-10  9:00   ` Juan Quintela
  0 siblings, 0 replies; 42+ messages in thread
From: Juan Quintela @ 2023-05-10  9:00 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Peter Xu,
	Leonardo Bras, Eric Blake, Markus Armbruster, Thomas Huth,
	Laurent Vivier, Paolo Bonzini, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins

Avihai Horon <avihaih@nvidia.com> wrote:
> Refactor vfio_save_block() to return the size of saved data on success
> and -errno on error.
>
> This will be used in next patch to implement VFIO migration pre-copy
> support.
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> Reviewed-by: Cédric Le Goater <clg@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

And this is independent of this series.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support
  2023-05-03 15:22   ` Avihai Horon
  2023-05-03 15:49     ` Peter Xu
@ 2023-05-10  9:12     ` Juan Quintela
  2023-05-10 16:01       ` Avihai Horon
  1 sibling, 1 reply; 42+ messages in thread
From: Juan Quintela @ 2023-05-10  9:12 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Peter Xu, qemu-devel, Alex Williamson, Cédric Le Goater,
	Leonardo Bras, Eric Blake, Markus Armbruster, Thomas Huth,
	Laurent Vivier, Paolo Bonzini, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins,
	Daniel Berrange

Avihai Horon <avihaih@nvidia.com> wrote:
> On 03/05/2023 1:49, Peter Xu wrote:
>> External email: Use caution opening links or attachments

>> Only until READY msg received on src could src switchover the precopy to
>> dst.
>>
>> Then it only needs 1 more field in SaveVMHandlers rather than 3, and only 1
>> more msg (dst->src).
>>
>> This is based on the fact that right now we always set caps on both qemus
>> so I suppose it already means either both have or don't have the feature
>> (even if one has, not setting the cap means disabled on both).
>>
>> Would it work for this case and cleaner?
>
> Hi Peter, thanks for the response!
> Your approach is indeed much simpler, however I have a few concerns
> regarding compatibility.
>
> You are saying that caps are always set both in src and dest.
> But what happens if we set the cap only on one side?
> Should we care about these scenarios?

Not really.
We are supposed that something like libvirt has set things up and such
things are ok.  We don't try to detect that kind of things in the
migration stream (I am not telling we should'nt, but that we don't).

If you configure qemu with an disk on source that is on source but not
on destination, migration will work fine until the device copy stage,
when that disk is missing.  I think this is something like that.  A
missconfiguration.

> For example, if we set the cap only in src, then src will wait
> indefinitely for dest to notify that switchover is ready.
> Would you expect migration to fail instead of just keep running
> indefinitely?
> In current approach we only need to enable the cap in the source, so
> such scenario can't happen.

I see.  I have to think if this is a better approach.  But will like to
know what libvirt thinks about this.

Daniel?


> Let's look at some other scenario.
> Src QEMU supports explicit-switchover for device X but *not* for
> device Y (i.e., src QEMU is some older version of QEMU that supports
> explicit-switchover for device X but not for Y).
> Dest QEMU supports explicit-switchover for device X and device Y.
> The capability is set in both src and dest.
> In the destination we will have switchover_pending=2 because both X
> and Y support explicit-switchover.
> We do migration, but switchover_pending will never reach 0 because
> only X supports it in the source, so the migration will run
> indefinitely.
> The per-device handshake solves this by making device Y not use
> explicit-switchover in this case.

You have a point here.
But I will approach this case in a different way:

Destination QEMU needs to be older, because it don't have the feature.
So we need to NOT being able to do the switchover for older machine
types.
And have something like this is qemu/hw/machine.c

GlobalProperty hw_compat_7_2[] = {
    { "our_device", "explicit-switchover", "off" },
};

Or whatever we want to call the device and the property, and not use it
for older machine types to allow migration for that.

Once told that, this is the "ideal" world.  In general we don't force
this because we are not good at detecting this kind of failures.

Later, Juan.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/8] migration: Add precopy initial data handshake
  2023-05-10  8:40   ` Juan Quintela
@ 2023-05-10 15:32     ` Avihai Horon
  0 siblings, 0 replies; 42+ messages in thread
From: Avihai Horon @ 2023-05-10 15:32 UTC (permalink / raw)
  To: quintela
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Peter Xu,
	Leonardo Bras, Eric Blake, Markus Armbruster, Thomas Huth,
	Laurent Vivier, Paolo Bonzini, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins


On 10/05/2023 11:40, Juan Quintela wrote:
> External email: Use caution opening links or attachments
>
>
> Avihai Horon <avihaih@nvidia.com> wrote:
>> Add precopy initial data handshake between source and destination upon
>> migration setup. The purpose of the handshake is to notify the
>> destination that precopy initial data is used and which migration users
>> (i.e., SaveStateEntry) are going to use it.
>>
>> The handshake is done in two levels. First, a general enable command is
>> sent to notify the destination migration code that precopy initial data
>> is used.
>> Then, for each migration user in the source that supports precopy
>> initial data, an enable command is sent to its counterpart in the
>> destination:
>> If both support it, precopy initial data will be used for them.
>> If source doesn't support it, precopy initial data will not be used for
>> them.
>> If source supports it and destination doesn't, migration will be failed.
>>
>> To implement it, a new migration command MIG_CMD_INITIAL_DATA_ENABLE is
>> added, as well as a new SaveVMHandlers handler initial_data_advise.
>> Calling the handler advises the migration user that precopy initial data
>> is used and its return value indicates whether precopy initial data is
>> supported by it.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index a9181b444b..2740defdf0 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -71,6 +71,13 @@
>>
>>   const unsigned int postcopy_ram_discard_version;
>>
>> +typedef struct {
>> +    uint8_t general_enable;
> I miss a comment for this field.
>
> I think that you only use the values 0 and 1
> And that 1 means something like: we require this feature and do nothing
> And that 0 means that this is a device that uses the feature and
> <something, something>

Correct.
But if we assume both source and destination will enable the capability 
then general_enable can just be dropped.

>> +    uint8_t reserved[7];
>> +    uint8_t idstr[256];
>> +    uint32_t instance_id;
>> +} InitialDataInfo;
> We have several "reserved" space here.  Do we want a Version field?

You mean SaveStateEntry->version_id field?
If so, then I don't think it's necessary, it's checked later on when 
device data is sent.

> It don't seem that we need a size field, as the command is fixed length.
>
>> @@ -90,6 +97,8 @@ enum qemu_vm_cmd {
>>       MIG_CMD_ENABLE_COLO,       /* Enable COLO */
>>       MIG_CMD_POSTCOPY_RESUME,   /* resume postcopy on dest */
>>       MIG_CMD_RECV_BITMAP,       /* Request for recved bitmap on dst */
>> +
> Spurious blank line

Will remove.

>
>> +    MIG_CMD_INITIAL_DATA_ENABLE, /* Enable precopy initial data in dest */
>>       MIG_CMD_MAX
>
>
>> +void qemu_savevm_send_initial_data_enable(MigrationState *ms, QEMUFile *f)
>> +{
>> +    SaveStateEntry *se;
>> +    InitialDataInfo buf;
> Small nit.
>
> The new way in the block to declare that something needs to be
> initialized to zero is:
>
>      InitialDataInfo buf = {};
>
> And no, I have no clue if this makes the compiler generate any better code.

Will change.

>
>> +    /* Enable precopy initial data generally in the migration */
>> +    memset(&buf, 0, sizeof(buf));
>> +    buf.general_enable = 1;
>> +    qemu_savevm_command_send(f, MIG_CMD_INITIAL_DATA_ENABLE, sizeof(buf),
>> +                             (uint8_t *)&buf);
>> +    trace_savevm_send_initial_data_enable(buf.general_enable, (char *)buf.idstr,
>> +                                          buf.instance_id);
> No buf.idstr here.
>
> Why do we need a command before the loop and seeing if we are having any
> device that requires this.

If we assume both source and destination will enable the capability then 
this can be dropped as well.

>
>> +    /* Enable precopy initial data for each migration user that supports it */
>> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +        if (!se->ops || !se->ops->initial_data_advise) {
>> +            continue;
>> +        }
>> +
>> +        if (!se->ops->initial_data_advise(se->opaque)) {
>> +            continue;
>> +        }
> Is this callback going to send anything?

Nope.

>
>> +
>> +        memset(&buf, 0, sizeof(buf));
>> +        memcpy(buf.idstr, se->idstr, sizeof(buf.idstr));
>> +        buf.instance_id = se->instance_id;
>> +
>> +        qemu_savevm_command_send(f, MIG_CMD_INITIAL_DATA_ENABLE, sizeof(buf),
>> +                                 (uint8_t *)&buf);
>> +        trace_savevm_send_initial_data_enable(
>> +            buf.general_enable, (char *)buf.idstr, buf.instance_id);
> It is me or general_enable is always zero here?

Yes, it's always 0.
But as I wrote above, if we assume both source and destination will 
enable the capability then general_enable can be dropped.

>
>> +    }
>> +}
>> +
>>   void qemu_savevm_send_colo_enable(QEMUFile *f)
>>   {
>>       trace_savevm_send_colo_enable();
>> @@ -2314,6 +2359,60 @@ static int loadvm_process_enable_colo(MigrationIncomingState *mis)
>>       return ret;
>>   }
>>
>> +static int loadvm_handle_initial_data_enable(MigrationIncomingState *mis)
>> +{
>> +    InitialDataInfo buf;
>> +    SaveStateEntry *se;
>> +    ssize_t read_size;
>> +
>> +    read_size = qemu_get_buffer(mis->from_src_file, (void *)&buf, sizeof(buf));
>> +    if (read_size != sizeof(buf)) {
>> +        error_report("%s: Could not get data buffer, read_size %ld, len %lu",
>> +                     __func__, read_size, sizeof(buf));
>> +
>> +        return -EIO;
>> +    }
>> +
>> +    /* Enable precopy initial data generally in the migration */
>> +    if (buf.general_enable) {
>> +        mis->initial_data_enabled = true;
>> +        trace_loadvm_handle_initial_data_enable(
>> +            buf.general_enable, (char *)buf.idstr, buf.instance_id);
>> +
>> +        return 0;
>> +    }
>> +
>> +    /* Enable precopy initial data for a specific migration user */
>> +    se = find_se((char *)buf.idstr, buf.instance_id);
>> +    if (!se) {
>> +        error_report("%s: Could not find SaveStateEntry, idstr '%s', "
>> +                     "instance_id %" PRIu32,
>> +                     __func__, buf.idstr, buf.instance_id);
>> +
>> +        return -ENOENT;
>> +    }
>> +
>> +    if (!se->ops || !se->ops->initial_data_advise) {
>> +        error_report("%s: '%s' doesn't have required "
>> +                     "initial_data_advise op",
>> +                     __func__, buf.idstr);
>> +
>> +        return -EOPNOTSUPP;
>> +    }
>> +
>> +    if (!se->ops->initial_data_advise(se->opaque)) {
>> +        error_report("%s: '%s' doesn't support precopy initial data", __func__,
>> +                     buf.idstr);
> This is not your fault.  Just venting.
>
> And here we are, again, with a place where we can't return errors.  Sniff.
>
>> +
>> +        return -EOPNOTSUPP;
>> +    }
> I have to wait until I see the usage later in the series, but it is a
> good idea to have a single handle for source and destination, and not
> passing at least a parameter telling where are we?

Are you referring here to the initial_data_advise() handler?
If so, then I didn't need to distinguish between source and destination.
But I can add a parameter if you think it could be useful/good practice.

>
> Really nice patch, very good done and very good integrated with the
> surrounded style.  A pleasure.

Thanks!



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/8] migration: Add precopy initial data loaded ACK functionality
  2023-05-10  8:54   ` Juan Quintela
@ 2023-05-10 15:52     ` Avihai Horon
  2023-05-10 15:59       ` Juan Quintela
  0 siblings, 1 reply; 42+ messages in thread
From: Avihai Horon @ 2023-05-10 15:52 UTC (permalink / raw)
  To: quintela
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Peter Xu,
	Leonardo Bras, Eric Blake, Markus Armbruster, Thomas Huth,
	Laurent Vivier, Paolo Bonzini, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins


On 10/05/2023 11:54, Juan Quintela wrote:
> External email: Use caution opening links or attachments
>
>
> Avihai Horon <avihaih@nvidia.com> wrote:
>> Add the core functionality of precopy initial data, which allows the
>> destination to ACK that initial data has been loaded and the source to
>> wait for this ACK before completing the migration.
>>
>> A new return path command MIG_RP_MSG_INITIAL_DATA_LOADED_ACK is added.
>> It is sent by the destination after precopy initial data is loaded to
>> ACK to the source that precopy initial data has been loaded.
>>
>> In addition, two new SaveVMHandlers handlers are added:
>> 1. is_initial_data_active which indicates whether precopy initial data
>>     is used for this migration user (i.e., SaveStateEntry).
>> 2. initial_data_loaded which indicates whether precopy initial data has
>>     been loaded by this migration user.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>
>> @@ -1401,6 +1412,8 @@ void migrate_init(MigrationState *s)
>>       s->vm_was_running = false;
>>       s->iteration_initial_bytes = 0;
>>       s->threshold_size = 0;
>> +
>> +    s->initial_data_loaded_acked = false;
> In general, you let a blank line for the stuff you add, when all the
> previous fields don't do that.  Can you remove it.

Sure, I will remove them.

>
>> @@ -2704,6 +2725,20 @@ static void migration_update_counters(MigrationState *s,
>>                                 bandwidth, s->threshold_size);
>>   }
>>
>> +static bool initial_data_loaded_acked(MigrationState *s)
>> +{
>> +    if (!migrate_precopy_initial_data()) {
>> +        return true;
>> +    }
>> +
>> +    /* No reason to wait for precopy initial data loaded ACK if VM is stopped */
>> +    if (!runstate_is_running()) {
>> +        return true;
>> +    }
> Thinking loud here.
>
> What happens if we start a migration.  Guest is running.
> We enable precopy_initial_data().
>
> And then we stop the guest.
>
> Are we going to receive data that expect this return false?  Or it is
> handled somewhere else?

Not sure I fully understand what you mean here, could you elaborate?

>> @@ -2719,6 +2754,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>>   {
>>       uint64_t must_precopy, can_postcopy;
>>       bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
>> +    bool initial_data_loaded = initial_data_loaded_acked(s);
>>
>>       qemu_savevm_state_pending_estimate(&must_precopy, &can_postcopy);
>>       uint64_t pending_size = must_precopy + can_postcopy;
>> @@ -2731,7 +2767,8 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>>           trace_migrate_pending_exact(pending_size, must_precopy, can_postcopy);
>>       }
>>
>> -    if (!pending_size || pending_size < s->threshold_size) {
>> +    if ((!pending_size || pending_size < s->threshold_size) &&
>> +        initial_data_loaded) {
>>           trace_migration_thread_low_pending(pending_size);
>>           migration_completion(s);
>>           return MIG_ITERATE_BREAK;
> For this specific variable, I think I am going to add something more
> general that this can piggy back.
>
> For the migration tests I need exactly this functionality.  I want
> migration to run until the test decided that it is a good idea to
> finish.  I.e. For testing xbzrle I need at least three iterations, to
> test auto_converge I need a minimum of 13 iterations.  And I am going to
> do exactly what you have done here.
>
> Will came back to you after I think something.

Oh, cool. So I will keep this as is in the meantime.

>
>> @@ -2739,7 +2776,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>>
>>       /* Still a significant amount to transfer */
>>       if (!in_postcopy && must_precopy <= s->threshold_size &&
>> -        qatomic_read(&s->start_postcopy)) {
>> +        initial_data_loaded && qatomic_read(&s->start_postcopy)) {
>>           if (postcopy_start(s)) {
>>               error_report("%s: postcopy failed to start", __func__);
>>           }
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index 2740defdf0..7a94deda3b 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -2504,6 +2504,39 @@ static int loadvm_process_command(QEMUFile *f)
>>       return 0;
>>   }
>>
>> +static int qemu_loadvm_initial_data_loaded_ack(MigrationIncomingState *mis)
>> +{
>> +    SaveStateEntry *se;
>> +    int ret;
>> +
>> +    if (!mis->initial_data_enabled || mis->initial_data_loaded_ack_sent) {
>> +        return 0;
>> +    }
>> +
>> +    QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> +        if (!se->ops || !se->ops->initial_data_loaded) {
>> +            continue;
>> +        }
>> +
>> +        if (!se->ops->is_initial_data_active ||
>> +            !se->ops->is_initial_data_active(se->opaque)) {
>> +            continue;
>> +        }
> If you don't have any other use for is_initial_data_active() I think you
> can integrate the functionality with initial_data_loaded().
>
> If it is not active just return 1?

Yes it's possible.
However, if we change this to a notification mechanism as Peter 
suggested then I think both is_initial_data_active() and 
initial_data_loaded() can be dropped.

Thanks.

>
>> +
>> +        if (!se->ops->initial_data_loaded(se->opaque)) {
>> +            return 0;
>> +        }
>> +    }
>> +
>> +    ret = migrate_send_rp_initial_data_loaded_ack(mis);
>> +    if (!ret) {
>> +        mis->initial_data_loaded_ack_sent = true;
>> +        trace_loadvm_initial_data_loaded_acked();
>> +    }
>> +
>> +    return ret;
>> +}
> Later, Juan
>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/8] migration: Add precopy initial data loaded ACK functionality
  2023-05-10 15:52     ` Avihai Horon
@ 2023-05-10 15:59       ` Juan Quintela
  0 siblings, 0 replies; 42+ messages in thread
From: Juan Quintela @ 2023-05-10 15:59 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater, Peter Xu,
	Leonardo Bras, Eric Blake, Markus Armbruster, Thomas Huth,
	Laurent Vivier, Paolo Bonzini, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins

Avihai Horon <avihaih@nvidia.com> wrote:
> On 10/05/2023 11:54, Juan Quintela wrote:
>> External email: Use caution opening links or attachments
>>

>>> +static bool initial_data_loaded_acked(MigrationState *s)
>>> +{
>>> +    if (!migrate_precopy_initial_data()) {
>>> +        return true;
>>> +    }
>>> +
>>> +    /* No reason to wait for precopy initial data loaded ACK if VM is stopped */
>>> +    if (!runstate_is_running()) {
>>> +        return true;
>>> +    }
>> Thinking loud here.
>>
>> What happens if we start a migration.  Guest is running.
>> We enable precopy_initial_data().
>>
>> And then we stop the guest.
>>
>> Are we going to receive data that expect this return false?  Or it is
>> handled somewhere else?
>
> Not sure I fully understand what you mean here, could you elaborate?
>

static bool initial_data_loaded_acked(MigrationState *s)
{
    if (!migrate_precopy_initial_data()) {
        return true;
    }

    /* No reason to wait for precopy initial data loaded ACK if VM is stopped */
    if (!runstate_is_running()) {
        return true;
    }

    return s->initial_data_loaded_acked;

    ****** here we are only returning a value. I mis-read we were
           calling a function.  Sorry for the noise.
Later, Juan.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support
  2023-05-10  9:12     ` Juan Quintela
@ 2023-05-10 16:01       ` Avihai Horon
  2023-05-10 16:41         ` Juan Quintela
  0 siblings, 1 reply; 42+ messages in thread
From: Avihai Horon @ 2023-05-10 16:01 UTC (permalink / raw)
  To: quintela
  Cc: Peter Xu, qemu-devel, Alex Williamson, Cédric Le Goater,
	Leonardo Bras, Eric Blake, Markus Armbruster, Thomas Huth,
	Laurent Vivier, Paolo Bonzini, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins,
	Daniel Berrange


On 10/05/2023 12:12, Juan Quintela wrote:
> External email: Use caution opening links or attachments
>
>
> Avihai Horon <avihaih@nvidia.com> wrote:
>> On 03/05/2023 1:49, Peter Xu wrote:
>>> External email: Use caution opening links or attachments
>>> Only until READY msg received on src could src switchover the precopy to
>>> dst.
>>>
>>> Then it only needs 1 more field in SaveVMHandlers rather than 3, and only 1
>>> more msg (dst->src).
>>>
>>> This is based on the fact that right now we always set caps on both qemus
>>> so I suppose it already means either both have or don't have the feature
>>> (even if one has, not setting the cap means disabled on both).
>>>
>>> Would it work for this case and cleaner?
>> Hi Peter, thanks for the response!
>> Your approach is indeed much simpler, however I have a few concerns
>> regarding compatibility.
>>
>> You are saying that caps are always set both in src and dest.
>> But what happens if we set the cap only on one side?
>> Should we care about these scenarios?
> Not really.
> We are supposed that something like libvirt has set things up and such
> things are ok.  We don't try to detect that kind of things in the
> migration stream (I am not telling we should'nt, but that we don't).
>
> If you configure qemu with an disk on source that is on source but not
> on destination, migration will work fine until the device copy stage,
> when that disk is missing.  I think this is something like that.  A
> missconfiguration.

Ah, I understand.

>
>> For example, if we set the cap only in src, then src will wait
>> indefinitely for dest to notify that switchover is ready.
>> Would you expect migration to fail instead of just keep running
>> indefinitely?
>> In current approach we only need to enable the cap in the source, so
>> such scenario can't happen.
> I see.  I have to think if this is a better approach.  But will like to
> know what libvirt thinks about this.
>
> Daniel?
>
>
>> Let's look at some other scenario.
>> Src QEMU supports explicit-switchover for device X but *not* for
>> device Y (i.e., src QEMU is some older version of QEMU that supports
>> explicit-switchover for device X but not for Y).
>> Dest QEMU supports explicit-switchover for device X and device Y.
>> The capability is set in both src and dest.
>> In the destination we will have switchover_pending=2 because both X
>> and Y support explicit-switchover.
>> We do migration, but switchover_pending will never reach 0 because
>> only X supports it in the source, so the migration will run
>> indefinitely.
>> The per-device handshake solves this by making device Y not use
>> explicit-switchover in this case.
> You have a point here.
> But I will approach this case in a different way:
>
> Destination QEMU needs to be older, because it don't have the feature.
> So we need to NOT being able to do the switchover for older machine
> types.
> And have something like this is qemu/hw/machine.c
>
> GlobalProperty hw_compat_7_2[] = {
>      { "our_device", "explicit-switchover", "off" },
> };
>
> Or whatever we want to call the device and the property, and not use it
> for older machine types to allow migration for that.

Let me see if I get this straight (I'm not that familiar with 
hw_compat_x_y):

You mean that device Y which adds support for explicit-switchover in 
QEMU version Z should add a property
like you wrote above, and use it to disable explicit-switchover usage 
for Y devices when Y device
from QEMU older than Z is migrated?

Thanks.

>
> Once told that, this is the "ideal" world.  In general we don't force
> this because we are not good at detecting this kind of failures.
>
> Later, Juan.
>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support
  2023-05-10 16:01       ` Avihai Horon
@ 2023-05-10 16:41         ` Juan Quintela
  2023-05-11 11:31           ` Avihai Horon
  0 siblings, 1 reply; 42+ messages in thread
From: Juan Quintela @ 2023-05-10 16:41 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Peter Xu, qemu-devel, Alex Williamson, Cédric Le Goater,
	Leonardo Bras, Eric Blake, Markus Armbruster, Thomas Huth,
	Laurent Vivier, Paolo Bonzini, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins,
	Daniel Berrange

Avihai Horon <avihaih@nvidia.com> wrote:

>> You have a point here.
>> But I will approach this case in a different way:
>>
>> Destination QEMU needs to be older, because it don't have the feature.
>> So we need to NOT being able to do the switchover for older machine
>> types.
>> And have something like this is qemu/hw/machine.c
>>
>> GlobalProperty hw_compat_7_2[] = {
>>      { "our_device", "explicit-switchover", "off" },
>> };
>>
>> Or whatever we want to call the device and the property, and not use it
>> for older machine types to allow migration for that.
>
> Let me see if I get this straight (I'm not that familiar with
> hw_compat_x_y):
>
> You mean that device Y which adds support for explicit-switchover in
> QEMU version Z should add a property
> like you wrote above, and use it to disable explicit-switchover usage
> for Y devices when Y device
> from QEMU older than Z is migrated?

More that "from" "to"

Let me elaborate.  We have two QEMUs:

QEMU version X, has device dev. Let's call it qemu-X.
QEMU version Y (X+1) add feature foo to device dev.  Let's call it qemu-Y.

We have two machine types (for this exercise we don't care about
architectures)

PC-X.0
PC-Y.0

So, the possible combinations are:

First the easy cases, same qemu on both sides.  Different machine types.

$ qemu-X -M PC-X.0   -> to -> qemu-X -M PC-X.0

  good. neither guest use feature foo.

$ qemu-X -M PC-Y.0   -> to -> qemu-X -M PC-Y.0

  impossible. qemu-X don't have machine PC-Y.0.  So nothing to see here.

$ qemu-Y -M PC-X.0   -> to -> qemu-Y -M PC-X.0

  good.  We have feature foo in both sides. Notice that I recomend here
  to not use feature foo.  We will see on the difficult cases.

$ qemu-Y -M PC-Y.0   -> to -> qemu-Y -M PC-Y.0

  good.  Both sides use feature foo.

Difficult cases, when we mix qemu versions.

$ qemu-X -M PC-X.0  -> to -> qemu-Y -M PC-X.0

  source don't have feature foo.  Destination have feature foo.
  But if we disable it for machine PC-X.0, it will work.

$ qemu-Y -M PC-X.0  -> to -> qemu-X -M PC-X.0

  same than previous example.  But here we have feature foo on source
  and not on destination.  Disabling it for machine PC-X.0 fixes the
  problem.

We can't migrate a PC-Y.0 when one of the qemu's is qemu-X, so that case
is impossible.

Does this makes more sense?

And now, how hw_compat_X_Y works.

It is an array of registers with the format:

- name of device  (we give some rope here, for instance migration is a
  device in this context)

- name of property: self explanatory.  The important bit is that
  we can get the value of the property in the device driver.

- value of the property: self explanatory.

With this mechanism what we do when we add a feature to a device that
matters for migration is:
- for the machine type of the version that we are "developing" feature
  is enabled by default.  For whatever that enable means.

- for old machine types we disable the feature, so it can be migrate
  freely with old qemu. But using the old machine type.

- there is way to enable the feature on the command line even for old
  machine types on new qemu, but only developers use that for testing.
  Normal users/admins never do that.

what does hw_compat_7_2 means?

Ok, we need to know the versions.  New version is 8.0.

hw_compat_7_2 has all the properties represensing "features", defaults,
whatever that has changed since 7.2.  In other words, what features we
need to disable to get to the features that existed when 7.2 was
released.

To go to a real example.

In the development tree.  We have:

GlobalProperty hw_compat_8_0[] = {
    { "migration", "multifd-flush-after-each-section", "on"},
};
const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0);

Feature is implemented in the following commits:

77c259a4cb1c9799754b48f570301ebf1de5ded8
b05292c237030343516d073b1a1e5f49ffc017a8
294e5a4034e81b3d8db03b4e0f691386f20d6ed3

When we are doing migration with multifd and we pass the end of memory
(i.e. we end one iteration through all the RAM) we need to make sure
that we don't send the same page through two channels, i.e. contents of
the page at iteration 1 through channel 1 and contents of the page at
iteration 2 through channel 2.  The problem is that they could arrive
out of order and the of page of iteration 1 arrive later than iteration
2 and overwrite new data with old data.  Which is undesirable.
We could use complex algorithms to fix that, but one easy way of doing
it is:

- When we finish a run through all memory (i.e.) one iteration, we flush
  all channels and make sure that everything arrives to destination
  before starting sending data o the next iteration.  I call that
  synchronize all channels.

And that is what we *should* have done.  But when I implemented the
feature, I did this synchronization everytime that we finish a cycle
(around 100miliseconds).  i.e. 10 times per second. This is called a
section for historical reasons. And when you are migrating
multiterabytes RAM machines with very fast networking, we end waiting
too much on the synchronizations.

Once detected the problem and found the cause, we change that.  The
problem is that if we are running an old qemu against a new qemu (or
viceversa) we would not be able to migrate, because one send/expects
synchronizations at different points.

So we have to maintain the old algorithm and the new algoritm.  That is
what we did here.  For machines older than <current in development>,
i.e. 8.0 we use the old algorithm (multifd-flush-after-earch section is
"on").

But the default for new machine types is the new algorithm, much faster.

I know that the explanation has been quite long, but inventing an
example was going to be even more complex.

Does this makes sense?

Later, Juan.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support
  2023-05-10 16:41         ` Juan Quintela
@ 2023-05-11 11:31           ` Avihai Horon
  2023-05-11 13:09             ` Juan Quintela
  0 siblings, 1 reply; 42+ messages in thread
From: Avihai Horon @ 2023-05-11 11:31 UTC (permalink / raw)
  To: quintela
  Cc: Peter Xu, qemu-devel, Alex Williamson, Cédric Le Goater,
	Leonardo Bras, Eric Blake, Markus Armbruster, Thomas Huth,
	Laurent Vivier, Paolo Bonzini, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins,
	Daniel Berrange


On 10/05/2023 19:41, Juan Quintela wrote:
> External email: Use caution opening links or attachments
>
>
> Avihai Horon <avihaih@nvidia.com> wrote:
>
>>> You have a point here.
>>> But I will approach this case in a different way:
>>>
>>> Destination QEMU needs to be older, because it don't have the feature.
>>> So we need to NOT being able to do the switchover for older machine
>>> types.
>>> And have something like this is qemu/hw/machine.c
>>>
>>> GlobalProperty hw_compat_7_2[] = {
>>>       { "our_device", "explicit-switchover", "off" },
>>> };
>>>
>>> Or whatever we want to call the device and the property, and not use it
>>> for older machine types to allow migration for that.
>> Let me see if I get this straight (I'm not that familiar with
>> hw_compat_x_y):
>>
>> You mean that device Y which adds support for explicit-switchover in
>> QEMU version Z should add a property
>> like you wrote above, and use it to disable explicit-switchover usage
>> for Y devices when Y device
>> from QEMU older than Z is migrated?
> More that "from" "to"
>
> Let me elaborate.  We have two QEMUs:
>
> QEMU version X, has device dev. Let's call it qemu-X.
> QEMU version Y (X+1) add feature foo to device dev.  Let's call it qemu-Y.
>
> We have two machine types (for this exercise we don't care about
> architectures)
>
> PC-X.0
> PC-Y.0
>
> So, the possible combinations are:
>
> First the easy cases, same qemu on both sides.  Different machine types.
>
> $ qemu-X -M PC-X.0   -> to -> qemu-X -M PC-X.0
>
>    good. neither guest use feature foo.
>
> $ qemu-X -M PC-Y.0   -> to -> qemu-X -M PC-Y.0
>
>    impossible. qemu-X don't have machine PC-Y.0.  So nothing to see here.
>
> $ qemu-Y -M PC-X.0   -> to -> qemu-Y -M PC-X.0
>
>    good.  We have feature foo in both sides. Notice that I recomend here
>    to not use feature foo.  We will see on the difficult cases.
>
> $ qemu-Y -M PC-Y.0   -> to -> qemu-Y -M PC-Y.0
>
>    good.  Both sides use feature foo.
>
> Difficult cases, when we mix qemu versions.
>
> $ qemu-X -M PC-X.0  -> to -> qemu-Y -M PC-X.0
>
>    source don't have feature foo.  Destination have feature foo.
>    But if we disable it for machine PC-X.0, it will work.
>
> $ qemu-Y -M PC-X.0  -> to -> qemu-X -M PC-X.0
>
>    same than previous example.  But here we have feature foo on source
>    and not on destination.  Disabling it for machine PC-X.0 fixes the
>    problem.
>
> We can't migrate a PC-Y.0 when one of the qemu's is qemu-X, so that case
> is impossible.
>
> Does this makes more sense?
>
> And now, how hw_compat_X_Y works.
>
> It is an array of registers with the format:
>
> - name of device  (we give some rope here, for instance migration is a
>    device in this context)
>
> - name of property: self explanatory.  The important bit is that
>    we can get the value of the property in the device driver.
>
> - value of the property: self explanatory.
>
> With this mechanism what we do when we add a feature to a device that
> matters for migration is:
> - for the machine type of the version that we are "developing" feature
>    is enabled by default.  For whatever that enable means.
>
> - for old machine types we disable the feature, so it can be migrate
>    freely with old qemu. But using the old machine type.
>
> - there is way to enable the feature on the command line even for old
>    machine types on new qemu, but only developers use that for testing.
>    Normal users/admins never do that.
>
> what does hw_compat_7_2 means?
>
> Ok, we need to know the versions.  New version is 8.0.
>
> hw_compat_7_2 has all the properties represensing "features", defaults,
> whatever that has changed since 7.2.  In other words, what features we
> need to disable to get to the features that existed when 7.2 was
> released.
>
> To go to a real example.
>
> In the development tree.  We have:
>
> GlobalProperty hw_compat_8_0[] = {
>      { "migration", "multifd-flush-after-each-section", "on"},
> };
> const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0);
>
> Feature is implemented in the following commits:
>
> 77c259a4cb1c9799754b48f570301ebf1de5ded8
> b05292c237030343516d073b1a1e5f49ffc017a8
> 294e5a4034e81b3d8db03b4e0f691386f20d6ed3
>
> When we are doing migration with multifd and we pass the end of memory
> (i.e. we end one iteration through all the RAM) we need to make sure
> that we don't send the same page through two channels, i.e. contents of
> the page at iteration 1 through channel 1 and contents of the page at
> iteration 2 through channel 2.  The problem is that they could arrive
> out of order and the of page of iteration 1 arrive later than iteration
> 2 and overwrite new data with old data.  Which is undesirable.
> We could use complex algorithms to fix that, but one easy way of doing
> it is:
>
> - When we finish a run through all memory (i.e.) one iteration, we flush
>    all channels and make sure that everything arrives to destination
>    before starting sending data o the next iteration.  I call that
>    synchronize all channels.
>
> And that is what we *should* have done.  But when I implemented the
> feature, I did this synchronization everytime that we finish a cycle
> (around 100miliseconds).  i.e. 10 times per second. This is called a
> section for historical reasons. And when you are migrating
> multiterabytes RAM machines with very fast networking, we end waiting
> too much on the synchronizations.
>
> Once detected the problem and found the cause, we change that.  The
> problem is that if we are running an old qemu against a new qemu (or
> viceversa) we would not be able to migrate, because one send/expects
> synchronizations at different points.
>
> So we have to maintain the old algorithm and the new algoritm.  That is
> what we did here.  For machines older than <current in development>,
> i.e. 8.0 we use the old algorithm (multifd-flush-after-earch section is
> "on").
>
> But the default for new machine types is the new algorithm, much faster.
>
> I know that the explanation has been quite long, but inventing an
> example was going to be even more complex.
>
> Does this makes sense?

Yes, thanks a lot for the full and detailed explanation!
This indeed solves the problem in the scenario I mentioned above.

However, this relies on the fact that a device support for this feature 
depends only on the QEMU version.
This is not the case for VFIO devices.
To support explicit-switchover, a VFIO device also needs host kernel 
support for VFIO precopy, i.e., it needs to have the 
VFIO_MIGRATION_PRE_COPY flag set.
So, theoretically we could have the following:
- Source and destination QEMU are the same version.
- We migrate two different VFIO devices (i.e., they don't share the same 
kernel driver), device X and device Y.
- Host kernel in source supports VIFO precopy for device X but not for 
device Y.
- Host kernel in destination supports VFIO precopy for both device X and 
device Y.
Without explicit-switchover, migration should work.
But if we enable explicit-switchover and do migration, we would end up 
in the same situation where switchover_pending=2 in destination and it 
never reaches zero so migration is stuck.

This could be solved by moving the switchover_pending counter to the 
source and sending multiple MIG_RP explicit-switchover ACK messages.
However, I also raised a concern about this in my last mail to Peter 
[1], where this is not guaranteed to work, depending on the device 
implementation for explicit-switchover feature.

Not sure though if I'm digging too deep in some improbable future corner 
cases.

Let's go back to the basic question, which is whether we need to send an 
"advise" message for each device that supports explicit-switchover.
I think it gives us more flexibility and although not needed at the 
moment, might be useful in the future.

If you want I can send a v2 that addresses the comments and simplifies 
the code in other areas and we'll continue discussing the necessity of 
the "advise" message then.

Thanks!

[1] 
https://lore.kernel.org/qemu-devel/688acb4e-a4e6-428d-9124-7596e3666133@nvidia.com/



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support
  2023-05-11 11:31           ` Avihai Horon
@ 2023-05-11 13:09             ` Juan Quintela
  2023-05-11 15:08               ` Avihai Horon
  0 siblings, 1 reply; 42+ messages in thread
From: Juan Quintela @ 2023-05-11 13:09 UTC (permalink / raw)
  To: Avihai Horon
  Cc: Peter Xu, qemu-devel, Alex Williamson, Cédric Le Goater,
	Leonardo Bras, Eric Blake, Markus Armbruster, Thomas Huth,
	Laurent Vivier, Paolo Bonzini, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins,
	Daniel Berrange

Avihai Horon <avihaih@nvidia.com> wrote:
> On 10/05/2023 19:41, Juan Quintela wrote:

>> Does this makes sense?
>
> Yes, thanks a lot for the full and detailed explanation!

Thank you.

> This indeed solves the problem in the scenario I mentioned above.
>
> However, this relies on the fact that a device support for this
> feature depends only on the QEMU version.
> This is not the case for VFIO devices.

What a surprise :-)

Yes, I couldn't resist.

> To support explicit-switchover, a VFIO device also needs host kernel
> support for VFIO precopy, i.e., it needs to have the
> VFIO_MIGRATION_PRE_COPY flag set.
> So, theoretically we could have the following:
> - Source and destination QEMU are the same version.
> - We migrate two different VFIO devices (i.e., they don't share the
>   same kernel driver), device X and device Y.
> - Host kernel in source supports VIFO precopy for device X but not for
>   device Y.
> - Host kernel in destination supports VFIO precopy for both device X
>   and device Y.
> Without explicit-switchover, migration should work.
> But if we enable explicit-switchover and do migration, we would end up
> in the same situation where switchover_pending=2 in destination and it
> never reaches zero so migration is stuck.

I think this is too much for qemu.  You need to work at the
libvirt/management level.

> This could be solved by moving the switchover_pending counter to the
> source and sending multiple MIG_RP explicit-switchover ACK messages.
> However, I also raised a concern about this in my last mail to Peter
> [1], where this is not guaranteed to work, depending on the device
> implementation for explicit-switchover feature.

I will not try to be extra clever here.  We have removed qemu support of
the question, as it is the same qemu in both sides.

So what we have is this configuration:

Host A
------
device X explicit_switchoever=on
device Y explicit_switchoever=off

Host B
------
device X explicit_switchoever=on
device Y explicit_switchoever=on

The configuration is different.  That is something that qemu protocol
don't know how to handle, and it is up to stack.

You need to configure explicitely in qemu command line on host B:
device=Y,explicit_switchover=off

Or whatever is that configured off.

It is exactly the same problem than:

Host A
------

Intel CPU genX

Host B
------

intel CPU genX-1

i.e. there are features that Host A has but host B don't have.  The only
way to make this work is that you need to configure qemu when launched
in Host A with a cpu type that host B is able to run (i.e. one that
don't have any features that Host B is missing).

What is the difference between this and yours?


> Not sure though if I'm digging too deep in some improbable future
> corner cases.

Oh, you are just starting.  The compat layers that CPU have had to do
over the years.  At some point even migration between AMD and Intel
CPU's worked.

> Let's go back to the basic question, which is whether we need to send
> an "advise" message for each device that supports explicit-switchover.
> I think it gives us more flexibility and although not needed at the
> moment, might be useful in the future.

I think that is not a good idea, see my previous comment.  We have two
cases:
- both devices have the same features in both places
- they have different features in any of the places

First case, we don't care.  It always work.
Second case, we need to configure it correctly, and that means disable
features that are not on the other side.

> If you want I can send a v2 that addresses the comments and simplifies
> the code in other areas and we'll continue discussing the necessity of
> the "advise" message then.

Yeap.  I think is the best course of action.

Thanks, Juan.

> Thanks!
>
> [1]
> https://lore.kernel.org/qemu-devel/688acb4e-a4e6-428d-9124-7596e3666133@nvidia.com/



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support
  2023-05-11 13:09             ` Juan Quintela
@ 2023-05-11 15:08               ` Avihai Horon
  0 siblings, 0 replies; 42+ messages in thread
From: Avihai Horon @ 2023-05-11 15:08 UTC (permalink / raw)
  To: quintela
  Cc: Peter Xu, qemu-devel, Alex Williamson, Cédric Le Goater,
	Leonardo Bras, Eric Blake, Markus Armbruster, Thomas Huth,
	Laurent Vivier, Paolo Bonzini, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins,
	Daniel Berrange


On 11/05/2023 16:09, Juan Quintela wrote:
> External email: Use caution opening links or attachments
>
>
> Avihai Horon <avihaih@nvidia.com> wrote:
>> On 10/05/2023 19:41, Juan Quintela wrote:
>>> Does this makes sense?
>> Yes, thanks a lot for the full and detailed explanation!
> Thank you.
>
>> This indeed solves the problem in the scenario I mentioned above.
>>
>> However, this relies on the fact that a device support for this
>> feature depends only on the QEMU version.
>> This is not the case for VFIO devices.
> What a surprise :-)
>
> Yes, I couldn't resist.
>
>> To support explicit-switchover, a VFIO device also needs host kernel
>> support for VFIO precopy, i.e., it needs to have the
>> VFIO_MIGRATION_PRE_COPY flag set.
>> So, theoretically we could have the following:
>> - Source and destination QEMU are the same version.
>> - We migrate two different VFIO devices (i.e., they don't share the
>>    same kernel driver), device X and device Y.
>> - Host kernel in source supports VIFO precopy for device X but not for
>>    device Y.
>> - Host kernel in destination supports VFIO precopy for both device X
>>    and device Y.
>> Without explicit-switchover, migration should work.
>> But if we enable explicit-switchover and do migration, we would end up
>> in the same situation where switchover_pending=2 in destination and it
>> never reaches zero so migration is stuck.
> I think this is too much for qemu.  You need to work at the
> libvirt/management level.
>
>> This could be solved by moving the switchover_pending counter to the
>> source and sending multiple MIG_RP explicit-switchover ACK messages.
>> However, I also raised a concern about this in my last mail to Peter
>> [1], where this is not guaranteed to work, depending on the device
>> implementation for explicit-switchover feature.
> I will not try to be extra clever here.  We have removed qemu support of
> the question, as it is the same qemu in both sides.
>
> So what we have is this configuration:
>
> Host A
> ------
> device X explicit_switchoever=on
> device Y explicit_switchoever=off
>
> Host B
> ------
> device X explicit_switchoever=on
> device Y explicit_switchoever=on
>
> The configuration is different.  That is something that qemu protocol
> don't know how to handle, and it is up to stack.
>
> You need to configure explicitely in qemu command line on host B:
> device=Y,explicit_switchover=off
>
> Or whatever is that configured off.

I understand.

>
> It is exactly the same problem than:
>
> Host A
> ------
>
> Intel CPU genX
>
> Host B
> ------
>
> intel CPU genX-1
>
> i.e. there are features that Host A has but host B don't have.  The only
> way to make this work is that you need to configure qemu when launched
> in Host A with a cpu type that host B is able to run (i.e. one that
> don't have any features that Host B is missing).
>
> What is the difference between this and yours?

Hmm, yes, I see your point.

>
>
>> Not sure though if I'm digging too deep in some improbable future
>> corner cases.
> Oh, you are just starting.  The compat layers that CPU have had to do
> over the years.  At some point even migration between AMD and Intel
> CPU's worked.
>
>> Let's go back to the basic question, which is whether we need to send
>> an "advise" message for each device that supports explicit-switchover.
>> I think it gives us more flexibility and although not needed at the
>> moment, might be useful in the future.
> I think that is not a good idea, see my previous comment.  We have two
> cases:
> - both devices have the same features in both places
> - they have different features in any of the places
>
> First case, we don't care.  It always work.
> Second case, we need to configure it correctly, and that means disable
> features that are not on the other side.

Yep, I understand.

>
>> If you want I can send a v2 that addresses the comments and simplifies
>> the code in other areas and we'll continue discussing the necessity of
>> the "advise" message then.
> Yeap.  I think is the best course of action.

OK, so let me digest all the new info of this discussion and get back 
with v2 / conclusions / questions.

Thanks for all the help!



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/8] migration: Add precopy initial data handshake
  2023-05-01 14:01 ` [PATCH 2/8] migration: Add precopy initial data handshake Avihai Horon
  2023-05-02 22:54   ` Peter Xu
  2023-05-10  8:40   ` Juan Quintela
@ 2023-05-14 16:42   ` Cédric Le Goater
  2023-05-15  7:56     ` Avihai Horon
  2 siblings, 1 reply; 42+ messages in thread
From: Cédric Le Goater @ 2023-05-14 16:42 UTC (permalink / raw)
  To: Avihai Horon, qemu-devel
  Cc: Alex Williamson, Juan Quintela, Peter Xu, Leonardo Bras,
	Eric Blake, Markus Armbruster, Thomas Huth, Laurent Vivier,
	Paolo Bonzini, Yishai Hadas, Jason Gunthorpe, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta, Joao Martins

Hello Avihai,

> +static int loadvm_handle_initial_data_enable(MigrationIncomingState *mis)
> +{
> +    InitialDataInfo buf;
> +    SaveStateEntry *se;
> +    ssize_t read_size;
> +
> +    read_size = qemu_get_buffer(mis->from_src_file, (void *)&buf, sizeof(buf));
> +    if (read_size != sizeof(buf)) {
> +        error_report("%s: Could not get data buffer, read_size %ld, len %lu",

please use %zd and %zu for ssize_t.

Thanks,

C.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/8] migration: Add precopy initial data handshake
  2023-05-14 16:42   ` Cédric Le Goater
@ 2023-05-15  7:56     ` Avihai Horon
  0 siblings, 0 replies; 42+ messages in thread
From: Avihai Horon @ 2023-05-15  7:56 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel
  Cc: Alex Williamson, Juan Quintela, Peter Xu, Leonardo Bras,
	Eric Blake, Markus Armbruster, Thomas Huth, Laurent Vivier,
	Paolo Bonzini, Yishai Hadas, Jason Gunthorpe, Maor Gottlieb,
	Kirti Wankhede, Tarun Gupta, Joao Martins


On 14/05/2023 19:42, Cédric Le Goater wrote:
> External email: Use caution opening links or attachments
>
>
> Hello Avihai,
>
>> +static int loadvm_handle_initial_data_enable(MigrationIncomingState 
>> *mis)
>> +{
>> +    InitialDataInfo buf;
>> +    SaveStateEntry *se;
>> +    ssize_t read_size;
>> +
>> +    read_size = qemu_get_buffer(mis->from_src_file, (void *)&buf, 
>> sizeof(buf));
>> +    if (read_size != sizeof(buf)) {
>> +        error_report("%s: Could not get data buffer, read_size %ld, 
>> len %lu",
>
> please use %zd and %zu for ssize_t.
>
Ah, right. Sure will change.

Thanks.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/8] migration: Add precopy initial data capability
  2023-05-01 14:01 ` [PATCH 1/8] migration: Add precopy initial data capability Avihai Horon
  2023-05-10  8:24   ` Juan Quintela
@ 2023-05-17  9:17   ` Markus Armbruster
  2023-05-17 10:16     ` Avihai Horon
  1 sibling, 1 reply; 42+ messages in thread
From: Markus Armbruster @ 2023-05-17  9:17 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
	Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake, Thomas Huth,
	Laurent Vivier, Paolo Bonzini, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins

Avihai Horon <avihaih@nvidia.com> writes:

> Migration downtime estimation is calculated based on bandwidth and
> remaining migration data. This assumes that loading of migration data in
> the destination takes a negligible amount of time and that downtime
> depends only on network speed.
>
> While this may be true for RAM, it's not necessarily true for other
> migration users. For example, loading the data of a VFIO device in the
> destination might require from the device to allocate resources, prepare
> internal data structures and so on. These operations can take a
> significant amount of time which can increase migration downtime.
>
> This patch adds a new capability "precopy initial data" that allows the
> source to send initial precopy data and the destination to ACK that this
> data has been loaded. Migration will not attempt to stop the source VM
> and complete the migration until this ACK is received.
>
> This will allow migration users to send initial precopy data which can
> be used to reduce downtime (e.g., by pre-allocating resources), while
> making sure that the source will stop the VM and complete the migration
> only after this initial precopy data is sent and loaded in the
> destination so it will have full effect.
>
> This new capability relies on the return path capability to communicate
> from the destination back to the source.
>
> The actual implementation of the capability will be added in the
> following patches.
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  qapi/migration.json |  9 ++++++++-
>  migration/options.h |  1 +
>  migration/options.c | 20 ++++++++++++++++++++
>  3 files changed, 29 insertions(+), 1 deletion(-)
>
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 82000adce4..d496148386 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -478,6 +478,13 @@
>  #                    should not affect the correctness of postcopy migration.
>  #                    (since 7.1)
>  #
> +# @precopy-initial-data: If enabled, migration will not attempt to stop source
> +#                        VM and complete the migration until an ACK is received
> +#                        from the destination that initial precopy data has
> +#                        been loaded. This can improve downtime if there are
> +#                        migration users that support precopy initial data.
> +#                        (since 8.1)
> +#

Please format like

   # @precopy-initial-data: If enabled, migration will not attempt to
   #     stop source VM and complete the migration until an ACK is
   #     received from the destination that initial precopy data has been
   #     loaded.  This can improve downtime if there are migration users
   #     that support precopy initial data.  (since 8.1)

to blend in with recent commit a937b6aa739 (qapi: Reformat doc comments
to conform to current conventions).

What do you mean by "if there are migration users that support precopy
initial data"?

Do I have to ensure the ACK comes by configuring the destination VM in a
certain way, and if yes, how exactly?

>  # Features:
>  # @unstable: Members @x-colo and @x-ignore-shared are experimental.
>  #
> @@ -492,7 +499,7 @@
>             'dirty-bitmaps', 'postcopy-blocktime', 'late-block-activate',
>             { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
>             'validate-uuid', 'background-snapshot',
> -           'zero-copy-send', 'postcopy-preempt'] }
> +           'zero-copy-send', 'postcopy-preempt', 'precopy-initial-data'] }
>  
>  ##
>  # @MigrationCapabilityStatus:
> diff --git a/migration/options.h b/migration/options.h
> index 3c322867cd..d004b6321e 100644
> --- a/migration/options.h
> +++ b/migration/options.h
> @@ -44,6 +44,7 @@ bool migrate_pause_before_switchover(void);
>  bool migrate_postcopy_blocktime(void);
>  bool migrate_postcopy_preempt(void);
>  bool migrate_postcopy_ram(void);
> +bool migrate_precopy_initial_data(void);
>  bool migrate_rdma_pin_all(void);
>  bool migrate_release_ram(void);
>  bool migrate_return_path(void);
> diff --git a/migration/options.c b/migration/options.c
> index 53b7fc5d5d..c4ef0c60c7 100644
> --- a/migration/options.c
> +++ b/migration/options.c
> @@ -184,6 +184,8 @@ Property migration_properties[] = {
>      DEFINE_PROP_MIG_CAP("x-zero-copy-send",
>              MIGRATION_CAPABILITY_ZERO_COPY_SEND),
>  #endif
> +    DEFINE_PROP_MIG_CAP("x-precopy-initial-data",
> +                        MIGRATION_CAPABILITY_PRECOPY_INITIAL_DATA),
>  
>      DEFINE_PROP_END_OF_LIST(),
>  };
> @@ -286,6 +288,13 @@ bool migrate_postcopy_ram(void)
>      return s->capabilities[MIGRATION_CAPABILITY_POSTCOPY_RAM];
>  }
>  
> +bool migrate_precopy_initial_data(void)
> +{
> +    MigrationState *s = migrate_get_current();
> +
> +    return s->capabilities[MIGRATION_CAPABILITY_PRECOPY_INITIAL_DATA];
> +}
> +
>  bool migrate_rdma_pin_all(void)
>  {
>      MigrationState *s = migrate_get_current();
> @@ -546,6 +555,17 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, Error **errp)
>          }
>      }
>  
> +    if (new_caps[MIGRATION_CAPABILITY_PRECOPY_INITIAL_DATA]) {
> +        if (!new_caps[MIGRATION_CAPABILITY_RETURN_PATH]) {
> +            error_setg(errp, "Precopy initial data requires return path");

Shouldn't we mention this requirement in the docs?

To make sense of the message, the user needs to make the connection from
"Precopy initial data" to capability @precopy-initial-data, and from
"return path" to capability @return-path.  More helpful, I think:
"capability 'precopy-initial-data' requires capability 'return-path'".

> +            return false;
> +        }
> +
> +        /* Disable this capability until it's implemented */
> +        error_setg(errp, "Precopy initial data is not implemented yet");
> +        return false;
> +    }
> +
>      return true;
>  }



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/8] migration: Add precopy initial data capability
  2023-05-17  9:17   ` Markus Armbruster
@ 2023-05-17 10:16     ` Avihai Horon
  2023-05-17 12:21       ` Markus Armbruster
  0 siblings, 1 reply; 42+ messages in thread
From: Avihai Horon @ 2023-05-17 10:16 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
	Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake, Thomas Huth,
	Laurent Vivier, Paolo Bonzini, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins


On 17/05/2023 12:17, Markus Armbruster wrote:
> External email: Use caution opening links or attachments
>
>
> Avihai Horon <avihaih@nvidia.com> writes:
>
>> Migration downtime estimation is calculated based on bandwidth and
>> remaining migration data. This assumes that loading of migration data in
>> the destination takes a negligible amount of time and that downtime
>> depends only on network speed.
>>
>> While this may be true for RAM, it's not necessarily true for other
>> migration users. For example, loading the data of a VFIO device in the
>> destination might require from the device to allocate resources, prepare
>> internal data structures and so on. These operations can take a
>> significant amount of time which can increase migration downtime.
>>
>> This patch adds a new capability "precopy initial data" that allows the
>> source to send initial precopy data and the destination to ACK that this
>> data has been loaded. Migration will not attempt to stop the source VM
>> and complete the migration until this ACK is received.
>>
>> This will allow migration users to send initial precopy data which can
>> be used to reduce downtime (e.g., by pre-allocating resources), while
>> making sure that the source will stop the VM and complete the migration
>> only after this initial precopy data is sent and loaded in the
>> destination so it will have full effect.
>>
>> This new capability relies on the return path capability to communicate
>> from the destination back to the source.
>>
>> The actual implementation of the capability will be added in the
>> following patches.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   qapi/migration.json |  9 ++++++++-
>>   migration/options.h |  1 +
>>   migration/options.c | 20 ++++++++++++++++++++
>>   3 files changed, 29 insertions(+), 1 deletion(-)
>>
>> diff --git a/qapi/migration.json b/qapi/migration.json
>> index 82000adce4..d496148386 100644
>> --- a/qapi/migration.json
>> +++ b/qapi/migration.json
>> @@ -478,6 +478,13 @@
>>   #                    should not affect the correctness of postcopy migration.
>>   #                    (since 7.1)
>>   #
>> +# @precopy-initial-data: If enabled, migration will not attempt to stop source
>> +#                        VM and complete the migration until an ACK is received
>> +#                        from the destination that initial precopy data has
>> +#                        been loaded. This can improve downtime if there are
>> +#                        migration users that support precopy initial data.
>> +#                        (since 8.1)
>> +#
> Please format like
>
>     # @precopy-initial-data: If enabled, migration will not attempt to
>     #     stop source VM and complete the migration until an ACK is
>     #     received from the destination that initial precopy data has been
>     #     loaded.  This can improve downtime if there are migration users
>     #     that support precopy initial data.  (since 8.1)
>
> to blend in with recent commit a937b6aa739 (qapi: Reformat doc comments
> to conform to current conventions).

Sure.

>
> What do you mean by "if there are migration users that support precopy
> initial data"?

This capability only provides the framework to send precopy initial data 
and ACK that it was loaded in the destination.
To actually benefit from it, migration users (such as VFIO devices, RAM, 
etc.) must implement support for it and use it.

What I wanted to say here is that there is no point to enable this 
capability if there are no migration users that support it.
For example, if you are migrating a VM without VFIO devices, then 
enabling this capability will have no effect.

>
> Do I have to ensure the ACK comes by configuring the destination VM in a
> certain way, and if yes, how exactly?

In v2 of the series that I will send later you will have to enable this 
capability also in the destination.

>
>>   # Features:
>>   # @unstable: Members @x-colo and @x-ignore-shared are experimental.
>>   #
>> @@ -492,7 +499,7 @@
>>              'dirty-bitmaps', 'postcopy-blocktime', 'late-block-activate',
>>              { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
>>              'validate-uuid', 'background-snapshot',
>> -           'zero-copy-send', 'postcopy-preempt'] }
>> +           'zero-copy-send', 'postcopy-preempt', 'precopy-initial-data'] }
>>
>>   ##
>>   # @MigrationCapabilityStatus:
>> diff --git a/migration/options.h b/migration/options.h
>> index 3c322867cd..d004b6321e 100644
>> --- a/migration/options.h
>> +++ b/migration/options.h
>> @@ -44,6 +44,7 @@ bool migrate_pause_before_switchover(void);
>>   bool migrate_postcopy_blocktime(void);
>>   bool migrate_postcopy_preempt(void);
>>   bool migrate_postcopy_ram(void);
>> +bool migrate_precopy_initial_data(void);
>>   bool migrate_rdma_pin_all(void);
>>   bool migrate_release_ram(void);
>>   bool migrate_return_path(void);
>> diff --git a/migration/options.c b/migration/options.c
>> index 53b7fc5d5d..c4ef0c60c7 100644
>> --- a/migration/options.c
>> +++ b/migration/options.c
>> @@ -184,6 +184,8 @@ Property migration_properties[] = {
>>       DEFINE_PROP_MIG_CAP("x-zero-copy-send",
>>               MIGRATION_CAPABILITY_ZERO_COPY_SEND),
>>   #endif
>> +    DEFINE_PROP_MIG_CAP("x-precopy-initial-data",
>> +                        MIGRATION_CAPABILITY_PRECOPY_INITIAL_DATA),
>>
>>       DEFINE_PROP_END_OF_LIST(),
>>   };
>> @@ -286,6 +288,13 @@ bool migrate_postcopy_ram(void)
>>       return s->capabilities[MIGRATION_CAPABILITY_POSTCOPY_RAM];
>>   }
>>
>> +bool migrate_precopy_initial_data(void)
>> +{
>> +    MigrationState *s = migrate_get_current();
>> +
>> +    return s->capabilities[MIGRATION_CAPABILITY_PRECOPY_INITIAL_DATA];
>> +}
>> +
>>   bool migrate_rdma_pin_all(void)
>>   {
>>       MigrationState *s = migrate_get_current();
>> @@ -546,6 +555,17 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, Error **errp)
>>           }
>>       }
>>
>> +    if (new_caps[MIGRATION_CAPABILITY_PRECOPY_INITIAL_DATA]) {
>> +        if (!new_caps[MIGRATION_CAPABILITY_RETURN_PATH]) {
>> +            error_setg(errp, "Precopy initial data requires return path");
> Shouldn't we mention this requirement in the docs?

Yes, I will add it.

>
> To make sense of the message, the user needs to make the connection from
> "Precopy initial data" to capability @precopy-initial-data, and from
> "return path" to capability @return-path.  More helpful, I think:
> "capability 'precopy-initial-data' requires capability 'return-path'".

Sure, I will change.

Thanks.

>> +            return false;
>> +        }
>> +
>> +        /* Disable this capability until it's implemented */
>> +        error_setg(errp, "Precopy initial data is not implemented yet");
>> +        return false;
>> +    }
>> +
>>       return true;
>>   }


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/8] migration: Add precopy initial data capability
  2023-05-17 10:16     ` Avihai Horon
@ 2023-05-17 12:21       ` Markus Armbruster
  2023-05-17 13:23         ` Avihai Horon
  0 siblings, 1 reply; 42+ messages in thread
From: Markus Armbruster @ 2023-05-17 12:21 UTC (permalink / raw)
  To: Avihai Horon
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
	Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake, Thomas Huth,
	Laurent Vivier, Paolo Bonzini, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins

Avihai Horon <avihaih@nvidia.com> writes:

> On 17/05/2023 12:17, Markus Armbruster wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> Avihai Horon <avihaih@nvidia.com> writes:
>>
>>> Migration downtime estimation is calculated based on bandwidth and
>>> remaining migration data. This assumes that loading of migration data in
>>> the destination takes a negligible amount of time and that downtime
>>> depends only on network speed.
>>>
>>> While this may be true for RAM, it's not necessarily true for other
>>> migration users. For example, loading the data of a VFIO device in the
>>> destination might require from the device to allocate resources, prepare
>>> internal data structures and so on. These operations can take a
>>> significant amount of time which can increase migration downtime.
>>>
>>> This patch adds a new capability "precopy initial data" that allows the
>>> source to send initial precopy data and the destination to ACK that this
>>> data has been loaded. Migration will not attempt to stop the source VM
>>> and complete the migration until this ACK is received.
>>>
>>> This will allow migration users to send initial precopy data which can
>>> be used to reduce downtime (e.g., by pre-allocating resources), while
>>> making sure that the source will stop the VM and complete the migration
>>> only after this initial precopy data is sent and loaded in the
>>> destination so it will have full effect.
>>>
>>> This new capability relies on the return path capability to communicate
>>> from the destination back to the source.
>>>
>>> The actual implementation of the capability will be added in the
>>> following patches.
>>>
>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>> ---
>>>   qapi/migration.json |  9 ++++++++-
>>>   migration/options.h |  1 +
>>>   migration/options.c | 20 ++++++++++++++++++++
>>>   3 files changed, 29 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/qapi/migration.json b/qapi/migration.json
>>> index 82000adce4..d496148386 100644
>>> --- a/qapi/migration.json
>>> +++ b/qapi/migration.json
>>> @@ -478,6 +478,13 @@
>>>   #                    should not affect the correctness of postcopy migration.
>>>   #                    (since 7.1)
>>>   #
>>> +# @precopy-initial-data: If enabled, migration will not attempt to stop source
>>> +#                        VM and complete the migration until an ACK is received
>>> +#                        from the destination that initial precopy data has
>>> +#                        been loaded. This can improve downtime if there are
>>> +#                        migration users that support precopy initial data.
>>> +#                        (since 8.1)
>>> +#
>> Please format like
>>
>>     # @precopy-initial-data: If enabled, migration will not attempt to
>>     #     stop source VM and complete the migration until an ACK is
>>     #     received from the destination that initial precopy data has been
>>     #     loaded.  This can improve downtime if there are migration users
>>     #     that support precopy initial data.  (since 8.1)
>>
>> to blend in with recent commit a937b6aa739 (qapi: Reformat doc comments
>> to conform to current conventions).
>
> Sure.
>
>>
>> What do you mean by "if there are migration users that support precopy
>> initial data"?
>
> This capability only provides the framework to send precopy initial data and ACK that it was loaded in the destination.
> To actually benefit from it, migration users (such as VFIO devices, RAM, etc.) must implement support for it and use it.
>
> What I wanted to say here is that there is no point to enable this capability if there are no migration users that support it.
> For example, if you are migrating a VM without VFIO devices, then enabling this capability will have no effect.

I see.

Which "migration users" support it now?

Which could support it in the future?

Is the "initial precopy data" feature described in more detail anywhere?

>> Do I have to ensure the ACK comes by configuring the destination VM in a
>> certain way, and if yes, how exactly?
>
> In v2 of the series that I will send later you will have to enable this capability also in the destination.

What happens when you enable it on the source and not on the
destination?

[...]



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/8] migration: Add precopy initial data capability
  2023-05-17 12:21       ` Markus Armbruster
@ 2023-05-17 13:23         ` Avihai Horon
  0 siblings, 0 replies; 42+ messages in thread
From: Avihai Horon @ 2023-05-17 13:23 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
	Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake, Thomas Huth,
	Laurent Vivier, Paolo Bonzini, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Kirti Wankhede, Tarun Gupta, Joao Martins


On 17/05/2023 15:21, Markus Armbruster wrote:
> External email: Use caution opening links or attachments
>
>
> Avihai Horon <avihaih@nvidia.com> writes:
>
>> On 17/05/2023 12:17, Markus Armbruster wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> Avihai Horon <avihaih@nvidia.com> writes:
>>>
>>>> Migration downtime estimation is calculated based on bandwidth and
>>>> remaining migration data. This assumes that loading of migration data in
>>>> the destination takes a negligible amount of time and that downtime
>>>> depends only on network speed.
>>>>
>>>> While this may be true for RAM, it's not necessarily true for other
>>>> migration users. For example, loading the data of a VFIO device in the
>>>> destination might require from the device to allocate resources, prepare
>>>> internal data structures and so on. These operations can take a
>>>> significant amount of time which can increase migration downtime.
>>>>
>>>> This patch adds a new capability "precopy initial data" that allows the
>>>> source to send initial precopy data and the destination to ACK that this
>>>> data has been loaded. Migration will not attempt to stop the source VM
>>>> and complete the migration until this ACK is received.
>>>>
>>>> This will allow migration users to send initial precopy data which can
>>>> be used to reduce downtime (e.g., by pre-allocating resources), while
>>>> making sure that the source will stop the VM and complete the migration
>>>> only after this initial precopy data is sent and loaded in the
>>>> destination so it will have full effect.
>>>>
>>>> This new capability relies on the return path capability to communicate
>>>> from the destination back to the source.
>>>>
>>>> The actual implementation of the capability will be added in the
>>>> following patches.
>>>>
>>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>>> ---
>>>>    qapi/migration.json |  9 ++++++++-
>>>>    migration/options.h |  1 +
>>>>    migration/options.c | 20 ++++++++++++++++++++
>>>>    3 files changed, 29 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/qapi/migration.json b/qapi/migration.json
>>>> index 82000adce4..d496148386 100644
>>>> --- a/qapi/migration.json
>>>> +++ b/qapi/migration.json
>>>> @@ -478,6 +478,13 @@
>>>>    #                    should not affect the correctness of postcopy migration.
>>>>    #                    (since 7.1)
>>>>    #
>>>> +# @precopy-initial-data: If enabled, migration will not attempt to stop source
>>>> +#                        VM and complete the migration until an ACK is received
>>>> +#                        from the destination that initial precopy data has
>>>> +#                        been loaded. This can improve downtime if there are
>>>> +#                        migration users that support precopy initial data.
>>>> +#                        (since 8.1)
>>>> +#
>>> Please format like
>>>
>>>      # @precopy-initial-data: If enabled, migration will not attempt to
>>>      #     stop source VM and complete the migration until an ACK is
>>>      #     received from the destination that initial precopy data has been
>>>      #     loaded.  This can improve downtime if there are migration users
>>>      #     that support precopy initial data.  (since 8.1)
>>>
>>> to blend in with recent commit a937b6aa739 (qapi: Reformat doc comments
>>> to conform to current conventions).
>> Sure.
>>
>>> What do you mean by "if there are migration users that support precopy
>>> initial data"?
>> This capability only provides the framework to send precopy initial data and ACK that it was loaded in the destination.
>> To actually benefit from it, migration users (such as VFIO devices, RAM, etc.) must implement support for it and use it.
>>
>> What I wanted to say here is that there is no point to enable this capability if there are no migration users that support it.
>> For example, if you are migrating a VM without VFIO devices, then enabling this capability will have no effect.
> I see.
>
> Which "migration users" support it now?

Currently, only VFIO devices.

>
> Which could support it in the future?

Any device that uses "iterative migration" [1], such as RAM, block, or 
some new type of device in the future.

[1] 
https://www.qemu.org/docs/master/devel/migration.html#iterative-device-migration

>
> Is the "initial precopy data" feature described in more detail anywhere?

The cover letter of this series contains more info on the background and 
motivation behind it and also about the workflow of the feature.

>
>>> Do I have to ensure the ACK comes by configuring the destination VM in a
>>> certain way, and if yes, how exactly?
>> In v2 of the series that I will send later you will have to enable this capability also in the destination.
> What happens when you enable it on the source and not on the
> destination?

Migration may fail.
For example, this is what happens if I migrate a VM with a VFIO device 
and enable this capability only on the source.

Thanks.



^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2023-05-17 13:24 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-01 14:01 [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support Avihai Horon
2023-05-01 14:01 ` [PATCH 1/8] migration: Add precopy initial data capability Avihai Horon
2023-05-10  8:24   ` Juan Quintela
2023-05-17  9:17   ` Markus Armbruster
2023-05-17 10:16     ` Avihai Horon
2023-05-17 12:21       ` Markus Armbruster
2023-05-17 13:23         ` Avihai Horon
2023-05-01 14:01 ` [PATCH 2/8] migration: Add precopy initial data handshake Avihai Horon
2023-05-02 22:54   ` Peter Xu
2023-05-03 15:31     ` Avihai Horon
2023-05-10  8:40   ` Juan Quintela
2023-05-10 15:32     ` Avihai Horon
2023-05-14 16:42   ` Cédric Le Goater
2023-05-15  7:56     ` Avihai Horon
2023-05-01 14:01 ` [PATCH 3/8] migration: Add precopy initial data loaded ACK functionality Avihai Horon
2023-05-02 22:56   ` Peter Xu
2023-05-03 15:36     ` Avihai Horon
2023-05-10  8:54   ` Juan Quintela
2023-05-10 15:52     ` Avihai Horon
2023-05-10 15:59       ` Juan Quintela
2023-05-01 14:01 ` [PATCH 4/8] migration: Enable precopy initial data capability Avihai Horon
2023-05-10  8:55   ` Juan Quintela
2023-05-01 14:01 ` [PATCH 5/8] tests: Add migration precopy initial data capability test Avihai Horon
2023-05-10  8:55   ` Juan Quintela
2023-05-01 14:01 ` [PATCH 6/8] vfio/migration: Refactor vfio_save_block() to return saved data size Avihai Horon
2023-05-10  9:00   ` Juan Quintela
2023-05-01 14:01 ` [PATCH 7/8] vfio/migration: Add VFIO migration pre-copy support Avihai Horon
2023-05-01 14:01 ` [PATCH 8/8] vfio/migration: Add support for precopy initial data capability Avihai Horon
2023-05-02 22:49 ` [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support Peter Xu
2023-05-03 15:22   ` Avihai Horon
2023-05-03 15:49     ` Peter Xu
2023-05-04 10:18       ` Avihai Horon
2023-05-04 15:50         ` Peter Xu
2023-05-07 12:54           ` Avihai Horon
2023-05-08  0:49             ` Peter Xu
2023-05-08 11:11               ` Avihai Horon
2023-05-10  9:12     ` Juan Quintela
2023-05-10 16:01       ` Avihai Horon
2023-05-10 16:41         ` Juan Quintela
2023-05-11 11:31           ` Avihai Horon
2023-05-11 13:09             ` Juan Quintela
2023-05-11 15:08               ` Avihai Horon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).