* [PATCH v5 1/9] migration: Add switchover ack capability
2023-05-30 14:48 [PATCH v5 0/9] migration: Add switchover ack capability and VFIO precopy support Avihai Horon
@ 2023-05-30 14:48 ` Avihai Horon
2023-06-15 12:38 ` YangHang Liu
2023-05-30 14:48 ` [PATCH v5 2/9] migration: Implement switchover ack logic Avihai Horon
` (8 subsequent siblings)
9 siblings, 1 reply; 25+ messages in thread
From: Avihai Horon @ 2023-05-30 14:48 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cédric Le Goater, Eduardo Habkost,
Marcel Apfelbaum, Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
Kirti Wankhede, Tarun Gupta, Joao Martins
Migration downtime estimation is calculated based on bandwidth and
remaining migration data. This assumes that loading of migration data in
the destination takes a negligible amount of time and that downtime
depends only on network speed.
While this may be true for RAM, it's not necessarily true for other
migrated devices. For example, loading the data of a VFIO device in the
destination might require from the device to allocate resources, prepare
internal data structures and so on. These operations can take a
significant amount of time which can increase migration downtime.
This patch adds a new capability "switchover ack" that prevents the
source from stopping the VM and completing the migration until an ACK
is received from the destination that it's OK to do so.
This can be used by migrated devices in various ways to reduce downtime.
For example, a device can send initial precopy metadata to pre-allocate
resources in the destination and use this capability to make sure that
the pre-allocation is completed before the source VM is stopped, so it
will have full effect.
This new capability relies on the return path capability to communicate
from the destination back to the source.
The actual implementation of the capability will be added in the
following patches.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
---
qapi/migration.json | 12 +++++++++++-
migration/options.h | 1 +
migration/options.c | 21 +++++++++++++++++++++
3 files changed, 33 insertions(+), 1 deletion(-)
diff --git a/qapi/migration.json b/qapi/migration.json
index 179af0c4d8..061ea512e0 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -487,6 +487,16 @@
# and should not affect the correctness of postcopy migration.
# (since 7.1)
#
+# @switchover-ack: If enabled, migration will not stop the source VM
+# and complete the migration until an ACK is received from the
+# destination that it's OK to do so. Exactly when this ACK is
+# sent depends on the migrated devices that use this feature.
+# For example, a device can use it to make sure some of its data
+# is sent and loaded in the destination before doing switchover.
+# This can reduce downtime if devices that support this capability
+# are present. 'return-path' capability must be enabled to use
+# it. (since 8.1)
+#
# Features:
#
# @unstable: Members @x-colo and @x-ignore-shared are experimental.
@@ -502,7 +512,7 @@
'dirty-bitmaps', 'postcopy-blocktime', 'late-block-activate',
{ 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
'validate-uuid', 'background-snapshot',
- 'zero-copy-send', 'postcopy-preempt'] }
+ 'zero-copy-send', 'postcopy-preempt', 'switchover-ack'] }
##
# @MigrationCapabilityStatus:
diff --git a/migration/options.h b/migration/options.h
index 45991af3c2..9aaf363322 100644
--- a/migration/options.h
+++ b/migration/options.h
@@ -40,6 +40,7 @@ bool migrate_postcopy_ram(void);
bool migrate_rdma_pin_all(void);
bool migrate_release_ram(void);
bool migrate_return_path(void);
+bool migrate_switchover_ack(void);
bool migrate_validate_uuid(void);
bool migrate_xbzrle(void);
bool migrate_zero_blocks(void);
diff --git a/migration/options.c b/migration/options.c
index b62ab30cd5..16007afca6 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -185,6 +185,8 @@ Property migration_properties[] = {
DEFINE_PROP_MIG_CAP("x-zero-copy-send",
MIGRATION_CAPABILITY_ZERO_COPY_SEND),
#endif
+ DEFINE_PROP_MIG_CAP("x-switchover-ack",
+ MIGRATION_CAPABILITY_SWITCHOVER_ACK),
DEFINE_PROP_END_OF_LIST(),
};
@@ -308,6 +310,13 @@ bool migrate_return_path(void)
return s->capabilities[MIGRATION_CAPABILITY_RETURN_PATH];
}
+bool migrate_switchover_ack(void)
+{
+ MigrationState *s = migrate_get_current();
+
+ return s->capabilities[MIGRATION_CAPABILITY_SWITCHOVER_ACK];
+}
+
bool migrate_validate_uuid(void)
{
MigrationState *s = migrate_get_current();
@@ -547,6 +556,18 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, Error **errp)
}
}
+ if (new_caps[MIGRATION_CAPABILITY_SWITCHOVER_ACK]) {
+ if (!new_caps[MIGRATION_CAPABILITY_RETURN_PATH]) {
+ error_setg(errp, "Capability 'switchover-ack' requires capability "
+ "'return-path'");
+ return false;
+ }
+
+ /* Disable this capability until it's implemented */
+ error_setg(errp, "'switchover-ack' is not implemented yet");
+ return false;
+ }
+
return true;
}
--
2.26.3
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v5 1/9] migration: Add switchover ack capability
2023-05-30 14:48 ` [PATCH v5 1/9] migration: Add switchover ack capability Avihai Horon
@ 2023-06-15 12:38 ` YangHang Liu
2023-06-15 13:49 ` Cédric Le Goater
0 siblings, 1 reply; 25+ messages in thread
From: YangHang Liu @ 2023-06-15 12:38 UTC (permalink / raw)
To: Avihai Horon
Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
Tarun Gupta, Joao Martins
Test in the following two scenarios:
[1] Test scenario: Both source VM and target VM (in listening mode)
have enabled return-path and switchover-ack capability:
Test result : The VFIO migration completed successfully
[2] Test scenario : The source VM has enabled return-path and
switchover-ack capability while the target VM (in listening mode) not
Test result : The VFIO migration fails
The detailed error thrown by qemu-kvm when VFIO migration fails:
Target VM:
0000:17:00.2: Received INIT_DATA_SENT but switchover ack is not used
error while loading state section id 81(0000:00:02.4:00.0/vfio)
load of migration failed: Invalid argument
Source VM:
failed to save SaveStateEntry with id(name): 2(ram): -5
Unable to write to socket: Connection reset by peer
Unable to write to socket: Connection reset by peer
Tested-by: YangHang Liu <yanghliu@redhat.com>
On Wed, May 31, 2023 at 1:46 AM Avihai Horon <avihaih@nvidia.com> wrote:
>
> Migration downtime estimation is calculated based on bandwidth and
> remaining migration data. This assumes that loading of migration data in
> the destination takes a negligible amount of time and that downtime
> depends only on network speed.
>
> While this may be true for RAM, it's not necessarily true for other
> migrated devices. For example, loading the data of a VFIO device in the
> destination might require from the device to allocate resources, prepare
> internal data structures and so on. These operations can take a
> significant amount of time which can increase migration downtime.
>
> This patch adds a new capability "switchover ack" that prevents the
> source from stopping the VM and completing the migration until an ACK
> is received from the destination that it's OK to do so.
>
> This can be used by migrated devices in various ways to reduce downtime.
> For example, a device can send initial precopy metadata to pre-allocate
> resources in the destination and use this capability to make sure that
> the pre-allocation is completed before the source VM is stopped, so it
> will have full effect.
>
> This new capability relies on the return path capability to communicate
> from the destination back to the source.
>
> The actual implementation of the capability will be added in the
> following patches.
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Acked-by: Markus Armbruster <armbru@redhat.com>
> ---
> qapi/migration.json | 12 +++++++++++-
> migration/options.h | 1 +
> migration/options.c | 21 +++++++++++++++++++++
> 3 files changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 179af0c4d8..061ea512e0 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -487,6 +487,16 @@
> # and should not affect the correctness of postcopy migration.
> # (since 7.1)
> #
> +# @switchover-ack: If enabled, migration will not stop the source VM
> +# and complete the migration until an ACK is received from the
> +# destination that it's OK to do so. Exactly when this ACK is
> +# sent depends on the migrated devices that use this feature.
> +# For example, a device can use it to make sure some of its data
> +# is sent and loaded in the destination before doing switchover.
> +# This can reduce downtime if devices that support this capability
> +# are present. 'return-path' capability must be enabled to use
> +# it. (since 8.1)
> +#
> # Features:
> #
> # @unstable: Members @x-colo and @x-ignore-shared are experimental.
> @@ -502,7 +512,7 @@
> 'dirty-bitmaps', 'postcopy-blocktime', 'late-block-activate',
> { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
> 'validate-uuid', 'background-snapshot',
> - 'zero-copy-send', 'postcopy-preempt'] }
> + 'zero-copy-send', 'postcopy-preempt', 'switchover-ack'] }
>
> ##
> # @MigrationCapabilityStatus:
> diff --git a/migration/options.h b/migration/options.h
> index 45991af3c2..9aaf363322 100644
> --- a/migration/options.h
> +++ b/migration/options.h
> @@ -40,6 +40,7 @@ bool migrate_postcopy_ram(void);
> bool migrate_rdma_pin_all(void);
> bool migrate_release_ram(void);
> bool migrate_return_path(void);
> +bool migrate_switchover_ack(void);
> bool migrate_validate_uuid(void);
> bool migrate_xbzrle(void);
> bool migrate_zero_blocks(void);
> diff --git a/migration/options.c b/migration/options.c
> index b62ab30cd5..16007afca6 100644
> --- a/migration/options.c
> +++ b/migration/options.c
> @@ -185,6 +185,8 @@ Property migration_properties[] = {
> DEFINE_PROP_MIG_CAP("x-zero-copy-send",
> MIGRATION_CAPABILITY_ZERO_COPY_SEND),
> #endif
> + DEFINE_PROP_MIG_CAP("x-switchover-ack",
> + MIGRATION_CAPABILITY_SWITCHOVER_ACK),
>
> DEFINE_PROP_END_OF_LIST(),
> };
> @@ -308,6 +310,13 @@ bool migrate_return_path(void)
> return s->capabilities[MIGRATION_CAPABILITY_RETURN_PATH];
> }
>
> +bool migrate_switchover_ack(void)
> +{
> + MigrationState *s = migrate_get_current();
> +
> + return s->capabilities[MIGRATION_CAPABILITY_SWITCHOVER_ACK];
> +}
> +
> bool migrate_validate_uuid(void)
> {
> MigrationState *s = migrate_get_current();
> @@ -547,6 +556,18 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, Error **errp)
> }
> }
>
> + if (new_caps[MIGRATION_CAPABILITY_SWITCHOVER_ACK]) {
> + if (!new_caps[MIGRATION_CAPABILITY_RETURN_PATH]) {
> + error_setg(errp, "Capability 'switchover-ack' requires capability "
> + "'return-path'");
> + return false;
> + }
> +
> + /* Disable this capability until it's implemented */
> + error_setg(errp, "'switchover-ack' is not implemented yet");
> + return false;
> + }
> +
> return true;
> }
>
> --
> 2.26.3
>
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v5 1/9] migration: Add switchover ack capability
2023-06-15 12:38 ` YangHang Liu
@ 2023-06-15 13:49 ` Cédric Le Goater
2023-06-19 9:37 ` Avihai Horon
0 siblings, 1 reply; 25+ messages in thread
From: Cédric Le Goater @ 2023-06-15 13:49 UTC (permalink / raw)
To: YangHang Liu, Avihai Horon
Cc: qemu-devel, Alex Williamson, Eduardo Habkost, Marcel Apfelbaum,
Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
Tarun Gupta, Joao Martins
On 6/15/23 14:38, YangHang Liu wrote:
> Test in the following two scenarios:
>
> [1] Test scenario: Both source VM and target VM (in listening mode)
> have enabled return-path and switchover-ack capability:
>
> Test result : The VFIO migration completed successfully
>
> [2] Test scenario : The source VM has enabled return-path and
> switchover-ack capability while the target VM (in listening mode) not
>
> Test result : The VFIO migration fails
>
> The detailed error thrown by qemu-kvm when VFIO migration fails:
> Target VM:
> 0000:17:00.2: Received INIT_DATA_SENT but switchover ack is not used
> error while loading state section id 81(0000:00:02.4:00.0/vfio)
> load of migration failed: Invalid argument
> Source VM:
> failed to save SaveStateEntry with id(name): 2(ram): -5
> Unable to write to socket: Connection reset by peer
> Unable to write to socket: Connection reset by peer
>
> Tested-by: YangHang Liu <yanghliu@redhat.com>
Some more info,
Tests were performed with a mainline Linux and a mainline QEMU including
this series - patch8.
The amount of precopy data for a CX-7 VF is not very large. Any idea how
to generate some more initial state with such devices ?
I suppose pre-copy will be more important with vGPUs.
YangHang,
Could you please reply with a Tested-by on the cover letter, so that the
whole series is tagged and not only patch 1.
Thanks,
C.
>
>
> On Wed, May 31, 2023 at 1:46 AM Avihai Horon <avihaih@nvidia.com> wrote:
>>
>> Migration downtime estimation is calculated based on bandwidth and
>> remaining migration data. This assumes that loading of migration data in
>> the destination takes a negligible amount of time and that downtime
>> depends only on network speed.
>>
>> While this may be true for RAM, it's not necessarily true for other
>> migrated devices. For example, loading the data of a VFIO device in the
>> destination might require from the device to allocate resources, prepare
>> internal data structures and so on. These operations can take a
>> significant amount of time which can increase migration downtime.
>>
>> This patch adds a new capability "switchover ack" that prevents the
>> source from stopping the VM and completing the migration until an ACK
>> is received from the destination that it's OK to do so.
>>
>> This can be used by migrated devices in various ways to reduce downtime.
>> For example, a device can send initial precopy metadata to pre-allocate
>> resources in the destination and use this capability to make sure that
>> the pre-allocation is completed before the source VM is stopped, so it
>> will have full effect.
>>
>> This new capability relies on the return path capability to communicate
>> from the destination back to the source.
>>
>> The actual implementation of the capability will be added in the
>> following patches.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> Reviewed-by: Peter Xu <peterx@redhat.com>
>> Acked-by: Markus Armbruster <armbru@redhat.com>
>> ---
>> qapi/migration.json | 12 +++++++++++-
>> migration/options.h | 1 +
>> migration/options.c | 21 +++++++++++++++++++++
>> 3 files changed, 33 insertions(+), 1 deletion(-)
>>
>> diff --git a/qapi/migration.json b/qapi/migration.json
>> index 179af0c4d8..061ea512e0 100644
>> --- a/qapi/migration.json
>> +++ b/qapi/migration.json
>> @@ -487,6 +487,16 @@
>> # and should not affect the correctness of postcopy migration.
>> # (since 7.1)
>> #
>> +# @switchover-ack: If enabled, migration will not stop the source VM
>> +# and complete the migration until an ACK is received from the
>> +# destination that it's OK to do so. Exactly when this ACK is
>> +# sent depends on the migrated devices that use this feature.
>> +# For example, a device can use it to make sure some of its data
>> +# is sent and loaded in the destination before doing switchover.
>> +# This can reduce downtime if devices that support this capability
>> +# are present. 'return-path' capability must be enabled to use
>> +# it. (since 8.1)
>> +#
>> # Features:
>> #
>> # @unstable: Members @x-colo and @x-ignore-shared are experimental.
>> @@ -502,7 +512,7 @@
>> 'dirty-bitmaps', 'postcopy-blocktime', 'late-block-activate',
>> { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
>> 'validate-uuid', 'background-snapshot',
>> - 'zero-copy-send', 'postcopy-preempt'] }
>> + 'zero-copy-send', 'postcopy-preempt', 'switchover-ack'] }
>>
>> ##
>> # @MigrationCapabilityStatus:
>> diff --git a/migration/options.h b/migration/options.h
>> index 45991af3c2..9aaf363322 100644
>> --- a/migration/options.h
>> +++ b/migration/options.h
>> @@ -40,6 +40,7 @@ bool migrate_postcopy_ram(void);
>> bool migrate_rdma_pin_all(void);
>> bool migrate_release_ram(void);
>> bool migrate_return_path(void);
>> +bool migrate_switchover_ack(void);
>> bool migrate_validate_uuid(void);
>> bool migrate_xbzrle(void);
>> bool migrate_zero_blocks(void);
>> diff --git a/migration/options.c b/migration/options.c
>> index b62ab30cd5..16007afca6 100644
>> --- a/migration/options.c
>> +++ b/migration/options.c
>> @@ -185,6 +185,8 @@ Property migration_properties[] = {
>> DEFINE_PROP_MIG_CAP("x-zero-copy-send",
>> MIGRATION_CAPABILITY_ZERO_COPY_SEND),
>> #endif
>> + DEFINE_PROP_MIG_CAP("x-switchover-ack",
>> + MIGRATION_CAPABILITY_SWITCHOVER_ACK),
>>
>> DEFINE_PROP_END_OF_LIST(),
>> };
>> @@ -308,6 +310,13 @@ bool migrate_return_path(void)
>> return s->capabilities[MIGRATION_CAPABILITY_RETURN_PATH];
>> }
>>
>> +bool migrate_switchover_ack(void)
>> +{
>> + MigrationState *s = migrate_get_current();
>> +
>> + return s->capabilities[MIGRATION_CAPABILITY_SWITCHOVER_ACK];
>> +}
>> +
>> bool migrate_validate_uuid(void)
>> {
>> MigrationState *s = migrate_get_current();
>> @@ -547,6 +556,18 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, Error **errp)
>> }
>> }
>>
>> + if (new_caps[MIGRATION_CAPABILITY_SWITCHOVER_ACK]) {
>> + if (!new_caps[MIGRATION_CAPABILITY_RETURN_PATH]) {
>> + error_setg(errp, "Capability 'switchover-ack' requires capability "
>> + "'return-path'");
>> + return false;
>> + }
>> +
>> + /* Disable this capability until it's implemented */
>> + error_setg(errp, "'switchover-ack' is not implemented yet");
>> + return false;
>> + }
>> +
>> return true;
>> }
>>
>> --
>> 2.26.3
>>
>>
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v5 1/9] migration: Add switchover ack capability
2023-06-15 13:49 ` Cédric Le Goater
@ 2023-06-19 9:37 ` Avihai Horon
0 siblings, 0 replies; 25+ messages in thread
From: Avihai Horon @ 2023-06-19 9:37 UTC (permalink / raw)
To: Cédric Le Goater, YangHang Liu
Cc: qemu-devel, Alex Williamson, Eduardo Habkost, Marcel Apfelbaum,
Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
Tarun Gupta, Joao Martins
On 15/06/2023 16:49, Cédric Le Goater wrote:
> External email: Use caution opening links or attachments
>
>
> On 6/15/23 14:38, YangHang Liu wrote:
>> Test in the following two scenarios:
>>
>> [1] Test scenario: Both source VM and target VM (in listening mode)
>> have enabled return-path and switchover-ack capability:
>>
>> Test result : The VFIO migration completed successfully
>>
>> [2] Test scenario : The source VM has enabled return-path and
>> switchover-ack capability while the target VM (in listening mode) not
>>
>> Test result : The VFIO migration fails
>>
>> The detailed error thrown by qemu-kvm when VFIO migration fails:
>> Target VM:
>> 0000:17:00.2: Received INIT_DATA_SENT but switchover ack
>> is not used
>> error while loading state section id
>> 81(0000:00:02.4:00.0/vfio)
>> load of migration failed: Invalid argument
>> Source VM:
>> failed to save SaveStateEntry with id(name): 2(ram): -5
>> Unable to write to socket: Connection reset by peer
>> Unable to write to socket: Connection reset by peer
>>
>> Tested-by: YangHang Liu <yanghliu@redhat.com>
>
> Some more info,
>
> Tests were performed with a mainline Linux and a mainline QEMU including
> this series - patch8.
>
> The amount of precopy data for a CX-7 VF is not very large. Any idea how
> to generate some more initial state with such devices ?
>
> I suppose pre-copy will be more important with vGPUs.
In CX-7 the precopy data is not expected to be very large, because it's
mainly used to pre-allocate resources in the destination.
However, precopy and switchover-ack are very important for CX-7, because
they allow doing the resource pre-allocation in the destination when the
source VM is running and reduce downtime significantly (see the example
I gave in the cover letter).
Thanks.
>
>
>
> YangHang,
>
> Could you please reply with a Tested-by on the cover letter, so that the
> whole series is tagged and not only patch 1.
>
> Thanks,
>
> C.
>
>>
>>
>> On Wed, May 31, 2023 at 1:46 AM Avihai Horon <avihaih@nvidia.com> wrote:
>>>
>>> Migration downtime estimation is calculated based on bandwidth and
>>> remaining migration data. This assumes that loading of migration
>>> data in
>>> the destination takes a negligible amount of time and that downtime
>>> depends only on network speed.
>>>
>>> While this may be true for RAM, it's not necessarily true for other
>>> migrated devices. For example, loading the data of a VFIO device in the
>>> destination might require from the device to allocate resources,
>>> prepare
>>> internal data structures and so on. These operations can take a
>>> significant amount of time which can increase migration downtime.
>>>
>>> This patch adds a new capability "switchover ack" that prevents the
>>> source from stopping the VM and completing the migration until an ACK
>>> is received from the destination that it's OK to do so.
>>>
>>> This can be used by migrated devices in various ways to reduce
>>> downtime.
>>> For example, a device can send initial precopy metadata to pre-allocate
>>> resources in the destination and use this capability to make sure that
>>> the pre-allocation is completed before the source VM is stopped, so it
>>> will have full effect.
>>>
>>> This new capability relies on the return path capability to communicate
>>> from the destination back to the source.
>>>
>>> The actual implementation of the capability will be added in the
>>> following patches.
>>>
>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>> Reviewed-by: Peter Xu <peterx@redhat.com>
>>> Acked-by: Markus Armbruster <armbru@redhat.com>
>>> ---
>>> qapi/migration.json | 12 +++++++++++-
>>> migration/options.h | 1 +
>>> migration/options.c | 21 +++++++++++++++++++++
>>> 3 files changed, 33 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/qapi/migration.json b/qapi/migration.json
>>> index 179af0c4d8..061ea512e0 100644
>>> --- a/qapi/migration.json
>>> +++ b/qapi/migration.json
>>> @@ -487,6 +487,16 @@
>>> # and should not affect the correctness of postcopy migration.
>>> # (since 7.1)
>>> #
>>> +# @switchover-ack: If enabled, migration will not stop the source VM
>>> +# and complete the migration until an ACK is received from the
>>> +# destination that it's OK to do so. Exactly when this ACK is
>>> +# sent depends on the migrated devices that use this feature.
>>> +# For example, a device can use it to make sure some of its data
>>> +# is sent and loaded in the destination before doing switchover.
>>> +# This can reduce downtime if devices that support this capability
>>> +# are present. 'return-path' capability must be enabled to use
>>> +# it. (since 8.1)
>>> +#
>>> # Features:
>>> #
>>> # @unstable: Members @x-colo and @x-ignore-shared are experimental.
>>> @@ -502,7 +512,7 @@
>>> 'dirty-bitmaps', 'postcopy-blocktime',
>>> 'late-block-activate',
>>> { 'name': 'x-ignore-shared', 'features': [ 'unstable' ] },
>>> 'validate-uuid', 'background-snapshot',
>>> - 'zero-copy-send', 'postcopy-preempt'] }
>>> + 'zero-copy-send', 'postcopy-preempt', 'switchover-ack'] }
>>>
>>> ##
>>> # @MigrationCapabilityStatus:
>>> diff --git a/migration/options.h b/migration/options.h
>>> index 45991af3c2..9aaf363322 100644
>>> --- a/migration/options.h
>>> +++ b/migration/options.h
>>> @@ -40,6 +40,7 @@ bool migrate_postcopy_ram(void);
>>> bool migrate_rdma_pin_all(void);
>>> bool migrate_release_ram(void);
>>> bool migrate_return_path(void);
>>> +bool migrate_switchover_ack(void);
>>> bool migrate_validate_uuid(void);
>>> bool migrate_xbzrle(void);
>>> bool migrate_zero_blocks(void);
>>> diff --git a/migration/options.c b/migration/options.c
>>> index b62ab30cd5..16007afca6 100644
>>> --- a/migration/options.c
>>> +++ b/migration/options.c
>>> @@ -185,6 +185,8 @@ Property migration_properties[] = {
>>> DEFINE_PROP_MIG_CAP("x-zero-copy-send",
>>> MIGRATION_CAPABILITY_ZERO_COPY_SEND),
>>> #endif
>>> + DEFINE_PROP_MIG_CAP("x-switchover-ack",
>>> + MIGRATION_CAPABILITY_SWITCHOVER_ACK),
>>>
>>> DEFINE_PROP_END_OF_LIST(),
>>> };
>>> @@ -308,6 +310,13 @@ bool migrate_return_path(void)
>>> return s->capabilities[MIGRATION_CAPABILITY_RETURN_PATH];
>>> }
>>>
>>> +bool migrate_switchover_ack(void)
>>> +{
>>> + MigrationState *s = migrate_get_current();
>>> +
>>> + return s->capabilities[MIGRATION_CAPABILITY_SWITCHOVER_ACK];
>>> +}
>>> +
>>> bool migrate_validate_uuid(void)
>>> {
>>> MigrationState *s = migrate_get_current();
>>> @@ -547,6 +556,18 @@ bool migrate_caps_check(bool *old_caps, bool
>>> *new_caps, Error **errp)
>>> }
>>> }
>>>
>>> + if (new_caps[MIGRATION_CAPABILITY_SWITCHOVER_ACK]) {
>>> + if (!new_caps[MIGRATION_CAPABILITY_RETURN_PATH]) {
>>> + error_setg(errp, "Capability 'switchover-ack' requires
>>> capability "
>>> + "'return-path'");
>>> + return false;
>>> + }
>>> +
>>> + /* Disable this capability until it's implemented */
>>> + error_setg(errp, "'switchover-ack' is not implemented yet");
>>> + return false;
>>> + }
>>> +
>>> return true;
>>> }
>>>
>>> --
>>> 2.26.3
>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v5 2/9] migration: Implement switchover ack logic
2023-05-30 14:48 [PATCH v5 0/9] migration: Add switchover ack capability and VFIO precopy support Avihai Horon
2023-05-30 14:48 ` [PATCH v5 1/9] migration: Add switchover ack capability Avihai Horon
@ 2023-05-30 14:48 ` Avihai Horon
2023-06-05 22:06 ` Alex Williamson
2023-05-30 14:48 ` [PATCH v5 3/9] migration: Enable switchover ack capability Avihai Horon
` (7 subsequent siblings)
9 siblings, 1 reply; 25+ messages in thread
From: Avihai Horon @ 2023-05-30 14:48 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cédric Le Goater, Eduardo Habkost,
Marcel Apfelbaum, Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
Kirti Wankhede, Tarun Gupta, Joao Martins
Implement switchover ack logic. This prevents the source from stopping
the VM and completing the migration until an ACK is received from the
destination that it's OK to do so.
To achieve this, a new SaveVMHandlers handler switchover_ack_needed()
and a new return path message MIG_RP_MSG_SWITCHOVER_ACK are added.
The switchover_ack_needed() handler is called during migration setup in
the destination to check if switchover ack is used by the migrated
device.
When switchover is approved by all migrated devices in the destination
that support this capability, the MIG_RP_MSG_SWITCHOVER_ACK return path
message is sent to the source to notify it that it's OK to do
switchover.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
include/migration/register.h | 2 ++
migration/migration.h | 14 ++++++++++
migration/savevm.h | 1 +
migration/migration.c | 32 +++++++++++++++++++--
migration/savevm.c | 54 ++++++++++++++++++++++++++++++++++++
migration/trace-events | 3 ++
6 files changed, 104 insertions(+), 2 deletions(-)
diff --git a/include/migration/register.h b/include/migration/register.h
index a8dfd8fefd..90914f32f5 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -71,6 +71,8 @@ typedef struct SaveVMHandlers {
int (*load_cleanup)(void *opaque);
/* Called when postcopy migration wants to resume from failure */
int (*resume_prepare)(MigrationState *s, void *opaque);
+ /* Checks if switchover ack should be used. Called only in dest */
+ bool (*switchover_ack_needed)(void *opaque);
} SaveVMHandlers;
int register_savevm_live(const char *idstr,
diff --git a/migration/migration.h b/migration/migration.h
index 48a46123a0..1e92ba7b1d 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -209,6 +209,13 @@ struct MigrationIncomingState {
* contains valid information.
*/
QemuMutex page_request_mutex;
+
+ /*
+ * Number of devices that have yet to approve switchover. When this reaches
+ * zero an ACK that it's OK to do switchover is sent to the source. No lock
+ * is needed as this field is updated serially.
+ */
+ unsigned int switchover_ack_pending_num;
};
MigrationIncomingState *migration_incoming_get_current(void);
@@ -437,6 +444,12 @@ struct MigrationState {
/* QEMU_VM_VMDESCRIPTION content filled for all non-iterable devices. */
JSONWriter *vmdesc;
+
+ /*
+ * Indicates whether an ACK from the destination that it's OK to do
+ * switchover has been received.
+ */
+ bool switchover_acked;
};
void migrate_set_state(int *state, int old_state, int new_state);
@@ -477,6 +490,7 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
void migrate_send_rp_recv_bitmap(MigrationIncomingState *mis,
char *block_name);
void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t value);
+int migrate_send_rp_switchover_ack(MigrationIncomingState *mis);
void dirty_bitmap_mig_before_vm_start(void);
void dirty_bitmap_mig_cancel_outgoing(void);
diff --git a/migration/savevm.h b/migration/savevm.h
index fb636735f0..e894bbc143 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -65,6 +65,7 @@ int qemu_loadvm_state(QEMUFile *f);
void qemu_loadvm_state_cleanup(void);
int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
int qemu_load_device_state(QEMUFile *f);
+int qemu_loadvm_approve_switchover(void);
int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
bool in_postcopy, bool inactivate_disks);
diff --git a/migration/migration.c b/migration/migration.c
index 5de7f734b9..c73261118c 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -78,6 +78,7 @@ enum mig_rp_message_type {
MIG_RP_MSG_REQ_PAGES, /* data (start: be64, len: be32) */
MIG_RP_MSG_RECV_BITMAP, /* send recved_bitmap back to source */
MIG_RP_MSG_RESUME_ACK, /* tell source that we are ready to resume */
+ MIG_RP_MSG_SWITCHOVER_ACK, /* Tell source it's OK to do switchover */
MIG_RP_MSG_MAX
};
@@ -760,6 +761,11 @@ bool migration_has_all_channels(void)
return true;
}
+int migrate_send_rp_switchover_ack(MigrationIncomingState *mis)
+{
+ return migrate_send_rp_message(mis, MIG_RP_MSG_SWITCHOVER_ACK, 0, NULL);
+}
+
/*
* Send a 'SHUT' message on the return channel with the given value
* to indicate that we've finished with the RP. Non-0 value indicates
@@ -1405,6 +1411,7 @@ void migrate_init(MigrationState *s)
s->vm_was_running = false;
s->iteration_initial_bytes = 0;
s->threshold_size = 0;
+ s->switchover_acked = false;
}
int migrate_add_blocker_internal(Error *reason, Error **errp)
@@ -1721,6 +1728,7 @@ static struct rp_cmd_args {
[MIG_RP_MSG_REQ_PAGES_ID] = { .len = -1, .name = "REQ_PAGES_ID" },
[MIG_RP_MSG_RECV_BITMAP] = { .len = -1, .name = "RECV_BITMAP" },
[MIG_RP_MSG_RESUME_ACK] = { .len = 4, .name = "RESUME_ACK" },
+ [MIG_RP_MSG_SWITCHOVER_ACK] = { .len = 0, .name = "SWITCHOVER_ACK" },
[MIG_RP_MSG_MAX] = { .len = -1, .name = "MAX" },
};
@@ -1959,6 +1967,11 @@ retry:
}
break;
+ case MIG_RP_MSG_SWITCHOVER_ACK:
+ ms->switchover_acked = true;
+ trace_source_return_path_thread_switchover_acked();
+ break;
+
default:
break;
}
@@ -2700,6 +2713,20 @@ static void migration_update_counters(MigrationState *s,
bandwidth, s->threshold_size);
}
+static bool migration_can_switchover(MigrationState *s)
+{
+ if (!migrate_switchover_ack()) {
+ return true;
+ }
+
+ /* No reason to wait for switchover ACK if VM is stopped */
+ if (!runstate_is_running()) {
+ return true;
+ }
+
+ return s->switchover_acked;
+}
+
/* Migration thread iteration status */
typedef enum {
MIG_ITERATE_RESUME, /* Resume current iteration */
@@ -2715,6 +2742,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
{
uint64_t must_precopy, can_postcopy;
bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
+ bool can_switchover = migration_can_switchover(s);
qemu_savevm_state_pending_estimate(&must_precopy, &can_postcopy);
uint64_t pending_size = must_precopy + can_postcopy;
@@ -2727,14 +2755,14 @@ static MigIterateState migration_iteration_run(MigrationState *s)
trace_migrate_pending_exact(pending_size, must_precopy, can_postcopy);
}
- if (!pending_size || pending_size < s->threshold_size) {
+ if ((!pending_size || pending_size < s->threshold_size) && can_switchover) {
trace_migration_thread_low_pending(pending_size);
migration_completion(s);
return MIG_ITERATE_BREAK;
}
/* Still a significant amount to transfer */
- if (!in_postcopy && must_precopy <= s->threshold_size &&
+ if (!in_postcopy && must_precopy <= s->threshold_size && can_switchover &&
qatomic_read(&s->start_postcopy)) {
if (postcopy_start(s)) {
error_report("%s: postcopy failed to start", __func__);
diff --git a/migration/savevm.c b/migration/savevm.c
index 03795ce8dc..285b814726 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2360,6 +2360,21 @@ static int loadvm_process_command(QEMUFile *f)
error_report("CMD_OPEN_RETURN_PATH failed");
return -1;
}
+
+ /*
+ * Switchover ack is enabled but no device uses it, so send an ACK to
+ * source that it's OK to switchover. Do it here, after return path has
+ * been created.
+ */
+ if (migrate_switchover_ack() && !mis->switchover_ack_pending_num) {
+ int ret = migrate_send_rp_switchover_ack(mis);
+ if (ret) {
+ error_report(
+ "Could not send switchover ack RP MSG, err %d (%s)", ret,
+ strerror(-ret));
+ return ret;
+ }
+ }
break;
case MIG_CMD_PING:
@@ -2586,6 +2601,23 @@ static int qemu_loadvm_state_header(QEMUFile *f)
return 0;
}
+static void qemu_loadvm_state_switchover_ack_needed(MigrationIncomingState *mis)
+{
+ SaveStateEntry *se;
+
+ QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
+ if (!se->ops || !se->ops->switchover_ack_needed) {
+ continue;
+ }
+
+ if (se->ops->switchover_ack_needed(se->opaque)) {
+ mis->switchover_ack_pending_num++;
+ }
+ }
+
+ trace_loadvm_state_switchover_ack_needed(mis->switchover_ack_pending_num);
+}
+
static int qemu_loadvm_state_setup(QEMUFile *f)
{
SaveStateEntry *se;
@@ -2789,6 +2821,10 @@ int qemu_loadvm_state(QEMUFile *f)
return -EINVAL;
}
+ if (migrate_switchover_ack()) {
+ qemu_loadvm_state_switchover_ack_needed(mis);
+ }
+
cpu_synchronize_all_pre_loadvm();
ret = qemu_loadvm_state_main(f, mis);
@@ -2862,6 +2898,24 @@ int qemu_load_device_state(QEMUFile *f)
return 0;
}
+int qemu_loadvm_approve_switchover(void)
+{
+ MigrationIncomingState *mis = migration_incoming_get_current();
+
+ if (!mis->switchover_ack_pending_num) {
+ return -EINVAL;
+ }
+
+ mis->switchover_ack_pending_num--;
+ trace_loadvm_approve_switchover(mis->switchover_ack_pending_num);
+
+ if (mis->switchover_ack_pending_num) {
+ return 0;
+ }
+
+ return migrate_send_rp_switchover_ack(mis);
+}
+
bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
bool has_devices, strList *devices, Error **errp)
{
diff --git a/migration/trace-events b/migration/trace-events
index cdaef7a1ea..5259c1044b 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -7,6 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
qemu_loadvm_state_post_main(int ret) "%d"
qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
qemu_savevm_send_packaged(void) ""
+loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
loadvm_state_setup(void) ""
loadvm_state_cleanup(void) ""
loadvm_handle_cmd_packaged(unsigned int length) "%u"
@@ -23,6 +24,7 @@ loadvm_postcopy_ram_handle_discard_end(void) ""
loadvm_postcopy_ram_handle_discard_header(const char *ramid, uint16_t len) "%s: %ud"
loadvm_process_command(const char *s, uint16_t len) "com=%s len=%d"
loadvm_process_command_ping(uint32_t val) "0x%x"
+loadvm_approve_switchover(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
postcopy_ram_listen_thread_exit(void) ""
postcopy_ram_listen_thread_start(void) ""
qemu_savevm_send_postcopy_advise(void) ""
@@ -180,6 +182,7 @@ source_return_path_thread_loop_top(void) ""
source_return_path_thread_pong(uint32_t val) "0x%x"
source_return_path_thread_shut(uint32_t val) "0x%x"
source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32
+source_return_path_thread_switchover_acked(void) ""
migration_thread_low_pending(uint64_t pending) "%" PRIu64
migrate_transferred(uint64_t tranferred, uint64_t time_spent, uint64_t bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " max_size %" PRId64
process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
--
2.26.3
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v5 2/9] migration: Implement switchover ack logic
2023-05-30 14:48 ` [PATCH v5 2/9] migration: Implement switchover ack logic Avihai Horon
@ 2023-06-05 22:06 ` Alex Williamson
2023-06-06 12:12 ` Avihai Horon
0 siblings, 1 reply; 25+ messages in thread
From: Alex Williamson @ 2023-06-05 22:06 UTC (permalink / raw)
To: Avihai Horon
Cc: qemu-devel, Cédric Le Goater, Eduardo Habkost,
Marcel Apfelbaum, Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
Tarun Gupta, Joao Martins
On Tue, 30 May 2023 17:48:14 +0300
Avihai Horon <avihaih@nvidia.com> wrote:
> Implement switchover ack logic. This prevents the source from stopping
> the VM and completing the migration until an ACK is received from the
> destination that it's OK to do so.
>
> To achieve this, a new SaveVMHandlers handler switchover_ack_needed()
> and a new return path message MIG_RP_MSG_SWITCHOVER_ACK are added.
>
> The switchover_ack_needed() handler is called during migration setup in
> the destination to check if switchover ack is used by the migrated
> device.
>
> When switchover is approved by all migrated devices in the destination
> that support this capability, the MIG_RP_MSG_SWITCHOVER_ACK return path
> message is sent to the source to notify it that it's OK to do
> switchover.
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> ---
> include/migration/register.h | 2 ++
> migration/migration.h | 14 ++++++++++
> migration/savevm.h | 1 +
> migration/migration.c | 32 +++++++++++++++++++--
> migration/savevm.c | 54 ++++++++++++++++++++++++++++++++++++
> migration/trace-events | 3 ++
> 6 files changed, 104 insertions(+), 2 deletions(-)
>
> diff --git a/include/migration/register.h b/include/migration/register.h
> index a8dfd8fefd..90914f32f5 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -71,6 +71,8 @@ typedef struct SaveVMHandlers {
> int (*load_cleanup)(void *opaque);
> /* Called when postcopy migration wants to resume from failure */
> int (*resume_prepare)(MigrationState *s, void *opaque);
> + /* Checks if switchover ack should be used. Called only in dest */
> + bool (*switchover_ack_needed)(void *opaque);
> } SaveVMHandlers;
>
> int register_savevm_live(const char *idstr,
> diff --git a/migration/migration.h b/migration/migration.h
> index 48a46123a0..1e92ba7b1d 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -209,6 +209,13 @@ struct MigrationIncomingState {
> * contains valid information.
> */
> QemuMutex page_request_mutex;
> +
> + /*
> + * Number of devices that have yet to approve switchover. When this reaches
> + * zero an ACK that it's OK to do switchover is sent to the source. No lock
> + * is needed as this field is updated serially.
> + */
> + unsigned int switchover_ack_pending_num;
> };
>
> MigrationIncomingState *migration_incoming_get_current(void);
> @@ -437,6 +444,12 @@ struct MigrationState {
>
> /* QEMU_VM_VMDESCRIPTION content filled for all non-iterable devices. */
> JSONWriter *vmdesc;
> +
> + /*
> + * Indicates whether an ACK from the destination that it's OK to do
> + * switchover has been received.
> + */
> + bool switchover_acked;
> };
>
> void migrate_set_state(int *state, int old_state, int new_state);
> @@ -477,6 +490,7 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
> void migrate_send_rp_recv_bitmap(MigrationIncomingState *mis,
> char *block_name);
> void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t value);
> +int migrate_send_rp_switchover_ack(MigrationIncomingState *mis);
>
> void dirty_bitmap_mig_before_vm_start(void);
> void dirty_bitmap_mig_cancel_outgoing(void);
> diff --git a/migration/savevm.h b/migration/savevm.h
> index fb636735f0..e894bbc143 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -65,6 +65,7 @@ int qemu_loadvm_state(QEMUFile *f);
> void qemu_loadvm_state_cleanup(void);
> int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
> int qemu_load_device_state(QEMUFile *f);
> +int qemu_loadvm_approve_switchover(void);
> int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
> bool in_postcopy, bool inactivate_disks);
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 5de7f734b9..c73261118c 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -78,6 +78,7 @@ enum mig_rp_message_type {
> MIG_RP_MSG_REQ_PAGES, /* data (start: be64, len: be32) */
> MIG_RP_MSG_RECV_BITMAP, /* send recved_bitmap back to source */
> MIG_RP_MSG_RESUME_ACK, /* tell source that we are ready to resume */
> + MIG_RP_MSG_SWITCHOVER_ACK, /* Tell source it's OK to do switchover */
>
> MIG_RP_MSG_MAX
> };
> @@ -760,6 +761,11 @@ bool migration_has_all_channels(void)
> return true;
> }
>
> +int migrate_send_rp_switchover_ack(MigrationIncomingState *mis)
> +{
> + return migrate_send_rp_message(mis, MIG_RP_MSG_SWITCHOVER_ACK, 0, NULL);
> +}
> +
> /*
> * Send a 'SHUT' message on the return channel with the given value
> * to indicate that we've finished with the RP. Non-0 value indicates
> @@ -1405,6 +1411,7 @@ void migrate_init(MigrationState *s)
> s->vm_was_running = false;
> s->iteration_initial_bytes = 0;
> s->threshold_size = 0;
> + s->switchover_acked = false;
> }
>
> int migrate_add_blocker_internal(Error *reason, Error **errp)
> @@ -1721,6 +1728,7 @@ static struct rp_cmd_args {
> [MIG_RP_MSG_REQ_PAGES_ID] = { .len = -1, .name = "REQ_PAGES_ID" },
> [MIG_RP_MSG_RECV_BITMAP] = { .len = -1, .name = "RECV_BITMAP" },
> [MIG_RP_MSG_RESUME_ACK] = { .len = 4, .name = "RESUME_ACK" },
> + [MIG_RP_MSG_SWITCHOVER_ACK] = { .len = 0, .name = "SWITCHOVER_ACK" },
> [MIG_RP_MSG_MAX] = { .len = -1, .name = "MAX" },
> };
>
> @@ -1959,6 +1967,11 @@ retry:
> }
> break;
>
> + case MIG_RP_MSG_SWITCHOVER_ACK:
> + ms->switchover_acked = true;
> + trace_source_return_path_thread_switchover_acked();
> + break;
> +
> default:
> break;
> }
> @@ -2700,6 +2713,20 @@ static void migration_update_counters(MigrationState *s,
> bandwidth, s->threshold_size);
> }
>
> +static bool migration_can_switchover(MigrationState *s)
> +{
> + if (!migrate_switchover_ack()) {
> + return true;
> + }
> +
> + /* No reason to wait for switchover ACK if VM is stopped */
> + if (!runstate_is_running()) {
> + return true;
> + }
Is it possible for QEMU to force the migration to continue regardless
of receiving an ack from the target and is this the check that would
allow that?
It seems that we don't know the downtime allowed for the VM in any of
this, nor do we know how much time the target device will require to
generate an ack, but we could certainly have conditions where the
priority is moving the VM from the source host regardless of the
resulting downtime.
Also does the return path requirement preclude offline migration or
does the above again take care of that if we pause the VM for an
offline migration (ex. save to and restore from file)? Thanks,
Alex
> +
> + return s->switchover_acked;
> +}
> +
> /* Migration thread iteration status */
> typedef enum {
> MIG_ITERATE_RESUME, /* Resume current iteration */
> @@ -2715,6 +2742,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
> {
> uint64_t must_precopy, can_postcopy;
> bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
> + bool can_switchover = migration_can_switchover(s);
>
> qemu_savevm_state_pending_estimate(&must_precopy, &can_postcopy);
> uint64_t pending_size = must_precopy + can_postcopy;
> @@ -2727,14 +2755,14 @@ static MigIterateState migration_iteration_run(MigrationState *s)
> trace_migrate_pending_exact(pending_size, must_precopy, can_postcopy);
> }
>
> - if (!pending_size || pending_size < s->threshold_size) {
> + if ((!pending_size || pending_size < s->threshold_size) && can_switchover) {
> trace_migration_thread_low_pending(pending_size);
> migration_completion(s);
> return MIG_ITERATE_BREAK;
> }
>
> /* Still a significant amount to transfer */
> - if (!in_postcopy && must_precopy <= s->threshold_size &&
> + if (!in_postcopy && must_precopy <= s->threshold_size && can_switchover &&
> qatomic_read(&s->start_postcopy)) {
> if (postcopy_start(s)) {
> error_report("%s: postcopy failed to start", __func__);
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 03795ce8dc..285b814726 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2360,6 +2360,21 @@ static int loadvm_process_command(QEMUFile *f)
> error_report("CMD_OPEN_RETURN_PATH failed");
> return -1;
> }
> +
> + /*
> + * Switchover ack is enabled but no device uses it, so send an ACK to
> + * source that it's OK to switchover. Do it here, after return path has
> + * been created.
> + */
> + if (migrate_switchover_ack() && !mis->switchover_ack_pending_num) {
> + int ret = migrate_send_rp_switchover_ack(mis);
> + if (ret) {
> + error_report(
> + "Could not send switchover ack RP MSG, err %d (%s)", ret,
> + strerror(-ret));
> + return ret;
> + }
> + }
> break;
>
> case MIG_CMD_PING:
> @@ -2586,6 +2601,23 @@ static int qemu_loadvm_state_header(QEMUFile *f)
> return 0;
> }
>
> +static void qemu_loadvm_state_switchover_ack_needed(MigrationIncomingState *mis)
> +{
> + SaveStateEntry *se;
> +
> + QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
> + if (!se->ops || !se->ops->switchover_ack_needed) {
> + continue;
> + }
> +
> + if (se->ops->switchover_ack_needed(se->opaque)) {
> + mis->switchover_ack_pending_num++;
> + }
> + }
> +
> + trace_loadvm_state_switchover_ack_needed(mis->switchover_ack_pending_num);
> +}
> +
> static int qemu_loadvm_state_setup(QEMUFile *f)
> {
> SaveStateEntry *se;
> @@ -2789,6 +2821,10 @@ int qemu_loadvm_state(QEMUFile *f)
> return -EINVAL;
> }
>
> + if (migrate_switchover_ack()) {
> + qemu_loadvm_state_switchover_ack_needed(mis);
> + }
> +
> cpu_synchronize_all_pre_loadvm();
>
> ret = qemu_loadvm_state_main(f, mis);
> @@ -2862,6 +2898,24 @@ int qemu_load_device_state(QEMUFile *f)
> return 0;
> }
>
> +int qemu_loadvm_approve_switchover(void)
> +{
> + MigrationIncomingState *mis = migration_incoming_get_current();
> +
> + if (!mis->switchover_ack_pending_num) {
> + return -EINVAL;
> + }
> +
> + mis->switchover_ack_pending_num--;
> + trace_loadvm_approve_switchover(mis->switchover_ack_pending_num);
> +
> + if (mis->switchover_ack_pending_num) {
> + return 0;
> + }
> +
> + return migrate_send_rp_switchover_ack(mis);
> +}
> +
> bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
> bool has_devices, strList *devices, Error **errp)
> {
> diff --git a/migration/trace-events b/migration/trace-events
> index cdaef7a1ea..5259c1044b 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -7,6 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
> qemu_loadvm_state_post_main(int ret) "%d"
> qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
> qemu_savevm_send_packaged(void) ""
> +loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
> loadvm_state_setup(void) ""
> loadvm_state_cleanup(void) ""
> loadvm_handle_cmd_packaged(unsigned int length) "%u"
> @@ -23,6 +24,7 @@ loadvm_postcopy_ram_handle_discard_end(void) ""
> loadvm_postcopy_ram_handle_discard_header(const char *ramid, uint16_t len) "%s: %ud"
> loadvm_process_command(const char *s, uint16_t len) "com=%s len=%d"
> loadvm_process_command_ping(uint32_t val) "0x%x"
> +loadvm_approve_switchover(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
> postcopy_ram_listen_thread_exit(void) ""
> postcopy_ram_listen_thread_start(void) ""
> qemu_savevm_send_postcopy_advise(void) ""
> @@ -180,6 +182,7 @@ source_return_path_thread_loop_top(void) ""
> source_return_path_thread_pong(uint32_t val) "0x%x"
> source_return_path_thread_shut(uint32_t val) "0x%x"
> source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32
> +source_return_path_thread_switchover_acked(void) ""
> migration_thread_low_pending(uint64_t pending) "%" PRIu64
> migrate_transferred(uint64_t tranferred, uint64_t time_spent, uint64_t bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " max_size %" PRId64
> process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v5 2/9] migration: Implement switchover ack logic
2023-06-05 22:06 ` Alex Williamson
@ 2023-06-06 12:12 ` Avihai Horon
2023-06-08 18:32 ` Alex Williamson
0 siblings, 1 reply; 25+ messages in thread
From: Avihai Horon @ 2023-06-06 12:12 UTC (permalink / raw)
To: Alex Williamson
Cc: qemu-devel, Cédric Le Goater, Eduardo Habkost,
Marcel Apfelbaum, Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
Tarun Gupta, Joao Martins
On 06/06/2023 1:06, Alex Williamson wrote:
> External email: Use caution opening links or attachments
>
>
> On Tue, 30 May 2023 17:48:14 +0300
> Avihai Horon <avihaih@nvidia.com> wrote:
>
>> Implement switchover ack logic. This prevents the source from stopping
>> the VM and completing the migration until an ACK is received from the
>> destination that it's OK to do so.
>>
>> To achieve this, a new SaveVMHandlers handler switchover_ack_needed()
>> and a new return path message MIG_RP_MSG_SWITCHOVER_ACK are added.
>>
>> The switchover_ack_needed() handler is called during migration setup in
>> the destination to check if switchover ack is used by the migrated
>> device.
>>
>> When switchover is approved by all migrated devices in the destination
>> that support this capability, the MIG_RP_MSG_SWITCHOVER_ACK return path
>> message is sent to the source to notify it that it's OK to do
>> switchover.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> Reviewed-by: Peter Xu <peterx@redhat.com>
>> ---
>> include/migration/register.h | 2 ++
>> migration/migration.h | 14 ++++++++++
>> migration/savevm.h | 1 +
>> migration/migration.c | 32 +++++++++++++++++++--
>> migration/savevm.c | 54 ++++++++++++++++++++++++++++++++++++
>> migration/trace-events | 3 ++
>> 6 files changed, 104 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/migration/register.h b/include/migration/register.h
>> index a8dfd8fefd..90914f32f5 100644
>> --- a/include/migration/register.h
>> +++ b/include/migration/register.h
>> @@ -71,6 +71,8 @@ typedef struct SaveVMHandlers {
>> int (*load_cleanup)(void *opaque);
>> /* Called when postcopy migration wants to resume from failure */
>> int (*resume_prepare)(MigrationState *s, void *opaque);
>> + /* Checks if switchover ack should be used. Called only in dest */
>> + bool (*switchover_ack_needed)(void *opaque);
>> } SaveVMHandlers;
>>
>> int register_savevm_live(const char *idstr,
>> diff --git a/migration/migration.h b/migration/migration.h
>> index 48a46123a0..1e92ba7b1d 100644
>> --- a/migration/migration.h
>> +++ b/migration/migration.h
>> @@ -209,6 +209,13 @@ struct MigrationIncomingState {
>> * contains valid information.
>> */
>> QemuMutex page_request_mutex;
>> +
>> + /*
>> + * Number of devices that have yet to approve switchover. When this reaches
>> + * zero an ACK that it's OK to do switchover is sent to the source. No lock
>> + * is needed as this field is updated serially.
>> + */
>> + unsigned int switchover_ack_pending_num;
>> };
>>
>> MigrationIncomingState *migration_incoming_get_current(void);
>> @@ -437,6 +444,12 @@ struct MigrationState {
>>
>> /* QEMU_VM_VMDESCRIPTION content filled for all non-iterable devices. */
>> JSONWriter *vmdesc;
>> +
>> + /*
>> + * Indicates whether an ACK from the destination that it's OK to do
>> + * switchover has been received.
>> + */
>> + bool switchover_acked;
>> };
>>
>> void migrate_set_state(int *state, int old_state, int new_state);
>> @@ -477,6 +490,7 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
>> void migrate_send_rp_recv_bitmap(MigrationIncomingState *mis,
>> char *block_name);
>> void migrate_send_rp_resume_ack(MigrationIncomingState *mis, uint32_t value);
>> +int migrate_send_rp_switchover_ack(MigrationIncomingState *mis);
>>
>> void dirty_bitmap_mig_before_vm_start(void);
>> void dirty_bitmap_mig_cancel_outgoing(void);
>> diff --git a/migration/savevm.h b/migration/savevm.h
>> index fb636735f0..e894bbc143 100644
>> --- a/migration/savevm.h
>> +++ b/migration/savevm.h
>> @@ -65,6 +65,7 @@ int qemu_loadvm_state(QEMUFile *f);
>> void qemu_loadvm_state_cleanup(void);
>> int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis);
>> int qemu_load_device_state(QEMUFile *f);
>> +int qemu_loadvm_approve_switchover(void);
>> int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
>> bool in_postcopy, bool inactivate_disks);
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 5de7f734b9..c73261118c 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -78,6 +78,7 @@ enum mig_rp_message_type {
>> MIG_RP_MSG_REQ_PAGES, /* data (start: be64, len: be32) */
>> MIG_RP_MSG_RECV_BITMAP, /* send recved_bitmap back to source */
>> MIG_RP_MSG_RESUME_ACK, /* tell source that we are ready to resume */
>> + MIG_RP_MSG_SWITCHOVER_ACK, /* Tell source it's OK to do switchover */
>>
>> MIG_RP_MSG_MAX
>> };
>> @@ -760,6 +761,11 @@ bool migration_has_all_channels(void)
>> return true;
>> }
>>
>> +int migrate_send_rp_switchover_ack(MigrationIncomingState *mis)
>> +{
>> + return migrate_send_rp_message(mis, MIG_RP_MSG_SWITCHOVER_ACK, 0, NULL);
>> +}
>> +
>> /*
>> * Send a 'SHUT' message on the return channel with the given value
>> * to indicate that we've finished with the RP. Non-0 value indicates
>> @@ -1405,6 +1411,7 @@ void migrate_init(MigrationState *s)
>> s->vm_was_running = false;
>> s->iteration_initial_bytes = 0;
>> s->threshold_size = 0;
>> + s->switchover_acked = false;
>> }
>>
>> int migrate_add_blocker_internal(Error *reason, Error **errp)
>> @@ -1721,6 +1728,7 @@ static struct rp_cmd_args {
>> [MIG_RP_MSG_REQ_PAGES_ID] = { .len = -1, .name = "REQ_PAGES_ID" },
>> [MIG_RP_MSG_RECV_BITMAP] = { .len = -1, .name = "RECV_BITMAP" },
>> [MIG_RP_MSG_RESUME_ACK] = { .len = 4, .name = "RESUME_ACK" },
>> + [MIG_RP_MSG_SWITCHOVER_ACK] = { .len = 0, .name = "SWITCHOVER_ACK" },
>> [MIG_RP_MSG_MAX] = { .len = -1, .name = "MAX" },
>> };
>>
>> @@ -1959,6 +1967,11 @@ retry:
>> }
>> break;
>>
>> + case MIG_RP_MSG_SWITCHOVER_ACK:
>> + ms->switchover_acked = true;
>> + trace_source_return_path_thread_switchover_acked();
>> + break;
>> +
>> default:
>> break;
>> }
>> @@ -2700,6 +2713,20 @@ static void migration_update_counters(MigrationState *s,
>> bandwidth, s->threshold_size);
>> }
>>
>> +static bool migration_can_switchover(MigrationState *s)
>> +{
>> + if (!migrate_switchover_ack()) {
>> + return true;
>> + }
>> +
>> + /* No reason to wait for switchover ACK if VM is stopped */
>> + if (!runstate_is_running()) {
>> + return true;
>> + }
> Is it possible for QEMU to force the migration to continue regardless
> of receiving an ack from the target and is this the check that would
> allow that?
Yes. If you stop the source VM then migration will not wait for an ACK
to do the switchover.
>
> It seems that we don't know the downtime allowed for the VM in any of
> this, nor do we know how much time the target device will require to
> generate an ack, but we could certainly have conditions where the
> priority is moving the VM from the source host regardless of the
> resulting downtime.
In such cases you can keep the switchover-ack capability off.
>
> Also does the return path requirement preclude offline migration or
> does the above again take care of that if we pause the VM for an
> offline migration (ex. save to and restore from file)?
I suppose that by offline migration you mean migration where you stop
the source VM first and then do migration?
If so, offline migration should work and in that case we don't care
about the ACK as downtime is not a concern.
However, migrating to a file doesn't work with return-path, as you don't
have the destination side responding to the source via the return path.
For this reason, using return-path when migrating to a file doesn't make
sense.
Thanks.
>> +
>> + return s->switchover_acked;
>> +}
>> +
>> /* Migration thread iteration status */
>> typedef enum {
>> MIG_ITERATE_RESUME, /* Resume current iteration */
>> @@ -2715,6 +2742,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>> {
>> uint64_t must_precopy, can_postcopy;
>> bool in_postcopy = s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE;
>> + bool can_switchover = migration_can_switchover(s);
>>
>> qemu_savevm_state_pending_estimate(&must_precopy, &can_postcopy);
>> uint64_t pending_size = must_precopy + can_postcopy;
>> @@ -2727,14 +2755,14 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>> trace_migrate_pending_exact(pending_size, must_precopy, can_postcopy);
>> }
>>
>> - if (!pending_size || pending_size < s->threshold_size) {
>> + if ((!pending_size || pending_size < s->threshold_size) && can_switchover) {
>> trace_migration_thread_low_pending(pending_size);
>> migration_completion(s);
>> return MIG_ITERATE_BREAK;
>> }
>>
>> /* Still a significant amount to transfer */
>> - if (!in_postcopy && must_precopy <= s->threshold_size &&
>> + if (!in_postcopy && must_precopy <= s->threshold_size && can_switchover &&
>> qatomic_read(&s->start_postcopy)) {
>> if (postcopy_start(s)) {
>> error_report("%s: postcopy failed to start", __func__);
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index 03795ce8dc..285b814726 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -2360,6 +2360,21 @@ static int loadvm_process_command(QEMUFile *f)
>> error_report("CMD_OPEN_RETURN_PATH failed");
>> return -1;
>> }
>> +
>> + /*
>> + * Switchover ack is enabled but no device uses it, so send an ACK to
>> + * source that it's OK to switchover. Do it here, after return path has
>> + * been created.
>> + */
>> + if (migrate_switchover_ack() && !mis->switchover_ack_pending_num) {
>> + int ret = migrate_send_rp_switchover_ack(mis);
>> + if (ret) {
>> + error_report(
>> + "Could not send switchover ack RP MSG, err %d (%s)", ret,
>> + strerror(-ret));
>> + return ret;
>> + }
>> + }
>> break;
>>
>> case MIG_CMD_PING:
>> @@ -2586,6 +2601,23 @@ static int qemu_loadvm_state_header(QEMUFile *f)
>> return 0;
>> }
>>
>> +static void qemu_loadvm_state_switchover_ack_needed(MigrationIncomingState *mis)
>> +{
>> + SaveStateEntry *se;
>> +
>> + QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
>> + if (!se->ops || !se->ops->switchover_ack_needed) {
>> + continue;
>> + }
>> +
>> + if (se->ops->switchover_ack_needed(se->opaque)) {
>> + mis->switchover_ack_pending_num++;
>> + }
>> + }
>> +
>> + trace_loadvm_state_switchover_ack_needed(mis->switchover_ack_pending_num);
>> +}
>> +
>> static int qemu_loadvm_state_setup(QEMUFile *f)
>> {
>> SaveStateEntry *se;
>> @@ -2789,6 +2821,10 @@ int qemu_loadvm_state(QEMUFile *f)
>> return -EINVAL;
>> }
>>
>> + if (migrate_switchover_ack()) {
>> + qemu_loadvm_state_switchover_ack_needed(mis);
>> + }
>> +
>> cpu_synchronize_all_pre_loadvm();
>>
>> ret = qemu_loadvm_state_main(f, mis);
>> @@ -2862,6 +2898,24 @@ int qemu_load_device_state(QEMUFile *f)
>> return 0;
>> }
>>
>> +int qemu_loadvm_approve_switchover(void)
>> +{
>> + MigrationIncomingState *mis = migration_incoming_get_current();
>> +
>> + if (!mis->switchover_ack_pending_num) {
>> + return -EINVAL;
>> + }
>> +
>> + mis->switchover_ack_pending_num--;
>> + trace_loadvm_approve_switchover(mis->switchover_ack_pending_num);
>> +
>> + if (mis->switchover_ack_pending_num) {
>> + return 0;
>> + }
>> +
>> + return migrate_send_rp_switchover_ack(mis);
>> +}
>> +
>> bool save_snapshot(const char *name, bool overwrite, const char *vmstate,
>> bool has_devices, strList *devices, Error **errp)
>> {
>> diff --git a/migration/trace-events b/migration/trace-events
>> index cdaef7a1ea..5259c1044b 100644
>> --- a/migration/trace-events
>> +++ b/migration/trace-events
>> @@ -7,6 +7,7 @@ qemu_loadvm_state_section_partend(uint32_t section_id) "%u"
>> qemu_loadvm_state_post_main(int ret) "%d"
>> qemu_loadvm_state_section_startfull(uint32_t section_id, const char *idstr, uint32_t instance_id, uint32_t version_id) "%u(%s) %u %u"
>> qemu_savevm_send_packaged(void) ""
>> +loadvm_state_switchover_ack_needed(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
>> loadvm_state_setup(void) ""
>> loadvm_state_cleanup(void) ""
>> loadvm_handle_cmd_packaged(unsigned int length) "%u"
>> @@ -23,6 +24,7 @@ loadvm_postcopy_ram_handle_discard_end(void) ""
>> loadvm_postcopy_ram_handle_discard_header(const char *ramid, uint16_t len) "%s: %ud"
>> loadvm_process_command(const char *s, uint16_t len) "com=%s len=%d"
>> loadvm_process_command_ping(uint32_t val) "0x%x"
>> +loadvm_approve_switchover(unsigned int switchover_ack_pending_num) "Switchover ack pending num=%u"
>> postcopy_ram_listen_thread_exit(void) ""
>> postcopy_ram_listen_thread_start(void) ""
>> qemu_savevm_send_postcopy_advise(void) ""
>> @@ -180,6 +182,7 @@ source_return_path_thread_loop_top(void) ""
>> source_return_path_thread_pong(uint32_t val) "0x%x"
>> source_return_path_thread_shut(uint32_t val) "0x%x"
>> source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32
>> +source_return_path_thread_switchover_acked(void) ""
>> migration_thread_low_pending(uint64_t pending) "%" PRIu64
>> migrate_transferred(uint64_t tranferred, uint64_t time_spent, uint64_t bandwidth, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " max_size %" PRId64
>> process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v5 2/9] migration: Implement switchover ack logic
2023-06-06 12:12 ` Avihai Horon
@ 2023-06-08 18:32 ` Alex Williamson
2023-06-11 7:45 ` Avihai Horon
0 siblings, 1 reply; 25+ messages in thread
From: Alex Williamson @ 2023-06-08 18:32 UTC (permalink / raw)
To: Avihai Horon
Cc: qemu-devel, Cédric Le Goater, Eduardo Habkost,
Marcel Apfelbaum, Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
Tarun Gupta, Joao Martins
On Tue, 6 Jun 2023 15:12:13 +0300
Avihai Horon <avihaih@nvidia.com> wrote:
> On 06/06/2023 1:06, Alex Williamson wrote:
> > On Tue, 30 May 2023 17:48:14 +0300
> > Avihai Horon <avihaih@nvidia.com> wrote:
> >> @@ -2700,6 +2713,20 @@ static void migration_update_counters(MigrationState *s,
> >> bandwidth, s->threshold_size);
> >> }
> >>
> >> +static bool migration_can_switchover(MigrationState *s)
> >> +{
> >> + if (!migrate_switchover_ack()) {
> >> + return true;
> >> + }
> >> +
> >> + /* No reason to wait for switchover ACK if VM is stopped */
> >> + if (!runstate_is_running()) {
> >> + return true;
> >> + }
> > Is it possible for QEMU to force the migration to continue regardless
> > of receiving an ack from the target and is this the check that would
> > allow that?
>
> Yes. If you stop the source VM then migration will not wait for an ACK
> to do the switchover.
>
> >
> > It seems that we don't know the downtime allowed for the VM in any of
> > this, nor do we know how much time the target device will require to
> > generate an ack, but we could certainly have conditions where the
> > priority is moving the VM from the source host regardless of the
> > resulting downtime.
>
> In such cases you can keep the switchover-ack capability off.
How is that accomplished?
> > Also does the return path requirement preclude offline migration or
> > does the above again take care of that if we pause the VM for an
> > offline migration (ex. save to and restore from file)?
>
> I suppose that by offline migration you mean migration where you stop
> the source VM first and then do migration?
Yes.
> If so, offline migration should work and in that case we don't care
> about the ACK as downtime is not a concern.
>
> However, migrating to a file doesn't work with return-path, as you don't
> have the destination side responding to the source via the return path.
> For this reason, using return-path when migrating to a file doesn't make
> sense.
So we require return-path for switchover-ack, but switchover-ack is
only required for pre-copy, therefore why do we require return-path for
an offline migration?
If there's a way to turn off switchover-ack capability, is there also a
way to turn off return-path and therefore enable migration to file?
Sorry if I'm simply not familiar with these migration switches. Thanks,
Alex
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v5 2/9] migration: Implement switchover ack logic
2023-06-08 18:32 ` Alex Williamson
@ 2023-06-11 7:45 ` Avihai Horon
0 siblings, 0 replies; 25+ messages in thread
From: Avihai Horon @ 2023-06-11 7:45 UTC (permalink / raw)
To: Alex Williamson
Cc: qemu-devel, Cédric Le Goater, Eduardo Habkost,
Marcel Apfelbaum, Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
Tarun Gupta, Joao Martins
On 08/06/2023 21:32, Alex Williamson wrote:
> External email: Use caution opening links or attachments
>
>
> On Tue, 6 Jun 2023 15:12:13 +0300
> Avihai Horon <avihaih@nvidia.com> wrote:
>> On 06/06/2023 1:06, Alex Williamson wrote:
>>> On Tue, 30 May 2023 17:48:14 +0300
>>> Avihai Horon <avihaih@nvidia.com> wrote:
>>>> @@ -2700,6 +2713,20 @@ static void migration_update_counters(MigrationState *s,
>>>> bandwidth, s->threshold_size);
>>>> }
>>>>
>>>> +static bool migration_can_switchover(MigrationState *s)
>>>> +{
>>>> + if (!migrate_switchover_ack()) {
>>>> + return true;
>>>> + }
>>>> +
>>>> + /* No reason to wait for switchover ACK if VM is stopped */
>>>> + if (!runstate_is_running()) {
>>>> + return true;
>>>> + }
>>> Is it possible for QEMU to force the migration to continue regardless
>>> of receiving an ack from the target and is this the check that would
>>> allow that?
>> Yes. If you stop the source VM then migration will not wait for an ACK
>> to do the switchover.
>>
>>> It seems that we don't know the downtime allowed for the VM in any of
>>> this, nor do we know how much time the target device will require to
>>> generate an ack, but we could certainly have conditions where the
>>> priority is moving the VM from the source host regardless of the
>>> resulting downtime.
>> In such cases you can keep the switchover-ack capability off.
> How is that accomplished?
You simply don't enable the switchover-ack migration capability (it is
disabled by default).
>
>>> Also does the return path requirement preclude offline migration or
>>> does the above again take care of that if we pause the VM for an
>>> offline migration (ex. save to and restore from file)?
>> I suppose that by offline migration you mean migration where you stop
>> the source VM first and then do migration?
> Yes.
>
>> If so, offline migration should work and in that case we don't care
>> about the ACK as downtime is not a concern.
>>
>> However, migrating to a file doesn't work with return-path, as you don't
>> have the destination side responding to the source via the return path.
>> For this reason, using return-path when migrating to a file doesn't make
>> sense.
> So we require return-path for switchover-ack, but switchover-ack is
> only required for pre-copy, therefore why do we require return-path for
> an offline migration?
We don't. See below.
>
> If there's a way to turn off switchover-ack capability, is there also a
> way to turn off return-path and therefore enable migration to file?
Yes.
By default, return-path and switchover-ack migration capabilities are
disabled.
So for an offline migration nothing needs to be done -- you simply run
migration.
For an online migration, you can choose to use switchover-ack or not.
If you want to use switchover-ack, then you need to enable return-path
and switchover-ack capabilities first and then run migration.
If you don't want (e.g., you don't have a VFIO device assigned to the
VM, so there is no reason to), then you keep return-path and
switchover-ack capabilities disabled.
I hope that's clear now.
Thanks.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v5 3/9] migration: Enable switchover ack capability
2023-05-30 14:48 [PATCH v5 0/9] migration: Add switchover ack capability and VFIO precopy support Avihai Horon
2023-05-30 14:48 ` [PATCH v5 1/9] migration: Add switchover ack capability Avihai Horon
2023-05-30 14:48 ` [PATCH v5 2/9] migration: Implement switchover ack logic Avihai Horon
@ 2023-05-30 14:48 ` Avihai Horon
2023-05-30 14:48 ` [PATCH v5 4/9] tests: Add migration switchover ack capability test Avihai Horon
` (6 subsequent siblings)
9 siblings, 0 replies; 25+ messages in thread
From: Avihai Horon @ 2023-05-30 14:48 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cédric Le Goater, Eduardo Habkost,
Marcel Apfelbaum, Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
Kirti Wankhede, Tarun Gupta, Joao Martins
Now that switchover ack logic has been implemented, enable the
capability.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
migration/options.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/migration/options.c b/migration/options.c
index 16007afca6..5a9505adf7 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -562,10 +562,6 @@ bool migrate_caps_check(bool *old_caps, bool *new_caps, Error **errp)
"'return-path'");
return false;
}
-
- /* Disable this capability until it's implemented */
- error_setg(errp, "'switchover-ack' is not implemented yet");
- return false;
}
return true;
--
2.26.3
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v5 4/9] tests: Add migration switchover ack capability test
2023-05-30 14:48 [PATCH v5 0/9] migration: Add switchover ack capability and VFIO precopy support Avihai Horon
` (2 preceding siblings ...)
2023-05-30 14:48 ` [PATCH v5 3/9] migration: Enable switchover ack capability Avihai Horon
@ 2023-05-30 14:48 ` Avihai Horon
2023-05-30 14:48 ` [PATCH v5 5/9] vfio/migration: Refactor vfio_save_block() to return saved data size Avihai Horon
` (5 subsequent siblings)
9 siblings, 0 replies; 25+ messages in thread
From: Avihai Horon @ 2023-05-30 14:48 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cédric Le Goater, Eduardo Habkost,
Marcel Apfelbaum, Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
Kirti Wankhede, Tarun Gupta, Joao Martins
Add migration switchover ack capability test. The test runs without
devices that support this capability, but is still useful to make sure
it didn't break anything.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
---
tests/qtest/migration-test.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index b99b49a314..d246a5bbc5 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -1648,6 +1648,28 @@ static void test_precopy_tcp_plain(void)
test_precopy_common(&args);
}
+static void *test_migrate_switchover_ack_start(QTestState *from, QTestState *to)
+{
+
+ migrate_set_capability(from, "return-path", true);
+ migrate_set_capability(to, "return-path", true);
+
+ migrate_set_capability(from, "switchover-ack", true);
+ migrate_set_capability(to, "switchover-ack", true);
+
+ return NULL;
+}
+
+static void test_precopy_tcp_switchover_ack(void)
+{
+ MigrateCommon args = {
+ .listen_uri = "tcp:127.0.0.1:0",
+ .start_hook = test_migrate_switchover_ack_start,
+ };
+
+ test_precopy_common(&args);
+}
+
#ifdef CONFIG_GNUTLS
static void test_precopy_tcp_tls_psk_match(void)
{
@@ -2695,6 +2717,10 @@ int main(int argc, char **argv)
#endif /* CONFIG_GNUTLS */
qtest_add_func("/migration/precopy/tcp/plain", test_precopy_tcp_plain);
+
+ qtest_add_func("/migration/precopy/tcp/plain/switchover-ack",
+ test_precopy_tcp_switchover_ack);
+
#ifdef CONFIG_GNUTLS
qtest_add_func("/migration/precopy/tcp/tls/psk/match",
test_precopy_tcp_tls_psk_match);
--
2.26.3
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v5 5/9] vfio/migration: Refactor vfio_save_block() to return saved data size
2023-05-30 14:48 [PATCH v5 0/9] migration: Add switchover ack capability and VFIO precopy support Avihai Horon
` (3 preceding siblings ...)
2023-05-30 14:48 ` [PATCH v5 4/9] tests: Add migration switchover ack capability test Avihai Horon
@ 2023-05-30 14:48 ` Avihai Horon
2023-05-30 14:48 ` [PATCH v5 6/9] vfio/migration: Store VFIO migration flags in VFIOMigration Avihai Horon
` (4 subsequent siblings)
9 siblings, 0 replies; 25+ messages in thread
From: Avihai Horon @ 2023-05-30 14:48 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cédric Le Goater, Eduardo Habkost,
Marcel Apfelbaum, Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
Kirti Wankhede, Tarun Gupta, Joao Martins
Refactor vfio_save_block() to return the size of saved data on success
and -errno on error.
This will be used in next patch to implement VFIO migration pre-copy
support.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
---
hw/vfio/migration.c | 17 +++++++++--------
1 file changed, 9 insertions(+), 8 deletions(-)
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 6b58dddb88..235978fd68 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -241,8 +241,8 @@ static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
return 0;
}
-/* Returns 1 if end-of-stream is reached, 0 if more data and -errno if error */
-static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
+/* Returns the size of saved data on success and -errno on error */
+static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
{
ssize_t data_size;
@@ -252,7 +252,7 @@ static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
return -errno;
}
if (data_size == 0) {
- return 1;
+ return 0;
}
qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
@@ -262,7 +262,7 @@ static int vfio_save_block(QEMUFile *f, VFIOMigration *migration)
trace_vfio_save_block(migration->vbasedev->name, data_size);
- return qemu_file_get_error(f);
+ return qemu_file_get_error(f) ?: data_size;
}
/* ---------------------------------------------------------------------- */
@@ -335,6 +335,7 @@ static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
{
VFIODevice *vbasedev = opaque;
+ ssize_t data_size;
int ret;
/* We reach here with device state STOP only */
@@ -345,11 +346,11 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
}
do {
- ret = vfio_save_block(f, vbasedev->migration);
- if (ret < 0) {
- return ret;
+ data_size = vfio_save_block(f, vbasedev->migration);
+ if (data_size < 0) {
+ return data_size;
}
- } while (!ret);
+ } while (data_size);
qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
ret = qemu_file_get_error(f);
--
2.26.3
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v5 6/9] vfio/migration: Store VFIO migration flags in VFIOMigration
2023-05-30 14:48 [PATCH v5 0/9] migration: Add switchover ack capability and VFIO precopy support Avihai Horon
` (4 preceding siblings ...)
2023-05-30 14:48 ` [PATCH v5 5/9] vfio/migration: Refactor vfio_save_block() to return saved data size Avihai Horon
@ 2023-05-30 14:48 ` Avihai Horon
2023-05-30 14:48 ` [PATCH v5 7/9] vfio/migration: Add VFIO migration pre-copy support Avihai Horon
` (3 subsequent siblings)
9 siblings, 0 replies; 25+ messages in thread
From: Avihai Horon @ 2023-05-30 14:48 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cédric Le Goater, Eduardo Habkost,
Marcel Apfelbaum, Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
Kirti Wankhede, Tarun Gupta, Joao Martins
VFIO migration flags are queried once in vfio_migration_init(). Store
them in VFIOMigration so they can be used later to check the device's
migration capabilities without re-querying them.
This will be used in the next patch to check if the device supports
precopy migration.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
include/hw/vfio/vfio-common.h | 1 +
hw/vfio/migration.c | 1 +
2 files changed, 2 insertions(+)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index eed244f25f..5f29dab839 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -66,6 +66,7 @@ typedef struct VFIOMigration {
int data_fd;
void *data_buffer;
size_t data_buffer_size;
+ uint64_t mig_flags;
} VFIOMigration;
typedef struct VFIOAddressSpace {
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 235978fd68..8d33414379 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -603,6 +603,7 @@ static int vfio_migration_init(VFIODevice *vbasedev)
migration->vbasedev = vbasedev;
migration->device_state = VFIO_DEVICE_STATE_RUNNING;
migration->data_fd = -1;
+ migration->mig_flags = mig_flags;
vbasedev->dirty_pages_supported = vfio_dma_logging_supported(vbasedev);
--
2.26.3
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v5 7/9] vfio/migration: Add VFIO migration pre-copy support
2023-05-30 14:48 [PATCH v5 0/9] migration: Add switchover ack capability and VFIO precopy support Avihai Horon
` (5 preceding siblings ...)
2023-05-30 14:48 ` [PATCH v5 6/9] vfio/migration: Store VFIO migration flags in VFIOMigration Avihai Horon
@ 2023-05-30 14:48 ` Avihai Horon
2023-05-30 14:48 ` [PATCH v5 8/9] vfio/migration: Add x-allow-pre-copy VFIO device property Avihai Horon
` (2 subsequent siblings)
9 siblings, 0 replies; 25+ messages in thread
From: Avihai Horon @ 2023-05-30 14:48 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cédric Le Goater, Eduardo Habkost,
Marcel Apfelbaum, Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
Kirti Wankhede, Tarun Gupta, Joao Martins
Pre-copy support allows the VFIO device data to be transferred while the
VM is running. This helps to accommodate VFIO devices that have a large
amount of data that needs to be transferred, and it can reduce migration
downtime.
Pre-copy support is optional in VFIO migration protocol v2.
Implement pre-copy of VFIO migration protocol v2 and use it for devices
that support it. Full description of it can be found in the following
Linux commit: 4db52602a607 ("vfio: Extend the device migration protocol
with PRE_COPY").
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
docs/devel/vfio-migration.rst | 35 +++++---
include/hw/vfio/vfio-common.h | 2 +
hw/vfio/common.c | 6 +-
hw/vfio/migration.c | 165 ++++++++++++++++++++++++++++++++--
hw/vfio/trace-events | 4 +-
5 files changed, 190 insertions(+), 22 deletions(-)
diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
index 1b68ccf115..e896b2a673 100644
--- a/docs/devel/vfio-migration.rst
+++ b/docs/devel/vfio-migration.rst
@@ -7,12 +7,14 @@ the guest is running on source host and restoring this saved state on the
destination host. This document details how saving and restoring of VFIO
devices is done in QEMU.
-Migration of VFIO devices currently consists of a single stop-and-copy phase.
-During the stop-and-copy phase the guest is stopped and the entire VFIO device
-data is transferred to the destination.
-
-The pre-copy phase of migration is currently not supported for VFIO devices.
-Support for VFIO pre-copy will be added later on.
+Migration of VFIO devices consists of two phases: the optional pre-copy phase,
+and the stop-and-copy phase. The pre-copy phase is iterative and allows to
+accommodate VFIO devices that have a large amount of data that needs to be
+transferred. The iterative pre-copy phase of migration allows for the guest to
+continue whilst the VFIO device state is transferred to the destination, this
+helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy
+support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
+VFIO_DEVICE_FEATURE_MIGRATION ioctl.
Note that currently VFIO migration is supported only for a single device. This
is due to VFIO migration's lack of P2P support. However, P2P support is planned
@@ -29,10 +31,20 @@ VFIO implements the device hooks for the iterative approach as follows:
* A ``load_setup`` function that sets the VFIO device on the destination in
_RESUMING state.
+* A ``state_pending_estimate`` function that reports an estimate of the
+ remaining pre-copy data that the vendor driver has yet to save for the VFIO
+ device.
+
* A ``state_pending_exact`` function that reads pending_bytes from the vendor
driver, which indicates the amount of data that the vendor driver has yet to
save for the VFIO device.
+* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
+ active only when the VFIO device is in pre-copy states.
+
+* A ``save_live_iterate`` function that reads the VFIO device's data from the
+ vendor driver during iterative pre-copy phase.
+
* A ``save_state`` function to save the device config space if it is present.
* A ``save_live_complete_precopy`` function that sets the VFIO device in
@@ -111,8 +123,10 @@ Flow of state changes during Live migration
===========================================
Below is the flow of state change during live migration.
-The values in the brackets represent the VM state, the migration state, and
+The values in the parentheses represent the VM state, the migration state, and
the VFIO device state, respectively.
+The text in the square brackets represents the flow if the VFIO device supports
+pre-copy.
Live migration save path
------------------------
@@ -124,11 +138,12 @@ Live migration save path
|
migrate_init spawns migration_thread
Migration thread then calls each device's .save_setup()
- (RUNNING, _SETUP, _RUNNING)
+ (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
|
- (RUNNING, _ACTIVE, _RUNNING)
- If device is active, get pending_bytes by .state_pending_exact()
+ (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
+ If device is active, get pending_bytes by .state_pending_{estimate,exact}()
If total pending_bytes >= threshold_size, call .save_live_iterate()
+ [Data of VFIO device for pre-copy phase is copied]
Iterate till total pending bytes converge and are less than threshold
|
On migration completion, vCPU stops and calls .save_live_complete_precopy for
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 5f29dab839..1db901c194 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -67,6 +67,8 @@ typedef struct VFIOMigration {
void *data_buffer;
size_t data_buffer_size;
uint64_t mig_flags;
+ uint64_t precopy_init_size;
+ uint64_t precopy_dirty_size;
} VFIOMigration;
typedef struct VFIOAddressSpace {
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 78358ede27..b73086e17a 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -492,7 +492,8 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
}
if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
- migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
+ (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
+ migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) {
return false;
}
}
@@ -537,7 +538,8 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
return false;
}
- if (migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
+ if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
+ migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
continue;
} else {
return false;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 8d33414379..d8f6a22ae1 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -68,6 +68,8 @@ static const char *mig_state_to_str(enum vfio_device_mig_state state)
return "STOP_COPY";
case VFIO_DEVICE_STATE_RESUMING:
return "RESUMING";
+ case VFIO_DEVICE_STATE_PRE_COPY:
+ return "PRE_COPY";
default:
return "UNKNOWN STATE";
}
@@ -241,6 +243,25 @@ static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
return 0;
}
+static int vfio_query_precopy_size(VFIOMigration *migration)
+{
+ struct vfio_precopy_info precopy = {
+ .argsz = sizeof(precopy),
+ };
+
+ migration->precopy_init_size = 0;
+ migration->precopy_dirty_size = 0;
+
+ if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, &precopy)) {
+ return -errno;
+ }
+
+ migration->precopy_init_size = precopy.initial_bytes;
+ migration->precopy_dirty_size = precopy.dirty_bytes;
+
+ return 0;
+}
+
/* Returns the size of saved data on success and -errno on error */
static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
{
@@ -249,6 +270,14 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
data_size = read(migration->data_fd, migration->data_buffer,
migration->data_buffer_size);
if (data_size < 0) {
+ /*
+ * Pre-copy emptied all the device state for now. For more information,
+ * please refer to the Linux kernel VFIO uAPI.
+ */
+ if (errno == ENOMSG) {
+ return 0;
+ }
+
return -errno;
}
if (data_size == 0) {
@@ -265,6 +294,38 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
return qemu_file_get_error(f) ?: data_size;
}
+static void vfio_update_estimated_pending_data(VFIOMigration *migration,
+ uint64_t data_size)
+{
+ if (!data_size) {
+ /*
+ * Pre-copy emptied all the device state for now, update estimated sizes
+ * accordingly.
+ */
+ migration->precopy_init_size = 0;
+ migration->precopy_dirty_size = 0;
+
+ return;
+ }
+
+ if (migration->precopy_init_size) {
+ uint64_t init_size = MIN(migration->precopy_init_size, data_size);
+
+ migration->precopy_init_size -= init_size;
+ data_size -= init_size;
+ }
+
+ migration->precopy_dirty_size -= MIN(migration->precopy_dirty_size,
+ data_size);
+}
+
+static bool vfio_precopy_supported(VFIODevice *vbasedev)
+{
+ VFIOMigration *migration = vbasedev->migration;
+
+ return migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
+}
+
/* ---------------------------------------------------------------------- */
static int vfio_save_setup(QEMUFile *f, void *opaque)
@@ -285,6 +346,28 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
return -ENOMEM;
}
+ if (vfio_precopy_supported(vbasedev)) {
+ int ret;
+
+ switch (migration->device_state) {
+ case VFIO_DEVICE_STATE_RUNNING:
+ ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_PRE_COPY,
+ VFIO_DEVICE_STATE_RUNNING);
+ if (ret) {
+ return ret;
+ }
+
+ vfio_query_precopy_size(migration);
+
+ break;
+ case VFIO_DEVICE_STATE_STOP:
+ /* vfio_save_complete_precopy() will go to STOP_COPY */
+ break;
+ default:
+ return -EINVAL;
+ }
+ }
+
trace_vfio_save_setup(vbasedev->name, migration->data_buffer_size);
qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
@@ -299,26 +382,42 @@ static void vfio_save_cleanup(void *opaque)
g_free(migration->data_buffer);
migration->data_buffer = NULL;
+ migration->precopy_init_size = 0;
+ migration->precopy_dirty_size = 0;
vfio_migration_cleanup(vbasedev);
trace_vfio_save_cleanup(vbasedev->name);
}
+static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy,
+ uint64_t *can_postcopy)
+{
+ VFIODevice *vbasedev = opaque;
+ VFIOMigration *migration = vbasedev->migration;
+
+ if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) {
+ return;
+ }
+
+ *must_precopy +=
+ migration->precopy_init_size + migration->precopy_dirty_size;
+
+ trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
+ *can_postcopy,
+ migration->precopy_init_size,
+ migration->precopy_dirty_size);
+}
+
/*
* Migration size of VFIO devices can be as little as a few KBs or as big as
* many GBs. This value should be big enough to cover the worst case.
*/
#define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
-/*
- * Only exact function is implemented and not estimate function. The reason is
- * that during pre-copy phase of migration the estimate function is called
- * repeatedly while pending RAM size is over the threshold, thus migration
- * can't converge and querying the VFIO device pending data size is useless.
- */
static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
uint64_t *can_postcopy)
{
VFIODevice *vbasedev = opaque;
+ VFIOMigration *migration = vbasedev->migration;
uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
/*
@@ -328,8 +427,48 @@ static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy,
vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
*must_precopy += stop_copy_size;
+ if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
+ vfio_query_precopy_size(migration);
+
+ *must_precopy +=
+ migration->precopy_init_size + migration->precopy_dirty_size;
+ }
+
trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
- stop_copy_size);
+ stop_copy_size, migration->precopy_init_size,
+ migration->precopy_dirty_size);
+}
+
+static bool vfio_is_active_iterate(void *opaque)
+{
+ VFIODevice *vbasedev = opaque;
+ VFIOMigration *migration = vbasedev->migration;
+
+ return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
+}
+
+static int vfio_save_iterate(QEMUFile *f, void *opaque)
+{
+ VFIODevice *vbasedev = opaque;
+ VFIOMigration *migration = vbasedev->migration;
+ ssize_t data_size;
+
+ data_size = vfio_save_block(f, migration);
+ if (data_size < 0) {
+ return data_size;
+ }
+ qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+ vfio_update_estimated_pending_data(migration, data_size);
+
+ trace_vfio_save_iterate(vbasedev->name, migration->precopy_init_size,
+ migration->precopy_dirty_size);
+
+ /*
+ * A VFIO device's pre-copy dirty_bytes is not guaranteed to reach zero.
+ * Return 1 so following handlers will not be potentially blocked.
+ */
+ return 1;
}
static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
@@ -338,7 +477,7 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
ssize_t data_size;
int ret;
- /* We reach here with device state STOP only */
+ /* We reach here with device state STOP or STOP_COPY only */
ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
VFIO_DEVICE_STATE_STOP);
if (ret) {
@@ -457,7 +596,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
static const SaveVMHandlers savevm_vfio_handlers = {
.save_setup = vfio_save_setup,
.save_cleanup = vfio_save_cleanup,
+ .state_pending_estimate = vfio_state_pending_estimate,
.state_pending_exact = vfio_state_pending_exact,
+ .is_active_iterate = vfio_is_active_iterate,
+ .save_live_iterate = vfio_save_iterate,
.save_live_complete_precopy = vfio_save_complete_precopy,
.save_state = vfio_save_state,
.load_setup = vfio_load_setup,
@@ -470,13 +612,18 @@ static const SaveVMHandlers savevm_vfio_handlers = {
static void vfio_vmstate_change(void *opaque, bool running, RunState state)
{
VFIODevice *vbasedev = opaque;
+ VFIOMigration *migration = vbasedev->migration;
enum vfio_device_mig_state new_state;
int ret;
if (running) {
new_state = VFIO_DEVICE_STATE_RUNNING;
} else {
- new_state = VFIO_DEVICE_STATE_STOP;
+ new_state =
+ (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY &&
+ (state == RUN_STATE_FINISH_MIGRATE || state == RUN_STATE_PAUSED)) ?
+ VFIO_DEVICE_STATE_STOP_COPY :
+ VFIO_DEVICE_STATE_STOP;
}
/*
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 646e42fd27..4150b59e58 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -162,6 +162,8 @@ vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
vfio_save_cleanup(const char *name) " (%s)"
vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
vfio_save_device_config_state(const char *name) " (%s)"
+vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
-vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64
+vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
+vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
--
2.26.3
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH v5 8/9] vfio/migration: Add x-allow-pre-copy VFIO device property
2023-05-30 14:48 [PATCH v5 0/9] migration: Add switchover ack capability and VFIO precopy support Avihai Horon
` (6 preceding siblings ...)
2023-05-30 14:48 ` [PATCH v5 7/9] vfio/migration: Add VFIO migration pre-copy support Avihai Horon
@ 2023-05-30 14:48 ` Avihai Horon
2023-06-01 20:22 ` Alex Williamson
2023-05-30 14:48 ` [PATCH v5 9/9] vfio/migration: Add support for switchover ack capability Avihai Horon
2023-06-16 9:35 ` [PATCH v5 0/9] migration: Add switchover ack capability and VFIO precopy support YangHang Liu
9 siblings, 1 reply; 25+ messages in thread
From: Avihai Horon @ 2023-05-30 14:48 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cédric Le Goater, Eduardo Habkost,
Marcel Apfelbaum, Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
Kirti Wankhede, Tarun Gupta, Joao Martins
Add a new VFIO device property x-allow-pre-copy to keep migration
compatibility to/from older QEMU versions that don't have VFIO pre-copy
support.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
include/hw/vfio/vfio-common.h | 1 +
hw/core/machine.c | 1 +
hw/vfio/migration.c | 3 ++-
hw/vfio/pci.c | 2 ++
4 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 1db901c194..a53ecbe2e0 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -146,6 +146,7 @@ typedef struct VFIODevice {
VFIOMigration *migration;
Error *migration_blocker;
OnOffAuto pre_copy_dirty_page_tracking;
+ bool allow_pre_copy;
bool dirty_pages_supported;
bool dirty_tracking;
} VFIODevice;
diff --git a/hw/core/machine.c b/hw/core/machine.c
index 1000406211..64ac3fe38e 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -41,6 +41,7 @@
GlobalProperty hw_compat_8_0[] = {
{ "migration", "multifd-flush-after-each-section", "on"},
+ { "vfio-pci", "x-allow-pre-copy", "false" },
};
const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0);
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index d8f6a22ae1..cb6923ed3f 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -323,7 +323,8 @@ static bool vfio_precopy_supported(VFIODevice *vbasedev)
{
VFIOMigration *migration = vbasedev->migration;
- return migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
+ return vbasedev->allow_pre_copy &&
+ migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
}
/* ---------------------------------------------------------------------- */
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 73874a94de..c69813af7f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3335,6 +3335,8 @@ static Property vfio_pci_dev_properties[] = {
DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking", VFIOPCIDevice,
vbasedev.pre_copy_dirty_page_tracking,
ON_OFF_AUTO_ON),
+ DEFINE_PROP_BOOL("x-allow-pre-copy", VFIOPCIDevice,
+ vbasedev.allow_pre_copy, true),
DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
display, ON_OFF_AUTO_OFF),
DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
--
2.26.3
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v5 8/9] vfio/migration: Add x-allow-pre-copy VFIO device property
2023-05-30 14:48 ` [PATCH v5 8/9] vfio/migration: Add x-allow-pre-copy VFIO device property Avihai Horon
@ 2023-06-01 20:22 ` Alex Williamson
2023-06-04 9:33 ` Avihai Horon
0 siblings, 1 reply; 25+ messages in thread
From: Alex Williamson @ 2023-06-01 20:22 UTC (permalink / raw)
To: Avihai Horon
Cc: qemu-devel, Cédric Le Goater, Eduardo Habkost,
Marcel Apfelbaum, Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
Tarun Gupta, Joao Martins
On Tue, 30 May 2023 17:48:20 +0300
Avihai Horon <avihaih@nvidia.com> wrote:
> Add a new VFIO device property x-allow-pre-copy to keep migration
> compatibility to/from older QEMU versions that don't have VFIO pre-copy
> support.
This doesn't make sense to me, vfio migration is not currently
supported, it can only be enabled via an experimental flag. AFAIK we
have no obligation to maintain migration compatibility against
experimental features. Is there any other reason we need a flag to
disable pre-copy?
OTOH, should this series finally remove the experimental migration
flag? Do we require Joao's vIOMMU support to finally make it
supportable? Is there something else? Thanks,
Alex
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> Reviewed-by: Cédric Le Goater <clg@redhat.com>
> ---
> include/hw/vfio/vfio-common.h | 1 +
> hw/core/machine.c | 1 +
> hw/vfio/migration.c | 3 ++-
> hw/vfio/pci.c | 2 ++
> 4 files changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 1db901c194..a53ecbe2e0 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -146,6 +146,7 @@ typedef struct VFIODevice {
> VFIOMigration *migration;
> Error *migration_blocker;
> OnOffAuto pre_copy_dirty_page_tracking;
> + bool allow_pre_copy;
> bool dirty_pages_supported;
> bool dirty_tracking;
> } VFIODevice;
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index 1000406211..64ac3fe38e 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -41,6 +41,7 @@
>
> GlobalProperty hw_compat_8_0[] = {
> { "migration", "multifd-flush-after-each-section", "on"},
> + { "vfio-pci", "x-allow-pre-copy", "false" },
> };
> const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0);
>
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index d8f6a22ae1..cb6923ed3f 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -323,7 +323,8 @@ static bool vfio_precopy_supported(VFIODevice *vbasedev)
> {
> VFIOMigration *migration = vbasedev->migration;
>
> - return migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
> + return vbasedev->allow_pre_copy &&
> + migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
> }
>
> /* ---------------------------------------------------------------------- */
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 73874a94de..c69813af7f 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -3335,6 +3335,8 @@ static Property vfio_pci_dev_properties[] = {
> DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking", VFIOPCIDevice,
> vbasedev.pre_copy_dirty_page_tracking,
> ON_OFF_AUTO_ON),
> + DEFINE_PROP_BOOL("x-allow-pre-copy", VFIOPCIDevice,
> + vbasedev.allow_pre_copy, true),
> DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
> display, ON_OFF_AUTO_OFF),
> DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v5 8/9] vfio/migration: Add x-allow-pre-copy VFIO device property
2023-06-01 20:22 ` Alex Williamson
@ 2023-06-04 9:33 ` Avihai Horon
2023-06-05 14:56 ` Alex Williamson
0 siblings, 1 reply; 25+ messages in thread
From: Avihai Horon @ 2023-06-04 9:33 UTC (permalink / raw)
To: Alex Williamson
Cc: qemu-devel, Cédric Le Goater, Eduardo Habkost,
Marcel Apfelbaum, Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
Tarun Gupta, Joao Martins
On 01/06/2023 23:22, Alex Williamson wrote:
> External email: Use caution opening links or attachments
>
>
> On Tue, 30 May 2023 17:48:20 +0300
> Avihai Horon <avihaih@nvidia.com> wrote:
>
>> Add a new VFIO device property x-allow-pre-copy to keep migration
>> compatibility to/from older QEMU versions that don't have VFIO pre-copy
>> support.
> This doesn't make sense to me, vfio migration is not currently
> supported, it can only be enabled via an experimental flag. AFAIK we
> have no obligation to maintain migration compatibility against
> experimental features. Is there any other reason we need a flag to
> disable pre-copy?
This could give flexibility to do migration between hosts without
matching VFIO device kernel drivers. E.g., source driver doesn't have
precopy support and dest driver has or vice versa.
>
> OTOH, should this series finally remove the experimental migration
> flag? Do we require Joao's vIOMMU support to finally make it
> supportable? Is there something else?
I think that after precopy is accepted we can remove the experimental
flag, as we'll have the major parts of VFIO migration upstream.
After that we will still need to add Joao's vIOMMU support and P2P support.
Do you want me to add a patch to this series that makes VFIO migration
non-experimental?
Thanks.
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> Reviewed-by: Cédric Le Goater <clg@redhat.com>
>> ---
>> include/hw/vfio/vfio-common.h | 1 +
>> hw/core/machine.c | 1 +
>> hw/vfio/migration.c | 3 ++-
>> hw/vfio/pci.c | 2 ++
>> 4 files changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 1db901c194..a53ecbe2e0 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -146,6 +146,7 @@ typedef struct VFIODevice {
>> VFIOMigration *migration;
>> Error *migration_blocker;
>> OnOffAuto pre_copy_dirty_page_tracking;
>> + bool allow_pre_copy;
>> bool dirty_pages_supported;
>> bool dirty_tracking;
>> } VFIODevice;
>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>> index 1000406211..64ac3fe38e 100644
>> --- a/hw/core/machine.c
>> +++ b/hw/core/machine.c
>> @@ -41,6 +41,7 @@
>>
>> GlobalProperty hw_compat_8_0[] = {
>> { "migration", "multifd-flush-after-each-section", "on"},
>> + { "vfio-pci", "x-allow-pre-copy", "false" },
>> };
>> const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0);
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index d8f6a22ae1..cb6923ed3f 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -323,7 +323,8 @@ static bool vfio_precopy_supported(VFIODevice *vbasedev)
>> {
>> VFIOMigration *migration = vbasedev->migration;
>>
>> - return migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
>> + return vbasedev->allow_pre_copy &&
>> + migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
>> }
>>
>> /* ---------------------------------------------------------------------- */
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 73874a94de..c69813af7f 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -3335,6 +3335,8 @@ static Property vfio_pci_dev_properties[] = {
>> DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking", VFIOPCIDevice,
>> vbasedev.pre_copy_dirty_page_tracking,
>> ON_OFF_AUTO_ON),
>> + DEFINE_PROP_BOOL("x-allow-pre-copy", VFIOPCIDevice,
>> + vbasedev.allow_pre_copy, true),
>> DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
>> display, ON_OFF_AUTO_OFF),
>> DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v5 8/9] vfio/migration: Add x-allow-pre-copy VFIO device property
2023-06-04 9:33 ` Avihai Horon
@ 2023-06-05 14:56 ` Alex Williamson
2023-06-06 11:59 ` Avihai Horon
0 siblings, 1 reply; 25+ messages in thread
From: Alex Williamson @ 2023-06-05 14:56 UTC (permalink / raw)
To: Avihai Horon
Cc: qemu-devel, Cédric Le Goater, Eduardo Habkost,
Marcel Apfelbaum, Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
Tarun Gupta, Joao Martins
On Sun, 4 Jun 2023 12:33:43 +0300
Avihai Horon <avihaih@nvidia.com> wrote:
> On 01/06/2023 23:22, Alex Williamson wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Tue, 30 May 2023 17:48:20 +0300
> > Avihai Horon <avihaih@nvidia.com> wrote:
> >
> >> Add a new VFIO device property x-allow-pre-copy to keep migration
> >> compatibility to/from older QEMU versions that don't have VFIO pre-copy
> >> support.
> > This doesn't make sense to me, vfio migration is not currently
> > supported, it can only be enabled via an experimental flag. AFAIK we
> > have no obligation to maintain migration compatibility against
> > experimental features. Is there any other reason we need a flag to
> > disable pre-copy?
>
> This could give flexibility to do migration between hosts without
> matching VFIO device kernel drivers. E.g., source driver doesn't have
> precopy support and dest driver has or vice versa.
If these are valid scenarios, the protocol should support negotiation
without requiring an experimental flag to do so.
> > OTOH, should this series finally remove the experimental migration
> > flag? Do we require Joao's vIOMMU support to finally make it
> > supportable? Is there something else?
>
> I think that after precopy is accepted we can remove the experimental
> flag, as we'll have the major parts of VFIO migration upstream.
> After that we will still need to add Joao's vIOMMU support and P2P support.
> Do you want me to add a patch to this series that makes VFIO migration
> non-experimental?
I'd keep it as a separate patch with a clearly described dependency on
this series so that we can discuss it separately. Thanks,
Alex
> >> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> >> Reviewed-by: Cédric Le Goater <clg@redhat.com>
> >> ---
> >> include/hw/vfio/vfio-common.h | 1 +
> >> hw/core/machine.c | 1 +
> >> hw/vfio/migration.c | 3 ++-
> >> hw/vfio/pci.c | 2 ++
> >> 4 files changed, 6 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> >> index 1db901c194..a53ecbe2e0 100644
> >> --- a/include/hw/vfio/vfio-common.h
> >> +++ b/include/hw/vfio/vfio-common.h
> >> @@ -146,6 +146,7 @@ typedef struct VFIODevice {
> >> VFIOMigration *migration;
> >> Error *migration_blocker;
> >> OnOffAuto pre_copy_dirty_page_tracking;
> >> + bool allow_pre_copy;
> >> bool dirty_pages_supported;
> >> bool dirty_tracking;
> >> } VFIODevice;
> >> diff --git a/hw/core/machine.c b/hw/core/machine.c
> >> index 1000406211..64ac3fe38e 100644
> >> --- a/hw/core/machine.c
> >> +++ b/hw/core/machine.c
> >> @@ -41,6 +41,7 @@
> >>
> >> GlobalProperty hw_compat_8_0[] = {
> >> { "migration", "multifd-flush-after-each-section", "on"},
> >> + { "vfio-pci", "x-allow-pre-copy", "false" },
> >> };
> >> const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0);
> >>
> >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >> index d8f6a22ae1..cb6923ed3f 100644
> >> --- a/hw/vfio/migration.c
> >> +++ b/hw/vfio/migration.c
> >> @@ -323,7 +323,8 @@ static bool vfio_precopy_supported(VFIODevice *vbasedev)
> >> {
> >> VFIOMigration *migration = vbasedev->migration;
> >>
> >> - return migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
> >> + return vbasedev->allow_pre_copy &&
> >> + migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
> >> }
> >>
> >> /* ---------------------------------------------------------------------- */
> >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >> index 73874a94de..c69813af7f 100644
> >> --- a/hw/vfio/pci.c
> >> +++ b/hw/vfio/pci.c
> >> @@ -3335,6 +3335,8 @@ static Property vfio_pci_dev_properties[] = {
> >> DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking", VFIOPCIDevice,
> >> vbasedev.pre_copy_dirty_page_tracking,
> >> ON_OFF_AUTO_ON),
> >> + DEFINE_PROP_BOOL("x-allow-pre-copy", VFIOPCIDevice,
> >> + vbasedev.allow_pre_copy, true),
> >> DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
> >> display, ON_OFF_AUTO_OFF),
> >> DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v5 8/9] vfio/migration: Add x-allow-pre-copy VFIO device property
2023-06-05 14:56 ` Alex Williamson
@ 2023-06-06 11:59 ` Avihai Horon
2023-06-06 13:40 ` Cédric Le Goater
0 siblings, 1 reply; 25+ messages in thread
From: Avihai Horon @ 2023-06-06 11:59 UTC (permalink / raw)
To: Alex Williamson
Cc: qemu-devel, Cédric Le Goater, Eduardo Habkost,
Marcel Apfelbaum, Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
Tarun Gupta, Joao Martins
On 05/06/2023 17:56, Alex Williamson wrote:
> External email: Use caution opening links or attachments
>
>
> On Sun, 4 Jun 2023 12:33:43 +0300
> Avihai Horon <avihaih@nvidia.com> wrote:
>
>> On 01/06/2023 23:22, Alex Williamson wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Tue, 30 May 2023 17:48:20 +0300
>>> Avihai Horon <avihaih@nvidia.com> wrote:
>>>
>>>> Add a new VFIO device property x-allow-pre-copy to keep migration
>>>> compatibility to/from older QEMU versions that don't have VFIO pre-copy
>>>> support.
>>> This doesn't make sense to me, vfio migration is not currently
>>> supported, it can only be enabled via an experimental flag. AFAIK we
>>> have no obligation to maintain migration compatibility against
>>> experimental features. Is there any other reason we need a flag to
>>> disable pre-copy?
>> This could give flexibility to do migration between hosts without
>> matching VFIO device kernel drivers. E.g., source driver doesn't have
>> precopy support and dest driver has or vice versa.
> If these are valid scenarios, the protocol should support negotiation
> without requiring an experimental flag to do so.
Thinking again, the two intree drivers we have support precopy and as
you said, this is still experimental, so I assume that we can drop this
patch in v6.
>
>>> OTOH, should this series finally remove the experimental migration
>>> flag? Do we require Joao's vIOMMU support to finally make it
>>> supportable? Is there something else?
>> I think that after precopy is accepted we can remove the experimental
>> flag, as we'll have the major parts of VFIO migration upstream.
>> After that we will still need to add Joao's vIOMMU support and P2P support.
>> Do you want me to add a patch to this series that makes VFIO migration
>> non-experimental?
> I'd keep it as a separate patch with a clearly described dependency on
> this series so that we can discuss it separately.
Sure, so I will post it separately.
Thanks.
>>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>>> Reviewed-by: Cédric Le Goater <clg@redhat.com>
>>>> ---
>>>> include/hw/vfio/vfio-common.h | 1 +
>>>> hw/core/machine.c | 1 +
>>>> hw/vfio/migration.c | 3 ++-
>>>> hw/vfio/pci.c | 2 ++
>>>> 4 files changed, 6 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>>> index 1db901c194..a53ecbe2e0 100644
>>>> --- a/include/hw/vfio/vfio-common.h
>>>> +++ b/include/hw/vfio/vfio-common.h
>>>> @@ -146,6 +146,7 @@ typedef struct VFIODevice {
>>>> VFIOMigration *migration;
>>>> Error *migration_blocker;
>>>> OnOffAuto pre_copy_dirty_page_tracking;
>>>> + bool allow_pre_copy;
>>>> bool dirty_pages_supported;
>>>> bool dirty_tracking;
>>>> } VFIODevice;
>>>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>>>> index 1000406211..64ac3fe38e 100644
>>>> --- a/hw/core/machine.c
>>>> +++ b/hw/core/machine.c
>>>> @@ -41,6 +41,7 @@
>>>>
>>>> GlobalProperty hw_compat_8_0[] = {
>>>> { "migration", "multifd-flush-after-each-section", "on"},
>>>> + { "vfio-pci", "x-allow-pre-copy", "false" },
>>>> };
>>>> const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0);
>>>>
>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>>> index d8f6a22ae1..cb6923ed3f 100644
>>>> --- a/hw/vfio/migration.c
>>>> +++ b/hw/vfio/migration.c
>>>> @@ -323,7 +323,8 @@ static bool vfio_precopy_supported(VFIODevice *vbasedev)
>>>> {
>>>> VFIOMigration *migration = vbasedev->migration;
>>>>
>>>> - return migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
>>>> + return vbasedev->allow_pre_copy &&
>>>> + migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
>>>> }
>>>>
>>>> /* ---------------------------------------------------------------------- */
>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>> index 73874a94de..c69813af7f 100644
>>>> --- a/hw/vfio/pci.c
>>>> +++ b/hw/vfio/pci.c
>>>> @@ -3335,6 +3335,8 @@ static Property vfio_pci_dev_properties[] = {
>>>> DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking", VFIOPCIDevice,
>>>> vbasedev.pre_copy_dirty_page_tracking,
>>>> ON_OFF_AUTO_ON),
>>>> + DEFINE_PROP_BOOL("x-allow-pre-copy", VFIOPCIDevice,
>>>> + vbasedev.allow_pre_copy, true),
>>>> DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
>>>> display, ON_OFF_AUTO_OFF),
>>>> DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v5 8/9] vfio/migration: Add x-allow-pre-copy VFIO device property
2023-06-06 11:59 ` Avihai Horon
@ 2023-06-06 13:40 ` Cédric Le Goater
2023-06-07 7:41 ` Avihai Horon
0 siblings, 1 reply; 25+ messages in thread
From: Cédric Le Goater @ 2023-06-06 13:40 UTC (permalink / raw)
To: Avihai Horon, Alex Williamson
Cc: qemu-devel, Eduardo Habkost, Marcel Apfelbaum,
Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
Tarun Gupta, Joao Martins
Hello Avihai
On 6/6/23 13:59, Avihai Horon wrote:
>
> On 05/06/2023 17:56, Alex Williamson wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On Sun, 4 Jun 2023 12:33:43 +0300
>> Avihai Horon <avihaih@nvidia.com> wrote:
>>
>>> On 01/06/2023 23:22, Alex Williamson wrote:
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> On Tue, 30 May 2023 17:48:20 +0300
>>>> Avihai Horon <avihaih@nvidia.com> wrote:
>>>>
>>>>> Add a new VFIO device property x-allow-pre-copy to keep migration
>>>>> compatibility to/from older QEMU versions that don't have VFIO pre-copy
>>>>> support.
>>>> This doesn't make sense to me, vfio migration is not currently
>>>> supported, it can only be enabled via an experimental flag. AFAIK we
>>>> have no obligation to maintain migration compatibility against
>>>> experimental features. Is there any other reason we need a flag to
>>>> disable pre-copy?
>>> This could give flexibility to do migration between hosts without
>>> matching VFIO device kernel drivers. E.g., source driver doesn't have
>>> precopy support and dest driver has or vice versa.
>> If these are valid scenarios, the protocol should support negotiation
>> without requiring an experimental flag to do so.
>
> Thinking again, the two intree drivers we have support precopy and as you said, this is still experimental, so I assume that we can drop this patch in v6.
No need to resend if there are no other changes.
Thanks,
C.
>
>>
>>>> OTOH, should this series finally remove the experimental migration
>>>> flag? Do we require Joao's vIOMMU support to finally make it
>>>> supportable? Is there something else?
>>> I think that after precopy is accepted we can remove the experimental
>>> flag, as we'll have the major parts of VFIO migration upstream.
>>> After that we will still need to add Joao's vIOMMU support and P2P support.
>>> Do you want me to add a patch to this series that makes VFIO migration
>>> non-experimental?
>> I'd keep it as a separate patch with a clearly described dependency on
>> this series so that we can discuss it separately.
>
> Sure, so I will post it separately.
>
> Thanks.
>
>>>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>>>> Reviewed-by: Cédric Le Goater <clg@redhat.com>
>>>>> ---
>>>>> include/hw/vfio/vfio-common.h | 1 +
>>>>> hw/core/machine.c | 1 +
>>>>> hw/vfio/migration.c | 3 ++-
>>>>> hw/vfio/pci.c | 2 ++
>>>>> 4 files changed, 6 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>>>> index 1db901c194..a53ecbe2e0 100644
>>>>> --- a/include/hw/vfio/vfio-common.h
>>>>> +++ b/include/hw/vfio/vfio-common.h
>>>>> @@ -146,6 +146,7 @@ typedef struct VFIODevice {
>>>>> VFIOMigration *migration;
>>>>> Error *migration_blocker;
>>>>> OnOffAuto pre_copy_dirty_page_tracking;
>>>>> + bool allow_pre_copy;
>>>>> bool dirty_pages_supported;
>>>>> bool dirty_tracking;
>>>>> } VFIODevice;
>>>>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>>>>> index 1000406211..64ac3fe38e 100644
>>>>> --- a/hw/core/machine.c
>>>>> +++ b/hw/core/machine.c
>>>>> @@ -41,6 +41,7 @@
>>>>>
>>>>> GlobalProperty hw_compat_8_0[] = {
>>>>> { "migration", "multifd-flush-after-each-section", "on"},
>>>>> + { "vfio-pci", "x-allow-pre-copy", "false" },
>>>>> };
>>>>> const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0);
>>>>>
>>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>>>> index d8f6a22ae1..cb6923ed3f 100644
>>>>> --- a/hw/vfio/migration.c
>>>>> +++ b/hw/vfio/migration.c
>>>>> @@ -323,7 +323,8 @@ static bool vfio_precopy_supported(VFIODevice *vbasedev)
>>>>> {
>>>>> VFIOMigration *migration = vbasedev->migration;
>>>>>
>>>>> - return migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
>>>>> + return vbasedev->allow_pre_copy &&
>>>>> + migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
>>>>> }
>>>>>
>>>>> /* ---------------------------------------------------------------------- */
>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>> index 73874a94de..c69813af7f 100644
>>>>> --- a/hw/vfio/pci.c
>>>>> +++ b/hw/vfio/pci.c
>>>>> @@ -3335,6 +3335,8 @@ static Property vfio_pci_dev_properties[] = {
>>>>> DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking", VFIOPCIDevice,
>>>>> vbasedev.pre_copy_dirty_page_tracking,
>>>>> ON_OFF_AUTO_ON),
>>>>> + DEFINE_PROP_BOOL("x-allow-pre-copy", VFIOPCIDevice,
>>>>> + vbasedev.allow_pre_copy, true),
>>>>> DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
>>>>> display, ON_OFF_AUTO_OFF),
>>>>> DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v5 8/9] vfio/migration: Add x-allow-pre-copy VFIO device property
2023-06-06 13:40 ` Cédric Le Goater
@ 2023-06-07 7:41 ` Avihai Horon
0 siblings, 0 replies; 25+ messages in thread
From: Avihai Horon @ 2023-06-07 7:41 UTC (permalink / raw)
To: Cédric Le Goater, Alex Williamson
Cc: qemu-devel, Eduardo Habkost, Marcel Apfelbaum,
Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
Tarun Gupta, Joao Martins
On 06/06/2023 16:40, Cédric Le Goater wrote:
> External email: Use caution opening links or attachments
>
>
> Hello Avihai
>
> On 6/6/23 13:59, Avihai Horon wrote:
>>
>> On 05/06/2023 17:56, Alex Williamson wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Sun, 4 Jun 2023 12:33:43 +0300
>>> Avihai Horon <avihaih@nvidia.com> wrote:
>>>
>>>> On 01/06/2023 23:22, Alex Williamson wrote:
>>>>> External email: Use caution opening links or attachments
>>>>>
>>>>>
>>>>> On Tue, 30 May 2023 17:48:20 +0300
>>>>> Avihai Horon <avihaih@nvidia.com> wrote:
>>>>>
>>>>>> Add a new VFIO device property x-allow-pre-copy to keep migration
>>>>>> compatibility to/from older QEMU versions that don't have VFIO
>>>>>> pre-copy
>>>>>> support.
>>>>> This doesn't make sense to me, vfio migration is not currently
>>>>> supported, it can only be enabled via an experimental flag. AFAIK we
>>>>> have no obligation to maintain migration compatibility against
>>>>> experimental features. Is there any other reason we need a flag to
>>>>> disable pre-copy?
>>>> This could give flexibility to do migration between hosts without
>>>> matching VFIO device kernel drivers. E.g., source driver doesn't have
>>>> precopy support and dest driver has or vice versa.
>>> If these are valid scenarios, the protocol should support negotiation
>>> without requiring an experimental flag to do so.
>>
>> Thinking again, the two intree drivers we have support precopy and as
>> you said, this is still experimental, so I assume that we can drop
>> this patch in v6.
>
> No need to resend if there are no other changes.
I think that just the last paragraph of the commit message in patch #9
should be removed.
Thanks.
>
>>
>>>
>>>>> OTOH, should this series finally remove the experimental migration
>>>>> flag? Do we require Joao's vIOMMU support to finally make it
>>>>> supportable? Is there something else?
>>>> I think that after precopy is accepted we can remove the experimental
>>>> flag, as we'll have the major parts of VFIO migration upstream.
>>>> After that we will still need to add Joao's vIOMMU support and P2P
>>>> support.
>>>> Do you want me to add a patch to this series that makes VFIO migration
>>>> non-experimental?
>>> I'd keep it as a separate patch with a clearly described dependency on
>>> this series so that we can discuss it separately.
>>
>> Sure, so I will post it separately.
>>
>> Thanks.
>>
>>>>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>>>>> Reviewed-by: Cédric Le Goater <clg@redhat.com>
>>>>>> ---
>>>>>> include/hw/vfio/vfio-common.h | 1 +
>>>>>> hw/core/machine.c | 1 +
>>>>>> hw/vfio/migration.c | 3 ++-
>>>>>> hw/vfio/pci.c | 2 ++
>>>>>> 4 files changed, 6 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/include/hw/vfio/vfio-common.h
>>>>>> b/include/hw/vfio/vfio-common.h
>>>>>> index 1db901c194..a53ecbe2e0 100644
>>>>>> --- a/include/hw/vfio/vfio-common.h
>>>>>> +++ b/include/hw/vfio/vfio-common.h
>>>>>> @@ -146,6 +146,7 @@ typedef struct VFIODevice {
>>>>>> VFIOMigration *migration;
>>>>>> Error *migration_blocker;
>>>>>> OnOffAuto pre_copy_dirty_page_tracking;
>>>>>> + bool allow_pre_copy;
>>>>>> bool dirty_pages_supported;
>>>>>> bool dirty_tracking;
>>>>>> } VFIODevice;
>>>>>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>>>>>> index 1000406211..64ac3fe38e 100644
>>>>>> --- a/hw/core/machine.c
>>>>>> +++ b/hw/core/machine.c
>>>>>> @@ -41,6 +41,7 @@
>>>>>>
>>>>>> GlobalProperty hw_compat_8_0[] = {
>>>>>> { "migration", "multifd-flush-after-each-section", "on"},
>>>>>> + { "vfio-pci", "x-allow-pre-copy", "false" },
>>>>>> };
>>>>>> const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0);
>>>>>>
>>>>>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>>>>>> index d8f6a22ae1..cb6923ed3f 100644
>>>>>> --- a/hw/vfio/migration.c
>>>>>> +++ b/hw/vfio/migration.c
>>>>>> @@ -323,7 +323,8 @@ static bool vfio_precopy_supported(VFIODevice
>>>>>> *vbasedev)
>>>>>> {
>>>>>> VFIOMigration *migration = vbasedev->migration;
>>>>>>
>>>>>> - return migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
>>>>>> + return vbasedev->allow_pre_copy &&
>>>>>> + migration->mig_flags & VFIO_MIGRATION_PRE_COPY;
>>>>>> }
>>>>>>
>>>>>> /*
>>>>>> ----------------------------------------------------------------------
>>>>>> */
>>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>>> index 73874a94de..c69813af7f 100644
>>>>>> --- a/hw/vfio/pci.c
>>>>>> +++ b/hw/vfio/pci.c
>>>>>> @@ -3335,6 +3335,8 @@ static Property vfio_pci_dev_properties[] = {
>>>>>> DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking",
>>>>>> VFIOPCIDevice,
>>>>>> vbasedev.pre_copy_dirty_page_tracking,
>>>>>> ON_OFF_AUTO_ON),
>>>>>> + DEFINE_PROP_BOOL("x-allow-pre-copy", VFIOPCIDevice,
>>>>>> + vbasedev.allow_pre_copy, true),
>>>>>> DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
>>>>>> display, ON_OFF_AUTO_OFF),
>>>>>> DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
>>
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v5 9/9] vfio/migration: Add support for switchover ack capability
2023-05-30 14:48 [PATCH v5 0/9] migration: Add switchover ack capability and VFIO precopy support Avihai Horon
` (7 preceding siblings ...)
2023-05-30 14:48 ` [PATCH v5 8/9] vfio/migration: Add x-allow-pre-copy VFIO device property Avihai Horon
@ 2023-05-30 14:48 ` Avihai Horon
2023-05-30 15:15 ` Cédric Le Goater
2023-06-16 9:35 ` [PATCH v5 0/9] migration: Add switchover ack capability and VFIO precopy support YangHang Liu
9 siblings, 1 reply; 25+ messages in thread
From: Avihai Horon @ 2023-05-30 14:48 UTC (permalink / raw)
To: qemu-devel
Cc: Alex Williamson, Cédric Le Goater, Eduardo Habkost,
Marcel Apfelbaum, Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Avihai Horon,
Kirti Wankhede, Tarun Gupta, Joao Martins
Loading of a VFIO device's data can take a substantial amount of time as
the device may need to allocate resources, prepare internal data
structures, etc. This can increase migration downtime, especially for
VFIO devices with a lot of resources.
To solve this, VFIO migration uAPI defines "initial bytes" as part of
its precopy data stream. Initial bytes can be used in various ways to
improve VFIO migration performance. For example, it can be used to
transfer device metadata to pre-allocate resources in the destination.
However, for this to work we need to make sure that all initial bytes
are sent and loaded in the destination before the source VM is stopped.
Use migration switchover ack capability to make sure a VFIO device's
initial bytes are sent and loaded in the destination before the source
stops the VM and attempts to complete the migration.
This can significantly reduce migration downtime for some devices.
As precopy support and precopy initial bytes support come together in
VFIO migration, use x-allow-pre-copy device property to control usage of
this feature as well.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
docs/devel/vfio-migration.rst | 10 +++++++++
include/hw/vfio/vfio-common.h | 1 +
hw/vfio/migration.c | 39 ++++++++++++++++++++++++++++++++++-
3 files changed, 49 insertions(+), 1 deletion(-)
diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
index e896b2a673..b433cb5bb2 100644
--- a/docs/devel/vfio-migration.rst
+++ b/docs/devel/vfio-migration.rst
@@ -16,6 +16,13 @@ helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy
support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
VFIO_DEVICE_FEATURE_MIGRATION ioctl.
+When pre-copy is supported, it's possible to further reduce downtime by
+enabling "switchover-ack" migration capability.
+VFIO migration uAPI defines "initial bytes" as part of its pre-copy data stream
+and recommends that the initial bytes are sent and loaded in the destination
+before stopping the source VM. Enabling this migration capability will
+guarantee that and thus, can potentially reduce downtime even further.
+
Note that currently VFIO migration is supported only for a single device. This
is due to VFIO migration's lack of P2P support. However, P2P support is planned
to be added later on.
@@ -45,6 +52,9 @@ VFIO implements the device hooks for the iterative approach as follows:
* A ``save_live_iterate`` function that reads the VFIO device's data from the
vendor driver during iterative pre-copy phase.
+* A ``switchover_ack_needed`` function that checks if the VFIO device uses
+ "switchover-ack" migration capability when this capability is enabled.
+
* A ``save_state`` function to save the device config space if it is present.
* A ``save_live_complete_precopy`` function that sets the VFIO device in
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index a53ecbe2e0..3677aba4f4 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -69,6 +69,7 @@ typedef struct VFIOMigration {
uint64_t mig_flags;
uint64_t precopy_init_size;
uint64_t precopy_dirty_size;
+ bool initial_data_sent;
} VFIOMigration;
typedef struct VFIOAddressSpace {
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index cb6923ed3f..53f5787f0e 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -18,6 +18,8 @@
#include "sysemu/runstate.h"
#include "hw/vfio/vfio-common.h"
#include "migration/migration.h"
+#include "migration/options.h"
+#include "migration/savevm.h"
#include "migration/vmstate.h"
#include "migration/qemu-file.h"
#include "migration/register.h"
@@ -45,6 +47,7 @@
#define VFIO_MIG_FLAG_DEV_CONFIG_STATE (0xffffffffef100002ULL)
#define VFIO_MIG_FLAG_DEV_SETUP_STATE (0xffffffffef100003ULL)
#define VFIO_MIG_FLAG_DEV_DATA_STATE (0xffffffffef100004ULL)
+#define VFIO_MIG_FLAG_DEV_INIT_DATA_SENT (0xffffffffef100005ULL)
/*
* This is an arbitrary size based on migration of mlx5 devices, where typically
@@ -385,6 +388,7 @@ static void vfio_save_cleanup(void *opaque)
migration->data_buffer = NULL;
migration->precopy_init_size = 0;
migration->precopy_dirty_size = 0;
+ migration->initial_data_sent = false;
vfio_migration_cleanup(vbasedev);
trace_vfio_save_cleanup(vbasedev->name);
}
@@ -458,10 +462,17 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
if (data_size < 0) {
return data_size;
}
- qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
vfio_update_estimated_pending_data(migration, data_size);
+ if (migrate_switchover_ack() && !migration->precopy_init_size &&
+ !migration->initial_data_sent) {
+ qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT);
+ migration->initial_data_sent = true;
+ } else {
+ qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+ }
+
trace_vfio_save_iterate(vbasedev->name, migration->precopy_init_size,
migration->precopy_dirty_size);
@@ -580,6 +591,24 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
}
break;
}
+ case VFIO_MIG_FLAG_DEV_INIT_DATA_SENT:
+ {
+ if (!vfio_precopy_supported(vbasedev) ||
+ !migrate_switchover_ack()) {
+ error_report("%s: Received INIT_DATA_SENT but switchover ack "
+ "is not used", vbasedev->name);
+ return -EINVAL;
+ }
+
+ ret = qemu_loadvm_approve_switchover();
+ if (ret) {
+ error_report(
+ "%s: qemu_loadvm_approve_switchover failed, err=%d (%s)",
+ vbasedev->name, ret, strerror(-ret));
+ }
+
+ return ret;
+ }
default:
error_report("%s: Unknown tag 0x%"PRIx64, vbasedev->name, data);
return -EINVAL;
@@ -594,6 +623,13 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
return ret;
}
+static bool vfio_switchover_ack_needed(void *opaque)
+{
+ VFIODevice *vbasedev = opaque;
+
+ return vfio_precopy_supported(vbasedev);
+}
+
static const SaveVMHandlers savevm_vfio_handlers = {
.save_setup = vfio_save_setup,
.save_cleanup = vfio_save_cleanup,
@@ -606,6 +642,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
.load_setup = vfio_load_setup,
.load_cleanup = vfio_load_cleanup,
.load_state = vfio_load_state,
+ .switchover_ack_needed = vfio_switchover_ack_needed,
};
/* ---------------------------------------------------------------------- */
--
2.26.3
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v5 9/9] vfio/migration: Add support for switchover ack capability
2023-05-30 14:48 ` [PATCH v5 9/9] vfio/migration: Add support for switchover ack capability Avihai Horon
@ 2023-05-30 15:15 ` Cédric Le Goater
0 siblings, 0 replies; 25+ messages in thread
From: Cédric Le Goater @ 2023-05-30 15:15 UTC (permalink / raw)
To: Avihai Horon, qemu-devel
Cc: Alex Williamson, Eduardo Habkost, Marcel Apfelbaum,
Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
Tarun Gupta, Joao Martins
On 5/30/23 16:48, Avihai Horon wrote:
> Loading of a VFIO device's data can take a substantial amount of time as
> the device may need to allocate resources, prepare internal data
> structures, etc. This can increase migration downtime, especially for
> VFIO devices with a lot of resources.
>
> To solve this, VFIO migration uAPI defines "initial bytes" as part of
> its precopy data stream. Initial bytes can be used in various ways to
> improve VFIO migration performance. For example, it can be used to
> transfer device metadata to pre-allocate resources in the destination.
> However, for this to work we need to make sure that all initial bytes
> are sent and loaded in the destination before the source VM is stopped.
>
> Use migration switchover ack capability to make sure a VFIO device's
> initial bytes are sent and loaded in the destination before the source
> stops the VM and attempts to complete the migration.
> This can significantly reduce migration downtime for some devices.
>
> As precopy support and precopy initial bytes support come together in
> VFIO migration, use x-allow-pre-copy device property to control usage of
> this feature as well.
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Thanks,
C.
> ---
> docs/devel/vfio-migration.rst | 10 +++++++++
> include/hw/vfio/vfio-common.h | 1 +
> hw/vfio/migration.c | 39 ++++++++++++++++++++++++++++++++++-
> 3 files changed, 49 insertions(+), 1 deletion(-)
>
> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
> index e896b2a673..b433cb5bb2 100644
> --- a/docs/devel/vfio-migration.rst
> +++ b/docs/devel/vfio-migration.rst
> @@ -16,6 +16,13 @@ helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy
> support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
> VFIO_DEVICE_FEATURE_MIGRATION ioctl.
>
> +When pre-copy is supported, it's possible to further reduce downtime by
> +enabling "switchover-ack" migration capability.
> +VFIO migration uAPI defines "initial bytes" as part of its pre-copy data stream
> +and recommends that the initial bytes are sent and loaded in the destination
> +before stopping the source VM. Enabling this migration capability will
> +guarantee that and thus, can potentially reduce downtime even further.
> +
> Note that currently VFIO migration is supported only for a single device. This
> is due to VFIO migration's lack of P2P support. However, P2P support is planned
> to be added later on.
> @@ -45,6 +52,9 @@ VFIO implements the device hooks for the iterative approach as follows:
> * A ``save_live_iterate`` function that reads the VFIO device's data from the
> vendor driver during iterative pre-copy phase.
>
> +* A ``switchover_ack_needed`` function that checks if the VFIO device uses
> + "switchover-ack" migration capability when this capability is enabled.
> +
> * A ``save_state`` function to save the device config space if it is present.
>
> * A ``save_live_complete_precopy`` function that sets the VFIO device in
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index a53ecbe2e0..3677aba4f4 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -69,6 +69,7 @@ typedef struct VFIOMigration {
> uint64_t mig_flags;
> uint64_t precopy_init_size;
> uint64_t precopy_dirty_size;
> + bool initial_data_sent;
> } VFIOMigration;
>
> typedef struct VFIOAddressSpace {
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index cb6923ed3f..53f5787f0e 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -18,6 +18,8 @@
> #include "sysemu/runstate.h"
> #include "hw/vfio/vfio-common.h"
> #include "migration/migration.h"
> +#include "migration/options.h"
> +#include "migration/savevm.h"
> #include "migration/vmstate.h"
> #include "migration/qemu-file.h"
> #include "migration/register.h"
> @@ -45,6 +47,7 @@
> #define VFIO_MIG_FLAG_DEV_CONFIG_STATE (0xffffffffef100002ULL)
> #define VFIO_MIG_FLAG_DEV_SETUP_STATE (0xffffffffef100003ULL)
> #define VFIO_MIG_FLAG_DEV_DATA_STATE (0xffffffffef100004ULL)
> +#define VFIO_MIG_FLAG_DEV_INIT_DATA_SENT (0xffffffffef100005ULL)
>
> /*
> * This is an arbitrary size based on migration of mlx5 devices, where typically
> @@ -385,6 +388,7 @@ static void vfio_save_cleanup(void *opaque)
> migration->data_buffer = NULL;
> migration->precopy_init_size = 0;
> migration->precopy_dirty_size = 0;
> + migration->initial_data_sent = false;
> vfio_migration_cleanup(vbasedev);
> trace_vfio_save_cleanup(vbasedev->name);
> }
> @@ -458,10 +462,17 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
> if (data_size < 0) {
> return data_size;
> }
> - qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>
> vfio_update_estimated_pending_data(migration, data_size);
>
> + if (migrate_switchover_ack() && !migration->precopy_init_size &&
> + !migration->initial_data_sent) {
> + qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT);
> + migration->initial_data_sent = true;
> + } else {
> + qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> + }
> +
> trace_vfio_save_iterate(vbasedev->name, migration->precopy_init_size,
> migration->precopy_dirty_size);
>
> @@ -580,6 +591,24 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> }
> break;
> }
> + case VFIO_MIG_FLAG_DEV_INIT_DATA_SENT:
> + {
> + if (!vfio_precopy_supported(vbasedev) ||
> + !migrate_switchover_ack()) {
> + error_report("%s: Received INIT_DATA_SENT but switchover ack "
> + "is not used", vbasedev->name);
> + return -EINVAL;
> + }
> +
> + ret = qemu_loadvm_approve_switchover();
> + if (ret) {
> + error_report(
> + "%s: qemu_loadvm_approve_switchover failed, err=%d (%s)",
> + vbasedev->name, ret, strerror(-ret));
> + }
> +
> + return ret;
> + }
> default:
> error_report("%s: Unknown tag 0x%"PRIx64, vbasedev->name, data);
> return -EINVAL;
> @@ -594,6 +623,13 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
> return ret;
> }
>
> +static bool vfio_switchover_ack_needed(void *opaque)
> +{
> + VFIODevice *vbasedev = opaque;
> +
> + return vfio_precopy_supported(vbasedev);
> +}
> +
> static const SaveVMHandlers savevm_vfio_handlers = {
> .save_setup = vfio_save_setup,
> .save_cleanup = vfio_save_cleanup,
> @@ -606,6 +642,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
> .load_setup = vfio_load_setup,
> .load_cleanup = vfio_load_cleanup,
> .load_state = vfio_load_state,
> + .switchover_ack_needed = vfio_switchover_ack_needed,
> };
>
> /* ---------------------------------------------------------------------- */
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v5 0/9] migration: Add switchover ack capability and VFIO precopy support
2023-05-30 14:48 [PATCH v5 0/9] migration: Add switchover ack capability and VFIO precopy support Avihai Horon
` (8 preceding siblings ...)
2023-05-30 14:48 ` [PATCH v5 9/9] vfio/migration: Add support for switchover ack capability Avihai Horon
@ 2023-06-16 9:35 ` YangHang Liu
9 siblings, 0 replies; 25+ messages in thread
From: YangHang Liu @ 2023-06-16 9:35 UTC (permalink / raw)
To: Avihai Horon
Cc: qemu-devel, Alex Williamson, Cédric Le Goater,
Eduardo Habkost, Marcel Apfelbaum, Philippe Mathieu-Daudé,
Yanan Wang, Juan Quintela, Peter Xu, Leonardo Bras, Eric Blake,
Markus Armbruster, Thomas Huth, Laurent Vivier, Paolo Bonzini,
Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Kirti Wankhede,
Tarun Gupta, Joao Martins
Tested-by: YangHang Liu <yanghliu@redhat.com>
On Wed, May 31, 2023 at 1:01 AM Avihai Horon <avihaih@nvidia.com> wrote:
>
> Hello everyone,
>
> This is v5 of the switchover ack series.
>
> Changes from v4 [6]:
> * Removed superfluous '"' in vfio_save_iterate() trace. (Cedric)
> * Removed VFIOMigration->switchover_ack_needed and computed it locally
> when needed. (Cedric)
> * Added R-bs.
>
> Changes from v3 [5]:
> * Rebased on latest master branch.
> * Simplified switchover ack logic (call switchover_ack_needed only in
> destination). (Peter)
> * Moved caching of VFIO migration flags to a separate patch. (Cedric)
> * Moved adding of x-allow-pre-copy property to a separate patch. (Cedric)
> * Reset VFIOMigration->precopy_{init,dirty}_size in vfio_query_precopy_size()
> and in vfio_save_cleanup(). (Cedric)
> * Added a reference to VFIO uAPI in vfio_save_block() ENOMSG comment. (Cedric)
> * Added VFIOMigration->precopy_{init,dirty}_size to trace_vfio_save_iterate(). (Cedric)
> * Adapted VFIO migration to switchover ack logic simplification:
> - Checked migrate_switchover_ack() in vfio_{save,load}_setup() and set
> VFIOMigration->switchover_ack_needed accordingly.
> - vfio_switchover_ack_needed() doesn't set VFIOMigration->switchover_ack_needed
> and only returns its value.
> * Move VFIOMigration->switchover_ack_needed = false to vfio_migration_cleanup()
> so it will be set to false both in src and dest.
> * Fixed a few typos/coding style. (Peter/Cedric)
> * Added R-b/A-b (didn't add Cedric's R-b on patch #7 as switchover ack
> changes in patch #2 introduced some changes to patch #7 as well).
>
> Changes from v2 [4]:
> * Rebased on latest master branch.
> * Changed the capability name to "switchover-ack" and the related
> code/docs accordingly. (Peter)
> * Added a counter for the number of switchover ack users in the source
> and used it to skip switchover ack if there are no users (instead of
> setting the switchover acked flag to true). (Peter)
> * Added R-bs.
>
> Changes from v1 [3]:
> * Rebased on latest master branch.
> * Updated to latest QAPI doc comment conventions and refined
> QAPI docs and capability error message. (Markus)
> * Followed Peter/Juan suggestion and removed the handshake between
> source and destination.
> Now the capability must be set on both source and destination.
> Compatibility of this feature between different QEMU versions or
> different host capabilities (i.e., kernel) is achieved in the regular
> way of device properties and hw_comapt_x_y.
> * Replaced is_initial_data_active() and initial_data_loaded()
> SaveVMHandlers handlers with a notification mechanism. (Peter)
> * Set the capability also in destination in the migration test.
> * Added VFIO device property x-allow-pre-copy to be able to preserve
> compatibility between different QEMU versions or different host
> capabilities (i.e., kernel).
> * Changed VFIO precopy initial data implementation according to the
> above changes.
> * Documented VFIO precopy initial data support in VFIO migration
> documentation.
> * Added R-bs.
>
> ===
>
> This series adds a new migration capability called "switchover ack". The
> purpose of this capability is to reduce migration downtime in cases
> where loading of migration data in the destination can take a lot of
> time, such as with VFIO migration data.
>
> The series then moves to add precopy support and switchover ack support
> for VFIO migration.
>
> Switchover ack is used by VFIO migration, but other migrated devices can
> add support for it and use it as well.
>
> === Background ===
>
> Migration downtime estimation is calculated based on bandwidth and
> remaining migration data. This assumes that loading of migration data in
> the destination takes a negligible amount of time and that downtime
> depends only on network speed.
>
> While this may be true for RAM, it's not necessarily true for other
> migrated devices. For example, loading the data of a VFIO device in the
> destination might require from the device to allocate resources and
> prepare internal data structures which can take a significant amount of
> time to do.
>
> This poses a problem, as the source may think that the remaining
> migration data is small enough to meet the downtime limit, so it will
> stop the VM and complete the migration, but in fact sending and loading
> the data in the destination may take longer than the downtime limit.
>
> To solve this, VFIO migration uAPI defines "initial bytes" as part of
> its precopy stream [1]. Initial bytes can be used in various ways to
> improve VFIO migration performance. For example, it can be used to
> transfer device metadata to pre-allocate resources in the destination.
> However, for this to work we need to make sure that all initial bytes
> are sent and loaded in the destination before the source VM is stopped.
>
> The new switchover ack migration capability helps us achieve this.
> It prevents the source from stopping the VM and completing the migration
> until an ACK is received from the destination that it's OK to do so.
> Thus, a VFIO device can make sure that its initial bytes were sent
> and loaded in the destination before the source VM is stopped.
>
> Note that this relies on the return path capability to communicate from
> the destination back to the source.
>
> === Flow of operation ===
>
> To use switchover ack, the capability must be enabled in both the source
> and the destination.
>
> During migration setup, migration code in the destination calls the
> switchover_ack_needed() SaveVMHandlers handler of the migrated devices
> to check if switchover ack is used by them.
> A "switchover_ack_pending_num" counter is increased for each migrated
> device that supports this feature. It will be used later to mark when an
> ACK should be sent to the source.
>
> Migration is active and the source starts to send precopy data as usual.
> In the destination, when a migrated device thinks it's OK to do
> switchover, it notifies the migration code about it and the
> "switchover_ack_pending_num" counter is decreased. For example, for a
> VFIO device it's when the device receives and loads its initial bytes.
>
> When the "switchover_ack_pending_num" counter reaches zero, it means
> that all devices agree to do switchover and an ACK is sent to the
> source, which will now be able to complete the migration when
> appropriate.
>
> === Test results ===
>
> The below table shows the downtime of two identical migrations. In the
> first migration swithcover ack is disabled and in the second it is
> enabled. The migrated VM is assigned with a mlx5 VFIO device which has
> 300MB of device data to be migrated.
>
> +----------------------+-----------------------+----------+
> | Switchover ack | VFIO device data size | Downtime |
> +----------------------+-----------------------+----------+
> | Disabled | 300MB | 1900ms |
> | Enabled | 300MB | 420ms |
> +----------------------+-----------------------+----------+
>
> Switchover ack gives a roughly 4.5 times improvement in downtime.
> The 1480ms difference is time that is used for resource allocation for
> the VFIO device in the destination. Without switchover ack, this time is
> spent when the source VM is stopped and thus the downtime is much
> higher. With switchover ack, the time is spent when the source VM is
> still running.
>
> === Patch breakdown ===
>
> - Patches 1-4 add the switchover ack capability.
> - Patches 5-8 add VFIO migration precopy support. Similar version of
> them was previously sent here [2].
> - Patch 9 adds switchover ack support for VFIO migration.
>
> Thanks for reviewing!
>
> [1]
> https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/vfio.h#L1048
>
> [2]
> https://lore.kernel.org/qemu-devel/20230222174915.5647-3-avihaih@nvidia.com/
>
> [3]
> https://lore.kernel.org/qemu-devel/20230501140141.11743-1-avihaih@nvidia.com/
>
> [4]
> https://lore.kernel.org/qemu-devel/20230517155219.10691-1-avihaih@nvidia.com/
>
> [5]
> https://lore.kernel.org/qemu-devel/20230521151808.24804-1-avihaih@nvidia.com/
>
> [6]
> https://lore.kernel.org/qemu-devel/20230528140652.8693-1-avihaih@nvidia.com/
>
> Avihai Horon (9):
> migration: Add switchover ack capability
> migration: Implement switchover ack logic
> migration: Enable switchover ack capability
> tests: Add migration switchover ack capability test
> vfio/migration: Refactor vfio_save_block() to return saved data size
> vfio/migration: Store VFIO migration flags in VFIOMigration
> vfio/migration: Add VFIO migration pre-copy support
> vfio/migration: Add x-allow-pre-copy VFIO device property
> vfio/migration: Add support for switchover ack capability
>
> docs/devel/vfio-migration.rst | 45 +++++--
> qapi/migration.json | 12 +-
> include/hw/vfio/vfio-common.h | 5 +
> include/migration/register.h | 2 +
> migration/migration.h | 14 +++
> migration/options.h | 1 +
> migration/savevm.h | 1 +
> hw/core/machine.c | 1 +
> hw/vfio/common.c | 6 +-
> hw/vfio/migration.c | 221 +++++++++++++++++++++++++++++++---
> hw/vfio/pci.c | 2 +
> migration/migration.c | 32 ++++-
> migration/options.c | 17 +++
> migration/savevm.c | 54 +++++++++
> tests/qtest/migration-test.c | 26 ++++
> hw/vfio/trace-events | 4 +-
> migration/trace-events | 3 +
> 17 files changed, 413 insertions(+), 33 deletions(-)
>
> --
> 2.26.3
>
>
^ permalink raw reply [flat|nested] 25+ messages in thread