All of lore.kernel.org
 help / color / mirror / Atom feed
* VFIO Migration
@ 2020-11-02 11:11 Stefan Hajnoczi
  2020-11-02 12:28 ` Cornelia Huck
                   ` (6 more replies)
  0 siblings, 7 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2020-11-02 11:11 UTC (permalink / raw)
  To: Jason Wang
  Cc: John G Johnson, mtsirkin, Daniel P. Berrangé,
	quintela, Alex Williamson, qemu-devel, Kirti Wankhede,
	Thanos Makatos, Felipe Franciosi, Paolo Bonzini,
	Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 9334 bytes --]

There is discussion about VFIO migration in the "Re: Out-of-Process
Device Emulation session at KVM Forum 2020" thread. The current status
is that Kirti proposed a VFIO device region type for saving and loading
device state. There is currently no guidance on migrating between
different device versions or device implementations from different
vendors. This is known to be non-trivial and raised discussion about
whether it should really be handled by VFIO or centralized in QEMU.

Below is a document that describes how to ensure migration compatibility
in VFIO. It does not require changes to the VFIO migration interface. It
can be used for both VFIO/mdev kernel devices and vfio-user devices.

The idea is that the device state blob is opaque to the VMM but the same
level of migration compatibility that exists today is still available.

I hope this will help us reach consensus and let us discuss specifics.

If you followed the previous discussion, I changed the approach from
sending a magic constant in the device state blob to identifying device
models by URIs. Therefore the device state structure does not need to be
defined here - the critical information for ensuring device migration
compatibility is the device model and configuration defined below.

Stefan
---
VFIO Migration
==============
This document describes how to save and load VFIO device states. Saving a
device state produces a snapshot of a VFIO device's state that can be loaded
again at a later point in time to resume the device from the snapshot.

The data representation of the device state is outside the scope of this
document.

Overview
--------
The purpose of device states is to save the device at a point in time and then
restore the device back to the saved state later. This is more challenging than
it first appears.

The process of saving a device state and loading it later is called
*migration*. The state may be loaded by the same device that saved it or by a
new instance of the device, possibly running on a different computer.

It must be possible to migrate to a newer implementation of the device
as well as to an older implementation of the device. This allows users
to upgrade and roll back their systems.

Migration can fail if loading the device state is not possible. It should fail
early with a clear error message. It must not appear to complete but leave the
device inoperable due to a migration problem.

The rest of this document describes how these requirements can be met.

Device Models
-------------
Devices have a *hardware interface* consisting of hardware registers,
interrupts, and so on.

The hardware interface together with the device state representation is called
a *device model*. Device models can be assigned URIs such as
https://qemu.org/devices/e1000e to uniquely identify them.

Multiple implementations of a device model may exist. They are they are
interchangeable if they follow the same hardware interface and device
state representation.

Multiple implementations of the same hardware interface may exist with
different device state representations, in which case the device models are not
interchangeable and must be assigned different URIs.

Migration is only possible when the same device model is supported by the
*source* and the *destination* devices.

Device Configuration
--------------------
Device models may have parameters that affect the hardware interface or device
state representation. For example, a network card may have a configurable
address filtering table size parameter called ``rx-filter-size``. A
device state saved with ``rx-filter-size=32`` cannot be safely loaded
into a device with ``rx-filter-size=0``, because changing the size from
32 to 0 may disrupt device operation.

A list of configuration parameters is called the *device configuration*.
Migration is expected to succeed when the same device model and configuration
that was used for saving the device state is used again to load it.

Note that not all parameters used to instantiate a device need to be part of
the device configuration. For example, assigning a network card to a specific
physical port is not part of the device configuration since it is not part of
the device's hardware interface or the device state representation. The device
state can be loaded and run on a different physical port without affecting the
operation of the device. Therefore the physical port is not part of the device
configuration.

However, secondary aspects related to the physical port may affect the device's
hardware interface and need to be reflected in the device configuration. The
link speed may depend on the physical port and be reported through the device's
hardware interface. In that case a ``link-speed`` configuration parameter is
required to prevent unexpected changes to the link speed after migration.

Note that the device configuration is a conservative bound on device
states that can be migrated successfully since not all configuration
parameters may be strictly required to match on the source and
destination devices. For example, if the device's hardware interface has
not yet been initialized then changes to the link speed may not be
noticed. However, accurately representing runtime constraints is complex
and risks introducing migration bugs, so no attempt is made to support
them to achieve more relaxed bounds on successful migrations.

Device Versions
---------------
As a device evolves, the number of configuration parameters required may become
inconvenient for users to express in full. A device configuration can be
aliased by a *device version*, which is a shorthand for the full device
configuration. This makes it easy to apply a standard device configuration
without listing every configuration parameter explicitly.

For example, if address filtering support was added to a network card then
device versions and the corresponding configurations may look like this:
* ``version=1`` - Behaves as if ``rx-filter-size=0``
* ``version=2`` - ``rx-filter-size=32``

Device States
-------------
The details of the device state representation are not covered in this document
but the general requirements are discussed here.

The device state consists of data accessible through the device's hardware
interface and internal state that is needed to restore device operation.
State in the hardware interface includes the values of hardware registers.
An example of internal state is an index value needed to avoid processing
queued requests more than once.

Changes can be made to the device state representation as follows. Each change
to device state must have a corresponding device configuration parameter that
allows the change to toggled:

* When the parameter is disabled the hardware interface and device state
  representation are unchanged. This allows old device states to be loaded.

* When the parameter is enabled the change comes into effect.

* The parameter's default value disables the change. Therefore old versions do
  not have to explicitly specify the parameter.

The following example illustrates migration from an old device
implementation to a new one. A version=1 network card is migrated to a
new device implementation that is also capable of version=2 and adds the
rx-filter-size=32 parameter. The new device is instantiated with
version=1, which disables rx-filter-size and is capable of loading the
version=1 device state. The migration completes successfully but note
the device is still operating at version=1 level in the new device.

The following example illustrates migration from a new device
implementation back to an older one. The new device implementation
supports version=1 and version=2. The old device implementation supports
version=1 only. Therefore the device can only be migrated when
instantiated with version=1 or the equivalent full configuration
parameters.

Orchestrating Migrations
------------------------
The following steps must be followed to migrate devices:

1. Check that the source and destination devices support the same device model.

2. Check that the destination device supports the source device's
   configuration. Each configuration parameter must be accepted by the
   destination in order to ensure that it will be possible to load the device
   state.

3. The device state is saved on the source and loaded on the destination.

4. If migration succeeds then the destination resumes operation and the source
   must not resume operation. If the migration fails then the source resumes
   operation and the destination must not resume operation.

VFIO Implementation
-------------------
The following applies both to kernel VFIO/mdev drivers and vfio-user device
backends.

Devices are instantiated based on a version and/or configuration parameters:
* ``version=1`` - use the device configuration aliased by version 1
* ``version=2,rx-filter-size=64`` - use version 1 and override ``rx-filter-size``
* ``rx-filter-size=0`` - directly set configuration parameters without using a version

Device creation fails if the version and/or configuration parameters are not
supported.

There must be a mechanism to query the "latest" configuration for a device
model. It may simply report the ``version=5`` where 5 is the latest version but
it could also report all configuration parameters instead of using a version
alias.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-02 11:11 VFIO Migration Stefan Hajnoczi
@ 2020-11-02 12:28 ` Cornelia Huck
  2020-11-02 14:56   ` Stefan Hajnoczi
  2020-11-02 19:38 ` Alex Williamson
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 40+ messages in thread
From: Cornelia Huck @ 2020-11-02 12:28 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: John G Johnson, mtsirkin, Daniel P. Berrangé,
	quintela, Jason Wang, qemu-devel, Kirti Wankhede,
	Dr. David Alan Gilbert, Alex Williamson, Paolo Bonzini,
	Felipe Franciosi, Thanos Makatos

On Mon, 2 Nov 2020 11:11:53 +0000
Stefan Hajnoczi <stefanha@redhat.com> wrote:

> There is discussion about VFIO migration in the "Re: Out-of-Process
> Device Emulation session at KVM Forum 2020" thread. The current status
> is that Kirti proposed a VFIO device region type for saving and loading
> device state. There is currently no guidance on migrating between
> different device versions or device implementations from different
> vendors. This is known to be non-trivial and raised discussion about
> whether it should really be handled by VFIO or centralized in QEMU.

Ok, so I won't dig through that thread, but instead comment here.

> 
> Below is a document that describes how to ensure migration compatibility
> in VFIO. It does not require changes to the VFIO migration interface. It
> can be used for both VFIO/mdev kernel devices and vfio-user devices.
> 
> The idea is that the device state blob is opaque to the VMM but the same
> level of migration compatibility that exists today is still available.
> 
> I hope this will help us reach consensus and let us discuss specifics.
> 
> If you followed the previous discussion, I changed the approach from
> sending a magic constant in the device state blob to identifying device
> models by URIs. Therefore the device state structure does not need to be
> defined here - the critical information for ensuring device migration
> compatibility is the device model and configuration defined below.
> 
> Stefan
> ---
> VFIO Migration
> ==============
> This document describes how to save and load VFIO device states. Saving a
> device state produces a snapshot of a VFIO device's state that can be loaded
> again at a later point in time to resume the device from the snapshot.
> 
> The data representation of the device state is outside the scope of this
> document.

[Is this document supposed to live in the QEMU source tree later?]

> 
> Overview
> --------
> The purpose of device states is to save the device at a point in time and then
> restore the device back to the saved state later. This is more challenging than
> it first appears.
> 
> The process of saving a device state and loading it later is called
> *migration*. The state may be loaded by the same device that saved it or by a
> new instance of the device, possibly running on a different computer.
> 
> It must be possible to migrate to a newer implementation of the device
> as well as to an older implementation of the device. This allows users
> to upgrade and roll back their systems.
> 
> Migration can fail if loading the device state is not possible. It should fail
> early with a clear error message. It must not appear to complete but leave the
> device inoperable due to a migration problem.
> 
> The rest of this document describes how these requirements can be met.
> 
> Device Models
> -------------
> Devices have a *hardware interface* consisting of hardware registers,
> interrupts, and so on.
> 
> The hardware interface together with the device state representation is called
> a *device model*. Device models can be assigned URIs such as
> https://qemu.org/devices/e1000e to uniquely identify them.

Is that something that needs to be put together for every device where we
want to support migration? How do you create the URI?

For mdev devices, would this refer to the "base" device, or to the
device specified by a certain mdev type?

> 
> Multiple implementations of a device model may exist. They are they are
> interchangeable if they follow the same hardware interface and device
> state representation.
> 
> Multiple implementations of the same hardware interface may exist with
> different device state representations, in which case the device models are not
> interchangeable and must be assigned different URIs.
> 
> Migration is only possible when the same device model is supported by the
> *source* and the *destination* devices.
> 
> Device Configuration
> --------------------
> Device models may have parameters that affect the hardware interface or device
> state representation. For example, a network card may have a configurable
> address filtering table size parameter called ``rx-filter-size``. A
> device state saved with ``rx-filter-size=32`` cannot be safely loaded
> into a device with ``rx-filter-size=0``, because changing the size from
> 32 to 0 may disrupt device operation.
> 
> A list of configuration parameters is called the *device configuration*.
> Migration is expected to succeed when the same device model and configuration
> that was used for saving the device state is used again to load it.
> 
> Note that not all parameters used to instantiate a device need to be part of
> the device configuration. For example, assigning a network card to a specific
> physical port is not part of the device configuration since it is not part of
> the device's hardware interface or the device state representation. The device
> state can be loaded and run on a different physical port without affecting the
> operation of the device. Therefore the physical port is not part of the device
> configuration.
> 
> However, secondary aspects related to the physical port may affect the device's
> hardware interface and need to be reflected in the device configuration. The
> link speed may depend on the physical port and be reported through the device's
> hardware interface. In that case a ``link-speed`` configuration parameter is
> required to prevent unexpected changes to the link speed after migration.
> 
> Note that the device configuration is a conservative bound on device
> states that can be migrated successfully since not all configuration
> parameters may be strictly required to match on the source and
> destination devices. For example, if the device's hardware interface has
> not yet been initialized then changes to the link speed may not be
> noticed. However, accurately representing runtime constraints is complex
> and risks introducing migration bugs, so no attempt is made to support
> them to achieve more relaxed bounds on successful migrations.

Do we want a "I know what I'm doing" override?

> 
> Device Versions
> ---------------
> As a device evolves, the number of configuration parameters required may become
> inconvenient for users to express in full. A device configuration can be
> aliased by a *device version*, which is a shorthand for the full device
> configuration. This makes it easy to apply a standard device configuration
> without listing every configuration parameter explicitly.
> 
> For example, if address filtering support was added to a network card then
> device versions and the corresponding configurations may look like this:
> * ``version=1`` - Behaves as if ``rx-filter-size=0``
> * ``version=2`` - ``rx-filter-size=32``

Is versioning supposed to be an ascending number, with a migration from
n->n+m possible, but not the other way around?

Are these device versions supposed to be independent of machine versions?

> 
> Device States
> -------------
> The details of the device state representation are not covered in this document
> but the general requirements are discussed here.
> 
> The device state consists of data accessible through the device's hardware
> interface and internal state that is needed to restore device operation.
> State in the hardware interface includes the values of hardware registers.
> An example of internal state is an index value needed to avoid processing
> queued requests more than once.
> 
> Changes can be made to the device state representation as follows. Each change
> to device state must have a corresponding device configuration parameter that
> allows the change to toggled:

s/to/to be/ :)

> 
> * When the parameter is disabled the hardware interface and device state
>   representation are unchanged. This allows old device states to be loaded.
> 
> * When the parameter is enabled the change comes into effect.
> 
> * The parameter's default value disables the change. Therefore old versions do
>   not have to explicitly specify the parameter.
> 
> The following example illustrates migration from an old device
> implementation to a new one. A version=1 network card is migrated to a
> new device implementation that is also capable of version=2 and adds the
> rx-filter-size=32 parameter. The new device is instantiated with
> version=1, which disables rx-filter-size and is capable of loading the
> version=1 device state. The migration completes successfully but note
> the device is still operating at version=1 level in the new device.
> 
> The following example illustrates migration from a new device
> implementation back to an older one. The new device implementation
> supports version=1 and version=2. The old device implementation supports
> version=1 only. Therefore the device can only be migrated when
> instantiated with version=1 or the equivalent full configuration
> parameters.
> 
> Orchestrating Migrations
> ------------------------
> The following steps must be followed to migrate devices:
> 
> 1. Check that the source and destination devices support the same device model.
> 
> 2. Check that the destination device supports the source device's
>    configuration. Each configuration parameter must be accepted by the
>    destination in order to ensure that it will be possible to load the device
>    state.
> 
> 3. The device state is saved on the source and loaded on the destination.
> 
> 4. If migration succeeds then the destination resumes operation and the source
>    must not resume operation. If the migration fails then the source resumes
>    operation and the destination must not resume operation.
> 
> VFIO Implementation
> -------------------
> The following applies both to kernel VFIO/mdev drivers and vfio-user device
> backends.
> 
> Devices are instantiated based on a version and/or configuration parameters:
> * ``version=1`` - use the device configuration aliased by version 1
> * ``version=2,rx-filter-size=64`` - use version 1 and override ``rx-filter-size``
> * ``rx-filter-size=0`` - directly set configuration parameters without using a version

I think some of this would be encapsulated in the mdev type for
mediated devices.

> 
> Device creation fails if the version and/or configuration parameters are not
> supported.
> 
> There must be a mechanism to query the "latest" configuration for a device
> model. It may simply report the ``version=5`` where 5 is the latest version but
> it could also report all configuration parameters instead of using a version
> alias.

Thanks for putting this together!



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-02 12:28 ` Cornelia Huck
@ 2020-11-02 14:56   ` Stefan Hajnoczi
  2020-11-04  8:07     ` Gerd Hoffmann
  0 siblings, 1 reply; 40+ messages in thread
From: Stefan Hajnoczi @ 2020-11-02 14:56 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: John G Johnson, mtsirkin, Daniel P. Berrangé,
	quintela, Jason Wang, qemu-devel, Kirti Wankhede,
	Dr. David Alan Gilbert, Alex Williamson, Paolo Bonzini,
	Felipe Franciosi, Thanos Makatos

[-- Attachment #1: Type: text/plain, Size: 7758 bytes --]

On Mon, Nov 02, 2020 at 01:28:44PM +0100, Cornelia Huck wrote:
> On Mon, 2 Nov 2020 11:11:53 +0000
> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > VFIO Migration
> > ==============
> > This document describes how to save and load VFIO device states. Saving a
> > device state produces a snapshot of a VFIO device's state that can be loaded
> > again at a later point in time to resume the device from the snapshot.
> > 
> > The data representation of the device state is outside the scope of this
> > document.
> 
> [Is this document supposed to live in the QEMU source tree later?]

It should live alongside the VFIO documentation. For vfio-user the spec
will live in qemu.git and we could also keep this document there. The
kernel VFIO/mdev drivers also need this information. They could link to
the QEMU document.

> > Device Models
> > -------------
> > Devices have a *hardware interface* consisting of hardware registers,
> > interrupts, and so on.
> > 
> > The hardware interface together with the device state representation is called
> > a *device model*. Device models can be assigned URIs such as
> > https://qemu.org/devices/e1000e to uniquely identify them.
> 
> Is that something that needs to be put together for every device where we
> want to support migration? How do you create the URI?

Yes. If you are creating a custom device that no one else needs to
emulate then you can simply pick a unique URL:

  https://vendor.com/my-dev

There doesn't need to be anything at the URL. It's just a unique string
that no one else will use and therefore web URLs are handy because no
one else will accidentally pick your string.

If your intention is to define a standard device model that others can
emulate and migrate, then it's good practice to publish a web page about
the device model at the URL, including the hardware datasheet and the
device state representation (e.g. a spec describing the migration data
stream).

For example, https://virtio-spec.org/devices/pci/virtio-net would
contain a link to the VIRTIO specification and the device state
representation. This allows others to implement devices that are
compatible and support migration between implementations. This is
getting beyond the scope of this document, but I imagine the VIRTIO
device state representation would be QEMU's current vmstate
representation so that migration between QEMU and out-of-process devices
is possible...

> For mdev devices, would this refer to the "base" device, or to the
> device specified by a certain mdev type?

The device synthesized by the mdev driver, because that's the
guest-visible hardware interface. I didn't want to say "guest-visible"
or refer to VMs in the document, but maybe that would make things
clearer.

> > Note that the device configuration is a conservative bound on device
> > states that can be migrated successfully since not all configuration
> > parameters may be strictly required to match on the source and
> > destination devices. For example, if the device's hardware interface has
> > not yet been initialized then changes to the link speed may not be
> > noticed. However, accurately representing runtime constraints is complex
> > and risks introducing migration bugs, so no attempt is made to support
> > them to achieve more relaxed bounds on successful migrations.
> 
> Do we want a "I know what I'm doing" override?

I think that could be implementation-defined. Maybe it will be useful if
a device is very broken and you want to offer users a command-line that
allows them to migrate to safety.

> > Device Versions
> > ---------------
> > As a device evolves, the number of configuration parameters required may become
> > inconvenient for users to express in full. A device configuration can be
> > aliased by a *device version*, which is a shorthand for the full device
> > configuration. This makes it easy to apply a standard device configuration
> > without listing every configuration parameter explicitly.
> > 
> > For example, if address filtering support was added to a network card then
> > device versions and the corresponding configurations may look like this:
> > * ``version=1`` - Behaves as if ``rx-filter-size=0``
> > * ``version=2`` - ``rx-filter-size=32``
> 
> Is versioning supposed to be an ascending number, with a migration from
> n->n+m possible, but not the other way around?

The actual version string does not matter. Ascending integers is a
reasonable convention because it's easy to type and for humans to
compare.

Slightly pedantic but important point: migration from n->m is not
possible according to this document. It's always migration from n->n.
Migrating does not upgrade the guest-visible aspects of the device. The
device instance always remains at its current version. Of course the
destination device implementation may contain bug fixes, etc that come
into effect right away, but they don't change the guest-visible hardware
interface or device state representation.

To actually upgrade from n->m the user must explicitly reconfigure the
guest and hotplug or reboot.

In other words, the device version (always stays the same throughout the
lifetime of a device instance) and the device implementation version
(e.g. my-virtio-net-pci-v1.1) are two different concepts.

> Are these device versions supposed to be independent of machine versions?

Yes. Since VFIO devices are passthrough devices that can be implemented
without introducing code into QEMU, they are separate from versioned
machine types. This is similar to going out and buying a PCI adapter and
putting it into a machine. The machine itself may be a Dell Foo Bar
server with a certain hardware spec, but the PCI adapter is a completely
separate device with no relation to the machine type.

> > 
> > Device States
> > -------------
> > The details of the device state representation are not covered in this document
> > but the general requirements are discussed here.
> > 
> > The device state consists of data accessible through the device's hardware
> > interface and internal state that is needed to restore device operation.
> > State in the hardware interface includes the values of hardware registers.
> > An example of internal state is an index value needed to avoid processing
> > queued requests more than once.
> > 
> > Changes can be made to the device state representation as follows. Each change
> > to device state must have a corresponding device configuration parameter that
> > allows the change to toggled:
> 
> s/to/to be/ :)

To be or not to be! Thanks.

> > VFIO Implementation
> > -------------------
> > The following applies both to kernel VFIO/mdev drivers and vfio-user device
> > backends.
> > 
> > Devices are instantiated based on a version and/or configuration parameters:
> > * ``version=1`` - use the device configuration aliased by version 1
> > * ``version=2,rx-filter-size=64`` - use version 1 and override ``rx-filter-size``
> > * ``rx-filter-size=0`` - directly set configuration parameters without using a version
> 
> I think some of this would be encapsulated in the mdev type for
> mediated devices.

Yes, the device model and configuration need to be provided when
creating the mdev instance. This assumption is built into this design:

You decide the device model and configuration at creation time, not at
migration time. In other words, each device instance is fully specified
at all times and there is no choice of "in which format should we
save/load this?".

This approach is simple and easy to troubleshoot, but if someone can
think of a reason why it's too limited, please share.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-02 11:11 VFIO Migration Stefan Hajnoczi
  2020-11-02 12:28 ` Cornelia Huck
@ 2020-11-02 19:38 ` Alex Williamson
  2020-11-03 11:03   ` Stefan Hajnoczi
  2020-11-03  8:46 ` Jason Wang
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 40+ messages in thread
From: Alex Williamson @ 2020-11-02 19:38 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: John G Johnson, Tian, Kevin, mtsirkin, Daniel P. Berrangé,
	quintela, Jason Wang, Zeng, Xin, qemu-devel,
	Dr. David Alan Gilbert, Yan Zhao, Kirti Wankhede, Thanos Makatos,
	Felipe Franciosi, Paolo Bonzini


Cc+ Intel folks as this really bumps into the migration compatibility
discussion[1][2][3]

On Mon, 2 Nov 2020 11:11:53 +0000
Stefan Hajnoczi <stefanha@redhat.com> wrote:

> There is discussion about VFIO migration in the "Re: Out-of-Process
> Device Emulation session at KVM Forum 2020" thread. The current status
> is that Kirti proposed a VFIO device region type for saving and loading
> device state. There is currently no guidance on migrating between
> different device versions or device implementations from different
> vendors. This is known to be non-trivial and raised discussion about
> whether it should really be handled by VFIO or centralized in QEMU.
> 
> Below is a document that describes how to ensure migration compatibility
> in VFIO. It does not require changes to the VFIO migration interface. It
> can be used for both VFIO/mdev kernel devices and vfio-user devices.
> 
> The idea is that the device state blob is opaque to the VMM but the same
> level of migration compatibility that exists today is still available.
> 
> I hope this will help us reach consensus and let us discuss specifics.
> 
> If you followed the previous discussion, I changed the approach from
> sending a magic constant in the device state blob to identifying device
> models by URIs. Therefore the device state structure does not need to be
> defined here - the critical information for ensuring device migration
> compatibility is the device model and configuration defined below.
> 
> Stefan
> ---
> VFIO Migration
> ==============
> This document describes how to save and load VFIO device states. Saving a
> device state produces a snapshot of a VFIO device's state that can be loaded
> again at a later point in time to resume the device from the snapshot.
> 
> The data representation of the device state is outside the scope of this
> document.
> 
> Overview
> --------
> The purpose of device states is to save the device at a point in time and then
> restore the device back to the saved state later. This is more challenging than
> it first appears.
> 
> The process of saving a device state and loading it later is called
> *migration*. The state may be loaded by the same device that saved it or by a
> new instance of the device, possibly running on a different computer.
> 
> It must be possible to migrate to a newer implementation of the device
> as well as to an older implementation of the device. This allows users
> to upgrade and roll back their systems.


It must be possible to specify, but we can't necessarily force a vendor
to support it.  It must also be possible to describe incompatibilities,
whether due to lack of support or forks in the migration format.

 
> Migration can fail if loading the device state is not possible. It should fail
> early with a clear error message. It must not appear to complete but leave the
> device inoperable due to a migration problem.
> 
> The rest of this document describes how these requirements can be met.
> 
> Device Models
> -------------
> Devices have a *hardware interface* consisting of hardware registers,
> interrupts, and so on.
> 
> The hardware interface together with the device state representation is called
> a *device model*. Device models can be assigned URIs such as
> https://qemu.org/devices/e1000e to uniquely identify them.
> 
> Multiple implementations of a device model may exist. They are they are
> interchangeable if they follow the same hardware interface and device
> state representation.
> 
> Multiple implementations of the same hardware interface may exist with
> different device state representations, in which case the device models are not
> interchangeable and must be assigned different URIs.
> 
> Migration is only possible when the same device model is supported by the
> *source* and the *destination* devices.
> 
> Device Configuration
> --------------------
> Device models may have parameters that affect the hardware interface or device
> state representation. For example, a network card may have a configurable
> address filtering table size parameter called ``rx-filter-size``. A
> device state saved with ``rx-filter-size=32`` cannot be safely loaded
> into a device with ``rx-filter-size=0``, because changing the size from
> 32 to 0 may disrupt device operation.
> 
> A list of configuration parameters is called the *device configuration*.
> Migration is expected to succeed when the same device model and configuration
> that was used for saving the device state is used again to load it.
> 
> Note that not all parameters used to instantiate a device need to be part of
> the device configuration. For example, assigning a network card to a specific
> physical port is not part of the device configuration since it is not part of
> the device's hardware interface or the device state representation. The device
> state can be loaded and run on a different physical port without affecting the
> operation of the device. Therefore the physical port is not part of the device
> configuration.
> 
> However, secondary aspects related to the physical port may affect the device's
> hardware interface and need to be reflected in the device configuration. The
> link speed may depend on the physical port and be reported through the device's
> hardware interface. In that case a ``link-speed`` configuration parameter is
> required to prevent unexpected changes to the link speed after migration.
> 
> Note that the device configuration is a conservative bound on device
> states that can be migrated successfully since not all configuration
> parameters may be strictly required to match on the source and
> destination devices. For example, if the device's hardware interface has
> not yet been initialized then changes to the link speed may not be
> noticed. However, accurately representing runtime constraints is complex
> and risks introducing migration bugs, so no attempt is made to support
> them to achieve more relaxed bounds on successful migrations.
> 
> Device Versions
> ---------------
> As a device evolves, the number of configuration parameters required may become
> inconvenient for users to express in full. A device configuration can be
> aliased by a *device version*, which is a shorthand for the full device
> configuration. This makes it easy to apply a standard device configuration
> without listing every configuration parameter explicitly.
> 
> For example, if address filtering support was added to a network card then
> device versions and the corresponding configurations may look like this:
> * ``version=1`` - Behaves as if ``rx-filter-size=0``
> * ``version=2`` - ``rx-filter-size=32``
> 
> Device States
> -------------
> The details of the device state representation are not covered in this document
> but the general requirements are discussed here.
> 
> The device state consists of data accessible through the device's hardware
> interface and internal state that is needed to restore device operation.
> State in the hardware interface includes the values of hardware registers.
> An example of internal state is an index value needed to avoid processing
> queued requests more than once.
> 
> Changes can be made to the device state representation as follows. Each change
> to device state must have a corresponding device configuration parameter that
> allows the change to toggled:
> 
> * When the parameter is disabled the hardware interface and device state
>   representation are unchanged. This allows old device states to be loaded.
> 
> * When the parameter is enabled the change comes into effect.
> 
> * The parameter's default value disables the change. Therefore old versions do
>   not have to explicitly specify the parameter.
> 
> The following example illustrates migration from an old device
> implementation to a new one. A version=1 network card is migrated to a
> new device implementation that is also capable of version=2 and adds the
> rx-filter-size=32 parameter. The new device is instantiated with
> version=1, which disables rx-filter-size and is capable of loading the
> version=1 device state. The migration completes successfully but note
> the device is still operating at version=1 level in the new device.
> 
> The following example illustrates migration from a new device
> implementation back to an older one. The new device implementation
> supports version=1 and version=2. The old device implementation supports
> version=1 only. Therefore the device can only be migrated when
> instantiated with version=1 or the equivalent full configuration
> parameters.
> 
> Orchestrating Migrations
> ------------------------
> The following steps must be followed to migrate devices:
> 
> 1. Check that the source and destination devices support the same device model.
> 
> 2. Check that the destination device supports the source device's
>    configuration. Each configuration parameter must be accepted by the
>    destination in order to ensure that it will be possible to load the device
>    state.
> 
> 3. The device state is saved on the source and loaded on the destination.
> 
> 4. If migration succeeds then the destination resumes operation and the source
>    must not resume operation. If the migration fails then the source resumes
>    operation and the destination must not resume operation.
> 
> VFIO Implementation
> -------------------
> The following applies both to kernel VFIO/mdev drivers and vfio-user device
> backends.
> 
> Devices are instantiated based on a version and/or configuration parameters:
> * ``version=1`` - use the device configuration aliased by version 1
> * ``version=2,rx-filter-size=64`` - use version 1 and override ``rx-filter-size``
> * ``rx-filter-size=0`` - directly set configuration parameters without using a version
> 
> Device creation fails if the version and/or configuration parameters are not
> supported.
> 
> There must be a mechanism to query the "latest" configuration for a device
> model. It may simply report the ``version=5`` where 5 is the latest version but
> it could also report all configuration parameters instead of using a version
> alias.

When we talk about "instantiating" a device here, are we referring to
managing the device on the host or within QEMU via something like
vfio_realize()?  We create an instance of an mdev on the host via an
mdev type using operations on the host sysfs.  That mdev type doesn't
really seem to map to your idea if a device model represented by a URI.
How are supported URIs exposed and specified when the device is
instantiated?

Same for device configuration, we might have per device attributes in
host sysfs defining the configuration of a given mdev device, are these
the device configuration values?  It seems like you're referring to
something much more QEMU centric, but vfio-pci in QEMU handles all
devices the same, aside from quirks.

Likewise, I don't know where versions would be exposed in the current
vfio interface.

There's also a desire to support the vfio migration interface on
non-mdev vfio devices.  We don't know yet if those will be separate,
device specific vfio bus drivers or be integrated into existing
vfio-pci, but the host device is likely instantiated by binding to a
driver, so again I don't really understand where you're proposing this
negotiation occurs.  Will management tools be required to create a
device on-demand to fulfill a migration request or can we manipulate an
existing device into that desired.  Some management layers embrace the
idea of device pools rather than dynamic creation.  Thanks,

Alex

[1]https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg04519.html
[2]https://lists.gnu.org/archive/html/qemu-devel/2020-08/msg00293.html
[3]https://lists.gnu.org/archive/html/qemu-devel/2020-09/msg02983.html



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-02 11:11 VFIO Migration Stefan Hajnoczi
  2020-11-02 12:28 ` Cornelia Huck
  2020-11-02 19:38 ` Alex Williamson
@ 2020-11-03  8:46 ` Jason Wang
  2020-11-03 12:15   ` Stefan Hajnoczi
  2020-11-03 11:39 ` Daniel P. Berrangé
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 40+ messages in thread
From: Jason Wang @ 2020-11-03  8:46 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: John G Johnson, mtsirkin, Daniel P. Berrangé,
	quintela, qemu-devel, Kirti Wankhede, Dr. David Alan Gilbert,
	Alex Williamson, Paolo Bonzini, Felipe Franciosi, Thanos Makatos


On 2020/11/2 下午7:11, Stefan Hajnoczi wrote:
> There is discussion about VFIO migration in the "Re: Out-of-Process
> Device Emulation session at KVM Forum 2020" thread. The current status
> is that Kirti proposed a VFIO device region type for saving and loading
> device state. There is currently no guidance on migrating between
> different device versions or device implementations from different
> vendors. This is known to be non-trivial and raised discussion about
> whether it should really be handled by VFIO or centralized in QEMU.
>
> Below is a document that describes how to ensure migration compatibility
> in VFIO. It does not require changes to the VFIO migration interface. It
> can be used for both VFIO/mdev kernel devices and vfio-user devices.
>
> The idea is that the device state blob is opaque to the VMM but the same
> level of migration compatibility that exists today is still available.


So if we can't mandate this or there's no way to validate this. Vendor 
is still free to implement their own protocol which could lead a lot of 
maintaining burden.


>
> I hope this will help us reach consensus and let us discuss specifics.
>
> If you followed the previous discussion, I changed the approach from
> sending a magic constant in the device state blob to identifying device
> models by URIs. Therefore the device state structure does not need to be
> defined here - the critical information for ensuring device migration
> compatibility is the device model and configuration defined below.
>
> Stefan
> ---
> VFIO Migration
> ==============
> This document describes how to save and load VFIO device states. Saving a
> device state produces a snapshot of a VFIO device's state that can be loaded
> again at a later point in time to resume the device from the snapshot.
>
> The data representation of the device state is outside the scope of this
> document.
>
> Overview
> --------
> The purpose of device states is to save the device at a point in time and then
> restore the device back to the saved state later. This is more challenging than
> it first appears.
>
> The process of saving a device state and loading it later is called
> *migration*. The state may be loaded by the same device that saved it or by a
> new instance of the device, possibly running on a different computer.
>
> It must be possible to migrate to a newer implementation of the device
> as well as to an older implementation of the device. This allows users
> to upgrade and roll back their systems.
>
> Migration can fail if loading the device state is not possible. It should fail
> early with a clear error message. It must not appear to complete but leave the
> device inoperable due to a migration problem.


For VFIO-user, how management know that a VM can be migrated from src to 
dst? For kernel, we have sysfs.


>
> The rest of this document describes how these requirements can be met.
>
> Device Models
> -------------
> Devices have a *hardware interface* consisting of hardware registers,
> interrupts, and so on.
>
> The hardware interface together with the device state representation is called
> a *device model*. Device models can be assigned URIs such as
> https://qemu.org/devices/e1000e to uniquely identify them.


It looks worse than 
"pci://vendor_id.device_id.subvendor_id.subdevice_id". "e1000e" means a 
lot of different 8275X implementations that have subtle but easy to be 
ignored differences.

And is it possible to have a list of URIs here?


>
> Multiple implementations of a device model may exist. They are they are
> interchangeable if they follow the same hardware interface and device
> state representation.
>
> Multiple implementations of the same hardware interface may exist with
> different device state representations, in which case the device models are not
> interchangeable and must be assigned different URIs.
>
> Migration is only possible when the same device model is supported by the
> *source* and the *destination* devices.
>
> Device Configuration
> --------------------
> Device models may have parameters that affect the hardware interface or device
> state representation. For example, a network card may have a configurable
> address filtering table size parameter called ``rx-filter-size``. A
> device state saved with ``rx-filter-size=32`` cannot be safely loaded
> into a device with ``rx-filter-size=0``, because changing the size from
> 32 to 0 may disrupt device operation.


Do we allow the migration from "rx-filter-size=16" to 
"rx-filter-size=32" (I guess not?) And should we extend the concept to 
"device capability" instead of just state representation.  E.g src has 
CAP_X=on,CAP_Y=off but dst has CAP_X=on,CAP_Y=on, so we disallow the 
migration from src to dst.


>
> A list of configuration parameters is called the *device configuration*.
> Migration is expected to succeed when the same device model and configuration
> that was used for saving the device state is used again to load it.
>
> Note that not all parameters used to instantiate a device need to be part of
> the device configuration. For example, assigning a network card to a specific
> physical port is not part of the device configuration since it is not part of
> the device's hardware interface or the device state representation.


Yes, but the task needs to be done by management somehow. So do you 
expect a vendor specific provisioning API here?


> The device
> state can be loaded and run on a different physical port without affecting the
> operation of the device. Therefore the physical port is not part of the device
> configuration.
>
> However, secondary aspects related to the physical port may affect the device's
> hardware interface and need to be reflected in the device configuration. The
> link speed may depend on the physical port and be reported through the device's
> hardware interface. In that case a ``link-speed`` configuration parameter is
> required to prevent unexpected changes to the link speed after migration.
>
> Note that the device configuration is a conservative bound on device
> states that can be migrated successfully since not all configuration
> parameters may be strictly required to match on the source and
> destination devices. For example, if the device's hardware interface has
> not yet been initialized then changes to the link speed may not be
> noticed. However, accurately representing runtime constraints is complex
> and risks introducing migration bugs, so no attempt is made to support
> them to achieve more relaxed bounds on successful migrations.
>
> Device Versions
> ---------------
> As a device evolves, the number of configuration parameters required may become
> inconvenient for users to express in full. A device configuration can be
> aliased by a *device version*, which is a shorthand for the full device
> configuration. This makes it easy to apply a standard device configuration
> without listing every configuration parameter explicitly.


I'm not sure how to apply the device versions consider the device state 
is opaque or the device needs to export another API to do this?


>
> For example, if address filtering support was added to a network card then
> device versions and the corresponding configurations may look like this:
> * ``version=1`` - Behaves as if ``rx-filter-size=0``
> * ``version=2`` - ``rx-filter-size=32``
>
> Device States
> -------------
> The details of the device state representation are not covered in this document
> but the general requirements are discussed here.
>
> The device state consists of data accessible through the device's hardware
> interface and internal state that is needed to restore device operation.
> State in the hardware interface includes the values of hardware registers.
> An example of internal state is an index value needed to avoid processing
> queued requests more than once.
>
> Changes can be made to the device state representation as follows. Each change
> to device state must have a corresponding device configuration parameter that
> allows the change to toggled:
>
> * When the parameter is disabled the hardware interface and device state
>    representation are unchanged. This allows old device states to be loaded.
>
> * When the parameter is enabled the change comes into effect.
>
> * The parameter's default value disables the change. Therefore old versions do
>    not have to explicitly specify the parameter.
>
> The following example illustrates migration from an old device
> implementation to a new one. A version=1 network card is migrated to a
> new device implementation that is also capable of version=2 and adds the
> rx-filter-size=32 parameter. The new device is instantiated with
> version=1, which disables rx-filter-size and is capable of loading the
> version=1 device state. The migration completes successfully but note
> the device is still operating at version=1 level in the new device.
>
> The following example illustrates migration from a new device
> implementation back to an older one. The new device implementation
> supports version=1 and version=2. The old device implementation supports
> version=1 only. Therefore the device can only be migrated when
> instantiated with version=1 or the equivalent full configuration
> parameters.


In qemu we have subsection to facilitate the case when some fields were 
forgot to migrate. Do we need something similar here?


>
> Orchestrating Migrations
> ------------------------
> The following steps must be followed to migrate devices:
>
> 1. Check that the source and destination devices support the same device model.
>
> 2. Check that the destination device supports the source device's
>     configuration. Each configuration parameter must be accepted by the
>     destination in order to ensure that it will be possible to load the device
>     state.
>
> 3. The device state is saved on the source and loaded on the destination.
>
> 4. If migration succeeds then the destination resumes operation and the source
>     must not resume operation. If the migration fails then the source resumes
>     operation and the destination must not resume operation.
>
> VFIO Implementation
> -------------------
> The following applies both to kernel VFIO/mdev drivers and vfio-user device
> backends.
>
> Devices are instantiated based on a version and/or configuration parameters:
> * ``version=1`` - use the device configuration aliased by version 1
> * ``version=2,rx-filter-size=64`` - use version 1 and override ``rx-filter-size``
> * ``rx-filter-size=0`` - directly set configuration parameters without using a version
>
> Device creation fails if the version and/or configuration parameters are not
> supported.
>
> There must be a mechanism to query the "latest" configuration for a device
> model. It may simply report the ``version=5`` where 5 is the latest version but
> it could also report all configuration parameters instead of using a version
> alias.


Thanks



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-02 19:38 ` Alex Williamson
@ 2020-11-03 11:03   ` Stefan Hajnoczi
  2020-11-03 17:13     ` Alex Williamson
  0 siblings, 1 reply; 40+ messages in thread
From: Stefan Hajnoczi @ 2020-11-03 11:03 UTC (permalink / raw)
  To: Alex Williamson
  Cc: John G Johnson, Tian, Kevin, mtsirkin, Daniel P. Berrangé,
	quintela, Jason Wang, Zeng, Xin, qemu-devel,
	Dr. David Alan Gilbert, Yan Zhao, Kirti Wankhede, Thanos Makatos,
	Felipe Franciosi, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 21009 bytes --]

On Mon, Nov 02, 2020 at 12:38:23PM -0700, Alex Williamson wrote:
> 
> Cc+ Intel folks as this really bumps into the migration compatibility
> discussion[1][2][3]
> 
> On Mon, 2 Nov 2020 11:11:53 +0000
> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> > There is discussion about VFIO migration in the "Re: Out-of-Process
> > Device Emulation session at KVM Forum 2020" thread. The current status
> > is that Kirti proposed a VFIO device region type for saving and loading
> > device state. There is currently no guidance on migrating between
> > different device versions or device implementations from different
> > vendors. This is known to be non-trivial and raised discussion about
> > whether it should really be handled by VFIO or centralized in QEMU.
> > 
> > Below is a document that describes how to ensure migration compatibility
> > in VFIO. It does not require changes to the VFIO migration interface. It
> > can be used for both VFIO/mdev kernel devices and vfio-user devices.
> > 
> > The idea is that the device state blob is opaque to the VMM but the same
> > level of migration compatibility that exists today is still available.
> > 
> > I hope this will help us reach consensus and let us discuss specifics.
> > 
> > If you followed the previous discussion, I changed the approach from
> > sending a magic constant in the device state blob to identifying device
> > models by URIs. Therefore the device state structure does not need to be
> > defined here - the critical information for ensuring device migration
> > compatibility is the device model and configuration defined below.
> > 
> > Stefan
> > ---
> > VFIO Migration
> > ==============
> > This document describes how to save and load VFIO device states. Saving a
> > device state produces a snapshot of a VFIO device's state that can be loaded
> > again at a later point in time to resume the device from the snapshot.
> > 
> > The data representation of the device state is outside the scope of this
> > document.
> > 
> > Overview
> > --------
> > The purpose of device states is to save the device at a point in time and then
> > restore the device back to the saved state later. This is more challenging than
> > it first appears.
> > 
> > The process of saving a device state and loading it later is called
> > *migration*. The state may be loaded by the same device that saved it or by a
> > new instance of the device, possibly running on a different computer.
> > 
> > It must be possible to migrate to a newer implementation of the device
> > as well as to an older implementation of the device. This allows users
> > to upgrade and roll back their systems.
> 
> 
> It must be possible to specify, but we can't necessarily force a vendor
> to support it.

The wording is unclear. This migration scheme makes it possible but does
not require implementations to support advanced migration scenarios. A
VFIO/mdev driver or vfio-user device backend can refuse to instantiate
with certain device configuration parameters. For example, if version=1
is no longer supported in the latest device implementation then it can
return an error.

> It must also be possible to describe incompatibilities,
> whether due to lack of support or forks in the migration format.

Compatibility is handled by the device model and configuration
parameters that are used to instantiate devices. If device model URIs
differ then the devices are incompatible (e.g.
https://vendor-a.com/rtl8139 and https://vendor-b.com/rtl8139). When
changes are made to the guest-visible hardware interface or device state
representation then they are toggled via configuration parameters (e.g.
rss=on|off).

Here is an example:

The device model is a network card as defined by the
https://vendor-a.com/my-nic device model. Receive Side Scaling (RSS) is
an optional feature and the configuration parameter rss=on|off toggles
its availability. When rss=on the RSS feature is available in the
hardware interface, but it doesn't necessarily mean that the guest
driver has to enable the feature.

Now we wish to migrate to another implementation of the same device
model. On the destination machine RSS is not available, so trying to
instantiate https://vendor-a.com/my-nic with rss=on will fail with an
error because the feature is unavailable (this could be because the
implementation doesn't support the feature or because the host lacks the
capability).

The following combinations are possible:

Source    Available   Result
          on Dest?
-------------------------------------------------------------------
rss=off          no   OK.
rss=off         yes   OK. rss=on is supported but we don't need it.
rss=on           no   FAIL. rss=on is not supported on destination!
rss=on          yes   OK.

By the way, this shows why this scheme is a conservative bound on
migration compatibility. If the guest driver hasn't enabled RSS and
won't be using it then we could potentially migrate rss=on even when the
destination does not support rss=on. But doing this reliably isn't
tractable so instead we use strict migration compatibility.

Regarding forking, if you want complete freedom you can pick a new
device model URI. Device instances using the old device model URI are
not considered compatible with the new device model URI. However, you
can then introduce changes to the hardware interface or device state
representation without agreement from the owner of the old device model
URI.

If instead you want to collaborate you can agree on changes with the
device model URI owner. You can change the device's hardware interface
and device state representation as described in this document.
Basically, each change must be reflect in a device configuration
parameter.

> > Migration can fail if loading the device state is not possible. It should fail
> > early with a clear error message. It must not appear to complete but leave the
> > device inoperable due to a migration problem.
> > 
> > The rest of this document describes how these requirements can be met.
> > 
> > Device Models
> > -------------
> > Devices have a *hardware interface* consisting of hardware registers,
> > interrupts, and so on.
> > 
> > The hardware interface together with the device state representation is called
> > a *device model*. Device models can be assigned URIs such as
> > https://qemu.org/devices/e1000e to uniquely identify them.
> > 
> > Multiple implementations of a device model may exist. They are they are
> > interchangeable if they follow the same hardware interface and device
> > state representation.
> > 
> > Multiple implementations of the same hardware interface may exist with
> > different device state representations, in which case the device models are not
> > interchangeable and must be assigned different URIs.
> > 
> > Migration is only possible when the same device model is supported by the
> > *source* and the *destination* devices.
> > 
> > Device Configuration
> > --------------------
> > Device models may have parameters that affect the hardware interface or device
> > state representation. For example, a network card may have a configurable
> > address filtering table size parameter called ``rx-filter-size``. A
> > device state saved with ``rx-filter-size=32`` cannot be safely loaded
> > into a device with ``rx-filter-size=0``, because changing the size from
> > 32 to 0 may disrupt device operation.
> > 
> > A list of configuration parameters is called the *device configuration*.
> > Migration is expected to succeed when the same device model and configuration
> > that was used for saving the device state is used again to load it.
> > 
> > Note that not all parameters used to instantiate a device need to be part of
> > the device configuration. For example, assigning a network card to a specific
> > physical port is not part of the device configuration since it is not part of
> > the device's hardware interface or the device state representation. The device
> > state can be loaded and run on a different physical port without affecting the
> > operation of the device. Therefore the physical port is not part of the device
> > configuration.
> > 
> > However, secondary aspects related to the physical port may affect the device's
> > hardware interface and need to be reflected in the device configuration. The
> > link speed may depend on the physical port and be reported through the device's
> > hardware interface. In that case a ``link-speed`` configuration parameter is
> > required to prevent unexpected changes to the link speed after migration.
> > 
> > Note that the device configuration is a conservative bound on device
> > states that can be migrated successfully since not all configuration
> > parameters may be strictly required to match on the source and
> > destination devices. For example, if the device's hardware interface has
> > not yet been initialized then changes to the link speed may not be
> > noticed. However, accurately representing runtime constraints is complex
> > and risks introducing migration bugs, so no attempt is made to support
> > them to achieve more relaxed bounds on successful migrations.
> > 
> > Device Versions
> > ---------------
> > As a device evolves, the number of configuration parameters required may become
> > inconvenient for users to express in full. A device configuration can be
> > aliased by a *device version*, which is a shorthand for the full device
> > configuration. This makes it easy to apply a standard device configuration
> > without listing every configuration parameter explicitly.
> > 
> > For example, if address filtering support was added to a network card then
> > device versions and the corresponding configurations may look like this:
> > * ``version=1`` - Behaves as if ``rx-filter-size=0``
> > * ``version=2`` - ``rx-filter-size=32``
> > 
> > Device States
> > -------------
> > The details of the device state representation are not covered in this document
> > but the general requirements are discussed here.
> > 
> > The device state consists of data accessible through the device's hardware
> > interface and internal state that is needed to restore device operation.
> > State in the hardware interface includes the values of hardware registers.
> > An example of internal state is an index value needed to avoid processing
> > queued requests more than once.
> > 
> > Changes can be made to the device state representation as follows. Each change
> > to device state must have a corresponding device configuration parameter that
> > allows the change to toggled:
> > 
> > * When the parameter is disabled the hardware interface and device state
> >   representation are unchanged. This allows old device states to be loaded.
> > 
> > * When the parameter is enabled the change comes into effect.
> > 
> > * The parameter's default value disables the change. Therefore old versions do
> >   not have to explicitly specify the parameter.
> > 
> > The following example illustrates migration from an old device
> > implementation to a new one. A version=1 network card is migrated to a
> > new device implementation that is also capable of version=2 and adds the
> > rx-filter-size=32 parameter. The new device is instantiated with
> > version=1, which disables rx-filter-size and is capable of loading the
> > version=1 device state. The migration completes successfully but note
> > the device is still operating at version=1 level in the new device.
> > 
> > The following example illustrates migration from a new device
> > implementation back to an older one. The new device implementation
> > supports version=1 and version=2. The old device implementation supports
> > version=1 only. Therefore the device can only be migrated when
> > instantiated with version=1 or the equivalent full configuration
> > parameters.
> > 
> > Orchestrating Migrations
> > ------------------------
> > The following steps must be followed to migrate devices:
> > 
> > 1. Check that the source and destination devices support the same device model.
> > 
> > 2. Check that the destination device supports the source device's
> >    configuration. Each configuration parameter must be accepted by the
> >    destination in order to ensure that it will be possible to load the device
> >    state.
> > 
> > 3. The device state is saved on the source and loaded on the destination.
> > 
> > 4. If migration succeeds then the destination resumes operation and the source
> >    must not resume operation. If the migration fails then the source resumes
> >    operation and the destination must not resume operation.
> > 
> > VFIO Implementation
> > -------------------
> > The following applies both to kernel VFIO/mdev drivers and vfio-user device
> > backends.
> > 
> > Devices are instantiated based on a version and/or configuration parameters:
> > * ``version=1`` - use the device configuration aliased by version 1
> > * ``version=2,rx-filter-size=64`` - use version 1 and override ``rx-filter-size``
> > * ``rx-filter-size=0`` - directly set configuration parameters without using a version
> > 
> > Device creation fails if the version and/or configuration parameters are not
> > supported.
> > 
> > There must be a mechanism to query the "latest" configuration for a device
> > model. It may simply report the ``version=5`` where 5 is the latest version but
> > it could also report all configuration parameters instead of using a version
> > alias.
> 
> When we talk about "instantiating" a device here, are we referring to
> managing the device on the host or within QEMU via something like
> vfio_realize()?  We create an instance of an mdev on the host via an
> mdev type using operations on the host sysfs.  That mdev type doesn't
> really seem to map to your idea if a device model represented by a URI.
> How are supported URIs exposed and specified when the device is
> instantiated?
> 
> Same for device configuration, we might have per device attributes in
> host sysfs defining the configuration of a given mdev device, are these
> the device configuration values?  It seems like you're referring to
> something much more QEMU centric, but vfio-pci in QEMU handles all
> devices the same, aside from quirks.
> 
> Likewise, I don't know where versions would be exposed in the current
> vfio interface.

"Instantiating" means writing to the mdev "create" sysfs attr. I am not
very familiar with mdev so this could be totally wrong, but I'll try to
define a mapping:

1. The mdev driver sets up struct
   mdev_parent_opts->supported_type_groups as follows:

  /* Device model URI */
  static ssize_t model_show(struct kobject *kobj,
                            struct device *dev,
                            char *buf)
  {
      return sprintf(buf, "https://vendor-a.com/my-nic\n");
  }
  static MDEV_TYPE_ATTR_RO(model);

  /* Receive Side Scaling (RSS) */
  static ssize_t rss_show(struct kobject *kobj,
                          struct dev *dev,
			  char *buf)
  {
      return sprintf(buf, "%d\n", ...->rss);
  }
  static ssize_t rss_store(struct kobject *kobj,
                           struct attribute *attr,
			   const char *page,
			   size_t count)
  {
      char *p = (char *) page;
      unsigned long val = simple_strtoul(p, &p, 10);

      ...->rss = !!val;
      return count;
  }
  static MDEV_TYPE_ATTR_RW(rss);

  /* Device version */
  static ssize_t version_show(struct kobject *kobj,
                              struct dev *dev,
			      char *buf)
  {
      return sprintf(buf, "%u\n", ...->version);
  }
  static ssize_t version_store(struct kobject *kobj,
                               struct attribute *attr,
			       const char *page,
			       size_t count)
  {
      char *p = (char *) page;
      unsigned long val = simple_strtoul(p, &p, 10);

      /* Set device configuration parameters to their defaults */
      switch (version) {
      case 1:
          ...->rss = false;
	  ...->version = 1;
	  break;

      case 2:
          ...->rss = true;
	  ...->version = 2;
	  break;

      default:
          return -ENOTSUPP;
      }

      return count;
  }
  static MDEV_TYPE_ATTR_RW(rss);

  static struct attribute *mdev_type_my_nic_attrs[] = {
      &mdev_type_attr_model.attr,
      &mdev_type_attr_rss.attr,
      &mdev_type_attr_version.attr,
      NULL,
  };

  static struct attribute_group mdev_type_group_my_nic = {
      .name  = "my-nic", /* shorthand name */
      .attrs = mdev_type_my_nic_attrs,
  };

  struct attribute_group *supported_type_groups[] = {
      &mdev_type_group_my_nic,
      NULL,
  };

2. The userspace tooling enumerates supported device models by reading
   the "model" sysfs attr from each supported type attr group.

3. Userspace picks the device model it wishes to instantiate and sets
   the "version" sysfs attr and other device configuration parameters as
   desired.

4. Userspace instantiates the device by writing to the mdev "create" sysfs
   attr. If instantiation succeeds then migrating a device state saved
   by the same device model with the same configuration parameters is
   possible.

Maybe a cleaner way to structure this is to include the version as part
of the supported type group. So "my-nic" becomes "my-nic-1", "my-nic-2",
etc. There would still be a "version" sysfs attr but it would be
read-only. Device configuration parameters would only be present if they
were actually available in that version. For example, "my-nic-1" would
not expose an "rss" sysfs attr because it was introduced in "my-nic-2".
I see pros and cons to both the approach I outlined above and this
alternative, maybe someone more familiar with mdev has a preference?

> There's also a desire to support the vfio migration interface on
> non-mdev vfio devices.  We don't know yet if those will be separate,
> device specific vfio bus drivers or be integrated into existing
> vfio-pci, but the host device is likely instantiated by binding to a
> driver, so again I don't really understand where you're proposing this
> negotiation occurs.  Will management tools be required to create a
> device on-demand to fulfill a migration request or can we manipulate an
> existing device into that desired.  Some management layers embrace the
> idea of device pools rather than dynamic creation.  Thanks,

The concept of device instantiation is natural for mdev and vfio-user,
but not essential.

When dealing with physical devices (even PCI SR-IOV), we don't need to
instantiate them explicitly. Device instances can already exist. As long
as we know their device model URI and configuration parameters we can
ensure migration compatibility.

For example, imagine a physical PCI NIC accompanied by a non-mdev VFIO
migration driver. The device model URI and configuration parameter
information can be distributed alongside the VFIO migration driver. It
could be available via modinfo(8), as a separate metadata file, via a
vendor-specific tool, etc.

Management tools need to match the device model/configuration from the
source device against the destination device. If the destination is
capable of supporting the source's device model/configuration then
migration can proceed safely.

Let's look at the case where we are migration from an older version of a
device to a newer version. On the source we have:

  model = https://vendor-a.com/my-nic

On the destination we have:

  model = https://vendor-a.com/my-nic
  rss = on

The two devices are incompatible because the destination exposes the RSS
feature that is not present on the source. The RSS feature involves
guest-visible hardware interface changes and a change to the device
state representation. It is not safe to migrate!

In this case an extra configuration step is necessary so that the
destination device can accept the device state from the source. The
management tool invokes a vendor-specific tool to put the device into
the right configuration:

  # vendor-tool set-migration-config --device 0000:00:04.0 \
                                     --model https://vendor-a.com/my-nic

(This tool only succeeds when the device is bound to VFIO but not yet
opened.)

The tool invokes ioctls on the vendor-specific VFIO driver that does two
things:
1. Tells the device to present the old hardware interface without RSS
2. Uses the old device state representation without RSS support

Does this approach fit?

> [1]https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg04519.html
> [2]https://lists.gnu.org/archive/html/qemu-devel/2020-08/msg00293.html
> [3]https://lists.gnu.org/archive/html/qemu-devel/2020-09/msg02983.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-02 11:11 VFIO Migration Stefan Hajnoczi
                   ` (2 preceding siblings ...)
  2020-11-03  8:46 ` Jason Wang
@ 2020-11-03 11:39 ` Daniel P. Berrangé
  2020-11-03 15:05   ` Stefan Hajnoczi
  2020-11-03 12:17 ` Dr. David Alan Gilbert
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 40+ messages in thread
From: Daniel P. Berrangé @ 2020-11-03 11:39 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: John G Johnson, mtsirkin, quintela, Jason Wang, Felipe Franciosi,
	Kirti Wankhede, qemu-devel, Alex Williamson, Thanos Makatos,
	Paolo Bonzini, Dr. David Alan Gilbert

On Mon, Nov 02, 2020 at 11:11:53AM +0000, Stefan Hajnoczi wrote:
> There is discussion about VFIO migration in the "Re: Out-of-Process
> Device Emulation session at KVM Forum 2020" thread. The current status
> is that Kirti proposed a VFIO device region type for saving and loading
> device state. There is currently no guidance on migrating between
> different device versions or device implementations from different
> vendors. This is known to be non-trivial and raised discussion about
> whether it should really be handled by VFIO or centralized in QEMU.
> 
> Below is a document that describes how to ensure migration compatibility
> in VFIO. It does not require changes to the VFIO migration interface. It
> can be used for both VFIO/mdev kernel devices and vfio-user devices.
> 
> The idea is that the device state blob is opaque to the VMM but the same
> level of migration compatibility that exists today is still available.
> 
> I hope this will help us reach consensus and let us discuss specifics.
> 
> If you followed the previous discussion, I changed the approach from
> sending a magic constant in the device state blob to identifying device
> models by URIs. Therefore the device state structure does not need to be
> defined here - the critical information for ensuring device migration
> compatibility is the device model and configuration defined below.
> 
> Stefan
> ---
> VFIO Migration
> ==============
> This document describes how to save and load VFIO device states. Saving a
> device state produces a snapshot of a VFIO device's state that can be loaded
> again at a later point in time to resume the device from the snapshot.
> 
> The data representation of the device state is outside the scope of this
> document.
> 
> Overview
> --------
> The purpose of device states is to save the device at a point in time and then
> restore the device back to the saved state later. This is more challenging than
> it first appears.
> 
> The process of saving a device state and loading it later is called
> *migration*. The state may be loaded by the same device that saved it or by a
> new instance of the device, possibly running on a different computer.
> 
> It must be possible to migrate to a newer implementation of the device
> as well as to an older implementation of the device. This allows users
> to upgrade and roll back their systems.
> 
> Migration can fail if loading the device state is not possible. It should fail
> early with a clear error message. It must not appear to complete but leave the
> device inoperable due to a migration problem.

I think there needs to be an addition requirement.

 It must be possible for a management application to query the supported
 versions, independantly of execution of a migration  operation.

This is important to large scale data center / cloud management applications
because before initiating a migration they need to *automatically* select
a target host with high level of confidence that is will be compatible with
the source host.

Today QEMU migration compatibility is largely determined by the machine
type version. Apps can query the supported machine types for host to
check whether it is compatible. Similarly they will query CPU model
features to check compatiblity.

Validation and error checking at time of migration is of course still
required, but the goal should be that an mgmt application will *NEVER*
hit these errors because they will have pre-selected a host that is
known to be compatible based on reported versions that are supported.

> Device Versions
> ---------------
> As a device evolves, the number of configuration parameters required may become
> inconvenient for users to express in full. A device configuration can be
> aliased by a *device version*, which is a shorthand for the full device
> configuration. This makes it easy to apply a standard device configuration
> without listing every configuration parameter explicitly.
> 
> For example, if address filtering support was added to a network card then
> device versions and the corresponding configurations may look like this:
> * ``version=1`` - Behaves as if ``rx-filter-size=0``
> * ``version=2`` - ``rx-filter-size=32``
> 
> Device States
> -------------
> The details of the device state representation are not covered in this document
> but the general requirements are discussed here.
> 
> The device state consists of data accessible through the device's hardware
> interface and internal state that is needed to restore device operation.
> State in the hardware interface includes the values of hardware registers.
> An example of internal state is an index value needed to avoid processing
> queued requests more than once.
> 
> Changes can be made to the device state representation as follows. Each change
> to device state must have a corresponding device configuration parameter that
> allows the change to toggled:
> 
> * When the parameter is disabled the hardware interface and device state
>   representation are unchanged. This allows old device states to be loaded.
> 
> * When the parameter is enabled the change comes into effect.
> 
> * The parameter's default value disables the change. Therefore old versions do
>   not have to explicitly specify the parameter.
> 
> The following example illustrates migration from an old device
> implementation to a new one. A version=1 network card is migrated to a
> new device implementation that is also capable of version=2 and adds the
> rx-filter-size=32 parameter. The new device is instantiated with
> version=1, which disables rx-filter-size and is capable of loading the
> version=1 device state. The migration completes successfully but note
> the device is still operating at version=1 level in the new device.
> 
> The following example illustrates migration from a new device
> implementation back to an older one. The new device implementation
> supports version=1 and version=2. The old device implementation supports
> version=1 only. Therefore the device can only be migrated when
> instantiated with version=1 or the equivalent full configuration
> parameters.
> 
> Orchestrating Migrations
> ------------------------
> The following steps must be followed to migrate devices:
> 
> 1. Check that the source and destination devices support the same device model.
> 
> 2. Check that the destination device supports the source device's
>    configuration. Each configuration parameter must be accepted by the
>    destination in order to ensure that it will be possible to load the device
>    state.
> 
> 3. The device state is saved on the source and loaded on the destination.
> 
> 4. If migration succeeds then the destination resumes operation and the source
>    must not resume operation. If the migration fails then the source resumes
>    operation and the destination must not resume operation.
> 
> VFIO Implementation
> -------------------
> The following applies both to kernel VFIO/mdev drivers and vfio-user device
> backends.
> 
> Devices are instantiated based on a version and/or configuration parameters:
> * ``version=1`` - use the device configuration aliased by version 1
> * ``version=2,rx-filter-size=64`` - use version 1 and override ``rx-filter-size``
> * ``rx-filter-size=0`` - directly set configuration parameters without using a version
> 
> Device creation fails if the version and/or configuration parameters are not
> supported.
> 
> There must be a mechanism to query the "latest" configuration for a device
> model. It may simply report the ``version=5`` where 5 is the latest version but
> it could also report all configuration parameters instead of using a version
> alias.

The mechanism needs to be able to report all supported versions strings,
not simple the latest version string. I think we need to specify the
actual mechanism todo this query too, because we can't end up in a place
where there's a different approach to queries for each device type.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-03  8:46 ` Jason Wang
@ 2020-11-03 12:15   ` Stefan Hajnoczi
  2020-11-04  3:32     ` Jason Wang
  0 siblings, 1 reply; 40+ messages in thread
From: Stefan Hajnoczi @ 2020-11-03 12:15 UTC (permalink / raw)
  To: Jason Wang
  Cc: John G Johnson, mtsirkin, Daniel P. Berrangé,
	quintela, qemu-devel, Kirti Wankhede, Dr. David Alan Gilbert,
	Alex Williamson, Paolo Bonzini, Felipe Franciosi, Thanos Makatos

[-- Attachment #1: Type: text/plain, Size: 13919 bytes --]

On Tue, Nov 03, 2020 at 04:46:53PM +0800, Jason Wang wrote:
> 
> On 2020/11/2 下午7:11, Stefan Hajnoczi wrote:
> > There is discussion about VFIO migration in the "Re: Out-of-Process
> > Device Emulation session at KVM Forum 2020" thread. The current status
> > is that Kirti proposed a VFIO device region type for saving and loading
> > device state. There is currently no guidance on migrating between
> > different device versions or device implementations from different
> > vendors. This is known to be non-trivial and raised discussion about
> > whether it should really be handled by VFIO or centralized in QEMU.
> > 
> > Below is a document that describes how to ensure migration compatibility
> > in VFIO. It does not require changes to the VFIO migration interface. It
> > can be used for both VFIO/mdev kernel devices and vfio-user devices.
> > 
> > The idea is that the device state blob is opaque to the VMM but the same
> > level of migration compatibility that exists today is still available.
> 
> 
> So if we can't mandate this or there's no way to validate this. Vendor is
> still free to implement their own protocol which could lead a lot of
> maintaining burden.

Yes, the device state representation is their responsibility. We can't
do that for them since they define the hardware interface and internal
state.

As Michael and Paolo have mentioned in the other thread, we can provide
guidelines and standardize common aspects.

> > Migration can fail if loading the device state is not possible. It should fail
> > early with a clear error message. It must not appear to complete but leave the
> > device inoperable due to a migration problem.
> 
> 
> For VFIO-user, how management know that a VM can be migrated from src to
> dst? For kernel, we have sysfs.

vfio-user devices will normally be instantiated in one of two ways:

1. Launching a device backend and passing command-line parameters:

     $ my-nic --socket-path /tmp/my-nic-vfio-user.sock \
              --model https://vendor-a.com/my-nic \
	      --rss on

   Here "model" is the device model URL. The program could support
   multiple device models.

   The "rss" device configuration parameter enables Receive Side Scaling
   (RSS) as an example of a configuration parameter.

2. Creating a device using an RPC interface:

     (qemu) device-add my-nic,rss=on

If the device instantiation succeeds then it is safe to live migrate.
The device is exposing the desired hardware interface and expecting the
right device state representation.

> > 
> > The rest of this document describes how these requirements can be met.
> > 
> > Device Models
> > -------------
> > Devices have a *hardware interface* consisting of hardware registers,
> > interrupts, and so on.
> > 
> > The hardware interface together with the device state representation is called
> > a *device model*. Device models can be assigned URIs such as
> > https://qemu.org/devices/e1000e to uniquely identify them.
> 
> 
> It looks worse than "pci://vendor_id.device_id.subvendor_id.subdevice_id".
> "e1000e" means a lot of different 8275X implementations that have subtle but
> easy to be ignored differences.

If you wish to reflect those differences in the device model URI then
you can use:

  https://qemu.org/devices/pci/<vendor-id>/<device-id>/<subvendor-id>/<subdevice-id>

Another option is to use device configuration parameters to express
differences.

The important thing is that this device model URI has one owner. No one
else will use qemu.org. There can be many different e1000e device model
URIs, if necessary (with slightly different hardware interfaces and/or
device state representations). This avoids collisions.

> And is it possible to have a list of URIs here?

A device implementation (mdev driver, vfio-user device backend, etc) may
support multiple device model URIs.

A device instance has an immutable device model URI and list of
configuration parameters. In other words, once the device is created its
ABI is fixed for the lifetime of the device. A new device instance can
be configured by powering off the machine, hotplug, etc.

> > Multiple implementations of a device model may exist. They are they are
> > interchangeable if they follow the same hardware interface and device
> > state representation.
> > 
> > Multiple implementations of the same hardware interface may exist with
> > different device state representations, in which case the device models are not
> > interchangeable and must be assigned different URIs.
> > 
> > Migration is only possible when the same device model is supported by the
> > *source* and the *destination* devices.
> > 
> > Device Configuration
> > --------------------
> > Device models may have parameters that affect the hardware interface or device
> > state representation. For example, a network card may have a configurable
> > address filtering table size parameter called ``rx-filter-size``. A
> > device state saved with ``rx-filter-size=32`` cannot be safely loaded
> > into a device with ``rx-filter-size=0``, because changing the size from
> > 32 to 0 may disrupt device operation.
> 
> 
> Do we allow the migration from "rx-filter-size=16" to "rx-filter-size=32" (I
> guess not?) And should we extend the concept to "device capability" instead
> of just state representation.  E.g src has CAP_X=on,CAP_Y=off but dst has
> CAP_X=on,CAP_Y=on, so we disallow the migration from src to dst.

A device instance's configuration parameters are immutable.
rx-filter-size=16 cannot be migrated to rx-filter-size=32.

Yes, configuration parameters can describe capabilities. I think of
capabilities as something that affects the guest-visible hardware
interface (e.g. the RSS feature bit is enabled) that is mentioned in the
text, but it would be clearer to mention them explicitly.

> > A list of configuration parameters is called the *device configuration*.
> > Migration is expected to succeed when the same device model and configuration
> > that was used for saving the device state is used again to load it.
> > 
> > Note that not all parameters used to instantiate a device need to be part of
> > the device configuration. For example, assigning a network card to a specific
> > physical port is not part of the device configuration since it is not part of
> > the device's hardware interface or the device state representation.
> 
> 
> Yes, but the task needs to be done by management somehow. So do you expect a
> vendor specific provisioning API here?

There seems to be no consensus on this yet. It's the question of how to
manage the lifecycle of VFIO, mdev, vhost-user, and vfio-user devices.
There are attempts to standardize in some of these areas.

For mdev drivers we can standardize the sysfs interface so management
tools can query source devices and instantiate destination devices
without device-specific code.

For vhost-user devices there is the backend program conventions
specification, which aims to standardize common parameters. This makes
integrating support for new device implementations easier (there is less
device implementation-specific code).

For vfio-user devices something based on the vhost-user backend program
conventions spec could work well.

The main issue could be that avoiding vendor-specific provisioning code
in management software either requires you to restrict yourself to a few
standard device types or to pass through configuration data.

A libvirt opinion would be interesting.

> > The device
> > state can be loaded and run on a different physical port without affecting the
> > operation of the device. Therefore the physical port is not part of the device
> > configuration.
> > 
> > However, secondary aspects related to the physical port may affect the device's
> > hardware interface and need to be reflected in the device configuration. The
> > link speed may depend on the physical port and be reported through the device's
> > hardware interface. In that case a ``link-speed`` configuration parameter is
> > required to prevent unexpected changes to the link speed after migration.
> > 
> > Note that the device configuration is a conservative bound on device
> > states that can be migrated successfully since not all configuration
> > parameters may be strictly required to match on the source and
> > destination devices. For example, if the device's hardware interface has
> > not yet been initialized then changes to the link speed may not be
> > noticed. However, accurately representing runtime constraints is complex
> > and risks introducing migration bugs, so no attempt is made to support
> > them to achieve more relaxed bounds on successful migrations.
> > 
> > Device Versions
> > ---------------
> > As a device evolves, the number of configuration parameters required may become
> > inconvenient for users to express in full. A device configuration can be
> > aliased by a *device version*, which is a shorthand for the full device
> > configuration. This makes it easy to apply a standard device configuration
> > without listing every configuration parameter explicitly.
> 
> 
> I'm not sure how to apply the device versions consider the device state is
> opaque or the device needs to export another API to do this?

Versions are just aliases for a list of configuration parameters. For
example, version=2 expands to rx-filter-size=32. The purpose of versions
is to provide a human-readable shorthand notation.

Versions are not involved in migration compatibility checking, instead
the device model URI and expanded configuration parameters are compared.

The version has no direct effect on the device state representation. It
has an indirect effect due to the configuration parameters that it
expands to. For example, the rx-filter-size=32 configuration parameter
may change the device state representation to include the 32 addresses
that the device is filtering on.

No "version check" is necessary when loading the device state
representation because the device was already instantiated with the
exact configuration parameters that determine the device state
representation.

> > For example, if address filtering support was added to a network card then
> > device versions and the corresponding configurations may look like this:
> > * ``version=1`` - Behaves as if ``rx-filter-size=0``
> > * ``version=2`` - ``rx-filter-size=32``
> > 
> > Device States
> > -------------
> > The details of the device state representation are not covered in this document
> > but the general requirements are discussed here.
> > 
> > The device state consists of data accessible through the device's hardware
> > interface and internal state that is needed to restore device operation.
> > State in the hardware interface includes the values of hardware registers.
> > An example of internal state is an index value needed to avoid processing
> > queued requests more than once.
> > 
> > Changes can be made to the device state representation as follows. Each change
> > to device state must have a corresponding device configuration parameter that
> > allows the change to toggled:
> > 
> > * When the parameter is disabled the hardware interface and device state
> >    representation are unchanged. This allows old device states to be loaded.
> > 
> > * When the parameter is enabled the change comes into effect.
> > 
> > * The parameter's default value disables the change. Therefore old versions do
> >    not have to explicitly specify the parameter.
> > 
> > The following example illustrates migration from an old device
> > implementation to a new one. A version=1 network card is migrated to a
> > new device implementation that is also capable of version=2 and adds the
> > rx-filter-size=32 parameter. The new device is instantiated with
> > version=1, which disables rx-filter-size and is capable of loading the
> > version=1 device state. The migration completes successfully but note
> > the device is still operating at version=1 level in the new device.
> > 
> > The following example illustrates migration from a new device
> > implementation back to an older one. The new device implementation
> > supports version=1 and version=2. The old device implementation supports
> > version=1 only. Therefore the device can only be migrated when
> > instantiated with version=1 or the equivalent full configuration
> > parameters.
> 
> 
> In qemu we have subsection to facilitate the case when some fields were
> forgot to migrate. Do we need something similar here?

This is an important question and I'm not sure.

The problem with subsection semantics is that they break rollback. Once
the old device state has been loaded by the new device implementation,
saving the device state produces the new device state representation.
The old device implementation can no longer load it :(.  Manual
intervention is necessary to tell the new device implementation to save
in the old representation.

In the migration model described in this document it works the other
way around: back and forth migration is always safe. If you wish to
change the device you need to create a new instance (after poweroff or
through hotplug).

One way of achieving something similar is to provide additional
information about safe transitions between configuration parameter
lists. It is not safe to change arbitrary device configuration
parameters, but certain parameters can be safely changed.

I'm not sure if the complexity is worth it though. The downside to the
current approach is that devices must eventually be reconfigured to
upgrade to new versions, even if there is no guest-visible hardware
interface change.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-02 11:11 VFIO Migration Stefan Hajnoczi
                   ` (3 preceding siblings ...)
  2020-11-03 11:39 ` Daniel P. Berrangé
@ 2020-11-03 12:17 ` Dr. David Alan Gilbert
  2020-11-03 15:27   ` Stefan Hajnoczi
  2020-11-03 15:23 ` Christophe de Dinechin
  2020-11-04  7:50 ` Michael S. Tsirkin
  6 siblings, 1 reply; 40+ messages in thread
From: Dr. David Alan Gilbert @ 2020-11-03 12:17 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: John G Johnson, mtsirkin, Daniel P. Berrangé,
	quintela, Jason Wang, Felipe Franciosi, Kirti Wankhede,
	qemu-devel, Alex Williamson, Thanos Makatos, Paolo Bonzini

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> There is discussion about VFIO migration in the "Re: Out-of-Process
> Device Emulation session at KVM Forum 2020" thread. The current status
> is that Kirti proposed a VFIO device region type for saving and loading
> device state. There is currently no guidance on migrating between
> different device versions or device implementations from different
> vendors. This is known to be non-trivial and raised discussion about
> whether it should really be handled by VFIO or centralized in QEMU.
> 
> Below is a document that describes how to ensure migration compatibility
> in VFIO. It does not require changes to the VFIO migration interface. It
> can be used for both VFIO/mdev kernel devices and vfio-user devices.
> 
> The idea is that the device state blob is opaque to the VMM but the same
> level of migration compatibility that exists today is still available.
> 
> I hope this will help us reach consensus and let us discuss specifics.
> 
> If you followed the previous discussion, I changed the approach from
> sending a magic constant in the device state blob to identifying device
> models by URIs. Therefore the device state structure does not need to be
> defined here - the critical information for ensuring device migration
> compatibility is the device model and configuration defined below.
> 
> Stefan
> ---
> VFIO Migration
> ==============
> This document describes how to save and load VFIO device states. Saving a
> device state produces a snapshot of a VFIO device's state that can be loaded
> again at a later point in time to resume the device from the snapshot.
> 
> The data representation of the device state is outside the scope of this
> document.
> 
> Overview
> --------
> The purpose of device states is to save the device at a point in time and then
> restore the device back to the saved state later. This is more challenging than
> it first appears.
> 
> The process of saving a device state and loading it later is called
> *migration*. The state may be loaded by the same device that saved it or by a
> new instance of the device, possibly running on a different computer.
> 
> It must be possible to migrate to a newer implementation of the device
> as well as to an older implementation of the device. This allows users
> to upgrade and roll back their systems.
> 
> Migration can fail if loading the device state is not possible. It should fail
> early with a clear error message. It must not appear to complete but leave the
> device inoperable due to a migration problem.
> 
> The rest of this document describes how these requirements can be met.
> 
> Device Models
> -------------
> Devices have a *hardware interface* consisting of hardware registers,
> interrupts, and so on.
> 
> The hardware interface together with the device state representation is called
> a *device model*. Device models can be assigned URIs such as
> https://qemu.org/devices/e1000e to uniquely identify them.

I think this is a unique identifier, not actually a URI; the https://
isn't needed since no one expects to ever connect to this.

> Multiple implementations of a device model may exist. They are they are
> interchangeable if they follow the same hardware interface and device
> state representation.
> 
> Multiple implementations of the same hardware interface may exist with
> different device state representations, in which case the device models are not
> interchangeable and must be assigned different URIs.
> 
> Migration is only possible when the same device model is supported by the
> *source* and the *destination* devices.
> 
> Device Configuration
> --------------------
> Device models may have parameters that affect the hardware interface or device
> state representation. For example, a network card may have a configurable
> address filtering table size parameter called ``rx-filter-size``. A
> device state saved with ``rx-filter-size=32`` cannot be safely loaded
> into a device with ``rx-filter-size=0``, because changing the size from
> 32 to 0 may disrupt device operation.
> 
> A list of configuration parameters is called the *device configuration*.
> Migration is expected to succeed when the same device model and configuration
> that was used for saving the device state is used again to load it.
> 
> Note that not all parameters used to instantiate a device need to be part of
> the device configuration. For example, assigning a network card to a specific
> physical port is not part of the device configuration since it is not part of
> the device's hardware interface or the device state representation. The device
> state can be loaded and run on a different physical port without affecting the
> operation of the device. Therefore the physical port is not part of the device
> configuration.
> 
> However, secondary aspects related to the physical port may affect the device's
> hardware interface and need to be reflected in the device configuration. The
> link speed may depend on the physical port and be reported through the device's
> hardware interface. In that case a ``link-speed`` configuration parameter is
> required to prevent unexpected changes to the link speed after migration.

That's an interesting example; because depending on the device, it might
be:
    a) Completely virtualised so that the guest *shouldn't* know what
the physical link speed is, precisely to allow the physical network on
the destination to be different.

    b) Part of the migrated state

    c) Something that's allowed to be reloaded after migration

    d) Configurable

so I'm not sure whether it's a good example in this case or not.

Maybe what's needed is a stronger instruction to abstract external
device state so that it's not part of the configuration in most cases.

> Note that the device configuration is a conservative bound on device
> states that can be migrated successfully since not all configuration
> parameters may be strictly required to match on the source and
> destination devices. For example, if the device's hardware interface has
> not yet been initialized then changes to the link speed may not be
> noticed. However, accurately representing runtime constraints is complex
> and risks introducing migration bugs, so no attempt is made to support
> them to achieve more relaxed bounds on successful migrations.
> 
> Device Versions
> ---------------
> As a device evolves, the number of configuration parameters required may become
> inconvenient for users to express in full. A device configuration can be
> aliased by a *device version*, which is a shorthand for the full device
> configuration. This makes it easy to apply a standard device configuration
> without listing every configuration parameter explicitly.

> For example, if address filtering support was added to a network card then
> device versions and the corresponding configurations may look like this:
> * ``version=1`` - Behaves as if ``rx-filter-size=0``
> * ``version=2`` - ``rx-filter-size=32``

Note configuration parameters might have been added during the life of
the device; e.g. if the original card had no support for rx-filters, it
might not have a rx-filter-size parameter.

> Device States
> -------------
> The details of the device state representation are not covered in this document
> but the general requirements are discussed here.
> 
> The device state consists of data accessible through the device's hardware
> interface and internal state that is needed to restore device operation.
> State in the hardware interface includes the values of hardware registers.
> An example of internal state is an index value needed to avoid processing
> queued requests more than once.

I try and emphasise that 'internal state' should be represented in a way
that reflects the problem rather than the particular implementation;
this gives it a better chance of migrating to future versions.

> Changes can be made to the device state representation as follows. Each change
> to device state must have a corresponding device configuration parameter that
> allows the change to toggled:
> 
> * When the parameter is disabled the hardware interface and device state
>   representation are unchanged. This allows old device states to be loaded.
> 
> * When the parameter is enabled the change comes into effect.
> 
> * The parameter's default value disables the change. Therefore old versions do
>   not have to explicitly specify the parameter.
> 
> The following example illustrates migration from an old device
> implementation to a new one. A version=1 network card is migrated to a
> new device implementation that is also capable of version=2 and adds the
> rx-filter-size=32 parameter. The new device is instantiated with
> version=1, which disables rx-filter-size and is capable of loading the
> version=1 device state. The migration completes successfully but note
> the device is still operating at version=1 level in the new device.
> 
> The following example illustrates migration from a new device
> implementation back to an older one. The new device implementation
> supports version=1 and version=2. The old device implementation supports
> version=1 only. Therefore the device can only be migrated when
> instantiated with version=1 or the equivalent full configuration
> parameters.

I'm sometimes asked for 'ways out' of buggy migration cases; e.g. what
happens if version=1 forgot to migrate the X register; or what happens
if verison=1 forgot to handle the special, rare case when X=5 and we
now need to migrate some extra state.

> Orchestrating Migrations
> ------------------------
> The following steps must be followed to migrate devices:
> 
> 1. Check that the source and destination devices support the same device model.
> 
> 2. Check that the destination device supports the source device's
>    configuration. Each configuration parameter must be accepted by the
>    destination in order to ensure that it will be possible to load the device
>    state.

This is written in terms of a 'check'; there are at least three tricky
things:

  a) Where they both have the same parameter, do they accept the same
range of values; e.g. a newer version of the card might allow
rx-filter-size to go upto 128

  b) In cloud cases, the problem is not a 'check' the problem is a
'find' - find me a host with a spare card that matches the one on my
source host and is capable of taking the set of device config that my
source has.  That's trickier.  Finding the set of device models a host
supports on all it's cards isn't too bad.

  c) There may be resource limits; e.g. a host might be able to
handle only some combination of device models on a given card.

> 3. The device state is saved on the source and loaded on the destination.
> 
> 4. If migration succeeds then the destination resumes operation and the source
>    must not resume operation. If the migration fails then the source resumes
>    operation and the destination must not resume operation.
> 
> VFIO Implementation
> -------------------
> The following applies both to kernel VFIO/mdev drivers and vfio-user device
> backends.
> 
> Devices are instantiated based on a version and/or configuration parameters:
> * ``version=1`` - use the device configuration aliased by version 1
> * ``version=2,rx-filter-size=64`` - use version 1 and override ``rx-filter-size``
> * ``rx-filter-size=0`` - directly set configuration parameters without using a version
> 
> Device creation fails if the version and/or configuration parameters are not
> supported.
> 
> There must be a mechanism to query the "latest" configuration for a device
> model. It may simply report the ``version=5`` where 5 is the latest version but
> it could also report all configuration parameters instead of using a version
> alias.

Dave

-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-03 11:39 ` Daniel P. Berrangé
@ 2020-11-03 15:05   ` Stefan Hajnoczi
  2020-11-03 15:23     ` Daniel P. Berrangé
  0 siblings, 1 reply; 40+ messages in thread
From: Stefan Hajnoczi @ 2020-11-03 15:05 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: John G Johnson, mtsirkin, quintela, Jason Wang, Felipe Franciosi,
	Kirti Wankhede, qemu-devel, Alex Williamson, Thanos Makatos,
	Paolo Bonzini, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 4410 bytes --]

On Tue, Nov 03, 2020 at 11:39:29AM +0000, Daniel P. Berrangé wrote:
> On Mon, Nov 02, 2020 at 11:11:53AM +0000, Stefan Hajnoczi wrote:
> > Overview
> > --------
> > The purpose of device states is to save the device at a point in time and then
> > restore the device back to the saved state later. This is more challenging than
> > it first appears.
> > 
> > The process of saving a device state and loading it later is called
> > *migration*. The state may be loaded by the same device that saved it or by a
> > new instance of the device, possibly running on a different computer.
> > 
> > It must be possible to migrate to a newer implementation of the device
> > as well as to an older implementation of the device. This allows users
> > to upgrade and roll back their systems.
> > 
> > Migration can fail if loading the device state is not possible. It should fail
> > early with a clear error message. It must not appear to complete but leave the
> > device inoperable due to a migration problem.
> 
> I think there needs to be an addition requirement.
> 
>  It must be possible for a management application to query the supported
>  versions, independantly of execution of a migration  operation.
> 
> This is important to large scale data center / cloud management applications
> because before initiating a migration they need to *automatically* select
> a target host with high level of confidence that is will be compatible with
> the source host.
> 
> Today QEMU migration compatibility is largely determined by the machine
> type version. Apps can query the supported machine types for host to
> check whether it is compatible. Similarly they will query CPU model
> features to check compatiblity.
> 
> Validation and error checking at time of migration is of course still
> required, but the goal should be that an mgmt application will *NEVER*
> hit these errors because they will have pre-selected a host that is
> known to be compatible based on reported versions that are supported.

Okay. What do you think of the following?

  [
    {
      "model": "https://qemu.org/devices/e1000e",
      "params": [
        "rss",
	...more configuration parameters...
      ],
      "versions": [
        {
	  "name": "1",
	  "params": [],
	},
	{
	  "name": "2",
	  "params": ["rss=on"],
	},
	...more versions...
      ]
    },
    ...more device models...
  ]

The management tool can generate the configuration parameter list by
expanding a version into its params.

Configuration parameter types and input ranges need more thought. For
example, version 1 of the device might not have rx-table-size (it's
effectively 0). Version 2 introduces rx-table-size and sets it to 32.
Version 3 raises the value to 64. In addition, the user can set a custom
value like rx-table-size=48. I haven't defined the rules for this yet,
but it's clear there needs to be a way to extend configuration
parameters.

To check migration compatibility:
1. Verify that the device model URL matches the JSON data[n].model
   field.
2. For every configuration parameter name from the source device,
   check that it is contained within the JSON data[n].params list.

> > VFIO Implementation
> > -------------------
> > The following applies both to kernel VFIO/mdev drivers and vfio-user device
> > backends.
> > 
> > Devices are instantiated based on a version and/or configuration parameters:
> > * ``version=1`` - use the device configuration aliased by version 1
> > * ``version=2,rx-filter-size=64`` - use version 1 and override ``rx-filter-size``
> > * ``rx-filter-size=0`` - directly set configuration parameters without using a version
> > 
> > Device creation fails if the version and/or configuration parameters are not
> > supported.
> > 
> > There must be a mechanism to query the "latest" configuration for a device
> > model. It may simply report the ``version=5`` where 5 is the latest version but
> > it could also report all configuration parameters instead of using a version
> > alias.
> 
> The mechanism needs to be able to report all supported versions strings,
> not simple the latest version string. I think we need to specify the
> actual mechanism todo this query too, because we can't end up in a place
> where there's a different approach to queries for each device type.

Makes sense.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-03 15:05   ` Stefan Hajnoczi
@ 2020-11-03 15:23     ` Daniel P. Berrangé
  2020-11-03 18:16       ` Stefan Hajnoczi
  0 siblings, 1 reply; 40+ messages in thread
From: Daniel P. Berrangé @ 2020-11-03 15:23 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: John G Johnson, mtsirkin, quintela, Jason Wang, Felipe Franciosi,
	Kirti Wankhede, qemu-devel, Alex Williamson, Thanos Makatos,
	Paolo Bonzini, Dr. David Alan Gilbert

On Tue, Nov 03, 2020 at 03:05:08PM +0000, Stefan Hajnoczi wrote:
> On Tue, Nov 03, 2020 at 11:39:29AM +0000, Daniel P. Berrangé wrote:
> > On Mon, Nov 02, 2020 at 11:11:53AM +0000, Stefan Hajnoczi wrote:
> > > Overview
> > > --------
> > > The purpose of device states is to save the device at a point in time and then
> > > restore the device back to the saved state later. This is more challenging than
> > > it first appears.
> > > 
> > > The process of saving a device state and loading it later is called
> > > *migration*. The state may be loaded by the same device that saved it or by a
> > > new instance of the device, possibly running on a different computer.
> > > 
> > > It must be possible to migrate to a newer implementation of the device
> > > as well as to an older implementation of the device. This allows users
> > > to upgrade and roll back their systems.
> > > 
> > > Migration can fail if loading the device state is not possible. It should fail
> > > early with a clear error message. It must not appear to complete but leave the
> > > device inoperable due to a migration problem.
> > 
> > I think there needs to be an addition requirement.
> > 
> >  It must be possible for a management application to query the supported
> >  versions, independantly of execution of a migration  operation.
> > 
> > This is important to large scale data center / cloud management applications
> > because before initiating a migration they need to *automatically* select
> > a target host with high level of confidence that is will be compatible with
> > the source host.
> > 
> > Today QEMU migration compatibility is largely determined by the machine
> > type version. Apps can query the supported machine types for host to
> > check whether it is compatible. Similarly they will query CPU model
> > features to check compatiblity.
> > 
> > Validation and error checking at time of migration is of course still
> > required, but the goal should be that an mgmt application will *NEVER*
> > hit these errors because they will have pre-selected a host that is
> > known to be compatible based on reported versions that are supported.
> 
> Okay. What do you think of the following?
> 
>   [
>     {
>       "model": "https://qemu.org/devices/e1000e",
>       "params": [
>         "rss",
> 	...more configuration parameters...
>       ],
>       "versions": [
>         {
> 	  "name": "1",
> 	  "params": [],
> 	},
> 	{
> 	  "name": "2",
> 	  "params": ["rss=on"],
> 	},
> 	...more versions...
>       ]
>     },
>     ...more device models...
>   ]
> 
> The management tool can generate the configuration parameter list by
> expanding a version into its params.
> 
> Configuration parameter types and input ranges need more thought. For
> example, version 1 of the device might not have rx-table-size (it's
> effectively 0). Version 2 introduces rx-table-size and sets it to 32.
> Version 3 raises the value to 64. In addition, the user can set a custom
> value like rx-table-size=48. I haven't defined the rules for this yet,
> but it's clear there needs to be a way to extend configuration
> parameters.
> 
> To check migration compatibility:
> 1. Verify that the device model URL matches the JSON data[n].model
>    field.
> 2. For every configuration parameter name from the source device,
>    check that it is contained within the JSON data[n].params list.

I'm not convinced that this makes sense. A matching set of parameter
names + values does not imply that the migration data stream is
actually compatible.

ie implementations may need to change the internal migration data
stream to fix bugs, without adding/removing a config parameter.
The migration version string alone expresses data stream compatibility.

This is similar to how 2 QEMU command lines can have identical set
of configuration parameters, aside from the machine type version,
and thus be migration *incompatible.

Basically the version string should be considered an opaque blob
that expresses compatibility on its own.

> > > VFIO Implementation
> > > -------------------
> > > The following applies both to kernel VFIO/mdev drivers and vfio-user device
> > > backends.
> > > 
> > > Devices are instantiated based on a version and/or configuration parameters:
> > > * ``version=1`` - use the device configuration aliased by version 1
> > > * ``version=2,rx-filter-size=64`` - use version 1 and override ``rx-filter-size``
> > > * ``rx-filter-size=0`` - directly set configuration parameters without using a version
> > > 
> > > Device creation fails if the version and/or configuration parameters are not
> > > supported.
> > > 
> > > There must be a mechanism to query the "latest" configuration for a device
> > > model. It may simply report the ``version=5`` where 5 is the latest version but
> > > it could also report all configuration parameters instead of using a version
> > > alias.
> > 
> > The mechanism needs to be able to report all supported versions strings,
> > not simple the latest version string. I think we need to specify the
> > actual mechanism todo this query too, because we can't end up in a place
> > where there's a different approach to queries for each device type.
> 
> Makes sense.
> 
> Stefan



Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-02 11:11 VFIO Migration Stefan Hajnoczi
                   ` (4 preceding siblings ...)
  2020-11-03 12:17 ` Dr. David Alan Gilbert
@ 2020-11-03 15:23 ` Christophe de Dinechin
  2020-11-03 15:33   ` Daniel P. Berrangé
  2020-11-04 11:10   ` Stefan Hajnoczi
  2020-11-04  7:50 ` Michael S. Tsirkin
  6 siblings, 2 replies; 40+ messages in thread
From: Christophe de Dinechin @ 2020-11-03 15:23 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: John G Johnson, mtsirkin, Daniel P. Berrangé,
	quintela, qemu-devel, Jason Wang, Kirti Wankhede,
	Dr. David Alan Gilbert, Alex Williamson, Paolo Bonzini,
	Felipe Franciosi, Thanos Makatos


On 2020-11-02 at 12:11 CET, Stefan Hajnoczi wrote...
> There is discussion about VFIO migration in the "Re: Out-of-Process
> Device Emulation session at KVM Forum 2020" thread. The current status
> is that Kirti proposed a VFIO device region type for saving and loading
> device state. There is currently no guidance on migrating between
> different device versions or device implementations from different
> vendors. This is known to be non-trivial and raised discussion about
> whether it should really be handled by VFIO or centralized in QEMU.
>
> Below is a document that describes how to ensure migration compatibility
> in VFIO. It does not require changes to the VFIO migration interface. It
> can be used for both VFIO/mdev kernel devices and vfio-user devices.
>
> The idea is that the device state blob is opaque to the VMM but the same
> level of migration compatibility that exists today is still available.
>
> I hope this will help us reach consensus and let us discuss specifics.
>
> If you followed the previous discussion, I changed the approach from
> sending a magic constant in the device state blob to identifying device
> models by URIs. Therefore the device state structure does not need to be
> defined here - the critical information for ensuring device migration
> compatibility is the device model and configuration defined below.
>
> Stefan
> ---
> VFIO Migration
> ==============
> This document describes how to save and load VFIO device states. Saving a
> device state produces a snapshot of a VFIO device's state that can be loaded
> again at a later point in time to resume the device from the snapshot.
>
> The data representation of the device state is outside the scope of this
> document.
>
> Overview
> --------
> The purpose of device states is to save the device at a point in time and then
> restore the device back to the saved state later. This is more challenging than
> it first appears.
>
> The process of saving a device state and loading it later is called
> *migration*. The state may be loaded by the same device that saved it or by a
> new instance of the device, possibly running on a different computer.
>
> It must be possible to migrate to a newer implementation of the device
> as well as to an older implementation of the device. This allows users
> to upgrade and roll back their systems.
>
> Migration can fail if loading the device state is not possible. It should fail
> early with a clear error message. It must not appear to complete but leave the
> device inoperable due to a migration problem.
>
> The rest of this document describes how these requirements can be met.
>
> Device Models
> -------------
> Devices have a *hardware interface* consisting of hardware registers,
> interrupts, and so on.
>
> The hardware interface together with the device state representation is called
> a *device model*. Device models can be assigned URIs such as
> https://qemu.org/devices/e1000e to uniquely identify them.

Like others, I think we should either

a) Give a relatively strong requirement regarding what is at the URL in
question, e.g. docs, maybe even a machine-readable schema describing
configuration and state for the device. Leaving the option "there can be
nothing here" is IMO asking for trouble.

b) simply call that a unique ID, and then either drop the https: entirely or
use something else, like pci:// or, to be more specific, vfio://

I'd favor option (b) for a different practical reason. URLs are subject to
redirection and other mishaps. For example, using https:// begs the question
whether
https://qemu.org/devices/e1000e and
https://www.qemu.org/devices/e1000e
should be treated as the same device. I believe that your intent is that
they shouldn't, but if the qemu web server redirects to www, and someone
wants to copy-paste their web browser's URL bar to the command line, they'd
get the wrong one.


>
> Multiple implementations of a device model may exist. They are they are

dup "they are"

> interchangeable if they follow the same hardware interface and device
> state representation.
>
> Multiple implementations of the same hardware interface may exist with
> different device state representations, in which case the device models are not
> interchangeable and must be assigned different URIs.
>
> Migration is only possible when the same device model is supported by the
> *source* and the *destination* devices.
>
> Device Configuration
> --------------------

I find "device configuration" to be a bit confusing and ambiguous here.
From the discussion, it appears that you are not talking about the active
meaning of "configuration", as in "configuring" the device after migration,
but talking about a passive meaning of "this device exists in multiple
variant, which one am I talking about".

I've scratched my head looking for a less ambiguous wording, but could not
find any.

> Device models may have parameters that affect the hardware interface or device
> state representation. For example, a network card may have a configurable
> address filtering table size parameter called ``rx-filter-size``. A
> device state saved with ``rx-filter-size=32`` cannot be safely loaded
> into a device with ``rx-filter-size=0``, because changing the size from
> 32 to 0 may disrupt device operation.
>
> A list of configuration parameters is called the *device configuration*.
> Migration is expected to succeed when the same device model and configuration
> that was used for saving the device state is used again to load it.

If that's intended for a static decision, are you thinking about making it
part of the URI?

Something like vfio://qemu.org/devices/e1000e?version=2


>
> Note that not all parameters used to instantiate a device need to be part of
> the device configuration. For example, assigning a network card to a specific
> physical port is not part of the device configuration since it is not part of
> the device's hardware interface or the device state representation.

I'd replace "since" with "when". There are cases where all ports are not
equivalent. Or maybe you are saying that this is covered by other more
relevant parts of the configuration like link speed?

What about the topology used to access the card? Would you want to be able
to refer to things like IOMMU groups, etc?


> The device state can be loaded and run on a different physical port
> without affecting the operation of the device. Therefore the physical port
> is not part of the device configuration.

I would prefer if we could offer a mechanism here, rather than a policy, and
let the upper layers in the stack be able to specify the policy.

Imagine for example that you have allocated ports between internal and
external networks? The upper stack would probably want to migrate an
"internal network" vfio to another "internal network" port, no?


>
> However, secondary aspects related to the physical port may affect the device's
> hardware interface and need to be reflected in the device configuration. The
> link speed may depend on the physical port and be reported through the device's
> hardware interface. In that case a ``link-speed`` configuration parameter is
> required to prevent unexpected changes to the link speed after migration.

Again, I think that we should provide mechanism rather than policy here.

Imagine someone who wants to migrate _precisely_ to get a different link
speed. Would we want to preclude that if nothing else was blocking the
migration?

The way I see it, it is not entirely clear that the validation of whether
the migration is OK or not should occur entirely within qemu. It might be
good to make room in the spec for some external validation, which could be
implemented in practice through some optional plug-in. It does not need to
be done in the first iteration, but I think the spec should be ready for it.


>
> Note that the device configuration is a conservative bound on device
> states that can be migrated successfully since not all configuration
> parameters may be strictly required to match on the source and
> destination devices. For example, if the device's hardware interface has
> not yet been initialized then changes to the link speed may not be
> noticed. However, accurately representing runtime constraints is complex
> and risks introducing migration bugs, so no attempt is made to support
> them to achieve more relaxed bounds on successful migrations.

That makes me wonder if the distinction between configuration, version and
state is really tight.

Consider a vGPU for example. It looks to me like the "shape" of the target
vGPU would be part of "configuration" at first sight. But then, it might be
instead a "state" request, "this is what I need", that could cause the
target to reconfigure the vGPUs to match the description.

Notice that such a reconfiguration might be impossible. So this is still a
migration validation, but it's a bit more dynamic.

Similarly, if we get to network cards and "upper stacks", you could consider
the MAC address as part of the state or configuration, depending on the
scenario. You could either want to "transport" the MAC address, or to
have the upper stack follow some rules on which one to pick for the target.
My understanding is that IPv6 DAD for example somewhat relies on the MAC
address, and that this makes things complicated for OpenShift. Ask Stefano
Brivio about that, he understands the problem much better than I do.

The bottom line is that IMO the line between configuration and state may be
a bit fuzzy, even for a single device model, depending on the use case.

>
> Device Versions
> ---------------
> As a device evolves, the number of configuration parameters required may become
> inconvenient for users to express in full. A device configuration can be
> aliased by a *device version*, which is a shorthand for the full device
> configuration. This makes it easy to apply a standard device configuration
> without listing every configuration parameter explicitly.
>
> For example, if address filtering support was added to a network card then
> device versions and the corresponding configurations may look like this:
> * ``version=1`` - Behaves as if ``rx-filter-size=0``
> * ``version=2`` - ``rx-filter-size=32``

To me, this corresponds to default settings, see below.

If two devices have different versions, do you allow migration?

>
> Device States
> -------------
> The details of the device state representation are not covered in this document
> but the general requirements are discussed here.
>
> The device state consists of data accessible through the device's hardware
> interface and internal state that is needed to restore device operation.
> State in the hardware interface includes the values of hardware registers.
> An example of internal state is an index value needed to avoid processing
> queued requests more than once.



>
> Changes can be made to the device state representation as follows. Each change
> to device state must have a corresponding device configuration parameter that
> allows the change to toggled:
>
> * When the parameter is disabled the hardware interface and device state
>   representation are unchanged. This allows old device states to be loaded.
>
> * When the parameter is enabled the change comes into effect.
>
> * The parameter's default value disables the change. Therefore old versions do
>   not have to explicitly specify the parameter.

I see a problem with this. Imagine a new card has new parameter foo.
Now, you once had a VM on this card that had foo=42. So it has
foo-enabled=true and foo=42. Then you migrate there something that does not
know about foo. Most likely, that would not even touch foo-enabled.

So I think that you need to add that the migration starts with a "reset
state" where all featured are disabled by default.

If that's the case, then you don't really need the "enabled" flag. You
simply need to state that the reset state is compatible with earlier
versions. If you know about the new feature, you set it. If you don't know,
you have the state of the previous version.

Setting a version could allow you to quickly change the defaults.



>
> The following example illustrates migration from an old device
> implementation to a new one. A version=1 network card is migrated to a
> new device implementation that is also capable of version=2 and adds the
> rx-filter-size=32 parameter. The new device is instantiated with
> version=1, which disables rx-filter-size and is capable of loading the
> version=1 device state. The migration completes successfully but note
> the device is still operating at version=1 level in the new device.
>
> The following example illustrates migration from a new device
> implementation back to an older one. The new device implementation
> supports version=1 and version=2. The old device implementation supports
> version=1 only. Therefore the device can only be migrated when
> instantiated with version=1 or the equivalent full configuration
> parameters.
>
> Orchestrating Migrations
> ------------------------
> The following steps must be followed to migrate devices:
>
> 1. Check that the source and destination devices support the same device model.
>
> 2. Check that the destination device supports the source device's
>    configuration. Each configuration parameter must be accepted by the
>    destination in order to ensure that it will be possible to load the device
>    state.
>
> 3. The device state is saved on the source and loaded on the destination.
>
> 4. If migration succeeds then the destination resumes operation and the source
>    must not resume operation. If the migration fails then the source resumes
>    operation and the destination must not resume operation.
>
> VFIO Implementation
> -------------------
> The following applies both to kernel VFIO/mdev drivers and vfio-user device
> backends.
>
> Devices are instantiated based on a version and/or configuration parameters:
> * ``version=1`` - use the device configuration aliased by version 1
> * ``version=2,rx-filter-size=64`` - use version 1 and override ``rx-filter-size``
> * ``rx-filter-size=0`` - directly set configuration parameters without using a version
>
> Device creation fails if the version and/or configuration parameters are not
> supported.
>
> There must be a mechanism to query the "latest" configuration for a device
> model. It may simply report the ``version=5`` where 5 is the latest version but
> it could also report all configuration parameters instead of using a version
> alias.

Instead of "latest", we could have a query that lists the "supported"
configurations. Again, vGPUs are a good example where this would be
useful. A same card can be partitioned in a number of ways, and you can't
really claim that "M10-2B" or "M10-0Q" is "latest".

You could arguably assign a unique URI to each sub-model. Maybe that's how
you were envisioning things?

--
Cheers,
Christophe de Dinechin (IRC c3d)



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-03 12:17 ` Dr. David Alan Gilbert
@ 2020-11-03 15:27   ` Stefan Hajnoczi
  2020-11-03 18:49     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 40+ messages in thread
From: Stefan Hajnoczi @ 2020-11-03 15:27 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: John G Johnson, mtsirkin, Daniel P. Berrangé,
	quintela, Jason Wang, Felipe Franciosi, Kirti Wankhede,
	qemu-devel, Alex Williamson, Thanos Makatos, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 7733 bytes --]

On Tue, Nov 03, 2020 at 12:17:09PM +0000, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > Device Models
> > -------------
> > Devices have a *hardware interface* consisting of hardware registers,
> > interrupts, and so on.
> > 
> > The hardware interface together with the device state representation is called
> > a *device model*. Device models can be assigned URIs such as
> > https://qemu.org/devices/e1000e to uniquely identify them.
> 
> I think this is a unique identifier, not actually a URI; the https://
> isn't needed since no one expects to ever connect to this.

Yes, it could be any unique string. If the URI idea is not popular we
can use any similar scheme.

> > However, secondary aspects related to the physical port may affect the device's
> > hardware interface and need to be reflected in the device configuration. The
> > link speed may depend on the physical port and be reported through the device's
> > hardware interface. In that case a ``link-speed`` configuration parameter is
> > required to prevent unexpected changes to the link speed after migration.
> 
> That's an interesting example; because depending on the device, it might
> be:
>     a) Completely virtualised so that the guest *shouldn't* know what
> the physical link speed is, precisely to allow the physical network on
> the destination to be different.
> 
>     b) Part of the migrated state
> 
>     c) Something that's allowed to be reloaded after migration
> 
>     d) Configurable
> 
> so I'm not sure whether it's a good example in this case or not.

Can you think of an example that has only one option?

I tried but couldn't. For example take a sound card. The guest is aware
the device supports stereo playback (2 output channels), but which exact
stereo host device is used doesn't matter, they are all suitable.

Now imagine migrating to a 7.1 surround-sound device. Similar options
come into play:

a) Emulate stereo and mix it to 7.1 surround-sound on the physical
   device. The guest still sees the stereo device.

b) Refuse migration.

c) Indicate that the output has switched and let the guest reconfigure
   itself (e.g. a sound card with multiple outputs, where one of them is
   stereo and another is 7.1 surround sound).

Which option is desirable depends on the use case.

> Maybe what's needed is a stronger instruction to abstract external
> device state so that it's not part of the configuration in most cases.

Do you want to propose something?

> > For example, if address filtering support was added to a network card then
> > device versions and the corresponding configurations may look like this:
> > * ``version=1`` - Behaves as if ``rx-filter-size=0``
> > * ``version=2`` - ``rx-filter-size=32``
> 
> Note configuration parameters might have been added during the life of
> the device; e.g. if the original card had no support for rx-filters, it
> might not have a rx-filter-size parameter.

version=1 does not explicitly set rx-filter-size=0. When a new parameter
is introduced it must have a default value that disables its effect on
the hardware interface and/or device state representation. This is
described in a bit more detail in the next section, maybe it should be
reordered.

> > Device States
> > -------------
> > The details of the device state representation are not covered in this document
> > but the general requirements are discussed here.
> > 
> > The device state consists of data accessible through the device's hardware
> > interface and internal state that is needed to restore device operation.
> > State in the hardware interface includes the values of hardware registers.
> > An example of internal state is an index value needed to avoid processing
> > queued requests more than once.
> 
> I try and emphasise that 'internal state' should be represented in a way
> that reflects the problem rather than the particular implementation;
> this gives it a better chance of migrating to future versions.

Sounds like a good idea.

> > Changes can be made to the device state representation as follows. Each change
> > to device state must have a corresponding device configuration parameter that
> > allows the change to toggled:
> > 
> > * When the parameter is disabled the hardware interface and device state
> >   representation are unchanged. This allows old device states to be loaded.
> > 
> > * When the parameter is enabled the change comes into effect.
> > 
> > * The parameter's default value disables the change. Therefore old versions do
> >   not have to explicitly specify the parameter.
> > 
> > The following example illustrates migration from an old device
> > implementation to a new one. A version=1 network card is migrated to a
> > new device implementation that is also capable of version=2 and adds the
> > rx-filter-size=32 parameter. The new device is instantiated with
> > version=1, which disables rx-filter-size and is capable of loading the
> > version=1 device state. The migration completes successfully but note
> > the device is still operating at version=1 level in the new device.
> > 
> > The following example illustrates migration from a new device
> > implementation back to an older one. The new device implementation
> > supports version=1 and version=2. The old device implementation supports
> > version=1 only. Therefore the device can only be migrated when
> > instantiated with version=1 or the equivalent full configuration
> > parameters.
> 
> I'm sometimes asked for 'ways out' of buggy migration cases; e.g. what
> happens if version=1 forgot to migrate the X register; or what happens
> if verison=1 forgot to handle the special, rare case when X=5 and we
> now need to migrate some extra state.

Can these cases be handled by adding additional configuration parameters?

If version=1 is lacks essential state then version=2 can add it. The
user must configure the device to use version before they can save the
full state.

If version=1 didn't handle the X=5 case then the same solution is
needed. A new configuration parameter is introduced and the user needs
to configure the device to be the new version before migrating.

Unfortunately this requires poweroff or hotplugging a new device
instance. But some disruption is probably necessarily anyway so the
migration code on the host side can be patched to use the updated device
state representation.

> > Orchestrating Migrations
> > ------------------------
> > The following steps must be followed to migrate devices:
> > 
> > 1. Check that the source and destination devices support the same device model.
> > 
> > 2. Check that the destination device supports the source device's
> >    configuration. Each configuration parameter must be accepted by the
> >    destination in order to ensure that it will be possible to load the device
> >    state.
> 
> This is written in terms of a 'check'; there are at least three tricky
> things:
> 
>   a) Where they both have the same parameter, do they accept the same
> range of values; e.g. a newer version of the card might allow
> rx-filter-size to go upto 128

The easy way to handle that without lots of metadata is by instantiating
the destination device to see if it works.

But in the next point you mention cloud where we need a way to find a
host that supports a given device. Metadata is probably needed to make
that check easy. In the email reply to Daniel Berrange I posted the
beginning of a JSON schema that describes device models for this
purpose. I think that offers a solution for the cloud case.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-03 15:23 ` Christophe de Dinechin
@ 2020-11-03 15:33   ` Daniel P. Berrangé
  2020-11-03 17:31     ` Alex Williamson
  2020-11-04 11:10   ` Stefan Hajnoczi
  1 sibling, 1 reply; 40+ messages in thread
From: Daniel P. Berrangé @ 2020-11-03 15:33 UTC (permalink / raw)
  To: Christophe de Dinechin
  Cc: John G Johnson, mtsirkin, quintela, qemu-devel, Jason Wang,
	Kirti Wankhede, Dr. David Alan Gilbert, Alex Williamson,
	Paolo Bonzini, Stefan Hajnoczi, Felipe Franciosi, Thanos Makatos

On Tue, Nov 03, 2020 at 04:23:43PM +0100, Christophe de Dinechin wrote:
> 
> On 2020-11-02 at 12:11 CET, Stefan Hajnoczi wrote...
> > There is discussion about VFIO migration in the "Re: Out-of-Process
> > Device Emulation session at KVM Forum 2020" thread. The current status
> > is that Kirti proposed a VFIO device region type for saving and loading
> > device state. There is currently no guidance on migrating between
> > different device versions or device implementations from different
> > vendors. This is known to be non-trivial and raised discussion about
> > whether it should really be handled by VFIO or centralized in QEMU.
> >
> > Below is a document that describes how to ensure migration compatibility
> > in VFIO. It does not require changes to the VFIO migration interface. It
> > can be used for both VFIO/mdev kernel devices and vfio-user devices.
> >
> > The idea is that the device state blob is opaque to the VMM but the same
> > level of migration compatibility that exists today is still available.
> >
> > I hope this will help us reach consensus and let us discuss specifics.
> >
> > If you followed the previous discussion, I changed the approach from
> > sending a magic constant in the device state blob to identifying device
> > models by URIs. Therefore the device state structure does not need to be
> > defined here - the critical information for ensuring device migration
> > compatibility is the device model and configuration defined below.
> >
> > Stefan
> > ---
> > VFIO Migration
> > ==============
> > This document describes how to save and load VFIO device states. Saving a
> > device state produces a snapshot of a VFIO device's state that can be loaded
> > again at a later point in time to resume the device from the snapshot.
> >
> > The data representation of the device state is outside the scope of this
> > document.
> >
> > Overview
> > --------
> > The purpose of device states is to save the device at a point in time and then
> > restore the device back to the saved state later. This is more challenging than
> > it first appears.
> >
> > The process of saving a device state and loading it later is called
> > *migration*. The state may be loaded by the same device that saved it or by a
> > new instance of the device, possibly running on a different computer.
> >
> > It must be possible to migrate to a newer implementation of the device
> > as well as to an older implementation of the device. This allows users
> > to upgrade and roll back their systems.
> >
> > Migration can fail if loading the device state is not possible. It should fail
> > early with a clear error message. It must not appear to complete but leave the
> > device inoperable due to a migration problem.
> >
> > The rest of this document describes how these requirements can be met.
> >
> > Device Models
> > -------------
> > Devices have a *hardware interface* consisting of hardware registers,
> > interrupts, and so on.
> >
> > The hardware interface together with the device state representation is called
> > a *device model*. Device models can be assigned URIs such as
> > https://qemu.org/devices/e1000e to uniquely identify them.
> 
> Like others, I think we should either
> 
> a) Give a relatively strong requirement regarding what is at the URL in
> question, e.g. docs, maybe even a machine-readable schema describing
> configuration and state for the device. Leaving the option "there can be
> nothing here" is IMO asking for trouble.
> 
> b) simply call that a unique ID, and then either drop the https: entirely or
> use something else, like pci:// or, to be more specific, vfio://
> 
> I'd favor option (b) for a different practical reason. URLs are subject to
> redirection and other mishaps. For example, using https:// begs the question
> whether
> https://qemu.org/devices/e1000e and
> https://www.qemu.org/devices/e1000e
> should be treated as the same device. I believe that your intent is that
> they shouldn't, but if the qemu web server redirects to www, and someone
> wants to copy-paste their web browser's URL bar to the command line, they'd
> get the wrong one.

That's not a real world problem IMHO, because neither of these URLs
ever need resolve to a real webpage, and thus not need to be cut +
paste from a browser.

They are simply expressing a resource identifier using a URI as a
convenient format. This is the same as an XML namespace using a URI,
and rarely, if ever, resolving to any actual web page.

This is a good thing, because if you say there needs to be a real page
there, then it creates a pile of corporate beaurocracy for contributors.
I can freely create a URI under https://redhat.com for purposes of being
a identifier, but I cannot get any content published there without jumping
through many tedious corporate approvals and stand a good chance of being
rejected.

If we're truely treating the URIs as an opaque string, we don't especially
need to define any rules other than to say it should be under a domain that
you have authority over either directly, or via membership of a project
that delegates. We can suggest "https" since seeing "http" is a red flag
for many people these days.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-03 11:03   ` Stefan Hajnoczi
@ 2020-11-03 17:13     ` Alex Williamson
  2020-11-03 18:09       ` Stefan Hajnoczi
  2020-11-05 23:37       ` Yan Zhao
  0 siblings, 2 replies; 40+ messages in thread
From: Alex Williamson @ 2020-11-03 17:13 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: John G Johnson, Tian, Kevin, mtsirkin, Daniel P. Berrangé,
	quintela, Jason Wang, Zeng, Xin, qemu-devel,
	Dr. David Alan Gilbert, Yan Zhao, Kirti Wankhede, Thanos Makatos,
	Felipe Franciosi, Paolo Bonzini

On Tue, 3 Nov 2020 11:03:24 +0000
Stefan Hajnoczi <stefanha@redhat.com> wrote:

> On Mon, Nov 02, 2020 at 12:38:23PM -0700, Alex Williamson wrote:
> > 
> > Cc+ Intel folks as this really bumps into the migration compatibility
> > discussion[1][2][3]
> > 
> > On Mon, 2 Nov 2020 11:11:53 +0000
> > Stefan Hajnoczi <stefanha@redhat.com> wrote:

> > > VFIO Implementation
> > > -------------------
> > > The following applies both to kernel VFIO/mdev drivers and vfio-user device
> > > backends.
> > > 
> > > Devices are instantiated based on a version and/or configuration parameters:
> > > * ``version=1`` - use the device configuration aliased by version 1
> > > * ``version=2,rx-filter-size=64`` - use version 1 and override ``rx-filter-size``
> > > * ``rx-filter-size=0`` - directly set configuration parameters without using a version
> > > 
> > > Device creation fails if the version and/or configuration parameters are not
> > > supported.
> > > 
> > > There must be a mechanism to query the "latest" configuration for a device
> > > model. It may simply report the ``version=5`` where 5 is the latest version but
> > > it could also report all configuration parameters instead of using a version
> > > alias.  
> > 
> > When we talk about "instantiating" a device here, are we referring to
> > managing the device on the host or within QEMU via something like
> > vfio_realize()?  We create an instance of an mdev on the host via an
> > mdev type using operations on the host sysfs.  That mdev type doesn't
> > really seem to map to your idea if a device model represented by a URI.
> > How are supported URIs exposed and specified when the device is
> > instantiated?
> > 
> > Same for device configuration, we might have per device attributes in
> > host sysfs defining the configuration of a given mdev device, are these
> > the device configuration values?  It seems like you're referring to
> > something much more QEMU centric, but vfio-pci in QEMU handles all
> > devices the same, aside from quirks.
> > 
> > Likewise, I don't know where versions would be exposed in the current
> > vfio interface.  
> 
> "Instantiating" means writing to the mdev "create" sysfs attr. I am not
> very familiar with mdev so this could be totally wrong, but I'll try to
> define a mapping:
> 
> 1. The mdev driver sets up struct
>    mdev_parent_opts->supported_type_groups as follows:
> 
>   /* Device model URI */
>   static ssize_t model_show(struct kobject *kobj,
>                             struct device *dev,
>                             char *buf)
>   {
>       return sprintf(buf, "https://vendor-a.com/my-nic\n");
>   }
>   static MDEV_TYPE_ATTR_RO(model);
> 
>   /* Receive Side Scaling (RSS) */
>   static ssize_t rss_show(struct kobject *kobj,
>                           struct dev *dev,
> 			  char *buf)
>   {
>       return sprintf(buf, "%d\n", ...->rss);
>   }
>   static ssize_t rss_store(struct kobject *kobj,
>                            struct attribute *attr,
> 			   const char *page,
> 			   size_t count)
>   {
>       char *p = (char *) page;
>       unsigned long val = simple_strtoul(p, &p, 10);
> 
>       ...->rss = !!val;
>       return count;
>   }
>   static MDEV_TYPE_ATTR_RW(rss);
> 
>   /* Device version */
>   static ssize_t version_show(struct kobject *kobj,
>                               struct dev *dev,
> 			      char *buf)
>   {
>       return sprintf(buf, "%u\n", ...->version);
>   }
>   static ssize_t version_store(struct kobject *kobj,
>                                struct attribute *attr,
> 			       const char *page,
> 			       size_t count)
>   {
>       char *p = (char *) page;
>       unsigned long val = simple_strtoul(p, &p, 10);
> 
>       /* Set device configuration parameters to their defaults */
>       switch (version) {
>       case 1:
>           ...->rss = false;
> 	  ...->version = 1;
> 	  break;
> 
>       case 2:
>           ...->rss = true;
> 	  ...->version = 2;
> 	  break;
> 
>       default:
>           return -ENOTSUPP;
>       }
> 
>       return count;
>   }
>   static MDEV_TYPE_ATTR_RW(rss);
> 
>   static struct attribute *mdev_type_my_nic_attrs[] = {
>       &mdev_type_attr_model.attr,
>       &mdev_type_attr_rss.attr,
>       &mdev_type_attr_version.attr,
>       NULL,
>   };
> 
>   static struct attribute_group mdev_type_group_my_nic = {
>       .name  = "my-nic", /* shorthand name */
>       .attrs = mdev_type_my_nic_attrs,
>   };
> 
>   struct attribute_group *supported_type_groups[] = {
>       &mdev_type_group_my_nic,
>       NULL,
>   };
> 
> 2. The userspace tooling enumerates supported device models by reading
>    the "model" sysfs attr from each supported type attr group.


So a given mdev type can only support a single model, model just gives
us some independence from the vendor driver association of the mdev
type?  I wonder how "model" is really different from the "name"
attribute on an mdev type other than being more formalized.

 
> 3. Userspace picks the device model it wishes to instantiate and sets
>    the "version" sysfs attr and other device configuration parameters as
>    desired.
> 
> 4. Userspace instantiates the device by writing to the mdev "create" sysfs
>    attr. If instantiation succeeds then migrating a device state saved
>    by the same device model with the same configuration parameters is
>    possible.


These are not feasible semantics, multiple tasks may be instantiating
devices simultaneously, we can't lock a sub-hierarchy of sysfs while
one process configures features to their liking.  It seems more like
these attributes would be read-only to advertise support, but be
applied as part of the write to create, ie. we'd append device specific
attributes after the uuid string, similar to how we've previously
discussed supporting device aggregation.

> 
> Maybe a cleaner way to structure this is to include the version as part
> of the supported type group. So "my-nic" becomes "my-nic-1", "my-nic-2",
> etc. There would still be a "version" sysfs attr but it would be
> read-only. Device configuration parameters would only be present if they
> were actually available in that version. For example, "my-nic-1" would
> not expose an "rss" sysfs attr because it was introduced in "my-nic-2".
> I see pros and cons to both the approach I outlined above and this
> alternative, maybe someone more familiar with mdev has a preference?


How exactly is this different from an mdev type?  An mdev type is
supposed to define a software compatible device configuration.  If
version 2 is not compatible with version 1, we'd expect a vendor to
define a new type.  In practice vendors are often defining new types to
indicate the scale of a device, for example with vGPUs types may
different in the amount of graphics memory per instance.  The topic of
"aggregation" came about as a generic way to describe this within a
single type, when for example we might have an untenable number of
scaling increments to describe each as a separate type.  Unfortunately
"aggregation" is also too generic, "aggregation of what" needs to be
more clearly defined.

 
> > There's also a desire to support the vfio migration interface on
> > non-mdev vfio devices.  We don't know yet if those will be separate,
> > device specific vfio bus drivers or be integrated into existing
> > vfio-pci, but the host device is likely instantiated by binding to a
> > driver, so again I don't really understand where you're proposing this
> > negotiation occurs.  Will management tools be required to create a
> > device on-demand to fulfill a migration request or can we manipulate an
> > existing device into that desired.  Some management layers embrace the
> > idea of device pools rather than dynamic creation.  Thanks,  
> 
> The concept of device instantiation is natural for mdev and vfio-user,
> but not essential.
> 
> When dealing with physical devices (even PCI SR-IOV), we don't need to
> instantiate them explicitly. Device instances can already exist. As long
> as we know their device model URI and configuration parameters we can
> ensure migration compatibility.
> 
> For example, imagine a physical PCI NIC accompanied by a non-mdev VFIO
> migration driver. The device model URI and configuration parameter
> information can be distributed alongside the VFIO migration driver. It
> could be available via modinfo(8), as a separate metadata file, via a
> vendor-specific tool, etc.


I think we want instances of objects to expose their device and
configuration through sysfs, we don't want to require userspace to use
different methods for different flavors of devices, nor should it be
required for someone to remember how a device was instantiated.

 
> Management tools need to match the device model/configuration from the
> source device against the destination device. If the destination is
> capable of supporting the source's device model/configuration then
> migration can proceed safely.
> 
> Let's look at the case where we are migration from an older version of a
> device to a newer version. On the source we have:
> 
>   model = https://vendor-a.com/my-nic
> 
> On the destination we have:
> 
>   model = https://vendor-a.com/my-nic
>   rss = on
> 
> The two devices are incompatible because the destination exposes the RSS
> feature that is not present on the source. The RSS feature involves
> guest-visible hardware interface changes and a change to the device
> state representation. It is not safe to migrate!
> 
> In this case an extra configuration step is necessary so that the
> destination device can accept the device state from the source. The
> management tool invokes a vendor-specific tool to put the device into
> the right configuration:
> 
>   # vendor-tool set-migration-config --device 0000:00:04.0 \
>                                      --model https://vendor-a.com/my-nic
> 
> (This tool only succeeds when the device is bound to VFIO but not yet
> opened.)
> 
> The tool invokes ioctls on the vendor-specific VFIO driver that does two
> things:
> 1. Tells the device to present the old hardware interface without RSS
> 2. Uses the old device state representation without RSS support
> 
> Does this approach fit?


Should we not require that any sort of configuration like this occurs
through sysfs?  We must be able to create an instance with a specific
configuration without using vendor specific tools, therefore in the
worse case we should be able to remove and recreate an instance as we
desire without invoking vendor specific tools.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-03 15:33   ` Daniel P. Berrangé
@ 2020-11-03 17:31     ` Alex Williamson
  2020-11-04 10:13       ` Stefan Hajnoczi
  0 siblings, 1 reply; 40+ messages in thread
From: Alex Williamson @ 2020-11-03 17:31 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: John G Johnson, mtsirkin, quintela, Jason Wang,
	Dr. David Alan Gilbert, qemu-devel, Kirti Wankhede,
	Paolo Bonzini, Stefan Hajnoczi, Felipe Franciosi,
	Christophe de Dinechin, Thanos Makatos

On Tue, 3 Nov 2020 15:33:56 +0000
Daniel P. Berrangé <berrange@redhat.com> wrote:

> On Tue, Nov 03, 2020 at 04:23:43PM +0100, Christophe de Dinechin wrote:
> > 
> > On 2020-11-02 at 12:11 CET, Stefan Hajnoczi wrote...  
> > > There is discussion about VFIO migration in the "Re: Out-of-Process
> > > Device Emulation session at KVM Forum 2020" thread. The current status
> > > is that Kirti proposed a VFIO device region type for saving and loading
> > > device state. There is currently no guidance on migrating between
> > > different device versions or device implementations from different
> > > vendors. This is known to be non-trivial and raised discussion about
> > > whether it should really be handled by VFIO or centralized in QEMU.
> > >
> > > Below is a document that describes how to ensure migration compatibility
> > > in VFIO. It does not require changes to the VFIO migration interface. It
> > > can be used for both VFIO/mdev kernel devices and vfio-user devices.
> > >
> > > The idea is that the device state blob is opaque to the VMM but the same
> > > level of migration compatibility that exists today is still available.
> > >
> > > I hope this will help us reach consensus and let us discuss specifics.
> > >
> > > If you followed the previous discussion, I changed the approach from
> > > sending a magic constant in the device state blob to identifying device
> > > models by URIs. Therefore the device state structure does not need to be
> > > defined here - the critical information for ensuring device migration
> > > compatibility is the device model and configuration defined below.
> > >
> > > Stefan
> > > ---
> > > VFIO Migration
> > > ==============
> > > This document describes how to save and load VFIO device states. Saving a
> > > device state produces a snapshot of a VFIO device's state that can be loaded
> > > again at a later point in time to resume the device from the snapshot.
> > >
> > > The data representation of the device state is outside the scope of this
> > > document.
> > >
> > > Overview
> > > --------
> > > The purpose of device states is to save the device at a point in time and then
> > > restore the device back to the saved state later. This is more challenging than
> > > it first appears.
> > >
> > > The process of saving a device state and loading it later is called
> > > *migration*. The state may be loaded by the same device that saved it or by a
> > > new instance of the device, possibly running on a different computer.
> > >
> > > It must be possible to migrate to a newer implementation of the device
> > > as well as to an older implementation of the device. This allows users
> > > to upgrade and roll back their systems.
> > >
> > > Migration can fail if loading the device state is not possible. It should fail
> > > early with a clear error message. It must not appear to complete but leave the
> > > device inoperable due to a migration problem.
> > >
> > > The rest of this document describes how these requirements can be met.
> > >
> > > Device Models
> > > -------------
> > > Devices have a *hardware interface* consisting of hardware registers,
> > > interrupts, and so on.
> > >
> > > The hardware interface together with the device state representation is called
> > > a *device model*. Device models can be assigned URIs such as
> > > https://qemu.org/devices/e1000e to uniquely identify them.  
> > 
> > Like others, I think we should either
> > 
> > a) Give a relatively strong requirement regarding what is at the URL in
> > question, e.g. docs, maybe even a machine-readable schema describing
> > configuration and state for the device. Leaving the option "there can be
> > nothing here" is IMO asking for trouble.
> > 
> > b) simply call that a unique ID, and then either drop the https: entirely or
> > use something else, like pci:// or, to be more specific, vfio://
> > 
> > I'd favor option (b) for a different practical reason. URLs are subject to
> > redirection and other mishaps. For example, using https:// begs the question
> > whether
> > https://qemu.org/devices/e1000e and
> > https://www.qemu.org/devices/e1000e
> > should be treated as the same device. I believe that your intent is that
> > they shouldn't, but if the qemu web server redirects to www, and someone
> > wants to copy-paste their web browser's URL bar to the command line, they'd
> > get the wrong one.  
> 
> That's not a real world problem IMHO, because neither of these URLs
> ever need resolve to a real webpage, and thus not need to be cut +
> paste from a browser.
> 
> They are simply expressing a resource identifier using a URI as a
> convenient format. This is the same as an XML namespace using a URI,
> and rarely, if ever, resolving to any actual web page.
> 
> This is a good thing, because if you say there needs to be a real page
> there, then it creates a pile of corporate beaurocracy for contributors.
> I can freely create a URI under https://redhat.com for purposes of being
> a identifier, but I cannot get any content published there without jumping
> through many tedious corporate approvals and stand a good chance of being
> rejected.
> 
> If we're truely treating the URIs as an opaque string, we don't especially
> need to define any rules other than to say it should be under a domain that
> you have authority over either directly, or via membership of a project
> that delegates. We can suggest "https" since seeing "http" is a red flag
> for many people these days.

Hmm, an opaque string, sort of like the existing "name" attribute we
have now where Christophe quoted some examples in his message.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-03 17:13     ` Alex Williamson
@ 2020-11-03 18:09       ` Stefan Hajnoczi
  2020-11-05 23:37       ` Yan Zhao
  1 sibling, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2020-11-03 18:09 UTC (permalink / raw)
  To: Alex Williamson
  Cc: John G Johnson, Tian, Kevin, mtsirkin, Daniel P. Berrangé,
	quintela, Jason Wang, Zeng, Xin, qemu-devel,
	Dr. David Alan Gilbert, Yan Zhao, Kirti Wankhede, Thanos Makatos,
	Felipe Franciosi, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 12831 bytes --]

On Tue, Nov 03, 2020 at 10:13:05AM -0700, Alex Williamson wrote:
> On Tue, 3 Nov 2020 11:03:24 +0000
> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> > On Mon, Nov 02, 2020 at 12:38:23PM -0700, Alex Williamson wrote:
> > > 
> > > Cc+ Intel folks as this really bumps into the migration compatibility
> > > discussion[1][2][3]
> > > 
> > > On Mon, 2 Nov 2020 11:11:53 +0000
> > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> > > > VFIO Implementation
> > > > -------------------
> > > > The following applies both to kernel VFIO/mdev drivers and vfio-user device
> > > > backends.
> > > > 
> > > > Devices are instantiated based on a version and/or configuration parameters:
> > > > * ``version=1`` - use the device configuration aliased by version 1
> > > > * ``version=2,rx-filter-size=64`` - use version 1 and override ``rx-filter-size``
> > > > * ``rx-filter-size=0`` - directly set configuration parameters without using a version
> > > > 
> > > > Device creation fails if the version and/or configuration parameters are not
> > > > supported.
> > > > 
> > > > There must be a mechanism to query the "latest" configuration for a device
> > > > model. It may simply report the ``version=5`` where 5 is the latest version but
> > > > it could also report all configuration parameters instead of using a version
> > > > alias.  
> > > 
> > > When we talk about "instantiating" a device here, are we referring to
> > > managing the device on the host or within QEMU via something like
> > > vfio_realize()?  We create an instance of an mdev on the host via an
> > > mdev type using operations on the host sysfs.  That mdev type doesn't
> > > really seem to map to your idea if a device model represented by a URI.
> > > How are supported URIs exposed and specified when the device is
> > > instantiated?
> > > 
> > > Same for device configuration, we might have per device attributes in
> > > host sysfs defining the configuration of a given mdev device, are these
> > > the device configuration values?  It seems like you're referring to
> > > something much more QEMU centric, but vfio-pci in QEMU handles all
> > > devices the same, aside from quirks.
> > > 
> > > Likewise, I don't know where versions would be exposed in the current
> > > vfio interface.  
> > 
> > "Instantiating" means writing to the mdev "create" sysfs attr. I am not
> > very familiar with mdev so this could be totally wrong, but I'll try to
> > define a mapping:
> > 
> > 1. The mdev driver sets up struct
> >    mdev_parent_opts->supported_type_groups as follows:
> > 
> >   /* Device model URI */
> >   static ssize_t model_show(struct kobject *kobj,
> >                             struct device *dev,
> >                             char *buf)
> >   {
> >       return sprintf(buf, "https://vendor-a.com/my-nic\n");
> >   }
> >   static MDEV_TYPE_ATTR_RO(model);
> > 
> >   /* Receive Side Scaling (RSS) */
> >   static ssize_t rss_show(struct kobject *kobj,
> >                           struct dev *dev,
> > 			  char *buf)
> >   {
> >       return sprintf(buf, "%d\n", ...->rss);
> >   }
> >   static ssize_t rss_store(struct kobject *kobj,
> >                            struct attribute *attr,
> > 			   const char *page,
> > 			   size_t count)
> >   {
> >       char *p = (char *) page;
> >       unsigned long val = simple_strtoul(p, &p, 10);
> > 
> >       ...->rss = !!val;
> >       return count;
> >   }
> >   static MDEV_TYPE_ATTR_RW(rss);
> > 
> >   /* Device version */
> >   static ssize_t version_show(struct kobject *kobj,
> >                               struct dev *dev,
> > 			      char *buf)
> >   {
> >       return sprintf(buf, "%u\n", ...->version);
> >   }
> >   static ssize_t version_store(struct kobject *kobj,
> >                                struct attribute *attr,
> > 			       const char *page,
> > 			       size_t count)
> >   {
> >       char *p = (char *) page;
> >       unsigned long val = simple_strtoul(p, &p, 10);
> > 
> >       /* Set device configuration parameters to their defaults */
> >       switch (version) {
> >       case 1:
> >           ...->rss = false;
> > 	  ...->version = 1;
> > 	  break;
> > 
> >       case 2:
> >           ...->rss = true;
> > 	  ...->version = 2;
> > 	  break;
> > 
> >       default:
> >           return -ENOTSUPP;
> >       }
> > 
> >       return count;
> >   }
> >   static MDEV_TYPE_ATTR_RW(rss);
> > 
> >   static struct attribute *mdev_type_my_nic_attrs[] = {
> >       &mdev_type_attr_model.attr,
> >       &mdev_type_attr_rss.attr,
> >       &mdev_type_attr_version.attr,
> >       NULL,
> >   };
> > 
> >   static struct attribute_group mdev_type_group_my_nic = {
> >       .name  = "my-nic", /* shorthand name */
> >       .attrs = mdev_type_my_nic_attrs,
> >   };
> > 
> >   struct attribute_group *supported_type_groups[] = {
> >       &mdev_type_group_my_nic,
> >       NULL,
> >   };
> > 
> > 2. The userspace tooling enumerates supported device models by reading
> >    the "model" sysfs attr from each supported type attr group.
> 
> 
> So a given mdev type can only support a single model, model just gives
> us some independence from the vendor driver association of the mdev
> type?  I wonder how "model" is really different from the "name"
> attribute on an mdev type other than being more formalized.

Two reasons, neither of them critical:
1. Short names are more human-friendly than full device model URIs.
2. I was concerned about escaping characters if the type name is used as
   a sysfs directory name. "https://qemu.org/devices/e1000" cannot be a
   path component, it needs to be escaped somehow to avoid the
   backslashes.

> > 3. Userspace picks the device model it wishes to instantiate and sets
> >    the "version" sysfs attr and other device configuration parameters as
> >    desired.
> > 
> > 4. Userspace instantiates the device by writing to the mdev "create" sysfs
> >    attr. If instantiation succeeds then migrating a device state saved
> >    by the same device model with the same configuration parameters is
> >    possible.
> 
> 
> These are not feasible semantics, multiple tasks may be instantiating
> devices simultaneously, we can't lock a sub-hierarchy of sysfs while
> one process configures features to their liking.  It seems more like
> these attributes would be read-only to advertise support, but be
> applied as part of the write to create, ie. we'd append device specific
> attributes after the uuid string, similar to how we've previously
> discussed supporting device aggregation.

Thanks for explaining and suggesting a solution.

> > 
> > Maybe a cleaner way to structure this is to include the version as part
> > of the supported type group. So "my-nic" becomes "my-nic-1", "my-nic-2",
> > etc. There would still be a "version" sysfs attr but it would be
> > read-only. Device configuration parameters would only be present if they
> > were actually available in that version. For example, "my-nic-1" would
> > not expose an "rss" sysfs attr because it was introduced in "my-nic-2".
> > I see pros and cons to both the approach I outlined above and this
> > alternative, maybe someone more familiar with mdev has a preference?
> 
> 
> How exactly is this different from an mdev type?  An mdev type is
> supposed to define a software compatible device configuration.  If
> version 2 is not compatible with version 1, we'd expect a vendor to
> define a new type.  In practice vendors are often defining new types to
> indicate the scale of a device, for example with vGPUs types may
> different in the amount of graphics memory per instance.  The topic of
> "aggregation" came about as a generic way to describe this within a
> single type, when for example we might have an untenable number of
> scaling increments to describe each as a separate type.  Unfortunately
> "aggregation" is also too generic, "aggregation of what" needs to be
> more clearly defined.

Yes, this is a 1:1 mapping of device versions to mdev types. Based on
QEMU's device models I would estimate that devices may have over 100
versions but less than 1000. Does that sound reasonable?

If having so many mdev types is problematic we can use the multiplexing
approach that we discussed above with a "version" sysfs attr.

> > > There's also a desire to support the vfio migration interface on
> > > non-mdev vfio devices.  We don't know yet if those will be separate,
> > > device specific vfio bus drivers or be integrated into existing
> > > vfio-pci, but the host device is likely instantiated by binding to a
> > > driver, so again I don't really understand where you're proposing this
> > > negotiation occurs.  Will management tools be required to create a
> > > device on-demand to fulfill a migration request or can we manipulate an
> > > existing device into that desired.  Some management layers embrace the
> > > idea of device pools rather than dynamic creation.  Thanks,  
> > 
> > The concept of device instantiation is natural for mdev and vfio-user,
> > but not essential.
> > 
> > When dealing with physical devices (even PCI SR-IOV), we don't need to
> > instantiate them explicitly. Device instances can already exist. As long
> > as we know their device model URI and configuration parameters we can
> > ensure migration compatibility.
> > 
> > For example, imagine a physical PCI NIC accompanied by a non-mdev VFIO
> > migration driver. The device model URI and configuration parameter
> > information can be distributed alongside the VFIO migration driver. It
> > could be available via modinfo(8), as a separate metadata file, via a
> > vendor-specific tool, etc.
> 
> 
> I think we want instances of objects to expose their device and
> configuration through sysfs, we don't want to require userspace to use
> different methods for different flavors of devices, nor should it be
> required for someone to remember how a device was instantiated.

Sounds good. The sysfs layout can be the same as for mdev types.

> > Management tools need to match the device model/configuration from the
> > source device against the destination device. If the destination is
> > capable of supporting the source's device model/configuration then
> > migration can proceed safely.
> > 
> > Let's look at the case where we are migration from an older version of a
> > device to a newer version. On the source we have:
> > 
> >   model = https://vendor-a.com/my-nic
> > 
> > On the destination we have:
> > 
> >   model = https://vendor-a.com/my-nic
> >   rss = on
> > 
> > The two devices are incompatible because the destination exposes the RSS
> > feature that is not present on the source. The RSS feature involves
> > guest-visible hardware interface changes and a change to the device
> > state representation. It is not safe to migrate!
> > 
> > In this case an extra configuration step is necessary so that the
> > destination device can accept the device state from the source. The
> > management tool invokes a vendor-specific tool to put the device into
> > the right configuration:
> > 
> >   # vendor-tool set-migration-config --device 0000:00:04.0 \
> >                                      --model https://vendor-a.com/my-nic
> > 
> > (This tool only succeeds when the device is bound to VFIO but not yet
> > opened.)
> > 
> > The tool invokes ioctls on the vendor-specific VFIO driver that does two
> > things:
> > 1. Tells the device to present the old hardware interface without RSS
> > 2. Uses the old device state representation without RSS support
> > 
> > Does this approach fit?
> 
> 
> Should we not require that any sort of configuration like this occurs
> through sysfs?  We must be able to create an instance with a specific
> configuration without using vendor specific tools, therefore in the
> worse case we should be able to remove and recreate an instance as we
> desire without invoking vendor specific tools.

Yes, sysfs sounds good.

One thing that is a little ugly is that the version=1 device above does
not have the "rss" configuration parameter and we have no value to set
on the destination device, which is newer and supports the "rss"
configuration parameter. It may be necessary to implement a special case
for disabling configuration parameters like writing an empty string or
magic string to the "rss" sysfs attr.

Or maybe there would be a standard sysfs attr called "config_params"
where userspace writes a space-separated list of configuration parameter
names that are active. Then userspace could omit "rss" from the
"config_params" sysfs attr to disable it.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-03 15:23     ` Daniel P. Berrangé
@ 2020-11-03 18:16       ` Stefan Hajnoczi
  0 siblings, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2020-11-03 18:16 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: John G Johnson, mtsirkin, quintela, Jason Wang, Felipe Franciosi,
	Kirti Wankhede, qemu-devel, Alex Williamson, Thanos Makatos,
	Paolo Bonzini, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 4959 bytes --]

On Tue, Nov 03, 2020 at 03:23:03PM +0000, Daniel P. Berrangé wrote:
> On Tue, Nov 03, 2020 at 03:05:08PM +0000, Stefan Hajnoczi wrote:
> > On Tue, Nov 03, 2020 at 11:39:29AM +0000, Daniel P. Berrangé wrote:
> > > On Mon, Nov 02, 2020 at 11:11:53AM +0000, Stefan Hajnoczi wrote:
> > > > Overview
> > > > --------
> > > > The purpose of device states is to save the device at a point in time and then
> > > > restore the device back to the saved state later. This is more challenging than
> > > > it first appears.
> > > > 
> > > > The process of saving a device state and loading it later is called
> > > > *migration*. The state may be loaded by the same device that saved it or by a
> > > > new instance of the device, possibly running on a different computer.
> > > > 
> > > > It must be possible to migrate to a newer implementation of the device
> > > > as well as to an older implementation of the device. This allows users
> > > > to upgrade and roll back their systems.
> > > > 
> > > > Migration can fail if loading the device state is not possible. It should fail
> > > > early with a clear error message. It must not appear to complete but leave the
> > > > device inoperable due to a migration problem.
> > > 
> > > I think there needs to be an addition requirement.
> > > 
> > >  It must be possible for a management application to query the supported
> > >  versions, independantly of execution of a migration  operation.
> > > 
> > > This is important to large scale data center / cloud management applications
> > > because before initiating a migration they need to *automatically* select
> > > a target host with high level of confidence that is will be compatible with
> > > the source host.
> > > 
> > > Today QEMU migration compatibility is largely determined by the machine
> > > type version. Apps can query the supported machine types for host to
> > > check whether it is compatible. Similarly they will query CPU model
> > > features to check compatiblity.
> > > 
> > > Validation and error checking at time of migration is of course still
> > > required, but the goal should be that an mgmt application will *NEVER*
> > > hit these errors because they will have pre-selected a host that is
> > > known to be compatible based on reported versions that are supported.
> > 
> > Okay. What do you think of the following?
> > 
> >   [
> >     {
> >       "model": "https://qemu.org/devices/e1000e",
> >       "params": [
> >         "rss",
> > 	...more configuration parameters...
> >       ],
> >       "versions": [
> >         {
> > 	  "name": "1",
> > 	  "params": [],
> > 	},
> > 	{
> > 	  "name": "2",
> > 	  "params": ["rss=on"],
> > 	},
> > 	...more versions...
> >       ]
> >     },
> >     ...more device models...
> >   ]
> > 
> > The management tool can generate the configuration parameter list by
> > expanding a version into its params.
> > 
> > Configuration parameter types and input ranges need more thought. For
> > example, version 1 of the device might not have rx-table-size (it's
> > effectively 0). Version 2 introduces rx-table-size and sets it to 32.
> > Version 3 raises the value to 64. In addition, the user can set a custom
> > value like rx-table-size=48. I haven't defined the rules for this yet,
> > but it's clear there needs to be a way to extend configuration
> > parameters.
> > 
> > To check migration compatibility:
> > 1. Verify that the device model URL matches the JSON data[n].model
> >    field.
> > 2. For every configuration parameter name from the source device,
> >    check that it is contained within the JSON data[n].params list.
> 
> I'm not convinced that this makes sense. A matching set of parameter
> names + values does not imply that the migration data stream is
> actually compatible.
> 
> ie implementations may need to change the internal migration data
> stream to fix bugs, without adding/removing a config parameter.
> The migration version string alone expresses data stream compatibility.

This is not the approach described in this document. The point of this
approach is precisely that migration is known to be safe when the device
model URI and configuration parameters match on source and destination.

Changes to the guest-visible hardware interface and/or device state
representation always require a new configuration parameter under this
approach.

> This is similar to how 2 QEMU command lines can have identical set
> of configuration parameters, aside from the machine type version,
> and thus be migration *incompatible.

That is not possible under this approach.

> Basically the version string should be considered an opaque blob
> that expresses compatibility on its own.

The version string is not directly part of the migration compatibility
check under this approach. It's is simply an alias for a list of
configuration parameters.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-03 15:27   ` Stefan Hajnoczi
@ 2020-11-03 18:49     ` Dr. David Alan Gilbert
  2020-11-04  7:36       ` Stefan Hajnoczi
  2020-11-04 11:05       ` Christophe de Dinechin
  0 siblings, 2 replies; 40+ messages in thread
From: Dr. David Alan Gilbert @ 2020-11-03 18:49 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: John G Johnson, mtsirkin, Daniel P. Berrangé,
	quintela, Jason Wang, Felipe Franciosi, Kirti Wankhede,
	qemu-devel, Alex Williamson, Thanos Makatos, Paolo Bonzini

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Tue, Nov 03, 2020 at 12:17:09PM +0000, Dr. David Alan Gilbert wrote:
> > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > Device Models
> > > -------------
> > > Devices have a *hardware interface* consisting of hardware registers,
> > > interrupts, and so on.
> > > 
> > > The hardware interface together with the device state representation is called
> > > a *device model*. Device models can be assigned URIs such as
> > > https://qemu.org/devices/e1000e to uniquely identify them.
> > 
> > I think this is a unique identifier, not actually a URI; the https://
> > isn't needed since no one expects to ever connect to this.
> 
> Yes, it could be any unique string. If the URI idea is not popular we
> can use any similar scheme.

I'm OK with it being a URI; just drop the https.

> > > However, secondary aspects related to the physical port may affect the device's
> > > hardware interface and need to be reflected in the device configuration. The
> > > link speed may depend on the physical port and be reported through the device's
> > > hardware interface. In that case a ``link-speed`` configuration parameter is
> > > required to prevent unexpected changes to the link speed after migration.
> > 
> > That's an interesting example; because depending on the device, it might
> > be:
> >     a) Completely virtualised so that the guest *shouldn't* know what
> > the physical link speed is, precisely to allow the physical network on
> > the destination to be different.
> > 
> >     b) Part of the migrated state
> > 
> >     c) Something that's allowed to be reloaded after migration
> > 
> >     d) Configurable
> > 
> > so I'm not sure whether it's a good example in this case or not.
> 
> Can you think of an example that has only one option?
> 
> I tried but couldn't. For example take a sound card. The guest is aware
> the device supports stereo playback (2 output channels), but which exact
> stereo host device is used doesn't matter, they are all suitable.
> 
> Now imagine migrating to a 7.1 surround-sound device. Similar options
> come into play:
> 
> a) Emulate stereo and mix it to 7.1 surround-sound on the physical
>    device. The guest still sees the stereo device.
> 
> b) Refuse migration.
> 
> c) Indicate that the output has switched and let the guest reconfigure
>    itself (e.g. a sound card with multiple outputs, where one of them is
>    stereo and another is 7.1 surround sound).
> 
> Which option is desirable depends on the use case.

Yes, but I think it might be worth calling out these differences;  there
are explicitly cases where you don't want external changes to be visible
and other cases where you do; both are valid, but both need thinking
about. (Another one, GPU whether you have a monitor plugged in!)

> > Maybe what's needed is a stronger instruction to abstract external
> > device state so that it's not part of the configuration in most cases.
> 
> Do you want to propose something?

I think something like 'Some part of a devices state may be irrelevant
to a migration; for example on some NICs it might be preferable to hide
the physical characteristics of the link from the guest.'

> > > For example, if address filtering support was added to a network card then
> > > device versions and the corresponding configurations may look like this:
> > > * ``version=1`` - Behaves as if ``rx-filter-size=0``
> > > * ``version=2`` - ``rx-filter-size=32``
> > 
> > Note configuration parameters might have been added during the life of
> > the device; e.g. if the original card had no support for rx-filters, it
> > might not have a rx-filter-size parameter.
> 
> version=1 does not explicitly set rx-filter-size=0. When a new parameter
> is introduced it must have a default value that disables its effect on
> the hardware interface and/or device state representation. This is
> described in a bit more detail in the next section, maybe it should be
> reordered.

We've generally found the definition of devices tends in practice to be
done newer->older; i.e. you define the current machine, and then define
the next older machine setting the defaults that used to be true; then
define the older version behind that....

> > > Device States
> > > -------------
> > > The details of the device state representation are not covered in this document
> > > but the general requirements are discussed here.
> > > 
> > > The device state consists of data accessible through the device's hardware
> > > interface and internal state that is needed to restore device operation.
> > > State in the hardware interface includes the values of hardware registers.
> > > An example of internal state is an index value needed to avoid processing
> > > queued requests more than once.
> > 
> > I try and emphasise that 'internal state' should be represented in a way
> > that reflects the problem rather than the particular implementation;
> > this gives it a better chance of migrating to future versions.
> 
> Sounds like a good idea.
> 
> > > Changes can be made to the device state representation as follows. Each change
> > > to device state must have a corresponding device configuration parameter that
> > > allows the change to toggled:
> > > 
> > > * When the parameter is disabled the hardware interface and device state
> > >   representation are unchanged. This allows old device states to be loaded.
> > > 
> > > * When the parameter is enabled the change comes into effect.
> > > 
> > > * The parameter's default value disables the change. Therefore old versions do
> > >   not have to explicitly specify the parameter.
> > > 
> > > The following example illustrates migration from an old device
> > > implementation to a new one. A version=1 network card is migrated to a
> > > new device implementation that is also capable of version=2 and adds the
> > > rx-filter-size=32 parameter. The new device is instantiated with
> > > version=1, which disables rx-filter-size and is capable of loading the
> > > version=1 device state. The migration completes successfully but note
> > > the device is still operating at version=1 level in the new device.
> > > 
> > > The following example illustrates migration from a new device
> > > implementation back to an older one. The new device implementation
> > > supports version=1 and version=2. The old device implementation supports
> > > version=1 only. Therefore the device can only be migrated when
> > > instantiated with version=1 or the equivalent full configuration
> > > parameters.
> > 
> > I'm sometimes asked for 'ways out' of buggy migration cases; e.g. what
> > happens if version=1 forgot to migrate the X register; or what happens
> > if verison=1 forgot to handle the special, rare case when X=5 and we
> > now need to migrate some extra state.
> 
> Can these cases be handled by adding additional configuration parameters?
> 
> If version=1 is lacks essential state then version=2 can add it. The
> user must configure the device to use version before they can save the
> full state.
> 
> If version=1 didn't handle the X=5 case then the same solution is
> needed. A new configuration parameter is introduced and the user needs
> to configure the device to be the new version before migrating.
> 
> Unfortunately this requires poweroff or hotplugging a new device
> instance. But some disruption is probably necessarily anyway so the
> migration code on the host side can be patched to use the updated device
> state representation.

There are some corner cases that people sometimes prefer; for example
lets say the X=5 case is actually really rare - but when it happens the
device is hopelessly broken, some device authors prefer to fix it and
send the extra data and let the migration fail if the destination
doesn't understand it (it would break anyway).  I've also been asked
by mst for a 'unexpected data' mechanism to send data that the
destination might not expect if it didn't know about it, for similar
cases.

> > > Orchestrating Migrations
> > > ------------------------
> > > The following steps must be followed to migrate devices:
> > > 
> > > 1. Check that the source and destination devices support the same device model.
> > > 
> > > 2. Check that the destination device supports the source device's
> > >    configuration. Each configuration parameter must be accepted by the
> > >    destination in order to ensure that it will be possible to load the device
> > >    state.
> > 
> > This is written in terms of a 'check'; there are at least three tricky
> > things:
> > 
> >   a) Where they both have the same parameter, do they accept the same
> > range of values; e.g. a newer version of the card might allow
> > rx-filter-size to go upto 128
> 
> The easy way to handle that without lots of metadata is by instantiating
> the destination device to see if it works.
> 
> But in the next point you mention cloud where we need a way to find a
> host that supports a given device. Metadata is probably needed to make
> that check easy. In the email reply to Daniel Berrange I posted the
> beginning of a JSON schema that describes device models for this
> purpose. I think that offers a solution for the cloud case.

A similar suggestion had come up in the vfio thread with Nvidia
some months ago; I can't remember the outcome of that.
(Much of this thread repeats the repeated long discussions on that
thread!)

Dave

> 
> Stefan


-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-03 12:15   ` Stefan Hajnoczi
@ 2020-11-04  3:32     ` Jason Wang
  2020-11-04  7:16       ` Stefan Hajnoczi
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Wang @ 2020-11-04  3:32 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: John G Johnson, mtsirkin, Daniel P. Berrangé,
	quintela, qemu-devel, Kirti Wankhede, Dr. David Alan Gilbert,
	Alex Williamson, Paolo Bonzini, Felipe Franciosi, Thanos Makatos


On 2020/11/3 下午8:15, Stefan Hajnoczi wrote:
> On Tue, Nov 03, 2020 at 04:46:53PM +0800, Jason Wang wrote:
>> On 2020/11/2 下午7:11, Stefan Hajnoczi wrote:
>>> There is discussion about VFIO migration in the "Re: Out-of-Process
>>> Device Emulation session at KVM Forum 2020" thread. The current status
>>> is that Kirti proposed a VFIO device region type for saving and loading
>>> device state. There is currently no guidance on migrating between
>>> different device versions or device implementations from different
>>> vendors. This is known to be non-trivial and raised discussion about
>>> whether it should really be handled by VFIO or centralized in QEMU.
>>>
>>> Below is a document that describes how to ensure migration compatibility
>>> in VFIO. It does not require changes to the VFIO migration interface. It
>>> can be used for both VFIO/mdev kernel devices and vfio-user devices.
>>>
>>> The idea is that the device state blob is opaque to the VMM but the same
>>> level of migration compatibility that exists today is still available.
>>
>> So if we can't mandate this or there's no way to validate this. Vendor is
>> still free to implement their own protocol which could lead a lot of
>> maintaining burden.
> Yes, the device state representation is their responsibility. We can't
> do that for them since they define the hardware interface and internal
> state.
>
> As Michael and Paolo have mentioned in the other thread, we can provide
> guidelines and standardize common aspects.
>
>>> Migration can fail if loading the device state is not possible. It should fail
>>> early with a clear error message. It must not appear to complete but leave the
>>> device inoperable due to a migration problem.
>>
>> For VFIO-user, how management know that a VM can be migrated from src to
>> dst? For kernel, we have sysfs.
> vfio-user devices will normally be instantiated in one of two ways:
>
> 1. Launching a device backend and passing command-line parameters:
>
>       $ my-nic --socket-path /tmp/my-nic-vfio-user.sock \
>                --model https://vendor-a.com/my-nic \
> 	      --rss on
>
>     Here "model" is the device model URL. The program could support
>     multiple device models.
>
>     The "rss" device configuration parameter enables Receive Side Scaling
>     (RSS) as an example of a configuration parameter.
>
> 2. Creating a device using an RPC interface:
>
>       (qemu) device-add my-nic,rss=on
>
> If the device instantiation succeeds then it is safe to live migrate.
> The device is exposing the desired hardware interface and expecting the
> right device state representation.


Does this mean there will still be a "my-nic" stub in qemu? (I thought 
it should be a generic one like device-add "vfio-user-pci")


>
>>> The rest of this document describes how these requirements can be met.
>>>
>>> Device Models
>>> -------------
>>> Devices have a *hardware interface* consisting of hardware registers,
>>> interrupts, and so on.
>>>
>>> The hardware interface together with the device state representation is called
>>> a *device model*. Device models can be assigned URIs such as
>>> https://qemu.org/devices/e1000e to uniquely identify them.
>>
>> It looks worse than "pci://vendor_id.device_id.subvendor_id.subdevice_id".
>> "e1000e" means a lot of different 8275X implementations that have subtle but
>> easy to be ignored differences.
> If you wish to reflect those differences in the device model URI then
> you can use:
>
>    https://qemu.org/devices/pci/<vendor-id>/<device-id>/<subvendor-id>/<subdevice-id>
>
> Another option is to use device configuration parameters to express
> differences.
>
> The important thing is that this device model URI has one owner. No one
> else will use qemu.org. There can be many different e1000e device model
> URIs, if necessary (with slightly different hardware interfaces and/or
> device state representations). This avoids collisions.
>
>> And is it possible to have a list of URIs here?
> A device implementation (mdev driver, vfio-user device backend, etc) may
> support multiple device model URIs.
>
> A device instance has an immutable device model URI and list of
> configuration parameters. In other words, once the device is created its
> ABI is fixed for the lifetime of the device. A new device instance can
> be configured by powering off the machine, hotplug, etc.
>
>>> Multiple implementations of a device model may exist. They are they are
>>> interchangeable if they follow the same hardware interface and device
>>> state representation.
>>>
>>> Multiple implementations of the same hardware interface may exist with
>>> different device state representations, in which case the device models are not
>>> interchangeable and must be assigned different URIs.
>>>
>>> Migration is only possible when the same device model is supported by the
>>> *source* and the *destination* devices.
>>>
>>> Device Configuration
>>> --------------------
>>> Device models may have parameters that affect the hardware interface or device
>>> state representation. For example, a network card may have a configurable
>>> address filtering table size parameter called ``rx-filter-size``. A
>>> device state saved with ``rx-filter-size=32`` cannot be safely loaded
>>> into a device with ``rx-filter-size=0``, because changing the size from
>>> 32 to 0 may disrupt device operation.
>>
>> Do we allow the migration from "rx-filter-size=16" to "rx-filter-size=32" (I
>> guess not?) And should we extend the concept to "device capability" instead
>> of just state representation.  E.g src has CAP_X=on,CAP_Y=off but dst has
>> CAP_X=on,CAP_Y=on, so we disallow the migration from src to dst.
> A device instance's configuration parameters are immutable.
> rx-filter-size=16 cannot be migrated to rx-filter-size=32.


But then it looks to me we can't migrate back, or do you mean it is 
required to have the ability to change the max rx-filter-size.


>
> Yes, configuration parameters can describe capabilities. I think of
> capabilities as something that affects the guest-visible hardware
> interface (e.g. the RSS feature bit is enabled) that is mentioned in the
> text, but it would be clearer to mention them explicitly.
>
>>> A list of configuration parameters is called the *device configuration*.
>>> Migration is expected to succeed when the same device model and configuration
>>> that was used for saving the device state is used again to load it.
>>>
>>> Note that not all parameters used to instantiate a device need to be part of
>>> the device configuration. For example, assigning a network card to a specific
>>> physical port is not part of the device configuration since it is not part of
>>> the device's hardware interface or the device state representation.
>>
>> Yes, but the task needs to be done by management somehow. So do you expect a
>> vendor specific provisioning API here?
> There seems to be no consensus on this yet. It's the question of how to
> manage the lifecycle of VFIO, mdev, vhost-user, and vfio-user devices.
> There are attempts to standardize in some of these areas.
>
> For mdev drivers we can standardize the sysfs interface so management
> tools can query source devices and instantiate destination devices
> without device-specific code.


Even for mdev, it should be have some class defined for sysfs which 
could be a standard way to configure NVME or virtio device.


>
> For vhost-user devices there is the backend program conventions
> specification, which aims to standardize common parameters. This makes
> integrating support for new device implementations easier (there is less
> device implementation-specific code).
>
> For vfio-user devices something based on the vhost-user backend program
> conventions spec could work well.
>
> The main issue could be that avoiding vendor-specific provisioning code
> in management software either requires you to restrict yourself to a few
> standard device types or to pass through configuration data.
>
> A libvirt opinion would be interesting.
>
>>> The device
>>> state can be loaded and run on a different physical port without affecting the
>>> operation of the device. Therefore the physical port is not part of the device
>>> configuration.
>>>
>>> However, secondary aspects related to the physical port may affect the device's
>>> hardware interface and need to be reflected in the device configuration. The
>>> link speed may depend on the physical port and be reported through the device's
>>> hardware interface. In that case a ``link-speed`` configuration parameter is
>>> required to prevent unexpected changes to the link speed after migration.
>>>
>>> Note that the device configuration is a conservative bound on device
>>> states that can be migrated successfully since not all configuration
>>> parameters may be strictly required to match on the source and
>>> destination devices. For example, if the device's hardware interface has
>>> not yet been initialized then changes to the link speed may not be
>>> noticed. However, accurately representing runtime constraints is complex
>>> and risks introducing migration bugs, so no attempt is made to support
>>> them to achieve more relaxed bounds on successful migrations.
>>>
>>> Device Versions
>>> ---------------
>>> As a device evolves, the number of configuration parameters required may become
>>> inconvenient for users to express in full. A device configuration can be
>>> aliased by a *device version*, which is a shorthand for the full device
>>> configuration. This makes it easy to apply a standard device configuration
>>> without listing every configuration parameter explicitly.
>>
>> I'm not sure how to apply the device versions consider the device state is
>> opaque or the device needs to export another API to do this?
> Versions are just aliases for a list of configuration parameters. For
> example, version=2 expands to rx-filter-size=32. The purpose of versions
> is to provide a human-readable shorthand notation.
>
> Versions are not involved in migration compatibility checking, instead
> the device model URI and expanded configuration parameters are compared.
>
> The version has no direct effect on the device state representation. It
> has an indirect effect due to the configuration parameters that it
> expands to. For example, the rx-filter-size=32 configuration parameter
> may change the device state representation to include the 32 addresses
> that the device is filtering on.
>
> No "version check" is necessary when loading the device state
> representation because the device was already instantiated with the
> exact configuration parameters that determine the device state
> representation.
>
>>> For example, if address filtering support was added to a network card then
>>> device versions and the corresponding configurations may look like this:
>>> * ``version=1`` - Behaves as if ``rx-filter-size=0``
>>> * ``version=2`` - ``rx-filter-size=32``
>>>
>>> Device States
>>> -------------
>>> The details of the device state representation are not covered in this document
>>> but the general requirements are discussed here.
>>>
>>> The device state consists of data accessible through the device's hardware
>>> interface and internal state that is needed to restore device operation.
>>> State in the hardware interface includes the values of hardware registers.
>>> An example of internal state is an index value needed to avoid processing
>>> queued requests more than once.
>>>
>>> Changes can be made to the device state representation as follows. Each change
>>> to device state must have a corresponding device configuration parameter that
>>> allows the change to toggled:
>>>
>>> * When the parameter is disabled the hardware interface and device state
>>>     representation are unchanged. This allows old device states to be loaded.
>>>
>>> * When the parameter is enabled the change comes into effect.
>>>
>>> * The parameter's default value disables the change. Therefore old versions do
>>>     not have to explicitly specify the parameter.
>>>
>>> The following example illustrates migration from an old device
>>> implementation to a new one. A version=1 network card is migrated to a
>>> new device implementation that is also capable of version=2 and adds the
>>> rx-filter-size=32 parameter. The new device is instantiated with
>>> version=1, which disables rx-filter-size and is capable of loading the
>>> version=1 device state. The migration completes successfully but note
>>> the device is still operating at version=1 level in the new device.
>>>
>>> The following example illustrates migration from a new device
>>> implementation back to an older one. The new device implementation
>>> supports version=1 and version=2. The old device implementation supports
>>> version=1 only. Therefore the device can only be migrated when
>>> instantiated with version=1 or the equivalent full configuration
>>> parameters.
>>
>> In qemu we have subsection to facilitate the case when some fields were
>> forgot to migrate. Do we need something similar here?
> This is an important question and I'm not sure.
>
> The problem with subsection semantics is that they break rollback. Once
> the old device state has been loaded by the new device implementation,
> saving the device state produces the new device state representation.
> The old device implementation can no longer load it :(.


Only when subsection is needed.


>    Manual
> intervention is necessary to tell the new device implementation to save
> in the old representation.


If we don't support subsection, could we end up with a deadlock like we 
do migration since want upgrade the kernel, but if we don't upgrade the 
kernel, we can't do live migration.


>
> In the migration model described in this document it works the other
> way around: back and forth migration is always safe. If you wish to
> change the device you need to create a new instance (after poweroff or
> through hotplug).
>
> One way of achieving something similar is to provide additional
> information about safe transitions between configuration parameter
> lists. It is not safe to change arbitrary device configuration
> parameters, but certain parameters can be safely changed.
>
> I'm not sure if the complexity is worth it though. The downside to the
> current approach is that devices must eventually be reconfigured to
> upgrade to new versions, even if there is no guest-visible hardware
> interface change.
>
> Stefan


Thanks



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-04  3:32     ` Jason Wang
@ 2020-11-04  7:16       ` Stefan Hajnoczi
  0 siblings, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2020-11-04  7:16 UTC (permalink / raw)
  To: Jason Wang
  Cc: John G Johnson, mtsirkin, Daniel P. Berrangé,
	quintela, qemu-devel, Kirti Wankhede, Dr. David Alan Gilbert,
	Alex Williamson, Paolo Bonzini, Felipe Franciosi, Thanos Makatos

[-- Attachment #1: Type: text/plain, Size: 10041 bytes --]

On Wed, Nov 04, 2020 at 11:32:34AM +0800, Jason Wang wrote:
> 
> On 2020/11/3 下午8:15, Stefan Hajnoczi wrote:
> > On Tue, Nov 03, 2020 at 04:46:53PM +0800, Jason Wang wrote:
> > > On 2020/11/2 下午7:11, Stefan Hajnoczi wrote:
> > > > There is discussion about VFIO migration in the "Re: Out-of-Process
> > > > Device Emulation session at KVM Forum 2020" thread. The current status
> > > > is that Kirti proposed a VFIO device region type for saving and loading
> > > > device state. There is currently no guidance on migrating between
> > > > different device versions or device implementations from different
> > > > vendors. This is known to be non-trivial and raised discussion about
> > > > whether it should really be handled by VFIO or centralized in QEMU.
> > > > 
> > > > Below is a document that describes how to ensure migration compatibility
> > > > in VFIO. It does not require changes to the VFIO migration interface. It
> > > > can be used for both VFIO/mdev kernel devices and vfio-user devices.
> > > > 
> > > > The idea is that the device state blob is opaque to the VMM but the same
> > > > level of migration compatibility that exists today is still available.
> > > 
> > > So if we can't mandate this or there's no way to validate this. Vendor is
> > > still free to implement their own protocol which could lead a lot of
> > > maintaining burden.
> > Yes, the device state representation is their responsibility. We can't
> > do that for them since they define the hardware interface and internal
> > state.
> > 
> > As Michael and Paolo have mentioned in the other thread, we can provide
> > guidelines and standardize common aspects.
> > 
> > > > Migration can fail if loading the device state is not possible. It should fail
> > > > early with a clear error message. It must not appear to complete but leave the
> > > > device inoperable due to a migration problem.
> > > 
> > > For VFIO-user, how management know that a VM can be migrated from src to
> > > dst? For kernel, we have sysfs.
> > vfio-user devices will normally be instantiated in one of two ways:
> > 
> > 1. Launching a device backend and passing command-line parameters:
> > 
> >       $ my-nic --socket-path /tmp/my-nic-vfio-user.sock \
> >                --model https://vendor-a.com/my-nic \
> > 	      --rss on
> > 
> >     Here "model" is the device model URL. The program could support
> >     multiple device models.
> > 
> >     The "rss" device configuration parameter enables Receive Side Scaling
> >     (RSS) as an example of a configuration parameter.
> > 
> > 2. Creating a device using an RPC interface:
> > 
> >       (qemu) device-add my-nic,rss=on
> > 
> > If the device instantiation succeeds then it is safe to live migrate.
> > The device is exposing the desired hardware interface and expecting the
> > right device state representation.
> 
> 
> Does this mean there will still be a "my-nic" stub in qemu? (I thought it
> should be a generic one like device-add "vfio-user-pci")

No, sorry for the confusing example. I was thinking of
qemu-storage-daemon or multi-process QEMU where devices could be
configured over a QMP/HMP monitor. The device happens to be implemented
in the QEMU codebase but the VMM doesn't need a stub device.

A D-Bus or gRPC example would have been clearer because it's not
associated with a VMM.

> > 
> > > > The rest of this document describes how these requirements can be met.
> > > > 
> > > > Device Models
> > > > -------------
> > > > Devices have a *hardware interface* consisting of hardware registers,
> > > > interrupts, and so on.
> > > > 
> > > > The hardware interface together with the device state representation is called
> > > > a *device model*. Device models can be assigned URIs such as
> > > > https://qemu.org/devices/e1000e to uniquely identify them.
> > > 
> > > It looks worse than "pci://vendor_id.device_id.subvendor_id.subdevice_id".
> > > "e1000e" means a lot of different 8275X implementations that have subtle but
> > > easy to be ignored differences.
> > If you wish to reflect those differences in the device model URI then
> > you can use:
> > 
> >    https://qemu.org/devices/pci/<vendor-id>/<device-id>/<subvendor-id>/<subdevice-id>
> > 
> > Another option is to use device configuration parameters to express
> > differences.
> > 
> > The important thing is that this device model URI has one owner. No one
> > else will use qemu.org. There can be many different e1000e device model
> > URIs, if necessary (with slightly different hardware interfaces and/or
> > device state representations). This avoids collisions.
> > 
> > > And is it possible to have a list of URIs here?
> > A device implementation (mdev driver, vfio-user device backend, etc) may
> > support multiple device model URIs.
> > 
> > A device instance has an immutable device model URI and list of
> > configuration parameters. In other words, once the device is created its
> > ABI is fixed for the lifetime of the device. A new device instance can
> > be configured by powering off the machine, hotplug, etc.
> > 
> > > > Multiple implementations of a device model may exist. They are they are
> > > > interchangeable if they follow the same hardware interface and device
> > > > state representation.
> > > > 
> > > > Multiple implementations of the same hardware interface may exist with
> > > > different device state representations, in which case the device models are not
> > > > interchangeable and must be assigned different URIs.
> > > > 
> > > > Migration is only possible when the same device model is supported by the
> > > > *source* and the *destination* devices.
> > > > 
> > > > Device Configuration
> > > > --------------------
> > > > Device models may have parameters that affect the hardware interface or device
> > > > state representation. For example, a network card may have a configurable
> > > > address filtering table size parameter called ``rx-filter-size``. A
> > > > device state saved with ``rx-filter-size=32`` cannot be safely loaded
> > > > into a device with ``rx-filter-size=0``, because changing the size from
> > > > 32 to 0 may disrupt device operation.
> > > 
> > > Do we allow the migration from "rx-filter-size=16" to "rx-filter-size=32" (I
> > > guess not?) And should we extend the concept to "device capability" instead
> > > of just state representation.  E.g src has CAP_X=on,CAP_Y=off but dst has
> > > CAP_X=on,CAP_Y=on, so we disallow the migration from src to dst.
> > A device instance's configuration parameters are immutable.
> > rx-filter-size=16 cannot be migrated to rx-filter-size=32.
> 
> 
> But then it looks to me we can't migrate back, or do you mean it is required
> to have the ability to change the max rx-filter-size.

We can migrate a device with rx-filter-size=16 from old -> new if the
new device implementation supports rx-filter-size=16. We can migrate
back to the old device implementation because it supports
rx-filter-size=16.

If you want to change the configuration parameters then new device must
be instantiated during poweroff or hotplug. This is how
rx-filter-size=16 can be changed to rx-filter-size=32, but it must be
done explicitly (configuration parameters don't change across
migration).

> > Yes, configuration parameters can describe capabilities. I think of
> > capabilities as something that affects the guest-visible hardware
> > interface (e.g. the RSS feature bit is enabled) that is mentioned in the
> > text, but it would be clearer to mention them explicitly.
> > 
> > > > A list of configuration parameters is called the *device configuration*.
> > > > Migration is expected to succeed when the same device model and configuration
> > > > that was used for saving the device state is used again to load it.
> > > > 
> > > > Note that not all parameters used to instantiate a device need to be part of
> > > > the device configuration. For example, assigning a network card to a specific
> > > > physical port is not part of the device configuration since it is not part of
> > > > the device's hardware interface or the device state representation.
> > > 
> > > Yes, but the task needs to be done by management somehow. So do you expect a
> > > vendor specific provisioning API here?
> > There seems to be no consensus on this yet. It's the question of how to
> > manage the lifecycle of VFIO, mdev, vhost-user, and vfio-user devices.
> > There are attempts to standardize in some of these areas.
> > 
> > For mdev drivers we can standardize the sysfs interface so management
> > tools can query source devices and instantiate destination devices
> > without device-specific code.
> 
> 
> Even for mdev, it should be have some class defined for sysfs which could be
> a standard way to configure NVME or virtio device.

Discussion on the mdev sysfs interface has started in the sub-thread
with Alex Williamson.

> > The problem with subsection semantics is that they break rollback. Once
> > the old device state has been loaded by the new device implementation,
> > saving the device state produces the new device state representation.
> > The old device implementation can no longer load it :(.
> 
> 
> Only when subsection is needed.

Good point. Most rollback migrations still work, only the ones that
introduce new subsections fail.

> >    Manual
> > intervention is necessary to tell the new device implementation to save
> > in the old representation.
> 
> 
> If we don't support subsection, could we end up with a deadlock like we do
> migration since want upgrade the kernel, but if we don't upgrade the kernel,
> we can't do live migration.

Can you explain in more detail?

I think the approach described in this document works, except it
requires manual intervention to change device configuration parameters
whereas subsections are automatically applied by the new QEMU.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-03 18:49     ` Dr. David Alan Gilbert
@ 2020-11-04  7:36       ` Stefan Hajnoczi
  2020-11-04 10:14         ` Dr. David Alan Gilbert
  2020-11-04 11:05       ` Christophe de Dinechin
  1 sibling, 1 reply; 40+ messages in thread
From: Stefan Hajnoczi @ 2020-11-04  7:36 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: John G Johnson, mtsirkin, Daniel P. Berrangé,
	quintela, Jason Wang, Felipe Franciosi, Kirti Wankhede,
	qemu-devel, Alex Williamson, Thanos Makatos, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 10073 bytes --]

On Tue, Nov 03, 2020 at 06:49:51PM +0000, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > On Tue, Nov 03, 2020 at 12:17:09PM +0000, Dr. David Alan Gilbert wrote:
> > > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > > Device Models
> > > > -------------
> > > > Devices have a *hardware interface* consisting of hardware registers,
> > > > interrupts, and so on.
> > > > 
> > > > The hardware interface together with the device state representation is called
> > > > a *device model*. Device models can be assigned URIs such as
> > > > https://qemu.org/devices/e1000e to uniquely identify them.
> > > 
> > > I think this is a unique identifier, not actually a URI; the https://
> > > isn't needed since no one expects to ever connect to this.
> > 
> > Yes, it could be any unique string. If the URI idea is not popular we
> > can use any similar scheme.
> 
> I'm OK with it being a URI; just drop the https.

Okay.

> > > > However, secondary aspects related to the physical port may affect the device's
> > > > hardware interface and need to be reflected in the device configuration. The
> > > > link speed may depend on the physical port and be reported through the device's
> > > > hardware interface. In that case a ``link-speed`` configuration parameter is
> > > > required to prevent unexpected changes to the link speed after migration.
> > > 
> > > That's an interesting example; because depending on the device, it might
> > > be:
> > >     a) Completely virtualised so that the guest *shouldn't* know what
> > > the physical link speed is, precisely to allow the physical network on
> > > the destination to be different.
> > > 
> > >     b) Part of the migrated state
> > > 
> > >     c) Something that's allowed to be reloaded after migration
> > > 
> > >     d) Configurable
> > > 
> > > so I'm not sure whether it's a good example in this case or not.
> > 
> > Can you think of an example that has only one option?
> > 
> > I tried but couldn't. For example take a sound card. The guest is aware
> > the device supports stereo playback (2 output channels), but which exact
> > stereo host device is used doesn't matter, they are all suitable.
> > 
> > Now imagine migrating to a 7.1 surround-sound device. Similar options
> > come into play:
> > 
> > a) Emulate stereo and mix it to 7.1 surround-sound on the physical
> >    device. The guest still sees the stereo device.
> > 
> > b) Refuse migration.
> > 
> > c) Indicate that the output has switched and let the guest reconfigure
> >    itself (e.g. a sound card with multiple outputs, where one of them is
> >    stereo and another is 7.1 surround sound).
> > 
> > Which option is desirable depends on the use case.
> 
> Yes, but I think it might be worth calling out these differences;  there
> are explicitly cases where you don't want external changes to be visible
> and other cases where you do; both are valid, but both need thinking
> about. (Another one, GPU whether you have a monitor plugged in!)

Okay.

> > > Maybe what's needed is a stronger instruction to abstract external
> > > device state so that it's not part of the configuration in most cases.
> > 
> > Do you want to propose something?
> 
> I think something like 'Some part of a devices state may be irrelevant
> to a migration; for example on some NICs it might be preferable to hide
> the physical characteristics of the link from the guest.'

Got it.

> > > > For example, if address filtering support was added to a network card then
> > > > device versions and the corresponding configurations may look like this:
> > > > * ``version=1`` - Behaves as if ``rx-filter-size=0``
> > > > * ``version=2`` - ``rx-filter-size=32``
> > > 
> > > Note configuration parameters might have been added during the life of
> > > the device; e.g. if the original card had no support for rx-filters, it
> > > might not have a rx-filter-size parameter.
> > 
> > version=1 does not explicitly set rx-filter-size=0. When a new parameter
> > is introduced it must have a default value that disables its effect on
> > the hardware interface and/or device state representation. This is
> > described in a bit more detail in the next section, maybe it should be
> > reordered.
> 
> We've generally found the definition of devices tends in practice to be
> done newer->older; i.e. you define the current machine, and then define
> the next older machine setting the defaults that used to be true; then
> define the older version behind that....

That is not possible here because an older device implementation is
unaware of new configuration parameters.

Looking at the example above, imagine a version=1 device is instantiated
on a device implementation that supports both version=1 and version=2.
Should the configuration parameter list for version=1 be empty or
rx-filter-size=0?

It must to be empty, otherwise an older device implementation that only
supports version=1 cannot instantiate the device. The older device
implementation does not recognize the rx-filter-size configuration
parameter (it was introduced in version=2) so we cannot set it to 0.

> > > > Device States
> > > > -------------
> > > > The details of the device state representation are not covered in this document
> > > > but the general requirements are discussed here.
> > > > 
> > > > The device state consists of data accessible through the device's hardware
> > > > interface and internal state that is needed to restore device operation.
> > > > State in the hardware interface includes the values of hardware registers.
> > > > An example of internal state is an index value needed to avoid processing
> > > > queued requests more than once.
> > > 
> > > I try and emphasise that 'internal state' should be represented in a way
> > > that reflects the problem rather than the particular implementation;
> > > this gives it a better chance of migrating to future versions.
> > 
> > Sounds like a good idea.
> > 
> > > > Changes can be made to the device state representation as follows. Each change
> > > > to device state must have a corresponding device configuration parameter that
> > > > allows the change to toggled:
> > > > 
> > > > * When the parameter is disabled the hardware interface and device state
> > > >   representation are unchanged. This allows old device states to be loaded.
> > > > 
> > > > * When the parameter is enabled the change comes into effect.
> > > > 
> > > > * The parameter's default value disables the change. Therefore old versions do
> > > >   not have to explicitly specify the parameter.
> > > > 
> > > > The following example illustrates migration from an old device
> > > > implementation to a new one. A version=1 network card is migrated to a
> > > > new device implementation that is also capable of version=2 and adds the
> > > > rx-filter-size=32 parameter. The new device is instantiated with
> > > > version=1, which disables rx-filter-size and is capable of loading the
> > > > version=1 device state. The migration completes successfully but note
> > > > the device is still operating at version=1 level in the new device.
> > > > 
> > > > The following example illustrates migration from a new device
> > > > implementation back to an older one. The new device implementation
> > > > supports version=1 and version=2. The old device implementation supports
> > > > version=1 only. Therefore the device can only be migrated when
> > > > instantiated with version=1 or the equivalent full configuration
> > > > parameters.
> > > 
> > > I'm sometimes asked for 'ways out' of buggy migration cases; e.g. what
> > > happens if version=1 forgot to migrate the X register; or what happens
> > > if verison=1 forgot to handle the special, rare case when X=5 and we
> > > now need to migrate some extra state.
> > 
> > Can these cases be handled by adding additional configuration parameters?
> > 
> > If version=1 is lacks essential state then version=2 can add it. The
> > user must configure the device to use version before they can save the
> > full state.
> > 
> > If version=1 didn't handle the X=5 case then the same solution is
> > needed. A new configuration parameter is introduced and the user needs
> > to configure the device to be the new version before migrating.
> > 
> > Unfortunately this requires poweroff or hotplugging a new device
> > instance. But some disruption is probably necessarily anyway so the
> > migration code on the host side can be patched to use the updated device
> > state representation.
> 
> There are some corner cases that people sometimes prefer; for example
> lets say the X=5 case is actually really rare - but when it happens the
> device is hopelessly broken, some device authors prefer to fix it and
> send the extra data and let the migration fail if the destination
> doesn't understand it (it would break anyway).

The device implementation needs to be updated to send the extra data. At
that point a new device configuration parameter should be introduced and
if the user wishes to run the new version of the device then the extra
data will be sent.

If the destination doesn't support the new parameter then migration will
be refused. That matches what you've described, so I think the approach
in this document handles this case.

> I've also been asked
> by mst for a 'unexpected data' mechanism to send data that the
> destination might not expect if it didn't know about it, for similar
> cases.

Do you mean optional data that can be more or less safely dropped? A new
device configuration parameter is not needed because the hardware
interface and device state representation remain compatible. That
feature can be defined in the device state representation spec and is
not visible at the layer discussed in this document. But I think it's
worth adding an explanation into this document explaining what to do.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-02 11:11 VFIO Migration Stefan Hajnoczi
                   ` (5 preceding siblings ...)
  2020-11-03 15:23 ` Christophe de Dinechin
@ 2020-11-04  7:50 ` Michael S. Tsirkin
  2020-11-04 16:37   ` Stefan Hajnoczi
  6 siblings, 1 reply; 40+ messages in thread
From: Michael S. Tsirkin @ 2020-11-04  7:50 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: John G Johnson, Daniel P. Berrangé,
	quintela, Jason Wang, Kirti Wankhede, qemu-devel,
	Alex Williamson, Thanos Makatos, Felipe Franciosi, Paolo Bonzini,
	Dr. David Alan Gilbert

On Mon, Nov 02, 2020 at 11:11:53AM +0000, Stefan Hajnoczi wrote:
> Device States
> -------------
> The details of the device state representation are not covered in this document
> but the general requirements are discussed here.
> 
> The device state consists of data accessible through the device's hardware
> interface and internal state that is needed to restore device operation.
> State in the hardware interface includes the values of hardware registers.
> An example of internal state is an index value needed to avoid processing
> queued requests more than once.
> 
> Changes can be made to the device state representation as follows. Each change
> to device state must have a corresponding device configuration parameter that
> allows the change to toggled:
> 
> * When the parameter is disabled the hardware interface and device state
>   representation are unchanged. This allows old device states to be loaded.
> 
> * When the parameter is enabled the change comes into effect.
> 
> * The parameter's default value disables the change. Therefore old versions do
>   not have to explicitly specify the parameter.
> 
> The following example illustrates migration from an old device
> implementation to a new one. A version=1 network card is migrated to a
> new device implementation that is also capable of version=2 and adds the
> rx-filter-size=32 parameter. The new device is instantiated with
> version=1, which disables rx-filter-size and is capable of loading the
> version=1 device state. The migration completes successfully but note
> the device is still operating at version=1 level in the new device.
> 
> The following example illustrates migration from a new device
> implementation back to an older one. The new device implementation
> supports version=1 and version=2. The old device implementation supports
> version=1 only. Therefore the device can only be migrated when
> instantiated with version=1 or the equivalent full configuration
> parameters.

So all this is pretty complex and easy for vendors to get wrong.  How
about we introduce a directory under docs/interop/ where each supported
device can list the format of its state and parameters and what is tied
to what?

I am a bit unsure about the usefulness of the version shortcut.
It would be handy if all this was used directly by users
but these are unlikely to want to orchestrate cross version
migrations, and tools do not need shortcuts like these ...

-- 
MST



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-02 14:56   ` Stefan Hajnoczi
@ 2020-11-04  8:07     ` Gerd Hoffmann
  2020-11-04 16:40       ` Stefan Hajnoczi
  0 siblings, 1 reply; 40+ messages in thread
From: Gerd Hoffmann @ 2020-11-04  8:07 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: John G Johnson, mtsirkin, Daniel P. Berrangé,
	quintela, Jason Wang, Cornelia Huck, qemu-devel,
	Dr. David Alan Gilbert, Kirti Wankhede, Thanos Makatos,
	Alex Williamson, Felipe Franciosi, Paolo Bonzini

  Hi,

> > > The hardware interface together with the device state representation is called
> > > a *device model*. Device models can be assigned URIs such as
> > > https://qemu.org/devices/e1000e to uniquely identify them.
> > 
> > Is that something that needs to be put together for every device where we
> > want to support migration? How do you create the URI?
> 
> Yes. If you are creating a custom device that no one else needs to
> emulate then you can simply pick a unique URL:
> 
>   https://vendor.com/my-dev
> 
> There doesn't need to be anything at the URL. It's just a unique string
> that no one else will use and therefore web URLs are handy because no
> one else will accidentally pick your string.

If this is just a string I think it would be better to use the reverse
domain name scheme (as used by virtio-serial too), i.e.

 - org.qemu.devices.e1000e
 - com.vendor.my-dev

take care,
  Gerd



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-03 17:31     ` Alex Williamson
@ 2020-11-04 10:13       ` Stefan Hajnoczi
  0 siblings, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2020-11-04 10:13 UTC (permalink / raw)
  To: Alex Williamson
  Cc: John G Johnson, mtsirkin, Daniel P. Berrangé,
	quintela, Jason Wang, Dr. David Alan Gilbert, qemu-devel,
	Kirti Wankhede, Paolo Bonzini, Felipe Franciosi,
	Christophe de Dinechin, Thanos Makatos

[-- Attachment #1: Type: text/plain, Size: 6420 bytes --]

On Tue, Nov 03, 2020 at 10:31:35AM -0700, Alex Williamson wrote:
> On Tue, 3 Nov 2020 15:33:56 +0000
> Daniel P. Berrangé <berrange@redhat.com> wrote:
> 
> > On Tue, Nov 03, 2020 at 04:23:43PM +0100, Christophe de Dinechin wrote:
> > > 
> > > On 2020-11-02 at 12:11 CET, Stefan Hajnoczi wrote...  
> > > > There is discussion about VFIO migration in the "Re: Out-of-Process
> > > > Device Emulation session at KVM Forum 2020" thread. The current status
> > > > is that Kirti proposed a VFIO device region type for saving and loading
> > > > device state. There is currently no guidance on migrating between
> > > > different device versions or device implementations from different
> > > > vendors. This is known to be non-trivial and raised discussion about
> > > > whether it should really be handled by VFIO or centralized in QEMU.
> > > >
> > > > Below is a document that describes how to ensure migration compatibility
> > > > in VFIO. It does not require changes to the VFIO migration interface. It
> > > > can be used for both VFIO/mdev kernel devices and vfio-user devices.
> > > >
> > > > The idea is that the device state blob is opaque to the VMM but the same
> > > > level of migration compatibility that exists today is still available.
> > > >
> > > > I hope this will help us reach consensus and let us discuss specifics.
> > > >
> > > > If you followed the previous discussion, I changed the approach from
> > > > sending a magic constant in the device state blob to identifying device
> > > > models by URIs. Therefore the device state structure does not need to be
> > > > defined here - the critical information for ensuring device migration
> > > > compatibility is the device model and configuration defined below.
> > > >
> > > > Stefan
> > > > ---
> > > > VFIO Migration
> > > > ==============
> > > > This document describes how to save and load VFIO device states. Saving a
> > > > device state produces a snapshot of a VFIO device's state that can be loaded
> > > > again at a later point in time to resume the device from the snapshot.
> > > >
> > > > The data representation of the device state is outside the scope of this
> > > > document.
> > > >
> > > > Overview
> > > > --------
> > > > The purpose of device states is to save the device at a point in time and then
> > > > restore the device back to the saved state later. This is more challenging than
> > > > it first appears.
> > > >
> > > > The process of saving a device state and loading it later is called
> > > > *migration*. The state may be loaded by the same device that saved it or by a
> > > > new instance of the device, possibly running on a different computer.
> > > >
> > > > It must be possible to migrate to a newer implementation of the device
> > > > as well as to an older implementation of the device. This allows users
> > > > to upgrade and roll back their systems.
> > > >
> > > > Migration can fail if loading the device state is not possible. It should fail
> > > > early with a clear error message. It must not appear to complete but leave the
> > > > device inoperable due to a migration problem.
> > > >
> > > > The rest of this document describes how these requirements can be met.
> > > >
> > > > Device Models
> > > > -------------
> > > > Devices have a *hardware interface* consisting of hardware registers,
> > > > interrupts, and so on.
> > > >
> > > > The hardware interface together with the device state representation is called
> > > > a *device model*. Device models can be assigned URIs such as
> > > > https://qemu.org/devices/e1000e to uniquely identify them.  
> > > 
> > > Like others, I think we should either
> > > 
> > > a) Give a relatively strong requirement regarding what is at the URL in
> > > question, e.g. docs, maybe even a machine-readable schema describing
> > > configuration and state for the device. Leaving the option "there can be
> > > nothing here" is IMO asking for trouble.
> > > 
> > > b) simply call that a unique ID, and then either drop the https: entirely or
> > > use something else, like pci:// or, to be more specific, vfio://
> > > 
> > > I'd favor option (b) for a different practical reason. URLs are subject to
> > > redirection and other mishaps. For example, using https:// begs the question
> > > whether
> > > https://qemu.org/devices/e1000e and
> > > https://www.qemu.org/devices/e1000e
> > > should be treated as the same device. I believe that your intent is that
> > > they shouldn't, but if the qemu web server redirects to www, and someone
> > > wants to copy-paste their web browser's URL bar to the command line, they'd
> > > get the wrong one.  
> > 
> > That's not a real world problem IMHO, because neither of these URLs
> > ever need resolve to a real webpage, and thus not need to be cut +
> > paste from a browser.
> > 
> > They are simply expressing a resource identifier using a URI as a
> > convenient format. This is the same as an XML namespace using a URI,
> > and rarely, if ever, resolving to any actual web page.
> > 
> > This is a good thing, because if you say there needs to be a real page
> > there, then it creates a pile of corporate beaurocracy for contributors.
> > I can freely create a URI under https://redhat.com for purposes of being
> > a identifier, but I cannot get any content published there without jumping
> > through many tedious corporate approvals and stand a good chance of being
> > rejected.
> > 
> > If we're truely treating the URIs as an opaque string, we don't especially
> > need to define any rules other than to say it should be under a domain that
> > you have authority over either directly, or via membership of a project
> > that delegates. We can suggest "https" since seeing "http" is a red flag
> > for many people these days.
> 
> Hmm, an opaque string, sort of like the existing "name" attribute we
> have now where Christophe quoted some examples in his message.  Thanks,

Let's go for b) in the next revision.

There will still be a structure <domain>/<path> but it won't be a URI
with a scheme ("https"). The reason for keeping a structure and not
simply declaring it a unique opaque string is that it's hard to
ensure uniqueness if there is no structure. Two people might
accidentally choose the same name, so let's keep the domain and path
there.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-04  7:36       ` Stefan Hajnoczi
@ 2020-11-04 10:14         ` Dr. David Alan Gilbert
  2020-11-04 16:47           ` Stefan Hajnoczi
  0 siblings, 1 reply; 40+ messages in thread
From: Dr. David Alan Gilbert @ 2020-11-04 10:14 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: John G Johnson, mtsirkin, Daniel P. Berrangé,
	quintela, Jason Wang, Felipe Franciosi, Kirti Wankhede,
	qemu-devel, Alex Williamson, Thanos Makatos, Paolo Bonzini

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Tue, Nov 03, 2020 at 06:49:51PM +0000, Dr. David Alan Gilbert wrote:
> > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > On Tue, Nov 03, 2020 at 12:17:09PM +0000, Dr. David Alan Gilbert wrote:
> > > > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > > > Device Models
> > > > > -------------
> > > > > Devices have a *hardware interface* consisting of hardware registers,
> > > > > interrupts, and so on.
> > > > > 
> > > > > The hardware interface together with the device state representation is called
> > > > > a *device model*. Device models can be assigned URIs such as
> > > > > https://qemu.org/devices/e1000e to uniquely identify them.
> > > > 
> > > > I think this is a unique identifier, not actually a URI; the https://
> > > > isn't needed since no one expects to ever connect to this.
> > > 
> > > Yes, it could be any unique string. If the URI idea is not popular we
> > > can use any similar scheme.
> > 
> > I'm OK with it being a URI; just drop the https.
> 
> Okay.
> 
> > > > > However, secondary aspects related to the physical port may affect the device's
> > > > > hardware interface and need to be reflected in the device configuration. The
> > > > > link speed may depend on the physical port and be reported through the device's
> > > > > hardware interface. In that case a ``link-speed`` configuration parameter is
> > > > > required to prevent unexpected changes to the link speed after migration.
> > > > 
> > > > That's an interesting example; because depending on the device, it might
> > > > be:
> > > >     a) Completely virtualised so that the guest *shouldn't* know what
> > > > the physical link speed is, precisely to allow the physical network on
> > > > the destination to be different.
> > > > 
> > > >     b) Part of the migrated state
> > > > 
> > > >     c) Something that's allowed to be reloaded after migration
> > > > 
> > > >     d) Configurable
> > > > 
> > > > so I'm not sure whether it's a good example in this case or not.
> > > 
> > > Can you think of an example that has only one option?
> > > 
> > > I tried but couldn't. For example take a sound card. The guest is aware
> > > the device supports stereo playback (2 output channels), but which exact
> > > stereo host device is used doesn't matter, they are all suitable.
> > > 
> > > Now imagine migrating to a 7.1 surround-sound device. Similar options
> > > come into play:
> > > 
> > > a) Emulate stereo and mix it to 7.1 surround-sound on the physical
> > >    device. The guest still sees the stereo device.
> > > 
> > > b) Refuse migration.
> > > 
> > > c) Indicate that the output has switched and let the guest reconfigure
> > >    itself (e.g. a sound card with multiple outputs, where one of them is
> > >    stereo and another is 7.1 surround sound).
> > > 
> > > Which option is desirable depends on the use case.
> > 
> > Yes, but I think it might be worth calling out these differences;  there
> > are explicitly cases where you don't want external changes to be visible
> > and other cases where you do; both are valid, but both need thinking
> > about. (Another one, GPU whether you have a monitor plugged in!)
> 
> Okay.
> 
> > > > Maybe what's needed is a stronger instruction to abstract external
> > > > device state so that it's not part of the configuration in most cases.
> > > 
> > > Do you want to propose something?
> > 
> > I think something like 'Some part of a devices state may be irrelevant
> > to a migration; for example on some NICs it might be preferable to hide
> > the physical characteristics of the link from the guest.'
> 
> Got it.
> 
> > > > > For example, if address filtering support was added to a network card then
> > > > > device versions and the corresponding configurations may look like this:
> > > > > * ``version=1`` - Behaves as if ``rx-filter-size=0``
> > > > > * ``version=2`` - ``rx-filter-size=32``
> > > > 
> > > > Note configuration parameters might have been added during the life of
> > > > the device; e.g. if the original card had no support for rx-filters, it
> > > > might not have a rx-filter-size parameter.
> > > 
> > > version=1 does not explicitly set rx-filter-size=0. When a new parameter
> > > is introduced it must have a default value that disables its effect on
> > > the hardware interface and/or device state representation. This is
> > > described in a bit more detail in the next section, maybe it should be
> > > reordered.
> > 
> > We've generally found the definition of devices tends in practice to be
> > done newer->older; i.e. you define the current machine, and then define
> > the next older machine setting the defaults that used to be true; then
> > define the older version behind that....
> 
> That is not possible here because an older device implementation is
> unaware of new configuration parameters.
> 
> Looking at the example above, imagine a version=1 device is instantiated
> on a device implementation that supports both version=1 and version=2.
> Should the configuration parameter list for version=1 be empty or
> rx-filter-size=0?
> 
> It must to be empty, otherwise an older device implementation that only
> supports version=1 cannot instantiate the device. The older device
> implementation does not recognize the rx-filter-size configuration
> parameter (it was introduced in version=2) so we cannot set it to 0.

I think this question might come down to who expands the device version
definition.
If it's the device itself that expands that, then a version 2 device
knows about what it needs to do for version 1 compatibility.
But if you're saying someone outside the device needs to be able to
expand that list then I'm not sure how you'd keep that expansion in line
with the implementation of a device.

> > > > > Device States
> > > > > -------------
> > > > > The details of the device state representation are not covered in this document
> > > > > but the general requirements are discussed here.
> > > > > 
> > > > > The device state consists of data accessible through the device's hardware
> > > > > interface and internal state that is needed to restore device operation.
> > > > > State in the hardware interface includes the values of hardware registers.
> > > > > An example of internal state is an index value needed to avoid processing
> > > > > queued requests more than once.
> > > > 
> > > > I try and emphasise that 'internal state' should be represented in a way
> > > > that reflects the problem rather than the particular implementation;
> > > > this gives it a better chance of migrating to future versions.
> > > 
> > > Sounds like a good idea.
> > > 
> > > > > Changes can be made to the device state representation as follows. Each change
> > > > > to device state must have a corresponding device configuration parameter that
> > > > > allows the change to toggled:
> > > > > 
> > > > > * When the parameter is disabled the hardware interface and device state
> > > > >   representation are unchanged. This allows old device states to be loaded.
> > > > > 
> > > > > * When the parameter is enabled the change comes into effect.
> > > > > 
> > > > > * The parameter's default value disables the change. Therefore old versions do
> > > > >   not have to explicitly specify the parameter.
> > > > > 
> > > > > The following example illustrates migration from an old device
> > > > > implementation to a new one. A version=1 network card is migrated to a
> > > > > new device implementation that is also capable of version=2 and adds the
> > > > > rx-filter-size=32 parameter. The new device is instantiated with
> > > > > version=1, which disables rx-filter-size and is capable of loading the
> > > > > version=1 device state. The migration completes successfully but note
> > > > > the device is still operating at version=1 level in the new device.
> > > > > 
> > > > > The following example illustrates migration from a new device
> > > > > implementation back to an older one. The new device implementation
> > > > > supports version=1 and version=2. The old device implementation supports
> > > > > version=1 only. Therefore the device can only be migrated when
> > > > > instantiated with version=1 or the equivalent full configuration
> > > > > parameters.
> > > > 
> > > > I'm sometimes asked for 'ways out' of buggy migration cases; e.g. what
> > > > happens if version=1 forgot to migrate the X register; or what happens
> > > > if verison=1 forgot to handle the special, rare case when X=5 and we
> > > > now need to migrate some extra state.
> > > 
> > > Can these cases be handled by adding additional configuration parameters?
> > > 
> > > If version=1 is lacks essential state then version=2 can add it. The
> > > user must configure the device to use version before they can save the
> > > full state.
> > > 
> > > If version=1 didn't handle the X=5 case then the same solution is
> > > needed. A new configuration parameter is introduced and the user needs
> > > to configure the device to be the new version before migrating.
> > > 
> > > Unfortunately this requires poweroff or hotplugging a new device
> > > instance. But some disruption is probably necessarily anyway so the
> > > migration code on the host side can be patched to use the updated device
> > > state representation.
> > 
> > There are some corner cases that people sometimes prefer; for example
> > lets say the X=5 case is actually really rare - but when it happens the
> > device is hopelessly broken, some device authors prefer to fix it and
> > send the extra data and let the migration fail if the destination
> > doesn't understand it (it would break anyway).
> 
> The device implementation needs to be updated to send the extra data. At
> that point a new device configuration parameter should be introduced and
> if the user wishes to run the new version of the device then the extra
> data will be sent.
> 
> If the destination doesn't support the new parameter then migration will
> be refused. That matches what you've described, so I think the approach
> in this document handles this case.

Well that's the ideal; but the case I'm describing is where you're
recovering from a screwup in which the migration is going to fail in a
rare (runtime defined) corner case, and only sending the extra data
in that rare case before you get a chance to define a new version.

> > I've also been asked
> > by mst for a 'unexpected data' mechanism to send data that the
> > destination might not expect if it didn't know about it, for similar
> > cases.
> 
> Do you mean optional data that can be more or less safely dropped? A new
> device configuration parameter is not needed because the hardware
> interface and device state representation remain compatible. That
> feature can be defined in the device state representation spec and is
> not visible at the layer discussed in this document. But I think it's
> worth adding an explanation into this document explaining what to do.

I mean a way to send optional data that the destination can drop; but
that the destination doesn't know what it means and at the time the
destination was written, wasn't yet defined. It is part of the device
state;  it's similar to the X=5 case above - but in this case it allows
the migration not to fail even when you start sending the extra data.

Dave

> 
> Stefan


-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-03 18:49     ` Dr. David Alan Gilbert
  2020-11-04  7:36       ` Stefan Hajnoczi
@ 2020-11-04 11:05       ` Christophe de Dinechin
  1 sibling, 0 replies; 40+ messages in thread
From: Christophe de Dinechin @ 2020-11-04 11:05 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: John G Johnson, mtsirkin, "Daniel P. Berrangé",
	Juan Quintela, Jason Wang, BALATON Zoltan via, Kirti Wankhede,
	Paolo Bonzini, Alex Williamson, Stefan Hajnoczi,
	Felipe Franciosi, Thanos Makatos

[-- Attachment #1: Type: text/plain, Size: 1209 bytes --]



> On 3 Nov 2020, at 19:49, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> 
> * Stefan Hajnoczi (stefanha@redhat.com <mailto:stefanha@redhat.com>) wrote:
>> On Tue, Nov 03, 2020 at 12:17:09PM +0000, Dr. David Alan Gilbert wrote:
>>> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
>>>> Device Models
>>>> -------------
>>>> Devices have a *hardware interface* consisting of hardware registers,
>>>> interrupts, and so on.
>>>> 
>>>> The hardware interface together with the device state representation is called
>>>> a *device model*. Device models can be assigned URIs such as
>>>> https://qemu.org/devices/e1000e to uniquely identify them.
>>> 
>>> I think this is a unique identifier, not actually a URI; the https://
>>> isn't needed since no one expects to ever connect to this.
>> 
>> Yes, it could be any unique string. If the URI idea is not popular we
>> can use any similar scheme.
> 
> I'm OK with it being a URI; just drop the https.

I completely agree. https gives the wrong idea about what this represents. Unless you give it https semantics, by requiring a doc or a schema or whatever to be at the URL, but then you enter another universe of cans of worms.



[-- Attachment #2: Type: text/html, Size: 4882 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-03 15:23 ` Christophe de Dinechin
  2020-11-03 15:33   ` Daniel P. Berrangé
@ 2020-11-04 11:10   ` Stefan Hajnoczi
  1 sibling, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2020-11-04 11:10 UTC (permalink / raw)
  To: Christophe de Dinechin
  Cc: John G Johnson, mtsirkin, Daniel P. Berrangé,
	quintela, qemu-devel, Jason Wang, Kirti Wankhede,
	Dr. David Alan Gilbert, Alex Williamson, Paolo Bonzini,
	Felipe Franciosi, Thanos Makatos

[-- Attachment #1: Type: text/plain, Size: 16591 bytes --]

On Tue, Nov 03, 2020 at 04:23:43PM +0100, Christophe de Dinechin wrote:
> On 2020-11-02 at 12:11 CET, Stefan Hajnoczi wrote...
> > There is discussion about VFIO migration in the "Re: Out-of-Process
> > Device Emulation session at KVM Forum 2020" thread. The current status
> > is that Kirti proposed a VFIO device region type for saving and loading
> > device state. There is currently no guidance on migrating between
> > different device versions or device implementations from different
> > vendors. This is known to be non-trivial and raised discussion about
> > whether it should really be handled by VFIO or centralized in QEMU.
> >
> > Below is a document that describes how to ensure migration compatibility
> > in VFIO. It does not require changes to the VFIO migration interface. It
> > can be used for both VFIO/mdev kernel devices and vfio-user devices.
> >
> > The idea is that the device state blob is opaque to the VMM but the same
> > level of migration compatibility that exists today is still available.
> >
> > I hope this will help us reach consensus and let us discuss specifics.
> >
> > If you followed the previous discussion, I changed the approach from
> > sending a magic constant in the device state blob to identifying device
> > models by URIs. Therefore the device state structure does not need to be
> > defined here - the critical information for ensuring device migration
> > compatibility is the device model and configuration defined below.
> >
> > Stefan
> > ---
> > VFIO Migration
> > ==============
> > This document describes how to save and load VFIO device states. Saving a
> > device state produces a snapshot of a VFIO device's state that can be loaded
> > again at a later point in time to resume the device from the snapshot.
> >
> > The data representation of the device state is outside the scope of this
> > document.
> >
> > Overview
> > --------
> > The purpose of device states is to save the device at a point in time and then
> > restore the device back to the saved state later. This is more challenging than
> > it first appears.
> >
> > The process of saving a device state and loading it later is called
> > *migration*. The state may be loaded by the same device that saved it or by a
> > new instance of the device, possibly running on a different computer.
> >
> > It must be possible to migrate to a newer implementation of the device
> > as well as to an older implementation of the device. This allows users
> > to upgrade and roll back their systems.
> >
> > Migration can fail if loading the device state is not possible. It should fail
> > early with a clear error message. It must not appear to complete but leave the
> > device inoperable due to a migration problem.
> >
> > The rest of this document describes how these requirements can be met.
> >
> > Device Models
> > -------------
> > Devices have a *hardware interface* consisting of hardware registers,
> > interrupts, and so on.
> >
> > The hardware interface together with the device state representation is called
> > a *device model*. Device models can be assigned URIs such as
> > https://qemu.org/devices/e1000e to uniquely identify them.
> 
> Like others, I think we should either
> 
> a) Give a relatively strong requirement regarding what is at the URL in
> question, e.g. docs, maybe even a machine-readable schema describing
> configuration and state for the device. Leaving the option "there can be
> nothing here" is IMO asking for trouble.
> 
> b) simply call that a unique ID, and then either drop the https: entirely or
> use something else, like pci:// or, to be more specific, vfio://
> 
> I'd favor option (b) for a different practical reason. URLs are subject to
> redirection and other mishaps. For example, using https:// begs the question
> whether
> https://qemu.org/devices/e1000e and
> https://www.qemu.org/devices/e1000e
> should be treated as the same device. I believe that your intent is that
> they shouldn't, but if the qemu web server redirects to www, and someone
> wants to copy-paste their web browser's URL bar to the command line, they'd
> get the wrong one.
> 
> 
> >
> > Multiple implementations of a device model may exist. They are they are
> 
> dup "they are"

Thanks, will fix.

> > interchangeable if they follow the same hardware interface and device
> > state representation.
> >
> > Multiple implementations of the same hardware interface may exist with
> > different device state representations, in which case the device models are not
> > interchangeable and must be assigned different URIs.
> >
> > Migration is only possible when the same device model is supported by the
> > *source* and the *destination* devices.
> >
> > Device Configuration
> > --------------------
> 
> I find "device configuration" to be a bit confusing and ambiguous here.
> From the discussion, it appears that you are not talking about the active
> meaning of "configuration", as in "configuring" the device after migration,
> but talking about a passive meaning of "this device exists in multiple
> variant, which one am I talking about".
> 
> I've scratched my head looking for a less ambiguous wording, but could not
> find any.

The "configuration parameters" describe variations in the hardware
interface and device state representation. I'll rework this section and
just call them "device parameters" with a fuller explanation of their
purpose.

> > Device models may have parameters that affect the hardware interface or device
> > state representation. For example, a network card may have a configurable
> > address filtering table size parameter called ``rx-filter-size``. A
> > device state saved with ``rx-filter-size=32`` cannot be safely loaded
> > into a device with ``rx-filter-size=0``, because changing the size from
> > 32 to 0 may disrupt device operation.
> >
> > A list of configuration parameters is called the *device configuration*.
> > Migration is expected to succeed when the same device model and configuration
> > that was used for saving the device state is used again to load it.
> 
> If that's intended for a static decision, are you thinking about making it
> part of the URI?
> 
> Something like vfio://qemu.org/devices/e1000e?version=2

Neat idea, it might come in handy.

> > Note that not all parameters used to instantiate a device need to be part of
> > the device configuration. For example, assigning a network card to a specific
> > physical port is not part of the device configuration since it is not part of
> > the device's hardware interface or the device state representation.
> 
> I'd replace "since" with "when". There are cases where all ports are not
> equivalent. Or maybe you are saying that this is covered by other more
> relevant parts of the configuration like link speed?

Yes, the next part that talks about link speed is an example of how to
represent cases where certain aspects of the port do matter.

Based on the feedback I've gotten about this section, I think it was a
confusing example. I will rework this to make it clearer.

> What about the topology used to access the card? Would you want to be able
> to refer to things like IOMMU groups, etc?

Do you mean guest-visible IOMMU groups? In that case the vIOMMU (which
is a separate device) contains that information.

> > The device state can be loaded and run on a different physical port
> > without affecting the operation of the device. Therefore the physical port
> > is not part of the device configuration.
> 
> I would prefer if we could offer a mechanism here, rather than a policy, and
> let the upper layers in the stack be able to specify the policy.
> 
> Imagine for example that you have allocated ports between internal and
> external networks? The upper stack would probably want to migrate an
> "internal network" vfio to another "internal network" port, no?

Yes, the example is problematic. I will rework it.

> > Note that the device configuration is a conservative bound on device
> > states that can be migrated successfully since not all configuration
> > parameters may be strictly required to match on the source and
> > destination devices. For example, if the device's hardware interface has
> > not yet been initialized then changes to the link speed may not be
> > noticed. However, accurately representing runtime constraints is complex
> > and risks introducing migration bugs, so no attempt is made to support
> > them to achieve more relaxed bounds on successful migrations.
> 
> That makes me wonder if the distinction between configuration, version and
> state is really tight.
> 
> Consider a vGPU for example. It looks to me like the "shape" of the target
> vGPU would be part of "configuration" at first sight. But then, it might be
> instead a "state" request, "this is what I need", that could cause the
> target to reconfigure the vGPUs to match the description.
> 
> Notice that such a reconfiguration might be impossible. So this is still a
> migration validation, but it's a bit more dynamic.

The configuration describes the maximum set of features and device
resources. Some of them may not be utilized at runtime, but the
destination still has to provide them just in case they are used.

It's conservative because it refuses all migrations that could run into
trouble. However, it also refuses some migrations that could succeed.

As mentioned, detecting those cases is too complex and risky. Better
safe than sorry.

> Similarly, if we get to network cards and "upper stacks", you could consider
> the MAC address as part of the state or configuration, depending on the
> scenario. You could either want to "transport" the MAC address, or to
> have the upper stack follow some rules on which one to pick for the target.
> My understanding is that IPv6 DAD for example somewhat relies on the MAC
> address, and that this makes things complicated for OpenShift. Ask Stefano
> Brivio about that, he understands the problem much better than I do.
> 
> The bottom line is that IMO the line between configuration and state may be
> a bit fuzzy, even for a single device model, depending on the use case.

Yes, there is some freedom for the device designer to choose whether to
migrate values as part of the device state or to make them configuration
parameters.

> > Device Versions
> > ---------------
> > As a device evolves, the number of configuration parameters required may become
> > inconvenient for users to express in full. A device configuration can be
> > aliased by a *device version*, which is a shorthand for the full device
> > configuration. This makes it easy to apply a standard device configuration
> > without listing every configuration parameter explicitly.
> >
> > For example, if address filtering support was added to a network card then
> > device versions and the corresponding configurations may look like this:
> > * ``version=1`` - Behaves as if ``rx-filter-size=0``
> > * ``version=2`` - ``rx-filter-size=32``
> 
> To me, this corresponds to default settings, see below.
> 
> If two devices have different versions, do you allow migration?

Versions are just a shorthand. The configuration parameters are what is
compared to check migration compatibility.

If the configuration parameters differ then migration is not allowed.

Usually two versions specific different configuration parameters, so
migration between versions is not allowed.

> > Changes can be made to the device state representation as follows. Each change
> > to device state must have a corresponding device configuration parameter that
> > allows the change to toggled:
> >
> > * When the parameter is disabled the hardware interface and device state
> >   representation are unchanged. This allows old device states to be loaded.
> >
> > * When the parameter is enabled the change comes into effect.
> >
> > * The parameter's default value disables the change. Therefore old versions do
> >   not have to explicitly specify the parameter.
> 
> I see a problem with this. Imagine a new card has new parameter foo.
> Now, you once had a VM on this card that had foo=42. So it has
> foo-enabled=true and foo=42. Then you migrate there something that does not
> know about foo. Most likely, that would not even touch foo-enabled.
> 
> So I think that you need to add that the migration starts with a "reset
> state" where all featured are disabled by default.

You are right. When I wrote the document I assumed the destination will
always be a freshly-instantiated device, but it was pointed out that
devices may be pre-existing (e.g. from a pool of available VFIO
devices).

The sub-thread with Alex Williamson discusses how to set parameters
using mdev and VFIO sysfs attrs. It will be possible to configure an
existing device without creating a new instance.

> > The following example illustrates migration from an old device
> > implementation to a new one. A version=1 network card is migrated to a
> > new device implementation that is also capable of version=2 and adds the
> > rx-filter-size=32 parameter. The new device is instantiated with
> > version=1, which disables rx-filter-size and is capable of loading the
> > version=1 device state. The migration completes successfully but note
> > the device is still operating at version=1 level in the new device.
> >
> > The following example illustrates migration from a new device
> > implementation back to an older one. The new device implementation
> > supports version=1 and version=2. The old device implementation supports
> > version=1 only. Therefore the device can only be migrated when
> > instantiated with version=1 or the equivalent full configuration
> > parameters.
> >
> > Orchestrating Migrations
> > ------------------------
> > The following steps must be followed to migrate devices:
> >
> > 1. Check that the source and destination devices support the same device model.
> >
> > 2. Check that the destination device supports the source device's
> >    configuration. Each configuration parameter must be accepted by the
> >    destination in order to ensure that it will be possible to load the device
> >    state.
> >
> > 3. The device state is saved on the source and loaded on the destination.
> >
> > 4. If migration succeeds then the destination resumes operation and the source
> >    must not resume operation. If the migration fails then the source resumes
> >    operation and the destination must not resume operation.
> >
> > VFIO Implementation
> > -------------------
> > The following applies both to kernel VFIO/mdev drivers and vfio-user device
> > backends.
> >
> > Devices are instantiated based on a version and/or configuration parameters:
> > * ``version=1`` - use the device configuration aliased by version 1
> > * ``version=2,rx-filter-size=64`` - use version 1 and override ``rx-filter-size``
> > * ``rx-filter-size=0`` - directly set configuration parameters without using a version
> >
> > Device creation fails if the version and/or configuration parameters are not
> > supported.
> >
> > There must be a mechanism to query the "latest" configuration for a device
> > model. It may simply report the ``version=5`` where 5 is the latest version but
> > it could also report all configuration parameters instead of using a version
> > alias.
> 
> Instead of "latest", we could have a query that lists the "supported"
> configurations. Again, vGPUs are a good example where this would be
> useful. A same card can be partitioned in a number of ways, and you can't
> really claim that "M10-2B" or "M10-0Q" is "latest".

Thanks. I agree, there needs to be a way to report all available
configuration parameters and device versions.

> You could arguably assign a unique URI to each sub-model. Maybe that's how
> you were envisioning things?

It could be done either way. The device model is parameterized so it's
not necessary to define a unique URI for each sub-model.

If the device state representation and the hardware interface is similar
then using a single device model URI and expressing the differences
using configuration parameters seems reasonable.

A device model does not necessarily correspond to a single PCI
Device/Vendor ID, just like a Linux driver does can support many PCI
Device/Vendor IDs.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-04  7:50 ` Michael S. Tsirkin
@ 2020-11-04 16:37   ` Stefan Hajnoczi
  0 siblings, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2020-11-04 16:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: John G Johnson, Daniel P. Berrangé,
	quintela, Jason Wang, Kirti Wankhede, qemu-devel,
	Alex Williamson, Thanos Makatos, Felipe Franciosi, Paolo Bonzini,
	Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 3138 bytes --]

On Wed, Nov 04, 2020 at 02:50:58AM -0500, Michael S. Tsirkin wrote:
> On Mon, Nov 02, 2020 at 11:11:53AM +0000, Stefan Hajnoczi wrote:
> > Device States
> > -------------
> > The details of the device state representation are not covered in this document
> > but the general requirements are discussed here.
> > 
> > The device state consists of data accessible through the device's hardware
> > interface and internal state that is needed to restore device operation.
> > State in the hardware interface includes the values of hardware registers.
> > An example of internal state is an index value needed to avoid processing
> > queued requests more than once.
> > 
> > Changes can be made to the device state representation as follows. Each change
> > to device state must have a corresponding device configuration parameter that
> > allows the change to toggled:
> > 
> > * When the parameter is disabled the hardware interface and device state
> >   representation are unchanged. This allows old device states to be loaded.
> > 
> > * When the parameter is enabled the change comes into effect.
> > 
> > * The parameter's default value disables the change. Therefore old versions do
> >   not have to explicitly specify the parameter.
> > 
> > The following example illustrates migration from an old device
> > implementation to a new one. A version=1 network card is migrated to a
> > new device implementation that is also capable of version=2 and adds the
> > rx-filter-size=32 parameter. The new device is instantiated with
> > version=1, which disables rx-filter-size and is capable of loading the
> > version=1 device state. The migration completes successfully but note
> > the device is still operating at version=1 level in the new device.
> > 
> > The following example illustrates migration from a new device
> > implementation back to an older one. The new device implementation
> > supports version=1 and version=2. The old device implementation supports
> > version=1 only. Therefore the device can only be migrated when
> > instantiated with version=1 or the equivalent full configuration
> > parameters.
> 
> So all this is pretty complex and easy for vendors to get wrong.  How
> about we introduce a directory under docs/interop/ where each supported
> device can list the format of its state and parameters and what is tied
> to what?

Yes, that would be great for standardizing the device state
representations and migration parameters. I'm not aware of any devices
that need standardization yet but let's do it for vfio-user VIRTIO
devices.

> I am a bit unsure about the usefulness of the version shortcut.
> It would be handy if all this was used directly by users
> but these are unlikely to want to orchestrate cross version
> migrations, and tools do not need shortcuts like these ...

Me too. It's much easier for humans to compare version 1 and 2 than to
compare a potentially long list of parameters, but if it's always done
by the tooling then it doesn't matter. The device version can be dropped
for now and we can bring it back if we need it.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-04  8:07     ` Gerd Hoffmann
@ 2020-11-04 16:40       ` Stefan Hajnoczi
  2020-11-05  6:47         ` Gerd Hoffmann
  0 siblings, 1 reply; 40+ messages in thread
From: Stefan Hajnoczi @ 2020-11-04 16:40 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: John G Johnson, mtsirkin, Daniel P. Berrangé,
	quintela, Jason Wang, Cornelia Huck, qemu-devel,
	Dr. David Alan Gilbert, Kirti Wankhede, Thanos Makatos,
	Alex Williamson, Felipe Franciosi, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1219 bytes --]

On Wed, Nov 04, 2020 at 09:07:45AM +0100, Gerd Hoffmann wrote:
> > > > The hardware interface together with the device state representation is called
> > > > a *device model*. Device models can be assigned URIs such as
> > > > https://qemu.org/devices/e1000e to uniquely identify them.
> > > 
> > > Is that something that needs to be put together for every device where we
> > > want to support migration? How do you create the URI?
> > 
> > Yes. If you are creating a custom device that no one else needs to
> > emulate then you can simply pick a unique URL:
> > 
> >   https://vendor.com/my-dev
> > 
> > There doesn't need to be anything at the URL. It's just a unique string
> > that no one else will use and therefore web URLs are handy because no
> > one else will accidentally pick your string.
> 
> If this is just a string I think it would be better to use the reverse
> domain name scheme (as used by virtio-serial too), i.e.
> 
>  - org.qemu.devices.e1000e
>  - com.vendor.my-dev

This is the Java syntax. Go uses gitlab.com/my-user/foo and I think it's
nicer but I think I'm bikeshedding.

Is there any particular reason why you prefer the reverse domain name
approach?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-04 10:14         ` Dr. David Alan Gilbert
@ 2020-11-04 16:47           ` Stefan Hajnoczi
  2020-11-04 17:32             ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 40+ messages in thread
From: Stefan Hajnoczi @ 2020-11-04 16:47 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: John G Johnson, mtsirkin, Daniel P. Berrangé,
	quintela, Jason Wang, Felipe Franciosi, Kirti Wankhede,
	qemu-devel, Alex Williamson, Thanos Makatos, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 12972 bytes --]

On Wed, Nov 04, 2020 at 10:14:23AM +0000, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > On Tue, Nov 03, 2020 at 06:49:51PM +0000, Dr. David Alan Gilbert wrote:
> > > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > > On Tue, Nov 03, 2020 at 12:17:09PM +0000, Dr. David Alan Gilbert wrote:
> > > > > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > > > > Device Models
> > > > > > -------------
> > > > > > Devices have a *hardware interface* consisting of hardware registers,
> > > > > > interrupts, and so on.
> > > > > > 
> > > > > > The hardware interface together with the device state representation is called
> > > > > > a *device model*. Device models can be assigned URIs such as
> > > > > > https://qemu.org/devices/e1000e to uniquely identify them.
> > > > > 
> > > > > I think this is a unique identifier, not actually a URI; the https://
> > > > > isn't needed since no one expects to ever connect to this.
> > > > 
> > > > Yes, it could be any unique string. If the URI idea is not popular we
> > > > can use any similar scheme.
> > > 
> > > I'm OK with it being a URI; just drop the https.
> > 
> > Okay.
> > 
> > > > > > However, secondary aspects related to the physical port may affect the device's
> > > > > > hardware interface and need to be reflected in the device configuration. The
> > > > > > link speed may depend on the physical port and be reported through the device's
> > > > > > hardware interface. In that case a ``link-speed`` configuration parameter is
> > > > > > required to prevent unexpected changes to the link speed after migration.
> > > > > 
> > > > > That's an interesting example; because depending on the device, it might
> > > > > be:
> > > > >     a) Completely virtualised so that the guest *shouldn't* know what
> > > > > the physical link speed is, precisely to allow the physical network on
> > > > > the destination to be different.
> > > > > 
> > > > >     b) Part of the migrated state
> > > > > 
> > > > >     c) Something that's allowed to be reloaded after migration
> > > > > 
> > > > >     d) Configurable
> > > > > 
> > > > > so I'm not sure whether it's a good example in this case or not.
> > > > 
> > > > Can you think of an example that has only one option?
> > > > 
> > > > I tried but couldn't. For example take a sound card. The guest is aware
> > > > the device supports stereo playback (2 output channels), but which exact
> > > > stereo host device is used doesn't matter, they are all suitable.
> > > > 
> > > > Now imagine migrating to a 7.1 surround-sound device. Similar options
> > > > come into play:
> > > > 
> > > > a) Emulate stereo and mix it to 7.1 surround-sound on the physical
> > > >    device. The guest still sees the stereo device.
> > > > 
> > > > b) Refuse migration.
> > > > 
> > > > c) Indicate that the output has switched and let the guest reconfigure
> > > >    itself (e.g. a sound card with multiple outputs, where one of them is
> > > >    stereo and another is 7.1 surround sound).
> > > > 
> > > > Which option is desirable depends on the use case.
> > > 
> > > Yes, but I think it might be worth calling out these differences;  there
> > > are explicitly cases where you don't want external changes to be visible
> > > and other cases where you do; both are valid, but both need thinking
> > > about. (Another one, GPU whether you have a monitor plugged in!)
> > 
> > Okay.
> > 
> > > > > Maybe what's needed is a stronger instruction to abstract external
> > > > > device state so that it's not part of the configuration in most cases.
> > > > 
> > > > Do you want to propose something?
> > > 
> > > I think something like 'Some part of a devices state may be irrelevant
> > > to a migration; for example on some NICs it might be preferable to hide
> > > the physical characteristics of the link from the guest.'
> > 
> > Got it.
> > 
> > > > > > For example, if address filtering support was added to a network card then
> > > > > > device versions and the corresponding configurations may look like this:
> > > > > > * ``version=1`` - Behaves as if ``rx-filter-size=0``
> > > > > > * ``version=2`` - ``rx-filter-size=32``
> > > > > 
> > > > > Note configuration parameters might have been added during the life of
> > > > > the device; e.g. if the original card had no support for rx-filters, it
> > > > > might not have a rx-filter-size parameter.
> > > > 
> > > > version=1 does not explicitly set rx-filter-size=0. When a new parameter
> > > > is introduced it must have a default value that disables its effect on
> > > > the hardware interface and/or device state representation. This is
> > > > described in a bit more detail in the next section, maybe it should be
> > > > reordered.
> > > 
> > > We've generally found the definition of devices tends in practice to be
> > > done newer->older; i.e. you define the current machine, and then define
> > > the next older machine setting the defaults that used to be true; then
> > > define the older version behind that....
> > 
> > That is not possible here because an older device implementation is
> > unaware of new configuration parameters.
> > 
> > Looking at the example above, imagine a version=1 device is instantiated
> > on a device implementation that supports both version=1 and version=2.
> > Should the configuration parameter list for version=1 be empty or
> > rx-filter-size=0?
> > 
> > It must to be empty, otherwise an older device implementation that only
> > supports version=1 cannot instantiate the device. The older device
> > implementation does not recognize the rx-filter-size configuration
> > parameter (it was introduced in version=2) so we cannot set it to 0.
> 
> I think this question might come down to who expands the device version
> definition.
> If it's the device itself that expands that, then a version 2 device
> knows about what it needs to do for version 1 compatibility.
> But if you're saying someone outside the device needs to be able to
> expand that list then I'm not sure how you'd keep that expansion in line
> with the implementation of a device.

The current approach is that the version is expanded into configuration
parameters when the device is instantiated. Those parameters are then
used to check migration compatibility of the destination (versions don't
play a role once the device has been created).

Michael replied in another sub-thread wondering if versions are really
necessary since tools do the migration checks. Let's try dropping
versions to simplify things. We can bring them back if needed later.

> > > > > > Device States
> > > > > > -------------
> > > > > > The details of the device state representation are not covered in this document
> > > > > > but the general requirements are discussed here.
> > > > > > 
> > > > > > The device state consists of data accessible through the device's hardware
> > > > > > interface and internal state that is needed to restore device operation.
> > > > > > State in the hardware interface includes the values of hardware registers.
> > > > > > An example of internal state is an index value needed to avoid processing
> > > > > > queued requests more than once.
> > > > > 
> > > > > I try and emphasise that 'internal state' should be represented in a way
> > > > > that reflects the problem rather than the particular implementation;
> > > > > this gives it a better chance of migrating to future versions.
> > > > 
> > > > Sounds like a good idea.
> > > > 
> > > > > > Changes can be made to the device state representation as follows. Each change
> > > > > > to device state must have a corresponding device configuration parameter that
> > > > > > allows the change to toggled:
> > > > > > 
> > > > > > * When the parameter is disabled the hardware interface and device state
> > > > > >   representation are unchanged. This allows old device states to be loaded.
> > > > > > 
> > > > > > * When the parameter is enabled the change comes into effect.
> > > > > > 
> > > > > > * The parameter's default value disables the change. Therefore old versions do
> > > > > >   not have to explicitly specify the parameter.
> > > > > > 
> > > > > > The following example illustrates migration from an old device
> > > > > > implementation to a new one. A version=1 network card is migrated to a
> > > > > > new device implementation that is also capable of version=2 and adds the
> > > > > > rx-filter-size=32 parameter. The new device is instantiated with
> > > > > > version=1, which disables rx-filter-size and is capable of loading the
> > > > > > version=1 device state. The migration completes successfully but note
> > > > > > the device is still operating at version=1 level in the new device.
> > > > > > 
> > > > > > The following example illustrates migration from a new device
> > > > > > implementation back to an older one. The new device implementation
> > > > > > supports version=1 and version=2. The old device implementation supports
> > > > > > version=1 only. Therefore the device can only be migrated when
> > > > > > instantiated with version=1 or the equivalent full configuration
> > > > > > parameters.
> > > > > 
> > > > > I'm sometimes asked for 'ways out' of buggy migration cases; e.g. what
> > > > > happens if version=1 forgot to migrate the X register; or what happens
> > > > > if verison=1 forgot to handle the special, rare case when X=5 and we
> > > > > now need to migrate some extra state.
> > > > 
> > > > Can these cases be handled by adding additional configuration parameters?
> > > > 
> > > > If version=1 is lacks essential state then version=2 can add it. The
> > > > user must configure the device to use version before they can save the
> > > > full state.
> > > > 
> > > > If version=1 didn't handle the X=5 case then the same solution is
> > > > needed. A new configuration parameter is introduced and the user needs
> > > > to configure the device to be the new version before migrating.
> > > > 
> > > > Unfortunately this requires poweroff or hotplugging a new device
> > > > instance. But some disruption is probably necessarily anyway so the
> > > > migration code on the host side can be patched to use the updated device
> > > > state representation.
> > > 
> > > There are some corner cases that people sometimes prefer; for example
> > > lets say the X=5 case is actually really rare - but when it happens the
> > > device is hopelessly broken, some device authors prefer to fix it and
> > > send the extra data and let the migration fail if the destination
> > > doesn't understand it (it would break anyway).
> > 
> > The device implementation needs to be updated to send the extra data. At
> > that point a new device configuration parameter should be introduced and
> > if the user wishes to run the new version of the device then the extra
> > data will be sent.
> > 
> > If the destination doesn't support the new parameter then migration will
> > be refused. That matches what you've described, so I think the approach
> > in this document handles this case.
> 
> Well that's the ideal; but the case I'm describing is where you're
> recovering from a screwup in which the migration is going to fail in a
> rare (runtime defined) corner case, and only sending the extra data
> in that rare case before you get a chance to define a new version.

You need to upgrade the migration code in order to produce that extra
data. Why not define a configuration parameter alongside this code
change?

> > > I've also been asked
> > > by mst for a 'unexpected data' mechanism to send data that the
> > > destination might not expect if it didn't know about it, for similar
> > > cases.
> > 
> > Do you mean optional data that can be more or less safely dropped? A new
> > device configuration parameter is not needed because the hardware
> > interface and device state representation remain compatible. That
> > feature can be defined in the device state representation spec and is
> > not visible at the layer discussed in this document. But I think it's
> > worth adding an explanation into this document explaining what to do.
> 
> I mean a way to send optional data that the destination can drop; but
> that the destination doesn't know what it means and at the time the
> destination was written, wasn't yet defined. It is part of the device
> state;  it's similar to the X=5 case above - but in this case it allows
> the migration not to fail even when you start sending the extra data.

The device state representation may have a way of sending optional data.
Since it just gets dropped if the destination doesn't recognize it there
is no need to introduce a configuration parameter and it doesn't play a
part in migration compatibility checks.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-04 16:47           ` Stefan Hajnoczi
@ 2020-11-04 17:32             ` Dr. David Alan Gilbert
  2020-11-05 11:40               ` Stefan Hajnoczi
  0 siblings, 1 reply; 40+ messages in thread
From: Dr. David Alan Gilbert @ 2020-11-04 17:32 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: John G Johnson, mtsirkin, Daniel P. Berrangé,
	quintela, Jason Wang, Felipe Franciosi, Kirti Wankhede,
	qemu-devel, Alex Williamson, Thanos Makatos, Paolo Bonzini

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Wed, Nov 04, 2020 at 10:14:23AM +0000, Dr. David Alan Gilbert wrote:
> > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > On Tue, Nov 03, 2020 at 06:49:51PM +0000, Dr. David Alan Gilbert wrote:
> > > > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > > > On Tue, Nov 03, 2020 at 12:17:09PM +0000, Dr. David Alan Gilbert wrote:
> > > > > > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > > > > > Device Models
> > > > > > > -------------
> > > > > > > Devices have a *hardware interface* consisting of hardware registers,
> > > > > > > interrupts, and so on.
> > > > > > > 
> > > > > > > The hardware interface together with the device state representation is called
> > > > > > > a *device model*. Device models can be assigned URIs such as
> > > > > > > https://qemu.org/devices/e1000e to uniquely identify them.
> > > > > > 
> > > > > > I think this is a unique identifier, not actually a URI; the https://
> > > > > > isn't needed since no one expects to ever connect to this.
> > > > > 
> > > > > Yes, it could be any unique string. If the URI idea is not popular we
> > > > > can use any similar scheme.
> > > > 
> > > > I'm OK with it being a URI; just drop the https.
> > > 
> > > Okay.
> > > 
> > > > > > > However, secondary aspects related to the physical port may affect the device's
> > > > > > > hardware interface and need to be reflected in the device configuration. The
> > > > > > > link speed may depend on the physical port and be reported through the device's
> > > > > > > hardware interface. In that case a ``link-speed`` configuration parameter is
> > > > > > > required to prevent unexpected changes to the link speed after migration.
> > > > > > 
> > > > > > That's an interesting example; because depending on the device, it might
> > > > > > be:
> > > > > >     a) Completely virtualised so that the guest *shouldn't* know what
> > > > > > the physical link speed is, precisely to allow the physical network on
> > > > > > the destination to be different.
> > > > > > 
> > > > > >     b) Part of the migrated state
> > > > > > 
> > > > > >     c) Something that's allowed to be reloaded after migration
> > > > > > 
> > > > > >     d) Configurable
> > > > > > 
> > > > > > so I'm not sure whether it's a good example in this case or not.
> > > > > 
> > > > > Can you think of an example that has only one option?
> > > > > 
> > > > > I tried but couldn't. For example take a sound card. The guest is aware
> > > > > the device supports stereo playback (2 output channels), but which exact
> > > > > stereo host device is used doesn't matter, they are all suitable.
> > > > > 
> > > > > Now imagine migrating to a 7.1 surround-sound device. Similar options
> > > > > come into play:
> > > > > 
> > > > > a) Emulate stereo and mix it to 7.1 surround-sound on the physical
> > > > >    device. The guest still sees the stereo device.
> > > > > 
> > > > > b) Refuse migration.
> > > > > 
> > > > > c) Indicate that the output has switched and let the guest reconfigure
> > > > >    itself (e.g. a sound card with multiple outputs, where one of them is
> > > > >    stereo and another is 7.1 surround sound).
> > > > > 
> > > > > Which option is desirable depends on the use case.
> > > > 
> > > > Yes, but I think it might be worth calling out these differences;  there
> > > > are explicitly cases where you don't want external changes to be visible
> > > > and other cases where you do; both are valid, but both need thinking
> > > > about. (Another one, GPU whether you have a monitor plugged in!)
> > > 
> > > Okay.
> > > 
> > > > > > Maybe what's needed is a stronger instruction to abstract external
> > > > > > device state so that it's not part of the configuration in most cases.
> > > > > 
> > > > > Do you want to propose something?
> > > > 
> > > > I think something like 'Some part of a devices state may be irrelevant
> > > > to a migration; for example on some NICs it might be preferable to hide
> > > > the physical characteristics of the link from the guest.'
> > > 
> > > Got it.
> > > 
> > > > > > > For example, if address filtering support was added to a network card then
> > > > > > > device versions and the corresponding configurations may look like this:
> > > > > > > * ``version=1`` - Behaves as if ``rx-filter-size=0``
> > > > > > > * ``version=2`` - ``rx-filter-size=32``
> > > > > > 
> > > > > > Note configuration parameters might have been added during the life of
> > > > > > the device; e.g. if the original card had no support for rx-filters, it
> > > > > > might not have a rx-filter-size parameter.
> > > > > 
> > > > > version=1 does not explicitly set rx-filter-size=0. When a new parameter
> > > > > is introduced it must have a default value that disables its effect on
> > > > > the hardware interface and/or device state representation. This is
> > > > > described in a bit more detail in the next section, maybe it should be
> > > > > reordered.
> > > > 
> > > > We've generally found the definition of devices tends in practice to be
> > > > done newer->older; i.e. you define the current machine, and then define
> > > > the next older machine setting the defaults that used to be true; then
> > > > define the older version behind that....
> > > 
> > > That is not possible here because an older device implementation is
> > > unaware of new configuration parameters.
> > > 
> > > Looking at the example above, imagine a version=1 device is instantiated
> > > on a device implementation that supports both version=1 and version=2.
> > > Should the configuration parameter list for version=1 be empty or
> > > rx-filter-size=0?
> > > 
> > > It must to be empty, otherwise an older device implementation that only
> > > supports version=1 cannot instantiate the device. The older device
> > > implementation does not recognize the rx-filter-size configuration
> > > parameter (it was introduced in version=2) so we cannot set it to 0.
> > 
> > I think this question might come down to who expands the device version
> > definition.
> > If it's the device itself that expands that, then a version 2 device
> > knows about what it needs to do for version 1 compatibility.
> > But if you're saying someone outside the device needs to be able to
> > expand that list then I'm not sure how you'd keep that expansion in line
> > with the implementation of a device.
> 
> The current approach is that the version is expanded into configuration
> parameters when the device is instantiated. Those parameters are then
> used to check migration compatibility of the destination (versions don't
> play a role once the device has been created).
> 
> Michael replied in another sub-thread wondering if versions are really
> necessary since tools do the migration checks. Let's try dropping
> versions to simplify things. We can bring them back if needed later.

What does a user facing tool do?  If I say I want one of these NICs
and I'm on the latest QEMU machine type, who sets all these parameters?

Dave

> > > > > > > Device States
> > > > > > > -------------
> > > > > > > The details of the device state representation are not covered in this document
> > > > > > > but the general requirements are discussed here.
> > > > > > > 
> > > > > > > The device state consists of data accessible through the device's hardware
> > > > > > > interface and internal state that is needed to restore device operation.
> > > > > > > State in the hardware interface includes the values of hardware registers.
> > > > > > > An example of internal state is an index value needed to avoid processing
> > > > > > > queued requests more than once.
> > > > > > 
> > > > > > I try and emphasise that 'internal state' should be represented in a way
> > > > > > that reflects the problem rather than the particular implementation;
> > > > > > this gives it a better chance of migrating to future versions.
> > > > > 
> > > > > Sounds like a good idea.
> > > > > 
> > > > > > > Changes can be made to the device state representation as follows. Each change
> > > > > > > to device state must have a corresponding device configuration parameter that
> > > > > > > allows the change to toggled:
> > > > > > > 
> > > > > > > * When the parameter is disabled the hardware interface and device state
> > > > > > >   representation are unchanged. This allows old device states to be loaded.
> > > > > > > 
> > > > > > > * When the parameter is enabled the change comes into effect.
> > > > > > > 
> > > > > > > * The parameter's default value disables the change. Therefore old versions do
> > > > > > >   not have to explicitly specify the parameter.
> > > > > > > 
> > > > > > > The following example illustrates migration from an old device
> > > > > > > implementation to a new one. A version=1 network card is migrated to a
> > > > > > > new device implementation that is also capable of version=2 and adds the
> > > > > > > rx-filter-size=32 parameter. The new device is instantiated with
> > > > > > > version=1, which disables rx-filter-size and is capable of loading the
> > > > > > > version=1 device state. The migration completes successfully but note
> > > > > > > the device is still operating at version=1 level in the new device.
> > > > > > > 
> > > > > > > The following example illustrates migration from a new device
> > > > > > > implementation back to an older one. The new device implementation
> > > > > > > supports version=1 and version=2. The old device implementation supports
> > > > > > > version=1 only. Therefore the device can only be migrated when
> > > > > > > instantiated with version=1 or the equivalent full configuration
> > > > > > > parameters.
> > > > > > 
> > > > > > I'm sometimes asked for 'ways out' of buggy migration cases; e.g. what
> > > > > > happens if version=1 forgot to migrate the X register; or what happens
> > > > > > if verison=1 forgot to handle the special, rare case when X=5 and we
> > > > > > now need to migrate some extra state.
> > > > > 
> > > > > Can these cases be handled by adding additional configuration parameters?
> > > > > 
> > > > > If version=1 is lacks essential state then version=2 can add it. The
> > > > > user must configure the device to use version before they can save the
> > > > > full state.
> > > > > 
> > > > > If version=1 didn't handle the X=5 case then the same solution is
> > > > > needed. A new configuration parameter is introduced and the user needs
> > > > > to configure the device to be the new version before migrating.
> > > > > 
> > > > > Unfortunately this requires poweroff or hotplugging a new device
> > > > > instance. But some disruption is probably necessarily anyway so the
> > > > > migration code on the host side can be patched to use the updated device
> > > > > state representation.
> > > > 
> > > > There are some corner cases that people sometimes prefer; for example
> > > > lets say the X=5 case is actually really rare - but when it happens the
> > > > device is hopelessly broken, some device authors prefer to fix it and
> > > > send the extra data and let the migration fail if the destination
> > > > doesn't understand it (it would break anyway).
> > > 
> > > The device implementation needs to be updated to send the extra data. At
> > > that point a new device configuration parameter should be introduced and
> > > if the user wishes to run the new version of the device then the extra
> > > data will be sent.
> > > 
> > > If the destination doesn't support the new parameter then migration will
> > > be refused. That matches what you've described, so I think the approach
> > > in this document handles this case.
> > 
> > Well that's the ideal; but the case I'm describing is where you're
> > recovering from a screwup in which the migration is going to fail in a
> > rare (runtime defined) corner case, and only sending the extra data
> > in that rare case before you get a chance to define a new version.
> 
> You need to upgrade the migration code in order to produce that extra
> data. Why not define a configuration parameter alongside this code
> change?
> 
> > > > I've also been asked
> > > > by mst for a 'unexpected data' mechanism to send data that the
> > > > destination might not expect if it didn't know about it, for similar
> > > > cases.
> > > 
> > > Do you mean optional data that can be more or less safely dropped? A new
> > > device configuration parameter is not needed because the hardware
> > > interface and device state representation remain compatible. That
> > > feature can be defined in the device state representation spec and is
> > > not visible at the layer discussed in this document. But I think it's
> > > worth adding an explanation into this document explaining what to do.
> > 
> > I mean a way to send optional data that the destination can drop; but
> > that the destination doesn't know what it means and at the time the
> > destination was written, wasn't yet defined. It is part of the device
> > state;  it's similar to the X=5 case above - but in this case it allows
> > the migration not to fail even when you start sending the extra data.
> 
> The device state representation may have a way of sending optional data.
> Since it just gets dropped if the destination doesn't recognize it there
> is no need to introduce a configuration parameter and it doesn't play a
> part in migration compatibility checks.
> 
> Stefan


-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-04 16:40       ` Stefan Hajnoczi
@ 2020-11-05  6:47         ` Gerd Hoffmann
  2020-11-05 11:42           ` Stefan Hajnoczi
  0 siblings, 1 reply; 40+ messages in thread
From: Gerd Hoffmann @ 2020-11-05  6:47 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: John G Johnson, mtsirkin, Daniel P. Berrangé,
	quintela, Jason Wang, Cornelia Huck, qemu-devel,
	Dr. David Alan Gilbert, Kirti Wankhede, Thanos Makatos,
	Alex Williamson, Felipe Franciosi, Paolo Bonzini

  Hi,

> > > Yes. If you are creating a custom device that no one else needs to
> > > emulate then you can simply pick a unique URL:
> > > 
> > >   https://vendor.com/my-dev
> > > 
> > > There doesn't need to be anything at the URL. It's just a unique string
> > > that no one else will use and therefore web URLs are handy because no
> > > one else will accidentally pick your string.
> > 
> > If this is just a string I think it would be better to use the reverse
> > domain name scheme (as used by virtio-serial too), i.e.
> > 
> >  - org.qemu.devices.e1000e
> >  - com.vendor.my-dev
> 
> This is the Java syntax.

I think both android and ios use that too, for app naming (but maybe that
comes from java).

> Go uses gitlab.com/my-user/foo and I think it's
> nicer but I think I'm bikeshedding.
> 
> Is there any particular reason why you prefer the reverse domain name
> approach?

Having "https://" at the start is odd, especially if we don't require
that the given URL returns something useful.  Other that that I don't
mind that much whenever we use go-style or java-style strings, with a
slight preference for the latter for consistency with virtio-serial.

take care,
  Gerd



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-04 17:32             ` Dr. David Alan Gilbert
@ 2020-11-05 11:40               ` Stefan Hajnoczi
  2020-11-05 12:13                 ` Dr. David Alan Gilbert
  2020-11-05 12:53                 ` Michael S. Tsirkin
  0 siblings, 2 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2020-11-05 11:40 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: John G Johnson, mtsirkin, Daniel P. Berrangé,
	quintela, Jason Wang, Felipe Franciosi, Kirti Wankhede,
	qemu-devel, Alex Williamson, Thanos Makatos, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1243 bytes --]

On Wed, Nov 04, 2020 at 05:32:02PM +0000, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > Michael replied in another sub-thread wondering if versions are really
> > necessary since tools do the migration checks. Let's try dropping
> > versions to simplify things. We can bring them back if needed later.
> 
> What does a user facing tool do?  If I say I want one of these NICs
> and I'm on the latest QEMU machine type, who sets all these parameters?

The machine type is orthogonal since QEMU doesn't know about every
possible VFIO device. The device is like a PCI adapter that is added to
a physical machine aftermarket, it's not part of the base machine's
specs.

The migration tool queries the parameters from the source device.
VFIO/mdev will provide sysfs attrs. For vfio-user I'm not sure whether
to print the parameters during device instantiation, require a
VFIO-compatible FUSE directory, or to use a query-migration-params RPC
command.

Let's discuss this more when the next revision of the document is sent
out, because it modifies the approach so that migration parameters are
logically separate from device configuration parameters. That changes
things a bit.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-05  6:47         ` Gerd Hoffmann
@ 2020-11-05 11:42           ` Stefan Hajnoczi
  0 siblings, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2020-11-05 11:42 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: John G Johnson, mtsirkin, Daniel P. Berrangé,
	quintela, Jason Wang, Cornelia Huck, qemu-devel,
	Dr. David Alan Gilbert, Kirti Wankhede, Thanos Makatos,
	Alex Williamson, Felipe Franciosi, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 1426 bytes --]

On Thu, Nov 05, 2020 at 07:47:24AM +0100, Gerd Hoffmann wrote:
> > > > Yes. If you are creating a custom device that no one else needs to
> > > > emulate then you can simply pick a unique URL:
> > > > 
> > > >   https://vendor.com/my-dev
> > > > 
> > > > There doesn't need to be anything at the URL. It's just a unique string
> > > > that no one else will use and therefore web URLs are handy because no
> > > > one else will accidentally pick your string.
> > > 
> > > If this is just a string I think it would be better to use the reverse
> > > domain name scheme (as used by virtio-serial too), i.e.
> > > 
> > >  - org.qemu.devices.e1000e
> > >  - com.vendor.my-dev
> > 
> > This is the Java syntax.
> 
> I think both android and ios use that too, for app naming (but maybe that
> comes from java).
> 
> > Go uses gitlab.com/my-user/foo and I think it's
> > nicer but I think I'm bikeshedding.
> > 
> > Is there any particular reason why you prefer the reverse domain name
> > approach?
> 
> Having "https://" at the start is odd, especially if we don't require
> that the given URL returns something useful.  Other that that I don't
> mind that much whenever we use go-style or java-style strings, with a
> slight preference for the latter for consistency with virtio-serial.

Thanks for explaining. We can discuss the exact format in the next
revision if there are opinions.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-05 11:40               ` Stefan Hajnoczi
@ 2020-11-05 12:13                 ` Dr. David Alan Gilbert
  2020-11-05 12:47                   ` Michael S. Tsirkin
  2020-11-05 12:53                 ` Michael S. Tsirkin
  1 sibling, 1 reply; 40+ messages in thread
From: Dr. David Alan Gilbert @ 2020-11-05 12:13 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: John G Johnson, mtsirkin, Daniel P. Berrangé,
	quintela, Jason Wang, Felipe Franciosi, Kirti Wankhede,
	qemu-devel, Alex Williamson, Thanos Makatos, Paolo Bonzini

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Wed, Nov 04, 2020 at 05:32:02PM +0000, Dr. David Alan Gilbert wrote:
> > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > Michael replied in another sub-thread wondering if versions are really
> > > necessary since tools do the migration checks. Let's try dropping
> > > versions to simplify things. We can bring them back if needed later.
> > 
> > What does a user facing tool do?  If I say I want one of these NICs
> > and I'm on the latest QEMU machine type, who sets all these parameters?
> 
> The machine type is orthogonal since QEMU doesn't know about every
> possible VFIO device. The device is like a PCI adapter that is added to
> a physical machine aftermarket, it's not part of the base machine's
> specs.

OK, but ignoring migration, I think the same problem holds; if I'm a
tool creating one of these VMs, and I plug this device in, what do I do
with all it's configuration parameters?  I'd assume most of the time
that they don't know about or dont care about most of the parameters,
they just want the sane defaults unless told otherwise.

> The migration tool queries the parameters from the source device.
> VFIO/mdev will provide sysfs attrs. For vfio-user I'm not sure whether
> to print the parameters during device instantiation, require a
> VFIO-compatible FUSE directory, or to use a query-migration-params RPC
> command.

But on VM creation we have to answer the question of what config do we
want; so for example lets say I'm creating a new VM in my cluster,
but I want to be sure that later I can migrate it.  I can read the 
config off one of the other machines;  can I just use that even if my
new machine has a later device implementation?

Dave

> Let's discuss this more when the next revision of the document is sent
> out, because it modifies the approach so that migration parameters are
> logically separate from device configuration parameters. That changes
> things a bit.
> 
> Stefan


-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-05 12:13                 ` Dr. David Alan Gilbert
@ 2020-11-05 12:47                   ` Michael S. Tsirkin
  2020-11-05 14:17                     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 40+ messages in thread
From: Michael S. Tsirkin @ 2020-11-05 12:47 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: John G Johnson, Daniel P. Berrangé,
	quintela, Jason Wang, Felipe Franciosi, Kirti Wankhede,
	qemu-devel, Alex Williamson, Stefan Hajnoczi, Thanos Makatos,
	Paolo Bonzini

On Thu, Nov 05, 2020 at 12:13:24PM +0000, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > On Wed, Nov 04, 2020 at 05:32:02PM +0000, Dr. David Alan Gilbert wrote:
> > > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > > Michael replied in another sub-thread wondering if versions are really
> > > > necessary since tools do the migration checks. Let's try dropping
> > > > versions to simplify things. We can bring them back if needed later.
> > > 
> > > What does a user facing tool do?  If I say I want one of these NICs
> > > and I'm on the latest QEMU machine type, who sets all these parameters?
> > 
> > The machine type is orthogonal since QEMU doesn't know about every
> > possible VFIO device. The device is like a PCI adapter that is added to
> > a physical machine aftermarket, it's not part of the base machine's
> > specs.
> 
> OK, but ignoring migration, I think the same problem holds; if I'm a
> tool creating one of these VMs, and I plug this device in, what do I do
> with all it's configuration parameters?  I'd assume most of the time
> that they don't know about or dont care about most of the parameters,
> they just want the sane defaults unless told otherwise.

I think that if you ignore migration then you can ignore parameters.

> > The migration tool queries the parameters from the source device.
> > VFIO/mdev will provide sysfs attrs. For vfio-user I'm not sure whether
> > to print the parameters during device instantiation, require a
> > VFIO-compatible FUSE directory, or to use a query-migration-params RPC
> > command.
> 
> But on VM creation we have to answer the question of what config do we
> want; so for example lets say I'm creating a new VM in my cluster,
> but I want to be sure that later I can migrate it.  I can read the 
> config off one of the other machines;  can I just use that even if my
> new machine has a later device implementation?
> 
> Dave

I don't think so - we need a tool that can query a set of machines and then
produce a safe configuration.

The same problem exists with vhost as well it will just
explode exponentially with lots more devices and backends ...
Talked about it a bit in my kvm forum preso ...

> > Let's discuss this more when the next revision of the document is sent
> > out, because it modifies the approach so that migration parameters are
> > logically separate from device configuration parameters. That changes
> > things a bit.
> > 
> > Stefan
> 
> 
> -- 
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-05 11:40               ` Stefan Hajnoczi
  2020-11-05 12:13                 ` Dr. David Alan Gilbert
@ 2020-11-05 12:53                 ` Michael S. Tsirkin
  1 sibling, 0 replies; 40+ messages in thread
From: Michael S. Tsirkin @ 2020-11-05 12:53 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: John G Johnson, Daniel P. Berrangé,
	quintela, Jason Wang, Felipe Franciosi, Dr. David Alan Gilbert,
	Kirti Wankhede, qemu-devel, Alex Williamson, Thanos Makatos,
	Paolo Bonzini

On Thu, Nov 05, 2020 at 11:40:37AM +0000, Stefan Hajnoczi wrote:
> On Wed, Nov 04, 2020 at 05:32:02PM +0000, Dr. David Alan Gilbert wrote:
> > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > Michael replied in another sub-thread wondering if versions are really
> > > necessary since tools do the migration checks. Let's try dropping
> > > versions to simplify things. We can bring them back if needed later.
> > 
> > What does a user facing tool do?  If I say I want one of these NICs
> > and I'm on the latest QEMU machine type, who sets all these parameters?
> 
> The machine type is orthogonal since QEMU doesn't know about every
> possible VFIO device. The device is like a PCI adapter that is added to
> a physical machine aftermarket, it's not part of the base machine's
> specs.

I think at least at the first stage, it is a smart thing to do
to have a list of allowed devices in QEMU. This way we can ask
for a spec of the migration format, include it in qemu
(or a subtree? I don't mind ...) and check it is sane.
And we can be reasonably sure we can make changes
without breaking the world - we will know whom to
contact if we change the protocol.

-- 
MST



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-05 12:47                   ` Michael S. Tsirkin
@ 2020-11-05 14:17                     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 40+ messages in thread
From: Dr. David Alan Gilbert @ 2020-11-05 14:17 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: John G Johnson, Daniel P. Berrangé,
	quintela, Jason Wang, Felipe Franciosi, Kirti Wankhede,
	qemu-devel, Alex Williamson, Stefan Hajnoczi, Thanos Makatos,
	Paolo Bonzini

* Michael S. Tsirkin (mst@redhat.com) wrote:
> On Thu, Nov 05, 2020 at 12:13:24PM +0000, Dr. David Alan Gilbert wrote:
> > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > On Wed, Nov 04, 2020 at 05:32:02PM +0000, Dr. David Alan Gilbert wrote:
> > > > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > > > Michael replied in another sub-thread wondering if versions are really
> > > > > necessary since tools do the migration checks. Let's try dropping
> > > > > versions to simplify things. We can bring them back if needed later.
> > > > 
> > > > What does a user facing tool do?  If I say I want one of these NICs
> > > > and I'm on the latest QEMU machine type, who sets all these parameters?
> > > 
> > > The machine type is orthogonal since QEMU doesn't know about every
> > > possible VFIO device. The device is like a PCI adapter that is added to
> > > a physical machine aftermarket, it's not part of the base machine's
> > > specs.
> > 
> > OK, but ignoring migration, I think the same problem holds; if I'm a
> > tool creating one of these VMs, and I plug this device in, what do I do
> > with all it's configuration parameters?  I'd assume most of the time
> > that they don't know about or dont care about most of the parameters,
> > they just want the sane defaults unless told otherwise.
> 
> I think that if you ignore migration then you can ignore parameters.

So if I ingore parameters, do I get the latest, greatest config from the
device implementation I have?

> > > The migration tool queries the parameters from the source device.
> > > VFIO/mdev will provide sysfs attrs. For vfio-user I'm not sure whether
> > > to print the parameters during device instantiation, require a
> > > VFIO-compatible FUSE directory, or to use a query-migration-params RPC
> > > command.
> > 
> > But on VM creation we have to answer the question of what config do we
> > want; so for example lets say I'm creating a new VM in my cluster,
> > but I want to be sure that later I can migrate it.  I can read the 
> > config off one of the other machines;  can I just use that even if my
> > new machine has a later device implementation?
> > 
> > Dave
> 
> I don't think so - we need a tool that can query a set of machines and then
> produce a safe configuration.

That's why I liked the 'version' idea; you just have to find the lowest
version in your set.

Dave

> The same problem exists with vhost as well it will just
> explode exponentially with lots more devices and backends ...
> Talked about it a bit in my kvm forum preso ...
> 
> > > Let's discuss this more when the next revision of the document is sent
> > > out, because it modifies the approach so that migration parameters are
> > > logically separate from device configuration parameters. That changes
> > > things a bit.
> > > 
> > > Stefan
> > 
> > 
> > -- 
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: VFIO Migration
  2020-11-03 17:13     ` Alex Williamson
  2020-11-03 18:09       ` Stefan Hajnoczi
@ 2020-11-05 23:37       ` Yan Zhao
  1 sibling, 0 replies; 40+ messages in thread
From: Yan Zhao @ 2020-11-05 23:37 UTC (permalink / raw)
  To: Alex Williamson
  Cc: John G Johnson, Tian, Kevin, mtsirkin, Daniel P. Berrangé,
	quintela, Jason Wang, Felipe Franciosi, Zeng, Xin, qemu-devel,
	Dr. David Alan Gilbert, Kirti Wankhede, Stefan Hajnoczi,
	Thanos Makatos, Paolo Bonzini

On Tue, Nov 03, 2020 at 10:13:05AM -0700, Alex Williamson wrote:
> On Tue, 3 Nov 2020 11:03:24 +0000
> Stefan Hajnoczi <stefanha@redhat.com> wrote:

<...>
>  
> > Management tools need to match the device model/configuration from the
> > source device against the destination device. If the destination is
> > capable of supporting the source's device model/configuration then
> > migration can proceed safely.
> > 
> > Let's look at the case where we are migration from an older version of a
> > device to a newer version. On the source we have:
> > 
> >   model = https://vendor-a.com/my-nic
> > 
> > On the destination we have:
> > 
> >   model = https://vendor-a.com/my-nic
> >   rss = on
> > 
> > The two devices are incompatible because the destination exposes the RSS
> > feature that is not present on the source. The RSS feature involves
> > guest-visible hardware interface changes and a change to the device
> > state representation. It is not safe to migrate!
> > 
> > In this case an extra configuration step is necessary so that the
> > destination device can accept the device state from the source. The
> > management tool invokes a vendor-specific tool to put the device into
> > the right configuration:
> > 
> >   # vendor-tool set-migration-config --device 0000:00:04.0 \
> >                                      --model https://vendor-a.com/my-nic
> > 
> > (This tool only succeeds when the device is bound to VFIO but not yet
> > opened.)
> > 
> > The tool invokes ioctls on the vendor-specific VFIO driver that does two
> > things:
> > 1. Tells the device to present the old hardware interface without RSS
> > 2. Uses the old device state representation without RSS support
> > 
> > Does this approach fit?
> 
> 
> Should we not require that any sort of configuration like this occurs
> through sysfs?  We must be able to create an instance with a specific
> configuration without using vendor specific tools, therefore in the
> worse case we should be able to remove and recreate an instance as we
> desire without invoking vendor specific tools.  Thanks,
> 
hi Alex,
could mdevctl serve as a general configuration tool to create/destroy/config
mdev devices?

I think previously the main debate is on what is an easy way for management
tool to find and create a compatible target mdev device according to sysfs
info of source mdev device, is it?
as in [1], we have simplified the method to 1:1 matching of mdev_type
in src and target. and we can further force it to be 1:1 matching of
vendor_specific attributes (e.g. pci id) and dynamic resources
(e.g. aggregator, fps,...) and have mdevctl to create a compatible target
for management tools.

Given management tools like openstack are still in their preliminary
stage of supporting mdev devices, could we first settle down the
compatibility sysfs protocol and treat mdevctl as userspace tool
currently?

[1]: https://lists.gnu.org/archive/html/qemu-devel/2020-09/msg03273.html

Thanks
Yan


^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2020-11-05 23:39 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-02 11:11 VFIO Migration Stefan Hajnoczi
2020-11-02 12:28 ` Cornelia Huck
2020-11-02 14:56   ` Stefan Hajnoczi
2020-11-04  8:07     ` Gerd Hoffmann
2020-11-04 16:40       ` Stefan Hajnoczi
2020-11-05  6:47         ` Gerd Hoffmann
2020-11-05 11:42           ` Stefan Hajnoczi
2020-11-02 19:38 ` Alex Williamson
2020-11-03 11:03   ` Stefan Hajnoczi
2020-11-03 17:13     ` Alex Williamson
2020-11-03 18:09       ` Stefan Hajnoczi
2020-11-05 23:37       ` Yan Zhao
2020-11-03  8:46 ` Jason Wang
2020-11-03 12:15   ` Stefan Hajnoczi
2020-11-04  3:32     ` Jason Wang
2020-11-04  7:16       ` Stefan Hajnoczi
2020-11-03 11:39 ` Daniel P. Berrangé
2020-11-03 15:05   ` Stefan Hajnoczi
2020-11-03 15:23     ` Daniel P. Berrangé
2020-11-03 18:16       ` Stefan Hajnoczi
2020-11-03 12:17 ` Dr. David Alan Gilbert
2020-11-03 15:27   ` Stefan Hajnoczi
2020-11-03 18:49     ` Dr. David Alan Gilbert
2020-11-04  7:36       ` Stefan Hajnoczi
2020-11-04 10:14         ` Dr. David Alan Gilbert
2020-11-04 16:47           ` Stefan Hajnoczi
2020-11-04 17:32             ` Dr. David Alan Gilbert
2020-11-05 11:40               ` Stefan Hajnoczi
2020-11-05 12:13                 ` Dr. David Alan Gilbert
2020-11-05 12:47                   ` Michael S. Tsirkin
2020-11-05 14:17                     ` Dr. David Alan Gilbert
2020-11-05 12:53                 ` Michael S. Tsirkin
2020-11-04 11:05       ` Christophe de Dinechin
2020-11-03 15:23 ` Christophe de Dinechin
2020-11-03 15:33   ` Daniel P. Berrangé
2020-11-03 17:31     ` Alex Williamson
2020-11-04 10:13       ` Stefan Hajnoczi
2020-11-04 11:10   ` Stefan Hajnoczi
2020-11-04  7:50 ` Michael S. Tsirkin
2020-11-04 16:37   ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.