KVM Archive on lore.kernel.org
 help / color / Atom feed
* device compatibility interface for live migration with assigned devices
@ 2020-07-13 23:29 Yan Zhao
  2020-07-14 10:21 ` Daniel P. Berrangé
  2020-07-16  4:16 ` Jason Wang
  0 siblings, 2 replies; 48+ messages in thread
From: Yan Zhao @ 2020-07-13 23:29 UTC (permalink / raw)
  To: devel, openstack-discuss, libvir-list
  Cc: intel-gvt-dev, kvm, qemu-devel, berrange, smooney, eskultet,
	alex.williamson, cohuck, dinechin, corbet, kwankhede, dgilbert,
	eauger, jian-feng.ding, hejie.xu, kevin.tian, zhenyuw,
	bao.yumeng, xin-ran.wang, shaohe.feng

hi folks,
we are defining a device migration compatibility interface that helps upper
layer stack like openstack/ovirt/libvirt to check if two devices are
live migration compatible.
The "devices" here could be MDEVs, physical devices, or hybrid of the two.
e.g. we could use it to check whether
- a src MDEV can migrate to a target MDEV,
- a src VF in SRIOV can migrate to a target VF in SRIOV,
- a src MDEV can migration to a target VF in SRIOV.
  (e.g. SIOV/SRIOV backward compatibility case)

The upper layer stack could use this interface as the last step to check
if one device is able to migrate to another device before triggering a real
live migration procedure.
we are not sure if this interface is of value or help to you. please don't
hesitate to drop your valuable comments.


(1) interface definition
The interface is defined in below way:

             __    userspace
              /\              \
             /                 \write
            / read              \
   ________/__________       ___\|/_____________
  | migration_version |     | migration_version |-->check migration
  ---------------------     ---------------------   compatibility
     device A                    device B


a device attribute named migration_version is defined under each device's
sysfs node. e.g. (/sys/bus/pci/devices/0000\:00\:02.0/$mdev_UUID/migration_version).
userspace tools read the migration_version as a string from the source device,
and write it to the migration_version sysfs attribute in the target device.

The userspace should treat ANY of below conditions as two devices not compatible:
- any one of the two devices does not have a migration_version attribute
- error when reading from migration_version attribute of one device
- error when writing migration_version string of one device to
  migration_version attribute of the other device

The string read from migration_version attribute is defined by device vendor
driver and is completely opaque to the userspace.
for a Intel vGPU, string format can be defined like
"parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".

for an NVMe VF connecting to a remote storage. it could be
"PCI ID" + "driver version" + "configured remote storage URL"

for a QAT VF, it may be
"PCI ID" + "driver version" + "supported encryption set".

(to avoid namespace confliction from each vendor, we may prefix a driver name to
each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)


(2) backgrounds

The reason we hope the migration_version string is opaque to the userspace
is that it is hard to generalize standard comparing fields and comparing
methods for different devices from different vendors.
Though userspace now could still do a simple string compare to check if
two devices are compatible, and result should also be right, it's still
too limited as it excludes the possible candidate whose migration_version
string fails to be equal.
e.g. an MDEV with mdev_type_1, aggregator count 3 is probably compatible
with another MDEV with mdev_type_3, aggregator count 1, even their
migration_version strings are not equal.
(assumed mdev_type_3 is of 3 times equal resources of mdev_type_1).

besides that, driver version + configured resources are all elements demanding
to take into account.

So, we hope leaving the freedom to vendor driver and let it make the final decision
in a simple reading from source side and writing for test in the target side way.


we then think the device compatibility issues for live migration with assigned
devices can be divided into two steps:
a. management tools filter out possible migration target devices.
   Tags could be created according to info from product specification.
   we think openstack/ovirt may have vendor proprietary components to create
   those customized tags for each product from each vendor.
   e.g.
   for Intel vGPU, with a vGPU(a MDEV device) in source side, the tags to
   search target vGPU are like:
   a tag for compatible parent PCI IDs,
   a tag for a range of gvt driver versions,
   a tag for a range of mdev type + aggregator count

   for NVMe VF, the tags to search target VF may be like:
   a tag for compatible PCI IDs,
   a tag for a range of driver versions,
   a tag for URL of configured remote storage.

b. with the output from step a, openstack/ovirt/libvirt could use our proposed
   device migration compatibility interface to make sure the two devices are
   indeed live migration compatible before launching the real live migration
   process to start stream copying, src device stopping and target device
   resuming.
   It is supposed that this step would not bring any performance penalty as
   -in kernel it's just a simple string decoding and comparing
   -in openstack/ovirt, it could be done by extending current function
    check_can_live_migrate_destination, along side claiming target resources.[1]


[1] https://specs.openstack.org/openstack/nova-specs/specs/stein/approved/libvirt-neutron-sriov-livemigration.html

Thanks
Yan


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-13 23:29 device compatibility interface for live migration with assigned devices Yan Zhao
@ 2020-07-14 10:21 ` Daniel P. Berrangé
  2020-07-14 12:33   ` Sean Mooney
  2020-07-14 16:16   ` Alex Williamson
  2020-07-16  4:16 ` Jason Wang
  1 sibling, 2 replies; 48+ messages in thread
From: Daniel P. Berrangé @ 2020-07-14 10:21 UTC (permalink / raw)
  To: Yan Zhao
  Cc: devel, openstack-discuss, libvir-list, intel-gvt-dev, kvm,
	qemu-devel, smooney, eskultet, alex.williamson, cohuck, dinechin,
	corbet, kwankhede, dgilbert, eauger, jian-feng.ding, hejie.xu,
	kevin.tian, zhenyuw, bao.yumeng, xin-ran.wang, shaohe.feng

On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:
> hi folks,
> we are defining a device migration compatibility interface that helps upper
> layer stack like openstack/ovirt/libvirt to check if two devices are
> live migration compatible.
> The "devices" here could be MDEVs, physical devices, or hybrid of the two.
> e.g. we could use it to check whether
> - a src MDEV can migrate to a target MDEV,
> - a src VF in SRIOV can migrate to a target VF in SRIOV,
> - a src MDEV can migration to a target VF in SRIOV.
>   (e.g. SIOV/SRIOV backward compatibility case)
> 
> The upper layer stack could use this interface as the last step to check
> if one device is able to migrate to another device before triggering a real
> live migration procedure.
> we are not sure if this interface is of value or help to you. please don't
> hesitate to drop your valuable comments.
> 
> 
> (1) interface definition
> The interface is defined in below way:
> 
>              __    userspace
>               /\              \
>              /                 \write
>             / read              \
>    ________/__________       ___\|/_____________
>   | migration_version |     | migration_version |-->check migration
>   ---------------------     ---------------------   compatibility
>      device A                    device B
> 
> 
> a device attribute named migration_version is defined under each device's
> sysfs node. e.g. (/sys/bus/pci/devices/0000\:00\:02.0/$mdev_UUID/migration_version).
> userspace tools read the migration_version as a string from the source device,
> and write it to the migration_version sysfs attribute in the target device.
> 
> The userspace should treat ANY of below conditions as two devices not compatible:
> - any one of the two devices does not have a migration_version attribute
> - error when reading from migration_version attribute of one device
> - error when writing migration_version string of one device to
>   migration_version attribute of the other device
> 
> The string read from migration_version attribute is defined by device vendor
> driver and is completely opaque to the userspace.
> for a Intel vGPU, string format can be defined like
> "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
> 
> for an NVMe VF connecting to a remote storage. it could be
> "PCI ID" + "driver version" + "configured remote storage URL"
> 
> for a QAT VF, it may be
> "PCI ID" + "driver version" + "supported encryption set".
> 
> (to avoid namespace confliction from each vendor, we may prefix a driver name to
> each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)
> 
> 
> (2) backgrounds
> 
> The reason we hope the migration_version string is opaque to the userspace
> is that it is hard to generalize standard comparing fields and comparing
> methods for different devices from different vendors.
> Though userspace now could still do a simple string compare to check if
> two devices are compatible, and result should also be right, it's still
> too limited as it excludes the possible candidate whose migration_version
> string fails to be equal.
> e.g. an MDEV with mdev_type_1, aggregator count 3 is probably compatible
> with another MDEV with mdev_type_3, aggregator count 1, even their
> migration_version strings are not equal.
> (assumed mdev_type_3 is of 3 times equal resources of mdev_type_1).
> 
> besides that, driver version + configured resources are all elements demanding
> to take into account.
> 
> So, we hope leaving the freedom to vendor driver and let it make the final decision
> in a simple reading from source side and writing for test in the target side way.
> 
> 
> we then think the device compatibility issues for live migration with assigned
> devices can be divided into two steps:
> a. management tools filter out possible migration target devices.
>    Tags could be created according to info from product specification.
>    we think openstack/ovirt may have vendor proprietary components to create
>    those customized tags for each product from each vendor.

>    for Intel vGPU, with a vGPU(a MDEV device) in source side, the tags to
>    search target vGPU are like:
>    a tag for compatible parent PCI IDs,
>    a tag for a range of gvt driver versions,
>    a tag for a range of mdev type + aggregator count
> 
>    for NVMe VF, the tags to search target VF may be like:
>    a tag for compatible PCI IDs,
>    a tag for a range of driver versions,
>    a tag for URL of configured remote storage.

Requiring management application developers to figure out this possible
compatibility based on prod specs is really unrealistic. Product specs
are typically as clear as mud, and with the suggestion we consider
different rules for different types of devices, add up to a huge amount
of complexity. This isn't something app developers should have to spend
their time figuring out.

The suggestion that we make use of vendor proprietary helper components
is totally unacceptable. We need to be able to build a solution that
works with exclusively an open source software stack.

IMHO there needs to be a mechanism for the kernel to report via sysfs
what versions are supported on a given device. This puts the job of
reporting compatible versions directly under the responsibility of the
vendor who writes the kernel driver for it. They are the ones with the
best knowledge of the hardware they've built and the rules around its
compatibility.

> b. with the output from step a, openstack/ovirt/libvirt could use our proposed
>    device migration compatibility interface to make sure the two devices are
>    indeed live migration compatible before launching the real live migration
>    process to start stream copying, src device stopping and target device
>    resuming.
>    It is supposed that this step would not bring any performance penalty as
>    -in kernel it's just a simple string decoding and comparing
>    -in openstack/ovirt, it could be done by extending current function
>     check_can_live_migrate_destination, along side claiming target resources.[1]




> 
> 
> [1] https://specs.openstack.org/openstack/nova-specs/specs/stein/approved/libvirt-neutron-sriov-livemigration.html
> 
> Thanks
> Yan
> 

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-14 10:21 ` Daniel P. Berrangé
@ 2020-07-14 12:33   ` Sean Mooney
       [not found]     ` <20200714110148.0471c03c@x1.home>
  2020-07-14 16:16   ` Alex Williamson
  1 sibling, 1 reply; 48+ messages in thread
From: Sean Mooney @ 2020-07-14 12:33 UTC (permalink / raw)
  To: Daniel P. Berrangé, Yan Zhao
  Cc: devel, openstack-discuss, libvir-list, intel-gvt-dev, kvm,
	qemu-devel, eskultet, alex.williamson, cohuck, dinechin, corbet,
	kwankhede, dgilbert, eauger, jian-feng.ding, hejie.xu,
	kevin.tian, zhenyuw, bao.yumeng, xin-ran.wang, shaohe.feng

On Tue, 2020-07-14 at 11:21 +0100, Daniel P. Berrangé wrote:
> On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:
> > hi folks,
> > we are defining a device migration compatibility interface that helps upper
> > layer stack like openstack/ovirt/libvirt to check if two devices are
> > live migration compatible.
> > The "devices" here could be MDEVs, physical devices, or hybrid of the two.
> > e.g. we could use it to check whether
> > - a src MDEV can migrate to a target MDEV,
mdev live migration is completely possible to do but i agree with Dan barrange's comments
from the point of view of openstack integration i dont see calling out to a vender sepecific
tool to be an accpetable
solutions for device compatiablity checking. the sys filesystem
that describs the mdevs that can be created shoudl also
contain the relevent infomation such
taht nova could integrate it via libvirt xml representation or directly retrive the
info from
sysfs.
> > - a src VF in SRIOV can migrate to a target VF in SRIOV,
so vf to vf migration is not possible in the general case as there is no standarised
way to transfer teh device state as part of the siorv specs produced by the pci-sig
as such there is not vender neutral way to support sriov live migration. 
> > - a src MDEV can migration to a target VF in SRIOV.
that also makes this unviable
> >   (e.g. SIOV/SRIOV backward compatibility case)
> > 
> > The upper layer stack could use this interface as the last step to check
> > if one device is able to migrate to another device before triggering a real
> > live migration procedure.
well actully that is already too late really. ideally we would want to do this compaiablity
check much sooneer to avoid the migration failing. in an openstack envionment  at least
by the time we invoke libvirt (assuming your using the libvirt driver) to do the migration we have alreaedy
finished schduling the instance to the new host. if if we do the compatiablity check at this point
and it fails then the live migration is aborted and will not be retired. These types of late check lead to a
poor user experince as unless you check the migration detial it basically looks like the migration was ignored
as it start to migrate and then continuge running on the orgininal host.

when using generic pci passhotuhg with openstack, the pci alias is intended to reference a single vendor id/product
id so you will have 1+ alias for each type of device. that allows openstack to schedule based on the availability of a
compatibale device because we track inventories of pci devices and can query that when selecting a host.

if we were to support mdev live migration in the future we would want to take the same declarative approch.
1 interospec the capability of the deivce we manage
2 create inventories of the allocatable devices and there capabilities
3 schdule the instance to a host based on the device-type/capabilities and claim it atomicly to prevent raceces
4 have the lower level hyperviors do addtional validation if need prelive migration.

this proposal seams to be targeting extending step 4 where as ideally we should focuse on providing the info that would
be relevant in set 1 preferably in a vendor neutral way vai a kernel interface like /sys.
 
> > we are not sure if this interface is of value or help to you. please don't
> > hesitate to drop your valuable comments.
> > 
> > 
> > (1) interface definition
> > The interface is defined in below way:
> > 
> >              __    userspace
> >               /\              \
> >              /                 \write
> >             / read              \
> >    ________/__________       ___\|/_____________
> >   | migration_version |     | migration_version |-->check migration
> >   ---------------------     ---------------------   compatibility
> >      device A                    device B
> > 
> > 
> > a device attribute named migration_version is defined under each device's
> > sysfs node. e.g. (/sys/bus/pci/devices/0000\:00\:02.0/$mdev_UUID/migration_version).
this might be useful as we could tag the inventory with the migration version and only might to
devices with  the same version
> > userspace tools read the migration_version as a string from the source device,
> > and write it to the migration_version sysfs attribute in the target device.
this would not be useful as the schduler cannot directlly connect to the compute host
and even if it could it would be extreamly slow to do this for 1000s of hosts and potentally
multiple devices per host.
> > 
> > The userspace should treat ANY of below conditions as two devices not compatible:
> > - any one of the two devices does not have a migration_version attribute
> > - error when reading from migration_version attribute of one device
> > - error when writing migration_version string of one device to
> >   migration_version attribute of the other device
> > 
> > The string read from migration_version attribute is defined by device vendor
> > driver and is completely opaque to the userspace.
opaque vendor specific stings that higher level orchestros have to pass form host
to host and cant reason about are evil, when allowed they prolifroate and
makes any idea of a vendor nutral abstraction and interoperablity between systems
impossible to reason about. that said there is a way to make it opaue but still useful
to userspace. see below
> > for a Intel vGPU, string format can be defined like
> > "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
> > 
> > for an NVMe VF connecting to a remote storage. it could be
> > "PCI ID" + "driver version" + "configured remote storage URL"
> > 
> > for a QAT VF, it may be
> > "PCI ID" + "driver version" + "supported encryption set".
> > 
> > (to avoid namespace confliction from each vendor, we may prefix a driver name to
> > each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)
honestly i would much prefer if the version string was just a semver string.
e.g. {major}.{minor}.{bugfix} 

if you do a driver/frimware update and break compatiablity with an older version bump the
major version.

if you add optional a feature that does not break backwards compatiablity if you migrate
an older instance to the new host then just bump the minor/feature number.

if you have a fix for a bug that does not change the feature set or compatiblity backwards or
forwards then bump the bugfix number

then the check is as simple as 
1.) is the mdev type the same
2.) is the major verion the same
3.) am i going form the same version to same version or same version to newer version

if all 3 are true we can migrate.
e.g. 
2.0.1 -> 2.1.1 (ok same major version and migrating from older feature release to newer feature release)
2.1.1 -> 2.0.1 (not ok same major version and migrating from new feature release to old feature release may be
incompatable)
2.0.0 -> 3.0.0 (not ok chaning major version)
2.0.1 -> 2.0.0 (ok same major and minor version, all bugfixs in the same minor release should be compatibly)

we dont need vendor to rencode the driver name or vendor id and product id in the string. that info is alreay
available both to the device driver and to userspace via /sys already we just need to know if version of
the same mdev are compatiable so a simple semver version string which is well know in the software world
at least is a clean abstration we can reuse.

> > (2) backgrounds
> > 
> > The reason we hope the migration_version string is opaque to the userspace
> > is that it is hard to generalize standard comparing fields and comparing
> > methods for different devices from different vendors.
> > Though userspace now could still do a simple string compare to check if
> > two devices are compatible, and result should also be right, it's still
> > too limited as it excludes the possible candidate whose migration_version
> > string fails to be equal.
> > e.g. an MDEV with mdev_type_1, aggregator count 3 is probably compatible
> > with another MDEV with mdev_type_3, aggregator count 1, even their
> > migration_version strings are not equal.
> > (assumed mdev_type_3 is of 3 times equal resources of mdev_type_1).
> > 
> > besides that, driver version + configured resources are all elements demanding
> > to take into account.
> > 
> > So, we hope leaving the freedom to vendor driver and let it make the final decision
> > in a simple reading from source side and writing for test in the target side way.
> > 
> > 
> > we then think the device compatibility issues for live migration with assigned
> > devices can be divided into two steps:
> > a. management tools filter out possible migration target devices.
> >    Tags could be created according to info from product specification.
> >    we think openstack/ovirt may have vendor proprietary components to create
> >    those customized tags for each product from each vendor.
> >    for Intel vGPU, with a vGPU(a MDEV device) in source side, the tags to
> >    search target vGPU are like:
> >    a tag for compatible parent PCI IDs,
> >    a tag for a range of gvt driver versions,
> >    a tag for a range of mdev type + aggregator count
> > 
> >    for NVMe VF, the tags to search target VF may be like:
> >    a tag for compatible PCI IDs,
> >    a tag for a range of driver versions,
> >    a tag for URL of configured remote storage.
> 
> Requiring management application developers to figure out this possible
> compatibility based on prod specs is really unrealistic. Product specs
> are typically as clear as mud, and with the suggestion we consider
> different rules for different types of devices, add up to a huge amount
> of complexity. This isn't something app developers should have to spend
> their time figuring out.
> 
> The suggestion that we make use of vendor proprietary helper components
> is totally unacceptable. We need to be able to build a solution that
> works with exclusively an open source software stack.
> 
> IMHO there needs to be a mechanism for the kernel to report via sysfs
> what versions are supported on a given device. This puts the job of
> reporting compatible versions directly under the responsibility of the
> vendor who writes the kernel driver for it. They are the ones with the
> best knowledge of the hardware they've built and the rules around its
> compatibility.
yep totally agree with that statement.
> 
> > b. with the output from step a, openstack/ovirt/libvirt could use our proposed
> >    device migration compatibility interface to make sure the two devices are
> >    indeed live migration compatible before launching the real live migration
> >    process to start stream copying, src device stopping and target device
> >    resuming.
> >    It is supposed that this step would not bring any performance penalty as
> >    -in kernel it's just a simple string decoding and comparing
> >    -in openstack/ovirt, it could be done by extending current function
> >     check_can_live_migrate_destination, along side claiming target resources.[1]
that is a compute driver fucntion
https://github.com/openstack/nova/blob/8988316b8c132c9662dea6cf0345975e87ce7344/nova/virt/driver.py#L1261-L1278
that is called in the conductor here

https://github.com/openstack/nova/blob/8988316b8c132c9662dea6cf0345975e87ce7344/nova/conductor/tasks/live_migrate.py#L360-L364
if the check fails(ignoreing the fact its expensive to do an rpc to the compute host) we raise an excption that
move on to the next host in the alternate host list.

https://github.com/openstack/nova/blob/8988316b8c132c9662dea6cf0345975e87ce7344/nova/conductor/tasks/live_migrate.py#L556-L567
by default the alternate host list is 3
https://docs.openstack.org/nova/latest/configuration/config.html#scheduler.max_attempts
so there would be a pretty high likely hood that if we only checked compatiablity at this point it would fail to
migrate. realistically speaking this is too late. we can do a final safty check at this point but this should
not be the first time we check compatibility. at a mimnium we would have wanted to select a host with the same mdev
type first, we can do that from the info we have today but i hope i have made the point that declaritive interfacs
which we can introspect without haveing opaqce vender sepecitic blob are vastly more consomable then imperitive
interfaces we have to probe. form a security and packaging point of view this is better too as if i only need
readonly access to sysfs instead of write access and if i dont need to package a bunch of addtion vendor tools
in a continerised deployment that significantly decreases the potential attack surface.
> 
> 
> 
> 
> > 
> > 
> > [1] https://specs.openstack.org/openstack/nova-specs/specs/stein/approved/libvirt-neutron-sriov-livemigration.html
> > 
> > Thanks
> > Yan
> > 
> 
> Regards,
> Daniel


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-14 10:21 ` Daniel P. Berrangé
  2020-07-14 12:33   ` Sean Mooney
@ 2020-07-14 16:16   ` Alex Williamson
  2020-07-14 16:47     ` Daniel P. Berrangé
  2020-07-14 17:19     ` Dr. David Alan Gilbert
  1 sibling, 2 replies; 48+ messages in thread
From: Alex Williamson @ 2020-07-14 16:16 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Yan Zhao, devel, openstack-discuss, libvir-list, intel-gvt-dev,
	kvm, qemu-devel, smooney, eskultet, cohuck, dinechin, corbet,
	kwankhede, dgilbert, eauger, jian-feng.ding, hejie.xu,
	kevin.tian, zhenyuw, bao.yumeng, xin-ran.wang, shaohe.feng

On Tue, 14 Jul 2020 11:21:29 +0100
Daniel P. Berrangé <berrange@redhat.com> wrote:

> On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:
> > hi folks,
> > we are defining a device migration compatibility interface that helps upper
> > layer stack like openstack/ovirt/libvirt to check if two devices are
> > live migration compatible.
> > The "devices" here could be MDEVs, physical devices, or hybrid of the two.
> > e.g. we could use it to check whether
> > - a src MDEV can migrate to a target MDEV,
> > - a src VF in SRIOV can migrate to a target VF in SRIOV,
> > - a src MDEV can migration to a target VF in SRIOV.
> >   (e.g. SIOV/SRIOV backward compatibility case)
> > 
> > The upper layer stack could use this interface as the last step to check
> > if one device is able to migrate to another device before triggering a real
> > live migration procedure.
> > we are not sure if this interface is of value or help to you. please don't
> > hesitate to drop your valuable comments.
> > 
> > 
> > (1) interface definition
> > The interface is defined in below way:
> > 
> >              __    userspace
> >               /\              \
> >              /                 \write
> >             / read              \
> >    ________/__________       ___\|/_____________
> >   | migration_version |     | migration_version |-->check migration
> >   ---------------------     ---------------------   compatibility
> >      device A                    device B
> > 
> > 
> > a device attribute named migration_version is defined under each device's
> > sysfs node. e.g. (/sys/bus/pci/devices/0000\:00\:02.0/$mdev_UUID/migration_version).
> > userspace tools read the migration_version as a string from the source device,
> > and write it to the migration_version sysfs attribute in the target device.
> > 
> > The userspace should treat ANY of below conditions as two devices not compatible:
> > - any one of the two devices does not have a migration_version attribute
> > - error when reading from migration_version attribute of one device
> > - error when writing migration_version string of one device to
> >   migration_version attribute of the other device
> > 
> > The string read from migration_version attribute is defined by device vendor
> > driver and is completely opaque to the userspace.
> > for a Intel vGPU, string format can be defined like
> > "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
> > 
> > for an NVMe VF connecting to a remote storage. it could be
> > "PCI ID" + "driver version" + "configured remote storage URL"
> > 
> > for a QAT VF, it may be
> > "PCI ID" + "driver version" + "supported encryption set".
> > 
> > (to avoid namespace confliction from each vendor, we may prefix a driver name to
> > each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)

It's very strange to define it as opaque and then proceed to describe
the contents of that opaque string.  The point is that its contents
are defined by the vendor driver to describe the device, driver version,
and possibly metadata about the configuration of the device.  One
instance of a device might generate a different string from another.
The string that a device produces is not necessarily the only string
the vendor driver will accept, for example the driver might support
backwards compatible migrations.

> > (2) backgrounds
> > 
> > The reason we hope the migration_version string is opaque to the userspace
> > is that it is hard to generalize standard comparing fields and comparing
> > methods for different devices from different vendors.
> > Though userspace now could still do a simple string compare to check if
> > two devices are compatible, and result should also be right, it's still
> > too limited as it excludes the possible candidate whose migration_version
> > string fails to be equal.
> > e.g. an MDEV with mdev_type_1, aggregator count 3 is probably compatible
> > with another MDEV with mdev_type_3, aggregator count 1, even their
> > migration_version strings are not equal.
> > (assumed mdev_type_3 is of 3 times equal resources of mdev_type_1).
> > 
> > besides that, driver version + configured resources are all elements demanding
> > to take into account.
> > 
> > So, we hope leaving the freedom to vendor driver and let it make the final decision
> > in a simple reading from source side and writing for test in the target side way.
> > 
> > 
> > we then think the device compatibility issues for live migration with assigned
> > devices can be divided into two steps:
> > a. management tools filter out possible migration target devices.
> >    Tags could be created according to info from product specification.
> >    we think openstack/ovirt may have vendor proprietary components to create
> >    those customized tags for each product from each vendor.  
> 
> >    for Intel vGPU, with a vGPU(a MDEV device) in source side, the tags to
> >    search target vGPU are like:
> >    a tag for compatible parent PCI IDs,
> >    a tag for a range of gvt driver versions,
> >    a tag for a range of mdev type + aggregator count
> > 
> >    for NVMe VF, the tags to search target VF may be like:
> >    a tag for compatible PCI IDs,
> >    a tag for a range of driver versions,
> >    a tag for URL of configured remote storage.  

I interpret this as hand waving, ie. the first step is for management
tools to make a good guess :-\  We don't seem to be willing to say that
a given mdev type can only migrate to a device with that same type.
There's this aggregation discussion happening separately where a base
mdev type might be created or later configured to be equivalent to a
different type.  The vfio migration API we've defined is also not
limited to mdev devices, for example we could create vendor specific
quirks or hooks to provide migration support for a physical PF/VF
device.  Within the realm of possibility then is that we could migrate
between a physical device and an mdev device, which are simply
different degrees of creating a virtualization layer in front of the
device.
 
> Requiring management application developers to figure out this possible
> compatibility based on prod specs is really unrealistic. Product specs
> are typically as clear as mud, and with the suggestion we consider
> different rules for different types of devices, add up to a huge amount
> of complexity. This isn't something app developers should have to spend
> their time figuring out.

Agreed.

> The suggestion that we make use of vendor proprietary helper components
> is totally unacceptable. We need to be able to build a solution that
> works with exclusively an open source software stack.

I'm surprised to see this as well, but I'm not sure if Yan was really
suggesting proprietary software so much as just vendor specific
knowledge.

> IMHO there needs to be a mechanism for the kernel to report via sysfs
> what versions are supported on a given device. This puts the job of
> reporting compatible versions directly under the responsibility of the
> vendor who writes the kernel driver for it. They are the ones with the
> best knowledge of the hardware they've built and the rules around its
> compatibility.

The version string discussed previously is the version string that
represents a given device, possibly including driver information,
configuration, etc.  I think what you're asking for here is an
enumeration of every possible version string that a given device could
accept as an incoming migration stream.  If we consider the string as
opaque, that means the vendor driver needs to generate a separate
string for every possible version it could accept, for every possible
configuration option.  That potentially becomes an excessive amount of
data to either generate or manage.

Am I overestimating how vendors intend to use the version string?

We'd also need to consider devices that we could create, for instance
providing the same interface enumeration prior to creating an mdev
device to have a confidence level that the new device would be a valid
target.

We defined the string as opaque to allow vendor flexibility and because
defining a common format is hard.  Do we need to revisit this part of
the discussion to define the version string as non-opaque with parsing
rules, probably with separate incoming vs outgoing interfaces?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-14 16:16   ` Alex Williamson
@ 2020-07-14 16:47     ` Daniel P. Berrangé
  2020-07-14 20:47       ` Alex Williamson
  2020-07-14 17:19     ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 48+ messages in thread
From: Daniel P. Berrangé @ 2020-07-14 16:47 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yan Zhao, devel, openstack-discuss, libvir-list, intel-gvt-dev,
	kvm, qemu-devel, smooney, eskultet, cohuck, dinechin, corbet,
	kwankhede, dgilbert, eauger, jian-feng.ding, hejie.xu,
	kevin.tian, zhenyuw, bao.yumeng, xin-ran.wang, shaohe.feng

On Tue, Jul 14, 2020 at 10:16:16AM -0600, Alex Williamson wrote:
> On Tue, 14 Jul 2020 11:21:29 +0100
> Daniel P. Berrangé <berrange@redhat.com> wrote:
> 
> > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:
> > > 
> > > The string read from migration_version attribute is defined by device vendor
> > > driver and is completely opaque to the userspace.
> > > for a Intel vGPU, string format can be defined like
> > > "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
> > > 
> > > for an NVMe VF connecting to a remote storage. it could be
> > > "PCI ID" + "driver version" + "configured remote storage URL"
> > > 
> > > for a QAT VF, it may be
> > > "PCI ID" + "driver version" + "supported encryption set".
> > > 
> > > (to avoid namespace confliction from each vendor, we may prefix a driver name to
> > > each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)
> 
> It's very strange to define it as opaque and then proceed to describe
> the contents of that opaque string.  The point is that its contents
> are defined by the vendor driver to describe the device, driver version,
> and possibly metadata about the configuration of the device.  One
> instance of a device might generate a different string from another.
> The string that a device produces is not necessarily the only string
> the vendor driver will accept, for example the driver might support
> backwards compatible migrations.


> > IMHO there needs to be a mechanism for the kernel to report via sysfs
> > what versions are supported on a given device. This puts the job of
> > reporting compatible versions directly under the responsibility of the
> > vendor who writes the kernel driver for it. They are the ones with the
> > best knowledge of the hardware they've built and the rules around its
> > compatibility.
> 
> The version string discussed previously is the version string that
> represents a given device, possibly including driver information,
> configuration, etc.  I think what you're asking for here is an
> enumeration of every possible version string that a given device could
> accept as an incoming migration stream.  If we consider the string as
> opaque, that means the vendor driver needs to generate a separate
> string for every possible version it could accept, for every possible
> configuration option.  That potentially becomes an excessive amount of
> data to either generate or manage.
> 
> Am I overestimating how vendors intend to use the version string?

If I'm interpreting your reply & the quoted text orrectly, the version
string isn't really a version string in any normal sense of the word
"version".

Instead it sounds like string encoding a set of features in some arbitrary
vendor specific format, which they parse and do compatibility checks on
individual pieces ? One or more parts may contain a version number, but
its much more than just a version.

If that's correct, then I'd prefer we didn't call it a version string,
instead call it a "capability string" to make it clear it is expressing
a much more general concept, but...

> We'd also need to consider devices that we could create, for instance
> providing the same interface enumeration prior to creating an mdev
> device to have a confidence level that the new device would be a valid
> target.
> 
> We defined the string as opaque to allow vendor flexibility and because
> defining a common format is hard.  Do we need to revisit this part of
> the discussion to define the version string as non-opaque with parsing
> rules, probably with separate incoming vs outgoing interfaces?  Thanks,

..even if the huge amount of flexibility is technically relevant from the
POV of the hardware/drivers, we should consider whether management apps
actually want, or can use, that level of flexibility.

The task of picking which host to place a VM on has alot of factors to
consider, and when there are a large number of hosts, the total amount
of information to check gets correspondingly large.  The placement
process is also fairly performance critical.

Running complex algorithmic logic to check compatibility of devices
based on a arbitrary set of rules is likely to be a performance
challenge. A flat list of supported strings is a much simpler
thing to check as it reduces down to a simple set membership test.

IOW, even if there's some complex set of device type / vendor specific
rules to check for compatibility, I fear apps will ignore them and
just define a very simplified list of compatible string, and ignore
all the extra flexibility.

I'm sure OpenStack maintainers can speak to this more, as they've put
alot of work into their scheduling engine to optimize the way it places
VMs largely driven from simple structured data reported from hosts.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-14 16:16   ` Alex Williamson
  2020-07-14 16:47     ` Daniel P. Berrangé
@ 2020-07-14 17:19     ` Dr. David Alan Gilbert
  2020-07-14 20:59       ` Alex Williamson
  1 sibling, 1 reply; 48+ messages in thread
From: Dr. David Alan Gilbert @ 2020-07-14 17:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Daniel P. Berrangé,
	Yan Zhao, devel, openstack-discuss, libvir-list, intel-gvt-dev,
	kvm, qemu-devel, smooney, eskultet, cohuck, dinechin, corbet,
	kwankhede, eauger, jian-feng.ding, hejie.xu, kevin.tian, zhenyuw,
	bao.yumeng, xin-ran.wang, shaohe.feng

* Alex Williamson (alex.williamson@redhat.com) wrote:
> On Tue, 14 Jul 2020 11:21:29 +0100
> Daniel P. Berrangé <berrange@redhat.com> wrote:
> 
> > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:
> > > hi folks,
> > > we are defining a device migration compatibility interface that helps upper
> > > layer stack like openstack/ovirt/libvirt to check if two devices are
> > > live migration compatible.
> > > The "devices" here could be MDEVs, physical devices, or hybrid of the two.
> > > e.g. we could use it to check whether
> > > - a src MDEV can migrate to a target MDEV,
> > > - a src VF in SRIOV can migrate to a target VF in SRIOV,
> > > - a src MDEV can migration to a target VF in SRIOV.
> > >   (e.g. SIOV/SRIOV backward compatibility case)
> > > 
> > > The upper layer stack could use this interface as the last step to check
> > > if one device is able to migrate to another device before triggering a real
> > > live migration procedure.
> > > we are not sure if this interface is of value or help to you. please don't
> > > hesitate to drop your valuable comments.
> > > 
> > > 
> > > (1) interface definition
> > > The interface is defined in below way:
> > > 
> > >              __    userspace
> > >               /\              \
> > >              /                 \write
> > >             / read              \
> > >    ________/__________       ___\|/_____________
> > >   | migration_version |     | migration_version |-->check migration
> > >   ---------------------     ---------------------   compatibility
> > >      device A                    device B
> > > 
> > > 
> > > a device attribute named migration_version is defined under each device's
> > > sysfs node. e.g. (/sys/bus/pci/devices/0000\:00\:02.0/$mdev_UUID/migration_version).
> > > userspace tools read the migration_version as a string from the source device,
> > > and write it to the migration_version sysfs attribute in the target device.
> > > 
> > > The userspace should treat ANY of below conditions as two devices not compatible:
> > > - any one of the two devices does not have a migration_version attribute
> > > - error when reading from migration_version attribute of one device
> > > - error when writing migration_version string of one device to
> > >   migration_version attribute of the other device
> > > 
> > > The string read from migration_version attribute is defined by device vendor
> > > driver and is completely opaque to the userspace.
> > > for a Intel vGPU, string format can be defined like
> > > "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
> > > 
> > > for an NVMe VF connecting to a remote storage. it could be
> > > "PCI ID" + "driver version" + "configured remote storage URL"
> > > 
> > > for a QAT VF, it may be
> > > "PCI ID" + "driver version" + "supported encryption set".
> > > 
> > > (to avoid namespace confliction from each vendor, we may prefix a driver name to
> > > each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)
> 
> It's very strange to define it as opaque and then proceed to describe
> the contents of that opaque string.  The point is that its contents
> are defined by the vendor driver to describe the device, driver version,
> and possibly metadata about the configuration of the device.  One
> instance of a device might generate a different string from another.
> The string that a device produces is not necessarily the only string
> the vendor driver will accept, for example the driver might support
> backwards compatible migrations.

(As I've said in the previous discussion, off one of the patch series)

My view is it makes sense to have a half-way house on the opaqueness of
this string; I'd expect to have an ID and version that are human
readable, maybe a device ID/name that's human interpretable and then a
bunch of other cruft that maybe device/vendor/version specific.

I'm thinking that we want to be able to report problems and include the
string and the user to be able to easily identify the device that was
complaining and notice a difference in versions, and perhaps also use
it in compatibility patterns to find compatible hosts; but that does
get tricky when it's a 'ask the device if it's compatible'.

Dave

> > > (2) backgrounds
> > > 
> > > The reason we hope the migration_version string is opaque to the userspace
> > > is that it is hard to generalize standard comparing fields and comparing
> > > methods for different devices from different vendors.
> > > Though userspace now could still do a simple string compare to check if
> > > two devices are compatible, and result should also be right, it's still
> > > too limited as it excludes the possible candidate whose migration_version
> > > string fails to be equal.
> > > e.g. an MDEV with mdev_type_1, aggregator count 3 is probably compatible
> > > with another MDEV with mdev_type_3, aggregator count 1, even their
> > > migration_version strings are not equal.
> > > (assumed mdev_type_3 is of 3 times equal resources of mdev_type_1).
> > > 
> > > besides that, driver version + configured resources are all elements demanding
> > > to take into account.
> > > 
> > > So, we hope leaving the freedom to vendor driver and let it make the final decision
> > > in a simple reading from source side and writing for test in the target side way.
> > > 
> > > 
> > > we then think the device compatibility issues for live migration with assigned
> > > devices can be divided into two steps:
> > > a. management tools filter out possible migration target devices.
> > >    Tags could be created according to info from product specification.
> > >    we think openstack/ovirt may have vendor proprietary components to create
> > >    those customized tags for each product from each vendor.  
> > 
> > >    for Intel vGPU, with a vGPU(a MDEV device) in source side, the tags to
> > >    search target vGPU are like:
> > >    a tag for compatible parent PCI IDs,
> > >    a tag for a range of gvt driver versions,
> > >    a tag for a range of mdev type + aggregator count
> > > 
> > >    for NVMe VF, the tags to search target VF may be like:
> > >    a tag for compatible PCI IDs,
> > >    a tag for a range of driver versions,
> > >    a tag for URL of configured remote storage.  
> 
> I interpret this as hand waving, ie. the first step is for management
> tools to make a good guess :-\  We don't seem to be willing to say that
> a given mdev type can only migrate to a device with that same type.
> There's this aggregation discussion happening separately where a base
> mdev type might be created or later configured to be equivalent to a
> different type.  The vfio migration API we've defined is also not
> limited to mdev devices, for example we could create vendor specific
> quirks or hooks to provide migration support for a physical PF/VF
> device.  Within the realm of possibility then is that we could migrate
> between a physical device and an mdev device, which are simply
> different degrees of creating a virtualization layer in front of the
> device.
>  
> > Requiring management application developers to figure out this possible
> > compatibility based on prod specs is really unrealistic. Product specs
> > are typically as clear as mud, and with the suggestion we consider
> > different rules for different types of devices, add up to a huge amount
> > of complexity. This isn't something app developers should have to spend
> > their time figuring out.
> 
> Agreed.
> 
> > The suggestion that we make use of vendor proprietary helper components
> > is totally unacceptable. We need to be able to build a solution that
> > works with exclusively an open source software stack.
> 
> I'm surprised to see this as well, but I'm not sure if Yan was really
> suggesting proprietary software so much as just vendor specific
> knowledge.
> 
> > IMHO there needs to be a mechanism for the kernel to report via sysfs
> > what versions are supported on a given device. This puts the job of
> > reporting compatible versions directly under the responsibility of the
> > vendor who writes the kernel driver for it. They are the ones with the
> > best knowledge of the hardware they've built and the rules around its
> > compatibility.
> 
> The version string discussed previously is the version string that
> represents a given device, possibly including driver information,
> configuration, etc.  I think what you're asking for here is an
> enumeration of every possible version string that a given device could
> accept as an incoming migration stream.  If we consider the string as
> opaque, that means the vendor driver needs to generate a separate
> string for every possible version it could accept, for every possible
> configuration option.  That potentially becomes an excessive amount of
> data to either generate or manage.
> 
> Am I overestimating how vendors intend to use the version string?
> 
> We'd also need to consider devices that we could create, for instance
> providing the same interface enumeration prior to creating an mdev
> device to have a confidence level that the new device would be a valid
> target.
> 
> We defined the string as opaque to allow vendor flexibility and because
> defining a common format is hard.  Do we need to revisit this part of
> the discussion to define the version string as non-opaque with parsing
> rules, probably with separate incoming vs outgoing interfaces?  Thanks,
> 
> Alex
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-14 16:47     ` Daniel P. Berrangé
@ 2020-07-14 20:47       ` Alex Williamson
  2020-07-15  9:16         ` Daniel P. Berrangé
  0 siblings, 1 reply; 48+ messages in thread
From: Alex Williamson @ 2020-07-14 20:47 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Yan Zhao, devel, openstack-discuss, libvir-list, intel-gvt-dev,
	kvm, qemu-devel, smooney, eskultet, cohuck, dinechin, corbet,
	kwankhede, dgilbert, eauger, jian-feng.ding, hejie.xu,
	kevin.tian, zhenyuw, bao.yumeng, xin-ran.wang, shaohe.feng

On Tue, 14 Jul 2020 17:47:22 +0100
Daniel P. Berrangé <berrange@redhat.com> wrote:

> On Tue, Jul 14, 2020 at 10:16:16AM -0600, Alex Williamson wrote:
> > On Tue, 14 Jul 2020 11:21:29 +0100
> > Daniel P. Berrangé <berrange@redhat.com> wrote:
> >   
> > > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:  
> > > > 
> > > > The string read from migration_version attribute is defined by device vendor
> > > > driver and is completely opaque to the userspace.
> > > > for a Intel vGPU, string format can be defined like
> > > > "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
> > > > 
> > > > for an NVMe VF connecting to a remote storage. it could be
> > > > "PCI ID" + "driver version" + "configured remote storage URL"
> > > > 
> > > > for a QAT VF, it may be
> > > > "PCI ID" + "driver version" + "supported encryption set".
> > > > 
> > > > (to avoid namespace confliction from each vendor, we may prefix a driver name to
> > > > each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)  
> > 
> > It's very strange to define it as opaque and then proceed to describe
> > the contents of that opaque string.  The point is that its contents
> > are defined by the vendor driver to describe the device, driver version,
> > and possibly metadata about the configuration of the device.  One
> > instance of a device might generate a different string from another.
> > The string that a device produces is not necessarily the only string
> > the vendor driver will accept, for example the driver might support
> > backwards compatible migrations.  
> 
> 
> > > IMHO there needs to be a mechanism for the kernel to report via sysfs
> > > what versions are supported on a given device. This puts the job of
> > > reporting compatible versions directly under the responsibility of the
> > > vendor who writes the kernel driver for it. They are the ones with the
> > > best knowledge of the hardware they've built and the rules around its
> > > compatibility.  
> > 
> > The version string discussed previously is the version string that
> > represents a given device, possibly including driver information,
> > configuration, etc.  I think what you're asking for here is an
> > enumeration of every possible version string that a given device could
> > accept as an incoming migration stream.  If we consider the string as
> > opaque, that means the vendor driver needs to generate a separate
> > string for every possible version it could accept, for every possible
> > configuration option.  That potentially becomes an excessive amount of
> > data to either generate or manage.
> > 
> > Am I overestimating how vendors intend to use the version string?  
> 
> If I'm interpreting your reply & the quoted text orrectly, the version
> string isn't really a version string in any normal sense of the word
> "version".
> 
> Instead it sounds like string encoding a set of features in some arbitrary
> vendor specific format, which they parse and do compatibility checks on
> individual pieces ? One or more parts may contain a version number, but
> its much more than just a version.
> 
> If that's correct, then I'd prefer we didn't call it a version string,
> instead call it a "capability string" to make it clear it is expressing
> a much more general concept, but...

I'd agree with that.  The intent of the previous proposal was to
provide and interface for reading a string and writing a string back in
where the result of that write indicated migration compatibility with
the device.  So yes, "version" is not the right term.
 
> > We'd also need to consider devices that we could create, for instance
> > providing the same interface enumeration prior to creating an mdev
> > device to have a confidence level that the new device would be a valid
> > target.
> > 
> > We defined the string as opaque to allow vendor flexibility and because
> > defining a common format is hard.  Do we need to revisit this part of
> > the discussion to define the version string as non-opaque with parsing
> > rules, probably with separate incoming vs outgoing interfaces?  Thanks,  
> 
> ..even if the huge amount of flexibility is technically relevant from the
> POV of the hardware/drivers, we should consider whether management apps
> actually want, or can use, that level of flexibility.
> 
> The task of picking which host to place a VM on has alot of factors to
> consider, and when there are a large number of hosts, the total amount
> of information to check gets correspondingly large.  The placement
> process is also fairly performance critical.
> 
> Running complex algorithmic logic to check compatibility of devices
> based on a arbitrary set of rules is likely to be a performance
> challenge. A flat list of supported strings is a much simpler
> thing to check as it reduces down to a simple set membership test.
> 
> IOW, even if there's some complex set of device type / vendor specific
> rules to check for compatibility, I fear apps will ignore them and
> just define a very simplified list of compatible string, and ignore
> all the extra flexibility.

There's always the "try it and see if it works" interface, which is
essentially what we have currently.  With even a simple version of what
we're trying to accomplish here, there's still a risk that a management
engine might rather just ignore it and restrict themselves to 1:1 mdev
type matches, with or without knowing anything about the vendor driver
version, relying on the migration to fail quickly if the devices are
incompatible.  If the complexity of the interface makes it too
complicated or time consuming to provide sufficient value above such an
algorithm, there's not much point to implementing it, which is why Yan
has included so many people in this discussion.

> I'm sure OpenStack maintainers can speak to this more, as they've put
> alot of work into their scheduling engine to optimize the way it places
> VMs largely driven from simple structured data reported from hosts.

I think we've weeded out that our intended approach is not worthwhile,
testing a compatibility string at a device is too much overhead, we
need to provide enough information to the management engine to predict
the response without interaction beyond the initial capability probing.

As you've identified above, we're really dealing with more than a
simple version, we need to construct a compatibility string and we need
to start defining what goes into that.

The first item seems to be that we're defining compatibility relative
to a vfio migration stream, vfio devices have a device API, such as
vfio-pci, so the first attribute might simply define the device API.
Once we have a class of devices we might then be able to use bus
specific attributes, for example the PCI vendor and device ID (other
bus types TBD).

We probably also need driver version numbers, so we need to include
both the driver name as well as version major and minor numbers.  Rules
need to be put in place around what we consider to be viable version
matches, potentially as Sean described.  For example, does the major
version require a match?  Do we restrict to only formward, ie.
increasing, minor number matches within that major verison?

Do we then also have section that includes any required device
attributes to result in a compatible device.  This would be largely
focused on mdev, but I wouldn't rule out others.  For example if an
aggregation parameter is required to maintain compatibility, we'd want
to specify that as a required attribute.

So maybe we end up with something like:

{
  "device_api": "vfio-pci",
  "vendor": "vendor-driver-name",
  "version": {
    "major": 0,
    "minor": 1
  },
  "vfio-pci": { // Based on above device_api
    "vendor": 0x1234, // Values for the exposed device
    "device": 0x5678,
      // Possibly further parameters for a more specific match
  }
  "mdev_attrs": [
    { "attribute0": "VALUE" }
  ]
}

The sysfs interface would return an array containing one or more of
these for each device supported.  I'm trying to account for things like
aggregation via the mdev_attrs section, but I haven't really put it all
together yet.  I think Intel folks want to be able to say mdev type
foo-3 is compatible with mdev type foo-1 so long as foo-1 is created
with an aggregation attribute value of 3, but I expect both foo-1 and
foo-3 would have the same user visible PCI vendor:device IDs  If we
use mdev type rather than the resulting device IDs, then we introduce
an barrier to phys<->mdev migration.  We could specify the subsystem
values though, for example foo-1 might correspond to subsystem IDs
8086:0001 and foo3 8086:0003, then we can specify that creating an
foo-1 from this device doesn't require any attributes, but creating a
foo-3 does.  I'm nervous how that scales though.

NB. I'm also considering how portions of this might be compatible with
mdevctl such that we could direct mdevctl to create a compatible device
using information from this compatibility interface.

Thanks,
Alex


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-14 17:19     ` Dr. David Alan Gilbert
@ 2020-07-14 20:59       ` Alex Williamson
  2020-07-15  8:20         ` Yan Zhao
                           ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Alex Williamson @ 2020-07-14 20:59 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrangé,
	Yan Zhao, devel, openstack-discuss, libvir-list, intel-gvt-dev,
	kvm, qemu-devel, smooney, eskultet, cohuck, dinechin, corbet,
	kwankhede, eauger, jian-feng.ding, hejie.xu, kevin.tian, zhenyuw,
	bao.yumeng, xin-ran.wang, shaohe.feng

On Tue, 14 Jul 2020 18:19:46 +0100
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Alex Williamson (alex.williamson@redhat.com) wrote:
> > On Tue, 14 Jul 2020 11:21:29 +0100
> > Daniel P. Berrangé <berrange@redhat.com> wrote:
> >   
> > > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:  
> > > > hi folks,
> > > > we are defining a device migration compatibility interface that helps upper
> > > > layer stack like openstack/ovirt/libvirt to check if two devices are
> > > > live migration compatible.
> > > > The "devices" here could be MDEVs, physical devices, or hybrid of the two.
> > > > e.g. we could use it to check whether
> > > > - a src MDEV can migrate to a target MDEV,
> > > > - a src VF in SRIOV can migrate to a target VF in SRIOV,
> > > > - a src MDEV can migration to a target VF in SRIOV.
> > > >   (e.g. SIOV/SRIOV backward compatibility case)
> > > > 
> > > > The upper layer stack could use this interface as the last step to check
> > > > if one device is able to migrate to another device before triggering a real
> > > > live migration procedure.
> > > > we are not sure if this interface is of value or help to you. please don't
> > > > hesitate to drop your valuable comments.
> > > > 
> > > > 
> > > > (1) interface definition
> > > > The interface is defined in below way:
> > > > 
> > > >              __    userspace
> > > >               /\              \
> > > >              /                 \write
> > > >             / read              \
> > > >    ________/__________       ___\|/_____________
> > > >   | migration_version |     | migration_version |-->check migration
> > > >   ---------------------     ---------------------   compatibility
> > > >      device A                    device B
> > > > 
> > > > 
> > > > a device attribute named migration_version is defined under each device's
> > > > sysfs node. e.g. (/sys/bus/pci/devices/0000\:00\:02.0/$mdev_UUID/migration_version).
> > > > userspace tools read the migration_version as a string from the source device,
> > > > and write it to the migration_version sysfs attribute in the target device.
> > > > 
> > > > The userspace should treat ANY of below conditions as two devices not compatible:
> > > > - any one of the two devices does not have a migration_version attribute
> > > > - error when reading from migration_version attribute of one device
> > > > - error when writing migration_version string of one device to
> > > >   migration_version attribute of the other device
> > > > 
> > > > The string read from migration_version attribute is defined by device vendor
> > > > driver and is completely opaque to the userspace.
> > > > for a Intel vGPU, string format can be defined like
> > > > "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
> > > > 
> > > > for an NVMe VF connecting to a remote storage. it could be
> > > > "PCI ID" + "driver version" + "configured remote storage URL"
> > > > 
> > > > for a QAT VF, it may be
> > > > "PCI ID" + "driver version" + "supported encryption set".
> > > > 
> > > > (to avoid namespace confliction from each vendor, we may prefix a driver name to
> > > > each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)  
> > 
> > It's very strange to define it as opaque and then proceed to describe
> > the contents of that opaque string.  The point is that its contents
> > are defined by the vendor driver to describe the device, driver version,
> > and possibly metadata about the configuration of the device.  One
> > instance of a device might generate a different string from another.
> > The string that a device produces is not necessarily the only string
> > the vendor driver will accept, for example the driver might support
> > backwards compatible migrations.  
> 
> (As I've said in the previous discussion, off one of the patch series)
> 
> My view is it makes sense to have a half-way house on the opaqueness of
> this string; I'd expect to have an ID and version that are human
> readable, maybe a device ID/name that's human interpretable and then a
> bunch of other cruft that maybe device/vendor/version specific.
> 
> I'm thinking that we want to be able to report problems and include the
> string and the user to be able to easily identify the device that was
> complaining and notice a difference in versions, and perhaps also use
> it in compatibility patterns to find compatible hosts; but that does
> get tricky when it's a 'ask the device if it's compatible'.

In the reply I just sent to Dan, I gave this example of what a
"compatibility string" might look like represented as json:

{
  "device_api": "vfio-pci",
  "vendor": "vendor-driver-name",
  "version": {
    "major": 0,
    "minor": 1
  },
  "vfio-pci": { // Based on above device_api
    "vendor": 0x1234, // Values for the exposed device
    "device": 0x5678,
      // Possibly further parameters for a more specific match
  },
  "mdev_attrs": [
    { "attribute0": "VALUE" }
  ]
}

Are you thinking that we might allow the vendor to include a vendor
specific array where we'd simply require that both sides have matching
fields and values?  ie.

  "vendor_fields": [
    { "unknown_field0": "unknown_value0" },
    { "unknown_field1": "unknown_value1" },
  ]

We could certainly make that part of the spec, but I can't really
figure the value of it other than to severely restrict compatibility,
which the vendor could already do via the version.major value.  Maybe
they'd want to put a build timestamp, random uuid, or source sha1 into
such a field to make absolutely certain compatibility is only determined
between identical builds?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
       [not found]       ` <eb705c72cdc8b6b8959b6ebaeeac6069a718d524.camel@redhat.com>
@ 2020-07-14 21:15         ` Sean Mooney
  0 siblings, 0 replies; 48+ messages in thread
From: Sean Mooney @ 2020-07-14 21:15 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Daniel Berrange, Yan Zhao, devel, openstack-discuss, libvir-list,
	intel-gvt-dev, kvm, qemu-devel, smooney, eskultet, cohuck,
	dinechin, corbet, kwankhede, dgilbert, eauger, jian-feng.ding,
	hejie.xu, kevin.tian, zhenyuw, bao.yumeng, xin-ran.wang,
	Shaohe Feng

resending with full cc list since i had this typed up
i would blame my email provier but my email client does not seam to like long cc lists.
we probably want to continue on  alex's thread to not split the disscusion.
but i have responed inline with some example of  how openstack schdules and what i ment by different mdev_types


On Tue, 2020-07-14 at 20:29 +0100, Sean Mooney wrote:
> On Tue, 2020-07-14 at 11:01 -0600, Alex Williamson wrote:
> > On Tue, 14 Jul 2020 13:33:24 +0100
> > Sean Mooney <smooney@redhat.com> wrote:
> > 
> > > On Tue, 2020-07-14 at 11:21 +0100, Daniel P. Berrangé wrote:
> > > > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:  
> > > > > hi folks,
> > > > > we are defining a device migration compatibility interface that helps upper
> > > > > layer stack like openstack/ovirt/libvirt to check if two devices are
> > > > > live migration compatible.
> > > > > The "devices" here could be MDEVs, physical devices, or hybrid of the two.
> > > > > e.g. we could use it to check whether
> > > > > - a src MDEV can migrate to a target MDEV,  
> > > 
> > > mdev live migration is completely possible to do but i agree with Dan barrange's comments
> > > from the point of view of openstack integration i dont see calling out to a vender sepecific
> > > tool to be an accpetable
> > 
> > As I replied to Dan, I'm hoping Yan was referring more to vendor
> > specific knowledge rather than actual tools.
> > 
> > > solutions for device compatiablity checking. the sys filesystem
> > > that describs the mdevs that can be created shoudl also
> > > contain the relevent infomation such
> > > taht nova could integrate it via libvirt xml representation or directly retrive the
> > > info from
> > > sysfs.
> > > > > - a src VF in SRIOV can migrate to a target VF in SRIOV,  
> > > 
> > > so vf to vf migration is not possible in the general case as there is no standarised
> > > way to transfer teh device state as part of the siorv specs produced by the pci-sig
> > > as such there is not vender neutral way to support sriov live migration. 
> > 
> > We're not talking about a general case, we're talking about physical
> > devices which have vfio wrappers or hooks with device specific
> > knowledge in order to support the vfio migration interface.  The point
> > is that a discussion around vfio device migration cannot be limited to
> > mdev devices.
> 
> ok upstream in  openstack at least we do not plan to support generic livemigration
> for passthough devivces. we cheat with network interfaces since in generaly operating
> systems handel hotplug of a nic somewhat safely so wehre no abstraction layer like
> an mdev is present or a macvtap device we hot unplug the nic before the migration
> and attach a new one after.  for gpus or crypto cards this likely would not be viable
> since you can bond generic hardware devices to hide the removal and readdtion of a generic
> pci device. we were hoping that there would be a convergenca around MDEVs as a way to provide
> that abstraction going forward for generic device or some other new mechanisum in the future.
> > 
> > > > > - a src MDEV can migration to a target VF in SRIOV.  
> > > 
> > > that also makes this unviable
> > > > >   (e.g. SIOV/SRIOV backward compatibility case)
> > > > > 
> > > > > The upper layer stack could use this interface as the last step to check
> > > > > if one device is able to migrate to another device before triggering a real
> > > > > live migration procedure.  
> > > 
> > > well actully that is already too late really. ideally we would want to do this compaiablity
> > > check much sooneer to avoid the migration failing. in an openstack envionment  at least
> > > by the time we invoke libvirt (assuming your using the libvirt driver) to do the migration we have alreaedy
> > > finished schduling the instance to the new host. if if we do the compatiablity check at this point
> > > and it fails then the live migration is aborted and will not be retired. These types of late check lead to a
> > > poor user experince as unless you check the migration detial it basically looks like the migration was ignored
> > > as it start to migrate and then continuge running on the orgininal host.
> > > 
> > > when using generic pci passhotuhg with openstack, the pci alias is intended to reference a single vendor
> > > id/product
> > > id so you will have 1+ alias for each type of device. that allows openstack to schedule based on the availability
> > > of
> > > a
> > > compatibale device because we track inventories of pci devices and can query that when selecting a host.
> > > 
> > > if we were to support mdev live migration in the future we would want to take the same declarative approch.
> > > 1 interospec the capability of the deivce we manage
> > > 2 create inventories of the allocatable devices and there capabilities
> > > 3 schdule the instance to a host based on the device-type/capabilities and claim it atomicly to prevent raceces
> > > 4 have the lower level hyperviors do addtional validation if need prelive migration.
> > > 
> > > this proposal seams to be targeting extending step 4 where as ideally we should focuse on providing the info that
> > > would
> > > be relevant in set 1 preferably in a vendor neutral way vai a kernel interface like /sys.
> > 
> > I think this is reading a whole lot into the phrase "last step".  We
> > want to make the information available for a management engine to
> > consume as needed to make informed decisions regarding likely
> > compatible target devices.
> 
> well openstack as a management engin has 3 stages for schdule and asignment,.
> in respocne to a live migration request the api does minimal valaidation then hand the task off to the conductor
> service
> ot orchestrate. the conductor invokes an rpc to the schduler service which makes a rest call to the plamcent service.
> the placment cervice generate a set of allocation candiate for host based on qunataive and qulaitivly
> queries agains an abstract resouce provider tree model of the hosts.
> currently device pasthough is not modeled in placment so plamcnet is basicaly returning a set of host that have enough
> cpu ram and disk for the instance. in the spacial of  vGPU they technically are modelled in placement but not in a way
> that would gurarentee compatiablity for migration. a generic pci device request is haneled in the second phase of
> schduling called filtering and weighing. in this pahse the nova schuleer apply a series  of filter to the list of host
> returned by plamcnet to assert things like anit afintiy, tenant isolation or in the case of this converation nuam
> affintiy and pci device avaiablity. when we have filtered the posible set of host down to X number we weigh the
> listing
> to select an optimal host and set of alternitive hosts. we then enter the code that this mail suggest modfiying which
> does an rpc call to the destiation host form teh conductor to have it assert compatiablity which internaly calls back
> to
> the sourc host.
> 
> so my point is we have done a lot of work  by the time we call check_can_live_migrate_destination and failing
> at this point is considerd quite a late failure but its still better then failing when qemu actully tries to migrate.
> in general we would prefer to move compatiablity check as early in that workflow as possible but to be fair we dont
> actully check cpu model compatiablity until check_can_live_migrate_destination.
> 
https://github.com/openstack/nova/blob/8988316b8c132c9662dea6cf0345975e87ce7344/nova/virt/libvirt/driver.py#L8325-L8331
> 
> if we needed too we could read the version string on the source and write the version string on the dest at this
> point.
> doing so however would be considerd, inelegant, we have found this does not scale as the first copmpatabilty check.
> for cpu for example there are way to filter hosts by groups sets fo host with the same cpu or filtering on cpu feature
> flags that happen in the placment or filter stage both of which are very early and cheap to do at runtime.
> 
> the "read for version, write for compatibility" workflow could be used as a final safe check if required but
> probing for compatibility via writes is basicaly considered an anti patteren in openstack. we try to always
> assert compatibility by reading avaiable info and asserting requirement over it not testing to see if it works.
> 
> this has come up in the past in the context of virtio feature flag where the idea of spawning an instrance or trying
> to add a virtio port to ovs dpdk that reqested a specific feature flag was rejected as unacceptable from a performance
> and security point of view.
> 
> >  
> > > > > we are not sure if this interface is of value or help to you. please don't
> > > > > hesitate to drop your valuable comments.
> > > > > 
> > > > > 
> > > > > (1) interface definition
> > > > > The interface is defined in below way:
> > > > > 
> > > > >              __    userspace
> > > > >               /\              \
> > > > >              /                 \write
> > > > >             / read              \
> > > > >    ________/__________       ___\|/_____________
> > > > >   | migration_version |     | migration_version |-->check migration
> > > > >   ---------------------     ---------------------   compatibility
> > > > >      device A                    device B
> > > > > 
> > > > > 
> > > > > a device attribute named migration_version is defined under each device's
> > > > > sysfs node. e.g. (/sys/bus/pci/devices/0000\:00\:02.0/$mdev_UUID/migration_version).  
> > > 
> > > this might be useful as we could tag the inventory with the migration version and only might to
> > > devices with  the same version
> > 
> > Is cross version compatibility something that you'd consider using?
> 
> yes but it would depend on what cross version actully ment.
> 
> the version of an mdev is not something we would want to be exposed to endusers.
> it would be a security risk to do so as the version sting would potentaily allow the untrused user
> to discover if a device has an unpatch vulnerablity. as a result in the context of live migration
> we can only support cross verion compatiabilyt if the device in the guest  does not alter as
> part of the migration and the behavior does not change.
> 
> going form version 1.0 with feature X to verions 1.1 with feature X and Y but only X enabled would
> be fine. going gorm 1.0 to 2.0 where thre is only feature Y would not be ok.
> being abstract makes it a little harder to readabout but i guess i would sumerisei if its
> transparent to the guest for the lifetime of the qemu process then its ok for the backing version to change.
> if a vm is rebooted its also ok fo the vm to pick up feature Y form the 1.1 device although at that point
> it could not be migrated back to the 1.0 host as it now has feature X and Y and 1.0 only has X so that woudl be
> an obserable change if it was drop as a reult of the live migration.
> > 
> > > > > userspace tools read the migration_version as a string from the source device,
> > > > > and write it to the migration_version sysfs attribute in the target device.  
> > > 
> > > this would not be useful as the schduler cannot directlly connect to the compute host
> > > and even if it could it would be extreamly slow to do this for 1000s of hosts and potentally
> > > multiple devices per host.
> > 
> > Seems similar to Dan's requirement, looks like the 'read for version,
> > write for compatibility' test idea isn't really viable.
> 
> its ineffiecnt and we have reject adding such test in the case of virtio-feature flag compatiabilty
> in the past, so its more an option of last resourt if we have no other way to support compatiablity
> checking.
> > 
> > > > > 
> > > > > The userspace should treat ANY of below conditions as two devices not compatible:
> > > > > - any one of the two devices does not have a migration_version attribute
> > > > > - error when reading from migration_version attribute of one device
> > > > > - error when writing migration_version string of one device to
> > > > >   migration_version attribute of the other device
> > > > > 
> > > > > The string read from migration_version attribute is defined by device vendor
> > > > > driver and is completely opaque to the userspace.  
> > > 
> > > opaque vendor specific stings that higher level orchestros have to pass form host
> > > to host and cant reason about are evil, when allowed they prolifroate and
> > > makes any idea of a vendor nutral abstraction and interoperablity between systems
> > > impossible to reason about. that said there is a way to make it opaue but still useful
> > > to userspace. see below
> > > > > for a Intel vGPU, string format can be defined like
> > > > > "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
> > > > > 
> > > > > for an NVMe VF connecting to a remote storage. it could be
> > > > > "PCI ID" + "driver version" + "configured remote storage URL"
> > > > > 
> > > > > for a QAT VF, it may be
> > > > > "PCI ID" + "driver version" + "supported encryption set".
> > > > > 
> > > > > (to avoid namespace confliction from each vendor, we may prefix a driver name to
> > > > > each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)  
> > > 
> > > honestly i would much prefer if the version string was just a semver string.
> > > e.g. {major}.{minor}.{bugfix} 
> > > 
> > > if you do a driver/frimware update and break compatiablity with an older version bump the
> > > major version.
> > > 
> > > if you add optional a feature that does not break backwards compatiablity if you migrate
> > > an older instance to the new host then just bump the minor/feature number.
> > > 
> > > if you have a fix for a bug that does not change the feature set or compatiblity backwards or
> > > forwards then bump the bugfix number
> > > 
> > > then the check is as simple as 
> > > 1.) is the mdev type the same
> > > 2.) is the major verion the same
> > > 3.) am i going form the same version to same version or same version to newer version
> > > 
> > > if all 3 are true we can migrate.
> > > e.g. 
> > > 2.0.1 -> 2.1.1 (ok same major version and migrating from older feature release to newer feature release)
> > > 2.1.1 -> 2.0.1 (not ok same major version and migrating from new feature release to old feature release may be
> > > incompatable)
> > > 2.0.0 -> 3.0.0 (not ok chaning major version)
> > > 2.0.1 -> 2.0.0 (ok same major and minor version, all bugfixs in the same minor release should be compatibly)
> > 
> > What's the value of the bugfix field in this scheme?
> 
> its not require but really its for a non visable chagne form a feature standpoint.
> a rather contrived example but if it was quadratic to inital a set of queues or device bufferes
> in 1.0.0 and you made it liniar in 1.0.1 that is a performace improvment in the device intialisation time
> which is great but it would not affect the feature set or compatiablity in any way. you could call it
> a feature but its really just an internal change but you might want to still bump the version number.
> > 
> > The simplicity is good, but is it too simple.  It's not immediately
> > clear to me whether all features can be hidden behind a minor version.
> > For instance, if we have an mdev device that supports this notion of
> > aggregation, which is proposed as a solution to the problem that
> > physical hardware might support lots and lots of assignable interfaces
> > which can be combined into arbitrary sets for mdev devices, making it
> > impractical to expose an mdev type for every possible enumeration of
> > assignable interfaces within a device.
> 
> so this is a modeling problem and likely a limitation of the current way an mdev_type is exposed.
> stealing some linux doc eamples
> 
> 
>   |- [parent physical device]
>   |--- Vendor-specific-attributes [optional]
>   |--- [mdev_supported_types]
>   |     |--- [<type-id>]
>   |     |   |--- create
>   |     |   |--- name
>   |     |   |--- available_instances
>   |     |   |--- device_api
>   |     |   |--- description
> 
> you could adress this in 1 of at least 3 ways.
> 1.) mdev type for each enmartion which is fine for 1-2 variabley othersize its a combinitroial explotions.
> 2.) report each of the consomable sub componetns as an mdev type and create mupltipel mdevs and assign them to the vm.
> 3.) provider an api to dynamically compose mdevs types which staticaly partion the reqouese and can then be consomed
> perferably embeding the resouce infomation in the description filed in a huma/machince readable form.
> 
> 2 and 3 woudl work well with openstack however they both have there challanges
> 1 doesnt really work for anyone out side of a demo.
> >   We therefore expose a base type
> > where the aggregation is built later.  This essentially puts us in a
> > scenario where even within an mdev type running on the same driver,
> > there are devices that are not directly compatible with each other.
> >  
> > > we dont need vendor to rencode the driver name or vendor id and product id in the string. that info is alreay
> > > available both to the device driver and to userspace via /sys already we just need to know if version of
> > > the same mdev are compatiable so a simple semver version string which is well know in the software world
> > > at least is a clean abstration we can reuse.
> > 
> > This presumes there's no cross device migration.
> 
> no but it does assume no cross mdev_type migration.
> it assuems that nvida_mdev_type_x on host 1 is the same as nvida_mdev_type_x on host 2.
> if the parent device differese but support the same mdev type  we are asserting that they
> should be compatiable or a differnt mdev_type name should be used on each device.
> 
> so we are presuming the mdev type cant change as part of a live migration and if the type
> was to change it would no longer be a live migration operation it would be something else.
> that is based on the premis that changing the mdev type would change the capabilities of the mdev
> 
> >   An mdev type can only
> > be migrated to the same mdev type, all of the devices within that type
> > have some based compatibility, a phsyical device can only be migrated to
> > the same physical device.  In the latter case what defines the type?
> 
> the type-id in /sysfs
> 
>     /sys/devices/virtual/mtty/mtty/
>         |-- mdev_supported_types
>         |   |-- mtty-1 <---- this is an mdev type
>         |   |   |-- available_instances
>         |   |   |-- create
>         |   |   |-- device_api
>         |   |   |-- devices
>         |   |   `-- name
>         |   `-- mtty-2 <---- as is this
>         |       |-- available_instances
>         |       |-- create
>         |       |-- device_api
>         |       |-- devices
>         |       `-- name
> 
>   |- [parent phy device]
>   |--- [$MDEV_UUID]
>          |--- remove
>          |--- mdev_type {link to its type} <-- here
>          |--- vendor-specific-attributes [optional]
> 
> >   If
> > it's a PCI device, is it only vendor:device IDs?
> 
> no the mdev type is not defined by the vendor:device id of the parent device
> although the capablityes of that device will determin what mdev types if any it supprots.
> >   What about revision?
> > What about subsystem IDs?
> 
> at least for nvidia gpus i dont think if you by an evga branded v100 vs an pny branded one the capability
> would change but i do know that certenly the capablities of a dell branding intel nic and an intel branded
> one can. e.g. i have seen oem sku nics without sriov eventhoguh the same nic form intel supports it.
> sriov was deliberatly disabled in the dell firmware even though it share dhte same vendor and prodcut id but differnt
> subsystem id.
> 
> if the odm made an incomatipable change like that which affect an mdev type in some way i guess i would expect them to
> change the name or the description filed content to signal that.
> 
> >   What about possibly an onboard ROM or
> > internal firmware?
> 
> i would expect that updating the firmware/rom could result in changing a version string. that is how i was imagining
> it would change. 
> >   The information may be available, but which things
> > are relevant to migration?
> 
> that i dont know an i really would not like to encode that knolage in the vendor specific way in higher level
> tools like openstack or even libvirt. declarative version sting comparisons or even simile feature flag 
> check where an abstract huristic that can be applied across vendors would be fine. but yes i dont know
> what info would be needed in this case.
> >   We already see desires to allow migration
> > between physical and mdev,
> 
> migration between a phsical device and an mdev would not generally be considered a live migration in openstack.
> that would be a different operation as it would be user visible withing the guest vm.
> >  but also to expose mdev types that might be
> > composable to be compatible with other types.  Thanks,
> 
> i think composable mdev types are really challanging without some kind of feature flag concept
> like cpu flags or ethtool nic capablities that are both human readable and easily parsable.
> 
> we have the capability to schedule on cpu flags or gpu cuda level using a traits abstraction
> so instead of saying i want an vm on a host with an intel 2695v3 to ensure it has AVX
> you say i want an vm that is capable of using AVX
> https://github.com/openstack/os-traits/blob/master/os_traits/hw/cpu/x86/__init__.py#L18
> 
> we also have trait for cuda level so instead of asking for a specifc mdev type or nvida
> gpu the idea was you woudl describe what feature cuda in this exmple you need
> https://github.com/openstack/os-traits/blob/master/os_traits/hw/gpu/cuda.py#L16-L45
> 
> That is what we call qualitative schudleing and is why we create teh placement service.
> with out going in to the weeds we try to decouple quantaitive request such as 4 cpus and 1G of ram
> form the qunative i need AVX supprot
> 
> e.g. resouces:VCPU=4,resouces:MEMORY_MB=1024 triats:required=HW_CPU_X86_AVX
> 
> declarative quantitive and capablites reporting of resouces fits easily into that model.
> dynamic quantities that change as other mdev are allocated from the parent device or as
> new mdevs types are composed on the fly are very challenging.
> 
> > 
> > Alex
> > 
> 
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-14 20:59       ` Alex Williamson
@ 2020-07-15  8:20         ` Yan Zhao
  2020-07-15  8:49           ` Feng, Shaohe
  2020-07-17 14:59           ` Alex Williamson
  2020-07-15  8:23         ` Dr. David Alan Gilbert
       [not found]         ` <CAH7mGatPWsczh_rbVhx4a+psJXvkZgKou3r5HrEQTqE7SqZkKA@mail.gmail.com>
  2 siblings, 2 replies; 48+ messages in thread
From: Yan Zhao @ 2020-07-15  8:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Dr. David Alan Gilbert, Daniel P. Berrangé,
	devel, openstack-discuss, libvir-list, intel-gvt-dev, kvm,
	qemu-devel, smooney, eskultet, cohuck, dinechin, corbet,
	kwankhede, eauger, jian-feng.ding, hejie.xu, kevin.tian, zhenyuw,
	bao.yumeng, xin-ran.wang, shaohe.feng

On Tue, Jul 14, 2020 at 02:59:48PM -0600, Alex Williamson wrote:
> On Tue, 14 Jul 2020 18:19:46 +0100
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
> > * Alex Williamson (alex.williamson@redhat.com) wrote:
> > > On Tue, 14 Jul 2020 11:21:29 +0100
> > > Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >   
> > > > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:  
> > > > > hi folks,
> > > > > we are defining a device migration compatibility interface that helps upper
> > > > > layer stack like openstack/ovirt/libvirt to check if two devices are
> > > > > live migration compatible.
> > > > > The "devices" here could be MDEVs, physical devices, or hybrid of the two.
> > > > > e.g. we could use it to check whether
> > > > > - a src MDEV can migrate to a target MDEV,
> > > > > - a src VF in SRIOV can migrate to a target VF in SRIOV,
> > > > > - a src MDEV can migration to a target VF in SRIOV.
> > > > >   (e.g. SIOV/SRIOV backward compatibility case)
> > > > > 
> > > > > The upper layer stack could use this interface as the last step to check
> > > > > if one device is able to migrate to another device before triggering a real
> > > > > live migration procedure.
> > > > > we are not sure if this interface is of value or help to you. please don't
> > > > > hesitate to drop your valuable comments.
> > > > > 
> > > > > 
> > > > > (1) interface definition
> > > > > The interface is defined in below way:
> > > > > 
> > > > >              __    userspace
> > > > >               /\              \
> > > > >              /                 \write
> > > > >             / read              \
> > > > >    ________/__________       ___\|/_____________
> > > > >   | migration_version |     | migration_version |-->check migration
> > > > >   ---------------------     ---------------------   compatibility
> > > > >      device A                    device B
> > > > > 
> > > > > 
> > > > > a device attribute named migration_version is defined under each device's
> > > > > sysfs node. e.g. (/sys/bus/pci/devices/0000\:00\:02.0/$mdev_UUID/migration_version).
> > > > > userspace tools read the migration_version as a string from the source device,
> > > > > and write it to the migration_version sysfs attribute in the target device.
> > > > > 
> > > > > The userspace should treat ANY of below conditions as two devices not compatible:
> > > > > - any one of the two devices does not have a migration_version attribute
> > > > > - error when reading from migration_version attribute of one device
> > > > > - error when writing migration_version string of one device to
> > > > >   migration_version attribute of the other device
> > > > > 
> > > > > The string read from migration_version attribute is defined by device vendor
> > > > > driver and is completely opaque to the userspace.
> > > > > for a Intel vGPU, string format can be defined like
> > > > > "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
> > > > > 
> > > > > for an NVMe VF connecting to a remote storage. it could be
> > > > > "PCI ID" + "driver version" + "configured remote storage URL"
> > > > > 
> > > > > for a QAT VF, it may be
> > > > > "PCI ID" + "driver version" + "supported encryption set".
> > > > > 
> > > > > (to avoid namespace confliction from each vendor, we may prefix a driver name to
> > > > > each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)  
> > > 
> > > It's very strange to define it as opaque and then proceed to describe
> > > the contents of that opaque string.  The point is that its contents
> > > are defined by the vendor driver to describe the device, driver version,
> > > and possibly metadata about the configuration of the device.  One
> > > instance of a device might generate a different string from another.
> > > The string that a device produces is not necessarily the only string
> > > the vendor driver will accept, for example the driver might support
> > > backwards compatible migrations.  
> > 
> > (As I've said in the previous discussion, off one of the patch series)
> > 
> > My view is it makes sense to have a half-way house on the opaqueness of
> > this string; I'd expect to have an ID and version that are human
> > readable, maybe a device ID/name that's human interpretable and then a
> > bunch of other cruft that maybe device/vendor/version specific.
> > 
> > I'm thinking that we want to be able to report problems and include the
> > string and the user to be able to easily identify the device that was
> > complaining and notice a difference in versions, and perhaps also use
> > it in compatibility patterns to find compatible hosts; but that does
> > get tricky when it's a 'ask the device if it's compatible'.
> 
> In the reply I just sent to Dan, I gave this example of what a
> "compatibility string" might look like represented as json:
> 
> {
>   "device_api": "vfio-pci",
>   "vendor": "vendor-driver-name",
>   "version": {
>     "major": 0,
>     "minor": 1
>   },
>   "vfio-pci": { // Based on above device_api
>     "vendor": 0x1234, // Values for the exposed device
>     "device": 0x5678,
>       // Possibly further parameters for a more specific match
>   },
>   "mdev_attrs": [
>     { "attribute0": "VALUE" }
>   ]
> }
> 
> Are you thinking that we might allow the vendor to include a vendor
> specific array where we'd simply require that both sides have matching
> fields and values?  ie.
> 
>   "vendor_fields": [
>     { "unknown_field0": "unknown_value0" },
>     { "unknown_field1": "unknown_value1" },
>   ]
> 
> We could certainly make that part of the spec, but I can't really
> figure the value of it other than to severely restrict compatibility,
> which the vendor could already do via the version.major value.  Maybe
> they'd want to put a build timestamp, random uuid, or source sha1 into
> such a field to make absolutely certain compatibility is only determined
> between identical builds?  Thanks,
>
Yes, I agree kernel could expose such sysfs interface to educate
openstack how to filter out devices. But I still think the proposed
migration_version (or rename to migration_compatibility) interface is
still required for libvirt to do double check.

In the following scenario: 
1. openstack chooses the target device by reading sysfs interface (of json
format) of the source device. And Openstack are now pretty sure the two
devices are migration compatible.
2. openstack asks libvirt to create the target VM with the target device
and start live migration.
3. libvirt now receives the request. so it now has two choices:
(1) create the target VM & target device and start live migration directly
(2) double check if the target device is compatible with the source
device before doing the remaining tasks.

Because the factors to determine whether two devices are live migration
compatible are complicated and may be dynamically changing, (e.g. driver
upgrade or configuration changes), and also because libvirt should not
totally rely on the input from openstack, I think the cost for libvirt is
relatively lower if it chooses to go (2) than (1). At least it has no
need to cancel migration and destroy the VM if it knows it earlier.

So, it means the kernel may need to expose two parallel interfaces:
(1) with json format, enumerating all possible fields and comparing
methods, so as to indicate openstack how to find a matching target device
(2) an opaque driver defined string, requiring write and test in target,
which is used by libvirt to make sure device compatibility, rather than
rely on the input accurateness from openstack or rely on kernel driver
implementing the compatibility detection immediately after migration
start.

Does it make sense?

Thanks
Yan









^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-14 20:59       ` Alex Williamson
  2020-07-15  8:20         ` Yan Zhao
@ 2020-07-15  8:23         ` Dr. David Alan Gilbert
       [not found]         ` <CAH7mGatPWsczh_rbVhx4a+psJXvkZgKou3r5HrEQTqE7SqZkKA@mail.gmail.com>
  2 siblings, 0 replies; 48+ messages in thread
From: Dr. David Alan Gilbert @ 2020-07-15  8:23 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Daniel P. Berrangé,
	Yan Zhao, devel, openstack-discuss, libvir-list, intel-gvt-dev,
	kvm, qemu-devel, smooney, eskultet, cohuck, dinechin, corbet,
	kwankhede, eauger, jian-feng.ding, hejie.xu, kevin.tian, zhenyuw,
	bao.yumeng, xin-ran.wang, shaohe.feng

* Alex Williamson (alex.williamson@redhat.com) wrote:
> On Tue, 14 Jul 2020 18:19:46 +0100
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
> > * Alex Williamson (alex.williamson@redhat.com) wrote:
> > > On Tue, 14 Jul 2020 11:21:29 +0100
> > > Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >   
> > > > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:  
> > > > > hi folks,
> > > > > we are defining a device migration compatibility interface that helps upper
> > > > > layer stack like openstack/ovirt/libvirt to check if two devices are
> > > > > live migration compatible.
> > > > > The "devices" here could be MDEVs, physical devices, or hybrid of the two.
> > > > > e.g. we could use it to check whether
> > > > > - a src MDEV can migrate to a target MDEV,
> > > > > - a src VF in SRIOV can migrate to a target VF in SRIOV,
> > > > > - a src MDEV can migration to a target VF in SRIOV.
> > > > >   (e.g. SIOV/SRIOV backward compatibility case)
> > > > > 
> > > > > The upper layer stack could use this interface as the last step to check
> > > > > if one device is able to migrate to another device before triggering a real
> > > > > live migration procedure.
> > > > > we are not sure if this interface is of value or help to you. please don't
> > > > > hesitate to drop your valuable comments.
> > > > > 
> > > > > 
> > > > > (1) interface definition
> > > > > The interface is defined in below way:
> > > > > 
> > > > >              __    userspace
> > > > >               /\              \
> > > > >              /                 \write
> > > > >             / read              \
> > > > >    ________/__________       ___\|/_____________
> > > > >   | migration_version |     | migration_version |-->check migration
> > > > >   ---------------------     ---------------------   compatibility
> > > > >      device A                    device B
> > > > > 
> > > > > 
> > > > > a device attribute named migration_version is defined under each device's
> > > > > sysfs node. e.g. (/sys/bus/pci/devices/0000\:00\:02.0/$mdev_UUID/migration_version).
> > > > > userspace tools read the migration_version as a string from the source device,
> > > > > and write it to the migration_version sysfs attribute in the target device.
> > > > > 
> > > > > The userspace should treat ANY of below conditions as two devices not compatible:
> > > > > - any one of the two devices does not have a migration_version attribute
> > > > > - error when reading from migration_version attribute of one device
> > > > > - error when writing migration_version string of one device to
> > > > >   migration_version attribute of the other device
> > > > > 
> > > > > The string read from migration_version attribute is defined by device vendor
> > > > > driver and is completely opaque to the userspace.
> > > > > for a Intel vGPU, string format can be defined like
> > > > > "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
> > > > > 
> > > > > for an NVMe VF connecting to a remote storage. it could be
> > > > > "PCI ID" + "driver version" + "configured remote storage URL"
> > > > > 
> > > > > for a QAT VF, it may be
> > > > > "PCI ID" + "driver version" + "supported encryption set".
> > > > > 
> > > > > (to avoid namespace confliction from each vendor, we may prefix a driver name to
> > > > > each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)  
> > > 
> > > It's very strange to define it as opaque and then proceed to describe
> > > the contents of that opaque string.  The point is that its contents
> > > are defined by the vendor driver to describe the device, driver version,
> > > and possibly metadata about the configuration of the device.  One
> > > instance of a device might generate a different string from another.
> > > The string that a device produces is not necessarily the only string
> > > the vendor driver will accept, for example the driver might support
> > > backwards compatible migrations.  
> > 
> > (As I've said in the previous discussion, off one of the patch series)
> > 
> > My view is it makes sense to have a half-way house on the opaqueness of
> > this string; I'd expect to have an ID and version that are human
> > readable, maybe a device ID/name that's human interpretable and then a
> > bunch of other cruft that maybe device/vendor/version specific.
> > 
> > I'm thinking that we want to be able to report problems and include the
> > string and the user to be able to easily identify the device that was
> > complaining and notice a difference in versions, and perhaps also use
> > it in compatibility patterns to find compatible hosts; but that does
> > get tricky when it's a 'ask the device if it's compatible'.
> 
> In the reply I just sent to Dan, I gave this example of what a
> "compatibility string" might look like represented as json:
> 
> {
>   "device_api": "vfio-pci",
>   "vendor": "vendor-driver-name",
>   "version": {
>     "major": 0,
>     "minor": 1
>   },
>   "vfio-pci": { // Based on above device_api
>     "vendor": 0x1234, // Values for the exposed device
>     "device": 0x5678,
>       // Possibly further parameters for a more specific match
>   },
>   "mdev_attrs": [
>     { "attribute0": "VALUE" }
>   ]
> }
> 
> Are you thinking that we might allow the vendor to include a vendor
> specific array where we'd simply require that both sides have matching
> fields and values?  ie.
> 
>   "vendor_fields": [
>     { "unknown_field0": "unknown_value0" },
>     { "unknown_field1": "unknown_value1" },
>   ]
> 
> We could certainly make that part of the spec, but I can't really
> figure the value of it other than to severely restrict compatibility,
> which the vendor could already do via the version.major value.  Maybe
> they'd want to put a build timestamp, random uuid, or source sha1 into
> such a field to make absolutely certain compatibility is only determined
> between identical builds?  Thanks,

No, I'd mostly anticipated matching on the vendor and device and maybe a
version number for the bit the user specifies; I had assumed all that
'vendor cruft' was still mostly opaque; having said that, if it did
become a list of attributes like that (some of which were vendor
specific) that would make sense to me.

Dave

> 
> Alex
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: device compatibility interface for live migration with assigned devices
  2020-07-15  8:20         ` Yan Zhao
@ 2020-07-15  8:49           ` Feng, Shaohe
  2020-07-17 14:59           ` Alex Williamson
  1 sibling, 0 replies; 48+ messages in thread
From: Feng, Shaohe @ 2020-07-15  8:49 UTC (permalink / raw)
  To: Zhao, Yan Y, Alex Williamson
  Cc: Dr. David Alan Gilbert, Daniel P. Berrangé,
	devel, openstack-discuss, libvir-list, intel-gvt-dev, kvm,
	qemu-devel, smooney, eskultet, cohuck, dinechin, corbet,
	kwankhede, eauger, Ding, Jian-feng, Xu, Hejie, Tian, Kevin,
	zhenyuw, bao.yumeng, Wang, Xin-ran, Feng, Shaohe



-----Original Message-----
From: Zhao, Yan Y <yan.y.zhao@intel.com> 
Sent: 2020年7月15日 16:21
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>; Daniel P. Berrangé <berrange@redhat.com>; devel@ovirt.org; openstack-discuss@lists.openstack.org; libvir-list@redhat.com; intel-gvt-dev@lists.freedesktop.org; kvm@vger.kernel.org; qemu-devel@nongnu.org; smooney@redhat.com; eskultet@redhat.com; cohuck@redhat.com; dinechin@redhat.com; corbet@lwn.net; kwankhede@nvidia.com; eauger@redhat.com; Ding, Jian-feng <jian-feng.ding@intel.com>; Xu, Hejie <hejie.xu@intel.com>; Tian, Kevin <kevin.tian@intel.com>; zhenyuw@linux.intel.com; bao.yumeng@zte.com.cn; Wang, Xin-ran <xin-ran.wang@intel.com>; Feng, Shaohe <shaohe.feng@intel.com>
Subject: Re: device compatibility interface for live migration with assigned devices

On Tue, Jul 14, 2020 at 02:59:48PM -0600, Alex Williamson wrote:
> On Tue, 14 Jul 2020 18:19:46 +0100
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
> > * Alex Williamson (alex.williamson@redhat.com) wrote:
> > > On Tue, 14 Jul 2020 11:21:29 +0100 Daniel P. Berrangé 
> > > <berrange@redhat.com> wrote:
> > >   
> > > > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:  
> > > > > hi folks,
> > > > > we are defining a device migration compatibility interface 
> > > > > that helps upper layer stack like openstack/ovirt/libvirt to 
> > > > > check if two devices are live migration compatible.
> > > > > The "devices" here could be MDEVs, physical devices, or hybrid of the two.
> > > > > e.g. we could use it to check whether
> > > > > - a src MDEV can migrate to a target MDEV,
> > > > > - a src VF in SRIOV can migrate to a target VF in SRIOV,
> > > > > - a src MDEV can migration to a target VF in SRIOV.
> > > > >   (e.g. SIOV/SRIOV backward compatibility case)
> > > > > 
> > > > > The upper layer stack could use this interface as the last 
> > > > > step to check if one device is able to migrate to another 
> > > > > device before triggering a real live migration procedure.
> > > > > we are not sure if this interface is of value or help to you. 
> > > > > please don't hesitate to drop your valuable comments.
> > > > > 
> > > > > 
> > > > > (1) interface definition
> > > > > The interface is defined in below way:
> > > > > 
> > > > >              __    userspace
> > > > >               /\              \
> > > > >              /                 \write
> > > > >             / read              \
> > > > >    ________/__________       ___\|/_____________
> > > > >   | migration_version |     | migration_version |-->check migration
> > > > >   ---------------------     ---------------------   compatibility
> > > > >      device A                    device B
> > > > > 
> > > > > 
> > > > > a device attribute named migration_version is defined under 
> > > > > each device's sysfs node. e.g. (/sys/bus/pci/devices/0000\:00\:02.0/$mdev_UUID/migration_version).
> > > > > userspace tools read the migration_version as a string from 
> > > > > the source device, and write it to the migration_version sysfs attribute in the target device.
> > > > > 
> > > > > The userspace should treat ANY of below conditions as two devices not compatible:
> > > > > - any one of the two devices does not have a migration_version 
> > > > > attribute
> > > > > - error when reading from migration_version attribute of one 
> > > > > device
> > > > > - error when writing migration_version string of one device to
> > > > >   migration_version attribute of the other device
> > > > > 
> > > > > The string read from migration_version attribute is defined by 
> > > > > device vendor driver and is completely opaque to the userspace.
> > > > > for a Intel vGPU, string format can be defined like "parent 
> > > > > device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
> > > > > 
> > > > > for an NVMe VF connecting to a remote storage. it could be 
> > > > > "PCI ID" + "driver version" + "configured remote storage URL"
> > > > > 
> > > > > for a QAT VF, it may be
> > > > > "PCI ID" + "driver version" + "supported encryption set".
> > > > > 
> > > > > (to avoid namespace confliction from each vendor, we may 
> > > > > prefix a driver name to each migration_version string. e.g. 
> > > > > i915-v1-8086-591d-i915-GVTg_V5_8-1)
> > > 
> > > It's very strange to define it as opaque and then proceed to 
> > > describe the contents of that opaque string.  The point is that 
> > > its contents are defined by the vendor driver to describe the 
> > > device, driver version, and possibly metadata about the 
> > > configuration of the device.  One instance of a device might generate a different string from another.
> > > The string that a device produces is not necessarily the only 
> > > string the vendor driver will accept, for example the driver might 
> > > support backwards compatible migrations.
> > 
> > (As I've said in the previous discussion, off one of the patch 
> > series)
> > 
> > My view is it makes sense to have a half-way house on the opaqueness 
> > of this string; I'd expect to have an ID and version that are human 
> > readable, maybe a device ID/name that's human interpretable and then 
> > a bunch of other cruft that maybe device/vendor/version specific.
> > 
> > I'm thinking that we want to be able to report problems and include 
> > the string and the user to be able to easily identify the device 
> > that was complaining and notice a difference in versions, and 
> > perhaps also use it in compatibility patterns to find compatible 
> > hosts; but that does get tricky when it's a 'ask the device if it's compatible'.
> 
> In the reply I just sent to Dan, I gave this example of what a 
> "compatibility string" might look like represented as json:
> 
> {
>   "device_api": "vfio-pci",
>   "vendor": "vendor-driver-name",
>   "version": {
>     "major": 0,
>     "minor": 1
>   },
>   "vfio-pci": { // Based on above device_api
>     "vendor": 0x1234, // Values for the exposed device
>     "device": 0x5678,
>       // Possibly further parameters for a more specific match
>   },
>   "mdev_attrs": [
>     { "attribute0": "VALUE" }
>   ]
> }
> 
> Are you thinking that we might allow the vendor to include a vendor 
> specific array where we'd simply require that both sides have matching 
> fields and values?  ie.
> 
>   "vendor_fields": [
>     { "unknown_field0": "unknown_value0" },
>     { "unknown_field1": "unknown_value1" },
>   ]
> 
> We could certainly make that part of the spec, but I can't really 
> figure the value of it other than to severely restrict compatibility, 
> which the vendor could already do via the version.major value.  Maybe 
> they'd want to put a build timestamp, random uuid, or source sha1 into 
> such a field to make absolutely certain compatibility is only 
> determined between identical builds?  Thanks,
>
Yes, I agree kernel could expose such sysfs interface to educate openstack how to filter out devices. But I still think the proposed migration_version (or rename to migration_compatibility) interface is still required for libvirt to do double check.

In the following scenario: 
1. openstack chooses the target device by reading sysfs interface (of json
format) of the source device. And Openstack are now pretty sure the two devices are migration compatible.
2. openstack asks libvirt to create the target VM with the target device and start live migration.
3. libvirt now receives the request. so it now has two choices:
(1) create the target VM & target device and start live migration directly
(2) double check if the target device is compatible with the source device before doing the remaining tasks.

Because the factors to determine whether two devices are live migration compatible are complicated and may be dynamically changing, (e.g. driver upgrade or configuration changes), and also because libvirt should not totally rely on the input from openstack, I think the cost for libvirt is relatively lower if it chooses to go (2) than (1). At least it has no need to cancel migration and destroy the VM if it knows it earlier.

So, it means the kernel may need to expose two parallel interfaces:
(1) with json format, enumerating all possible fields and comparing methods, so as to indicate openstack how to find a matching target device
(2) an opaque driver defined string, requiring write and test in target, which is used by libvirt to make sure device compatibility, rather than rely on the input accurateness from openstack or rely on kernel driver implementing the compatibility detection immediately after migration start.

Does it make sense?

[Feng, Shaohe] 
Yes, had better 2 interface for different phase of live migration. 
For (1), it is can leverage these information for scheduler to minimize the failure rate of migration. The problem is that which value should be used for scheduler guide.  The values should be human readable. 
For (2) yes we can't assume that the migration always screenful, double check is needed.
BR
Shaohe 

Thanks
Yan









^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-14 20:47       ` Alex Williamson
@ 2020-07-15  9:16         ` Daniel P. Berrangé
  0 siblings, 0 replies; 48+ messages in thread
From: Daniel P. Berrangé @ 2020-07-15  9:16 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yan Zhao, devel, openstack-discuss, libvir-list, intel-gvt-dev,
	kvm, qemu-devel, smooney, eskultet, cohuck, dinechin, corbet,
	kwankhede, dgilbert, eauger, jian-feng.ding, hejie.xu,
	kevin.tian, zhenyuw, bao.yumeng, xin-ran.wang, shaohe.feng

On Tue, Jul 14, 2020 at 02:47:15PM -0600, Alex Williamson wrote:
> On Tue, 14 Jul 2020 17:47:22 +0100
> Daniel P. Berrangé <berrange@redhat.com> wrote:

> > I'm sure OpenStack maintainers can speak to this more, as they've put
> > alot of work into their scheduling engine to optimize the way it places
> > VMs largely driven from simple structured data reported from hosts.
> 
> I think we've weeded out that our intended approach is not worthwhile,
> testing a compatibility string at a device is too much overhead, we
> need to provide enough information to the management engine to predict
> the response without interaction beyond the initial capability probing.

Just to clarify in case people mis-interpreted my POV...

I think that testing a compatibility string at a device *is* useful, as
it allows for a final accurate safety check to be performed before the
migration stream starts. Libvirt could use that reasonably easily I
believe.

It just isn't sufficient for a complete solution.

In parallel with the device level test in sysfs, we need something else
to support the host placement selection problems in an efficient way, as
you are trying to address in the remainder of your mail.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-13 23:29 device compatibility interface for live migration with assigned devices Yan Zhao
  2020-07-14 10:21 ` Daniel P. Berrangé
@ 2020-07-16  4:16 ` Jason Wang
  2020-07-16  8:32   ` Yan Zhao
  1 sibling, 1 reply; 48+ messages in thread
From: Jason Wang @ 2020-07-16  4:16 UTC (permalink / raw)
  To: Yan Zhao, devel, openstack-discuss, libvir-list
  Cc: intel-gvt-dev, kvm, qemu-devel, berrange, smooney, eskultet,
	alex.williamson, cohuck, dinechin, corbet, kwankhede, dgilbert,
	eauger, jian-feng.ding, hejie.xu, kevin.tian, zhenyuw,
	bao.yumeng, xin-ran.wang, shaohe.feng


On 2020/7/14 上午7:29, Yan Zhao wrote:
> hi folks,
> we are defining a device migration compatibility interface that helps upper
> layer stack like openstack/ovirt/libvirt to check if two devices are
> live migration compatible.
> The "devices" here could be MDEVs, physical devices, or hybrid of the two.
> e.g. we could use it to check whether
> - a src MDEV can migrate to a target MDEV,
> - a src VF in SRIOV can migrate to a target VF in SRIOV,
> - a src MDEV can migration to a target VF in SRIOV.
>    (e.g. SIOV/SRIOV backward compatibility case)
>
> The upper layer stack could use this interface as the last step to check
> if one device is able to migrate to another device before triggering a real
> live migration procedure.
> we are not sure if this interface is of value or help to you. please don't
> hesitate to drop your valuable comments.
>
>
> (1) interface definition
> The interface is defined in below way:
>
>               __    userspace
>                /\              \
>               /                 \write
>              / read              \
>     ________/__________       ___\|/_____________
>    | migration_version |     | migration_version |-->check migration
>    ---------------------     ---------------------   compatibility
>       device A                    device B
>
>
> a device attribute named migration_version is defined under each device's
> sysfs node. e.g. (/sys/bus/pci/devices/0000\:00\:02.0/$mdev_UUID/migration_version).


Are you aware of the devlink based device management interface that is 
proposed upstream? I think it has many advantages over sysfs, do you 
consider to switch to that?


> userspace tools read the migration_version as a string from the source device,
> and write it to the migration_version sysfs attribute in the target device.
>
> The userspace should treat ANY of below conditions as two devices not compatible:
> - any one of the two devices does not have a migration_version attribute
> - error when reading from migration_version attribute of one device
> - error when writing migration_version string of one device to
>    migration_version attribute of the other device
>
> The string read from migration_version attribute is defined by device vendor
> driver and is completely opaque to the userspace.


My understanding is that something opaque to userspace is not the 
philosophy of Linux. Instead of having a generic API but opaque value, 
why not do in a vendor specific way like:

1) exposing the device capability in a vendor specific way via 
sysfs/devlink or other API
2) management read capability in both src and dst and determine whether 
we can do the migration

This is the way we plan to do with vDPA.

Thanks


> for a Intel vGPU, string format can be defined like
> "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
>
> for an NVMe VF connecting to a remote storage. it could be
> "PCI ID" + "driver version" + "configured remote storage URL"
>
> for a QAT VF, it may be
> "PCI ID" + "driver version" + "supported encryption set".
>
> (to avoid namespace confliction from each vendor, we may prefix a driver name to
> each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)
>
>
> (2) backgrounds
>
> The reason we hope the migration_version string is opaque to the userspace
> is that it is hard to generalize standard comparing fields and comparing
> methods for different devices from different vendors.
> Though userspace now could still do a simple string compare to check if
> two devices are compatible, and result should also be right, it's still
> too limited as it excludes the possible candidate whose migration_version
> string fails to be equal.
> e.g. an MDEV with mdev_type_1, aggregator count 3 is probably compatible
> with another MDEV with mdev_type_3, aggregator count 1, even their
> migration_version strings are not equal.
> (assumed mdev_type_3 is of 3 times equal resources of mdev_type_1).
>
> besides that, driver version + configured resources are all elements demanding
> to take into account.
>
> So, we hope leaving the freedom to vendor driver and let it make the final decision
> in a simple reading from source side and writing for test in the target side way.
>
>
> we then think the device compatibility issues for live migration with assigned
> devices can be divided into two steps:
> a. management tools filter out possible migration target devices.
>     Tags could be created according to info from product specification.
>     we think openstack/ovirt may have vendor proprietary components to create
>     those customized tags for each product from each vendor.
>     e.g.
>     for Intel vGPU, with a vGPU(a MDEV device) in source side, the tags to
>     search target vGPU are like:
>     a tag for compatible parent PCI IDs,
>     a tag for a range of gvt driver versions,
>     a tag for a range of mdev type + aggregator count
>
>     for NVMe VF, the tags to search target VF may be like:
>     a tag for compatible PCI IDs,
>     a tag for a range of driver versions,
>     a tag for URL of configured remote storage.
>
> b. with the output from step a, openstack/ovirt/libvirt could use our proposed
>     device migration compatibility interface to make sure the two devices are
>     indeed live migration compatible before launching the real live migration
>     process to start stream copying, src device stopping and target device
>     resuming.
>     It is supposed that this step would not bring any performance penalty as
>     -in kernel it's just a simple string decoding and comparing
>     -in openstack/ovirt, it could be done by extending current function
>      check_can_live_migrate_destination, along side claiming target resources.[1]
>
>
> [1] https://specs.openstack.org/openstack/nova-specs/specs/stein/approved/libvirt-neutron-sriov-livemigration.html
>
> Thanks
> Yan
>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-16  4:16 ` Jason Wang
@ 2020-07-16  8:32   ` Yan Zhao
  2020-07-16  9:30     ` Jason Wang
  2020-07-17 16:12     ` Alex Williamson
  0 siblings, 2 replies; 48+ messages in thread
From: Yan Zhao @ 2020-07-16  8:32 UTC (permalink / raw)
  To: Jason Wang
  Cc: devel, openstack-discuss, libvir-list, intel-gvt-dev, kvm,
	qemu-devel, berrange, smooney, eskultet, alex.williamson, cohuck,
	dinechin, corbet, kwankhede, dgilbert, eauger, jian-feng.ding,
	hejie.xu, kevin.tian, zhenyuw, bao.yumeng, xin-ran.wang,
	shaohe.feng

On Thu, Jul 16, 2020 at 12:16:26PM +0800, Jason Wang wrote:
> 
> On 2020/7/14 上午7:29, Yan Zhao wrote:
> > hi folks,
> > we are defining a device migration compatibility interface that helps upper
> > layer stack like openstack/ovirt/libvirt to check if two devices are
> > live migration compatible.
> > The "devices" here could be MDEVs, physical devices, or hybrid of the two.
> > e.g. we could use it to check whether
> > - a src MDEV can migrate to a target MDEV,
> > - a src VF in SRIOV can migrate to a target VF in SRIOV,
> > - a src MDEV can migration to a target VF in SRIOV.
> >    (e.g. SIOV/SRIOV backward compatibility case)
> > 
> > The upper layer stack could use this interface as the last step to check
> > if one device is able to migrate to another device before triggering a real
> > live migration procedure.
> > we are not sure if this interface is of value or help to you. please don't
> > hesitate to drop your valuable comments.
> > 
> > 
> > (1) interface definition
> > The interface is defined in below way:
> > 
> >               __    userspace
> >                /\              \
> >               /                 \write
> >              / read              \
> >     ________/__________       ___\|/_____________
> >    | migration_version |     | migration_version |-->check migration
> >    ---------------------     ---------------------   compatibility
> >       device A                    device B
> > 
> > 
> > a device attribute named migration_version is defined under each device's
> > sysfs node. e.g. (/sys/bus/pci/devices/0000\:00\:02.0/$mdev_UUID/migration_version).
> 
> 
> Are you aware of the devlink based device management interface that is
> proposed upstream? I think it has many advantages over sysfs, do you
> consider to switch to that?
not familiar with the devlink. will do some research of it.
> 
> 
> > userspace tools read the migration_version as a string from the source device,
> > and write it to the migration_version sysfs attribute in the target device.
> > 
> > The userspace should treat ANY of below conditions as two devices not compatible:
> > - any one of the two devices does not have a migration_version attribute
> > - error when reading from migration_version attribute of one device
> > - error when writing migration_version string of one device to
> >    migration_version attribute of the other device
> > 
> > The string read from migration_version attribute is defined by device vendor
> > driver and is completely opaque to the userspace.
> 
> 
> My understanding is that something opaque to userspace is not the philosophy

but the VFIO live migration in itself is essentially a big opaque stream to userspace.

> of Linux. Instead of having a generic API but opaque value, why not do in a
> vendor specific way like:
> 
> 1) exposing the device capability in a vendor specific way via sysfs/devlink
> or other API
> 2) management read capability in both src and dst and determine whether we
> can do the migration
> 
> This is the way we plan to do with vDPA.
>
yes, in another reply, Alex proposed to use an interface in json format.
I guess we can define something like

{ "self" :
  [
    { "pciid" : "8086591d",
      "driver" : "i915",
      "gvt-version" : "v1",
      "mdev_type"   : "i915-GVTg_V5_2",
      "aggregator"  : "1",
      "pv-mode"     : "none",
    }
  ],
  "compatible" :
  [
    { "pciid" : "8086591d",
      "driver" : "i915",
      "gvt-version" : "v1",
      "mdev_type"   : "i915-GVTg_V5_2",
      "aggregator"  : "1"
      "pv-mode"     : "none",
    },
    { "pciid" : "8086591d",
      "driver" : "i915",
      "gvt-version" : "v1",
      "mdev_type"   : "i915-GVTg_V5_4",
      "aggregator"  : "2"
      "pv-mode"     : "none",
    },
    { "pciid" : "8086591d",
      "driver" : "i915",
      "gvt-version" : "v2",
      "mdev_type"   : "i915-GVTg_V5_4",
      "aggregator"  : "2"
      "pv-mode"     : "none, ppgtt, context",
    }
    ...
  ]
}

But as those fields are mostly vendor specific, the userspace can
only do simple string comparing, I guess the list would be very long as
it needs to enumerate all possible targets.
also, in some fileds like "gvt-version", is there a simple way to express
things like v2+?

If the userspace can read this interface both in src and target and
check whether both src and target are in corresponding compatible list, I
think it will work for us.

But still, kernel should not rely on userspace's choice, the opaque
compatibility string is still required in kernel. No matter whether
it would be exposed to userspace as an compatibility checking interface,
vendor driver would keep this part of code and embed the string into the
migration stream. so exposing it as an interface to be used by libvirt to
do a safety check before a real live migration is only about enabling
the kernel part of check to happen ahead.


Thanks
Yan


> 
> 
> > for a Intel vGPU, string format can be defined like
> > "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
> > 
> > for an NVMe VF connecting to a remote storage. it could be
> > "PCI ID" + "driver version" + "configured remote storage URL"
> > 
> > for a QAT VF, it may be
> > "PCI ID" + "driver version" + "supported encryption set".
> > 
> > (to avoid namespace confliction from each vendor, we may prefix a driver name to
> > each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)
> > 
> > 
> > (2) backgrounds
> > 
> > The reason we hope the migration_version string is opaque to the userspace
> > is that it is hard to generalize standard comparing fields and comparing
> > methods for different devices from different vendors.
> > Though userspace now could still do a simple string compare to check if
> > two devices are compatible, and result should also be right, it's still
> > too limited as it excludes the possible candidate whose migration_version
> > string fails to be equal.
> > e.g. an MDEV with mdev_type_1, aggregator count 3 is probably compatible
> > with another MDEV with mdev_type_3, aggregator count 1, even their
> > migration_version strings are not equal.
> > (assumed mdev_type_3 is of 3 times equal resources of mdev_type_1).
> > 
> > besides that, driver version + configured resources are all elements demanding
> > to take into account.
> > 
> > So, we hope leaving the freedom to vendor driver and let it make the final decision
> > in a simple reading from source side and writing for test in the target side way.
> > 
> > 
> > we then think the device compatibility issues for live migration with assigned
> > devices can be divided into two steps:
> > a. management tools filter out possible migration target devices.
> >     Tags could be created according to info from product specification.
> >     we think openstack/ovirt may have vendor proprietary components to create
> >     those customized tags for each product from each vendor.
> >     e.g.
> >     for Intel vGPU, with a vGPU(a MDEV device) in source side, the tags to
> >     search target vGPU are like:
> >     a tag for compatible parent PCI IDs,
> >     a tag for a range of gvt driver versions,
> >     a tag for a range of mdev type + aggregator count
> > 
> >     for NVMe VF, the tags to search target VF may be like:
> >     a tag for compatible PCI IDs,
> >     a tag for a range of driver versions,
> >     a tag for URL of configured remote storage.
> > 
> > b. with the output from step a, openstack/ovirt/libvirt could use our proposed
> >     device migration compatibility interface to make sure the two devices are
> >     indeed live migration compatible before launching the real live migration
> >     process to start stream copying, src device stopping and target device
> >     resuming.
> >     It is supposed that this step would not bring any performance penalty as
> >     -in kernel it's just a simple string decoding and comparing
> >     -in openstack/ovirt, it could be done by extending current function
> >      check_can_live_migrate_destination, along side claiming target resources.[1]
> > 
> > 
> > [1] https://specs.openstack.org/openstack/nova-specs/specs/stein/approved/libvirt-neutron-sriov-livemigration.html
> > 
> > Thanks
> > Yan
> > 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-16  8:32   ` Yan Zhao
@ 2020-07-16  9:30     ` Jason Wang
  2020-07-17 16:12     ` Alex Williamson
  1 sibling, 0 replies; 48+ messages in thread
From: Jason Wang @ 2020-07-16  9:30 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, libvir-list, qemu-devel, kwankhede, eauger, xin-ran.wang,
	corbet, openstack-discuss, shaohe.feng, kevin.tian, eskultet,
	jian-feng.ding, dgilbert, zhenyuw, hejie.xu, bao.yumeng,
	alex.williamson, smooney, intel-gvt-dev, berrange, cohuck,
	dinechin, devel


On 2020/7/16 下午4:32, Yan Zhao wrote:
> On Thu, Jul 16, 2020 at 12:16:26PM +0800, Jason Wang wrote:
>> On 2020/7/14 上午7:29, Yan Zhao wrote:
>>> hi folks,
>>> we are defining a device migration compatibility interface that helps upper
>>> layer stack like openstack/ovirt/libvirt to check if two devices are
>>> live migration compatible.
>>> The "devices" here could be MDEVs, physical devices, or hybrid of the two.
>>> e.g. we could use it to check whether
>>> - a src MDEV can migrate to a target MDEV,
>>> - a src VF in SRIOV can migrate to a target VF in SRIOV,
>>> - a src MDEV can migration to a target VF in SRIOV.
>>>     (e.g. SIOV/SRIOV backward compatibility case)
>>>
>>> The upper layer stack could use this interface as the last step to check
>>> if one device is able to migrate to another device before triggering a real
>>> live migration procedure.
>>> we are not sure if this interface is of value or help to you. please don't
>>> hesitate to drop your valuable comments.
>>>
>>>
>>> (1) interface definition
>>> The interface is defined in below way:
>>>
>>>                __    userspace
>>>                 /\              \
>>>                /                 \write
>>>               / read              \
>>>      ________/__________       ___\|/_____________
>>>     | migration_version |     | migration_version |-->check migration
>>>     ---------------------     ---------------------   compatibility
>>>        device A                    device B
>>>
>>>
>>> a device attribute named migration_version is defined under each device's
>>> sysfs node. e.g. (/sys/bus/pci/devices/0000\:00\:02.0/$mdev_UUID/migration_version).
>>
>> Are you aware of the devlink based device management interface that is
>> proposed upstream? I think it has many advantages over sysfs, do you
>> consider to switch to that?
> not familiar with the devlink. will do some research of it.
>>
>>> userspace tools read the migration_version as a string from the source device,
>>> and write it to the migration_version sysfs attribute in the target device.
>>>
>>> The userspace should treat ANY of below conditions as two devices not compatible:
>>> - any one of the two devices does not have a migration_version attribute
>>> - error when reading from migration_version attribute of one device
>>> - error when writing migration_version string of one device to
>>>     migration_version attribute of the other device
>>>
>>> The string read from migration_version attribute is defined by device vendor
>>> driver and is completely opaque to the userspace.
>>
>> My understanding is that something opaque to userspace is not the philosophy
> but the VFIO live migration in itself is essentially a big opaque stream to userspace.


I think it's better not limit to the kernel interface for a specific use 
case. This is basically the device introspection.


>
>> of Linux. Instead of having a generic API but opaque value, why not do in a
>> vendor specific way like:
>>
>> 1) exposing the device capability in a vendor specific way via sysfs/devlink
>> or other API
>> 2) management read capability in both src and dst and determine whether we
>> can do the migration
>>
>> This is the way we plan to do with vDPA.
>>
> yes, in another reply, Alex proposed to use an interface in json format.
> I guess we can define something like
>
> { "self" :
>    [
>      { "pciid" : "8086591d",
>        "driver" : "i915",
>        "gvt-version" : "v1",
>        "mdev_type"   : "i915-GVTg_V5_2",
>        "aggregator"  : "1",
>        "pv-mode"     : "none",
>      }
>    ],
>    "compatible" :
>    [
>      { "pciid" : "8086591d",
>        "driver" : "i915",
>        "gvt-version" : "v1",
>        "mdev_type"   : "i915-GVTg_V5_2",
>        "aggregator"  : "1"
>        "pv-mode"     : "none",
>      },
>      { "pciid" : "8086591d",
>        "driver" : "i915",
>        "gvt-version" : "v1",
>        "mdev_type"   : "i915-GVTg_V5_4",
>        "aggregator"  : "2"
>        "pv-mode"     : "none",
>      },
>      { "pciid" : "8086591d",
>        "driver" : "i915",
>        "gvt-version" : "v2",
>        "mdev_type"   : "i915-GVTg_V5_4",
>        "aggregator"  : "2"
>        "pv-mode"     : "none, ppgtt, context",
>      }
>      ...
>    ]
> }


This is probably another call for devlink base interface.


>
> But as those fields are mostly vendor specific, the userspace can
> only do simple string comparing, I guess the list would be very long as
> it needs to enumerate all possible targets.
> also, in some fileds like "gvt-version", is there a simple way to express
> things like v2+?


That's total vendor specific I think. If "v2+" means it only support a 
version 2+, we can introduce fields like min_version and max_version. 
But again, the point is to let such interfaces vendor specific instead 
of trying to have a generic format.


>
> If the userspace can read this interface both in src and target and
> check whether both src and target are in corresponding compatible list, I
> think it will work for us.
>
> But still, kernel should not rely on userspace's choice, the opaque
> compatibility string is still required in kernel. No matter whether
> it would be exposed to userspace as an compatibility checking interface,
> vendor driver would keep this part of code and embed the string into the
> migration stream.


Why? Can we simply do:

1) Src support feature A, B, C  (version 1.0)
2) Dst support feature A, B, C, D (version 2.0)
3) only enable feature A, B, C in destination in a version specific way 
(set version to 1.0)
4) migrate metadata A, B, C


>   so exposing it as an interface to be used by libvirt to
> do a safety check before a real live migration is only about enabling
> the kernel part of check to happen ahead.


If we've already exposed the capability, there's no need for an extra 
check like compatibility string.

Thanks


>
>
> Thanks
> Yan
>
>
>>
>>> for a Intel vGPU, string format can be defined like
>>> "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
>>>
>>> for an NVMe VF connecting to a remote storage. it could be
>>> "PCI ID" + "driver version" + "configured remote storage URL"
>>>
>>> for a QAT VF, it may be
>>> "PCI ID" + "driver version" + "supported encryption set".
>>>
>>> (to avoid namespace confliction from each vendor, we may prefix a driver name to
>>> each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)
>>>
>>>
>>> (2) backgrounds
>>>
>>> The reason we hope the migration_version string is opaque to the userspace
>>> is that it is hard to generalize standard comparing fields and comparing
>>> methods for different devices from different vendors.
>>> Though userspace now could still do a simple string compare to check if
>>> two devices are compatible, and result should also be right, it's still
>>> too limited as it excludes the possible candidate whose migration_version
>>> string fails to be equal.
>>> e.g. an MDEV with mdev_type_1, aggregator count 3 is probably compatible
>>> with another MDEV with mdev_type_3, aggregator count 1, even their
>>> migration_version strings are not equal.
>>> (assumed mdev_type_3 is of 3 times equal resources of mdev_type_1).
>>>
>>> besides that, driver version + configured resources are all elements demanding
>>> to take into account.
>>>
>>> So, we hope leaving the freedom to vendor driver and let it make the final decision
>>> in a simple reading from source side and writing for test in the target side way.
>>>
>>>
>>> we then think the device compatibility issues for live migration with assigned
>>> devices can be divided into two steps:
>>> a. management tools filter out possible migration target devices.
>>>      Tags could be created according to info from product specification.
>>>      we think openstack/ovirt may have vendor proprietary components to create
>>>      those customized tags for each product from each vendor.
>>>      e.g.
>>>      for Intel vGPU, with a vGPU(a MDEV device) in source side, the tags to
>>>      search target vGPU are like:
>>>      a tag for compatible parent PCI IDs,
>>>      a tag for a range of gvt driver versions,
>>>      a tag for a range of mdev type + aggregator count
>>>
>>>      for NVMe VF, the tags to search target VF may be like:
>>>      a tag for compatible PCI IDs,
>>>      a tag for a range of driver versions,
>>>      a tag for URL of configured remote storage.
>>>
>>> b. with the output from step a, openstack/ovirt/libvirt could use our proposed
>>>      device migration compatibility interface to make sure the two devices are
>>>      indeed live migration compatible before launching the real live migration
>>>      process to start stream copying, src device stopping and target device
>>>      resuming.
>>>      It is supposed that this step would not bring any performance penalty as
>>>      -in kernel it's just a simple string decoding and comparing
>>>      -in openstack/ovirt, it could be done by extending current function
>>>       check_can_live_migrate_destination, along side claiming target resources.[1]
>>>
>>>
>>> [1] https://specs.openstack.org/openstack/nova-specs/specs/stein/approved/libvirt-neutron-sriov-livemigration.html
>>>
>>> Thanks
>>> Yan
>>>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-15  8:20         ` Yan Zhao
  2020-07-15  8:49           ` Feng, Shaohe
@ 2020-07-17 14:59           ` Alex Williamson
  2020-07-17 18:03             ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 48+ messages in thread
From: Alex Williamson @ 2020-07-17 14:59 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Dr. David Alan Gilbert, Daniel P. Berrangé,
	devel, openstack-discuss, libvir-list, intel-gvt-dev, kvm,
	qemu-devel, smooney, eskultet, cohuck, dinechin, corbet,
	kwankhede, eauger, jian-feng.ding, hejie.xu, kevin.tian, zhenyuw,
	bao.yumeng, xin-ran.wang, shaohe.feng

On Wed, 15 Jul 2020 16:20:41 +0800
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Tue, Jul 14, 2020 at 02:59:48PM -0600, Alex Williamson wrote:
> > On Tue, 14 Jul 2020 18:19:46 +0100
> > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> >   
> > > * Alex Williamson (alex.williamson@redhat.com) wrote:  
> > > > On Tue, 14 Jul 2020 11:21:29 +0100
> > > > Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > >     
> > > > > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:    
> > > > > > hi folks,
> > > > > > we are defining a device migration compatibility interface that helps upper
> > > > > > layer stack like openstack/ovirt/libvirt to check if two devices are
> > > > > > live migration compatible.
> > > > > > The "devices" here could be MDEVs, physical devices, or hybrid of the two.
> > > > > > e.g. we could use it to check whether
> > > > > > - a src MDEV can migrate to a target MDEV,
> > > > > > - a src VF in SRIOV can migrate to a target VF in SRIOV,
> > > > > > - a src MDEV can migration to a target VF in SRIOV.
> > > > > >   (e.g. SIOV/SRIOV backward compatibility case)
> > > > > > 
> > > > > > The upper layer stack could use this interface as the last step to check
> > > > > > if one device is able to migrate to another device before triggering a real
> > > > > > live migration procedure.
> > > > > > we are not sure if this interface is of value or help to you. please don't
> > > > > > hesitate to drop your valuable comments.
> > > > > > 
> > > > > > 
> > > > > > (1) interface definition
> > > > > > The interface is defined in below way:
> > > > > > 
> > > > > >              __    userspace
> > > > > >               /\              \
> > > > > >              /                 \write
> > > > > >             / read              \
> > > > > >    ________/__________       ___\|/_____________
> > > > > >   | migration_version |     | migration_version |-->check migration
> > > > > >   ---------------------     ---------------------   compatibility
> > > > > >      device A                    device B
> > > > > > 
> > > > > > 
> > > > > > a device attribute named migration_version is defined under each device's
> > > > > > sysfs node. e.g. (/sys/bus/pci/devices/0000\:00\:02.0/$mdev_UUID/migration_version).
> > > > > > userspace tools read the migration_version as a string from the source device,
> > > > > > and write it to the migration_version sysfs attribute in the target device.
> > > > > > 
> > > > > > The userspace should treat ANY of below conditions as two devices not compatible:
> > > > > > - any one of the two devices does not have a migration_version attribute
> > > > > > - error when reading from migration_version attribute of one device
> > > > > > - error when writing migration_version string of one device to
> > > > > >   migration_version attribute of the other device
> > > > > > 
> > > > > > The string read from migration_version attribute is defined by device vendor
> > > > > > driver and is completely opaque to the userspace.
> > > > > > for a Intel vGPU, string format can be defined like
> > > > > > "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
> > > > > > 
> > > > > > for an NVMe VF connecting to a remote storage. it could be
> > > > > > "PCI ID" + "driver version" + "configured remote storage URL"
> > > > > > 
> > > > > > for a QAT VF, it may be
> > > > > > "PCI ID" + "driver version" + "supported encryption set".
> > > > > > 
> > > > > > (to avoid namespace confliction from each vendor, we may prefix a driver name to
> > > > > > each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)    
> > > > 
> > > > It's very strange to define it as opaque and then proceed to describe
> > > > the contents of that opaque string.  The point is that its contents
> > > > are defined by the vendor driver to describe the device, driver version,
> > > > and possibly metadata about the configuration of the device.  One
> > > > instance of a device might generate a different string from another.
> > > > The string that a device produces is not necessarily the only string
> > > > the vendor driver will accept, for example the driver might support
> > > > backwards compatible migrations.    
> > > 
> > > (As I've said in the previous discussion, off one of the patch series)
> > > 
> > > My view is it makes sense to have a half-way house on the opaqueness of
> > > this string; I'd expect to have an ID and version that are human
> > > readable, maybe a device ID/name that's human interpretable and then a
> > > bunch of other cruft that maybe device/vendor/version specific.
> > > 
> > > I'm thinking that we want to be able to report problems and include the
> > > string and the user to be able to easily identify the device that was
> > > complaining and notice a difference in versions, and perhaps also use
> > > it in compatibility patterns to find compatible hosts; but that does
> > > get tricky when it's a 'ask the device if it's compatible'.  
> > 
> > In the reply I just sent to Dan, I gave this example of what a
> > "compatibility string" might look like represented as json:
> > 
> > {
> >   "device_api": "vfio-pci",
> >   "vendor": "vendor-driver-name",
> >   "version": {
> >     "major": 0,
> >     "minor": 1
> >   },
> >   "vfio-pci": { // Based on above device_api
> >     "vendor": 0x1234, // Values for the exposed device
> >     "device": 0x5678,
> >       // Possibly further parameters for a more specific match
> >   },
> >   "mdev_attrs": [
> >     { "attribute0": "VALUE" }
> >   ]
> > }
> > 
> > Are you thinking that we might allow the vendor to include a vendor
> > specific array where we'd simply require that both sides have matching
> > fields and values?  ie.
> > 
> >   "vendor_fields": [
> >     { "unknown_field0": "unknown_value0" },
> >     { "unknown_field1": "unknown_value1" },
> >   ]
> > 
> > We could certainly make that part of the spec, but I can't really
> > figure the value of it other than to severely restrict compatibility,
> > which the vendor could already do via the version.major value.  Maybe
> > they'd want to put a build timestamp, random uuid, or source sha1 into
> > such a field to make absolutely certain compatibility is only determined
> > between identical builds?  Thanks,
> >  
> Yes, I agree kernel could expose such sysfs interface to educate
> openstack how to filter out devices. But I still think the proposed
> migration_version (or rename to migration_compatibility) interface is
> still required for libvirt to do double check.
> 
> In the following scenario: 
> 1. openstack chooses the target device by reading sysfs interface (of json
> format) of the source device. And Openstack are now pretty sure the two
> devices are migration compatible.
> 2. openstack asks libvirt to create the target VM with the target device
> and start live migration.
> 3. libvirt now receives the request. so it now has two choices:
> (1) create the target VM & target device and start live migration directly
> (2) double check if the target device is compatible with the source
> device before doing the remaining tasks.
> 
> Because the factors to determine whether two devices are live migration
> compatible are complicated and may be dynamically changing, (e.g. driver
> upgrade or configuration changes), and also because libvirt should not
> totally rely on the input from openstack, I think the cost for libvirt is
> relatively lower if it chooses to go (2) than (1). At least it has no
> need to cancel migration and destroy the VM if it knows it earlier.
> 
> So, it means the kernel may need to expose two parallel interfaces:
> (1) with json format, enumerating all possible fields and comparing
> methods, so as to indicate openstack how to find a matching target device
> (2) an opaque driver defined string, requiring write and test in target,
> which is used by libvirt to make sure device compatibility, rather than
> rely on the input accurateness from openstack or rely on kernel driver
> implementing the compatibility detection immediately after migration
> start.
> 
> Does it make sense?

No, libvirt is not responsible for the success or failure of the
migration, it's the vendor driver's responsibility to encode
compatibility information early in the migration stream and error
should the incoming device prove to be incompatible.  It's not
libvirt's job to second guess the management engine and I would not
support a duplicate interface only for that purpose.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
       [not found]         ` <CAH7mGatPWsczh_rbVhx4a+psJXvkZgKou3r5HrEQTqE7SqZkKA@mail.gmail.com>
@ 2020-07-17 15:18           ` Alex Williamson
  0 siblings, 0 replies; 48+ messages in thread
From: Alex Williamson @ 2020-07-17 15:18 UTC (permalink / raw)
  To: Alex Xu
  Cc: Dr. David Alan Gilbert, kvm, libvir-list, qemu-devel, kwankhede,
	eauger, Wang, Xin-ran, corbet, openstack-discuss, shaohe.feng,
	kevin.tian, Yan Zhao, eskultet, Ding, Jian-feng, zhenyuw, Xu,
	Hejie, bao.yumeng, Sean Mooney, intel-gvt-dev, cohuck, dinechin,
	devel

On Wed, 15 Jul 2020 15:37:19 +0800
Alex Xu <soulxu@gmail.com> wrote:

> Alex Williamson <alex.williamson@redhat.com> 于2020年7月15日周三 上午5:00写道:
> 
> > On Tue, 14 Jul 2020 18:19:46 +0100
> > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> >  
> > > * Alex Williamson (alex.williamson@redhat.com) wrote:  
> > > > On Tue, 14 Jul 2020 11:21:29 +0100
> > > > Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > >  
> > > > > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:  
> > > > > > hi folks,
> > > > > > we are defining a device migration compatibility interface that  
> > helps upper  
> > > > > > layer stack like openstack/ovirt/libvirt to check if two devices  
> > are  
> > > > > > live migration compatible.
> > > > > > The "devices" here could be MDEVs, physical devices, or hybrid of  
> > the two.  
> > > > > > e.g. we could use it to check whether
> > > > > > - a src MDEV can migrate to a target MDEV,
> > > > > > - a src VF in SRIOV can migrate to a target VF in SRIOV,
> > > > > > - a src MDEV can migration to a target VF in SRIOV.
> > > > > >   (e.g. SIOV/SRIOV backward compatibility case)
> > > > > >
> > > > > > The upper layer stack could use this interface as the last step to  
> > check  
> > > > > > if one device is able to migrate to another device before  
> > triggering a real  
> > > > > > live migration procedure.
> > > > > > we are not sure if this interface is of value or help to you.  
> > please don't  
> > > > > > hesitate to drop your valuable comments.
> > > > > >
> > > > > >
> > > > > > (1) interface definition
> > > > > > The interface is defined in below way:
> > > > > >
> > > > > >              __    userspace
> > > > > >               /\              \
> > > > > >              /                 \write
> > > > > >             / read              \
> > > > > >    ________/__________       ___\|/_____________
> > > > > >   | migration_version |     | migration_version |-->check migration
> > > > > >   ---------------------     ---------------------   compatibility
> > > > > >      device A                    device B
> > > > > >
> > > > > >
> > > > > > a device attribute named migration_version is defined under each  
> > device's  
> > > > > > sysfs node. e.g.  
> > (/sys/bus/pci/devices/0000\:00\:02.0/$mdev_UUID/migration_version).  
> > > > > > userspace tools read the migration_version as a string from the  
> > source device,  
> > > > > > and write it to the migration_version sysfs attribute in the  
> > target device.  
> > > > > >
> > > > > > The userspace should treat ANY of below conditions as two devices  
> > not compatible:  
> > > > > > - any one of the two devices does not have a migration_version  
> > attribute  
> > > > > > - error when reading from migration_version attribute of one device
> > > > > > - error when writing migration_version string of one device to
> > > > > >   migration_version attribute of the other device
> > > > > >
> > > > > > The string read from migration_version attribute is defined by  
> > device vendor  
> > > > > > driver and is completely opaque to the userspace.
> > > > > > for a Intel vGPU, string format can be defined like
> > > > > > "parent device PCI ID" + "version of gvt driver" + "mdev type" +  
> > "aggregator count".  
> > > > > >
> > > > > > for an NVMe VF connecting to a remote storage. it could be
> > > > > > "PCI ID" + "driver version" + "configured remote storage URL"
> > > > > >
> > > > > > for a QAT VF, it may be
> > > > > > "PCI ID" + "driver version" + "supported encryption set".
> > > > > >
> > > > > > (to avoid namespace confliction from each vendor, we may prefix a  
> > driver name to  
> > > > > > each migration_version string. e.g.  
> > i915-v1-8086-591d-i915-GVTg_V5_8-1)  
> > > >
> > > > It's very strange to define it as opaque and then proceed to describe
> > > > the contents of that opaque string.  The point is that its contents
> > > > are defined by the vendor driver to describe the device, driver  
> > version,  
> > > > and possibly metadata about the configuration of the device.  One
> > > > instance of a device might generate a different string from another.
> > > > The string that a device produces is not necessarily the only string
> > > > the vendor driver will accept, for example the driver might support
> > > > backwards compatible migrations.  
> > >
> > > (As I've said in the previous discussion, off one of the patch series)
> > >
> > > My view is it makes sense to have a half-way house on the opaqueness of
> > > this string; I'd expect to have an ID and version that are human
> > > readable, maybe a device ID/name that's human interpretable and then a
> > > bunch of other cruft that maybe device/vendor/version specific.
> > >
> > > I'm thinking that we want to be able to report problems and include the
> > > string and the user to be able to easily identify the device that was
> > > complaining and notice a difference in versions, and perhaps also use
> > > it in compatibility patterns to find compatible hosts; but that does
> > > get tricky when it's a 'ask the device if it's compatible'.  
> >
> > In the reply I just sent to Dan, I gave this example of what a
> > "compatibility string" might look like represented as json:
> >
> > {
> >   "device_api": "vfio-pci",
> >   "vendor": "vendor-driver-name",
> >   "version": {
> >     "major": 0,
> >     "minor": 1
> >   },
> >  
> 
> The OpenStack Placement service doesn't support to filtering the target
> host by the semver syntax, altough we can code this filtering logic inside
> scheduler filtering by python code. Basically, placement only supports
> filtering the host by traits (it is same thing with labels, tags). The nova
> scheduler will call the placement service to filter the hosts first, then
> go through all the scheduler filters. That would be great if the placement
> service can filter out more hosts which isn't compatible first, and then it
> is better.
> 
> 
> >   "vfio-pci": { // Based on above device_api
> >     "vendor": 0x1234, // Values for the exposed device
> >     "device": 0x5678,
> >       // Possibly further parameters for a more specific match
> >   },
> >  
> 
> OpenStack already based on vendor and device id to separate the devices
> into the different resource pool, then the scheduler based on that to filer
> the hosts, so I think it needn't be the part of this compatibility string.


This is the part of the string that actually says what the resulting
device is, so it's a rather fundamental part of the description.  This
is where we'd determine that a physical to mdev migration is possible
or that different mdev types result in the same guest PCI device,
possibly with attributes set as specified later in the output.


> >   "mdev_attrs": [
> >     { "attribute0": "VALUE" }
> >   ]
> > }
> >
> > Are you thinking that we might allow the vendor to include a vendor
> > specific array where we'd simply require that both sides have matching
> > fields and values?  ie.


That's what I'm defining in the below vendor_fields, the above
mdev_attrs would be specifying attributes of the device that must be
set in order to create the device with the compatibility described.
For example if we're describing compatibility for type foo-1, which is
a base type that can be equivalent to type foo-3 if type foo-1 is
created with aggregation=3, this is where that would be defined.
Thanks,

Alex

> >   "vendor_fields": [
> >     { "unknown_field0": "unknown_value0" },
> >     { "unknown_field1": "unknown_value1" },
> >   ]
> >  
> 
> Since the placement support traits (labels, tags), so the placement just to
> matching those fields, so it isn't problem of openstack, since openstack
> needn't to know the meaning of those fields. But the traits is just a
> label, it isn't key-value format. But also if we have to, we can code this
> scheduler filter by python code. But the same thing as above, the invalid
> host can't be filtered out in the first step placement service filtering.
> 
> 
> > We could certainly make that part of the spec, but I can't really
> > figure the value of it other than to severely restrict compatibility,
> > which the vendor could already do via the version.major value.  Maybe
> > they'd want to put a build timestamp, random uuid, or source sha1 into
> > such a field to make absolutely certain compatibility is only determined
> > between identical builds?  Thanks,
> >
> > Alex
> >
> >  


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-16  8:32   ` Yan Zhao
  2020-07-16  9:30     ` Jason Wang
@ 2020-07-17 16:12     ` Alex Williamson
  2020-07-20  3:41       ` Jason Wang
  2020-07-21  0:51       ` Yan Zhao
  1 sibling, 2 replies; 48+ messages in thread
From: Alex Williamson @ 2020-07-17 16:12 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Jason Wang, devel, openstack-discuss, libvir-list, intel-gvt-dev,
	kvm, qemu-devel, berrange, smooney, eskultet, cohuck, dinechin,
	corbet, kwankhede, dgilbert, eauger, jian-feng.ding, hejie.xu,
	kevin.tian, zhenyuw, bao.yumeng, xin-ran.wang, shaohe.feng

On Thu, 16 Jul 2020 16:32:30 +0800
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Thu, Jul 16, 2020 at 12:16:26PM +0800, Jason Wang wrote:
> > 
> > On 2020/7/14 上午7:29, Yan Zhao wrote:  
> > > hi folks,
> > > we are defining a device migration compatibility interface that helps upper
> > > layer stack like openstack/ovirt/libvirt to check if two devices are
> > > live migration compatible.
> > > The "devices" here could be MDEVs, physical devices, or hybrid of the two.
> > > e.g. we could use it to check whether
> > > - a src MDEV can migrate to a target MDEV,
> > > - a src VF in SRIOV can migrate to a target VF in SRIOV,
> > > - a src MDEV can migration to a target VF in SRIOV.
> > >    (e.g. SIOV/SRIOV backward compatibility case)
> > > 
> > > The upper layer stack could use this interface as the last step to check
> > > if one device is able to migrate to another device before triggering a real
> > > live migration procedure.
> > > we are not sure if this interface is of value or help to you. please don't
> > > hesitate to drop your valuable comments.
> > > 
> > > 
> > > (1) interface definition
> > > The interface is defined in below way:
> > > 
> > >               __    userspace
> > >                /\              \
> > >               /                 \write
> > >              / read              \
> > >     ________/__________       ___\|/_____________
> > >    | migration_version |     | migration_version |-->check migration
> > >    ---------------------     ---------------------   compatibility
> > >       device A                    device B
> > > 
> > > 
> > > a device attribute named migration_version is defined under each device's
> > > sysfs node. e.g. (/sys/bus/pci/devices/0000\:00\:02.0/$mdev_UUID/migration_version).  
> > 
> > 
> > Are you aware of the devlink based device management interface that is
> > proposed upstream? I think it has many advantages over sysfs, do you
> > consider to switch to that?  


Advantages, such as?


> not familiar with the devlink. will do some research of it.
> > 
> >   
> > > userspace tools read the migration_version as a string from the source device,
> > > and write it to the migration_version sysfs attribute in the target device.
> > > 
> > > The userspace should treat ANY of below conditions as two devices not compatible:
> > > - any one of the two devices does not have a migration_version attribute
> > > - error when reading from migration_version attribute of one device
> > > - error when writing migration_version string of one device to
> > >    migration_version attribute of the other device
> > > 
> > > The string read from migration_version attribute is defined by device vendor
> > > driver and is completely opaque to the userspace.  
> > 
> > 
> > My understanding is that something opaque to userspace is not the philosophy  
> 
> but the VFIO live migration in itself is essentially a big opaque stream to userspace.
> 
> > of Linux. Instead of having a generic API but opaque value, why not do in a
> > vendor specific way like:
> > 
> > 1) exposing the device capability in a vendor specific way via sysfs/devlink
> > or other API
> > 2) management read capability in both src and dst and determine whether we
> > can do the migration
> > 
> > This is the way we plan to do with vDPA.
> >  
> yes, in another reply, Alex proposed to use an interface in json format.
> I guess we can define something like
> 
> { "self" :
>   [
>     { "pciid" : "8086591d",
>       "driver" : "i915",
>       "gvt-version" : "v1",
>       "mdev_type"   : "i915-GVTg_V5_2",
>       "aggregator"  : "1",
>       "pv-mode"     : "none",
>     }
>   ],
>   "compatible" :
>   [
>     { "pciid" : "8086591d",
>       "driver" : "i915",
>       "gvt-version" : "v1",
>       "mdev_type"   : "i915-GVTg_V5_2",
>       "aggregator"  : "1"
>       "pv-mode"     : "none",
>     },
>     { "pciid" : "8086591d",
>       "driver" : "i915",
>       "gvt-version" : "v1",
>       "mdev_type"   : "i915-GVTg_V5_4",
>       "aggregator"  : "2"
>       "pv-mode"     : "none",
>     },
>     { "pciid" : "8086591d",
>       "driver" : "i915",
>       "gvt-version" : "v2",
>       "mdev_type"   : "i915-GVTg_V5_4",
>       "aggregator"  : "2"
>       "pv-mode"     : "none, ppgtt, context",
>     }
>     ...
>   ]
> }
> 
> But as those fields are mostly vendor specific, the userspace can
> only do simple string comparing, I guess the list would be very long as
> it needs to enumerate all possible targets.


This ignores so much of what I tried to achieve in my example :(


> also, in some fileds like "gvt-version", is there a simple way to express
> things like v2+?


That's not a reasonable thing to express anyway, how can you be certain
that v3 won't break compatibility with v2?  Sean proposed a versioning
scheme that accounts for this, using an x.y.z version expressing the
major, minor, and bugfix versions, where there is no compatibility
across major versions, minor versions have forward compatibility (ex. 1
-> 2 is ok, 2 -> 1 is not) and bugfix version number indicates some
degree of internal improvement that is not visible to the user in terms
of features or compatibility, but provides a basis for preferring
equally compatible candidates.

 
> If the userspace can read this interface both in src and target and
> check whether both src and target are in corresponding compatible list, I
> think it will work for us.
> 
> But still, kernel should not rely on userspace's choice, the opaque
> compatibility string is still required in kernel. No matter whether
> it would be exposed to userspace as an compatibility checking interface,
> vendor driver would keep this part of code and embed the string into the
> migration stream. so exposing it as an interface to be used by libvirt to
> do a safety check before a real live migration is only about enabling
> the kernel part of check to happen ahead.

As you indicate, the vendor driver is responsible for checking version
information embedded within the migration stream.  Therefore a
migration should fail early if the devices are incompatible.  Is it
really libvirt's place to second guess what it has been directed to do?
Why would we even proceed to design a user parse-able version interface
if we still have a dependency on an opaque interface?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-17 14:59           ` Alex Williamson
@ 2020-07-17 18:03             ` Dr. David Alan Gilbert
  2020-07-17 18:30               ` Alex Williamson
  0 siblings, 1 reply; 48+ messages in thread
From: Dr. David Alan Gilbert @ 2020-07-17 18:03 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yan Zhao, Daniel P. Berrangé,
	devel, openstack-discuss, libvir-list, intel-gvt-dev, kvm,
	qemu-devel, smooney, eskultet, cohuck, dinechin, corbet,
	kwankhede, eauger, jian-feng.ding, hejie.xu, kevin.tian, zhenyuw,
	bao.yumeng, xin-ran.wang, shaohe.feng

* Alex Williamson (alex.williamson@redhat.com) wrote:
> On Wed, 15 Jul 2020 16:20:41 +0800
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Tue, Jul 14, 2020 at 02:59:48PM -0600, Alex Williamson wrote:
> > > On Tue, 14 Jul 2020 18:19:46 +0100
> > > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> > >   
> > > > * Alex Williamson (alex.williamson@redhat.com) wrote:  
> > > > > On Tue, 14 Jul 2020 11:21:29 +0100
> > > > > Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > >     
> > > > > > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:    
> > > > > > > hi folks,
> > > > > > > we are defining a device migration compatibility interface that helps upper
> > > > > > > layer stack like openstack/ovirt/libvirt to check if two devices are
> > > > > > > live migration compatible.
> > > > > > > The "devices" here could be MDEVs, physical devices, or hybrid of the two.
> > > > > > > e.g. we could use it to check whether
> > > > > > > - a src MDEV can migrate to a target MDEV,
> > > > > > > - a src VF in SRIOV can migrate to a target VF in SRIOV,
> > > > > > > - a src MDEV can migration to a target VF in SRIOV.
> > > > > > >   (e.g. SIOV/SRIOV backward compatibility case)
> > > > > > > 
> > > > > > > The upper layer stack could use this interface as the last step to check
> > > > > > > if one device is able to migrate to another device before triggering a real
> > > > > > > live migration procedure.
> > > > > > > we are not sure if this interface is of value or help to you. please don't
> > > > > > > hesitate to drop your valuable comments.
> > > > > > > 
> > > > > > > 
> > > > > > > (1) interface definition
> > > > > > > The interface is defined in below way:
> > > > > > > 
> > > > > > >              __    userspace
> > > > > > >               /\              \
> > > > > > >              /                 \write
> > > > > > >             / read              \
> > > > > > >    ________/__________       ___\|/_____________
> > > > > > >   | migration_version |     | migration_version |-->check migration
> > > > > > >   ---------------------     ---------------------   compatibility
> > > > > > >      device A                    device B
> > > > > > > 
> > > > > > > 
> > > > > > > a device attribute named migration_version is defined under each device's
> > > > > > > sysfs node. e.g. (/sys/bus/pci/devices/0000\:00\:02.0/$mdev_UUID/migration_version).
> > > > > > > userspace tools read the migration_version as a string from the source device,
> > > > > > > and write it to the migration_version sysfs attribute in the target device.
> > > > > > > 
> > > > > > > The userspace should treat ANY of below conditions as two devices not compatible:
> > > > > > > - any one of the two devices does not have a migration_version attribute
> > > > > > > - error when reading from migration_version attribute of one device
> > > > > > > - error when writing migration_version string of one device to
> > > > > > >   migration_version attribute of the other device
> > > > > > > 
> > > > > > > The string read from migration_version attribute is defined by device vendor
> > > > > > > driver and is completely opaque to the userspace.
> > > > > > > for a Intel vGPU, string format can be defined like
> > > > > > > "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
> > > > > > > 
> > > > > > > for an NVMe VF connecting to a remote storage. it could be
> > > > > > > "PCI ID" + "driver version" + "configured remote storage URL"
> > > > > > > 
> > > > > > > for a QAT VF, it may be
> > > > > > > "PCI ID" + "driver version" + "supported encryption set".
> > > > > > > 
> > > > > > > (to avoid namespace confliction from each vendor, we may prefix a driver name to
> > > > > > > each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)    
> > > > > 
> > > > > It's very strange to define it as opaque and then proceed to describe
> > > > > the contents of that opaque string.  The point is that its contents
> > > > > are defined by the vendor driver to describe the device, driver version,
> > > > > and possibly metadata about the configuration of the device.  One
> > > > > instance of a device might generate a different string from another.
> > > > > The string that a device produces is not necessarily the only string
> > > > > the vendor driver will accept, for example the driver might support
> > > > > backwards compatible migrations.    
> > > > 
> > > > (As I've said in the previous discussion, off one of the patch series)
> > > > 
> > > > My view is it makes sense to have a half-way house on the opaqueness of
> > > > this string; I'd expect to have an ID and version that are human
> > > > readable, maybe a device ID/name that's human interpretable and then a
> > > > bunch of other cruft that maybe device/vendor/version specific.
> > > > 
> > > > I'm thinking that we want to be able to report problems and include the
> > > > string and the user to be able to easily identify the device that was
> > > > complaining and notice a difference in versions, and perhaps also use
> > > > it in compatibility patterns to find compatible hosts; but that does
> > > > get tricky when it's a 'ask the device if it's compatible'.  
> > > 
> > > In the reply I just sent to Dan, I gave this example of what a
> > > "compatibility string" might look like represented as json:
> > > 
> > > {
> > >   "device_api": "vfio-pci",
> > >   "vendor": "vendor-driver-name",
> > >   "version": {
> > >     "major": 0,
> > >     "minor": 1
> > >   },
> > >   "vfio-pci": { // Based on above device_api
> > >     "vendor": 0x1234, // Values for the exposed device
> > >     "device": 0x5678,
> > >       // Possibly further parameters for a more specific match
> > >   },
> > >   "mdev_attrs": [
> > >     { "attribute0": "VALUE" }
> > >   ]
> > > }
> > > 
> > > Are you thinking that we might allow the vendor to include a vendor
> > > specific array where we'd simply require that both sides have matching
> > > fields and values?  ie.
> > > 
> > >   "vendor_fields": [
> > >     { "unknown_field0": "unknown_value0" },
> > >     { "unknown_field1": "unknown_value1" },
> > >   ]
> > > 
> > > We could certainly make that part of the spec, but I can't really
> > > figure the value of it other than to severely restrict compatibility,
> > > which the vendor could already do via the version.major value.  Maybe
> > > they'd want to put a build timestamp, random uuid, or source sha1 into
> > > such a field to make absolutely certain compatibility is only determined
> > > between identical builds?  Thanks,
> > >  
> > Yes, I agree kernel could expose such sysfs interface to educate
> > openstack how to filter out devices. But I still think the proposed
> > migration_version (or rename to migration_compatibility) interface is
> > still required for libvirt to do double check.
> > 
> > In the following scenario: 
> > 1. openstack chooses the target device by reading sysfs interface (of json
> > format) of the source device. And Openstack are now pretty sure the two
> > devices are migration compatible.
> > 2. openstack asks libvirt to create the target VM with the target device
> > and start live migration.
> > 3. libvirt now receives the request. so it now has two choices:
> > (1) create the target VM & target device and start live migration directly
> > (2) double check if the target device is compatible with the source
> > device before doing the remaining tasks.
> > 
> > Because the factors to determine whether two devices are live migration
> > compatible are complicated and may be dynamically changing, (e.g. driver
> > upgrade or configuration changes), and also because libvirt should not
> > totally rely on the input from openstack, I think the cost for libvirt is
> > relatively lower if it chooses to go (2) than (1). At least it has no
> > need to cancel migration and destroy the VM if it knows it earlier.
> > 
> > So, it means the kernel may need to expose two parallel interfaces:
> > (1) with json format, enumerating all possible fields and comparing
> > methods, so as to indicate openstack how to find a matching target device
> > (2) an opaque driver defined string, requiring write and test in target,
> > which is used by libvirt to make sure device compatibility, rather than
> > rely on the input accurateness from openstack or rely on kernel driver
> > implementing the compatibility detection immediately after migration
> > start.
> > 
> > Does it make sense?
> 
> No, libvirt is not responsible for the success or failure of the
> migration, it's the vendor driver's responsibility to encode
> compatibility information early in the migration stream and error
> should the incoming device prove to be incompatible.  It's not
> libvirt's job to second guess the management engine and I would not
> support a duplicate interface only for that purpose.  Thanks,

libvirt does try to enforce it for other things; trying to stop a bad
migration from starting.

Dave

> Alex
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-17 18:03             ` Dr. David Alan Gilbert
@ 2020-07-17 18:30               ` Alex Williamson
  0 siblings, 0 replies; 48+ messages in thread
From: Alex Williamson @ 2020-07-17 18:30 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Yan Zhao, Daniel P. Berrangé,
	devel, openstack-discuss, libvir-list, intel-gvt-dev, kvm,
	qemu-devel, smooney, eskultet, cohuck, dinechin, corbet,
	kwankhede, eauger, jian-feng.ding, hejie.xu, kevin.tian, zhenyuw,
	bao.yumeng, xin-ran.wang, shaohe.feng

On Fri, 17 Jul 2020 19:03:44 +0100
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Alex Williamson (alex.williamson@redhat.com) wrote:
> > On Wed, 15 Jul 2020 16:20:41 +0800
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> > > On Tue, Jul 14, 2020 at 02:59:48PM -0600, Alex Williamson wrote:  
> > > > On Tue, 14 Jul 2020 18:19:46 +0100
> > > > "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> > > >     
> > > > > * Alex Williamson (alex.williamson@redhat.com) wrote:    
> > > > > > On Tue, 14 Jul 2020 11:21:29 +0100
> > > > > > Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > > >       
> > > > > > > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:      
> > > > > > > > hi folks,
> > > > > > > > we are defining a device migration compatibility interface that helps upper
> > > > > > > > layer stack like openstack/ovirt/libvirt to check if two devices are
> > > > > > > > live migration compatible.
> > > > > > > > The "devices" here could be MDEVs, physical devices, or hybrid of the two.
> > > > > > > > e.g. we could use it to check whether
> > > > > > > > - a src MDEV can migrate to a target MDEV,
> > > > > > > > - a src VF in SRIOV can migrate to a target VF in SRIOV,
> > > > > > > > - a src MDEV can migration to a target VF in SRIOV.
> > > > > > > >   (e.g. SIOV/SRIOV backward compatibility case)
> > > > > > > > 
> > > > > > > > The upper layer stack could use this interface as the last step to check
> > > > > > > > if one device is able to migrate to another device before triggering a real
> > > > > > > > live migration procedure.
> > > > > > > > we are not sure if this interface is of value or help to you. please don't
> > > > > > > > hesitate to drop your valuable comments.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > (1) interface definition
> > > > > > > > The interface is defined in below way:
> > > > > > > > 
> > > > > > > >              __    userspace
> > > > > > > >               /\              \
> > > > > > > >              /                 \write
> > > > > > > >             / read              \
> > > > > > > >    ________/__________       ___\|/_____________
> > > > > > > >   | migration_version |     | migration_version |-->check migration
> > > > > > > >   ---------------------     ---------------------   compatibility
> > > > > > > >      device A                    device B
> > > > > > > > 
> > > > > > > > 
> > > > > > > > a device attribute named migration_version is defined under each device's
> > > > > > > > sysfs node. e.g. (/sys/bus/pci/devices/0000\:00\:02.0/$mdev_UUID/migration_version).
> > > > > > > > userspace tools read the migration_version as a string from the source device,
> > > > > > > > and write it to the migration_version sysfs attribute in the target device.
> > > > > > > > 
> > > > > > > > The userspace should treat ANY of below conditions as two devices not compatible:
> > > > > > > > - any one of the two devices does not have a migration_version attribute
> > > > > > > > - error when reading from migration_version attribute of one device
> > > > > > > > - error when writing migration_version string of one device to
> > > > > > > >   migration_version attribute of the other device
> > > > > > > > 
> > > > > > > > The string read from migration_version attribute is defined by device vendor
> > > > > > > > driver and is completely opaque to the userspace.
> > > > > > > > for a Intel vGPU, string format can be defined like
> > > > > > > > "parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator count".
> > > > > > > > 
> > > > > > > > for an NVMe VF connecting to a remote storage. it could be
> > > > > > > > "PCI ID" + "driver version" + "configured remote storage URL"
> > > > > > > > 
> > > > > > > > for a QAT VF, it may be
> > > > > > > > "PCI ID" + "driver version" + "supported encryption set".
> > > > > > > > 
> > > > > > > > (to avoid namespace confliction from each vendor, we may prefix a driver name to
> > > > > > > > each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)      
> > > > > > 
> > > > > > It's very strange to define it as opaque and then proceed to describe
> > > > > > the contents of that opaque string.  The point is that its contents
> > > > > > are defined by the vendor driver to describe the device, driver version,
> > > > > > and possibly metadata about the configuration of the device.  One
> > > > > > instance of a device might generate a different string from another.
> > > > > > The string that a device produces is not necessarily the only string
> > > > > > the vendor driver will accept, for example the driver might support
> > > > > > backwards compatible migrations.      
> > > > > 
> > > > > (As I've said in the previous discussion, off one of the patch series)
> > > > > 
> > > > > My view is it makes sense to have a half-way house on the opaqueness of
> > > > > this string; I'd expect to have an ID and version that are human
> > > > > readable, maybe a device ID/name that's human interpretable and then a
> > > > > bunch of other cruft that maybe device/vendor/version specific.
> > > > > 
> > > > > I'm thinking that we want to be able to report problems and include the
> > > > > string and the user to be able to easily identify the device that was
> > > > > complaining and notice a difference in versions, and perhaps also use
> > > > > it in compatibility patterns to find compatible hosts; but that does
> > > > > get tricky when it's a 'ask the device if it's compatible'.    
> > > > 
> > > > In the reply I just sent to Dan, I gave this example of what a
> > > > "compatibility string" might look like represented as json:
> > > > 
> > > > {
> > > >   "device_api": "vfio-pci",
> > > >   "vendor": "vendor-driver-name",
> > > >   "version": {
> > > >     "major": 0,
> > > >     "minor": 1
> > > >   },
> > > >   "vfio-pci": { // Based on above device_api
> > > >     "vendor": 0x1234, // Values for the exposed device
> > > >     "device": 0x5678,
> > > >       // Possibly further parameters for a more specific match
> > > >   },
> > > >   "mdev_attrs": [
> > > >     { "attribute0": "VALUE" }
> > > >   ]
> > > > }
> > > > 
> > > > Are you thinking that we might allow the vendor to include a vendor
> > > > specific array where we'd simply require that both sides have matching
> > > > fields and values?  ie.
> > > > 
> > > >   "vendor_fields": [
> > > >     { "unknown_field0": "unknown_value0" },
> > > >     { "unknown_field1": "unknown_value1" },
> > > >   ]
> > > > 
> > > > We could certainly make that part of the spec, but I can't really
> > > > figure the value of it other than to severely restrict compatibility,
> > > > which the vendor could already do via the version.major value.  Maybe
> > > > they'd want to put a build timestamp, random uuid, or source sha1 into
> > > > such a field to make absolutely certain compatibility is only determined
> > > > between identical builds?  Thanks,
> > > >    
> > > Yes, I agree kernel could expose such sysfs interface to educate
> > > openstack how to filter out devices. But I still think the proposed
> > > migration_version (or rename to migration_compatibility) interface is
> > > still required for libvirt to do double check.
> > > 
> > > In the following scenario: 
> > > 1. openstack chooses the target device by reading sysfs interface (of json
> > > format) of the source device. And Openstack are now pretty sure the two
> > > devices are migration compatible.
> > > 2. openstack asks libvirt to create the target VM with the target device
> > > and start live migration.
> > > 3. libvirt now receives the request. so it now has two choices:
> > > (1) create the target VM & target device and start live migration directly
> > > (2) double check if the target device is compatible with the source
> > > device before doing the remaining tasks.
> > > 
> > > Because the factors to determine whether two devices are live migration
> > > compatible are complicated and may be dynamically changing, (e.g. driver
> > > upgrade or configuration changes), and also because libvirt should not
> > > totally rely on the input from openstack, I think the cost for libvirt is
> > > relatively lower if it chooses to go (2) than (1). At least it has no
> > > need to cancel migration and destroy the VM if it knows it earlier.
> > > 
> > > So, it means the kernel may need to expose two parallel interfaces:
> > > (1) with json format, enumerating all possible fields and comparing
> > > methods, so as to indicate openstack how to find a matching target device
> > > (2) an opaque driver defined string, requiring write and test in target,
> > > which is used by libvirt to make sure device compatibility, rather than
> > > rely on the input accurateness from openstack or rely on kernel driver
> > > implementing the compatibility detection immediately after migration
> > > start.
> > > 
> > > Does it make sense?  
> > 
> > No, libvirt is not responsible for the success or failure of the
> > migration, it's the vendor driver's responsibility to encode
> > compatibility information early in the migration stream and error
> > should the incoming device prove to be incompatible.  It's not
> > libvirt's job to second guess the management engine and I would not
> > support a duplicate interface only for that purpose.  Thanks,  
> 
> libvirt does try to enforce it for other things; trying to stop a bad
> migration from starting.

Even if libvirt did want to verify why would we want to support a
separate opaque interface for that purpose versus a parse-able
interface?  If we get different results, we've failed.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-17 16:12     ` Alex Williamson
@ 2020-07-20  3:41       ` Jason Wang
  2020-07-20 10:39         ` Sean Mooney
  2020-07-21  0:51       ` Yan Zhao
  1 sibling, 1 reply; 48+ messages in thread
From: Jason Wang @ 2020-07-20  3:41 UTC (permalink / raw)
  To: Alex Williamson, Yan Zhao
  Cc: devel, openstack-discuss, libvir-list, intel-gvt-dev, kvm,
	qemu-devel, berrange, smooney, eskultet, cohuck, dinechin,
	corbet, kwankhede, dgilbert, eauger, jian-feng.ding, hejie.xu,
	kevin.tian, zhenyuw, bao.yumeng, xin-ran.wang, shaohe.feng


On 2020/7/18 上午12:12, Alex Williamson wrote:
> On Thu, 16 Jul 2020 16:32:30 +0800
> Yan Zhao <yan.y.zhao@intel.com> wrote:
>
>> On Thu, Jul 16, 2020 at 12:16:26PM +0800, Jason Wang wrote:
>>> On 2020/7/14 上午7:29, Yan Zhao wrote:
>>>> hi folks,
>>>> we are defining a device migration compatibility interface that helps upper
>>>> layer stack like openstack/ovirt/libvirt to check if two devices are
>>>> live migration compatible.
>>>> The "devices" here could be MDEVs, physical devices, or hybrid of the two.
>>>> e.g. we could use it to check whether
>>>> - a src MDEV can migrate to a target MDEV,
>>>> - a src VF in SRIOV can migrate to a target VF in SRIOV,
>>>> - a src MDEV can migration to a target VF in SRIOV.
>>>>     (e.g. SIOV/SRIOV backward compatibility case)
>>>>
>>>> The upper layer stack could use this interface as the last step to check
>>>> if one device is able to migrate to another device before triggering a real
>>>> live migration procedure.
>>>> we are not sure if this interface is of value or help to you. please don't
>>>> hesitate to drop your valuable comments.
>>>>
>>>>
>>>> (1) interface definition
>>>> The interface is defined in below way:
>>>>
>>>>                __    userspace
>>>>                 /\              \
>>>>                /                 \write
>>>>               / read              \
>>>>      ________/__________       ___\|/_____________
>>>>     | migration_version |     | migration_version |-->check migration
>>>>     ---------------------     ---------------------   compatibility
>>>>        device A                    device B
>>>>
>>>>
>>>> a device attribute named migration_version is defined under each device's
>>>> sysfs node. e.g. (/sys/bus/pci/devices/0000\:00\:02.0/$mdev_UUID/migration_version).
>>>
>>> Are you aware of the devlink based device management interface that is
>>> proposed upstream? I think it has many advantages over sysfs, do you
>>> consider to switch to that?
>
> Advantages, such as?


My understanding for devlink(netlink) over sysfs (some are mentioned at 
the time of vDPA sysfs mgmt API discussion) are:

- existing users (NIC, crypto, SCSI, ib), mature and stable
- much better error reporting (ext_ack other than string or errno)
- namespace aware
- do not couple with kobject

Thanks


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-20  3:41       ` Jason Wang
@ 2020-07-20 10:39         ` Sean Mooney
  2020-07-21  2:11           ` Jason Wang
  0 siblings, 1 reply; 48+ messages in thread
From: Sean Mooney @ 2020-07-20 10:39 UTC (permalink / raw)
  To: Jason Wang, Alex Williamson, Yan Zhao
  Cc: devel, openstack-discuss, libvir-list, intel-gvt-dev, kvm,
	qemu-devel, berrange, eskultet, cohuck, dinechin, corbet,
	kwankhede, dgilbert, eauger, jian-feng.ding, hejie.xu,
	kevin.tian, zhenyuw, bao.yumeng, xin-ran.wang, shaohe.feng

On Mon, 2020-07-20 at 11:41 +0800, Jason Wang wrote:
> On 2020/7/18 上午12:12, Alex Williamson wrote:
> > On Thu, 16 Jul 2020 16:32:30 +0800
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > 
> > > On Thu, Jul 16, 2020 at 12:16:26PM +0800, Jason Wang wrote:
> > > > On 2020/7/14 上午7:29, Yan Zhao wrote:
> > > > > hi folks,
> > > > > we are defining a device migration compatibility interface that helps upper
> > > > > layer stack like openstack/ovirt/libvirt to check if two devices are
> > > > > live migration compatible.
> > > > > The "devices" here could be MDEVs, physical devices, or hybrid of the two.
> > > > > e.g. we could use it to check whether
> > > > > - a src MDEV can migrate to a target MDEV,
> > > > > - a src VF in SRIOV can migrate to a target VF in SRIOV,
> > > > > - a src MDEV can migration to a target VF in SRIOV.
> > > > >     (e.g. SIOV/SRIOV backward compatibility case)
> > > > > 
> > > > > The upper layer stack could use this interface as the last step to check
> > > > > if one device is able to migrate to another device before triggering a real
> > > > > live migration procedure.
> > > > > we are not sure if this interface is of value or help to you. please don't
> > > > > hesitate to drop your valuable comments.
> > > > > 
> > > > > 
> > > > > (1) interface definition
> > > > > The interface is defined in below way:
> > > > > 
> > > > >                __    userspace
> > > > >                 /\              \
> > > > >                /                 \write
> > > > >               / read              \
> > > > >      ________/__________       ___\|/_____________
> > > > >     | migration_version |     | migration_version |-->check migration
> > > > >     ---------------------     ---------------------   compatibility
> > > > >        device A                    device B
> > > > > 
> > > > > 
> > > > > a device attribute named migration_version is defined under each device's
> > > > > sysfs node. e.g. (/sys/bus/pci/devices/0000\:00\:02.0/$mdev_UUID/migration_version).
> > > > 
> > > > Are you aware of the devlink based device management interface that is
> > > > proposed upstream? I think it has many advantages over sysfs, do you
> > > > consider to switch to that?
> > 
> > Advantages, such as?
> 
> 
> My understanding for devlink(netlink) over sysfs (some are mentioned at 
> the time of vDPA sysfs mgmt API discussion) are:
i tought netlink was used more a as a configuration protocoal to qurry and confire nic and i guess
other devices in its devlink form requireint a tool to be witten that can speak the protocal to interact with.
the primary advantate of sysfs is that everything is just a file. there are no addtional depleenceis
needed and unlike netlink there are not interoperatblity issues in a coanitnerised env. if you are using diffrenet
version of libc and gcc in the contaienr vs the host my understanding is tools like ethtool from ubuntu deployed
in a container on a centos host can have issue communicating with the host kernel. if its jsut a file unless
the format the data is returnin in chagnes or the layout of sysfs changes its compatiable regardless of what you
use to read it.
> 
> - existing users (NIC, crypto, SCSI, ib), mature and stable
> - much better error reporting (ext_ack other than string or errno)
> - namespace aware
> - do not couple with kobject
> 
> Thanks
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-17 16:12     ` Alex Williamson
  2020-07-20  3:41       ` Jason Wang
@ 2020-07-21  0:51       ` Yan Zhao
  2020-07-27  7:24         ` Yan Zhao
  1 sibling, 1 reply; 48+ messages in thread
From: Yan Zhao @ 2020-07-21  0:51 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Wang, devel, openstack-discuss, libvir-list, intel-gvt-dev,
	kvm, qemu-devel, berrange, smooney, eskultet, cohuck, dinechin,
	corbet, kwankhede, dgilbert, eauger, jian-feng.ding, hejie.xu,
	kevin.tian, zhenyuw, bao.yumeng, xin-ran.wang, shaohe.feng

On Fri, Jul 17, 2020 at 10:12:58AM -0600, Alex Williamson wrote:
<...>
> > yes, in another reply, Alex proposed to use an interface in json format.
> > I guess we can define something like
> > 
> > { "self" :
> >   [
> >     { "pciid" : "8086591d",
> >       "driver" : "i915",
> >       "gvt-version" : "v1",
> >       "mdev_type"   : "i915-GVTg_V5_2",
> >       "aggregator"  : "1",
> >       "pv-mode"     : "none",
> >     }
> >   ],
> >   "compatible" :
> >   [
> >     { "pciid" : "8086591d",
> >       "driver" : "i915",
> >       "gvt-version" : "v1",
> >       "mdev_type"   : "i915-GVTg_V5_2",
> >       "aggregator"  : "1"
> >       "pv-mode"     : "none",
> >     },
> >     { "pciid" : "8086591d",
> >       "driver" : "i915",
> >       "gvt-version" : "v1",
> >       "mdev_type"   : "i915-GVTg_V5_4",
> >       "aggregator"  : "2"
> >       "pv-mode"     : "none",
> >     },
> >     { "pciid" : "8086591d",
> >       "driver" : "i915",
> >       "gvt-version" : "v2",
> >       "mdev_type"   : "i915-GVTg_V5_4",
> >       "aggregator"  : "2"
> >       "pv-mode"     : "none, ppgtt, context",
> >     }
> >     ...
> >   ]
> > }
> > 
> > But as those fields are mostly vendor specific, the userspace can
> > only do simple string comparing, I guess the list would be very long as
> > it needs to enumerate all possible targets.
> 
> 
> This ignores so much of what I tried to achieve in my example :(
> 
sorry, I just was eager to show and confirm the way to list all compatible
combination of mdev_type and mdev attributes.

> 
> > also, in some fileds like "gvt-version", is there a simple way to express
> > things like v2+?
> 
> 
> That's not a reasonable thing to express anyway, how can you be certain
> that v3 won't break compatibility with v2?  Sean proposed a versioning
> scheme that accounts for this, using an x.y.z version expressing the
> major, minor, and bugfix versions, where there is no compatibility
> across major versions, minor versions have forward compatibility (ex. 1
> -> 2 is ok, 2 -> 1 is not) and bugfix version number indicates some
> degree of internal improvement that is not visible to the user in terms
> of features or compatibility, but provides a basis for preferring
> equally compatible candidates.
>
right. if self version is v1, it can't know its compatible version is
v2. it can only be done in reverse. i.e.
when self version is v2, it can list its compatible version is v1 and
v2.
and maybe later when self version is v3, there's no v1 in its compatible
list.

In this way, do you think we still need the complex x.y.z versioning scheme?

>  
> > If the userspace can read this interface both in src and target and
> > check whether both src and target are in corresponding compatible list, I
> > think it will work for us.
> > 
> > But still, kernel should not rely on userspace's choice, the opaque
> > compatibility string is still required in kernel. No matter whether
> > it would be exposed to userspace as an compatibility checking interface,
> > vendor driver would keep this part of code and embed the string into the
> > migration stream. so exposing it as an interface to be used by libvirt to
> > do a safety check before a real live migration is only about enabling
> > the kernel part of check to happen ahead.
> 
> As you indicate, the vendor driver is responsible for checking version
> information embedded within the migration stream.  Therefore a
> migration should fail early if the devices are incompatible.  Is it
but as I know, currently in VFIO migration protocol, we have no way to
get vendor specific compatibility checking string in migration setup stage
(i.e. .save_setup stage) before the device is set to _SAVING state.
In this way, for devices who does not save device data in precopy stage,
the migration compatibility checking is as late as in stop-and-copy
stage, which is too late.
do you think we need to add the getting/checking of vendor specific
compatibility string early in save_setup stage?

> really libvirt's place to second guess what it has been directed to do?
if libvirt uses the scheme of reading compatibility string at source and
writing for checking at the target, it can not be called "a second guess".
It's not a guess, but a confirmation.

> Why would we even proceed to design a user parse-able version interface
> if we still have a dependency on an opaque interface?  Thanks,
one reason is that libvirt can't trust the parsing result from
openstack.
Another reason is that libvirt can use this opaque interface easier than
another parsing by itself, in the fact that it would not introduce more
burden to kernel who would write this part of code anyway, no matter
libvirt uses it or not.
 
Thanks
Yan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-20 10:39         ` Sean Mooney
@ 2020-07-21  2:11           ` Jason Wang
  0 siblings, 0 replies; 48+ messages in thread
From: Jason Wang @ 2020-07-21  2:11 UTC (permalink / raw)
  To: Sean Mooney, Alex Williamson, Yan Zhao
  Cc: devel, openstack-discuss, libvir-list, intel-gvt-dev, kvm,
	qemu-devel, berrange, eskultet, cohuck, dinechin, corbet,
	kwankhede, dgilbert, eauger, jian-feng.ding, hejie.xu,
	kevin.tian, zhenyuw, bao.yumeng, xin-ran.wang, shaohe.feng


On 2020/7/20 下午6:39, Sean Mooney wrote:
> On Mon, 2020-07-20 at 11:41 +0800, Jason Wang wrote:
>> On 2020/7/18 上午12:12, Alex Williamson wrote:
>>> On Thu, 16 Jul 2020 16:32:30 +0800
>>> Yan Zhao <yan.y.zhao@intel.com> wrote:
>>>
>>>> On Thu, Jul 16, 2020 at 12:16:26PM +0800, Jason Wang wrote:
>>>>> On 2020/7/14 上午7:29, Yan Zhao wrote:
>>>>>> hi folks,
>>>>>> we are defining a device migration compatibility interface that helps upper
>>>>>> layer stack like openstack/ovirt/libvirt to check if two devices are
>>>>>> live migration compatible.
>>>>>> The "devices" here could be MDEVs, physical devices, or hybrid of the two.
>>>>>> e.g. we could use it to check whether
>>>>>> - a src MDEV can migrate to a target MDEV,
>>>>>> - a src VF in SRIOV can migrate to a target VF in SRIOV,
>>>>>> - a src MDEV can migration to a target VF in SRIOV.
>>>>>>      (e.g. SIOV/SRIOV backward compatibility case)
>>>>>>
>>>>>> The upper layer stack could use this interface as the last step to check
>>>>>> if one device is able to migrate to another device before triggering a real
>>>>>> live migration procedure.
>>>>>> we are not sure if this interface is of value or help to you. please don't
>>>>>> hesitate to drop your valuable comments.
>>>>>>
>>>>>>
>>>>>> (1) interface definition
>>>>>> The interface is defined in below way:
>>>>>>
>>>>>>                 __    userspace
>>>>>>                  /\              \
>>>>>>                 /                 \write
>>>>>>                / read              \
>>>>>>       ________/__________       ___\|/_____________
>>>>>>      | migration_version |     | migration_version |-->check migration
>>>>>>      ---------------------     ---------------------   compatibility
>>>>>>         device A                    device B
>>>>>>
>>>>>>
>>>>>> a device attribute named migration_version is defined under each device's
>>>>>> sysfs node. e.g. (/sys/bus/pci/devices/0000\:00\:02.0/$mdev_UUID/migration_version).
>>>>> Are you aware of the devlink based device management interface that is
>>>>> proposed upstream? I think it has many advantages over sysfs, do you
>>>>> consider to switch to that?
>>> Advantages, such as?
>>
>> My understanding for devlink(netlink) over sysfs (some are mentioned at
>> the time of vDPA sysfs mgmt API discussion) are:
> i tought netlink was used more a as a configuration protocoal to qurry and confire nic and i guess
> other devices in its devlink form requireint a tool to be witten that can speak the protocal to interact with.
> the primary advantate of sysfs is that everything is just a file. there are no addtional depleenceis
> needed


Well, if you try to build logic like introspection on top for a 
sophisticated hardware, you probably need to have library on top. And 
it's attribute per file is pretty inefficient.


>   and unlike netlink there are not interoperatblity issues in a coanitnerised env. if you are using diffrenet
> version of libc and gcc in the contaienr vs the host my understanding is tools like ethtool from ubuntu deployed
> in a container on a centos host can have issue communicating with the host kernel.


Kernel provides stable ABI for userspace, so it's not something that we 
can't fix.


> if its jsut a file unless
> the format the data is returnin in chagnes or the layout of sysfs changes its compatiable regardless of what you
> use to read it.


I believe you can't change sysfs layout which is part of uABI. But as I 
mentioned below, sysfs has several drawbacks. It's not harm to compare 
between different approach when you start a new device management API.

Thanks


>> - existing users (NIC, crypto, SCSI, ib), mature and stable
>> - much better error reporting (ext_ack other than string or errno)
>> - namespace aware
>> - do not couple with kobject
>>
>> Thanks
>>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-21  0:51       ` Yan Zhao
@ 2020-07-27  7:24         ` Yan Zhao
  2020-07-27 22:23           ` Alex Williamson
  0 siblings, 1 reply; 48+ messages in thread
From: Yan Zhao @ 2020-07-27  7:24 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, libvir-list, Jason Wang, qemu-devel, kwankhede, eauger,
	xin-ran.wang, corbet, openstack-discuss, shaohe.feng, kevin.tian,
	eskultet, jian-feng.ding, dgilbert, zhenyuw, hejie.xu,
	bao.yumeng, smooney, intel-gvt-dev, berrange, cohuck, dinechin,
	devel

> > As you indicate, the vendor driver is responsible for checking version
> > information embedded within the migration stream.  Therefore a
> > migration should fail early if the devices are incompatible.  Is it
> but as I know, currently in VFIO migration protocol, we have no way to
> get vendor specific compatibility checking string in migration setup stage
> (i.e. .save_setup stage) before the device is set to _SAVING state.
> In this way, for devices who does not save device data in precopy stage,
> the migration compatibility checking is as late as in stop-and-copy
> stage, which is too late.
> do you think we need to add the getting/checking of vendor specific
> compatibility string early in save_setup stage?
>
hi Alex,
after an offline discussion with Kevin, I realized that it may not be a
problem if migration compatibility check in vendor driver occurs late in
stop-and-copy phase for some devices, because if we report device
compatibility attributes clearly in an interface, the chances for
libvirt/openstack to make a wrong decision is little.
so, do you think we are now arriving at an agreement that we'll give up
the read-and-test scheme and start to defining one interface (perhaps in
json format), from which libvirt/openstack is able to parse and find out
compatibility list of a source mdev/physical device?

Thanks
Yan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-27  7:24         ` Yan Zhao
@ 2020-07-27 22:23           ` Alex Williamson
  2020-07-29  8:05             ` Yan Zhao
  2020-07-29 19:05             ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 48+ messages in thread
From: Alex Williamson @ 2020-07-27 22:23 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, libvir-list, Jason Wang, qemu-devel, kwankhede, eauger,
	xin-ran.wang, corbet, openstack-discuss, shaohe.feng, kevin.tian,
	eskultet, jian-feng.ding, dgilbert, zhenyuw, hejie.xu,
	bao.yumeng, smooney, intel-gvt-dev, berrange, cohuck, dinechin,
	devel

On Mon, 27 Jul 2020 15:24:40 +0800
Yan Zhao <yan.y.zhao@intel.com> wrote:

> > > As you indicate, the vendor driver is responsible for checking version
> > > information embedded within the migration stream.  Therefore a
> > > migration should fail early if the devices are incompatible.  Is it  
> > but as I know, currently in VFIO migration protocol, we have no way to
> > get vendor specific compatibility checking string in migration setup stage
> > (i.e. .save_setup stage) before the device is set to _SAVING state.
> > In this way, for devices who does not save device data in precopy stage,
> > the migration compatibility checking is as late as in stop-and-copy
> > stage, which is too late.
> > do you think we need to add the getting/checking of vendor specific
> > compatibility string early in save_setup stage?
> >  
> hi Alex,
> after an offline discussion with Kevin, I realized that it may not be a
> problem if migration compatibility check in vendor driver occurs late in
> stop-and-copy phase for some devices, because if we report device
> compatibility attributes clearly in an interface, the chances for
> libvirt/openstack to make a wrong decision is little.

I think it would be wise for a vendor driver to implement a pre-copy
phase, even if only to send version information and verify it at the
target.  Deciding you have no device state to send during pre-copy does
not mean your vendor driver needs to opt-out of the pre-copy phase
entirely.  Please also note that pre-copy is at the user's discretion,
we've defined that we can enter stop-and-copy at any point, including
without a pre-copy phase, so I would recommend that vendor drivers
validate compatibility at the start of both the pre-copy and the
stop-and-copy phases.

> so, do you think we are now arriving at an agreement that we'll give up
> the read-and-test scheme and start to defining one interface (perhaps in
> json format), from which libvirt/openstack is able to parse and find out
> compatibility list of a source mdev/physical device?

Based on the feedback we've received, the previously proposed interface
is not viable.  I think there's agreement that the user needs to be
able to parse and interpret the version information.  Using json seems
viable, but I don't know if it's the best option.  Is there any
precedent of markup strings returned via sysfs we could follow?

Your idea of having both a "self" object and an array of "compatible"
objects is perhaps something we can build on, but we must not assume
PCI devices at the root level of the object.  Providing both the
mdev-type and the driver is a bit redundant, since the former includes
the latter.  We can't have vendor specific versioning schemes though,
ie. gvt-version. We need to agree on a common scheme and decide which
fields the version is relative to, ex. just the mdev type?

I had also proposed fields that provide information to create a
compatible type, for example to create a type_x2 device from a type_x1
mdev type, they need to know to apply an aggregation attribute.  If we
need to explicitly list every aggregation value and the resulting type,
I think we run aground of what aggregation was trying to avoid anyway,
so we might need to pick a language that defines variable substitution
or some kind of tagging.  For example if we could define ${aggr} as an
integer within a specified range, then we might be able to define a type
relative to that value (type_x${aggr}) which requires an aggregation
attribute using the same value.  I dunno, just spit balling.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-27 22:23           ` Alex Williamson
@ 2020-07-29  8:05             ` Yan Zhao
  2020-07-29 11:28               ` Sean Mooney
  2020-08-04 16:35               ` Cornelia Huck
  2020-07-29 19:05             ` Dr. David Alan Gilbert
  1 sibling, 2 replies; 48+ messages in thread
From: Yan Zhao @ 2020-07-29  8:05 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, libvir-list, Jason Wang, qemu-devel, kwankhede, eauger,
	xin-ran.wang, corbet, openstack-discuss, shaohe.feng, kevin.tian,
	eskultet, jian-feng.ding, dgilbert, zhenyuw, hejie.xu,
	bao.yumeng, smooney, intel-gvt-dev, berrange, cohuck, dinechin,
	devel

On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
> On Mon, 27 Jul 2020 15:24:40 +0800
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > > > As you indicate, the vendor driver is responsible for checking version
> > > > information embedded within the migration stream.  Therefore a
> > > > migration should fail early if the devices are incompatible.  Is it  
> > > but as I know, currently in VFIO migration protocol, we have no way to
> > > get vendor specific compatibility checking string in migration setup stage
> > > (i.e. .save_setup stage) before the device is set to _SAVING state.
> > > In this way, for devices who does not save device data in precopy stage,
> > > the migration compatibility checking is as late as in stop-and-copy
> > > stage, which is too late.
> > > do you think we need to add the getting/checking of vendor specific
> > > compatibility string early in save_setup stage?
> > >  
> > hi Alex,
> > after an offline discussion with Kevin, I realized that it may not be a
> > problem if migration compatibility check in vendor driver occurs late in
> > stop-and-copy phase for some devices, because if we report device
> > compatibility attributes clearly in an interface, the chances for
> > libvirt/openstack to make a wrong decision is little.
> 
> I think it would be wise for a vendor driver to implement a pre-copy
> phase, even if only to send version information and verify it at the
> target.  Deciding you have no device state to send during pre-copy does
> not mean your vendor driver needs to opt-out of the pre-copy phase
> entirely.  Please also note that pre-copy is at the user's discretion,
> we've defined that we can enter stop-and-copy at any point, including
> without a pre-copy phase, so I would recommend that vendor drivers
> validate compatibility at the start of both the pre-copy and the
> stop-and-copy phases.
>
ok. got it!

> > so, do you think we are now arriving at an agreement that we'll give up
> > the read-and-test scheme and start to defining one interface (perhaps in
> > json format), from which libvirt/openstack is able to parse and find out
> > compatibility list of a source mdev/physical device?
> 
> Based on the feedback we've received, the previously proposed interface
> is not viable.  I think there's agreement that the user needs to be
> able to parse and interpret the version information.  Using json seems
> viable, but I don't know if it's the best option.  Is there any
> precedent of markup strings returned via sysfs we could follow?
I found some examples of using formatted string under /sys, mostly under
tracing. maybe we can do a similar implementation.

#cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format

name: kvm_mmio
ID: 32
format:
        field:unsigned short common_type;       offset:0;       size:2; signed:0;
        field:unsigned char common_flags;       offset:2;       size:1; signed:0;
        field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
        field:int common_pid;   offset:4;       size:4; signed:1;

        field:u32 type; offset:8;       size:4; signed:0;
        field:u32 len;  offset:12;      size:4; signed:0;
        field:u64 gpa;  offset:16;      size:8; signed:0;
        field:u64 val;  offset:24;      size:8; signed:0;

print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, "read" }, { 2, "write" }), REC->len, REC->gpa, REC->val


#cat /sys/devices/pci0000:00/0000:00:02.0/uevent
DRIVER=vfio-pci
PCI_CLASS=30000
PCI_ID=8086:591D
PCI_SUBSYS_ID=8086:2212
PCI_SLOT_NAME=0000:00:02.0
MODALIAS=pci:v00008086d0000591Dsv00008086sd00002212bc03sc00i00

> 
> Your idea of having both a "self" object and an array of "compatible"
> objects is perhaps something we can build on, but we must not assume
> PCI devices at the root level of the object.  Providing both the
> mdev-type and the driver is a bit redundant, since the former includes
> the latter.  We can't have vendor specific versioning schemes though,
> ie. gvt-version. We need to agree on a common scheme and decide which
> fields the version is relative to, ex. just the mdev type?
what about making all comparing fields vendor specific?
userspace like openstack only needs to parse and compare if target
device is within source compatible list without understanding the meaning
of each field.

> I had also proposed fields that provide information to create a
> compatible type, for example to create a type_x2 device from a type_x1
> mdev type, they need to know to apply an aggregation attribute.  If we
> need to explicitly list every aggregation value and the resulting type,
> I think we run aground of what aggregation was trying to avoid anyway,
> so we might need to pick a language that defines variable substitution
> or some kind of tagging.  For example if we could define ${aggr} as an
> integer within a specified range, then we might be able to define a type
> relative to that value (type_x${aggr}) which requires an aggregation
> attribute using the same value.  I dunno, just spit balling.  Thanks,
what about a migration_compatible attribute under device node like
below?

#cat /sys/bus/pci/devices/0000\:00\:02.0/UUID1/migration_compatible
SELF:
	device_type=pci
	device_id=8086591d
	mdev_type=i915-GVTg_V5_2
	aggregator=1
	pv_mode="none+ppgtt+context"
	interface_version=3
COMPATIBLE:
	device_type=pci
	device_id=8086591d
	mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
	aggregator={val1}/2
	pv_mode={val2:string:"none+ppgtt","none+context","none+ppgtt+context"} 
	interface_version={val3:int:2,3}
COMPATIBLE:
	device_type=pci
	device_id=8086591d
	mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
	aggregator={val1}/2
	pv_mode=""  #"" meaning empty, could be absent in a compatible device
	interface_version=1


#cat /sys/bus/pci/devices/0000\:00\:02.0/UUID2/migration_compatible
SELF:
	device_type=pci
	device_id=8086591d
	mdev_type=i915-GVTg_V5_4
	aggregator=2
	interface_version=1
COMPATIBLE: 
	device_type=pci
	device_id=8086591d
	mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
	aggregator={val1}/2
	interface_version=1


Notes:
- A COMPATIBLE object is a line starting with COMPATIBLE.
  It specifies a list of compatible devices that are allowed to migrate
  in.
  The reason to allow multiple COMPATIBLE objects is that when it
  is hard to express a complex compatible logic in one COMPATIBLE
  object, a simple enumeration is still a fallback.
  in the above example, device UUID2 is in the compatible list of
  device UUID1, but device UUID1 is not in the compatible list of device
  UUID2, so device UUID2 is able to migrate to device UUID1, but device
  UUID1 is not able to migrate to device UUID2.

- fields under each object are of "and" relationship to each other,  meaning
  all fields of SELF object of a target device must be equal to corresponding
  fields of a COMPATIBLE object of source device, otherwise it is regarded as not
  compatible.

- each field, however, is able to specify multiple allowed values, using
  variables as explained below.

- variables are represented with {}, the first appearance of one variable
  specifies its type and allowed list. e.g.
  {val1:int:1,2,4,8} represents var1 whose type is integer and allowed
  values are 1, 2, 4, 8.

- vendors are able to specify which fields are within the comparing list
  and which fields are not. e.g. for physical VF migration, it may not
  choose mdev_type as a comparing field, and maybe use driver name instead.
 

Thanks
Yan



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-29  8:05             ` Yan Zhao
@ 2020-07-29 11:28               ` Sean Mooney
  2020-07-29 19:12                 ` Alex Williamson
  2020-07-30  1:56                 ` Yan Zhao
  2020-08-04 16:35               ` Cornelia Huck
  1 sibling, 2 replies; 48+ messages in thread
From: Sean Mooney @ 2020-07-29 11:28 UTC (permalink / raw)
  To: Yan Zhao, Alex Williamson
  Cc: kvm, libvir-list, Jason Wang, qemu-devel, kwankhede, eauger,
	xin-ran.wang, corbet, openstack-discuss, shaohe.feng, kevin.tian,
	eskultet, jian-feng.ding, dgilbert, zhenyuw, hejie.xu,
	bao.yumeng, intel-gvt-dev, berrange, cohuck, dinechin, devel

On Wed, 2020-07-29 at 16:05 +0800, Yan Zhao wrote:
> On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
> > On Mon, 27 Jul 2020 15:24:40 +0800
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > 
> > > > > As you indicate, the vendor driver is responsible for checking version
> > > > > information embedded within the migration stream.  Therefore a
> > > > > migration should fail early if the devices are incompatible.  Is it  
> > > > 
> > > > but as I know, currently in VFIO migration protocol, we have no way to
> > > > get vendor specific compatibility checking string in migration setup stage
> > > > (i.e. .save_setup stage) before the device is set to _SAVING state.
> > > > In this way, for devices who does not save device data in precopy stage,
> > > > the migration compatibility checking is as late as in stop-and-copy
> > > > stage, which is too late.
> > > > do you think we need to add the getting/checking of vendor specific
> > > > compatibility string early in save_setup stage?
> > > >  
> > > 
> > > hi Alex,
> > > after an offline discussion with Kevin, I realized that it may not be a
> > > problem if migration compatibility check in vendor driver occurs late in
> > > stop-and-copy phase for some devices, because if we report device
> > > compatibility attributes clearly in an interface, the chances for
> > > libvirt/openstack to make a wrong decision is little.
> > 
> > I think it would be wise for a vendor driver to implement a pre-copy
> > phase, even if only to send version information and verify it at the
> > target.  Deciding you have no device state to send during pre-copy does
> > not mean your vendor driver needs to opt-out of the pre-copy phase
> > entirely.  Please also note that pre-copy is at the user's discretion,
> > we've defined that we can enter stop-and-copy at any point, including
> > without a pre-copy phase, so I would recommend that vendor drivers
> > validate compatibility at the start of both the pre-copy and the
> > stop-and-copy phases.
> > 
> 
> ok. got it!
> 
> > > so, do you think we are now arriving at an agreement that we'll give up
> > > the read-and-test scheme and start to defining one interface (perhaps in
> > > json format), from which libvirt/openstack is able to parse and find out
> > > compatibility list of a source mdev/physical device?
> > 
> > Based on the feedback we've received, the previously proposed interface
> > is not viable.  I think there's agreement that the user needs to be
> > able to parse and interpret the version information.  Using json seems
> > viable, but I don't know if it's the best option.  Is there any
> > precedent of markup strings returned via sysfs we could follow?
> 
> I found some examples of using formatted string under /sys, mostly under
> tracing. maybe we can do a similar implementation.
> 
> #cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format
> 
> name: kvm_mmio
> ID: 32
> format:
>         field:unsigned short common_type;       offset:0;       size:2; signed:0;
>         field:unsigned char common_flags;       offset:2;       size:1; signed:0;
>         field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
>         field:int common_pid;   offset:4;       size:4; signed:1;
> 
>         field:u32 type; offset:8;       size:4; signed:0;
>         field:u32 len;  offset:12;      size:4; signed:0;
>         field:u64 gpa;  offset:16;      size:8; signed:0;
>         field:u64 val;  offset:24;      size:8; signed:0;
> 
> print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, "read"
> }, { 2, "write" }), REC->len, REC->gpa, REC->val
> 
this is not json fromat and its not supper frendly to parse.
> 
> #cat /sys/devices/pci0000:00/0000:00:02.0/uevent
> DRIVER=vfio-pci
> PCI_CLASS=30000
> PCI_ID=8086:591D
> PCI_SUBSYS_ID=8086:2212
> PCI_SLOT_NAME=0000:00:02.0
> MODALIAS=pci:v00008086d0000591Dsv00008086sd00002212bc03sc00i00
> 
this is ini format or conf formant 
this is pretty simple to parse whichi would be fine.
that said you could also have a version or capablitiy directory with a file
for each key and a singel value.

i would prefer to only have to do one read personally the list the files in
directory and then read tehm all ot build the datastucture myself but that is
doable though the simple ini format use d for uevent seams the best of 3 options
provided above.
> > 
> > Your idea of having both a "self" object and an array of "compatible"
> > objects is perhaps something we can build on, but we must not assume
> > PCI devices at the root level of the object.  Providing both the
> > mdev-type and the driver is a bit redundant, since the former includes
> > the latter.  We can't have vendor specific versioning schemes though,
> > ie. gvt-version. We need to agree on a common scheme and decide which
> > fields the version is relative to, ex. just the mdev type?
> 
> what about making all comparing fields vendor specific?
> userspace like openstack only needs to parse and compare if target
> device is within source compatible list without understanding the meaning
> of each field.
that kind of defeats the reason for having them be be parsable.
the reason openstack want to be able to understand the capablitys is so
we can staticaly declare the capablit of devices ahead of time on so our schduler
can select host based on that. is the keys and data are opaquce to userspace
becaue they are just random vendor sepecific blobs we cant do that.
> 
> > I had also proposed fields that provide information to create a
> > compatible type, for example to create a type_x2 device from a type_x1
> > mdev type, they need to know to apply an aggregation attribute.  If we
> > need to explicitly list every aggregation value and the resulting type,
> > I think we run aground of what aggregation was trying to avoid anyway,
> > so we might need to pick a language that defines variable substitution
> > or some kind of tagging.  For example if we could define ${aggr} as an
> > integer within a specified range, then we might be able to define a type
> > relative to that value (type_x${aggr}) which requires an aggregation
> > attribute using the same value.  I dunno, just spit balling.  Thanks,
> 
> what about a migration_compatible attribute under device node like
> below?
rather then listing comaptiable devices it would be better if you could declaritivly 
list the feature supported and we could compare those along with a simple semver version string.
> 
> #cat /sys/bus/pci/devices/0000\:00\:02.0/UUID1/migration_compatible
> SELF:
> 	device_type=pci
> 	device_id=8086591d
> 	mdev_type=i915-GVTg_V5_2
> 	aggregator=1
> 	pv_mode="none+ppgtt+context"
> 	interface_version=3
> COMPATIBLE:
> 	device_type=pci
> 	device_id=8086591d
> 	mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
this mixed notation will be hard to parse so i would avoid that.
> 	aggregator={val1}/2
> 	pv_mode={val2:string:"none+ppgtt","none+context","none+ppgtt+context"}
>  
> 	interface_version={val3:int:2,3}
> COMPATIBLE:
> 	device_type=pci
> 	device_id=8086591d
> 	mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
> 	aggregator={val1}/2
> 	pv_mode=""  #"" meaning empty, could be absent in a compatible device
> 	interface_version=1
if you presented this information the only way i could see to use it would be to
extract the mdev_type name and interface_vertion  and build a database table as follows

source_mdev_type | source_version | target_mdev_type | target_version
i915-GVTg_V5_2 | 3 | 915-GVTg_V5_{val1:int:1,2,4,8} | {val3:int:2,3}
i915-GVTg_V5_2 | 3 | 915-GVTg_V5_{val1:int:1,2,4,8} | 1

this would either reuiqre use to use a post placment sechudler filter to itrospec this data base
or thansform the target_mdev_type and target_version colum data into CUSTOM_* traits we apply to
our placment resouce providers and we would have to prefrom multiple reuqest for each posible compatiable
alternitive.  if the vm has muplite mdevs this is combinatorially problmenatic as it is 1 query for each
device * the number of possible compatible devices for that device.

in other word if this is just opaque data we cant ever represent it efficently in our placment service and
have to fall back to an explisive post placment schdluer filter base on the db table approch.

this also ignore the fact that at present the mdev_type cannot change druing a migration so the compatiable
devicve with a different mdev type would not be considerd accpetable choice in openstack. they way you select a host
with a specific vgpu mdev type today is to apply a custome trait which is CUSTOM_<medev_type_goes_here> to the vGPU
resouce provider and then in the flavor you request 1 allcoaton of vGPU and require the CUSTOM_<medev_type_goes_here>
trait. so going form i915-GVTg_V5_2 to i915-GVTg_V5_{val1:int:1,2,4,8} would not currently be compatiable with that
workflow.


> #cat /sys/bus/pci/dei915-GVTg_V5_{val1:int:1,2,4,8}vices/0000\:00\:i915-
> GVTg_V5_{val1:int:1,2,4,8}2.0/UUID2/migration_compatible
> SELF:
> 	device_type=pci
> 	device_id=8086591d
> 	mdev_type=i915-GVTg_V5_4
> 	aggregator=2
> 	interface_version=1
> COMPATIBLE: 
> 	device_type=pci
> 	device_id=8086591d
> 	mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
> 	aggregator={val1}/2
> 	interface_version=1
by the way this is closer to yaml format then it is to json but it does not align with any exsiting
format i know of so that just make the representation needless hard to consume
if we are going to use a markup lanag let use a standard one like yaml json or toml and not invent a new one.
> 
> Notes:
> - A COMPATIBLE object is a line starting with COMPATIBLE.
>   It specifies a list of compatible devices that are allowed to migrate
>   in.
>   The reason to allow multiple COMPATIBLE objects is that when it
>   is hard to express a complex compatible logic in one COMPATIBLE
>   object, a simple enumeration is still a fallback.
>   in the above example, device UUID2 is in the compatible list of
>   device UUID1, but device UUID1 is not in the compatible list of device
>   UUID2, so device UUID2 is able to migrate to device UUID1, but device
>   UUID1 is not able to migrate to device UUID2.
> 
> - fields under each object are of "and" relationship to each other,  meaning
>   all fields of SELF object of a target device must be equal to corresponding
>   fields of a COMPATIBLE object of source device, otherwise it is regarded as not
>   compatible.
> 
> - each field, however, is able to specify multiple allowed values, using
>   variables as explained below.
> 
> - variables are represented with {}, the first appearance of one variable
>   specifies its type and allowed list. e.g.
>   {val1:int:1,2,4,8} represents var1 whose type is integer and allowed
>   values are 1, 2, 4, 8.
> 
> - vendors are able to specify which fields are within the comparing list
>   and which fields are not. e.g. for physical VF migration, it may not
>   choose mdev_type as a comparing field, and maybe use driver name instead.
this format might be useful to vendors but from a orcestrator perspecive i dont think this has
value to us likely we would not use this api if it was added as it does not help us with schduling.
ideally instead fo declaring which other mdev types a device is compatiable with (which could presumably change over
time as new device and firmwares are released) i would prefer to see a declaritive non vendor specific api that declares
the feature set provided by each mdev_type from which we can infer comaptiablity similar to cpu feature flags.
for devices fo the same mdev_type name addtionally a declaritive version sting could also be used if required for
addtional compatiablity checks.
>  
> 
> Thanks
> Yan
> 
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-27 22:23           ` Alex Williamson
  2020-07-29  8:05             ` Yan Zhao
@ 2020-07-29 19:05             ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 48+ messages in thread
From: Dr. David Alan Gilbert @ 2020-07-29 19:05 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yan Zhao, kvm, libvir-list, Jason Wang, qemu-devel, kwankhede,
	eauger, xin-ran.wang, corbet, openstack-discuss, shaohe.feng,
	kevin.tian, eskultet, jian-feng.ding, zhenyuw, hejie.xu,
	bao.yumeng, smooney, intel-gvt-dev, berrange, cohuck, dinechin,
	devel

* Alex Williamson (alex.williamson@redhat.com) wrote:
> On Mon, 27 Jul 2020 15:24:40 +0800
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > > > As you indicate, the vendor driver is responsible for checking version
> > > > information embedded within the migration stream.  Therefore a
> > > > migration should fail early if the devices are incompatible.  Is it  
> > > but as I know, currently in VFIO migration protocol, we have no way to
> > > get vendor specific compatibility checking string in migration setup stage
> > > (i.e. .save_setup stage) before the device is set to _SAVING state.
> > > In this way, for devices who does not save device data in precopy stage,
> > > the migration compatibility checking is as late as in stop-and-copy
> > > stage, which is too late.
> > > do you think we need to add the getting/checking of vendor specific
> > > compatibility string early in save_setup stage?
> > >  
> > hi Alex,
> > after an offline discussion with Kevin, I realized that it may not be a
> > problem if migration compatibility check in vendor driver occurs late in
> > stop-and-copy phase for some devices, because if we report device
> > compatibility attributes clearly in an interface, the chances for
> > libvirt/openstack to make a wrong decision is little.
> 
> I think it would be wise for a vendor driver to implement a pre-copy
> phase, even if only to send version information and verify it at the
> target.  Deciding you have no device state to send during pre-copy does
> not mean your vendor driver needs to opt-out of the pre-copy phase
> entirely.  Please also note that pre-copy is at the user's discretion,
> we've defined that we can enter stop-and-copy at any point, including
> without a pre-copy phase, so I would recommend that vendor drivers
> validate compatibility at the start of both the pre-copy and the
> stop-and-copy phases.

That's quite curious; from a migration point of view I'd expect if you
did want to skip pre-copy, that you'd go through the motions of entering
it and then not saving any data and then going to stop-and-copy,
rather than having two flows.

Note that failing at a late stage of stop-and-copy is a pain; if you've
just spent an hour migrating your huge busy VM over, you're going to be
pretty annoyed when it goes pop near the end.

Dave

> > so, do you think we are now arriving at an agreement that we'll give up
> > the read-and-test scheme and start to defining one interface (perhaps in
> > json format), from which libvirt/openstack is able to parse and find out
> > compatibility list of a source mdev/physical device?
> 
> Based on the feedback we've received, the previously proposed interface
> is not viable.  I think there's agreement that the user needs to be
> able to parse and interpret the version information.  Using json seems
> viable, but I don't know if it's the best option.  Is there any
> precedent of markup strings returned via sysfs we could follow?
> 
> Your idea of having both a "self" object and an array of "compatible"
> objects is perhaps something we can build on, but we must not assume
> PCI devices at the root level of the object.  Providing both the
> mdev-type and the driver is a bit redundant, since the former includes
> the latter.  We can't have vendor specific versioning schemes though,
> ie. gvt-version. We need to agree on a common scheme and decide which
> fields the version is relative to, ex. just the mdev type?
> 
> I had also proposed fields that provide information to create a
> compatible type, for example to create a type_x2 device from a type_x1
> mdev type, they need to know to apply an aggregation attribute.  If we
> need to explicitly list every aggregation value and the resulting type,
> I think we run aground of what aggregation was trying to avoid anyway,
> so we might need to pick a language that defines variable substitution
> or some kind of tagging.  For example if we could define ${aggr} as an
> integer within a specified range, then we might be able to define a type
> relative to that value (type_x${aggr}) which requires an aggregation
> attribute using the same value.  I dunno, just spit balling.  Thanks,
> 
> Alex
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-29 11:28               ` Sean Mooney
@ 2020-07-29 19:12                 ` Alex Williamson
  2020-07-30  3:41                   ` Yan Zhao
  2020-07-30  1:56                 ` Yan Zhao
  1 sibling, 1 reply; 48+ messages in thread
From: Alex Williamson @ 2020-07-29 19:12 UTC (permalink / raw)
  To: Sean Mooney
  Cc: Yan Zhao, kvm, libvir-list, Jason Wang, qemu-devel, kwankhede,
	eauger, xin-ran.wang, corbet, openstack-discuss, shaohe.feng,
	kevin.tian, eskultet, jian-feng.ding, dgilbert, zhenyuw,
	hejie.xu, bao.yumeng, intel-gvt-dev, berrange, cohuck, dinechin,
	devel

On Wed, 29 Jul 2020 12:28:46 +0100
Sean Mooney <smooney@redhat.com> wrote:

> On Wed, 2020-07-29 at 16:05 +0800, Yan Zhao wrote:
> > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:  
> > > On Mon, 27 Jul 2020 15:24:40 +0800
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >   
> > > > > > As you indicate, the vendor driver is responsible for checking version
> > > > > > information embedded within the migration stream.  Therefore a
> > > > > > migration should fail early if the devices are incompatible.  Is it    
> > > > > 
> > > > > but as I know, currently in VFIO migration protocol, we have no way to
> > > > > get vendor specific compatibility checking string in migration setup stage
> > > > > (i.e. .save_setup stage) before the device is set to _SAVING state.
> > > > > In this way, for devices who does not save device data in precopy stage,
> > > > > the migration compatibility checking is as late as in stop-and-copy
> > > > > stage, which is too late.
> > > > > do you think we need to add the getting/checking of vendor specific
> > > > > compatibility string early in save_setup stage?
> > > > >    
> > > > 
> > > > hi Alex,
> > > > after an offline discussion with Kevin, I realized that it may not be a
> > > > problem if migration compatibility check in vendor driver occurs late in
> > > > stop-and-copy phase for some devices, because if we report device
> > > > compatibility attributes clearly in an interface, the chances for
> > > > libvirt/openstack to make a wrong decision is little.  
> > > 
> > > I think it would be wise for a vendor driver to implement a pre-copy
> > > phase, even if only to send version information and verify it at the
> > > target.  Deciding you have no device state to send during pre-copy does
> > > not mean your vendor driver needs to opt-out of the pre-copy phase
> > > entirely.  Please also note that pre-copy is at the user's discretion,
> > > we've defined that we can enter stop-and-copy at any point, including
> > > without a pre-copy phase, so I would recommend that vendor drivers
> > > validate compatibility at the start of both the pre-copy and the
> > > stop-and-copy phases.
> > >   
> > 
> > ok. got it!
> >   
> > > > so, do you think we are now arriving at an agreement that we'll give up
> > > > the read-and-test scheme and start to defining one interface (perhaps in
> > > > json format), from which libvirt/openstack is able to parse and find out
> > > > compatibility list of a source mdev/physical device?  
> > > 
> > > Based on the feedback we've received, the previously proposed interface
> > > is not viable.  I think there's agreement that the user needs to be
> > > able to parse and interpret the version information.  Using json seems
> > > viable, but I don't know if it's the best option.  Is there any
> > > precedent of markup strings returned via sysfs we could follow?  
> > 
> > I found some examples of using formatted string under /sys, mostly under
> > tracing. maybe we can do a similar implementation.
> > 
> > #cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format
> > 
> > name: kvm_mmio
> > ID: 32
> > format:
> >         field:unsigned short common_type;       offset:0;       size:2; signed:0;
> >         field:unsigned char common_flags;       offset:2;       size:1; signed:0;
> >         field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
> >         field:int common_pid;   offset:4;       size:4; signed:1;
> > 
> >         field:u32 type; offset:8;       size:4; signed:0;
> >         field:u32 len;  offset:12;      size:4; signed:0;
> >         field:u64 gpa;  offset:16;      size:8; signed:0;
> >         field:u64 val;  offset:24;      size:8; signed:0;
> > 
> > print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, "read"
> > }, { 2, "write" }), REC->len, REC->gpa, REC->val
> >   
> this is not json fromat and its not supper frendly to parse.
> > 
> > #cat /sys/devices/pci0000:00/0000:00:02.0/uevent
> > DRIVER=vfio-pci
> > PCI_CLASS=30000
> > PCI_ID=8086:591D
> > PCI_SUBSYS_ID=8086:2212
> > PCI_SLOT_NAME=0000:00:02.0
> > MODALIAS=pci:v00008086d0000591Dsv00008086sd00002212bc03sc00i00
> >   
> this is ini format or conf formant 
> this is pretty simple to parse whichi would be fine.
> that said you could also have a version or capablitiy directory with a file
> for each key and a singel value.
> 
> i would prefer to only have to do one read personally the list the files in
> directory and then read tehm all ot build the datastucture myself but that is
> doable though the simple ini format use d for uevent seams the best of 3 options
> provided above.
> > > 
> > > Your idea of having both a "self" object and an array of "compatible"
> > > objects is perhaps something we can build on, but we must not assume
> > > PCI devices at the root level of the object.  Providing both the
> > > mdev-type and the driver is a bit redundant, since the former includes
> > > the latter.  We can't have vendor specific versioning schemes though,
> > > ie. gvt-version. We need to agree on a common scheme and decide which
> > > fields the version is relative to, ex. just the mdev type?  
> > 
> > what about making all comparing fields vendor specific?
> > userspace like openstack only needs to parse and compare if target
> > device is within source compatible list without understanding the meaning
> > of each field.  
> that kind of defeats the reason for having them be be parsable.
> the reason openstack want to be able to understand the capablitys is so
> we can staticaly declare the capablit of devices ahead of time on so our schduler
> can select host based on that. is the keys and data are opaquce to userspace
> becaue they are just random vendor sepecific blobs we cant do that.

Agreed, I'm not sure I'm willing to rule out that there could be vendor
specific direct match fields, as I included in my example earlier in
the thread, but entirely vendor specific defeats much of the purpose
here.

> > > I had also proposed fields that provide information to create a
> > > compatible type, for example to create a type_x2 device from a type_x1
> > > mdev type, they need to know to apply an aggregation attribute.  If we
> > > need to explicitly list every aggregation value and the resulting type,
> > > I think we run aground of what aggregation was trying to avoid anyway,
> > > so we might need to pick a language that defines variable substitution
> > > or some kind of tagging.  For example if we could define ${aggr} as an
> > > integer within a specified range, then we might be able to define a type
> > > relative to that value (type_x${aggr}) which requires an aggregation
> > > attribute using the same value.  I dunno, just spit balling.  Thanks,  
> > 
> > what about a migration_compatible attribute under device node like
> > below?  
> rather then listing comaptiable devices it would be better if you could declaritivly 
> list the feature supported and we could compare those along with a simple semver version string.
> > 
> > #cat /sys/bus/pci/devices/0000\:00\:02.0/UUID1/migration_compatible

Note that we're defining compatibility relative to a vfio migration
interface, so we should include that in the name, we don't know what
other migration interfaces might exist.

> > SELF:
> > 	device_type=pci

Why not the device_api here, ie. vfio-pci.  The device doesn't provide
a pci interface directly, it's wrapped in a vfio API.

> > 	device_id=8086591d

Is device_id interpreted relative to device_type?  How does this
relate to mdev_type?  If we have an mdev_type, doesn't that fully
defined the software API?

> > 	mdev_type=i915-GVTg_V5_2

And how are non-mdev devices represented?

> > 	aggregator=1
> > 	pv_mode="none+ppgtt+context"

These are meaningless vendor specific matches afaict.

> > 	interface_version=3

Not much granularity here, I prefer Sean's previous
<major>.<minor>[.bugfix] scheme.

> > COMPATIBLE:
> > 	device_type=pci
> > 	device_id=8086591d
> > 	mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}  
> this mixed notation will be hard to parse so i would avoid that.

Some background, Intel has been proposing aggregation as a solution to
how we scale mdev devices when hardware exposes large numbers of
assignable objects that can be composed in essentially arbitrary ways.
So for instance, if we have a workqueue (wq), we might have an mdev
type for 1wq, 2wq, 3wq,... Nwq.  It's not really practical to expose a
discrete mdev type for each of those, so they want to define a base
type which is composable to other types via this aggregation.  This is
what this substitution and tagging is attempting to accomplish.  So
imagine this set of values for cases where it's not practical to unroll
the values for N discrete types.

> > 	aggregator={val1}/2

So the {val1} above would be substituted here, though an aggregation
factor of 1/2 is a head scratcher...

> > 	pv_mode={val2:string:"none+ppgtt","none+context","none+ppgtt+context"}

I'm lost on this one though.  I think maybe it's indicating that it's
compatible with any of these, so do we need to list it?  Couldn't this
be handled by Sean's version proposal where the minor version
represents feature compatibility?

> >  
> > 	interface_version={val3:int:2,3}

What does this turn into in a few years, 2,7,12,23,75,96,...

> > COMPATIBLE:
> > 	device_type=pci
> > 	device_id=8086591d
> > 	mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
> > 	aggregator={val1}/2
> > 	pv_mode=""  #"" meaning empty, could be absent in a compatible device
> > 	interface_version=1  

Why can't this be represented within the previous compatible
description?

> if you presented this information the only way i could see to use it would be to
> extract the mdev_type name and interface_vertion  and build a database table as follows
> 
> source_mdev_type | source_version | target_mdev_type | target_version
> i915-GVTg_V5_2 | 3 | 915-GVTg_V5_{val1:int:1,2,4,8} | {val3:int:2,3}
> i915-GVTg_V5_2 | 3 | 915-GVTg_V5_{val1:int:1,2,4,8} | 1
> 
> this would either reuiqre use to use a post placment sechudler filter to itrospec this data base
> or thansform the target_mdev_type and target_version colum data into CUSTOM_* traits we apply to
> our placment resouce providers and we would have to prefrom multiple reuqest for each posible compatiable
> alternitive.  if the vm has muplite mdevs this is combinatorially problmenatic as it is 1 query for each
> device * the number of possible compatible devices for that device.
> 
> in other word if this is just opaque data we cant ever represent it efficently in our placment service and
> have to fall back to an explisive post placment schdluer filter base on the db table approch.
> 
> this also ignore the fact that at present the mdev_type cannot change druing a migration so the compatiable
> devicve with a different mdev type would not be considerd accpetable choice in openstack. they way you select a host
> with a specific vgpu mdev type today is to apply a custome trait which is CUSTOM_<medev_type_goes_here> to the vGPU
> resouce provider and then in the flavor you request 1 allcoaton of vGPU and require the CUSTOM_<medev_type_goes_here>
> trait. so going form i915-GVTg_V5_2 to i915-GVTg_V5_{val1:int:1,2,4,8} would not currently be compatiable with that
> workflow.

The latter would need to be parsed into:

i915-GVTg_V5_1
i915-GVTg_V5_2
i915-GVTg_V5_4
i915-GVTg_V5_8

There is also on the table, migration from physical devices to mdev
devices (or vice versa), which is not represented in these examples,
nor do I see how we'd represent it.  This is where I started exposing
the resulting PCI device from the mdev in my example so we could have
some commonality between devices, but the migration stream provider is
just as important as the type of device, we could have different host
drivers providing the same device with incompatible migration streams.
The mdev_type encompasses both the driver and device, but we wouldn't
have mdev_types for physical devices, per our current thinking.


> > #cat /sys/bus/pci/dei915-GVTg_V5_{val1:int:1,2,4,8}vices/0000\:00\:i915-
> > GVTg_V5_{val1:int:1,2,4,8}2.0/UUID2/migration_compatible
> > SELF:
> > 	device_type=pci
> > 	device_id=8086591d
> > 	mdev_type=i915-GVTg_V5_4
> > 	aggregator=2
> > 	interface_version=1
> > COMPATIBLE: 
> > 	device_type=pci
> > 	device_id=8086591d
> > 	mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
> > 	aggregator={val1}/2
> > 	interface_version=1  
> by the way this is closer to yaml format then it is to json but it does not align with any exsiting
> format i know of so that just make the representation needless hard to consume
> if we are going to use a markup lanag let use a standard one like yaml json or toml and not invent a new one.
> > 
> > Notes:
> > - A COMPATIBLE object is a line starting with COMPATIBLE.
> >   It specifies a list of compatible devices that are allowed to migrate
> >   in.
> >   The reason to allow multiple COMPATIBLE objects is that when it
> >   is hard to express a complex compatible logic in one COMPATIBLE
> >   object, a simple enumeration is still a fallback.
> >   in the above example, device UUID2 is in the compatible list of
> >   device UUID1, but device UUID1 is not in the compatible list of device
> >   UUID2, so device UUID2 is able to migrate to device UUID1, but device
> >   UUID1 is not able to migrate to device UUID2.
> > 
> > - fields under each object are of "and" relationship to each other,  meaning
> >   all fields of SELF object of a target device must be equal to corresponding
> >   fields of a COMPATIBLE object of source device, otherwise it is regarded as not
> >   compatible.
> > 
> > - each field, however, is able to specify multiple allowed values, using
> >   variables as explained below.
> > 
> > - variables are represented with {}, the first appearance of one variable
> >   specifies its type and allowed list. e.g.
> >   {val1:int:1,2,4,8} represents var1 whose type is integer and allowed
> >   values are 1, 2, 4, 8.
> > 
> > - vendors are able to specify which fields are within the comparing list
> >   and which fields are not. e.g. for physical VF migration, it may not
> >   choose mdev_type as a comparing field, and maybe use driver name instead.  
> this format might be useful to vendors but from a orcestrator
> perspecive i dont think this has value to us likely we would not use
> this api if it was added as it does not help us with schduling.
> ideally instead fo declaring which other mdev types a device is
> compatiable with (which could presumably change over time as new
> device and firmwares are released) i would prefer to see a
> declaritive non vendor specific api that declares the feature set
> provided by each mdev_type from which we can infer comaptiablity
> similar to cpu feature flags. for devices fo the same mdev_type name
> addtionally a declaritive version sting could also be used if
> required for addtional compatiablity checks.

"non vendor specific api that declares the feature set", aren't
features generally vendor specific?  What we're trying to describe is,
by it's very nature, vendor specific.  We don't have an ISO body
defining a graphics adapter and enumerating features for that adapter.
I think what we have is mdev_types.  Each type is supposed to define a
specific software interface, perhaps even more so than is done by a PCI
vendor:device ID.  Maybe that mdev_type needs to be abstracted as
something more like a vendor signature, such that a physical device
could provide or accept a vendor signature that's compatible with an
mdev device.  For example, a physically assigned Intel GPU might expose
a migration signature of i915-GVTg_v5_8 if it were designed to be
compatible with that mdev_type.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-29 11:28               ` Sean Mooney
  2020-07-29 19:12                 ` Alex Williamson
@ 2020-07-30  1:56                 ` Yan Zhao
  2020-07-30 13:14                   ` Sean Mooney
  1 sibling, 1 reply; 48+ messages in thread
From: Yan Zhao @ 2020-07-30  1:56 UTC (permalink / raw)
  To: Sean Mooney
  Cc: Alex Williamson, kvm, libvir-list, Jason Wang, qemu-devel,
	kwankhede, eauger, xin-ran.wang, corbet, openstack-discuss,
	shaohe.feng, kevin.tian, eskultet, jian-feng.ding, dgilbert,
	zhenyuw, hejie.xu, bao.yumeng, intel-gvt-dev, berrange, cohuck,
	dinechin, devel

On Wed, Jul 29, 2020 at 12:28:46PM +0100, Sean Mooney wrote:
> On Wed, 2020-07-29 at 16:05 +0800, Yan Zhao wrote:
> > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
> > > On Mon, 27 Jul 2020 15:24:40 +0800
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > 
> > > > > > As you indicate, the vendor driver is responsible for checking version
> > > > > > information embedded within the migration stream.  Therefore a
> > > > > > migration should fail early if the devices are incompatible.  Is it  
> > > > > 
> > > > > but as I know, currently in VFIO migration protocol, we have no way to
> > > > > get vendor specific compatibility checking string in migration setup stage
> > > > > (i.e. .save_setup stage) before the device is set to _SAVING state.
> > > > > In this way, for devices who does not save device data in precopy stage,
> > > > > the migration compatibility checking is as late as in stop-and-copy
> > > > > stage, which is too late.
> > > > > do you think we need to add the getting/checking of vendor specific
> > > > > compatibility string early in save_setup stage?
> > > > >  
> > > > 
> > > > hi Alex,
> > > > after an offline discussion with Kevin, I realized that it may not be a
> > > > problem if migration compatibility check in vendor driver occurs late in
> > > > stop-and-copy phase for some devices, because if we report device
> > > > compatibility attributes clearly in an interface, the chances for
> > > > libvirt/openstack to make a wrong decision is little.
> > > 
> > > I think it would be wise for a vendor driver to implement a pre-copy
> > > phase, even if only to send version information and verify it at the
> > > target.  Deciding you have no device state to send during pre-copy does
> > > not mean your vendor driver needs to opt-out of the pre-copy phase
> > > entirely.  Please also note that pre-copy is at the user's discretion,
> > > we've defined that we can enter stop-and-copy at any point, including
> > > without a pre-copy phase, so I would recommend that vendor drivers
> > > validate compatibility at the start of both the pre-copy and the
> > > stop-and-copy phases.
> > > 
> > 
> > ok. got it!
> > 
> > > > so, do you think we are now arriving at an agreement that we'll give up
> > > > the read-and-test scheme and start to defining one interface (perhaps in
> > > > json format), from which libvirt/openstack is able to parse and find out
> > > > compatibility list of a source mdev/physical device?
> > > 
> > > Based on the feedback we've received, the previously proposed interface
> > > is not viable.  I think there's agreement that the user needs to be
> > > able to parse and interpret the version information.  Using json seems
> > > viable, but I don't know if it's the best option.  Is there any
> > > precedent of markup strings returned via sysfs we could follow?
> > 
> > I found some examples of using formatted string under /sys, mostly under
> > tracing. maybe we can do a similar implementation.
> > 
> > #cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format
> > 
> > name: kvm_mmio
> > ID: 32
> > format:
> >         field:unsigned short common_type;       offset:0;       size:2; signed:0;
> >         field:unsigned char common_flags;       offset:2;       size:1; signed:0;
> >         field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
> >         field:int common_pid;   offset:4;       size:4; signed:1;
> > 
> >         field:u32 type; offset:8;       size:4; signed:0;
> >         field:u32 len;  offset:12;      size:4; signed:0;
> >         field:u64 gpa;  offset:16;      size:8; signed:0;
> >         field:u64 val;  offset:24;      size:8; signed:0;
> > 
> > print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, "read"
> > }, { 2, "write" }), REC->len, REC->gpa, REC->val
> > 
> this is not json fromat and its not supper frendly to parse.
yes, it's just an example. It's exported to be used by userspace perf &
trace_cmd.

> > 
> > #cat /sys/devices/pci0000:00/0000:00:02.0/uevent
> > DRIVER=vfio-pci
> > PCI_CLASS=30000
> > PCI_ID=8086:591D
> > PCI_SUBSYS_ID=8086:2212
> > PCI_SLOT_NAME=0000:00:02.0
> > MODALIAS=pci:v00008086d0000591Dsv00008086sd00002212bc03sc00i00
> > 
> this is ini format or conf formant 
> this is pretty simple to parse whichi would be fine.
> that said you could also have a version or capablitiy directory with a file
> for each key and a singel value.
> 
if this is easy for openstack, maybe we can organize the data like below way?
 
 |- [device]
    |- migration
        |-self
	|-compatible1
	|-compatible2

e.g. 
#cat /sys/bus/pci/devices/0000:00:02.0/UUID1/migration/self
 	filed1=xxx
 	filed2=xxx
 	filed3=xxx
 	filed3=xxx
#cat /sys/bus/pci/devices/0000:00:02.0/UUID1/migration/compatible
 	filed1=xxx
 	filed2=xxx
 	filed3=xxx
 	filed3=xxx

or in a flat layer
 |- [device]
    |- migration-self-traits
    |- migration-compatible-traits

I'm not sure whether json format in a single file is better, as I didn't
find any precedent.

> i would prefer to only have to do one read personally the list the files in
> directory and then read tehm all ot build the datastucture myself but that is
> doable though the simple ini format use d for uevent seams the best of 3 options
> provided above.
> > > 
> > > Your idea of having both a "self" object and an array of "compatible"
> > > objects is perhaps something we can build on, but we must not assume
> > > PCI devices at the root level of the object.  Providing both the
> > > mdev-type and the driver is a bit redundant, since the former includes
> > > the latter.  We can't have vendor specific versioning schemes though,
> > > ie. gvt-version. We need to agree on a common scheme and decide which
> > > fields the version is relative to, ex. just the mdev type?
> > 
> > what about making all comparing fields vendor specific?
> > userspace like openstack only needs to parse and compare if target
> > device is within source compatible list without understanding the meaning
> > of each field.
> that kind of defeats the reason for having them be be parsable.
> the reason openstack want to be able to understand the capablitys is so
> we can staticaly declare the capablit of devices ahead of time on so our schduler
> can select host based on that. is the keys and data are opaquce to userspace
> becaue they are just random vendor sepecific blobs we cant do that.
I heard that cyborg can parse the kernel interface and generate several
traits without understanding the meaning of each trait. Then it reports
those traits to placement for scheduling.

but I agree if mdev creation is involved, those traits need to match
to mdev attributes and mdev_type.

could you explain a little how you plan to create a target mdev device?
is it dynamically created during searching of compatible mdevs or just statically
created before migration?

> > 
> > > I had also proposed fields that provide information to create a
> > > compatible type, for example to create a type_x2 device from a type_x1
> > > mdev type, they need to know to apply an aggregation attribute.  If we
> > > need to explicitly list every aggregation value and the resulting type,
> > > I think we run aground of what aggregation was trying to avoid anyway,
> > > so we might need to pick a language that defines variable substitution
> > > or some kind of tagging.  For example if we could define ${aggr} as an
> > > integer within a specified range, then we might be able to define a type
> > > relative to that value (type_x${aggr}) which requires an aggregation
> > > attribute using the same value.  I dunno, just spit balling.  Thanks,
> > 
> > what about a migration_compatible attribute under device node like
> > below?
> rather then listing comaptiable devices it would be better if you could declaritivly 
> list the feature supported and we could compare those along with a simple semver version string.
I think below is already in a way of listing feature supported.
The reason I also want to declare compatible lists of features is that
sometimes it's not a simple 1:1 matching of source list and target list.
as I demonstrated below,
source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to
target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2),
               (mdev_type i915-GVTg_V5_8 + aggregator 4)

and aggragator may be just one of such examples that 1:1 matching is not
fit.
so I guess it's best not to leave the hard decision to openstack.

Thanks
Yan
> > 
> > #cat /sys/bus/pci/devices/0000\:00\:02.0/UUID1/migration_compatible
> > SELF:
> > 	device_type=pci
> > 	device_id=8086591d
> > 	mdev_type=i915-GVTg_V5_2
> > 	aggregator=1
> > 	pv_mode="none+ppgtt+context"
> > 	interface_version=3
> > COMPATIBLE:
> > 	device_type=pci
> > 	device_id=8086591d
> > 	mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
> this mixed notation will be hard to parse so i would avoid that.
> > 	aggregator={val1}/2
> > 	pv_mode={val2:string:"none+ppgtt","none+context","none+ppgtt+context"}
> >  
> > 	interface_version={val3:int:2,3}
> > COMPATIBLE:
> > 	device_type=pci
> > 	device_id=8086591d
> > 	mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
> > 	aggregator={val1}/2
> > 	pv_mode=""  #"" meaning empty, could be absent in a compatible device
> > 	interface_version=1
> if you presented this information the only way i could see to use it would be to
> extract the mdev_type name and interface_vertion  and build a database table as follows
> 
> source_mdev_type | source_version | target_mdev_type | target_version
> i915-GVTg_V5_2 | 3 | 915-GVTg_V5_{val1:int:1,2,4,8} | {val3:int:2,3}
> i915-GVTg_V5_2 | 3 | 915-GVTg_V5_{val1:int:1,2,4,8} | 1
> 
> this would either reuiqre use to use a post placment sechudler filter to itrospec this data base
> or thansform the target_mdev_type and target_version colum data into CUSTOM_* traits we apply to
> our placment resouce providers and we would have to prefrom multiple reuqest for each posible compatiable
> alternitive.  if the vm has muplite mdevs this is combinatorially problmenatic as it is 1 query for each
> device * the number of possible compatible devices for that device.
> 
> in other word if this is just opaque data we cant ever represent it efficently in our placment service and
> have to fall back to an explisive post placment schdluer filter base on the db table approch.
> 
> this also ignore the fact that at present the mdev_type cannot change druing a migration so the compatiable
> devicve with a different mdev type would not be considerd accpetable choice in openstack. they way you select a host
> with a specific vgpu mdev type today is to apply a custome trait which is CUSTOM_<medev_type_goes_here> to the vGPU
> resouce provider and then in the flavor you request 1 allcoaton of vGPU and require the CUSTOM_<medev_type_goes_here>
> trait. so going form i915-GVTg_V5_2 to i915-GVTg_V5_{val1:int:1,2,4,8} would not currently be compatiable with that
> workflow.
> 
> 
> > #cat /sys/bus/pci/dei915-GVTg_V5_{val1:int:1,2,4,8}vices/0000\:00\:i915-
> > GVTg_V5_{val1:int:1,2,4,8}2.0/UUID2/migration_compatible
> > SELF:
> > 	device_type=pci
> > 	device_id=8086591d
> > 	mdev_type=i915-GVTg_V5_4
> > 	aggregator=2
> > 	interface_version=1
> > COMPATIBLE: 
> > 	device_type=pci
> > 	device_id=8086591d
> > 	mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
> > 	aggregator={val1}/2
> > 	interface_version=1
> by the way this is closer to yaml format then it is to json but it does not align with any exsiting
> format i know of so that just make the representation needless hard to consume
> if we are going to use a markup lanag let use a standard one like yaml json or toml and not invent a new one.
> > 
> > Notes:
> > - A COMPATIBLE object is a line starting with COMPATIBLE.
> >   It specifies a list of compatible devices that are allowed to migrate
> >   in.
> >   The reason to allow multiple COMPATIBLE objects is that when it
> >   is hard to express a complex compatible logic in one COMPATIBLE
> >   object, a simple enumeration is still a fallback.
> >   in the above example, device UUID2 is in the compatible list of
> >   device UUID1, but device UUID1 is not in the compatible list of device
> >   UUID2, so device UUID2 is able to migrate to device UUID1, but device
> >   UUID1 is not able to migrate to device UUID2.
> > 
> > - fields under each object are of "and" relationship to each other,  meaning
> >   all fields of SELF object of a target device must be equal to corresponding
> >   fields of a COMPATIBLE object of source device, otherwise it is regarded as not
> >   compatible.
> > 
> > - each field, however, is able to specify multiple allowed values, using
> >   variables as explained below.
> > 
> > - variables are represented with {}, the first appearance of one variable
> >   specifies its type and allowed list. e.g.
> >   {val1:int:1,2,4,8} represents var1 whose type is integer and allowed
> >   values are 1, 2, 4, 8.
> > 
> > - vendors are able to specify which fields are within the comparing list
> >   and which fields are not. e.g. for physical VF migration, it may not
> >   choose mdev_type as a comparing field, and maybe use driver name instead.
> this format might be useful to vendors but from a orcestrator perspecive i dont think this has
> value to us likely we would not use this api if it was added as it does not help us with schduling.
> ideally instead fo declaring which other mdev types a device is compatiable with (which could presumably change over
> time as new device and firmwares are released) i would prefer to see a declaritive non vendor specific api that declares
> the feature set provided by each mdev_type from which we can infer comaptiablity similar to cpu feature flags.
> for devices fo the same mdev_type name addtionally a declaritive version sting could also be used if required for
> addtional compatiablity checks.
> >  
> > 
> > Thanks
> > Yan
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-29 19:12                 ` Alex Williamson
@ 2020-07-30  3:41                   ` Yan Zhao
  2020-07-30 13:24                     ` Sean Mooney
  2020-07-30 17:29                     ` Alex Williamson
  0 siblings, 2 replies; 48+ messages in thread
From: Yan Zhao @ 2020-07-30  3:41 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Sean Mooney, kvm, libvir-list, Jason Wang, qemu-devel, kwankhede,
	eauger, xin-ran.wang, corbet, openstack-discuss, shaohe.feng,
	kevin.tian, eskultet, jian-feng.ding, dgilbert, zhenyuw,
	hejie.xu, bao.yumeng, intel-gvt-dev, berrange, cohuck, dinechin,
	devel

On Wed, Jul 29, 2020 at 01:12:55PM -0600, Alex Williamson wrote:
> On Wed, 29 Jul 2020 12:28:46 +0100
> Sean Mooney <smooney@redhat.com> wrote:
> 
> > On Wed, 2020-07-29 at 16:05 +0800, Yan Zhao wrote:
> > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:  
> > > > On Mon, 27 Jul 2020 15:24:40 +0800
> > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >   
> > > > > > > As you indicate, the vendor driver is responsible for checking version
> > > > > > > information embedded within the migration stream.  Therefore a
> > > > > > > migration should fail early if the devices are incompatible.  Is it    
> > > > > > 
> > > > > > but as I know, currently in VFIO migration protocol, we have no way to
> > > > > > get vendor specific compatibility checking string in migration setup stage
> > > > > > (i.e. .save_setup stage) before the device is set to _SAVING state.
> > > > > > In this way, for devices who does not save device data in precopy stage,
> > > > > > the migration compatibility checking is as late as in stop-and-copy
> > > > > > stage, which is too late.
> > > > > > do you think we need to add the getting/checking of vendor specific
> > > > > > compatibility string early in save_setup stage?
> > > > > >    
> > > > > 
> > > > > hi Alex,
> > > > > after an offline discussion with Kevin, I realized that it may not be a
> > > > > problem if migration compatibility check in vendor driver occurs late in
> > > > > stop-and-copy phase for some devices, because if we report device
> > > > > compatibility attributes clearly in an interface, the chances for
> > > > > libvirt/openstack to make a wrong decision is little.  
> > > > 
> > > > I think it would be wise for a vendor driver to implement a pre-copy
> > > > phase, even if only to send version information and verify it at the
> > > > target.  Deciding you have no device state to send during pre-copy does
> > > > not mean your vendor driver needs to opt-out of the pre-copy phase
> > > > entirely.  Please also note that pre-copy is at the user's discretion,
> > > > we've defined that we can enter stop-and-copy at any point, including
> > > > without a pre-copy phase, so I would recommend that vendor drivers
> > > > validate compatibility at the start of both the pre-copy and the
> > > > stop-and-copy phases.
> > > >   
> > > 
> > > ok. got it!
> > >   
> > > > > so, do you think we are now arriving at an agreement that we'll give up
> > > > > the read-and-test scheme and start to defining one interface (perhaps in
> > > > > json format), from which libvirt/openstack is able to parse and find out
> > > > > compatibility list of a source mdev/physical device?  
> > > > 
> > > > Based on the feedback we've received, the previously proposed interface
> > > > is not viable.  I think there's agreement that the user needs to be
> > > > able to parse and interpret the version information.  Using json seems
> > > > viable, but I don't know if it's the best option.  Is there any
> > > > precedent of markup strings returned via sysfs we could follow?  
> > > 
> > > I found some examples of using formatted string under /sys, mostly under
> > > tracing. maybe we can do a similar implementation.
> > > 
> > > #cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format
> > > 
> > > name: kvm_mmio
> > > ID: 32
> > > format:
> > >         field:unsigned short common_type;       offset:0;       size:2; signed:0;
> > >         field:unsigned char common_flags;       offset:2;       size:1; signed:0;
> > >         field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
> > >         field:int common_pid;   offset:4;       size:4; signed:1;
> > > 
> > >         field:u32 type; offset:8;       size:4; signed:0;
> > >         field:u32 len;  offset:12;      size:4; signed:0;
> > >         field:u64 gpa;  offset:16;      size:8; signed:0;
> > >         field:u64 val;  offset:24;      size:8; signed:0;
> > > 
> > > print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, "read"
> > > }, { 2, "write" }), REC->len, REC->gpa, REC->val
> > >   
> > this is not json fromat and its not supper frendly to parse.
> > > 
> > > #cat /sys/devices/pci0000:00/0000:00:02.0/uevent
> > > DRIVER=vfio-pci
> > > PCI_CLASS=30000
> > > PCI_ID=8086:591D
> > > PCI_SUBSYS_ID=8086:2212
> > > PCI_SLOT_NAME=0000:00:02.0
> > > MODALIAS=pci:v00008086d0000591Dsv00008086sd00002212bc03sc00i00
> > >   
> > this is ini format or conf formant 
> > this is pretty simple to parse whichi would be fine.
> > that said you could also have a version or capablitiy directory with a file
> > for each key and a singel value.
> > 
> > i would prefer to only have to do one read personally the list the files in
> > directory and then read tehm all ot build the datastucture myself but that is
> > doable though the simple ini format use d for uevent seams the best of 3 options
> > provided above.
> > > > 
> > > > Your idea of having both a "self" object and an array of "compatible"
> > > > objects is perhaps something we can build on, but we must not assume
> > > > PCI devices at the root level of the object.  Providing both the
> > > > mdev-type and the driver is a bit redundant, since the former includes
> > > > the latter.  We can't have vendor specific versioning schemes though,
> > > > ie. gvt-version. We need to agree on a common scheme and decide which
> > > > fields the version is relative to, ex. just the mdev type?  
> > > 
> > > what about making all comparing fields vendor specific?
> > > userspace like openstack only needs to parse and compare if target
> > > device is within source compatible list without understanding the meaning
> > > of each field.  
> > that kind of defeats the reason for having them be be parsable.
> > the reason openstack want to be able to understand the capablitys is so
> > we can staticaly declare the capablit of devices ahead of time on so our schduler
> > can select host based on that. is the keys and data are opaquce to userspace
> > becaue they are just random vendor sepecific blobs we cant do that.
> 
> Agreed, I'm not sure I'm willing to rule out that there could be vendor
> specific direct match fields, as I included in my example earlier in
> the thread, but entirely vendor specific defeats much of the purpose
> here.
> 
> > > > I had also proposed fields that provide information to create a
> > > > compatible type, for example to create a type_x2 device from a type_x1
> > > > mdev type, they need to know to apply an aggregation attribute.  If we
> > > > need to explicitly list every aggregation value and the resulting type,
> > > > I think we run aground of what aggregation was trying to avoid anyway,
> > > > so we might need to pick a language that defines variable substitution
> > > > or some kind of tagging.  For example if we could define ${aggr} as an
> > > > integer within a specified range, then we might be able to define a type
> > > > relative to that value (type_x${aggr}) which requires an aggregation
> > > > attribute using the same value.  I dunno, just spit balling.  Thanks,  
> > > 
> > > what about a migration_compatible attribute under device node like
> > > below?  
> > rather then listing comaptiable devices it would be better if you could declaritivly 
> > list the feature supported and we could compare those along with a simple semver version string.
> > > 
> > > #cat /sys/bus/pci/devices/0000\:00\:02.0/UUID1/migration_compatible
> 
> Note that we're defining compatibility relative to a vfio migration
> interface, so we should include that in the name, we don't know what
> other migration interfaces might exist.
do you mean we need to name it as vfio_migration, e.g.
 /sys/bus/pci/devices/0000\:00\:02.0/UUID1/vfio_migration ?
> 
> > > SELF:
> > > 	device_type=pci
> 
> Why not the device_api here, ie. vfio-pci.  The device doesn't provide
> a pci interface directly, it's wrapped in a vfio API.
> 
the device_type is to indicate below device_id is a pci id.

yes, include a device_api field is better.
for mdev, "device_type=vfio-mdev", is it right?

> > > 	device_id=8086591d
> 
> Is device_id interpreted relative to device_type?  How does this
> relate to mdev_type?  If we have an mdev_type, doesn't that fully
> defined the software API?
> 
it's parent pci id for mdev actually.


> > > 	mdev_type=i915-GVTg_V5_2
> 
> And how are non-mdev devices represented?
> 
non-mdev can opt to not include this field, or as you said below, a
vendor signature. 

> > > 	aggregator=1
> > > 	pv_mode="none+ppgtt+context"
> 
> These are meaningless vendor specific matches afaict.
> 
yes, pv_mode and aggregator are vendor specific fields.
but they are important to decide whether two devices are compatible.
pv_mode means whether a vGPU supports guest paravirtualized api.
"none+ppgtt+context" means guest can not use pv, or use ppgtt mode pv or
use context mode pv.

> > > 	interface_version=3
> 
> Not much granularity here, I prefer Sean's previous
> <major>.<minor>[.bugfix] scheme.
> 
yes, <major>.<minor>[.bugfix] scheme may be better, but I'm not sure if
it works for a complicated scenario.
e.g for pv_mode,
(1) initially,  pv_mode is not supported, so it's pv_mode=none, it's 0.0.0,
(2) then, pv_mode=ppgtt is supported, pv_mode="none+ppgtt", it's 0.1.0,
indicating pv_mode=none can migrate to pv_mode="none+ppgtt", but not vice versa.
(3) later, pv_mode=context is also supported,
pv_mode="none+ppgtt+context", so it's 0.2.0.

But if later, pv_mode=ppgtt is removed. pv_mode="none+context", how to
name its version? "none+ppgtt" (0.1.0) is not compatible to
"none+context", but "none+ppgtt+context" (0.2.0) is compatible to
"none+context".

Maintain such scheme is painful to vendor driver.



> > > COMPATIBLE:
> > > 	device_type=pci
> > > 	device_id=8086591d
> > > 	mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}  
> > this mixed notation will be hard to parse so i would avoid that.
> 
> Some background, Intel has been proposing aggregation as a solution to
> how we scale mdev devices when hardware exposes large numbers of
> assignable objects that can be composed in essentially arbitrary ways.
> So for instance, if we have a workqueue (wq), we might have an mdev
> type for 1wq, 2wq, 3wq,... Nwq.  It's not really practical to expose a
> discrete mdev type for each of those, so they want to define a base
> type which is composable to other types via this aggregation.  This is
> what this substitution and tagging is attempting to accomplish.  So
> imagine this set of values for cases where it's not practical to unroll
> the values for N discrete types.
> 
> > > 	aggregator={val1}/2
> 
> So the {val1} above would be substituted here, though an aggregation
> factor of 1/2 is a head scratcher...
> 
> > > 	pv_mode={val2:string:"none+ppgtt","none+context","none+ppgtt+context"}
> 
> I'm lost on this one though.  I think maybe it's indicating that it's
> compatible with any of these, so do we need to list it?  Couldn't this
> be handled by Sean's version proposal where the minor version
> represents feature compatibility?
yes, it's indicating that it's compatible with any of these.
Sean's version proposal may also work, but it would be painful for
vendor driver to maintain the versions when multiple similar features
are involved.

> 
> > >  
> > > 	interface_version={val3:int:2,3}
> 
> What does this turn into in a few years, 2,7,12,23,75,96,...
> 
is a range better?

> > > COMPATIBLE:
> > > 	device_type=pci
> > > 	device_id=8086591d
> > > 	mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
> > > 	aggregator={val1}/2
> > > 	pv_mode=""  #"" meaning empty, could be absent in a compatible device
> > > 	interface_version=1  
> 
> Why can't this be represented within the previous compatible
> description?
> 
actually it can be merged with the previous one :)
But I guess there must be one that cannot merge, so put it as an
example to demo multiple COMPATIBLE objects.

Thanks
Yan

> > if you presented this information the only way i could see to use it would be to
> > extract the mdev_type name and interface_vertion  and build a database table as follows
> > 
> > source_mdev_type | source_version | target_mdev_type | target_version
> > i915-GVTg_V5_2 | 3 | 915-GVTg_V5_{val1:int:1,2,4,8} | {val3:int:2,3}
> > i915-GVTg_V5_2 | 3 | 915-GVTg_V5_{val1:int:1,2,4,8} | 1
> > 
> > this would either reuiqre use to use a post placment sechudler filter to itrospec this data base
> > or thansform the target_mdev_type and target_version colum data into CUSTOM_* traits we apply to
> > our placment resouce providers and we would have to prefrom multiple reuqest for each posible compatiable
> > alternitive.  if the vm has muplite mdevs this is combinatorially problmenatic as it is 1 query for each
> > device * the number of possible compatible devices for that device.
> > 
> > in other word if this is just opaque data we cant ever represent it efficently in our placment service and
> > have to fall back to an explisive post placment schdluer filter base on the db table approch.
> > 
> > this also ignore the fact that at present the mdev_type cannot change druing a migration so the compatiable
> > devicve with a different mdev type would not be considerd accpetable choice in openstack. they way you select a host
> > with a specific vgpu mdev type today is to apply a custome trait which is CUSTOM_<medev_type_goes_here> to the vGPU
> > resouce provider and then in the flavor you request 1 allcoaton of vGPU and require the CUSTOM_<medev_type_goes_here>
> > trait. so going form i915-GVTg_V5_2 to i915-GVTg_V5_{val1:int:1,2,4,8} would not currently be compatiable with that
> > workflow.
> 
> The latter would need to be parsed into:
> 
> i915-GVTg_V5_1
> i915-GVTg_V5_2
> i915-GVTg_V5_4
> i915-GVTg_V5_8
> 
> There is also on the table, migration from physical devices to mdev
> devices (or vice versa), which is not represented in these examples,
> nor do I see how we'd represent it.  This is where I started exposing
> the resulting PCI device from the mdev in my example so we could have
> some commonality between devices, but the migration stream provider is
> just as important as the type of device, we could have different host
> drivers providing the same device with incompatible migration streams.
> The mdev_type encompasses both the driver and device, but we wouldn't
> have mdev_types for physical devices, per our current thinking.
> 
> 
> > > #cat /sys/bus/pci/dei915-GVTg_V5_{val1:int:1,2,4,8}vices/0000\:00\:i915-
> > > GVTg_V5_{val1:int:1,2,4,8}2.0/UUID2/migration_compatible
> > > SELF:
> > > 	device_type=pci
> > > 	device_id=8086591d
> > > 	mdev_type=i915-GVTg_V5_4
> > > 	aggregator=2
> > > 	interface_version=1
> > > COMPATIBLE: 
> > > 	device_type=pci
> > > 	device_id=8086591d
> > > 	mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
> > > 	aggregator={val1}/2
> > > 	interface_version=1  
> > by the way this is closer to yaml format then it is to json but it does not align with any exsiting
> > format i know of so that just make the representation needless hard to consume
> > if we are going to use a markup lanag let use a standard one like yaml json or toml and not invent a new one.
> > > 
> > > Notes:
> > > - A COMPATIBLE object is a line starting with COMPATIBLE.
> > >   It specifies a list of compatible devices that are allowed to migrate
> > >   in.
> > >   The reason to allow multiple COMPATIBLE objects is that when it
> > >   is hard to express a complex compatible logic in one COMPATIBLE
> > >   object, a simple enumeration is still a fallback.
> > >   in the above example, device UUID2 is in the compatible list of
> > >   device UUID1, but device UUID1 is not in the compatible list of device
> > >   UUID2, so device UUID2 is able to migrate to device UUID1, but device
> > >   UUID1 is not able to migrate to device UUID2.
> > > 
> > > - fields under each object are of "and" relationship to each other,  meaning
> > >   all fields of SELF object of a target device must be equal to corresponding
> > >   fields of a COMPATIBLE object of source device, otherwise it is regarded as not
> > >   compatible.
> > > 
> > > - each field, however, is able to specify multiple allowed values, using
> > >   variables as explained below.
> > > 
> > > - variables are represented with {}, the first appearance of one variable
> > >   specifies its type and allowed list. e.g.
> > >   {val1:int:1,2,4,8} represents var1 whose type is integer and allowed
> > >   values are 1, 2, 4, 8.
> > > 
> > > - vendors are able to specify which fields are within the comparing list
> > >   and which fields are not. e.g. for physical VF migration, it may not
> > >   choose mdev_type as a comparing field, and maybe use driver name instead.  
> > this format might be useful to vendors but from a orcestrator
> > perspecive i dont think this has value to us likely we would not use
> > this api if it was added as it does not help us with schduling.
> > ideally instead fo declaring which other mdev types a device is
> > compatiable with (which could presumably change over time as new
> > device and firmwares are released) i would prefer to see a
> > declaritive non vendor specific api that declares the feature set
> > provided by each mdev_type from which we can infer comaptiablity
> > similar to cpu feature flags. for devices fo the same mdev_type name
> > addtionally a declaritive version sting could also be used if
> > required for addtional compatiablity checks.
> 
> "non vendor specific api that declares the feature set", aren't
> features generally vendor specific?  What we're trying to describe is,
> by it's very nature, vendor specific.  We don't have an ISO body
> defining a graphics adapter and enumerating features for that adapter.
> I think what we have is mdev_types.  Each type is supposed to define a
> specific software interface, perhaps even more so than is done by a PCI
> vendor:device ID.  Maybe that mdev_type needs to be abstracted as
> something more like a vendor signature, such that a physical device
> could provide or accept a vendor signature that's compatible with an
> mdev device.  For example, a physically assigned Intel GPU might expose
> a migration signature of i915-GVTg_v5_8 if it were designed to be
> compatible with that mdev_type.  Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-30  1:56                 ` Yan Zhao
@ 2020-07-30 13:14                   ` Sean Mooney
  0 siblings, 0 replies; 48+ messages in thread
From: Sean Mooney @ 2020-07-30 13:14 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Alex Williamson, kvm, libvir-list, Jason Wang, qemu-devel,
	kwankhede, eauger, xin-ran.wang, corbet, openstack-discuss,
	shaohe.feng, kevin.tian, eskultet, jian-feng.ding, dgilbert,
	zhenyuw, hejie.xu, bao.yumeng, intel-gvt-dev, berrange, cohuck,
	dinechin, devel

On Thu, 2020-07-30 at 09:56 +0800, Yan Zhao wrote:
> On Wed, Jul 29, 2020 at 12:28:46PM +0100, Sean Mooney wrote:
> > On Wed, 2020-07-29 at 16:05 +0800, Yan Zhao wrote:
> > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
> > > > On Mon, 27 Jul 2020 15:24:40 +0800
> > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > 
> > > > > > > As you indicate, the vendor driver is responsible for checking version
> > > > > > > information embedded within the migration stream.  Therefore a
> > > > > > > migration should fail early if the devices are incompatible.  Is it  
> > > > > > 
> > > > > > but as I know, currently in VFIO migration protocol, we have no way to
> > > > > > get vendor specific compatibility checking string in migration setup stage
> > > > > > (i.e. .save_setup stage) before the device is set to _SAVING state.
> > > > > > In this way, for devices who does not save device data in precopy stage,
> > > > > > the migration compatibility checking is as late as in stop-and-copy
> > > > > > stage, which is too late.
> > > > > > do you think we need to add the getting/checking of vendor specific
> > > > > > compatibility string early in save_setup stage?
> > > > > >  
> > > > > 
> > > > > hi Alex,
> > > > > after an offline discussion with Kevin, I realized that it may not be a
> > > > > problem if migration compatibility check in vendor driver occurs late in
> > > > > stop-and-copy phase for some devices, because if we report device
> > > > > compatibility attributes clearly in an interface, the chances for
> > > > > libvirt/openstack to make a wrong decision is little.
> > > > 
> > > > I think it would be wise for a vendor driver to implement a pre-copy
> > > > phase, even if only to send version information and verify it at the
> > > > target.  Deciding you have no device state to send during pre-copy does
> > > > not mean your vendor driver needs to opt-out of the pre-copy phase
> > > > entirely.  Please also note that pre-copy is at the user's discretion,
> > > > we've defined that we can enter stop-and-copy at any point, including
> > > > without a pre-copy phase, so I would recommend that vendor drivers
> > > > validate compatibility at the start of both the pre-copy and the
> > > > stop-and-copy phases.
> > > > 
> > > 
> > > ok. got it!
> > > 
> > > > > so, do you think we are now arriving at an agreement that we'll give up
> > > > > the read-and-test scheme and start to defining one interface (perhaps in
> > > > > json format), from which libvirt/openstack is able to parse and find out
> > > > > compatibility list of a source mdev/physical device?
> > > > 
> > > > Based on the feedback we've received, the previously proposed interface
> > > > is not viable.  I think there's agreement that the user needs to be
> > > > able to parse and interpret the version information.  Using json seems
> > > > viable, but I don't know if it's the best option.  Is there any
> > > > precedent of markup strings returned via sysfs we could follow?
> > > 
> > > I found some examples of using formatted string under /sys, mostly under
> > > tracing. maybe we can do a similar implementation.
> > > 
> > > #cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format
> > > 
> > > name: kvm_mmio
> > > ID: 32
> > > format:
> > >         field:unsigned short common_type;       offset:0;       size:2; signed:0;
> > >         field:unsigned char common_flags;       offset:2;       size:1; signed:0;
> > >         field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
> > >         field:int common_pid;   offset:4;       size:4; signed:1;
> > > 
> > >         field:u32 type; offset:8;       size:4; signed:0;
> > >         field:u32 len;  offset:12;      size:4; signed:0;
> > >         field:u64 gpa;  offset:16;      size:8; signed:0;
> > >         field:u64 val;  offset:24;      size:8; signed:0;
> > > 
> > > print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1,
> > > "read"
> > > }, { 2, "write" }), REC->len, REC->gpa, REC->val
> > > 
> > 
> > this is not json fromat and its not supper frendly to parse.
> 
> yes, it's just an example. It's exported to be used by userspace perf &
> trace_cmd.
> 
> > > 
> > > #cat /sys/devices/pci0000:00/0000:00:02.0/uevent
> > > DRIVER=vfio-pci
> > > PCI_CLASS=30000
> > > PCI_ID=8086:591D
> > > PCI_SUBSYS_ID=8086:2212
> > > PCI_SLOT_NAME=0000:00:02.0
> > > MODALIAS=pci:v00008086d0000591Dsv00008086sd00002212bc03sc00i00
> > > 
> > 
> > this is ini format or conf formant 
> > this is pretty simple to parse whichi would be fine.
> > that said you could also have a version or capablitiy directory with a file
> > for each key and a singel value.
> > 
> 
> if this is easy for openstack, maybe we can organize the data like below way?
>  
>  |- [device]
>     |- migration
>         |-self
> 	|-compatible1
> 	|-compatible2
> 
> e.g. 
> #cat /sys/bus/pci/devices/0000:00:02.0/UUID1/migration/self
>  	filed1=xxx
>  	filed2=xxx
>  	filed3=xxx
>  	filed3=xxx
> #cat /sys/bus/pci/devices/0000:00:02.0/UUID1/migration/compatible
>  	filed1=xxx
>  	filed2=xxx
>  	filed3=xxx
>  	filed3=xxx

ya this would work.
nova specificly the libvirt driver trys to avoid reading sysfs directly if libvirt
has an api that provides the infomation but where it does not it can read it and that
structure  woudl be easy for use to consume.

libs like os-vif which cant depend on libvirt use it a little more
for example to look up a PF form one of its VFs
https://github.com/openstack/os-vif/blob/master/vif_plug_ovs/linux_net.py#L384-L391

we are carefult not to over use sysfs as it can change over time based on kernel version in somecase
but its is genernal seen a preferable to calling an every growing list of comnnadline clients to retrive
the same info.
> 
> or in a flat layer
>  |- [device]
>     |- migration-self-traits
>     |- migration-compatible-traits
> 
> I'm not sure whether json format in a single file is better, as I didn't
> find any precedent.
i think i prefer the nested directories to this flatend styple but there isnent really any significant increase
in complexity form a bash scripting point of view if i was manually debuging something the multi layer reprentation is
slight simpler to work with but not enough so that it really matters.
> 
> > i would prefer to only have to do one read personally the list the files in
> > directory and then read tehm all ot build the datastucture myself but that is
> > doable though the simple ini format use d for uevent seams the best of 3 options
> > provided above.
> > > > 
> > > > Your idea of having both a "self" object and an array of "compatible"
> > > > objects is perhaps something we can build on, but we must not assume
> > > > PCI devices at the root level of the object.  Providing both the
> > > > mdev-type and the driver is a bit redundant, since the former includes
> > > > the latter.  We can't have vendor specific versioning schemes though,
> > > > ie. gvt-version. We need to agree on a common scheme and decide which
> > > > fields the version is relative to, ex. just the mdev type?
> > > 
> > > what about making all comparing fields vendor specific?
> > > userspace like openstack only needs to parse and compare if target
> > > device is within source compatible list without understanding the meaning
> > > of each field.
> > 
> > that kind of defeats the reason for having them be be parsable.
> > the reason openstack want to be able to understand the capablitys is so
> > we can staticaly declare the capablit of devices ahead of time on so our schduler
> > can select host based on that. is the keys and data are opaquce to userspace
> > becaue they are just random vendor sepecific blobs we cant do that.
> 
> I heard that cyborg can parse the kernel interface and generate several
> traits without understanding the meaning of each trait. Then it reports
> those traits to placement for scheduling.
if it is doing a raw passthough like that 1 it will break users if a vendor every
removes a trait or renames it as part of a firwmware update and second it will require them to use
CUSTOM_ triant in stead of standardised traits. in other words is an interoperatbltiy problem between clouds.

at present cyborg does not support mdevs
there is a proposal for adding a generic mdev driver for generic stateless devices.
it could report arbitary capablity to placment although its does not exsit yet so its kind of premature ot point
to it as an example
> 
> but I agree if mdev creation is involved, those traits need to match
> to mdev attributes and mdev_type.
currently the only use of mdevs in openstack is for vGPU with nvidia devices.
in theory intel gpus can work with the existing code but it has not been tested.
> 
> could you explain a little how you plan to create a target mdev device?
> is it dynamically created during searching of compatible mdevs or just statically
> created before migration?
the mdevs are currently created dynamically when a vm is created based on a set of pre defiend
flavor which have static metadata in the form of flavor extra_specs.
thost extra specs request a vgpu by spcifying resouces:VGPU=1 in the extra specs.
e.g. openstack flavor set vgpu_1 --property "resources:VGPU=1"
if you want a specific vgpu type then you must request a custom trait in addtion to the resouce class
openstack --os-placement-api-version 1.6 trait create CUSTOM_NVIDIA_11
openstack flavor set --property trait:CUSTOM_NVIDIA_11=required vgpu_1

when configuring the host for vGPUs you list the enabled vgpu mdev types and the device that can use them

   [devices]
   enabled_vgpu_types = nvidia-35, nvidia-36

   [vgpu_nvidia-35]
   device_addresses = 0000:84:00.0,0000:85:00.0

   [vgpu_nvidia-36]
   device_addresses = 0000:86:00.0

each device that is listed will be created as a resouce provider in the plamcent service
so to associate the custom trait with the physical gpu and mdev type you manually tag the RP withthe trait

openstack --os-placement-api-version 1.6 resource provider trait set \
    --trait CUSTOM_NVIDIA_11 e2f8607b-0683-4141-a8af-f5e20682e28c

this decouple the name of the CUSTOM_ trait form the underliying mdev type
so the operator is free to use small|medium|large or bronze|silver|gold if they want to or they coudld chose to use the
mdev_type name if they want too.

currently we dont support live migration with vGPU because the required code has not been upstream to qemu/kvm
yet? i belive it just missed the kernel 5.7 merge window? i know its in flight but have not been following too closely

if you do a cold/offline migration currenlty and you had multiple mdev types then technical the mdev type could change.
we had planned for operators to ensure that what ever trait they choose would map to the same mdev type on all hosts.
if we were to supprot live migration in the future without this new api we are disccusing we woudl make the trait to
mdev type relationship required to be 1:1 for live migration.

we have talked auto creating traits for gvpus which would be in the form of CUSTOM_<mdev type> but shyed away from it
as we are worried vendors will break us and our user by changing mdev_types in frimware updates or driver updates.
we kind of need to rely on them being stable but we are hesitent to encode them in our public api in this manner.

> > > 
> > > > I had also proposed fields that provide information to create a
> > > > compatible type, for example to create a type_x2 device from a type_x1
> > > > mdev type, they need to know to apply an aggregation attribute.
honestly form an opesntack point of view i woudl prefer if each consumable resouce was
exposed as a different mdev_type and we could just create multiple mdevs and attach them to
a vm. that would allow use to do the aggreatation our selves. parsing mdev atributes
and dynamicaly creating 1 mdev type from aggregation of other requires detailed knoladge of the
vendor device.

the cyborg(acclerator managment) project might be open to this becuase they have plugable vendor specific and could
write a driver that only work with a sepecifc sku of a vendoer deivce or a device familay e.g. a qat
driver that could have the require knoladge to do the compostion.

that type of lowlevel device management is out of scope of the nova (compute) project
we woudl be far more likely to require operator to staticly parttion the device up front into mdevs
and pass us a list of them which we could then provend to vms.

we more or less already do this for vGPU today as the phsycal gpus need to be declared to support exactly 1 mdev_type
each and the same is true for persistent memroy. you need to pre create the persistent memeroy namespaces and then
provide the list of namespaces to nova.

so aggregation is something i suspect taht will only be supported in cyborg if it eventually supprot mdevs.
it has not been requested or assesed for nova yet but it seams unlikely.
in a migration work flow i would expect the nova conduction or source host to make an rpc call to the destination
host in pre live migration to create the mdev. this is before the call to libvirt to migrate the vm and before it would
do any validation but after schduleing. so ideally we shoudl know at this point that the destination host has a
comaptiable device.
> > > >   If we
> > > > need to explicitly list every aggregation value and the resulting type,
> > > > I think we run aground of what aggregation was trying to avoid anyway,
> > > > so we might need to pick a language that defines variable substitution
> > > > or some kind of tagging.  For example if we could define ${aggr} as an
> > > > integer within a specified range, then we might be able to define a type
> > > > relative to that value (type_x${aggr}) which requires an aggregation
> > > > attribute using the same value.  I dunno, just spit balling.  Thanks,
> > > 
> > > what about a migration_compatible attribute under device node like
> > > below?
> > 
> > rather then listing comaptiable devices it would be better if you could declaritivly 
> > list the feature supported and we could compare those along with a simple semver version string.
> 
> I think below is already in a way of listing feature supported.
> The reason I also want to declare compatible lists of features is that
> sometimes it's not a simple 1:1 matching of source list and target list.
> as I demonstrated below,
> source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to
> target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2),
>                (mdev_type i915-GVTg_V5_8 + aggregator 4)
> 
> and aggragator may be just one of such examples that 1:1 matching is not
> fit.
so far i am not conviced that aggragators are a good concept to model at this level.
is there some document that explains why they are need and we cant jsut have multipel
mdev_type per consumable resouce and attach multiple mdevs to a singel vm.

i suspect this is due to limitation in compoasblity in hardware such as nvidia multi
instance gpu tech. however (mdev_type i915-GVTg_V5_8 + aggregator 4) seams unfriendly to work with
form an orchestrato perspective.

on of our current complaint with the mdev api today is that depending on the device consoming
and instance of 1 mdev type may impact the availablity of other or change the avaiablity capastiyt of others.
that make it very hard to reason about capastiy avaiablity and aggregator sound like it will
make that problem worse not better.

> so I guess it's best not to leave the hard decision to openstack.
> 
> Thanks
> Yan
> > > 
> > > #cat /sys/bus/pci/devices/0000\:00\:02.0/UUID1/migration_compatible
> > > SELF:
> > > 	device_type=pci
> > > 	device_id=8086591d
> > > 	mdev_type=i915-GVTg_V5_2
> > > 	aggregator=1
> > > 	pv_mode="none+ppgtt+context"
> > > 	interface_version=3
> > > COMPATIBLE:
> > > 	device_type=pci
> > > 	device_id=8086591d
> > > 	mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
> > 
> > this mixed notation will be hard to parse so i would avoid that.
> > > 	aggregator={val1}/2
> > > 	pv_mode={val2:string:"none+ppgtt","none+context","none+ppgtt+context"}
> > >  
> > > 	interface_version={val3:int:2,3}
> > > COMPATIBLE:
> > > 	device_type=pci
> > > 	device_id=8086591d
> > > 	mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
> > > 	aggregator={val1}/2
> > > 	pv_mode=""  #"" meaning empty, could be absent in a compatible device
> > > 	interface_version=1
> > 
> > if you presented this information the only way i could see to use it would be to
> > extract the mdev_type name and interface_vertion  and build a database table as follows
> > 
> > source_mdev_type | source_version | target_mdev_type | target_version
> > i915-GVTg_V5_2 | 3 | 915-GVTg_V5_{val1:int:1,2,4,8} | {val3:int:2,3}
> > i915-GVTg_V5_2 | 3 | 915-GVTg_V5_{val1:int:1,2,4,8} | 1
> > 
> > this would either reuiqre use to use a post placment sechudler filter to itrospec this data base
> > or thansform the target_mdev_type and target_version colum data into CUSTOM_* traits we apply to
> > our placment resouce providers and we would have to prefrom multiple reuqest for each posible compatiable
> > alternitive.  if the vm has muplite mdevs this is combinatorially problmenatic as it is 1 query for each
> > device * the number of possible compatible devices for that device.
> > 
> > in other word if this is just opaque data we cant ever represent it efficently in our placment service and
> > have to fall back to an explisive post placment schdluer filter base on the db table approch.
> > 
> > this also ignore the fact that at present the mdev_type cannot change druing a migration so the compatiable
> > devicve with a different mdev type would not be considerd accpetable choice in openstack. they way you select a host
> > with a specific vgpu mdev type today is to apply a custome trait which is CUSTOM_<medev_type_goes_here> to the vGPU
> > resouce provider and then in the flavor you request 1 allcoaton of vGPU and require the
> > CUSTOM_<medev_type_goes_here>
> > trait. so going form i915-GVTg_V5_2 to i915-GVTg_V5_{val1:int:1,2,4,8} would not currently be compatiable with that
> > workflow.
> > 
> > 
> > > #cat /sys/bus/pci/dei915-GVTg_V5_{val1:int:1,2,4,8}vices/0000\:00\:i915-
> > > GVTg_V5_{val1:int:1,2,4,8}2.0/UUID2/migration_compatible
> > > SELF:
> > > 	device_type=pci
> > > 	device_id=8086591d
> > > 	mdev_type=i915-GVTg_V5_4
> > > 	aggregator=2
> > > 	interface_version=1
> > > COMPATIBLE: 
> > > 	device_type=pci
> > > 	device_id=8086591d
> > > 	mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
> > > 	aggregator={val1}/2
> > > 	interface_version=1
> > 
> > by the way this is closer to yaml format then it is to json but it does not align with any exsiting
> > format i know of so that just make the representation needless hard to consume
> > if we are going to use a markup lanag let use a standard one like yaml json or toml and not invent a new one.
> > > 
> > > Notes:
> > > - A COMPATIBLE object is a line starting with COMPATIBLE.
> > >   It specifies a list of compatible devices that are allowed to migrate
> > >   in.
> > >   The reason to allow multiple COMPATIBLE objects is that when it
> > >   is hard to express a complex compatible logic in one COMPATIBLE
> > >   object, a simple enumeration is still a fallback.
> > >   in the above example, device UUID2 is in the compatible list of
> > >   device UUID1, but device UUID1 is not in the compatible list of device
> > >   UUID2, so device UUID2 is able to migrate to device UUID1, but device
> > >   UUID1 is not able to migrate to device UUID2.
> > > 
> > > - fields under each object are of "and" relationship to each other,  meaning
> > >   all fields of SELF object of a target device must be equal to corresponding
> > >   fields of a COMPATIBLE object of source device, otherwise it is regarded as not
> > >   compatible.
> > > 
> > > - each field, however, is able to specify multiple allowed values, using
> > >   variables as explained below.
> > > 
> > > - variables are represented with {}, the first appearance of one variable
> > >   specifies its type and allowed list. e.g.
> > >   {val1:int:1,2,4,8} represents var1 whose type is integer and allowed
> > >   values are 1, 2, 4, 8.
> > > 
> > > - vendors are able to specify which fields are within the comparing list
> > >   and which fields are not. e.g. for physical VF migration, it may not
> > >   choose mdev_type as a comparing field, and maybe use driver name instead.
> > 
> > this format might be useful to vendors but from a orcestrator perspecive i dont think this has
> > value to us likely we would not use this api if it was added as it does not help us with schduling.
> > ideally instead fo declaring which other mdev types a device is compatiable with (which could presumably change over
> > time as new device and firmwares are released) i would prefer to see a declaritive non vendor specific api that
> > declares
> > the feature set provided by each mdev_type from which we can infer comaptiablity similar to cpu feature flags.
> > for devices fo the same mdev_type name addtionally a declaritive version sting could also be used if required for
> > addtional compatiablity checks.
> > >  
> > > 
> > > Thanks
> > > Yan
> > > 
> > > 
> 
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-30  3:41                   ` Yan Zhao
@ 2020-07-30 13:24                     ` Sean Mooney
  2020-07-30 17:29                     ` Alex Williamson
  1 sibling, 0 replies; 48+ messages in thread
From: Sean Mooney @ 2020-07-30 13:24 UTC (permalink / raw)
  To: Yan Zhao, Alex Williamson
  Cc: kvm, libvir-list, Jason Wang, qemu-devel, kwankhede, eauger,
	xin-ran.wang, corbet, openstack-discuss, shaohe.feng, kevin.tian,
	eskultet, jian-feng.ding, dgilbert, zhenyuw, hejie.xu,
	bao.yumeng, intel-gvt-dev, berrange, cohuck, dinechin, devel

On Thu, 2020-07-30 at 11:41 +0800, Yan Zhao wrote:
> > > >    interface_version=3
> > 
> > Not much granularity here, I prefer Sean's previous
> > <major>.<minor>[.bugfix] scheme.
> > 
> 
> yes, <major>.<minor>[.bugfix] scheme may be better, but I'm not sure if
> it works for a complicated scenario.
> e.g for pv_mode,
> (1) initially,  pv_mode is not supported, so it's pv_mode=none, it's 0.0.0,
> (2) then, pv_mode=ppgtt is supported, pv_mode="none+ppgtt", it's 0.1.0,
> indicating pv_mode=none can migrate to pv_mode="none+ppgtt", but not vice versa.
> (3) later, pv_mode=context is also supported,
> pv_mode="none+ppgtt+context", so it's 0.2.0.
> 
> But if later, pv_mode=ppgtt is removed. pv_mode="none+context", how to
> name its version?
it would become 1.0.0
addtion of a feature is a minor version bump as its backwards compatiable.
if you dont request the new feature you dont need to use it and it can continue to behave like
a 0.0.0 device evne if its capably of acting as a 0.1.0 device.
when you remove a feature that is backward incompatable as any isnstance that was prevously not
using it would nolonger work so you have to bump the major version.
>  "none+ppgtt" (0.1.0) is not compatible to
> "none+context", but "none+ppgtt+context" (0.2.0) is compatible to
> "none+context".
> 
> Maintain such scheme is painful to vendor driver.
not really its how most software libs are version today. some use other schemes
but semantic versioning is don right is a concies and easy to consume set of rules
https://semver.org/ however you are right that it forcnes vendor to think about backwards
and forwards compatiablty with each change which for the most part is a good thing.
it goes hand in hand with have stable abi and api definitons to ensuring firmware updates and driver chagnes
dont break userspace that depend on the kernel interfaces they expose.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-30  3:41                   ` Yan Zhao
  2020-07-30 13:24                     ` Sean Mooney
@ 2020-07-30 17:29                     ` Alex Williamson
  2020-08-04  8:37                       ` Yan Zhao
  1 sibling, 1 reply; 48+ messages in thread
From: Alex Williamson @ 2020-07-30 17:29 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Sean Mooney, kvm, libvir-list, Jason Wang, qemu-devel, kwankhede,
	eauger, xin-ran.wang, corbet, openstack-discuss, shaohe.feng,
	kevin.tian, eskultet, jian-feng.ding, dgilbert, zhenyuw,
	hejie.xu, bao.yumeng, intel-gvt-dev, berrange, cohuck, dinechin,
	devel

On Thu, 30 Jul 2020 11:41:04 +0800
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Wed, Jul 29, 2020 at 01:12:55PM -0600, Alex Williamson wrote:
> > On Wed, 29 Jul 2020 12:28:46 +0100
> > Sean Mooney <smooney@redhat.com> wrote:
> >   
> > > On Wed, 2020-07-29 at 16:05 +0800, Yan Zhao wrote:  
> > > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:    
> > > > > On Mon, 27 Jul 2020 15:24:40 +0800
> > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > >     
> > > > > > > > As you indicate, the vendor driver is responsible for checking version
> > > > > > > > information embedded within the migration stream.  Therefore a
> > > > > > > > migration should fail early if the devices are incompatible.  Is it      
> > > > > > > 
> > > > > > > but as I know, currently in VFIO migration protocol, we have no way to
> > > > > > > get vendor specific compatibility checking string in migration setup stage
> > > > > > > (i.e. .save_setup stage) before the device is set to _SAVING state.
> > > > > > > In this way, for devices who does not save device data in precopy stage,
> > > > > > > the migration compatibility checking is as late as in stop-and-copy
> > > > > > > stage, which is too late.
> > > > > > > do you think we need to add the getting/checking of vendor specific
> > > > > > > compatibility string early in save_setup stage?
> > > > > > >      
> > > > > > 
> > > > > > hi Alex,
> > > > > > after an offline discussion with Kevin, I realized that it may not be a
> > > > > > problem if migration compatibility check in vendor driver occurs late in
> > > > > > stop-and-copy phase for some devices, because if we report device
> > > > > > compatibility attributes clearly in an interface, the chances for
> > > > > > libvirt/openstack to make a wrong decision is little.    
> > > > > 
> > > > > I think it would be wise for a vendor driver to implement a pre-copy
> > > > > phase, even if only to send version information and verify it at the
> > > > > target.  Deciding you have no device state to send during pre-copy does
> > > > > not mean your vendor driver needs to opt-out of the pre-copy phase
> > > > > entirely.  Please also note that pre-copy is at the user's discretion,
> > > > > we've defined that we can enter stop-and-copy at any point, including
> > > > > without a pre-copy phase, so I would recommend that vendor drivers
> > > > > validate compatibility at the start of both the pre-copy and the
> > > > > stop-and-copy phases.
> > > > >     
> > > > 
> > > > ok. got it!
> > > >     
> > > > > > so, do you think we are now arriving at an agreement that we'll give up
> > > > > > the read-and-test scheme and start to defining one interface (perhaps in
> > > > > > json format), from which libvirt/openstack is able to parse and find out
> > > > > > compatibility list of a source mdev/physical device?    
> > > > > 
> > > > > Based on the feedback we've received, the previously proposed interface
> > > > > is not viable.  I think there's agreement that the user needs to be
> > > > > able to parse and interpret the version information.  Using json seems
> > > > > viable, but I don't know if it's the best option.  Is there any
> > > > > precedent of markup strings returned via sysfs we could follow?    
> > > > 
> > > > I found some examples of using formatted string under /sys, mostly under
> > > > tracing. maybe we can do a similar implementation.
> > > > 
> > > > #cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format
> > > > 
> > > > name: kvm_mmio
> > > > ID: 32
> > > > format:
> > > >         field:unsigned short common_type;       offset:0;       size:2; signed:0;
> > > >         field:unsigned char common_flags;       offset:2;       size:1; signed:0;
> > > >         field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
> > > >         field:int common_pid;   offset:4;       size:4; signed:1;
> > > > 
> > > >         field:u32 type; offset:8;       size:4; signed:0;
> > > >         field:u32 len;  offset:12;      size:4; signed:0;
> > > >         field:u64 gpa;  offset:16;      size:8; signed:0;
> > > >         field:u64 val;  offset:24;      size:8; signed:0;
> > > > 
> > > > print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, "read"
> > > > }, { 2, "write" }), REC->len, REC->gpa, REC->val
> > > >     
> > > this is not json fromat and its not supper frendly to parse.  
> > > > 
> > > > #cat /sys/devices/pci0000:00/0000:00:02.0/uevent
> > > > DRIVER=vfio-pci
> > > > PCI_CLASS=30000
> > > > PCI_ID=8086:591D
> > > > PCI_SUBSYS_ID=8086:2212
> > > > PCI_SLOT_NAME=0000:00:02.0
> > > > MODALIAS=pci:v00008086d0000591Dsv00008086sd00002212bc03sc00i00
> > > >     
> > > this is ini format or conf formant 
> > > this is pretty simple to parse whichi would be fine.
> > > that said you could also have a version or capablitiy directory with a file
> > > for each key and a singel value.
> > > 
> > > i would prefer to only have to do one read personally the list the files in
> > > directory and then read tehm all ot build the datastucture myself but that is
> > > doable though the simple ini format use d for uevent seams the best of 3 options
> > > provided above.  
> > > > > 
> > > > > Your idea of having both a "self" object and an array of "compatible"
> > > > > objects is perhaps something we can build on, but we must not assume
> > > > > PCI devices at the root level of the object.  Providing both the
> > > > > mdev-type and the driver is a bit redundant, since the former includes
> > > > > the latter.  We can't have vendor specific versioning schemes though,
> > > > > ie. gvt-version. We need to agree on a common scheme and decide which
> > > > > fields the version is relative to, ex. just the mdev type?    
> > > > 
> > > > what about making all comparing fields vendor specific?
> > > > userspace like openstack only needs to parse and compare if target
> > > > device is within source compatible list without understanding the meaning
> > > > of each field.    
> > > that kind of defeats the reason for having them be be parsable.
> > > the reason openstack want to be able to understand the capablitys is so
> > > we can staticaly declare the capablit of devices ahead of time on so our schduler
> > > can select host based on that. is the keys and data are opaquce to userspace
> > > becaue they are just random vendor sepecific blobs we cant do that.  
> > 
> > Agreed, I'm not sure I'm willing to rule out that there could be vendor
> > specific direct match fields, as I included in my example earlier in
> > the thread, but entirely vendor specific defeats much of the purpose
> > here.
> >   
> > > > > I had also proposed fields that provide information to create a
> > > > > compatible type, for example to create a type_x2 device from a type_x1
> > > > > mdev type, they need to know to apply an aggregation attribute.  If we
> > > > > need to explicitly list every aggregation value and the resulting type,
> > > > > I think we run aground of what aggregation was trying to avoid anyway,
> > > > > so we might need to pick a language that defines variable substitution
> > > > > or some kind of tagging.  For example if we could define ${aggr} as an
> > > > > integer within a specified range, then we might be able to define a type
> > > > > relative to that value (type_x${aggr}) which requires an aggregation
> > > > > attribute using the same value.  I dunno, just spit balling.  Thanks,    
> > > > 
> > > > what about a migration_compatible attribute under device node like
> > > > below?    
> > > rather then listing comaptiable devices it would be better if you could declaritivly 
> > > list the feature supported and we could compare those along with a simple semver version string.  
> > > > 
> > > > #cat /sys/bus/pci/devices/0000\:00\:02.0/UUID1/migration_compatible  
> > 
> > Note that we're defining compatibility relative to a vfio migration
> > interface, so we should include that in the name, we don't know what
> > other migration interfaces might exist.  
> do you mean we need to name it as vfio_migration, e.g.
>  /sys/bus/pci/devices/0000\:00\:02.0/UUID1/vfio_migration ?
> >   
> > > > SELF:
> > > > 	device_type=pci  
> > 
> > Why not the device_api here, ie. vfio-pci.  The device doesn't provide
> > a pci interface directly, it's wrapped in a vfio API.
> >   
> the device_type is to indicate below device_id is a pci id.
> 
> yes, include a device_api field is better.
> for mdev, "device_type=vfio-mdev", is it right?

No, vfio-mdev is not a device API, it's the driver that attaches to the
mdev bus device to expose it through vfio.  The device_api exposes the
actual interface of the vfio device, it's also vfio-pci for typical
mdev devices found on x86, but may be vfio-ccw, vfio-ap, etc...  See
VFIO_DEVICE_API_PCI_STRING and friends.
 
> > > > 	device_id=8086591d  
> > 
> > Is device_id interpreted relative to device_type?  How does this
> > relate to mdev_type?  If we have an mdev_type, doesn't that fully
> > defined the software API?
> >   
> it's parent pci id for mdev actually.

If we need to specify the parent PCI ID then something is fundamentally
wrong with the mdev_type.  The mdev_type should define a unique,
software compatible interface, regardless of the parent device IDs.  If
a i915-GVTg_V5_2 means different things based on the parent device IDs,
then then different mdev_types should be reported for those parent
devices.

> > > > 	mdev_type=i915-GVTg_V5_2  
> > 
> > And how are non-mdev devices represented?
> >   
> non-mdev can opt to not include this field, or as you said below, a
> vendor signature. 
> 
> > > > 	aggregator=1
> > > > 	pv_mode="none+ppgtt+context"  
> > 
> > These are meaningless vendor specific matches afaict.
> >   
> yes, pv_mode and aggregator are vendor specific fields.
> but they are important to decide whether two devices are compatible.
> pv_mode means whether a vGPU supports guest paravirtualized api.
> "none+ppgtt+context" means guest can not use pv, or use ppgtt mode pv or
> use context mode pv.
> 
> > > > 	interface_version=3  
> > 
> > Not much granularity here, I prefer Sean's previous
> > <major>.<minor>[.bugfix] scheme.
> >   
> yes, <major>.<minor>[.bugfix] scheme may be better, but I'm not sure if
> it works for a complicated scenario.
> e.g for pv_mode,
> (1) initially,  pv_mode is not supported, so it's pv_mode=none, it's 0.0.0,
> (2) then, pv_mode=ppgtt is supported, pv_mode="none+ppgtt", it's 0.1.0,
> indicating pv_mode=none can migrate to pv_mode="none+ppgtt", but not vice versa.
> (3) later, pv_mode=context is also supported,
> pv_mode="none+ppgtt+context", so it's 0.2.0.
> 
> But if later, pv_mode=ppgtt is removed. pv_mode="none+context", how to
> name its version? "none+ppgtt" (0.1.0) is not compatible to
> "none+context", but "none+ppgtt+context" (0.2.0) is compatible to
> "none+context".

If pv_mode=ppgtt is removed, then the compatible versions would be
0.0.0 or 1.0.0, ie. the major version would be incremented due to
feature removal.
 
> Maintain such scheme is painful to vendor driver.

Migration compatibility is painful, there's no way around that.  I
think the version scheme is an attempt to push some of that low level
burden on the vendor driver, otherwise the management tools need to
work on an ever growing matrix of vendor specific features which is
going to become unwieldy and is largely meaningless outside of the
vendor driver.  Instead, the vendor driver can make strategic decisions
about where to continue to maintain a support burden and make explicit
decisions to maintain or break compatibility.  The version scheme is a
simplification and abstraction of vendor driver features in order to
create a small, logical compatibility matrix.  Compromises necessarily
need to be made for that to occur.

> > > > COMPATIBLE:
> > > > 	device_type=pci
> > > > 	device_id=8086591d
> > > > 	mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}    
> > > this mixed notation will be hard to parse so i would avoid that.  
> > 
> > Some background, Intel has been proposing aggregation as a solution to
> > how we scale mdev devices when hardware exposes large numbers of
> > assignable objects that can be composed in essentially arbitrary ways.
> > So for instance, if we have a workqueue (wq), we might have an mdev
> > type for 1wq, 2wq, 3wq,... Nwq.  It's not really practical to expose a
> > discrete mdev type for each of those, so they want to define a base
> > type which is composable to other types via this aggregation.  This is
> > what this substitution and tagging is attempting to accomplish.  So
> > imagine this set of values for cases where it's not practical to unroll
> > the values for N discrete types.
> >   
> > > > 	aggregator={val1}/2  
> > 
> > So the {val1} above would be substituted here, though an aggregation
> > factor of 1/2 is a head scratcher...
> >   
> > > > 	pv_mode={val2:string:"none+ppgtt","none+context","none+ppgtt+context"}  
> > 
> > I'm lost on this one though.  I think maybe it's indicating that it's
> > compatible with any of these, so do we need to list it?  Couldn't this
> > be handled by Sean's version proposal where the minor version
> > represents feature compatibility?  
> yes, it's indicating that it's compatible with any of these.
> Sean's version proposal may also work, but it would be painful for
> vendor driver to maintain the versions when multiple similar features
> are involved.

This is something vendor drivers need to consider when adding and
removing features.

> > > > 	interface_version={val3:int:2,3}  
> > 
> > What does this turn into in a few years, 2,7,12,23,75,96,...
> >   
> is a range better?

I was really trying to point out that sparseness becomes an issue if
the vendor driver is largely disconnected from how their feature
addition and deprecation affects migration support.  Thanks,

Alex

> > > > COMPATIBLE:
> > > > 	device_type=pci
> > > > 	device_id=8086591d
> > > > 	mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
> > > > 	aggregator={val1}/2
> > > > 	pv_mode=""  #"" meaning empty, could be absent in a compatible device
> > > > 	interface_version=1    
> > 
> > Why can't this be represented within the previous compatible
> > description?
> >   
> actually it can be merged with the previous one :)
> But I guess there must be one that cannot merge, so put it as an
> example to demo multiple COMPATIBLE objects.
> 
> Thanks
> Yan
> 
> > > if you presented this information the only way i could see to use it would be to
> > > extract the mdev_type name and interface_vertion  and build a database table as follows
> > > 
> > > source_mdev_type | source_version | target_mdev_type | target_version
> > > i915-GVTg_V5_2 | 3 | 915-GVTg_V5_{val1:int:1,2,4,8} | {val3:int:2,3}
> > > i915-GVTg_V5_2 | 3 | 915-GVTg_V5_{val1:int:1,2,4,8} | 1
> > > 
> > > this would either reuiqre use to use a post placment sechudler filter to itrospec this data base
> > > or thansform the target_mdev_type and target_version colum data into CUSTOM_* traits we apply to
> > > our placment resouce providers and we would have to prefrom multiple reuqest for each posible compatiable
> > > alternitive.  if the vm has muplite mdevs this is combinatorially problmenatic as it is 1 query for each
> > > device * the number of possible compatible devices for that device.
> > > 
> > > in other word if this is just opaque data we cant ever represent it efficently in our placment service and
> > > have to fall back to an explisive post placment schdluer filter base on the db table approch.
> > > 
> > > this also ignore the fact that at present the mdev_type cannot change druing a migration so the compatiable
> > > devicve with a different mdev type would not be considerd accpetable choice in openstack. they way you select a host
> > > with a specific vgpu mdev type today is to apply a custome trait which is CUSTOM_<medev_type_goes_here> to the vGPU
> > > resouce provider and then in the flavor you request 1 allcoaton of vGPU and require the CUSTOM_<medev_type_goes_here>
> > > trait. so going form i915-GVTg_V5_2 to i915-GVTg_V5_{val1:int:1,2,4,8} would not currently be compatiable with that
> > > workflow.  
> > 
> > The latter would need to be parsed into:
> > 
> > i915-GVTg_V5_1
> > i915-GVTg_V5_2
> > i915-GVTg_V5_4
> > i915-GVTg_V5_8
> > 
> > There is also on the table, migration from physical devices to mdev
> > devices (or vice versa), which is not represented in these examples,
> > nor do I see how we'd represent it.  This is where I started exposing
> > the resulting PCI device from the mdev in my example so we could have
> > some commonality between devices, but the migration stream provider is
> > just as important as the type of device, we could have different host
> > drivers providing the same device with incompatible migration streams.
> > The mdev_type encompasses both the driver and device, but we wouldn't
> > have mdev_types for physical devices, per our current thinking.
> > 
> >   
> > > > #cat /sys/bus/pci/dei915-GVTg_V5_{val1:int:1,2,4,8}vices/0000\:00\:i915-
> > > > GVTg_V5_{val1:int:1,2,4,8}2.0/UUID2/migration_compatible
> > > > SELF:
> > > > 	device_type=pci
> > > > 	device_id=8086591d
> > > > 	mdev_type=i915-GVTg_V5_4
> > > > 	aggregator=2
> > > > 	interface_version=1
> > > > COMPATIBLE: 
> > > > 	device_type=pci
> > > > 	device_id=8086591d
> > > > 	mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
> > > > 	aggregator={val1}/2
> > > > 	interface_version=1    
> > > by the way this is closer to yaml format then it is to json but it does not align with any exsiting
> > > format i know of so that just make the representation needless hard to consume
> > > if we are going to use a markup lanag let use a standard one like yaml json or toml and not invent a new one.  
> > > > 
> > > > Notes:
> > > > - A COMPATIBLE object is a line starting with COMPATIBLE.
> > > >   It specifies a list of compatible devices that are allowed to migrate
> > > >   in.
> > > >   The reason to allow multiple COMPATIBLE objects is that when it
> > > >   is hard to express a complex compatible logic in one COMPATIBLE
> > > >   object, a simple enumeration is still a fallback.
> > > >   in the above example, device UUID2 is in the compatible list of
> > > >   device UUID1, but device UUID1 is not in the compatible list of device
> > > >   UUID2, so device UUID2 is able to migrate to device UUID1, but device
> > > >   UUID1 is not able to migrate to device UUID2.
> > > > 
> > > > - fields under each object are of "and" relationship to each other,  meaning
> > > >   all fields of SELF object of a target device must be equal to corresponding
> > > >   fields of a COMPATIBLE object of source device, otherwise it is regarded as not
> > > >   compatible.
> > > > 
> > > > - each field, however, is able to specify multiple allowed values, using
> > > >   variables as explained below.
> > > > 
> > > > - variables are represented with {}, the first appearance of one variable
> > > >   specifies its type and allowed list. e.g.
> > > >   {val1:int:1,2,4,8} represents var1 whose type is integer and allowed
> > > >   values are 1, 2, 4, 8.
> > > > 
> > > > - vendors are able to specify which fields are within the comparing list
> > > >   and which fields are not. e.g. for physical VF migration, it may not
> > > >   choose mdev_type as a comparing field, and maybe use driver name instead.    
> > > this format might be useful to vendors but from a orcestrator
> > > perspecive i dont think this has value to us likely we would not use
> > > this api if it was added as it does not help us with schduling.
> > > ideally instead fo declaring which other mdev types a device is
> > > compatiable with (which could presumably change over time as new
> > > device and firmwares are released) i would prefer to see a
> > > declaritive non vendor specific api that declares the feature set
> > > provided by each mdev_type from which we can infer comaptiablity
> > > similar to cpu feature flags. for devices fo the same mdev_type name
> > > addtionally a declaritive version sting could also be used if
> > > required for addtional compatiablity checks.  
> > 
> > "non vendor specific api that declares the feature set", aren't
> > features generally vendor specific?  What we're trying to describe is,
> > by it's very nature, vendor specific.  We don't have an ISO body
> > defining a graphics adapter and enumerating features for that adapter.
> > I think what we have is mdev_types.  Each type is supposed to define a
> > specific software interface, perhaps even more so than is done by a PCI
> > vendor:device ID.  Maybe that mdev_type needs to be abstracted as
> > something more like a vendor signature, such that a physical device
> > could provide or accept a vendor signature that's compatible with an
> > mdev device.  For example, a physically assigned Intel GPU might expose
> > a migration signature of i915-GVTg_v5_8 if it were designed to be
> > compatible with that mdev_type.  Thanks,
> > 
> > Alex
> >   
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-30 17:29                     ` Alex Williamson
@ 2020-08-04  8:37                       ` Yan Zhao
  2020-08-05  9:44                         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 48+ messages in thread
From: Yan Zhao @ 2020-08-04  8:37 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Sean Mooney, kvm, libvir-list, Jason Wang, qemu-devel, kwankhede,
	eauger, xin-ran.wang, corbet, openstack-discuss, shaohe.feng,
	kevin.tian, eskultet, jian-feng.ding, dgilbert, zhenyuw,
	hejie.xu, bao.yumeng, intel-gvt-dev, berrange, cohuck, dinechin,
	devel

> > yes, include a device_api field is better.
> > for mdev, "device_type=vfio-mdev", is it right?
> 
> No, vfio-mdev is not a device API, it's the driver that attaches to the
> mdev bus device to expose it through vfio.  The device_api exposes the
> actual interface of the vfio device, it's also vfio-pci for typical
> mdev devices found on x86, but may be vfio-ccw, vfio-ap, etc...  See
> VFIO_DEVICE_API_PCI_STRING and friends.
> 
ok. got it.

> > > > > 	device_id=8086591d  
> > > 
> > > Is device_id interpreted relative to device_type?  How does this
> > > relate to mdev_type?  If we have an mdev_type, doesn't that fully
> > > defined the software API?
> > >   
> > it's parent pci id for mdev actually.
>
> If we need to specify the parent PCI ID then something is fundamentally
> wrong with the mdev_type.  The mdev_type should define a unique,
> software compatible interface, regardless of the parent device IDs.  If
> a i915-GVTg_V5_2 means different things based on the parent device IDs,
> then then different mdev_types should be reported for those parent
> devices.
>
hmm, then do we allow vendor specific fields?
or is it a must that a vendor specific field should have corresponding
vendor attribute?

another thing is that the definition of mdev_type in GVT only corresponds
to vGPU computing ability currently,
e.g. i915-GVTg_V5_2, is 1/2 of a gen9 IGD, i915-GVTg_V4_2 is 1/2 of a
gen8 IGD.
It is too coarse-grained to live migration compatibility.

Do you think we need to update GVT's definition of mdev_type?
And is there any guide in mdev_type definition?

> > > > > 	mdev_type=i915-GVTg_V5_2  
> > > 
> > > And how are non-mdev devices represented?
> > >   
> > non-mdev can opt to not include this field, or as you said below, a
> > vendor signature. 
> > 
> > > > > 	aggregator=1
> > > > > 	pv_mode="none+ppgtt+context"  
> > > 
> > > These are meaningless vendor specific matches afaict.
> > >   
> > yes, pv_mode and aggregator are vendor specific fields.
> > but they are important to decide whether two devices are compatible.
> > pv_mode means whether a vGPU supports guest paravirtualized api.
> > "none+ppgtt+context" means guest can not use pv, or use ppgtt mode pv or
> > use context mode pv.
> > 
> > > > > 	interface_version=3  
> > > 
> > > Not much granularity here, I prefer Sean's previous
> > > <major>.<minor>[.bugfix] scheme.
> > >   
> > yes, <major>.<minor>[.bugfix] scheme may be better, but I'm not sure if
> > it works for a complicated scenario.
> > e.g for pv_mode,
> > (1) initially,  pv_mode is not supported, so it's pv_mode=none, it's 0.0.0,
> > (2) then, pv_mode=ppgtt is supported, pv_mode="none+ppgtt", it's 0.1.0,
> > indicating pv_mode=none can migrate to pv_mode="none+ppgtt", but not vice versa.
> > (3) later, pv_mode=context is also supported,
> > pv_mode="none+ppgtt+context", so it's 0.2.0.
> > 
> > But if later, pv_mode=ppgtt is removed. pv_mode="none+context", how to
> > name its version? "none+ppgtt" (0.1.0) is not compatible to
> > "none+context", but "none+ppgtt+context" (0.2.0) is compatible to
> > "none+context".
> 
> If pv_mode=ppgtt is removed, then the compatible versions would be
> 0.0.0 or 1.0.0, ie. the major version would be incremented due to
> feature removal.
>  
> > Maintain such scheme is painful to vendor driver.
> 
> Migration compatibility is painful, there's no way around that.  I
> think the version scheme is an attempt to push some of that low level
> burden on the vendor driver, otherwise the management tools need to
> work on an ever growing matrix of vendor specific features which is
> going to become unwieldy and is largely meaningless outside of the
> vendor driver.  Instead, the vendor driver can make strategic decisions
> about where to continue to maintain a support burden and make explicit
> decisions to maintain or break compatibility.  The version scheme is a
> simplification and abstraction of vendor driver features in order to
> create a small, logical compatibility matrix.  Compromises necessarily
> need to be made for that to occur.
>
ok. got it.

> > > > > COMPATIBLE:
> > > > > 	device_type=pci
> > > > > 	device_id=8086591d
> > > > > 	mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}    
> > > > this mixed notation will be hard to parse so i would avoid that.  
> > > 
> > > Some background, Intel has been proposing aggregation as a solution to
> > > how we scale mdev devices when hardware exposes large numbers of
> > > assignable objects that can be composed in essentially arbitrary ways.
> > > So for instance, if we have a workqueue (wq), we might have an mdev
> > > type for 1wq, 2wq, 3wq,... Nwq.  It's not really practical to expose a
> > > discrete mdev type for each of those, so they want to define a base
> > > type which is composable to other types via this aggregation.  This is
> > > what this substitution and tagging is attempting to accomplish.  So
> > > imagine this set of values for cases where it's not practical to unroll
> > > the values for N discrete types.
> > >   
> > > > > 	aggregator={val1}/2  
> > > 
> > > So the {val1} above would be substituted here, though an aggregation
> > > factor of 1/2 is a head scratcher...
> > >   
> > > > > 	pv_mode={val2:string:"none+ppgtt","none+context","none+ppgtt+context"}  
> > > 
> > > I'm lost on this one though.  I think maybe it's indicating that it's
> > > compatible with any of these, so do we need to list it?  Couldn't this
> > > be handled by Sean's version proposal where the minor version
> > > represents feature compatibility?  
> > yes, it's indicating that it's compatible with any of these.
> > Sean's version proposal may also work, but it would be painful for
> > vendor driver to maintain the versions when multiple similar features
> > are involved.
> 
> This is something vendor drivers need to consider when adding and
> removing features.
> 
> > > > > 	interface_version={val3:int:2,3}  
> > > 
> > > What does this turn into in a few years, 2,7,12,23,75,96,...
> > >   
> > is a range better?
> 
> I was really trying to point out that sparseness becomes an issue if
> the vendor driver is largely disconnected from how their feature
> addition and deprecation affects migration support.  Thanks,
>
ok. we'll use the x.y.z scheme then.

Thanks
Yan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-07-29  8:05             ` Yan Zhao
  2020-07-29 11:28               ` Sean Mooney
@ 2020-08-04 16:35               ` Cornelia Huck
  2020-08-05  2:22                 ` Jason Wang
  1 sibling, 1 reply; 48+ messages in thread
From: Cornelia Huck @ 2020-08-04 16:35 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Alex Williamson, kvm, libvir-list, Jason Wang, qemu-devel,
	kwankhede, eauger, xin-ran.wang, corbet, openstack-discuss,
	shaohe.feng, kevin.tian, eskultet, jian-feng.ding, dgilbert,
	zhenyuw, hejie.xu, bao.yumeng, smooney, intel-gvt-dev, berrange,
	dinechin, devel

[sorry about not chiming in earlier]

On Wed, 29 Jul 2020 16:05:03 +0800
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:

(...)

> > Based on the feedback we've received, the previously proposed interface
> > is not viable.  I think there's agreement that the user needs to be
> > able to parse and interpret the version information.  Using json seems
> > viable, but I don't know if it's the best option.  Is there any
> > precedent of markup strings returned via sysfs we could follow?  

I don't think encoding complex information in a sysfs file is a viable
approach. Quoting Documentation/filesystems/sysfs.rst:

"Attributes should be ASCII text files, preferably with only one value            
per file. It is noted that it may not be efficient to contain only one           
value per file, so it is socially acceptable to express an array of              
values of the same type.                                                         
                                                                                 
Mixing types, expressing multiple lines of data, and doing fancy                 
formatting of data is heavily frowned upon."

Even though this is an older file, I think these restrictions still
apply.

> I found some examples of using formatted string under /sys, mostly under
> tracing. maybe we can do a similar implementation.
> 
> #cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format

Note that this is *not* sysfs (anything under debug/ follows different
rules anyway!)

> 
> name: kvm_mmio
> ID: 32
> format:
>         field:unsigned short common_type;       offset:0;       size:2; signed:0;
>         field:unsigned char common_flags;       offset:2;       size:1; signed:0;
>         field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
>         field:int common_pid;   offset:4;       size:4; signed:1;
> 
>         field:u32 type; offset:8;       size:4; signed:0;
>         field:u32 len;  offset:12;      size:4; signed:0;
>         field:u64 gpa;  offset:16;      size:8; signed:0;
>         field:u64 val;  offset:24;      size:8; signed:0;
> 
> print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, "read" }, { 2, "write" }), REC->len, REC->gpa, REC->val
> 
> 
> #cat /sys/devices/pci0000:00/0000:00:02.0/uevent

'uevent' can probably be considered a special case, I would not really
want to copy it.

> DRIVER=vfio-pci
> PCI_CLASS=30000
> PCI_ID=8086:591D
> PCI_SUBSYS_ID=8086:2212
> PCI_SLOT_NAME=0000:00:02.0
> MODALIAS=pci:v00008086d0000591Dsv00008086sd00002212bc03sc00i00
> 

(...)

> what about a migration_compatible attribute under device node like
> below?
> 
> #cat /sys/bus/pci/devices/0000\:00\:02.0/UUID1/migration_compatible
> SELF:
> 	device_type=pci
> 	device_id=8086591d
> 	mdev_type=i915-GVTg_V5_2
> 	aggregator=1
> 	pv_mode="none+ppgtt+context"
> 	interface_version=3
> COMPATIBLE:
> 	device_type=pci
> 	device_id=8086591d
> 	mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
> 	aggregator={val1}/2
> 	pv_mode={val2:string:"none+ppgtt","none+context","none+ppgtt+context"} 
> 	interface_version={val3:int:2,3}
> COMPATIBLE:
> 	device_type=pci
> 	device_id=8086591d
> 	mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
> 	aggregator={val1}/2
> 	pv_mode=""  #"" meaning empty, could be absent in a compatible device
> 	interface_version=1

I'd consider anything of a comparable complexity to be a big no-no. If
anything, this needs to be split into individual files (with many of
them being vendor driver specific anyway.)

I think we can list compatible versions in a range/list format, though.
Something like

cat interface_version 
2.1.3

cat interface_version_compatible
2.0.2-2.0.4,2.1.0-

(indicating that versions 2.0.{2,3,4} and all versions after 2.1.0 are
compatible, considering versions <2 and >2 incompatible by default)

Possible compatibility between different mdev types feels a bit odd to
me, and should not be included by default (only if it makes sense for a
particular vendor driver.)


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-08-05  2:22                 ` Jason Wang
@ 2020-08-05  2:16                   ` Yan Zhao
  2020-08-05  2:41                     ` Jason Wang
  0 siblings, 1 reply; 48+ messages in thread
From: Yan Zhao @ 2020-08-05  2:16 UTC (permalink / raw)
  To: Jason Wang
  Cc: Cornelia Huck, Alex Williamson, kvm, libvir-list, qemu-devel,
	kwankhede, eauger, xin-ran.wang, corbet, openstack-discuss,
	shaohe.feng, kevin.tian, eskultet, jian-feng.ding, dgilbert,
	zhenyuw, hejie.xu, bao.yumeng, smooney, intel-gvt-dev, berrange,
	dinechin, devel

On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote:
> 
> On 2020/8/5 上午12:35, Cornelia Huck wrote:
> > [sorry about not chiming in earlier]
> > 
> > On Wed, 29 Jul 2020 16:05:03 +0800
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > 
> > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
> > (...)
> > 
> > > > Based on the feedback we've received, the previously proposed interface
> > > > is not viable.  I think there's agreement that the user needs to be
> > > > able to parse and interpret the version information.  Using json seems
> > > > viable, but I don't know if it's the best option.  Is there any
> > > > precedent of markup strings returned via sysfs we could follow?
> > I don't think encoding complex information in a sysfs file is a viable
> > approach. Quoting Documentation/filesystems/sysfs.rst:
> > 
> > "Attributes should be ASCII text files, preferably with only one value
> > per file. It is noted that it may not be efficient to contain only one
> > value per file, so it is socially acceptable to express an array of
> > values of the same type.
> > Mixing types, expressing multiple lines of data, and doing fancy
> > formatting of data is heavily frowned upon."
> > 
> > Even though this is an older file, I think these restrictions still
> > apply.
> 
> 
> +1, that's another reason why devlink(netlink) is better.
>
hi Jason,
do you have any materials or sample code about devlink, so we can have a good
study of it?
I found some kernel docs about it but my preliminary study didn't show me the
advantage of devlink.

Thanks
Yan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-08-04 16:35               ` Cornelia Huck
@ 2020-08-05  2:22                 ` Jason Wang
  2020-08-05  2:16                   ` Yan Zhao
  0 siblings, 1 reply; 48+ messages in thread
From: Jason Wang @ 2020-08-05  2:22 UTC (permalink / raw)
  To: Cornelia Huck, Yan Zhao
  Cc: Alex Williamson, kvm, libvir-list, qemu-devel, kwankhede, eauger,
	xin-ran.wang, corbet, openstack-discuss, shaohe.feng, kevin.tian,
	eskultet, jian-feng.ding, dgilbert, zhenyuw, hejie.xu,
	bao.yumeng, smooney, intel-gvt-dev, berrange, dinechin, devel


On 2020/8/5 上午12:35, Cornelia Huck wrote:
> [sorry about not chiming in earlier]
>
> On Wed, 29 Jul 2020 16:05:03 +0800
> Yan Zhao <yan.y.zhao@intel.com> wrote:
>
>> On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
> (...)
>
>>> Based on the feedback we've received, the previously proposed interface
>>> is not viable.  I think there's agreement that the user needs to be
>>> able to parse and interpret the version information.  Using json seems
>>> viable, but I don't know if it's the best option.  Is there any
>>> precedent of markup strings returned via sysfs we could follow?
> I don't think encoding complex information in a sysfs file is a viable
> approach. Quoting Documentation/filesystems/sysfs.rst:
>
> "Attributes should be ASCII text files, preferably with only one value
> per file. It is noted that it may not be efficient to contain only one
> value per file, so it is socially acceptable to express an array of
> values of the same type.
>                                                                                   
> Mixing types, expressing multiple lines of data, and doing fancy
> formatting of data is heavily frowned upon."
>
> Even though this is an older file, I think these restrictions still
> apply.


+1, that's another reason why devlink(netlink) is better.

Thanks


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-08-05  2:16                   ` Yan Zhao
@ 2020-08-05  2:41                     ` Jason Wang
  2020-08-05  7:56                       ` Jiri Pirko
  0 siblings, 1 reply; 48+ messages in thread
From: Jason Wang @ 2020-08-05  2:41 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Cornelia Huck, Alex Williamson, kvm, libvir-list, qemu-devel,
	kwankhede, eauger, xin-ran.wang, corbet, openstack-discuss,
	shaohe.feng, kevin.tian, eskultet, jian-feng.ding, dgilbert,
	zhenyuw, hejie.xu, bao.yumeng, smooney, intel-gvt-dev, berrange,
	dinechin, devel, Jiri Pirko, Parav Pandit


On 2020/8/5 上午10:16, Yan Zhao wrote:
> On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote:
>> On 2020/8/5 上午12:35, Cornelia Huck wrote:
>>> [sorry about not chiming in earlier]
>>>
>>> On Wed, 29 Jul 2020 16:05:03 +0800
>>> Yan Zhao <yan.y.zhao@intel.com> wrote:
>>>
>>>> On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
>>> (...)
>>>
>>>>> Based on the feedback we've received, the previously proposed interface
>>>>> is not viable.  I think there's agreement that the user needs to be
>>>>> able to parse and interpret the version information.  Using json seems
>>>>> viable, but I don't know if it's the best option.  Is there any
>>>>> precedent of markup strings returned via sysfs we could follow?
>>> I don't think encoding complex information in a sysfs file is a viable
>>> approach. Quoting Documentation/filesystems/sysfs.rst:
>>>
>>> "Attributes should be ASCII text files, preferably with only one value
>>> per file. It is noted that it may not be efficient to contain only one
>>> value per file, so it is socially acceptable to express an array of
>>> values of the same type.
>>> Mixing types, expressing multiple lines of data, and doing fancy
>>> formatting of data is heavily frowned upon."
>>>
>>> Even though this is an older file, I think these restrictions still
>>> apply.
>>
>> +1, that's another reason why devlink(netlink) is better.
>>
> hi Jason,
> do you have any materials or sample code about devlink, so we can have a good
> study of it?
> I found some kernel docs about it but my preliminary study didn't show me the
> advantage of devlink.


CC Jiri and Parav for a better answer for this.

My understanding is that the following advantages are obvious (as I 
replied in another thread):

- existing users (NIC, crypto, SCSI, ib), mature and stable
- much better error reporting (ext_ack other than string or errno)
- namespace aware
- do not couple with kobject

Thanks


>
> Thanks
> Yan
>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-08-05  2:41                     ` Jason Wang
@ 2020-08-05  7:56                       ` Jiri Pirko
  2020-08-05  8:02                         ` Jason Wang
  0 siblings, 1 reply; 48+ messages in thread
From: Jiri Pirko @ 2020-08-05  7:56 UTC (permalink / raw)
  To: Jason Wang
  Cc: Yan Zhao, Cornelia Huck, Alex Williamson, kvm, libvir-list,
	qemu-devel, kwankhede, eauger, xin-ran.wang, corbet,
	openstack-discuss, shaohe.feng, kevin.tian, eskultet,
	jian-feng.ding, dgilbert, zhenyuw, hejie.xu, bao.yumeng, smooney,
	intel-gvt-dev, berrange, dinechin, devel, Parav Pandit

Wed, Aug 05, 2020 at 04:41:54AM CEST, jasowang@redhat.com wrote:
>
>On 2020/8/5 上午10:16, Yan Zhao wrote:
>> On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote:
>> > On 2020/8/5 上午12:35, Cornelia Huck wrote:
>> > > [sorry about not chiming in earlier]
>> > > 
>> > > On Wed, 29 Jul 2020 16:05:03 +0800
>> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
>> > > 
>> > > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
>> > > (...)
>> > > 
>> > > > > Based on the feedback we've received, the previously proposed interface
>> > > > > is not viable.  I think there's agreement that the user needs to be
>> > > > > able to parse and interpret the version information.  Using json seems
>> > > > > viable, but I don't know if it's the best option.  Is there any
>> > > > > precedent of markup strings returned via sysfs we could follow?
>> > > I don't think encoding complex information in a sysfs file is a viable
>> > > approach. Quoting Documentation/filesystems/sysfs.rst:
>> > > 
>> > > "Attributes should be ASCII text files, preferably with only one value
>> > > per file. It is noted that it may not be efficient to contain only one
>> > > value per file, so it is socially acceptable to express an array of
>> > > values of the same type.
>> > > Mixing types, expressing multiple lines of data, and doing fancy
>> > > formatting of data is heavily frowned upon."
>> > > 
>> > > Even though this is an older file, I think these restrictions still
>> > > apply.
>> > 
>> > +1, that's another reason why devlink(netlink) is better.
>> > 
>> hi Jason,
>> do you have any materials or sample code about devlink, so we can have a good
>> study of it?
>> I found some kernel docs about it but my preliminary study didn't show me the
>> advantage of devlink.
>
>
>CC Jiri and Parav for a better answer for this.
>
>My understanding is that the following advantages are obvious (as I replied
>in another thread):
>
>- existing users (NIC, crypto, SCSI, ib), mature and stable
>- much better error reporting (ext_ack other than string or errno)
>- namespace aware
>- do not couple with kobject

Jason, what is your use case?



>
>Thanks
>
>
>> 
>> Thanks
>> Yan
>> 
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-08-05  7:56                       ` Jiri Pirko
@ 2020-08-05  8:02                         ` Jason Wang
  2020-08-05  9:33                           ` Yan Zhao
  0 siblings, 1 reply; 48+ messages in thread
From: Jason Wang @ 2020-08-05  8:02 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Yan Zhao, Cornelia Huck, Alex Williamson, kvm, libvir-list,
	qemu-devel, kwankhede, eauger, xin-ran.wang, corbet,
	openstack-discuss, shaohe.feng, kevin.tian, eskultet,
	jian-feng.ding, dgilbert, zhenyuw, hejie.xu, bao.yumeng, smooney,
	intel-gvt-dev, berrange, dinechin, devel, Parav Pandit


On 2020/8/5 下午3:56, Jiri Pirko wrote:
> Wed, Aug 05, 2020 at 04:41:54AM CEST, jasowang@redhat.com wrote:
>> On 2020/8/5 上午10:16, Yan Zhao wrote:
>>> On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote:
>>>> On 2020/8/5 上午12:35, Cornelia Huck wrote:
>>>>> [sorry about not chiming in earlier]
>>>>>
>>>>> On Wed, 29 Jul 2020 16:05:03 +0800
>>>>> Yan Zhao <yan.y.zhao@intel.com> wrote:
>>>>>
>>>>>> On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
>>>>> (...)
>>>>>
>>>>>>> Based on the feedback we've received, the previously proposed interface
>>>>>>> is not viable.  I think there's agreement that the user needs to be
>>>>>>> able to parse and interpret the version information.  Using json seems
>>>>>>> viable, but I don't know if it's the best option.  Is there any
>>>>>>> precedent of markup strings returned via sysfs we could follow?
>>>>> I don't think encoding complex information in a sysfs file is a viable
>>>>> approach. Quoting Documentation/filesystems/sysfs.rst:
>>>>>
>>>>> "Attributes should be ASCII text files, preferably with only one value
>>>>> per file. It is noted that it may not be efficient to contain only one
>>>>> value per file, so it is socially acceptable to express an array of
>>>>> values of the same type.
>>>>> Mixing types, expressing multiple lines of data, and doing fancy
>>>>> formatting of data is heavily frowned upon."
>>>>>
>>>>> Even though this is an older file, I think these restrictions still
>>>>> apply.
>>>> +1, that's another reason why devlink(netlink) is better.
>>>>
>>> hi Jason,
>>> do you have any materials or sample code about devlink, so we can have a good
>>> study of it?
>>> I found some kernel docs about it but my preliminary study didn't show me the
>>> advantage of devlink.
>>
>> CC Jiri and Parav for a better answer for this.
>>
>> My understanding is that the following advantages are obvious (as I replied
>> in another thread):
>>
>> - existing users (NIC, crypto, SCSI, ib), mature and stable
>> - much better error reporting (ext_ack other than string or errno)
>> - namespace aware
>> - do not couple with kobject
> Jason, what is your use case?


I think the use case is to report device compatibility for live 
migration. Yan proposed a simple sysfs based migration version first, 
but it looks not sufficient and something based on JSON is discussed.

Yan, can you help to summarize the discussion so far for Jiri as a 
reference?

Thanks


>
>
>
>> Thanks
>>
>>
>>> Thanks
>>> Yan
>>>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-08-05  8:02                         ` Jason Wang
@ 2020-08-05  9:33                           ` Yan Zhao
  2020-08-05 10:53                             ` Jiri Pirko
  0 siblings, 1 reply; 48+ messages in thread
From: Yan Zhao @ 2020-08-05  9:33 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jiri Pirko, Cornelia Huck, Alex Williamson, kvm, libvir-list,
	qemu-devel, kwankhede, eauger, xin-ran.wang, corbet,
	openstack-discuss, shaohe.feng, kevin.tian, eskultet,
	jian-feng.ding, dgilbert, zhenyuw, hejie.xu, bao.yumeng, smooney,
	intel-gvt-dev, berrange, dinechin, devel, Parav Pandit

On Wed, Aug 05, 2020 at 04:02:48PM +0800, Jason Wang wrote:
> 
> On 2020/8/5 下午3:56, Jiri Pirko wrote:
> > Wed, Aug 05, 2020 at 04:41:54AM CEST, jasowang@redhat.com wrote:
> > > On 2020/8/5 上午10:16, Yan Zhao wrote:
> > > > On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote:
> > > > > On 2020/8/5 上午12:35, Cornelia Huck wrote:
> > > > > > [sorry about not chiming in earlier]
> > > > > > 
> > > > > > On Wed, 29 Jul 2020 16:05:03 +0800
> > > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > 
> > > > > > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
> > > > > > (...)
> > > > > > 
> > > > > > > > Based on the feedback we've received, the previously proposed interface
> > > > > > > > is not viable.  I think there's agreement that the user needs to be
> > > > > > > > able to parse and interpret the version information.  Using json seems
> > > > > > > > viable, but I don't know if it's the best option.  Is there any
> > > > > > > > precedent of markup strings returned via sysfs we could follow?
> > > > > > I don't think encoding complex information in a sysfs file is a viable
> > > > > > approach. Quoting Documentation/filesystems/sysfs.rst:
> > > > > > 
> > > > > > "Attributes should be ASCII text files, preferably with only one value
> > > > > > per file. It is noted that it may not be efficient to contain only one
> > > > > > value per file, so it is socially acceptable to express an array of
> > > > > > values of the same type.
> > > > > > Mixing types, expressing multiple lines of data, and doing fancy
> > > > > > formatting of data is heavily frowned upon."
> > > > > > 
> > > > > > Even though this is an older file, I think these restrictions still
> > > > > > apply.
> > > > > +1, that's another reason why devlink(netlink) is better.
> > > > > 
> > > > hi Jason,
> > > > do you have any materials or sample code about devlink, so we can have a good
> > > > study of it?
> > > > I found some kernel docs about it but my preliminary study didn't show me the
> > > > advantage of devlink.
> > > 
> > > CC Jiri and Parav for a better answer for this.
> > > 
> > > My understanding is that the following advantages are obvious (as I replied
> > > in another thread):
> > > 
> > > - existing users (NIC, crypto, SCSI, ib), mature and stable
> > > - much better error reporting (ext_ack other than string or errno)
> > > - namespace aware
> > > - do not couple with kobject
> > Jason, what is your use case?
> 
> 
> I think the use case is to report device compatibility for live migration.
> Yan proposed a simple sysfs based migration version first, but it looks not
> sufficient and something based on JSON is discussed.
> 
> Yan, can you help to summarize the discussion so far for Jiri as a
> reference?
> 
yes.
we are currently defining an device live migration compatibility
interface in order to let user space like openstack and libvirt knows
which two devices are live migration compatible.
currently the devices include mdev (a kernel emulated virtual device)
and physical devices (e.g.  a VF of a PCI SRIOV device).

the attributes we want user space to compare including
common attribues:
    device_api: vfio-pci, vfio-ccw...
    mdev_type: mdev type of mdev or similar signature for physical device
               It specifies a device's hardware capability. e.g.
	       i915-GVTg_V5_4 means it's of 1/4 of a gen9 Intel graphics
	       device.
    software_version: device driver's version.
               in <major>.<minor>[.bugfix] scheme, where there is no
	       compatibility across major versions, minor versions have
	       forward compatibility (ex. 1-> 2 is ok, 2 -> 1 is not) and
	       bugfix version number indicates some degree of internal
	       improvement that is not visible to the user in terms of
	       features or compatibility,

vendor specific attributes: each vendor may define different attributes
   device id : device id of a physical devices or mdev's parent pci device.
               it could be equal to pci id for pci devices
   aggregator: used together with mdev_type. e.g. aggregator=2 together
               with i915-GVTg_V5_4 means 2*1/4=1/2 of a gen9 Intel
	       graphics device.
   remote_url: for a local NVMe VF, it may be configured with a remote
               url of a remote storage and all data is stored in the
	       remote side specified by the remote url.
   ...

Comparing those attributes by user space alone is not an easy job, as it
can't simply assume an equal relationship between source attributes and
target attributes. e.g.
for a source device of mdev_type=i915-GVTg_V5_4,aggregator=2, (1/2 of
gen9), it actually could find a compatible device of
mdev_type=i915-GVTg_V5_8,aggregator=4 (also 1/2 of gen9),
if mdev_type of i915-GVTg_V5_4 is not available in the target machine.

So, in our current proposal, we want to create two sysfs attributes
under a device sysfs node.
/sys/<path to device>/migration/self
/sys/<path to device>/migration/compatible

#cat /sys/<path to device>/migration/self
device_type=vfio_pci
mdev_type=i915-GVTg_V5_4
device_id=8086591d
aggregator=2
software_version=1.0.0

#cat /sys/<path to device>/migration/compatible
device_type=vfio_pci
mdev_type=i915-GVTg_V5_{val1:int:2,4,8}
device_id=8086591d
aggregator={val1}/2
software_version=1.0.0

The /sys/<path to device>/migration/self specifies self attributes of
a device.
The /sys/<path to device>/migration/compatible specifies the list of
compatible devices of a device. as in the example, compatible devices
could have
	device_type == vfio_pci &&
	device_id == 8086591d   &&
	software_version == 1.0.0 &&
        (
	(mdev_type of i915-GVTg_V5_2 && aggregator==1) ||
	(mdev_type of i915-GVTg_V5_4 && aggregator==2) ||
	(mdev_type of i915-GVTg_V5_8 && aggregator=4)
	)

by comparing whether a target device is in compatible list of source
device, the user space can know whether a two devices are live migration
compatible.

Additional notes:
1)software_version in the compatible list may not be necessary as it
already has a major.minor.bugfix scheme.
2)for vendor attribute like remote_url, it may not be statically
assigned and could be changed with a device interface.

So, as Cornelia pointed that it's not good to use complex format in
a sysfs attribute, we'd like to know whether there're other good ways to
our use case, e.g. splitting a single attribute to multiple simple sysfs
attributes as what Cornelia suggested or devlink that Jason has strongly
recommended.

Thanks
Yan




^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-08-04  8:37                       ` Yan Zhao
@ 2020-08-05  9:44                         ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 48+ messages in thread
From: Dr. David Alan Gilbert @ 2020-08-05  9:44 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Alex Williamson, Sean Mooney, kvm, libvir-list, Jason Wang,
	qemu-devel, kwankhede, eauger, xin-ran.wang, corbet,
	openstack-discuss, shaohe.feng, kevin.tian, eskultet,
	jian-feng.ding, zhenyuw, hejie.xu, bao.yumeng, intel-gvt-dev,
	berrange, cohuck, dinechin, devel

* Yan Zhao (yan.y.zhao@intel.com) wrote:
> > > yes, include a device_api field is better.
> > > for mdev, "device_type=vfio-mdev", is it right?
> > 
> > No, vfio-mdev is not a device API, it's the driver that attaches to the
> > mdev bus device to expose it through vfio.  The device_api exposes the
> > actual interface of the vfio device, it's also vfio-pci for typical
> > mdev devices found on x86, but may be vfio-ccw, vfio-ap, etc...  See
> > VFIO_DEVICE_API_PCI_STRING and friends.
> > 
> ok. got it.
> 
> > > > > > 	device_id=8086591d  
> > > > 
> > > > Is device_id interpreted relative to device_type?  How does this
> > > > relate to mdev_type?  If we have an mdev_type, doesn't that fully
> > > > defined the software API?
> > > >   
> > > it's parent pci id for mdev actually.
> >
> > If we need to specify the parent PCI ID then something is fundamentally
> > wrong with the mdev_type.  The mdev_type should define a unique,
> > software compatible interface, regardless of the parent device IDs.  If
> > a i915-GVTg_V5_2 means different things based on the parent device IDs,
> > then then different mdev_types should be reported for those parent
> > devices.
> >
> hmm, then do we allow vendor specific fields?
> or is it a must that a vendor specific field should have corresponding
> vendor attribute?
> 
> another thing is that the definition of mdev_type in GVT only corresponds
> to vGPU computing ability currently,
> e.g. i915-GVTg_V5_2, is 1/2 of a gen9 IGD, i915-GVTg_V4_2 is 1/2 of a
> gen8 IGD.
> It is too coarse-grained to live migration compatibility.

Can you explain why that's too coarse?

Is this because it's too specific (i.e. that a i915-GVTg_V4_2 could be
migrated to a newer device?), or that it's too specific on the exact
sizings (i.e. that there may be multiple different sizes of a gen9)?

Dave

> Do you think we need to update GVT's definition of mdev_type?
> And is there any guide in mdev_type definition?
> 
> > > > > > 	mdev_type=i915-GVTg_V5_2  
> > > > 
> > > > And how are non-mdev devices represented?
> > > >   
> > > non-mdev can opt to not include this field, or as you said below, a
> > > vendor signature. 
> > > 
> > > > > > 	aggregator=1
> > > > > > 	pv_mode="none+ppgtt+context"  
> > > > 
> > > > These are meaningless vendor specific matches afaict.
> > > >   
> > > yes, pv_mode and aggregator are vendor specific fields.
> > > but they are important to decide whether two devices are compatible.
> > > pv_mode means whether a vGPU supports guest paravirtualized api.
> > > "none+ppgtt+context" means guest can not use pv, or use ppgtt mode pv or
> > > use context mode pv.
> > > 
> > > > > > 	interface_version=3  
> > > > 
> > > > Not much granularity here, I prefer Sean's previous
> > > > <major>.<minor>[.bugfix] scheme.
> > > >   
> > > yes, <major>.<minor>[.bugfix] scheme may be better, but I'm not sure if
> > > it works for a complicated scenario.
> > > e.g for pv_mode,
> > > (1) initially,  pv_mode is not supported, so it's pv_mode=none, it's 0.0.0,
> > > (2) then, pv_mode=ppgtt is supported, pv_mode="none+ppgtt", it's 0.1.0,
> > > indicating pv_mode=none can migrate to pv_mode="none+ppgtt", but not vice versa.
> > > (3) later, pv_mode=context is also supported,
> > > pv_mode="none+ppgtt+context", so it's 0.2.0.
> > > 
> > > But if later, pv_mode=ppgtt is removed. pv_mode="none+context", how to
> > > name its version? "none+ppgtt" (0.1.0) is not compatible to
> > > "none+context", but "none+ppgtt+context" (0.2.0) is compatible to
> > > "none+context".
> > 
> > If pv_mode=ppgtt is removed, then the compatible versions would be
> > 0.0.0 or 1.0.0, ie. the major version would be incremented due to
> > feature removal.
> >  
> > > Maintain such scheme is painful to vendor driver.
> > 
> > Migration compatibility is painful, there's no way around that.  I
> > think the version scheme is an attempt to push some of that low level
> > burden on the vendor driver, otherwise the management tools need to
> > work on an ever growing matrix of vendor specific features which is
> > going to become unwieldy and is largely meaningless outside of the
> > vendor driver.  Instead, the vendor driver can make strategic decisions
> > about where to continue to maintain a support burden and make explicit
> > decisions to maintain or break compatibility.  The version scheme is a
> > simplification and abstraction of vendor driver features in order to
> > create a small, logical compatibility matrix.  Compromises necessarily
> > need to be made for that to occur.
> >
> ok. got it.
> 
> > > > > > COMPATIBLE:
> > > > > > 	device_type=pci
> > > > > > 	device_id=8086591d
> > > > > > 	mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}    
> > > > > this mixed notation will be hard to parse so i would avoid that.  
> > > > 
> > > > Some background, Intel has been proposing aggregation as a solution to
> > > > how we scale mdev devices when hardware exposes large numbers of
> > > > assignable objects that can be composed in essentially arbitrary ways.
> > > > So for instance, if we have a workqueue (wq), we might have an mdev
> > > > type for 1wq, 2wq, 3wq,... Nwq.  It's not really practical to expose a
> > > > discrete mdev type for each of those, so they want to define a base
> > > > type which is composable to other types via this aggregation.  This is
> > > > what this substitution and tagging is attempting to accomplish.  So
> > > > imagine this set of values for cases where it's not practical to unroll
> > > > the values for N discrete types.
> > > >   
> > > > > > 	aggregator={val1}/2  
> > > > 
> > > > So the {val1} above would be substituted here, though an aggregation
> > > > factor of 1/2 is a head scratcher...
> > > >   
> > > > > > 	pv_mode={val2:string:"none+ppgtt","none+context","none+ppgtt+context"}  
> > > > 
> > > > I'm lost on this one though.  I think maybe it's indicating that it's
> > > > compatible with any of these, so do we need to list it?  Couldn't this
> > > > be handled by Sean's version proposal where the minor version
> > > > represents feature compatibility?  
> > > yes, it's indicating that it's compatible with any of these.
> > > Sean's version proposal may also work, but it would be painful for
> > > vendor driver to maintain the versions when multiple similar features
> > > are involved.
> > 
> > This is something vendor drivers need to consider when adding and
> > removing features.
> > 
> > > > > > 	interface_version={val3:int:2,3}  
> > > > 
> > > > What does this turn into in a few years, 2,7,12,23,75,96,...
> > > >   
> > > is a range better?
> > 
> > I was really trying to point out that sparseness becomes an issue if
> > the vendor driver is largely disconnected from how their feature
> > addition and deprecation affects migration support.  Thanks,
> >
> ok. we'll use the x.y.z scheme then.
> 
> Thanks
> Yan
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-08-05  9:33                           ` Yan Zhao
@ 2020-08-05 10:53                             ` Jiri Pirko
  2020-08-05 11:35                               ` Sean Mooney
  0 siblings, 1 reply; 48+ messages in thread
From: Jiri Pirko @ 2020-08-05 10:53 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Jason Wang, Cornelia Huck, Alex Williamson, kvm, libvir-list,
	qemu-devel, kwankhede, eauger, xin-ran.wang, corbet,
	openstack-discuss, shaohe.feng, kevin.tian, eskultet,
	jian-feng.ding, dgilbert, zhenyuw, hejie.xu, bao.yumeng, smooney,
	intel-gvt-dev, berrange, dinechin, devel, Parav Pandit

Wed, Aug 05, 2020 at 11:33:38AM CEST, yan.y.zhao@intel.com wrote:
>On Wed, Aug 05, 2020 at 04:02:48PM +0800, Jason Wang wrote:
>> 
>> On 2020/8/5 下午3:56, Jiri Pirko wrote:
>> > Wed, Aug 05, 2020 at 04:41:54AM CEST, jasowang@redhat.com wrote:
>> > > On 2020/8/5 上午10:16, Yan Zhao wrote:
>> > > > On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote:
>> > > > > On 2020/8/5 上午12:35, Cornelia Huck wrote:
>> > > > > > [sorry about not chiming in earlier]
>> > > > > > 
>> > > > > > On Wed, 29 Jul 2020 16:05:03 +0800
>> > > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
>> > > > > > 
>> > > > > > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
>> > > > > > (...)
>> > > > > > 
>> > > > > > > > Based on the feedback we've received, the previously proposed interface
>> > > > > > > > is not viable.  I think there's agreement that the user needs to be
>> > > > > > > > able to parse and interpret the version information.  Using json seems
>> > > > > > > > viable, but I don't know if it's the best option.  Is there any
>> > > > > > > > precedent of markup strings returned via sysfs we could follow?
>> > > > > > I don't think encoding complex information in a sysfs file is a viable
>> > > > > > approach. Quoting Documentation/filesystems/sysfs.rst:
>> > > > > > 
>> > > > > > "Attributes should be ASCII text files, preferably with only one value
>> > > > > > per file. It is noted that it may not be efficient to contain only one
>> > > > > > value per file, so it is socially acceptable to express an array of
>> > > > > > values of the same type.
>> > > > > > Mixing types, expressing multiple lines of data, and doing fancy
>> > > > > > formatting of data is heavily frowned upon."
>> > > > > > 
>> > > > > > Even though this is an older file, I think these restrictions still
>> > > > > > apply.
>> > > > > +1, that's another reason why devlink(netlink) is better.
>> > > > > 
>> > > > hi Jason,
>> > > > do you have any materials or sample code about devlink, so we can have a good
>> > > > study of it?
>> > > > I found some kernel docs about it but my preliminary study didn't show me the
>> > > > advantage of devlink.
>> > > 
>> > > CC Jiri and Parav for a better answer for this.
>> > > 
>> > > My understanding is that the following advantages are obvious (as I replied
>> > > in another thread):
>> > > 
>> > > - existing users (NIC, crypto, SCSI, ib), mature and stable
>> > > - much better error reporting (ext_ack other than string or errno)
>> > > - namespace aware
>> > > - do not couple with kobject
>> > Jason, what is your use case?
>> 
>> 
>> I think the use case is to report device compatibility for live migration.
>> Yan proposed a simple sysfs based migration version first, but it looks not
>> sufficient and something based on JSON is discussed.
>> 
>> Yan, can you help to summarize the discussion so far for Jiri as a
>> reference?
>> 
>yes.
>we are currently defining an device live migration compatibility
>interface in order to let user space like openstack and libvirt knows
>which two devices are live migration compatible.
>currently the devices include mdev (a kernel emulated virtual device)
>and physical devices (e.g.  a VF of a PCI SRIOV device).
>
>the attributes we want user space to compare including
>common attribues:
>    device_api: vfio-pci, vfio-ccw...
>    mdev_type: mdev type of mdev or similar signature for physical device
>               It specifies a device's hardware capability. e.g.
>	       i915-GVTg_V5_4 means it's of 1/4 of a gen9 Intel graphics
>	       device.
>    software_version: device driver's version.
>               in <major>.<minor>[.bugfix] scheme, where there is no
>	       compatibility across major versions, minor versions have
>	       forward compatibility (ex. 1-> 2 is ok, 2 -> 1 is not) and
>	       bugfix version number indicates some degree of internal
>	       improvement that is not visible to the user in terms of
>	       features or compatibility,
>
>vendor specific attributes: each vendor may define different attributes
>   device id : device id of a physical devices or mdev's parent pci device.
>               it could be equal to pci id for pci devices
>   aggregator: used together with mdev_type. e.g. aggregator=2 together
>               with i915-GVTg_V5_4 means 2*1/4=1/2 of a gen9 Intel
>	       graphics device.
>   remote_url: for a local NVMe VF, it may be configured with a remote
>               url of a remote storage and all data is stored in the
>	       remote side specified by the remote url.
>   ...
>
>Comparing those attributes by user space alone is not an easy job, as it
>can't simply assume an equal relationship between source attributes and
>target attributes. e.g.
>for a source device of mdev_type=i915-GVTg_V5_4,aggregator=2, (1/2 of
>gen9), it actually could find a compatible device of
>mdev_type=i915-GVTg_V5_8,aggregator=4 (also 1/2 of gen9),
>if mdev_type of i915-GVTg_V5_4 is not available in the target machine.
>
>So, in our current proposal, we want to create two sysfs attributes
>under a device sysfs node.
>/sys/<path to device>/migration/self
>/sys/<path to device>/migration/compatible
>
>#cat /sys/<path to device>/migration/self
>device_type=vfio_pci
>mdev_type=i915-GVTg_V5_4
>device_id=8086591d
>aggregator=2
>software_version=1.0.0
>
>#cat /sys/<path to device>/migration/compatible
>device_type=vfio_pci
>mdev_type=i915-GVTg_V5_{val1:int:2,4,8}
>device_id=8086591d
>aggregator={val1}/2
>software_version=1.0.0
>
>The /sys/<path to device>/migration/self specifies self attributes of
>a device.
>The /sys/<path to device>/migration/compatible specifies the list of
>compatible devices of a device. as in the example, compatible devices
>could have
>	device_type == vfio_pci &&
>	device_id == 8086591d   &&
>	software_version == 1.0.0 &&
>        (
>	(mdev_type of i915-GVTg_V5_2 && aggregator==1) ||
>	(mdev_type of i915-GVTg_V5_4 && aggregator==2) ||
>	(mdev_type of i915-GVTg_V5_8 && aggregator=4)
>	)
>
>by comparing whether a target device is in compatible list of source
>device, the user space can know whether a two devices are live migration
>compatible.
>
>Additional notes:
>1)software_version in the compatible list may not be necessary as it
>already has a major.minor.bugfix scheme.
>2)for vendor attribute like remote_url, it may not be statically
>assigned and could be changed with a device interface.
>
>So, as Cornelia pointed that it's not good to use complex format in
>a sysfs attribute, we'd like to know whether there're other good ways to
>our use case, e.g. splitting a single attribute to multiple simple sysfs
>attributes as what Cornelia suggested or devlink that Jason has strongly
>recommended.

Hi Yan.

Thanks for the explanation, I'm still fuzzy about the details.
Anyway, I suggest you to check "devlink dev info" command we have
implemented for multiple drivers. You can try netdevsim to test this.
I think that the info you need to expose might be put there.

Devlink creates instance per-device. Specific device driver calls into
devlink core to create the instance.  What device do you have? What
driver is it handled by?


>
>Thanks
>Yan
>
>
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-08-05 10:53                             ` Jiri Pirko
@ 2020-08-05 11:35                               ` Sean Mooney
  2020-08-07 11:59                                 ` Cornelia Huck
  0 siblings, 1 reply; 48+ messages in thread
From: Sean Mooney @ 2020-08-05 11:35 UTC (permalink / raw)
  To: Jiri Pirko, Yan Zhao
  Cc: Jason Wang, Cornelia Huck, Alex Williamson, kvm, libvir-list,
	qemu-devel, kwankhede, eauger, xin-ran.wang, corbet,
	openstack-discuss, shaohe.feng, kevin.tian, eskultet,
	jian-feng.ding, dgilbert, zhenyuw, hejie.xu, bao.yumeng,
	intel-gvt-dev, berrange, dinechin, devel, Parav Pandit

On Wed, 2020-08-05 at 12:53 +0200, Jiri Pirko wrote:
> Wed, Aug 05, 2020 at 11:33:38AM CEST, yan.y.zhao@intel.com wrote:
> > On Wed, Aug 05, 2020 at 04:02:48PM +0800, Jason Wang wrote:
> > > 
> > > On 2020/8/5 下午3:56, Jiri Pirko wrote:
> > > > Wed, Aug 05, 2020 at 04:41:54AM CEST, jasowang@redhat.com wrote:
> > > > > On 2020/8/5 上午10:16, Yan Zhao wrote:
> > > > > > On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote:
> > > > > > > On 2020/8/5 上午12:35, Cornelia Huck wrote:
> > > > > > > > [sorry about not chiming in earlier]
> > > > > > > > 
> > > > > > > > On Wed, 29 Jul 2020 16:05:03 +0800
> > > > > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > > > 
> > > > > > > > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
> > > > > > > > 
> > > > > > > > (...)
> > > > > > > > 
> > > > > > > > > > Based on the feedback we've received, the previously proposed interface
> > > > > > > > > > is not viable.  I think there's agreement that the user needs to be
> > > > > > > > > > able to parse and interpret the version information.  Using json seems
> > > > > > > > > > viable, but I don't know if it's the best option.  Is there any
> > > > > > > > > > precedent of markup strings returned via sysfs we could follow?
> > > > > > > > 
> > > > > > > > I don't think encoding complex information in a sysfs file is a viable
> > > > > > > > approach. Quoting Documentation/filesystems/sysfs.rst:
> > > > > > > > 
> > > > > > > > "Attributes should be ASCII text files, preferably with only one value
> > > > > > > > per file. It is noted that it may not be efficient to contain only one
> > > > > > > > value per file, so it is socially acceptable to express an array of
> > > > > > > > values of the same type.
> > > > > > > > Mixing types, expressing multiple lines of data, and doing fancy
> > > > > > > > formatting of data is heavily frowned upon."
> > > > > > > > 
> > > > > > > > Even though this is an older file, I think these restrictions still
> > > > > > > > apply.
> > > > > > > 
> > > > > > > +1, that's another reason why devlink(netlink) is better.
> > > > > > > 
> > > > > > 
> > > > > > hi Jason,
> > > > > > do you have any materials or sample code about devlink, so we can have a good
> > > > > > study of it?
> > > > > > I found some kernel docs about it but my preliminary study didn't show me the
> > > > > > advantage of devlink.
> > > > > 
> > > > > CC Jiri and Parav for a better answer for this.
> > > > > 
> > > > > My understanding is that the following advantages are obvious (as I replied
> > > > > in another thread):
> > > > > 
> > > > > - existing users (NIC, crypto, SCSI, ib), mature and stable
> > > > > - much better error reporting (ext_ack other than string or errno)
> > > > > - namespace aware
> > > > > - do not couple with kobject
> > > > 
> > > > Jason, what is your use case?
> > > 
> > > 
> > > I think the use case is to report device compatibility for live migration.
> > > Yan proposed a simple sysfs based migration version first, but it looks not
> > > sufficient and something based on JSON is discussed.
> > > 
> > > Yan, can you help to summarize the discussion so far for Jiri as a
> > > reference?
> > > 
> > 
> > yes.
> > we are currently defining an device live migration compatibility
> > interface in order to let user space like openstack and libvirt knows
> > which two devices are live migration compatible.
> > currently the devices include mdev (a kernel emulated virtual device)
> > and physical devices (e.g.  a VF of a PCI SRIOV device).
> > 
> > the attributes we want user space to compare including
> > common attribues:
> >    device_api: vfio-pci, vfio-ccw...
> >    mdev_type: mdev type of mdev or similar signature for physical device
> >               It specifies a device's hardware capability. e.g.
> > 	       i915-GVTg_V5_4 means it's of 1/4 of a gen9 Intel graphics
> > 	       device.
by the way this nameing sceam works the opisite of how it would have expected
i woudl have expected to i915-GVTg_V5 to be the same as i915-GVTg_V5_1 and 
i915-GVTg_V5_4 to use 4 times the amount of resouce as i915-GVTg_V5_1 not 1 quarter.

i would much rather see i915-GVTg_V5_4 express as aggreataor:i915-GVTg_V5=4
e.g. that it is 4 of the basic i915-GVTg_V5 type
the invertion of the relationship makes this much harder to resonabout IMO.

if i915-GVTg_V5_8 and i915-GVTg_V5_4 are both actully claiming the same resouce
and both can be used at the same time with your suggested nameing scemem i have have
to fine the mdevtype with the largest value and store that then do math by devidign it by the suffix
of the requested type every time i want to claim the resouce in our placement inventoies.

if we represent it the way i suggest we dont
if it i915-GVTg_V5_8 i know its using 8 of i915-GVTg_V5
it makes it significantly simpler.

> >    software_version: device driver's version.
> >               in <major>.<minor>[.bugfix] scheme, where there is no
> > 	       compatibility across major versions, minor versions have
> > 	       forward compatibility (ex. 1-> 2 is ok, 2 -> 1 is not) and
> > 	       bugfix version number indicates some degree of internal
> > 	       improvement that is not visible to the user in terms of
> > 	       features or compatibility,
> > 
> > vendor specific attributes: each vendor may define different attributes
> >   device id : device id of a physical devices or mdev's parent pci device.
> >               it could be equal to pci id for pci devices
> >   aggregator: used together with mdev_type. e.g. aggregator=2 together
> >               with i915-GVTg_V5_4 means 2*1/4=1/2 of a gen9 Intel
> > 	       graphics device.
> >   remote_url: for a local NVMe VF, it may be configured with a remote
> >               url of a remote storage and all data is stored in the
> > 	       remote side specified by the remote url.
> >   ...
just a minor not that i find ^ much more simmple to understand then
the current proposal with self and compatiable.
if i have well defiend attibute that i can parse and understand that allow
me to calulate the what is and is not compatible that is likely going to
more useful as you wont have to keep maintianing a list of other compatible
devices every time a new sku is released.

in anycase thank for actully shareing ^ as it make it simpler to reson about what
you have previously proposed.
> > 
> > Comparing those attributes by user space alone is not an easy job, as it
> > can't simply assume an equal relationship between source attributes and
> > target attributes. e.g.
> > for a source device of mdev_type=i915-GVTg_V5_4,aggregator=2, (1/2 of
> > gen9), it actually could find a compatible device of
> > mdev_type=i915-GVTg_V5_8,aggregator=4 (also 1/2 of gen9),
> > if mdev_type of i915-GVTg_V5_4 is not available in the target machine.
> > 
> > So, in our current proposal, we want to create two sysfs attributes
> > under a device sysfs node.
> > /sys/<path to device>/migration/self
> > /sys/<path to device>/migration/compatible
> > 
> > #cat /sys/<path to device>/migration/self
> > device_type=vfio_pci
> > mdev_type=i915-GVTg_V5_4
> > device_id=8086591d
> > aggregator=2
> > software_version=1.0.0
> > 
> > #cat /sys/<path to device>/migration/compatible
> > device_type=vfio_pci
> > mdev_type=i915-GVTg_V5_{val1:int:2,4,8}
> > device_id=8086591d
> > aggregator={val1}/2
> > software_version=1.0.0
> > 
> > The /sys/<path to device>/migration/self specifies self attributes of
> > a device.
> > The /sys/<path to device>/migration/compatible specifies the list of
> > compatible devices of a device. as in the example, compatible devices
> > could have
> > 	device_type == vfio_pci &&
> > 	device_id == 8086591d   &&
> > 	software_version == 1.0.0 &&
> >        (
> > 	(mdev_type of i915-GVTg_V5_2 && aggregator==1) ||
> > 	(mdev_type of i915-GVTg_V5_4 && aggregator==2) ||
> > 	(mdev_type of i915-GVTg_V5_8 && aggregator=4)
> > 	)
> > 
> > by comparing whether a target device is in compatible list of source
> > device, the user space can know whether a two devices are live migration
> > compatible.
> > 
> > Additional notes:
> > 1)software_version in the compatible list may not be necessary as it
> > already has a major.minor.bugfix scheme.
> > 2)for vendor attribute like remote_url, it may not be statically
> > assigned and could be changed with a device interface.
> > 
> > So, as Cornelia pointed that it's not good to use complex format in
> > a sysfs attribute, we'd like to know whether there're other good ways to
> > our use case, e.g. splitting a single attribute to multiple simple sysfs
> > attributes as what Cornelia suggested or devlink that Jason has strongly
> > recommended.
> 
> Hi Yan.
> 
> Thanks for the explanation, I'm still fuzzy about the details.
> Anyway, I suggest you to check "devlink dev info" command we have
> implemented for multiple drivers.

is devlink exposed as a filesytem we can read with just open?
openstack will likely try to leverage libvirt to get this info but when we
cant its much simpler to read sysfs then it is to take a a depenency on a commandline
too and have to fork shell to execute it and parse the cli output.
pyroute2 which we use in some openstack poject has basic python binding for devlink but im not
sure how complete it is as i think its relitivly new addtion. if we need to take a dependcy
we will but that would be a drawback fo devlink not that that is a large one just something
to keep in mind.

>  You can try netdevsim to test this.
> I think that the info you need to expose might be put there.
> 
> Devlink creates instance per-device. Specific device driver calls into
> devlink core to create the instance.  What device do you have? What
> driver is it handled by?
> 
> 
> > 
> > Thanks
> > Yan
> > 
> > 
> > 
> 
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: device compatibility interface for live migration with assigned devices
  2020-08-05 11:35                               ` Sean Mooney
@ 2020-08-07 11:59                                 ` Cornelia Huck
  0 siblings, 0 replies; 48+ messages in thread
From: Cornelia Huck @ 2020-08-07 11:59 UTC (permalink / raw)
  To: Sean Mooney
  Cc: Jiri Pirko, Yan Zhao, Jason Wang, Alex Williamson, kvm,
	libvir-list, qemu-devel, kwankhede, eauger, xin-ran.wang, corbet,
	openstack-discuss, shaohe.feng, kevin.tian, eskultet,
	jian-feng.ding, dgilbert, zhenyuw, hejie.xu, bao.yumeng,
	intel-gvt-dev, berrange, dinechin, devel, Parav Pandit

On Wed, 05 Aug 2020 12:35:01 +0100
Sean Mooney <smooney@redhat.com> wrote:

> On Wed, 2020-08-05 at 12:53 +0200, Jiri Pirko wrote:
> > Wed, Aug 05, 2020 at 11:33:38AM CEST, yan.y.zhao@intel.com wrote:  

(...)

> > >    software_version: device driver's version.
> > >               in <major>.<minor>[.bugfix] scheme, where there is no
> > > 	       compatibility across major versions, minor versions have
> > > 	       forward compatibility (ex. 1-> 2 is ok, 2 -> 1 is not) and
> > > 	       bugfix version number indicates some degree of internal
> > > 	       improvement that is not visible to the user in terms of
> > > 	       features or compatibility,
> > > 
> > > vendor specific attributes: each vendor may define different attributes
> > >   device id : device id of a physical devices or mdev's parent pci device.
> > >               it could be equal to pci id for pci devices
> > >   aggregator: used together with mdev_type. e.g. aggregator=2 together
> > >               with i915-GVTg_V5_4 means 2*1/4=1/2 of a gen9 Intel
> > > 	       graphics device.
> > >   remote_url: for a local NVMe VF, it may be configured with a remote
> > >               url of a remote storage and all data is stored in the
> > > 	       remote side specified by the remote url.
> > >   ...  
> just a minor not that i find ^ much more simmple to understand then
> the current proposal with self and compatiable.
> if i have well defiend attibute that i can parse and understand that allow
> me to calulate the what is and is not compatible that is likely going to
> more useful as you wont have to keep maintianing a list of other compatible
> devices every time a new sku is released.
> 
> in anycase thank for actully shareing ^ as it make it simpler to reson about what
> you have previously proposed.

So, what would be the most helpful format? A 'software_version' field
that follows the conventions outlined above, and other (possibly
optional) fields that have to match?

(...)

> > Thanks for the explanation, I'm still fuzzy about the details.
> > Anyway, I suggest you to check "devlink dev info" command we have
> > implemented for multiple drivers.  
> 
> is devlink exposed as a filesytem we can read with just open?
> openstack will likely try to leverage libvirt to get this info but when we
> cant its much simpler to read sysfs then it is to take a a depenency on a commandline
> too and have to fork shell to execute it and parse the cli output.
> pyroute2 which we use in some openstack poject has basic python binding for devlink but im not
> sure how complete it is as i think its relitivly new addtion. if we need to take a dependcy
> we will but that would be a drawback fo devlink not that that is a large one just something
> to keep in mind.

A devlinkfs, maybe? At least for reading information (IIUC, "devlink
dev info" is only about information retrieval, right?)


^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, back to index

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-13 23:29 device compatibility interface for live migration with assigned devices Yan Zhao
2020-07-14 10:21 ` Daniel P. Berrangé
2020-07-14 12:33   ` Sean Mooney
     [not found]     ` <20200714110148.0471c03c@x1.home>
     [not found]       ` <eb705c72cdc8b6b8959b6ebaeeac6069a718d524.camel@redhat.com>
2020-07-14 21:15         ` Sean Mooney
2020-07-14 16:16   ` Alex Williamson
2020-07-14 16:47     ` Daniel P. Berrangé
2020-07-14 20:47       ` Alex Williamson
2020-07-15  9:16         ` Daniel P. Berrangé
2020-07-14 17:19     ` Dr. David Alan Gilbert
2020-07-14 20:59       ` Alex Williamson
2020-07-15  8:20         ` Yan Zhao
2020-07-15  8:49           ` Feng, Shaohe
2020-07-17 14:59           ` Alex Williamson
2020-07-17 18:03             ` Dr. David Alan Gilbert
2020-07-17 18:30               ` Alex Williamson
2020-07-15  8:23         ` Dr. David Alan Gilbert
     [not found]         ` <CAH7mGatPWsczh_rbVhx4a+psJXvkZgKou3r5HrEQTqE7SqZkKA@mail.gmail.com>
2020-07-17 15:18           ` Alex Williamson
2020-07-16  4:16 ` Jason Wang
2020-07-16  8:32   ` Yan Zhao
2020-07-16  9:30     ` Jason Wang
2020-07-17 16:12     ` Alex Williamson
2020-07-20  3:41       ` Jason Wang
2020-07-20 10:39         ` Sean Mooney
2020-07-21  2:11           ` Jason Wang
2020-07-21  0:51       ` Yan Zhao
2020-07-27  7:24         ` Yan Zhao
2020-07-27 22:23           ` Alex Williamson
2020-07-29  8:05             ` Yan Zhao
2020-07-29 11:28               ` Sean Mooney
2020-07-29 19:12                 ` Alex Williamson
2020-07-30  3:41                   ` Yan Zhao
2020-07-30 13:24                     ` Sean Mooney
2020-07-30 17:29                     ` Alex Williamson
2020-08-04  8:37                       ` Yan Zhao
2020-08-05  9:44                         ` Dr. David Alan Gilbert
2020-07-30  1:56                 ` Yan Zhao
2020-07-30 13:14                   ` Sean Mooney
2020-08-04 16:35               ` Cornelia Huck
2020-08-05  2:22                 ` Jason Wang
2020-08-05  2:16                   ` Yan Zhao
2020-08-05  2:41                     ` Jason Wang
2020-08-05  7:56                       ` Jiri Pirko
2020-08-05  8:02                         ` Jason Wang
2020-08-05  9:33                           ` Yan Zhao
2020-08-05 10:53                             ` Jiri Pirko
2020-08-05 11:35                               ` Sean Mooney
2020-08-07 11:59                                 ` Cornelia Huck
2020-07-29 19:05             ` Dr. David Alan Gilbert

KVM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kvm/0 kvm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kvm kvm/ https://lore.kernel.org/kvm \
		kvm@vger.kernel.org
	public-inbox-index kvm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.kvm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git