All of lore.kernel.org
 help / color / mirror / Atom feed
* Out-of-Process Device Emulation session at KVM Forum 2020
@ 2020-10-27 15:14 Stefan Hajnoczi
  2020-10-28  9:32 ` Thanos Makatos
                   ` (3 more replies)
  0 siblings, 4 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2020-10-27 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Elena Ufimtseva, john.g.johnson, mst@redhat.com, jag.raman, slp,
	Marc-André Lureau, kraxel, Felipe Franciosi, thanos.makatos,
	Alex Bennée, David Gibson

[-- Attachment #1: Type: text/plain, Size: 719 bytes --]

There will be a birds-of-a-feather session at KVM Forum, a chance for
us to get together and discuss Out-of-Process Device Emulation.

Please send suggestions for the agenda!

These sessions are a good opportunity to reach agreement on topics that
are hard to discuss via mailing lists.

Ideas:
 * How will we decide that the protocol is stable? Can third-party
   applications like DPDK/SPDK use the protocol in the meantime?
 * QEMU build system requirements: how to configure and build device
   emulator binaries?
 * Common sandboxing solution shared between C and Rust-based binaries?
   minijail (https://github.com/google/minijail)? bubblewrap
   (https://github.com/containers/bubblewrap)? systemd-run?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-27 15:14 Out-of-Process Device Emulation session at KVM Forum 2020 Stefan Hajnoczi
@ 2020-10-28  9:32 ` Thanos Makatos
  2020-10-28 10:07   ` Thanos Makatos
  2020-10-28 11:09 ` Michael S. Tsirkin
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 40+ messages in thread
From: Thanos Makatos @ 2020-10-28  9:32 UTC (permalink / raw)
  To: Stefan Hajnoczi, qemu-devel
  Cc: Elena Ufimtseva, john.g.johnson, mst@redhat.com, jag.raman, slp,
	kraxel, Felipe Franciosi, Marc-André Lureau,
	Alex Bennée, David Gibson

> -----Original Message-----
> From: Stefan Hajnoczi <stefanha@redhat.com>
> Sent: 27 October 2020 15:14
> To: qemu-devel@nongnu.org
> Cc: Alex Bennée <alex.bennee@linaro.org>; mst@redhat.com
> <mtsirkin@redhat.com>; john.g.johnson@oracle.com; Elena Ufimtseva
> <elena.ufimtseva@oracle.com>; kraxel@redhat.com;
> jag.raman@oracle.com; Thanos Makatos <thanos.makatos@nutanix.com>;
> Felipe Franciosi <felipe@nutanix.com>; Marc-André Lureau
> <marcandre.lureau@redhat.com>; slp@redhat.com; David Gibson
> <david@gibson.dropbear.id.au>
> Subject: Out-of-Process Device Emulation session at KVM Forum 2020
> 
> There will be a birds-of-a-feather session at KVM Forum, a chance for
> us to get together and discuss Out-of-Process Device Emulation.
> 
> Please send suggestions for the agenda!
> 
> These sessions are a good opportunity to reach agreement on topics that
> are hard to discuss via mailing lists.
> 
> Ideas:
>  * How will we decide that the protocol is stable? Can third-party
>    applications like DPDK/SPDK use the protocol in the meantime?
>  * QEMU build system requirements: how to configure and build device
>    emulator binaries?
>  * Common sandboxing solution shared between C and Rust-based binaries?
>    minijail (https://github.com/google/minijail)? bubblewrap
>    (https://github.com/containers/bubblewrap)? systemd-run?
> 
> Stefan

Here are a couple of issues we'd also like to talk about:

Fast switching from polling to interrupt-based notifications: when a single
process is emulating multiple devices then it might be more efficient to poll
instead of relying on interrupts for notifications. However, during periods when
the guests are mostly idle, polling might unnecessary, so we'd like to be able
switch to interrupt-based notifications at a low cost.

Device throttling during live migration: a device can easily dirty huge amounts
of guest RAM which results in live migration taking too long or making it hard
to estimate progress. Ideally, we'd like to be able to instruct an out-of-process
device emulator to make sure it won't dirty too many guest pages during a
specified window of time.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-28  9:32 ` Thanos Makatos
@ 2020-10-28 10:07   ` Thanos Makatos
  0 siblings, 0 replies; 40+ messages in thread
From: Thanos Makatos @ 2020-10-28 10:07 UTC (permalink / raw)
  To: Thanos Makatos, Stefan Hajnoczi, qemu-devel
  Cc: Elena Ufimtseva, john.g.johnson, mst@redhat.com, jag.raman, slp,
	kraxel, Felipe Franciosi, Marc-André Lureau,
	Alex Bennée, David Gibson



> -----Original Message-----
> From: Qemu-devel <qemu-devel-
> bounces+thanos.makatos=nutanix.com@nongnu.org> On Behalf Of Thanos
> Makatos
> Sent: 28 October 2020 09:32
> To: Stefan Hajnoczi <stefanha@redhat.com>; qemu-devel@nongnu.org
> Cc: Elena Ufimtseva <elena.ufimtseva@oracle.com>;
> john.g.johnson@oracle.com; mst@redhat.com <mtsirkin@redhat.com>;
> jag.raman@oracle.com; slp@redhat.com; kraxel@redhat.com; Felipe
> Franciosi <felipe@nutanix.com>; Marc-André Lureau
> <marcandre.lureau@redhat.com>; Alex Bennée <alex.bennee@linaro.org>;
> David Gibson <david@gibson.dropbear.id.au>
> Subject: RE: Out-of-Process Device Emulation session at KVM Forum 2020
> 
> > -----Original Message-----
> > From: Stefan Hajnoczi <stefanha@redhat.com>
> > Sent: 27 October 2020 15:14
> > To: qemu-devel@nongnu.org
> > Cc: Alex Bennée <alex.bennee@linaro.org>; mst@redhat.com
> > <mtsirkin@redhat.com>; john.g.johnson@oracle.com; Elena Ufimtseva
> > <elena.ufimtseva@oracle.com>; kraxel@redhat.com;
> > jag.raman@oracle.com; Thanos Makatos <thanos.makatos@nutanix.com>;
> > Felipe Franciosi <felipe@nutanix.com>; Marc-André Lureau
> > <marcandre.lureau@redhat.com>; slp@redhat.com; David Gibson
> > <david@gibson.dropbear.id.au>
> > Subject: Out-of-Process Device Emulation session at KVM Forum 2020
> >
> > There will be a birds-of-a-feather session at KVM Forum, a chance for
> > us to get together and discuss Out-of-Process Device Emulation.
> >
> > Please send suggestions for the agenda!
> >
> > These sessions are a good opportunity to reach agreement on topics that
> > are hard to discuss via mailing lists.
> >
> > Ideas:
> >  * How will we decide that the protocol is stable? Can third-party
> >    applications like DPDK/SPDK use the protocol in the meantime?
> >  * QEMU build system requirements: how to configure and build device
> >    emulator binaries?
> >  * Common sandboxing solution shared between C and Rust-based
> binaries?
> >    minijail (https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__github.com_google_minijail-29-
> 3F&d=DwIFAw&c=s883GpUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvtw6ogtti46a
> tk736SI4vgsJiUKIyDE&m=hPc4ln1oFnCIYCRna-
> C027BO06__al6zPJhAs0_KcP8&s=dqqLRGO3GvV4gAEqkMXzbhm5TtOHqLGQ
> d_0SBlzubp0&e=  bubblewrap
> >    (https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__github.com_containers_bubblewrap-29-
> 3F&d=DwIFAw&c=s883GpUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvtw6ogtti46a
> tk736SI4vgsJiUKIyDE&m=hPc4ln1oFnCIYCRna-
> C027BO06__al6zPJhAs0_KcP8&s=Rnd-
> 6YVz2xrg0Vrm6ukannwt3kmbQ8L7upVLrEc227g&e=  systemd-run?
> >
> > Stefan
> 
> Here are a couple of issues we'd also like to talk about:
> 
> Fast switching from polling to interrupt-based notifications: when a single
> process is emulating multiple devices then it might be more efficient to poll
> instead of relying on interrupts for notifications. However, during periods
> when
> the guests are mostly idle, polling might unnecessary, so we'd like to be able
> switch to interrupt-based notifications at a low cost.

Correction: there are no interrupts involved here, just guest to device notifications.

> 
> Device throttling during live migration: a device can easily dirty huge amounts
> of guest RAM which results in live migration taking too long or making it hard
> to estimate progress. Ideally, we'd like to be able to instruct an out-of-
> process
> device emulator to make sure it won't dirty too many guest pages during a
> specified window of time.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-27 15:14 Out-of-Process Device Emulation session at KVM Forum 2020 Stefan Hajnoczi
  2020-10-28  9:32 ` Thanos Makatos
@ 2020-10-28 11:09 ` Michael S. Tsirkin
  2020-10-29  8:21 ` Stefan Hajnoczi
  2020-10-29 12:08 ` Stefan Hajnoczi
  3 siblings, 0 replies; 40+ messages in thread
From: Michael S. Tsirkin @ 2020-10-28 11:09 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, john.g.johnson, jag.raman, slp, qemu-devel,
	Marc-André Lureau, kraxel, Felipe Franciosi, thanos.makatos,
	Alex Bennée, David Gibson

On Tue, Oct 27, 2020 at 03:14:00PM +0000, Stefan Hajnoczi wrote:
> There will be a birds-of-a-feather session at KVM Forum, a chance for
> us to get together and discuss Out-of-Process Device Emulation.
> 
> Please send suggestions for the agenda!
> 
> These sessions are a good opportunity to reach agreement on topics that
> are hard to discuss via mailing lists.
> 
> Ideas:
>  * How will we decide that the protocol is stable? Can third-party
>    applications like DPDK/SPDK use the protocol in the meantime?

and if not how do we prevent that?

>  * QEMU build system requirements: how to configure and build device
>    emulator binaries?
>  * Common sandboxing solution shared between C and Rust-based binaries?
>    minijail (https://github.com/google/minijail)? bubblewrap
>    (https://github.com/containers/bubblewrap)? systemd-run?

disconnect
migration

> Stefan




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-27 15:14 Out-of-Process Device Emulation session at KVM Forum 2020 Stefan Hajnoczi
  2020-10-28  9:32 ` Thanos Makatos
  2020-10-28 11:09 ` Michael S. Tsirkin
@ 2020-10-29  8:21 ` Stefan Hajnoczi
  2020-10-29 12:08 ` Stefan Hajnoczi
  3 siblings, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2020-10-29  8:21 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, John G Johnson, mst@redhat.com, Jag Raman,
	Sergio Lopez, qemu-devel, Thanos Makatos, Gerd Hoffmann,
	Felipe Franciosi, Marc-André Lureau, Alex Bennée,
	David Gibson

The session will be at 11:00 UTC. The meeting URL is
https://meet.jit.si/QEMUOoPDevices.

Stefan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-27 15:14 Out-of-Process Device Emulation session at KVM Forum 2020 Stefan Hajnoczi
                   ` (2 preceding siblings ...)
  2020-10-29  8:21 ` Stefan Hajnoczi
@ 2020-10-29 12:08 ` Stefan Hajnoczi
  2020-10-29 13:02   ` Jason Wang
  3 siblings, 1 reply; 40+ messages in thread
From: Stefan Hajnoczi @ 2020-10-29 12:08 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, John G Johnson, mst@redhat.com, Janosch Frank,
	qemu-devel, Kirti Wankhede, Paolo Bonzini, Gerd Hoffmann,
	Yan Vugenfirer, Jag Raman, Anup Patel, Claudio Imbrenda,
	Christian Borntraeger, Roman Kagan, Felipe Franciosi,
	Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Liran Alon, Alex Bennée,
	David Gibson, Kevin Wolf, Halil Pasic, Daniel P. Berrange,
	Christophe de Dinechin, Thanos Makatos, fam

Here are notes from the session:

protocol stability:
    * vhost-user already exists for existing third-party applications
    * vfio-user is more general but will take more time to develop
    * libvfio-user can be provided to allow device implementations

management:
    * Should QEMU launch device emulation processes?
        * Nicer user experience
        * Technical blockers: forking, hotplug, security is hard once
QEMU has started running
        * Probably requires a new process model with a long-running
QEMU management process proxying QMP requests to the emulator process

migration:
    * dbus-vmstate
    * VFIO live migration ioctls
        * Source device can continue if migration fails
        * Opaque blobs are transferred to destination, destination can
fail migration if it decides the blobs are incompatible
        * How does the VMM share the migration data region with the
device emulation process?
            * The vfio-user protocol can trap or mmap
    * device versioning (like versioned machine types) needed to pin
the guest-visible device ABI
    * Felipe will investigate live migration

reconnection:
    * How to support reconnection?
        * QEMU has relatively little state of a vfio-user device
        * vhost-user has more state so it's a little easier to
reconnect or migrate
    * Build in reconnection and live migration from the start to avoid
difficulties in the future
    * Relationship between migration and reconnection?
        * VFIO has a mechanism for saving/loading device state
        * Lots of different reconnection cases that need to be thought through

security & sandboxing:
    * Goal: make it easy to lock down the process so developers don't
need to reinvent sandboxing
    * minijail
        * in-process
    * firecracker jailer
    * bubblewrap
        * launcher tool
    * systemd-run
        * launcher tool


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-29 12:08 ` Stefan Hajnoczi
@ 2020-10-29 13:02   ` Jason Wang
  2020-10-29 13:06     ` Paolo Bonzini
                       ` (3 more replies)
  0 siblings, 4 replies; 40+ messages in thread
From: Jason Wang @ 2020-10-29 13:02 UTC (permalink / raw)
  To: Stefan Hajnoczi, Stefan Hajnoczi
  Cc: Elena Ufimtseva, John G Johnson, mst@redhat.com, Janosch Frank,
	qemu-devel, Kirti Wankhede, Gerd Hoffmann, Yan Vugenfirer,
	Jag Raman, Anup Patel, Claudio Imbrenda, Christian Borntraeger,
	Roman Kagan, Felipe Franciosi, Marc-André Lureau,
	Jens Freimann, Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Liran Alon, Thanos Makatos,
	Alex Bennée, David Gibson, Kevin Wolf, Halil Pasic,
	Daniel P. Berrange, Christophe de Dinechin, Paolo Bonzini, fam


On 2020/10/29 下午8:08, Stefan Hajnoczi wrote:
> Here are notes from the session:
>
> protocol stability:
>      * vhost-user already exists for existing third-party applications
>      * vfio-user is more general but will take more time to develop
>      * libvfio-user can be provided to allow device implementations
>
> management:
>      * Should QEMU launch device emulation processes?
>          * Nicer user experience
>          * Technical blockers: forking, hotplug, security is hard once
> QEMU has started running
>          * Probably requires a new process model with a long-running
> QEMU management process proxying QMP requests to the emulator process
>
> migration:
>      * dbus-vmstate
>      * VFIO live migration ioctls
>          * Source device can continue if migration fails
>          * Opaque blobs are transferred to destination, destination can
> fail migration if it decides the blobs are incompatible


I'm not sure this can work:

1) Reading something that is opaque to userspace is probably a hint of 
bad uAPI design
2) Did qemu even try to migrate opaque blobs before? It's probably a bad 
design of migration protocol as well.

It looks to me have a migration driver in qemu that can clearly define 
each byte in the migration stream is a better approach.


>          * How does the VMM share the migration data region with the
> device emulation process?
>              * The vfio-user protocol can trap or mmap
>      * device versioning (like versioned machine types) needed to pin
> the guest-visible device ABI
>      * Felipe will investigate live migration
>
> reconnection:
>      * How to support reconnection?
>          * QEMU has relatively little state of a vfio-user device
>          * vhost-user has more state so it's a little easier to
> reconnect or migrate


It could be even more easier, e.g for the inflight indices, we can 
design (or forcing to use in order) virtqueue carefully then we don't 
need any auxiliary data structure.

Thanks


>      * Build in reconnection and live migration from the start to avoid
> difficulties in the future
>      * Relationship between migration and reconnection?
>          * VFIO has a mechanism for saving/loading device state
>          * Lots of different reconnection cases that need to be thought through
>
> security & sandboxing:
>      * Goal: make it easy to lock down the process so developers don't
> need to reinvent sandboxing
>      * minijail
>          * in-process
>      * firecracker jailer
>      * bubblewrap
>          * launcher tool
>      * systemd-run
>          * launcher tool
>



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-29 13:02   ` Jason Wang
@ 2020-10-29 13:06     ` Paolo Bonzini
  2020-10-29 14:08     ` Stefan Hajnoczi
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 40+ messages in thread
From: Paolo Bonzini @ 2020-10-29 13:06 UTC (permalink / raw)
  To: Jason Wang, Stefan Hajnoczi, Stefan Hajnoczi
  Cc: Elena Ufimtseva, John G Johnson, mst@redhat.com, Janosch Frank,
	qemu-devel, Kirti Wankhede, Gerd Hoffmann, Yan Vugenfirer,
	Jag Raman, Anup Patel, Claudio Imbrenda, Christian Borntraeger,
	Roman Kagan, Felipe Franciosi, Marc-André Lureau,
	Jens Freimann, Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Liran Alon, Alex Bennée,
	David Gibson, Kevin Wolf, Halil Pasic, Daniel P. Berrange,
	Christophe de Dinechin, Thanos Makatos, fam

On 29/10/20 14:02, Jason Wang wrote:
> 
> 
> 1) Reading something that is opaque to userspace is probably a hint of
> bad uAPI design
> 2) Did qemu even try to migrate opaque blobs before? It's probably a bad
> design of migration protocol as well.

The nested live migration data is an opaque blob.  The format is
documented by the kernel; for Intel it is also guest-visible.  However
QEMU doesn't try to parse it.

Paolo



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-29 13:02   ` Jason Wang
  2020-10-29 13:06     ` Paolo Bonzini
@ 2020-10-29 14:08     ` Stefan Hajnoczi
  2020-10-29 14:31     ` Alex Williamson
  2020-10-29 16:15     ` David Edmondson
  3 siblings, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2020-10-29 14:08 UTC (permalink / raw)
  To: Jason Wang
  Cc: Elena Ufimtseva, John G Johnson, mst@redhat.com, Janosch Frank,
	qemu-devel, Kirti Wankhede, Gerd Hoffmann, Yan Vugenfirer,
	Jag Raman, Anup Patel, Claudio Imbrenda, Christian Borntraeger,
	Roman Kagan, Felipe Franciosi, Marc-André Lureau,
	Jens Freimann, Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Liran Alon, Stefan Hajnoczi,
	Thanos Makatos, Alex Bennée, David Gibson, Kevin Wolf,
	Halil Pasic, Daniel P. Berrange, Christophe de Dinechin,
	Paolo Bonzini, fam

On Thu, Oct 29, 2020 at 1:03 PM Jason Wang <jasowang@redhat.com> wrote:
> On 2020/10/29 下午8:08, Stefan Hajnoczi wrote:
> > Here are notes from the session:
> >
> > protocol stability:
> >      * vhost-user already exists for existing third-party applications
> >      * vfio-user is more general but will take more time to develop
> >      * libvfio-user can be provided to allow device implementations
> >
> > management:
> >      * Should QEMU launch device emulation processes?
> >          * Nicer user experience
> >          * Technical blockers: forking, hotplug, security is hard once
> > QEMU has started running
> >          * Probably requires a new process model with a long-running
> > QEMU management process proxying QMP requests to the emulator process
> >
> > migration:
> >      * dbus-vmstate
> >      * VFIO live migration ioctls
> >          * Source device can continue if migration fails
> >          * Opaque blobs are transferred to destination, destination can
> > fail migration if it decides the blobs are incompatible
>
>
> I'm not sure this can work:
>
> 1) Reading something that is opaque to userspace is probably a hint of
> bad uAPI design
> 2) Did qemu even try to migrate opaque blobs before? It's probably a bad
> design of migration protocol as well.
>
> It looks to me have a migration driver in qemu that can clearly define
> each byte in the migration stream is a better approach.

Here is the kernel patch series if you want to review it:
https://lore.kernel.org/kvm/20191203110412.055c38df@x1.home/t/

There is also a QEMU patch series:
https://patchwork.kernel.org/project/qemu-devel/cover/1566845753-18993-1-git-send-email-kwankhede@nvidia.com/

Kirti is also on the CC list if you want to discuss specific questions.

Stefan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-29 13:02   ` Jason Wang
  2020-10-29 13:06     ` Paolo Bonzini
  2020-10-29 14:08     ` Stefan Hajnoczi
@ 2020-10-29 14:31     ` Alex Williamson
  2020-10-29 15:09       ` Jason Wang
  2020-10-29 16:15     ` David Edmondson
  3 siblings, 1 reply; 40+ messages in thread
From: Alex Williamson @ 2020-10-29 14:31 UTC (permalink / raw)
  To: Jason Wang
  Cc: Elena Ufimtseva, Janosch Frank, mst@redhat.com, John G Johnson,
	Stefan Hajnoczi, qemu-devel, Kirti Wankhede, Gerd Hoffmann,
	Yan Vugenfirer, Jag Raman, Anup Patel, Claudio Imbrenda,
	Christian Borntraeger, Roman Kagan, Felipe Franciosi,
	Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Liran Alon, Stefan Hajnoczi,
	Paolo Bonzini, Alex Bennée, David Gibson, Kevin Wolf,
	Halil Pasic, Daniel P. Berrange, Christophe de Dinechin,
	Thanos Makatos, fam

On Thu, 29 Oct 2020 21:02:05 +0800
Jason Wang <jasowang@redhat.com> wrote:

> On 2020/10/29 下午8:08, Stefan Hajnoczi wrote:
> > Here are notes from the session:
> >
> > protocol stability:
> >      * vhost-user already exists for existing third-party applications
> >      * vfio-user is more general but will take more time to develop
> >      * libvfio-user can be provided to allow device implementations
> >
> > management:
> >      * Should QEMU launch device emulation processes?
> >          * Nicer user experience
> >          * Technical blockers: forking, hotplug, security is hard once
> > QEMU has started running
> >          * Probably requires a new process model with a long-running
> > QEMU management process proxying QMP requests to the emulator process
> >
> > migration:
> >      * dbus-vmstate
> >      * VFIO live migration ioctls
> >          * Source device can continue if migration fails
> >          * Opaque blobs are transferred to destination, destination can
> > fail migration if it decides the blobs are incompatible  
> 
> 
> I'm not sure this can work:
> 
> 1) Reading something that is opaque to userspace is probably a hint of 
> bad uAPI design
> 2) Did qemu even try to migrate opaque blobs before? It's probably a bad 
> design of migration protocol as well.
> 
> It looks to me have a migration driver in qemu that can clearly define 
> each byte in the migration stream is a better approach.

Any time during the previous two years of development might have been a
more appropriate time to express your doubts.

Note that we're not talking about vDPA devices here, we're talking
about arbitrary devices with arbitrary state.  Some degree of migration
support for assigned devices can be implemented in QEMU, Alex Graf
proved this several years ago with i40evf.  Years later, we don't have
any vendors proposing device specific migration code for QEMU.

Clearly we're also trying to account for proprietary devices where even
for suspend/resume support, proprietary drivers may be required for
manipulating that internal state.  When we move device emulation
outside of QEMU, whether in kernel or to other userspace processes,
does it still make sense to require code in QEMU to support
interpretation of that device for migration purposes?  That seems
counter to the actual goal of out-of-process devices and clearly hasn't
work for us so far.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-29 14:31     ` Alex Williamson
@ 2020-10-29 15:09       ` Jason Wang
  2020-10-29 15:46         ` Alex Williamson
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Wang @ 2020-10-29 15:09 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Elena Ufimtseva, Janosch Frank, mst@redhat.com, John G Johnson,
	Stefan Hajnoczi, qemu-devel, Kirti Wankhede, Gerd Hoffmann,
	Yan Vugenfirer, Jag Raman, Anup Patel, Claudio Imbrenda,
	Christian Borntraeger, Roman Kagan, Felipe Franciosi,
	Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Liran Alon, Stefan Hajnoczi,
	Paolo Bonzini, Alex Bennée, David Gibson, Kevin Wolf,
	Halil Pasic, Daniel P. Berrange, Christophe de Dinechin,
	Thanos Makatos, fam


On 2020/10/29 下午10:31, Alex Williamson wrote:
> On Thu, 29 Oct 2020 21:02:05 +0800
> Jason Wang <jasowang@redhat.com> wrote:
>
>> On 2020/10/29 下午8:08, Stefan Hajnoczi wrote:
>>> Here are notes from the session:
>>>
>>> protocol stability:
>>>       * vhost-user already exists for existing third-party applications
>>>       * vfio-user is more general but will take more time to develop
>>>       * libvfio-user can be provided to allow device implementations
>>>
>>> management:
>>>       * Should QEMU launch device emulation processes?
>>>           * Nicer user experience
>>>           * Technical blockers: forking, hotplug, security is hard once
>>> QEMU has started running
>>>           * Probably requires a new process model with a long-running
>>> QEMU management process proxying QMP requests to the emulator process
>>>
>>> migration:
>>>       * dbus-vmstate
>>>       * VFIO live migration ioctls
>>>           * Source device can continue if migration fails
>>>           * Opaque blobs are transferred to destination, destination can
>>> fail migration if it decides the blobs are incompatible
>>
>> I'm not sure this can work:
>>
>> 1) Reading something that is opaque to userspace is probably a hint of
>> bad uAPI design
>> 2) Did qemu even try to migrate opaque blobs before? It's probably a bad
>> design of migration protocol as well.
>>
>> It looks to me have a migration driver in qemu that can clearly define
>> each byte in the migration stream is a better approach.
> Any time during the previous two years of development might have been a
> more appropriate time to express your doubts.


Somehow I did that in this series[1]. But the main issue is still there. 
Is this legal to have a uAPI that turns out to be opaque to userspace? 
(VFIO seems to be the first). If it's not,  the only choice is to do 
that in Qemu.


>
> Note that we're not talking about vDPA devices here, we're talking
> about arbitrary devices with arbitrary state.  Some degree of migration
> support for assigned devices can be implemented in QEMU, Alex Graf
> proved this several years ago with i40evf.  Years later, we don't have
> any vendors proposing device specific migration code for QEMU.


Yes but it's not necessarily VFIO as well.


>
> Clearly we're also trying to account for proprietary devices where even
> for suspend/resume support, proprietary drivers may be required for
> manipulating that internal state.  When we move device emulation
> outside of QEMU, whether in kernel or to other userspace processes,
> does it still make sense to require code in QEMU to support
> interpretation of that device for migration purposes?


Well, we could extend Qemu to support property module (or have we 
supported that now?). And then it can talk to property drivers via 
either VFIO or vendor specific uAPI.


>   That seems
> counter to the actual goal of out-of-process devices and clearly hasn't
> work for us so far.  Thanks,
>
> Alex


Thanks

[1] 
https://lore.kernel.org/kvm/20200914084449.0182e8a9@x1.home/T/#m23b08f92a7269fa9676b91dacb6699a78d4b3949




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-29 15:09       ` Jason Wang
@ 2020-10-29 15:46         ` Alex Williamson
  2020-10-29 16:10           ` Paolo Bonzini
  2020-10-30  1:11           ` Jason Wang
  0 siblings, 2 replies; 40+ messages in thread
From: Alex Williamson @ 2020-10-29 15:46 UTC (permalink / raw)
  To: Jason Wang
  Cc: Elena Ufimtseva, Janosch Frank, mst@redhat.com, John G Johnson,
	Stefan Hajnoczi, qemu-devel, Kirti Wankhede, Gerd Hoffmann,
	Yan Vugenfirer, Jag Raman, Anup Patel, Claudio Imbrenda,
	Christian Borntraeger, Roman Kagan, Felipe Franciosi,
	Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Liran Alon, Stefan Hajnoczi,
	Paolo Bonzini, Alex Bennée, David Gibson, Kevin Wolf,
	Halil Pasic, Daniel P. Berrange, Christophe de Dinechin,
	Thanos Makatos, fam

On Thu, 29 Oct 2020 23:09:33 +0800
Jason Wang <jasowang@redhat.com> wrote:

> On 2020/10/29 下午10:31, Alex Williamson wrote:
> > On Thu, 29 Oct 2020 21:02:05 +0800
> > Jason Wang <jasowang@redhat.com> wrote:
> >  
> >> On 2020/10/29 下午8:08, Stefan Hajnoczi wrote:  
> >>> Here are notes from the session:
> >>>
> >>> protocol stability:
> >>>       * vhost-user already exists for existing third-party applications
> >>>       * vfio-user is more general but will take more time to develop
> >>>       * libvfio-user can be provided to allow device implementations
> >>>
> >>> management:
> >>>       * Should QEMU launch device emulation processes?
> >>>           * Nicer user experience
> >>>           * Technical blockers: forking, hotplug, security is hard once
> >>> QEMU has started running
> >>>           * Probably requires a new process model with a long-running
> >>> QEMU management process proxying QMP requests to the emulator process
> >>>
> >>> migration:
> >>>       * dbus-vmstate
> >>>       * VFIO live migration ioctls
> >>>           * Source device can continue if migration fails
> >>>           * Opaque blobs are transferred to destination, destination can
> >>> fail migration if it decides the blobs are incompatible  
> >>
> >> I'm not sure this can work:
> >>
> >> 1) Reading something that is opaque to userspace is probably a hint of
> >> bad uAPI design
> >> 2) Did qemu even try to migrate opaque blobs before? It's probably a bad
> >> design of migration protocol as well.
> >>
> >> It looks to me have a migration driver in qemu that can clearly define
> >> each byte in the migration stream is a better approach.  
> > Any time during the previous two years of development might have been a
> > more appropriate time to express your doubts.  
> 
> 
> Somehow I did that in this series[1]. But the main issue is still there. 

That series is related to a migration compatibility interface, not the
migration data itself.

> Is this legal to have a uAPI that turns out to be opaque to userspace? 
> (VFIO seems to be the first). If it's not,  the only choice is to do 
> that in Qemu.

So you're suggesting that any time the kernel is passing through opaque
data that gets interpreted by some entity elsewhere, potentially with
proprietary code, that we're in legal jeopardy?  VFIO is certainly not
the first to do that (storage and network devices come to mind).
Devices are essentially opaque data themselves, vfio provides access to
(ex.) BARs, but the interpretation of what resides in that BAR is device
specific.  Sometimes it's defined in a public datasheet, sometimes not.
Suggesting that we can't move opaque data through a uAPI seems rather
absurd.

> > Note that we're not talking about vDPA devices here, we're talking
> > about arbitrary devices with arbitrary state.  Some degree of migration
> > support for assigned devices can be implemented in QEMU, Alex Graf
> > proved this several years ago with i40evf.  Years later, we don't have
> > any vendors proposing device specific migration code for QEMU.  
> 
> 
> Yes but it's not necessarily VFIO as well.

I don't know what this means.

> >
> > Clearly we're also trying to account for proprietary devices where even
> > for suspend/resume support, proprietary drivers may be required for
> > manipulating that internal state.  When we move device emulation
> > outside of QEMU, whether in kernel or to other userspace processes,
> > does it still make sense to require code in QEMU to support
> > interpretation of that device for migration purposes?  
> 
> 
> Well, we could extend Qemu to support property module (or have we 
> supported that now?). And then it can talk to property drivers via 
> either VFIO or vendor specific uAPI.

Yikes, I thought out-of-process devices was exactly the compromise
being developed to avoid QEMU supporting proprietary modules and ad-hoc
vendor specific uAPIs.  I think you're actually questioning even the
premise of developing a standardized API for out-of-process devices
here.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-29 15:46         ` Alex Williamson
@ 2020-10-29 16:10           ` Paolo Bonzini
  2020-10-30  1:11           ` Jason Wang
  1 sibling, 0 replies; 40+ messages in thread
From: Paolo Bonzini @ 2020-10-29 16:10 UTC (permalink / raw)
  To: Alex Williamson, Jason Wang
  Cc: Elena Ufimtseva, Janosch Frank, mst@redhat.com, John G Johnson,
	Stefan Hajnoczi, qemu-devel, Kirti Wankhede, Gerd Hoffmann,
	Yan Vugenfirer, Jag Raman, Anup Patel, Claudio Imbrenda,
	Christian Borntraeger, Roman Kagan, Felipe Franciosi,
	Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Liran Alon, Stefan Hajnoczi,
	Alex Bennée, David Gibson, Kevin Wolf, Halil Pasic,
	Daniel P. Berrange, Christophe de Dinechin, Thanos Makatos, fam

On 29/10/20 16:46, Alex Williamson wrote:
>>> Clearly we're also trying to account for proprietary devices where even
>>> for suspend/resume support, proprietary drivers may be required for
>>> manipulating that internal state.  When we move device emulation
>>> outside of QEMU, whether in kernel or to other userspace processes,
>>> does it still make sense to require code in QEMU to support
>>> interpretation of that device for migration purposes?  
>>
>> Well, we could extend Qemu to support property module (or have we 
>> supported that now?). And then it can talk to property drivers via 
>> either VFIO or vendor specific uAPI.
>
> Yikes, I thought out-of-process devices was exactly the compromise
> being developed to avoid QEMU supporting proprietary modules and ad-hoc
> vendor specific uAPIs.  I think you're actually questioning even the
> premise of developing a standardized API for out-of-process devices
> here.

Strongly agreed!  Some (including me :)) would very much prefer not
having proprietary device emulation at all; at the same time
out-of-process devices make sense for _technical_ reasons (cross-VM
operation, privilege separation, isolation of less secure code) that are
strong enough to accept the reality of allowing proprietary
out-of-process code.  Especially if people could anyway go for an
inferior solution using VFIO, putting the kernel between QEMU and the
proprietary emulation just to get what they want.

Having to choose between opaque migration blobs and proprietary modules,
I would certainly go for the former.

Paolo



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-29 13:02   ` Jason Wang
                       ` (2 preceding siblings ...)
  2020-10-29 14:31     ` Alex Williamson
@ 2020-10-29 16:15     ` David Edmondson
  2020-10-29 16:42       ` Daniel P. Berrangé
  3 siblings, 1 reply; 40+ messages in thread
From: David Edmondson @ 2020-10-29 16:15 UTC (permalink / raw)
  To: Jason Wang, Stefan Hajnoczi, Stefan Hajnoczi
  Cc: Elena Ufimtseva, John G Johnson, mst@redhat.com, Janosch Frank,
	qemu-devel, Kirti Wankhede, Gerd Hoffmann, Yan Vugenfirer,
	Jag Raman, Anup Patel, Claudio Imbrenda, Christian Borntraeger,
	Roman Kagan, Felipe Franciosi, Marc-André Lureau,
	Jens Freimann, Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Liran Alon, Paolo Bonzini,
	Alex Bennée, David Gibson, Kevin Wolf, Halil Pasic,
	Daniel P. Berrange, Christophe de Dinechin, Thanos Makatos, fam

On Thursday, 2020-10-29 at 21:02:05 +08, Jason Wang wrote:

> 2) Did qemu even try to migrate opaque blobs before? It's probably a bad 
> design of migration protocol as well.

The TPM emulator backend migrates blobs that are only understood by
swtpm.

dme.
-- 
She's as sweet as Tupelo honey, she's an angel of the first degree.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-29 16:15     ` David Edmondson
@ 2020-10-29 16:42       ` Daniel P. Berrangé
  2020-10-29 17:47         ` Kirti Wankhede
  0 siblings, 1 reply; 40+ messages in thread
From: Daniel P. Berrangé @ 2020-10-29 16:42 UTC (permalink / raw)
  To: David Edmondson
  Cc: Elena Ufimtseva, John G Johnson, mst@redhat.com, Janosch Frank,
	Stefan Hajnoczi, Jason Wang, qemu-devel, Kirti Wankhede,
	Gerd Hoffmann, Yan Vugenfirer, Jag Raman, Anup Patel,
	Claudio Imbrenda, Christian Borntraeger, Roman Kagan,
	Felipe Franciosi, Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Liran Alon, Stefan Hajnoczi,
	Thanos Makatos, Alex Bennée, David Gibson, Kevin Wolf,
	Halil Pasic, Christophe de Dinechin, Paolo Bonzini, fam

On Thu, Oct 29, 2020 at 04:15:30PM +0000, David Edmondson wrote:
> On Thursday, 2020-10-29 at 21:02:05 +08, Jason Wang wrote:
> 
> > 2) Did qemu even try to migrate opaque blobs before? It's probably a bad 
> > design of migration protocol as well.
> 
> The TPM emulator backend migrates blobs that are only understood by
> swtpm.

The separate slirp-helper net backend does the same too IIUC

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-29 16:42       ` Daniel P. Berrangé
@ 2020-10-29 17:47         ` Kirti Wankhede
  2020-10-29 18:07           ` Paolo Bonzini
  0 siblings, 1 reply; 40+ messages in thread
From: Kirti Wankhede @ 2020-10-29 17:47 UTC (permalink / raw)
  To: Daniel P. Berrangé, David Edmondson
  Cc: Elena Ufimtseva, John G Johnson, mst@redhat.com, Janosch Frank,
	Stefan Hajnoczi, Jason Wang, qemu-devel, Gerd Hoffmann,
	Yan Vugenfirer, Jag Raman, Anup Patel, Claudio Imbrenda,
	Christian Borntraeger, Roman Kagan, Felipe Franciosi,
	Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Liran Alon, Stefan Hajnoczi,
	Thanos Makatos, Alex Bennée, David Gibson, Kevin Wolf,
	Halil Pasic, Christophe de Dinechin, Paolo Bonzini, fam



On 10/29/2020 10:12 PM, Daniel P. Berrangé wrote:
> On Thu, Oct 29, 2020 at 04:15:30PM +0000, David Edmondson wrote:
>> On Thursday, 2020-10-29 at 21:02:05 +08, Jason Wang wrote:
>>
>>> 2) Did qemu even try to migrate opaque blobs before? It's probably a bad
>>> design of migration protocol as well.
>>
>> The TPM emulator backend migrates blobs that are only understood by
>> swtpm.
> 
> The separate slirp-helper net backend does the same too IIUC
> 

When sys mem pages are marked dirty and content is copied to 
destination, content of sys mem is also opaque to QEMU.

Thanks,
Kirti


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-29 17:47         ` Kirti Wankhede
@ 2020-10-29 18:07           ` Paolo Bonzini
  2020-10-30  1:15             ` Jason Wang
  0 siblings, 1 reply; 40+ messages in thread
From: Paolo Bonzini @ 2020-10-29 18:07 UTC (permalink / raw)
  To: Kirti Wankhede, Daniel P. Berrangé, David Edmondson
  Cc: Elena Ufimtseva, John G Johnson, mst@redhat.com, Janosch Frank,
	Stefan Hajnoczi, Jason Wang, qemu-devel, Gerd Hoffmann,
	Yan Vugenfirer, Jag Raman, Anup Patel, Claudio Imbrenda,
	Christian Borntraeger, Roman Kagan, Felipe Franciosi,
	Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Liran Alon, Stefan Hajnoczi,
	Alex Bennée, David Gibson, Kevin Wolf, Halil Pasic,
	Christophe de Dinechin, Thanos Makatos, fam

On 29/10/20 18:47, Kirti Wankhede wrote:
> 
> On 10/29/2020 10:12 PM, Daniel P. Berrangé wrote:
>> On Thu, Oct 29, 2020 at 04:15:30PM +0000, David Edmondson wrote:
>>> On Thursday, 2020-10-29 at 21:02:05 +08, Jason Wang wrote:
>>>
>>>> 2) Did qemu even try to migrate opaque blobs before? It's probably a
>>>> bad
>>>> design of migration protocol as well.
>>>
>>> The TPM emulator backend migrates blobs that are only understood by
>>> swtpm.
>>
>> The separate slirp-helper net backend does the same too IIUC
> 
> When sys mem pages are marked dirty and content is copied to
> destination, content of sys mem is also opaque to QEMU.

Non-opaque RAM might be a bit too much to expect, though. :)

Paolo



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-29 15:46         ` Alex Williamson
  2020-10-29 16:10           ` Paolo Bonzini
@ 2020-10-30  1:11           ` Jason Wang
  2020-10-30  3:04             ` Alex Williamson
  1 sibling, 1 reply; 40+ messages in thread
From: Jason Wang @ 2020-10-30  1:11 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Elena Ufimtseva, Janosch Frank, mst@redhat.com, John G Johnson,
	Stefan Hajnoczi, qemu-devel, Kirti Wankhede, Gerd Hoffmann,
	Yan Vugenfirer, Jag Raman, Anup Patel, Claudio Imbrenda,
	Christian Borntraeger, Roman Kagan, Felipe Franciosi,
	Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Liran Alon, Stefan Hajnoczi,
	Thanos Makatos, Alex Bennée, David Gibson, Kevin Wolf,
	Halil Pasic, Daniel P. Berrange, Christophe de Dinechin,
	Paolo Bonzini, fam


On 2020/10/29 下午11:46, Alex Williamson wrote:
> On Thu, 29 Oct 2020 23:09:33 +0800
> Jason Wang <jasowang@redhat.com> wrote:
>
>> On 2020/10/29 下午10:31, Alex Williamson wrote:
>>> On Thu, 29 Oct 2020 21:02:05 +0800
>>> Jason Wang <jasowang@redhat.com> wrote:
>>>   
>>>> On 2020/10/29 下午8:08, Stefan Hajnoczi wrote:
>>>>> Here are notes from the session:
>>>>>
>>>>> protocol stability:
>>>>>        * vhost-user already exists for existing third-party applications
>>>>>        * vfio-user is more general but will take more time to develop
>>>>>        * libvfio-user can be provided to allow device implementations
>>>>>
>>>>> management:
>>>>>        * Should QEMU launch device emulation processes?
>>>>>            * Nicer user experience
>>>>>            * Technical blockers: forking, hotplug, security is hard once
>>>>> QEMU has started running
>>>>>            * Probably requires a new process model with a long-running
>>>>> QEMU management process proxying QMP requests to the emulator process
>>>>>
>>>>> migration:
>>>>>        * dbus-vmstate
>>>>>        * VFIO live migration ioctls
>>>>>            * Source device can continue if migration fails
>>>>>            * Opaque blobs are transferred to destination, destination can
>>>>> fail migration if it decides the blobs are incompatible
>>>> I'm not sure this can work:
>>>>
>>>> 1) Reading something that is opaque to userspace is probably a hint of
>>>> bad uAPI design
>>>> 2) Did qemu even try to migrate opaque blobs before? It's probably a bad
>>>> design of migration protocol as well.
>>>>
>>>> It looks to me have a migration driver in qemu that can clearly define
>>>> each byte in the migration stream is a better approach.
>>> Any time during the previous two years of development might have been a
>>> more appropriate time to express your doubts.
>>
>> Somehow I did that in this series[1]. But the main issue is still there.
> That series is related to a migration compatibility interface, not the
> migration data itself.


They are not independent. The compatibility interface design depends on 
the migration data design. I ask the uAPI issue in that thread but 
without any response.


>
>> Is this legal to have a uAPI that turns out to be opaque to userspace?
>> (VFIO seems to be the first). If it's not,  the only choice is to do
>> that in Qemu.
> So you're suggesting that any time the kernel is passing through opaque
> data that gets interpreted by some entity elsewhere, potentially with
> proprietary code, that we're in legal jeopardy?  VFIO is certainly not
> the first to do that (storage and network devices come to mind).
> Devices are essentially opaque data themselves, vfio provides access to
> (ex.) BARs, but the interpretation of what resides in that BAR is device
> specific.  Sometimes it's defined in a public datasheet, sometimes not.
> Suggesting that we can't move opaque data through a uAPI seems rather
> absurd.


No, I think we are talking about different things. What I meant is the 
data carried via uAPI should not opaque userspace. What you said here is 
a good example for this actually. When you expose BAR to userspace, 
there should be driver that knows the semantics of BAR running in the 
userspace, so it's not opaque to userspace.


>
>>> Note that we're not talking about vDPA devices here, we're talking
>>> about arbitrary devices with arbitrary state.  Some degree of migration
>>> support for assigned devices can be implemented in QEMU, Alex Graf
>>> proved this several years ago with i40evf.  Years later, we don't have
>>> any vendors proposing device specific migration code for QEMU.
>>
>> Yes but it's not necessarily VFIO as well.
> I don't know what this means.


I meant we can't not assume VFIO is the only uAPI that will be used by Qemu.


>
>>> Clearly we're also trying to account for proprietary devices where even
>>> for suspend/resume support, proprietary drivers may be required for
>>> manipulating that internal state.  When we move device emulation
>>> outside of QEMU, whether in kernel or to other userspace processes,
>>> does it still make sense to require code in QEMU to support
>>> interpretation of that device for migration purposes?
>>
>> Well, we could extend Qemu to support property module (or have we
>> supported that now?). And then it can talk to property drivers via
>> either VFIO or vendor specific uAPI.
> Yikes, I thought out-of-process devices was exactly the compromise
> being developed to avoid QEMU supporting proprietary modules and ad-hoc
> vendor specific uAPIs.


We can't even prevent this in kernel, so I don't see how possible we can 
make it for Qemu.


> I think you're actually questioning even the
> premise of developing a standardized API for out-of-process devices
> here.  Thanks,


Actually not, it's just question in my mind when looking at VFIO 
migration compatibility patches, since vfio-user is being proposed, it's 
a good time to revisit them.

Thanks


>
> Alex
>
>



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-29 18:07           ` Paolo Bonzini
@ 2020-10-30  1:15             ` Jason Wang
  0 siblings, 0 replies; 40+ messages in thread
From: Jason Wang @ 2020-10-30  1:15 UTC (permalink / raw)
  To: Paolo Bonzini, Kirti Wankhede, Daniel P. Berrangé, David Edmondson
  Cc: Elena Ufimtseva, John G Johnson, mst@redhat.com, Janosch Frank,
	Stefan Hajnoczi, qemu-devel, Gerd Hoffmann, Yan Vugenfirer,
	Jag Raman, Anup Patel, Claudio Imbrenda, Christian Borntraeger,
	Roman Kagan, Felipe Franciosi, Marc-André Lureau,
	Jens Freimann, Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Liran Alon, Stefan Hajnoczi,
	Alex Bennée, David Gibson, Kevin Wolf, Halil Pasic,
	Christophe de Dinechin, Thanos Makatos, fam


On 2020/10/30 上午2:07, Paolo Bonzini wrote:
> On 29/10/20 18:47, Kirti Wankhede wrote:
>> On 10/29/2020 10:12 PM, Daniel P. Berrangé wrote:
>>> On Thu, Oct 29, 2020 at 04:15:30PM +0000, David Edmondson wrote:
>>>> On Thursday, 2020-10-29 at 21:02:05 +08, Jason Wang wrote:
>>>>
>>>>> 2) Did qemu even try to migrate opaque blobs before? It's probably a
>>>>> bad
>>>>> design of migration protocol as well.
>>>> The TPM emulator backend migrates blobs that are only understood by
>>>> swtpm.
>>> The separate slirp-helper net backend does the same too IIUC
>> When sys mem pages are marked dirty and content is copied to
>> destination, content of sys mem is also opaque to QEMU.
> Non-opaque RAM might be a bit too much to expect, though. :)
>
> Paolo
>

True, and in this case you know you don't need to care about compatibility.

Thanks




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-30  1:11           ` Jason Wang
@ 2020-10-30  3:04             ` Alex Williamson
  2020-10-30  6:21               ` Stefan Hajnoczi
                                 ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Alex Williamson @ 2020-10-30  3:04 UTC (permalink / raw)
  To: Jason Wang
  Cc: Elena Ufimtseva, Janosch Frank, mst@redhat.com, John G Johnson,
	Stefan Hajnoczi, qemu-devel, Kirti Wankhede, Gerd Hoffmann,
	Yan Vugenfirer, Jag Raman, Anup Patel, Claudio Imbrenda,
	Christian Borntraeger, Roman Kagan, Felipe Franciosi,
	Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Liran Alon, Stefan Hajnoczi,
	Thanos Makatos, Alex Bennée, David Gibson, Kevin Wolf,
	Halil Pasic, Daniel P. Berrange, Christophe de Dinechin,
	Paolo Bonzini, fam

On Fri, 30 Oct 2020 09:11:23 +0800
Jason Wang <jasowang@redhat.com> wrote:

> On 2020/10/29 下午11:46, Alex Williamson wrote:
> > On Thu, 29 Oct 2020 23:09:33 +0800
> > Jason Wang <jasowang@redhat.com> wrote:
> >  
> >> On 2020/10/29 下午10:31, Alex Williamson wrote:  
> >>> On Thu, 29 Oct 2020 21:02:05 +0800
> >>> Jason Wang <jasowang@redhat.com> wrote:
> >>>     
> >>>> On 2020/10/29 下午8:08, Stefan Hajnoczi wrote:  
> >>>>> Here are notes from the session:
> >>>>>
> >>>>> protocol stability:
> >>>>>        * vhost-user already exists for existing third-party applications
> >>>>>        * vfio-user is more general but will take more time to develop
> >>>>>        * libvfio-user can be provided to allow device implementations
> >>>>>
> >>>>> management:
> >>>>>        * Should QEMU launch device emulation processes?
> >>>>>            * Nicer user experience
> >>>>>            * Technical blockers: forking, hotplug, security is hard once
> >>>>> QEMU has started running
> >>>>>            * Probably requires a new process model with a long-running
> >>>>> QEMU management process proxying QMP requests to the emulator process
> >>>>>
> >>>>> migration:
> >>>>>        * dbus-vmstate
> >>>>>        * VFIO live migration ioctls
> >>>>>            * Source device can continue if migration fails
> >>>>>            * Opaque blobs are transferred to destination, destination can
> >>>>> fail migration if it decides the blobs are incompatible  
> >>>> I'm not sure this can work:
> >>>>
> >>>> 1) Reading something that is opaque to userspace is probably a hint of
> >>>> bad uAPI design
> >>>> 2) Did qemu even try to migrate opaque blobs before? It's probably a bad
> >>>> design of migration protocol as well.
> >>>>
> >>>> It looks to me have a migration driver in qemu that can clearly define
> >>>> each byte in the migration stream is a better approach.  
> >>> Any time during the previous two years of development might have been a
> >>> more appropriate time to express your doubts.  
> >>
> >> Somehow I did that in this series[1]. But the main issue is still there.  
> > That series is related to a migration compatibility interface, not the
> > migration data itself.  
> 
> 
> They are not independent. The compatibility interface design depends on 
> the migration data design. I ask the uAPI issue in that thread but 
> without any response.
> 
> 
> >  
> >> Is this legal to have a uAPI that turns out to be opaque to userspace?
> >> (VFIO seems to be the first). If it's not,  the only choice is to do
> >> that in Qemu.  
> > So you're suggesting that any time the kernel is passing through opaque
> > data that gets interpreted by some entity elsewhere, potentially with
> > proprietary code, that we're in legal jeopardy?  VFIO is certainly not
> > the first to do that (storage and network devices come to mind).
> > Devices are essentially opaque data themselves, vfio provides access to
> > (ex.) BARs, but the interpretation of what resides in that BAR is device
> > specific.  Sometimes it's defined in a public datasheet, sometimes not.
> > Suggesting that we can't move opaque data through a uAPI seems rather
> > absurd.  
> 
> 
> No, I think we are talking about different things. What I meant is the 
> data carried via uAPI should not opaque userspace. What you said here is 
> a good example for this actually. When you expose BAR to userspace, 
> there should be driver that knows the semantics of BAR running in the 
> userspace, so it's not opaque to userspace.


But the thing running in userspace might be QEMU, which doesn't know
the semantics of the BAR, it might not be until a driver in the guest
that we have something that understands the BAR semantics beyond opaque
data.  We might have nested guests, so it could be passed through
multiple userspaces as opaque data.  The requirement make no sense.


> >>> Note that we're not talking about vDPA devices here, we're talking
> >>> about arbitrary devices with arbitrary state.  Some degree of migration
> >>> support for assigned devices can be implemented in QEMU, Alex Graf
> >>> proved this several years ago with i40evf.  Years later, we don't have
> >>> any vendors proposing device specific migration code for QEMU.  
> >>
> >> Yes but it's not necessarily VFIO as well.  
> > I don't know what this means.  
> 
> 
> I meant we can't not assume VFIO is the only uAPI that will be used by Qemu.

 
And we don't, DPDK, SPDK, various other userspaces exist.  All can take
advantage of the migration uAPI that we've developed rather than
implementing device specific code in their projects.  I'm not sure how
this is strengthening your argument for device specific migration code
in QEMU, which would need to be replicated in every other userspace.  As
opaque data with a well defined protocol, each userspace can implement
support for this migration protocol once and it should work independent
of the device or vendor.  It only requires support in the code
implementing the device, which is already necessarily device specific.


> >>> Clearly we're also trying to account for proprietary devices where even
> >>> for suspend/resume support, proprietary drivers may be required for
> >>> manipulating that internal state.  When we move device emulation
> >>> outside of QEMU, whether in kernel or to other userspace processes,
> >>> does it still make sense to require code in QEMU to support
> >>> interpretation of that device for migration purposes?  
> >>
> >> Well, we could extend Qemu to support property module (or have we
> >> supported that now?). And then it can talk to property drivers via
> >> either VFIO or vendor specific uAPI.  
> > Yikes, I thought out-of-process devices was exactly the compromise
> > being developed to avoid QEMU supporting proprietary modules and ad-hoc
> > vendor specific uAPIs.  
> 
> 
> We can't even prevent this in kernel, so I don't see how possible we can 
> make it for Qemu.


The kernel is a different beast, it already supports loadable modules
and due to whatever pressures or market demands of the past, it allows
non-GPL use of symbols necessary for some of those modules.  QEMU has
no module support outside of non-mainline forks.  Clearly there is
pressure to support sub-process and proprietary device emulation and
it's our choice how we enable that.  This vfio over socket approach is
the mechanism we're trying to enable to avoid proprietary modules in
QEMU proper.


> > I think you're actually questioning even the
> > premise of developing a standardized API for out-of-process devices
> > here.  Thanks,  
> 
> 
> Actually not, it's just question in my mind when looking at VFIO 
> migration compatibility patches, since vfio-user is being proposed, it's 
> a good time to revisit them.


A migration compatibility interface has not been determined for vfio.
We currently rely on the vendor drivers to provide their own internal
validation and harmlessly reject migration from an incompatible device.
It would be great if we could make progress on this, but it's a
difficult problem, and one that I hope we can further address once we
have a base level of migration support.

It's great to revisit ideas, but proclaiming a uAPI is bad solely
because the data transfer is opaque, without defining why that's bad,
evaluating the feasibility and implementation of defining a well
specified data format rather than protocol, including cross-vendor
support, or proposing any sort of alternative is not so helpful imo.

Note that we also migrate guest memory as opaque data; we don't require
knowing the data structures it holds or how regions are used, we simply
look for changes and transfer the new data.  That's not so different
from a vendor driver passing us a blob of data as "information it needs
to replicate the device state at the target."  Thanks,

Alex



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-30  3:04             ` Alex Williamson
@ 2020-10-30  6:21               ` Stefan Hajnoczi
  2020-10-30  9:45                 ` Jason Wang
  2020-10-30  7:51               ` Michael S. Tsirkin
  2020-10-30  9:31               ` Jason Wang
  2 siblings, 1 reply; 40+ messages in thread
From: Stefan Hajnoczi @ 2020-10-30  6:21 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Elena Ufimtseva, Janosch Frank, mst@redhat.com, John G Johnson,
	Jason Wang, qemu-devel, Kirti Wankhede, Gerd Hoffmann,
	Yan Vugenfirer, Jag Raman, Eugenio Pérez, Anup Patel,
	Claudio Imbrenda, Christian Borntraeger, Roman Kagan,
	Felipe Franciosi, Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Liran Alon, Stefan Hajnoczi,
	Thanos Makatos, Alex Bennée, David Gibson, Kevin Wolf,
	Halil Pasic, Daniel P. Berrange, Christophe de Dinechin,
	Paolo Bonzini, fam

On Fri, Oct 30, 2020 at 3:04 AM Alex Williamson
<alex.williamson@redhat.com> wrote:
> It's great to revisit ideas, but proclaiming a uAPI is bad solely
> because the data transfer is opaque, without defining why that's bad,
> evaluating the feasibility and implementation of defining a well
> specified data format rather than protocol, including cross-vendor
> support, or proposing any sort of alternative is not so helpful imo.

The migration approaches in VFIO and vDPA/vhost were designed for
different requirements and I think this is why there are different
perspectives on this. Here is a comparison and how VFIO could be
extended in the future. I see 3 levels of device state compatibility:

1. The device cannot save/load state blobs, instead userspace fetches
and restores specific values of the device's runtime state (e.g. last
processed ring index). This is the vhost approach.

2. The device can save/load state in a standard format. This is
similar to #1 except that there is a single read/write blob interface
instead of fine-grained get_FOO()/set_FOO() interfaces. This approach
pushes the migration state parsing into the device so that userspace
doesn't need knowledge of every device type. With this approach it is
possible for a device from vendor A to migrate to a device from vendor
B, as long as they both implement the same standard migration format.
The limitation of this approach is that vendor-specific state cannot
be transferred.

3. The device can save/load opaque blobs. This is the initial VFIO
approach. A device from vendor A cannot migrate to a device from
vendor B because the format is incompatible. This approach works well
when devices have unique guest-visible hardware interfaces so the
guest wouldn't be able to handle migrating a device from vendor A to a
device from vendor B anyway.

I think we will see more NVMe and VIRTIO hardware VFIO devices in the
future. Those are standard guest-visible hardware interfaces. It makes
sense to define standard migration formats so it's possible to migrate
a device from vendor A to a device from vendor B.

This can be achieved as follows:
1. The VFIO migration blob starts with a unique format identifier such
as a UUID. This way the destination device can identify standard
device state formats and parse them.
2. The VFIO device state ioctl is extended so userspace can enumerate
and select device state formats. This way it's possible to check
available formats on the source and destination devices before
migration and to configure the source device to produce device state
in a common format.

To me it seems #3 makes sense as an initial approach for VFIO since
guest-visible hardware interfaces are often not compatible between PCI
devices. #2 can be added in the future, especially when VFIO drivers
from different vendors become available that present the same
guest-visible hardware interface (NVMe, VIRTIO, etc).

Stefan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-30  3:04             ` Alex Williamson
  2020-10-30  6:21               ` Stefan Hajnoczi
@ 2020-10-30  7:51               ` Michael S. Tsirkin
  2020-10-30  9:31               ` Jason Wang
  2 siblings, 0 replies; 40+ messages in thread
From: Michael S. Tsirkin @ 2020-10-30  7:51 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Elena Ufimtseva, Janosch Frank, John G Johnson, Stefan Hajnoczi,
	Jason Wang, qemu-devel, Kirti Wankhede, Gerd Hoffmann,
	Yan Vugenfirer, Jag Raman, Anup Patel, Claudio Imbrenda,
	Christian Borntraeger, Roman Kagan, Felipe Franciosi,
	Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Liran Alon, Stefan Hajnoczi,
	Thanos Makatos, Alex Bennée, David Gibson, Kevin Wolf,
	Halil Pasic, Daniel P. Berrange, Christophe de Dinechin,
	Paolo Bonzini, fam

> A migration compatibility interface has not been determined for vfio.
> We currently rely on the vendor drivers to provide their own internal
> validation and harmlessly reject migration from an incompatible device.
> It would be great if we could make progress on this, but it's a
> difficult problem, and one that I hope we can further address once we
> have a base level of migration support.
> 
> It's great to revisit ideas, but proclaiming a uAPI is bad solely
> because the data transfer is opaque, without defining why that's bad,


That makes sense.

I feel what is missing from all of these discussions is comparison
with an existing Out-of-Process solution - namely vhost-user.
As a result I feel the proposals tend to forget some of the
lessons learned designing that interface.

In particular I personally see cross-version and cross vendor
migration as a litmus test: it is a hard problem, one that
1. I do not believe vendors will be motivated enough to solve by themselves
2. I don't believe QEMU will be able to add after the fact
for the reason that "supporting QEMU" will come to not imply any level
of compatibility whatsoever.

That was a hard learned lesson and that's the reason I (and maybe Jason,
too) keep harping on that, not that it's so burningly important by
itself.


I think at this point we have an opportunity to make people document
their interfaces up to a point and also actually somewhat standardize
them, using upstream inclusion as a carrot. Some big vendors will
probably ignore it, small ones hopefully won't. After X years margins
become thin, vendors lose interest, and we are at that point glad we
have standards and documentation.


> evaluating the feasibility and implementation of defining a well
> specified data format rather than protocol, including cross-vendor
> support, or proposing any sort of alternative is not so helpful imo.



For example, with a registry of supported device/vendor/subsystem tuples
and a list of compatibility features and a documented migration data format for
each, maintained in QEMU, with a handshake validating that would create
a kind of a registry documenting what is compatible with what.

That could then serve for debugging, validation, and also
help push people towards more standard interfaces.

That is just one idea.

> Note that we also migrate guest memory as opaque data; we don't require
> knowing the data structures it holds or how regions are used, we simply
> look for changes and transfer the new data.  That's not so different
> from a vendor driver passing us a blob of data as "information it needs
> to replicate the device state at the target."


I don't really understand this argument. At the device level we know
exactly how is each region used: some are IO, some are RAM.
In fact one can migrate between systems released years apart.

-- 
MST



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-30  3:04             ` Alex Williamson
  2020-10-30  6:21               ` Stefan Hajnoczi
  2020-10-30  7:51               ` Michael S. Tsirkin
@ 2020-10-30  9:31               ` Jason Wang
  2 siblings, 0 replies; 40+ messages in thread
From: Jason Wang @ 2020-10-30  9:31 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Elena Ufimtseva, Janosch Frank, mst@redhat.com, John G Johnson,
	Stefan Hajnoczi, qemu-devel, Kirti Wankhede, Gerd Hoffmann,
	Yan Vugenfirer, Jag Raman, Anup Patel, Claudio Imbrenda,
	Christian Borntraeger, Roman Kagan, Felipe Franciosi,
	Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Liran Alon, Stefan Hajnoczi,
	Paolo Bonzini, Alex Bennée, David Gibson, Kevin Wolf,
	Halil Pasic, Daniel P. Berrange, Christophe de Dinechin,
	Thanos Makatos, fam


On 2020/10/30 上午11:04, Alex Williamson wrote:
> On Fri, 30 Oct 2020 09:11:23 +0800
> Jason Wang <jasowang@redhat.com> wrote:
>
>> On 2020/10/29 下午11:46, Alex Williamson wrote:
>>> On Thu, 29 Oct 2020 23:09:33 +0800
>>> Jason Wang <jasowang@redhat.com> wrote:
>>>   
>>>> On 2020/10/29 下午10:31, Alex Williamson wrote:
>>>>> On Thu, 29 Oct 2020 21:02:05 +0800
>>>>> Jason Wang <jasowang@redhat.com> wrote:
>>>>>      
>>>>>> On 2020/10/29 下午8:08, Stefan Hajnoczi wrote:
>>>>>>> Here are notes from the session:
>>>>>>>
>>>>>>> protocol stability:
>>>>>>>         * vhost-user already exists for existing third-party applications
>>>>>>>         * vfio-user is more general but will take more time to develop
>>>>>>>         * libvfio-user can be provided to allow device implementations
>>>>>>>
>>>>>>> management:
>>>>>>>         * Should QEMU launch device emulation processes?
>>>>>>>             * Nicer user experience
>>>>>>>             * Technical blockers: forking, hotplug, security is hard once
>>>>>>> QEMU has started running
>>>>>>>             * Probably requires a new process model with a long-running
>>>>>>> QEMU management process proxying QMP requests to the emulator process
>>>>>>>
>>>>>>> migration:
>>>>>>>         * dbus-vmstate
>>>>>>>         * VFIO live migration ioctls
>>>>>>>             * Source device can continue if migration fails
>>>>>>>             * Opaque blobs are transferred to destination, destination can
>>>>>>> fail migration if it decides the blobs are incompatible
>>>>>> I'm not sure this can work:
>>>>>>
>>>>>> 1) Reading something that is opaque to userspace is probably a hint of
>>>>>> bad uAPI design
>>>>>> 2) Did qemu even try to migrate opaque blobs before? It's probably a bad
>>>>>> design of migration protocol as well.
>>>>>>
>>>>>> It looks to me have a migration driver in qemu that can clearly define
>>>>>> each byte in the migration stream is a better approach.
>>>>> Any time during the previous two years of development might have been a
>>>>> more appropriate time to express your doubts.
>>>> Somehow I did that in this series[1]. But the main issue is still there.
>>> That series is related to a migration compatibility interface, not the
>>> migration data itself.
>>
>> They are not independent. The compatibility interface design depends on
>> the migration data design. I ask the uAPI issue in that thread but
>> without any response.
>>
>>
>>>   
>>>> Is this legal to have a uAPI that turns out to be opaque to userspace?
>>>> (VFIO seems to be the first). If it's not,  the only choice is to do
>>>> that in Qemu.
>>> So you're suggesting that any time the kernel is passing through opaque
>>> data that gets interpreted by some entity elsewhere, potentially with
>>> proprietary code, that we're in legal jeopardy?  VFIO is certainly not
>>> the first to do that (storage and network devices come to mind).
>>> Devices are essentially opaque data themselves, vfio provides access to
>>> (ex.) BARs, but the interpretation of what resides in that BAR is device
>>> specific.  Sometimes it's defined in a public datasheet, sometimes not.
>>> Suggesting that we can't move opaque data through a uAPI seems rather
>>> absurd.
>>
>> No, I think we are talking about different things. What I meant is the
>> data carried via uAPI should not opaque userspace. What you said here is
>> a good example for this actually. When you expose BAR to userspace,
>> there should be driver that knows the semantics of BAR running in the
>> userspace, so it's not opaque to userspace.
>
> But the thing running in userspace might be QEMU, which doesn't know
> the semantics of the BAR, it might not be until a driver in the guest
> that we have something that understands the BAR semantics beyond opaque
> data.  We might have nested guests, so it could be passed through
> multiple userspaces as opaque data.  The requirement make no sense.


I don't see the difference. From kernel perspective they are all 
userspace drivers regardless whether it's a guest or not. No matter how 
many levels in the middle, there will always be a final endpoint that 
know clearly about the semantics of the BAR. The intermediate level just 
transports the uAPI to upper levels.


>
>
>>>>> Note that we're not talking about vDPA devices here, we're talking
>>>>> about arbitrary devices with arbitrary state.  Some degree of migration
>>>>> support for assigned devices can be implemented in QEMU, Alex Graf
>>>>> proved this several years ago with i40evf.  Years later, we don't have
>>>>> any vendors proposing device specific migration code for QEMU.
>>>> Yes but it's not necessarily VFIO as well.
>>> I don't know what this means.
>>
>> I meant we can't not assume VFIO is the only uAPI that will be used by Qemu.
>   
> And we don't, DPDK, SPDK, various other userspaces exist.  All can take
> advantage of the migration uAPI that we've developed rather than
> implementing device specific code in their projects.


Obviously, for device that has higher level of abstraction like virtio, 
using a bus level device model for migration is a burden.


>   I'm not sure how
> this is strengthening your argument for device specific migration code
> in QEMU, which would need to be replicated in every other userspace.


Any reason for such replication? Except for the devices that have well 
known interface like virtio, each device should have unique 
attributes/behaviors that needs to be dealt with during live migration.


>   As
> opaque data with a well defined protocol, each userspace can implement
> support for this migration protocol once and it should work independent
> of the device or vendor.  It only requires support in the code
> implementing the device, which is already necessarily device specific.
>
>
>>>>> Clearly we're also trying to account for proprietary devices where even
>>>>> for suspend/resume support, proprietary drivers may be required for
>>>>> manipulating that internal state.  When we move device emulation
>>>>> outside of QEMU, whether in kernel or to other userspace processes,
>>>>> does it still make sense to require code in QEMU to support
>>>>> interpretation of that device for migration purposes?
>>>> Well, we could extend Qemu to support property module (or have we
>>>> supported that now?). And then it can talk to property drivers via
>>>> either VFIO or vendor specific uAPI.
>>> Yikes, I thought out-of-process devices was exactly the compromise
>>> being developed to avoid QEMU supporting proprietary modules and ad-hoc
>>> vendor specific uAPIs.
>>
>> We can't even prevent this in kernel, so I don't see how possible we can
>> make it for Qemu.
>
> The kernel is a different beast, it already supports loadable modules
> and due to whatever pressures or market demands of the past, it allows
> non-GPL use of symbols necessary for some of those modules.


So this just answer my question. It's not hard to forecast Qemu may end 
up with similar pressure in the future. The request is simple, connect a 
guest with a vendor specific proprietary uAPI.


>    QEMU has
> no module support outside of non-mainline forks.  Clearly there is
> pressure to support sub-process and proprietary device emulation and
> it's our choice how we enable that.  This vfio over socket approach is
> the mechanism we're trying to enable to avoid proprietary modules in
> QEMU proper.


VFIO user is not the first, vhost-user can do this already. I would 
rather consider VFIO-user to cover the case that vhost-user can't cover. 
And if possible, we should encourage to use vhost-user.


>
>
>>> I think you're actually questioning even the
>>> premise of developing a standardized API for out-of-process devices
>>> here.  Thanks,
>>
>> Actually not, it's just question in my mind when looking at VFIO
>> migration compatibility patches, since vfio-user is being proposed, it's
>> a good time to revisit them.
>
> A migration compatibility interface has not been determined for vfio.
> We currently rely on the vendor drivers to provide their own internal
> validation and harmlessly reject migration from an incompatible device.


So it looks like vendor needs to implement their own migration protocol 
instead of the well defined ones in qemu? I think migration guys may 
share more experiences of how challenge it would be.


> It would be great if we could make progress on this, but it's a
> difficult problem, and one that I hope we can further address once we
> have a base level of migration support.


One thing that is missed in this summary is the way to detect migration 
compatibility. We can't not simply depend on migration failure I guess.


>
> It's great to revisit ideas, but proclaiming a uAPI is bad solely
> because the data transfer is opaque, without defining why that's bad,


Well, it should be sufficient if the opaque uAPI itself is against the 
Linux uAPI/ABI design principles. The side effect is obvious, 
maintainability, debug-ability and compatibility.


> evaluating the feasibility and implementation of defining a well
> specified data format rather than protocol, including cross-vendor
> support, or proposing any sort of alternative is not so helpful imo.


I don't get here, why proposing alternative is not helpful consider 
we're in the early stage?


>
> Note that we also migrate guest memory as opaque data; we don't require
> knowing the data structures it holds or how regions are used, we simply
> look for changes and transfer the new data.  That's not so different
> from a vendor driver passing us a blob of data as "information it needs
> to replicate the device state at the target."  Thanks,


That's complete different, for guest memory, it's:

1) not read from any uAPI
2) not opaque for guest itself

But what qemu expect to read from VFIO uAPI is completely opaque to any 
of the upper layer.

Thanks


>
> Alex
>
>



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-30  6:21               ` Stefan Hajnoczi
@ 2020-10-30  9:45                 ` Jason Wang
  2020-10-30 11:13                   ` Stefan Hajnoczi
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Wang @ 2020-10-30  9:45 UTC (permalink / raw)
  To: Stefan Hajnoczi, Alex Williamson
  Cc: Elena Ufimtseva, Janosch Frank, mst@redhat.com, John G Johnson,
	qemu-devel, Kirti Wankhede, Gerd Hoffmann, Yan Vugenfirer,
	Jag Raman, Eugenio Pérez, Anup Patel, Claudio Imbrenda,
	Christian Borntraeger, Roman Kagan, Felipe Franciosi,
	Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Liran Alon, Stefan Hajnoczi,
	Thanos Makatos, Alex Bennée, David Gibson, Kevin Wolf,
	Halil Pasic, Daniel P. Berrange, Christophe de Dinechin,
	Paolo Bonzini, fam


On 2020/10/30 下午2:21, Stefan Hajnoczi wrote:
> On Fri, Oct 30, 2020 at 3:04 AM Alex Williamson
> <alex.williamson@redhat.com> wrote:
>> It's great to revisit ideas, but proclaiming a uAPI is bad solely
>> because the data transfer is opaque, without defining why that's bad,
>> evaluating the feasibility and implementation of defining a well
>> specified data format rather than protocol, including cross-vendor
>> support, or proposing any sort of alternative is not so helpful imo.
> The migration approaches in VFIO and vDPA/vhost were designed for
> different requirements and I think this is why there are different
> perspectives on this. Here is a comparison and how VFIO could be
> extended in the future. I see 3 levels of device state compatibility:
>
> 1. The device cannot save/load state blobs, instead userspace fetches
> and restores specific values of the device's runtime state (e.g. last
> processed ring index). This is the vhost approach.
>
> 2. The device can save/load state in a standard format. This is
> similar to #1 except that there is a single read/write blob interface
> instead of fine-grained get_FOO()/set_FOO() interfaces. This approach
> pushes the migration state parsing into the device so that userspace
> doesn't need knowledge of every device type. With this approach it is
> possible for a device from vendor A to migrate to a device from vendor
> B, as long as they both implement the same standard migration format.
> The limitation of this approach is that vendor-specific state cannot
> be transferred.
>
> 3. The device can save/load opaque blobs. This is the initial VFIO
> approach.


I still don't get why it must be opaque.


>   A device from vendor A cannot migrate to a device from
> vendor B because the format is incompatible. This approach works well
> when devices have unique guest-visible hardware interfaces so the
> guest wouldn't be able to handle migrating a device from vendor A to a
> device from vendor B anyway.


For VFIO I guess cross vendor live migration can't succeed unless we do 
some cheats in device/vendor id.


>
> I think we will see more NVMe and VIRTIO hardware VFIO devices in the
> future. Those are standard guest-visible hardware interfaces. It makes
> sense to define standard migration formats so it's possible to migrate
> a device from vendor A to a device from vendor B.


Yes.


>
> This can be achieved as follows:
> 1. The VFIO migration blob starts with a unique format identifier such
> as a UUID. This way the destination device can identify standard
> device state formats and parse them.
> 2. The VFIO device state ioctl is extended so userspace can enumerate
> and select device state formats. This way it's possible to check
> available formats on the source and destination devices before
> migration and to configure the source device to produce device state
> in a common format.
>
> To me it seems #3 makes sense as an initial approach for VFIO since
> guest-visible hardware interfaces are often not compatible between PCI
> devices. #2 can be added in the future, especially when VFIO drivers
> from different vendors become available that present the same
> guest-visible hardware interface (NVMe, VIRTIO, etc).


For at least virtio, they will still go with virtio/vDPA. The advantages 
are:

1) virtio/vDPA can serve kernel subsystems which VFIO can't, this is 
very important for containers
2) virtio/vDPA is bus independent, we can present a virtio-mmio device 
which is based on vDPA PCI hardware for e.g microvm

I'm not familiar with NVME but they should go with the same way instead 
of depending on VFIO.

Thanks


>
> Stefan
>



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-30  9:45                 ` Jason Wang
@ 2020-10-30 11:13                   ` Stefan Hajnoczi
  2020-10-30 12:07                     ` Jason Wang
                                       ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2020-10-30 11:13 UTC (permalink / raw)
  To: Jason Wang
  Cc: Elena Ufimtseva, Janosch Frank, mst@redhat.com, John G Johnson,
	qemu-devel, Kirti Wankhede, Gerd Hoffmann, Yan Vugenfirer,
	Jag Raman, Eugenio Pérez, Anup Patel, Claudio Imbrenda,
	Christian Borntraeger, Roman Kagan, Felipe Franciosi,
	Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Alex Williamson, Liran Alon,
	Stefan Hajnoczi, Thanos Makatos, Alex Bennée, David Gibson,
	Kevin Wolf, Halil Pasic, Daniel P. Berrange,
	Christophe de Dinechin, Paolo Bonzini, fam

On Fri, Oct 30, 2020 at 9:46 AM Jason Wang <jasowang@redhat.com> wrote:
> On 2020/10/30 下午2:21, Stefan Hajnoczi wrote:
> > On Fri, Oct 30, 2020 at 3:04 AM Alex Williamson
> > <alex.williamson@redhat.com> wrote:
> >> It's great to revisit ideas, but proclaiming a uAPI is bad solely
> >> because the data transfer is opaque, without defining why that's bad,
> >> evaluating the feasibility and implementation of defining a well
> >> specified data format rather than protocol, including cross-vendor
> >> support, or proposing any sort of alternative is not so helpful imo.
> > The migration approaches in VFIO and vDPA/vhost were designed for
> > different requirements and I think this is why there are different
> > perspectives on this. Here is a comparison and how VFIO could be
> > extended in the future. I see 3 levels of device state compatibility:
> >
> > 1. The device cannot save/load state blobs, instead userspace fetches
> > and restores specific values of the device's runtime state (e.g. last
> > processed ring index). This is the vhost approach.
> >
> > 2. The device can save/load state in a standard format. This is
> > similar to #1 except that there is a single read/write blob interface
> > instead of fine-grained get_FOO()/set_FOO() interfaces. This approach
> > pushes the migration state parsing into the device so that userspace
> > doesn't need knowledge of every device type. With this approach it is
> > possible for a device from vendor A to migrate to a device from vendor
> > B, as long as they both implement the same standard migration format.
> > The limitation of this approach is that vendor-specific state cannot
> > be transferred.
> >
> > 3. The device can save/load opaque blobs. This is the initial VFIO
> > approach.
>
>
> I still don't get why it must be opaque.

If the device state format needs to be in the VMM then each device
needs explicit enablement in each VMM (QEMU, cloud-hypervisor, etc).

Let's invert the question: why does the VMM need to understand the
device state of a _passthrough_ device?

> >   A device from vendor A cannot migrate to a device from
> > vendor B because the format is incompatible. This approach works well
> > when devices have unique guest-visible hardware interfaces so the
> > guest wouldn't be able to handle migrating a device from vendor A to a
> > device from vendor B anyway.
>
>
> For VFIO I guess cross vendor live migration can't succeed unless we do
> some cheats in device/vendor id.

Yes. I haven't looked into the details of PCI (Sub-)Device/Vendor IDs
and how to best enable migration but I hope that can be solved. The
simplest approach is to override the IDs and make them part of the
guest configuration.

> For at least virtio, they will still go with virtio/vDPA. The advantages
> are:
>
> 1) virtio/vDPA can serve kernel subsystems which VFIO can't, this is
> very important for containers

I'm not sure I understand this. If the kernel wants to use the device
then it doesn't use VFIO, it runs the kernel driver instead.

One part I believe is missing from VFIO/mdev is attaching an mdev
device to the kernel. That seems to be an example of the limitation
you mentioned.

> 2) virtio/vDPA is bus independent, we can present a virtio-mmio device
> which is based on vDPA PCI hardware for e.g microvm

Yes. This is neat although microvm supports PCI now
(https://www.kraxel.org/blog/2020/10/qemu-microvm-acpi/).

> I'm not familiar with NVME but they should go with the same way instead
> of depending on VFIO.

There are pros/cons with both approaches. I'm not even sure all VIRTIO
hardware vendors will use vDPA. Two examples:
1. A tiny VMM with strict security requirements. The VFIO approach is
less complex because the VMM is much less involved with the device.
2. A vendor shipping a hardware VIRTIO PCI device as a PF - no SR-IOV,
no software VFs, just a single instance. A passthrough PCI device is a
much simpler way to deliver this device than vDPA + vhost + VMM
support.

vDPA is very useful but there are situations when the VFIO approach is
attractive too.

Stefan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-30 11:13                   ` Stefan Hajnoczi
@ 2020-10-30 12:07                     ` Jason Wang
  2020-10-30 13:15                       ` Stefan Hajnoczi
  2020-10-31 21:49                     ` Michael S. Tsirkin
  2020-11-02  3:00                     ` Jason Wang
  2 siblings, 1 reply; 40+ messages in thread
From: Jason Wang @ 2020-10-30 12:07 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, Janosch Frank, mst@redhat.com, John G Johnson,
	qemu-devel, Kirti Wankhede, Gerd Hoffmann, Yan Vugenfirer,
	Jag Raman, Eugenio Pérez, Anup Patel, Claudio Imbrenda,
	Christian Borntraeger, Roman Kagan, Felipe Franciosi,
	Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Alex Williamson, Liran Alon,
	Stefan Hajnoczi, Thanos Makatos, Alex Bennée, David Gibson,
	Kevin Wolf, Halil Pasic, Daniel P. Berrange,
	Christophe de Dinechin, Paolo Bonzini, fam


On 2020/10/30 下午7:13, Stefan Hajnoczi wrote:
> On Fri, Oct 30, 2020 at 9:46 AM Jason Wang <jasowang@redhat.com> wrote:
>> On 2020/10/30 下午2:21, Stefan Hajnoczi wrote:
>>> On Fri, Oct 30, 2020 at 3:04 AM Alex Williamson
>>> <alex.williamson@redhat.com> wrote:
>>>> It's great to revisit ideas, but proclaiming a uAPI is bad solely
>>>> because the data transfer is opaque, without defining why that's bad,
>>>> evaluating the feasibility and implementation of defining a well
>>>> specified data format rather than protocol, including cross-vendor
>>>> support, or proposing any sort of alternative is not so helpful imo.
>>> The migration approaches in VFIO and vDPA/vhost were designed for
>>> different requirements and I think this is why there are different
>>> perspectives on this. Here is a comparison and how VFIO could be
>>> extended in the future. I see 3 levels of device state compatibility:
>>>
>>> 1. The device cannot save/load state blobs, instead userspace fetches
>>> and restores specific values of the device's runtime state (e.g. last
>>> processed ring index). This is the vhost approach.
>>>
>>> 2. The device can save/load state in a standard format. This is
>>> similar to #1 except that there is a single read/write blob interface
>>> instead of fine-grained get_FOO()/set_FOO() interfaces. This approach
>>> pushes the migration state parsing into the device so that userspace
>>> doesn't need knowledge of every device type. With this approach it is
>>> possible for a device from vendor A to migrate to a device from vendor
>>> B, as long as they both implement the same standard migration format.
>>> The limitation of this approach is that vendor-specific state cannot
>>> be transferred.
>>>
>>> 3. The device can save/load opaque blobs. This is the initial VFIO
>>> approach.
>>
>> I still don't get why it must be opaque.
> If the device state format needs to be in the VMM then each device
> needs explicit enablement in each VMM (QEMU, cloud-hypervisor, etc).
>
> Let's invert the question: why does the VMM need to understand the
> device state of a _passthrough_ device?


For better manageability, compatibility and debug-ability. If we depends 
on a opaque structure, do we encourage device to implement its own 
migration protocol? It would be very challenge.

For VFIO in the kernel, I suspect a uAPI that may result a opaque data 
to be read or wrote from guest violates the Linux uAPI principle. It 
will be very hard to maintain uABI or even impossible. It looks to me 
VFIO is the first subsystem that is trying to do this.


>
>>>    A device from vendor A cannot migrate to a device from
>>> vendor B because the format is incompatible. This approach works well
>>> when devices have unique guest-visible hardware interfaces so the
>>> guest wouldn't be able to handle migrating a device from vendor A to a
>>> device from vendor B anyway.
>>
>> For VFIO I guess cross vendor live migration can't succeed unless we do
>> some cheats in device/vendor id.
> Yes. I haven't looked into the details of PCI (Sub-)Device/Vendor IDs
> and how to best enable migration but I hope that can be solved. The
> simplest approach is to override the IDs and make them part of the
> guest configuration.


That would be very tricky (or requires whitelist). E.g the opaque of the 
src may match the opaque of the dst by chance.


>
>> For at least virtio, they will still go with virtio/vDPA. The advantages
>> are:
>>
>> 1) virtio/vDPA can serve kernel subsystems which VFIO can't, this is
>> very important for containers
> I'm not sure I understand this. If the kernel wants to use the device
> then it doesn't use VFIO, it runs the kernel driver instead.


Current spec is not suitable for all type of device. We've received many 
feedbacks that virtio(pci) might not work very well. Another point is 
that there could be vendor that don't want go with virtio control path. 
Mellanox mlx5 vdpa driver is one example. Yes, they can use mlx5_en, but 
there are vendors that want to build a vendor specific control path from 
scratch.


>
> One part I believe is missing from VFIO/mdev is attaching an mdev
> device to the kernel. That seems to be an example of the limitation
> you mentioned.


Yes, exactly.


>
>> 2) virtio/vDPA is bus independent, we can present a virtio-mmio device
>> which is based on vDPA PCI hardware for e.g microvm
> Yes. This is neat although microvm supports PCI now
> (https://www.kraxel.org/blog/2020/10/qemu-microvm-acpi/).
>
>> I'm not familiar with NVME but they should go with the same way instead
>> of depending on VFIO.
> There are pros/cons with both approaches. I'm not even sure all VIRTIO
> hardware vendors will use vDPA. Two examples:
> 1. A tiny VMM with strict security requirements. The VFIO approach is
> less complex because the VMM is much less involved with the device.


I suspect VFIO could be more secure. It exposes a lot of hardware 
details while vDPA is trying to hide.


> 2. A vendor shipping a hardware VIRTIO PCI device as a PF - no SR-IOV,
> no software VFs, just a single instance. A passthrough PCI device is a
> much simpler way to deliver this device than vDPA + vhost + VMM
> support.


It could be simple but note that there's no live migration support in 
the spec. So it can't be live migrated. We could extend the spec for 
sure, but there're vendor that has already implemented the virtio plus 
their vendor specific extensions for live migration.


>
> vDPA is very useful but there are situations when the VFIO approach is
> attractive too.


Note that it's probably better to differ virtio from vDPA. For virtio 
control path compatible device, we should keep it work in both 
subsystems. For the rest vDPA devices (control path is not virtio), 
exposing them via VFIO doesn't help much or even impossible (e.g the 
abstraction requires the communication with PF).

Thanks


>
> Stefan
>



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-30 12:07                     ` Jason Wang
@ 2020-10-30 13:15                       ` Stefan Hajnoczi
  2020-11-02  2:51                         ` Jason Wang
  0 siblings, 1 reply; 40+ messages in thread
From: Stefan Hajnoczi @ 2020-10-30 13:15 UTC (permalink / raw)
  To: Jason Wang
  Cc: Elena Ufimtseva, Janosch Frank, mst@redhat.com, John G Johnson,
	qemu-devel, Kirti Wankhede, Gerd Hoffmann, Yan Vugenfirer,
	Jag Raman, Eugenio Pérez, Anup Patel, Claudio Imbrenda,
	Christian Borntraeger, Roman Kagan, Felipe Franciosi,
	Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Alex Williamson, Liran Alon,
	Stefan Hajnoczi, Thanos Makatos, Alex Bennée, David Gibson,
	Kevin Wolf, Halil Pasic, Daniel P. Berrange,
	Christophe de Dinechin, Paolo Bonzini, fam

On Fri, Oct 30, 2020 at 12:08 PM Jason Wang <jasowang@redhat.com> wrote:
> On 2020/10/30 下午7:13, Stefan Hajnoczi wrote:
> > On Fri, Oct 30, 2020 at 9:46 AM Jason Wang <jasowang@redhat.com> wrote:
> >> On 2020/10/30 下午2:21, Stefan Hajnoczi wrote:
> >>> On Fri, Oct 30, 2020 at 3:04 AM Alex Williamson
> >>> <alex.williamson@redhat.com> wrote:
> >>>> It's great to revisit ideas, but proclaiming a uAPI is bad solely
> >>>> because the data transfer is opaque, without defining why that's bad,
> >>>> evaluating the feasibility and implementation of defining a well
> >>>> specified data format rather than protocol, including cross-vendor
> >>>> support, or proposing any sort of alternative is not so helpful imo.
> >>> The migration approaches in VFIO and vDPA/vhost were designed for
> >>> different requirements and I think this is why there are different
> >>> perspectives on this. Here is a comparison and how VFIO could be
> >>> extended in the future. I see 3 levels of device state compatibility:
> >>>
> >>> 1. The device cannot save/load state blobs, instead userspace fetches
> >>> and restores specific values of the device's runtime state (e.g. last
> >>> processed ring index). This is the vhost approach.
> >>>
> >>> 2. The device can save/load state in a standard format. This is
> >>> similar to #1 except that there is a single read/write blob interface
> >>> instead of fine-grained get_FOO()/set_FOO() interfaces. This approach
> >>> pushes the migration state parsing into the device so that userspace
> >>> doesn't need knowledge of every device type. With this approach it is
> >>> possible for a device from vendor A to migrate to a device from vendor
> >>> B, as long as they both implement the same standard migration format.
> >>> The limitation of this approach is that vendor-specific state cannot
> >>> be transferred.
> >>>
> >>> 3. The device can save/load opaque blobs. This is the initial VFIO
> >>> approach.
> >>
> >> I still don't get why it must be opaque.
> > If the device state format needs to be in the VMM then each device
> > needs explicit enablement in each VMM (QEMU, cloud-hypervisor, etc).
> >
> > Let's invert the question: why does the VMM need to understand the
> > device state of a _passthrough_ device?
>
>
> For better manageability, compatibility and debug-ability. If we depends
> on a opaque structure, do we encourage device to implement its own
> migration protocol? It would be very challenge.
>
> For VFIO in the kernel, I suspect a uAPI that may result a opaque data
> to be read or wrote from guest violates the Linux uAPI principle. It
> will be very hard to maintain uABI or even impossible. It looks to me
> VFIO is the first subsystem that is trying to do this.

I think our concepts of uAPI are different. The uAPI of read(2) and
write(2) does not define the structure of the data buffers. VFIO
device regions are exactly the same, the structure of the data is not
defined by the kernel uAPI.

Maybe microcode and firmware loading is an example we agree on?

> >>>    A device from vendor A cannot migrate to a device from
> >>> vendor B because the format is incompatible. This approach works well
> >>> when devices have unique guest-visible hardware interfaces so the
> >>> guest wouldn't be able to handle migrating a device from vendor A to a
> >>> device from vendor B anyway.
> >>
> >> For VFIO I guess cross vendor live migration can't succeed unless we do
> >> some cheats in device/vendor id.
> > Yes. I haven't looked into the details of PCI (Sub-)Device/Vendor IDs
> > and how to best enable migration but I hope that can be solved. The
> > simplest approach is to override the IDs and make them part of the
> > guest configuration.
>
>
> That would be very tricky (or requires whitelist). E.g the opaque of the
> src may match the opaque of the dst by chance.

Luckily identifying things based on magic constants has been solved
many times in the past.

A central identifier registry prevents all collisions but is a pain to
manage. Or use a 128-bit UUID and self-allocate the identifier with an
extremely low chance of collision:
https://en.wikipedia.org/wiki/Universally_unique_identifier#Collisions

> >> For at least virtio, they will still go with virtio/vDPA. The advantages
> >> are:
> >>
> >> 1) virtio/vDPA can serve kernel subsystems which VFIO can't, this is
> >> very important for containers
> > I'm not sure I understand this. If the kernel wants to use the device
> > then it doesn't use VFIO, it runs the kernel driver instead.
>
>
> Current spec is not suitable for all type of device. We've received many
> feedbacks that virtio(pci) might not work very well. Another point is
> that there could be vendor that don't want go with virtio control path.
> Mellanox mlx5 vdpa driver is one example. Yes, they can use mlx5_en, but
> there are vendors that want to build a vendor specific control path from
> scratch.

Okay, I think I understand you mean now. This is the reason why vDPA exists.

Stefan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-30 11:13                   ` Stefan Hajnoczi
  2020-10-30 12:07                     ` Jason Wang
@ 2020-10-31 21:49                     ` Michael S. Tsirkin
  2020-11-01  8:26                       ` Paolo Bonzini
  2020-11-02  3:00                     ` Jason Wang
  2 siblings, 1 reply; 40+ messages in thread
From: Michael S. Tsirkin @ 2020-10-31 21:49 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, Janosch Frank, John G Johnson, Jason Wang,
	qemu-devel, Kirti Wankhede, Gerd Hoffmann, Yan Vugenfirer,
	Jag Raman, Eugenio Pérez, Anup Patel, Claudio Imbrenda,
	Christian Borntraeger, Roman Kagan, Felipe Franciosi,
	Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Alex Williamson, Liran Alon,
	Stefan Hajnoczi, Thanos Makatos, Alex Bennée, David Gibson,
	Kevin Wolf, Halil Pasic, Daniel P. Berrange,
	Christophe de Dinechin, Paolo Bonzini, fam

On Fri, Oct 30, 2020 at 11:13:59AM +0000, Stefan Hajnoczi wrote:
> > > 3. The device can save/load opaque blobs. This is the initial VFIO
> > > approach.
> >
> >
> > I still don't get why it must be opaque.
> 
> If the device state format needs to be in the VMM then each device
> needs explicit enablement in each VMM (QEMU, cloud-hypervisor, etc).

And QEMU cares why exactly?

> Let's invert the question: why does the VMM need to understand the
> device state of a _passthrough_ device?

To support cross version migration and compatibility checks.
This problem is harder than it appears, I don't think vendors
will do a good job of it without any guidance and standards.

-- 
MST



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-31 21:49                     ` Michael S. Tsirkin
@ 2020-11-01  8:26                       ` Paolo Bonzini
  2020-11-02  2:54                         ` Jason Wang
  0 siblings, 1 reply; 40+ messages in thread
From: Paolo Bonzini @ 2020-11-01  8:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Elena Ufimtseva, John G Johnson, Janosch Frank, Stefan Hajnoczi,
	Jason Wang, qemu-devel, Kirti Wankhede, Gerd Hoffmann,
	Yan Vugenfirer, Jag Raman, Eugenio Pérez, Anup Patel,
	Claudio Imbrenda, Christian Borntraeger, Roman Kagan,
	Felipe Franciosi, Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Alex Williamson, Liran Alon,
	Stefan Hajnoczi, Alex Bennée, David Gibson, Kevin Wolf,
	Halil Pasic, Daniel P. Berrange, Christophe de Dinechin,
	Thanos Makatos, fam

[-- Attachment #1: Type: text/plain, Size: 1704 bytes --]

Il sab 31 ott 2020, 22:49 Michael S. Tsirkin <mst@redhat.com> ha scritto:

> > > I still don't get why it must be opaque.
> >
> > If the device state format needs to be in the VMM then each device
> > needs explicit enablement in each VMM (QEMU, cloud-hypervisor, etc).
>
> And QEMU cares why exactly?
>

QEMU cares for another reason. It is more code to review, and it's worth
spending the time to reviewing it only if we can do a decent job at
reviewing it.

There are several cases in which drivers migrate non-architectural,
implementation-dependent state. There are some examples in nested
virtualization (the deadline of the VMX preemption timer) or device
emulation (the RTC has quite a few example also of how those changed
through the years). We probably don't have anyway the knowledge of the
innards of the drivers to do a decent job at reviewing patches that affect
those.

> Let's invert the question: why does the VMM need to understand the
> > device state of a _passthrough_ device?
>
> To support cross version migration and compatibility checks.
>

That doesn't have to be in the VMM. We should give guidance but that can be
in terms of documentation. Also, in QEMU we chose the path of dropping
sections on the source when migrating to older versions, but that can also
be considered a deficiency of vmstate---a self-synchronizing format
(Anthony many years ago wanted to use X509 as the migration format) would
be much better. And for some specific device types we could define standard
formats, just like PCI has standard classes.

Paolo

>
This problem is harder than it appears, I don't think vendors
> will do a good job of it without any guidance and standards.
>
> --
> MST
>
>

[-- Attachment #2: Type: text/html, Size: 2897 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-30 13:15                       ` Stefan Hajnoczi
@ 2020-11-02  2:51                         ` Jason Wang
  2020-11-02 10:13                           ` Stefan Hajnoczi
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Wang @ 2020-11-02  2:51 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, Janosch Frank, mst@redhat.com, John G Johnson,
	qemu-devel, Kirti Wankhede, Gerd Hoffmann, Yan Vugenfirer,
	Jag Raman, Christian Borntraeger, Anup Patel, Claudio Imbrenda,
	Eugenio Pérez, Roman Kagan, Felipe Franciosi,
	Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Alex Williamson, Liran Alon,
	Stefan Hajnoczi, Paolo Bonzini, Alex Bennée, David Gibson,
	Kevin Wolf, Halil Pasic, Daniel P. Berrange,
	Christophe de Dinechin, Thanos Makatos, fam


On 2020/10/30 下午9:15, Stefan Hajnoczi wrote:
> On Fri, Oct 30, 2020 at 12:08 PM Jason Wang <jasowang@redhat.com> wrote:
>> On 2020/10/30 下午7:13, Stefan Hajnoczi wrote:
>>> On Fri, Oct 30, 2020 at 9:46 AM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2020/10/30 下午2:21, Stefan Hajnoczi wrote:
>>>>> On Fri, Oct 30, 2020 at 3:04 AM Alex Williamson
>>>>> <alex.williamson@redhat.com> wrote:
>>>>>> It's great to revisit ideas, but proclaiming a uAPI is bad solely
>>>>>> because the data transfer is opaque, without defining why that's bad,
>>>>>> evaluating the feasibility and implementation of defining a well
>>>>>> specified data format rather than protocol, including cross-vendor
>>>>>> support, or proposing any sort of alternative is not so helpful imo.
>>>>> The migration approaches in VFIO and vDPA/vhost were designed for
>>>>> different requirements and I think this is why there are different
>>>>> perspectives on this. Here is a comparison and how VFIO could be
>>>>> extended in the future. I see 3 levels of device state compatibility:
>>>>>
>>>>> 1. The device cannot save/load state blobs, instead userspace fetches
>>>>> and restores specific values of the device's runtime state (e.g. last
>>>>> processed ring index). This is the vhost approach.
>>>>>
>>>>> 2. The device can save/load state in a standard format. This is
>>>>> similar to #1 except that there is a single read/write blob interface
>>>>> instead of fine-grained get_FOO()/set_FOO() interfaces. This approach
>>>>> pushes the migration state parsing into the device so that userspace
>>>>> doesn't need knowledge of every device type. With this approach it is
>>>>> possible for a device from vendor A to migrate to a device from vendor
>>>>> B, as long as they both implement the same standard migration format.
>>>>> The limitation of this approach is that vendor-specific state cannot
>>>>> be transferred.
>>>>>
>>>>> 3. The device can save/load opaque blobs. This is the initial VFIO
>>>>> approach.
>>>> I still don't get why it must be opaque.
>>> If the device state format needs to be in the VMM then each device
>>> needs explicit enablement in each VMM (QEMU, cloud-hypervisor, etc).
>>>
>>> Let's invert the question: why does the VMM need to understand the
>>> device state of a _passthrough_ device?
>>
>> For better manageability, compatibility and debug-ability. If we depends
>> on a opaque structure, do we encourage device to implement its own
>> migration protocol? It would be very challenge.
>>
>> For VFIO in the kernel, I suspect a uAPI that may result a opaque data
>> to be read or wrote from guest violates the Linux uAPI principle. It
>> will be very hard to maintain uABI or even impossible. It looks to me
>> VFIO is the first subsystem that is trying to do this.
> I think our concepts of uAPI are different. The uAPI of read(2) and
> write(2) does not define the structure of the data buffers. VFIO
> device regions are exactly the same, the structure of the data is not
> defined by the kernel uAPI.


I think we're talking about different things. It's not about the data 
structure, it's about whether to data that reads from kernel can be 
understood by userspace.


>
> Maybe microcode and firmware loading is an example we agree on?


I think not. They are bytecodes that have

1) strict ABI definitions
2) understood by userspace


>
>>>>>     A device from vendor A cannot migrate to a device from
>>>>> vendor B because the format is incompatible. This approach works well
>>>>> when devices have unique guest-visible hardware interfaces so the
>>>>> guest wouldn't be able to handle migrating a device from vendor A to a
>>>>> device from vendor B anyway.
>>>> For VFIO I guess cross vendor live migration can't succeed unless we do
>>>> some cheats in device/vendor id.
>>> Yes. I haven't looked into the details of PCI (Sub-)Device/Vendor IDs
>>> and how to best enable migration but I hope that can be solved. The
>>> simplest approach is to override the IDs and make them part of the
>>> guest configuration.
>>
>> That would be very tricky (or requires whitelist). E.g the opaque of the
>> src may match the opaque of the dst by chance.
> Luckily identifying things based on magic constants has been solved
> many times in the past.
>
> A central identifier registry prevents all collisions but is a pain to
> manage. Or use a 128-bit UUID and self-allocate the identifier with an
> extremely low chance of collision:
> https://en.wikipedia.org/wiki/Universally_unique_identifier#Collisions


I may miss something. I think we're talking about cross vendor live 
migration.

Would you want src and dest have same UUID or not?

If they have different UUIDs, how could we know we can live migrate 
between them.

If they have the same UUID, what's the rule of forcing the the vendors 
to choose same UUID (a spec)?

Thanks


>
>>>> For at least virtio, they will still go with virtio/vDPA. The advantages
>>>> are:
>>>>
>>>> 1) virtio/vDPA can serve kernel subsystems which VFIO can't, this is
>>>> very important for containers
>>> I'm not sure I understand this. If the kernel wants to use the device
>>> then it doesn't use VFIO, it runs the kernel driver instead.
>>
>> Current spec is not suitable for all type of device. We've received many
>> feedbacks that virtio(pci) might not work very well. Another point is
>> that there could be vendor that don't want go with virtio control path.
>> Mellanox mlx5 vdpa driver is one example. Yes, they can use mlx5_en, but
>> there are vendors that want to build a vendor specific control path from
>> scratch.
> Okay, I think I understand you mean now. This is the reason why vDPA exists.
>
> Stefan
>



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-11-01  8:26                       ` Paolo Bonzini
@ 2020-11-02  2:54                         ` Jason Wang
  0 siblings, 0 replies; 40+ messages in thread
From: Jason Wang @ 2020-11-02  2:54 UTC (permalink / raw)
  To: Paolo Bonzini, Michael S. Tsirkin
  Cc: Elena Ufimtseva, John G Johnson, Janosch Frank, Stefan Hajnoczi,
	qemu-devel, Kirti Wankhede, Gerd Hoffmann, Yan Vugenfirer,
	Jag Raman, Eugenio Pérez, Anup Patel, Claudio Imbrenda,
	Christian Borntraeger, Roman Kagan, Felipe Franciosi,
	Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Alex Williamson, Liran Alon,
	Stefan Hajnoczi, Alex Bennée, David Gibson, Kevin Wolf,
	Halil Pasic, Daniel P. Berrange, Christophe de Dinechin,
	Thanos Makatos, fam


On 2020/11/1 下午4:26, Paolo Bonzini wrote:
>
>
> Il sab 31 ott 2020, 22:49 Michael S. Tsirkin <mst@redhat.com 
> <mailto:mst@redhat.com>> ha scritto:
>
>     > > I still don't get why it must be opaque.
>     >
>     > If the device state format needs to be in the VMM then each device
>     > needs explicit enablement in each VMM (QEMU, cloud-hypervisor, etc).
>
>     And QEMU cares why exactly?
>
>
> QEMU cares for another reason. It is more code to review, and it's 
> worth spending the time to reviewing it only if we can do a decent job 
> at reviewing it.
>
> There are several cases in which drivers migrate non-architectural, 
> implementation-dependent state. There are some examples in nested 
> virtualization (the deadline of the VMX preemption timer) or device 
> emulation (the RTC has quite a few example also of how those changed 
> through the years). We probably don't have anyway the knowledge of the 
> innards of the drivers to do a decent job at reviewing patches that 
> affect those.
>
>     > Let's invert the question: why does the VMM need to understand the
>     > device state of a _passthrough_ device?
>
>     To support cross version migration and compatibility checks.
>
>
> That doesn't have to be in the VMM. We should give guidance but that 
> can be in terms of documentation.


I doubt this can work well if we don't force it via ABI.

Thanks


> Also, in QEMU we chose the path of dropping sections on the source 
> when migrating to older versions, but that can also be considered a 
> deficiency of vmstate---a self-synchronizing format (Anthony many 
> years ago wanted to use X509 as the migration format) would be much 
> better. And for some specific device types we could define standard 
> formats, just like PCI has standard classes.
>
> Paolo
>
>     This problem is harder than it appears, I don't think vendors
>     will do a good job of it without any guidance and standards.
>
>     -- 
>     MST
>



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-10-30 11:13                   ` Stefan Hajnoczi
  2020-10-30 12:07                     ` Jason Wang
  2020-10-31 21:49                     ` Michael S. Tsirkin
@ 2020-11-02  3:00                     ` Jason Wang
  2020-11-02 10:27                       ` Stefan Hajnoczi
  2 siblings, 1 reply; 40+ messages in thread
From: Jason Wang @ 2020-11-02  3:00 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, Janosch Frank, mst@redhat.com, John G Johnson,
	qemu-devel, Kirti Wankhede, Gerd Hoffmann, Yan Vugenfirer,
	Jag Raman, Christian Borntraeger, Anup Patel, Claudio Imbrenda,
	Eugenio Pérez, Roman Kagan, Felipe Franciosi,
	Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Alex Williamson, Liran Alon,
	Stefan Hajnoczi, Paolo Bonzini, Alex Bennée, David Gibson,
	Kevin Wolf, Halil Pasic, Daniel P. Berrange,
	Christophe de Dinechin, Thanos Makatos, fam


On 2020/10/30 下午7:13, Stefan Hajnoczi wrote:
>> I still don't get why it must be opaque.
> If the device state format needs to be in the VMM then each device
> needs explicit enablement in each VMM (QEMU, cloud-hypervisor, etc).
>
> Let's invert the question: why does the VMM need to understand the
> device state of a_passthrough_  device?


It's not a 100% passthrough device if you want to support live 
migration. E.g the device state save and restore is not under the 
control of drivers in the guest.

And if I understand correctly, it usually requires device emulation or 
mediation in either userspace or kernel to support e.g dirty page 
tracking and other things.

Thanks



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-11-02  2:51                         ` Jason Wang
@ 2020-11-02 10:13                           ` Stefan Hajnoczi
  2020-11-03  7:52                             ` Jason Wang
  0 siblings, 1 reply; 40+ messages in thread
From: Stefan Hajnoczi @ 2020-11-02 10:13 UTC (permalink / raw)
  To: Jason Wang
  Cc: Elena Ufimtseva, John G Johnson, mst@redhat.com, Janosch Frank,
	Stefan Hajnoczi, qemu-devel, Kirti Wankhede, Gerd Hoffmann,
	Yan Vugenfirer, Jag Raman, Christian Borntraeger, Anup Patel,
	Claudio Imbrenda, Eugenio Pérez, Roman Kagan,
	Felipe Franciosi, Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Alex Williamson, Liran Alon,
	Paolo Bonzini, Alex Bennée, David Gibson, Kevin Wolf,
	Halil Pasic, Daniel P. Berrange, Christophe de Dinechin,
	Thanos Makatos, fam

[-- Attachment #1: Type: text/plain, Size: 6976 bytes --]

On Mon, Nov 02, 2020 at 10:51:18AM +0800, Jason Wang wrote:
> 
> On 2020/10/30 下午9:15, Stefan Hajnoczi wrote:
> > On Fri, Oct 30, 2020 at 12:08 PM Jason Wang <jasowang@redhat.com> wrote:
> > > On 2020/10/30 下午7:13, Stefan Hajnoczi wrote:
> > > > On Fri, Oct 30, 2020 at 9:46 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > On 2020/10/30 下午2:21, Stefan Hajnoczi wrote:
> > > > > > On Fri, Oct 30, 2020 at 3:04 AM Alex Williamson
> > > > > > <alex.williamson@redhat.com> wrote:
> > > > > > > It's great to revisit ideas, but proclaiming a uAPI is bad solely
> > > > > > > because the data transfer is opaque, without defining why that's bad,
> > > > > > > evaluating the feasibility and implementation of defining a well
> > > > > > > specified data format rather than protocol, including cross-vendor
> > > > > > > support, or proposing any sort of alternative is not so helpful imo.
> > > > > > The migration approaches in VFIO and vDPA/vhost were designed for
> > > > > > different requirements and I think this is why there are different
> > > > > > perspectives on this. Here is a comparison and how VFIO could be
> > > > > > extended in the future. I see 3 levels of device state compatibility:
> > > > > > 
> > > > > > 1. The device cannot save/load state blobs, instead userspace fetches
> > > > > > and restores specific values of the device's runtime state (e.g. last
> > > > > > processed ring index). This is the vhost approach.
> > > > > > 
> > > > > > 2. The device can save/load state in a standard format. This is
> > > > > > similar to #1 except that there is a single read/write blob interface
> > > > > > instead of fine-grained get_FOO()/set_FOO() interfaces. This approach
> > > > > > pushes the migration state parsing into the device so that userspace
> > > > > > doesn't need knowledge of every device type. With this approach it is
> > > > > > possible for a device from vendor A to migrate to a device from vendor
> > > > > > B, as long as they both implement the same standard migration format.
> > > > > > The limitation of this approach is that vendor-specific state cannot
> > > > > > be transferred.
> > > > > > 
> > > > > > 3. The device can save/load opaque blobs. This is the initial VFIO
> > > > > > approach.
> > > > > I still don't get why it must be opaque.
> > > > If the device state format needs to be in the VMM then each device
> > > > needs explicit enablement in each VMM (QEMU, cloud-hypervisor, etc).
> > > > 
> > > > Let's invert the question: why does the VMM need to understand the
> > > > device state of a _passthrough_ device?
> > > 
> > > For better manageability, compatibility and debug-ability. If we depends
> > > on a opaque structure, do we encourage device to implement its own
> > > migration protocol? It would be very challenge.
> > > 
> > > For VFIO in the kernel, I suspect a uAPI that may result a opaque data
> > > to be read or wrote from guest violates the Linux uAPI principle. It
> > > will be very hard to maintain uABI or even impossible. It looks to me
> > > VFIO is the first subsystem that is trying to do this.
> > I think our concepts of uAPI are different. The uAPI of read(2) and
> > write(2) does not define the structure of the data buffers. VFIO
> > device regions are exactly the same, the structure of the data is not
> > defined by the kernel uAPI.
> 
> 
> I think we're talking about different things. It's not about the data
> structure, it's about whether to data that reads from kernel can be
> understood by userspace.
> 
> 
> > 
> > Maybe microcode and firmware loading is an example we agree on?
> 
> 
> I think not. They are bytecodes that have
> 
> 1) strict ABI definitions
> 2) understood by userspace

No, they can be proprietary formats that neither the Linux kernel nor
userspace can parse. For example, look at linux-firmware
(https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/about/)
it's just a collection of binary blobs. The format is not necessarily
public. The only restriction on that repo is that the binary blob must
be redistributable and users must be allowed to run them (i.e.
proprietary licenses can be used).

Or look at other passthrough device interfaces like /dev/i2c or libusb.
They expose data to userspace without requiring a defined format. It's
the same as VFIO.

In addition, look at kernel uAPIs where userspace acts simply as a data
transport for opaque data (e.g. where a userspace helper facilitates
communication but has no visibility of the data). I imagine that memory
encryption relies on this because the host kernel and userspace do not
have access to encrypted memory or associated state - but they need to
help migrate them to other hosts.

I hope these examples show that such APIs don't pose a problem for the
Linux uAPI and are already in use. VFIO device state isn't doing
anything new here.

> > > > > >     A device from vendor A cannot migrate to a device from
> > > > > > vendor B because the format is incompatible. This approach works well
> > > > > > when devices have unique guest-visible hardware interfaces so the
> > > > > > guest wouldn't be able to handle migrating a device from vendor A to a
> > > > > > device from vendor B anyway.
> > > > > For VFIO I guess cross vendor live migration can't succeed unless we do
> > > > > some cheats in device/vendor id.
> > > > Yes. I haven't looked into the details of PCI (Sub-)Device/Vendor IDs
> > > > and how to best enable migration but I hope that can be solved. The
> > > > simplest approach is to override the IDs and make them part of the
> > > > guest configuration.
> > > 
> > > That would be very tricky (or requires whitelist). E.g the opaque of the
> > > src may match the opaque of the dst by chance.
> > Luckily identifying things based on magic constants has been solved
> > many times in the past.
> > 
> > A central identifier registry prevents all collisions but is a pain to
> > manage. Or use a 128-bit UUID and self-allocate the identifier with an
> > extremely low chance of collision:
> > https://en.wikipedia.org/wiki/Universally_unique_identifier#Collisions
> 
> 
> I may miss something. I think we're talking about cross vendor live
> migration.
> 
> Would you want src and dest have same UUID or not?
> 
> If they have different UUIDs, how could we know we can live migrate between
> them.
> 
> If they have the same UUID, what's the rule of forcing the the vendors to
> choose same UUID (a spec)?

I will send a separate email that describes how VFIO live migration can
work in more detail. I think it's possible to do it with existing ioctl
interface that Kirti has proposed and still prevent the risk of
incorrectly interpreting data that you have pointed out.

The document that I'm sending will allow us to discuss in more detail
and make the approach clearer.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-11-02  3:00                     ` Jason Wang
@ 2020-11-02 10:27                       ` Stefan Hajnoczi
  2020-11-02 10:34                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 40+ messages in thread
From: Stefan Hajnoczi @ 2020-11-02 10:27 UTC (permalink / raw)
  To: Jason Wang
  Cc: Elena Ufimtseva, John G Johnson, mst@redhat.com, Janosch Frank,
	Stefan Hajnoczi, qemu-devel, Kirti Wankhede, Gerd Hoffmann,
	Yan Vugenfirer, Jag Raman, Christian Borntraeger, Anup Patel,
	Claudio Imbrenda, Eugenio Pérez, Roman Kagan,
	Felipe Franciosi, Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Alex Williamson, Liran Alon,
	Paolo Bonzini, Alex Bennée, David Gibson, Kevin Wolf,
	Halil Pasic, Daniel P. Berrange, Christophe de Dinechin,
	Thanos Makatos, fam

[-- Attachment #1: Type: text/plain, Size: 2250 bytes --]

On Mon, Nov 02, 2020 at 11:00:12AM +0800, Jason Wang wrote:
> 
> On 2020/10/30 下午7:13, Stefan Hajnoczi wrote:
> > > I still don't get why it must be opaque.
> > If the device state format needs to be in the VMM then each device
> > needs explicit enablement in each VMM (QEMU, cloud-hypervisor, etc).
> > 
> > Let's invert the question: why does the VMM need to understand the
> > device state of a_passthrough_  device?
> 
> 
> It's not a 100% passthrough device if you want to support live migration.
> E.g the device state save and restore is not under the control of drivers in
> the guest.

VFIO devices are already not pure passthrough (even without mdev) since
the PCI bus is emulated and device-specific quirks may be implemented.
Adding device state save/load does not change anything here.

> And if I understand correctly, it usually requires device emulation or
> mediation in either userspace or kernel to support e.g dirty page tracking
> and other things.

Breaking down the options further:

1. VFIO on physical PCI devices. Here I see two approaches:

   a. An mdev vendor driver implements the migration region described in
      Kirti's patch series. Individual device state fields are
      marshalled by the driver.

   b. The VFIO PCI core parses a PCI Capability that indicates migration
      support on the physical device and filters it out. The remainder
      of the device is passed through. The device state representation
      is saved/loaded by the physical hardware when the VFIO PCI core
      receives the ioctl and notifies the hardware. QEMU and host
      kernel code does not marshall individual device state fields.

      In the future it may be desirable to also expose the PCI
      Capability so that the guest is able to snapshot and restore the
      device. That could be useful for checkpointing AI/HPC workloads on
      GPUs, for example. I don't think there is a fundamental reason why
      device state save/load needs to be handled by the host except that
      VM live migration is supposed to be transparent to guests and
      cannot rely on guest cooperation.

2. vfio-user device backends. The device backend implements the
   save/load.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-11-02 10:27                       ` Stefan Hajnoczi
@ 2020-11-02 10:34                         ` Michael S. Tsirkin
  2020-11-02 14:59                           ` Stefan Hajnoczi
  0 siblings, 1 reply; 40+ messages in thread
From: Michael S. Tsirkin @ 2020-11-02 10:34 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, John G Johnson, Janosch Frank, Stefan Hajnoczi,
	Jason Wang, qemu-devel, Kirti Wankhede, Gerd Hoffmann,
	Yan Vugenfirer, Jag Raman, Christian Borntraeger, Anup Patel,
	Claudio Imbrenda, Eugenio Pérez, Roman Kagan,
	Felipe Franciosi, Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Alex Williamson, Liran Alon,
	Paolo Bonzini, Alex Bennée, David Gibson, Kevin Wolf,
	Halil Pasic, Daniel P. Berrange, Christophe de Dinechin,
	Thanos Makatos, fam

On Mon, Nov 02, 2020 at 10:27:54AM +0000, Stefan Hajnoczi wrote:
> On Mon, Nov 02, 2020 at 11:00:12AM +0800, Jason Wang wrote:
> > 
> > On 2020/10/30 下午7:13, Stefan Hajnoczi wrote:
> > > > I still don't get why it must be opaque.
> > > If the device state format needs to be in the VMM then each device
> > > needs explicit enablement in each VMM (QEMU, cloud-hypervisor, etc).
> > > 
> > > Let's invert the question: why does the VMM need to understand the
> > > device state of a_passthrough_  device?
> > 
> > 
> > It's not a 100% passthrough device if you want to support live migration.
> > E.g the device state save and restore is not under the control of drivers in
> > the guest.
> 
> VFIO devices are already not pure passthrough (even without mdev) since
> the PCI bus is emulated and device-specific quirks may be implemented.

So since it's not a pure passthrough anyway, let's try to
introduce some standards even if we can not always enforce
them.

> Adding device state save/load does not change anything here.

It's as good a time as any to try to standardize things and
not just let each driver do whatever it wants. In particular
if you consider things like cross version support it's
a hard problem where vendors are sure to get it wrong without
guidance.

-- 
MST



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-11-02 10:34                         ` Michael S. Tsirkin
@ 2020-11-02 14:59                           ` Stefan Hajnoczi
  0 siblings, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2020-11-02 14:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Elena Ufimtseva, John G Johnson, Janosch Frank, Stefan Hajnoczi,
	Jason Wang, qemu-devel, Kirti Wankhede, Gerd Hoffmann,
	Yan Vugenfirer, Jag Raman, Christian Borntraeger, Anup Patel,
	Claudio Imbrenda, Eugenio Pérez, Roman Kagan,
	Felipe Franciosi, Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Alex Williamson, Liran Alon,
	Paolo Bonzini, Alex Bennée, David Gibson, Kevin Wolf,
	Halil Pasic, Daniel P. Berrange, Christophe de Dinechin,
	Thanos Makatos, fam

[-- Attachment #1: Type: text/plain, Size: 1382 bytes --]

On Mon, Nov 02, 2020 at 05:34:50AM -0500, Michael S. Tsirkin wrote:
> On Mon, Nov 02, 2020 at 10:27:54AM +0000, Stefan Hajnoczi wrote:
> > On Mon, Nov 02, 2020 at 11:00:12AM +0800, Jason Wang wrote:
> > > 
> > > On 2020/10/30 下午7:13, Stefan Hajnoczi wrote:
> > > > > I still don't get why it must be opaque.
> > > > If the device state format needs to be in the VMM then each device
> > > > needs explicit enablement in each VMM (QEMU, cloud-hypervisor, etc).
> > > > 
> > > > Let's invert the question: why does the VMM need to understand the
> > > > device state of a_passthrough_  device?
> > > 
> > > 
> > > It's not a 100% passthrough device if you want to support live migration.
> > > E.g the device state save and restore is not under the control of drivers in
> > > the guest.
> > 
> > VFIO devices are already not pure passthrough (even without mdev) since
> > the PCI bus is emulated and device-specific quirks may be implemented.
> 
> So since it's not a pure passthrough anyway, let's try to
> introduce some standards even if we can not always enforce
> them.

Yes, I agree. I've sent a document called "VFIO Migration" in a separate
email thread that defines how to orchestrate migration with versioning.
Maybe we can discuss the details there and figure out which guidelines
and device state representations to standardize.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-11-02 10:13                           ` Stefan Hajnoczi
@ 2020-11-03  7:52                             ` Jason Wang
  2020-11-03 14:26                               ` Stefan Hajnoczi
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Wang @ 2020-11-03  7:52 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, John G Johnson, mst@redhat.com, Janosch Frank,
	Stefan Hajnoczi, qemu-devel, Kirti Wankhede, Gerd Hoffmann,
	Yan Vugenfirer, Jag Raman, Eugenio Pérez, Anup Patel,
	Claudio Imbrenda, Christian Borntraeger, Roman Kagan,
	Felipe Franciosi, Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Alex Williamson, Liran Alon,
	Thanos Makatos, Alex Bennée, David Gibson, Kevin Wolf,
	Halil Pasic, Daniel P. Berrange, Christophe de Dinechin,
	Paolo Bonzini, fam


On 2020/11/2 下午6:13, Stefan Hajnoczi wrote:
> On Mon, Nov 02, 2020 at 10:51:18AM +0800, Jason Wang wrote:
>> On 2020/10/30 下午9:15, Stefan Hajnoczi wrote:
>>> On Fri, Oct 30, 2020 at 12:08 PM Jason Wang <jasowang@redhat.com> wrote:
>>>> On 2020/10/30 下午7:13, Stefan Hajnoczi wrote:
>>>>> On Fri, Oct 30, 2020 at 9:46 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>>> On 2020/10/30 下午2:21, Stefan Hajnoczi wrote:
>>>>>>> On Fri, Oct 30, 2020 at 3:04 AM Alex Williamson
>>>>>>> <alex.williamson@redhat.com> wrote:
>>>>>>>> It's great to revisit ideas, but proclaiming a uAPI is bad solely
>>>>>>>> because the data transfer is opaque, without defining why that's bad,
>>>>>>>> evaluating the feasibility and implementation of defining a well
>>>>>>>> specified data format rather than protocol, including cross-vendor
>>>>>>>> support, or proposing any sort of alternative is not so helpful imo.
>>>>>>> The migration approaches in VFIO and vDPA/vhost were designed for
>>>>>>> different requirements and I think this is why there are different
>>>>>>> perspectives on this. Here is a comparison and how VFIO could be
>>>>>>> extended in the future. I see 3 levels of device state compatibility:
>>>>>>>
>>>>>>> 1. The device cannot save/load state blobs, instead userspace fetches
>>>>>>> and restores specific values of the device's runtime state (e.g. last
>>>>>>> processed ring index). This is the vhost approach.
>>>>>>>
>>>>>>> 2. The device can save/load state in a standard format. This is
>>>>>>> similar to #1 except that there is a single read/write blob interface
>>>>>>> instead of fine-grained get_FOO()/set_FOO() interfaces. This approach
>>>>>>> pushes the migration state parsing into the device so that userspace
>>>>>>> doesn't need knowledge of every device type. With this approach it is
>>>>>>> possible for a device from vendor A to migrate to a device from vendor
>>>>>>> B, as long as they both implement the same standard migration format.
>>>>>>> The limitation of this approach is that vendor-specific state cannot
>>>>>>> be transferred.
>>>>>>>
>>>>>>> 3. The device can save/load opaque blobs. This is the initial VFIO
>>>>>>> approach.
>>>>>> I still don't get why it must be opaque.
>>>>> If the device state format needs to be in the VMM then each device
>>>>> needs explicit enablement in each VMM (QEMU, cloud-hypervisor, etc).
>>>>>
>>>>> Let's invert the question: why does the VMM need to understand the
>>>>> device state of a _passthrough_ device?
>>>> For better manageability, compatibility and debug-ability. If we depends
>>>> on a opaque structure, do we encourage device to implement its own
>>>> migration protocol? It would be very challenge.
>>>>
>>>> For VFIO in the kernel, I suspect a uAPI that may result a opaque data
>>>> to be read or wrote from guest violates the Linux uAPI principle. It
>>>> will be very hard to maintain uABI or even impossible. It looks to me
>>>> VFIO is the first subsystem that is trying to do this.
>>> I think our concepts of uAPI are different. The uAPI of read(2) and
>>> write(2) does not define the structure of the data buffers. VFIO
>>> device regions are exactly the same, the structure of the data is not
>>> defined by the kernel uAPI.
>>
>> I think we're talking about different things. It's not about the data
>> structure, it's about whether to data that reads from kernel can be
>> understood by userspace.
>>
>>
>>> Maybe microcode and firmware loading is an example we agree on?
>>
>> I think not. They are bytecodes that have
>>
>> 1) strict ABI definitions
>> 2) understood by userspace
> No, they can be proprietary formats that neither the Linux kernel nor
> userspace can parse. For example, look at linux-firmware
> (https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/about/)
> it's just a collection of binary blobs. The format is not necessarily
> public. The only restriction on that repo is that the binary blob must
> be redistributable and users must be allowed to run them (i.e.
> proprietary licenses can be used).


I think not. Obviously each firmware should have its own ABI no matter 
whether its public or proprietary. For proprietary firmware, it should 
be understood by the proprietary userspace counterpart.


>
> Or look at other passthrough device interfaces like /dev/i2c or libusb.
> They expose data to userspace without requiring a defined format. It's
> the same as VFIO.


Again, it should have an ABI there (either device or spec) no matter 
whether or not it's a transport layer. And there will be an endpoint in 
the userspace know all the format.


>
> In addition, look at kernel uAPIs where userspace acts simply as a data
> transport for opaque data (e.g. where a userspace helper facilitates
> communication but has no visibility of the data). I imagine that memory
> encryption relies on this because the host kernel and userspace do not
> have access to encrypted memory or associated state - but they need to
> help migrate them to other hosts.


Which uAPI do you mean here?


>
> I hope these examples show that such APIs don't pose a problem for the
> Linux uAPI and are already in use. VFIO device state isn't doing
> anything new here.


I feel that you tried to explain "why it can be" but not "why it must 
be". Trying to find one or two subsystems that have opaque uAPI without 
ABI (though I suspect there will be one) may not convince here.

Thanks


>
>>>>>>>      A device from vendor A cannot migrate to a device from
>>>>>>> vendor B because the format is incompatible. This approach works well
>>>>>>> when devices have unique guest-visible hardware interfaces so the
>>>>>>> guest wouldn't be able to handle migrating a device from vendor A to a
>>>>>>> device from vendor B anyway.
>>>>>> For VFIO I guess cross vendor live migration can't succeed unless we do
>>>>>> some cheats in device/vendor id.
>>>>> Yes. I haven't looked into the details of PCI (Sub-)Device/Vendor IDs
>>>>> and how to best enable migration but I hope that can be solved. The
>>>>> simplest approach is to override the IDs and make them part of the
>>>>> guest configuration.
>>>> That would be very tricky (or requires whitelist). E.g the opaque of the
>>>> src may match the opaque of the dst by chance.
>>> Luckily identifying things based on magic constants has been solved
>>> many times in the past.
>>>
>>> A central identifier registry prevents all collisions but is a pain to
>>> manage. Or use a 128-bit UUID and self-allocate the identifier with an
>>> extremely low chance of collision:
>>> https://en.wikipedia.org/wiki/Universally_unique_identifier#Collisions
>>
>> I may miss something. I think we're talking about cross vendor live
>> migration.
>>
>> Would you want src and dest have same UUID or not?
>>
>> If they have different UUIDs, how could we know we can live migrate between
>> them.
>>
>> If they have the same UUID, what's the rule of forcing the the vendors to
>> choose same UUID (a spec)?
> I will send a separate email that describes how VFIO live migration can
> work in more detail. I think it's possible to do it with existing ioctl
> interface that Kirti has proposed and still prevent the risk of
> incorrectly interpreting data that you have pointed out.
>
> The document that I'm sending will allow us to discuss in more detail
> and make the approach clearer.
>
> Stefan



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-11-03  7:52                             ` Jason Wang
@ 2020-11-03 14:26                               ` Stefan Hajnoczi
  2020-11-04  6:50                                 ` Gerd Hoffmann
  0 siblings, 1 reply; 40+ messages in thread
From: Stefan Hajnoczi @ 2020-11-03 14:26 UTC (permalink / raw)
  To: Jason Wang
  Cc: Elena Ufimtseva, John G Johnson, mst@redhat.com, Janosch Frank,
	qemu-devel, Kirti Wankhede, Gerd Hoffmann, Yan Vugenfirer,
	Jag Raman, Eugenio Pérez, Anup Patel, Claudio Imbrenda,
	Christian Borntraeger, Roman Kagan, Felipe Franciosi,
	Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Alex Williamson, Liran Alon,
	Stefan Hajnoczi, Thanos Makatos, Alex Bennée, David Gibson,
	Kevin Wolf, Halil Pasic, Daniel P. Berrange,
	Christophe de Dinechin, Paolo Bonzini, fam

On Tue, Nov 3, 2020 at 7:53 AM Jason Wang <jasowang@redhat.com> wrote:
> On 2020/11/2 下午6:13, Stefan Hajnoczi wrote:
> > On Mon, Nov 02, 2020 at 10:51:18AM +0800, Jason Wang wrote:
> >> On 2020/10/30 下午9:15, Stefan Hajnoczi wrote:
> >>> On Fri, Oct 30, 2020 at 12:08 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>> On 2020/10/30 下午7:13, Stefan Hajnoczi wrote:
> >>>>> On Fri, Oct 30, 2020 at 9:46 AM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>> On 2020/10/30 下午2:21, Stefan Hajnoczi wrote:
> >>>>>>> On Fri, Oct 30, 2020 at 3:04 AM Alex Williamson
> >>>>>>> <alex.williamson@redhat.com> wrote:
> >>>>>>>> It's great to revisit ideas, but proclaiming a uAPI is bad solely
> >>>>>>>> because the data transfer is opaque, without defining why that's bad,
> >>>>>>>> evaluating the feasibility and implementation of defining a well
> >>>>>>>> specified data format rather than protocol, including cross-vendor
> >>>>>>>> support, or proposing any sort of alternative is not so helpful imo.
> >>>>>>> The migration approaches in VFIO and vDPA/vhost were designed for
> >>>>>>> different requirements and I think this is why there are different
> >>>>>>> perspectives on this. Here is a comparison and how VFIO could be
> >>>>>>> extended in the future. I see 3 levels of device state compatibility:
> >>>>>>>
> >>>>>>> 1. The device cannot save/load state blobs, instead userspace fetches
> >>>>>>> and restores specific values of the device's runtime state (e.g. last
> >>>>>>> processed ring index). This is the vhost approach.
> >>>>>>>
> >>>>>>> 2. The device can save/load state in a standard format. This is
> >>>>>>> similar to #1 except that there is a single read/write blob interface
> >>>>>>> instead of fine-grained get_FOO()/set_FOO() interfaces. This approach
> >>>>>>> pushes the migration state parsing into the device so that userspace
> >>>>>>> doesn't need knowledge of every device type. With this approach it is
> >>>>>>> possible for a device from vendor A to migrate to a device from vendor
> >>>>>>> B, as long as they both implement the same standard migration format.
> >>>>>>> The limitation of this approach is that vendor-specific state cannot
> >>>>>>> be transferred.
> >>>>>>>
> >>>>>>> 3. The device can save/load opaque blobs. This is the initial VFIO
> >>>>>>> approach.
> >>>>>> I still don't get why it must be opaque.
> >>>>> If the device state format needs to be in the VMM then each device
> >>>>> needs explicit enablement in each VMM (QEMU, cloud-hypervisor, etc).
> >>>>>
> >>>>> Let's invert the question: why does the VMM need to understand the
> >>>>> device state of a _passthrough_ device?
> >>>> For better manageability, compatibility and debug-ability. If we depends
> >>>> on a opaque structure, do we encourage device to implement its own
> >>>> migration protocol? It would be very challenge.
> >>>>
> >>>> For VFIO in the kernel, I suspect a uAPI that may result a opaque data
> >>>> to be read or wrote from guest violates the Linux uAPI principle. It
> >>>> will be very hard to maintain uABI or even impossible. It looks to me
> >>>> VFIO is the first subsystem that is trying to do this.
> >>> I think our concepts of uAPI are different. The uAPI of read(2) and
> >>> write(2) does not define the structure of the data buffers. VFIO
> >>> device regions are exactly the same, the structure of the data is not
> >>> defined by the kernel uAPI.
> >>
> >> I think we're talking about different things. It's not about the data
> >> structure, it's about whether to data that reads from kernel can be
> >> understood by userspace.
> >>
> >>
> >>> Maybe microcode and firmware loading is an example we agree on?
> >>
> >> I think not. They are bytecodes that have
> >>
> >> 1) strict ABI definitions
> >> 2) understood by userspace
> > No, they can be proprietary formats that neither the Linux kernel nor
> > userspace can parse. For example, look at linux-firmware
> > (https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/about/)
> > it's just a collection of binary blobs. The format is not necessarily
> > public. The only restriction on that repo is that the binary blob must
> > be redistributable and users must be allowed to run them (i.e.
> > proprietary licenses can be used).
>
>
> I think not. Obviously each firmware should have its own ABI no matter
> whether its public or proprietary. For proprietary firmware, it should
> be understood by the proprietary userspace counterpart.

Userspace does not necessarily need to interpret the contents. The
vendor can ship a binary blob and the driver loads the file onto the
device without interpreting it.

> > Or look at other passthrough device interfaces like /dev/i2c or libusb.
> > They expose data to userspace without requiring a defined format. It's
> > the same as VFIO.
>
>
> Again, it should have an ABI there (either device or spec) no matter
> whether or not it's a transport layer. And there will be an endpoint in
> the userspace know all the format.

VFIO defines how userspace interacts with migration regions, see the
patch series that I linked at the beginning of this discussion.
Userspace has control over pausing/resuming the device and reading
migration blobs.

> > In addition, look at kernel uAPIs where userspace acts simply as a data
> > transport for opaque data (e.g. where a userspace helper facilitates
> > communication but has no visibility of the data). I imagine that memory
> > encryption relies on this because the host kernel and userspace do not
> > have access to encrypted memory or associated state - but they need to
> > help migrate them to other hosts.
>
>
> Which uAPI do you mean here?

Migration of encrypted guests. The host kernel and userspace do not
have access to all guest state. Userspace acts as a transport - same
as VFIO migration.

I'm not sure how much of it is already upstream since it's being
actively developed right now, but it's another example where userspace
does not need to and cannot interpret data.

> > I hope these examples show that such APIs don't pose a problem for the
> > Linux uAPI and are already in use. VFIO device state isn't doing
> > anything new here.
>
>
> I feel that you tried to explain "why it can be" but not "why it must
> be". Trying to find one or two subsystems that have opaque uAPI without
> ABI (though I suspect there will be one) may not convince here.

As I've said from the beginning of the discussion, there are multiple
approaches and they are suited to different use cases.

For passthrough devices I think it's preferable for the VMM not to be
involved in the device state representation. This keeps the VMM
simple, avoids code duplication across VMMs, and allows migration to
work in non-virtualization use cases.

Stefan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-11-03 14:26                               ` Stefan Hajnoczi
@ 2020-11-04  6:50                                 ` Gerd Hoffmann
  2020-11-04  7:42                                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 40+ messages in thread
From: Gerd Hoffmann @ 2020-11-04  6:50 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Elena Ufimtseva, John G Johnson, mst@redhat.com, Janosch Frank,
	Jason Wang, qemu-devel, Kirti Wankhede, Yan Vugenfirer,
	Jag Raman, Eugenio Pérez, Anup Patel, Claudio Imbrenda,
	Christian Borntraeger, Roman Kagan, Felipe Franciosi,
	Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Alex Williamson, Liran Alon,
	Stefan Hajnoczi, Thanos Makatos, Alex Bennée, David Gibson,
	Kevin Wolf, Halil Pasic, Daniel P. Berrange,
	Christophe de Dinechin, Paolo Bonzini, fam

  Hi,

> > I think not. Obviously each firmware should have its own ABI no matter
> > whether its public or proprietary. For proprietary firmware, it should
> > be understood by the proprietary userspace counterpart.
> 
> Userspace does not necessarily need to interpret the contents. The
> vendor can ship a binary blob and the driver loads the file onto the
> device without interpreting it.

Exactly.  Neither userspace nor kernel look at the blob, except maybe
some headers with version, size, checksum etc.  Only the device does
something with the actual content.

Doing the same make sense for migration device state.  The kernel driver
saves and restores the device state.  Userspace doesn't need to look at
it.  Again, with an exception for some header fields.

So requiring userspace being able to interpret the migration data
(except header) for all devices looks rather pointless to me.

Speaking of headers: Defining a common header format makes sense.
For standard devices (virtio, nvme, ...) it makes sense to try define
a standard, cross-vendor migration data format.
For vendor-specific devices (gpus for example) I absolutely don't see
the point.

take care,
  Gerd



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Out-of-Process Device Emulation session at KVM Forum 2020
  2020-11-04  6:50                                 ` Gerd Hoffmann
@ 2020-11-04  7:42                                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 40+ messages in thread
From: Michael S. Tsirkin @ 2020-11-04  7:42 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Elena Ufimtseva, Janosch Frank, John G Johnson, Stefan Hajnoczi,
	Jason Wang, qemu-devel, Kirti Wankhede, Yan Vugenfirer,
	Jag Raman, Eugenio Pérez, Anup Patel, Claudio Imbrenda,
	Christian Borntraeger, Roman Kagan, Felipe Franciosi,
	Marc-André Lureau, Jens Freimann,
	Philippe Mathieu-Daudé,
	Stefano Garzarella, Eduardo Habkost, Sergio Lopez,
	Kashyap Chamarthy, Darren Kenny, Alex Williamson, Liran Alon,
	Stefan Hajnoczi, Thanos Makatos, Alex Bennée, David Gibson,
	Kevin Wolf, Halil Pasic, Daniel P. Berrange,
	Christophe de Dinechin, Paolo Bonzini, fam

On Wed, Nov 04, 2020 at 07:50:52AM +0100, Gerd Hoffmann wrote:
>   Hi,
> 
> > > I think not. Obviously each firmware should have its own ABI no matter
> > > whether its public or proprietary. For proprietary firmware, it should
> > > be understood by the proprietary userspace counterpart.
> > 
> > Userspace does not necessarily need to interpret the contents. The
> > vendor can ship a binary blob and the driver loads the file onto the
> > device without interpreting it.
> 
> Exactly.  Neither userspace nor kernel look at the blob, except maybe
> some headers with version, size, checksum etc.  Only the device does
> something with the actual content.
> 
> Doing the same make sense for migration device state.  The kernel driver
> saves and restores the device state.  Userspace doesn't need to look at
> it.  Again, with an exception for some header fields.
> 
> So requiring userspace being able to interpret the migration data
> (except header) for all devices looks rather pointless to me.

If nothing else we need a good place where vendors can publish this
data.

> Speaking of headers: Defining a common header format makes sense.
> For standard devices (virtio, nvme, ...) it makes sense to try define
> a standard, cross-vendor migration data format.
> For vendor-specific devices (gpus for example) I absolutely don't see
> the point.
> 
> take care,
>   Gerd



^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2020-11-04  7:44 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-27 15:14 Out-of-Process Device Emulation session at KVM Forum 2020 Stefan Hajnoczi
2020-10-28  9:32 ` Thanos Makatos
2020-10-28 10:07   ` Thanos Makatos
2020-10-28 11:09 ` Michael S. Tsirkin
2020-10-29  8:21 ` Stefan Hajnoczi
2020-10-29 12:08 ` Stefan Hajnoczi
2020-10-29 13:02   ` Jason Wang
2020-10-29 13:06     ` Paolo Bonzini
2020-10-29 14:08     ` Stefan Hajnoczi
2020-10-29 14:31     ` Alex Williamson
2020-10-29 15:09       ` Jason Wang
2020-10-29 15:46         ` Alex Williamson
2020-10-29 16:10           ` Paolo Bonzini
2020-10-30  1:11           ` Jason Wang
2020-10-30  3:04             ` Alex Williamson
2020-10-30  6:21               ` Stefan Hajnoczi
2020-10-30  9:45                 ` Jason Wang
2020-10-30 11:13                   ` Stefan Hajnoczi
2020-10-30 12:07                     ` Jason Wang
2020-10-30 13:15                       ` Stefan Hajnoczi
2020-11-02  2:51                         ` Jason Wang
2020-11-02 10:13                           ` Stefan Hajnoczi
2020-11-03  7:52                             ` Jason Wang
2020-11-03 14:26                               ` Stefan Hajnoczi
2020-11-04  6:50                                 ` Gerd Hoffmann
2020-11-04  7:42                                   ` Michael S. Tsirkin
2020-10-31 21:49                     ` Michael S. Tsirkin
2020-11-01  8:26                       ` Paolo Bonzini
2020-11-02  2:54                         ` Jason Wang
2020-11-02  3:00                     ` Jason Wang
2020-11-02 10:27                       ` Stefan Hajnoczi
2020-11-02 10:34                         ` Michael S. Tsirkin
2020-11-02 14:59                           ` Stefan Hajnoczi
2020-10-30  7:51               ` Michael S. Tsirkin
2020-10-30  9:31               ` Jason Wang
2020-10-29 16:15     ` David Edmondson
2020-10-29 16:42       ` Daniel P. Berrangé
2020-10-29 17:47         ` Kirti Wankhede
2020-10-29 18:07           ` Paolo Bonzini
2020-10-30  1:15             ` Jason Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.