All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stefan Hajnoczi <stefanha@gmail.com>
To: Jason Wang <jasowang@redhat.com>
Cc: "Elena Ufimtseva" <elena.ufimtseva@oracle.com>,
	"John G Johnson" <john.g.johnson@oracle.com>,
	"mst@redhat.com" <mtsirkin@redhat.com>,
	"Janosch Frank" <frankja@linux.vnet.ibm.com>,
	qemu-devel <qemu-devel@nongnu.org>,
	"Kirti Wankhede" <kwankhede@nvidia.com>,
	"Gerd Hoffmann" <kraxel@redhat.com>,
	"Yan Vugenfirer" <yan@daynix.com>,
	"Jag Raman" <jag.raman@oracle.com>,
	"Eugenio Pérez" <eperezma@redhat.com>,
	"Anup Patel" <anup@brainfault.org>,
	"Claudio Imbrenda" <imbrenda@linux.vnet.ibm.com>,
	"Christian Borntraeger" <borntraeger@de.ibm.com>,
	"Roman Kagan" <rkagan@virtuozzo.com>,
	"Felipe Franciosi" <felipe@nutanix.com>,
	"Marc-André Lureau" <marcandre.lureau@redhat.com>,
	"Jens Freimann" <jfreimann@redhat.com>,
	"Philippe Mathieu-Daudé" <philmd@redhat.com>,
	"Stefano Garzarella" <sgarzare@redhat.com>,
	"Eduardo Habkost" <ehabkost@redhat.com>,
	"Sergio Lopez" <slp@redhat.com>,
	"Kashyap Chamarthy" <kchamart@redhat.com>,
	"Darren Kenny" <darren.kenny@oracle.com>,
	"Alex Williamson" <alex.williamson@redhat.com>,
	"Liran Alon" <liran.alon@oracle.com>,
	"Stefan Hajnoczi" <stefanha@redhat.com>,
	"Thanos Makatos" <thanos.makatos@nutanix.com>,
	"Alex Bennée" <alex.bennee@linaro.org>,
	"David Gibson" <david@gibson.dropbear.id.au>,
	"Kevin Wolf" <kwolf@redhat.com>,
	"Halil Pasic" <pasic@linux.vnet.ibm.com>,
	"Daniel P. Berrange" <berrange@redhat.com>,
	"Christophe de Dinechin" <dinechin@redhat.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>, fam <fam@euphon.net>
Subject: Re: Out-of-Process Device Emulation session at KVM Forum 2020
Date: Tue, 3 Nov 2020 14:26:23 +0000	[thread overview]
Message-ID: <CAJSP0QXJd-BK60t+efhAt2d6mj9+kgieiyfKm=DSC1z+fDCesA@mail.gmail.com> (raw)
In-Reply-To: <c007455d-b9fc-32d5-a58c-fd8d17794996@redhat.com>

On Tue, Nov 3, 2020 at 7:53 AM Jason Wang <jasowang@redhat.com> wrote:
> On 2020/11/2 下午6:13, Stefan Hajnoczi wrote:
> > On Mon, Nov 02, 2020 at 10:51:18AM +0800, Jason Wang wrote:
> >> On 2020/10/30 下午9:15, Stefan Hajnoczi wrote:
> >>> On Fri, Oct 30, 2020 at 12:08 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>> On 2020/10/30 下午7:13, Stefan Hajnoczi wrote:
> >>>>> On Fri, Oct 30, 2020 at 9:46 AM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>> On 2020/10/30 下午2:21, Stefan Hajnoczi wrote:
> >>>>>>> On Fri, Oct 30, 2020 at 3:04 AM Alex Williamson
> >>>>>>> <alex.williamson@redhat.com> wrote:
> >>>>>>>> It's great to revisit ideas, but proclaiming a uAPI is bad solely
> >>>>>>>> because the data transfer is opaque, without defining why that's bad,
> >>>>>>>> evaluating the feasibility and implementation of defining a well
> >>>>>>>> specified data format rather than protocol, including cross-vendor
> >>>>>>>> support, or proposing any sort of alternative is not so helpful imo.
> >>>>>>> The migration approaches in VFIO and vDPA/vhost were designed for
> >>>>>>> different requirements and I think this is why there are different
> >>>>>>> perspectives on this. Here is a comparison and how VFIO could be
> >>>>>>> extended in the future. I see 3 levels of device state compatibility:
> >>>>>>>
> >>>>>>> 1. The device cannot save/load state blobs, instead userspace fetches
> >>>>>>> and restores specific values of the device's runtime state (e.g. last
> >>>>>>> processed ring index). This is the vhost approach.
> >>>>>>>
> >>>>>>> 2. The device can save/load state in a standard format. This is
> >>>>>>> similar to #1 except that there is a single read/write blob interface
> >>>>>>> instead of fine-grained get_FOO()/set_FOO() interfaces. This approach
> >>>>>>> pushes the migration state parsing into the device so that userspace
> >>>>>>> doesn't need knowledge of every device type. With this approach it is
> >>>>>>> possible for a device from vendor A to migrate to a device from vendor
> >>>>>>> B, as long as they both implement the same standard migration format.
> >>>>>>> The limitation of this approach is that vendor-specific state cannot
> >>>>>>> be transferred.
> >>>>>>>
> >>>>>>> 3. The device can save/load opaque blobs. This is the initial VFIO
> >>>>>>> approach.
> >>>>>> I still don't get why it must be opaque.
> >>>>> If the device state format needs to be in the VMM then each device
> >>>>> needs explicit enablement in each VMM (QEMU, cloud-hypervisor, etc).
> >>>>>
> >>>>> Let's invert the question: why does the VMM need to understand the
> >>>>> device state of a _passthrough_ device?
> >>>> For better manageability, compatibility and debug-ability. If we depends
> >>>> on a opaque structure, do we encourage device to implement its own
> >>>> migration protocol? It would be very challenge.
> >>>>
> >>>> For VFIO in the kernel, I suspect a uAPI that may result a opaque data
> >>>> to be read or wrote from guest violates the Linux uAPI principle. It
> >>>> will be very hard to maintain uABI or even impossible. It looks to me
> >>>> VFIO is the first subsystem that is trying to do this.
> >>> I think our concepts of uAPI are different. The uAPI of read(2) and
> >>> write(2) does not define the structure of the data buffers. VFIO
> >>> device regions are exactly the same, the structure of the data is not
> >>> defined by the kernel uAPI.
> >>
> >> I think we're talking about different things. It's not about the data
> >> structure, it's about whether to data that reads from kernel can be
> >> understood by userspace.
> >>
> >>
> >>> Maybe microcode and firmware loading is an example we agree on?
> >>
> >> I think not. They are bytecodes that have
> >>
> >> 1) strict ABI definitions
> >> 2) understood by userspace
> > No, they can be proprietary formats that neither the Linux kernel nor
> > userspace can parse. For example, look at linux-firmware
> > (https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/about/)
> > it's just a collection of binary blobs. The format is not necessarily
> > public. The only restriction on that repo is that the binary blob must
> > be redistributable and users must be allowed to run them (i.e.
> > proprietary licenses can be used).
>
>
> I think not. Obviously each firmware should have its own ABI no matter
> whether its public or proprietary. For proprietary firmware, it should
> be understood by the proprietary userspace counterpart.

Userspace does not necessarily need to interpret the contents. The
vendor can ship a binary blob and the driver loads the file onto the
device without interpreting it.

> > Or look at other passthrough device interfaces like /dev/i2c or libusb.
> > They expose data to userspace without requiring a defined format. It's
> > the same as VFIO.
>
>
> Again, it should have an ABI there (either device or spec) no matter
> whether or not it's a transport layer. And there will be an endpoint in
> the userspace know all the format.

VFIO defines how userspace interacts with migration regions, see the
patch series that I linked at the beginning of this discussion.
Userspace has control over pausing/resuming the device and reading
migration blobs.

> > In addition, look at kernel uAPIs where userspace acts simply as a data
> > transport for opaque data (e.g. where a userspace helper facilitates
> > communication but has no visibility of the data). I imagine that memory
> > encryption relies on this because the host kernel and userspace do not
> > have access to encrypted memory or associated state - but they need to
> > help migrate them to other hosts.
>
>
> Which uAPI do you mean here?

Migration of encrypted guests. The host kernel and userspace do not
have access to all guest state. Userspace acts as a transport - same
as VFIO migration.

I'm not sure how much of it is already upstream since it's being
actively developed right now, but it's another example where userspace
does not need to and cannot interpret data.

> > I hope these examples show that such APIs don't pose a problem for the
> > Linux uAPI and are already in use. VFIO device state isn't doing
> > anything new here.
>
>
> I feel that you tried to explain "why it can be" but not "why it must
> be". Trying to find one or two subsystems that have opaque uAPI without
> ABI (though I suspect there will be one) may not convince here.

As I've said from the beginning of the discussion, there are multiple
approaches and they are suited to different use cases.

For passthrough devices I think it's preferable for the VMM not to be
involved in the device state representation. This keeps the VMM
simple, avoids code duplication across VMMs, and allows migration to
work in non-virtualization use cases.

Stefan


  reply	other threads:[~2020-11-03 14:33 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-27 15:14 Out-of-Process Device Emulation session at KVM Forum 2020 Stefan Hajnoczi
2020-10-28  9:32 ` Thanos Makatos
2020-10-28 10:07   ` Thanos Makatos
2020-10-28 11:09 ` Michael S. Tsirkin
2020-10-29  8:21 ` Stefan Hajnoczi
2020-10-29 12:08 ` Stefan Hajnoczi
2020-10-29 13:02   ` Jason Wang
2020-10-29 13:06     ` Paolo Bonzini
2020-10-29 14:08     ` Stefan Hajnoczi
2020-10-29 14:31     ` Alex Williamson
2020-10-29 15:09       ` Jason Wang
2020-10-29 15:46         ` Alex Williamson
2020-10-29 16:10           ` Paolo Bonzini
2020-10-30  1:11           ` Jason Wang
2020-10-30  3:04             ` Alex Williamson
2020-10-30  6:21               ` Stefan Hajnoczi
2020-10-30  9:45                 ` Jason Wang
2020-10-30 11:13                   ` Stefan Hajnoczi
2020-10-30 12:07                     ` Jason Wang
2020-10-30 13:15                       ` Stefan Hajnoczi
2020-11-02  2:51                         ` Jason Wang
2020-11-02 10:13                           ` Stefan Hajnoczi
2020-11-03  7:52                             ` Jason Wang
2020-11-03 14:26                               ` Stefan Hajnoczi [this message]
2020-11-04  6:50                                 ` Gerd Hoffmann
2020-11-04  7:42                                   ` Michael S. Tsirkin
2020-10-31 21:49                     ` Michael S. Tsirkin
2020-11-01  8:26                       ` Paolo Bonzini
2020-11-02  2:54                         ` Jason Wang
2020-11-02  3:00                     ` Jason Wang
2020-11-02 10:27                       ` Stefan Hajnoczi
2020-11-02 10:34                         ` Michael S. Tsirkin
2020-11-02 14:59                           ` Stefan Hajnoczi
2020-10-30  7:51               ` Michael S. Tsirkin
2020-10-30  9:31               ` Jason Wang
2020-10-29 16:15     ` David Edmondson
2020-10-29 16:42       ` Daniel P. Berrangé
2020-10-29 17:47         ` Kirti Wankhede
2020-10-29 18:07           ` Paolo Bonzini
2020-10-30  1:15             ` Jason Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJSP0QXJd-BK60t+efhAt2d6mj9+kgieiyfKm=DSC1z+fDCesA@mail.gmail.com' \
    --to=stefanha@gmail.com \
    --cc=alex.bennee@linaro.org \
    --cc=alex.williamson@redhat.com \
    --cc=anup@brainfault.org \
    --cc=berrange@redhat.com \
    --cc=borntraeger@de.ibm.com \
    --cc=darren.kenny@oracle.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=dinechin@redhat.com \
    --cc=ehabkost@redhat.com \
    --cc=elena.ufimtseva@oracle.com \
    --cc=eperezma@redhat.com \
    --cc=fam@euphon.net \
    --cc=felipe@nutanix.com \
    --cc=frankja@linux.vnet.ibm.com \
    --cc=imbrenda@linux.vnet.ibm.com \
    --cc=jag.raman@oracle.com \
    --cc=jasowang@redhat.com \
    --cc=jfreimann@redhat.com \
    --cc=john.g.johnson@oracle.com \
    --cc=kchamart@redhat.com \
    --cc=kraxel@redhat.com \
    --cc=kwankhede@nvidia.com \
    --cc=kwolf@redhat.com \
    --cc=liran.alon@oracle.com \
    --cc=marcandre.lureau@redhat.com \
    --cc=mtsirkin@redhat.com \
    --cc=pasic@linux.vnet.ibm.com \
    --cc=pbonzini@redhat.com \
    --cc=philmd@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=rkagan@virtuozzo.com \
    --cc=sgarzare@redhat.com \
    --cc=slp@redhat.com \
    --cc=stefanha@redhat.com \
    --cc=thanos.makatos@nutanix.com \
    --cc=yan@daynix.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.