All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC: use VFIO over a UNIX domain socket to implement device offloading
@ 2020-03-26  9:47 Thanos Makatos
  2020-03-27 10:37 ` Thanos Makatos
  2020-04-01  9:17 ` Stefan Hajnoczi
  0 siblings, 2 replies; 31+ messages in thread
From: Thanos Makatos @ 2020-03-26  9:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Walker, Benjamin, Elena Ufimtseva, Jag Raman, Harris,  James R,
	Swapnil Ingle, Konrad Rzeszutek Wilk, Kirti Wankhede,
	Raphael Norwitz, Alex Williamson, Kanth Ghatraju,
	Stefan Hajnoczi, Felipe Franciosi, Zhang, Tina, Liu, Changpeng,
	dgilbert

I want to continue the discussion regarding using MUSER
(https://github.com/nutanix/muser) as a device offloading mechanism. The main
drawback of MUSER is that it requires a kernel module, so I've experimented
with a proof of concept of how MUSER would look like if we somehow didn't need
a kernel module. I did this by implementing a wrapper library
(https://github.com/tmakatos/libpathtrap) that intercepts accesses to
VFIO-related paths and forwards them to the MUSER process providing device
emulation over a UNIX domain socket. This does not require any changes to QEMU
(4.1.0). Obviously this is a massive hack and is done only for the needs of
this PoC.

The result is a fully working PCI device in QEMU (the gpio sample explained in
https://github.com/nutanix/muser/blob/master/README.md#running-gpio-pci-idio-16),
which is as simple as possible. I've also tested with a much more complicated
device emulation, https://github.com/tmakatos/spdk, which provides NVMe device
emulation and requires accessing guest memory for DMA, allowing BAR0 to be
memory mapped into the guest, using MSI-X interrupts, etc.

The changes required in MUSER are fairly small, all that is needed is to
introduce a new concept of "transport" to receive requests from a UNIX domain
socket instead of the kernel (from a character device) and to send/receive file
descriptors for sharing memory and firing interrupts.

My experience is that VFIO is so intuitive to use for offloading device
emulation from one process to another that makes this feature quite
straightforward. There's virtually nothing specific to the kernel in the VFIO
API. Therefore I strongly agree with Stefan's suggestion to use it for device
offloading when interacting with QEMU. Using 'muser.ko' is still interesting
when QEMU is not the client, but if everyone is happy to proceed with the
vfio-over-socket alternative the kernel module can become a second-class
citizen. (QEMU is, after all, our first and most relevant client.)

Next I explain how to test the PoC.

Build MUSER with vfio-over-socket:

        git clone --single-branch --branch vfio-over-socket git@github.com:tmakatos/muser.git
        cd muser/
        git submodule update --init
        make

Run device emulation, e.g.

        ./build/dbg/samples/gpio-pci-idio-16 -s <N>

Where <N> is an available IOMMU group, essentially the device ID, which must not
previously exist in /dev/vfio/.

Run QEMU using the vfio wrapper library and specifying the MUSER device:

        LD_PRELOAD=muser/build/dbg/libvfio/libvfio.so qemu-system-x86_64 \
                ... \
                -device vfio-pci,sysfsdev=/dev/vfio/<N> \
                -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=mem,share=yes,size=1073741824 \
                -numa node,nodeid=0,cpus=0,memdev=ram-node0

Bear in mind that since this is just a PoC lots of things can break, e.g. some
system call not intercepted etc.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-03-26  9:47 RFC: use VFIO over a UNIX domain socket to implement device offloading Thanos Makatos
@ 2020-03-27 10:37 ` Thanos Makatos
  2020-04-01  9:17 ` Stefan Hajnoczi
  1 sibling, 0 replies; 31+ messages in thread
From: Thanos Makatos @ 2020-03-27 10:37 UTC (permalink / raw)
  To: Thanos Makatos, qemu-devel
  Cc: Walker, Benjamin, Elena Ufimtseva, Jag Raman, Harris,  James R,
	Swapnil Ingle, Konrad Rzeszutek Wilk, Alex Williamson,
	Raphael Norwitz, Kirti Wankhede, Kanth Ghatraju, Stefan Hajnoczi,
	Felipe Franciosi, Zhang, Tina, Liu, Changpeng, dgilbert

>  
> Next I explain how to test the PoC.
> 
> Build MUSER with vfio-over-socket:
> 
>         git clone --single-branch --branch vfio-over-socket
> git@github.com:tmakatos/muser.git
>         cd muser/
>         git submodule update --init
>         make

Yesterday's version had a bug where it didn't build if you didn't have an existing libmuser installation, I pushed a patch to fix that.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-03-26  9:47 RFC: use VFIO over a UNIX domain socket to implement device offloading Thanos Makatos
  2020-03-27 10:37 ` Thanos Makatos
@ 2020-04-01  9:17 ` Stefan Hajnoczi
  2020-04-01 15:49   ` Thanos Makatos
  2020-04-20 11:05   ` Thanos Makatos
  1 sibling, 2 replies; 31+ messages in thread
From: Stefan Hajnoczi @ 2020-04-01  9:17 UTC (permalink / raw)
  To: Thanos Makatos
  Cc: Walker, Benjamin, Elena Ufimtseva, john.g.johnson, Jag Raman,
	Harris, James R, Swapnil Ingle, Konrad Rzeszutek Wilk,
	qemu-devel, Kirti Wankhede, Raphael Norwitz, Alex Williamson,
	Kanth Ghatraju, Felipe Franciosi, Marc-André Lureau, Zhang,
	Tina, Liu, Changpeng, dgilbert

[-- Attachment #1: Type: text/plain, Size: 2674 bytes --]

On Thu, Mar 26, 2020 at 09:47:38AM +0000, Thanos Makatos wrote:
> Build MUSER with vfio-over-socket:
> 
>         git clone --single-branch --branch vfio-over-socket git@github.com:tmakatos/muser.git
>         cd muser/
>         git submodule update --init
>         make
> 
> Run device emulation, e.g.
> 
>         ./build/dbg/samples/gpio-pci-idio-16 -s <N>
> 
> Where <N> is an available IOMMU group, essentially the device ID, which must not
> previously exist in /dev/vfio/.
> 
> Run QEMU using the vfio wrapper library and specifying the MUSER device:
> 
>         LD_PRELOAD=muser/build/dbg/libvfio/libvfio.so qemu-system-x86_64 \
>                 ... \
>                 -device vfio-pci,sysfsdev=/dev/vfio/<N> \
>                 -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=mem,share=yes,size=1073741824 \
>                 -numa node,nodeid=0,cpus=0,memdev=ram-node0
> 
> Bear in mind that since this is just a PoC lots of things can break, e.g. some
> system call not intercepted etc.

Cool, I had a quick look at libvfio and how the transport integrates
into libmuser.  The integration on the libmuser side is nice and small.

It seems likely that there will be several different implementations of
the vfio-over-socket device side (server):
1. libmuser
2. A Rust equivalent to libmuser
3. Maybe a native QEMU implementation for multi-process QEMU (I think JJ
   has been investigating this?)

In order to interoperate we'll need to maintain a protocol
specification.  Mayb You and JJ could put that together and CC the vfio,
rust-vmm, and QEMU communities for discussion?

It should cover the UNIX domain socket connection semantics (does a
listen socket only accept 1 connection at a time?  What happens when the
client disconnects?  What happens when the server disconnects?), how
VFIO structs are exchanged, any vfio-over-socket specific protocol
messages, etc.  Basically everything needed to write an implementation
(although it's not necessary to copy the VFIO struct definitions from
the kernel headers into the spec or even document their semantics if
they are identical to kernel VFIO).

The next step beyond the LD_PRELOAD library is a native vfio-over-socket
client implementation in QEMU.  There is a prototype here:
https://github.com/elmarco/qemu/blob/wip/vfio-user/hw/vfio/libvfio-user.c

If there are any volunteers for working on that then this would be a
good time to discuss it.

Finally, has anyone looked at CrosVM's out-of-process device model?  I
wonder if it has any features we should consider...

Looks like a great start to vfio-over-socket!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-04-01  9:17 ` Stefan Hajnoczi
@ 2020-04-01 15:49   ` Thanos Makatos
  2020-04-01 16:58     ` Marc-André Lureau
  2020-04-20 11:05   ` Thanos Makatos
  1 sibling, 1 reply; 31+ messages in thread
From: Thanos Makatos @ 2020-04-01 15:49 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Walker, Benjamin, Elena Ufimtseva, john.g.johnson, Jag Raman,
	Harris,  James R, Swapnil Ingle, Konrad Rzeszutek Wilk,
	qemu-devel, Kirti Wankhede, Raphael Norwitz, Alex Williamson,
	Kanth Ghatraju, Felipe Franciosi, Marc-André Lureau, Zhang,
	Tina, Liu, Changpeng, dgilbert

> On Thu, Mar 26, 2020 at 09:47:38AM +0000, Thanos Makatos wrote:
> > Build MUSER with vfio-over-socket:
> >
> >         git clone --single-branch --branch vfio-over-socket
> git@github.com:tmakatos/muser.git
> >         cd muser/
> >         git submodule update --init
> >         make
> >
> > Run device emulation, e.g.
> >
> >         ./build/dbg/samples/gpio-pci-idio-16 -s <N>
> >
> > Where <N> is an available IOMMU group, essentially the device ID, which
> must not
> > previously exist in /dev/vfio/.
> >
> > Run QEMU using the vfio wrapper library and specifying the MUSER device:
> >
> >         LD_PRELOAD=muser/build/dbg/libvfio/libvfio.so qemu-system-x86_64
> \
> >                 ... \
> >                 -device vfio-pci,sysfsdev=/dev/vfio/<N> \
> >                 -object memory-backend-file,id=ram-node0,prealloc=yes,mem-
> path=mem,share=yes,size=1073741824 \
> >                 -numa node,nodeid=0,cpus=0,memdev=ram-node0
> >
> > Bear in mind that since this is just a PoC lots of things can break, e.g. some
> > system call not intercepted etc.
> 
> Cool, I had a quick look at libvfio and how the transport integrates
> into libmuser.  The integration on the libmuser side is nice and small.
> 
> It seems likely that there will be several different implementations of
> the vfio-over-socket device side (server):
> 1. libmuser
> 2. A Rust equivalent to libmuser
> 3. Maybe a native QEMU implementation for multi-process QEMU (I think JJ
>    has been investigating this?)
> 
> In order to interoperate we'll need to maintain a protocol
> specification.  Mayb You and JJ could put that together and CC the vfio,
> rust-vmm, and QEMU communities for discussion?

Sure, I can start by drafting a design doc and share it.

> It should cover the UNIX domain socket connection semantics (does a
> listen socket only accept 1 connection at a time?  What happens when the
> client disconnects?  What happens when the server disconnects?), how
> VFIO structs are exchanged, any vfio-over-socket specific protocol
> messages, etc.  Basically everything needed to write an implementation
> (although it's not necessary to copy the VFIO struct definitions from
> the kernel headers into the spec or even document their semantics if
> they are identical to kernel VFIO).
> 
> The next step beyond the LD_PRELOAD library is a native vfio-over-socket
> client implementation in QEMU.  There is a prototype here:
> https://github.com/elmarco/qemu/blob/wip/vfio-user/hw/vfio/libvfio-
> user.c
> 
> If there are any volunteers for working on that then this would be a
> good time to discuss it.
> 
> Finally, has anyone looked at CrosVM's out-of-process device model?  I
> wonder if it has any features we should consider...
> 
> Looks like a great start to vfio-over-socket!


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-04-01 15:49   ` Thanos Makatos
@ 2020-04-01 16:58     ` Marc-André Lureau
  2020-04-02 10:19       ` Stefan Hajnoczi
  0 siblings, 1 reply; 31+ messages in thread
From: Marc-André Lureau @ 2020-04-01 16:58 UTC (permalink / raw)
  To: Thanos Makatos
  Cc: Walker, Benjamin, Elena Ufimtseva, Jag Raman, Swapnil Ingle,
	Harris, James R, john.g.johnson, Konrad Rzeszutek Wilk,
	qemu-devel, Raphael Norwitz, Kirti Wankhede, Alex Williamson,
	Stefan Hajnoczi, Felipe Franciosi, Kanth Ghatraju, Zhang, Tina,
	Liu, Changpeng, dgilbert

Hi

On Wed, Apr 1, 2020 at 5:51 PM Thanos Makatos
<thanos.makatos@nutanix.com> wrote:
>
> > On Thu, Mar 26, 2020 at 09:47:38AM +0000, Thanos Makatos wrote:
> > > Build MUSER with vfio-over-socket:
> > >
> > >         git clone --single-branch --branch vfio-over-socket
> > git@github.com:tmakatos/muser.git
> > >         cd muser/
> > >         git submodule update --init
> > >         make
> > >
> > > Run device emulation, e.g.
> > >
> > >         ./build/dbg/samples/gpio-pci-idio-16 -s <N>
> > >
> > > Where <N> is an available IOMMU group, essentially the device ID, which
> > must not
> > > previously exist in /dev/vfio/.
> > >
> > > Run QEMU using the vfio wrapper library and specifying the MUSER device:
> > >
> > >         LD_PRELOAD=muser/build/dbg/libvfio/libvfio.so qemu-system-x86_64
> > \
> > >                 ... \
> > >                 -device vfio-pci,sysfsdev=/dev/vfio/<N> \
> > >                 -object memory-backend-file,id=ram-node0,prealloc=yes,mem-
> > path=mem,share=yes,size=1073741824 \
> > >                 -numa node,nodeid=0,cpus=0,memdev=ram-node0
> > >

fyi, with 5.0 you no longer need -numa!:

-object memory-backend-memfd,id=mem,size=2G -M memory-backend=mem

(hopefully, we will get something even simpler one day)

> > > Bear in mind that since this is just a PoC lots of things can break, e.g. some
> > > system call not intercepted etc.
> >
> > Cool, I had a quick look at libvfio and how the transport integrates
> > into libmuser.  The integration on the libmuser side is nice and small.
> >
> > It seems likely that there will be several different implementations of
> > the vfio-over-socket device side (server):
> > 1. libmuser
> > 2. A Rust equivalent to libmuser
> > 3. Maybe a native QEMU implementation for multi-process QEMU (I think JJ
> >    has been investigating this?)
> >
> > In order to interoperate we'll need to maintain a protocol
> > specification.  Mayb You and JJ could put that together and CC the vfio,
> > rust-vmm, and QEMU communities for discussion?
>
> Sure, I can start by drafting a design doc and share it.

ok! I am quite amazed you went this far with a ldpreload hack. This
demonstrates some limits of gpl projects, if it was necessary.

I think with this work, and the muser experience, you have a pretty
good idea of what the protocol could look like. My approach, as I
remember, was a quite straightforward VFIO over socket translation,
while trying to see if it could share some aspects with vhost-user,
such as memory handling etc.

To contrast with the work done on qemu-mp series, I'd also prefer we
focus our work on a vfio-like protocol, before trying to see how qemu
code and interface could be changed over multiple binaries etc. We
will start with some limitations, similar to the one that apply to
VFIO: migration, introspection, managements etc are mostly left out
for now. (iow, qemu-mp is trying to do too many things simultaneously)

That's the rough ideas/plan I have in mind:
- draft/define a "vfio over unix" protocol
- similar to vhost-user, also define some backend conventions
https://github.com/qemu/qemu/blob/master/docs/interop/vhost-user.rst#backend-program-conventions
- modify qemu vfio code to allow using a socket backend. Ie something
like "-chardev socket=foo -device vfio-pci,chardev=foo"
- implement some test devices (and outside qemu, in whatever
language/framework - the more the merrier!)
- investigate how existing qemu binary could expose some devices over
"vfio-unix", for ex: "qemu -machine none -chardev socket=foo,server
-device pci-serial,vfio=foo". This would avoid a lot of proxy and code
churn proposed in qemu-mp.
- think about evolution of QMP, so that commands are dispatched to the
right process. In my book, this is called a bus, and I would go for
DBus (not through qemu) in the long term. But for now, we probably
want to split QMP code to make it more modular (in qemu-mp series,
this isn't stellar either). Later on, perhaps look at bridging QMP
over DBus.
- code refactoring in qemu, to allow smaller binaries, that implement
the minimum for vfio-user devices. (imho, this will be a bit easier
after the meson move, as the build system is simpler)

That should allow some work sharing.

I can't wait for your design draft, and see how I could help.

>
> > It should cover the UNIX domain socket connection semantics (does a
> > listen socket only accept 1 connection at a time?  What happens when the
> > client disconnects?  What happens when the server disconnects?), how
> > VFIO structs are exchanged, any vfio-over-socket specific protocol
> > messages, etc.  Basically everything needed to write an implementation
> > (although it's not necessary to copy the VFIO struct definitions from
> > the kernel headers into the spec or even document their semantics if
> > they are identical to kernel VFIO).
> >
> > The next step beyond the LD_PRELOAD library is a native vfio-over-socket
> > client implementation in QEMU.  There is a prototype here:
> > https://github.com/elmarco/qemu/blob/wip/vfio-user/hw/vfio/libvfio-
> > user.c
> >
> > If there are any volunteers for working on that then this would be a
> > good time to discuss it.
> >
> > Finally, has anyone looked at CrosVM's out-of-process device model?  I
> > wonder if it has any features we should consider...
> >
> > Looks like a great start to vfio-over-socket!
>


-- 
Marc-André Lureau


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-04-01 16:58     ` Marc-André Lureau
@ 2020-04-02 10:19       ` Stefan Hajnoczi
  2020-04-02 10:46         ` Daniel P. Berrangé
  0 siblings, 1 reply; 31+ messages in thread
From: Stefan Hajnoczi @ 2020-04-02 10:19 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: Walker, Benjamin, Elena Ufimtseva, Jag Raman, Swapnil Ingle,
	Harris, James R, john.g.johnson, Konrad Rzeszutek Wilk,
	qemu-devel, Raphael Norwitz, Kirti Wankhede, Alex Williamson,
	Felipe Franciosi, Thanos Makatos, Liu, Changpeng, Zhang, Tina,
	Kanth Ghatraju, dgilbert

[-- Attachment #1: Type: text/plain, Size: 5783 bytes --]

On Wed, Apr 01, 2020 at 06:58:20PM +0200, Marc-André Lureau wrote:
> On Wed, Apr 1, 2020 at 5:51 PM Thanos Makatos
> <thanos.makatos@nutanix.com> wrote:
> > > On Thu, Mar 26, 2020 at 09:47:38AM +0000, Thanos Makatos wrote:
> > > > Build MUSER with vfio-over-socket:
> > > >
> > > >         git clone --single-branch --branch vfio-over-socket
> > > git@github.com:tmakatos/muser.git
> > > >         cd muser/
> > > >         git submodule update --init
> > > >         make
> > > >
> > > > Run device emulation, e.g.
> > > >
> > > >         ./build/dbg/samples/gpio-pci-idio-16 -s <N>
> > > >
> > > > Where <N> is an available IOMMU group, essentially the device ID, which
> > > must not
> > > > previously exist in /dev/vfio/.
> > > >
> > > > Run QEMU using the vfio wrapper library and specifying the MUSER device:
> > > >
> > > >         LD_PRELOAD=muser/build/dbg/libvfio/libvfio.so qemu-system-x86_64
> > > \
> > > >                 ... \
> > > >                 -device vfio-pci,sysfsdev=/dev/vfio/<N> \
> > > >                 -object memory-backend-file,id=ram-node0,prealloc=yes,mem-
> > > path=mem,share=yes,size=1073741824 \
> > > >                 -numa node,nodeid=0,cpus=0,memdev=ram-node0
> > > >
> 
> fyi, with 5.0 you no longer need -numa!:
> 
> -object memory-backend-memfd,id=mem,size=2G -M memory-backend=mem
> 
> (hopefully, we will get something even simpler one day)
> 
> > > > Bear in mind that since this is just a PoC lots of things can break, e.g. some
> > > > system call not intercepted etc.
> > >
> > > Cool, I had a quick look at libvfio and how the transport integrates
> > > into libmuser.  The integration on the libmuser side is nice and small.
> > >
> > > It seems likely that there will be several different implementations of
> > > the vfio-over-socket device side (server):
> > > 1. libmuser
> > > 2. A Rust equivalent to libmuser
> > > 3. Maybe a native QEMU implementation for multi-process QEMU (I think JJ
> > >    has been investigating this?)
> > >
> > > In order to interoperate we'll need to maintain a protocol
> > > specification.  Mayb You and JJ could put that together and CC the vfio,
> > > rust-vmm, and QEMU communities for discussion?
> >
> > Sure, I can start by drafting a design doc and share it.
> 
> ok! I am quite amazed you went this far with a ldpreload hack. This
> demonstrates some limits of gpl projects, if it was necessary.
> 
> I think with this work, and the muser experience, you have a pretty
> good idea of what the protocol could look like. My approach, as I
> remember, was a quite straightforward VFIO over socket translation,
> while trying to see if it could share some aspects with vhost-user,
> such as memory handling etc.
> 
> To contrast with the work done on qemu-mp series, I'd also prefer we
> focus our work on a vfio-like protocol, before trying to see how qemu
> code and interface could be changed over multiple binaries etc. We
> will start with some limitations, similar to the one that apply to
> VFIO: migration, introspection, managements etc are mostly left out
> for now. (iow, qemu-mp is trying to do too many things simultaneously)

qemu-mp has been cut down significantly in order to make it
non-invasive.  The model is now much cleaner:
1. No monitor command or command-line option forwarding.  The device
   emulation program has its own command-line and monitor that QEMU
   doesn't know about.
2. No per-device proxy objects.  A single RemotePCIDevice is added to
   QEMU.  In the current patch series it only supports the LSI SCSI
   controller but once the socket protocol is changed to
   vfio-over-socket it will be possible to use any PCI device.

We recently agreed on dropping live migration to further reduce the
patch series.  If you have specific suggestions, please post reviews on
the latest patch series.

The RemotePCIDevice and device emulation program infrastructure it puts
in place are intended to be used by vfio-over-socket in the future.  I
see it as complementary to vfio-over-socket rather than as a
replacement.  Elena, Jag, and JJ have been working on it for a long time
and I think we should build on top of it (replacing parts as needed)
rather than propose a new plan that sidelines their work.

> That's the rough ideas/plan I have in mind:
> - draft/define a "vfio over unix" protocol
> - similar to vhost-user, also define some backend conventions
> https://github.com/qemu/qemu/blob/master/docs/interop/vhost-user.rst#backend-program-conventions
> - modify qemu vfio code to allow using a socket backend. Ie something
> like "-chardev socket=foo -device vfio-pci,chardev=foo"

I think JJ has been working on this already.  Not sure what the status
is.

> - implement some test devices (and outside qemu, in whatever
> language/framework - the more the merrier!)
> - investigate how existing qemu binary could expose some devices over
> "vfio-unix", for ex: "qemu -machine none -chardev socket=foo,server
> -device pci-serial,vfio=foo". This would avoid a lot of proxy and code
> churn proposed in qemu-mp.

This is similar to the qemu-mp approach.  I think they found that doing
this in practice requires a RemotePCIBus and a
RemoteInterruptController.  Something along these lines:

  qemu -machine none \
       -chardev socket=foo,server \
       -device remote-pci-bus,chardev=foo \
       -device pci-serial # added to the remote-pci-bus

PCI devices you want to instantiate are completely unmodified - no need
to even add a vfio= parameter.  They just happen to be on a RemotePCIBus
instead of a regular PCI bus.  That way they can be accessed via
vfio-over-socket and interrupts are also handled remotely.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-04-02 10:19       ` Stefan Hajnoczi
@ 2020-04-02 10:46         ` Daniel P. Berrangé
  2020-04-03 12:03           ` Stefan Hajnoczi
  0 siblings, 1 reply; 31+ messages in thread
From: Daniel P. Berrangé @ 2020-04-02 10:46 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Walker, Benjamin, Elena Ufimtseva, Jag Raman, Swapnil Ingle,
	Harris, James R, john.g.johnson, Alex Williamson,
	Konrad Rzeszutek Wilk, qemu-devel, Raphael Norwitz,
	Kirti Wankhede, Marc-André Lureau, Felipe Franciosi,
	Kanth Ghatraju, Thanos Makatos, Zhang, Tina, Liu, Changpeng,
	dgilbert

On Thu, Apr 02, 2020 at 11:19:42AM +0100, Stefan Hajnoczi wrote:
> On Wed, Apr 01, 2020 at 06:58:20PM +0200, Marc-André Lureau wrote:
> > On Wed, Apr 1, 2020 at 5:51 PM Thanos Makatos
> > <thanos.makatos@nutanix.com> wrote:
> > > > > Bear in mind that since this is just a PoC lots of things can break, e.g. some
> > > > > system call not intercepted etc.
> > > >
> > > > Cool, I had a quick look at libvfio and how the transport integrates
> > > > into libmuser.  The integration on the libmuser side is nice and small.
> > > >
> > > > It seems likely that there will be several different implementations of
> > > > the vfio-over-socket device side (server):
> > > > 1. libmuser
> > > > 2. A Rust equivalent to libmuser
> > > > 3. Maybe a native QEMU implementation for multi-process QEMU (I think JJ
> > > >    has been investigating this?)
> > > >
> > > > In order to interoperate we'll need to maintain a protocol
> > > > specification.  Mayb You and JJ could put that together and CC the vfio,
> > > > rust-vmm, and QEMU communities for discussion?
> > >
> > > Sure, I can start by drafting a design doc and share it.
> > 
> > ok! I am quite amazed you went this far with a ldpreload hack. This
> > demonstrates some limits of gpl projects, if it was necessary.
> > 
> > I think with this work, and the muser experience, you have a pretty
> > good idea of what the protocol could look like. My approach, as I
> > remember, was a quite straightforward VFIO over socket translation,
> > while trying to see if it could share some aspects with vhost-user,
> > such as memory handling etc.
> > 
> > To contrast with the work done on qemu-mp series, I'd also prefer we
> > focus our work on a vfio-like protocol, before trying to see how qemu
> > code and interface could be changed over multiple binaries etc. We
> > will start with some limitations, similar to the one that apply to
> > VFIO: migration, introspection, managements etc are mostly left out
> > for now. (iow, qemu-mp is trying to do too many things simultaneously)
> 
> qemu-mp has been cut down significantly in order to make it
> non-invasive.  The model is now much cleaner:
> 1. No monitor command or command-line option forwarding.  The device
>    emulation program has its own command-line and monitor that QEMU
>    doesn't know about.
> 2. No per-device proxy objects.  A single RemotePCIDevice is added to
>    QEMU.  In the current patch series it only supports the LSI SCSI
>    controller but once the socket protocol is changed to
>    vfio-over-socket it will be possible to use any PCI device.
> 
> We recently agreed on dropping live migration to further reduce the
> patch series.  If you have specific suggestions, please post reviews on
> the latest patch series.

To clarify - the decision was to *temporarily* drop live migration, to
make the initial patch series smaller and thus easier to merge. It does
ultimately need live migration, so there would be followup patch series
to provide migration support, after the initial merge.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-04-02 10:46         ` Daniel P. Berrangé
@ 2020-04-03 12:03           ` Stefan Hajnoczi
  0 siblings, 0 replies; 31+ messages in thread
From: Stefan Hajnoczi @ 2020-04-03 12:03 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Walker, Benjamin, Elena Ufimtseva, Jag Raman, Harris, James R,
	Swapnil Ingle, john.g.johnson, Konrad Rzeszutek Wilk, qemu-devel,
	Kirti Wankhede, Raphael Norwitz, Alex Williamson, Thanos Makatos,
	Marc-André Lureau, Stefan Hajnoczi, Felipe Franciosi,
	Kanth Ghatraju, Zhang, Tina, Liu, Changpeng, dgilbert

[-- Attachment #1: Type: text/plain, Size: 3414 bytes --]

On Thu, Apr 02, 2020 at 11:46:45AM +0100, Daniel P. Berrangé wrote:
> On Thu, Apr 02, 2020 at 11:19:42AM +0100, Stefan Hajnoczi wrote:
> > On Wed, Apr 01, 2020 at 06:58:20PM +0200, Marc-André Lureau wrote:
> > > On Wed, Apr 1, 2020 at 5:51 PM Thanos Makatos
> > > <thanos.makatos@nutanix.com> wrote:
> > > > > > Bear in mind that since this is just a PoC lots of things can break, e.g. some
> > > > > > system call not intercepted etc.
> > > > >
> > > > > Cool, I had a quick look at libvfio and how the transport integrates
> > > > > into libmuser.  The integration on the libmuser side is nice and small.
> > > > >
> > > > > It seems likely that there will be several different implementations of
> > > > > the vfio-over-socket device side (server):
> > > > > 1. libmuser
> > > > > 2. A Rust equivalent to libmuser
> > > > > 3. Maybe a native QEMU implementation for multi-process QEMU (I think JJ
> > > > >    has been investigating this?)
> > > > >
> > > > > In order to interoperate we'll need to maintain a protocol
> > > > > specification.  Mayb You and JJ could put that together and CC the vfio,
> > > > > rust-vmm, and QEMU communities for discussion?
> > > >
> > > > Sure, I can start by drafting a design doc and share it.
> > > 
> > > ok! I am quite amazed you went this far with a ldpreload hack. This
> > > demonstrates some limits of gpl projects, if it was necessary.
> > > 
> > > I think with this work, and the muser experience, you have a pretty
> > > good idea of what the protocol could look like. My approach, as I
> > > remember, was a quite straightforward VFIO over socket translation,
> > > while trying to see if it could share some aspects with vhost-user,
> > > such as memory handling etc.
> > > 
> > > To contrast with the work done on qemu-mp series, I'd also prefer we
> > > focus our work on a vfio-like protocol, before trying to see how qemu
> > > code and interface could be changed over multiple binaries etc. We
> > > will start with some limitations, similar to the one that apply to
> > > VFIO: migration, introspection, managements etc are mostly left out
> > > for now. (iow, qemu-mp is trying to do too many things simultaneously)
> > 
> > qemu-mp has been cut down significantly in order to make it
> > non-invasive.  The model is now much cleaner:
> > 1. No monitor command or command-line option forwarding.  The device
> >    emulation program has its own command-line and monitor that QEMU
> >    doesn't know about.
> > 2. No per-device proxy objects.  A single RemotePCIDevice is added to
> >    QEMU.  In the current patch series it only supports the LSI SCSI
> >    controller but once the socket protocol is changed to
> >    vfio-over-socket it will be possible to use any PCI device.
> > 
> > We recently agreed on dropping live migration to further reduce the
> > patch series.  If you have specific suggestions, please post reviews on
> > the latest patch series.
> 
> To clarify - the decision was to *temporarily* drop live migration, to
> make the initial patch series smaller and thus easier to merge. It does
> ultimately need live migration, so there would be followup patch series
> to provide migration support, after the initial merge.

Yes.  Live migration should come from the VFIO protocol and/or vmstate
DBus.  There is no need to implement it in a qemu-mp-specific way.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-04-01  9:17 ` Stefan Hajnoczi
  2020-04-01 15:49   ` Thanos Makatos
@ 2020-04-20 11:05   ` Thanos Makatos
  2020-04-22 15:29     ` Stefan Hajnoczi
  2020-05-14 16:32     ` John G Johnson
  1 sibling, 2 replies; 31+ messages in thread
From: Thanos Makatos @ 2020-04-20 11:05 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Walker, Benjamin, Elena Ufimtseva, john.g.johnson, Jag Raman,
	Harris,  James R, Swapnil Ingle, Konrad Rzeszutek Wilk,
	qemu-devel, Kirti Wankhede, Raphael Norwitz, Alex Williamson,
	Kanth Ghatraju, Felipe Franciosi, Marc-André Lureau, Zhang,
	Tina, Liu, Changpeng, dgilbert

> In order to interoperate we'll need to maintain a protocol
> specification.  Mayb You and JJ could put that together and CC the vfio,
> rust-vmm, and QEMU communities for discussion?
> 
> It should cover the UNIX domain socket connection semantics (does a
> listen socket only accept 1 connection at a time?  What happens when the
> client disconnects?  What happens when the server disconnects?), how
> VFIO structs are exchanged, any vfio-over-socket specific protocol
> messages, etc.  Basically everything needed to write an implementation
> (although it's not necessary to copy the VFIO struct definitions from
> the kernel headers into the spec or even document their semantics if
> they are identical to kernel VFIO).
> 
> The next step beyond the LD_PRELOAD library is a native vfio-over-socket
> client implementation in QEMU.  There is a prototype here:
> https://github.com/elmarco/qemu/blob/wip/vfio-user/hw/vfio/libvfio-
> user.c
> 
> If there are any volunteers for working on that then this would be a
> good time to discuss it.

Hi,

I've just shared with you the Google doc we've working on with John where we've
been drafting the protocol specification, we think it's time for some first
comments. Please feel free to comment/edit and suggest more people to be on the
reviewers list.

You can also find the Google doc here:

https://docs.google.com/document/d/1FspkL0hVEnZqHbdoqGLUpyC38rSk_7HhY471TsVwyK8/edit?usp=sharing

If a Google doc doesn't work for you we're open to suggestions.

Thanks


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-04-20 11:05   ` Thanos Makatos
@ 2020-04-22 15:29     ` Stefan Hajnoczi
  2020-04-27 10:58       ` Thanos Makatos
  2020-05-14 16:32     ` John G Johnson
  1 sibling, 1 reply; 31+ messages in thread
From: Stefan Hajnoczi @ 2020-04-22 15:29 UTC (permalink / raw)
  To: Thanos Makatos
  Cc: Walker, Benjamin, Elena Ufimtseva, Jag Raman, Swapnil Ingle,
	Harris, James R, john.g.johnson, Konrad Rzeszutek Wilk,
	qemu-devel, Raphael Norwitz, Marc-André Lureau,
	Kirti Wankhede, Alex Williamson, Stefan Hajnoczi,
	Felipe Franciosi, Kanth Ghatraju, Zhang, Tina, Liu, Changpeng,
	dgilbert

[-- Attachment #1: Type: text/plain, Size: 3869 bytes --]

On Mon, Apr 20, 2020 at 11:05:25AM +0000, Thanos Makatos wrote:
> > In order to interoperate we'll need to maintain a protocol
> > specification.  Mayb You and JJ could put that together and CC the vfio,
> > rust-vmm, and QEMU communities for discussion?
> > 
> > It should cover the UNIX domain socket connection semantics (does a
> > listen socket only accept 1 connection at a time?  What happens when the
> > client disconnects?  What happens when the server disconnects?), how
> > VFIO structs are exchanged, any vfio-over-socket specific protocol
> > messages, etc.  Basically everything needed to write an implementation
> > (although it's not necessary to copy the VFIO struct definitions from
> > the kernel headers into the spec or even document their semantics if
> > they are identical to kernel VFIO).
> > 
> > The next step beyond the LD_PRELOAD library is a native vfio-over-socket
> > client implementation in QEMU.  There is a prototype here:
> > https://github.com/elmarco/qemu/blob/wip/vfio-user/hw/vfio/libvfio-
> > user.c
> > 
> > If there are any volunteers for working on that then this would be a
> > good time to discuss it.
> 
> Hi,
> 
> I've just shared with you the Google doc we've working on with John where we've
> been drafting the protocol specification, we think it's time for some first
> comments. Please feel free to comment/edit and suggest more people to be on the
> reviewers list.
> 
> You can also find the Google doc here:
> 
> https://docs.google.com/document/d/1FspkL0hVEnZqHbdoqGLUpyC38rSk_7HhY471TsVwyK8/edit?usp=sharing
> 
> If a Google doc doesn't work for you we're open to suggestions.

I can't add comments to the document so I've inlined them here:

The spec assumes the reader is already familiar with VFIO and does not
explain concepts like the device lifecycle, regions, interrupts, etc.
We don't need to duplicate detailed VFIO information, but I think the
device model should be explained so that anyone can start from the
VFIO-user spec and begin working on an implementation.  Right now I
think they would have to do some serious investigation of VFIO first in
order to be able to write code.

"only the source header files are used"
I notice the current <linux/vfio.h> header is licensed "GPL-2.0 WITH
Linux-syscall-note".  I'm not a lawyer but I guess this means there are
some restrictions on using this header file.  The <linux/virtio*.h>
header files were explicitly licensed under the BSD license to make it
easy to use the non __KERNEL__ parts.

VFIO-user Command Types: please indicate for each request type whether
it is client->server, server->client, or both.  Also is it a "command"
or "request"?

vfio_user_req_type <-- is this an extension on top of <linux/vfio.h>?
Please make it clear what is part of the base <linux/vfio.h> protocol
and what is specific to vfio-user.

VFIO_USER_READ/WRITE serve completely different purposes depending on
whether they are sent client->server or server->client.  I suggest
defining separate request type constants instead of overloading them.

What is the difference between VFIO_USER_MAP_DMA and VFIO_USER_REG_MEM?
They both seem to be client->server messages for setting up memory but
I'm not sure why two request types are needed.

struct vfio_user_req->data.  Is this really a union so that every
message has the same size, regardless of how many parameters are passed
in the data field?

"a framebuffer where the guest does multiple stores to the virtual
device."  Do you mean in SMP guests?  Or even in a single CPU guest?

Also, is there any concurrency requirement on the client and server
side?  Can I implement a client/server that processes requests
sequentially and completes them before moving on to the next request or
would that deadlock for certain message types?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-04-22 15:29     ` Stefan Hajnoczi
@ 2020-04-27 10:58       ` Thanos Makatos
  2020-04-30 11:23         ` Thanos Makatos
  0 siblings, 1 reply; 31+ messages in thread
From: Thanos Makatos @ 2020-04-27 10:58 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Walker, Benjamin, Elena Ufimtseva, Jag Raman, Swapnil Ingle,
	Harris, James R, john.g.johnson, Konrad Rzeszutek Wilk,
	qemu-devel, Raphael Norwitz, Marc-André Lureau,
	Kirti Wankhede, Alex Williamson, Stefan Hajnoczi,
	Felipe Franciosi, Kanth Ghatraju, Zhang, Tina, Liu, Changpeng,
	dgilbert

> > I've just shared with you the Google doc we've working on with John
> where we've
> > been drafting the protocol specification, we think it's time for some first
> > comments. Please feel free to comment/edit and suggest more people to
> be on the
> > reviewers list.
> >
> > You can also find the Google doc here:
> >
> >
> https://docs.google.com/document/d/1FspkL0hVEnZqHbdoqGLUpyC38rSk_
> 7HhY471TsVwyK8/edit?usp=sharing
> >
> > If a Google doc doesn't work for you we're open to suggestions.
> 
> I can't add comments to the document so I've inlined them here:
> 
> The spec assumes the reader is already familiar with VFIO and does not
> explain concepts like the device lifecycle, regions, interrupts, etc.
> We don't need to duplicate detailed VFIO information, but I think the
> device model should be explained so that anyone can start from the
> VFIO-user spec and begin working on an implementation.  Right now I
> think they would have to do some serious investigation of VFIO first in
> order to be able to write code.

I've added a high-level overview of how VFIO is used in this context.

> "only the source header files are used"
> I notice the current <linux/vfio.h> header is licensed "GPL-2.0 WITH
> Linux-syscall-note".  I'm not a lawyer but I guess this means there are
> some restrictions on using this header file.  The <linux/virtio*.h>
> header files were explicitly licensed under the BSD license to make it
> easy to use the non __KERNEL__ parts.

My impression is that this note actually relaxes the licensing requirements, so
that proprietary software can use the system call headers and run on Linux
without being considered derived work. In any case I'll double check with our
legal team.
 
> VFIO-user Command Types: please indicate for each request type whether
> it is client->server, server->client, or both.  Also is it a "command"
> or "request"?

Will do. It's a command.

 
> vfio_user_req_type <-- is this an extension on top of <linux/vfio.h>?
> Please make it clear what is part of the base <linux/vfio.h> protocol
> and what is specific to vfio-user.

Correct, it's an extension over <linux/vfio.h>. I've clarified the two symbol
namespaces.

 
> VFIO_USER_READ/WRITE serve completely different purposes depending on
> whether they are sent client->server or server->client.  I suggest
> defining separate request type constants instead of overloading them.

Fixed.

> What is the difference between VFIO_USER_MAP_DMA and
> VFIO_USER_REG_MEM?
> They both seem to be client->server messages for setting up memory but
> I'm not sure why two request types are needed.

John will provide more information on this.

> struct vfio_user_req->data.  Is this really a union so that every
> message has the same size, regardless of how many parameters are passed
> in the data field?

Correct, it's a union so that every message has the same length.

> "a framebuffer where the guest does multiple stores to the virtual
> device."  Do you mean in SMP guests?  Or even in a single CPU guest?

@John

> Also, is there any concurrency requirement on the client and server
> side?  Can I implement a client/server that processes requests
> sequentially and completes them before moving on to the next request or
> would that deadlock for certain message types?

I believe that this might also depend on the device semantics, will need to
think about it in greater detail.

More importantly, considering:
a) Marc-André's comments about data alignment etc., and
b) the possibility to run the server on another guest or host,
we won't be able to use native VFIO types. If we do want to support that then
we'll have to redefine all data formats, similar to
https://github.com/qemu/qemu/blob/master/docs/interop/vhost-user.rst.

So the protocol will be more like an enhanced version of the Vhost-user protocol
than VFIO. I'm fine with either direction (VFIO vs. enhanced Vhost-user),
so we need to decide before proceeding as the request format is substantially
different.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-04-27 10:58       ` Thanos Makatos
@ 2020-04-30 11:23         ` Thanos Makatos
  2020-04-30 11:40           ` Daniel P. Berrangé
  0 siblings, 1 reply; 31+ messages in thread
From: Thanos Makatos @ 2020-04-30 11:23 UTC (permalink / raw)
  To: Thanos Makatos, Stefan Hajnoczi, John G Johnson
  Cc: Walker, Benjamin, Elena Ufimtseva, Jag Raman, Swapnil Ingle,
	Harris, James R, Konrad Rzeszutek Wilk, qemu-devel,
	Raphael Norwitz, Kirti Wankhede, Alex Williamson,
	Stefan Hajnoczi, Felipe Franciosi, Marc-André Lureau, Liu,
	Changpeng, Zhang, Tina, Kanth Ghatraju, dgilbert

> > > I've just shared with you the Google doc we've working on with John
> > where we've
> > > been drafting the protocol specification, we think it's time for some first
> > > comments. Please feel free to comment/edit and suggest more people
> to
> > be on the
> > > reviewers list.
> > >
> > > You can also find the Google doc here:
> > >
> > >
> > https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__docs.google.com_document_d_1FspkL0hVEnZqHbdoqGLUpyC38rSk-
> 5F&d=DwIFAw&c=s883GpUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvtw6ogtti46a
> tk736SI4vgsJiUKIyDE&m=lJC7YeMMsAaVsr99tmTYncQdjEfOXiJQkRkJW7NMg
> Rg&s=RyyhgVrLX2bBTqpXZnBmllqkCg_wyalxwZKkfcYt50c&e=
> > 7HhY471TsVwyK8/edit?usp=sharing
> > >
> > > If a Google doc doesn't work for you we're open to suggestions.
> >
> > I can't add comments to the document so I've inlined them here:
> >
> > The spec assumes the reader is already familiar with VFIO and does not
> > explain concepts like the device lifecycle, regions, interrupts, etc.
> > We don't need to duplicate detailed VFIO information, but I think the
> > device model should be explained so that anyone can start from the
> > VFIO-user spec and begin working on an implementation.  Right now I
> > think they would have to do some serious investigation of VFIO first in
> > order to be able to write code.
> 
> I've added a high-level overview of how VFIO is used in this context.
> 
> > "only the source header files are used"
> > I notice the current <linux/vfio.h> header is licensed "GPL-2.0 WITH
> > Linux-syscall-note".  I'm not a lawyer but I guess this means there are
> > some restrictions on using this header file.  The <linux/virtio*.h>
> > header files were explicitly licensed under the BSD license to make it
> > easy to use the non __KERNEL__ parts.
> 
> My impression is that this note actually relaxes the licensing requirements, so
> that proprietary software can use the system call headers and run on Linux
> without being considered derived work. In any case I'll double check with our
> legal team.
> 
> > VFIO-user Command Types: please indicate for each request type whether
> > it is client->server, server->client, or both.  Also is it a "command"
> > or "request"?
> 
> Will do. It's a command.
> 
> 
> > vfio_user_req_type <-- is this an extension on top of <linux/vfio.h>?
> > Please make it clear what is part of the base <linux/vfio.h> protocol
> > and what is specific to vfio-user.
> 
> Correct, it's an extension over <linux/vfio.h>. I've clarified the two symbol
> namespaces.
> 
> 
> > VFIO_USER_READ/WRITE serve completely different purposes depending
> on
> > whether they are sent client->server or server->client.  I suggest
> > defining separate request type constants instead of overloading them.
> 
> Fixed.
> 
> > What is the difference between VFIO_USER_MAP_DMA and
> > VFIO_USER_REG_MEM?
> > They both seem to be client->server messages for setting up memory but
> > I'm not sure why two request types are needed.
> 
> John will provide more information on this.
> 
> > struct vfio_user_req->data.  Is this really a union so that every
> > message has the same size, regardless of how many parameters are
> passed
> > in the data field?
> 
> Correct, it's a union so that every message has the same length.
> 
> > "a framebuffer where the guest does multiple stores to the virtual
> > device."  Do you mean in SMP guests?  Or even in a single CPU guest?
> 
> @John
> 
> > Also, is there any concurrency requirement on the client and server
> > side?  Can I implement a client/server that processes requests
> > sequentially and completes them before moving on to the next request or
> > would that deadlock for certain message types?
> 
> I believe that this might also depend on the device semantics, will need to
> think about it in greater detail.

I've looked at this but can't provide a definitive answer yet. I believe the
safest thing to do is for the server to process requests in order.

> More importantly, considering:
> a) Marc-André's comments about data alignment etc., and
> b) the possibility to run the server on another guest or host,
> we won't be able to use native VFIO types. If we do want to support that
> then
> we'll have to redefine all data formats, similar to
> https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__github.com_qemu_qemu_blob_master_docs_interop_vhost-
> 2Duser.rst&d=DwIFAw&c=s883GpUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvtw6
> ogtti46atk736SI4vgsJiUKIyDE&m=lJC7YeMMsAaVsr99tmTYncQdjEfOXiJQkRkJ
> W7NMgRg&s=1d_kB7VWQ-8d4t6Ikga5KSVwws4vwiVMvTyWVaS6PRU&e= .
> 
> So the protocol will be more like an enhanced version of the Vhost-user
> protocol
> than VFIO. I'm fine with either direction (VFIO vs. enhanced Vhost-user),
> so we need to decide before proceeding as the request format is
> substantially
> different.

Regarding the ability to use the protocol on non-AF_UNIX sockets, we can 
support this future use case without unnecessarily complicating the protocol by
defining the C structs and stating that data alignment and endianness for the 
non AF_UNIX case must be the one used by GCC on a x86_64 bit machine, or can 
be overridden as required.

We've polished the document a bit more with the help of Marc-André,
Raphael, and Swapnil.

The major outstanding issue is agreeing on the having a pair of commands for
registering/unregistering guest memory and another pair of commands to map/unmap
DMA regions, similar to QEMU.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-04-30 11:23         ` Thanos Makatos
@ 2020-04-30 11:40           ` Daniel P. Berrangé
  2020-04-30 15:20             ` Thanos Makatos
  0 siblings, 1 reply; 31+ messages in thread
From: Daniel P. Berrangé @ 2020-04-30 11:40 UTC (permalink / raw)
  To: Thanos Makatos
  Cc: Walker, Benjamin, Elena Ufimtseva, Jag Raman, Harris, James R,
	Swapnil Ingle, John G Johnson, Stefan Hajnoczi,
	Konrad Rzeszutek Wilk, qemu-devel, Raphael Norwitz,
	Kirti Wankhede, Alex Williamson, Stefan Hajnoczi,
	Felipe Franciosi, Kanth Ghatraju, Marc-André Lureau, Zhang,
	Tina, Liu, Changpeng, dgilbert

On Thu, Apr 30, 2020 at 11:23:34AM +0000, Thanos Makatos wrote:
> > > > I've just shared with you the Google doc we've working on with John
> > > where we've
> > > > been drafting the protocol specification, we think it's time for some first
> > > > comments. Please feel free to comment/edit and suggest more people
> > to
> > > be on the
> > > > reviewers list.
> > > >
> > > > You can also find the Google doc here:
> > > >
> > > >
> > > https://urldefense.proofpoint.com/v2/url?u=https-
> > 3A__docs.google.com_document_d_1FspkL0hVEnZqHbdoqGLUpyC38rSk-
> > 5F&d=DwIFAw&c=s883GpUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvtw6ogtti46a
> > tk736SI4vgsJiUKIyDE&m=lJC7YeMMsAaVsr99tmTYncQdjEfOXiJQkRkJW7NMg
> > Rg&s=RyyhgVrLX2bBTqpXZnBmllqkCg_wyalxwZKkfcYt50c&e=
> > > 7HhY471TsVwyK8/edit?usp=sharing
> > > >
> > > > If a Google doc doesn't work for you we're open to suggestions.
> > >
> > > I can't add comments to the document so I've inlined them here:
> > >
> > > The spec assumes the reader is already familiar with VFIO and does not
> > > explain concepts like the device lifecycle, regions, interrupts, etc.
> > > We don't need to duplicate detailed VFIO information, but I think the
> > > device model should be explained so that anyone can start from the
> > > VFIO-user spec and begin working on an implementation.  Right now I
> > > think they would have to do some serious investigation of VFIO first in
> > > order to be able to write code.
> > 
> > I've added a high-level overview of how VFIO is used in this context.
> > 
> > > "only the source header files are used"
> > > I notice the current <linux/vfio.h> header is licensed "GPL-2.0 WITH
> > > Linux-syscall-note".  I'm not a lawyer but I guess this means there are
> > > some restrictions on using this header file.  The <linux/virtio*.h>
> > > header files were explicitly licensed under the BSD license to make it
> > > easy to use the non __KERNEL__ parts.
> > 
> > My impression is that this note actually relaxes the licensing requirements, so
> > that proprietary software can use the system call headers and run on Linux
> > without being considered derived work. In any case I'll double check with our
> > legal team.
> > 
> > > VFIO-user Command Types: please indicate for each request type whether
> > > it is client->server, server->client, or both.  Also is it a "command"
> > > or "request"?
> > 
> > Will do. It's a command.
> > 
> > 
> > > vfio_user_req_type <-- is this an extension on top of <linux/vfio.h>?
> > > Please make it clear what is part of the base <linux/vfio.h> protocol
> > > and what is specific to vfio-user.
> > 
> > Correct, it's an extension over <linux/vfio.h>. I've clarified the two symbol
> > namespaces.
> > 
> > 
> > > VFIO_USER_READ/WRITE serve completely different purposes depending
> > on
> > > whether they are sent client->server or server->client.  I suggest
> > > defining separate request type constants instead of overloading them.
> > 
> > Fixed.
> > 
> > > What is the difference between VFIO_USER_MAP_DMA and
> > > VFIO_USER_REG_MEM?
> > > They both seem to be client->server messages for setting up memory but
> > > I'm not sure why two request types are needed.
> > 
> > John will provide more information on this.
> > 
> > > struct vfio_user_req->data.  Is this really a union so that every
> > > message has the same size, regardless of how many parameters are
> > passed
> > > in the data field?
> > 
> > Correct, it's a union so that every message has the same length.
> > 
> > > "a framebuffer where the guest does multiple stores to the virtual
> > > device."  Do you mean in SMP guests?  Or even in a single CPU guest?
> > 
> > @John
> > 
> > > Also, is there any concurrency requirement on the client and server
> > > side?  Can I implement a client/server that processes requests
> > > sequentially and completes them before moving on to the next request or
> > > would that deadlock for certain message types?
> > 
> > I believe that this might also depend on the device semantics, will need to
> > think about it in greater detail.
> 
> I've looked at this but can't provide a definitive answer yet. I believe the
> safest thing to do is for the server to process requests in order.
> 
> > More importantly, considering:
> > a) Marc-André's comments about data alignment etc., and
> > b) the possibility to run the server on another guest or host,
> > we won't be able to use native VFIO types. If we do want to support that
> > then
> > we'll have to redefine all data formats, similar to
> > https://urldefense.proofpoint.com/v2/url?u=https-
> > 3A__github.com_qemu_qemu_blob_master_docs_interop_vhost-
> > 2Duser.rst&d=DwIFAw&c=s883GpUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvtw6
> > ogtti46atk736SI4vgsJiUKIyDE&m=lJC7YeMMsAaVsr99tmTYncQdjEfOXiJQkRkJ
> > W7NMgRg&s=1d_kB7VWQ-8d4t6Ikga5KSVwws4vwiVMvTyWVaS6PRU&e= .
> > 
> > So the protocol will be more like an enhanced version of the Vhost-user
> > protocol
> > than VFIO. I'm fine with either direction (VFIO vs. enhanced Vhost-user),
> > so we need to decide before proceeding as the request format is
> > substantially
> > different.
> 
> Regarding the ability to use the protocol on non-AF_UNIX sockets, we can 
> support this future use case without unnecessarily complicating the protocol by
> defining the C structs and stating that data alignment and endianness for the 
> non AF_UNIX case must be the one used by GCC on a x86_64 bit machine, or can 
> be overridden as required.

Defining it to be x86_64 semantics is effectively saying "we're not going
to do anything and it is up to other arch maintainers to fix the inevitable
portability problems that arise".

Since this is a new protocol should we take the opportunity to model it
explicitly in some common standard RPC protocol language. This would have
the benefit of allowing implementors to use off the shelf APIs for their
wire protocol marshalling, and eliminate questions about endianness and
alignment across architectures.



Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-04-30 11:40           ` Daniel P. Berrangé
@ 2020-04-30 15:20             ` Thanos Makatos
  2020-05-01 15:01               ` Felipe Franciosi
  0 siblings, 1 reply; 31+ messages in thread
From: Thanos Makatos @ 2020-04-30 15:20 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Walker, Benjamin, Elena Ufimtseva, Jag Raman, Harris,  James R,
	Swapnil Ingle, John G Johnson, Stefan Hajnoczi,
	Konrad Rzeszutek Wilk, qemu-devel, Raphael Norwitz,
	Kirti Wankhede, Alex Williamson, Stefan Hajnoczi,
	Felipe Franciosi, Kanth Ghatraju, Marc-André Lureau, Zhang,
	Tina, Liu, Changpeng, dgilbert

> > > More importantly, considering:
> > > a) Marc-André's comments about data alignment etc., and
> > > b) the possibility to run the server on another guest or host,
> > > we won't be able to use native VFIO types. If we do want to support that
> > > then
> > > we'll have to redefine all data formats, similar to
> > > https://urldefense.proofpoint.com/v2/url?u=https-
> > > 3A__github.com_qemu_qemu_blob_master_docs_interop_vhost-
> > >
> 2Duser.rst&d=DwIFAw&c=s883GpUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvtw6
> > >
> ogtti46atk736SI4vgsJiUKIyDE&m=lJC7YeMMsAaVsr99tmTYncQdjEfOXiJQkRkJ
> > > W7NMgRg&s=1d_kB7VWQ-
> 8d4t6Ikga5KSVwws4vwiVMvTyWVaS6PRU&e= .
> > >
> > > So the protocol will be more like an enhanced version of the Vhost-user
> > > protocol
> > > than VFIO. I'm fine with either direction (VFIO vs. enhanced Vhost-user),
> > > so we need to decide before proceeding as the request format is
> > > substantially
> > > different.
> >
> > Regarding the ability to use the protocol on non-AF_UNIX sockets, we can
> > support this future use case without unnecessarily complicating the
> protocol by
> > defining the C structs and stating that data alignment and endianness for
> the
> > non AF_UNIX case must be the one used by GCC on a x86_64 bit machine,
> or can
> > be overridden as required.
> 
> Defining it to be x86_64 semantics is effectively saying "we're not going
> to do anything and it is up to other arch maintainers to fix the inevitable
> portability problems that arise".

Pretty much.
 
> Since this is a new protocol should we take the opportunity to model it
> explicitly in some common standard RPC protocol language. This would have
> the benefit of allowing implementors to use off the shelf APIs for their
> wire protocol marshalling, and eliminate questions about endianness and
> alignment across architectures.

The problem is that we haven't defined the scope very well. My initial impression 
was that we should use the existing VFIO structs and constants, however that's 
impossible if we're to support non AF_UNIX. We need consensus on this, we're 
open to ideas how to do this.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-04-30 15:20             ` Thanos Makatos
@ 2020-05-01 15:01               ` Felipe Franciosi
  2020-05-01 15:28                 ` Daniel P. Berrangé
  0 siblings, 1 reply; 31+ messages in thread
From: Felipe Franciosi @ 2020-05-01 15:01 UTC (permalink / raw)
  To: Thanos Makatos, Daniel P. Berrangé, Stefan Hajnoczi
  Cc: Walker, Benjamin, John G Johnson, Jag Raman, Harris, James R,
	Swapnil Ingle, Konrad Rzeszutek Wilk, qemu-devel,
	Elena Ufimtseva, Raphael Norwitz, Kirti Wankhede,
	Alex Williamson, Stefan Hajnoczi, Kanth Ghatraju,
	Marc-André Lureau, Zhang, Tina, Liu, Changpeng, dgilbert

Hi,

> On Apr 30, 2020, at 4:20 PM, Thanos Makatos <thanos.makatos@nutanix.com> wrote:
> 
>>>> More importantly, considering:
>>>> a) Marc-André's comments about data alignment etc., and
>>>> b) the possibility to run the server on another guest or host,
>>>> we won't be able to use native VFIO types. If we do want to support that
>>>> then
>>>> we'll have to redefine all data formats, similar to
>>>> https://urldefense.proofpoint.com/v2/url?u=https-
>>>> 3A__github.com_qemu_qemu_blob_master_docs_interop_vhost-
>>>> 
>> 2Duser.rst&d=DwIFAw&c=s883GpUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvtw6
>>>> 
>> ogtti46atk736SI4vgsJiUKIyDE&m=lJC7YeMMsAaVsr99tmTYncQdjEfOXiJQkRkJ
>>>> W7NMgRg&s=1d_kB7VWQ-
>> 8d4t6Ikga5KSVwws4vwiVMvTyWVaS6PRU&e= .
>>>> 
>>>> So the protocol will be more like an enhanced version of the Vhost-user
>>>> protocol
>>>> than VFIO. I'm fine with either direction (VFIO vs. enhanced Vhost-user),
>>>> so we need to decide before proceeding as the request format is
>>>> substantially
>>>> different.
>>> 
>>> Regarding the ability to use the protocol on non-AF_UNIX sockets, we can
>>> support this future use case without unnecessarily complicating the
>> protocol by
>>> defining the C structs and stating that data alignment and endianness for
>> the
>>> non AF_UNIX case must be the one used by GCC on a x86_64 bit machine,
>> or can
>>> be overridden as required.
>> 
>> Defining it to be x86_64 semantics is effectively saying "we're not going
>> to do anything and it is up to other arch maintainers to fix the inevitable
>> portability problems that arise".
> 
> Pretty much.
> 
>> Since this is a new protocol should we take the opportunity to model it
>> explicitly in some common standard RPC protocol language. This would have
>> the benefit of allowing implementors to use off the shelf APIs for their
>> wire protocol marshalling, and eliminate questions about endianness and
>> alignment across architectures.
> 
> The problem is that we haven't defined the scope very well. My initial impression 
> was that we should use the existing VFIO structs and constants, however that's 
> impossible if we're to support non AF_UNIX. We need consensus on this, we're 
> open to ideas how to do this.

Thanos has a point.

From https://wiki.qemu.org/Features/MultiProcessQEMU, which I believe
was written by Stefan, I read:

> Inventing a new device emulation protocol from scratch has many
> disadvantages. VFIO could be used as the protocol to avoid reinventing
> the wheel ...

At the same time, this appears to be incompatible with the (new?)
requirement of supporting device emulation which may run in non-VFIO
compliant OSs or even across OSs (ie. via TCP or similar).

We are happy to support what the community agrees on, but it seems
like there isn't an agreement. Is it worth all of us jumping into
another call to realign?

Cheers,
F.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-05-01 15:01               ` Felipe Franciosi
@ 2020-05-01 15:28                 ` Daniel P. Berrangé
  2020-05-04  9:45                   ` Stefan Hajnoczi
  0 siblings, 1 reply; 31+ messages in thread
From: Daniel P. Berrangé @ 2020-05-01 15:28 UTC (permalink / raw)
  To: Felipe Franciosi
  Cc: Walker, Benjamin, John G Johnson, Jag Raman, Harris, James R,
	Swapnil Ingle, Konrad Rzeszutek Wilk, Stefan Hajnoczi,
	qemu-devel, Elena Ufimtseva, Raphael Norwitz, Kirti Wankhede,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	Kanth Ghatraju, Thanos Makatos, Zhang, Tina, Liu, Changpeng,
	dgilbert

On Fri, May 01, 2020 at 03:01:01PM +0000, Felipe Franciosi wrote:
> Hi,
> 
> > On Apr 30, 2020, at 4:20 PM, Thanos Makatos <thanos.makatos@nutanix.com> wrote:
> > 
> >>>> More importantly, considering:
> >>>> a) Marc-André's comments about data alignment etc., and
> >>>> b) the possibility to run the server on another guest or host,
> >>>> we won't be able to use native VFIO types. If we do want to support that
> >>>> then
> >>>> we'll have to redefine all data formats, similar to
> >>>> https://urldefense.proofpoint.com/v2/url?u=https-
> >>>> 3A__github.com_qemu_qemu_blob_master_docs_interop_vhost-
> >>>> 
> >> 2Duser.rst&d=DwIFAw&c=s883GpUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvtw6
> >>>> 
> >> ogtti46atk736SI4vgsJiUKIyDE&m=lJC7YeMMsAaVsr99tmTYncQdjEfOXiJQkRkJ
> >>>> W7NMgRg&s=1d_kB7VWQ-
> >> 8d4t6Ikga5KSVwws4vwiVMvTyWVaS6PRU&e= .
> >>>> 
> >>>> So the protocol will be more like an enhanced version of the Vhost-user
> >>>> protocol
> >>>> than VFIO. I'm fine with either direction (VFIO vs. enhanced Vhost-user),
> >>>> so we need to decide before proceeding as the request format is
> >>>> substantially
> >>>> different.
> >>> 
> >>> Regarding the ability to use the protocol on non-AF_UNIX sockets, we can
> >>> support this future use case without unnecessarily complicating the
> >> protocol by
> >>> defining the C structs and stating that data alignment and endianness for
> >> the
> >>> non AF_UNIX case must be the one used by GCC on a x86_64 bit machine,
> >> or can
> >>> be overridden as required.
> >> 
> >> Defining it to be x86_64 semantics is effectively saying "we're not going
> >> to do anything and it is up to other arch maintainers to fix the inevitable
> >> portability problems that arise".
> > 
> > Pretty much.
> > 
> >> Since this is a new protocol should we take the opportunity to model it
> >> explicitly in some common standard RPC protocol language. This would have
> >> the benefit of allowing implementors to use off the shelf APIs for their
> >> wire protocol marshalling, and eliminate questions about endianness and
> >> alignment across architectures.
> > 
> > The problem is that we haven't defined the scope very well. My initial impression 
> > was that we should use the existing VFIO structs and constants, however that's 
> > impossible if we're to support non AF_UNIX. We need consensus on this, we're 
> > open to ideas how to do this.
> 
> Thanos has a point.
> 
> From https://wiki.qemu.org/Features/MultiProcessQEMU, which I believe
> was written by Stefan, I read:
> 
> > Inventing a new device emulation protocol from scratch has many
> > disadvantages. VFIO could be used as the protocol to avoid reinventing
> > the wheel ...
> 
> At the same time, this appears to be incompatible with the (new?)
> requirement of supporting device emulation which may run in non-VFIO
> compliant OSs or even across OSs (ie. via TCP or similar).

To be clear, I don't have any opinion on whether we need to support
cross-OS/TCP or not.

I'm merely saying that if we do decide to support cross-OS/TCP, then
I think we need a more explicitly modelled protocol, instead of relying
on serialization of C structs.

There could be benefits to an explicitly modelled protocol, even for
local only usage, if we want to more easily support non-C languages
doing serialization, but again I don't have a strong opinion on whether
that's neccessary to worry about or not.

So I guess largely the question boils down to setting the scope of
what we want to be able to achieve in terms of RPC endpoints.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-05-01 15:28                 ` Daniel P. Berrangé
@ 2020-05-04  9:45                   ` Stefan Hajnoczi
  2020-05-04 17:49                     ` John G Johnson
  0 siblings, 1 reply; 31+ messages in thread
From: Stefan Hajnoczi @ 2020-05-04  9:45 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Walker, Benjamin, John G Johnson, Jag Raman, Harris, James R,
	Swapnil Ingle, Konrad Rzeszutek Wilk, qemu-devel,
	Elena Ufimtseva, Raphael Norwitz, Marc-André Lureau,
	Kirti Wankhede, Alex Williamson, Stefan Hajnoczi,
	Felipe Franciosi, Kanth Ghatraju, Thanos Makatos, Zhang, Tina,
	Liu, Changpeng, dgilbert

[-- Attachment #1: Type: text/plain, Size: 5407 bytes --]

On Fri, May 01, 2020 at 04:28:25PM +0100, Daniel P. Berrangé wrote:
> On Fri, May 01, 2020 at 03:01:01PM +0000, Felipe Franciosi wrote:
> > Hi,
> > 
> > > On Apr 30, 2020, at 4:20 PM, Thanos Makatos <thanos.makatos@nutanix.com> wrote:
> > > 
> > >>>> More importantly, considering:
> > >>>> a) Marc-André's comments about data alignment etc., and
> > >>>> b) the possibility to run the server on another guest or host,
> > >>>> we won't be able to use native VFIO types. If we do want to support that
> > >>>> then
> > >>>> we'll have to redefine all data formats, similar to
> > >>>> https://urldefense.proofpoint.com/v2/url?u=https-
> > >>>> 3A__github.com_qemu_qemu_blob_master_docs_interop_vhost-
> > >>>> 
> > >> 2Duser.rst&d=DwIFAw&c=s883GpUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvtw6
> > >>>> 
> > >> ogtti46atk736SI4vgsJiUKIyDE&m=lJC7YeMMsAaVsr99tmTYncQdjEfOXiJQkRkJ
> > >>>> W7NMgRg&s=1d_kB7VWQ-
> > >> 8d4t6Ikga5KSVwws4vwiVMvTyWVaS6PRU&e= .
> > >>>> 
> > >>>> So the protocol will be more like an enhanced version of the Vhost-user
> > >>>> protocol
> > >>>> than VFIO. I'm fine with either direction (VFIO vs. enhanced Vhost-user),
> > >>>> so we need to decide before proceeding as the request format is
> > >>>> substantially
> > >>>> different.
> > >>> 
> > >>> Regarding the ability to use the protocol on non-AF_UNIX sockets, we can
> > >>> support this future use case without unnecessarily complicating the
> > >> protocol by
> > >>> defining the C structs and stating that data alignment and endianness for
> > >> the
> > >>> non AF_UNIX case must be the one used by GCC on a x86_64 bit machine,
> > >> or can
> > >>> be overridden as required.
> > >> 
> > >> Defining it to be x86_64 semantics is effectively saying "we're not going
> > >> to do anything and it is up to other arch maintainers to fix the inevitable
> > >> portability problems that arise".
> > > 
> > > Pretty much.
> > > 
> > >> Since this is a new protocol should we take the opportunity to model it
> > >> explicitly in some common standard RPC protocol language. This would have
> > >> the benefit of allowing implementors to use off the shelf APIs for their
> > >> wire protocol marshalling, and eliminate questions about endianness and
> > >> alignment across architectures.
> > > 
> > > The problem is that we haven't defined the scope very well. My initial impression 
> > > was that we should use the existing VFIO structs and constants, however that's 
> > > impossible if we're to support non AF_UNIX. We need consensus on this, we're 
> > > open to ideas how to do this.
> > 
> > Thanos has a point.
> > 
> > From https://wiki.qemu.org/Features/MultiProcessQEMU, which I believe
> > was written by Stefan, I read:
> > 
> > > Inventing a new device emulation protocol from scratch has many
> > > disadvantages. VFIO could be used as the protocol to avoid reinventing
> > > the wheel ...
> > 
> > At the same time, this appears to be incompatible with the (new?)
> > requirement of supporting device emulation which may run in non-VFIO
> > compliant OSs or even across OSs (ie. via TCP or similar).
> 
> To be clear, I don't have any opinion on whether we need to support
> cross-OS/TCP or not.
> 
> I'm merely saying that if we do decide to support cross-OS/TCP, then
> I think we need a more explicitly modelled protocol, instead of relying
> on serialization of C structs.
> 
> There could be benefits to an explicitly modelled protocol, even for
> local only usage, if we want to more easily support non-C languages
> doing serialization, but again I don't have a strong opinion on whether
> that's neccessary to worry about or not.
> 
> So I guess largely the question boils down to setting the scope of
> what we want to be able to achieve in terms of RPC endpoints.

The protocol relies on both file descriptor and memory mapping. These
are hard to achieve with networking.

I think the closest would be using RDMA to accelerate memory access and
switching to a network notification mechanism instead of eventfd.

Sooner or later someone will probably try this. I don't think it makes
sense to define this transport in detail now if there are no users, but
we should try to make it possible to add it in the future, if necessary.

Another use case that is interesting and not yet directly addressed is:
how can another VM play the role of the device? This is important in
compute cloud environments where everything is a VM and running a
process on the host is not possible.

The virtio-vhost-user prototype showed that it's possible to add this on
top of an existing vhost-user style protocol by terminating the
connection in the device VMM and then communicating with the device
using a new VIRTIO device. Maybe that's the way to do it here too and we
don't need to worry about explicitly designing that into the vfio-user
protocol, but if anyone has other approaches in mind then let's discuss
them now.

Finally, I think the goal of integrating this new protocol into the
existing vfio component of VMMs is a good idea. Sticking closely to the
<linux/vfio.h> interface will help in this regard. The further away we
get, the harder it will be to fit it into the vfio code in existing VMMs
and the harder it will be for users to configure the VMM along the lines
for how vfio works today.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-05-04  9:45                   ` Stefan Hajnoczi
@ 2020-05-04 17:49                     ` John G Johnson
  2020-05-11 14:37                       ` Stefan Hajnoczi
  0 siblings, 1 reply; 31+ messages in thread
From: John G Johnson @ 2020-05-04 17:49 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Walker, Benjamin, Elena Ufimtseva, Jag Raman, Harris, James R,
	Swapnil Ingle, Konrad Rzeszutek Wilk, Felipe Franciosi,
	qemu-devel, Raphael Norwitz, Kirti Wankhede, Thanos Makatos,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	"Daniel P. Berrangé",
	Liu, Changpeng, Zhang, Tina, Kanth Ghatraju, dgilbert



> On May 4, 2020, at 2:45 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> 
> On Fri, May 01, 2020 at 04:28:25PM +0100, Daniel P. Berrangé wrote:
>> On Fri, May 01, 2020 at 03:01:01PM +0000, Felipe Franciosi wrote:
>>> Hi,
>>> 
>>>> On Apr 30, 2020, at 4:20 PM, Thanos Makatos <thanos.makatos@nutanix.com> wrote:
>>>> 
>>>>>>> More importantly, considering:
>>>>>>> a) Marc-André's comments about data alignment etc., and
>>>>>>> b) the possibility to run the server on another guest or host,
>>>>>>> we won't be able to use native VFIO types. If we do want to support that
>>>>>>> then
>>>>>>> we'll have to redefine all data formats, similar to
>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-
>>>>>>> 3A__github.com_qemu_qemu_blob_master_docs_interop_vhost-
>>>>>>> 
>>>>> 2Duser.rst&d=DwIFAw&c=s883GpUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvtw6
>>>>>>> 
>>>>> ogtti46atk736SI4vgsJiUKIyDE&m=lJC7YeMMsAaVsr99tmTYncQdjEfOXiJQkRkJ
>>>>>>> W7NMgRg&s=1d_kB7VWQ-
>>>>> 8d4t6Ikga5KSVwws4vwiVMvTyWVaS6PRU&e= .
>>>>>>> 
>>>>>>> So the protocol will be more like an enhanced version of the Vhost-user
>>>>>>> protocol
>>>>>>> than VFIO. I'm fine with either direction (VFIO vs. enhanced Vhost-user),
>>>>>>> so we need to decide before proceeding as the request format is
>>>>>>> substantially
>>>>>>> different.
>>>>>> 
>>>>>> Regarding the ability to use the protocol on non-AF_UNIX sockets, we can
>>>>>> support this future use case without unnecessarily complicating the
>>>>> protocol by
>>>>>> defining the C structs and stating that data alignment and endianness for
>>>>> the
>>>>>> non AF_UNIX case must be the one used by GCC on a x86_64 bit machine,
>>>>> or can
>>>>>> be overridden as required.
>>>>> 
>>>>> Defining it to be x86_64 semantics is effectively saying "we're not going
>>>>> to do anything and it is up to other arch maintainers to fix the inevitable
>>>>> portability problems that arise".
>>>> 
>>>> Pretty much.
>>>> 
>>>>> Since this is a new protocol should we take the opportunity to model it
>>>>> explicitly in some common standard RPC protocol language. This would have
>>>>> the benefit of allowing implementors to use off the shelf APIs for their
>>>>> wire protocol marshalling, and eliminate questions about endianness and
>>>>> alignment across architectures.
>>>> 
>>>> The problem is that we haven't defined the scope very well. My initial impression 
>>>> was that we should use the existing VFIO structs and constants, however that's 
>>>> impossible if we're to support non AF_UNIX. We need consensus on this, we're 
>>>> open to ideas how to do this.
>>> 
>>> Thanos has a point.
>>> 
>>> From https://wiki.qemu.org/Features/MultiProcessQEMU, which I believe
>>> was written by Stefan, I read:
>>> 
>>>> Inventing a new device emulation protocol from scratch has many
>>>> disadvantages. VFIO could be used as the protocol to avoid reinventing
>>>> the wheel ...
>>> 
>>> At the same time, this appears to be incompatible with the (new?)
>>> requirement of supporting device emulation which may run in non-VFIO
>>> compliant OSs or even across OSs (ie. via TCP or similar).
>> 
>> To be clear, I don't have any opinion on whether we need to support
>> cross-OS/TCP or not.
>> 
>> I'm merely saying that if we do decide to support cross-OS/TCP, then
>> I think we need a more explicitly modelled protocol, instead of relying
>> on serialization of C structs.
>> 
>> There could be benefits to an explicitly modelled protocol, even for
>> local only usage, if we want to more easily support non-C languages
>> doing serialization, but again I don't have a strong opinion on whether
>> that's neccessary to worry about or not.
>> 
>> So I guess largely the question boils down to setting the scope of
>> what we want to be able to achieve in terms of RPC endpoints.
> 
> The protocol relies on both file descriptor and memory mapping. These
> are hard to achieve with networking.
> 
> I think the closest would be using RDMA to accelerate memory access and
> switching to a network notification mechanism instead of eventfd.
> 
> Sooner or later someone will probably try this. I don't think it makes
> sense to define this transport in detail now if there are no users, but
> we should try to make it possible to add it in the future, if necessary.
> 
> Another use case that is interesting and not yet directly addressed is:
> how can another VM play the role of the device? This is important in
> compute cloud environments where everything is a VM and running a
> process on the host is not possible.
> 

	Cross-VM is not a lot different from networking.  You can’t
use AF_UNIX; and AF_VSOCK and AF_INET do not support FD passing.
You’d either have to add FD passing to AF_VSOCK, which will have
some security issues, or fall back to message passing that will
degrade performance.  You can skip the byte ordering issues, however,
when it’s the same host.

							JJ



> The virtio-vhost-user prototype showed that it's possible to add this on
> top of an existing vhost-user style protocol by terminating the
> connection in the device VMM and then communicating with the device
> using a new VIRTIO device. Maybe that's the way to do it here too and we
> don't need to worry about explicitly designing that into the vfio-user
> protocol, but if anyone has other approaches in mind then let's discuss
> them now.
> 
> Finally, I think the goal of integrating this new protocol into the
> existing vfio component of VMMs is a good idea. Sticking closely to the
> <linux/vfio.h> interface will help in this regard. The further away we
> get, the harder it will be to fit it into the vfio code in existing VMMs
> and the harder it will be for users to configure the VMM along the lines
> for how vfio works today.
> 
> Stefan



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-05-04 17:49                     ` John G Johnson
@ 2020-05-11 14:37                       ` Stefan Hajnoczi
  0 siblings, 0 replies; 31+ messages in thread
From: Stefan Hajnoczi @ 2020-05-11 14:37 UTC (permalink / raw)
  To: John G Johnson
  Cc: Walker, Benjamin, Elena Ufimtseva, Jag Raman, Harris, James R,
	Swapnil Ingle, Konrad Rzeszutek Wilk, Felipe Franciosi,
	qemu-devel, Raphael Norwitz, Kirti Wankhede, Thanos Makatos,
	Alex Williamson, Stefan Hajnoczi, Marc-André Lureau,
	"Daniel P. Berrangé",
	Liu, Changpeng, Zhang, Tina, Kanth Ghatraju, dgilbert

[-- Attachment #1: Type: text/plain, Size: 6334 bytes --]

On Mon, May 04, 2020 at 10:49:11AM -0700, John G Johnson wrote:
> 
> 
> > On May 4, 2020, at 2:45 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > 
> > On Fri, May 01, 2020 at 04:28:25PM +0100, Daniel P. Berrangé wrote:
> >> On Fri, May 01, 2020 at 03:01:01PM +0000, Felipe Franciosi wrote:
> >>> Hi,
> >>> 
> >>>> On Apr 30, 2020, at 4:20 PM, Thanos Makatos <thanos.makatos@nutanix.com> wrote:
> >>>> 
> >>>>>>> More importantly, considering:
> >>>>>>> a) Marc-André's comments about data alignment etc., and
> >>>>>>> b) the possibility to run the server on another guest or host,
> >>>>>>> we won't be able to use native VFIO types. If we do want to support that
> >>>>>>> then
> >>>>>>> we'll have to redefine all data formats, similar to
> >>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-
> >>>>>>> 3A__github.com_qemu_qemu_blob_master_docs_interop_vhost-
> >>>>>>> 
> >>>>> 2Duser.rst&d=DwIFAw&c=s883GpUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvtw6
> >>>>>>> 
> >>>>> ogtti46atk736SI4vgsJiUKIyDE&m=lJC7YeMMsAaVsr99tmTYncQdjEfOXiJQkRkJ
> >>>>>>> W7NMgRg&s=1d_kB7VWQ-
> >>>>> 8d4t6Ikga5KSVwws4vwiVMvTyWVaS6PRU&e= .
> >>>>>>> 
> >>>>>>> So the protocol will be more like an enhanced version of the Vhost-user
> >>>>>>> protocol
> >>>>>>> than VFIO. I'm fine with either direction (VFIO vs. enhanced Vhost-user),
> >>>>>>> so we need to decide before proceeding as the request format is
> >>>>>>> substantially
> >>>>>>> different.
> >>>>>> 
> >>>>>> Regarding the ability to use the protocol on non-AF_UNIX sockets, we can
> >>>>>> support this future use case without unnecessarily complicating the
> >>>>> protocol by
> >>>>>> defining the C structs and stating that data alignment and endianness for
> >>>>> the
> >>>>>> non AF_UNIX case must be the one used by GCC on a x86_64 bit machine,
> >>>>> or can
> >>>>>> be overridden as required.
> >>>>> 
> >>>>> Defining it to be x86_64 semantics is effectively saying "we're not going
> >>>>> to do anything and it is up to other arch maintainers to fix the inevitable
> >>>>> portability problems that arise".
> >>>> 
> >>>> Pretty much.
> >>>> 
> >>>>> Since this is a new protocol should we take the opportunity to model it
> >>>>> explicitly in some common standard RPC protocol language. This would have
> >>>>> the benefit of allowing implementors to use off the shelf APIs for their
> >>>>> wire protocol marshalling, and eliminate questions about endianness and
> >>>>> alignment across architectures.
> >>>> 
> >>>> The problem is that we haven't defined the scope very well. My initial impression 
> >>>> was that we should use the existing VFIO structs and constants, however that's 
> >>>> impossible if we're to support non AF_UNIX. We need consensus on this, we're 
> >>>> open to ideas how to do this.
> >>> 
> >>> Thanos has a point.
> >>> 
> >>> From https://wiki.qemu.org/Features/MultiProcessQEMU, which I believe
> >>> was written by Stefan, I read:
> >>> 
> >>>> Inventing a new device emulation protocol from scratch has many
> >>>> disadvantages. VFIO could be used as the protocol to avoid reinventing
> >>>> the wheel ...
> >>> 
> >>> At the same time, this appears to be incompatible with the (new?)
> >>> requirement of supporting device emulation which may run in non-VFIO
> >>> compliant OSs or even across OSs (ie. via TCP or similar).
> >> 
> >> To be clear, I don't have any opinion on whether we need to support
> >> cross-OS/TCP or not.
> >> 
> >> I'm merely saying that if we do decide to support cross-OS/TCP, then
> >> I think we need a more explicitly modelled protocol, instead of relying
> >> on serialization of C structs.
> >> 
> >> There could be benefits to an explicitly modelled protocol, even for
> >> local only usage, if we want to more easily support non-C languages
> >> doing serialization, but again I don't have a strong opinion on whether
> >> that's neccessary to worry about or not.
> >> 
> >> So I guess largely the question boils down to setting the scope of
> >> what we want to be able to achieve in terms of RPC endpoints.
> > 
> > The protocol relies on both file descriptor and memory mapping. These
> > are hard to achieve with networking.
> > 
> > I think the closest would be using RDMA to accelerate memory access and
> > switching to a network notification mechanism instead of eventfd.
> > 
> > Sooner or later someone will probably try this. I don't think it makes
> > sense to define this transport in detail now if there are no users, but
> > we should try to make it possible to add it in the future, if necessary.
> > 
> > Another use case that is interesting and not yet directly addressed is:
> > how can another VM play the role of the device? This is important in
> > compute cloud environments where everything is a VM and running a
> > process on the host is not possible.
> > 
> 
> 	Cross-VM is not a lot different from networking.  You can’t
> use AF_UNIX; and AF_VSOCK and AF_INET do not support FD passing.
> You’d either have to add FD passing to AF_VSOCK, which will have
> some security issues, or fall back to message passing that will
> degrade performance.

In the approach where vfio-user terminates in the device VMM and the
device guest uses a new virtio-vhost-user style device we can continue
to use AF_UNIX with file descriptor passing on the host. The vfio-user
protocol doesn't need to be extended like it would for AF_VSOCK/AF_INET.

   Driver guest                              Device guest
        ^                                         ^
	| PCI device             virtio-vfio-user |
	v                                         v
    Driver VMM  <---- vfio-user AF_UNIX ----> Device VMM

It does not require changing the vfio-user protocol because the driver
VMM is talking to a regular vfio-user device process that happens to be
the device VMM.

The trick is that the device VMM makes the shared memory accessible as
VIRTIO shared memory regions (already in the VIRTIO spec) and eventfds
as VIRTIO doorbells/interrupts (proposed by not yet added to the VIRTIO
spec). This allows the device guest to directly access these resources
so it can DMA to the driver guest's RAM, inject interrupts, and receive
doorbell notifications.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-04-20 11:05   ` Thanos Makatos
  2020-04-22 15:29     ` Stefan Hajnoczi
@ 2020-05-14 16:32     ` John G Johnson
  2020-05-14 19:20       ` Alex Williamson
  2020-05-21  0:45       ` John G Johnson
  1 sibling, 2 replies; 31+ messages in thread
From: John G Johnson @ 2020-05-14 16:32 UTC (permalink / raw)
  To: Thanos Makatos, Stefan Hajnoczi, Walker, Benjamin,
	Elena Ufimtseva, Jag Raman, Harris, James R, Swapnil Ingle,
	Konrad Rzeszutek Wilk, qemu-devel, Kirti Wankhede,
	Raphael Norwitz, Alex Williamson, Kanth Ghatraju,
	Felipe Franciosi, Marc-André Lureau, Zhang, Tina, Liu,
	Changpeng, dgilbert


	Thanos and I have made some changes to the doc in response to the
feedback we’ve received.  The biggest difference is that it is less reliant
on the reader being familiar with the current VFIO implementation.  We’d
appreciate any additional feedback you could give on the changes.  Thanks
in advance.

							Thanos and JJ


The link remains the same:

https://docs.google.com/document/d/1FspkL0hVEnZqHbdoqGLUpyC38rSk_7HhY471TsVwyK8/edit?usp=sharing



> On Apr 20, 2020, at 4:05 AM, Thanos Makatos <thanos.makatos@nutanix.com> wrote:
> 
> Hi,
> 
> I've just shared with you the Google doc we've working on with John where we've
> been drafting the protocol specification, we think it's time for some first
> comments. Please feel free to comment/edit and suggest more people to be on the
> reviewers list.
> 
> You can also find the Google doc here:
> 
> https://docs.google.com/document/d/1FspkL0hVEnZqHbdoqGLUpyC38rSk_7HhY471TsVwyK8/edit?usp=sharing
> 
> If a Google doc doesn't work for you we're open to suggestions.
> 
> Thanks
> 



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-05-14 16:32     ` John G Johnson
@ 2020-05-14 19:20       ` Alex Williamson
  2020-05-21  0:45       ` John G Johnson
  1 sibling, 0 replies; 31+ messages in thread
From: Alex Williamson @ 2020-05-14 19:20 UTC (permalink / raw)
  To: John G Johnson
  Cc: Walker, Benjamin, Elena Ufimtseva, Jag Raman, Swapnil Ingle,
	Harris, James R, Konrad Rzeszutek Wilk, qemu-devel,
	Raphael Norwitz, Marc-André Lureau, Kirti Wankhede,
	Kanth Ghatraju, Stefan Hajnoczi, Felipe Franciosi,
	Thanos Makatos, Zhang, Tina, Liu, Changpeng, dgilbert

On Thu, 14 May 2020 09:32:15 -0700
John G Johnson <john.g.johnson@oracle.com> wrote:

> 	Thanos and I have made some changes to the doc in response to the
> feedback we’ve received.  The biggest difference is that it is less reliant
> on the reader being familiar with the current VFIO implementation.  We’d
> appreciate any additional feedback you could give on the changes.  Thanks
> in advance.
> 
> 							Thanos and JJ
> 
> 
> The link remains the same:
> 
> https://docs.google.com/document/d/1FspkL0hVEnZqHbdoqGLUpyC38rSk_7HhY471TsVwyK8/edit?usp=sharing

Hi,

I'm confused by VFIO_USER_ADD_MEMORY_REGION vs VFIO_USER_IOMMU_MAP_DMA.
The former seems intended to provide the server with access to the
entire GPA space, while the latter indicates an IOVA to GPA mapping of
those regions.  Doesn't this break the basic isolation of a vIOMMU?
This essentially says to me "here's all the guest memory, but please
only access these regions for which we're providing DMA mappings".
That invites abuse.

Also regarding VFIO_USER_ADD_MEMORY_REGION, it's not clear to me how
"an array of file descriptors will be sent as part of the message
meta-data" works.  Also consider s/SUB/DEL/.  Why is the Device ID in
the table specified as 0?  How does a client learn their Device ID?

VFIO_USER_DEVICE_GET_REGION_INFO (or anything else making use of a
capability chain), the cap_offset and next pointers within the chain
need to specify what their offset is relative to (ie. the start of the
packet, the start of the vfio compatible data structure, etc).  I
assume the latter for client compatibility.

Also on REGION_INFO, offset is specified as "the base offset to be
given to the mmap() call for regions with the MMAP attribute".  Base
offset from what?  Is the mmap performed on the socket fd?  Do we not
allow read/write, we need to use VFIO_USER_MMIO_READ/WRITE instead?
Why do we specify "MMIO" in those operations versus simply "REGION"?
Are we arbitrarily excluding support for I/O port regions or device
specific regions?  If these commands replace direct read and write to
an fd offset, how is PCI config space handled?

VFIO_USER_MMIO_READ specifies the count field is zero and the reply
will include the count specifying the amount of data read.  How does
the client specify how much data to read?  Via message size?

VFIO_USER_DMA_READ/WRITE, is the address a GPA or IOVA?  IMO the device
should only ever have access via IOVA, which implies a DMA mapping
exists for the device.  Can you provide an example of why we need these
commands since there seems little point to this interface if a device
cannot directly interact with VM memory.

The IOMMU commands should be unnecessary, a vIOMMU should be
transparent to the server by virtue that the device only knows about
IOVA mappings accessible to the device.  Requiring the client to expose
all memory to the server implies that the server must always be trusted.

Interrupt info format, s/type/index/, s/vector/subindex/

In addition to the unused ioctls, the entire concept of groups and
containers are not found in this specification.  To some degree that
makes sense and even mdevs and typically SR-IOV VFs have a 1:1 device
to group relationship.  However, the container is very much involved in
the development of migration support, where it's the container that
provides dirty bitmaps.  Since we're doing map and unmap without that
container concept here, perhaps we'd equally apply those APIs to this
same socket.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-05-14 16:32     ` John G Johnson
  2020-05-14 19:20       ` Alex Williamson
@ 2020-05-21  0:45       ` John G Johnson
  2020-06-02 15:06         ` Alex Williamson
  1 sibling, 1 reply; 31+ messages in thread
From: John G Johnson @ 2020-05-21  0:45 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Walker, Benjamin, Elena Ufimtseva, Jag Raman, Swapnil Ingle,
	Harris, James R, Konrad Rzeszutek Wilk, Felipe Franciosi,
	qemu-devel, Raphael Norwitz, Kirti Wankhede, Kanth Ghatraju,
	Stefan Hajnoczi, Marc-André Lureau, Thanos Makatos, Zhang,
	Tina, Liu, Changpeng, dgilbert



> I'm confused by VFIO_USER_ADD_MEMORY_REGION vs VFIO_USER_IOMMU_MAP_DMA.
> The former seems intended to provide the server with access to the
> entire GPA space, while the latter indicates an IOVA to GPA mapping of
> those regions.  Doesn't this break the basic isolation of a vIOMMU?
> This essentially says to me "here's all the guest memory, but please
> only access these regions for which we're providing DMA mappings".
> That invites abuse.
> 

	The purpose behind separating QEMU into multiple processes is
to provide an additional layer protection for the infrastructure against
a malign guest, not for the guest against itself, so preventing a server
that has been compromised by a guest from accessing all of guest memory
adds no additional benefit.  We don’t even have an IOMMU in our current
guest model for this reason.

	The implementation was stolen from vhost-user, with the exception
that we push IOTLB translations from client to server like VFIO does, as
opposed to pulling them from server to client like vhost-user does.

	That said, neither the qemu-mp nor MUSER implementation uses an
IOMMU, so if you prefer another IOMMU model, we can consider it.  We
could only send the guest memory file descriptors with IOMMU_MAP_DMA
requests, although that would cost performance since each request would
require the server to execute an mmap() system call.


> Also regarding VFIO_USER_ADD_MEMORY_REGION, it's not clear to me how
> "an array of file descriptors will be sent as part of the message
> meta-data" works.  Also consider s/SUB/DEL/.  Why is the Device ID in
> the table specified as 0?  How does a client learn their Device ID?
> 

	SCM_RIGHTS message controls allow sendmsg() to send an array of
file descriptors over a UNIX domain socket.

	We’re only supporting one device per socket in this protocol
version, so the device ID will always be 0.  This may change in a future
revision, so we included the field in the header to avoid a major version
change if device multiplexing is added later.


> VFIO_USER_DEVICE_GET_REGION_INFO (or anything else making use of a
> capability chain), the cap_offset and next pointers within the chain
> need to specify what their offset is relative to (ie. the start of the
> packet, the start of the vfio compatible data structure, etc).  I
> assume the latter for client compatibility.
> 

	Yes.  We will attempt to make the language clearer.


> Also on REGION_INFO, offset is specified as "the base offset to be
> given to the mmap() call for regions with the MMAP attribute".  Base
> offset from what?  Is the mmap performed on the socket fd?  Do we not
> allow read/write, we need to use VFIO_USER_MMIO_READ/WRITE instead?
> Why do we specify "MMIO" in those operations versus simply "REGION"?
> Are we arbitrarily excluding support for I/O port regions or device
> specific regions?  If these commands replace direct read and write to
> an fd offset, how is PCI config space handled?
> 

	The base offset refers to the sparse areas, where the sparse area
offset is added to the base region offset.  We will try to make the text
clearer here as well.

	MMIO was added to distinguish these operations from DMA operations.
I can see how this can cause confusion when the region refers to a port range,
so we can change the name to REGION_READ/WRITE. 


> VFIO_USER_MMIO_READ specifies the count field is zero and the reply
> will include the count specifying the amount of data read.  How does
> the client specify how much data to read?  Via message size?
> 

	This is a bug in the doc.  As you said, the read field should
be the amount of data to be read.
	

> VFIO_USER_DMA_READ/WRITE, is the address a GPA or IOVA?  IMO the device
> should only ever have access via IOVA, which implies a DMA mapping
> exists for the device.  Can you provide an example of why we need these
> commands since there seems little point to this interface if a device
> cannot directly interact with VM memory.
> 

	It is a GPA.  The device emulation code would only handle the DMA
addresses the guest programmed it with; the server infrastructure knows
whether an IOMMU exists, and whether the DMA address needs translation to
GPA or not.


> The IOMMU commands should be unnecessary, a vIOMMU should be
> transparent to the server by virtue that the device only knows about
> IOVA mappings accessible to the device.  Requiring the client to expose
> all memory to the server implies that the server must always be trusted.
> 

	The client and server are equally trusted; the guest is the untrusted
entity.


> Interrupt info format, s/type/index/, s/vector/subindex/
> 

	ok


> In addition to the unused ioctls, the entire concept of groups and
> containers are not found in this specification.  To some degree that
> makes sense and even mdevs and typically SR-IOV VFs have a 1:1 device
> to group relationship.  However, the container is very much involved in
> the development of migration support, where it's the container that
> provides dirty bitmaps.  Since we're doing map and unmap without that
> container concept here, perhaps we'd equally apply those APIs to this
> same socket.  Thanks,

	Groups and containers are host IOMMU concepts, and we don’t
interact with the host here.  The kernel VFIO driver doesn’t even need
to exist for VFIO over socket.  I think it’s fine to assume a 1-1
correspondence between containers, groups, and a VFIO over socket device.

	Thanks for looking this over.

							Thanos & JJ






^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-05-21  0:45       ` John G Johnson
@ 2020-06-02 15:06         ` Alex Williamson
  2020-06-10  6:25           ` John G Johnson
  0 siblings, 1 reply; 31+ messages in thread
From: Alex Williamson @ 2020-06-02 15:06 UTC (permalink / raw)
  To: John G Johnson
  Cc: Walker, Benjamin, Elena Ufimtseva, Jag Raman, Swapnil Ingle,
	Harris, James R, Konrad Rzeszutek Wilk, Felipe Franciosi,
	qemu-devel, Raphael Norwitz, Kirti Wankhede, Kanth Ghatraju,
	Stefan Hajnoczi, Marc-André Lureau, Thanos Makatos, Zhang,
	Tina, Liu, Changpeng, dgilbert

On Wed, 20 May 2020 17:45:13 -0700
John G Johnson <john.g.johnson@oracle.com> wrote:

> > I'm confused by VFIO_USER_ADD_MEMORY_REGION vs VFIO_USER_IOMMU_MAP_DMA.
> > The former seems intended to provide the server with access to the
> > entire GPA space, while the latter indicates an IOVA to GPA mapping of
> > those regions.  Doesn't this break the basic isolation of a vIOMMU?
> > This essentially says to me "here's all the guest memory, but please
> > only access these regions for which we're providing DMA mappings".
> > That invites abuse.
> >   
> 
> 	The purpose behind separating QEMU into multiple processes is
> to provide an additional layer protection for the infrastructure against
> a malign guest, not for the guest against itself, so preventing a server
> that has been compromised by a guest from accessing all of guest memory
> adds no additional benefit.  We don’t even have an IOMMU in our current
> guest model for this reason.

One of the use cases we see a lot with vfio is nested assignment, ie.
we assign a device to a VM where the VM includes a vIOMMU, such that
the guest OS can then assign the device to userspace within the guest.
This is safe to do AND provides isolation within the guest exactly
because the device only has access to memory mapped to the device, not
the entire guest address space.  I don't think it's just the hypervisor
you're trying to protect, we can't assume there are always trusted
drivers managing the device.

> 
> 	The implementation was stolen from vhost-user, with the exception
> that we push IOTLB translations from client to server like VFIO does, as
> opposed to pulling them from server to client like vhost-user does.

It seems that vhost has numerous hacks forcing it to know whether a
vIOMMU is present as a result of this, vfio has none.
 
> 	That said, neither the qemu-mp nor MUSER implementation uses an
> IOMMU, so if you prefer another IOMMU model, we can consider it.  We
> could only send the guest memory file descriptors with IOMMU_MAP_DMA
> requests, although that would cost performance since each request would
> require the server to execute an mmap() system call.

It would seem shortsighted to not fully enable a vIOMMU compatible
implementation at this time.

> > Also regarding VFIO_USER_ADD_MEMORY_REGION, it's not clear to me how
> > "an array of file descriptors will be sent as part of the message
> > meta-data" works.  Also consider s/SUB/DEL/.  Why is the Device ID in
> > the table specified as 0?  How does a client learn their Device ID?
> >   
> 
> 	SCM_RIGHTS message controls allow sendmsg() to send an array of
> file descriptors over a UNIX domain socket.
> 
> 	We’re only supporting one device per socket in this protocol
> version, so the device ID will always be 0.  This may change in a future
> revision, so we included the field in the header to avoid a major version
> change if device multiplexing is added later.
> 
> 
> > VFIO_USER_DEVICE_GET_REGION_INFO (or anything else making use of a
> > capability chain), the cap_offset and next pointers within the chain
> > need to specify what their offset is relative to (ie. the start of the
> > packet, the start of the vfio compatible data structure, etc).  I
> > assume the latter for client compatibility.
> >   
> 
> 	Yes.  We will attempt to make the language clearer.
> 
> 
> > Also on REGION_INFO, offset is specified as "the base offset to be
> > given to the mmap() call for regions with the MMAP attribute".  Base
> > offset from what?  Is the mmap performed on the socket fd?  Do we not
> > allow read/write, we need to use VFIO_USER_MMIO_READ/WRITE instead?
> > Why do we specify "MMIO" in those operations versus simply "REGION"?
> > Are we arbitrarily excluding support for I/O port regions or device
> > specific regions?  If these commands replace direct read and write to
> > an fd offset, how is PCI config space handled?
> >   
> 
> 	The base offset refers to the sparse areas, where the sparse area
> offset is added to the base region offset.  We will try to make the text
> clearer here as well.
> 
> 	MMIO was added to distinguish these operations from DMA operations.
> I can see how this can cause confusion when the region refers to a port range,
> so we can change the name to REGION_READ/WRITE. 
> 
> 
> > VFIO_USER_MMIO_READ specifies the count field is zero and the reply
> > will include the count specifying the amount of data read.  How does
> > the client specify how much data to read?  Via message size?
> >   
> 
> 	This is a bug in the doc.  As you said, the read field should
> be the amount of data to be read.
> 	
> 
> > VFIO_USER_DMA_READ/WRITE, is the address a GPA or IOVA?  IMO the device
> > should only ever have access via IOVA, which implies a DMA mapping
> > exists for the device.  Can you provide an example of why we need these
> > commands since there seems little point to this interface if a device
> > cannot directly interact with VM memory.
> >   
> 
> 	It is a GPA.  The device emulation code would only handle the DMA
> addresses the guest programmed it with; the server infrastructure knows
> whether an IOMMU exists, and whether the DMA address needs translation to
> GPA or not.

I'll re-iterate, a device should only ever issue DMAs in terms of IOVA.
This is how vfio works.

> > The IOMMU commands should be unnecessary, a vIOMMU should be
> > transparent to the server by virtue that the device only knows about
> > IOVA mappings accessible to the device.  Requiring the client to expose
> > all memory to the server implies that the server must always be trusted.
> >   
> 
> 	The client and server are equally trusted; the guest is the untrusted
> entity.

And therefore the driver is untrusted and opening the client/sever
window to expose all of guest memory presents a larger attack surface.

> > Interrupt info format, s/type/index/, s/vector/subindex/
> >   
> 
> 	ok
> 
> 
> > In addition to the unused ioctls, the entire concept of groups and
> > containers are not found in this specification.  To some degree that
> > makes sense and even mdevs and typically SR-IOV VFs have a 1:1 device
> > to group relationship.  However, the container is very much involved in
> > the development of migration support, where it's the container that
> > provides dirty bitmaps.  Since we're doing map and unmap without that
> > container concept here, perhaps we'd equally apply those APIs to this
> > same socket.  Thanks,  
> 
> 	Groups and containers are host IOMMU concepts, and we don’t
> interact with the host here.  The kernel VFIO driver doesn’t even need
> to exist for VFIO over socket.  I think it’s fine to assume a 1-1
> correspondence between containers, groups, and a VFIO over socket device.

Obviously the kernel driver and host IOMMU are out of the picture here.
The point I was trying to make is that we're building interfaces to
support migration around concepts that don't exist in this model, so
it's not clear how we'd map, for example, dirty page tracking on the
container interface to this API.  This seems more akin to the no-iommu
model of vfio, which is a hack where we allow userspace to have access
to a device using the vfio API, but they're on their own for DMA.  We
don't support that model in QEMU, and without those conceptual
equivalencies, I wonder how much we'll be able to leverage existing
QEMU code or co-develop and support future features.  IOW, is this
really just "a vfio-like device model over unix socket" rather than
"vfio over unix socket"?  Thanks,

Alex



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-06-02 15:06         ` Alex Williamson
@ 2020-06-10  6:25           ` John G Johnson
  2020-06-15 10:49             ` Stefan Hajnoczi
  0 siblings, 1 reply; 31+ messages in thread
From: John G Johnson @ 2020-06-10  6:25 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Walker, Benjamin, Elena Ufimtseva, Jag Raman, Swapnil Ingle,
	Harris, James R, Konrad Rzeszutek Wilk, qemu-devel,
	Raphael Norwitz, Kirti Wankhede, Thanos Makatos, Kanth Ghatraju,
	Stefan Hajnoczi, Felipe Franciosi, Marc-André Lureau, Zhang,
	Tina, Liu, Changpeng, dgilbert



> On Jun 2, 2020, at 8:06 AM, Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> On Wed, 20 May 2020 17:45:13 -0700
> John G Johnson <john.g.johnson@oracle.com> wrote:
> 
>>> I'm confused by VFIO_USER_ADD_MEMORY_REGION vs VFIO_USER_IOMMU_MAP_DMA.
>>> The former seems intended to provide the server with access to the
>>> entire GPA space, while the latter indicates an IOVA to GPA mapping of
>>> those regions.  Doesn't this break the basic isolation of a vIOMMU?
>>> This essentially says to me "here's all the guest memory, but please
>>> only access these regions for which we're providing DMA mappings".
>>> That invites abuse.
>>> 
>> 
>> 	The purpose behind separating QEMU into multiple processes is
>> to provide an additional layer protection for the infrastructure against
>> a malign guest, not for the guest against itself, so preventing a server
>> that has been compromised by a guest from accessing all of guest memory
>> adds no additional benefit.  We don’t even have an IOMMU in our current
>> guest model for this reason.
> 
> One of the use cases we see a lot with vfio is nested assignment, ie.
> we assign a device to a VM where the VM includes a vIOMMU, such that
> the guest OS can then assign the device to userspace within the guest.
> This is safe to do AND provides isolation within the guest exactly
> because the device only has access to memory mapped to the device, not
> the entire guest address space.  I don't think it's just the hypervisor
> you're trying to protect, we can't assume there are always trusted
> drivers managing the device.
> 

	We intend to support an IOMMU.  The question seems to be whether
it’s implemented in the server or client.  The current proposal has it
in the server, ala vhost-user, but we are fine with moving it.


>> 
>> 	The implementation was stolen from vhost-user, with the exception
>> that we push IOTLB translations from client to server like VFIO does, as
>> opposed to pulling them from server to client like vhost-user does.
> 
> It seems that vhost has numerous hacks forcing it to know whether a
> vIOMMU is present as a result of this, vfio has none.
> 

	I imagine this decision was driven by performance considerations.
If the IOMMU is implemented on the client side, the server must execute mmap()
or munmap() for every IOMMU MAP/UNMAP message.  If the IOMMU is implemented
on the server side, the server doesn’t need these system calls; it just adds a
SW translation entry to its own table.


>> 	That said, neither the qemu-mp nor MUSER implementation uses an
>> IOMMU, so if you prefer another IOMMU model, we can consider it.  We
>> could only send the guest memory file descriptors with IOMMU_MAP_DMA
>> requests, although that would cost performance since each request would
>> require the server to execute an mmap() system call.
> 
> It would seem shortsighted to not fully enable a vIOMMU compatible
> implementation at this time.
> 
>>> Also regarding VFIO_USER_ADD_MEMORY_REGION, it's not clear to me how
>>> "an array of file descriptors will be sent as part of the message
>>> meta-data" works.  Also consider s/SUB/DEL/.  Why is the Device ID in
>>> the table specified as 0?  How does a client learn their Device ID?
>>> 
>> 
>> 	SCM_RIGHTS message controls allow sendmsg() to send an array of
>> file descriptors over a UNIX domain socket.
>> 
>> 	We’re only supporting one device per socket in this protocol
>> version, so the device ID will always be 0.  This may change in a future
>> revision, so we included the field in the header to avoid a major version
>> change if device multiplexing is added later.
>> 
>> 
>>> VFIO_USER_DEVICE_GET_REGION_INFO (or anything else making use of a
>>> capability chain), the cap_offset and next pointers within the chain
>>> need to specify what their offset is relative to (ie. the start of the
>>> packet, the start of the vfio compatible data structure, etc).  I
>>> assume the latter for client compatibility.
>>> 
>> 
>> 	Yes.  We will attempt to make the language clearer.
>> 
>> 
>>> Also on REGION_INFO, offset is specified as "the base offset to be
>>> given to the mmap() call for regions with the MMAP attribute".  Base
>>> offset from what?  Is the mmap performed on the socket fd?  Do we not
>>> allow read/write, we need to use VFIO_USER_MMIO_READ/WRITE instead?
>>> Why do we specify "MMIO" in those operations versus simply "REGION"?
>>> Are we arbitrarily excluding support for I/O port regions or device
>>> specific regions?  If these commands replace direct read and write to
>>> an fd offset, how is PCI config space handled?
>>> 
>> 
>> 	The base offset refers to the sparse areas, where the sparse area
>> offset is added to the base region offset.  We will try to make the text
>> clearer here as well.
>> 
>> 	MMIO was added to distinguish these operations from DMA operations.
>> I can see how this can cause confusion when the region refers to a port range,
>> so we can change the name to REGION_READ/WRITE. 
>> 
>> 
>>> VFIO_USER_MMIO_READ specifies the count field is zero and the reply
>>> will include the count specifying the amount of data read.  How does
>>> the client specify how much data to read?  Via message size?
>>> 
>> 
>> 	This is a bug in the doc.  As you said, the read field should
>> be the amount of data to be read.
>> 	
>> 
>>> VFIO_USER_DMA_READ/WRITE, is the address a GPA or IOVA?  IMO the device
>>> should only ever have access via IOVA, which implies a DMA mapping
>>> exists for the device.  Can you provide an example of why we need these
>>> commands since there seems little point to this interface if a device
>>> cannot directly interact with VM memory.
>>> 
>> 
>> 	It is a GPA.  The device emulation code would only handle the DMA
>> addresses the guest programmed it with; the server infrastructure knows
>> whether an IOMMU exists, and whether the DMA address needs translation to
>> GPA or not.
> 
> I'll re-iterate, a device should only ever issue DMAs in terms of IOVA.
> This is how vfio works.
> 
>>> The IOMMU commands should be unnecessary, a vIOMMU should be
>>> transparent to the server by virtue that the device only knows about
>>> IOVA mappings accessible to the device.  Requiring the client to expose
>>> all memory to the server implies that the server must always be trusted.
>>> 
>> 
>> 	The client and server are equally trusted; the guest is the untrusted
>> entity.
> 
> And therefore the driver is untrusted and opening the client/sever
> window to expose all of guest memory presents a larger attack surface.
> 
>>> Interrupt info format, s/type/index/, s/vector/subindex/
>>> 
>> 
>> 	ok
>> 
>> 
>>> In addition to the unused ioctls, the entire concept of groups and
>>> containers are not found in this specification.  To some degree that
>>> makes sense and even mdevs and typically SR-IOV VFs have a 1:1 device
>>> to group relationship.  However, the container is very much involved in
>>> the development of migration support, where it's the container that
>>> provides dirty bitmaps.  Since we're doing map and unmap without that
>>> container concept here, perhaps we'd equally apply those APIs to this
>>> same socket.  Thanks,  
>> 
>> 	Groups and containers are host IOMMU concepts, and we don’t
>> interact with the host here.  The kernel VFIO driver doesn’t even need
>> to exist for VFIO over socket.  I think it’s fine to assume a 1-1
>> correspondence between containers, groups, and a VFIO over socket device.
> 
> Obviously the kernel driver and host IOMMU are out of the picture here.
> The point I was trying to make is that we're building interfaces to
> support migration around concepts that don't exist in this model, so
> it's not clear how we'd map, for example, dirty page tracking on the
> container interface to this API.  This seems more akin to the no-iommu
> model of vfio, which is a hack where we allow userspace to have access
> to a device using the vfio API, but they're on their own for DMA.  We
> don't support that model in QEMU, and without those conceptual
> equivalencies, I wonder how much we'll be able to leverage existing
> QEMU code or co-develop and support future features.  IOW, is this
> really just "a vfio-like device model over unix socket" rather than
> "vfio over unix socket"?  Thanks,
> 


	In this model, each device is in its own IOMMU and VFIO container.
We can add a note stating this to the spec.

	VFIO does seem to support a QEMU guest with no IOMMU.  It uses the
VFIO_IOMMU_MAP_DMA ioctl()s to establish the 1-1 device DMA address to
guest physical address mapping in the host IOMMU.  The current MUSER
prototypes takes advantage of this in their guest, which lacks an IOMMU.

						Thanos & JJ




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-06-10  6:25           ` John G Johnson
@ 2020-06-15 10:49             ` Stefan Hajnoczi
  2020-06-18 21:38               ` John G Johnson
  0 siblings, 1 reply; 31+ messages in thread
From: Stefan Hajnoczi @ 2020-06-15 10:49 UTC (permalink / raw)
  To: John G Johnson
  Cc: Walker, Benjamin, Elena Ufimtseva, Jag Raman, Swapnil Ingle,
	Harris, James R, Konrad Rzeszutek Wilk, qemu-devel,
	Kirti Wankhede, Raphael Norwitz, Alex Williamson, Thanos Makatos,
	Kanth Ghatraju, Felipe Franciosi, Marc-André Lureau, Zhang,
	Tina, Liu, Changpeng, dgilbert

[-- Attachment #1: Type: text/plain, Size: 2984 bytes --]

On Tue, Jun 09, 2020 at 11:25:41PM -0700, John G Johnson wrote:
> > On Jun 2, 2020, at 8:06 AM, Alex Williamson <alex.williamson@redhat.com> wrote:
> > 
> > On Wed, 20 May 2020 17:45:13 -0700
> > John G Johnson <john.g.johnson@oracle.com> wrote:
> > 
> >>> I'm confused by VFIO_USER_ADD_MEMORY_REGION vs VFIO_USER_IOMMU_MAP_DMA.
> >>> The former seems intended to provide the server with access to the
> >>> entire GPA space, while the latter indicates an IOVA to GPA mapping of
> >>> those regions.  Doesn't this break the basic isolation of a vIOMMU?
> >>> This essentially says to me "here's all the guest memory, but please
> >>> only access these regions for which we're providing DMA mappings".
> >>> That invites abuse.
> >>> 
> >> 
> >> 	The purpose behind separating QEMU into multiple processes is
> >> to provide an additional layer protection for the infrastructure against
> >> a malign guest, not for the guest against itself, so preventing a server
> >> that has been compromised by a guest from accessing all of guest memory
> >> adds no additional benefit.  We don’t even have an IOMMU in our current
> >> guest model for this reason.
> > 
> > One of the use cases we see a lot with vfio is nested assignment, ie.
> > we assign a device to a VM where the VM includes a vIOMMU, such that
> > the guest OS can then assign the device to userspace within the guest.
> > This is safe to do AND provides isolation within the guest exactly
> > because the device only has access to memory mapped to the device, not
> > the entire guest address space.  I don't think it's just the hypervisor
> > you're trying to protect, we can't assume there are always trusted
> > drivers managing the device.
> > 
> 
> 	We intend to support an IOMMU.  The question seems to be whether
> it’s implemented in the server or client.  The current proposal has it
> in the server, ala vhost-user, but we are fine with moving it.

It's challenging to implement a fast and secure IOMMU. The simplest
approach is secure but not fast: add protocol messages for
DMA_READ(iova, length) and DMA_WRITE(iova, buffer, length).

An issue with file descriptor passing is that it's hard to revoke access
once the file descriptor has been passed. memfd supports sealing with
fnctl(F_ADD_SEALS) it doesn't revoke mmap(MAP_WRITE) on other processes.

Memory Protection Keys don't seem to be useful here either and their
availability is limited (see pkeys(7)).

One crazy idea is to use KVM as a sandbox for running the device and let
the vIOMMU control the page tables instead of the device (guest). That
way the hardware MMU provides memory translation, but I think this is
impractical because the guest environment is too different from the
Linux userspace environment.

As a starting point adding DMA_READ/DMA_WRITE messages would provide the
functionality and security. Unfortunately it makes DMA expensive and
performance will suffer.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-06-15 10:49             ` Stefan Hajnoczi
@ 2020-06-18 21:38               ` John G Johnson
  2020-06-23 12:27                 ` Stefan Hajnoczi
  0 siblings, 1 reply; 31+ messages in thread
From: John G Johnson @ 2020-06-18 21:38 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Walker, Benjamin, Elena Ufimtseva, Jag Raman, Swapnil Ingle,
	Harris, James R, Konrad Rzeszutek Wilk, qemu-devel,
	Raphael Norwitz, Marc-André Lureau, Kirti Wankhede,
	Alex Williamson, Felipe Franciosi, Thanos Makatos, Liu,
	Changpeng, Zhang, Tina, Kanth Ghatraju, dgilbert



> On Jun 15, 2020, at 3:49 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> 
> It's challenging to implement a fast and secure IOMMU. The simplest
> approach is secure but not fast: add protocol messages for
> DMA_READ(iova, length) and DMA_WRITE(iova, buffer, length).
> 

	We do have protocol messages for the case where no FD is
associated with the DMA region:  VFIO_USER_DMA_READ/WRITE.


> An issue with file descriptor passing is that it's hard to revoke access
> once the file descriptor has been passed. memfd supports sealing with
> fnctl(F_ADD_SEALS) it doesn't revoke mmap(MAP_WRITE) on other processes.
> 
> Memory Protection Keys don't seem to be useful here either and their
> availability is limited (see pkeys(7)).
> 
> One crazy idea is to use KVM as a sandbox for running the device and let
> the vIOMMU control the page tables instead of the device (guest). That
> way the hardware MMU provides memory translation, but I think this is
> impractical because the guest environment is too different from the
> Linux userspace environment.
> 
> As a starting point adding DMA_READ/DMA_WRITE messages would provide the
> functionality and security. Unfortunately it makes DMA expensive and
> performance will suffer.
> 

	Are you advocating for only using VFIO_USER_DMA_READ/WRITE and
not passing FDs at all?  The performance penalty would be large for the
cases where the client and server are equally trusted.  Or are you
advocating for an option where the slower methods are used for cases
where the server is less trusted?

								JJ




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-06-18 21:38               ` John G Johnson
@ 2020-06-23 12:27                 ` Stefan Hajnoczi
  2020-06-26  3:54                   ` John G Johnson
  0 siblings, 1 reply; 31+ messages in thread
From: Stefan Hajnoczi @ 2020-06-23 12:27 UTC (permalink / raw)
  To: John G Johnson
  Cc: Walker, Benjamin, Elena Ufimtseva, Jag Raman, Harris, James R,
	Swapnil Ingle, Konrad Rzeszutek Wilk, qemu-devel,
	Raphael Norwitz, Kirti Wankhede, Thanos Makatos, Alex Williamson,
	Stefan Hajnoczi, Felipe Franciosi, Kanth Ghatraju,
	Marc-André Lureau, Zhang, Tina, Liu, Changpeng, dgilbert

[-- Attachment #1: Type: text/plain, Size: 1493 bytes --]

On Thu, Jun 18, 2020 at 02:38:04PM -0700, John G Johnson wrote:
> > On Jun 15, 2020, at 3:49 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > An issue with file descriptor passing is that it's hard to revoke access
> > once the file descriptor has been passed. memfd supports sealing with
> > fnctl(F_ADD_SEALS) it doesn't revoke mmap(MAP_WRITE) on other processes.
> > 
> > Memory Protection Keys don't seem to be useful here either and their
> > availability is limited (see pkeys(7)).
> > 
> > One crazy idea is to use KVM as a sandbox for running the device and let
> > the vIOMMU control the page tables instead of the device (guest). That
> > way the hardware MMU provides memory translation, but I think this is
> > impractical because the guest environment is too different from the
> > Linux userspace environment.
> > 
> > As a starting point adding DMA_READ/DMA_WRITE messages would provide the
> > functionality and security. Unfortunately it makes DMA expensive and
> > performance will suffer.
> > 
> 
> 	Are you advocating for only using VFIO_USER_DMA_READ/WRITE and
> not passing FDs at all?  The performance penalty would be large for the
> cases where the client and server are equally trusted.  Or are you
> advocating for an option where the slower methods are used for cases
> where the server is less trusted?

I think the enforcing IOMMU should be optional (due to the performance
overhead) but part of the spec from the start.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-06-23 12:27                 ` Stefan Hajnoczi
@ 2020-06-26  3:54                   ` John G Johnson
  2020-06-26 13:30                     ` Stefan Hajnoczi
  0 siblings, 1 reply; 31+ messages in thread
From: John G Johnson @ 2020-06-26  3:54 UTC (permalink / raw)
  To: Stefan Hajnoczi, Alex Williamson
  Cc: Walker, Benjamin, Elena Ufimtseva, Jag Raman, Harris, James R,
	Swapnil Ingle, Konrad Rzeszutek Wilk, qemu-devel,
	Raphael Norwitz, Marc-André Lureau, Kirti Wankhede,
	Kanth Ghatraju, Stefan Hajnoczi, Felipe Franciosi,
	Thanos Makatos, Zhang, Tina, Liu, Changpeng, dgilbert



> On Jun 23, 2020, at 5:27 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> 
> On Thu, Jun 18, 2020 at 02:38:04PM -0700, John G Johnson wrote:
>>> On Jun 15, 2020, at 3:49 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>> An issue with file descriptor passing is that it's hard to revoke access
>>> once the file descriptor has been passed. memfd supports sealing with
>>> fnctl(F_ADD_SEALS) it doesn't revoke mmap(MAP_WRITE) on other processes.
>>> 
>>> Memory Protection Keys don't seem to be useful here either and their
>>> availability is limited (see pkeys(7)).
>>> 
>>> One crazy idea is to use KVM as a sandbox for running the device and let
>>> the vIOMMU control the page tables instead of the device (guest). That
>>> way the hardware MMU provides memory translation, but I think this is
>>> impractical because the guest environment is too different from the
>>> Linux userspace environment.
>>> 
>>> As a starting point adding DMA_READ/DMA_WRITE messages would provide the
>>> functionality and security. Unfortunately it makes DMA expensive and
>>> performance will suffer.
>>> 
>> 
>> 	Are you advocating for only using VFIO_USER_DMA_READ/WRITE and
>> not passing FDs at all?  The performance penalty would be large for the
>> cases where the client and server are equally trusted.  Or are you
>> advocating for an option where the slower methods are used for cases
>> where the server is less trusted?
> 
> I think the enforcing IOMMU should be optional (due to the performance
> overhead) but part of the spec from the start.
> 


	With this in mind, we will collapse the current memory region
messages (VFIO_USER_ADD_MEMORY_REGION and VFIO_USER_SUB_MEMORY_REGION)
and the IOMMU messages (VFIO_USER_IOMMU_MAP and VFIO_USER_IOMMU_UNMAP)
into new messages (VFIO_USER_DMA_MAP and VFIO_USER_DMA_UNMAP).  Their
contents will be the same as the memory region messages.

	On a system without an IOMMU, the new messages will be used to
export the system physical address space as DMA addresses.  On a system
with an IOMMU they will be used to export the valid device DMA ranges
programmed into the IOMMU by the guest.  This behavior matches how the
existing QEMU VFIO object programs the host IOMMU.  The server will not
be aware of whether the client is using an IOMMU.

	In the QEMU VFIO implementation, will will add a ‘secure-dma’
option that suppresses exporting mmap()able FDs to the server.  All
DMA will use the slow path to be validated by the client before accessing
guest memory.

	Is this acceptable to you (and Alex, of course)?

								JJ



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-06-26  3:54                   ` John G Johnson
@ 2020-06-26 13:30                     ` Stefan Hajnoczi
  2020-07-02  6:23                       ` John G Johnson
  0 siblings, 1 reply; 31+ messages in thread
From: Stefan Hajnoczi @ 2020-06-26 13:30 UTC (permalink / raw)
  To: John G Johnson
  Cc: Walker, Benjamin, Elena Ufimtseva, Jag Raman, Swapnil Ingle,
	Harris, James R, Konrad Rzeszutek Wilk, Stefan Hajnoczi,
	qemu-devel, Kirti Wankhede, Raphael Norwitz,
	Marc-André Lureau, Alex Williamson, Kanth Ghatraju,
	Felipe Franciosi, Thanos Makatos, Zhang, Tina, Liu, Changpeng,
	dgilbert

[-- Attachment #1: Type: text/plain, Size: 2819 bytes --]

On Thu, Jun 25, 2020 at 08:54:25PM -0700, John G Johnson wrote:
> 
> 
> > On Jun 23, 2020, at 5:27 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > 
> > On Thu, Jun 18, 2020 at 02:38:04PM -0700, John G Johnson wrote:
> >>> On Jun 15, 2020, at 3:49 AM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>> An issue with file descriptor passing is that it's hard to revoke access
> >>> once the file descriptor has been passed. memfd supports sealing with
> >>> fnctl(F_ADD_SEALS) it doesn't revoke mmap(MAP_WRITE) on other processes.
> >>> 
> >>> Memory Protection Keys don't seem to be useful here either and their
> >>> availability is limited (see pkeys(7)).
> >>> 
> >>> One crazy idea is to use KVM as a sandbox for running the device and let
> >>> the vIOMMU control the page tables instead of the device (guest). That
> >>> way the hardware MMU provides memory translation, but I think this is
> >>> impractical because the guest environment is too different from the
> >>> Linux userspace environment.
> >>> 
> >>> As a starting point adding DMA_READ/DMA_WRITE messages would provide the
> >>> functionality and security. Unfortunately it makes DMA expensive and
> >>> performance will suffer.
> >>> 
> >> 
> >> 	Are you advocating for only using VFIO_USER_DMA_READ/WRITE and
> >> not passing FDs at all?  The performance penalty would be large for the
> >> cases where the client and server are equally trusted.  Or are you
> >> advocating for an option where the slower methods are used for cases
> >> where the server is less trusted?
> > 
> > I think the enforcing IOMMU should be optional (due to the performance
> > overhead) but part of the spec from the start.
> > 
> 
> 
> 	With this in mind, we will collapse the current memory region
> messages (VFIO_USER_ADD_MEMORY_REGION and VFIO_USER_SUB_MEMORY_REGION)
> and the IOMMU messages (VFIO_USER_IOMMU_MAP and VFIO_USER_IOMMU_UNMAP)
> into new messages (VFIO_USER_DMA_MAP and VFIO_USER_DMA_UNMAP).  Their
> contents will be the same as the memory region messages.
> 
> 	On a system without an IOMMU, the new messages will be used to
> export the system physical address space as DMA addresses.  On a system
> with an IOMMU they will be used to export the valid device DMA ranges
> programmed into the IOMMU by the guest.  This behavior matches how the
> existing QEMU VFIO object programs the host IOMMU.  The server will not
> be aware of whether the client is using an IOMMU.
>
> 	In the QEMU VFIO implementation, will will add a ‘secure-dma’
> option that suppresses exporting mmap()able FDs to the server.  All
> DMA will use the slow path to be validated by the client before accessing
> guest memory.
> 
> 	Is this acceptable to you (and Alex, of course)?

Sounds good to me.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-06-26 13:30                     ` Stefan Hajnoczi
@ 2020-07-02  6:23                       ` John G Johnson
  2020-07-15 10:15                         ` Stefan Hajnoczi
  0 siblings, 1 reply; 31+ messages in thread
From: John G Johnson @ 2020-07-02  6:23 UTC (permalink / raw)
  To: Stefan Hajnoczi, qemu-devel, Alex Williamson
  Cc: Walker, Benjamin, Elena Ufimtseva, Jag Raman, Swapnil Ingle,
	Harris, James R, Konrad Rzeszutek Wilk, Stefan Hajnoczi,
	dgilbert, Raphael Norwitz, Kirti Wankhede, Thanos Makatos,
	Kanth Ghatraju, Felipe Franciosi, Marc-André Lureau, Zhang,
	Tina, Liu, Changpeng


	We’ve made the review changes to the doc, and moved to RST format,
so the doc can go into the QEMU sources.

						Thanos & JJ
 

https://github.com/tmakatos/qemu/blob/master/docs/devel/vfio-over-socket.rst




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RFC: use VFIO over a UNIX domain socket to implement device offloading
  2020-07-02  6:23                       ` John G Johnson
@ 2020-07-15 10:15                         ` Stefan Hajnoczi
  0 siblings, 0 replies; 31+ messages in thread
From: Stefan Hajnoczi @ 2020-07-15 10:15 UTC (permalink / raw)
  To: John G Johnson
  Cc: Walker, Benjamin, Elena Ufimtseva, Jag Raman, Harris, James R,
	Swapnil Ingle, Konrad Rzeszutek Wilk, Stefan Hajnoczi,
	qemu-devel, Kirti Wankhede, Raphael Norwitz, Alex Williamson,
	Thanos Makatos, Kanth Ghatraju, Felipe Franciosi,
	Marc-André Lureau, Zhang, Tina, Liu, Changpeng, dgilbert

[-- Attachment #1: Type: text/plain, Size: 405 bytes --]

On Wed, Jul 01, 2020 at 11:23:25PM -0700, John G Johnson wrote:
> 
> 	We’ve made the review changes to the doc, and moved to RST format,
> so the doc can go into the QEMU sources.
> 
> 						Thanos & JJ
>  
> 
> https://github.com/tmakatos/qemu/blob/master/docs/devel/vfio-over-socket.rst

Great! Feel free to send a patch to qemu-devel so the proposal can be
discussed in detail.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2020-07-15 10:16 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-26  9:47 RFC: use VFIO over a UNIX domain socket to implement device offloading Thanos Makatos
2020-03-27 10:37 ` Thanos Makatos
2020-04-01  9:17 ` Stefan Hajnoczi
2020-04-01 15:49   ` Thanos Makatos
2020-04-01 16:58     ` Marc-André Lureau
2020-04-02 10:19       ` Stefan Hajnoczi
2020-04-02 10:46         ` Daniel P. Berrangé
2020-04-03 12:03           ` Stefan Hajnoczi
2020-04-20 11:05   ` Thanos Makatos
2020-04-22 15:29     ` Stefan Hajnoczi
2020-04-27 10:58       ` Thanos Makatos
2020-04-30 11:23         ` Thanos Makatos
2020-04-30 11:40           ` Daniel P. Berrangé
2020-04-30 15:20             ` Thanos Makatos
2020-05-01 15:01               ` Felipe Franciosi
2020-05-01 15:28                 ` Daniel P. Berrangé
2020-05-04  9:45                   ` Stefan Hajnoczi
2020-05-04 17:49                     ` John G Johnson
2020-05-11 14:37                       ` Stefan Hajnoczi
2020-05-14 16:32     ` John G Johnson
2020-05-14 19:20       ` Alex Williamson
2020-05-21  0:45       ` John G Johnson
2020-06-02 15:06         ` Alex Williamson
2020-06-10  6:25           ` John G Johnson
2020-06-15 10:49             ` Stefan Hajnoczi
2020-06-18 21:38               ` John G Johnson
2020-06-23 12:27                 ` Stefan Hajnoczi
2020-06-26  3:54                   ` John G Johnson
2020-06-26 13:30                     ` Stefan Hajnoczi
2020-07-02  6:23                       ` John G Johnson
2020-07-15 10:15                         ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.