Re: [PATCH v8] introduce vfio-user protocol specification

From: John Levon <john.levon@nutanix.com>
To: Stefan Hajnoczi <stefanha@redhat.com>
Cc: "benjamin.walker@intel.com" <benjamin.walker@intel.com>,
	John G Johnson <john.g.johnson@oracle.com>,
	Swapnil Ingle <swapnil.ingle@nutanix.com>,
	"Michael S . Tsirkin" <mst@redhat.com>,
	Jason Wang <jasowang@redhat.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	John Levon <levon@movementarian.org>,
	Kirti Wankhede <kwankhede@nvidia.com>,
	Gerd Hoffmann <kraxel@redhat.com>,
	"tina.zhang@intel.com" <tina.zhang@intel.com>,
	"jag.raman@oracle.com" <jag.raman@oracle.com>,
	"james.r.harris@intel.com" <james.r.harris@intel.com>,
	Elena Ufimtseva <elena.ufimtseva@oracle.com>,
	"Kanth.Ghatraju@oracle.com" <Kanth.Ghatraju@oracle.com>,
	Felipe Franciosi <felipe@nutanix.com>,
	"marcandre.lureau@redhat.com" <marcandre.lureau@redhat.com>,
	Christophe de Dinechin <cdupontd@redhat.com>,
	Yan Zhao <yan.y.zhao@intel.com>,
	"konrad.wilk@oracle.com" <konrad.wilk@oracle.com>,
	"yuvalkashtan@gmail.com" <yuvalkashtan@gmail.com>,
	"dgilbert@redhat.com" <dgilbert@redhat.com>,
	"ismael@linux.com" <ismael@linux.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	"changpeng.liu@intel.com" <changpeng.liu@intel.com>,
	"tomassetti.andrea@gmail.com" <tomassetti.andrea@gmail.com>,
	"mpiszczek@ddn.com" <mpiszczek@ddn.com>,
	Cornelia Huck <cohuck@redhat.com>,
	"alex.williamson@redhat.com" <alex.williamson@redhat.com>,
	Raphael Norwitz <raphael.norwitz@nutanix.com>,
	"xiuchun.lu@intel.com" <xiuchun.lu@intel.com>,
	Thanos Makatos <thanos.makatos@nutanix.com>
Subject: Re: [PATCH v8] introduce vfio-user protocol specification
Date: Wed, 5 May 2021 16:19:46 +0000	[thread overview]
Message-ID: <20210505161942.GA1195642@sent> (raw)
In-Reply-To: <YJK+8MrRe3ANcu7y@stefanha-x1.localdomain> <YJFRcdvcOlwm2bd7@stefanha-x1.localdomain>

On Tue, May 04, 2021 at 02:51:45PM +0100, Stefan Hajnoczi wrote:

> > +While passing of file descriptors is desirable for performance reasons, it is
> > +not necessary neither for the client nor for the server to support it in order
> 
> Double negative. "not" can be removed.

Done. I also took a `:set spell` pass on the whole spec, which should catch your
other typos.

> > +Device Read and Write
> > +^^^^^^^^^^^^^^^^^^^^^
> > +When the guest executes load or store operations to device memory, the client
> 
> <linux/vfio.h> calls it "device regions", not "device memory".
> s/device memory/unmapped device regions/?

Spec was sloppy here, yes. Fixed up all the instances I noticed.

> Does anything need to be said about mmaps and file descriptors on
> disconnected? I guess they need to be cleaned up and are not retained
> for future reconnection?

Updated:

```
Therefore in order for the protocol to be forward compatible, the server should
respond to a client disconnection as follows:

 - all client memory regions are unmapped and cleaned up (including closing any
   passed file descriptors)
 - all IRQ file descriptors passed from the old client are closed
 - the device state should otherwise be retained

The expectation is that when a client reconnects, it will re-establish IRQ and
client memory mappings.

If anything happens to the client (such as qemu really did exit), the control
stack will know about it and can clean up resources accordingly.
```

> Are there rules for avoiding deadlock between client->server and
> server->client messages? For example, the client sends
> VFIO_USER_REGION_WRITE and the server sends VFIO_USER_VM_INTERRUPT
> before replying to the write message.
> 
> Multi-threaded clients and servers could end up deadlocking if messages
> are processed while polling threads handle other device activity (e.g.
> I/O requests that cause DMA messages).
> 
> Pipelining has the nice effect that the oldest message must complete
> before the next pipelined message starts. It imposes a maximum issue
> depth of 1. Still, it seems like it would be relatively easy to hit
> re-entrancy or deadlock issues since both the client and the server can
> initiate messages and may need to wait for a response.

Filed https://github.com/nutanix/libvfio-user/issues/466

> > +* *Offset* is the file offset of the region with respect to the associated file
> > +  descriptor.
> > +* *Protections* are the region's protection attributes as encoded in
> > +  ``<sys/mman.h>``.
> 
> Please be more specific. Does it only include PROT_READ and PROT_WRITE?
> What about PROT_EXEC?

Updated to just have PROT_READ/WRITE.

> > +If a DMA region being added can be directly mapped by the server, an array of
> > +file descriptors must be sent as part of the message meta-data. Each mappable
> > +region entry must have a corresponding file descriptor. On AF_UNIX sockets, the
> > +file descriptors must be passed as SCM_RIGHTS type ancillary data. Otherwise,
> > +if a DMA region cannot be directly mapped by the server, it can be accessed by
> > +the server using VFIO_USER_DMA_READ and VFIO_USER_DMA_WRITE messages, explained
> > +in `Read and Write Operations`_. A command to map over an existing region must
> > +be failed by the server with ``EEXIST`` set in error field in the reply.
> 
> Does this mean a vIOMMU update, like a protections bits change requires
> an unmap command followed by a map command? That is not an atomic
> operation but hopefully guests don't try to update a vIOMMU mapping
> while accessing it.

Filed https://github.com/nutanix/libvfio-user/issues/467

> > +This command message is sent by the client to the server to inform it that a
> > +DMA region, previously made available via a VFIO_USER_DMA_MAP command message,
> > +is no longer available for DMA. It typically occurs when memory is subtracted
> > +from the client or if the client uses a vIOMMU. If the client does not expect
> > +the server to perform DMA then it does not need to send to the server
> > +VFIO_USER_DMA_UNMAP commands. If the server does not need to perform DMA then
> > +it can ignore such commands but it must still reply to them. The table is an
> 
> I'm a little confused by the last two sentences about not sending or
> ignoring VFIO_USER_DMA_UNMAP. Does it mean that VFIO_USER_DMA_MAP does
> not need to be sent either when the device is known never to need DMA?

If a device is never going to access client memory (either directly mapped or
DMA_READ/WRITE), there's no need to inform the server of VM memory.  I removed
the sentences as they were just confusing, sort of obvious, and not really
relevant to real systems.

> > +* *Address* is the base DMA address of the region.
> > +* *Size* is the size of the region.
> > +* *Offset* is the file offset of the region with respect to the associated file
> > +  descriptor.
> > +* *Protections* are the region's protection attributes as encoded in
> > +  ``<sys/mman.h>``.
> 
> Why are offset and protections required for the unmap command?

They are now explicitly listed as ignored.

> > +* *Flags* contains the following region attributes:
> > +
> > +  * *VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP* indicates that a dirty page bitmap
> > +    must be populated before unmapping the DMA region. The client must provide
> > +    a ``struct vfio_bitmap`` in the VFIO bitmaps field for each region, with
> > +    the ``vfio_bitmap.pgsize`` and ``vfio_bitmap.size`` fields initialized.
> > +
> > +* *VFIO Bitmaps* contains one ``struct vfio_bitmap`` per region (explained
> > +  below) if ``VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP`` is set in Flags.
> 
> I'm confused, it's 1 "VFIO Bitmaps" per "Table entry". Why does it
> contain one struct vfio_bitmap per region? What is a "region" in this
> context?

Thanos?

> > +Upon receiving a VFIO_USER_DMA_UNMAP command, if the file descriptor is mapped
> > +then the server must release all references to that DMA region before replying,
> > +which includes potentially in flight DMA transactions. Removing a portion of a
> > +DMA region is possible.
> 
> "Removing a portion of a DMA region is possible"
> -> doing so splits a larger DMA region into one or two smaller remaining regions?

We certainly don't have that implemented. Thanos?

On Wed, May 05, 2021 at 04:51:12PM +0100, Stefan Hajnoczi wrote:

> > > How do potentially large messages work around max_msg_size? It is hard
> > > for the client/server to anticipate the maximum message size that will
> > > be required ahead of time, so they can't really know if they will hit a
> > > situation where max_msg_size is too low.
> > 
> > Are there specific messages you're worried about? would it help to add a
> > specification stipulation as to minimum size that clients and servers must
> > support?
> > 
> > Ultimately the max msg size exists solely to ease implementation: with a
> > reasonable fixed size, we can always consume the entire data in one go, rather
> > than doing partial reads. Obviously that needs a limit to avoid unbounded
> > allocations.
> 
> It came to mind when reading about the dirty bitmap messages. Memory
> dirty bitmaps can become large. An 8 GB memory region has a 1 MB dirty
> bitmap.

Right, yeah. I filed https://github.com/nutanix/libvfio-user/issues/469 to track
it.

regards
john