On Mon, May 10, 2021 at 10:25:41PM +0000, John Levon wrote:
> On Mon, May 10, 2021 at 05:57:37PM +0100, Stefan Hajnoczi wrote:
> > On Wed, Apr 14, 2021 at 04:41:22AM -0700, Thanos Makatos wrote:
> > > +Region IO FD info format
> > > +^^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > ++-------------+--------+------+
> > > +| Name        | Offset | Size |
> > > ++=============+========+======+
> > > +| argsz       | 16     | 4    |
> > > ++-------------+--------+------+
> > > +| flags       | 20     | 4    |
> > > ++-------------+--------+------+
> > > +| index       | 24     | 4    |
> > > ++-------------+--------+------+
> > > +| count       | 28     | 4    |
> > > ++-------------+--------+------+
> > > +| sub-regions | 32     | ...  |
> > > ++-------------+--------+------+
> > > +
> > > +* *argsz* is the size of the region IO FD info structure plus the
> > > +  total size of the sub-region array. Thus, each array entry "i" is at offset
> > > +  i * ((argsz - 32) / count). Note that currently this is 40 bytes for both IO
> > > +  FD types, but this is not to be relied on.
> > > +* *flags* must be zero
> > > +* *index* is the index of memory region being queried
> > > +* *count* is the number of sub-regions in the array
> > > +* *sub-regions* is the array of Sub-Region IO FD info structures
> > > +
> > > +The client must set ``flags`` to zero and specify the region being queried in
> > > +the ``index``.
> > > +
> > > +The client sets the ``argsz`` field to indicate the maximum size of the response
> > > +that the server can send, which must be at least the size of the response header
> > > +plus space for the sub-region array. If the full response size exceeds ``argsz``,
> > > +then the server must respond only with the response header and the Region IO FD
> > > +info structure, setting in ``argsz`` the buffer size required to store the full
> > > +response. In this case, no file descriptors are passed back.  The client then
> > > +retries the operation with a larger receive buffer.
> > > +
> > > +The reply message will additionally include at least one file descriptor in the
> > > +ancillary data. Note that more than one sub-region may share the same file
> > > +descriptor.
> > 
> > How does this interact with the maximum number of file descriptors,
> > max_fds? It is possible that there are more sub-regions than max_fds
> > allows...
> 
> I think this would just be a matter of the client advertising a reasonably large
> enough size for max_msg_fds. Do we need to worry about this?

vhost-user historically only supported passing 8 fds and it became a
problem there.

I can imagine devices having 10s to 100s of sub-regions (e.g. 64 queue
doorbells). Probably not 1000s.

If I was implementing a server I would check the negotiated max_fds and
refuse to start the vfio-user connection if the device has been
configured to require more sub-regions. Failing early and printing an
error would allow users to troubleshoot the issue and re-configure the
client/server.

This seems okay but the spec doesn't mention it explicitly so I wanted
to check what you had in mind.

> > > +Interrupt info format
> > > +^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > ++-----------+--------+------+
> > > +| Name      | Offset | Size |
> > > ++===========+========+======+
> > > +| Sub-index | 16     | 4    |
> > > ++-----------+--------+------+
> > > +
> > > +* *Sub-index* is relative to the IRQ index, e.g., the vector number used in PCI
> > > +  MSI/X type interrupts.
> > 
> > Hmm...this is weird. The server tells the client to raise an MSI-X
> > interrupt but does not include the MSI message that resides in the MSI-X
> > table BAR device region? Or should MSI-X interrupts be delivered to the
> > client via VFIO_USER_DMA_WRITE instead?
> > 
> > (Basically it's not clear to me how MSI-X interrupts would work with
> > vfio-user. Reading how they work in kernel VFIO might let me infer it,
> > but it's probably worth explaining this clearly in the spec.)
> 
> It doesn't. We don't have an implementation, and the qemu patches don't get this
> right either - it treats the sub-index as the IRQ index AKA IRQ type.
> 
> I'd be inclined to just remove this for now, until we have an implementation.
> Thoughts?

I don't remember the details of kernel VFIO irqs but it has an interface
where VFIO notifies KVM of configured irqs so that KVM can set up Posted
Interrupts. I think vfio-user would use KVM irqfd eventfds for efficient
interrupt injection instead since we're not trying to map a host
interrupt to a guest interrupt.

Fleshing out irqs sounds like a 1.0 milestone to me. It will definitely
be necessary but for now this can be dropped.

> > > +VFIO_USER_DEVICE_RESET
> > > +----------------------
> > > +
> > > +Message format
> > > +^^^^^^^^^^^^^^
> > > +
> > > ++--------------+------------------------+
> > > +| Name         | Value                  |
> > > ++==============+========================+
> > > +| Message ID   | <ID>                   |
> > > ++--------------+------------------------+
> > > +| Command      | 14                     |
> > > ++--------------+------------------------+
> > > +| Message size | 16                     |
> > > ++--------------+------------------------+
> > > +| Flags        | Reply bit set in reply |
> > > ++--------------+------------------------+
> > > +| Error        | 0/errno                |
> > > ++--------------+------------------------+
> > > +
> > > +This command message is sent from the client to the server to reset the device.
> > 
> > Any requirements for how long VFIO_USER_DEVICE_RESET takes to complete?
> > In some cases a reset involves the server communicating with other
> > systems or components and this can take an unbounded amount of time.
> > Therefore this message could hang. For example, if a vfio-user NVMe
> > device was accessing data on a hung NFS export and there were I/O
> > requests in flight that need to be aborted.
> 
> I'm not sure this is something we could put in the generic spec. Perhaps a
> caveat?

It's up to you whether you want to discuss this in the spec or let
client implementors figure it out themselves. Any vfio-user message can
take an unbounded amount of time and we could assume that readers will
think of this.

VFIO_USER_DEVICE_RESET is just particularly likely to be called by
clients from a synchronous code path. QEMU moved the monitor (RPC
interface) fd into a separate thread in order to stay responsive when
the main event loop is blocked for any reason, so the issue came to
mind.