On Mon, May 10, 2021 at 10:25:41PM +0000, John Levon wrote: > On Mon, May 10, 2021 at 05:57:37PM +0100, Stefan Hajnoczi wrote: > > On Wed, Apr 14, 2021 at 04:41:22AM -0700, Thanos Makatos wrote: > > > +Region IO FD info format > > > +^^^^^^^^^^^^^^^^^^^^^^^^ > > > + > > > ++-------------+--------+------+ > > > +| Name | Offset | Size | > > > ++=============+========+======+ > > > +| argsz | 16 | 4 | > > > ++-------------+--------+------+ > > > +| flags | 20 | 4 | > > > ++-------------+--------+------+ > > > +| index | 24 | 4 | > > > ++-------------+--------+------+ > > > +| count | 28 | 4 | > > > ++-------------+--------+------+ > > > +| sub-regions | 32 | ... | > > > ++-------------+--------+------+ > > > + > > > +* *argsz* is the size of the region IO FD info structure plus the > > > + total size of the sub-region array. Thus, each array entry "i" is at offset > > > + i * ((argsz - 32) / count). Note that currently this is 40 bytes for both IO > > > + FD types, but this is not to be relied on. > > > +* *flags* must be zero > > > +* *index* is the index of memory region being queried > > > +* *count* is the number of sub-regions in the array > > > +* *sub-regions* is the array of Sub-Region IO FD info structures > > > + > > > +The client must set ``flags`` to zero and specify the region being queried in > > > +the ``index``. > > > + > > > +The client sets the ``argsz`` field to indicate the maximum size of the response > > > +that the server can send, which must be at least the size of the response header > > > +plus space for the sub-region array. If the full response size exceeds ``argsz``, > > > +then the server must respond only with the response header and the Region IO FD > > > +info structure, setting in ``argsz`` the buffer size required to store the full > > > +response. In this case, no file descriptors are passed back. The client then > > > +retries the operation with a larger receive buffer. > > > + > > > +The reply message will additionally include at least one file descriptor in the > > > +ancillary data. Note that more than one sub-region may share the same file > > > +descriptor. > > > > How does this interact with the maximum number of file descriptors, > > max_fds? It is possible that there are more sub-regions than max_fds > > allows... > > I think this would just be a matter of the client advertising a reasonably large > enough size for max_msg_fds. Do we need to worry about this? vhost-user historically only supported passing 8 fds and it became a problem there. I can imagine devices having 10s to 100s of sub-regions (e.g. 64 queue doorbells). Probably not 1000s. If I was implementing a server I would check the negotiated max_fds and refuse to start the vfio-user connection if the device has been configured to require more sub-regions. Failing early and printing an error would allow users to troubleshoot the issue and re-configure the client/server. This seems okay but the spec doesn't mention it explicitly so I wanted to check what you had in mind. > > > +Interrupt info format > > > +^^^^^^^^^^^^^^^^^^^^^ > > > + > > > ++-----------+--------+------+ > > > +| Name | Offset | Size | > > > ++===========+========+======+ > > > +| Sub-index | 16 | 4 | > > > ++-----------+--------+------+ > > > + > > > +* *Sub-index* is relative to the IRQ index, e.g., the vector number used in PCI > > > + MSI/X type interrupts. > > > > Hmm...this is weird. The server tells the client to raise an MSI-X > > interrupt but does not include the MSI message that resides in the MSI-X > > table BAR device region? Or should MSI-X interrupts be delivered to the > > client via VFIO_USER_DMA_WRITE instead? > > > > (Basically it's not clear to me how MSI-X interrupts would work with > > vfio-user. Reading how they work in kernel VFIO might let me infer it, > > but it's probably worth explaining this clearly in the spec.) > > It doesn't. We don't have an implementation, and the qemu patches don't get this > right either - it treats the sub-index as the IRQ index AKA IRQ type. > > I'd be inclined to just remove this for now, until we have an implementation. > Thoughts? I don't remember the details of kernel VFIO irqs but it has an interface where VFIO notifies KVM of configured irqs so that KVM can set up Posted Interrupts. I think vfio-user would use KVM irqfd eventfds for efficient interrupt injection instead since we're not trying to map a host interrupt to a guest interrupt. Fleshing out irqs sounds like a 1.0 milestone to me. It will definitely be necessary but for now this can be dropped. > > > +VFIO_USER_DEVICE_RESET > > > +---------------------- > > > + > > > +Message format > > > +^^^^^^^^^^^^^^ > > > + > > > ++--------------+------------------------+ > > > +| Name | Value | > > > ++==============+========================+ > > > +| Message ID | | > > > ++--------------+------------------------+ > > > +| Command | 14 | > > > ++--------------+------------------------+ > > > +| Message size | 16 | > > > ++--------------+------------------------+ > > > +| Flags | Reply bit set in reply | > > > ++--------------+------------------------+ > > > +| Error | 0/errno | > > > ++--------------+------------------------+ > > > + > > > +This command message is sent from the client to the server to reset the device. > > > > Any requirements for how long VFIO_USER_DEVICE_RESET takes to complete? > > In some cases a reset involves the server communicating with other > > systems or components and this can take an unbounded amount of time. > > Therefore this message could hang. For example, if a vfio-user NVMe > > device was accessing data on a hung NFS export and there were I/O > > requests in flight that need to be aborted. > > I'm not sure this is something we could put in the generic spec. Perhaps a > caveat? It's up to you whether you want to discuss this in the spec or let client implementors figure it out themselves. Any vfio-user message can take an unbounded amount of time and we could assume that readers will think of this. VFIO_USER_DEVICE_RESET is just particularly likely to be called by clients from a synchronous code path. QEMU moved the monitor (RPC interface) fd into a separate thread in order to stay responsive when the main event loop is blocked for any reason, so the issue came to mind.