From: Stefan Hajnoczi <stefanha@redhat.com>
To: John Levon <levon@movementarian.org>
Cc: benjamin.walker@intel.com,
Elena Ufimtseva <elena.ufimtseva@oracle.com>,
Swapnil Ingle <swapnil.ingle@nutanix.com>,
John G Johnson <john.g.johnson@oracle.com>,
Jason Wang <jasowang@redhat.com>,
qemu-devel@nongnu.org,
Christophe de Dinechin <cdupontd@redhat.com>,
Kirti Wankhede <kwankhede@nvidia.com>,
Gerd Hoffmann <kraxel@redhat.com>,
Raphael Norwitz <raphael.norwitz@nutanix.com>,
jag.raman@oracle.com, james.r.harris@intel.com,
John Levon <john.levon@nutanix.com>,
"Michael S . Tsirkin" <mst@redhat.com>,
Kanth.Ghatraju@oracle.com, Felipe Franciosi <felipe@nutanix.com>,
marcandre.lureau@redhat.com, Yan Zhao <yan.y.zhao@intel.com>,
konrad.wilk@oracle.com, yuvalkashtan@gmail.com,
dgilbert@redhat.com, eafanasova@gmail.com, ismael@linux.com,
Paolo Bonzini <pbonzini@redhat.com>,
changpeng.liu@intel.com, tomassetti.andrea@gmail.com,
mpiszczek@ddn.com, Cornelia Huck <cohuck@redhat.com>,
alex.williamson@redhat.com, tina.zhang@intel.com,
xiuchun.lu@intel.com, Thanos Makatos <thanos.makatos@nutanix.com>
Subject: Re: [PATCH v8] introduce vfio-user protocol specification
Date: Tue, 11 May 2021 11:09:53 +0100 [thread overview]
Message-ID: <YJpX8XT+WvXYkyMD@stefanha-x1.localdomain> (raw)
In-Reply-To: <20210510222541.GA1916565@li1368-133.members.linode.com>
[-- Attachment #1: Type: text/plain, Size: 6666 bytes --]
On Mon, May 10, 2021 at 10:25:41PM +0000, John Levon wrote:
> On Mon, May 10, 2021 at 05:57:37PM +0100, Stefan Hajnoczi wrote:
> > On Wed, Apr 14, 2021 at 04:41:22AM -0700, Thanos Makatos wrote:
> > > +Region IO FD info format
> > > +^^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > ++-------------+--------+------+
> > > +| Name | Offset | Size |
> > > ++=============+========+======+
> > > +| argsz | 16 | 4 |
> > > ++-------------+--------+------+
> > > +| flags | 20 | 4 |
> > > ++-------------+--------+------+
> > > +| index | 24 | 4 |
> > > ++-------------+--------+------+
> > > +| count | 28 | 4 |
> > > ++-------------+--------+------+
> > > +| sub-regions | 32 | ... |
> > > ++-------------+--------+------+
> > > +
> > > +* *argsz* is the size of the region IO FD info structure plus the
> > > + total size of the sub-region array. Thus, each array entry "i" is at offset
> > > + i * ((argsz - 32) / count). Note that currently this is 40 bytes for both IO
> > > + FD types, but this is not to be relied on.
> > > +* *flags* must be zero
> > > +* *index* is the index of memory region being queried
> > > +* *count* is the number of sub-regions in the array
> > > +* *sub-regions* is the array of Sub-Region IO FD info structures
> > > +
> > > +The client must set ``flags`` to zero and specify the region being queried in
> > > +the ``index``.
> > > +
> > > +The client sets the ``argsz`` field to indicate the maximum size of the response
> > > +that the server can send, which must be at least the size of the response header
> > > +plus space for the sub-region array. If the full response size exceeds ``argsz``,
> > > +then the server must respond only with the response header and the Region IO FD
> > > +info structure, setting in ``argsz`` the buffer size required to store the full
> > > +response. In this case, no file descriptors are passed back. The client then
> > > +retries the operation with a larger receive buffer.
> > > +
> > > +The reply message will additionally include at least one file descriptor in the
> > > +ancillary data. Note that more than one sub-region may share the same file
> > > +descriptor.
> >
> > How does this interact with the maximum number of file descriptors,
> > max_fds? It is possible that there are more sub-regions than max_fds
> > allows...
>
> I think this would just be a matter of the client advertising a reasonably large
> enough size for max_msg_fds. Do we need to worry about this?
vhost-user historically only supported passing 8 fds and it became a
problem there.
I can imagine devices having 10s to 100s of sub-regions (e.g. 64 queue
doorbells). Probably not 1000s.
If I was implementing a server I would check the negotiated max_fds and
refuse to start the vfio-user connection if the device has been
configured to require more sub-regions. Failing early and printing an
error would allow users to troubleshoot the issue and re-configure the
client/server.
This seems okay but the spec doesn't mention it explicitly so I wanted
to check what you had in mind.
> > > +Interrupt info format
> > > +^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > ++-----------+--------+------+
> > > +| Name | Offset | Size |
> > > ++===========+========+======+
> > > +| Sub-index | 16 | 4 |
> > > ++-----------+--------+------+
> > > +
> > > +* *Sub-index* is relative to the IRQ index, e.g., the vector number used in PCI
> > > + MSI/X type interrupts.
> >
> > Hmm...this is weird. The server tells the client to raise an MSI-X
> > interrupt but does not include the MSI message that resides in the MSI-X
> > table BAR device region? Or should MSI-X interrupts be delivered to the
> > client via VFIO_USER_DMA_WRITE instead?
> >
> > (Basically it's not clear to me how MSI-X interrupts would work with
> > vfio-user. Reading how they work in kernel VFIO might let me infer it,
> > but it's probably worth explaining this clearly in the spec.)
>
> It doesn't. We don't have an implementation, and the qemu patches don't get this
> right either - it treats the sub-index as the IRQ index AKA IRQ type.
>
> I'd be inclined to just remove this for now, until we have an implementation.
> Thoughts?
I don't remember the details of kernel VFIO irqs but it has an interface
where VFIO notifies KVM of configured irqs so that KVM can set up Posted
Interrupts. I think vfio-user would use KVM irqfd eventfds for efficient
interrupt injection instead since we're not trying to map a host
interrupt to a guest interrupt.
Fleshing out irqs sounds like a 1.0 milestone to me. It will definitely
be necessary but for now this can be dropped.
> > > +VFIO_USER_DEVICE_RESET
> > > +----------------------
> > > +
> > > +Message format
> > > +^^^^^^^^^^^^^^
> > > +
> > > ++--------------+------------------------+
> > > +| Name | Value |
> > > ++==============+========================+
> > > +| Message ID | <ID> |
> > > ++--------------+------------------------+
> > > +| Command | 14 |
> > > ++--------------+------------------------+
> > > +| Message size | 16 |
> > > ++--------------+------------------------+
> > > +| Flags | Reply bit set in reply |
> > > ++--------------+------------------------+
> > > +| Error | 0/errno |
> > > ++--------------+------------------------+
> > > +
> > > +This command message is sent from the client to the server to reset the device.
> >
> > Any requirements for how long VFIO_USER_DEVICE_RESET takes to complete?
> > In some cases a reset involves the server communicating with other
> > systems or components and this can take an unbounded amount of time.
> > Therefore this message could hang. For example, if a vfio-user NVMe
> > device was accessing data on a hung NFS export and there were I/O
> > requests in flight that need to be aborted.
>
> I'm not sure this is something we could put in the generic spec. Perhaps a
> caveat?
It's up to you whether you want to discuss this in the spec or let
client implementors figure it out themselves. Any vfio-user message can
take an unbounded amount of time and we could assume that readers will
think of this.
VFIO_USER_DEVICE_RESET is just particularly likely to be called by
clients from a synchronous code path. QEMU moved the monitor (RPC
interface) fd into a separate thread in order to stay responsive when
the main event loop is blocked for any reason, so the issue came to
mind.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
next prev parent reply other threads:[~2021-05-11 10:17 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-14 11:41 [PATCH v8] introduce vfio-user protocol specification Thanos Makatos
2021-04-26 15:48 ` Stefan Hajnoczi
2021-04-27 12:02 ` Thanos Makatos
2021-04-27 15:01 ` Stefan Hajnoczi
2021-05-04 13:51 ` Stefan Hajnoczi
2021-05-04 14:31 ` John Levon
2021-05-05 15:51 ` Stefan Hajnoczi
2021-06-14 9:57 ` Thanos Makatos
2021-05-05 16:19 ` John Levon
2021-05-06 8:49 ` Stefan Hajnoczi
2021-05-07 16:10 ` Thanos Makatos
2021-06-14 10:07 ` Thanos Makatos
2021-05-10 16:57 ` Stefan Hajnoczi
2021-05-10 22:25 ` John Levon
2021-05-11 10:09 ` Stefan Hajnoczi [this message]
2021-05-11 10:43 ` John Levon
2021-05-11 15:40 ` Stefan Hajnoczi
2021-05-12 5:08 ` John Johnson
2021-05-19 21:08 ` Alex Williamson
2021-05-19 22:38 ` John Levon
2021-06-14 9:47 ` Thanos Makatos
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YJpX8XT+WvXYkyMD@stefanha-x1.localdomain \
--to=stefanha@redhat.com \
--cc=Kanth.Ghatraju@oracle.com \
--cc=alex.williamson@redhat.com \
--cc=benjamin.walker@intel.com \
--cc=cdupontd@redhat.com \
--cc=changpeng.liu@intel.com \
--cc=cohuck@redhat.com \
--cc=dgilbert@redhat.com \
--cc=eafanasova@gmail.com \
--cc=elena.ufimtseva@oracle.com \
--cc=felipe@nutanix.com \
--cc=ismael@linux.com \
--cc=jag.raman@oracle.com \
--cc=james.r.harris@intel.com \
--cc=jasowang@redhat.com \
--cc=john.g.johnson@oracle.com \
--cc=john.levon@nutanix.com \
--cc=konrad.wilk@oracle.com \
--cc=kraxel@redhat.com \
--cc=kwankhede@nvidia.com \
--cc=levon@movementarian.org \
--cc=marcandre.lureau@redhat.com \
--cc=mpiszczek@ddn.com \
--cc=mst@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=raphael.norwitz@nutanix.com \
--cc=swapnil.ingle@nutanix.com \
--cc=thanos.makatos@nutanix.com \
--cc=tina.zhang@intel.com \
--cc=tomassetti.andrea@gmail.com \
--cc=xiuchun.lu@intel.com \
--cc=yan.y.zhao@intel.com \
--cc=yuvalkashtan@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).