qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Stefan Hajnoczi <stefanha@redhat.com>
To: John Levon <levon@movementarian.org>
Cc: benjamin.walker@intel.com,
	Elena Ufimtseva <elena.ufimtseva@oracle.com>,
	Swapnil Ingle <swapnil.ingle@nutanix.com>,
	John G Johnson <john.g.johnson@oracle.com>,
	Jason Wang <jasowang@redhat.com>,
	qemu-devel@nongnu.org,
	Christophe de Dinechin <cdupontd@redhat.com>,
	Kirti Wankhede <kwankhede@nvidia.com>,
	Gerd Hoffmann <kraxel@redhat.com>,
	Raphael Norwitz <raphael.norwitz@nutanix.com>,
	jag.raman@oracle.com, james.r.harris@intel.com,
	John Levon <john.levon@nutanix.com>,
	"Michael S . Tsirkin" <mst@redhat.com>,
	Kanth.Ghatraju@oracle.com, Felipe Franciosi <felipe@nutanix.com>,
	marcandre.lureau@redhat.com, Yan Zhao <yan.y.zhao@intel.com>,
	konrad.wilk@oracle.com, yuvalkashtan@gmail.com,
	dgilbert@redhat.com, eafanasova@gmail.com, ismael@linux.com,
	Paolo Bonzini <pbonzini@redhat.com>,
	changpeng.liu@intel.com, tomassetti.andrea@gmail.com,
	mpiszczek@ddn.com, Cornelia Huck <cohuck@redhat.com>,
	alex.williamson@redhat.com, tina.zhang@intel.com,
	xiuchun.lu@intel.com, Thanos Makatos <thanos.makatos@nutanix.com>
Subject: Re: [PATCH v8] introduce vfio-user protocol specification
Date: Tue, 11 May 2021 11:09:53 +0100	[thread overview]
Message-ID: <YJpX8XT+WvXYkyMD@stefanha-x1.localdomain> (raw)
In-Reply-To: <20210510222541.GA1916565@li1368-133.members.linode.com>

[-- Attachment #1: Type: text/plain, Size: 6666 bytes --]

On Mon, May 10, 2021 at 10:25:41PM +0000, John Levon wrote:
> On Mon, May 10, 2021 at 05:57:37PM +0100, Stefan Hajnoczi wrote:
> > On Wed, Apr 14, 2021 at 04:41:22AM -0700, Thanos Makatos wrote:
> > > +Region IO FD info format
> > > +^^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > ++-------------+--------+------+
> > > +| Name        | Offset | Size |
> > > ++=============+========+======+
> > > +| argsz       | 16     | 4    |
> > > ++-------------+--------+------+
> > > +| flags       | 20     | 4    |
> > > ++-------------+--------+------+
> > > +| index       | 24     | 4    |
> > > ++-------------+--------+------+
> > > +| count       | 28     | 4    |
> > > ++-------------+--------+------+
> > > +| sub-regions | 32     | ...  |
> > > ++-------------+--------+------+
> > > +
> > > +* *argsz* is the size of the region IO FD info structure plus the
> > > +  total size of the sub-region array. Thus, each array entry "i" is at offset
> > > +  i * ((argsz - 32) / count). Note that currently this is 40 bytes for both IO
> > > +  FD types, but this is not to be relied on.
> > > +* *flags* must be zero
> > > +* *index* is the index of memory region being queried
> > > +* *count* is the number of sub-regions in the array
> > > +* *sub-regions* is the array of Sub-Region IO FD info structures
> > > +
> > > +The client must set ``flags`` to zero and specify the region being queried in
> > > +the ``index``.
> > > +
> > > +The client sets the ``argsz`` field to indicate the maximum size of the response
> > > +that the server can send, which must be at least the size of the response header
> > > +plus space for the sub-region array. If the full response size exceeds ``argsz``,
> > > +then the server must respond only with the response header and the Region IO FD
> > > +info structure, setting in ``argsz`` the buffer size required to store the full
> > > +response. In this case, no file descriptors are passed back.  The client then
> > > +retries the operation with a larger receive buffer.
> > > +
> > > +The reply message will additionally include at least one file descriptor in the
> > > +ancillary data. Note that more than one sub-region may share the same file
> > > +descriptor.
> > 
> > How does this interact with the maximum number of file descriptors,
> > max_fds? It is possible that there are more sub-regions than max_fds
> > allows...
> 
> I think this would just be a matter of the client advertising a reasonably large
> enough size for max_msg_fds. Do we need to worry about this?

vhost-user historically only supported passing 8 fds and it became a
problem there.

I can imagine devices having 10s to 100s of sub-regions (e.g. 64 queue
doorbells). Probably not 1000s.

If I was implementing a server I would check the negotiated max_fds and
refuse to start the vfio-user connection if the device has been
configured to require more sub-regions. Failing early and printing an
error would allow users to troubleshoot the issue and re-configure the
client/server.

This seems okay but the spec doesn't mention it explicitly so I wanted
to check what you had in mind.

> > > +Interrupt info format
> > > +^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > ++-----------+--------+------+
> > > +| Name      | Offset | Size |
> > > ++===========+========+======+
> > > +| Sub-index | 16     | 4    |
> > > ++-----------+--------+------+
> > > +
> > > +* *Sub-index* is relative to the IRQ index, e.g., the vector number used in PCI
> > > +  MSI/X type interrupts.
> > 
> > Hmm...this is weird. The server tells the client to raise an MSI-X
> > interrupt but does not include the MSI message that resides in the MSI-X
> > table BAR device region? Or should MSI-X interrupts be delivered to the
> > client via VFIO_USER_DMA_WRITE instead?
> > 
> > (Basically it's not clear to me how MSI-X interrupts would work with
> > vfio-user. Reading how they work in kernel VFIO might let me infer it,
> > but it's probably worth explaining this clearly in the spec.)
> 
> It doesn't. We don't have an implementation, and the qemu patches don't get this
> right either - it treats the sub-index as the IRQ index AKA IRQ type.
> 
> I'd be inclined to just remove this for now, until we have an implementation.
> Thoughts?

I don't remember the details of kernel VFIO irqs but it has an interface
where VFIO notifies KVM of configured irqs so that KVM can set up Posted
Interrupts. I think vfio-user would use KVM irqfd eventfds for efficient
interrupt injection instead since we're not trying to map a host
interrupt to a guest interrupt.

Fleshing out irqs sounds like a 1.0 milestone to me. It will definitely
be necessary but for now this can be dropped.

> > > +VFIO_USER_DEVICE_RESET
> > > +----------------------
> > > +
> > > +Message format
> > > +^^^^^^^^^^^^^^
> > > +
> > > ++--------------+------------------------+
> > > +| Name         | Value                  |
> > > ++==============+========================+
> > > +| Message ID   | <ID>                   |
> > > ++--------------+------------------------+
> > > +| Command      | 14                     |
> > > ++--------------+------------------------+
> > > +| Message size | 16                     |
> > > ++--------------+------------------------+
> > > +| Flags        | Reply bit set in reply |
> > > ++--------------+------------------------+
> > > +| Error        | 0/errno                |
> > > ++--------------+------------------------+
> > > +
> > > +This command message is sent from the client to the server to reset the device.
> > 
> > Any requirements for how long VFIO_USER_DEVICE_RESET takes to complete?
> > In some cases a reset involves the server communicating with other
> > systems or components and this can take an unbounded amount of time.
> > Therefore this message could hang. For example, if a vfio-user NVMe
> > device was accessing data on a hung NFS export and there were I/O
> > requests in flight that need to be aborted.
> 
> I'm not sure this is something we could put in the generic spec. Perhaps a
> caveat?

It's up to you whether you want to discuss this in the spec or let
client implementors figure it out themselves. Any vfio-user message can
take an unbounded amount of time and we could assume that readers will
think of this.

VFIO_USER_DEVICE_RESET is just particularly likely to be called by
clients from a synchronous code path. QEMU moved the monitor (RPC
interface) fd into a separate thread in order to stay responsive when
the main event loop is blocked for any reason, so the issue came to
mind.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply	other threads:[~2021-05-11 10:17 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-14 11:41 [PATCH v8] introduce vfio-user protocol specification Thanos Makatos
2021-04-26 15:48 ` Stefan Hajnoczi
2021-04-27 12:02   ` Thanos Makatos
2021-04-27 15:01     ` Stefan Hajnoczi
2021-05-04 13:51 ` Stefan Hajnoczi
2021-05-04 14:31   ` John Levon
2021-05-05 15:51     ` Stefan Hajnoczi
2021-06-14  9:57     ` Thanos Makatos
2021-05-05 16:19   ` John Levon
2021-05-06  8:49     ` Stefan Hajnoczi
2021-05-07 16:10     ` Thanos Makatos
2021-06-14 10:07   ` Thanos Makatos
2021-05-10 16:57 ` Stefan Hajnoczi
2021-05-10 22:25   ` John Levon
2021-05-11 10:09     ` Stefan Hajnoczi [this message]
2021-05-11 10:43       ` John Levon
2021-05-11 15:40         ` Stefan Hajnoczi
2021-05-12  5:08     ` John Johnson
2021-05-19 21:08 ` Alex Williamson
2021-05-19 22:38   ` John Levon
2021-06-14  9:47     ` Thanos Makatos

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YJpX8XT+WvXYkyMD@stefanha-x1.localdomain \
    --to=stefanha@redhat.com \
    --cc=Kanth.Ghatraju@oracle.com \
    --cc=alex.williamson@redhat.com \
    --cc=benjamin.walker@intel.com \
    --cc=cdupontd@redhat.com \
    --cc=changpeng.liu@intel.com \
    --cc=cohuck@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=eafanasova@gmail.com \
    --cc=elena.ufimtseva@oracle.com \
    --cc=felipe@nutanix.com \
    --cc=ismael@linux.com \
    --cc=jag.raman@oracle.com \
    --cc=james.r.harris@intel.com \
    --cc=jasowang@redhat.com \
    --cc=john.g.johnson@oracle.com \
    --cc=john.levon@nutanix.com \
    --cc=konrad.wilk@oracle.com \
    --cc=kraxel@redhat.com \
    --cc=kwankhede@nvidia.com \
    --cc=levon@movementarian.org \
    --cc=marcandre.lureau@redhat.com \
    --cc=mpiszczek@ddn.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=raphael.norwitz@nutanix.com \
    --cc=swapnil.ingle@nutanix.com \
    --cc=thanos.makatos@nutanix.com \
    --cc=tina.zhang@intel.com \
    --cc=tomassetti.andrea@gmail.com \
    --cc=xiuchun.lu@intel.com \
    --cc=yan.y.zhao@intel.com \
    --cc=yuvalkashtan@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).