All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stefan Hajnoczi <stefanha@redhat.com>
To: John Levon <levon@movementarian.org>
Cc: benjamin.walker@intel.com,
	Elena Ufimtseva <elena.ufimtseva@oracle.com>,
	Swapnil Ingle <swapnil.ingle@nutanix.com>,
	John G Johnson <john.g.johnson@oracle.com>,
	Jason Wang <jasowang@redhat.com>,
	qemu-devel@nongnu.org,
	Christophe de Dinechin <cdupontd@redhat.com>,
	Kirti Wankhede <kwankhede@nvidia.com>,
	Gerd Hoffmann <kraxel@redhat.com>,
	Raphael Norwitz <raphael.norwitz@nutanix.com>,
	jag.raman@oracle.com, james.r.harris@intel.com,
	John Levon <john.levon@nutanix.com>,
	"Michael S . Tsirkin" <mst@redhat.com>,
	Kanth.Ghatraju@oracle.com, Felipe Franciosi <felipe@nutanix.com>,
	marcandre.lureau@redhat.com, Yan Zhao <yan.y.zhao@intel.com>,
	konrad.wilk@oracle.com, yuvalkashtan@gmail.com,
	dgilbert@redhat.com, eafanasova@gmail.com, ismael@linux.com,
	Paolo Bonzini <pbonzini@redhat.com>,
	changpeng.liu@intel.com, tomassetti.andrea@gmail.com,
	mpiszczek@ddn.com, Cornelia Huck <cohuck@redhat.com>,
	alex.williamson@redhat.com, tina.zhang@intel.com,
	xiuchun.lu@intel.com, Thanos Makatos <thanos.makatos@nutanix.com>
Subject: Re: [PATCH v8] introduce vfio-user protocol specification
Date: Tue, 11 May 2021 11:09:53 +0100	[thread overview]
Message-ID: <YJpX8XT+WvXYkyMD@stefanha-x1.localdomain> (raw)
In-Reply-To: <20210510222541.GA1916565@li1368-133.members.linode.com>

[-- Attachment #1: Type: text/plain, Size: 6666 bytes --]

On Mon, May 10, 2021 at 10:25:41PM +0000, John Levon wrote:
> On Mon, May 10, 2021 at 05:57:37PM +0100, Stefan Hajnoczi wrote:
> > On Wed, Apr 14, 2021 at 04:41:22AM -0700, Thanos Makatos wrote:
> > > +Region IO FD info format
> > > +^^^^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > ++-------------+--------+------+
> > > +| Name        | Offset | Size |
> > > ++=============+========+======+
> > > +| argsz       | 16     | 4    |
> > > ++-------------+--------+------+
> > > +| flags       | 20     | 4    |
> > > ++-------------+--------+------+
> > > +| index       | 24     | 4    |
> > > ++-------------+--------+------+
> > > +| count       | 28     | 4    |
> > > ++-------------+--------+------+
> > > +| sub-regions | 32     | ...  |
> > > ++-------------+--------+------+
> > > +
> > > +* *argsz* is the size of the region IO FD info structure plus the
> > > +  total size of the sub-region array. Thus, each array entry "i" is at offset
> > > +  i * ((argsz - 32) / count). Note that currently this is 40 bytes for both IO
> > > +  FD types, but this is not to be relied on.
> > > +* *flags* must be zero
> > > +* *index* is the index of memory region being queried
> > > +* *count* is the number of sub-regions in the array
> > > +* *sub-regions* is the array of Sub-Region IO FD info structures
> > > +
> > > +The client must set ``flags`` to zero and specify the region being queried in
> > > +the ``index``.
> > > +
> > > +The client sets the ``argsz`` field to indicate the maximum size of the response
> > > +that the server can send, which must be at least the size of the response header
> > > +plus space for the sub-region array. If the full response size exceeds ``argsz``,
> > > +then the server must respond only with the response header and the Region IO FD
> > > +info structure, setting in ``argsz`` the buffer size required to store the full
> > > +response. In this case, no file descriptors are passed back.  The client then
> > > +retries the operation with a larger receive buffer.
> > > +
> > > +The reply message will additionally include at least one file descriptor in the
> > > +ancillary data. Note that more than one sub-region may share the same file
> > > +descriptor.
> > 
> > How does this interact with the maximum number of file descriptors,
> > max_fds? It is possible that there are more sub-regions than max_fds
> > allows...
> 
> I think this would just be a matter of the client advertising a reasonably large
> enough size for max_msg_fds. Do we need to worry about this?

vhost-user historically only supported passing 8 fds and it became a
problem there.

I can imagine devices having 10s to 100s of sub-regions (e.g. 64 queue
doorbells). Probably not 1000s.

If I was implementing a server I would check the negotiated max_fds and
refuse to start the vfio-user connection if the device has been
configured to require more sub-regions. Failing early and printing an
error would allow users to troubleshoot the issue and re-configure the
client/server.

This seems okay but the spec doesn't mention it explicitly so I wanted
to check what you had in mind.

> > > +Interrupt info format
> > > +^^^^^^^^^^^^^^^^^^^^^
> > > +
> > > ++-----------+--------+------+
> > > +| Name      | Offset | Size |
> > > ++===========+========+======+
> > > +| Sub-index | 16     | 4    |
> > > ++-----------+--------+------+
> > > +
> > > +* *Sub-index* is relative to the IRQ index, e.g., the vector number used in PCI
> > > +  MSI/X type interrupts.
> > 
> > Hmm...this is weird. The server tells the client to raise an MSI-X
> > interrupt but does not include the MSI message that resides in the MSI-X
> > table BAR device region? Or should MSI-X interrupts be delivered to the
> > client via VFIO_USER_DMA_WRITE instead?
> > 
> > (Basically it's not clear to me how MSI-X interrupts would work with
> > vfio-user. Reading how they work in kernel VFIO might let me infer it,
> > but it's probably worth explaining this clearly in the spec.)
> 
> It doesn't. We don't have an implementation, and the qemu patches don't get this
> right either - it treats the sub-index as the IRQ index AKA IRQ type.
> 
> I'd be inclined to just remove this for now, until we have an implementation.
> Thoughts?

I don't remember the details of kernel VFIO irqs but it has an interface
where VFIO notifies KVM of configured irqs so that KVM can set up Posted
Interrupts. I think vfio-user would use KVM irqfd eventfds for efficient
interrupt injection instead since we're not trying to map a host
interrupt to a guest interrupt.

Fleshing out irqs sounds like a 1.0 milestone to me. It will definitely
be necessary but for now this can be dropped.

> > > +VFIO_USER_DEVICE_RESET
> > > +----------------------
> > > +
> > > +Message format
> > > +^^^^^^^^^^^^^^
> > > +
> > > ++--------------+------------------------+
> > > +| Name         | Value                  |
> > > ++==============+========================+
> > > +| Message ID   | <ID>                   |
> > > ++--------------+------------------------+
> > > +| Command      | 14                     |
> > > ++--------------+------------------------+
> > > +| Message size | 16                     |
> > > ++--------------+------------------------+
> > > +| Flags        | Reply bit set in reply |
> > > ++--------------+------------------------+
> > > +| Error        | 0/errno                |
> > > ++--------------+------------------------+
> > > +
> > > +This command message is sent from the client to the server to reset the device.
> > 
> > Any requirements for how long VFIO_USER_DEVICE_RESET takes to complete?
> > In some cases a reset involves the server communicating with other
> > systems or components and this can take an unbounded amount of time.
> > Therefore this message could hang. For example, if a vfio-user NVMe
> > device was accessing data on a hung NFS export and there were I/O
> > requests in flight that need to be aborted.
> 
> I'm not sure this is something we could put in the generic spec. Perhaps a
> caveat?

It's up to you whether you want to discuss this in the spec or let
client implementors figure it out themselves. Any vfio-user message can
take an unbounded amount of time and we could assume that readers will
think of this.

VFIO_USER_DEVICE_RESET is just particularly likely to be called by
clients from a synchronous code path. QEMU moved the monitor (RPC
interface) fd into a separate thread in order to stay responsive when
the main event loop is blocked for any reason, so the issue came to
mind.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply	other threads:[~2021-05-11 10:17 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-14 11:41 [PATCH v8] introduce vfio-user protocol specification Thanos Makatos
2021-04-26 15:48 ` Stefan Hajnoczi
2021-04-27 12:02   ` Thanos Makatos
2021-04-27 15:01     ` Stefan Hajnoczi
2021-05-04 13:51 ` Stefan Hajnoczi
2021-05-04 14:31   ` John Levon
2021-05-05 15:51     ` Stefan Hajnoczi
2021-06-14  9:57     ` Thanos Makatos
2021-05-05 16:19   ` John Levon
2021-05-06  8:49     ` Stefan Hajnoczi
2021-05-07 16:10     ` Thanos Makatos
2021-06-14 10:07   ` Thanos Makatos
2021-05-10 16:57 ` Stefan Hajnoczi
2021-05-10 22:25   ` John Levon
2021-05-11 10:09     ` Stefan Hajnoczi [this message]
2021-05-11 10:43       ` John Levon
2021-05-11 15:40         ` Stefan Hajnoczi
2021-05-12  5:08     ` John Johnson
2021-05-19 21:08 ` Alex Williamson
2021-05-19 22:38   ` John Levon
2021-06-14  9:47     ` Thanos Makatos

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YJpX8XT+WvXYkyMD@stefanha-x1.localdomain \
    --to=stefanha@redhat.com \
    --cc=Kanth.Ghatraju@oracle.com \
    --cc=alex.williamson@redhat.com \
    --cc=benjamin.walker@intel.com \
    --cc=cdupontd@redhat.com \
    --cc=changpeng.liu@intel.com \
    --cc=cohuck@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=eafanasova@gmail.com \
    --cc=elena.ufimtseva@oracle.com \
    --cc=felipe@nutanix.com \
    --cc=ismael@linux.com \
    --cc=jag.raman@oracle.com \
    --cc=james.r.harris@intel.com \
    --cc=jasowang@redhat.com \
    --cc=john.g.johnson@oracle.com \
    --cc=john.levon@nutanix.com \
    --cc=konrad.wilk@oracle.com \
    --cc=kraxel@redhat.com \
    --cc=kwankhede@nvidia.com \
    --cc=levon@movementarian.org \
    --cc=marcandre.lureau@redhat.com \
    --cc=mpiszczek@ddn.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=raphael.norwitz@nutanix.com \
    --cc=swapnil.ingle@nutanix.com \
    --cc=thanos.makatos@nutanix.com \
    --cc=tina.zhang@intel.com \
    --cc=tomassetti.andrea@gmail.com \
    --cc=xiuchun.lu@intel.com \
    --cc=yan.y.zhao@intel.com \
    --cc=yuvalkashtan@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.