All of lore.kernel.org
 help / color / mirror / Atom feed
From: John Levon <john.levon@nutanix.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: "benjamin.walker@intel.com" <benjamin.walker@intel.com>,
	John G Johnson <john.g.johnson@oracle.com>,
	Swapnil Ingle <swapnil.ingle@nutanix.com>,
	"Michael S . Tsirkin" <mst@redhat.com>,
	Jason Wang <jasowang@redhat.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	John Levon <levon@movementarian.org>,
	Kirti Wankhede <kwankhede@nvidia.com>,
	Gerd Hoffmann <kraxel@redhat.com>,
	"tina.zhang@intel.com" <tina.zhang@intel.com>,
	"jag.raman@oracle.com" <jag.raman@oracle.com>,
	"james.r.harris@intel.com" <james.r.harris@intel.com>,
	Elena Ufimtseva <elena.ufimtseva@oracle.com>,
	"Kanth.Ghatraju@oracle.com" <Kanth.Ghatraju@oracle.com>,
	Felipe Franciosi <felipe@nutanix.com>,
	"marcandre.lureau@redhat.com" <marcandre.lureau@redhat.com>,
	Yan Zhao <yan.y.zhao@intel.com>,
	"konrad.wilk@oracle.com" <konrad.wilk@oracle.com>,
	"yuvalkashtan@gmail.com" <yuvalkashtan@gmail.com>,
	"dgilbert@redhat.com" <dgilbert@redhat.com>,
	"ismael@linux.com" <ismael@linux.com>,
	Stefan Hajnoczi <stefanha@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	"changpeng.liu@intel.com" <changpeng.liu@intel.com>,
	"tomassetti.andrea@gmail.com" <tomassetti.andrea@gmail.com>,
	"mpiszczek@ddn.com" <mpiszczek@ddn.com>,
	Cornelia Huck <cohuck@redhat.com>,
	Christophe de Dinechin <cdupontd@redhat.com>,
	Raphael Norwitz <raphael.norwitz@nutanix.com>,
	"xiuchun.lu@intel.com" <xiuchun.lu@intel.com>,
	Thanos Makatos <thanos.makatos@nutanix.com>
Subject: Re: [PATCH v8] introduce vfio-user protocol specification
Date: Wed, 19 May 2021 22:38:48 +0000	[thread overview]
Message-ID: <20210519223844.GA1036405@sent> (raw)
In-Reply-To: <20210519150817.6a90985b.alex.williamson@redhat.com>

On Wed, May 19, 2021 at 03:08:17PM -0600, Alex Williamson wrote:

> > +VFIO_USER_DMA_MAP
> > +-----------------
> > +
> > +Message Format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+------------------------+
> > +| Name         | Value                  |
> > ++==============+========================+
> > +| Message ID   | <ID>                   |
> > ++--------------+------------------------+
> > +| Command      | 2                      |
> > ++--------------+------------------------+
> > +| Message size | 16 + table size        |
> > ++--------------+------------------------+
> > +| Flags        | Reply bit set in reply |
> > ++--------------+------------------------+
> > +| Error        | 0/errno                |
> > ++--------------+------------------------+
> > +| Table        | array of table entries |
> > ++--------------+------------------------+
> > +
> > +This command message is sent by the client to the server to inform it of the
> > +memory regions the server can access. It must be sent before the server can
> > +perform any DMA to the client. It is normally sent directly after the version
> > +handshake is completed, but may also occur when memory is added to the client,
> > +or if the client uses a vIOMMU. If the client does not expect the server to
> > +perform DMA then it does not need to send to the server VFIO_USER_DMA_MAP
> > +commands. If the server does not need to perform DMA then it can ignore such
> > +commands but it must still reply to them. The table is an array of the
> > +following structure:
> > +
> > +Table entry format
> > +^^^^^^^^^^^^^^^^^^
> > +
> > ++-------------+--------+-------------+
> > +| Name        | Offset | Size        |
> > ++=============+========+=============+
> > +| Address     | 0      | 8           |
> > ++-------------+--------+-------------+
> > +| Size        | 8      | 8           |
> > ++-------------+--------+-------------+
> > +| Offset      | 16     | 8           |
> > ++-------------+--------+-------------+
> > +| Protections | 24     | 4           |
> > ++-------------+--------+-------------+
> > +| Flags       | 28     | 4           |
> > ++-------------+--------+-------------+
> > +|             | +-----+------------+ |
> > +|             | | Bit | Definition | |
> > +|             | +=====+============+ |
> > +|             | | 0   | Mappable   | |
> > +|             | +-----+------------+ |
> > ++-------------+--------+-------------+
> > +
> > +* *Address* is the base DMA address of the region.
> > +* *Size* is the size of the region.
> > +* *Offset* is the file offset of the region with respect to the associated file
> > +  descriptor.
> 
> It might help to explicitly state this value is ignored by the server
> for non-mappable DMA, otherwise a server might assume a specific value
> is required (ex. zero) for such cases.

Generally we say that unused inputs must be zero, but yes, this should be
clarified, thanks.

> > +* *Protections* are the region's protection attributes as encoded in
> > +  ``<sys/mman.h>``.
> > +* *Flags* contains the following region attributes:
> > +
> > +  * *Mappable* indicates that the region can be mapped via the mmap() system
> > +    call using the file descriptor provided in the message meta-data.
> > +
> > +This structure is 32 bytes in size, so the message size is:
> > +16 + (# of table entries * 32).
> > +
> > +If a DMA region being added can be directly mapped by the server, an array of
> > +file descriptors must be sent as part of the message meta-data. Each mappable
> > +region entry must have a corresponding file descriptor. On AF_UNIX sockets, the
> 
> Is this saying that if the client provides table entries where indexes
> 1, 3, and 5 are indicated as mappable, then there must be an ancillary
> file descriptor array with 3 entries, where fd[0] maps to entry[1],
> fd[1]:entry[3], and fd[2]:entry[5], even if fd[0-2] are all the
> same file descriptor?

Yes. Though we are planning to change these calls to only support single regions
which would make this moot.

> > +file descriptors must be passed as SCM_RIGHTS type ancillary data. Otherwise,
> > +if a DMA region cannot be directly mapped by the server, it can be accessed by
> > +the server using VFIO_USER_DMA_READ and VFIO_USER_DMA_WRITE messages, explained
> > +in `Read and Write Operations`_. A command to map over an existing region must
> > +be failed by the server with ``EEXIST`` set in error field in the reply.
> > +
> > +Adding multiple DMA regions can partially fail. The response does not indicate
> > +which regions were added and which were not, therefore it is a client
> > +implementation detail how to recover from the failure.
> > +
> > +.. Note::
> > +   The server can optionally remove succesfully added DMA regions making this
> > +   operation atomic.
> > +   The client can recover by attempting to unmap one by one all the DMA regions
> > +   in the VFIO_USER_DMA_MAP command, ignoring failures for regions that do not
> > +   exist.
> 
> What's the benefit of specifying this server behavior as optional?  I'm
> afraid this unspecified error recovery behavior might actually deter
> clients from performing batch mappings.  Servers also have little
> incentive to do their own cleanup if the client has no way to detect
> that behavior.

This may well be moot too.

> > +VFIO_USER_DMA_UNMAP
> > +-------------------
> > +
> > +This command message is sent by the client to the server to inform it that a
> > +DMA region, previously made available via a VFIO_USER_DMA_MAP command message,
> > +is no longer available for DMA. It typically occurs when memory is subtracted
> > +from the client or if the client uses a vIOMMU. If the client does not expect
> > +the server to perform DMA then it does not need to send to the server
> > +VFIO_USER_DMA_UNMAP commands. If the server does not need to perform DMA then
> > +it can ignore such commands but it must still reply to them. The table is an
> 
> I'm confused why expectation of DMA plays a factor here.  For example,
> if QEMU unplugs a DIMM and the server has an mmap of the file descriptor
> related to that DIMM, does it get to retain the mmap if it doesn't
> currently have any DMA queued targeting that address range?  Can QEMU
> skip sending an unmap if the PCI bus master bit is disabled on the
> device preventing further DMA?  How can the associated file descriptor
> get released?  This doesn't feel strongly specified.

I thought we'd removed those sentences actually, as they're just confusing. In
reality, everything is going to both send and handle map/unmap messages.

> Are there any assumptions about address and size of the unmap command
> relative to the original map command or is the client freely allowed to
> bisect, overlap, or overextend previous mappings?

Good question. Filed https://github.com/nutanix/libvfio-user/issues/504 to
track this.

I actually don't know what clients would like to be able to do in this respect.

> > +* *Offset* is the file offset of the region with respect to the associated file
> > +  descriptor.
> > +* *Protections* are the region's protection attributes as encoded in
> > +  ``<sys/mman.h>``.
> > +* *Flags* contains the following region attributes:
> > +
> > +  * *VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP* indicates that a dirty page bitmap
> > +    must be populated before unmapping the DMA region. The client must provide
> > +    a ``struct vfio_bitmap`` in the VFIO bitmaps field for each region, with
> > +    the ``vfio_bitmap.pgsize`` and ``vfio_bitmap.size`` fields initialized.
> > +
> > +* *VFIO Bitmaps* contains one ``struct vfio_bitmap`` per region (explained
> > +  below) if ``VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP`` is set in Flags.
> 
> How will this be extended when new flags are added to get new data or
> new data formats?  Note for instance that the kernel struct
> vfio_iommu_type1_dma_unmap specifies the data[] as opaque in general and
> only specifies it as struct vfio_bitmap for the case where
> GET_DIRTY_BITMAP is specified.

We're planning to do something like that for the new (actually more like vfio
again) map/unmap format.

> > +Upon receiving a VFIO_USER_DMA_UNMAP command, if the file descriptor is mapped
> > +then the server must release all references to that DMA region before replying,
> > +which includes potentially in flight DMA transactions. Removing a portion of a
> > +DMA region is possible.
> 
> Ah, maybe this answers my question about unmap vs map, but it also seems
> to contradict the description allowing the server to ignore unmap
> requests if no DMA is expected when we state here that the server MUST
> release references.

The text is confusing (which is why I've removed it again). What it's really
trying to say is:

If there is a server implementation (such as the gpio-pci-idio-16 sample) that
never needs to access guest memory at all, then the server can choose to ignore
DMA_MAP/UNMAP - so it would never keep any references around in the first place.

It's not a useful thing to mention in the spec IMHO.

> Is this also a good place to point out that the max messages size of
> 4096 is extremely limiting for returning a dirty bitmap for most use
> cases?  Some discussion of the error codes for such a case might be
> relevant here too.

It's a silly low default. The only implementation so far reports 65536 for what
it's worth.

We are also prototyping implementation changes such that this limit can be
removed altogether; hopefully that will come in a future spec update.

> > +VFIO_USER_VM_INTERRUPT
> > +
> > +Interrupt info format
> > +^^^^^^^^^^^^^^^^^^^^^
> > +
> > ++-----------+--------+------+
> > +| Name      | Offset | Size |
> > ++===========+========+======+
> > +| Sub-index | 16     | 4    |
> > ++-----------+--------+------+
> > +
> > +* *Sub-index* is relative to the IRQ index, e.g., the vector number used in PCI
> > +  MSI/X type interrupts.
> 
> Sorry if I'm blind, but where is the index itself provided?

You were confused because the message makes no sense as it's defined. It's been
removed.

Thanks for taking a look, much appreciated!

regards
john

  reply	other threads:[~2021-05-19 22:40 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-14 11:41 [PATCH v8] introduce vfio-user protocol specification Thanos Makatos
2021-04-26 15:48 ` Stefan Hajnoczi
2021-04-27 12:02   ` Thanos Makatos
2021-04-27 15:01     ` Stefan Hajnoczi
2021-05-04 13:51 ` Stefan Hajnoczi
2021-05-04 14:31   ` John Levon
2021-05-05 15:51     ` Stefan Hajnoczi
2021-06-14  9:57     ` Thanos Makatos
2021-05-05 16:19   ` John Levon
2021-05-06  8:49     ` Stefan Hajnoczi
2021-05-07 16:10     ` Thanos Makatos
2021-06-14 10:07   ` Thanos Makatos
2021-05-10 16:57 ` Stefan Hajnoczi
2021-05-10 22:25   ` John Levon
2021-05-11 10:09     ` Stefan Hajnoczi
2021-05-11 10:43       ` John Levon
2021-05-11 15:40         ` Stefan Hajnoczi
2021-05-12  5:08     ` John Johnson
2021-05-19 21:08 ` Alex Williamson
2021-05-19 22:38   ` John Levon [this message]
2021-06-14  9:47     ` Thanos Makatos

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210519223844.GA1036405@sent \
    --to=john.levon@nutanix.com \
    --cc=Kanth.Ghatraju@oracle.com \
    --cc=alex.williamson@redhat.com \
    --cc=benjamin.walker@intel.com \
    --cc=cdupontd@redhat.com \
    --cc=changpeng.liu@intel.com \
    --cc=cohuck@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=elena.ufimtseva@oracle.com \
    --cc=felipe@nutanix.com \
    --cc=ismael@linux.com \
    --cc=jag.raman@oracle.com \
    --cc=james.r.harris@intel.com \
    --cc=jasowang@redhat.com \
    --cc=john.g.johnson@oracle.com \
    --cc=konrad.wilk@oracle.com \
    --cc=kraxel@redhat.com \
    --cc=kwankhede@nvidia.com \
    --cc=levon@movementarian.org \
    --cc=marcandre.lureau@redhat.com \
    --cc=mpiszczek@ddn.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=raphael.norwitz@nutanix.com \
    --cc=stefanha@redhat.com \
    --cc=swapnil.ingle@nutanix.com \
    --cc=thanos.makatos@nutanix.com \
    --cc=tina.zhang@intel.com \
    --cc=tomassetti.andrea@gmail.com \
    --cc=xiuchun.lu@intel.com \
    --cc=yan.y.zhao@intel.com \
    --cc=yuvalkashtan@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.