All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>,
	Jonathan Corbet <corbet@lwn.net>,
	linux-doc@vger.kernel.org
Cc: Cornelia Huck <cohuck@redhat.com>,
	kvm@vger.kernel.org, Kirti Wankhede <kwankhede@nvidia.com>,
	Max Gurtovoy <mgurtovoy@nvidia.com>,
	Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>,
	Yishai Hadas <yishaih@nvidia.com>
Subject: [PATCH RFC] vfio: Documentation for the migration region
Date: Mon, 22 Nov 2021 15:53:21 -0400	[thread overview]
Message-ID: <0-v1-0ec87874bede+123-vfio_mig_doc_jgg@nvidia.com> (raw)

Provide some more complete documentation for the migration region's
behavior, specifically focusing on the device_state bits and the whole
system view from a VMM.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 Documentation/driver-api/vfio.rst | 208 +++++++++++++++++++++++++++++-
 1 file changed, 207 insertions(+), 1 deletion(-)

Alex/Cornelia, here is the first draft of the requested documentation I promised

We think it includes all the feedback from hns, Intel and NVIDIA on this mechanism.

Our thinking is that NDMA would be implemented like this:

   +#define VFIO_DEVICE_STATE_NDMA      (1 << 3)

And a .add_capability ops will be used to signal to userspace driver support:

   +#define VFIO_REGION_INFO_CAP_MIGRATION_NDMA    6

I've described DIRTY TRACKING as a seperate concept here. With the current
uAPI this would be controlled by VFIO_IOMMU_DIRTY_PAGES_FLAG_START, with our
change in direction this would be per-tracker control, but no semantic change.

Upon some agreement we'll include this patch in the next iteration of the mlx5 driver
along with the NDMA bits.

Thanks,
Jason

diff --git a/Documentation/driver-api/vfio.rst b/Documentation/driver-api/vfio.rst
index c663b6f978255b..b28c6fb89ee92f 100644
--- a/Documentation/driver-api/vfio.rst
+++ b/Documentation/driver-api/vfio.rst
@@ -242,7 +242,213 @@ group and can access them as follows::
 VFIO User API
 -------------------------------------------------------------------------------
 
-Please see include/linux/vfio.h for complete API documentation.
+Please see include/uapi/linux/vfio.h for complete API documentation.
+
+-------------------------------------------------------------------------------
+
+VFIO migration driver API
+-------------------------------------------------------------------------------
+
+VFIO drivers that support migration implement a migration control register
+called device_state in the struct vfio_device_migration_info which is in its
+VFIO_REGION_TYPE_MIGRATION region.
+
+The device_state triggers device action both when bits are set/cleared and
+continuous behavior for each bit. For VMMs they can also control if the VCPUs in
+a VM are executing (VCPU RUNNING) and if the IOMMU is logging DMAs (DIRTY
+TRACKING). These two controls are not part of the device_state register, KVM
+will be used to control the VCPU and VFIO_IOMMU_DIRTY_PAGES_FLAG_START on the
+container controls dirty tracking.
+
+Along with the device_state the migration driver provides a data window which
+allows streaming migration data into or out of the device.
+
+A lot of flexibility is provided to userspace in how it operates these bits. The
+reference flow for saving device state in a live migration, with all features:
+
+  RUNNING, VCPU_RUNNING
+     Normal operating state
+  RUNNING, DIRTY TRACKING, VCPU RUNNING
+     Log DMAs
+     Stream all memory
+  SAVING | RUNNING, DIRTY TRACKING, VCPU RUNNING
+     Log internal device changes (pre-copy)
+     Stream device state through the migration window
+
+     While in this state repeat as desired:
+	Atomic Read and Clear DMA Dirty log
+	Stream dirty memory
+  SAVING | NDMA | RUNNING, VCPU RUNNING
+     vIOMMU grace state
+     Complete all in progress IO page faults, idle the vIOMMU
+  SAVING | NDMA | RUNNING
+     Peer to Peer DMA grace state
+     Final snapshot of DMA dirty log (atomic not required)
+  SAVING
+     Stream final device state through the migration window
+     Copy final dirty data
+  0
+     Device is halted
+
+and the reference flow for resuming:
+
+  RUNNING
+     Issue VFIO_DEVICE_RESET to clear the internal device state
+  0
+     Device is halted
+  RESUMING
+     Push in migration data. Data captured during pre-copy should be
+     prepended to data captured during SAVING.
+  NDMA | RUNNING
+     Peer to Peer DMA grace state
+  RUNNING, VCPU RUNNING
+     Normal operating state
+
+If the VMM has multiple VFIO devices undergoing migration then the grace states
+act as cross device synchronization points. The VMM must bring all devices to
+the grace state before advancing past it.
+
+To support these operations the migration driver is required to implement
+specific behaviors around the device_state.
+
+Actions on Set/Clear:
+ - SAVING | RUNNING
+   The device clears the data window and begins streaming 'pre copy' migration
+   data through the window. Device that cannot log internal state changes return
+   a 0 length migration stream.
+
+ - SAVING | !RUNNING
+   The device captures its internal state and begins streaming migration data
+   through the migration window
+
+ - RESUMING
+   The data window is opened and can receive the migration data.
+
+ - !RESUMING
+   All the data transferred into the data window is loaded into the device's
+   internal state. The migration driver can rely on userspace issuing a
+   VFIO_DEVICE_RESET prior to starting RESUMING.
+
+ - DIRTY TRACKING
+   On set clear the DMA log and start logging
+
+   On clear freeze the DMA log and allow userspace to read it. Userspace must
+   take care to ensure that DMA is suspended before clearing DIRTY TRACKING, for
+   instance by using NDMA.
+
+   DMA logs should be readable with an "atomic test and clear" to allow
+   continuous non-disruptive sampling of the log.
+
+Continuous Actions:
+  - NDMA
+    The device is not allowed to issue new DMA operations.
+    Before NDMA returns all in progress DMAs must be completed.
+
+  - !RUNNING
+    The device should not change its internal state. Implies NDMA. Any internal
+    state logging can stop.
+
+  - SAVING | !RUNNING
+    RESUMING | !RUNNING
+    The device may assume there are no incoming MMIO operations.
+
+  - RUNNING
+    The device can alter its internal state and must respond to incoming MMIO.
+
+  - SAVING | RUNNING
+    The device is logging changes to the internal state.
+
+  - !VCPU RUNNING
+    The CPU must not generate dirty pages or issue MMIO operations to devices.
+
+  - DIRTY TRACKING
+    DMAs are logged
+
+  - ERROR
+    The behavior of the device is undefined. The device must be recovered by
+    issuing VFIO_DEVICE_RESET.
+
+In general, userspace can issue a VFIO_DEVICE_RESET ioctl and recover the device
+back to device_state RUNNING. When a migration driver executes this ioctl it
+should discard the data window and set migration_state to RUNNING. This must
+happen even if the migration_state has errored. A freshly opened device FD
+should always be in the RUNNING state.
+
+The migration driver has limitations on what device state it can affect. Any
+device state controlled by general kernel subsystems must not be changed during
+RESUME, and SAVING must tolerate mutation of this state. Change to externally
+controlled device state can happen at any time, asynchronously, to the migration
+(ie interrupt rebalancing).
+
+Some examples of externally controlled state:
+ - MSI-X interrupt page
+ - MSI/legacy interrupt configuration
+ - Large parts of the PCI configuration space, ie common control bits
+ - PCI power management
+ - Changes via VFIO_DEVICE_SET_IRQS
+
+During !RUNNING, especially during SAVING and RESUMING, the device may have
+limitations on what it can tolerate. An ideal device will discard/return all
+ones to all incoming MMIO/PIO operations (exclusive of the external state above)
+in !RUNNING. However, devices are free to have undefined behavior if they
+receive MMIOs. This includes corrupting/aborting the migration, dirtying pages,
+and segfaulting userspace.
+
+However, a device may not compromise system integrity if it is subjected to a
+MMIO. It can not trigger an error TLP, it can not trigger a Machine Check, and
+it can not compromise device isolation.
+
+There are several edge cases that userspace should keep in mind when
+implementing migration:
+
+- Device Peer to Peer DMA. In this case devices are able issue DMAs to each
+  other's MMIO regions. The VMM can permit this if it maps the MMIO memory into
+  the IOMMU.
+
+  As Peer to Peer DMA is a MMIO touch like any other, it is important that
+  userspace suspend these accesses before entering any device_state where MMIO
+  is not permitted, such as !RUNNING. This can be accomplished with the NDMA
+  state. Userspace may also choose to remove MMIO mappings from the IOMMU if the
+  device does not support NDMA, and rely on that to guarantee quiet MMIO.
+
+  The P2P Grace States exist so that all devices may reach RUNNING before any
+  device is subjected to a MMIO access.
+
+  Failure to guarentee quiet MMIO may allow a hostile VM to use P2P to violate
+  the no-MMIO restriction during SAVING and corrupt the migration on devices
+  that cannot protect themselves.
+
+- IOMMU Page faults handled in userspace can occur at any time. A migration
+  driver is not required to serialize in-progress page faults. It can assume
+  that all page faults are completed before entering SAVING | !RUNNING. Since
+  the guest VCPU is required to complete page faults the VMM can accomplish this
+  by asserting NDMA | VCPU_RUNNING and clearing all pending page faults before
+  clearing VCPU_RUNNING.
+
+  Device that do not support NDMA cannot be configured to generate page faults
+  that require the VCPU to complete.
+
+- pre-copy allows the device to implement a dirty log for its internal state.
+  During the SAVING | RUNNING state the data window should present the device
+  state being logged and during SAVING | !RUNNING the data window should present
+  the unlogged device state as well as the changes from the internal dirty log.
+
+  On RESUME these two data streams are concatenated together.
+
+  pre-copy is only concerned with internal device state. External DMAs are
+  covered by the DIRTY TRACK function.
+
+- Atomic Read and Clear of the DMA log is a HW feature. If the tracker
+  cannot support this, then NDMA could be used to synthesize it less
+  efficiently.
+
+- NDMA is optional, if the device does not support this then the NDMA States
+  are pushed down to the next step in the sequence and various behaviors that
+  rely on NDMA cannot be used.
+
+TDB - discoverable feature flag for NDMA
+TDB IMS xlation
+TBD PASID xlation
 
 VFIO bus driver API
 -------------------------------------------------------------------------------

base-commit: ae0351a976d1880cf152de2bc680f1dff14d9049
-- 
2.33.1


             reply	other threads:[~2021-11-22 19:53 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-22 19:53 Jason Gunthorpe [this message]
2021-11-22 20:31 ` [PATCH RFC] vfio: Documentation for the migration region Jonathan Corbet
2021-11-23  0:20   ` Jason Gunthorpe
2021-11-23  7:22     ` Akira Yokosawa
2021-11-23 14:21 ` Cornelia Huck
2021-11-23 16:53   ` Jason Gunthorpe
2021-11-24 16:55     ` Cornelia Huck
2021-11-24 18:40       ` Jason Gunthorpe
2021-11-25 12:27         ` Cornelia Huck
2021-11-25 16:14           ` Jason Gunthorpe
2021-11-26 12:56             ` Cornelia Huck
2021-11-26 13:06               ` Jason Gunthorpe
2021-11-26 15:01                 ` Cornelia Huck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0-v1-0ec87874bede+123-vfio_mig_doc_jgg@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=cohuck@redhat.com \
    --cc=corbet@lwn.net \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=mgurtovoy@nvidia.com \
    --cc=shameerali.kolothum.thodi@huawei.com \
    --cc=yishaih@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.