[PATCH RFC] vfio: Documentation for the migration region

* [PATCH RFC] vfio: Documentation for the migration region
@ 2021-11-22 19:53 Jason Gunthorpe
  2021-11-22 20:31 ` Jonathan Corbet
  2021-11-23 14:21 ` Cornelia Huck
  0 siblings, 2 replies; 13+ messages in thread
From: Jason Gunthorpe @ 2021-11-22 19:53 UTC (permalink / raw)
  To: Alex Williamson, Jonathan Corbet, linux-doc
  Cc: Cornelia Huck, kvm, Kirti Wankhede, Max Gurtovoy,
	Shameer Kolothum, Yishai Hadas

Provide some more complete documentation for the migration region's
behavior, specifically focusing on the device_state bits and the whole
system view from a VMM.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 Documentation/driver-api/vfio.rst | 208 +++++++++++++++++++++++++++++-
 1 file changed, 207 insertions(+), 1 deletion(-)

Alex/Cornelia, here is the first draft of the requested documentation I promised

We think it includes all the feedback from hns, Intel and NVIDIA on this mechanism.

Our thinking is that NDMA would be implemented like this:

   +#define VFIO_DEVICE_STATE_NDMA      (1 << 3)

And a .add_capability ops will be used to signal to userspace driver support:

   +#define VFIO_REGION_INFO_CAP_MIGRATION_NDMA    6

I've described DIRTY TRACKING as a seperate concept here. With the current
uAPI this would be controlled by VFIO_IOMMU_DIRTY_PAGES_FLAG_START, with our
change in direction this would be per-tracker control, but no semantic change.

Upon some agreement we'll include this patch in the next iteration of the mlx5 driver
along with the NDMA bits.

Thanks,
Jason

diff --git a/Documentation/driver-api/vfio.rst b/Documentation/driver-api/vfio.rst
index c663b6f978255b..b28c6fb89ee92f 100644
--- a/Documentation/driver-api/vfio.rst
+++ b/Documentation/driver-api/vfio.rst
@@ -242,7 +242,213 @@ group and can access them as follows::
 VFIO User API
 -------------------------------------------------------------------------------
 
-Please see include/linux/vfio.h for complete API documentation.
+Please see include/uapi/linux/vfio.h for complete API documentation.
+
+-------------------------------------------------------------------------------
+
+VFIO migration driver API
+-------------------------------------------------------------------------------
+
+VFIO drivers that support migration implement a migration control register
+called device_state in the struct vfio_device_migration_info which is in its
+VFIO_REGION_TYPE_MIGRATION region.
+
+The device_state triggers device action both when bits are set/cleared and
+continuous behavior for each bit. For VMMs they can also control if the VCPUs in
+a VM are executing (VCPU RUNNING) and if the IOMMU is logging DMAs (DIRTY
+TRACKING). These two controls are not part of the device_state register, KVM
+will be used to control the VCPU and VFIO_IOMMU_DIRTY_PAGES_FLAG_START on the
+container controls dirty tracking.
+
+Along with the device_state the migration driver provides a data window which
+allows streaming migration data into or out of the device.
+
+A lot of flexibility is provided to userspace in how it operates these bits. The
+reference flow for saving device state in a live migration, with all features:
+
+  RUNNING, VCPU_RUNNING
+     Normal operating state
+  RUNNING, DIRTY TRACKING, VCPU RUNNING
+     Log DMAs
+     Stream all memory
+  SAVING | RUNNING, DIRTY TRACKING, VCPU RUNNING
+     Log internal device changes (pre-copy)
+     Stream device state through the migration window
+
+     While in this state repeat as desired:
+	Atomic Read and Clear DMA Dirty log
+	Stream dirty memory
+  SAVING | NDMA | RUNNING, VCPU RUNNING
+     vIOMMU grace state
+     Complete all in progress IO page faults, idle the vIOMMU
+  SAVING | NDMA | RUNNING
+     Peer to Peer DMA grace state
+     Final snapshot of DMA dirty log (atomic not required)
+  SAVING
+     Stream final device state through the migration window
+     Copy final dirty data
+  0
+     Device is halted
+
+and the reference flow for resuming:
+
+  RUNNING
+     Issue VFIO_DEVICE_RESET to clear the internal device state
+  0
+     Device is halted
+  RESUMING
+     Push in migration data. Data captured during pre-copy should be
+     prepended to data captured during SAVING.
+  NDMA | RUNNING
+     Peer to Peer DMA grace state
+  RUNNING, VCPU RUNNING
+     Normal operating state
+
+If the VMM has multiple VFIO devices undergoing migration then the grace states
+act as cross device synchronization points. The VMM must bring all devices to
+the grace state before advancing past it.
+
+To support these operations the migration driver is required to implement
+specific behaviors around the device_state.
+
+Actions on Set/Clear:
+ - SAVING | RUNNING
+   The device clears the data window and begins streaming 'pre copy' migration
+   data through the window. Device that cannot log internal state changes return
+   a 0 length migration stream.
+
+ - SAVING | !RUNNING
+   The device captures its internal state and begins streaming migration data
+   through the migration window
+
+ - RESUMING
+   The data window is opened and can receive the migration data.
+
+ - !RESUMING
+   All the data transferred into the data window is loaded into the device's
+   internal state. The migration driver can rely on userspace issuing a
+   VFIO_DEVICE_RESET prior to starting RESUMING.
+
+ - DIRTY TRACKING
+   On set clear the DMA log and start logging
+
+   On clear freeze the DMA log and allow userspace to read it. Userspace must
+   take care to ensure that DMA is suspended before clearing DIRTY TRACKING, for
+   instance by using NDMA.
+
+   DMA logs should be readable with an "atomic test and clear" to allow
+   continuous non-disruptive sampling of the log.
+
+Continuous Actions:
+  - NDMA
+    The device is not allowed to issue new DMA operations.
+    Before NDMA returns all in progress DMAs must be completed.
+
+  - !RUNNING
+    The device should not change its internal state. Implies NDMA. Any internal
+    state logging can stop.
+
+  - SAVING | !RUNNING
+    RESUMING | !RUNNING
+    The device may assume there are no incoming MMIO operations.
+
+  - RUNNING
+    The device can alter its internal state and must respond to incoming MMIO.
+
+  - SAVING | RUNNING
+    The device is logging changes to the internal state.
+
+  - !VCPU RUNNING
+    The CPU must not generate dirty pages or issue MMIO operations to devices.
+
+  - DIRTY TRACKING
+    DMAs are logged
+
+  - ERROR
+    The behavior of the device is undefined. The device must be recovered by
+    issuing VFIO_DEVICE_RESET.
+
+In general, userspace can issue a VFIO_DEVICE_RESET ioctl and recover the device
+back to device_state RUNNING. When a migration driver executes this ioctl it
+should discard the data window and set migration_state to RUNNING. This must
+happen even if the migration_state has errored. A freshly opened device FD
+should always be in the RUNNING state.
+
+The migration driver has limitations on what device state it can affect. Any
+device state controlled by general kernel subsystems must not be changed during
+RESUME, and SAVING must tolerate mutation of this state. Change to externally
+controlled device state can happen at any time, asynchronously, to the migration
+(ie interrupt rebalancing).
+
+Some examples of externally controlled state:
+ - MSI-X interrupt page
+ - MSI/legacy interrupt configuration
+ - Large parts of the PCI configuration space, ie common control bits
+ - PCI power management
+ - Changes via VFIO_DEVICE_SET_IRQS
+
+During !RUNNING, especially during SAVING and RESUMING, the device may have
+limitations on what it can tolerate. An ideal device will discard/return all
+ones to all incoming MMIO/PIO operations (exclusive of the external state above)
+in !RUNNING. However, devices are free to have undefined behavior if they
+receive MMIOs. This includes corrupting/aborting the migration, dirtying pages,
+and segfaulting userspace.
+
+However, a device may not compromise system integrity if it is subjected to a
+MMIO. It can not trigger an error TLP, it can not trigger a Machine Check, and
+it can not compromise device isolation.
+
+There are several edge cases that userspace should keep in mind when
+implementing migration:
+
+- Device Peer to Peer DMA. In this case devices are able issue DMAs to each
+  other's MMIO regions. The VMM can permit this if it maps the MMIO memory into
+  the IOMMU.
+
+  As Peer to Peer DMA is a MMIO touch like any other, it is important that
+  userspace suspend these accesses before entering any device_state where MMIO
+  is not permitted, such as !RUNNING. This can be accomplished with the NDMA
+  state. Userspace may also choose to remove MMIO mappings from the IOMMU if the
+  device does not support NDMA, and rely on that to guarantee quiet MMIO.
+
+  The P2P Grace States exist so that all devices may reach RUNNING before any
+  device is subjected to a MMIO access.
+
+  Failure to guarentee quiet MMIO may allow a hostile VM to use P2P to violate
+  the no-MMIO restriction during SAVING and corrupt the migration on devices
+  that cannot protect themselves.
+
+- IOMMU Page faults handled in userspace can occur at any time. A migration
+  driver is not required to serialize in-progress page faults. It can assume
+  that all page faults are completed before entering SAVING | !RUNNING. Since
+  the guest VCPU is required to complete page faults the VMM can accomplish this
+  by asserting NDMA | VCPU_RUNNING and clearing all pending page faults before
+  clearing VCPU_RUNNING.
+
+  Device that do not support NDMA cannot be configured to generate page faults
+  that require the VCPU to complete.
+
+- pre-copy allows the device to implement a dirty log for its internal state.
+  During the SAVING | RUNNING state the data window should present the device
+  state being logged and during SAVING | !RUNNING the data window should present
+  the unlogged device state as well as the changes from the internal dirty log.
+
+  On RESUME these two data streams are concatenated together.
+
+  pre-copy is only concerned with internal device state. External DMAs are
+  covered by the DIRTY TRACK function.
+
+- Atomic Read and Clear of the DMA log is a HW feature. If the tracker
+  cannot support this, then NDMA could be used to synthesize it less
+  efficiently.
+
+- NDMA is optional, if the device does not support this then the NDMA States
+  are pushed down to the next step in the sequence and various behaviors that
+  rely on NDMA cannot be used.
+
+TDB - discoverable feature flag for NDMA
+TDB IMS xlation
+TBD PASID xlation
 
 VFIO bus driver API
 -------------------------------------------------------------------------------

base-commit: ae0351a976d1880cf152de2bc680f1dff14d9049
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread