kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
@ 2021-12-09 23:34 Alex Williamson
  2021-12-10  1:25 ` Jason Gunthorpe
                   ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Alex Williamson @ 2021-12-09 23:34 UTC (permalink / raw)
  To: alex.williamson
  Cc: jgg, cohuck, corbet, kvm, linux-doc, farman, mjrosato, pasic

A new NDMA state is being proposed to support a quiescent state for
contexts containing multiple devices with peer-to-peer DMA support.
Formally define it.

Clarify various aspects of the migration region data fields and
protocol.  Remove QEMU related terminology and flows from the uAPI;
these will be provided in Documentation/ so as not to confuse the
device_state bitfield with a finite state machine with restricted
state transitions.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 include/uapi/linux/vfio.h |  405 ++++++++++++++++++++++++---------------------
 1 file changed, 214 insertions(+), 191 deletions(-)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index ef33ea002b0b..1fdbc928f886 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -408,199 +408,211 @@ struct vfio_region_gfx_edid {
 #define VFIO_REGION_SUBTYPE_MIGRATION           (1)
 
 /*
- * The structure vfio_device_migration_info is placed at the 0th offset of
- * the VFIO_REGION_SUBTYPE_MIGRATION region to get and set VFIO device related
- * migration information. Field accesses from this structure are only supported
- * at their native width and alignment. Otherwise, the result is undefined and
- * vendor drivers should return an error.
+ * The structure vfio_device_migration_info is placed at the immediate start of
+ * the per-device VFIO_REGION_SUBTYPE_MIGRATION region to manage the device
+ * state and migration information for the device.  Field accesses for this
+ * structure are only supported using their native width and alignment,
+ * accesses otherwise are undefined and the kernel migration driver should
+ * return an error.
  *
  * device_state: (read/write)
- *      - The user application writes to this field to inform the vendor driver
- *        about the device state to be transitioned to.
- *      - The vendor driver should take the necessary actions to change the
- *        device state. After successful transition to a given state, the
- *        vendor driver should return success on write(device_state, state)
- *        system call. If the device state transition fails, the vendor driver
- *        should return an appropriate -errno for the fault condition.
- *      - On the user application side, if the device state transition fails,
- *	  that is, if write(device_state, state) returns an error, read
- *	  device_state again to determine the current state of the device from
- *	  the vendor driver.
- *      - The vendor driver should return previous state of the device unless
- *        the vendor driver has encountered an internal error, in which case
- *        the vendor driver may report the device_state VFIO_DEVICE_STATE_ERROR.
- *      - The user application must use the device reset ioctl to recover the
- *        device from VFIO_DEVICE_STATE_ERROR state. If the device is
- *        indicated to be in a valid device state by reading device_state, the
- *        user application may attempt to transition the device to any valid
- *        state reachable from the current state or terminate itself.
- *
- *      device_state consists of 3 bits:
- *      - If bit 0 is set, it indicates the _RUNNING state. If bit 0 is clear,
- *        it indicates the _STOP state. When the device state is changed to
- *        _STOP, driver should stop the device before write() returns.
- *      - If bit 1 is set, it indicates the _SAVING state, which means that the
- *        driver should start gathering device state information that will be
- *        provided to the VFIO user application to save the device's state.
- *      - If bit 2 is set, it indicates the _RESUMING state, which means that
- *        the driver should prepare to resume the device. Data provided through
- *        the migration region should be used to resume the device.
- *      Bits 3 - 31 are reserved for future use. To preserve them, the user
- *      application should perform a read-modify-write operation on this
- *      field when modifying the specified bits.
- *
- *  +------- _RESUMING
- *  |+------ _SAVING
- *  ||+----- _RUNNING
- *  |||
- *  000b => Device Stopped, not saving or resuming
- *  001b => Device running, which is the default state
- *  010b => Stop the device & save the device state, stop-and-copy state
- *  011b => Device running and save the device state, pre-copy state
- *  100b => Device stopped and the device state is resuming
- *  101b => Invalid state
- *  110b => Error state
- *  111b => Invalid state
- *
- * State transitions:
- *
- *              _RESUMING  _RUNNING    Pre-copy    Stop-and-copy   _STOP
- *                (100b)     (001b)     (011b)        (010b)       (000b)
- * 0. Running or default state
- *                             |
- *
- * 1. Normal Shutdown (optional)
- *                             |------------------------------------->|
- *
- * 2. Save the state or suspend
- *                             |------------------------->|---------->|
- *
- * 3. Save the state during live migration
- *                             |----------->|------------>|---------->|
- *
- * 4. Resuming
- *                  |<---------|
- *
- * 5. Resumed
- *                  |--------->|
- *
- * 0. Default state of VFIO device is _RUNNING when the user application starts.
- * 1. During normal shutdown of the user application, the user application may
- *    optionally change the VFIO device state from _RUNNING to _STOP. This
- *    transition is optional. The vendor driver must support this transition but
- *    must not require it.
- * 2. When the user application saves state or suspends the application, the
- *    device state transitions from _RUNNING to stop-and-copy and then to _STOP.
- *    On state transition from _RUNNING to stop-and-copy, driver must stop the
- *    device, save the device state and send it to the application through the
- *    migration region. The sequence to be followed for such transition is given
- *    below.
- * 3. In live migration of user application, the state transitions from _RUNNING
- *    to pre-copy, to stop-and-copy, and to _STOP.
- *    On state transition from _RUNNING to pre-copy, the driver should start
- *    gathering the device state while the application is still running and send
- *    the device state data to application through the migration region.
- *    On state transition from pre-copy to stop-and-copy, the driver must stop
- *    the device, save the device state and send it to the user application
- *    through the migration region.
- *    Vendor drivers must support the pre-copy state even for implementations
- *    where no data is provided to the user before the stop-and-copy state. The
- *    user must not be required to consume all migration data before the device
- *    transitions to a new state, including the stop-and-copy state.
- *    The sequence to be followed for above two transitions is given below.
- * 4. To start the resuming phase, the device state should be transitioned from
- *    the _RUNNING to the _RESUMING state.
- *    In the _RESUMING state, the driver should use the device state data
- *    received through the migration region to resume the device.
- * 5. After providing saved device data to the driver, the application should
- *    change the state from _RESUMING to _RUNNING.
+ *   The device_state field is a bitfield written by the user to transition
+ *   the associated device between valid migration states using these rules:
+ *     - The user may read or write the device state register at any time.
+ *     - The kernel migration driver must fully transition the device to the
+ *       new state value before the write(2) operation returns to the user.
+ *     - The user may change multiple bits of the bitfield in the same write
+ *       operation, so long as the resulting state is valid.
+ *     - The kernel migration driver must not generate asynchronous device
+ *       state transitions outside of manipulation by the user or the
+ *       VFIO_DEVICE_RESET ioctl as described below.
+ *     - In the event of a device state transition failure, the kernel
+ *       migration driver must return a write(2) error with appropriate errno
+ *       to the user.
+ *     - Upon such an error, re-reading the device_state field must indicate
+ *       the device migration state as either the same state as prior to the
+ *       failing write or, at the migration driver discretion, indicate the
+ *       device state as VFIO_DEVICE_STATE_ERROR.
+ *     - To continue using a device that has entered VFIO_DEVICE_STATE_ERROR,
+ *       the user must issue a VFIO_DEVICE_RESET ioctl, which must transition
+ *       the migration state to the default value (defined below).
+ *     - Devices supporting migration via this specification must support the
+ *       VFIO_DEVICE_RESET ioctl and any use of that ioctl must return the
+ *       device migration state to the default value.
+ *
+ *   The device_state field defines the following bitfield use:
+ *
+ *     - Bit 0 (RUNNING) [REQUIRED]:
+ *        - Setting this bit indicates the device is fully operational, the
+ *          device may generate interrupts, DMA, respond to MMIO, all vfio
+ *          device regions are functional, and the device may advance its
+ *          internal state.  The default device_state must indicate the device
+ *          in exclusively the RUNNING state, with no other bits in this field
+ *          set.
+ *        - Clearing this bit (ie. !RUNNING) must stop the operation of the
+ *          device.  The device must not generate interrupts, DMA, or advance
+ *          its internal state.  The user should take steps to restrict access
+ *          to vfio device regions other than the migration region while the
+ *          device is !RUNNING or risk corruption of the device migration data
+ *          stream.  The device and kernel migration driver must accept and
+ *          respond to interaction to support external subsystems in the
+ *          !RUNNING state, for example PCI MSI-X and PCI config space.
+ *          Failure by the user to restrict device access while !RUNNING must
+ *          not result in error conditions outside the user context (ex.
+ *          host system faults).
+ *     - Bit 1 (SAVING) [REQUIRED]:
+ *        - Setting this bit enables and initializes the migration region data
+ *          window and associated fields within vfio_device_migration_info for
+ *          capturing the migration data stream for the device.  The migration
+ *          driver may perform actions such as enabling dirty logging of device
+ *          state with this bit.  The SAVING bit is mutually exclusive with the
+ *          RESUMING bit defined below.
+ *        - Clearing this bit (ie. !SAVING) de-initializes the migration region
+ *          data window and indicates the completion or termination of the
+ *          migration data stream for the device.
+ *     - Bit 2 (RESUMING) [REQUIRED]:
+ *        - Setting this bit enables and initializes the migration region data
+ *          window and associated fields within vfio_device_migration_info for
+ *          restoring the device from a migration data stream captured from a
+ *          SAVING session with a compatible device.  The migration driver may
+ *          perform internal device resets as necessary to reinitialize the
+ *          internal device state for the incoming migration data.
+ *        - Clearing this bit (ie. !RESUMING) de-initializes the migration
+ *          region data window and indicates the end of a resuming session for
+ *          the device.  The kernel migration driver should complete the
+ *          incorporation of data written to the migration data window into the
+ *          device internal state and perform final validity and consistency
+ *          checking of the new device state.  If the user provided data is
+ *          found to be incomplete, inconsistent, or otherwise invalid, the
+ *          migration driver must indicate a write(2) error and follow the
+ *          previously described protocol to return either the previous state
+ *          or an error state.
+ *     - Bit 3 (NDMA) [OPTIONAL]:
+ *        The NDMA or "No DMA" state is intended to be a quiescent state for
+ *        the device for the purposes of managing multiple devices within a
+ *        user context where peer-to-peer DMA between devices may be active.
+ *        Support for the NDMA bit is indicated through the presence of the
+ *        VFIO_REGION_INFO_CAP_MIG_NDMA capability as reported by
+ *        VFIO_DEVICE_GET_REGION_INFO for the associated device migration
+ *        region.
+ *        - Setting this bit must prevent the device from initiating any
+ *          new DMA or interrupt transactions.  The migration driver must
+ *          complete any such outstanding operations prior to completing
+ *          the transition to the NDMA state.  The NDMA device_state
+ *          essentially represents a sub-set of the !RUNNING state for the
+ *          purpose of quiescing the device, therefore the NDMA device_state
+ *          bit is superfluous in combinations including !RUNNING.
+ *        - Clearing this bit (ie. !NDMA) negates the device operational
+ *          restrictions required by the NDMA state.
+ *     - Bits [31:4]:
+ *        Reserved for future use, users should use read-modify-write
+ *        operations to the device_state field for manipulation of the above
+ *        defined bits for optimal compatibility.
+ *
+ *   All combinations for the above defined device_state bits are considered
+ *   valid with the following exceptions:
+ *     - RESUMING and SAVING are mutually exclusive, all combinations of
+ *       (RESUMING | SAVING) are invalid.  Furthermore the specific combination
+ *       (!NDMA | RESUMING | SAVING | !RUNNING) is reserved to indicate the
+ *       device error state VFIO_DEVICE_STATE_ERROR.  This variant is
+ *       specifically chosen due to the !RUNNING state of the device as the
+ *       migration driver should do everything possible, including an internal
+ *       reset of the device, to ensure that the device is fully stopped in
+ *       this state.  Other invalid combinations are reserved for future use
+ *       and must not be reachable.
+ *     - Combinations involving (RESUMING | RUNNING) are currently unsupported
+ *       by this revision of the uAPI.
+ *
+ *   Migration drivers should attempt to support any transition between valid
+ *   states.  For further discussion of device_state relative to expected usage
+ *   scenarios, see: Documentation/driver-api/vfio.rst
  *
  * reserved:
- *      Reads on this field return zero and writes are ignored.
+ *   Reads on this field return zero and writes are ignored.
  *
  * pending_bytes: (read only)
- *      The number of pending bytes still to be migrated from the vendor driver.
+ *   The kernel migration driver uses this field to indicate an estimate of
+ *   the remaining data size (in bytes) for the user to copy while SAVING is
+ *   set in the device_state.  The value should be considered volatile,
+ *   especially while RUNNING is still set in the device_state.  Userspace
+ *   uses this field to test whether data is available to be read from the
+ *   data section described below.  Userspace should only consider whether
+ *   the value read is zero or non-zero for the purposes of the protocol
+ *   below.  The user may only consider the migration data stream to be
+ *   completed when pending_bytes reports a zero value while the device is
+ *   !RUNNING.  The kernel migration driver must not require the user to reach
+ *   a zero value for this field to transition to a !RUNNING device_state.
+ *   The value of this field is undefined when !SAVING.
  *
  * data_offset: (read only)
- *      The user application should read data_offset field from the migration
- *      region. The user application should read the device data from this
- *      offset within the migration region during the _SAVING state or write
- *      the device data during the _RESUMING state. See below for details of
- *      sequence to be followed.
+ *   This field indicates the offset relative to the start of the device
+ *   migration region for the user to collect (SAVING) or store (RESUMING)
+ *   migration data for the device following the protocol described below.
+ *   The migration driver may provide sparse mmap support for the migration
+ *   region and use the data_offset field to direct user accesses as
+ *   appropriate, but must not require mmap access when provided.  The value
+ *   of this field is undefined when device_state does not include either
+ *   SAVING or RESUMING.
  *
  * data_size: (read/write)
- *      The user application should read data_size to get the size in bytes of
- *      the data copied in the migration region during the _SAVING state and
- *      write the size in bytes of the data copied in the migration region
- *      during the _RESUMING state.
- *
- * The format of the migration region is as follows:
- *  ------------------------------------------------------------------
- * |vfio_device_migration_info|    data section                      |
- * |                          |     ///////////////////////////////  |
- * ------------------------------------------------------------------
- *   ^                              ^
- *  offset 0-trapped part        data_offset
- *
- * The structure vfio_device_migration_info is always followed by the data
- * section in the region, so data_offset will always be nonzero. The offset
- * from where the data is copied is decided by the kernel driver. The data
- * section can be trapped, mmapped, or partitioned, depending on how the kernel
- * driver defines the data section. The data section partition can be defined
- * as mapped by the sparse mmap capability. If mmapped, data_offset must be
- * page aligned, whereas initial section which contains the
- * vfio_device_migration_info structure, might not end at the offset, which is
- * page aligned. The user is not required to access through mmap regardless
- * of the capabilities of the region mmap.
- * The vendor driver should determine whether and how to partition the data
- * section. The vendor driver should return data_offset accordingly.
- *
- * The sequence to be followed while in pre-copy state and stop-and-copy state
- * is as follows:
- * a. Read pending_bytes, indicating the start of a new iteration to get device
- *    data. Repeated read on pending_bytes at this stage should have no side
- *    effects.
- *    If pending_bytes == 0, the user application should not iterate to get data
- *    for that device.
- *    If pending_bytes > 0, perform the following steps.
- * b. Read data_offset, indicating that the vendor driver should make data
- *    available through the data section. The vendor driver should return this
- *    read operation only after data is available from (region + data_offset)
- *    to (region + data_offset + data_size).
- * c. Read data_size, which is the amount of data in bytes available through
- *    the migration region.
- *    Read on data_offset and data_size should return the offset and size of
- *    the current buffer if the user application reads data_offset and
- *    data_size more than once here.
- * d. Read data_size bytes of data from (region + data_offset) from the
- *    migration region.
- * e. Process the data.
- * f. Read pending_bytes, which indicates that the data from the previous
- *    iteration has been read. If pending_bytes > 0, go to step b.
- *
- * The user application can transition from the _SAVING|_RUNNING
- * (pre-copy state) to the _SAVING (stop-and-copy) state regardless of the
- * number of pending bytes. The user application should iterate in _SAVING
- * (stop-and-copy) until pending_bytes is 0.
- *
- * The sequence to be followed while _RESUMING device state is as follows:
- * While data for this device is available, repeat the following steps:
- * a. Read data_offset from where the user application should write data.
- * b. Write migration data starting at the migration region + data_offset for
- *    the length determined by data_size from the migration source.
- * c. Write data_size, which indicates to the vendor driver that data is
- *    written in the migration region. Vendor driver must return this write
- *    operations on consuming data. Vendor driver should apply the
- *    user-provided migration region data to the device resume state.
- *
- * If an error occurs during the above sequences, the vendor driver can return
- * an error code for next read() or write() operation, which will terminate the
- * loop. The user application should then take the next necessary action, for
- * example, failing migration or terminating the user application.
- *
- * For the user application, data is opaque. The user application should write
- * data in the same order as the data is received and the data should be of
- * same transaction size at the source.
+ *   This field indicates the length of the current data segment in bytes.
+ *   While SAVING, the kernel migration driver uses this field to indicate to
+ *   the user the length of the migration data stream available at data_offset.
+ *   When RESUMING, the user writes this field with the length of the data
+ *   segment written at the migration driver provided data_offset.  The value
+ *   of this field is undefined when device_state does not include either
+ *   SAVING or RESUMING.
+ *
+ * The following protocol is used while the device is in the SAVING
+ * device_state:
+ *
+ * a) The user reads pending_bytes.  If the read value is zero, no data is
+ *    currently available for the device.  If the device is !RUNNING and a
+ *    zero value is read, this indicates the end of the device migration
+ *    stream and the device must not generate any new migration data.  If
+ *    the read value is non-zero, the user may proceed to collect device
+ *    migration data in step b).  Repeated reads of pending_bytes is allowed
+ *    and must not compromise the migration data stream provided the user does
+ *    not proceed to the following step.
+ * b) The user reads data_offset, which indicates to the migration driver to
+ *    make a segment of device migration data available to the user at the
+ *    provided offset.  This action commits the user to collect the data
+ *    segment.
+ * c) The user reads data_size to determine the extent of the currently
+ *    available migration data segment.
+ * d) The user collects the data_size segment of device migration data at the
+ *    previously provided data_offset using access methods compatible to those
+ *    for the migration region.  The user must not be required to collect the
+ *    data in a single operation.
+ * e) The user re-reads pending_bytes to indicate to the migration driver that
+ *    the provided data has been collected.  Provided the read pending_bytes
+ *    value is non-zero, the user may proceed directly to step b) for another
+ *    iteration.
+ *
+ * The following protocol is used while the device is in the RESUMING
+ * device_state:
+ *
+ * a) The user reads data_offset, which directs the user to the location
+ *    within the migration region to store the migration data segment.
+ * b) The user writes the migration data segment starting at the provided
+ *    data_offset.  The user must preserve the data segment size as used when
+ *    the segment was collected from the device when SAVING.
+ * c) The user writes the data_size field with the number of bytes written to
+ *    the migration region in step b).  The kernel migration driver may use
+ *    this write to indicate the end of the current iteration.
+ * d) User proceeds to step a) so long as the migration data stream is not
+ *    complete.
+ *
+ * The kernel migration driver may indicate an error condition by returning
+ * a fault on read(2) or write(2) for any operation most approximate to the
+ * detection of the error.  Field accesses are provided within the protocol
+ * such that an opportunity exists to return a fault regardless of whether the
+ * data section is directly accessed via an mmap.
+ *
+ * The user must consider the migration data segments to be opaque and
+ * non-fungible.  During RESUMING, the data segments must be written in the
+ * size and order as provided during SAVING, irrespective of other bits within
+ * the device_state bitfield (ex. a transition to !RUNNING).
  */
 
 struct vfio_device_migration_info {
@@ -609,21 +621,25 @@ struct vfio_device_migration_info {
 #define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
 #define VFIO_DEVICE_STATE_SAVING    (1 << 1)
 #define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
-#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
-				     VFIO_DEVICE_STATE_SAVING |  \
+#define VFIO_DEVICE_STATE_NDMA      (1 << 3)
+#define VFIO_DEVICE_STATE_ERROR     (VFIO_DEVICE_STATE_SAVING | \
 				     VFIO_DEVICE_STATE_RESUMING)
+#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING  | \
+				     VFIO_DEVICE_STATE_SAVING   | \
+				     VFIO_DEVICE_STATE_RESUMING | \
+				     VFIO_DEVICE_STATE_NDMA)
 
 #define VFIO_DEVICE_STATE_VALID(state) \
-	(state & VFIO_DEVICE_STATE_RESUMING ? \
-	(state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
+	(!((state & VFIO_DEVICE_STATE_SAVING) && \
+	   (state & VFIO_DEVICE_STATE_RESUMING)) && \
+	 !((state & VFIO_DEVICE_STATE_RESUMING) && \
+	   (state & VFIO_DEVICE_STATE_RUNNING)))
 
 #define VFIO_DEVICE_STATE_IS_ERROR(state) \
-	((state & VFIO_DEVICE_STATE_MASK) == (VFIO_DEVICE_STATE_SAVING | \
-					      VFIO_DEVICE_STATE_RESUMING))
+	((state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_ERROR)
 
 #define VFIO_DEVICE_STATE_SET_ERROR(state) \
-	((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \
-					     VFIO_DEVICE_STATE_RESUMING)
+	((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_STATE_ERROR)
 
 	__u32 reserved;
 	__u64 pending_bytes;
@@ -631,6 +647,13 @@ struct vfio_device_migration_info {
 	__u64 data_size;
 };
 
+/*
+ * The Migration NDMA capability is exposed on a device Migration region
+ * to indicate support for the VFIO_DEVICE_STATE_NDMA bit of
+ * vfio_device_migration_info.device_state.
+ */
+#define VFIO_REGION_INFO_CAP_MIG_NDMA		6
+
 /*
  * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
  * which allows direct access to non-MSIX registers which happened to be within



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2021-12-09 23:34 [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state Alex Williamson
@ 2021-12-10  1:25 ` Jason Gunthorpe
  2021-12-13 20:40   ` Alex Williamson
  2021-12-20 17:38 ` Cornelia Huck
  2022-01-07  8:03 ` Tian, Kevin
  2 siblings, 1 reply; 35+ messages in thread
From: Jason Gunthorpe @ 2021-12-10  1:25 UTC (permalink / raw)
  To: Alex Williamson; +Cc: cohuck, corbet, kvm, linux-doc, farman, mjrosato, pasic

On Thu, Dec 09, 2021 at 04:34:29PM -0700, Alex Williamson wrote:
> A new NDMA state is being proposed to support a quiescent state for
> contexts containing multiple devices with peer-to-peer DMA support.
> Formally define it.
> 
> Clarify various aspects of the migration region data fields and
> protocol.  Remove QEMU related terminology and flows from the uAPI;
> these will be provided in Documentation/ so as not to confuse the
> device_state bitfield with a finite state machine with restricted
> state transitions.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>  include/uapi/linux/vfio.h |  405 ++++++++++++++++++++++++---------------------
>  1 file changed, 214 insertions(+), 191 deletions(-)

I need other peope to look this over, so these are just some quick
remarks. Thanks for doing it, it is very good.

Given I'm almost on vacation till Jan I think we will shortly have to
table this discussion to January.

But, if you are happy with this as all that is needed to do mlx5 we
can possibly have the v6 updated in early January, after the next
merge window.

Though lets try to quickly decide on what to do about the "change
multiple bits" below, please.

> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index ef33ea002b0b..1fdbc928f886 100644
> +++ b/include/uapi/linux/vfio.h
> @@ -408,199 +408,211 @@ struct vfio_region_gfx_edid {
>  #define VFIO_REGION_SUBTYPE_MIGRATION           (1)
>  
>  /*
> + * The structure vfio_device_migration_info is placed at the immediate start of
> + * the per-device VFIO_REGION_SUBTYPE_MIGRATION region to manage the device
> + * state and migration information for the device.  Field accesses for this
> + * structure are only supported using their native width and alignment,
> + * accesses otherwise are undefined and the kernel migration driver should
> + * return an error.
>   *
>   * device_state: (read/write)
> + *   The device_state field is a bitfield written by the user to transition
> + *   the associated device between valid migration states using these rules:
> + *     - The user may read or write the device state register at any time.
> + *     - The kernel migration driver must fully transition the device to the
> + *       new state value before the write(2) operation returns to the user.
> + *     - The user may change multiple bits of the bitfield in the same write
> + *       operation, so long as the resulting state is valid.

I would like to forbid this. It is really too complicated, and if
there is not a strongly defined device behavior when this is done it
is not inter-operable for userspace to do it.

> + *     - The kernel migration driver must not generate asynchronous device
> + *       state transitions outside of manipulation by the user or the
> + *       VFIO_DEVICE_RESET ioctl as described below.
> + *     - In the event of a device state transition failure, the kernel
> + *       migration driver must return a write(2) error with appropriate errno
> + *       to the user.
> + *     - Upon such an error, re-reading the device_state field must indicate
> + *       the device migration state as either the same state as prior to the
> + *       failing write or, at the migration driver discretion, indicate the
> + *       device state as VFIO_DEVICE_STATE_ERROR.

It is because this is complete nightmare. Let's say the user goes from
0 -> SAVING | RUNNING and SAVING fails after we succeed to do
RUNNING. We have to also backtrack and undo RUNNING, but what if that
fails too? Oh and we have to keep track of all this backtracking while
executing the new state and write a bunch of complicated never tested
error handling code.

Assuming we can even figure out what the precedence of multiple bits
even means for interoperability.

Backed into this is an assumption that any device transition is fully
atomic - that just isn't what I see any of the HW has done.

I thought we could tackled this when you first suggested it (eg copy
the mlx5 logic and be OK), but now I'm very skeptical. The idea that
every driver can do this right in all the corner cases doesn't seem
reasonable given we've made so many errors here already just in mlx5.

> + *     - Bit 1 (SAVING) [REQUIRED]:
> + *        - Setting this bit enables and initializes the migration region data

I would use the word clear instead of initialize - the point of this
is to throw away any data that may be left over in the window from any
prior actions.

> + *          window and associated fields within vfio_device_migration_info for
> + *          capturing the migration data stream for the device.  The migration
> + *          driver may perform actions such as enabling dirty logging of device
> + *          state with this bit.  The SAVING bit is mutually exclusive with the
> + *          RESUMING bit defined below.
> + *        - Clearing this bit (ie. !SAVING) de-initializes the migration region
> + *          data window and indicates the completion or termination of the
> + *          migration data stream for the device.

I don't know what "de-initialized" means as something a device should
do? IMHO there is no need to talk about the migration window here,
SAVING says initialize/clear - and data_offset/etc say their values
are undefined outside SAVING/RUNNING. That is enough.

> + *     - Bit 2 (RESUMING) [REQUIRED]:
> + *        - Setting this bit enables and initializes the migration region data
> + *          window and associated fields within vfio_device_migration_info for
> + *          restoring the device from a migration data stream captured from a
> + *          SAVING session with a compatible device.  The migration driver may
> + *          perform internal device resets as necessary to reinitialize the
> + *          internal device state for the incoming migration data.
> + *        - Clearing this bit (ie. !RESUMING) de-initializes the migration
> + *          region data window and indicates the end of a resuming session for
> + *          the device.  The kernel migration driver should complete the
> + *          incorporation of data written to the migration data window into the
> + *          device internal state and perform final validity and consistency
> + *          checking of the new device state.  If the user provided data is
> + *          found to be incomplete, inconsistent, or otherwise invalid, the
> + *          migration driver must indicate a write(2) error and follow the
> + *          previously described protocol to return either the previous state
> + *          or an error state.

Prefer this is just 'go to an error state' to avoid unnecessary
implementation differences.

> + *     - Bit 3 (NDMA) [OPTIONAL]:
> + *        The NDMA or "No DMA" state is intended to be a quiescent state for
> + *        the device for the purposes of managing multiple devices within a
> + *        user context where peer-to-peer DMA between devices may be active.
> + *        Support for the NDMA bit is indicated through the presence of the
> + *        VFIO_REGION_INFO_CAP_MIG_NDMA capability as reported by
> + *        VFIO_DEVICE_GET_REGION_INFO for the associated device migration
> + *        region.
> + *        - Setting this bit must prevent the device from initiating any
> + *          new DMA or interrupt transactions.  The migration driver must

I'm not sure about interrupts.

> + *          complete any such outstanding operations prior to completing
> + *          the transition to the NDMA state.  The NDMA device_state

Reading this as you wrote it and I suddenly have a doubt about the PRI
use case. Is it reasonable that the kernel driver will block on NDMA
waiting for another userspace thread to resolve any outstanding PRIs?

Can that allow userspace to deadlock the kernel or device? Is there an
alterative?

> + *   All combinations for the above defined device_state bits are considered
> + *   valid with the following exceptions:
> + *     - RESUMING and SAVING are mutually exclusive, all combinations of
> + *       (RESUMING | SAVING) are invalid.  Furthermore the specific combination
> + *       (!NDMA | RESUMING | SAVING | !RUNNING) is reserved to indicate the
> + *       device error state VFIO_DEVICE_STATE_ERROR.  This variant is
> + *       specifically chosen due to the !RUNNING state of the device as the
> + *       migration driver should do everything possible, including an internal
> + *       reset of the device, to ensure that the device is fully stopped in
> + *       this state.  

Prefer we don't specify this. ERROR is undefined behavior and
userspace should reset. Any path that leads along to ERROR already
includes possiblities for wild DMAs and what not, so there is nothing
to be gained by this other than causing a lot of driver complexity,
IMHO.

> + *   Migration drivers should attempt to support any transition between valid

should? must, I think.

The whole migration window definition seems quite straightforward now!

> + * a) The user reads pending_bytes.  If the read value is zero, no data is
> + *    currently available for the device.  If the device is !RUNNING and a
> + *    zero value is read, this indicates the end of the device migration
> + *    stream and the device must not generate any new migration data.  If
> + *    the read value is non-zero, the user may proceed to collect device
> + *    migration data in step b).  Repeated reads of pending_bytes is allowed
> + *    and must not compromise the migration data stream provided the user does
> + *    not proceed to the following step.

Add what to do in SAVING|RUNNING if pending bytes is 0?

>  #define VFIO_DEVICE_STATE_SET_ERROR(state) \
> -	((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \
> -					     VFIO_DEVICE_STATE_RESUMING)
> +	((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_STATE_ERROR)

We should delete this macro. It only makes sense used in a driver does
not belong in the uapi header.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2021-12-10  1:25 ` Jason Gunthorpe
@ 2021-12-13 20:40   ` Alex Williamson
  2021-12-14 12:08     ` Cornelia Huck
  2021-12-14 16:26     ` Jason Gunthorpe
  0 siblings, 2 replies; 35+ messages in thread
From: Alex Williamson @ 2021-12-13 20:40 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: cohuck, corbet, kvm, linux-doc, farman, mjrosato, pasic

On Thu, 9 Dec 2021 21:25:29 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Dec 09, 2021 at 04:34:29PM -0700, Alex Williamson wrote:
> > A new NDMA state is being proposed to support a quiescent state for
> > contexts containing multiple devices with peer-to-peer DMA support.
> > Formally define it.
> > 
> > Clarify various aspects of the migration region data fields and
> > protocol.  Remove QEMU related terminology and flows from the uAPI;
> > these will be provided in Documentation/ so as not to confuse the
> > device_state bitfield with a finite state machine with restricted
> > state transitions.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> >  include/uapi/linux/vfio.h |  405 ++++++++++++++++++++++++---------------------
> >  1 file changed, 214 insertions(+), 191 deletions(-)  
> 
> I need other peope to look this over, so these are just some quick
> remarks. Thanks for doing it, it is very good.
> 
> Given I'm almost on vacation till Jan I think we will shortly have to
> table this discussion to January.
> 
> But, if you are happy with this as all that is needed to do mlx5 we
> can possibly have the v6 updated in early January, after the next
> merge window.
> 
> Though lets try to quickly decide on what to do about the "change
> multiple bits" below, please.
> 
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index ef33ea002b0b..1fdbc928f886 100644
> > +++ b/include/uapi/linux/vfio.h
> > @@ -408,199 +408,211 @@ struct vfio_region_gfx_edid {
> >  #define VFIO_REGION_SUBTYPE_MIGRATION           (1)
> >  
> >  /*
> > + * The structure vfio_device_migration_info is placed at the immediate start of
> > + * the per-device VFIO_REGION_SUBTYPE_MIGRATION region to manage the device
> > + * state and migration information for the device.  Field accesses for this
> > + * structure are only supported using their native width and alignment,
> > + * accesses otherwise are undefined and the kernel migration driver should
> > + * return an error.
> >   *
> >   * device_state: (read/write)
> > + *   The device_state field is a bitfield written by the user to transition
> > + *   the associated device between valid migration states using these rules:
> > + *     - The user may read or write the device state register at any time.
> > + *     - The kernel migration driver must fully transition the device to the
> > + *       new state value before the write(2) operation returns to the user.
> > + *     - The user may change multiple bits of the bitfield in the same write
> > + *       operation, so long as the resulting state is valid.  
> 
> I would like to forbid this. It is really too complicated, and if
> there is not a strongly defined device behavior when this is done it
> is not inter-operable for userspace to do it.
> 
> > + *     - The kernel migration driver must not generate asynchronous device
> > + *       state transitions outside of manipulation by the user or the
> > + *       VFIO_DEVICE_RESET ioctl as described below.
> > + *     - In the event of a device state transition failure, the kernel
> > + *       migration driver must return a write(2) error with appropriate errno
> > + *       to the user.
> > + *     - Upon such an error, re-reading the device_state field must indicate
> > + *       the device migration state as either the same state as prior to the
> > + *       failing write or, at the migration driver discretion, indicate the
> > + *       device state as VFIO_DEVICE_STATE_ERROR.  
> 
> It is because this is complete nightmare. Let's say the user goes from
> 0 -> SAVING | RUNNING and SAVING fails after we succeed to do
> RUNNING. We have to also backtrack and undo RUNNING, but what if that
> fails too? Oh and we have to keep track of all this backtracking while
> executing the new state and write a bunch of complicated never tested
> error handling code.

We do specify that a migration driver has discretion in using the error
state for failed transitions, so there are options to simplify error
handling.

If we look at bit flips, we have:

Initial state
|  Resuming
|  |  Saving
|  |  |  Running
|  |  |  |  Next states with multiple bit flips

a) 0, 0, 0  (d)
b) 0, 0, 1  (c) (e)
c) 0, 1, 0  (b) (e)
d) 0, 1, 1  (a) (e)
e) 1, 0, 0  (b) (c) (d)
f) 1, 0, 1 UNSUPPORTED
g) 1, 1, 0 ERROR
h) 1, 1, 1 INVALID

We specify that we cannot pass through any invalid states during
transition, so if I consider each bit to be a discrete operation and
map all these multi-bit changes to a sequence of single bit flips, the
only requirements are:

  1) RESUMING must be cleared before setting SAVING or RUNNING
  2) SAVING or RUNNING must be cleared before setting RESUMING

I think the basis of your priority scheme comes from that.  Ordering
of the remaining items is more subtle though, for instance
0 -> SAVING | RUNNING can be broken down as:

  0 -> SAVING
  SAVING -> SAVING | RUNNING 

  vs

  0 -> RUNNING
  RUNNING -> SAVING | RUNNING

I'd give preference to enabling logging before running and I believe
that holds for transition (e) -> (d) as well.

In the reverse case, SAVING | RUNNING -> 0

  SAVING | RUNNING -> RUNNING
  RUNNING -> 0

  vs

  SAVING | RUNNING -> SAVING
  SAVING -> 0

This one is more arbitrary.  I tend to favor clearing RUNNING to stop
the device first, largely because it creates nice symmetry in the
resulting algorithm and follows the general principle that you
discovered as well, converge towards zero by addressing bit clearing
before setting.  So a valid algorithm would include:

int set_device_state(u32 old, u32 new, bool is_unwind)
{
	if (old.RUNNING && !new.RUNNING) {
		curr.RUNNING = 0;
		if (ERROR) goto unwind;
	}
	if (old.SAVING && !new.SAVING) {
		curr.SAVING = 0;
		if (ERROR) goto unwind;
	}
	if (old.RESUMING && !new.RESUMING) {
		curr.RESUMING = 0;
		if (ERROR) goto unwind;
	}
	if (!old.RESUMING && new.RESUMING) {
		curr.RESUMING = 1;
		if (ERROR) goto unwind;
	}
	if (!old.SAVING && new.SAVING) {
		curr.saving = 1;
		if (ERROR) goto unwind;
	}
	if (!old.RUNNING && new.RUNNING) {
		curr.RUNNING = 1;
		if (ERROR) goto unwind;
        }

	return 0;

unwind:
	if (!is_unwind) {
		ret = set_device_state(curr, old, true);
		if (ret) {
			curr.raw = ERROR;
			return ret;
		}
	}

	return -EIO;
}

And I think that also addresses the claim that we're doomed to untested
and complicated error code handling, we unwind by simply swapping the
args to our set state function and enter the ERROR state should that
recursive call fail.

I think it would be reasonable for documentation to present similar
pseudo code as a recommendation, but ultimately the migration driver
needs to come up with something that fits all the requirements.

(Ignoring NDMA for the moment until we determine if it's even a
synchronous operations)

If we put it in the user's hands and prescribe only single bit flips,
they don't really have device knowledge to optimize further than this
like a migration driver might be able to do.

> Assuming we can even figure out what the precedence of multiple bits
> even means for interoperability.
> 
> Backed into this is an assumption that any device transition is fully
> atomic - that just isn't what I see any of the HW has done.

We only specify that the transition needs to be complete before the
write(2) operation returns, there's no specified atomicity versus any
other event.  "Synchronous" is maybe your concern?

> I thought we could tackled this when you first suggested it (eg copy
> the mlx5 logic and be OK), but now I'm very skeptical. The idea that
> every driver can do this right in all the corner cases doesn't seem
> reasonable given we've made so many errors here already just in mlx5.
> 
> > + *     - Bit 1 (SAVING) [REQUIRED]:
> > + *        - Setting this bit enables and initializes the migration region data  
> 
> I would use the word clear instead of initialize - the point of this
> is to throw away any data that may be left over in the window from any
> prior actions.

"Clear" to me suggests that there's some sort of internal shared buffer
implementation that needs to be wiped between different modes.  I chose
"initialize" because I think it offers more independence to the
implementation.
 
> > + *          window and associated fields within vfio_device_migration_info for
> > + *          capturing the migration data stream for the device.  The migration
> > + *          driver may perform actions such as enabling dirty logging of device
> > + *          state with this bit.  The SAVING bit is mutually exclusive with the
> > + *          RESUMING bit defined below.
> > + *        - Clearing this bit (ie. !SAVING) de-initializes the migration region
> > + *          data window and indicates the completion or termination of the
> > + *          migration data stream for the device.  
> 
> I don't know what "de-initialized" means as something a device should
> do? IMHO there is no need to talk about the migration window here,
> SAVING says initialize/clear - and data_offset/etc say their values
> are undefined outside SAVING/RUNNING. That is enough.

If "initializing" the migration data region puts in place handlers for
pending_bytes and friends, "de-initializing" would undo that operation.
Perhaps I should use "deactivates"?

> > + *     - Bit 2 (RESUMING) [REQUIRED]:
> > + *        - Setting this bit enables and initializes the migration region data
> > + *          window and associated fields within vfio_device_migration_info for
> > + *          restoring the device from a migration data stream captured from a
> > + *          SAVING session with a compatible device.  The migration driver may
> > + *          perform internal device resets as necessary to reinitialize the
> > + *          internal device state for the incoming migration data.
> > + *        - Clearing this bit (ie. !RESUMING) de-initializes the migration
> > + *          region data window and indicates the end of a resuming session for
> > + *          the device.  The kernel migration driver should complete the
> > + *          incorporation of data written to the migration data window into the
> > + *          device internal state and perform final validity and consistency
> > + *          checking of the new device state.  If the user provided data is
> > + *          found to be incomplete, inconsistent, or otherwise invalid, the
> > + *          migration driver must indicate a write(2) error and follow the
> > + *          previously described protocol to return either the previous state
> > + *          or an error state.  
> 
> Prefer this is just 'go to an error state' to avoid unnecessary
> implementation differences.

Then it becomes a special case versus other device_state changes and
we're forcing what you've described as an undefined state into the
protocol.  Use of the error state is at the driver's discretion, but
the spec is written such a driver only needs to make use of it if it
encounters some sort of irrecoverable internal error.

> > + *     - Bit 3 (NDMA) [OPTIONAL]:
> > + *        The NDMA or "No DMA" state is intended to be a quiescent state for
> > + *        the device for the purposes of managing multiple devices within a
> > + *        user context where peer-to-peer DMA between devices may be active.
> > + *        Support for the NDMA bit is indicated through the presence of the
> > + *        VFIO_REGION_INFO_CAP_MIG_NDMA capability as reported by
> > + *        VFIO_DEVICE_GET_REGION_INFO for the associated device migration
> > + *        region.
> > + *        - Setting this bit must prevent the device from initiating any
> > + *          new DMA or interrupt transactions.  The migration driver must  
> 
> I'm not sure about interrupts.

In the common case an interrupt is a DMA write, so the name, if not
intention of this state gets a bit shaky if interrupts are allowed.
 
> > + *          complete any such outstanding operations prior to completing
> > + *          the transition to the NDMA state.  The NDMA device_state  
> 
> Reading this as you wrote it and I suddenly have a doubt about the PRI
> use case. Is it reasonable that the kernel driver will block on NDMA
> waiting for another userspace thread to resolve any outstanding PRIs?
> 
> Can that allow userspace to deadlock the kernel or device? Is there an
> alterative?

I'd hope we could avoid deadlock in the kernel, but it seems trickier
for userspace to be waiting on a write(2) operation to the device while
also handling page request events for that same device.  Is this
something more like a pending transaction bit where userspace asks the
device to go quiescent and polls for that to occur?

> > + *   All combinations for the above defined device_state bits are considered
> > + *   valid with the following exceptions:
> > + *     - RESUMING and SAVING are mutually exclusive, all combinations of
> > + *       (RESUMING | SAVING) are invalid.  Furthermore the specific combination
> > + *       (!NDMA | RESUMING | SAVING | !RUNNING) is reserved to indicate the
> > + *       device error state VFIO_DEVICE_STATE_ERROR.  This variant is
> > + *       specifically chosen due to the !RUNNING state of the device as the
> > + *       migration driver should do everything possible, including an internal
> > + *       reset of the device, to ensure that the device is fully stopped in
> > + *       this state.    
> 
> Prefer we don't specify this. ERROR is undefined behavior and
> userspace should reset. Any path that leads along to ERROR already
> includes possiblities for wild DMAs and what not, so there is nothing
> to be gained by this other than causing a lot of driver complexity,
> IMHO.

This seems contrary to your push for consistent, interoperable behavior.
What's the benefit to actually leaving the state undefined or the
drawback to preemptively resetting a device if the migration driver
cannot determine if the device is quiesced, especially when the user
would need to reset the device to enter a new state anyway?  I added
this because language in your doc suggested the error state was far
more undefined that I understood it to be, ie. !RUNNING.

> > + *   Migration drivers should attempt to support any transition between valid  
> 
> should? must, I think.

I think that "must" terminology is a bit contrary to the fact that we
have a defined error state that can be used at the discretion of the
migration driver.  To me, "should" tells the migration drivers that they
ought to make an attempt to support all transitions, but userspace
needs to be be prepared that they might not work.  If a driver fails to
implement some transitions necessary for a given application, the
application should fail gracefully, but migration features may not be
available for the device.

> The whole migration window definition seems quite straightforward now!

Great!
 
> > + * a) The user reads pending_bytes.  If the read value is zero, no data is
> > + *    currently available for the device.  If the device is !RUNNING and a
> > + *    zero value is read, this indicates the end of the device migration
> > + *    stream and the device must not generate any new migration data.  If
> > + *    the read value is non-zero, the user may proceed to collect device
> > + *    migration data in step b).  Repeated reads of pending_bytes is allowed
> > + *    and must not compromise the migration data stream provided the user does
> > + *    not proceed to the following step.  
> 
> Add what to do in SAVING|RUNNING if pending bytes is 0?

Maybe it's too subtle, but that's why I phrased it as "no data is
currently available" and went on to specify the implications in the
!RUNNING state.  "Currently", suggesting that in the RUNNING state the
value is essentially volatile.

> >  #define VFIO_DEVICE_STATE_SET_ERROR(state) \
> > -	((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \
> > -					     VFIO_DEVICE_STATE_RESUMING)
> > +	((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_STATE_ERROR)  
> 
> We should delete this macro. It only makes sense used in a driver does
> not belong in the uapi header.

I may have gotten sloppy here, I thought I was incorporating what had
been proposed in the mlx5 series.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2021-12-13 20:40   ` Alex Williamson
@ 2021-12-14 12:08     ` Cornelia Huck
  2021-12-14 16:26     ` Jason Gunthorpe
  1 sibling, 0 replies; 35+ messages in thread
From: Cornelia Huck @ 2021-12-14 12:08 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: corbet, kvm, linux-doc, farman, mjrosato, pasic

On Mon, Dec 13 2021, Alex Williamson <alex.williamson@redhat.com> wrote:

> We do specify that a migration driver has discretion in using the error
> state for failed transitions, so there are options to simplify error
> handling.
>
> If we look at bit flips, we have:
>
> Initial state
> |  Resuming
> |  |  Saving
> |  |  |  Running
> |  |  |  |  Next states with multiple bit flips
>
> a) 0, 0, 0  (d)
> b) 0, 0, 1  (c) (e)
> c) 0, 1, 0  (b) (e)
> d) 0, 1, 1  (a) (e)
> e) 1, 0, 0  (b) (c) (d)
> f) 1, 0, 1 UNSUPPORTED
> g) 1, 1, 0 ERROR
> h) 1, 1, 1 INVALID
>
> We specify that we cannot pass through any invalid states during
> transition, so if I consider each bit to be a discrete operation and
> map all these multi-bit changes to a sequence of single bit flips, the
> only requirements are:
>
>   1) RESUMING must be cleared before setting SAVING or RUNNING
>   2) SAVING or RUNNING must be cleared before setting RESUMING
>
> I think the basis of your priority scheme comes from that.  Ordering
> of the remaining items is more subtle though, for instance
> 0 -> SAVING | RUNNING can be broken down as:
>
>   0 -> SAVING
>   SAVING -> SAVING | RUNNING 
>
>   vs
>
>   0 -> RUNNING
>   RUNNING -> SAVING | RUNNING
>
> I'd give preference to enabling logging before running and I believe
> that holds for transition (e) -> (d) as well.

Agreed.

>
> In the reverse case, SAVING | RUNNING -> 0
>
>   SAVING | RUNNING -> RUNNING
>   RUNNING -> 0
>
>   vs
>
>   SAVING | RUNNING -> SAVING
>   SAVING -> 0
>
> This one is more arbitrary.  I tend to favor clearing RUNNING to stop
> the device first, largely because it creates nice symmetry in the
> resulting algorithm and follows the general principle that you
> discovered as well, converge towards zero by addressing bit clearing
> before setting.

That also makes sense to me.

> So a valid algorithm would include:
>
> int set_device_state(u32 old, u32 new, bool is_unwind)
> {
> 	if (old.RUNNING && !new.RUNNING) {
> 		curr.RUNNING = 0;
> 		if (ERROR) goto unwind;
> 	}
> 	if (old.SAVING && !new.SAVING) {
> 		curr.SAVING = 0;
> 		if (ERROR) goto unwind;
> 	}
> 	if (old.RESUMING && !new.RESUMING) {
> 		curr.RESUMING = 0;
> 		if (ERROR) goto unwind;
> 	}
> 	if (!old.RESUMING && new.RESUMING) {
> 		curr.RESUMING = 1;
> 		if (ERROR) goto unwind;
> 	}
> 	if (!old.SAVING && new.SAVING) {
> 		curr.saving = 1;
> 		if (ERROR) goto unwind;
> 	}
> 	if (!old.RUNNING && new.RUNNING) {
> 		curr.RUNNING = 1;
> 		if (ERROR) goto unwind;
>         }
>
> 	return 0;
>
> unwind:
> 	if (!is_unwind) {
> 		ret = set_device_state(curr, old, true);
> 		if (ret) {
> 			curr.raw = ERROR;
> 			return ret;
> 		}
> 	}
>
> 	return -EIO;
> }
>

I've stared at this and scribbled a bit on paper and I think this would
work.

> And I think that also addresses the claim that we're doomed to untested
> and complicated error code handling, we unwind by simply swapping the
> args to our set state function and enter the ERROR state should that
> recursive call fail.

Nod. As we clear RUNNING as the first thing and would only set RUNNING
again as the last step, any transition to ERROR should be save.

>
> I think it would be reasonable for documentation to present similar
> pseudo code as a recommendation, but ultimately the migration driver
> needs to come up with something that fits all the requirements.

Yes, something like this should go into Documentation/.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2021-12-13 20:40   ` Alex Williamson
  2021-12-14 12:08     ` Cornelia Huck
@ 2021-12-14 16:26     ` Jason Gunthorpe
  2021-12-20 22:26       ` Alex Williamson
  2022-01-04  3:49       ` Tian, Kevin
  1 sibling, 2 replies; 35+ messages in thread
From: Jason Gunthorpe @ 2021-12-14 16:26 UTC (permalink / raw)
  To: Alex Williamson; +Cc: cohuck, corbet, kvm, linux-doc, farman, mjrosato, pasic

On Mon, Dec 13, 2021 at 01:40:38PM -0700, Alex Williamson wrote:

> We do specify that a migration driver has discretion in using the error
> state for failed transitions, so there are options to simplify error
> handling.

This could be OK if we can agree ERROR has undefined behavior.

> I think the basis of your priority scheme comes from that.  Ordering
> of the remaining items is more subtle though, for instance
> 0 -> SAVING | RUNNING can be broken down as:
> 
>   0 -> SAVING
>   SAVING -> SAVING | RUNNING 
> 
>   vs
> 
>   0 -> RUNNING
>   RUNNING -> SAVING | RUNNING
>
> I'd give preference to enabling logging before running and I believe
> that holds for transition (e) -> (d) as well.

IMHO, any resolution to an arbitary choice should pick an order that
follows the reference flow because we know those are useful sequences
to have today.

Generally we have no use for the sequence SAVING -> SAVING | RUNNING,
while RUNNING -> SAVING | RUNNING is part of the standard flow.

It also raises the question that it seems not well defined what the
sequence:

SAVING -> SAVING | RUNNING

Even does to the migration window?

Nor can I imagine what mlx5 could do beyond fail this or corrupt the
migration..

If we keep the "should implement transitions" language below then I
expect mlx5 to fail this, and then we have a conflict where mlx5
cannot implement these precedence rules.

This is the kind of precedence resolution I was trying to avoid.

As far as I can see the requirements are broadly:
 - Do not transit through an invalid state
 - Do not loose NDMA during the transit, eg NDMA | SAVING | RUNNING -> 0
   should not have a race where a DMA can leak out
 - Do not do transit through things like SAVING -> SAVING | RUNNING,
   and I'm not confident I have a good list of these

> And I think that also addresses the claim that we're doomed to untested
> and complicated error code handling, we unwind by simply swapping the
> args to our set state function and enter the ERROR state should that
> recursive call fail.

I had the same thought the day after I wrote this, it seems workable.

I remain concerned however that we still can't seem to reach to a
working precedence after all this time. This is a very bad sign. 

Even if we work something out adding a new state someday is
terrifying. What if we can't work out any precedence that is
compatible with todays and supports the new state?

IMHO, we should be simplifing this before it becomes permanent API,
not trying to force it to work.

> If we put it in the user's hands and prescribe only single bit flips,
> they don't really have device knowledge to optimize further than this
> like a migration driver might be able to do.

If so this argues we should go back to the enforced FSM that the v1
mlx5 posting had and forget about device_state as a bunch of bits.

Most of things I brought up in this post are resolved by the forced
FSM.

Yes, we give up some flexability, but I think the quest for
flexability is a little misguided. If the drivers don't consistently
implement the flexability then it is just cruft we cannot make use of
from userspace.

eg what practical use is SAVING -> SAVING | RUNNING if today's mlx5
implementation silently corrupts the migration stream? That instantly
makes that a no-go for userspace from an interoperability perspective
and we've accomplished nothing by allowing for it.

Please think about it, it looks like an easy resolution to all this
discussion to simply specify a fixed FSM and be done with it.

> > I thought we could tackled this when you first suggested it (eg copy
> > the mlx5 logic and be OK), but now I'm very skeptical. The idea that
> > every driver can do this right in all the corner cases doesn't seem
> > reasonable given we've made so many errors here already just in mlx5.
> > 
> > > + *     - Bit 1 (SAVING) [REQUIRED]:
> > > + *        - Setting this bit enables and initializes the migration region data  
> > 
> > I would use the word clear instead of initialize - the point of this
> > is to throw away any data that may be left over in the window from any
> > prior actions.
> 
> "Clear" to me suggests that there's some sort of internal shared buffer
> implementation that needs to be wiped between different modes.  I chose
> "initialize" because I think it offers more independence to the
> implementation.

The data window is expressed as a shared buffer in this API, there is
only one data_offset/size and data window for everything.

I think it is fine to rely on that for the description, and like all
abstractions an implementation can do whatever so long as externally
it looks like this shared buffer API.

The requirement here is that anything that pre-existed in the data
window from any prior operation is cleaned and the data window starts
empty before any data related to this SAVING is transfered.

> > > + *          window and associated fields within vfio_device_migration_info for
> > > + *          capturing the migration data stream for the device.  The migration
> > > + *          driver may perform actions such as enabling dirty logging of device
> > > + *          state with this bit.  The SAVING bit is mutually exclusive with the
> > > + *          RESUMING bit defined below.
> > > + *        - Clearing this bit (ie. !SAVING) de-initializes the migration region
> > > + *          data window and indicates the completion or termination of the
> > > + *          migration data stream for the device.  
> > 
> > I don't know what "de-initialized" means as something a device should
> > do? IMHO there is no need to talk about the migration window here,
> > SAVING says initialize/clear - and data_offset/etc say their values
> > are undefined outside SAVING/RUNNING. That is enough.
> 
> If "initializing" the migration data region puts in place handlers for
> pending_bytes and friends, "de-initializing" would undo that operation.
> Perhaps I should use "deactivates"?

And if you don't use "initializing" we don't need to talk about
"de-initializating".

Reading the data window outside SAVING is undefined behavior it seems,
so nothing needs to be said.

> > > + *     - Bit 2 (RESUMING) [REQUIRED]:
> > > + *        - Setting this bit enables and initializes the migration region data
> > > + *          window and associated fields within vfio_device_migration_info for
> > > + *          restoring the device from a migration data stream captured from a
> > > + *          SAVING session with a compatible device.  The migration driver may
> > > + *          perform internal device resets as necessary to reinitialize the
> > > + *          internal device state for the incoming migration data.
> > > + *        - Clearing this bit (ie. !RESUMING) de-initializes the migration
> > > + *          region data window and indicates the end of a resuming session for
> > > + *          the device.  The kernel migration driver should complete the
> > > + *          incorporation of data written to the migration data window into the
> > > + *          device internal state and perform final validity and consistency
> > > + *          checking of the new device state.  If the user provided data is
> > > + *          found to be incomplete, inconsistent, or otherwise invalid, the
> > > + *          migration driver must indicate a write(2) error and follow the
> > > + *          previously described protocol to return either the previous state
> > > + *          or an error state.  
> > 
> > Prefer this is just 'go to an error state' to avoid unnecessary
> > implementation differences.
> 
> Then it becomes a special case versus other device_state changes and
> we're forcing what you've described as an undefined state into the
> protocol.

Lets look at what recovery actions something the VMM would need to
take along the reference flow:

RUNNING -> SAVING | RUNNING
  If this fails and we are still in RUNNING and can continue

 -> SAVING | RUNNING | NDMA
 -> SAVING
  If these fail we need to go to RUNNING
  -> RUNNING
    If this fails we need to RESET

 -> 0
  Migration succeeded? Failure here should RESET

RUNNING -> RESUMING
  If this fails and we are still in RUNNING continue
 -> NDMA | RUNNING
  If this fails RESET
 -> RUNNING
  If this fails RESET, VM could be corrupted.

One view is that what the device does is irrelevant as qemu should
simply unconditionally reset in these case.

Another view is that staying in a useless state is also pointless and
we may as well return ERROR anyhow. Eg if exiting RESUMING failed
there is no other action to take besides RESET, so why didn't we
return ERROR to tell this directly to userspace?

Both are reasonable views, which is why I wrote "prefer".

> > > + *     - Bit 3 (NDMA) [OPTIONAL]:
> > > + *        The NDMA or "No DMA" state is intended to be a quiescent state for
> > > + *        the device for the purposes of managing multiple devices within a
> > > + *        user context where peer-to-peer DMA between devices may be active.
> > > + *        Support for the NDMA bit is indicated through the presence of the
> > > + *        VFIO_REGION_INFO_CAP_MIG_NDMA capability as reported by
> > > + *        VFIO_DEVICE_GET_REGION_INFO for the associated device migration
> > > + *        region.
> > > + *        - Setting this bit must prevent the device from initiating any
> > > + *          new DMA or interrupt transactions.  The migration driver must  
> > 
> > I'm not sure about interrupts.
> 
> In the common case an interrupt is a DMA write, so the name, if not
> intention of this state gets a bit shaky if interrupts are allowed.

Interrupts have their own masking protocol. For instance a device like
the huawei one is halting DMA by manipulating the queue registers, it
may still generate interrupts.

Yes, MSI is a MemWr, but I've never heard anyone call it a DMA - there
is no memory access here since the TLP is routed to the interrupt
controller.

This is why I'm not sure. A hostile VM certainly can corrupt the MSI,
even today and thus turn it into a DMA. As we talked before this may
be OK, but is security risky that it allows the guest to impact the
hypervisor.

Overall it seems like this is more trouble for a device like huawei's
if they want to implement NDMA using the trapping or something. Given
your right concern that NDMA should be implemented as widely as
possible making it more difficult that stricly necessary is perhaps
not good.

Other peope should comment here.

> > > + *          complete any such outstanding operations prior to completing
> > > + *          the transition to the NDMA state.  The NDMA device_state  
> > 
> > Reading this as you wrote it and I suddenly have a doubt about the PRI
> > use case. Is it reasonable that the kernel driver will block on NDMA
> > waiting for another userspace thread to resolve any outstanding PRIs?
> > 
> > Can that allow userspace to deadlock the kernel or device? Is there an
> > alterative?
> 
> I'd hope we could avoid deadlock in the kernel, but it seems trickier
> for userspace to be waiting on a write(2) operation to the device while
> also handling page request events for that same device.  Is this
> something more like a pending transaction bit where userspace asks the
> device to go quiescent and polls for that to occur?

Hum. I'm still looking into this question, but some further thoughts.

PRI doesn't do DMA, it just transfers a physical address into the PCI
device's cache that can be later used with DMA.

PRI also doesn't imply the vPRI Intel is talking about.

For PRI controlled by the hypervisor, it is completely reasonable that
NDMA returns synchronously after the PRI and the DMA that triggered it
completes. The VMM would have to understand this and ensure it doesn't
block the kernel's fault path while going to NDMA eg with userfaultfd
or something else crazy.

The other reasonable option is that NDMA cancels the DMA that
triggered the PRI and simply doesn't care how the PRI is completed
after NDMA returns.

The later is interesting because it is a possible better path to solve
the vPRI problem Intel brought up. Waiting for the VCPU is just asking
for a DOS, if NDMA can cancel the DMAs we can then just directly fail
the open PRI in the hypervisor and we don't need to care about the
VCPU. Some mess to fixup in the vIOMMU protocol on resume, but the
resume'd device simply issues a new DMA with an empty ATS cache and
does a new PRI.

It is uncertain enough that qemu should not support vPRI with
migration until we define protocol(s) and a cap flag to say the device
supports it.

> > > + *   All combinations for the above defined device_state bits are considered
> > > + *   valid with the following exceptions:
> > > + *     - RESUMING and SAVING are mutually exclusive, all combinations of
> > > + *       (RESUMING | SAVING) are invalid.  Furthermore the specific combination
> > > + *       (!NDMA | RESUMING | SAVING | !RUNNING) is reserved to indicate the
> > > + *       device error state VFIO_DEVICE_STATE_ERROR.  This variant is
> > > + *       specifically chosen due to the !RUNNING state of the device as the
> > > + *       migration driver should do everything possible, including an internal
> > > + *       reset of the device, to ensure that the device is fully stopped in
> > > + *       this state.    
> > 
> > Prefer we don't specify this. ERROR is undefined behavior and
> > userspace should reset. Any path that leads along to ERROR already
> > includes possiblities for wild DMAs and what not, so there is nothing
> > to be gained by this other than causing a lot of driver complexity,
> > IMHO.
> 
> This seems contrary to your push for consistent, interoperable
> behavior.

Formal "undefined behavior" can be a useful part of a spec, especially
if the spec is 'when you see ERROR you must do RESET', we don't really
need to constrain the device further to continue to have
interoperability.

> What's the benefit to actually leaving the state undefined or the
> drawback to preemptively resetting a device if the migration driver
> cannot determine if the device is quiesced, 

RESET puts the device back to RUNNING, so RESET alone does not remedy
the problem.

RESET followed by !RUNNING can fail, meaning the best mlx5 can do is
"SHOULD", in which case lets omit the RESET since userspace can't rely
on it.

Even if it did work reliably, the requirement is userspace must issue
RESET to exit ERROR and if we say the driver has to issue reset to
enter ERROR we are just doing a pointless double RESET.

> would need to reset the device to enter a new state anyway?  I added
> this because language in your doc suggested the error state was far
> more undefined that I understood it to be, ie. !RUNNING.

Yes it was like that, because the implementation of this strict
requirement is not nice.

Perhaps a middle ground can work:

  For device_state ERROR the device SHOULD have the device
  !RUNNING. If the ERROR arose due to a device_state change and
  if the new and old states have NDMA behavior then the device MUST
  maintain NDMA behavior while processing the device_state and
  continuing while in ERROR. Userspace MUST reset the device to
  recover from ERROR, therefore devices SHOULD NOT do a redundant
  internal reset

> > > + *   Migration drivers should attempt to support any transition between valid  
> > 
> > should? must, I think.
> 
> I think that "must" terminology is a bit contrary to the fact that we
> have a defined error state that can be used at the discretion of the
> migration driver.  To me, "should" tells the migration drivers that they
> ought to make an attempt to support all transitions, but userspace
> needs to be be prepared that they might not work.  

IMHO this is not inter-operable. At a minimum we should expect that a
driver implements a set of standard transitions, or it has to
implement all of them.

Otherwise what is the point?

If you go back to the mlx5 v1 version it did effectively this. It
enforced a FSM and only allowed some transitions. That meets the
language here with "should" but you didn't like it, and I agreed with
you then.

This is when the trouble stated :)

The mlx5 v1 with the FSM didn't have alot of these problems we are
discussing. It didn't have precedence issues, it didn't have problems
executing odd combinations it can't support, it worked and was simple
to understand.

So, if we say should here, then I vote mlx5 goes back to enforcing
its FSM and that becomes the LCD that userspace must implement to.

In which case, why not formally specify the FSM now and avoid a driver
pushing a defacto spec?

If we say MUST here then we need to figure out a precedence and 
say that some transitions are undefined behavior, like SAVING ->
SAVING|RUNNING.

> Maybe it's too subtle, but that's why I phrased it as "no data is
> currently available" and went on to specify the implications in the
> !RUNNING state.  "Currently", suggesting that in the RUNNING state the
> value is essentially volatile.

It is subtle enough to clarify that polling may see more data in
future in that case.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2021-12-09 23:34 [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state Alex Williamson
  2021-12-10  1:25 ` Jason Gunthorpe
@ 2021-12-20 17:38 ` Cornelia Huck
  2021-12-20 22:49   ` Alex Williamson
  2022-01-07  8:03 ` Tian, Kevin
  2 siblings, 1 reply; 35+ messages in thread
From: Cornelia Huck @ 2021-12-20 17:38 UTC (permalink / raw)
  To: Alex Williamson, alex.williamson
  Cc: jgg, corbet, kvm, linux-doc, farman, mjrosato, pasic

On Thu, Dec 09 2021, Alex Williamson <alex.williamson@redhat.com> wrote:

> A new NDMA state is being proposed to support a quiescent state for
> contexts containing multiple devices with peer-to-peer DMA support.
> Formally define it.

[I'm wondering if we would want to use NDMA in other cases as well. Just
thinking out loud below.]

>
> Clarify various aspects of the migration region data fields and
> protocol.  Remove QEMU related terminology and flows from the uAPI;
> these will be provided in Documentation/ so as not to confuse the
> device_state bitfield with a finite state machine with restricted
> state transitions.
>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  include/uapi/linux/vfio.h |  405 ++++++++++++++++++++++++---------------------
>  1 file changed, 214 insertions(+), 191 deletions(-)
>
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index ef33ea002b0b..1fdbc928f886 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h

(...)

> + *   The device_state field defines the following bitfield use:
> + *
> + *     - Bit 0 (RUNNING) [REQUIRED]:
> + *        - Setting this bit indicates the device is fully operational, the
> + *          device may generate interrupts, DMA, respond to MMIO, all vfio
> + *          device regions are functional, and the device may advance its
> + *          internal state.  The default device_state must indicate the device
> + *          in exclusively the RUNNING state, with no other bits in this field
> + *          set.
> + *        - Clearing this bit (ie. !RUNNING) must stop the operation of the
> + *          device.  The device must not generate interrupts, DMA, or advance
> + *          its internal state.  The user should take steps to restrict access
> + *          to vfio device regions other than the migration region while the
> + *          device is !RUNNING or risk corruption of the device migration data
> + *          stream.  The device and kernel migration driver must accept and
> + *          respond to interaction to support external subsystems in the
> + *          !RUNNING state, for example PCI MSI-X and PCI config space.
> + *          Failure by the user to restrict device access while !RUNNING must
> + *          not result in error conditions outside the user context (ex.
> + *          host system faults).

If I consider ccw, this would mean that user space would need to stop
writing to the regions that initiate start/halt/... when RUNNING is
cleared (makes sense) and that the subchannel must be idle or even
disabled (so that it does not become status pending). The question is,
does it make sense to stop new requests and wait for the subchannel to
become idle during the !RUNNING transition (or even forcefully kill
outstanding I/O), or...

> + *     - Bit 1 (SAVING) [REQUIRED]:
> + *        - Setting this bit enables and initializes the migration region data
> + *          window and associated fields within vfio_device_migration_info for
> + *          capturing the migration data stream for the device.  The migration
> + *          driver may perform actions such as enabling dirty logging of device
> + *          state with this bit.  The SAVING bit is mutually exclusive with the
> + *          RESUMING bit defined below.
> + *        - Clearing this bit (ie. !SAVING) de-initializes the migration region
> + *          data window and indicates the completion or termination of the
> + *          migration data stream for the device.
> + *     - Bit 2 (RESUMING) [REQUIRED]:
> + *        - Setting this bit enables and initializes the migration region data
> + *          window and associated fields within vfio_device_migration_info for
> + *          restoring the device from a migration data stream captured from a
> + *          SAVING session with a compatible device.  The migration driver may
> + *          perform internal device resets as necessary to reinitialize the
> + *          internal device state for the incoming migration data.
> + *        - Clearing this bit (ie. !RESUMING) de-initializes the migration
> + *          region data window and indicates the end of a resuming session for
> + *          the device.  The kernel migration driver should complete the
> + *          incorporation of data written to the migration data window into the
> + *          device internal state and perform final validity and consistency
> + *          checking of the new device state.  If the user provided data is
> + *          found to be incomplete, inconsistent, or otherwise invalid, the
> + *          migration driver must indicate a write(2) error and follow the
> + *          previously described protocol to return either the previous state
> + *          or an error state.
> + *     - Bit 3 (NDMA) [OPTIONAL]:
> + *        The NDMA or "No DMA" state is intended to be a quiescent state for
> + *        the device for the purposes of managing multiple devices within a
> + *        user context where peer-to-peer DMA between devices may be active.
> + *        Support for the NDMA bit is indicated through the presence of the
> + *        VFIO_REGION_INFO_CAP_MIG_NDMA capability as reported by
> + *        VFIO_DEVICE_GET_REGION_INFO for the associated device migration
> + *        region.
> + *        - Setting this bit must prevent the device from initiating any
> + *          new DMA or interrupt transactions.  The migration driver must
> + *          complete any such outstanding operations prior to completing
> + *          the transition to the NDMA state.  The NDMA device_state
> + *          essentially represents a sub-set of the !RUNNING state for the
> + *          purpose of quiescing the device, therefore the NDMA device_state
> + *          bit is superfluous in combinations including !RUNNING.
> + *        - Clearing this bit (ie. !NDMA) negates the device operational
> + *          restrictions required by the NDMA state.

...should we use NDMA as the "stop new requests" state, but allow
running channel programs to conclude? I'm not entirely sure whether
that's in the spirit of NDMA (subchannels are independent of each
other), but it would be kind of "quiescing" already.

(We should probably clarify things like that in the Documentation/
file.)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2021-12-14 16:26     ` Jason Gunthorpe
@ 2021-12-20 22:26       ` Alex Williamson
  2022-01-04 20:28         ` Jason Gunthorpe
  2022-01-04  3:49       ` Tian, Kevin
  1 sibling, 1 reply; 35+ messages in thread
From: Alex Williamson @ 2021-12-20 22:26 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: cohuck, corbet, kvm, linux-doc, farman, mjrosato, pasic

On Tue, 14 Dec 2021 12:26:54 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Dec 13, 2021 at 01:40:38PM -0700, Alex Williamson wrote:
> 
> > We do specify that a migration driver has discretion in using the error
> > state for failed transitions, so there are options to simplify error
> > handling.  
> 
> This could be OK if we can agree ERROR has undefined behavior.
> 
> > I think the basis of your priority scheme comes from that.  Ordering
> > of the remaining items is more subtle though, for instance
> > 0 -> SAVING | RUNNING can be broken down as:
> > 
> >   0 -> SAVING
> >   SAVING -> SAVING | RUNNING 
> > 
> >   vs
> > 
> >   0 -> RUNNING
> >   RUNNING -> SAVING | RUNNING
> >
> > I'd give preference to enabling logging before running and I believe
> > that holds for transition (e) -> (d) as well.  
> 
> IMHO, any resolution to an arbitary choice should pick an order that
> follows the reference flow because we know those are useful sequences
> to have today.
> 
> Generally we have no use for the sequence SAVING -> SAVING | RUNNING,
> while RUNNING -> SAVING | RUNNING is part of the standard flow.

SAVING -> SAVING | RUNNING or SAVING -> RUNNING could both be used as
recovery sequences should the target VM fail.  The latter might be a
full abort of the migration while the former might be a means to reset
the downtime clock without fully restarting the migration.

> It also raises the question that it seems not well defined what the
> sequence:
> 
> SAVING -> SAVING | RUNNING
> 
> Even does to the migration window?
> 
> Nor can I imagine what mlx5 could do beyond fail this or corrupt the
> migration..

I think this comes down to the robustness of the migration driver.  The
migration data window and control of how userspace is to interact with
it is essentially meant to allow the migration driver to implement its
own transport protocol.  In the case of mlx5 where it expects only to
apply the received migration data on the RESUMING -> RUNNING
transition, a "short" data segment might be reserved for providing
sequencing information.  Each time the device enters SAVING | !RUNNING
the driver might begin by presenting a new sequence header.  On the
target, a new sequence header would cause any previously received
migration data to be discarded.  A similar header would also be
suggested to validate the migration data stream is appropriate for the
target device.

Also, I hope it goes without saying, but I'll say it anyway, the
migration data stream would make for an excellent exploit vector and
each migration driver needs to be responsible to make sure that
userspace cannot use it to break containment of the device.
 
> If we keep the "should implement transitions" language below then I
> expect mlx5 to fail this, and then we have a conflict where mlx5
> cannot implement these precedence rules.

Per above, I think it can.

> This is the kind of precedence resolution I was trying to avoid.
> 
> As far as I can see the requirements are broadly:
>  - Do not transit through an invalid state
>  - Do not loose NDMA during the transit, eg NDMA | SAVING | RUNNING -> 0
>    should not have a race where a DMA can leak out
>  - Do not do transit through things like SAVING -> SAVING | RUNNING,
>    and I'm not confident I have a good list of these

"Things like", yeah, that's not really spec material.  There's nothing
special about that transition.  The migration driver should take into
account management of the migration data stream and support such
states, or error the transition if it isn't sufficient ROI and expect
device specific bug reports at some point in the future.

For NDMA, that's a valid consideration.  I think that means though that
NDMA doesn't simply bookend the pseudo algorithm I provided.  Perhaps
it's nested between RUNNING and SAVING handlers on either end.
 
> > And I think that also addresses the claim that we're doomed to untested
> > and complicated error code handling, we unwind by simply swapping the
> > args to our set state function and enter the ERROR state should that
> > recursive call fail.  
> 
> I had the same thought the day after I wrote this, it seems workable.
> 
> I remain concerned however that we still can't seem to reach to a
> working precedence after all this time. This is a very bad sign. 
> 
> Even if we work something out adding a new state someday is
> terrifying. What if we can't work out any precedence that is
> compatible with todays and supports the new state?
> 
> IMHO, we should be simplifing this before it becomes permanent API,
> not trying to force it to work.

I agree, this is our opportunity to simplify before we're committed,
but I don't see how we can throw out perfectly valid transitions like
SAVING -> SAVING | RUNNING just because the driver hasn't accounted for
managing data in the data stream.

> > If we put it in the user's hands and prescribe only single bit flips,
> > they don't really have device knowledge to optimize further than this
> > like a migration driver might be able to do.  
> 
> If so this argues we should go back to the enforced FSM that the v1
> mlx5 posting had and forget about device_state as a bunch of bits.
> 
> Most of things I brought up in this post are resolved by the forced
> FSM.

Until userspace tries to do something different than exactly what it
does today, and then what?
 
> Yes, we give up some flexability, but I think the quest for
> flexability is a little misguided. If the drivers don't consistently
> implement the flexability then it is just cruft we cannot make use of
> from userspace.
> 
> eg what practical use is SAVING -> SAVING | RUNNING if today's mlx5
> implementation silently corrupts the migration stream? That instantly
> makes that a no-go for userspace from an interoperability perspective
> and we've accomplished nothing by allowing for it.

Failure to support that transition is a deficiency of the driver and
represented by a non-silent error in making that transition.  Silently
corrupting the migration stream is simply a driver bug.
 
> Please think about it, it looks like an easy resolution to all this
> discussion to simply specify a fixed FSM and be done with it.
> 
> > > I thought we could tackled this when you first suggested it (eg copy
> > > the mlx5 logic and be OK), but now I'm very skeptical. The idea that
> > > every driver can do this right in all the corner cases doesn't seem
> > > reasonable given we've made so many errors here already just in mlx5.
> > >   
> > > > + *     - Bit 1 (SAVING) [REQUIRED]:
> > > > + *        - Setting this bit enables and initializes the migration region data    
> > > 
> > > I would use the word clear instead of initialize - the point of this
> > > is to throw away any data that may be left over in the window from any
> > > prior actions.  
> > 
> > "Clear" to me suggests that there's some sort of internal shared buffer
> > implementation that needs to be wiped between different modes.  I chose
> > "initialize" because I think it offers more independence to the
> > implementation.  
> 
> The data window is expressed as a shared buffer in this API, there is
> only one data_offset/size and data window for everything.

Any access to the data window outside of that directed by the driver is
undefined, it's up to the driver where and how to populate the data.  A
driver might make a portion of the data window available as an mmap
that gets zapped and faulted to the correct device backing between
operations.
 
> I think it is fine to rely on that for the description, and like all
> abstractions an implementation can do whatever so long as externally
> it looks like this shared buffer API.
> 
> The requirement here is that anything that pre-existed in the data
> window from any prior operation is cleaned and the data window starts
> empty before any data related to this SAVING is transfered.

IOW, it's initialized.  We're picking out colors for the bike shed at
this point.

> > > > + *          window and associated fields within vfio_device_migration_info for
> > > > + *          capturing the migration data stream for the device.  The migration
> > > > + *          driver may perform actions such as enabling dirty logging of device
> > > > + *          state with this bit.  The SAVING bit is mutually exclusive with the
> > > > + *          RESUMING bit defined below.
> > > > + *        - Clearing this bit (ie. !SAVING) de-initializes the migration region
> > > > + *          data window and indicates the completion or termination of the
> > > > + *          migration data stream for the device.    
> > > 
> > > I don't know what "de-initialized" means as something a device should
> > > do? IMHO there is no need to talk about the migration window here,
> > > SAVING says initialize/clear - and data_offset/etc say their values
> > > are undefined outside SAVING/RUNNING. That is enough.  
> > 
> > If "initializing" the migration data region puts in place handlers for
> > pending_bytes and friends, "de-initializing" would undo that operation.
> > Perhaps I should use "deactivates"?  
> 
> And if you don't use "initializing" we don't need to talk about
> "de-initializating".
> 
> Reading the data window outside SAVING is undefined behavior it seems,
> so nothing needs to be said.

Exactly why I thought simply describing it as the reciprocal of setting
the bit would be sufficient.  Taupe!

> > > > + *     - Bit 2 (RESUMING) [REQUIRED]:
> > > > + *        - Setting this bit enables and initializes the migration region data
> > > > + *          window and associated fields within vfio_device_migration_info for
> > > > + *          restoring the device from a migration data stream captured from a
> > > > + *          SAVING session with a compatible device.  The migration driver may
> > > > + *          perform internal device resets as necessary to reinitialize the
> > > > + *          internal device state for the incoming migration data.
> > > > + *        - Clearing this bit (ie. !RESUMING) de-initializes the migration
> > > > + *          region data window and indicates the end of a resuming session for
> > > > + *          the device.  The kernel migration driver should complete the
> > > > + *          incorporation of data written to the migration data window into the
> > > > + *          device internal state and perform final validity and consistency
> > > > + *          checking of the new device state.  If the user provided data is
> > > > + *          found to be incomplete, inconsistent, or otherwise invalid, the
> > > > + *          migration driver must indicate a write(2) error and follow the
> > > > + *          previously described protocol to return either the previous state
> > > > + *          or an error state.    
> > > 
> > > Prefer this is just 'go to an error state' to avoid unnecessary
> > > implementation differences.  
> > 
> > Then it becomes a special case versus other device_state changes and
> > we're forcing what you've described as an undefined state into the
> > protocol.  
> 
> Lets look at what recovery actions something the VMM would need to
> take along the reference flow:
> 
> RUNNING -> SAVING | RUNNING
>   If this fails and we are still in RUNNING and can continue
> 
>  -> SAVING | RUNNING | NDMA
>  -> SAVING  
>   If these fail we need to go to RUNNING
>   -> RUNNING  
>     If this fails we need to RESET

I won't argue that there aren't transition failures where the next
logical step is likely a reset, but I also don't see the point in
defining special rules for certain cases.  When in doubt, leave policy
decisions to userspace?

> 
>  -> 0  
>   Migration succeeded? Failure here should RESET
>
> RUNNING -> RESUMING
>   If this fails and we are still in RUNNING continue
>  -> NDMA | RUNNING  
>   If this fails RESET
>  -> RUNNING  
>   If this fails RESET, VM could be corrupted.
> 
> One view is that what the device does is irrelevant as qemu should
> simply unconditionally reset in these case.
> 
> Another view is that staying in a useless state is also pointless and
> we may as well return ERROR anyhow. Eg if exiting RESUMING failed
> there is no other action to take besides RESET, so why didn't we
> return ERROR to tell this directly to userspace?

And then the last packet arrives, gets written to the device that's
still in RESUMING, and now can transition to RUNNING.
 
> Both are reasonable views, which is why I wrote "prefer".

There's no going back from the ERROR state, the only path the user has
forward is to reset the device.  Therefore the only case I'm willing to
say it's the preferred next state is in the case of an irrecoverable
internal fault.  I'd also like to avoid another lengthy discussion
trying to define which specific transitions should default to an ERROR
state if they fail versus simply return -errno.  Userspace is free to
define a policy where an -errno is considered fatal for the device.  I
prefer consistent userspace handling, letting userspace define policy,
and robust drivers that avoid forcing unnecessary user decisions.

> > > > + *     - Bit 3 (NDMA) [OPTIONAL]:
> > > > + *        The NDMA or "No DMA" state is intended to be a quiescent state for
> > > > + *        the device for the purposes of managing multiple devices within a
> > > > + *        user context where peer-to-peer DMA between devices may be active.
> > > > + *        Support for the NDMA bit is indicated through the presence of the
> > > > + *        VFIO_REGION_INFO_CAP_MIG_NDMA capability as reported by
> > > > + *        VFIO_DEVICE_GET_REGION_INFO for the associated device migration
> > > > + *        region.
> > > > + *        - Setting this bit must prevent the device from initiating any
> > > > + *          new DMA or interrupt transactions.  The migration driver must    
> > > 
> > > I'm not sure about interrupts.  
> > 
> > In the common case an interrupt is a DMA write, so the name, if not
> > intention of this state gets a bit shaky if interrupts are allowed.  
> 
> Interrupts have their own masking protocol. For instance a device like
> the huawei one is halting DMA by manipulating the queue registers, it
> may still generate interrupts.
> 
> Yes, MSI is a MemWr, but I've never heard anyone call it a DMA - there
> is no memory access here since the TLP is routed to the interrupt
> controller.

That's a pretty subtle distinction.  Can't that controller live in MMIO
space and isn't it then just a peer-to-peer DMA?  I know I'm not the
first to consider MSI to be just another DMA, that seems to be the
basis of it's handling on ARM SMMU.
 
> This is why I'm not sure. A hostile VM certainly can corrupt the MSI,
> even today and thus turn it into a DMA. As we talked before this may
> be OK, but is security risky that it allows the guest to impact the
> hypervisor.
>
> Overall it seems like this is more trouble for a device like huawei's
> if they want to implement NDMA using the trapping or something. Given
> your right concern that NDMA should be implemented as widely as
> possible making it more difficult that stricly necessary is perhaps
> not good.
> 
> Other peope should comment here.

Yeah, I'm not clear on what devices can and cannot do in the NDMA state.
Ultimately the goal is that once all devices are in the NDMA state, we
can safely transition them to the !RUNNING state without concern
regarding access from another device.  Specifically we want to avoid
things like DeviceA moves to !RUNNING while DeviceB initiates a DMA
access to DeviceA which now cannot respond without advancing internal
state which violates the condition of !RUNNING.

But I think we're really only after that p2p behavior and we've
discussed that disabling p2p mappings in the VM would be a sufficient
condition to support multiple devices without NDMA support.  I think
that means DMA to memory is fine and DMA related to MSI is fine... but
how does a device know which DMA is memory and which DMA is another
device?

> > > > + *          complete any such outstanding operations prior to completing
> > > > + *          the transition to the NDMA state.  The NDMA device_state    
> > > 
> > > Reading this as you wrote it and I suddenly have a doubt about the PRI
> > > use case. Is it reasonable that the kernel driver will block on NDMA
> > > waiting for another userspace thread to resolve any outstanding PRIs?
> > > 
> > > Can that allow userspace to deadlock the kernel or device? Is there an
> > > alterative?  
> > 
> > I'd hope we could avoid deadlock in the kernel, but it seems trickier
> > for userspace to be waiting on a write(2) operation to the device while
> > also handling page request events for that same device.  Is this
> > something more like a pending transaction bit where userspace asks the
> > device to go quiescent and polls for that to occur?  
> 
> Hum. I'm still looking into this question, but some further thoughts.
> 
> PRI doesn't do DMA, it just transfers a physical address into the PCI
> device's cache that can be later used with DMA.
> 
> PRI also doesn't imply the vPRI Intel is talking about.
> 
> For PRI controlled by the hypervisor, it is completely reasonable that
> NDMA returns synchronously after the PRI and the DMA that triggered it
> completes. The VMM would have to understand this and ensure it doesn't
> block the kernel's fault path while going to NDMA eg with userfaultfd
> or something else crazy.
> 
> The other reasonable option is that NDMA cancels the DMA that
> triggered the PRI and simply doesn't care how the PRI is completed
> after NDMA returns.
> 
> The later is interesting because it is a possible better path to solve
> the vPRI problem Intel brought up. Waiting for the VCPU is just asking
> for a DOS, if NDMA can cancel the DMAs we can then just directly fail
> the open PRI in the hypervisor and we don't need to care about the
> VCPU. Some mess to fixup in the vIOMMU protocol on resume, but the
> resume'd device simply issues a new DMA with an empty ATS cache and
> does a new PRI.
> 
> It is uncertain enough that qemu should not support vPRI with
> migration until we define protocol(s) and a cap flag to say the device
> supports it.
> 
> > > > + *   All combinations for the above defined device_state bits are considered
> > > > + *   valid with the following exceptions:
> > > > + *     - RESUMING and SAVING are mutually exclusive, all combinations of
> > > > + *       (RESUMING | SAVING) are invalid.  Furthermore the specific combination
> > > > + *       (!NDMA | RESUMING | SAVING | !RUNNING) is reserved to indicate the
> > > > + *       device error state VFIO_DEVICE_STATE_ERROR.  This variant is
> > > > + *       specifically chosen due to the !RUNNING state of the device as the
> > > > + *       migration driver should do everything possible, including an internal
> > > > + *       reset of the device, to ensure that the device is fully stopped in
> > > > + *       this state.      
> > > 
> > > Prefer we don't specify this. ERROR is undefined behavior and
> > > userspace should reset. Any path that leads along to ERROR already
> > > includes possiblities for wild DMAs and what not, so there is nothing
> > > to be gained by this other than causing a lot of driver complexity,
> > > IMHO.  
> > 
> > This seems contrary to your push for consistent, interoperable
> > behavior.  
> 
> Formal "undefined behavior" can be a useful part of a spec, especially
> if the spec is 'when you see ERROR you must do RESET', we don't really
> need to constrain the device further to continue to have
> interoperability.
> 
> > What's the benefit to actually leaving the state undefined or the
> > drawback to preemptively resetting a device if the migration driver
> > cannot determine if the device is quiesced,   
> 
> RESET puts the device back to RUNNING, so RESET alone does not remedy
> the problem.
> 
> RESET followed by !RUNNING can fail, meaning the best mlx5 can do is
> "SHOULD", in which case lets omit the RESET since userspace can't rely
> on it.
> 
> Even if it did work reliably, the requirement is userspace must issue
> RESET to exit ERROR and if we say the driver has to issue reset to
> enter ERROR we are just doing a pointless double RESET.

Please read what I wrote:

    This variant is specifically chosen due to the !RUNNING state of
    the device as the migration driver should do everything possible,
    including an internal reset of the device, to ensure that the
    device is fully stopped in this state.

That does not say that a driver must issue a reset to enter the ERROR
state.  Perhaps it's wrong that I'm equating this so formally to the
!RUNNING state when really we don't care about the internal state of
the device, we just want it to not corrupt memory or generate spurious
interrupts.  I'm thinking the equivalent of clear bus-master for PCI
devices.  Would it be sufficient if I clarified !RUNNING relative to
DMA and interrupt generation?

> > would need to reset the device to enter a new state anyway?  I added
> > this because language in your doc suggested the error state was far
> > more undefined that I understood it to be, ie. !RUNNING.  
> 
> Yes it was like that, because the implementation of this strict
> requirement is not nice.
> 
> Perhaps a middle ground can work:
> 
>   For device_state ERROR the device SHOULD have the device
>   !RUNNING. If the ERROR arose due to a device_state change and
>   if the new and old states have NDMA behavior then the device MUST
>   maintain NDMA behavior while processing the device_state and
>   continuing while in ERROR. Userspace MUST reset the device to
>   recover from ERROR, therefore devices SHOULD NOT do a redundant
>   internal reset

I don't have a problem if the reset is redundant to one the user needs
to do anyway, I'd rather see any externally visible operation of the
device stopped ASAP.  The new and old state NDMA-like properties is
also irrelevant, if a device enters an ERROR state moving from RUNNING
-> SAVING | RUNNING it shouldn't continue manipulating memory and
generating interrupts in the background.  What about:

    The !RUNNING variant is used here specifically to reflect that the
    device should immediately cease all external operations such as DMA
    and interrupts.  The migration driver should do everything
    possible, up to and including an internal reset of the device, to
    ensure that the device is externally quiescent in this state.

> > > > + *   Migration drivers should attempt to support any
> > > >     transition between valid    
> > > 
> > > should? must, I think.  
> > 
> > I think that "must" terminology is a bit contrary to the fact that
> >     we have a defined error state that can be used at the
> >     discretion of the migration driver.  To me, "should" tells the
> >     migration drivers that they ought to make an attempt to support
> >     all transitions, but userspace needs to be be prepared that
> >     they might not work.    
> 
> IMHO this is not inter-operable. At a minimum we should expect that a
> driver implements a set of standard transitions, or it has to
> implement all of them.
> 
> Otherwise what is the point?
> 
> If you go back to the mlx5 v1 version it did effectively this. It
> enforced a FSM and only allowed some transitions. That meets the
> language here with "should" but you didn't like it, and I agreed with
> you then.
> 
> This is when the trouble stated :)
> 
> The mlx5 v1 with the FSM didn't have alot of these problems we are
> discussing. It didn't have precedence issues, it didn't have problems
> executing odd combinations it can't support, it worked and was simple
> to understand.

And an audit of that driver during review found that it grossly failed
to meet the spirit of a "should" requirement.

Using "should" terminology here is meant to give the driver some
leeway, it's not an invitation for abuse.  Even below there's still a
notion that a given state transitions is unsupportable by your device,
what if that was actually true?

> So, if we say should here, then I vote mlx5 goes back to enforcing
> its FSM and that becomes the LCD that userspace must implement to.
> 
> In which case, why not formally specify the FSM now and avoid a driver
> pushing a defacto spec?

It really only takes one driver implementing something like SAVING ->
SAVING | RUNNING and QEMU taking advantage of it as a supported
transition per the uAPI for mlx5 to be left out of the feature that
might provide.

> If we say MUST here then we need to figure out a precedence and 
> say that some transitions are undefined behavior, like SAVING ->
> SAVING|RUNNING.

If we say "should" and don't do those thing, then we're still not
implementing to the spirit of the uAPI.  I'm hearing a lot of "may" in
your interpretation of "should".  And again, nothing wrong with that
transition, manage the migration stream better.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2021-12-20 17:38 ` Cornelia Huck
@ 2021-12-20 22:49   ` Alex Williamson
  2021-12-21 11:24     ` Cornelia Huck
  0 siblings, 1 reply; 35+ messages in thread
From: Alex Williamson @ 2021-12-20 22:49 UTC (permalink / raw)
  To: Cornelia Huck; +Cc: jgg, corbet, kvm, linux-doc, farman, mjrosato, pasic

On Mon, 20 Dec 2021 18:38:26 +0100
Cornelia Huck <cohuck@redhat.com> wrote:

> On Thu, Dec 09 2021, Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > A new NDMA state is being proposed to support a quiescent state for
> > contexts containing multiple devices with peer-to-peer DMA support.
> > Formally define it.  
> 
> [I'm wondering if we would want to use NDMA in other cases as well. Just
> thinking out loud below.]
> 
> >
> > Clarify various aspects of the migration region data fields and
> > protocol.  Remove QEMU related terminology and flows from the uAPI;
> > these will be provided in Documentation/ so as not to confuse the
> > device_state bitfield with a finite state machine with restricted
> > state transitions.
> >
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  include/uapi/linux/vfio.h |  405 ++++++++++++++++++++++++---------------------
> >  1 file changed, 214 insertions(+), 191 deletions(-)
> >
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index ef33ea002b0b..1fdbc928f886 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h  
> 
> (...)
> 
> > + *   The device_state field defines the following bitfield use:
> > + *
> > + *     - Bit 0 (RUNNING) [REQUIRED]:
> > + *        - Setting this bit indicates the device is fully operational, the
> > + *          device may generate interrupts, DMA, respond to MMIO, all vfio
> > + *          device regions are functional, and the device may advance its
> > + *          internal state.  The default device_state must indicate the device
> > + *          in exclusively the RUNNING state, with no other bits in this field
> > + *          set.
> > + *        - Clearing this bit (ie. !RUNNING) must stop the operation of the
> > + *          device.  The device must not generate interrupts, DMA, or advance
> > + *          its internal state.  The user should take steps to restrict access
> > + *          to vfio device regions other than the migration region while the
> > + *          device is !RUNNING or risk corruption of the device migration data
> > + *          stream.  The device and kernel migration driver must accept and
> > + *          respond to interaction to support external subsystems in the
> > + *          !RUNNING state, for example PCI MSI-X and PCI config space.
> > + *          Failure by the user to restrict device access while !RUNNING must
> > + *          not result in error conditions outside the user context (ex.
> > + *          host system faults).  
> 
> If I consider ccw, this would mean that user space would need to stop
> writing to the regions that initiate start/halt/... when RUNNING is
> cleared (makes sense) and that the subchannel must be idle or even
> disabled (so that it does not become status pending). The question is,
> does it make sense to stop new requests and wait for the subchannel to
> become idle during the !RUNNING transition (or even forcefully kill
> outstanding I/O), or...
> 

> > + *     - Bit 3 (NDMA) [OPTIONAL]:
> > + *        The NDMA or "No DMA" state is intended to be a quiescent state for
> > + *        the device for the purposes of managing multiple devices within a
> > + *        user context where peer-to-peer DMA between devices may be active.
> > + *        Support for the NDMA bit is indicated through the presence of the
> > + *        VFIO_REGION_INFO_CAP_MIG_NDMA capability as reported by
> > + *        VFIO_DEVICE_GET_REGION_INFO for the associated device migration
> > + *        region.
> > + *        - Setting this bit must prevent the device from initiating any
> > + *          new DMA or interrupt transactions.  The migration driver must
> > + *          complete any such outstanding operations prior to completing
> > + *          the transition to the NDMA state.  The NDMA device_state
> > + *          essentially represents a sub-set of the !RUNNING state for the
> > + *          purpose of quiescing the device, therefore the NDMA device_state
> > + *          bit is superfluous in combinations including !RUNNING.
> > + *        - Clearing this bit (ie. !NDMA) negates the device operational
> > + *          restrictions required by the NDMA state.  
> 
> ...should we use NDMA as the "stop new requests" state, but allow
> running channel programs to conclude? I'm not entirely sure whether
> that's in the spirit of NDMA (subchannels are independent of each
> other), but it would be kind of "quiescing" already.
> 
> (We should probably clarify things like that in the Documentation/
> file.)

This bumps into the discussion in my other thread with Jason, we need
to refine what NDMA means.  Based on my reply there and our previous
discussion that QEMU could exclude p2p mappings to support VMs with
multiple devices that don't support NDMA, I think that NDMA is only
quiescing p2p traffic (if so, maybe should be NOP2P).  So this use of
it seems out of scope to me.

Userspace necessarily needs to stop vCPUs before stopping devices,
which should mean that there are no new requests when a ccw device is
transitioning to !RUNNING.  Therefore I'd expect that the transition to
any !RUNNING state would wait from completion of running channel
programs.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2021-12-20 22:49   ` Alex Williamson
@ 2021-12-21 11:24     ` Cornelia Huck
  0 siblings, 0 replies; 35+ messages in thread
From: Cornelia Huck @ 2021-12-21 11:24 UTC (permalink / raw)
  To: Alex Williamson; +Cc: jgg, corbet, kvm, linux-doc, farman, mjrosato, pasic

On Mon, Dec 20 2021, Alex Williamson <alex.williamson@redhat.com> wrote:

> On Mon, 20 Dec 2021 18:38:26 +0100
> Cornelia Huck <cohuck@redhat.com> wrote:
>
>> On Thu, Dec 09 2021, Alex Williamson <alex.williamson@redhat.com> wrote:
>> 
>> > A new NDMA state is being proposed to support a quiescent state for
>> > contexts containing multiple devices with peer-to-peer DMA support.
>> > Formally define it.  
>> 
>> [I'm wondering if we would want to use NDMA in other cases as well. Just
>> thinking out loud below.]
>> 
>> >
>> > Clarify various aspects of the migration region data fields and
>> > protocol.  Remove QEMU related terminology and flows from the uAPI;
>> > these will be provided in Documentation/ so as not to confuse the
>> > device_state bitfield with a finite state machine with restricted
>> > state transitions.
>> >
>> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>> > ---
>> >  include/uapi/linux/vfio.h |  405 ++++++++++++++++++++++++---------------------
>> >  1 file changed, 214 insertions(+), 191 deletions(-)
>> >
>> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> > index ef33ea002b0b..1fdbc928f886 100644
>> > --- a/include/uapi/linux/vfio.h
>> > +++ b/include/uapi/linux/vfio.h  
>> 
>> (...)
>> 
>> > + *   The device_state field defines the following bitfield use:
>> > + *
>> > + *     - Bit 0 (RUNNING) [REQUIRED]:
>> > + *        - Setting this bit indicates the device is fully operational, the
>> > + *          device may generate interrupts, DMA, respond to MMIO, all vfio
>> > + *          device regions are functional, and the device may advance its
>> > + *          internal state.  The default device_state must indicate the device
>> > + *          in exclusively the RUNNING state, with no other bits in this field
>> > + *          set.
>> > + *        - Clearing this bit (ie. !RUNNING) must stop the operation of the
>> > + *          device.  The device must not generate interrupts, DMA, or advance
>> > + *          its internal state.  The user should take steps to restrict access
>> > + *          to vfio device regions other than the migration region while the
>> > + *          device is !RUNNING or risk corruption of the device migration data
>> > + *          stream.  The device and kernel migration driver must accept and
>> > + *          respond to interaction to support external subsystems in the
>> > + *          !RUNNING state, for example PCI MSI-X and PCI config space.
>> > + *          Failure by the user to restrict device access while !RUNNING must
>> > + *          not result in error conditions outside the user context (ex.
>> > + *          host system faults).  
>> 
>> If I consider ccw, this would mean that user space would need to stop
>> writing to the regions that initiate start/halt/... when RUNNING is
>> cleared (makes sense) and that the subchannel must be idle or even
>> disabled (so that it does not become status pending). The question is,
>> does it make sense to stop new requests and wait for the subchannel to
>> become idle during the !RUNNING transition (or even forcefully kill
>> outstanding I/O), or...
>> 
>
>> > + *     - Bit 3 (NDMA) [OPTIONAL]:
>> > + *        The NDMA or "No DMA" state is intended to be a quiescent state for
>> > + *        the device for the purposes of managing multiple devices within a
>> > + *        user context where peer-to-peer DMA between devices may be active.
>> > + *        Support for the NDMA bit is indicated through the presence of the
>> > + *        VFIO_REGION_INFO_CAP_MIG_NDMA capability as reported by
>> > + *        VFIO_DEVICE_GET_REGION_INFO for the associated device migration
>> > + *        region.
>> > + *        - Setting this bit must prevent the device from initiating any
>> > + *          new DMA or interrupt transactions.  The migration driver must
>> > + *          complete any such outstanding operations prior to completing
>> > + *          the transition to the NDMA state.  The NDMA device_state
>> > + *          essentially represents a sub-set of the !RUNNING state for the
>> > + *          purpose of quiescing the device, therefore the NDMA device_state
>> > + *          bit is superfluous in combinations including !RUNNING.
>> > + *        - Clearing this bit (ie. !NDMA) negates the device operational
>> > + *          restrictions required by the NDMA state.  
>> 
>> ...should we use NDMA as the "stop new requests" state, but allow
>> running channel programs to conclude? I'm not entirely sure whether
>> that's in the spirit of NDMA (subchannels are independent of each
>> other), but it would be kind of "quiescing" already.
>> 
>> (We should probably clarify things like that in the Documentation/
>> file.)
>
> This bumps into the discussion in my other thread with Jason, we need
> to refine what NDMA means.  Based on my reply there and our previous
> discussion that QEMU could exclude p2p mappings to support VMs with
> multiple devices that don't support NDMA, I think that NDMA is only
> quiescing p2p traffic (if so, maybe should be NOP2P).  So this use of
> it seems out of scope to me.

Ok, makes sense. If the scope of this flag is indeed to be supposed
quite narrow, it might make sense to rename it.

>
> Userspace necessarily needs to stop vCPUs before stopping devices,
> which should mean that there are no new requests when a ccw device is
> transitioning to !RUNNING.  Therefore I'd expect that the transition to
> any !RUNNING state would wait from completion of running channel
> programs.

Indeed, it should not be any problem to do this for !RUNNING, I had just
been wondering about possible alternative implementations.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2021-12-14 16:26     ` Jason Gunthorpe
  2021-12-20 22:26       ` Alex Williamson
@ 2022-01-04  3:49       ` Tian, Kevin
  2022-01-04 16:09         ` Jason Gunthorpe
  1 sibling, 1 reply; 35+ messages in thread
From: Tian, Kevin @ 2022-01-04  3:49 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: cohuck, corbet, kvm, linux-doc, farman, mjrosato, pasic, Lu, Baolu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, December 15, 2021 12:27 AM
> > > > + *          complete any such outstanding operations prior to completing
> > > > + *          the transition to the NDMA state.  The NDMA device_state
> > >
> > > Reading this as you wrote it and I suddenly have a doubt about the PRI
> > > use case. Is it reasonable that the kernel driver will block on NDMA
> > > waiting for another userspace thread to resolve any outstanding PRIs?
> > >
> > > Can that allow userspace to deadlock the kernel or device? Is there an
> > > alterative?
> >
> > I'd hope we could avoid deadlock in the kernel, but it seems trickier
> > for userspace to be waiting on a write(2) operation to the device while
> > also handling page request events for that same device.  Is this
> > something more like a pending transaction bit where userspace asks the
> > device to go quiescent and polls for that to occur?
> 
> Hum. I'm still looking into this question, but some further thoughts.
> 
> PRI doesn't do DMA, it just transfers a physical address into the PCI
> device's cache that can be later used with DMA.
> 
> PRI also doesn't imply the vPRI Intel is talking about.

This is correct. PRI can happen on either kernel-managed page table
or user-managed page table. Only for latter case the PRI needs be
forwarded to userspace for fixup.

> 
> For PRI controlled by the hypervisor, it is completely reasonable that
> NDMA returns synchronously after the PRI and the DMA that triggered it
> completes. The VMM would have to understand this and ensure it doesn't
> block the kernel's fault path while going to NDMA eg with userfaultfd
> or something else crazy.

I don't think there would be any problem on this usage.

> 
> The other reasonable option is that NDMA cancels the DMA that
> triggered the PRI and simply doesn't care how the PRI is completed
> after NDMA returns.
> 
> The later is interesting because it is a possible better path to solve
> the vPRI problem Intel brought up. Waiting for the VCPU is just asking
> for a DOS, if NDMA can cancel the DMAs we can then just directly fail

cancel and save the context so the aborted transaction can be resumed
on the target node.

> the open PRI in the hypervisor and we don't need to care about the
> VCPU. Some mess to fixup in the vIOMMU protocol on resume, but the
> resume'd device simply issues a new DMA with an empty ATS cache and
> does a new PRI.
> 
> It is uncertain enough that qemu should not support vPRI with
> migration until we define protocol(s) and a cap flag to say the device
> supports it.
> 

However this is too restricting. It's an ideal option but in reality it
implies the capability that the device can preempt and recover an
in-fly request in any granularity (given PRI can occur at any time). 
I was clearly told by hardware guys about how challenging to 
achieve this goal on various IPs, which is also the reason why the 
draining operation on most devices today is more-or-less a waiting 
flavor.

btw can you elaborate the DOS concern? The device is assigned
to an user application, which has one thread (migration thread)
blocked on another thread (vcpu thread) when transiting the
device to NDMA state. What service outside of this application
is denied here?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2022-01-04  3:49       ` Tian, Kevin
@ 2022-01-04 16:09         ` Jason Gunthorpe
  2022-01-05  1:59           ` Tian, Kevin
  2022-01-05  3:06           ` Tian, Kevin
  0 siblings, 2 replies; 35+ messages in thread
From: Jason Gunthorpe @ 2022-01-04 16:09 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, cohuck, corbet, kvm, linux-doc, farman,
	mjrosato, pasic, Lu, Baolu

On Tue, Jan 04, 2022 at 03:49:07AM +0000, Tian, Kevin wrote:
 
> btw can you elaborate the DOS concern? The device is assigned
> to an user application, which has one thread (migration thread)
> blocked on another thread (vcpu thread) when transiting the
> device to NDMA state. What service outside of this application
> is denied here?

The problem is the VM controls when the vPRI is responded and
migration cannot proceed until this is done.

So the basic DOS is for a hostile VM to trigger a vPRI and then never
answer it. Even trivially done from userspace with a vSVA and
userfaultfd, for instance.

This will block the hypervisor from ever migrating the VM in a very
poor way - it will just hang in the middle of a migration request.

Regardless of the complaints of the IP designers, this is a very poor
direction.

Progress in the hypervisor should never be contingent on a guest VM.

Jason

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2021-12-20 22:26       ` Alex Williamson
@ 2022-01-04 20:28         ` Jason Gunthorpe
  2022-01-06 18:17           ` Alex Williamson
  0 siblings, 1 reply; 35+ messages in thread
From: Jason Gunthorpe @ 2022-01-04 20:28 UTC (permalink / raw)
  To: Alex Williamson; +Cc: cohuck, corbet, kvm, linux-doc, farman, mjrosato, pasic

On Mon, Dec 20, 2021 at 03:26:23PM -0700, Alex Williamson wrote:

> > It also raises the question that it seems not well defined what the
> > sequence:
> > 
> > SAVING -> SAVING | RUNNING
> > 
> > Even does to the migration window?
> > 
> > Nor can I imagine what mlx5 could do beyond fail this or corrupt the
> > migration..
> 
> I think this comes down to the robustness of the migration driver.  The
> migration data window and control of how userspace is to interact with
> it is essentially meant to allow the migration driver to implement its
> own transport protocol.  In the case of mlx5 where it expects only to
> apply the received migration data on the RESUMING -> RUNNING
> transition, a "short" data segment might be reserved for providing
> sequencing information.  Each time the device enters SAVING | !RUNNING
> the driver might begin by presenting a new sequence header.  On the
> target, a new sequence header would cause any previously received
> migration data to be discarded.  A similar header would also be
> suggested to validate the migration data stream is appropriate for the
> target device.

Honestly, I have no interest in implementing something so complicated
for what should be a simple operation. We have no use case for this,
no desire to test it, it is just pure kernel cruft and complexity to
do this kind of extra work.

I think it is very telling that we are, what, 10 weeks into this now,
we have seen two HW drivers posted, and NOTHING implements like you
are imagining here. I doubt the closed drivers do any better.

Let's not make up busy work that can't be strongly justified! 

That is substantially what I see as wrong with this whole line of
thinking that the device_state must be independent bits, not a
constrained FSM.

We are actively failing, again and again, to tell if drivers are
implementing this mulit-bit vision correctly, or even specify properly
how it should work in an implementable way.

> > IMHO, we should be simplifing this before it becomes permanent API,
> > not trying to force it to work.
> 
> I agree, this is our opportunity to simplify before we're committed,
> but I don't see how we can throw out perfectly valid transitions like
> SAVING -> SAVING | RUNNING just because the driver hasn't accounted for
> managing data in the data stream.

Don't see? What do you mean? We showed how to do this exactly in the
v1.

We define the meaning of device_state to be actual FSM and write out
the allowed FSM arcs and then properly describe them.

This is what everyone else though this was already.

AFAICT you are the only person to view this as a bunch of bits. It is
why the original header file comment gave names to each of the states
and roughtly sketches out legal state transition arcs with the odd
ascii art.

So I want to stop trying to make the bunch of bits idea work. Let's
make it a FSM, let's define exactly the legal arcs, and then define
the behaviors in each FSM state. It is easy to understand and we have
a hope to get inter-operable implementations.

All the driver postings so far are demonstrating they don't get
oddball transition arcs correct, and we are clearly not able to find
these things during review.

And now you are asking for alot of extra work and complications in the
driver to support arcs that will never be used - that is really too
far, sorry.

> > Most of things I brought up in this post are resolved by the forced
> > FSM.
> 
> Until userspace tries to do something different than exactly what it
> does today, and then what?

It can't. We define the API to be exactly the permited arcs and no
others. That is what simplify means.

If we need to add new FSM arcs and new states later then their support
can be exposed via a cap flag.

This is much better than trying to define all arcs as legal, only
testing/using a small subset and hoping the somehow in the future this
results in extensible interoperability.

> > Another view is that staying in a useless state is also pointless and
> > we may as well return ERROR anyhow. Eg if exiting RESUMING failed
> > there is no other action to take besides RESET, so why didn't we
> > return ERROR to tell this directly to userspace?
> 
> And then the last packet arrives, gets written to the device that's
> still in RESUMING, and now can transition to RUNNING.

Huh? If the device indicated error during RESUMING userspace should
probably stop shoving packets into it or it will possibly corrupt the
migration stream.

> But I think we're really only after that p2p behavior and we've
> discussed that disabling p2p mappings in the VM would be a sufficient
> condition to support multiple devices without NDMA support.  I think
> that means DMA to memory is fine and DMA related to MSI is fine... but
> how does a device know which DMA is memory and which DMA is another
> device?

The device doesn't know if a particular DMA is P2P or not. This is why
the device action is called 'NO DMA'.

MSI is fine to be left unspecified because we currently virtualize all
the MSI register writes and it is impossible for a hostile guest to
corrupt them to point the address to anything but the interrupt
controller. If a MSI triggers or not in NDMA doesn't practically
matter.

It only starts to matter someday if we get the world Thomas is
thinking about where the guest can directly program the MSI registers.

> > Even if it did work reliably, the requirement is userspace must issue
> > RESET to exit ERROR and if we say the driver has to issue reset to
> > enter ERROR we are just doing a pointless double RESET.
> 
> Please read what I wrote:
> 
>     This variant is specifically chosen due to the !RUNNING state of
>     the device as the migration driver should do everything possible,
>     including an internal reset of the device, to ensure that the
>     device is fully stopped in this state.
> 
> That does not say that a driver must issue a reset to enter the ERROR
> state.  

Huh? "everything possible including an internal reset" sure sounds
like a device must issue a reset in some cases. If we keep with your
idea to rarely use ERROR then I think all the mlx5 cases that would
hit it are already so messed up that RESET will be mandatory.

> I don't have a problem if the reset is redundant to one the user needs
> to do anyway, I'd rather see any externally visible operation of the
> device stopped ASAP.  

Why? It was doing all those things before it had an error, why
should it suddenly stop now? What is this extra work helping?

Remember if we choose to return an error code instead of ERROR the
device is still running as it was, I don't see an benifit to making
ERROR different here.

ERROR just means the device has to be reset, we don't need the device
to stop doing what it was doing.

> The new and old state NDMA-like properties is also irrelevant, if a
> device enters an ERROR state moving from RUNNING -> SAVING | RUNNING
> it shouldn't continue manipulating memory and generating interrupts
> in the background.

I prefer a model where the device is allowed to keep doing whatever it
was doing before it hit the error. You are pushing for a model where
upon error we must force the device to stop.

For this view it is why the old state matters, if it was previously
allowed to DMA then it continues to be allowed to do DMA, etc.

> > The mlx5 v1 with the FSM didn't have alot of these problems we are
> > discussing. It didn't have precedence issues, it didn't have problems
> > executing odd combinations it can't support, it worked and was simple
> > to understand.
> 
> And an audit of that driver during review found that it grossly failed
> to meet the spirit of a "should" requirement.

That isn't how I see things.. The v1 driver implemented the uAPI we
all thought existed, was the FSM based uAPI the original patch authors
intended, and implemented only the FSM arcs discussed in the header
file comment.

Your idea that this is not a FSM seems to be unique here. I think
we've explored it to a reasonable conclusion to find it isn't working
out. Let's stop please.

Yishai can prepare a version with the FSM design back in including
NDMA and we can look at it.

> > In which case, why not formally specify the FSM now and avoid a driver
> > pushing a defacto spec?
> 
> It really only takes one driver implementing something like SAVING ->
> SAVING | RUNNING and QEMU taking advantage of it as a supported
> transition per the uAPI for mlx5 to be left out of the feature that
> might provide.

The uAPI spec is irrelavent, qemu can't just suddenly start doing
things that don't work on supported drivers.

What we've seen many times in the kernel is that uapis that don't have
driver interoperability don't succeed.

Jason

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2022-01-04 16:09         ` Jason Gunthorpe
@ 2022-01-05  1:59           ` Tian, Kevin
  2022-01-05 12:45             ` Jason Gunthorpe
  2022-01-05  3:06           ` Tian, Kevin
  1 sibling, 1 reply; 35+ messages in thread
From: Tian, Kevin @ 2022-01-05  1:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, cohuck, corbet, kvm, linux-doc, farman,
	mjrosato, pasic, Lu, Baolu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, January 5, 2022 12:10 AM
> 
> On Tue, Jan 04, 2022 at 03:49:07AM +0000, Tian, Kevin wrote:
> 
> > btw can you elaborate the DOS concern? The device is assigned
> > to an user application, which has one thread (migration thread)
> > blocked on another thread (vcpu thread) when transiting the
> > device to NDMA state. What service outside of this application
> > is denied here?
> 
> The problem is the VM controls when the vPRI is responded and
> migration cannot proceed until this is done.
> 
> So the basic DOS is for a hostile VM to trigger a vPRI and then never
> answer it. Even trivially done from userspace with a vSVA and
> userfaultfd, for instance.
> 
> This will block the hypervisor from ever migrating the VM in a very
> poor way - it will just hang in the middle of a migration request.

it's poor but 'hang' won't happen. PCI spec defines completion timeout
for ATS translation request. If timeout the device will abort the in-fly
request and report error back to software. 

> 
> Regardless of the complaints of the IP designers, this is a very poor
> direction.
> 
> Progress in the hypervisor should never be contingent on a guest VM.
> 

Whether the said DOS is a real concern and how severe it is are usage 
specific things. Why would we want to hardcode such restriction on
an uAPI? Just give the choice to the admin (as long as this restriction is
clearly communicated to userspace clearly)...

IMHO encouraging IP designers to work out better hardware shouldn't
block supporting current hardware which has limitations but also values
in scenarios where those limitations are tolerable.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2022-01-04 16:09         ` Jason Gunthorpe
  2022-01-05  1:59           ` Tian, Kevin
@ 2022-01-05  3:06           ` Tian, Kevin
  1 sibling, 0 replies; 35+ messages in thread
From: Tian, Kevin @ 2022-01-05  3:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, cohuck, corbet, kvm, linux-doc, farman,
	mjrosato, pasic, Lu, Baolu

> From: Tian, Kevin
> Sent: Wednesday, January 5, 2022 9:59 AM
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, January 5, 2022 12:10 AM
> >
> > On Tue, Jan 04, 2022 at 03:49:07AM +0000, Tian, Kevin wrote:
> >
> > > btw can you elaborate the DOS concern? The device is assigned
> > > to an user application, which has one thread (migration thread)
> > > blocked on another thread (vcpu thread) when transiting the
> > > device to NDMA state. What service outside of this application
> > > is denied here?
> >
> > The problem is the VM controls when the vPRI is responded and
> > migration cannot proceed until this is done.
> >
> > So the basic DOS is for a hostile VM to trigger a vPRI and then never
> > answer it. Even trivially done from userspace with a vSVA and
> > userfaultfd, for instance.
> >
> > This will block the hypervisor from ever migrating the VM in a very
> > poor way - it will just hang in the middle of a migration request.
> 
> it's poor but 'hang' won't happen. PCI spec defines completion timeout
> for ATS translation request. If timeout the device will abort the in-fly
> request and report error back to software.
> 
> >
> > Regardless of the complaints of the IP designers, this is a very poor
> > direction.
> >
> > Progress in the hypervisor should never be contingent on a guest VM.
> >
> 
> Whether the said DOS is a real concern and how severe it is are usage
> specific things. Why would we want to hardcode such restriction on
> an uAPI? Just give the choice to the admin (as long as this restriction is
> clearly communicated to userspace clearly)...
> 
> IMHO encouraging IP designers to work out better hardware shouldn't
> block supporting current hardware which has limitations but also values
> in scenarios where those limitations are tolerable.
> 

btw although the uapi is named 'migration', it's really about device
state management. Whether the managed device state is further 
migrated and whether failure to migrate is severe are really not 
the kernel's business.

It's just simple that changing device state could fail. and vPRI here is
just one failure reason due to no response from the user after certain 
timeout (for a user-managed page table).

Then it's Qemu which should document the restriction and provide
options for the admin to decide whether to expose vPRI vs. migration
based on specific usage requirement. The choices could be vPRI-off/
migration-on, vPRI-on/migration-off, or enabling both (migration
failure is tolerable or no 'hostile' VM in the setup)...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2022-01-05  1:59           ` Tian, Kevin
@ 2022-01-05 12:45             ` Jason Gunthorpe
  2022-01-06  6:32               ` Tian, Kevin
  0 siblings, 1 reply; 35+ messages in thread
From: Jason Gunthorpe @ 2022-01-05 12:45 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, cohuck, corbet, kvm, linux-doc, farman,
	mjrosato, pasic, Lu, Baolu

On Wed, Jan 05, 2022 at 01:59:31AM +0000, Tian, Kevin wrote:

> > This will block the hypervisor from ever migrating the VM in a very
> > poor way - it will just hang in the middle of a migration request.
> 
> it's poor but 'hang' won't happen. PCI spec defines completion timeout
> for ATS translation request. If timeout the device will abort the in-fly
> request and report error back to software. 

The PRI time outs have to be long enough to handle swap back from
disk, so 'hang' will be a fair amount of time..
 
> > Regardless of the complaints of the IP designers, this is a very poor
> > direction.
> > 
> > Progress in the hypervisor should never be contingent on a guest VM.
> > 
> 
> Whether the said DOS is a real concern and how severe it is are usage 
> specific things. Why would we want to hardcode such restriction on
> an uAPI? Just give the choice to the admin (as long as this restriction is
> clearly communicated to userspace clearly)...

IMHO it is not just DOS, PRI can become dependent on IO which requires
DMA to complete.

You could quickly get yourself into a deadlock situation where the
hypervisor has disabled DMA activities of other devices and the vPRI
simply cannot be completed.

I just don't see how this scheme is generally workable without a lot
of limitations.

While I do agree we should support the HW that exists, we should
recognize this is not a long term workable design and treat it as
such.

Jason

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2022-01-05 12:45             ` Jason Gunthorpe
@ 2022-01-06  6:32               ` Tian, Kevin
  2022-01-06 15:42                 ` Jason Gunthorpe
  0 siblings, 1 reply; 35+ messages in thread
From: Tian, Kevin @ 2022-01-06  6:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, cohuck, corbet, kvm, linux-doc, farman,
	mjrosato, pasic, Lu, Baolu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, January 5, 2022 8:46 PM
> 
> On Wed, Jan 05, 2022 at 01:59:31AM +0000, Tian, Kevin wrote:
> 
> > > This will block the hypervisor from ever migrating the VM in a very
> > > poor way - it will just hang in the middle of a migration request.
> >
> > it's poor but 'hang' won't happen. PCI spec defines completion timeout
> > for ATS translation request. If timeout the device will abort the in-fly
> > request and report error back to software.
> 
> The PRI time outs have to be long enough to handle swap back from
> disk, so 'hang' will be a fair amount of time..

This reminds me one interesting point.

Putting PRI aside the time to drain in-fly requests is undefined. It depends
on how many pending requests to be waited for before completing the
draining command on the device. This is IP specific (e.g. whether supports
preemption) and also guest specific (e.g. whether it's actively submitting
workload).

So even without hostile attempts the draining time may exceed what an
user tolerates in live migration.

This suggests certain software timeout mechanism might be necessary 
when transitioning to NDMA state, with the timeout value optionally
configurable by the user. If timeout, then fail the state transition
request.

And once such mechanism is in place, PRI is automatically covered as it
is just one implicit reason which may increase the draining time.

> 
> > > Regardless of the complaints of the IP designers, this is a very poor
> > > direction.
> > >
> > > Progress in the hypervisor should never be contingent on a guest VM.
> > >
> >
> > Whether the said DOS is a real concern and how severe it is are usage
> > specific things. Why would we want to hardcode such restriction on
> > an uAPI? Just give the choice to the admin (as long as this restriction is
> > clearly communicated to userspace clearly)...
> 
> IMHO it is not just DOS, PRI can become dependent on IO which requires
> DMA to complete.
> 
> You could quickly get yourself into a deadlock situation where the
> hypervisor has disabled DMA activities of other devices and the vPRI
> simply cannot be completed.

How is it related to PRI which is only about address translation?

Instead, above is a general p2p problem for any draining operation. How 
to solve it needs to be defined clearly for this NDMA state (which I suppose
is being discussed between you and Alex and I still need time to catch
up).

> 
> I just don't see how this scheme is generally workable without a lot
> of limitations.
> 
> While I do agree we should support the HW that exists, we should
> recognize this is not a long term workable design and treat it as
> such.
> 

Definitely agree with this point. We software people should continue
influencing IP designers toward a long-term software friendly design.
and also bear the fact that it takes time... 😊

Thanks
Kevin 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2022-01-06  6:32               ` Tian, Kevin
@ 2022-01-06 15:42                 ` Jason Gunthorpe
  2022-01-07  0:00                   ` Tian, Kevin
  0 siblings, 1 reply; 35+ messages in thread
From: Jason Gunthorpe @ 2022-01-06 15:42 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, cohuck, corbet, kvm, linux-doc, farman,
	mjrosato, pasic, Lu, Baolu

On Thu, Jan 06, 2022 at 06:32:57AM +0000, Tian, Kevin wrote:

> Putting PRI aside the time to drain in-fly requests is undefined. It depends
> on how many pending requests to be waited for before completing the
> draining command on the device. This is IP specific (e.g. whether supports
> preemption) and also guest specific (e.g. whether it's actively submitting
> workload).

You are assuming a model where NDMA has to be implemented by pushing a
command, but I would say that is very poor IP design.

A device is fully in self-control of its own DMA and it should simply
stop it quickly when doing NDMA.

Devices that are poorly designed here will have very long migration
downtime latencies and people simply won't want to use them.
 
> > > Whether the said DOS is a real concern and how severe it is are usage
> > > specific things. Why would we want to hardcode such restriction on
> > > an uAPI? Just give the choice to the admin (as long as this restriction is
> > > clearly communicated to userspace clearly)...
> > 
> > IMHO it is not just DOS, PRI can become dependent on IO which requires
> > DMA to complete.
> > 
> > You could quickly get yourself into a deadlock situation where the
> > hypervisor has disabled DMA activities of other devices and the vPRI
> > simply cannot be completed.
> 
> How is it related to PRI which is only about address translation?

In something like SVA PRI can request a page which is not present and
the OS has to do DMA to load the page back from storage to make it
present and respond to the translation request.

The DMA is not related to the device doing the PRI in the first place,
but if the hypervisor has blocked the DMA already for some other
reason (perhaps that device is also doing PRI) then it all will
deadlock.

> Instead, above is a general p2p problem for any draining operation. 

Nothing to do with p2p.

Jason

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2022-01-04 20:28         ` Jason Gunthorpe
@ 2022-01-06 18:17           ` Alex Williamson
  2022-01-06 21:20             ` Jason Gunthorpe
  0 siblings, 1 reply; 35+ messages in thread
From: Alex Williamson @ 2022-01-06 18:17 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: cohuck, corbet, kvm, linux-doc, farman, mjrosato, pasic

On Tue, 4 Jan 2022 16:28:34 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Dec 20, 2021 at 03:26:23PM -0700, Alex Williamson wrote:
> 
> > > It also raises the question that it seems not well defined what the
> > > sequence:
> > > 
> > > SAVING -> SAVING | RUNNING
> > > 
> > > Even does to the migration window?
> > > 
> > > Nor can I imagine what mlx5 could do beyond fail this or corrupt the
> > > migration..  
> > 
> > I think this comes down to the robustness of the migration driver.  The
> > migration data window and control of how userspace is to interact with
> > it is essentially meant to allow the migration driver to implement its
> > own transport protocol.  In the case of mlx5 where it expects only to
> > apply the received migration data on the RESUMING -> RUNNING
> > transition, a "short" data segment might be reserved for providing
> > sequencing information.  Each time the device enters SAVING | !RUNNING
> > the driver might begin by presenting a new sequence header.  On the
> > target, a new sequence header would cause any previously received
> > migration data to be discarded.  A similar header would also be
> > suggested to validate the migration data stream is appropriate for the
> > target device.  
> 
> Honestly, I have no interest in implementing something so complicated
> for what should be a simple operation. We have no use case for this,
> no desire to test it, it is just pure kernel cruft and complexity to
> do this kind of extra work.

Migration is not a simple operation, we have a device with a host
kernel driver exporting and importing device state through userspace.
It's a playground for potential exploit vectors.  I assume this also
means that you're not performing any sort of validation that the
incoming data is from a compatible device or providing any versioning
accommodations in the data stream, all features that would generally be
considered best practices.

> I think it is very telling that we are, what, 10 weeks into this now,
> we have seen two HW drivers posted, and NOTHING implements like you
> are imagining here. I doubt the closed drivers do any better.
>
> Let's not make up busy work that can't be strongly justified! 

I think I'm portraying the uAPI as it was designed and envisioned to
work.  Of course I also have doubts whether the closed drivers perform
any sort of validation or consistency checking, and we can only guess
what sort of attack vectors might exist as a result.
 
> That is substantially what I see as wrong with this whole line of
> thinking that the device_state must be independent bits, not a
> constrained FSM.
> 
> We are actively failing, again and again, to tell if drivers are
> implementing this mulit-bit vision correctly, or even specify properly
> how it should work in an implementable way.
> 
> > > IMHO, we should be simplifing this before it becomes permanent API,
> > > not trying to force it to work.  
> > 
> > I agree, this is our opportunity to simplify before we're committed,
> > but I don't see how we can throw out perfectly valid transitions like
> > SAVING -> SAVING | RUNNING just because the driver hasn't accounted for
> > managing data in the data stream.  
> 
> Don't see? What do you mean? We showed how to do this exactly in the
> v1.
> 
> We define the meaning of device_state to be actual FSM and write out
> the allowed FSM arcs and then properly describe them.
> 
> This is what everyone else though this was already.
> 
> AFAICT you are the only person to view this as a bunch of bits. It is
> why the original header file comment gave names to each of the states
> and roughtly sketches out legal state transition arcs with the odd
> ascii art.
> 
> So I want to stop trying to make the bunch of bits idea work. Let's
> make it a FSM, let's define exactly the legal arcs, and then define
> the behaviors in each FSM state. It is easy to understand and we have
> a hope to get inter-operable implementations.
> 
> All the driver postings so far are demonstrating they don't get
> oddball transition arcs correct, and we are clearly not able to find
> these things during review.
> 
> And now you are asking for alot of extra work and complications in the
> driver to support arcs that will never be used - that is really too
> far, sorry.

It's the uAPI as I understand it.  If you want a new uAPI, propose one.
I'm not willing to accept a driver that partially implements the uAPI
with an addendum document vaguely hinting at the ways it might be
limited, with the purpose of subverting the written uAPI by a de facto
standard.

> > > Most of things I brought up in this post are resolved by the forced
> > > FSM.  
> > 
> > Until userspace tries to do something different than exactly what it
> > does today, and then what?  
> 
> It can't. We define the API to be exactly the permited arcs and no
> others. That is what simplify means.

IOW, define the uAPI based on what happens to be the current QEMU
implementation and limitations.  That's exactly what we were trying to
avoid in the uAPI design.
 
> If we need to add new FSM arcs and new states later then their support
> can be exposed via a cap flag.
> 
> This is much better than trying to define all arcs as legal, only
> testing/using a small subset and hoping the somehow in the future this
> results in extensible interoperability.

A proposal of which states transitions you want to keep would be useful
here.  Let's look at all the possibilities:

{}
	-> {RUNNING}
	-> {SAVING}
	-> {RESUMING}
	-> {RUNNING|SAVING}

{RUNNING}
	-> {} (a)*
	-> {SAVING} (a)
	-> {RESUMING} (a)
	-> {RUNNING|SAVING} (a)

{SAVING}
	-> {} (a)*
	-> {RUNNING} (b)
	-> {RESUMING}
	-> {RUNNING|SAVING}

{RESUMING}
	-> {}
	-> {RUNNING} (a)
	-> {SAVING}
	-> {RUNNING|SAVING}

{RUNNING|SAVING}
	-> {}
	-> {RUNNING} (b)
	-> {SAVING} (a)
	-> {RESUMING}

We have 20 possible transitions.  I've marked those available via the
"odd ascii art" diagram as (a), that's 7 transitions.  We could
essentially remove the NULL state as unreachable as there seems little
value in the 2 transitions marked (a)* if we look only at migration,
that would bring us down to 5 of 12 possible transitions.  We need to
give userspace an abort path though, so we minimally need the 2
transitions marked (b) (7/12).

So now we can discuss the remaining 5 transitions:

{SAVING} -> {RESUMING}
	If not supported, user can achieve this via:
		{SAVING}->{RUNNING}->{RESUMING}
		{SAVING}-RESET->{RUNNING}->{RESUMING}
	It would likely be dis-recommended to return a device to
	{RUNNING} for this use case, so the latter would be preferred.

	Potential use case: ping-pong migration

{SAVING} -> {RUNNING|SAVING}
	If not supported, user can achieve this via:
		{SAVING}->{RUNNING}->{RUNNING|SAVING}

	Potential use case: downtime exceeded, avoid full migration
	restart (likely not achieved with the alternative flow).

{RESUMING} -> {SAVING}
	If not supported:
		{RESUMING}->{RUNNING}->{SAVING}
		{RESUMING}-RESET->{RUNNING}->{SAVING}

	Potential use case: validate migration data in = data out (also
	cannot be achieved with alternative flow, as device passes
	through RUNNING)

{RESUMING} -> {RUNNING|SAVING}
	If not supported:
		{RESUMING}->{RUNNING}->{RUNNING|SAVING}

	Potential use case: live ping-pong migration (alternative flow
	is likely sufficient)

{RUNNING|SAVING} -> {RESUMING}
	If not supported:
		{RUNNING|SAVING}->{RUNNING}->{RESUMING}
		{RUNNING|SAVING}-RESET->{RUNNING}->{RESUMING}
	(again former is likely dis-recommended)

	Potential use case: ???

So what's the proposal?  Do we ditch all of these?  Some of these?  If
drivers follow the previously provided pseudo algorithm with the
requirement that they cannot pass through an invalid state, we need to
formally address whether the NULL state is invalid or just not
reachable by the user.

> > > Another view is that staying in a useless state is also pointless and
> > > we may as well return ERROR anyhow. Eg if exiting RESUMING failed
> > > there is no other action to take besides RESET, so why didn't we
> > > return ERROR to tell this directly to userspace?  
> > 
> > And then the last packet arrives, gets written to the device that's
> > still in RESUMING, and now can transition to RUNNING.  
> 
> Huh? If the device indicated error during RESUMING userspace should
> probably stop shoving packets into it or it will possibly corrupt the
> migration stream.

If a {RESUMING}->{RUNNING} transition fails and the device remains in
{RESUMING}, it should be valid for userspace to push data to it.  If
the driver wants to indicate the transition attempt failed AND it won't
accept continuing data or a re-initialized data stream, it probably
should put the device into {ERROR} instead.
 
> > But I think we're really only after that p2p behavior and we've
> > discussed that disabling p2p mappings in the VM would be a sufficient
> > condition to support multiple devices without NDMA support.  I think
> > that means DMA to memory is fine and DMA related to MSI is fine... but
> > how does a device know which DMA is memory and which DMA is another
> > device?  
> 
> The device doesn't know if a particular DMA is P2P or not. This is why
> the device action is called 'NO DMA'.
> 
> MSI is fine to be left unspecified because we currently virtualize all
> the MSI register writes and it is impossible for a hostile guest to
> corrupt them to point the address to anything but the interrupt
> controller. If a MSI triggers or not in NDMA doesn't practically
> matter.

The vector table is directly accessible via the region mmap.  It
previously was not, but that becomes a problem with 64k page sizes, and
even some poorly designed devices on 4k systems when they don't honor
the PCI spec recommended alignments.  But I think that's beside the
point, if the user has vectors pointed at memory or other devices,
they've rather already broken their contract for using the device.

But I think you've identified two classes of DMA, MSI and everything
else.  The device can assume that an MSI is special and not included in
NDMA, but it can't know whether other arbitrary DMAs are p2p or memory.
If we define that the minimum requirement for multi-device migration is
to quiesce p2p DMA, ex. by not allowing it at all, then NDMA is
actually significantly more restrictive while it's enabled.

> It only starts to matter someday if we get the world Thomas is
> thinking about where the guest can directly program the MSI registers.
> 
> > > Even if it did work reliably, the requirement is userspace must issue
> > > RESET to exit ERROR and if we say the driver has to issue reset to
> > > enter ERROR we are just doing a pointless double RESET.  
> > 
> > Please read what I wrote:
> > 
> >     This variant is specifically chosen due to the !RUNNING state of
> >     the device as the migration driver should do everything possible,
> >     including an internal reset of the device, to ensure that the
> >     device is fully stopped in this state.
> > 
> > That does not say that a driver must issue a reset to enter the ERROR
> > state.    
> 
> Huh? "everything possible including an internal reset" sure sounds
> like a device must issue a reset in some cases. If we keep with your
> idea to rarely use ERROR then I think all the mlx5 cases that would
> hit it are already so messed up that RESET will be mandatory.
> 
> > I don't have a problem if the reset is redundant to one the user needs
> > to do anyway, I'd rather see any externally visible operation of the
> > device stopped ASAP.    
> 
> Why? It was doing all those things before it had an error, why
> should it suddenly stop now? What is this extra work helping?
> 
> Remember if we choose to return an error code instead of ERROR the
> device is still running as it was, I don't see an benifit to making
> ERROR different here.
> 
> ERROR just means the device has to be reset, we don't need the device
> to stop doing what it was doing.

Should a device in the ERROR state continue operation or be in a
quiesced state?  It seems obvious to me that since the ERROR state is
essentially undefined, the device should cease operations and be
quiesced by the driver.  If the device is continuing to operate in the
previous state, why would the driver place the device in the ERROR
state?  It should have returned an errno and left the device in the
previous state.
 
> > The new and old state NDMA-like properties is also irrelevant, if a
> > device enters an ERROR state moving from RUNNING -> SAVING | RUNNING
> > it shouldn't continue manipulating memory and generating interrupts
> > in the background.  
> 
> I prefer a model where the device is allowed to keep doing whatever it
> was doing before it hit the error. You are pushing for a model where
> upon error we must force the device to stop.

If the device continues operating in the previous mode then it
shouldn't enter the ERROR state, it should return errno and re-reading
device_state should indicate it's in the previous state.

> For this view it is why the old state matters, if it was previously
> allowed to DMA then it continues to be allowed to do DMA, etc.

If it's still running normally, it shouldn't have been reported in the
ERROR state.

> > > The mlx5 v1 with the FSM didn't have alot of these problems we are
> > > discussing. It didn't have precedence issues, it didn't have problems
> > > executing odd combinations it can't support, it worked and was simple
> > > to understand.  
> > 
> > And an audit of that driver during review found that it grossly failed
> > to meet the spirit of a "should" requirement.  
> 
> That isn't how I see things.. The v1 driver implemented the uAPI we
> all thought existed, was the FSM based uAPI the original patch authors
> intended, and implemented only the FSM arcs discussed in the header
> file comment.
> 
> Your idea that this is not a FSM seems to be unique here. I think
> we've explored it to a reasonable conclusion to find it isn't working
> out. Let's stop please.
> 
> Yishai can prepare a version with the FSM design back in including
> NDMA and we can look at it.

Sorry, but I was actually there and participating in original
development of the uAPI.  If you'd like to propose a different uAPI do
so, but again, I'm not going to accept a driver specifically looking to
compromise what I understand to be the intent of the current
specification in order to create a de facto standard outside of that
specification.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2022-01-06 18:17           ` Alex Williamson
@ 2022-01-06 21:20             ` Jason Gunthorpe
  2022-01-10  7:55               ` Tian, Kevin
  0 siblings, 1 reply; 35+ messages in thread
From: Jason Gunthorpe @ 2022-01-06 21:20 UTC (permalink / raw)
  To: Alex Williamson; +Cc: cohuck, corbet, kvm, linux-doc, farman, mjrosato, pasic

On Thu, Jan 06, 2022 at 11:17:18AM -0700, Alex Williamson wrote:

> > Honestly, I have no interest in implementing something so complicated
> > for what should be a simple operation. We have no use case for this,
> > no desire to test it, it is just pure kernel cruft and complexity to
> > do this kind of extra work.
> 
> Migration is not a simple operation, we have a device with a host
> kernel driver exporting and importing device state through userspace.
> It's a playground for potential exploit vectors.  I assume this also
> means that you're not performing any sort of validation that the
> incoming data is from a compatible device or providing any versioning
> accommodations in the data stream, all features that would generally be
> considered best practices.

Our architecture has FW code performing all validation and versioning.

The driver is a dumb pipe and works on a simple snapshot basis.

The snapshot basis is the same approach the huawei driver and the 3
other drivers we have in internal development use.

The idea of forcing every snapshot style driver into some complicated
streaming scheme is bad, IMHO. We should be having code sharing here,
see below please.
 
> It's the uAPI as I understand it.  If you want a new uAPI, propose
> one.

Oh. I thought that was effectively what we are doing. So let's be
clear about it.

We are proposing a 'new API'. It will be binary and backwards
compatible with current qemu.

> > It can't. We define the API to be exactly the permited arcs and no
> > others. That is what simplify means.
> 
> IOW, define the uAPI based on what happens to be the current QEMU
> implementation and limitations.  That's exactly what we were trying to
> avoid in the uAPI design.

I don't like the characterization. Selecting only the arcs that are
useful, and well defined just happens to overlap with most of the
the arcs that qemu uses exactly because qemu is using the useful
subset.

The stuff beyond that is possibly veering into over-engineering. It
was OK when it looked like those behaviors would fall out naturally,
eg via precedence/etc, but now that we fully understand that
implementing the unused part has a real implementation cost, it is not
appealing.

> > If we need to add new FSM arcs and new states later then their support
> > can be exposed via a cap flag.
> > 
> > This is much better than trying to define all arcs as legal, only
> > testing/using a small subset and hoping the somehow in the future this
> > results in extensible interoperability.
> 
> A proposal of which states transitions you want to keep would be useful
> here.

Yishai is working it out currently in code, but you can go back to the
v1 to see where we started from on this thinking.

It looks like we end up with about 8 possible valid FSM states out of
the 16 combinatins of bits and your analysis looks about about right
on the arcs, but NDMA will be included.

We were also thinking to retain STOP. SAVING -> STOP could possibly
serve as the abort path to avoid a double action, and some of the use
cases you ID'd below are achievable if STOP remains.

> We have 20 possible transitions.  I've marked those available via the
> "odd ascii art" diagram as (a), that's 7 transitions.  We could
> essentially remove the NULL state as unreachable as there seems little
> value in the 2 transitions marked (a)* if we look only at migration,
> that would bring us down to 5 of 12 possible transitions.  We need to
> give userspace an abort path though, so we minimally need the 2
> transitions marked (b) (7/12).

> So now we can discuss the remaining 5 transitions:
> 
> {SAVING} -> {RESUMING}
> 	If not supported, user can achieve this via:
> 		{SAVING}->{RUNNING}->{RESUMING}
> 		{SAVING}-RESET->{RUNNING}->{RESUMING}

This can be:

SAVING -> STOP -> RESUMING

> 	It would likely be dis-recommended to return a device to
> 	{RUNNING} for this use case, so the latter would be preferred.

Yes, I would drop this, there is no advantage compared to manually
going to STOP, or through RESET.

> {SAVING} -> {RUNNING|SAVING}
> 	If not supported, user can achieve this via:
> 		{SAVING}->{RUNNING}->{RUNNING|SAVING}
> 
> 	Potential use case: downtime exceeded, avoid full migration
> 	restart (likely not achieved with the alternative flow).

I'm sympathetic to your use case, but this is not natural, or useful
for the snapshot/non-precopy style drivers to implement. Without
support for PRECOPY there is no functional advantage to return to
RUNNING|SAVING.

It is much better if qemu signals the far side to abort the migration,
move the local device to RUNNING and the far device through RESET ->
RESUMING to start all over again. This becomes common code for all the
snapshot drivers to rely on so we don't have to implement this
recovery logic with tricks in each driver inside the migration stream.

Shared code is better.

If someone does think this usecase is important they can add a cap
flag and implement this arc along with the qemu support/etc for that
driver. To be meaningful it would have to be along with a device that
implements PRECOPY with dirty tracking, and a streaming post copy.

So I would discard this arc, at least from the mandatory set.

> {RESUMING} -> {SAVING}
> 	If not supported:
> 		{RESUMING}->{RUNNING}->{SAVING}
> 		{RESUMING}-RESET->{RUNNING}->{SAVING}

The simplified version would be:

    RESUMING -> STOP -> SAVING

> 	Potential use case: validate migration data in = data out (also
> 	cannot be achieved with alternative flow, as device passes
> 	through RUNNING)

No sure mlx5 will guarantee idempotent migration data. It looks like
the Huawei device will be able to do this, and some of the other
simplish device types we are working on could possibly too.

So I'd drop this, STOP is good enough.

> {RESUMING} -> {RUNNING|SAVING}
> 	If not supported:
> 		{RESUMING}->{RUNNING}->{RUNNING|SAVING}
> 
> 	Potential use case: live ping-pong migration (alternative flow
> 	is likely sufficient)

STOP can be used here too, I would drop this since it is redundant.

> {RUNNING|SAVING} -> {RESUMING}
> 	If not supported:
> 		{RUNNING|SAVING}->{RUNNING}->{RESUMING}
> 		{RUNNING|SAVING}-RESET->{RUNNING}->{RESUMING}
> 	(again former is likely dis-recommended)
> 
> 	Potential use case: ???

Yep, discard.

> So what's the proposal?  Do we ditch all of these?  Some of these?  

Yes, all of them.

Including the NDMA states we add 2 new FSM states and 7 new arcs,
IIRC:

   SAVING | RUNNING -> SAVING | RUNNING | NDMA
   SAVING | RUNNING | NDMA -> SAVING
   SAVING | RUNNING | NDMA -> RUNNING
   SAVING | RUNNING | NDMA -> STOP

and

   RESUMING -> RUNNING | NDMA
   RUNNING | NDMA -> RUNNING
   RUNNING | NDMA -> STOP

> drivers follow the previously provided pseudo algorithm with the
> requirement that they cannot pass through an invalid state, we need to
> formally address whether the NULL state is invalid or just not
> reachable by the user.

What is a NULL state?

We have defined (from memory, forgive me I don't have access to
Yishai's latest code at the moment) 8 formal FSM states:

 RUNNING
 PRECOPY
 PRECOPY_NDMA
 STOP_COPY
 STOP
 RESUMING
 RESUMING_NDMA
 ERROR (perhaps MUST_RESET)

Which matches the state labels already given in the header comment.

Mapped onto the 4 device_state bits in an 'ABI compatible with current
qemu' way as the current header comment describes.

Then the list of allowed arcs between them close to what you have
suggested.

IMHO this substantially conforms to what is written down in the header
comment. Consistently using the state labels in code and documentation
then eliminating the named bits will conclude the change in
specification from bits to a FSM.

Yishai, we should also recast the cap discovery as some ioctl to query
directly if the driver supports an arc. Eg we can discover if NDMA is
supported by querying support for PRECOPY -> PRECOPY_NDMA, and if your
timeout use case is supported by querying support for STOP_COPY ->
PRECOPY.

Any future new states/arcs will cleanly fit into this discovery
scheme.

> > Huh? If the device indicated error during RESUMING userspace should
> > probably stop shoving packets into it or it will possibly corrupt the
> > migration stream.
> 
> If a {RESUMING}->{RUNNING} transition fails and the device remains in
> {RESUMING}, it should be valid for userspace to push data to it.  

IMHO this is not useful. If the userspace did RESUMING -> RUNNING then
that is 'end of stream' and continuning to push data is
nonsensical. It is one of the cases where design wise it is much
clearer to just say any exit from RESUMING either succeeds or
userspace needs to reset the device, eg ERROR.

Again, my perspective here is multi-device interoperability and I just
don't think we can rely on a consistent device behavior in such a
strange corner case.

> The vector table is directly accessible via the region mmap.  It
> previously was not, but that becomes a problem with 64k page sizes, and
> even some poorly designed devices on 4k systems when they don't honor
> the PCI spec recommended alignments.  

Yes, I forgot about that.

> But I think that's beside the point, if the user has vectors pointed
> at memory or other devices, they've rather already broken their
> contract for using the device.

Yes, so long as it doesn't allow to compromise the hypervisor
integrity.
 
> But I think you've identified two classes of DMA, MSI and everything
> else.  The device can assume that an MSI is special and not included in
> NDMA, but it can't know whether other arbitrary DMAs are p2p or memory.
> If we define that the minimum requirement for multi-device migration is
> to quiesce p2p DMA, ex. by not allowing it at all, then NDMA is
> actually significantly more restrictive while it's enabled.

You are right, but in any practical physical device NDMA will be
implemented by halting all DMAs, not just p2p ones.

I don't mind what we label this, so long as we understand that halting
all DMA is a valid device implementation.

Actually, having reflected on this now, one of the things on my list
to fix in iommufd, is that mdevs can get access to P2P pages at all.

This is currently buggy as-is because they cannot DMA map these
things, touch them with the CPU and kmap, or do, really, anything with
them.

So we should be blocking mdev's from accessing P2P, and in that case a
mdev driver can quite rightly say it doesn't support P2P at all and
safely NOP NDMA if NDMA is defined to only impact P2P transactions.

Perhaps that answers the question for the S390 drivers as well.

> Should a device in the ERROR state continue operation or be in a
> quiesced state?  It seems obvious to me that since the ERROR state is
> essentially undefined, the device should cease operations and be
> quiesced by the driver.  If the device is continuing to operate in the
> previous state, why would the driver place the device in the ERROR
> state?  It should have returned an errno and left the device in the
> previous state.

What we found while implementing is the use of ERROR arises when the
driver has been forced to do multiple actions and is unable to fully
unwind the state. So the device_state is not fully the original state
or fully the target state. Thus it is ERROR.

The additional requirement that the driver do another action to
quiesce the device only introduces the possiblity for triple failure.

Since it doesn't bring any value to userspace, I prefer we not define
things in this complicated way.

> > I prefer a model where the device is allowed to keep doing whatever it
> > was doing before it hit the error. You are pushing for a model where
> > upon error we must force the device to stop.
> 
> If the device continues operating in the previous mode then it
> shouldn't enter the ERROR state, it should return errno and re-reading
> device_state should indicate it's in the previous state.

Continues operating in the new/previous state is an 'upper bound' on
what it is allowed to do. For instance if we are going from RUNNING ->
SAVING mlx5 must transit through 'RUNNING|NDMA' as part of its
internal design.

If it then fails it can't continue to pretend it is RUNNING when it is
doing RUNNING|NDMA and a double failure means we can't restore
RUNNING.

Allowing ERROR to continue any behavior allowed by RUNNING allows the
device to be left in RUNNING|NDMA and eliminates the possiblity of
triple failure in a transition to 'STOP'.

Indeed we can simplify the driver by removing failure recoveries for
cases that have a double fault and simply go to ERROR. This is not so
viable if ERROR itself requires an action to enter it as we get back
to having to deal with double and triple faults.

Yishai said he should have something to look at next week. We'll take
a stab at rewording the docs you provided to reflect a FSM approach
too.

Jason

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2022-01-06 15:42                 ` Jason Gunthorpe
@ 2022-01-07  0:00                   ` Tian, Kevin
  2022-01-07  0:29                     ` Jason Gunthorpe
  0 siblings, 1 reply; 35+ messages in thread
From: Tian, Kevin @ 2022-01-07  0:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, cohuck, corbet, kvm, linux-doc, farman,
	mjrosato, pasic, Lu, Baolu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, January 6, 2022 11:42 PM
> 
> On Thu, Jan 06, 2022 at 06:32:57AM +0000, Tian, Kevin wrote:
> 
> > Putting PRI aside the time to drain in-fly requests is undefined. It depends
> > on how many pending requests to be waited for before completing the
> > draining command on the device. This is IP specific (e.g. whether supports
> > preemption) and also guest specific (e.g. whether it's actively submitting
> > workload).
> 
> You are assuming a model where NDMA has to be implemented by pushing a
> command, but I would say that is very poor IP design.

I was not assuming a single model. I just wanted to figure out
how this model can be supported in this design, given I saw
many examples of it.

> 
> A device is fully in self-control of its own DMA and it should simply
> stop it quickly when doing NDMA.

simple on some classes, but definitely not so simple on others.

> 
> Devices that are poorly designed here will have very long migration
> downtime latencies and people simply won't want to use them.

Different usages have different latency requirement. Do we just want
people to decide whether to manage state for a device by measurement?
There is always difference between an experimental environment and
final production environment. A timeout mechanism is more robust 
as the last resort than breaking SLA in case of any surprise in the 
production environment.

> 
> > > > Whether the said DOS is a real concern and how severe it is are usage
> > > > specific things. Why would we want to hardcode such restriction on
> > > > an uAPI? Just give the choice to the admin (as long as this restriction is
> > > > clearly communicated to userspace clearly)...
> > >
> > > IMHO it is not just DOS, PRI can become dependent on IO which requires
> > > DMA to complete.
> > >
> > > You could quickly get yourself into a deadlock situation where the
> > > hypervisor has disabled DMA activities of other devices and the vPRI
> > > simply cannot be completed.
> >
> > How is it related to PRI which is only about address translation?
> 
> In something like SVA PRI can request a page which is not present and
> the OS has to do DMA to load the page back from storage to make it
> present and respond to the translation request.
> 
> The DMA is not related to the device doing the PRI in the first place,
> but if the hypervisor has blocked the DMA already for some other
> reason (perhaps that device is also doing PRI) then it all will
> deadlock.

yes, but with timeout the NDMA path doesn't care about whether
a PRI is not responded (due to hostile VM or such block-dma case). It
simply fails the state transition request when timeout is triggered.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2022-01-07  0:00                   ` Tian, Kevin
@ 2022-01-07  0:29                     ` Jason Gunthorpe
  2022-01-07  2:01                       ` Tian, Kevin
  0 siblings, 1 reply; 35+ messages in thread
From: Jason Gunthorpe @ 2022-01-07  0:29 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, cohuck, corbet, kvm, linux-doc, farman,
	mjrosato, pasic, Lu, Baolu

On Fri, Jan 07, 2022 at 12:00:13AM +0000, Tian, Kevin wrote:
> > Devices that are poorly designed here will have very long migration
> > downtime latencies and people simply won't want to use them.
> 
> Different usages have different latency requirement. Do we just want
> people to decide whether to manage state for a device by
> measurement?

It doesn't seem unreasonable to allow userspace to set max timer for
NDMA for SLA purposes on devices that have unbounded NDMA times. It
would probably be some new optional ioctl for devices that can
implement it.

However, this basically gives up on the idea that a VM can be migrated
as any migration can timeout and fail under this philosophy. I think
that is still very poor.

Optional migration really can't be sane path forward.

Jason

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2022-01-07  0:29                     ` Jason Gunthorpe
@ 2022-01-07  2:01                       ` Tian, Kevin
  2022-01-07 17:23                         ` Jason Gunthorpe
  0 siblings, 1 reply; 35+ messages in thread
From: Tian, Kevin @ 2022-01-07  2:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, cohuck, corbet, kvm, linux-doc, farman,
	mjrosato, pasic, Lu, Baolu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, January 7, 2022 8:30 AM
> 
> On Fri, Jan 07, 2022 at 12:00:13AM +0000, Tian, Kevin wrote:
> > > Devices that are poorly designed here will have very long migration
> > > downtime latencies and people simply won't want to use them.
> >
> > Different usages have different latency requirement. Do we just want
> > people to decide whether to manage state for a device by
> > measurement?
> 
> It doesn't seem unreasonable to allow userspace to set max timer for
> NDMA for SLA purposes on devices that have unbounded NDMA times. It
> would probably be some new optional ioctl for devices that can
> implement it.

Yes, that's my point.

> 
> However, this basically gives up on the idea that a VM can be migrated
> as any migration can timeout and fail under this philosophy. I think
> that is still very poor.
> 
> Optional migration really can't be sane path forward.
> 

How is it different from the scenario where the guest generates a very
high dirty rate so the precopy phase can never converge to a pre-defined
threshold then abort the migration after certain timeout?

IMHO live migration is always a try-and-fail flavor. A previous migration
failure doesn't prevent the orchestration stack to retry at a later point.

In the meantime people do explore various optimizations to increase 
the success rate. Having the device to stop DMA quickly is one such
optimization from the hardware side. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2021-12-09 23:34 [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state Alex Williamson
  2021-12-10  1:25 ` Jason Gunthorpe
  2021-12-20 17:38 ` Cornelia Huck
@ 2022-01-07  8:03 ` Tian, Kevin
  2022-01-07 16:36   ` Alex Williamson
  2 siblings, 1 reply; 35+ messages in thread
From: Tian, Kevin @ 2022-01-07  8:03 UTC (permalink / raw)
  To: Alex Williamson
  Cc: jgg, cohuck, corbet, kvm, linux-doc, farman, mjrosato, pasic

Hi, Alex,

Thanks for cleaning up this part, which is very helpful! 

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, December 10, 2021 7:34 AM
> 
> + *
> + *   The device_state field defines the following bitfield use:
> + *
> + *     - Bit 0 (RUNNING) [REQUIRED]:
> + *        - Setting this bit indicates the device is fully operational, the
> + *          device may generate interrupts, DMA, respond to MMIO, all vfio
> + *          device regions are functional, and the device may advance its
> + *          internal state.  The default device_state must indicate the device
> + *          in exclusively the RUNNING state, with no other bits in this field
> + *          set.
> + *        - Clearing this bit (ie. !RUNNING) must stop the operation of the
> + *          device.  The device must not generate interrupts, DMA, or advance
> + *          its internal state. 

I'm curious about what it means for the mediated device. I suppose this 
'must not' clause is from user p.o.v i.e. no event delivered to the user, 
no DMA to user memory and no user visible change on mdev state. Physically 
the device resource backing the mdev may still generate interrupt/DMA 
to the host according to the mediation policy.

Is this understanding correct?

> +*           The user should take steps to restrict access
> + *          to vfio device regions other than the migration region while the
> + *          device is !RUNNING or risk corruption of the device migration data
> + *          stream.  The device and kernel migration driver must accept and
> + *          respond to interaction to support external subsystems in the
> + *          !RUNNING state, for example PCI MSI-X and PCI config space.

and also respond to mmio access if some state is saved via reading mmio?

> + *          Failure by the user to restrict device access while !RUNNING must
> + *          not result in error conditions outside the user context (ex.
> + *          host system faults).
> + *     - Bit 1 (SAVING) [REQUIRED]:
> + *        - Setting this bit enables and initializes the migration region data
> + *          window and associated fields within vfio_device_migration_info for
> + *          capturing the migration data stream for the device.  The migration
> + *          driver may perform actions such as enabling dirty logging of device
> + *          state with this bit.  The SAVING bit is mutually exclusive with the
> + *          RESUMING bit defined below.
> + *        - Clearing this bit (ie. !SAVING) de-initializes the migration region
> + *          data window and indicates the completion or termination of the
> + *          migration data stream for the device.
> + *     - Bit 2 (RESUMING) [REQUIRED]:
> + *        - Setting this bit enables and initializes the migration region data
> + *          window and associated fields within vfio_device_migration_info for
> + *          restoring the device from a migration data stream captured from a
> + *          SAVING session with a compatible device.  The migration driver may
> + *          perform internal device resets as necessary to reinitialize the
> + *          internal device state for the incoming migration data.
> + *        - Clearing this bit (ie. !RESUMING) de-initializes the migration
> + *          region data window and indicates the end of a resuming session for
> + *          the device.  The kernel migration driver should complete the
> + *          incorporation of data written to the migration data window into the
> + *          device internal state and perform final validity and consistency
> + *          checking of the new device state.  If the user provided data is
> + *          found to be incomplete, inconsistent, or otherwise invalid, the
> + *          migration driver must indicate a write(2) error and follow the
> + *          previously described protocol to return either the previous state
> + *          or an error state.
> + *     - Bit 3 (NDMA) [OPTIONAL]:
> + *        The NDMA or "No DMA" state is intended to be a quiescent state for
> + *        the device for the purposes of managing multiple devices within a
> + *        user context where peer-to-peer DMA between devices may be active.

As discussed with Jason in another thread, this is also required for vPRI
when stopping DMA involves completing (instead of preempting) in-fly
requests then any vPRI for those requests must be completed when vcpu 
is running. This cannot be done in !RUNNING which is typically transitioned 
to after stopping vcpu.

It is also useful when the time of stopping device DMA is unbound (even
without vPRI). Having a failure path when vcpu is running avoids breaking 
SLA (if only capturing it after stopping vcpu). This further requires certain
interface for the user to specify a timeout value for entering NDMA, though
unclear to me what it will be now.

> + *        Support for the NDMA bit is indicated through the presence of the
> + *        VFIO_REGION_INFO_CAP_MIG_NDMA capability as reported by
> + *        VFIO_DEVICE_GET_REGION_INFO for the associated device migration
> + *        region.
> + *        - Setting this bit must prevent the device from initiating any
> + *          new DMA or interrupt transactions.  The migration driver must

Why also disabling interrupt? vcpu is still running at this point thus interrupt
could be triggered for many reasons other than DMA...

> + *          complete any such outstanding operations prior to completing
> + *          the transition to the NDMA state.  The NDMA device_state
> + *          essentially represents a sub-set of the !RUNNING state for the
> + *          purpose of quiescing the device, therefore the NDMA device_state
> + *          bit is superfluous in combinations including !RUNNING.

'superfluous' means doing so will get a failure, or just not recommended?

> + *        - Clearing this bit (ie. !NDMA) negates the device operational
> + *          restrictions required by the NDMA state.
> + *     - Bits [31:4]:
> + *        Reserved for future use, users should use read-modify-write
> + *        operations to the device_state field for manipulation of the above
> + *        defined bits for optimal compatibility.
> + *

Thanks
Kevin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2022-01-07  8:03 ` Tian, Kevin
@ 2022-01-07 16:36   ` Alex Williamson
  2022-01-10  6:01     ` Tian, Kevin
  0 siblings, 1 reply; 35+ messages in thread
From: Alex Williamson @ 2022-01-07 16:36 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: jgg, cohuck, corbet, kvm, linux-doc, farman, mjrosato, pasic

On Fri, 7 Jan 2022 08:03:57 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> Hi, Alex,
> 
> Thanks for cleaning up this part, which is very helpful! 
> 
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Friday, December 10, 2021 7:34 AM
> > 
> > + *
> > + *   The device_state field defines the following bitfield use:
> > + *
> > + *     - Bit 0 (RUNNING) [REQUIRED]:
> > + *        - Setting this bit indicates the device is fully operational, the
> > + *          device may generate interrupts, DMA, respond to MMIO, all vfio
> > + *          device regions are functional, and the device may advance its
> > + *          internal state.  The default device_state must indicate the device
> > + *          in exclusively the RUNNING state, with no other bits in this field
> > + *          set.
> > + *        - Clearing this bit (ie. !RUNNING) must stop the operation of the
> > + *          device.  The device must not generate interrupts, DMA, or advance
> > + *          its internal state.   
> 
> I'm curious about what it means for the mediated device. I suppose this 
> 'must not' clause is from user p.o.v i.e. no event delivered to the user, 
> no DMA to user memory and no user visible change on mdev state. Physically 
> the device resource backing the mdev may still generate interrupt/DMA 
> to the host according to the mediation policy.
> 
> Is this understanding correct?

Yes, one mediated device stopped running can't cause the backing
device to halt, it must continue performing activities for other child
devices as well as any host duties.  The user owned device should
effectively stop.

> > +*           The user should take steps to restrict access
> > + *          to vfio device regions other than the migration region while the
> > + *          device is !RUNNING or risk corruption of the device migration data
> > + *          stream.  The device and kernel migration driver must accept and
> > + *          respond to interaction to support external subsystems in the
> > + *          !RUNNING state, for example PCI MSI-X and PCI config space.  
> 
> and also respond to mmio access if some state is saved via reading mmio?

The device must not generate a host fault, ex. PCIe UR, but the idea
here is that the device stops and preventing further access is the
user's responsibility.  Failure to stop those accesses may result in
corrupting the migration data.

> > + *          Failure by the user to restrict device access while !RUNNING must
> > + *          not result in error conditions outside the user context (ex.
> > + *          host system faults).
> > + *     - Bit 1 (SAVING) [REQUIRED]:
> > + *        - Setting this bit enables and initializes the migration region data
> > + *          window and associated fields within vfio_device_migration_info for
> > + *          capturing the migration data stream for the device.  The migration
> > + *          driver may perform actions such as enabling dirty logging of device
> > + *          state with this bit.  The SAVING bit is mutually exclusive with the
> > + *          RESUMING bit defined below.
> > + *        - Clearing this bit (ie. !SAVING) de-initializes the migration region
> > + *          data window and indicates the completion or termination of the
> > + *          migration data stream for the device.
> > + *     - Bit 2 (RESUMING) [REQUIRED]:
> > + *        - Setting this bit enables and initializes the migration region data
> > + *          window and associated fields within vfio_device_migration_info for
> > + *          restoring the device from a migration data stream captured from a
> > + *          SAVING session with a compatible device.  The migration driver may
> > + *          perform internal device resets as necessary to reinitialize the
> > + *          internal device state for the incoming migration data.
> > + *        - Clearing this bit (ie. !RESUMING) de-initializes the migration
> > + *          region data window and indicates the end of a resuming session for
> > + *          the device.  The kernel migration driver should complete the
> > + *          incorporation of data written to the migration data window into the
> > + *          device internal state and perform final validity and consistency
> > + *          checking of the new device state.  If the user provided data is
> > + *          found to be incomplete, inconsistent, or otherwise invalid, the
> > + *          migration driver must indicate a write(2) error and follow the
> > + *          previously described protocol to return either the previous state
> > + *          or an error state.
> > + *     - Bit 3 (NDMA) [OPTIONAL]:
> > + *        The NDMA or "No DMA" state is intended to be a quiescent state for
> > + *        the device for the purposes of managing multiple devices within a
> > + *        user context where peer-to-peer DMA between devices may be active.  
> 
> As discussed with Jason in another thread, this is also required for vPRI
> when stopping DMA involves completing (instead of preempting) in-fly
> requests then any vPRI for those requests must be completed when vcpu 
> is running. This cannot be done in !RUNNING which is typically transitioned 
> to after stopping vcpu.
> 
> It is also useful when the time of stopping device DMA is unbound (even
> without vPRI). Having a failure path when vcpu is running avoids breaking 
> SLA (if only capturing it after stopping vcpu). This further requires certain
> interface for the user to specify a timeout value for entering NDMA, though
> unclear to me what it will be now.
> 
> > + *        Support for the NDMA bit is indicated through the presence of the
> > + *        VFIO_REGION_INFO_CAP_MIG_NDMA capability as reported by
> > + *        VFIO_DEVICE_GET_REGION_INFO for the associated device migration
> > + *        region.
> > + *        - Setting this bit must prevent the device from initiating any
> > + *          new DMA or interrupt transactions.  The migration driver must  
> 
> Why also disabling interrupt? vcpu is still running at this point thus interrupt
> could be triggered for many reasons other than DMA...

It's my understanding that the vCPU would be halted for the NDMA use
case, we can't very well have vCPUs demanding requests to devices that
are prevented from completing them.  The NDMA phase is intended to
support completion of outstanding requests without concurrently
accepting new requests, AIUI.

Further conversations in this thread allow for interrupts and deduce
that the primary requirement of NDMA is to restrict P2P DMA, which can
be approximated as all non-MSI DMA.

> > + *          complete any such outstanding operations prior to completing
> > + *          the transition to the NDMA state.  The NDMA device_state
> > + *          essentially represents a sub-set of the !RUNNING state for the
> > + *          purpose of quiescing the device, therefore the NDMA device_state
> > + *          bit is superfluous in combinations including !RUNNING.  
> 
> 'superfluous' means doing so will get a failure, or just not recommended?

Superfluous meaning redundant.  It's allowed, but DMA is already
restricted when !RUNNING, so setting NDMA when !RUNNING is meaningless.
 
> > + *        - Clearing this bit (ie. !NDMA) negates the device operational
> > + *          restrictions required by the NDMA state.
> > + *     - Bits [31:4]:
> > + *        Reserved for future use, users should use read-modify-write
> > + *        operations to the device_state field for manipulation of the above
> > + *        defined bits for optimal compatibility.
> > + *  

FWIW, I'm expecting to see an alternative uAPI propose using a FSM
machine in the near future, so while this clarifies what I believe is
the intention of the existing uAPI, it might be deprecated before we
bother to commit such clarifications.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2022-01-07  2:01                       ` Tian, Kevin
@ 2022-01-07 17:23                         ` Jason Gunthorpe
  2022-01-10  3:14                           ` Tian, Kevin
  0 siblings, 1 reply; 35+ messages in thread
From: Jason Gunthorpe @ 2022-01-07 17:23 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, cohuck, corbet, kvm, linux-doc, farman,
	mjrosato, pasic, Lu, Baolu

On Fri, Jan 07, 2022 at 02:01:55AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Friday, January 7, 2022 8:30 AM
> > 
> > On Fri, Jan 07, 2022 at 12:00:13AM +0000, Tian, Kevin wrote:
> > > > Devices that are poorly designed here will have very long migration
> > > > downtime latencies and people simply won't want to use them.
> > >
> > > Different usages have different latency requirement. Do we just want
> > > people to decide whether to manage state for a device by
> > > measurement?
> > 
> > It doesn't seem unreasonable to allow userspace to set max timer for
> > NDMA for SLA purposes on devices that have unbounded NDMA times. It
> > would probably be some new optional ioctl for devices that can
> > implement it.
> 
> Yes, that's my point.
> 
> > 
> > However, this basically gives up on the idea that a VM can be migrated
> > as any migration can timeout and fail under this philosophy. I think
> > that is still very poor.
> > 
> > Optional migration really can't be sane path forward.
> > 
> 
> How is it different from the scenario where the guest generates a very
> high dirty rate so the precopy phase can never converge to a pre-defined
> threshold then abort the migration after certain timeout?

The hypervisor can halt the VCPU and put a stop to this and complete
the migration.

There is a difference between optional migration under a SLA and
mandatory migration with no SLA - I think both must be supported to be
sane.

> IMHO live migration is always a try-and-fail flavor. A previous migration
> failure doesn't prevent the orchestration stack to retry at a later point.

An operator might need to emergency migrate a VM without the
possibility for failure. For instance there is something wrong with
the base HW. SLA ignored, migration must be done.

IMHO it is completely wrong to view migration as optional, that is a
terrible standard to design HW to.

Jason

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2022-01-07 17:23                         ` Jason Gunthorpe
@ 2022-01-10  3:14                           ` Tian, Kevin
  2022-01-10 17:52                             ` Jason Gunthorpe
  0 siblings, 1 reply; 35+ messages in thread
From: Tian, Kevin @ 2022-01-10  3:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, cohuck, corbet, kvm, linux-doc, farman,
	mjrosato, pasic, Lu, Baolu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, January 8, 2022 1:23 AM
> 
> On Fri, Jan 07, 2022 at 02:01:55AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Friday, January 7, 2022 8:30 AM
> > >
> > > On Fri, Jan 07, 2022 at 12:00:13AM +0000, Tian, Kevin wrote:
> > > > > Devices that are poorly designed here will have very long migration
> > > > > downtime latencies and people simply won't want to use them.
> > > >
> > > > Different usages have different latency requirement. Do we just want
> > > > people to decide whether to manage state for a device by
> > > > measurement?
> > >
> > > It doesn't seem unreasonable to allow userspace to set max timer for
> > > NDMA for SLA purposes on devices that have unbounded NDMA times. It
> > > would probably be some new optional ioctl for devices that can
> > > implement it.
> >
> > Yes, that's my point.
> >
> > >
> > > However, this basically gives up on the idea that a VM can be migrated
> > > as any migration can timeout and fail under this philosophy. I think
> > > that is still very poor.
> > >
> > > Optional migration really can't be sane path forward.
> > >
> >
> > How is it different from the scenario where the guest generates a very
> > high dirty rate so the precopy phase can never converge to a pre-defined
> > threshold then abort the migration after certain timeout?
> 
> The hypervisor can halt the VCPU and put a stop to this and complete
> the migration.
> 
> There is a difference between optional migration under a SLA and
> mandatory migration with no SLA - I think both must be supported to be
> sane.
> 
> > IMHO live migration is always a try-and-fail flavor. A previous migration
> > failure doesn't prevent the orchestration stack to retry at a later point.
> 
> An operator might need to emergency migrate a VM without the
> possibility for failure. For instance there is something wrong with
> the base HW. SLA ignored, migration must be done.

How is it done today when no assigned device supports migration?
If any constraint is tolerable today, I don't see why supporting only
optional migration cannot be accepted  which removes some 
constraints while still bears a subset in those deployments. This can
be seen as an intermediate step in the transition path toward the 
perfect world where both optional and mandatory migration are 
supported.

> 
> IMHO it is completely wrong to view migration as optional, that is a
> terrible standard to design HW to.
> 

I don't want to argue 'wrong' or 'terrible' here since different person
certainly has different view based their own usages.

But based on the whole discussion I hope we are aligned on:

- It's necessary to support existing HW though it may only supports
  optional migration due to unbounded time of stopping DMA;

- We should influence IP designers to design HW to allow preempting
  in-fly requests and stop DMA quickly (also implying the capability of
  aborting/resuming in-fly PRI requests);

- Specific to the device state management uAPI, it should not assume
  a specific usage and instead allow the user to set a timeout value so
  transitioning to NDMA is failed if the operation cannot be completed
  within the specified timeout value. If the user doesn't set it, the 
  migration driver could conservatively use a default timeout value to
  gate any potentially unbounded operation.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2022-01-07 16:36   ` Alex Williamson
@ 2022-01-10  6:01     ` Tian, Kevin
  0 siblings, 0 replies; 35+ messages in thread
From: Tian, Kevin @ 2022-01-10  6:01 UTC (permalink / raw)
  To: Alex Williamson
  Cc: jgg, cohuck, corbet, kvm, linux-doc, farman, mjrosato, pasic

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Saturday, January 8, 2022 12:36 AM
> 
> On Fri, 7 Jan 2022 08:03:57 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > Hi, Alex,
> >
> > Thanks for cleaning up this part, which is very helpful!
> >
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Friday, December 10, 2021 7:34 AM
> > >
> > > + *
> > > + *   The device_state field defines the following bitfield use:
> > > + *
> > > + *     - Bit 0 (RUNNING) [REQUIRED]:
> > > + *        - Setting this bit indicates the device is fully operational, the
> > > + *          device may generate interrupts, DMA, respond to MMIO, all vfio
> > > + *          device regions are functional, and the device may advance its
> > > + *          internal state.  The default device_state must indicate the device
> > > + *          in exclusively the RUNNING state, with no other bits in this field
> > > + *          set.
> > > + *        - Clearing this bit (ie. !RUNNING) must stop the operation of the
> > > + *          device.  The device must not generate interrupts, DMA, or
> advance
> > > + *          its internal state.
> >
> > I'm curious about what it means for the mediated device. I suppose this
> > 'must not' clause is from user p.o.v i.e. no event delivered to the user,
> > no DMA to user memory and no user visible change on mdev state.
> Physically
> > the device resource backing the mdev may still generate interrupt/DMA
> > to the host according to the mediation policy.
> >
> > Is this understanding correct?
> 
> Yes, one mediated device stopped running can't cause the backing
> device to halt, it must continue performing activities for other child
> devices as well as any host duties.  The user owned device should
> effectively stop.
> 
> > > +*           The user should take steps to restrict access
> > > + *          to vfio device regions other than the migration region while the
> > > + *          device is !RUNNING or risk corruption of the device migration
> data
> > > + *          stream.  The device and kernel migration driver must accept and
> > > + *          respond to interaction to support external subsystems in the
> > > + *          !RUNNING state, for example PCI MSI-X and PCI config space.
> >
> > and also respond to mmio access if some state is saved via reading mmio?
> 
> The device must not generate a host fault, ex. PCIe UR, but the idea
> here is that the device stops and preventing further access is the
> user's responsibility.  Failure to stop those accesses may result in
> corrupting the migration data.

ok. With this background I can understand what the last sentence tries
to say. It basically clarifies that while user access to the device is restricted
(by user itself) the kernel access (e.g. pci core, or the mediation driver 
itself) is allowed as long as doing so doesn't advance the to-be-saved state.

> 
> > > + *          Failure by the user to restrict device access while !RUNNING must
> > > + *          not result in error conditions outside the user context (ex.
> > > + *          host system faults).
> > > + *     - Bit 1 (SAVING) [REQUIRED]:
> > > + *        - Setting this bit enables and initializes the migration region data
> > > + *          window and associated fields within vfio_device_migration_info
> for
> > > + *          capturing the migration data stream for the device.  The
> migration
> > > + *          driver may perform actions such as enabling dirty logging of
> device
> > > + *          state with this bit.  The SAVING bit is mutually exclusive with the
> > > + *          RESUMING bit defined below.
> > > + *        - Clearing this bit (ie. !SAVING) de-initializes the migration region
> > > + *          data window and indicates the completion or termination of the
> > > + *          migration data stream for the device.
> > > + *     - Bit 2 (RESUMING) [REQUIRED]:
> > > + *        - Setting this bit enables and initializes the migration region data
> > > + *          window and associated fields within vfio_device_migration_info
> for
> > > + *          restoring the device from a migration data stream captured from
> a
> > > + *          SAVING session with a compatible device.  The migration driver
> may
> > > + *          perform internal device resets as necessary to reinitialize the
> > > + *          internal device state for the incoming migration data.
> > > + *        - Clearing this bit (ie. !RESUMING) de-initializes the migration
> > > + *          region data window and indicates the end of a resuming session
> for
> > > + *          the device.  The kernel migration driver should complete the
> > > + *          incorporation of data written to the migration data window into
> the
> > > + *          device internal state and perform final validity and consistency
> > > + *          checking of the new device state.  If the user provided data is
> > > + *          found to be incomplete, inconsistent, or otherwise invalid, the
> > > + *          migration driver must indicate a write(2) error and follow the
> > > + *          previously described protocol to return either the previous state
> > > + *          or an error state.
> > > + *     - Bit 3 (NDMA) [OPTIONAL]:
> > > + *        The NDMA or "No DMA" state is intended to be a quiescent state
> for
> > > + *        the device for the purposes of managing multiple devices within a
> > > + *        user context where peer-to-peer DMA between devices may be
> active.
> >
> > As discussed with Jason in another thread, this is also required for vPRI
> > when stopping DMA involves completing (instead of preempting) in-fly
> > requests then any vPRI for those requests must be completed when vcpu
> > is running. This cannot be done in !RUNNING which is typically transitioned
> > to after stopping vcpu.
> >
> > It is also useful when the time of stopping device DMA is unbound (even
> > without vPRI). Having a failure path when vcpu is running avoids breaking
> > SLA (if only capturing it after stopping vcpu). This further requires certain
> > interface for the user to specify a timeout value for entering NDMA, though
> > unclear to me what it will be now.
> >
> > > + *        Support for the NDMA bit is indicated through the presence of the
> > > + *        VFIO_REGION_INFO_CAP_MIG_NDMA capability as reported by
> > > + *        VFIO_DEVICE_GET_REGION_INFO for the associated device
> migration
> > > + *        region.
> > > + *        - Setting this bit must prevent the device from initiating any
> > > + *          new DMA or interrupt transactions.  The migration driver must
> >
> > Why also disabling interrupt? vcpu is still running at this point thus
> interrupt
> > could be triggered for many reasons other than DMA...
> 
> It's my understanding that the vCPU would be halted for the NDMA use
> case, we can't very well have vCPUs demanding requests to devices that
> are prevented from completing them.  The NDMA phase is intended to
> support completion of outstanding requests without concurrently
> accepting new requests, AIUI.

By current definition NDMA doesn't require the user to restrict access
to vfio device regions as for !RUNNING. So it's probably not good to tie it
with vcpu stopped.

As explained above it's also required when vPRI is enabled. At least for
current SVA-capable devices from Intel and Huawei they all require
waiting for vPRI completion before transitioning to NDMA can be done
while completing vPRI requires running vcpu.

New requests between NDMA and !RUNNING need to be mediated/
queued by the migration driver and then re-submitted on the 
destination node. This further requires certain mechanism to allow 
dynamically changing the mediation vs. passthru policy on the submit
portal.

> 
> Further conversations in this thread allow for interrupts and deduce
> that the primary requirement of NDMA is to restrict P2P DMA, which can
> be approximated as all non-MSI DMA.
> 
> > > + *          complete any such outstanding operations prior to completing
> > > + *          the transition to the NDMA state.  The NDMA device_state
> > > + *          essentially represents a sub-set of the !RUNNING state for the
> > > + *          purpose of quiescing the device, therefore the NDMA
> device_state
> > > + *          bit is superfluous in combinations including !RUNNING.
> >
> > 'superfluous' means doing so will get a failure, or just not recommended?
> 
> Superfluous meaning redundant.  It's allowed, but DMA is already
> restricted when !RUNNING, so setting NDMA when !RUNNING is meaningless.
> 
> > > + *        - Clearing this bit (ie. !NDMA) negates the device operational
> > > + *          restrictions required by the NDMA state.
> > > + *     - Bits [31:4]:
> > > + *        Reserved for future use, users should use read-modify-write
> > > + *        operations to the device_state field for manipulation of the above
> > > + *        defined bits for optimal compatibility.
> > > + *
> 
> FWIW, I'm expecting to see an alternative uAPI propose using a FSM
> machine in the near future, so while this clarifies what I believe is
> the intention of the existing uAPI, it might be deprecated before we
> bother to commit such clarifications.  Thanks,
> 

got it.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2022-01-06 21:20             ` Jason Gunthorpe
@ 2022-01-10  7:55               ` Tian, Kevin
  2022-01-10 17:34                 ` Alex Williamson
  2022-01-10 18:11                 ` Jason Gunthorpe
  0 siblings, 2 replies; 35+ messages in thread
From: Tian, Kevin @ 2022-01-10  7:55 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: cohuck, corbet, kvm, linux-doc, farman, mjrosato, pasic

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, January 7, 2022 5:21 AM
> 
> We were also thinking to retain STOP. SAVING -> STOP could possibly
> serve as the abort path to avoid a double action, and some of the use
> cases you ID'd below are achievable if STOP remains.

what is the exact difference between a null state {} (implying !RUNNING)
and STOP in this context?

If they are different, who (user or driver) should conduct and when do 
we expect transitioning the device into a null state?

> 
> > We have 20 possible transitions.  I've marked those available via the
> > "odd ascii art" diagram as (a), that's 7 transitions.  We could
> > essentially remove the NULL state as unreachable as there seems little
> > value in the 2 transitions marked (a)* if we look only at migration,
> > that would bring us down to 5 of 12 possible transitions.  We need to
> > give userspace an abort path though, so we minimally need the 2
> > transitions marked (b) (7/12).
> 
> > So now we can discuss the remaining 5 transitions:
> >
> > {SAVING} -> {RESUMING}
> > 	If not supported, user can achieve this via:
> > 		{SAVING}->{RUNNING}->{RESUMING}
> > 		{SAVING}-RESET->{RUNNING}->{RESUMING}
> 
> This can be:
> 
> SAVING -> STOP -> RESUMING

From Alex's original description the default device state is RUNNING.
This supposed to be the initial state on the dest machine for the
device assigned to Qemu before Qemu resumes the device state.
Then how do we eliminate the RUNNING state in above flow? Who
makes STOP as the initial state on the dest node?

> > drivers follow the previously provided pseudo algorithm with the
> > requirement that they cannot pass through an invalid state, we need to
> > formally address whether the NULL state is invalid or just not
> > reachable by the user.
> 
> What is a NULL state?

Hah, seems I'm not the only one having this confusion. 😊

> 
> We have defined (from memory, forgive me I don't have access to
> Yishai's latest code at the moment) 8 formal FSM states:
> 
>  RUNNING
>  PRECOPY
>  PRECOPY_NDMA
>  STOP_COPY
>  STOP
>  RESUMING
>  RESUMING_NDMA
>  ERROR (perhaps MUST_RESET)

ERROR->SHUTDOWN? Usually a shutdown state implies reset required...

> 
> > But I think you've identified two classes of DMA, MSI and everything
> > else.  The device can assume that an MSI is special and not included in
> > NDMA, but it can't know whether other arbitrary DMAs are p2p or memory.
> > If we define that the minimum requirement for multi-device migration is
> > to quiesce p2p DMA, ex. by not allowing it at all, then NDMA is
> > actually significantly more restrictive while it's enabled.
> 
> You are right, but in any practical physical device NDMA will be
> implemented by halting all DMAs, not just p2p ones.
> 
> I don't mind what we label this, so long as we understand that halting
> all DMA is a valid device implementation.
> 
> Actually, having reflected on this now, one of the things on my list
> to fix in iommufd, is that mdevs can get access to P2P pages at all.
> 
> This is currently buggy as-is because they cannot DMA map these
> things, touch them with the CPU and kmap, or do, really, anything with
> them.

Can you elaborate why mdev cannot access p2p pages?

> 
> So we should be blocking mdev's from accessing P2P, and in that case a
> mdev driver can quite rightly say it doesn't support P2P at all and
> safely NOP NDMA if NDMA is defined to only impact P2P transactions.
> 
> Perhaps that answers the question for the S390 drivers as well.
> 
> > Should a device in the ERROR state continue operation or be in a
> > quiesced state?  It seems obvious to me that since the ERROR state is
> > essentially undefined, the device should cease operations and be
> > quiesced by the driver.  If the device is continuing to operate in the
> > previous state, why would the driver place the device in the ERROR
> > state?  It should have returned an errno and left the device in the
> > previous state.
> 
> What we found while implementing is the use of ERROR arises when the
> driver has been forced to do multiple actions and is unable to fully
> unwind the state. So the device_state is not fully the original state
> or fully the target state. Thus it is ERROR.
> 
> The additional requirement that the driver do another action to
> quiesce the device only introduces the possiblity for triple failure.
> 
> Since it doesn't bring any value to userspace, I prefer we not define
> things in this complicated way.

So ERROR is really about unrecoverable failures. If recoverable suppose
errno should have been returned and the device is still in the original
state. Is this understanding correct?

btw which errno indicates to the user that the device is back to the
original state or in the ERROR state? or want the user to always check
the device state upon any transition error?

> 
> > > I prefer a model where the device is allowed to keep doing whatever it
> > > was doing before it hit the error. You are pushing for a model where
> > > upon error we must force the device to stop.
> >
> > If the device continues operating in the previous mode then it
> > shouldn't enter the ERROR state, it should return errno and re-reading
> > device_state should indicate it's in the previous state.
> 
> Continues operating in the new/previous state is an 'upper bound' on
> what it is allowed to do. For instance if we are going from RUNNING ->
> SAVING mlx5 must transit through 'RUNNING|NDMA' as part of its
> internal design.
> 
> If it then fails it can't continue to pretend it is RUNNING when it is
> doing RUNNING|NDMA and a double failure means we can't restore
> RUNNING.
> 
> Allowing ERROR to continue any behavior allowed by RUNNING allows the
> device to be left in RUNNING|NDMA and eliminates the possiblity of
> triple failure in a transition to 'STOP'.
> 
> Indeed we can simplify the driver by removing failure recoveries for
> cases that have a double fault and simply go to ERROR. This is not so
> viable if ERROR itself requires an action to enter it as we get back
> to having to deal with double and triple faults.

Or have a way for the user to specify the policy when entering ERROR,
e.g. asking the migration driver to conduct an internal reset if undefined
behavior in this state is a concern and the user doesn't want to further
diagnose the device context?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2022-01-10  7:55               ` Tian, Kevin
@ 2022-01-10 17:34                 ` Alex Williamson
  2022-01-11  2:41                   ` Tian, Kevin
  2022-01-10 18:11                 ` Jason Gunthorpe
  1 sibling, 1 reply; 35+ messages in thread
From: Alex Williamson @ 2022-01-10 17:34 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, cohuck, corbet, kvm, linux-doc, farman, mjrosato, pasic

On Mon, 10 Jan 2022 07:55:16 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Friday, January 7, 2022 5:21 AM
> > 
> > We were also thinking to retain STOP. SAVING -> STOP could possibly
> > serve as the abort path to avoid a double action, and some of the use
> > cases you ID'd below are achievable if STOP remains.  
> 
> what is the exact difference between a null state {} (implying !RUNNING)
> and STOP in this context?
> 
> If they are different, who (user or driver) should conduct and when do 
> we expect transitioning the device into a null state?

Sorry if I added confusion here, the null, ie. {}, state fit my
notation better.  The null state is simply no bits set in device_state,
it's equivalent to "STOP" without coming up with a new name for every
set of bit combinations.

> > > We have 20 possible transitions.  I've marked those available via the
> > > "odd ascii art" diagram as (a), that's 7 transitions.  We could
> > > essentially remove the NULL state as unreachable as there seems little
> > > value in the 2 transitions marked (a)* if we look only at migration,
> > > that would bring us down to 5 of 12 possible transitions.  We need to
> > > give userspace an abort path though, so we minimally need the 2
> > > transitions marked (b) (7/12).  
> >   
> > > So now we can discuss the remaining 5 transitions:
> > >
> > > {SAVING} -> {RESUMING}
> > > 	If not supported, user can achieve this via:
> > > 		{SAVING}->{RUNNING}->{RESUMING}
> > > 		{SAVING}-RESET->{RUNNING}->{RESUMING}  
> > 
> > This can be:
> > 
> > SAVING -> STOP -> RESUMING  
> 
> From Alex's original description the default device state is RUNNING.
> This supposed to be the initial state on the dest machine for the
> device assigned to Qemu before Qemu resumes the device state.
> Then how do we eliminate the RUNNING state in above flow? Who
> makes STOP as the initial state on the dest node?

The device must be RUNNING by default.  This is a requirement that
introduction of migration support for a device cannot break
compatibility with existing userspace that may no support migration
features.  It would be QEMU's responsibility to transition a migration
target device from the default state to a state to accept a migration.
There's no discussion here of eliminating the {RUNNING}->{RESUMING}
transition.
 
> > > drivers follow the previously provided pseudo algorithm with the
> > > requirement that they cannot pass through an invalid state, we need to
> > > formally address whether the NULL state is invalid or just not
> > > reachable by the user.  
> > 
> > What is a NULL state?  
> 
> Hah, seems I'm not the only one having this confusion. 😊

Sorry again, I thought it could easily be deduced by including it in
the state transitions.

> > We have defined (from memory, forgive me I don't have access to
> > Yishai's latest code at the moment) 8 formal FSM states:
> > 
> >  RUNNING
> >  PRECOPY
> >  PRECOPY_NDMA
> >  STOP_COPY
> >  STOP
> >  RESUMING
> >  RESUMING_NDMA
> >  ERROR (perhaps MUST_RESET)  
> 
> ERROR->SHUTDOWN? Usually a shutdown state implies reset required...

The userspace process can go away at any time, what exactly is a
"SHUTDOWN" state representing?
 
> > > But I think you've identified two classes of DMA, MSI and everything
> > > else.  The device can assume that an MSI is special and not included in
> > > NDMA, but it can't know whether other arbitrary DMAs are p2p or memory.
> > > If we define that the minimum requirement for multi-device migration is
> > > to quiesce p2p DMA, ex. by not allowing it at all, then NDMA is
> > > actually significantly more restrictive while it's enabled.  
> > 
> > You are right, but in any practical physical device NDMA will be
> > implemented by halting all DMAs, not just p2p ones.
> > 
> > I don't mind what we label this, so long as we understand that halting
> > all DMA is a valid device implementation.
> > 
> > Actually, having reflected on this now, one of the things on my list
> > to fix in iommufd, is that mdevs can get access to P2P pages at all.
> > 
> > This is currently buggy as-is because they cannot DMA map these
> > things, touch them with the CPU and kmap, or do, really, anything with
> > them.  
> 
> Can you elaborate why mdev cannot access p2p pages?
> 
> > 
> > So we should be blocking mdev's from accessing P2P, and in that case a
> > mdev driver can quite rightly say it doesn't support P2P at all and
> > safely NOP NDMA if NDMA is defined to only impact P2P transactions.
> > 
> > Perhaps that answers the question for the S390 drivers as well.
> >   
> > > Should a device in the ERROR state continue operation or be in a
> > > quiesced state?  It seems obvious to me that since the ERROR state is
> > > essentially undefined, the device should cease operations and be
> > > quiesced by the driver.  If the device is continuing to operate in the
> > > previous state, why would the driver place the device in the ERROR
> > > state?  It should have returned an errno and left the device in the
> > > previous state.  
> > 
> > What we found while implementing is the use of ERROR arises when the
> > driver has been forced to do multiple actions and is unable to fully
> > unwind the state. So the device_state is not fully the original state
> > or fully the target state. Thus it is ERROR.
> > 
> > The additional requirement that the driver do another action to
> > quiesce the device only introduces the possiblity for triple failure.
> > 
> > Since it doesn't bring any value to userspace, I prefer we not define
> > things in this complicated way.  
> 
> So ERROR is really about unrecoverable failures. If recoverable suppose
> errno should have been returned and the device is still in the original
> state. Is this understanding correct?

Yes, that's how I understand it.  The ERROR state should be used if the
driver can neither provide the requested new state nor remain/return to
the old state.
 
> btw which errno indicates to the user that the device is back to the
> original state or in the ERROR state? or want the user to always check
> the device state upon any transition error?

No such encoding is defined or required, per the existing uAPI the user
is to re-evaluate device_state on any non-zero errno.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2022-01-10  3:14                           ` Tian, Kevin
@ 2022-01-10 17:52                             ` Jason Gunthorpe
  2022-01-11  2:57                               ` Tian, Kevin
  0 siblings, 1 reply; 35+ messages in thread
From: Jason Gunthorpe @ 2022-01-10 17:52 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, cohuck, corbet, kvm, linux-doc, farman,
	mjrosato, pasic, Lu, Baolu

On Mon, Jan 10, 2022 at 03:14:44AM +0000, Tian, Kevin wrote:
> > An operator might need to emergency migrate a VM without the
> > possibility for failure. For instance there is something wrong with
> > the base HW. SLA ignored, migration must be done.
> 
> How is it done today when no assigned device supports migration?

That is different, the operator deliberately created a VM that is not
migratable. Operators may simply prefer to never do this.

You are talking about migration which is blockable by the guest -
outside of operator controll this is a totally different thing.

> - It's necessary to support existing HW though it may only supports
>   optional migration due to unbounded time of stopping DMA;

At a minimum a device with optional migration needs to be reported to
userspace and qemu should not blindly adopt it without some opt-in
IMHO.
 
> - We should influence IP designers to design HW to allow preempting
>   in-fly requests and stop DMA quickly (also implying the capability of
>   aborting/resuming in-fly PRI requests);

Yes, I think we need a way to suspend the device then abort its PRIs
with some error. The ATS cache is not something that is migrated so
this seems reasonable.

The only sketchy bit looks like how to resync the VM that still would
have a PRI in its queue and would still want to answer it. That answer
would have to be discarded..

> - Specific to the device state management uAPI, it should not assume
>   a specific usage and instead allow the user to set a timeout value so
>   transitioning to NDMA is failed if the operation cannot be completed
>   within the specified timeout value. If the user doesn't set it, the 
>   migration driver could conservatively use a default timeout value to
>   gate any potentially unbounded operation.

This would need to go along with the flag above, as only optional
drivers should have something like this.

Jason

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2022-01-10  7:55               ` Tian, Kevin
  2022-01-10 17:34                 ` Alex Williamson
@ 2022-01-10 18:11                 ` Jason Gunthorpe
  2022-01-11  3:14                   ` Tian, Kevin
  1 sibling, 1 reply; 35+ messages in thread
From: Jason Gunthorpe @ 2022-01-10 18:11 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, cohuck, corbet, kvm, linux-doc, farman, mjrosato, pasic

On Mon, Jan 10, 2022 at 07:55:16AM +0000, Tian, Kevin wrote:

> > > {SAVING} -> {RESUMING}
> > > 	If not supported, user can achieve this via:
> > > 		{SAVING}->{RUNNING}->{RESUMING}
> > > 		{SAVING}-RESET->{RUNNING}->{RESUMING}
> > 
> > This can be:
> > 
> > SAVING -> STOP -> RESUMING
> 
> From Alex's original description the default device state is RUNNING.
> This supposed to be the initial state on the dest machine for the
> device assigned to Qemu before Qemu resumes the device state.
> Then how do we eliminate the RUNNING state in above flow? Who
> makes STOP as the initial state on the dest node?

All of this notation should be read with the idea that the
device_state is already somehow moved away from RESET. Ie the above
notation is about what is possible once qemu has already moved the
device to SAVING.

> > This is currently buggy as-is because they cannot DMA map these
> > things, touch them with the CPU and kmap, or do, really, anything with
> > them.
> 
> Can you elaborate why mdev cannot access p2p pages?

It is just a failure of APIs in the kernel. A p2p page has no 'struct
page' so it cannot be used in a scatter list, and thus cannot be used in
dma_map_sg.

It also cannot be kmap'd, or memcpy'd from.

So, essentially, everything that a current mdev drivers try to do will
crash with a non-struct page pfn.

In principle this could all be made to work someday, but it doesn't
work now.

What I want to do is make these APIs correctly use struct page and
block all non-struct page memory from getting into them.

> > Since it doesn't bring any value to userspace, I prefer we not define
> > things in this complicated way.
> 
> So ERROR is really about unrecoverable failures. If recoverable suppose
> errno should have been returned and the device is still in the original
> state. Is this understanding correct?

Yes
 
> btw which errno indicates to the user that the device is back to the
> original state or in the ERROR state? or want the user to always check
> the device state upon any transition error?

IMHO it is a failing of the API that this cannot be reported back. The
fact that the system call became on-directional is a side effect of
abusing the migration region like this.

Jason

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2022-01-10 17:34                 ` Alex Williamson
@ 2022-01-11  2:41                   ` Tian, Kevin
  0 siblings, 0 replies; 35+ messages in thread
From: Tian, Kevin @ 2022-01-11  2:41 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, cohuck, corbet, kvm, linux-doc, farman, mjrosato, pasic

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Tuesday, January 11, 2022 1:34 AM
> 
> On Mon, 10 Jan 2022 07:55:16 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Friday, January 7, 2022 5:21 AM
> > >
> > > We were also thinking to retain STOP. SAVING -> STOP could possibly
> > > serve as the abort path to avoid a double action, and some of the use
> > > cases you ID'd below are achievable if STOP remains.
> >
> > what is the exact difference between a null state {} (implying !RUNNING)
> > and STOP in this context?
> >
> > If they are different, who (user or driver) should conduct and when do
> > we expect transitioning the device into a null state?
> 
> Sorry if I added confusion here, the null, ie. {}, state fit my
> notation better.  The null state is simply no bits set in device_state,
> it's equivalent to "STOP" without coming up with a new name for every
> set of bit combinations.

This matches my thought. Earlier reading this thread I was left with
the impression that Jason means 'STOP' as a new state.

> 
> > > > We have 20 possible transitions.  I've marked those available via the
> > > > "odd ascii art" diagram as (a), that's 7 transitions.  We could
> > > > essentially remove the NULL state as unreachable as there seems little
> > > > value in the 2 transitions marked (a)* if we look only at migration,
> > > > that would bring us down to 5 of 12 possible transitions.  We need to
> > > > give userspace an abort path though, so we minimally need the 2
> > > > transitions marked (b) (7/12).
> > >
> > > > So now we can discuss the remaining 5 transitions:
> > > >
> > > > {SAVING} -> {RESUMING}
> > > > 	If not supported, user can achieve this via:
> > > > 		{SAVING}->{RUNNING}->{RESUMING}
> > > > 		{SAVING}-RESET->{RUNNING}->{RESUMING}
> > >
> > > This can be:
> > >
> > > SAVING -> STOP -> RESUMING
> >
> > From Alex's original description the default device state is RUNNING.
> > This supposed to be the initial state on the dest machine for the
> > device assigned to Qemu before Qemu resumes the device state.
> > Then how do we eliminate the RUNNING state in above flow? Who
> > makes STOP as the initial state on the dest node?
> 
> The device must be RUNNING by default.  This is a requirement that
> introduction of migration support for a device cannot break
> compatibility with existing userspace that may no support migration
> features.  It would be QEMU's responsibility to transition a migration
> target device from the default state to a state to accept a migration.
> There's no discussion here of eliminating the {RUNNING}->{RESUMING}
> transition.

Then having STOP in the flow is meaningless. It actually means:

{SAVING}-RESET->{RUNNING}->{STOP}->{RESUMING}

> 
> > > > drivers follow the previously provided pseudo algorithm with the
> > > > requirement that they cannot pass through an invalid state, we need to
> > > > formally address whether the NULL state is invalid or just not
> > > > reachable by the user.
> > >
> > > What is a NULL state?
> >
> > Hah, seems I'm not the only one having this confusion. 😊
> 
> Sorry again, I thought it could easily be deduced by including it in
> the state transitions.
> 
> > > We have defined (from memory, forgive me I don't have access to
> > > Yishai's latest code at the moment) 8 formal FSM states:
> > >
> > >  RUNNING
> > >  PRECOPY
> > >  PRECOPY_NDMA
> > >  STOP_COPY
> > >  STOP
> > >  RESUMING
> > >  RESUMING_NDMA
> > >  ERROR (perhaps MUST_RESET)
> >
> > ERROR->SHUTDOWN? Usually a shutdown state implies reset required...
> 
> The userspace process can go away at any time, what exactly is a
> "SHUTDOWN" state representing?

I suppose closing fd for previous process transitions the device state
to NULL or STOP state from whatever current state is. Then when 
another process opens the same device its state becomes RUNNING.

All other states including SHUTDOWN only have effect when the
device fd is opened.

> 
> > > > But I think you've identified two classes of DMA, MSI and everything
> > > > else.  The device can assume that an MSI is special and not included in
> > > > NDMA, but it can't know whether other arbitrary DMAs are p2p or
> memory.
> > > > If we define that the minimum requirement for multi-device migration
> is
> > > > to quiesce p2p DMA, ex. by not allowing it at all, then NDMA is
> > > > actually significantly more restrictive while it's enabled.
> > >
> > > You are right, but in any practical physical device NDMA will be
> > > implemented by halting all DMAs, not just p2p ones.
> > >
> > > I don't mind what we label this, so long as we understand that halting
> > > all DMA is a valid device implementation.
> > >
> > > Actually, having reflected on this now, one of the things on my list
> > > to fix in iommufd, is that mdevs can get access to P2P pages at all.
> > >
> > > This is currently buggy as-is because they cannot DMA map these
> > > things, touch them with the CPU and kmap, or do, really, anything with
> > > them.
> >
> > Can you elaborate why mdev cannot access p2p pages?
> >
> > >
> > > So we should be blocking mdev's from accessing P2P, and in that case a
> > > mdev driver can quite rightly say it doesn't support P2P at all and
> > > safely NOP NDMA if NDMA is defined to only impact P2P transactions.
> > >
> > > Perhaps that answers the question for the S390 drivers as well.
> > >
> > > > Should a device in the ERROR state continue operation or be in a
> > > > quiesced state?  It seems obvious to me that since the ERROR state is
> > > > essentially undefined, the device should cease operations and be
> > > > quiesced by the driver.  If the device is continuing to operate in the
> > > > previous state, why would the driver place the device in the ERROR
> > > > state?  It should have returned an errno and left the device in the
> > > > previous state.
> > >
> > > What we found while implementing is the use of ERROR arises when the
> > > driver has been forced to do multiple actions and is unable to fully
> > > unwind the state. So the device_state is not fully the original state
> > > or fully the target state. Thus it is ERROR.
> > >
> > > The additional requirement that the driver do another action to
> > > quiesce the device only introduces the possiblity for triple failure.
> > >
> > > Since it doesn't bring any value to userspace, I prefer we not define
> > > things in this complicated way.
> >
> > So ERROR is really about unrecoverable failures. If recoverable suppose
> > errno should have been returned and the device is still in the original
> > state. Is this understanding correct?
> 
> Yes, that's how I understand it.  The ERROR state should be used if the
> driver can neither provide the requested new state nor remain/return to
> the old state.
> 
> > btw which errno indicates to the user that the device is back to the
> > original state or in the ERROR state? or want the user to always check
> > the device state upon any transition error?
> 
> No such encoding is defined or required, per the existing uAPI the user
> is to re-evaluate device_state on any non-zero errno.  Thanks,
> 

Makes sense.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2022-01-10 17:52                             ` Jason Gunthorpe
@ 2022-01-11  2:57                               ` Tian, Kevin
  0 siblings, 0 replies; 35+ messages in thread
From: Tian, Kevin @ 2022-01-11  2:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, cohuck, corbet, kvm, linux-doc, farman,
	mjrosato, pasic, Lu, Baolu

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, January 11, 2022 1:53 AM
> 
> > - It's necessary to support existing HW though it may only supports
> >   optional migration due to unbounded time of stopping DMA;
> 
> At a minimum a device with optional migration needs to be reported to
> userspace and qemu should not blindly adopt it without some opt-in
> IMHO.

yes

> 
> > - We should influence IP designers to design HW to allow preempting
> >   in-fly requests and stop DMA quickly (also implying the capability of
> >   aborting/resuming in-fly PRI requests);
> 
> Yes, I think we need a way to suspend the device then abort its PRIs
> with some error. The ATS cache is not something that is migrated so
> this seems reasonable.

yes, any cache can be abandoned in the migration process. and such
error must be recoverable otherwise it would break the guest functionality.

The key is that PRI can happen at any time which implies that the device 
must be able to preempt and abort a transaction in very fine-grained
granularity. There might be IP-specific factors affecting the final 
preemption granularity (e.g. complex GPUs will certainly have more
challenges), which is what we need involve IP designers to fully understand.

> 
> The only sketchy bit looks like how to resync the VM that still would
> have a PRI in its queue and would still want to answer it. That answer
> would have to be discarded..

Yes. This also suggests that iommu core needs allow the driver to cancel
all previously-queued PRI requests for its device so the guest answer to
that device can be ignored by the iommu core.

> 
> > - Specific to the device state management uAPI, it should not assume
> >   a specific usage and instead allow the user to set a timeout value so
> >   transitioning to NDMA is failed if the operation cannot be completed
> >   within the specified timeout value. If the user doesn't set it, the
> >   migration driver could conservatively use a default timeout value to
> >   gate any potentially unbounded operation.
> 
> This would need to go along with the flag above, as only optional
> drivers should have something like this.
> 

Agree.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2022-01-10 18:11                 ` Jason Gunthorpe
@ 2022-01-11  3:14                   ` Tian, Kevin
  2022-01-11 18:19                     ` Jason Gunthorpe
  0 siblings, 1 reply; 35+ messages in thread
From: Tian, Kevin @ 2022-01-11  3:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, cohuck, corbet, kvm, linux-doc, farman, mjrosato, pasic

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, January 11, 2022 2:12 AM
> 
> On Mon, Jan 10, 2022 at 07:55:16AM +0000, Tian, Kevin wrote:
> 
> > > > {SAVING} -> {RESUMING}
> > > > 	If not supported, user can achieve this via:
> > > > 		{SAVING}->{RUNNING}->{RESUMING}
> > > > 		{SAVING}-RESET->{RUNNING}->{RESUMING}
> > >
> > > This can be:
> > >
> > > SAVING -> STOP -> RESUMING
> >
> > From Alex's original description the default device state is RUNNING.
> > This supposed to be the initial state on the dest machine for the
> > device assigned to Qemu before Qemu resumes the device state.
> > Then how do we eliminate the RUNNING state in above flow? Who
> > makes STOP as the initial state on the dest node?
> 
> All of this notation should be read with the idea that the
> device_state is already somehow moved away from RESET. Ie the above
> notation is about what is possible once qemu has already moved the
> device to SAVING.

Qemu moves the device to SAVING on the src node.

On the dest the device is in RUNNING (after reset) which can be directly
transitioned to RESUMING. I didn't see the point of adding a STOP here.

> 
> > > This is currently buggy as-is because they cannot DMA map these
> > > things, touch them with the CPU and kmap, or do, really, anything with
> > > them.
> >
> > Can you elaborate why mdev cannot access p2p pages?
> 
> It is just a failure of APIs in the kernel. A p2p page has no 'struct
> page' so it cannot be used in a scatter list, and thus cannot be used in
> dma_map_sg.
> 
> It also cannot be kmap'd, or memcpy'd from.
> 
> So, essentially, everything that a current mdev drivers try to do will
> crash with a non-struct page pfn.
> 
> In principle this could all be made to work someday, but it doesn't
> work now.
> 
> What I want to do is make these APIs correctly use struct page and
> block all non-struct page memory from getting into them.
> 

I see. 

It does make sense for the current sw mdev drivers.

Later when supporting hw mdev (with pasid granular isolation in
iommu), this restriction can be uplifted as it doesn't use dma api
and is pretty much like a pdev regarding to ioas management.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
  2022-01-11  3:14                   ` Tian, Kevin
@ 2022-01-11 18:19                     ` Jason Gunthorpe
  0 siblings, 0 replies; 35+ messages in thread
From: Jason Gunthorpe @ 2022-01-11 18:19 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, cohuck, corbet, kvm, linux-doc, farman, mjrosato, pasic

On Tue, Jan 11, 2022 at 03:14:04AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, January 11, 2022 2:12 AM
> > 
> > On Mon, Jan 10, 2022 at 07:55:16AM +0000, Tian, Kevin wrote:
> > 
> > > > > {SAVING} -> {RESUMING}
> > > > > 	If not supported, user can achieve this via:
> > > > > 		{SAVING}->{RUNNING}->{RESUMING}
> > > > > 		{SAVING}-RESET->{RUNNING}->{RESUMING}
> > > >
> > > > This can be:
> > > >
> > > > SAVING -> STOP -> RESUMING
> > >
> > > From Alex's original description the default device state is RUNNING.
> > > This supposed to be the initial state on the dest machine for the
> > > device assigned to Qemu before Qemu resumes the device state.
> > > Then how do we eliminate the RUNNING state in above flow? Who
> > > makes STOP as the initial state on the dest node?
> > 
> > All of this notation should be read with the idea that the
> > device_state is already somehow moved away from RESET. Ie the above
> > notation is about what is possible once qemu has already moved the
> > device to SAVING.
> 
> Qemu moves the device to SAVING on the src node.
> 
> On the dest the device is in RUNNING (after reset) which can be directly
> transitioned to RESUMING. I didn't see the point of adding a STOP here.

Alex is talking about the same node case where qemu has put the device
into SAVING and then, for whatever reason, decides it now wants the
device to be in RESUMING.

We are talking about the state space of commands the driver has to
process here. If we can break down things like SAVING -> RESUMING into
two commands:

 SAVING -> STOP
 STOP -> RESUMING

Then the driver has to implement fewer arcs, and the arcs it does
implement are much simpler.

It also resolves the precedence question nicely as we have a core FSM
that is built on the arcs the drivers implement and that in turn gives
a natural answer to the question of how do you transit between any two
states.

Eg using the state names I gave earlier we can look at going from
RESUMING -> PRE_COPY_NDMA and decomposing it into these four steps:

  RESUMING -> STOP -> RUNNING -> PRE_COPY -> PRE_COPY_P2P

In the end the driver needs to implement only about half of the total
arcs and the ones it does need to implement are simpler and have a
more obvious implementation.

> Later when supporting hw mdev (with pasid granular isolation in
> iommu), this restriction can be uplifted as it doesn't use dma api
> and is pretty much like a pdev regarding to ioas management.

When I say 'mdev' I really mean things that use the vfio pinning
interface - which we don't quite have a proper name for yet (though
emulated iommu perhaps is sticking)

Things that use iommu_domain would not be a problem

Jason

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2022-01-11 18:19 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-09 23:34 [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state Alex Williamson
2021-12-10  1:25 ` Jason Gunthorpe
2021-12-13 20:40   ` Alex Williamson
2021-12-14 12:08     ` Cornelia Huck
2021-12-14 16:26     ` Jason Gunthorpe
2021-12-20 22:26       ` Alex Williamson
2022-01-04 20:28         ` Jason Gunthorpe
2022-01-06 18:17           ` Alex Williamson
2022-01-06 21:20             ` Jason Gunthorpe
2022-01-10  7:55               ` Tian, Kevin
2022-01-10 17:34                 ` Alex Williamson
2022-01-11  2:41                   ` Tian, Kevin
2022-01-10 18:11                 ` Jason Gunthorpe
2022-01-11  3:14                   ` Tian, Kevin
2022-01-11 18:19                     ` Jason Gunthorpe
2022-01-04  3:49       ` Tian, Kevin
2022-01-04 16:09         ` Jason Gunthorpe
2022-01-05  1:59           ` Tian, Kevin
2022-01-05 12:45             ` Jason Gunthorpe
2022-01-06  6:32               ` Tian, Kevin
2022-01-06 15:42                 ` Jason Gunthorpe
2022-01-07  0:00                   ` Tian, Kevin
2022-01-07  0:29                     ` Jason Gunthorpe
2022-01-07  2:01                       ` Tian, Kevin
2022-01-07 17:23                         ` Jason Gunthorpe
2022-01-10  3:14                           ` Tian, Kevin
2022-01-10 17:52                             ` Jason Gunthorpe
2022-01-11  2:57                               ` Tian, Kevin
2022-01-05  3:06           ` Tian, Kevin
2021-12-20 17:38 ` Cornelia Huck
2021-12-20 22:49   ` Alex Williamson
2021-12-21 11:24     ` Cornelia Huck
2022-01-07  8:03 ` Tian, Kevin
2022-01-07 16:36   ` Alex Williamson
2022-01-10  6:01     ` Tian, Kevin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).