All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC v2] vfio: Documentation for the migration region
@ 2021-11-29 14:45 Jason Gunthorpe
  2021-11-30 17:26 ` Alex Williamson
  0 siblings, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2021-11-29 14:45 UTC (permalink / raw)
  To: Alex Williamson, Jonathan Corbet, linux-doc
  Cc: Cornelia Huck, kvm, Kirti Wankhede, Max Gurtovoy,
	Shameer Kolothum, Yishai Hadas

Provide some more complete documentation for the migration regions
behavior, specifically focusing on the device_state bits and the whole
system view from a VMM.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 Documentation/driver-api/vfio.rst | 277 +++++++++++++++++++++++++++++-
 1 file changed, 276 insertions(+), 1 deletion(-)

Alex/Cornelia, here is the second draft of the requested documentation I promised

We think it includes all the feedback from hns, Intel and NVIDIA on this mechanism.

Our thinking is that NDMA would be implemented like this:

   +#define VFIO_DEVICE_STATE_NDMA      (1 << 3)

And a .add_capability ops will be used to signal to userspace driver support:

   +#define VFIO_REGION_INFO_CAP_MIGRATION_NDMA    6

I've described DIRTY TRACKING as a seperate concept here. With the current
uAPI this would be controlled by VFIO_IOMMU_DIRTY_PAGES_FLAG_START, with our
change in direction this would be per-tracker control, but no semantic change.

Upon some agreement we'll include this patch in the next iteration of the mlx5
driver along with the NDMA bits.

v2:
 - RST fixups for sphinx rendering
 - Inclue the priority order for multi-bit-changes
 - Add a small discussion on devices like hns with migration control inside
   the same function as is being migrated.
 - Language cleanups from v1, the diff says almost every line was touched in some way
v1: https://lore.kernel.org/r/0-v1-0ec87874bede+123-vfio_mig_doc_jgg@nvidia.com

diff --git a/Documentation/driver-api/vfio.rst b/Documentation/driver-api/vfio.rst
index c663b6f978255b..d9be47570f878c 100644
--- a/Documentation/driver-api/vfio.rst
+++ b/Documentation/driver-api/vfio.rst
@@ -242,7 +242,282 @@ group and can access them as follows::
 VFIO User API
 -------------------------------------------------------------------------------
 
-Please see include/linux/vfio.h for complete API documentation.
+Please see include/uapi/linux/vfio.h for complete API documentation.
+
+-------------------------------------------------------------------------------
+
+VFIO migration driver API
+-------------------------------------------------------------------------------
+
+VFIO drivers that support migration implement a migration control register
+called device_state in the struct vfio_device_migration_info which is in its
+VFIO_REGION_TYPE_MIGRATION region.
+
+The device_state controls both device action and continuous behavior.
+Setting/clearing bit groups triggers device action, and each bit controls a
+continuous device behavior.
+
+Along with the device_state the migration driver provides a data window which
+allows streaming migration data into or out of the device.
+
+A lot of flexibility is provided to user-space in how it operates these
+bits. What follows is a reference flow for saving device state in a live
+migration, with all features, and an illustration how other external non-VFIO
+entities (VCPU_RUNNING and DIRTY_TRACKING) the VMM controls fit in.
+
+  RUNNING, VCPU_RUNNING
+     Normal operating state
+  RUNNING, DIRTY_TRACKING, VCPU_RUNNING
+     Log DMAs
+
+     Stream all memory
+  SAVING | RUNNING, DIRTY_TRACKING, VCPU_RUNNING
+     Log internal device changes (pre-copy)
+
+     Stream device state through the migration window
+
+     While in this state repeat as desired:
+
+	Atomic Read and Clear DMA Dirty log
+
+	Stream dirty memory
+  SAVING | NDMA | RUNNING, VCPU_RUNNING
+     vIOMMU grace state
+
+     Complete all in progress IO page faults, idle the vIOMMU
+  SAVING | NDMA | RUNNING
+     Peer to Peer DMA grace state
+
+     Final snapshot of DMA dirty log (atomic not required)
+  SAVING
+     Stream final device state through the migration window
+
+     Copy final dirty data
+  0
+     Device is halted
+
+and the reference flow for resuming:
+
+  RUNNING
+     Issue VFIO_DEVICE_RESET to clear the internal device state
+  0
+     Device is halted
+  RESUMING
+     Push in migration data. Data captured during pre-copy should be
+     prepended to data captured during SAVING.
+  NDMA | RUNNING
+     Peer to Peer DMA grace state
+  RUNNING, VCPU_RUNNING
+     Normal operating state
+
+If the VMM has multiple VFIO devices undergoing migration then the grace
+states act as cross device synchronization points. The VMM must bring all
+devices to the grace state before advancing past it.
+
+The above reference flows are built around specific requirements on the
+migration driver for its implementation of the migration_state input.
+
+The migration_state cannot change asynchronously, upon writing the
+migration_state the driver will either keep the current state and return
+failure, return failure and go to ERROR, or succeed and go to the new state.
+
+Event triggered actions happen when user-space requests a new migration_state
+that differs from the current migration_state. Actions happen on a bit group
+basis:
+
+ SAVING | RUNNING
+   The device clears the data window and begins streaming 'pre copy' migration
+   data through the window. Devices that cannot log internal state changes
+   return a 0 length migration stream.
+
+ SAVING | !RUNNING
+   The device captures its internal state that is not covered by internal
+   logging, as well as any logged changes.
+
+   The device clears the data window and begins streaming the captured
+   migration data through the window. Devices that cannot log internal state
+   changes stream all their device state here.
+
+ RESUMING
+   The data window is cleared, opened, and can receive the migration data
+   stream.
+
+ !RESUMING
+   All the data transferred into the data window is loaded into the device's
+   internal state. The migration driver can rely on user-space issuing a
+   VFIO_DEVICE_RESET prior to starting RESUMING.
+
+   To abort a RESUMING issue a VFIO_DEVICE_RESET.
+
+   If the migration data is invalid then the ERROR state must be set.
+
+Continuous actions are in effect when migration_state bit groups are active:
+
+ RUNNING | NDMA
+   The device is not allowed to issue new DMA operations.
+
+   Whenever the kernel returns with a migration_state of NDMA there can be no
+   in progress DMAs.
+
+ !RUNNING
+   The device should not change its internal state. Further implies the NDMA
+   behavior above.
+
+ SAVING | !RUNNING
+   RESUMING | !RUNNING
+   The device may assume there are no incoming MMIO operations.
+
+   Internal state logging can stop.
+
+ RUNNING
+   The device can alter its internal state and must respond to incoming MMIO.
+
+ SAVING | RUNNING
+   The device is logging changes to the internal state.
+
+ ERROR
+   The behavior of the device is largely undefined. The device must be
+   recovered by issuing VFIO_DEVICE_RESET or closing the device file
+   descriptor.
+
+   However, devices supporting NDMA must behave as though NDMA is asserted
+   during ERROR to avoid corrupting other devices or a VM during a failed
+   migration.
+
+When multiple bits change in the migration_state they may describe multiple
+event triggered actions, and multiple changes to continuous actions.  The
+migration driver must process them in a priority order:
+
+ - SAVING | RUNNING
+ - NDMA
+ - !RUNNING
+ - SAVING | !RUNNING
+ - RESUMING
+ - !RESUMING
+ - RUNNING
+ - !NDMA
+
+In general, userspace can issue a VFIO_DEVICE_RESET ioctl and recover the
+device back to device_state RUNNING. When a migration driver executes this
+ioctl it should discard the data window and set migration_state to RUNNING as
+part of resetting the device to a clean state. This must happen even if the
+migration_state has errored. A freshly opened device FD should always be in
+the RUNNING state.
+
+The migration driver has limitations on what device state it can affect. Any
+device state controlled by general kernel subsystems must not be changed
+during RESUME, and SAVING must tolerate mutation of this state. Change to
+externally controlled device state can happen at any time, asynchronously, to
+the migration (ie interrupt rebalancing).
+
+Some examples of externally controlled state:
+ - MSI-X interrupt page
+ - MSI/legacy interrupt configuration
+ - Large parts of the PCI configuration space, ie common control bits
+ - PCI power management
+ - Changes via VFIO_DEVICE_SET_IRQS
+
+During !RUNNING, especially during SAVING and RESUMING, the device may have
+limitations on what it can tolerate. An ideal device will discard/return all
+ones to all incoming MMIO, PIO, or equivalent operations (exclusive of the
+external state above) in !RUNNING. However, devices are free to have undefined
+behavior if they receive incoming operations. This includes
+corrupting/aborting the migration, dirtying pages, and segfaulting user-space.
+
+However, a device may not compromise system integrity if it is subjected to a
+MMIO. It cannot trigger an error TLP, it cannot trigger an x86 Machine Check
+or similar, and it cannot compromise device isolation.
+
+There are several edge cases that user-space should keep in mind when
+implementing migration:
+
+- Device Peer to Peer DMA. In this case devices are able issue DMAs to each
+  other's MMIO regions. The VMM can permit this if it maps the MMIO memory into
+  the IOMMU.
+
+  As Peer to Peer DMA is a MMIO touch like any other, it is important that
+  userspace suspend these accesses before entering any device_state where MMIO
+  is not permitted, such as !RUNNING. This can be accomplished with the NDMA
+  state. Userspace may also choose to remove MMIO mappings from the IOMMU if the
+  device does not support NDMA and rely on that to guarantee quiet MMIO.
+
+  The Peer to Peer Grace States exist so that all devices may reach RUNNING
+  before any device is subjected to a MMIO access.
+
+  Failure to guarantee quiet MMIO may allow a hostile VM to use P2P to violate
+  the no-MMIO restriction during SAVING or RESUMING and corrupt the migration
+  on devices that cannot protect themselves.
+
+- IOMMU Page faults handled in user-space can occur at any time. A migration
+  driver is not required to serialize in-progress page faults. It can assume
+  that all page faults are completed before entering SAVING | !RUNNING. Since
+  the guest VCPU is required to complete page faults the VMM can accomplish
+  this by asserting NDMA | VCPU_RUNNING and clearing all pending page faults
+  before clearing VCPU_RUNNING.
+
+  Device that do not support NDMA cannot be configured to generate page faults
+  that require the VCPU to complete.
+
+- pre-copy allows the device to implement a dirty log for its internal state.
+  During the SAVING | RUNNING state the data window should present the device
+  state being logged and during SAVING | !RUNNING the data window should present
+  the unlogged device state as well as the changes from the internal dirty log.
+
+  On RESUME these two data streams are concatenated together.
+
+  pre-copy is only concerned with internal device state. External DMAs are
+  covered by the separate DIRTY_TRACKING function.
+
+- Atomic Read and Clear of the DMA log is a HW feature. If the tracker
+  cannot support this, then NDMA could be used to synthesize it less
+  efficiently.
+
+- NDMA is optional. If the device does not support this then the NDMA States
+  are pushed down to the next step in the sequence and various behaviors that
+  rely on NDMA cannot be used.
+
+- Migration control registers inside the same iommu_group as the VFIO device.
+  This immediately raises a security concern as user-space can use Peer to Peer
+  DMA to manipulate these migration control registers concurrently with
+  any kernel actions.
+
+  A device driver operating such a device must ensure that kernel integrity
+  cannot be broken by hostile user space operating the migration MMIO
+  registers via peer to peer, at any point in the sequence. Notably the kernel
+  cannot use DMA to transfer any migration data.
+
+  However, as discussed above in the "Device Peer to Peer DMA" section, it can
+  assume quiet MMIO as a condition to have a successful and uncorrupted
+  migration.
+
+To elaborate details on the reference flows, they assume the following details
+about the external behaviors:
+
+ !VCPU_RUNNING
+   User-space must not generate dirty pages or issue MMIO, PIO or equivalent
+   operations to devices.  For a VMM this would typically be controlled by
+   KVM.
+
+ DIRTY_TRACKING
+   Clear the DMA log and start DMA logging
+
+   DMA logs should be readable with an "atomic test and clear" to allow
+   continuous non-disruptive sampling of the log.
+
+   This is controlled by VFIO_IOMMU_DIRTY_PAGES_FLAG_START on the container
+   fd.
+
+ !DIRTY_TRACKING
+   Freeze the DMA log, stop tracking and allow user-space to read it.
+
+   If user-space is going to have any use of the dirty log it must ensure that
+   all DMA is suspended before clearing DIRTY_TRACKING, for instance by using
+   NDMA or !RUNNING on all VFIO devices.
+
+
+TDB - discoverable feature flag for NDMA
+TDB IMS xlation
+TBD PASID xlation
 
 VFIO bus driver API
 -------------------------------------------------------------------------------

base-commit: ae0351a976d1880cf152de2bc680f1dff14d9049
-- 
2.34.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v2] vfio: Documentation for the migration region
  2021-11-29 14:45 [PATCH RFC v2] vfio: Documentation for the migration region Jason Gunthorpe
@ 2021-11-30 17:26 ` Alex Williamson
  2021-11-30 18:59   ` Jason Gunthorpe
  0 siblings, 1 reply; 29+ messages in thread
From: Alex Williamson @ 2021-11-30 17:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jonathan Corbet, linux-doc, Cornelia Huck, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Mon, 29 Nov 2021 10:45:52 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> Provide some more complete documentation for the migration regions
> behavior, specifically focusing on the device_state bits and the whole
> system view from a VMM.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  Documentation/driver-api/vfio.rst | 277 +++++++++++++++++++++++++++++-
>  1 file changed, 276 insertions(+), 1 deletion(-)
> 
> Alex/Cornelia, here is the second draft of the requested documentation I promised
> 
> We think it includes all the feedback from hns, Intel and NVIDIA on this mechanism.
> 
> Our thinking is that NDMA would be implemented like this:
> 
>    +#define VFIO_DEVICE_STATE_NDMA      (1 << 3)
> 
> And a .add_capability ops will be used to signal to userspace driver support:
> 
>    +#define VFIO_REGION_INFO_CAP_MIGRATION_NDMA    6

So based on this and the discussion later in the doc, NDMA is an
optional device feature, is this specifically to support HNS?  IIRC,
this is a simple queue based device, but is it the fact that the queue
lives in non-device memory that makes it such that the driver cannot
mediate queue entries and simply add them to the to migration stream?
Some discussion of this requirement would be useful in the doc,
otherwise it seems easier to deprecate the v1 migration region
sub-type, and increment to a v2 where NDMA is a required feature.

> I've described DIRTY TRACKING as a seperate concept here. With the current
> uAPI this would be controlled by VFIO_IOMMU_DIRTY_PAGES_FLAG_START, with our
> change in direction this would be per-tracker control, but no semantic change.
> 
> Upon some agreement we'll include this patch in the next iteration of the mlx5
> driver along with the NDMA bits.
> 
> v2:
>  - RST fixups for sphinx rendering
>  - Inclue the priority order for multi-bit-changes
>  - Add a small discussion on devices like hns with migration control inside
>    the same function as is being migrated.
>  - Language cleanups from v1, the diff says almost every line was touched in some way
> v1: https://lore.kernel.org/r/0-v1-0ec87874bede+123-vfio_mig_doc_jgg@nvidia.com
> 
> diff --git a/Documentation/driver-api/vfio.rst b/Documentation/driver-api/vfio.rst
> index c663b6f978255b..d9be47570f878c 100644
> --- a/Documentation/driver-api/vfio.rst
> +++ b/Documentation/driver-api/vfio.rst
> @@ -242,7 +242,282 @@ group and can access them as follows::
>  VFIO User API
>  -------------------------------------------------------------------------------
>  
> -Please see include/linux/vfio.h for complete API documentation.
> +Please see include/uapi/linux/vfio.h for complete API documentation.
> +
> +-------------------------------------------------------------------------------
> +
> +VFIO migration driver API
> +-------------------------------------------------------------------------------
> +
> +VFIO drivers that support migration implement a migration control register
> +called device_state in the struct vfio_device_migration_info which is in its
> +VFIO_REGION_TYPE_MIGRATION region.
> +
> +The device_state controls both device action and continuous behavior.
> +Setting/clearing bit groups triggers device action, and each bit controls a
> +continuous device behavior.
> +
> +Along with the device_state the migration driver provides a data window which
> +allows streaming migration data into or out of the device.
> +
> +A lot of flexibility is provided to user-space in how it operates these
> +bits. What follows is a reference flow for saving device state in a live
> +migration, with all features, and an illustration how other external non-VFIO
> +entities (VCPU_RUNNING and DIRTY_TRACKING) the VMM controls fit in.
> +
> +  RUNNING, VCPU_RUNNING
> +     Normal operating state
> +  RUNNING, DIRTY_TRACKING, VCPU_RUNNING
> +     Log DMAs
> +
> +     Stream all memory
> +  SAVING | RUNNING, DIRTY_TRACKING, VCPU_RUNNING
> +     Log internal device changes (pre-copy)
> +
> +     Stream device state through the migration window
> +
> +     While in this state repeat as desired:
> +
> +	Atomic Read and Clear DMA Dirty log
> +
> +	Stream dirty memory
> +  SAVING | NDMA | RUNNING, VCPU_RUNNING
> +     vIOMMU grace state
> +
> +     Complete all in progress IO page faults, idle the vIOMMU
> +  SAVING | NDMA | RUNNING
> +     Peer to Peer DMA grace state
> +
> +     Final snapshot of DMA dirty log (atomic not required)
> +  SAVING
> +     Stream final device state through the migration window
> +
> +     Copy final dirty data
> +  0
> +     Device is halted
> +
> +and the reference flow for resuming:
> +
> +  RUNNING
> +     Issue VFIO_DEVICE_RESET to clear the internal device state
> +  0
> +     Device is halted
> +  RESUMING
> +     Push in migration data. Data captured during pre-copy should be
> +     prepended to data captured during SAVING.
> +  NDMA | RUNNING
> +     Peer to Peer DMA grace state
> +  RUNNING, VCPU_RUNNING
> +     Normal operating state
> +

There are so many implicit concepts here, I'm not sure I'm making a
correct interpretation, let alone someone coming to this document for
understanding.

> +If the VMM has multiple VFIO devices undergoing migration then the grace
> +states act as cross device synchronization points. The VMM must bring all
> +devices to the grace state before advancing past it.
> +
> +The above reference flows are built around specific requirements on the
> +migration driver for its implementation of the migration_state input.

I can't glean any meaning from this sentence.  "device_state" here and
throughout?  We don't have a "migration_state".

> +The migration_state cannot change asynchronously, upon writing the
> +migration_state the driver will either keep the current state and return
> +failure, return failure and go to ERROR, or succeed and go to the new state.
> +
> +Event triggered actions happen when user-space requests a new migration_state
> +that differs from the current migration_state. Actions happen on a bit group
> +basis:
> +
> + SAVING | RUNNING
> +   The device clears the data window and begins streaming 'pre copy' migration
> +   data through the window. Devices that cannot log internal state changes
> +   return a 0 length migration stream.
> +
> + SAVING | !RUNNING
> +   The device captures its internal state that is not covered by internal
> +   logging, as well as any logged changes.
> +
> +   The device clears the data window and begins streaming the captured
> +   migration data through the window. Devices that cannot log internal state
> +   changes stream all their device state here.
> +
> + RESUMING
> +   The data window is cleared, opened, and can receive the migration data
> +   stream.
> +
> + !RESUMING
> +   All the data transferred into the data window is loaded into the device's
> +   internal state. The migration driver can rely on user-space issuing a
> +   VFIO_DEVICE_RESET prior to starting RESUMING.

We can't really rely on userspace to do anything, nor has this sequence
been part of the specified protocol.

As with the previous flows, it seems like there's a ton of implicit
knowledge here.  Why are we documenting these here rather than in the
uAPI header?  I'm having a difficult time trying to understand what are
proposals to modify the uAPI and what are interpretations of the
existing protocol.

> +
> +   To abort a RESUMING issue a VFIO_DEVICE_RESET.

Again, this is not explicitly part of the current protocol.  Any use of
VFIO_DEVICE_RESET should return the device to the default state, but a
user is free to try to use device_state to transition from RESUMING to
any other state.  The driver can choose to fail that transition and
even make use of the error device_state, but there's no expectation
that a VFIO_DEVICE_RESET is the first choice for userspace to abort a
resume session.

> +
> +   If the migration data is invalid then the ERROR state must be set.

This is the only place the uAPI defines intervention via the RESET
ioctl.

> +
> +Continuous actions are in effect when migration_state bit groups are active:
> +
> + RUNNING | NDMA
> +   The device is not allowed to issue new DMA operations.
> +
> +   Whenever the kernel returns with a migration_state of NDMA there can be no
> +   in progress DMAs.
> +

There are certainly event triggered actions based on setting NDMA as
well, ex. completion of outstanding DMA.

> + !RUNNING
> +   The device should not change its internal state. Further implies the NDMA
> +   behavior above.

Does this also imply other device regions cannot be accessed as has
previously been suggested?  Which?

> +
> + SAVING | !RUNNING
> +   RESUMING | !RUNNING
> +   The device may assume there are no incoming MMIO operations.
> +
> +   Internal state logging can stop.
> +
> + RUNNING
> +   The device can alter its internal state and must respond to incoming MMIO.
> +
> + SAVING | RUNNING
> +   The device is logging changes to the internal state.
> +
> + ERROR
> +   The behavior of the device is largely undefined. The device must be
> +   recovered by issuing VFIO_DEVICE_RESET or closing the device file
> +   descriptor.
> +
> +   However, devices supporting NDMA must behave as though NDMA is asserted
> +   during ERROR to avoid corrupting other devices or a VM during a failed
> +   migration.
> +
> +When multiple bits change in the migration_state they may describe multiple
> +event triggered actions, and multiple changes to continuous actions.  The
> +migration driver must process them in a priority order:
> +
> + - SAVING | RUNNING
> + - NDMA
> + - !RUNNING
> + - SAVING | !RUNNING
> + - RESUMING
> + - !RESUMING
> + - RUNNING
> + - !NDMA

Lots of deduction left to the reader...

> +
> +In general, userspace can issue a VFIO_DEVICE_RESET ioctl and recover the
> +device back to device_state RUNNING. When a migration driver executes this
> +ioctl it should discard the data window and set migration_state to RUNNING as
> +part of resetting the device to a clean state. This must happen even if the
> +migration_state has errored. A freshly opened device FD should always be in
> +the RUNNING state.
> +
> +The migration driver has limitations on what device state it can affect. Any
> +device state controlled by general kernel subsystems must not be changed
> +during RESUME, and SAVING must tolerate mutation of this state. Change to
> +externally controlled device state can happen at any time, asynchronously, to
> +the migration (ie interrupt rebalancing).
> +
> +Some examples of externally controlled state:
> + - MSI-X interrupt page

More specifically, vector table and pba, but probably also config
capability.

> + - MSI/legacy interrupt configuration
> + - Large parts of the PCI configuration space, ie common control bits
> + - PCI power management
> + - Changes via VFIO_DEVICE_SET_IRQS
> +
> +During !RUNNING, especially during SAVING and RESUMING, the device may have
> +limitations on what it can tolerate. An ideal device will discard/return all
> +ones to all incoming MMIO, PIO, or equivalent operations (exclusive of the
> +external state above) in !RUNNING. However, devices are free to have undefined
> +behavior if they receive incoming operations. This includes
> +corrupting/aborting the migration, dirtying pages, and segfaulting user-space.
> +
> +However, a device may not compromise system integrity if it is subjected to a
> +MMIO. It cannot trigger an error TLP, it cannot trigger an x86 Machine Check
> +or similar, and it cannot compromise device isolation.
> +
> +There are several edge cases that user-space should keep in mind when
> +implementing migration:
> +
> +- Device Peer to Peer DMA. In this case devices are able issue DMAs to each
> +  other's MMIO regions. The VMM can permit this if it maps the MMIO memory into
> +  the IOMMU.
> +
> +  As Peer to Peer DMA is a MMIO touch like any other, it is important that
> +  userspace suspend these accesses before entering any device_state where MMIO
> +  is not permitted, such as !RUNNING. This can be accomplished with the NDMA
> +  state. Userspace may also choose to remove MMIO mappings from the IOMMU if the
> +  device does not support NDMA and rely on that to guarantee quiet MMIO.

Seems that would have its own set of consequences.

> +  The Peer to Peer Grace States exist so that all devices may reach RUNNING
> +  before any device is subjected to a MMIO access.
> +
> +  Failure to guarantee quiet MMIO may allow a hostile VM to use P2P to violate
> +  the no-MMIO restriction during SAVING or RESUMING and corrupt the migration
> +  on devices that cannot protect themselves.
> +
> +- IOMMU Page faults handled in user-space can occur at any time. A migration
> +  driver is not required to serialize in-progress page faults. It can assume
> +  that all page faults are completed before entering SAVING | !RUNNING. Since
> +  the guest VCPU is required to complete page faults the VMM can accomplish
> +  this by asserting NDMA | VCPU_RUNNING and clearing all pending page faults
> +  before clearing VCPU_RUNNING.
> +
> +  Device that do not support NDMA cannot be configured to generate page faults
> +  that require the VCPU to complete.

So the VMM is required to hide features like PRI based on NDMA support?

> +
> +- pre-copy allows the device to implement a dirty log for its internal state.
> +  During the SAVING | RUNNING state the data window should present the device
> +  state being logged and during SAVING | !RUNNING the data window should present
> +  the unlogged device state as well as the changes from the internal dirty log.
> +

This is getting a bit close to specifying an implementation.  The goal
of the pre-copy is to minimize downtime during the stop-and-copy,
therefore the specific data exposed in each phase depends on the usage
model and data size of the device.

> +  On RESUME these two data streams are concatenated together.
> +
> +  pre-copy is only concerned with internal device state. External DMAs are
> +  covered by the separate DIRTY_TRACKING function.
> +
> +- Atomic Read and Clear of the DMA log is a HW feature. If the tracker
> +  cannot support this, then NDMA could be used to synthesize it less
> +  efficiently.
> +
> +- NDMA is optional. If the device does not support this then the NDMA States
> +  are pushed down to the next step in the sequence and various behaviors that
> +  rely on NDMA cannot be used.
> +
> +- Migration control registers inside the same iommu_group as the VFIO device.

Not a complete sentence, is this meant as a topic header?

> +  This immediately raises a security concern as user-space can use Peer to Peer
> +  DMA to manipulate these migration control registers concurrently with
> +  any kernel actions.
> +

We haven't defined "migration control registers" beyond device_state,
which is software defined "register".  What physical registers that are
subject to p2p DMA is this actually referring to?

> +  A device driver operating such a device must ensure that kernel integrity
> +  cannot be broken by hostile user space operating the migration MMIO
> +  registers via peer to peer, at any point in the sequence. Notably the kernel
> +  cannot use DMA to transfer any migration data.
> +
> +  However, as discussed above in the "Device Peer to Peer DMA" section, it can
> +  assume quiet MMIO as a condition to have a successful and uncorrupted
> +  migration.
> +
> +To elaborate details on the reference flows, they assume the following details
> +about the external behaviors:
> +
> + !VCPU_RUNNING
> +   User-space must not generate dirty pages or issue MMIO, PIO or equivalent
> +   operations to devices.  For a VMM this would typically be controlled by
> +   KVM.
> +
> + DIRTY_TRACKING
> +   Clear the DMA log and start DMA logging
> +
> +   DMA logs should be readable with an "atomic test and clear" to allow
> +   continuous non-disruptive sampling of the log.
> +
> +   This is controlled by VFIO_IOMMU_DIRTY_PAGES_FLAG_START on the container
> +   fd.
> +
> + !DIRTY_TRACKING
> +   Freeze the DMA log, stop tracking and allow user-space to read it.
> +
> +   If user-space is going to have any use of the dirty log it must ensure that
> +   all DMA is suspended before clearing DIRTY_TRACKING, for instance by using
> +   NDMA or !RUNNING on all VFIO devices.
> +
> +
> +TDB - discoverable feature flag for NDMA

Is the goal to release mlx5 support without the NDMA feature and add it
later?  Thanks,

Alex

> +TDB IMS xlation
> +TBD PASID xlation
>  
>  VFIO bus driver API
>  -------------------------------------------------------------------------------
> 
> base-commit: ae0351a976d1880cf152de2bc680f1dff14d9049


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v2] vfio: Documentation for the migration region
  2021-11-30 17:26 ` Alex Williamson
@ 2021-11-30 18:59   ` Jason Gunthorpe
  2021-11-30 22:35     ` Alex Williamson
  0 siblings, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2021-11-30 18:59 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jonathan Corbet, linux-doc, Cornelia Huck, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Tue, Nov 30, 2021 at 10:26:11AM -0700, Alex Williamson wrote:
> On Mon, 29 Nov 2021 10:45:52 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > Provide some more complete documentation for the migration regions
> > behavior, specifically focusing on the device_state bits and the whole
> > system view from a VMM.
> > 
> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> >  Documentation/driver-api/vfio.rst | 277 +++++++++++++++++++++++++++++-
> >  1 file changed, 276 insertions(+), 1 deletion(-)
> > 
> > Alex/Cornelia, here is the second draft of the requested documentation I promised
> > 
> > We think it includes all the feedback from hns, Intel and NVIDIA on this mechanism.
> > 
> > Our thinking is that NDMA would be implemented like this:
> > 
> >    +#define VFIO_DEVICE_STATE_NDMA      (1 << 3)
> > 
> > And a .add_capability ops will be used to signal to userspace driver support:
> > 
> >    +#define VFIO_REGION_INFO_CAP_MIGRATION_NDMA    6
> 
> So based on this and the discussion later in the doc, NDMA is an
> optional device feature, is this specifically to support HNS? 

Yes. It is not trivial to implement NDMA in a device, we already have
HNS as a public existing example that cannot do it, so it is a
permanent optional feature.

> IIRC, this is a simple queue based device, but is it the fact that
> the queue lives in non-device memory that makes it such that the
> driver cannot mediate queue entries and simply add them to the to
> migration stream?

From what HNS said the device driver would have to trap every MMIO to
implement NDMA as it must prevent touches to the physical HW MMIO to
maintain the NDMA state.

The issue is that the HW migration registers can stop processing the
queue and thus enter NDMA but a MMIO touch can resume queue
processing, so NDMA cannot be sustained.

Trapping every MMIO would have a huge negative performance impact.  So
it doesn't make sense to do so for a device that is not intended to be
used in any situation where NDMA is required.

> Some discussion of this requirement would be useful in the doc,
> otherwise it seems easier to deprecate the v1 migration region
> sub-type, and increment to a v2 where NDMA is a required feature.

I can add some words a the bottom, but since NDMA is a completely
transparent optional feature I don't see any reason to have v2.

> There are so many implicit concepts here, I'm not sure I'm making a
> correct interpretation, let alone someone coming to this document for
> understanding.

That isn't really helpful feedback..

The above is something like a summary to give context to the below
which provides a lot of detail to each step.

If you read the above/below together and find stuff is lacking, then
please point to it and let's add it

> > +If the VMM has multiple VFIO devices undergoing migration then the grace
> > +states act as cross device synchronization points. The VMM must bring all
> > +devices to the grace state before advancing past it.
> > +
> > +The above reference flows are built around specific requirements on the
> > +migration driver for its implementation of the migration_state input.
> 
> I can't glean any meaning from this sentence.  "device_state" here and
> throughout?  We don't have a "migration_state".

Yes, migration_state is a spello for device_state, I'll fix them all

> > + !RESUMING
> > +   All the data transferred into the data window is loaded into the device's
> > +   internal state. The migration driver can rely on user-space issuing a
> > +   VFIO_DEVICE_RESET prior to starting RESUMING.
> 
> We can't really rely on userspace to do anything, nor has this sequence
> been part of the specified protocol.

It is exsisting behavior of qemu - which is why we documented it.

Either qemu shouldn't do it as devices must fully self-reset, or we
should have it part of the canonical flow and devices may as well
expect it. It is useful because post VFIO_DEVICE_RESET all DMA is
quiet, no outstanding PRIs exist, etc etc.

> As with the previous flows, it seems like there's a ton of implicit
> knowledge here.  Why are we documenting these here rather than in the
> uAPI header?

Because this is 300 lines already and is too complicated/long to
properly live in a uapi header.

>  I'm having a difficult time trying to understand what are
> proposals to modify the uAPI and what are interpretations of the
> existing protocol.

As far as we know this describes what current qemu does in the success
path with a single VFIO device. ie we think mlx5 conforms to this spec
and we see it works as-is with qemu, up to qemu's limitations:

 - qemu has no support for errors or error recovery, it just calls
   abort()
 - qemu does not stress the device_state and only does a few
   transition patterns
 - qemu doesn't support P2P cases due to the NDMA topic
 - simple devices like HNS will work, but not robustly in the face of
   a hostile VM and multiple VFIO devices.

> > +   To abort a RESUMING issue a VFIO_DEVICE_RESET.
>
> Any use of VFIO_DEVICE_RESET should return the device to the default
> state, but a user is free to try to use device_state to transition
> from RESUMING to any other state.  

Userspace can attempt all transitions, of course. 

However, notice this spec doesn't specify what happens on non-success
RESUMING->RUNNING. So, I'm calling that undefined behavior.

As you say:

> The driver can choose to fail that transition and even make use of
> the error device_state, but there's no expectation that a

Thus, the above advice. To reliably abort RESUMING use DEVICE_RESET,
do not use ->RUNNING.

> > +Continuous actions are in effect when migration_state bit groups are active:
> > +
> > + RUNNING | NDMA
> > +   The device is not allowed to issue new DMA operations.
> > +
> > +   Whenever the kernel returns with a migration_state of NDMA there can be no
> > +   in progress DMAs.
> > +
> 
> There are certainly event triggered actions based on setting NDMA as
> well, ex. completion of outstanding DMA.

I'm leaving it implied that there is always some work required to
begin/end these continuous behaviors

> > + !RUNNING
> > +   The device should not change its internal state. Further implies the NDMA
> > +   behavior above.
> 
> Does this also imply other device regions cannot be accessed as has
> previously been suggested?  Which?

This question is discussed/answered below

> > + - SAVING | RUNNING
> > + - NDMA
> > + - !RUNNING
> > + - SAVING | !RUNNING
> > + - RESUMING
> > + - !RESUMING
> > + - RUNNING
> > + - !NDMA
> 
> Lots of deduction left to the reader...

Do you have an alternative language? This is quite complicated, I
advise people to refer to mlx5's implementation.

> > +  As Peer to Peer DMA is a MMIO touch like any other, it is important that
> > +  userspace suspend these accesses before entering any device_state where MMIO
> > +  is not permitted, such as !RUNNING. This can be accomplished with the NDMA
> > +  state. Userspace may also choose to remove MMIO mappings from the IOMMU if the
> > +  device does not support NDMA and rely on that to guarantee quiet MMIO.
> 
> Seems that would have its own set of consequences.

Sure, userspace has to make choices here.

> > +  Device that do not support NDMA cannot be configured to generate page faults
> > +  that require the VCPU to complete.
> 
> So the VMM is required to hide features like PRI based on NDMA support?

Yep, looks like.

> > +- pre-copy allows the device to implement a dirty log for its internal state.
> > +  During the SAVING | RUNNING state the data window should present the device
> > +  state being logged and during SAVING | !RUNNING the data window should present
> > +  the unlogged device state as well as the changes from the internal dirty log.
> 
> This is getting a bit close to specifying an implementation.

Bit close, but not too close :) I think it is clarifying to
describe the general working of pre-copy - at least people here raised
questions about what it is doing.

Overall it must work in this basic way, and devices have freedom about
what internal state they can/will log. There is just a clear division
that every internal state in the first step is either immutable or
logged, and that the second step is a delta over the first.

> > +- Migration control registers inside the same iommu_group as the VFIO device.
> 
> Not a complete sentence, is this meant as a topic header?

Sure
 
> > +  This immediately raises a security concern as user-space can use Peer to Peer
> > +  DMA to manipulate these migration control registers concurrently with
> > +  any kernel actions.
> > +
> 
> We haven't defined "migration control registers" beyond device_state,
> which is software defined "register".  What physical registers that are
> subject to p2p DMA is this actually referring to?

Here this is talking about the device's physical HW control registers.

> > +TDB - discoverable feature flag for NDMA
> 
> Is the goal to release mlx5 support without the NDMA feature and add it
> later?  

Yishai has a patch already to add NDMA to mlx5, it will come in the
next iteration once we can agree on this document. qemu will follow
sometime later.

Below are the changes I made

Thanks,
Jason

diff --git a/Documentation/driver-api/vfio.rst b/Documentation/driver-api/vfio.rst
index d9be47570f878c..69d77fb7e5a321 100644
--- a/Documentation/driver-api/vfio.rst
+++ b/Documentation/driver-api/vfio.rst
@@ -315,14 +315,14 @@ states act as cross device synchronization points. The VMM must bring all
 devices to the grace state before advancing past it.
 
 The above reference flows are built around specific requirements on the
-migration driver for its implementation of the migration_state input.
+migration driver for its implementation of the device_state input.
 
-The migration_state cannot change asynchronously, upon writing the
-migration_state the driver will either keep the current state and return
+The device_state cannot change asynchronously, upon writing the
+device_state the driver will either keep the current state and return
 failure, return failure and go to ERROR, or succeed and go to the new state.
 
-Event triggered actions happen when user-space requests a new migration_state
-that differs from the current migration_state. Actions happen on a bit group
+Event triggered actions happen when user-space requests a new device_state
+that differs from the current device_state. Actions happen on a bit group
 basis:
 
  SAVING | RUNNING
@@ -351,12 +351,12 @@ basis:
 
    If the migration data is invalid then the ERROR state must be set.
 
-Continuous actions are in effect when migration_state bit groups are active:
+Continuous actions are in effect when device_state bit groups are active:
 
  RUNNING | NDMA
    The device is not allowed to issue new DMA operations.
 
-   Whenever the kernel returns with a migration_state of NDMA there can be no
+   Whenever the kernel returns with a device_state of NDMA there can be no
    in progress DMAs.
 
  !RUNNING
@@ -384,7 +384,7 @@ Continuous actions are in effect when migration_state bit groups are active:
    during ERROR to avoid corrupting other devices or a VM during a failed
    migration.
 
-When multiple bits change in the migration_state they may describe multiple
+When multiple bits change in the device_state they may describe multiple
 event triggered actions, and multiple changes to continuous actions.  The
 migration driver must process them in a priority order:
 
@@ -399,9 +399,9 @@ migration driver must process them in a priority order:
 
 In general, userspace can issue a VFIO_DEVICE_RESET ioctl and recover the
 device back to device_state RUNNING. When a migration driver executes this
-ioctl it should discard the data window and set migration_state to RUNNING as
+ioctl it should discard the data window and set device_state to RUNNING as
 part of resetting the device to a clean state. This must happen even if the
-migration_state has errored. A freshly opened device FD should always be in
+device_state has errored. A freshly opened device FD should always be in
 the RUNNING state.
 
 The migration driver has limitations on what device state it can affect. Any
@@ -476,14 +476,23 @@ implementing migration:
   are pushed down to the next step in the sequence and various behaviors that
   rely on NDMA cannot be used.
 
-- Migration control registers inside the same iommu_group as the VFIO device.
-  This immediately raises a security concern as user-space can use Peer to Peer
-  DMA to manipulate these migration control registers concurrently with
+  NDMA is made optional to support simple HW implementations that either just
+  cannot do NDMA, or cannot do NDMA without a performance cost. NDMA is only
+  necessary for special features like P2P and PRI, so devices can omit it in
+  exchange for limitations on the guest.
+
+- Devices that have their HW migration control MMIO registers inside the same
+  iommu_group as the VFIO device have some special considerations. In this
+  case a driver will be operating HW registers from kernel space that are also
+  subjected to userspace controlled DMA due to the iommu_group.
+
+  This immediately raises a security concern as user-space can use Peer to
+  Peer DMA to manipulate these migration control registers concurrently with
   any kernel actions.
 
   A device driver operating such a device must ensure that kernel integrity
   cannot be broken by hostile user space operating the migration MMIO
-  registers via peer to peer, at any point in the sequence. Notably the kernel
+  registers via peer to peer, at any point in the sequence. Further the kernel
   cannot use DMA to transfer any migration data.
 
   However, as discussed above in the "Device Peer to Peer DMA" section, it can

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v2] vfio: Documentation for the migration region
  2021-11-30 18:59   ` Jason Gunthorpe
@ 2021-11-30 22:35     ` Alex Williamson
  2021-12-01  3:14       ` Jason Gunthorpe
  0 siblings, 1 reply; 29+ messages in thread
From: Alex Williamson @ 2021-11-30 22:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jonathan Corbet, linux-doc, Cornelia Huck, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Tue, 30 Nov 2021 14:59:10 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Nov 30, 2021 at 10:26:11AM -0700, Alex Williamson wrote:
> > On Mon, 29 Nov 2021 10:45:52 -0400
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > Provide some more complete documentation for the migration regions
> > > behavior, specifically focusing on the device_state bits and the whole
> > > system view from a VMM.
> > > 
> > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > >  Documentation/driver-api/vfio.rst | 277 +++++++++++++++++++++++++++++-
> > >  1 file changed, 276 insertions(+), 1 deletion(-)
> > > 
> > > Alex/Cornelia, here is the second draft of the requested documentation I promised
> > > 
> > > We think it includes all the feedback from hns, Intel and NVIDIA on this mechanism.
> > > 
> > > Our thinking is that NDMA would be implemented like this:
> > > 
> > >    +#define VFIO_DEVICE_STATE_NDMA      (1 << 3)
> > > 
> > > And a .add_capability ops will be used to signal to userspace driver support:
> > > 
> > >    +#define VFIO_REGION_INFO_CAP_MIGRATION_NDMA    6  
> > 
> > So based on this and the discussion later in the doc, NDMA is an
> > optional device feature, is this specifically to support HNS?   
> 
> Yes. It is not trivial to implement NDMA in a device, we already have
> HNS as a public existing example that cannot do it, so it is a
> permanent optional feature.
> 
> > IIRC, this is a simple queue based device, but is it the fact that
> > the queue lives in non-device memory that makes it such that the
> > driver cannot mediate queue entries and simply add them to the to
> > migration stream?  
> 
> From what HNS said the device driver would have to trap every MMIO to
> implement NDMA as it must prevent touches to the physical HW MMIO to
> maintain the NDMA state.
> 
> The issue is that the HW migration registers can stop processing the
> queue and thus enter NDMA but a MMIO touch can resume queue
> processing, so NDMA cannot be sustained.
> 
> Trapping every MMIO would have a huge negative performance impact.  So
> it doesn't make sense to do so for a device that is not intended to be
> used in any situation where NDMA is required.

But migration is a cooperative activity with userspace.  If necessary
we can impose a requirement that mmap access to regions (other than the
migration region itself) are dropped when we're in the NDMA or !RUNNING
device_state.  We can define additional sparse mmap capabilities for
regions for various device states if we need more fine grained control.
There's no reason that mediation while in the NDMA state needs to
impose any performance penalty against the default RUNNING state.  In
fact, it seems rather the migration driver's job to provide such
mediation.

> > Some discussion of this requirement would be useful in the doc,
> > otherwise it seems easier to deprecate the v1 migration region
> > sub-type, and increment to a v2 where NDMA is a required feature.  
> 
> I can add some words a the bottom, but since NDMA is a completely
> transparent optional feature I don't see any reason to have v2.

It's hardly transparent, aiui userspace is going to need to impose a
variety of loosely defined restrictions for devices without NDMA
support.  It would be far easier if we could declare NDMA support to be
a requirement.

> > There are so many implicit concepts here, I'm not sure I'm making a
> > correct interpretation, let alone someone coming to this document for
> > understanding.  
> 
> That isn't really helpful feedback..
> 
> The above is something like a summary to give context to the below
> which provides a lot of detail to each step.
> 
> If you read the above/below together and find stuff is lacking, then
> please point to it and let's add it

As I think Connie also had trouble with, combining device_state with
IOMMU migration features and VMM state, without any preceding context
and visual cues makes the section confusing.  I did gain context as I
read further though the doc, but I also had the advantage of being
rather familiar with the topic.  Maybe a table format would help to
segment the responsibilities?

> > > +If the VMM has multiple VFIO devices undergoing migration then the grace
> > > +states act as cross device synchronization points. The VMM must bring all
> > > +devices to the grace state before advancing past it.
> > > +
> > > +The above reference flows are built around specific requirements on the
> > > +migration driver for its implementation of the migration_state input.  
> > 
> > I can't glean any meaning from this sentence.  "device_state" here and
> > throughout?  We don't have a "migration_state".  
> 
> Yes, migration_state is a spello for device_state, I'll fix them all
> 
> > > + !RESUMING
> > > +   All the data transferred into the data window is loaded into the device's
> > > +   internal state. The migration driver can rely on user-space issuing a
> > > +   VFIO_DEVICE_RESET prior to starting RESUMING.  
> > 
> > We can't really rely on userspace to do anything, nor has this sequence
> > been part of the specified protocol.  
> 
> It is exsisting behavior of qemu - which is why we documented it.

QEMU resets devices as part of initializing the VM, but I don't see
that QEMU specifically resets a device in order to transition it to
the RESUMING device_state. 

> Either qemu shouldn't do it as devices must fully self-reset, or we
> should have it part of the canonical flow and devices may as well
> expect it. It is useful because post VFIO_DEVICE_RESET all DMA is
> quiet, no outstanding PRIs exist, etc etc.

It's valid for QEMU to reset the device any time it wants, saying that
it cannot perform a reset before transitioning to the RESUMING state is
absurd.  Userspace can do redundant things for their own convenience.

We don't currently specify any precondition for a device to enter the
RESUMING state.  The driver can of course nak the state change with an
errno, or hard nak it with an errno and ERROR device_state, which would
require userspace to make use of VFIO_DEVICE_RESET.  I think I would
also consider if valid if the driver internally performed a reset on a
RUNNING -> RESUMING state change, but we've never previously expressed
any linkage of a userspace requirement to reset the device here. 

> > As with the previous flows, it seems like there's a ton of implicit
> > knowledge here.  Why are we documenting these here rather than in the
> > uAPI header?  
> 
> Because this is 300 lines already and is too complicated/long to
> properly live in a uapi header.

Minimally we need to resolve that this document must be consistent with
the uAPI.  I'm not sure that's entirely the case in this draft.

> >  I'm having a difficult time trying to understand what are
> > proposals to modify the uAPI and what are interpretations of the
> > existing protocol.  
> 
> As far as we know this describes what current qemu does in the success
> path with a single VFIO device. ie we think mlx5 conforms to this spec
> and we see it works as-is with qemu, up to qemu's limitations:
> 
>  - qemu has no support for errors or error recovery, it just calls
>    abort()
>  - qemu does not stress the device_state and only does a few
>    transition patterns
>  - qemu doesn't support P2P cases due to the NDMA topic

Or rather QEMU does support p2p cases regardless of the NDMA topic.

>  - simple devices like HNS will work, but not robustly in the face of
>    a hostile VM and multiple VFIO devices.

So what's the goal here, are we trying to make the one currently
implemented and unsupported userspace be the gold standard to which
drivers should base their implementation?  We've tried to define a
specification that's more flexible than a single implementation and by
these standards we seem to be flipping that implementation back into
the specification.

> > > +   To abort a RESUMING issue a VFIO_DEVICE_RESET.  
> >
> > Any use of VFIO_DEVICE_RESET should return the device to the default
> > state, but a user is free to try to use device_state to transition
> > from RESUMING to any other state.    
> 
> Userspace can attempt all transitions, of course. 
> 
> However, notice this spec doesn't specify what happens on non-success
> RESUMING->RUNNING. So, I'm calling that undefined behavior.
> 
> As you say:
> 
> > The driver can choose to fail that transition and even make use of
> > the error device_state, but there's no expectation that a  
> 
> Thus, the above advice. To reliably abort RESUMING use DEVICE_RESET,
> do not use ->RUNNING.

Userspace can attempt RESUMING -> RUNNING regardless of what we specify,
so a driver needs to be prepared for such an attempted state change
either way.  So what's the advantage to telling a driver author that
they can expect a given behavior?

It doesn't make much sense to me to glue two separate userspace
operations together to say these must be done in this sequence, back to
back.  If we want the device to be reset in order to enter RESUMING, the
driver should simply reset the device as necessary during the state
transition.  The outward effect to the user is to specify that device
internal state may not be retained on transition from RUNNING ->
RESUMING.
 
> > > +Continuous actions are in effect when migration_state bit groups are active:
> > > +
> > > + RUNNING | NDMA
> > > +   The device is not allowed to issue new DMA operations.
> > > +
> > > +   Whenever the kernel returns with a migration_state of NDMA there can be no
> > > +   in progress DMAs.
> > > +  
> > 
> > There are certainly event triggered actions based on setting NDMA as
> > well, ex. completion of outstanding DMA.  
> 
> I'm leaving it implied that there is always some work required to
> begin/end these continuous behaviors
> 
> > > + !RUNNING
> > > +   The device should not change its internal state. Further implies the NDMA
> > > +   behavior above.  
> > 
> > Does this also imply other device regions cannot be accessed as has
> > previously been suggested?  Which?  
> 
> This question is discussed/answered below
> 
> > > + - SAVING | RUNNING
> > > + - NDMA
> > > + - !RUNNING
> > > + - SAVING | !RUNNING
> > > + - RESUMING
> > > + - !RESUMING
> > > + - RUNNING
> > > + - !NDMA  
> > 
> > Lots of deduction left to the reader...  
> 
> Do you have an alternative language? This is quite complicated, I
> advise people to refer to mlx5's implementation.

I agree with Connie on this, if the reader of the documentation needs
to look at a specific driver implementation to understand the
reasoning, the documentation has failed.  If it can be worked out by
looking at the device_state write function of the mlx5 driver, then
surely a sentence or two for each priority item can be added here.

Part of the problem is that the nomenclature is unclear, we're listing
bit combinations, but not the changed bit(s) and we need to infer the
state.  The mlx5 driver specifically looks for individual state bit
flips in the presence of an existing state.  I'm not able to obviously
map the listing above to the latest posted version of the mlx5 driver.

> > > +  As Peer to Peer DMA is a MMIO touch like any other, it is important that
> > > +  userspace suspend these accesses before entering any device_state where MMIO
> > > +  is not permitted, such as !RUNNING. This can be accomplished with the NDMA
> > > +  state. Userspace may also choose to remove MMIO mappings from the IOMMU if the
> > > +  device does not support NDMA and rely on that to guarantee quiet MMIO.  
> > 
> > Seems that would have its own set of consequences.  
> 
> Sure, userspace has to make choices here.

It seems a bit loaded to suggest an alternative choice if it's not
practical or equivalent.  Maybe it's largely the phrasing, I read
"remove MMIO mappings" as to drop them dynamically, when I think we've
discussed that userspace might actually preclude these mappings for
non-NDMA devices such that p2p DMA cannot exist, ever.

> > > +  Device that do not support NDMA cannot be configured to generate page faults
> > > +  that require the VCPU to complete.  
> > 
> > So the VMM is required to hide features like PRI based on NDMA support?  
> 
> Yep, looks like.
> 
> > > +- pre-copy allows the device to implement a dirty log for its internal state.
> > > +  During the SAVING | RUNNING state the data window should present the device
> > > +  state being logged and during SAVING | !RUNNING the data window should present
> > > +  the unlogged device state as well as the changes from the internal dirty log.  
> > 
> > This is getting a bit close to specifying an implementation.  
> 
> Bit close, but not too close :) I think it is clarifying to
> describe the general working of pre-copy - at least people here raised
> questions about what it is doing.
> 
> Overall it must work in this basic way, and devices have freedom about
> what internal state they can/will log. There is just a clear division
> that every internal state in the first step is either immutable or
> logged, and that the second step is a delta over the first.

I agree that it's a reasonable approach, though as I read the proposed
text, there's no mention of immutable state and no reason a driver
would implement a dirty log for immutable state, therefore we seem to
be suggesting such data for the stop-and-copy phase when it would
actually be preferable to include it in pre-copy.  I think the fact
that a user is not required to run the pre-copy phase until completion
is also noteworthy.

> > > +- Migration control registers inside the same iommu_group as the VFIO device.  
> > 
> > Not a complete sentence, is this meant as a topic header?  
> 
> Sure
>  
> > > +  This immediately raises a security concern as user-space can use Peer to Peer
> > > +  DMA to manipulate these migration control registers concurrently with
> > > +  any kernel actions.
> > > +  
> > 
> > We haven't defined "migration control registers" beyond device_state,
> > which is software defined "register".  What physical registers that are
> > subject to p2p DMA is this actually referring to?  
> 
> Here this is talking about the device's physical HW control registers.
> 
> > > +TDB - discoverable feature flag for NDMA  
> > 
> > Is the goal to release mlx5 support without the NDMA feature and add it
> > later?    
> 
> Yishai has a patch already to add NDMA to mlx5, it will come in the
> next iteration once we can agree on this document. qemu will follow
> sometime later.

So it's not really a TBD, it's resolved in a uAPI update that will be
included with the next revision?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v2] vfio: Documentation for the migration region
  2021-11-30 22:35     ` Alex Williamson
@ 2021-12-01  3:14       ` Jason Gunthorpe
  2021-12-01  9:54         ` Shameerali Kolothum Thodi
  2021-12-01 20:03         ` Alex Williamson
  0 siblings, 2 replies; 29+ messages in thread
From: Jason Gunthorpe @ 2021-12-01  3:14 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jonathan Corbet, linux-doc, Cornelia Huck, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Tue, Nov 30, 2021 at 03:35:41PM -0700, Alex Williamson wrote:

> > From what HNS said the device driver would have to trap every MMIO to
> > implement NDMA as it must prevent touches to the physical HW MMIO to
> > maintain the NDMA state.
> > 
> > The issue is that the HW migration registers can stop processing the
> > queue and thus enter NDMA but a MMIO touch can resume queue
> > processing, so NDMA cannot be sustained.
> > 
> > Trapping every MMIO would have a huge negative performance impact.  So
> > it doesn't make sense to do so for a device that is not intended to be
> > used in any situation where NDMA is required.
> 
> But migration is a cooperative activity with userspace.  If necessary
> we can impose a requirement that mmap access to regions (other than the
> migration region itself) are dropped when we're in the NDMA or !RUNNING
> device_state.  

It is always NDMA|RUNNING, so we can't fully drop access to
MMIO. Userspace would have to transfer from direct MMIO to
trapping. With enough new kernel infrastructure and qemu support it
could be done.

Even so, we can't trap accesses through the IOMMU so such a scheme
would still require removing IOMMU acess to the device. Given that the
basic qemu mitigation for no NDMA support is to eliminate P2P cases by
removing the IOMMU mappings this doesn't seem to advance anything and
only creates complexity.

At least I'm not going to insist that hns do all kinds of work like
this for a edge case they don't care about as a precondition to get a
migration driver.

> There's no reason that mediation while in the NDMA state needs to
> impose any performance penalty against the default RUNNING state. 

Eh? Mitigation of no NDMA support would have to mediate the MMIO on a
a performance doorbell path, there is no escaping a performance
hit. I'm not sure what you mean

> > > Some discussion of this requirement would be useful in the doc,
> > > otherwise it seems easier to deprecate the v1 migration region
> > > sub-type, and increment to a v2 where NDMA is a required feature.  
> > 
> > I can add some words a the bottom, but since NDMA is a completely
> > transparent optional feature I don't see any reason to have v2.
> 
> It's hardly transparent, aiui userspace is going to need to impose a
> variety of loosely defined restrictions for devices without NDMA
> support.  It would be far easier if we could declare NDMA support to be
> a requirement.

It would make userspace a bit simpler at the cost of excluding or
complicating devices like hns for a use case they don't care about.

On the other hand, the simple solution in qemu is when there is no
universal NDMA it simply doesn't include any MMIO ranges in the
IOMMU.

> As I think Connie also had trouble with, combining device_state with
> IOMMU migration features and VMM state, without any preceding context
> and visual cues makes the section confusing.  I did gain context as I
> read further though the doc, but I also had the advantage of being
> rather familiar with the topic.  Maybe a table format would help to
> segment the responsibilities?

I moved the context to the bottom exactly because Connie said it was
confusing at the start. :)

This is a RST document so I not keen to make huge formatting
adventures for minimal readability gain.

I view this as something that probably needs to be read a few times,
along with the code and header files for someone brand new to
understand. I'm Ok with that, it is about consistent with kernel docs
of this level.

What I would like is if userspace focused readers can get their
important bits of information with less work.

> > It is exsisting behavior of qemu - which is why we documented it.
> 
> QEMU resets devices as part of initializing the VM, but I don't see
> that QEMU specifically resets a device in order to transition it to
> the RESUMING device_state. 

We instrumented the kernel and monitored qemu, it showed up on the
resume traces.

> > Either qemu shouldn't do it as devices must fully self-reset, or we
> > should have it part of the canonical flow and devices may as well
> > expect it. It is useful because post VFIO_DEVICE_RESET all DMA is
> > quiet, no outstanding PRIs exist, etc etc.
> 
> It's valid for QEMU to reset the device any time it wants, saying that
> it cannot perform a reset before transitioning to the RESUMING state is
> absurd.  Userspace can do redundant things for their own convenience.

I didn't say cannot, I said it shouldn't.

Since qemu is the only implementation it would be easy for drivers to
rely on the implicit reset it seems to do, it seems an important point
that should be written either way.

I don't have a particular requirement to have the reset, but it does
seem like a good idea. If you feel strongly, then let's say the
opposite that the driver must enter RESUME with no preconditions,
doing an internal reset if required.

> We don't currently specify any precondition for a device to enter the
> RESUMING state.  The driver can of course nak the state change with an
> errno, or hard nak it with an errno and ERROR device_state, which would
> require userspace to make use of VFIO_DEVICE_RESET.

I don't think we should be relying on every driver doing something
totally differnt on the standard path. That is only going to hurt
interoperability.

> > > As with the previous flows, it seems like there's a ton of implicit
> > > knowledge here.  Why are we documenting these here rather than in the
> > > uAPI header?  
> > 
> > Because this is 300 lines already and is too complicated/long to
> > properly live in a uapi header.
> 
> Minimally we need to resolve that this document must be consistent with
> the uAPI.  I'm not sure that's entirely the case in this draft.

Can you point to something please? I can't work with "I'm not sure"

IMO the header file doesn't really say much and can be read in a way
that is consistent with this more specific document.

> >  - qemu doesn't support P2P cases due to the NDMA topic
> 
> Or rather QEMU does support p2p cases regardless of the NDMA topic.

I mean support in a way that is actually usable as without NDMA it
corrupts the VM when it migrates it.

> >  - simple devices like HNS will work, but not robustly in the face of
> >    a hostile VM and multiple VFIO devices.
> 
> So what's the goal here, are we trying to make the one currently
> implemented and unsupported userspace be the gold standard to which
> drivers should base their implementation?  

I have no idea anymore. You asked for docs and complete picture as a
percondition for merging a driver. Here it is.

What do you want?

> We've tried to define a specification that's more flexible than a
> single implementation and by these standards we seem to be flipping
> that implementation back into the specification.

What specification!?! All we have is a couple lines in a header file
that is no where near detailed enough for multi-driver
interoperability with userspace. You have no idea how much effort has
been expended to get this far based on the few breadcrumbs that were
left, and we have access to the team that made the only other
implementation!

*flexible* is not a specification.

> Userspace can attempt RESUMING -> RUNNING regardless of what we specify,
> so a driver needs to be prepared for such an attempted state change
> either way.  So what's the advantage to telling a driver author that
> they can expect a given behavior?

The above didn't tell a driver author to expect a certain behavior, it
tells userspace what to do.

> It doesn't make much sense to me to glue two separate userspace
> operations together to say these must be done in this sequence, back to
> back.  If we want the device to be reset in order to enter RESUMING, the
> driver should simply reset the device as necessary during the state
> transition.  The outward effect to the user is to specify that device
> internal state may not be retained on transition from RUNNING ->
> RESUMING.

Maybe, and I'm happy if you want to specify this instead. It just
doesn't match what we observe qemu to be doing.

> > Do you have an alternative language? This is quite complicated, I
> > advise people to refer to mlx5's implementation.
> 
> I agree with Connie on this, if the reader of the documentation needs
> to look at a specific driver implementation to understand the
> reasoning, the documentation has failed.  

Lets agree on some objective here, this is not trying to be fully
comprehensive, or fully standalone. It is intended to drive agreement,
be informative to userspace, and be supplemental to the actual code.

> If it can be worked out by looking at the device_state write
> function of the mlx5 driver, then surely a sentence or two for each
> priority item can be added here.

Please give me a suggestion then, because I don't know what will help
here?

> Part of the problem is that the nomenclature is unclear, we're listing
> bit combinations, but not the changed bit(s) and we need to infer the
> state.

Each line lists the new state, the changed bits are thus any bits that
make up the new state.

If you look at how mlx5 is constructed each if has a 'did it change'
test followed by 'what state is it in now'

So the document is read as listing the order the driver enters the new
states. I clarified it as ""must process the new device_state bits in
a priority order""

> flips in the presence of an existing state.  I'm not able to obviously
> map the listing above to the latest posted version of the mlx5 driver.

One of the things we've done is align mlx5 more clearly to this. For
instance it no longer has a mixture of state and old state in the if
statements, it always tests the new state so the tests logically
follow what is written here

Stripping away the excess the expressions now look like this:

 !(state & VFIO_DEVICE_STATE_RUNNING)
 ((state & (VFIO_DEVICE_STATE_RUNNING | VFIO_DEVICE_STATE_SAVING)) == VFIO_DEVICE_STATE_SAVING))
 (state & VFIO_DEVICE_STATE_RESUMING)

Which mirror what is written here.

> > > > +  As Peer to Peer DMA is a MMIO touch like any other, it is important that
> > > > +  userspace suspend these accesses before entering any device_state where MMIO
> > > > +  is not permitted, such as !RUNNING. This can be accomplished with the NDMA
> > > > +  state. Userspace may also choose to remove MMIO mappings from the IOMMU if the
> > > > +  device does not support NDMA and rely on that to guarantee quiet MMIO.  
> > > 
> > > Seems that would have its own set of consequences.  
> > 
> > Sure, userspace has to make choices here.
> 
> It seems a bit loaded to suggest an alternative choice if it's not
> practical or equivalent.  Maybe it's largely the phrasing, I read
> "remove MMIO mappings" as to drop them dynamically, when I think we've
> discussed that userspace might actually preclude these mappings for
> non-NDMA devices such that p2p DMA cannot exist, ever.

I mean the latter. How about "never install MMIO mappings" ?

> > Overall it must work in this basic way, and devices have freedom about
> > what internal state they can/will log. There is just a clear division
> > that every internal state in the first step is either immutable or
> > logged, and that the second step is a delta over the first.
> 
> I agree that it's a reasonable approach, though as I read the proposed
> text, there's no mention of immutable state and no reason a driver
> would implement a dirty log for immutable state, therefore we seem to
> be suggesting such data for the stop-and-copy phase when it would
> actually be preferable to include it in pre-copy.

I'd say that is a detail we don't need to discuss/define, it has no
user space visible consequence.

> I think the fact that a user is not required to run the pre-copy
> phase until completion is also noteworthy.

This text doesn't try to detail how the migration window works, that
is a different large task. The intention is that the migration window
must be fully drained to be successful.

I added this for some clarity ""The entire migration data, up to each
end of stream must be transported from the saving to resuming side.""

> > Yishai has a patch already to add NDMA to mlx5, it will come in the
> > next iteration once we can agree on this document. qemu will follow
> > sometime later.
> 
> So it's not really a TBD, it's resolved in a uAPI update that will be
> included with the next revision?  Thanks,

There is a patch yes, the TBD here is to include a few words about how
to detect NDMA

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH RFC v2] vfio: Documentation for the migration region
  2021-12-01  3:14       ` Jason Gunthorpe
@ 2021-12-01  9:54         ` Shameerali Kolothum Thodi
  2021-12-01 13:49           ` Jason Gunthorpe
  2021-12-01 20:03         ` Alex Williamson
  1 sibling, 1 reply; 29+ messages in thread
From: Shameerali Kolothum Thodi @ 2021-12-01  9:54 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Jonathan Corbet, linux-doc, Cornelia Huck, kvm, Kirti Wankhede,
	Max Gurtovoy, Yishai Hadas, Zengtao (B),
	liulongfang



> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> Sent: 01 December 2021 03:14
> To: Alex Williamson <alex.williamson@redhat.com>
> Cc: Jonathan Corbet <corbet@lwn.net>; linux-doc@vger.kernel.org; Cornelia
> Huck <cohuck@redhat.com>; kvm@vger.kernel.org; Kirti Wankhede
> <kwankhede@nvidia.com>; Max Gurtovoy <mgurtovoy@nvidia.com>;
> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Yishai
> Hadas <yishaih@nvidia.com>
> Subject: Re: [PATCH RFC v2] vfio: Documentation for the migration region
> 
> On Tue, Nov 30, 2021 at 03:35:41PM -0700, Alex Williamson wrote:
> 
> > > From what HNS said the device driver would have to trap every MMIO to
> > > implement NDMA as it must prevent touches to the physical HW MMIO to
> > > maintain the NDMA state.
> > >
> > > The issue is that the HW migration registers can stop processing the
> > > queue and thus enter NDMA but a MMIO touch can resume queue
> > > processing, so NDMA cannot be sustained.
> > >
> > > Trapping every MMIO would have a huge negative performance impact.
> So
> > > it doesn't make sense to do so for a device that is not intended to be
> > > used in any situation where NDMA is required.
> >
> > But migration is a cooperative activity with userspace.  If necessary
> > we can impose a requirement that mmap access to regions (other than the
> > migration region itself) are dropped when we're in the NDMA or !RUNNING
> > device_state.
> 
> It is always NDMA|RUNNING, so we can't fully drop access to
> MMIO. Userspace would have to transfer from direct MMIO to
> trapping. With enough new kernel infrastructure and qemu support it
> could be done.

As far as our devices are concerned we put the dev queue into a PAUSE state
in the !RUNNUNG state. And since we don't have any P2P support, is it ok
to put the onus on userspace here that it won't try to access the MMIO during
!RUNNUNG state?

So just to make it clear , if a device declares that it doesn't support NDMA
and P2P, is the v1 version of the spec good enough or we still need to take
care the case that a malicious user might try MMIO access in !RUNNING
state and should have kernel infrastructure in place to safe guard that?

> 
> Even so, we can't trap accesses through the IOMMU so such a scheme
> would still require removing IOMMU acess to the device. Given that the
> basic qemu mitigation for no NDMA support is to eliminate P2P cases by
> removing the IOMMU mappings this doesn't seem to advance anything and
> only creates complexity.
> 
> At least I'm not going to insist that hns do all kinds of work like
> this for a edge case they don't care about as a precondition to get a
> migration driver.

Yes. That's our concern too.

(Just a note to clarify that these are not HNS devices per se. HNS actually
stands for HiSilicon Network Subsystem and doesn't currently have live
migration capability. The devices capable of live migration are HiSilicon
Accelerator devices).

Thanks,
Shameer

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v2] vfio: Documentation for the migration region
  2021-12-01  9:54         ` Shameerali Kolothum Thodi
@ 2021-12-01 13:49           ` Jason Gunthorpe
  0 siblings, 0 replies; 29+ messages in thread
From: Jason Gunthorpe @ 2021-12-01 13:49 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Alex Williamson, Jonathan Corbet, linux-doc, Cornelia Huck, kvm,
	Kirti Wankhede, Max Gurtovoy, Yishai Hadas, Zengtao (B),
	liulongfang

On Wed, Dec 01, 2021 at 09:54:27AM +0000, Shameerali Kolothum Thodi wrote:

> So just to make it clear , if a device declares that it doesn't support NDMA
> and P2P, is the v1 version of the spec good enough or we still need to take
> care the case that a malicious user might try MMIO access in !RUNNING
> state and should have kernel infrastructure in place to safe guard that?

My thinking is so long as the hostile user space cannot compromise the
kernel it is OK. A corrupted migration is acceptable if userspace is
not following the rules.

From a qemu perspective it should prevent a hostile VM from corrupting
the migration, as that is allowing the VM to attack the infrastructure
even if it hopefully only harms itself.

> (Just a note to clarify that these are not HNS devices per se. HNS actually
> stands for HiSilicon Network Subsystem and doesn't currently have live
> migration capability. The devices capable of live migration are HiSilicon
> Accelerator devices).

Sorry, I mostly talk to the hns team ;)

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v2] vfio: Documentation for the migration region
  2021-12-01  3:14       ` Jason Gunthorpe
  2021-12-01  9:54         ` Shameerali Kolothum Thodi
@ 2021-12-01 20:03         ` Alex Williamson
  2021-12-01 23:25           ` Jason Gunthorpe
  1 sibling, 1 reply; 29+ messages in thread
From: Alex Williamson @ 2021-12-01 20:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jonathan Corbet, linux-doc, Cornelia Huck, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Tue, 30 Nov 2021 23:14:07 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Nov 30, 2021 at 03:35:41PM -0700, Alex Williamson wrote:
> 
> > > From what HNS said the device driver would have to trap every MMIO to
> > > implement NDMA as it must prevent touches to the physical HW MMIO to
> > > maintain the NDMA state.
> > > 
> > > The issue is that the HW migration registers can stop processing the
> > > queue and thus enter NDMA but a MMIO touch can resume queue
> > > processing, so NDMA cannot be sustained.
> > > 
> > > Trapping every MMIO would have a huge negative performance impact.  So
> > > it doesn't make sense to do so for a device that is not intended to be
> > > used in any situation where NDMA is required.  
> > 
> > But migration is a cooperative activity with userspace.  If necessary
> > we can impose a requirement that mmap access to regions (other than the
> > migration region itself) are dropped when we're in the NDMA or !RUNNING
> > device_state.    
> 
> It is always NDMA|RUNNING, so we can't fully drop access to
> MMIO. Userspace would have to transfer from direct MMIO to
> trapping. With enough new kernel infrastructure and qemu support it
> could be done.

This is simply toggling whether the mmap MemoryRegion in QEMU is
enabled, when not enabled access falls through to the read/write access
functions.  We already have this functionality for unmasking INTx
interrupts when we don't have EOI support.
 
> Even so, we can't trap accesses through the IOMMU so such a scheme
> would still require removing IOMMU acess to the device. Given that the
> basic qemu mitigation for no NDMA support is to eliminate P2P cases by
> removing the IOMMU mappings this doesn't seem to advance anything and
> only creates complexity.

NDMA only requires that the device cease initiating DMA, so I suppose
you're suggesting that the MMIO of a device that doesn't expect to make
use of p2p DMA could be poked through DMA, which might cause the device
to initiate a DMA and potentially lose sync with mediation.  That would
be bad, but seems like a non-issue for hns.

> At least I'm not going to insist that hns do all kinds of work like
> this for a edge case they don't care about as a precondition to get a
> migration driver.

I wonder if we need a region flag to opt-out of IOMMU mappings for
devices that do not support p2p use cases.  If there were no IOMMU
mappings and mediation handled CPU driven MMIO accesses, then we'd have
a viable NDMA mode for hns.

> > There's no reason that mediation while in the NDMA state needs to
> > impose any performance penalty against the default RUNNING state.   
> 
> Eh? Mitigation of no NDMA support would have to mediate the MMIO on a
> a performance doorbell path, there is no escaping a performance
> hit. I'm not sure what you mean

Read it again, I'm suggesting that mediation during NDMA doesn't need
to carry over any performance penalty to the default run state.  We
don't care if mediation imposes a performance penalty while NDMA is set,
we're specifically attempting to quiesce the device.  The slower it
runs the better ;)

> > > > Some discussion of this requirement would be useful in the doc,
> > > > otherwise it seems easier to deprecate the v1 migration region
> > > > sub-type, and increment to a v2 where NDMA is a required feature.    
> > > 
> > > I can add some words a the bottom, but since NDMA is a completely
> > > transparent optional feature I don't see any reason to have v2.  
> > 
> > It's hardly transparent, aiui userspace is going to need to impose a
> > variety of loosely defined restrictions for devices without NDMA
> > support.  It would be far easier if we could declare NDMA support to be
> > a requirement.  
> 
> It would make userspace a bit simpler at the cost of excluding or
> complicating devices like hns for a use case they don't care about.
> 
> On the other hand, the simple solution in qemu is when there is no
> universal NDMA it simply doesn't include any MMIO ranges in the
> IOMMU.

That leads to mysterious performance penalties when a VM was previously
able to make use of (ex.) GPUDirect, but adds an hns device and suddenly
can no longer use p2p.  Or rejected hotplugs if a device has existing
NDMA capable devices, p2p might be active, but the new device does not
support NDMA.  This all makes it really complicated to get
deterministic behavior for devices.  I don't know how to make QEMU
behavior predictable and supportable in such an environment.

> > As I think Connie also had trouble with, combining device_state with
> > IOMMU migration features and VMM state, without any preceding context
> > and visual cues makes the section confusing.  I did gain context as I
> > read further though the doc, but I also had the advantage of being
> > rather familiar with the topic.  Maybe a table format would help to
> > segment the responsibilities?  
> 
> I moved the context to the bottom exactly because Connie said it was
> confusing at the start. :)
> 
> This is a RST document so I not keen to make huge formatting
> adventures for minimal readability gain.
> 
> I view this as something that probably needs to be read a few times,
> along with the code and header files for someone brand new to
> understand. I'm Ok with that, it is about consistent with kernel docs
> of this level.
> 
> What I would like is if userspace focused readers can get their
> important bits of information with less work.
> 
> > > It is exsisting behavior of qemu - which is why we documented it.  
> > 
> > QEMU resets devices as part of initializing the VM, but I don't see
> > that QEMU specifically resets a device in order to transition it to
> > the RESUMING device_state.   
> 
> We instrumented the kernel and monitored qemu, it showed up on the
> resume traces.

Exactly, it's part of the standard VM initialization sequence, not
specific to the device entering the RESUMING state.

> > > Either qemu shouldn't do it as devices must fully self-reset, or we
> > > should have it part of the canonical flow and devices may as well
> > > expect it. It is useful because post VFIO_DEVICE_RESET all DMA is
> > > quiet, no outstanding PRIs exist, etc etc.  
> > 
> > It's valid for QEMU to reset the device any time it wants, saying that
> > it cannot perform a reset before transitioning to the RESUMING state is
> > absurd.  Userspace can do redundant things for their own convenience.  
> 
> I didn't say cannot, I said it shouldn't.

Redundant perhaps, but not dis-recommended other than for overhead.

> Since qemu is the only implementation it would be easy for drivers to
> rely on the implicit reset it seems to do, it seems an important point
> that should be written either way.
> 
> I don't have a particular requirement to have the reset, but it does
> seem like a good idea. If you feel strongly, then let's say the
> opposite that the driver must enter RESUME with no preconditions,
> doing an internal reset if required.

It seems cleaner to me than unnecessarily requiring userspace to pass
through an ioctl in order to get to the next device_state.

> > We don't currently specify any precondition for a device to enter the
> > RESUMING state.  The driver can of course nak the state change with an
> > errno, or hard nak it with an errno and ERROR device_state, which would
> > require userspace to make use of VFIO_DEVICE_RESET.  
> 
> I don't think we should be relying on every driver doing something
> totally differnt on the standard path. That is only going to hurt
> interoperability.

I agree, the above is the exception path, devices should typically
allow a transition to RESUMING with no precondition, which is why I
suggest an internal reset as necessary.

> > > > As with the previous flows, it seems like there's a ton of implicit
> > > > knowledge here.  Why are we documenting these here rather than in the
> > > > uAPI header?    
> > > 
> > > Because this is 300 lines already and is too complicated/long to
> > > properly live in a uapi header.  
> > 
> > Minimally we need to resolve that this document must be consistent with
> > the uAPI.  I'm not sure that's entirely the case in this draft.  
> 
> Can you point to something please? I can't work with "I'm not sure"

The reset behavior that I'm trying to clarify above is the primary
offender, but "I'm not sure" I understand the bit prioritization enough
to know that there isn't something there as well.  I'm also not sure if
the "end of stream" phrasing below matches the uAPI.

> IMO the header file doesn't really say much and can be read in a way
> that is consistent with this more specific document.

But if this document is suggesting the mlx5/QEMU interpretation is the
only valid interpretations for driver authors, those clarifications
should be pushed back into the uAPI header.

> > >  - qemu doesn't support P2P cases due to the NDMA topic  
> > 
> > Or rather QEMU does support p2p cases regardless of the NDMA topic.  
> 
> I mean support in a way that is actually usable as without NDMA it
> corrupts the VM when it migrates it.

I think we're in agreement, I'm just pointing out that QEMU currently
has no NDMA support nor does anything to prevent p2p with non-NDMA
devices.

> > >  - simple devices like HNS will work, but not robustly in the face of
> > >    a hostile VM and multiple VFIO devices.  
> > 
> > So what's the goal here, are we trying to make the one currently
> > implemented and unsupported userspace be the gold standard to which
> > drivers should base their implementation?    
> 
> I have no idea anymore. You asked for docs and complete picture as a
> percondition for merging a driver. Here it is.
> 
> What do you want?

Looking through past conversations, I definitely asked that we figure
out how NDMA is going to work.  Are you thinking of a different request
from me?

The restriction implied by lack of NDMA support are pretty significant.
Maybe a given device like hns doesn't care because they don't intend to
support p2p, but they should care if it means we can't hot-add their
device to a VM in the presences of devices that can support p2p and if
cold plugging their device into an existing configuration implies loss
of functionality or performance elsewhere.

I'd tend to expect though that we'd incorporate NDMA documentation into
the uAPI with a formal proposal for discovery and outline a number of
those usage implications.

> > We've tried to define a specification that's more flexible than a
> > single implementation and by these standards we seem to be flipping
> > that implementation back into the specification.  
> 
> What specification!?! All we have is a couple lines in a header file
> that is no where near detailed enough for multi-driver
> interoperability with userspace. You have no idea how much effort has
> been expended to get this far based on the few breadcrumbs that were
> left, and we have access to the team that made the only other
> implementation!
> 
> *flexible* is not a specification.

There are approximately 220 lines of the uAPI header file dealing
specifically with the migration region.  A bit more than a couple.

We've tried to define it by looking at the device requirements to
support migration rather than tailor it specifically to the current QEMU
implementation of migration.  Attempting to undo that generality by
suggesting only current usage patterns are relevant is therefore going
to generate friction.

> > Userspace can attempt RESUMING -> RUNNING regardless of what we specify,
> > so a driver needs to be prepared for such an attempted state change
> > either way.  So what's the advantage to telling a driver author that
> > they can expect a given behavior?  
> 
> The above didn't tell a driver author to expect a certain behavior, it
> tells userspace what to do.

  "The migration driver can rely on user-space issuing a
   VFIO_DEVICE_RESET prior to starting RESUMING."

> > It doesn't make much sense to me to glue two separate userspace
> > operations together to say these must be done in this sequence, back to
> > back.  If we want the device to be reset in order to enter RESUMING, the
> > driver should simply reset the device as necessary during the state
> > transition.  The outward effect to the user is to specify that device
> > internal state may not be retained on transition from RUNNING ->
> > RESUMING.  
> 
> Maybe, and I'm happy if you want to specify this instead. It just
> doesn't match what we observe qemu to be doing.

Tracing that shows a reset preceding entering RESUMING doesn't suggest
to me that QEMU is performing a reset for the specific purpose of
entering RESUMING.  Correlation != causation.

> > > Do you have an alternative language? This is quite complicated, I
> > > advise people to refer to mlx5's implementation.  
> > 
> > I agree with Connie on this, if the reader of the documentation needs
> > to look at a specific driver implementation to understand the
> > reasoning, the documentation has failed.    
> 
> Lets agree on some objective here, this is not trying to be fully
> comprehensive, or fully standalone. It is intended to drive agreement,
> be informative to userspace, and be supplemental to the actual code.
> 
> > If it can be worked out by looking at the device_state write
> > function of the mlx5 driver, then surely a sentence or two for each
> > priority item can be added here.  
> 
> Please give me a suggestion then, because I don't know what will help
> here?

I'm trying... thus the suggesting just below

> > Part of the problem is that the nomenclature is unclear, we're listing
> > bit combinations, but not the changed bit(s) and we need to infer the
> > state.  
> 
> Each line lists the new state, the changed bits are thus any bits that
> make up the new state.
> 
> If you look at how mlx5 is constructed each if has a 'did it change'
> test followed by 'what state is it in now'
> 
> So the document is read as listing the order the driver enters the new
> states. I clarified it as ""must process the new device_state bits in
> a priority order""

The order I see in the v5 mlx5 post is:

if RUNNING 1->0
  quiesce + freeze
if RUNNING or SAVING change && state == !RUNNING | SAVING
  save device state
if RESUMING 0->1
  reset device state
if RESUMING 1->0
  load device state
if RUNNING 0->1
  unfreeze + unquiesce

So maybe part of my confusion stems from the fact that the mlx5 driver
doesn't support pre-copy, which by the provided list is the highest
priority.  But what actually makes that the highest priority?  Any
combination of SAVING and RESUMING is invalid, so we can eliminate
those.  We obviously can't have RUNNING and !RUNNING, so we can
eliminate all cases of !RUNNING, so we can shorten the list relativeto
prioritizing SAVING|RUNNING to:

 - SAVING | RUNNING
 - NDMA
 - !RESUMING
 - RUNNING
 - !NDMA

It seems like a transition from RESUMING -> SAVING | RUNNING is valid
and RESUMING(1->0) triggers the device state to be loaded from the
migration area, but SAVING(0->1) clears the data window, so I think
this priority scheme means a non-intuitive thing just happened.  SAVING
| RUNNING would need to be processed after !RESUMING, but maybe before
RUNNING itself.

NDMA only requires that the device cease initiating DMA before the call
returns and it only has meaning if RUNNING, so I'm not sure why setting
NDMA is given any priority.  I guess maybe we're trying
(unnecessarily?) to preempt any DMA that might occur as a result of
setting RUNNING (so why is it above cases including !RUNNING?)?

Obviously presenting a priority scheme without a discussion of the
associativity of the states and a barely sketched out nomenclature is
really not inviting an actual understanding how this is reasoned (imo).

> > flips in the presence of an existing state.  I'm not able to obviously
> > map the listing above to the latest posted version of the mlx5 driver.  
> 
> One of the things we've done is align mlx5 more clearly to this. For
> instance it no longer has a mixture of state and old state in the if
> statements, it always tests the new state so the tests logically
> follow what is written here
> 
> Stripping away the excess the expressions now look like this:
> 
>  !(state & VFIO_DEVICE_STATE_RUNNING)
>  ((state & (VFIO_DEVICE_STATE_RUNNING | VFIO_DEVICE_STATE_SAVING)) == VFIO_DEVICE_STATE_SAVING))
>  (state & VFIO_DEVICE_STATE_RESUMING)
> 
> Which mirror what is written here.

I'm feeling like there's a bit of a chicken and egg problem here to
decide if this is sufficiently clear documentation before a new posting
of the mlx5 driver where the mlx5 driver is the authoritative source
for the reasoning of the priority scheme documented here (and doesn't
make use of pre-copy).

> > > > > +  As Peer to Peer DMA is a MMIO touch like any other, it is important that
> > > > > +  userspace suspend these accesses before entering any device_state where MMIO
> > > > > +  is not permitted, such as !RUNNING. This can be accomplished with the NDMA
> > > > > +  state. Userspace may also choose to remove MMIO mappings from the IOMMU if the
> > > > > +  device does not support NDMA and rely on that to guarantee quiet MMIO.    
> > > > 
> > > > Seems that would have its own set of consequences.    
> > > 
> > > Sure, userspace has to make choices here.  
> > 
> > It seems a bit loaded to suggest an alternative choice if it's not
> > practical or equivalent.  Maybe it's largely the phrasing, I read
> > "remove MMIO mappings" as to drop them dynamically, when I think we've
> > discussed that userspace might actually preclude these mappings for
> > non-NDMA devices such that p2p DMA cannot exist, ever.  
> 
> I mean the latter. How about "never install MMIO mappings" ?

Yup.

> > > Overall it must work in this basic way, and devices have freedom about
> > > what internal state they can/will log. There is just a clear division
> > > that every internal state in the first step is either immutable or
> > > logged, and that the second step is a delta over the first.  
> > 
> > I agree that it's a reasonable approach, though as I read the proposed
> > text, there's no mention of immutable state and no reason a driver
> > would implement a dirty log for immutable state, therefore we seem to
> > be suggesting such data for the stop-and-copy phase when it would
> > actually be preferable to include it in pre-copy.  
> 
> I'd say that is a detail we don't need to discuss/define, it has no
> user space visible consequence.

The migration data stream is entirely opaque to userspace, so what's
the benefit to userspace to suggest anything about the content in each
phase?  This is presented in a userspace edge concerns section, but the
description is much more relevant to a driver author.

> > I think the fact that a user is not required to run the pre-copy
> > phase until completion is also noteworthy.  
> 
> This text doesn't try to detail how the migration window works, that
> is a different large task. The intention is that the migration window
> must be fully drained to be successful.

Clarification, the *!RUNNING* migration window must be fully drained.

> I added this for some clarity ""The entire migration data, up to each
> end of stream must be transported from the saving to resuming side.""

Per the uAPI regarding pre-copy:

  "The user must not be required to consume all migration data before
  the device transitions to a new state, including the stop-and-copy
  state."

If "end of stream" suggests the driver defined end of the data stream
for pre-copy rather than simply the end of the user accumulated data
stream, that conflicts with the uAPI.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v2] vfio: Documentation for the migration region
  2021-12-01 20:03         ` Alex Williamson
@ 2021-12-01 23:25           ` Jason Gunthorpe
  2021-12-02 17:05             ` Cornelia Huck
  2021-12-03 18:06             ` Alex Williamson
  0 siblings, 2 replies; 29+ messages in thread
From: Jason Gunthorpe @ 2021-12-01 23:25 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jonathan Corbet, linux-doc, Cornelia Huck, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Wed, Dec 01, 2021 at 01:03:14PM -0700, Alex Williamson wrote:
> On Tue, 30 Nov 2021 23:14:07 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Nov 30, 2021 at 03:35:41PM -0700, Alex Williamson wrote:
> > 
> > > > From what HNS said the device driver would have to trap every MMIO to
> > > > implement NDMA as it must prevent touches to the physical HW MMIO to
> > > > maintain the NDMA state.
> > > > 
> > > > The issue is that the HW migration registers can stop processing the
> > > > queue and thus enter NDMA but a MMIO touch can resume queue
> > > > processing, so NDMA cannot be sustained.
> > > > 
> > > > Trapping every MMIO would have a huge negative performance impact.  So
> > > > it doesn't make sense to do so for a device that is not intended to be
> > > > used in any situation where NDMA is required.  
> > > 
> > > But migration is a cooperative activity with userspace.  If necessary
> > > we can impose a requirement that mmap access to regions (other than the
> > > migration region itself) are dropped when we're in the NDMA or !RUNNING
> > > device_state.    
> > 
> > It is always NDMA|RUNNING, so we can't fully drop access to
> > MMIO. Userspace would have to transfer from direct MMIO to
> > trapping. With enough new kernel infrastructure and qemu support it
> > could be done.
> 
> This is simply toggling whether the mmap MemoryRegion in QEMU is
> enabled, when not enabled access falls through to the read/write access
> functions.  We already have this functionality for unmasking INTx
> interrupts when we don't have EOI support.
>
> > Even so, we can't trap accesses through the IOMMU so such a scheme
> > would still require removing IOMMU acess to the device. Given that the
> > basic qemu mitigation for no NDMA support is to eliminate P2P cases by
> > removing the IOMMU mappings this doesn't seem to advance anything and
> > only creates complexity.
> 
> NDMA only requires that the device cease initiating DMA, so I suppose
> you're suggesting that the MMIO of a device that doesn't expect to make
> use of p2p DMA could be poked through DMA, which might cause the device
> to initiate a DMA and potentially lose sync with mediation.  That would
> be bad, but seems like a non-issue for hns.

Yes, not sure how you get to a non-issue though? If the IOMMU map is
present the huawei device can be attacked by a hostile VM and forced
to exit NDMA. All this takes is any two DMA capable devices to be
plugged in?

> > At least I'm not going to insist that hns do all kinds of work like
> > this for a edge case they don't care about as a precondition to get a
> > migration driver.
> 
> I wonder if we need a region flag to opt-out of IOMMU mappings for
> devices that do not support p2p use cases.  If there were no IOMMU
> mappings and mediation handled CPU driven MMIO accesses, then we'd have
> a viable NDMA mode for hns.

This is a good idea, if we want to make huawei support NDMA then
flagging it to never be in the iommu map in the first place is a great
solution. Then they can use CPU mmio trapping get the rest of the way.

> > > There's no reason that mediation while in the NDMA state needs to
> > > impose any performance penalty against the default RUNNING state.   
> > 
> > Eh? Mitigation of no NDMA support would have to mediate the MMIO on a
> > a performance doorbell path, there is no escaping a performance
> > hit. I'm not sure what you mean
> 
> Read it again, I'm suggesting that mediation during NDMA doesn't need
> to carry over any performance penalty to the default run state.  We
> don't care if mediation imposes a performance penalty while NDMA is set,
> we're specifically attempting to quiesce the device.  The slower it
> runs the better ;)

OK, I don't read it like that. It seems OK to have a performance hit
in NDMA since it is only a short grace state.

> > It would make userspace a bit simpler at the cost of excluding or
> > complicating devices like hns for a use case they don't care about.
> > 
> > On the other hand, the simple solution in qemu is when there is no
> > universal NDMA it simply doesn't include any MMIO ranges in the
> > IOMMU.
> 
> That leads to mysterious performance penalties when a VM was previously
> able to make use of (ex.) GPUDirect, 

Not a penalty it will just explode somehow. There is no way right now
for a VM to know P2P doesn't work. It is one of these annoying things
that leaks to the VMM like the no-snoop mess. A VMM installing a
device combination that is commonly used with P2P, like a GPU and a
NIC, had a better make sure P2P works :)

> but adds an hns device and suddenly can no longer use p2p.  Or
> rejected hotplugs if a device has existing NDMA capable devices, p2p
> might be active, but the new device does not support NDMA.  

I don't think qemu can go back on what it already did, so rejected
hotplug seems the only choice.

> This all makes it really complicated to get deterministic behavior
> for devices.  I don't know how to make QEMU behavior predictable and
> supportable in such an environment.

And this is the thing that gives me pause to think maybe the huawei
device should do the extra work?

On the other hand I suspect their use case is fine with qemu set to
disable P2P completely.

OTOH "supportable" qemu could certainly make the default choice to
require devices for simplicity.

> > Since qemu is the only implementation it would be easy for drivers to
> > rely on the implicit reset it seems to do, it seems an important point
> > that should be written either way.
> > 
> > I don't have a particular requirement to have the reset, but it does
> > seem like a good idea. If you feel strongly, then let's say the
> > opposite that the driver must enter RESUME with no preconditions,
> > doing an internal reset if required.
> 
> It seems cleaner to me than unnecessarily requiring userspace to pass
> through an ioctl in order to get to the next device_state.

Ok, I'll add something like this.

> > Can you point to something please? I can't work with "I'm not sure"
> 
> The reset behavior that I'm trying to clarify above is the primary
> offender, but "I'm not sure" I understand the bit prioritization enough
> to know that there isn't something there as well.  I'm also not sure if
> the "end of stream" phrasing below matches the uAPI.

Realistically I think userspace should not make use of the bit
prioritization. It is more as something driver implementors should
follow for consistent behavior.
 
> > IMO the header file doesn't really say much and can be read in a way
> > that is consistent with this more specific document.
> 
> But if this document is suggesting the mlx5/QEMU interpretation is the
> only valid interpretations for driver authors, those clarifications
> should be pushed back into the uAPI header.

Can we go the other way and move more of the uAPI header text here?

> > I have no idea anymore. You asked for docs and complete picture as a
> > percondition for merging a driver. Here it is.
> > 
> > What do you want?
> 
> Looking through past conversations, I definitely asked that we figure
> out how NDMA is going to work.  Are you thinking of a different request
> from me?

It was part of the whole etherpad thing. This was what we said we'd do
to resolve the discussion.

I expect to come to some agreement with you and Connie on this text
and we will go ahead.
 
> The restriction implied by lack of NDMA support are pretty significant.
> Maybe a given device like hns doesn't care because they don't intend to
> support p2p, but they should care if it means we can't hot-add their
> device to a VM in the presences of devices that can support p2p and if
> cold plugging their device into an existing configuration implies loss
> of functionality or performance elsewhere.
> 
> I'd tend to expect though that we'd incorporate NDMA documentation into
> the uAPI with a formal proposal for discovery and outline a number of
> those usage implications.

Yishai made a patch, but we have put the discussion of NDMA here, not
hidden in a commit message
 
> > > We've tried to define a specification that's more flexible than a
> > > single implementation and by these standards we seem to be flipping
> > > that implementation back into the specification.  
> > 
> > What specification!?! All we have is a couple lines in a header file
> > that is no where near detailed enough for multi-driver
> > interoperability with userspace. You have no idea how much effort has
> > been expended to get this far based on the few breadcrumbs that were
> > left, and we have access to the team that made the only other
> > implementation!
> > 
> > *flexible* is not a specification.
> 
> There are approximately 220 lines of the uAPI header file dealing
> specifically with the migration region.  A bit more than a couple.

Unfortunately more than half of that describes how the data window
works, and half of the rest is kind of obvious statements.

> We've tried to define it by looking at the device requirements to
> support migration rather than tailor it specifically to the current QEMU
> implementation of migration.  Attempting to undo that generality by
> suggesting only current usage patterns are relevant is therefore going
> to generate friction.

In your mind you see generality, in our mind we want to know how to
write an inter operable driver and there is no documention saying how
to do that.

> > > Userspace can attempt RESUMING -> RUNNING regardless of what we specify,
> > > so a driver needs to be prepared for such an attempted state change
> > > either way.  So what's the advantage to telling a driver author that
> > > they can expect a given behavior?  
> > 
> > The above didn't tell a driver author to expect a certain behavior, it
> > tells userspace what to do.
> 
>   "The migration driver can rely on user-space issuing a
>    VFIO_DEVICE_RESET prior to starting RESUMING."

I trimmed too much, the original text you quoted was

"To abort a RESUMING issue a VFIO_DEVICE_RESET."

Which I still think is fine.

> Tracing that shows a reset preceding entering RESUMING doesn't suggest
> to me that QEMU is performing a reset for the specific purpose of
> entering RESUMING.  Correlation != causation.

Kernel doesn't care why qemu did it - it was done. Intent doesn't
matter :)

> The order I see in the v5 mlx5 post is:
> 
> if RUNNING 1->0
>   quiesce + freeze
> if RUNNING or SAVING change && state == !RUNNING | SAVING
>   save device state
> if RESUMING 0->1
>   reset device state
> if RESUMING 1->0
>   load device state
> if RUNNING 0->1
>   unfreeze + unquiesce

Right, which matches the text:

 - !RUNNING
 - SAVING | !RUNNING
 - RESUMING
 - !RESUMING
 - RUNNING
 
> So maybe part of my confusion stems from the fact that the mlx5 driver
> doesn't support pre-copy, which by the provided list is the highest
> priority.  

Right.

> But what actually makes that the highest priority?  Any
> combination of SAVING and RESUMING is invalid, so we can eliminate
> those.  We obviously can't have RUNNING and !RUNNING, so we can
> eliminate all cases of !RUNNING, so we can shorten the list relativeto
> prioritizing SAVING|RUNNING to:

There are several orders that can make sense. What we've found is
following the reference flow order has given something workable for
precedence.

> SAVING | RUNNING would need to be processed after !RESUMING, but
> maybe before RUNNING itself.

It is a good point, it does make more sense after RUNNING as a device
should be already RUNNING before entering pre-copy. I moved it to
before !NDMA

> NDMA only requires that the device cease initiating DMA before the call
> returns and it only has meaning if RUNNING, so I'm not sure why setting
> NDMA is given any priority.  I guess maybe we're trying
> (unnecessarily?) to preempt any DMA that might occur as a result of
> setting RUNNING (so why is it above cases including !RUNNING?)?

Everything was given priority so there is no confusing omission. For
the order, as NDMA has no meaning outside RUNNING, it makes sense you'd
do a NDMA before making it meaningless / after making it meaningful.

This is made concrete by mlx5's scheme that always requires quiesce
(NDMA) to happen before freeze (!RUNNING) and viceversa, so we get to
this order. mlx5 implicitly does NDMA on !RUNNING 

> Obviously presenting a priority scheme without a discussion of the
> associativity of the states and a barely sketched out nomenclature is
> really not inviting an actual understanding how this is reasoned (imo).

Sure, but how we got here isn't really important to the intent of the
document to guide implementors.

Well, you wrote a lot, and found a correction, but I haven't been left
with a way to write this more clearly? Now that you obviously
understand what it is saying, what do you think?

> I'm feeling like there's a bit of a chicken and egg problem here to
> decide if this is sufficiently clear documentation before a new posting
> of the mlx5 driver where the mlx5 driver is the authoritative source
> for the reasoning of the priority scheme documented here (and doesn't
> make use of pre-copy).

I wouldn't fixate on the ordering, it is a small part of the
document..

> The migration data stream is entirely opaque to userspace, so what's
> the benefit to userspace to suggest anything about the content in each
> phase?  This is presented in a userspace edge concerns section, but the
> description is much more relevant to a driver author.

It is informative for the device driver author to understand what
device functionality to map to this.

> > > I think the fact that a user is not required to run the pre-copy
> > > phase until completion is also noteworthy.  
> > 
> > This text doesn't try to detail how the migration window works, that
> > is a different large task. The intention is that the migration window
> > must be fully drained to be successful.
> 
> Clarification, the *!RUNNING* migration window must be fully drained.
> 
> > I added this for some clarity ""The entire migration data, up to each
> > end of stream must be transported from the saving to resuming side.""
> 
> Per the uAPI regarding pre-copy:
> 
>   "The user must not be required to consume all migration data before
>   the device transitions to a new state, including the stop-and-copy
>   state."
> 
> If "end of stream" suggests the driver defined end of the data stream
> for pre-copy rather than simply the end of the user accumulated data
> stream, that conflicts with the uAPI.  Thanks,

Hmm, yes. I can try to clarify how this all works better. We don't
implement pre-copy but it should still be described better than it has
been.

I'm still not sure how this works. 

We are in SAVING|RUNNING and we dump all the dirty data and return end
of stream.

We stay in SAVING|RUNNING and some more device state became dirty. How
does userspace know? Should it poll and see if the stream got longer?

Below is what I collected from your feedback so far

Thanks,
Jason

diff --git a/Documentation/driver-api/vfio.rst b/Documentation/driver-api/vfio.rst
index d9be47570f878c..2ff47823a889b4 100644
--- a/Documentation/driver-api/vfio.rst
+++ b/Documentation/driver-api/vfio.rst
@@ -258,7 +258,9 @@ Setting/clearing bit groups triggers device action, and each bit controls a
 continuous device behavior.
 
 Along with the device_state the migration driver provides a data window which
-allows streaming migration data into or out of the device.
+allows streaming migration data into or out of the device. The entire
+migration data, up to the end of stream must be transported from the saving to
+resuming side.
 
 A lot of flexibility is provided to user-space in how it operates these
 bits. What follows is a reference flow for saving device state in a live
@@ -299,12 +301,9 @@ entities (VCPU_RUNNING and DIRTY_TRACKING) the VMM controls fit in.
 and the reference flow for resuming:
 
   RUNNING
-     Issue VFIO_DEVICE_RESET to clear the internal device state
-  0
-     Device is halted
+     Use ioctl(VFIO_GROUP_GET_DEVICE_FD) to obtain a fresh device
   RESUMING
-     Push in migration data. Data captured during pre-copy should be
-     prepended to data captured during SAVING.
+     Push in migration data.
   NDMA | RUNNING
      Peer to Peer DMA grace state
   RUNNING, VCPU_RUNNING
@@ -315,48 +314,73 @@ states act as cross device synchronization points. The VMM must bring all
 devices to the grace state before advancing past it.
 
 The above reference flows are built around specific requirements on the
-migration driver for its implementation of the migration_state input.
+migration driver for its implementation of the device_state input.
 
-The migration_state cannot change asynchronously, upon writing the
-migration_state the driver will either keep the current state and return
+The device_state cannot change asynchronously, upon writing the
+device_state the driver will either keep the current state and return
 failure, return failure and go to ERROR, or succeed and go to the new state.
 
-Event triggered actions happen when user-space requests a new migration_state
-that differs from the current migration_state. Actions happen on a bit group
+Event triggered actions happen when user-space requests a new device_state
+that differs from the current device_state. Actions happen on a bit group
 basis:
 
+ SAVING
+   The device clears the data window and prepares to stream migration data.
+   The entire data from the start of SAVING to the end of stream is transfered
+   to the other side to execute a resume.
+
  SAVING | RUNNING
-   The device clears the data window and begins streaming 'pre copy' migration
-   data through the window. Devices that cannot log internal state changes
-   return a 0 length migration stream.
+   The device beings streaming 'pre copy' migration data through the window.
+
+   A device that does not support internal state logging should return a 0
+   length stream.
+
+   The migration window may reach an end of stream, this can be a permanent or
+   temporary condition.
+
+   User space can do SAVING | !RUNNING at any time, any in progress transfer
+   through the migration window is carried forward.
+
+   This allows the device to implement a dirty log for its internal state.
+   During this state the data window should present the device state being
+   logged and during SAVING | !RUNNING the data window should transfer the
+   dirtied state and conclude the migration data.
+
+   The state is only concerned with internal device state. External DMAs are
+   covered by the separate DIRTY_TRACKING function.
 
  SAVING | !RUNNING
-   The device captures its internal state that is not covered by internal
-   logging, as well as any logged changes.
+   The device captures its internal state and streams it through the
+   migration window.
 
-   The device clears the data window and begins streaming the captured
-   migration data through the window. Devices that cannot log internal state
-   changes stream all their device state here.
+   When the migration window reaches an end of stream the saving is concluded
+   and there is no further data. All of the migration data streamed from the
+   time SAVING starts to this final end of stream is concatenated together
+   and provided to RESUMING.
+
+   Devices that cannot log internal state changes stream all their device
+   state here.
 
  RESUMING
    The data window is cleared, opened, and can receive the migration data
-   stream.
+   stream. The device must always be able to enter resuming and it may reset
+   the device to do so.
 
  !RESUMING
    All the data transferred into the data window is loaded into the device's
-   internal state. The migration driver can rely on user-space issuing a
-   VFIO_DEVICE_RESET prior to starting RESUMING.
+   internal state.
 
-   To abort a RESUMING issue a VFIO_DEVICE_RESET.
+   The internal state of a device is undefined while in RESUMING. To abort a
+   RESUMING and return to a known state issue a VFIO_DEVICE_RESET.
 
    If the migration data is invalid then the ERROR state must be set.
 
-Continuous actions are in effect when migration_state bit groups are active:
+Continuous actions are in effect when device_state bit groups are active:
 
  RUNNING | NDMA
    The device is not allowed to issue new DMA operations.
 
-   Whenever the kernel returns with a migration_state of NDMA there can be no
+   Whenever the kernel returns with a device_state of NDMA there can be no
    in progress DMAs.
 
  !RUNNING
@@ -384,24 +408,24 @@ Continuous actions are in effect when migration_state bit groups are active:
    during ERROR to avoid corrupting other devices or a VM during a failed
    migration.
 
-When multiple bits change in the migration_state they may describe multiple
-event triggered actions, and multiple changes to continuous actions.  The
-migration driver must process them in a priority order:
+When multiple bits change in the device_state they may describe multiple event
+triggered actions, and multiple changes to continuous actions.  The migration
+driver must process the new device_state bits in a priority order:
 
- - SAVING | RUNNING
  - NDMA
  - !RUNNING
  - SAVING | !RUNNING
  - RESUMING
  - !RESUMING
  - RUNNING
+ - SAVING | RUNNING
  - !NDMA
 
 In general, userspace can issue a VFIO_DEVICE_RESET ioctl and recover the
 device back to device_state RUNNING. When a migration driver executes this
-ioctl it should discard the data window and set migration_state to RUNNING as
+ioctl it should discard the data window and set device_state to RUNNING as
 part of resetting the device to a clean state. This must happen even if the
-migration_state has errored. A freshly opened device FD should always be in
+device_state has errored. A freshly opened device FD should always be in
 the RUNNING state.
 
 The migration driver has limitations on what device state it can affect. Any
@@ -438,8 +462,9 @@ implementing migration:
   As Peer to Peer DMA is a MMIO touch like any other, it is important that
   userspace suspend these accesses before entering any device_state where MMIO
   is not permitted, such as !RUNNING. This can be accomplished with the NDMA
-  state. Userspace may also choose to remove MMIO mappings from the IOMMU if the
-  device does not support NDMA and rely on that to guarantee quiet MMIO.
+  state. Userspace may also choose to never install MMIO mappings into the
+  IOMMU if devices do not support NDMA and rely on that to guarantee quiet
+  MMIO.
 
   The Peer to Peer Grace States exist so that all devices may reach RUNNING
   before any device is subjected to a MMIO access.
@@ -458,16 +483,6 @@ implementing migration:
   Device that do not support NDMA cannot be configured to generate page faults
   that require the VCPU to complete.
 
-- pre-copy allows the device to implement a dirty log for its internal state.
-  During the SAVING | RUNNING state the data window should present the device
-  state being logged and during SAVING | !RUNNING the data window should present
-  the unlogged device state as well as the changes from the internal dirty log.
-
-  On RESUME these two data streams are concatenated together.
-
-  pre-copy is only concerned with internal device state. External DMAs are
-  covered by the separate DIRTY_TRACKING function.
-
 - Atomic Read and Clear of the DMA log is a HW feature. If the tracker
   cannot support this, then NDMA could be used to synthesize it less
   efficiently.
@@ -476,14 +491,23 @@ implementing migration:
   are pushed down to the next step in the sequence and various behaviors that
   rely on NDMA cannot be used.
 
-- Migration control registers inside the same iommu_group as the VFIO device.
-  This immediately raises a security concern as user-space can use Peer to Peer
-  DMA to manipulate these migration control registers concurrently with
+  NDMA is made optional to support simple HW implementations that either just
+  cannot do NDMA, or cannot do NDMA without a performance cost. NDMA is only
+  necessary for special features like P2P and PRI, so devices can omit it in
+  exchange for limitations on the guest.
+
+- Devices that have their HW migration control MMIO registers inside the same
+  iommu_group as the VFIO device have some special considerations. In this
+  case a driver will be operating HW registers from kernel space that are also
+  subjected to userspace controlled DMA due to the iommu_group.
+
+  This immediately raises a security concern as user-space can use Peer to
+  Peer DMA to manipulate these migration control registers concurrently with
   any kernel actions.
 
   A device driver operating such a device must ensure that kernel integrity
   cannot be broken by hostile user space operating the migration MMIO
-  registers via peer to peer, at any point in the sequence. Notably the kernel
+  registers via peer to peer, at any point in the sequence. Further the kernel
   cannot use DMA to transfer any migration data.
 
   However, as discussed above in the "Device Peer to Peer DMA" section, it can

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v2] vfio: Documentation for the migration region
  2021-12-01 23:25           ` Jason Gunthorpe
@ 2021-12-02 17:05             ` Cornelia Huck
  2021-12-02 17:41               ` Jason Gunthorpe
  2021-12-03 18:06             ` Alex Williamson
  1 sibling, 1 reply; 29+ messages in thread
From: Cornelia Huck @ 2021-12-02 17:05 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Jonathan Corbet, linux-doc, kvm, Kirti Wankhede, Max Gurtovoy,
	Shameer Kolothum, Yishai Hadas

On Wed, Dec 01 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Dec 01, 2021 at 01:03:14PM -0700, Alex Williamson wrote:
>> But if this document is suggesting the mlx5/QEMU interpretation is the
>> only valid interpretations for driver authors, those clarifications
>> should be pushed back into the uAPI header.
>
> Can we go the other way and move more of the uAPI header text here?

Where should a userspace author look when they try to implement support
for vfio migration? I think we need to answer that question first.

Maybe we should separate "these are the rules that an implementation
must obey" from "here's a more verbose description of how things work,
and how you can arrive at a working implementation". The former would go
into the header, while the latter can go into this document. (The
generated documentation can be linked from the header file.)


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v2] vfio: Documentation for the migration region
  2021-12-02 17:05             ` Cornelia Huck
@ 2021-12-02 17:41               ` Jason Gunthorpe
  2021-12-02 17:46                 ` Cornelia Huck
  0 siblings, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2021-12-02 17:41 UTC (permalink / raw)
  To: Cornelia Huck, Jonathan Corbet, linux-doc
  Cc: Alex Williamson, kvm, Kirti Wankhede, Max Gurtovoy,
	Shameer Kolothum, Yishai Hadas

On Thu, Dec 02, 2021 at 06:05:36PM +0100, Cornelia Huck wrote:
> On Wed, Dec 01 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Dec 01, 2021 at 01:03:14PM -0700, Alex Williamson wrote:
> >> But if this document is suggesting the mlx5/QEMU interpretation is the
> >> only valid interpretations for driver authors, those clarifications
> >> should be pushed back into the uAPI header.
> >
> > Can we go the other way and move more of the uAPI header text here?
> 
> Where should a userspace author look when they try to implement support
> for vfio migration? I think we need to answer that question first.
> 
> Maybe we should separate "these are the rules that an implementation
> must obey" from "here's a more verbose description of how things work,
> and how you can arrive at a working implementation". The former would go
> into the header, while the latter can go into this document. (The
> generated documentation can be linked from the header file.)

I think the usual kernel expectation now is to find userspace
information either in man pages or in the Documentation/ html pages?

The uapi header is fine to be a terse summary of what the ioctl does
and some important points, but I wouldn't try to write a spec for
anything complicated in a header file.

Maybe Jonathan has an advice?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v2] vfio: Documentation for the migration region
  2021-12-02 17:41               ` Jason Gunthorpe
@ 2021-12-02 17:46                 ` Cornelia Huck
  0 siblings, 0 replies; 29+ messages in thread
From: Cornelia Huck @ 2021-12-02 17:46 UTC (permalink / raw)
  To: Jason Gunthorpe, Jonathan Corbet, linux-doc
  Cc: Alex Williamson, kvm, Kirti Wankhede, Max Gurtovoy,
	Shameer Kolothum, Yishai Hadas

On Thu, Dec 02 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Dec 02, 2021 at 06:05:36PM +0100, Cornelia Huck wrote:
>> On Wed, Dec 01 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:
>> 
>> > On Wed, Dec 01, 2021 at 01:03:14PM -0700, Alex Williamson wrote:
>> >> But if this document is suggesting the mlx5/QEMU interpretation is the
>> >> only valid interpretations for driver authors, those clarifications
>> >> should be pushed back into the uAPI header.
>> >
>> > Can we go the other way and move more of the uAPI header text here?
>> 
>> Where should a userspace author look when they try to implement support
>> for vfio migration? I think we need to answer that question first.
>> 
>> Maybe we should separate "these are the rules that an implementation
>> must obey" from "here's a more verbose description of how things work,
>> and how you can arrive at a working implementation". The former would go
>> into the header, while the latter can go into this document. (The
>> generated documentation can be linked from the header file.)
>
> I think the usual kernel expectation now is to find userspace
> information either in man pages or in the Documentation/ html pages?
>
> The uapi header is fine to be a terse summary of what the ioctl does
> and some important points, but I wouldn't try to write a spec for
> anything complicated in a header file.

I was thinking less of a complete spec, more of "these are the fields
with some basic rules, consult $LINK for more information".


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v2] vfio: Documentation for the migration region
  2021-12-01 23:25           ` Jason Gunthorpe
  2021-12-02 17:05             ` Cornelia Huck
@ 2021-12-03 18:06             ` Alex Williamson
  2021-12-06 16:03               ` Cornelia Huck
  2021-12-06 19:15               ` Jason Gunthorpe
  1 sibling, 2 replies; 29+ messages in thread
From: Alex Williamson @ 2021-12-03 18:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jonathan Corbet, linux-doc, Cornelia Huck, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Wed, 1 Dec 2021 19:25:02 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Dec 01, 2021 at 01:03:14PM -0700, Alex Williamson wrote:
> > On Tue, 30 Nov 2021 23:14:07 -0400
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Tue, Nov 30, 2021 at 03:35:41PM -0700, Alex Williamson wrote:
> > >   
> > > > > From what HNS said the device driver would have to trap every MMIO to
> > > > > implement NDMA as it must prevent touches to the physical HW MMIO to
> > > > > maintain the NDMA state.
> > > > > 
> > > > > The issue is that the HW migration registers can stop processing the
> > > > > queue and thus enter NDMA but a MMIO touch can resume queue
> > > > > processing, so NDMA cannot be sustained.
> > > > > 
> > > > > Trapping every MMIO would have a huge negative performance impact.  So
> > > > > it doesn't make sense to do so for a device that is not intended to be
> > > > > used in any situation where NDMA is required.    
> > > > 
> > > > But migration is a cooperative activity with userspace.  If necessary
> > > > we can impose a requirement that mmap access to regions (other than the
> > > > migration region itself) are dropped when we're in the NDMA or !RUNNING
> > > > device_state.      
> > > 
> > > It is always NDMA|RUNNING, so we can't fully drop access to
> > > MMIO. Userspace would have to transfer from direct MMIO to
> > > trapping. With enough new kernel infrastructure and qemu support it
> > > could be done.  
> > 
> > This is simply toggling whether the mmap MemoryRegion in QEMU is
> > enabled, when not enabled access falls through to the read/write access
> > functions.  We already have this functionality for unmasking INTx
> > interrupts when we don't have EOI support.
> >  
> > > Even so, we can't trap accesses through the IOMMU so such a scheme
> > > would still require removing IOMMU acess to the device. Given that the
> > > basic qemu mitigation for no NDMA support is to eliminate P2P cases by
> > > removing the IOMMU mappings this doesn't seem to advance anything and
> > > only creates complexity.  
> > 
> > NDMA only requires that the device cease initiating DMA, so I suppose
> > you're suggesting that the MMIO of a device that doesn't expect to make
> > use of p2p DMA could be poked through DMA, which might cause the device
> > to initiate a DMA and potentially lose sync with mediation.  That would
> > be bad, but seems like a non-issue for hns.  
> 
> Yes, not sure how you get to a non-issue though? If the IOMMU map is
> present the huawei device can be attacked by a hostile VM and forced
> to exit NDMA. All this takes is any two DMA capable devices to be
> plugged in?

I'm confused by your idea of a "hostile VM".  Devices within a VM can
only DMA to memory owned by the VM.  So the only hostile VM that can
attack this device is the VM that owns the device in the first place.
Meanwhile the VMM is transitioning all devices to the NDMA state, so
the opportunity to poke one device from another device within the same
VM is pretty limited.  And for what gain?  Is this a DoS opportunity
that the guest might trigger the migration to abort or fail?

I suggest this might be a non-issue for hns because I understand it to
not support any sort of p2p, therefore this DMA write to the device
while it's under mediation would never happen in practice.  Skipping
the IOMMU mappings for the device would prevent it, but I'm still not
sure if this goes beyond a VM shooting itself scenario.

> > > At least I'm not going to insist that hns do all kinds of work like
> > > this for a edge case they don't care about as a precondition to get a
> > > migration driver.  
> > 
> > I wonder if we need a region flag to opt-out of IOMMU mappings for
> > devices that do not support p2p use cases.  If there were no IOMMU
> > mappings and mediation handled CPU driven MMIO accesses, then we'd have
> > a viable NDMA mode for hns.  
> 
> This is a good idea, if we want to make huawei support NDMA then
> flagging it to never be in the iommu map in the first place is a great
> solution. Then they can use CPU mmio trapping get the rest of the way.

NB, this flag would only be a hint to the VMM, the vfio IOMMU backend
would need to be involved if we intended to prevent such mappings, but
that seems like overkill if we're expecting userspace to follow the
rules to get working migration and migration drivers have sufficient
robustness to maintain device isolation.

> > > > There's no reason that mediation while in the NDMA state needs to
> > > > impose any performance penalty against the default RUNNING state.     
> > > 
> > > Eh? Mitigation of no NDMA support would have to mediate the MMIO on a
> > > a performance doorbell path, there is no escaping a performance
> > > hit. I'm not sure what you mean  
> > 
> > Read it again, I'm suggesting that mediation during NDMA doesn't need
> > to carry over any performance penalty to the default run state.  We
> > don't care if mediation imposes a performance penalty while NDMA is set,
> > we're specifically attempting to quiesce the device.  The slower it
> > runs the better ;)  
> 
> OK, I don't read it like that. It seems OK to have a performance hit
> in NDMA since it is only a short grace state.

Is it the length of time we're in the NDMA state or the utility of the
NDMA state itself?  If we're telling the device "no more outbound DMA,
no more interrupts", then it doesn't seem super relevant for the
majority of use cases what happens to the MMIO latency. 

> > > It would make userspace a bit simpler at the cost of excluding or
> > > complicating devices like hns for a use case they don't care about.
> > > 
> > > On the other hand, the simple solution in qemu is when there is no
> > > universal NDMA it simply doesn't include any MMIO ranges in the
> > > IOMMU.  
> > 
> > That leads to mysterious performance penalties when a VM was previously
> > able to make use of (ex.) GPUDirect,   
> 
> Not a penalty it will just explode somehow. There is no way right now
> for a VM to know P2P doesn't work. It is one of these annoying things
> that leaks to the VMM like the no-snoop mess. A VMM installing a
> device combination that is commonly used with P2P, like a GPU and a
> NIC, had a better make sure P2P works :)
> 
> > but adds an hns device and suddenly can no longer use p2p.  Or
> > rejected hotplugs if a device has existing NDMA capable devices, p2p
> > might be active, but the new device does not support NDMA.    
> 
> I don't think qemu can go back on what it already did, so rejected
> hotplug seems the only choice.

Agreed

> > This all makes it really complicated to get deterministic behavior
> > for devices.  I don't know how to make QEMU behavior predictable and
> > supportable in such an environment.  
> 
> And this is the thing that gives me pause to think maybe the huawei
> device should do the extra work?
> 
> On the other hand I suspect their use case is fine with qemu set to
> disable P2P completely.

I don't know how we look at that use case in isolation though.

> OTOH "supportable" qemu could certainly make the default choice to
> require devices for simplicity.

I get a bit lost every time I try to sketch out how QEMU would
implement it.  Forgive the stream of consciousness and rhetorical
discussion below...

 - Does it make sense that a device itself can opt-out of p2p mappings?
   This is simply an MMIO access from another device rather than from
   the CPU.  Vendors cannot preclude this use case on bare metal,
   should they be able to in a VM?  We create heterogeneous p2p maps,
   device A can access device B, but device B maybe can't access device
   A.  Seems troublesome.

 - If we can't have a per-device policy, then we'd need a per VM
   policy, likely some way to opt-out of all p2p mappings for vfio
   devices.  We need to support hotplug of devices though, so is it a
   per VM policy or is it a per device policy which needs to be
   consistent among all attached devices?  Perhaps a
   "enable_p2p=on/off" option on the vfio-pci device, default [on] to
   match current behavior.  For any case of this option being set to
   non-default, all devices would need to set it to the same value,
   non-compliant devices rejected.

 - We could possibly allow migration=on,enable_p2p=on for a non-NDMA
   device, but the rules change if additional devices are added, they
   need to be rejected or migration support implicitly disappears.  That
   seems hard to document, non-deterministic as far as a user is
   concerned. So maybe for a non-NDMA device we'd require
   enable_p2p=off in order to set migration=on.  That means we could
   never enable migration on non-NDMA devices by default, which
   probably means we also cannot enable it by default on NDMA devices
   or we get user confusion/inconsistency.

 - Can a user know which devices will require enable_p2p=off in order
   to set migration=on?  "Read the error log" is a poor user experience
   and difficult hurdle for libvirt.

So in order to create a predictable QEMU experience in the face of
optional NDMA per device, I think we preclude being able to enable
migration support for any vfio device by default and we have an
exercise to determine how a user or management tool could easily
determine NDMA compatibility :-\

> > > Since qemu is the only implementation it would be easy for drivers to
> > > rely on the implicit reset it seems to do, it seems an important point
> > > that should be written either way.
> > > 
> > > I don't have a particular requirement to have the reset, but it does
> > > seem like a good idea. If you feel strongly, then let's say the
> > > opposite that the driver must enter RESUME with no preconditions,
> > > doing an internal reset if required.  
> > 
> > It seems cleaner to me than unnecessarily requiring userspace to pass
> > through an ioctl in order to get to the next device_state.  
> 
> Ok, I'll add something like this.
> 
> > > Can you point to something please? I can't work with "I'm not sure"  
> > 
> > The reset behavior that I'm trying to clarify above is the primary
> > offender, but "I'm not sure" I understand the bit prioritization enough
> > to know that there isn't something there as well.  I'm also not sure if
> > the "end of stream" phrasing below matches the uAPI.  
> 
> Realistically I think userspace should not make use of the bit
> prioritization. It is more as something driver implementors should
> follow for consistent behavior.
>  
> > > IMO the header file doesn't really say much and can be read in a way
> > > that is consistent with this more specific document.  
> > 
> > But if this document is suggesting the mlx5/QEMU interpretation is the
> > only valid interpretations for driver authors, those clarifications
> > should be pushed back into the uAPI header.  
> 
> Can we go the other way and move more of the uAPI header text here?

Currently our Documentation/ file describes an overview of vfio, the
high level theory of groups, devices, and iommus, provides a detailed
userspace usage example, an overview of device registration and ops,
and a bunch of random spapr implementation notes largely around
userspace considerations on power platforms.

So I would generalize that our Documentation/ is currently focused on
userspace access and interaction with the uAPI.  The uAPI itself
therefore generally provides the detail regarding how a given interface
is intended to work from an implementation perspective.  I'd say the
proposed document here is mixing both of those in Documentation/, some
aspects relevant to implementation of the uAPI, others to userspace.

Currently I would clearly say that the uAPI header is the canonical
source of truth and I think it should continue to be so for describing
the implementation of the uAPI.

> > > I have no idea anymore. You asked for docs and complete picture as a
> > > percondition for merging a driver. Here it is.
> > > 
> > > What do you want?  
> > 
> > Looking through past conversations, I definitely asked that we figure
> > out how NDMA is going to work.  Are you thinking of a different request
> > from me?  
> 
> It was part of the whole etherpad thing. This was what we said we'd do
> to resolve the discussion.
> 
> I expect to come to some agreement with you and Connie on this text
> and we will go ahead.
>  
> > The restriction implied by lack of NDMA support are pretty significant.
> > Maybe a given device like hns doesn't care because they don't intend to
> > support p2p, but they should care if it means we can't hot-add their
> > device to a VM in the presences of devices that can support p2p and if
> > cold plugging their device into an existing configuration implies loss
> > of functionality or performance elsewhere.
> > 
> > I'd tend to expect though that we'd incorporate NDMA documentation into
> > the uAPI with a formal proposal for discovery and outline a number of
> > those usage implications.  
> 
> Yishai made a patch, but we have put the discussion of NDMA here, not
> hidden in a commit message
>  
> > > > We've tried to define a specification that's more flexible than a
> > > > single implementation and by these standards we seem to be flipping
> > > > that implementation back into the specification.    
> > > 
> > > What specification!?! All we have is a couple lines in a header file
> > > that is no where near detailed enough for multi-driver
> > > interoperability with userspace. You have no idea how much effort has
> > > been expended to get this far based on the few breadcrumbs that were
> > > left, and we have access to the team that made the only other
> > > implementation!
> > > 
> > > *flexible* is not a specification.  
> > 
> > There are approximately 220 lines of the uAPI header file dealing
> > specifically with the migration region.  A bit more than a couple.  
> 
> Unfortunately more than half of that describes how the data window
> works, and half of the rest is kind of obvious statements.

How does that rationalize forking implementation details to userspace
Documentation/ rather than resolving the issues you see with the uAPI
description?

> > We've tried to define it by looking at the device requirements to
> > support migration rather than tailor it specifically to the current QEMU
> > implementation of migration.  Attempting to undo that generality by
> > suggesting only current usage patterns are relevant is therefore going
> > to generate friction.  
> 
> In your mind you see generality, in our mind we want to know how to
> write an inter operable driver and there is no documention saying how
> to do that.

I didn't know there was a hive mind over there, maybe that explains
your quick replies ;)

There's a fine line between writing an inter-operable driver and
writing a driver for the current QEMU implementation.  Obviously we want
to support the current QEMU implementation, but we want an interface
that can accommodate how that implementation might evolve.  Once we
start telling driver authors to expect specific flows rather than
looking at the operation of each bit, then our implementations become
more fragile, less versatile relative to the user.

> > > > Userspace can attempt RESUMING -> RUNNING regardless of what we specify,
> > > > so a driver needs to be prepared for such an attempted state change
> > > > either way.  So what's the advantage to telling a driver author that
> > > > they can expect a given behavior?    
> > > 
> > > The above didn't tell a driver author to expect a certain behavior, it
> > > tells userspace what to do.  
> > 
> >   "The migration driver can rely on user-space issuing a
> >    VFIO_DEVICE_RESET prior to starting RESUMING."  
> 
> I trimmed too much, the original text you quoted was
> 
> "To abort a RESUMING issue a VFIO_DEVICE_RESET."
> 
> Which I still think is fine.

If we're writing a specification, that's really a MAY statement,
userspace MAY issue a reset to abort the RESUMING process and return
the device to RUNNING.  They MAY also write the device_state directly,
which MAY return an error depending on various factors such as whether
data has been written to the migration state and whether that data is
complete.  If a failed transitions results in an ERROR device_state,
the user MUST issue a reset in order to return it to a RUNNING state
without closing the interface.

A recommendation to use reset to skip over potential error conditions
when the goal is simply a new, clean RUNNING state irrespective of data
written to the migration region, is fine.  But drivers shouldn't be
written with only that expectation, just like they shouldn't expect a
reset precondition to entering RESUMING.
 
> > Tracing that shows a reset preceding entering RESUMING doesn't suggest
> > to me that QEMU is performing a reset for the specific purpose of
> > entering RESUMING.  Correlation != causation.  
> 
> Kernel doesn't care why qemu did it - it was done. Intent doesn't
> matter :)

This is exactly the sort of "designed for QEMU implementation"
inter-operability that I want to avoid.  It doesn't take much of a
crystal ball to guess that gratuitous and redundant device resets slow
VM instantiation and are a likely target for optimization.

> > The order I see in the v5 mlx5 post is:
> > 
> > if RUNNING 1->0
> >   quiesce + freeze
> > if RUNNING or SAVING change && state == !RUNNING | SAVING
> >   save device state
> > if RESUMING 0->1
> >   reset device state
> > if RESUMING 1->0
> >   load device state
> > if RUNNING 0->1
> >   unfreeze + unquiesce  
> 
> Right, which matches the text:
> 
>  - !RUNNING
>  - SAVING | !RUNNING
>  - RESUMING
>  - !RESUMING
>  - RUNNING
>  
> > So maybe part of my confusion stems from the fact that the mlx5 driver
> > doesn't support pre-copy, which by the provided list is the highest
> > priority.    
> 
> Right.
> 
> > But what actually makes that the highest priority?  Any
> > combination of SAVING and RESUMING is invalid, so we can eliminate
> > those.  We obviously can't have RUNNING and !RUNNING, so we can
> > eliminate all cases of !RUNNING, so we can shorten the list relativeto
> > prioritizing SAVING|RUNNING to:  
> 
> There are several orders that can make sense. What we've found is
> following the reference flow order has given something workable for
> precedence.

I think though that your reference flow priority depends a lot on your
implementation that resuming state is stored somewhere and only
processed on the transition of the RESUMING bit.  You're sharing a
buffer between SAVING and RESUMING.  That's implementation, not
specification.  A device may choose to validate and incorporate data
written to the migration region as it's written.  A device may choose
to expose actual portions of on device memory through the migration
region during either SAVING or RESTORING.  Therefore setting or
clearing of these bits may not have the data dependencies that drive
the priority scheme of mlx5.

So why is this the one true implementation?

> > SAVING | RUNNING would need to be processed after !RESUMING, but
> > maybe before RUNNING itself.  
> 
> It is a good point, it does make more sense after RUNNING as a device
> should be already RUNNING before entering pre-copy. I moved it to
> before !NDMA
> 
> > NDMA only requires that the device cease initiating DMA before the call
> > returns and it only has meaning if RUNNING, so I'm not sure why setting
> > NDMA is given any priority.  I guess maybe we're trying
> > (unnecessarily?) to preempt any DMA that might occur as a result of
> > setting RUNNING (so why is it above cases including !RUNNING?)?  
> 
> Everything was given priority so there is no confusing omission. For
> the order, as NDMA has no meaning outside RUNNING, it makes sense you'd
> do a NDMA before making it meaningless / after making it meaningful.
> 
> This is made concrete by mlx5's scheme that always requires quiesce
> (NDMA) to happen before freeze (!RUNNING) and viceversa, so we get to
> this order. mlx5 implicitly does NDMA on !RUNNING 
> 
> > Obviously presenting a priority scheme without a discussion of the
> > associativity of the states and a barely sketched out nomenclature is
> > really not inviting an actual understanding how this is reasoned (imo).  
> 
> Sure, but how we got here isn't really important to the intent of the
> document to guide implementors.
> 
> Well, you wrote a lot, and found a correction, but I haven't been left
> with a way to write this more clearly? Now that you obviously
> understand what it is saying, what do you think?

I think implementation and clarification of the actual definition of
the uAPI should exist in the uAPI header, userspace clarification and
examples for the consumption of the uAPI should stay here in
Documentation/, and we should not assume either the QEMU or mlx5
implementations as the canonical reference for either.

> > I'm feeling like there's a bit of a chicken and egg problem here to
> > decide if this is sufficiently clear documentation before a new posting
> > of the mlx5 driver where the mlx5 driver is the authoritative source
> > for the reasoning of the priority scheme documented here (and doesn't
> > make use of pre-copy).  
> 
> I wouldn't fixate on the ordering, it is a small part of the
> document..
> 
> > The migration data stream is entirely opaque to userspace, so what's
> > the benefit to userspace to suggest anything about the content in each
> > phase?  This is presented in a userspace edge concerns section, but the
> > description is much more relevant to a driver author.  
> 
> It is informative for the device driver author to understand what
> device functionality to map to this.

Sounds like we're in agreement, so why does this belong in userspace
Documentation/?

> > > > I think the fact that a user is not required to run the pre-copy
> > > > phase until completion is also noteworthy.    
> > > 
> > > This text doesn't try to detail how the migration window works, that
> > > is a different large task. The intention is that the migration window
> > > must be fully drained to be successful.  
> > 
> > Clarification, the *!RUNNING* migration window must be fully drained.
> >   
> > > I added this for some clarity ""The entire migration data, up to each
> > > end of stream must be transported from the saving to resuming side.""  
> > 
> > Per the uAPI regarding pre-copy:
> > 
> >   "The user must not be required to consume all migration data before
> >   the device transitions to a new state, including the stop-and-copy
> >   state."
> > 
> > If "end of stream" suggests the driver defined end of the data stream
> > for pre-copy rather than simply the end of the user accumulated data
> > stream, that conflicts with the uAPI.  Thanks,  
> 
> Hmm, yes. I can try to clarify how this all works better. We don't
> implement pre-copy but it should still be described better than it has
> been.
> 
> I'm still not sure how this works. 
> 
> We are in SAVING|RUNNING and we dump all the dirty data and return end
> of stream.
> 
> We stay in SAVING|RUNNING and some more device state became dirty. How
> does userspace know? Should it poll and see if the stream got longer?

The uAPI might need some clarification here, but the only viable
scenario would seem to be that yes, userspace should continue to poll
the device so long as it remains in SAVING|RUNNING as internal state
can continue to change asynchronously from userspace polling.  It's
only when pending_bytes returns zero while !RUNNING that the data
stream is complete and the device is precluded from deciding any new
state is available.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v2] vfio: Documentation for the migration region
  2021-12-03 18:06             ` Alex Williamson
@ 2021-12-06 16:03               ` Cornelia Huck
  2021-12-06 17:34                 ` Jason Gunthorpe
  2021-12-06 19:15               ` Jason Gunthorpe
  1 sibling, 1 reply; 29+ messages in thread
From: Cornelia Huck @ 2021-12-06 16:03 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Jonathan Corbet, linux-doc, kvm, Kirti Wankhede, Max Gurtovoy,
	Shameer Kolothum, Yishai Hadas

On Fri, Dec 03 2021, Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 1 Dec 2021 19:25:02 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
>
>> On Wed, Dec 01, 2021 at 01:03:14PM -0700, Alex Williamson wrote:
>> > On Tue, 30 Nov 2021 23:14:07 -0400
>> > Jason Gunthorpe <jgg@nvidia.com> wrote:
>> >   
>> > > On Tue, Nov 30, 2021 at 03:35:41PM -0700, Alex Williamson wrote:

>> OTOH "supportable" qemu could certainly make the default choice to
>> require devices for simplicity.
>
> I get a bit lost every time I try to sketch out how QEMU would
> implement it.  Forgive the stream of consciousness and rhetorical
> discussion below...
>
>  - Does it make sense that a device itself can opt-out of p2p mappings?
>    This is simply an MMIO access from another device rather than from
>    the CPU.  Vendors cannot preclude this use case on bare metal,
>    should they be able to in a VM?  We create heterogeneous p2p maps,
>    device A can access device B, but device B maybe can't access device
>    A.  Seems troublesome.
>
>  - If we can't have a per-device policy, then we'd need a per VM
>    policy, likely some way to opt-out of all p2p mappings for vfio
>    devices.  We need to support hotplug of devices though, so is it a
>    per VM policy or is it a per device policy which needs to be
>    consistent among all attached devices?  Perhaps a
>    "enable_p2p=on/off" option on the vfio-pci device, default [on] to
>    match current behavior.  For any case of this option being set to
>    non-default, all devices would need to set it to the same value,
>    non-compliant devices rejected.
>
>  - We could possibly allow migration=on,enable_p2p=on for a non-NDMA
>    device, but the rules change if additional devices are added, they
>    need to be rejected or migration support implicitly disappears.  That
>    seems hard to document, non-deterministic as far as a user is
>    concerned. So maybe for a non-NDMA device we'd require
>    enable_p2p=off in order to set migration=on.  That means we could
>    never enable migration on non-NDMA devices by default, which
>    probably means we also cannot enable it by default on NDMA devices
>    or we get user confusion/inconsistency.
>
>  - Can a user know which devices will require enable_p2p=off in order
>    to set migration=on?  "Read the error log" is a poor user experience
>    and difficult hurdle for libvirt.
>
> So in order to create a predictable QEMU experience in the face of
> optional NDMA per device, I think we preclude being able to enable
> migration support for any vfio device by default and we have an
> exercise to determine how a user or management tool could easily
> determine NDMA compatibility :-\

Hm, maybe we can take a page out of the confidential guest support book
and control this like we do for the virtio iommu flag?

I'm not sure whether there is a pressing need to support p2p for
non-NDMA devices?

> There's a fine line between writing an inter-operable driver and
> writing a driver for the current QEMU implementation.  Obviously we want
> to support the current QEMU implementation, but we want an interface
> that can accommodate how that implementation might evolve.  Once we
> start telling driver authors to expect specific flows rather than
> looking at the operation of each bit, then our implementations become
> more fragile, less versatile relative to the user.

What is actually on the table regarding the current QEMU implementation?
The interface is still marked as experimental, so we have some room for
changes, but we probably don't want to throw everything away.

>
>> > > > Userspace can attempt RESUMING -> RUNNING regardless of what we specify,
>> > > > so a driver needs to be prepared for such an attempted state change
>> > > > either way.  So what's the advantage to telling a driver author that
>> > > > they can expect a given behavior?    
>> > > 
>> > > The above didn't tell a driver author to expect a certain behavior, it
>> > > tells userspace what to do.  
>> > 
>> >   "The migration driver can rely on user-space issuing a
>> >    VFIO_DEVICE_RESET prior to starting RESUMING."  
>> 
>> I trimmed too much, the original text you quoted was
>> 
>> "To abort a RESUMING issue a VFIO_DEVICE_RESET."
>> 
>> Which I still think is fine.
>
> If we're writing a specification, that's really a MAY statement,
> userspace MAY issue a reset to abort the RESUMING process and return
> the device to RUNNING.  They MAY also write the device_state directly,
> which MAY return an error depending on various factors such as whether
> data has been written to the migration state and whether that data is
> complete.  If a failed transitions results in an ERROR device_state,
> the user MUST issue a reset in order to return it to a RUNNING state
> without closing the interface.

Are we actually writing a specification? If yes, we need to be more
clear on what is mandatory (MUST), advised (SHOULD), or allowed
(MAY). If I look at the current proposal, I'm not sure into which
category some of the statements fall.

>
> A recommendation to use reset to skip over potential error conditions
> when the goal is simply a new, clean RUNNING state irrespective of data
> written to the migration region, is fine.  But drivers shouldn't be
> written with only that expectation, just like they shouldn't expect a
> reset precondition to entering RESUMING.
>  
>> > Tracing that shows a reset preceding entering RESUMING doesn't suggest
>> > to me that QEMU is performing a reset for the specific purpose of
>> > entering RESUMING.  Correlation != causation.  
>> 
>> Kernel doesn't care why qemu did it - it was done. Intent doesn't
>> matter :)
>
> This is exactly the sort of "designed for QEMU implementation"
> inter-operability that I want to avoid.  It doesn't take much of a
> crystal ball to guess that gratuitous and redundant device resets slow
> VM instantiation and are a likely target for optimization.

Which brings me back to my question above: Are we wedded to all details
of the current QEMU implementation? Should we consider some of them more
as a MAY? Can we change QEMU to do things differently, given the
experimental nature of the support?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v2] vfio: Documentation for the migration region
  2021-12-06 16:03               ` Cornelia Huck
@ 2021-12-06 17:34                 ` Jason Gunthorpe
  2021-12-06 18:06                   ` Cornelia Huck
  0 siblings, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2021-12-06 17:34 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alex Williamson, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Mon, Dec 06, 2021 at 05:03:00PM +0100, Cornelia Huck wrote:

> > If we're writing a specification, that's really a MAY statement,
> > userspace MAY issue a reset to abort the RESUMING process and return
> > the device to RUNNING.  They MAY also write the device_state directly,
> > which MAY return an error depending on various factors such as whether
> > data has been written to the migration state and whether that data is
> > complete.  If a failed transitions results in an ERROR device_state,
> > the user MUST issue a reset in order to return it to a RUNNING state
> > without closing the interface.
> 
> Are we actually writing a specification? If yes, we need to be more
> clear on what is mandatory (MUST), advised (SHOULD), or allowed
> (MAY). If I look at the current proposal, I'm not sure into which
> category some of the statements fall.

I deliberately didn't use such formal language because this is far
from what I'd consider an acceptable spec. It is more words about how
things work and some kind of basis for agreement between user and
kernel.

Under Linus's "don't break userspace" guideline whatever userspace
ends up doing becomes the spec the kernel is wedded to, regardless of
what we write down here.

Which basically means whatever mlx5 and qemu does after we go forward
is the definitive spec and we cannot change qemu in a way that is
incompatible with mlx5 or introduce a new driver that is incompatible
with qemu.

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v2] vfio: Documentation for the migration region
  2021-12-06 17:34                 ` Jason Gunthorpe
@ 2021-12-06 18:06                   ` Cornelia Huck
  2021-12-06 19:19                     ` Jason Gunthorpe
  0 siblings, 1 reply; 29+ messages in thread
From: Cornelia Huck @ 2021-12-06 18:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Mon, Dec 06 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Dec 06, 2021 at 05:03:00PM +0100, Cornelia Huck wrote:
>
>> > If we're writing a specification, that's really a MAY statement,
>> > userspace MAY issue a reset to abort the RESUMING process and return
>> > the device to RUNNING.  They MAY also write the device_state directly,
>> > which MAY return an error depending on various factors such as whether
>> > data has been written to the migration state and whether that data is
>> > complete.  If a failed transitions results in an ERROR device_state,
>> > the user MUST issue a reset in order to return it to a RUNNING state
>> > without closing the interface.
>> 
>> Are we actually writing a specification? If yes, we need to be more
>> clear on what is mandatory (MUST), advised (SHOULD), or allowed
>> (MAY). If I look at the current proposal, I'm not sure into which
>> category some of the statements fall.
>
> I deliberately didn't use such formal language because this is far
> from what I'd consider an acceptable spec. It is more words about how
> things work and some kind of basis for agreement between user and
> kernel.

We don't really need formal language, but there are too many unclear
statements, as the discussion above showed. Therefore my question: What
are we actually writing? Even if it is not a formal specification, it
still needs to be clear.

>
> Under Linus's "don't break userspace" guideline whatever userspace
> ends up doing becomes the spec the kernel is wedded to, regardless of
> what we write down here.

All the more important that we actually agree before this is merged! I
don't want choices hidden deep inside the mlx5 driver dictating what
other drivers should do, it must be reasonably easy to figure out
(including what is mandatory, and what is flexible.)

> Which basically means whatever mlx5 and qemu does after we go forward
> is the definitive spec and we cannot change qemu in a way that is
> incompatible with mlx5 or introduce a new driver that is incompatible
> with qemu.

TBH, I'm not too happy with the current QEMU state, either. We need to
take a long, hard look first and figure out what we need to do to make
the QEMU support non-experimental.

We're discussing a complex topic here, and we really don't want to
perpetuate an unclear uAPI. This is where my push for more precise
statements is coming from.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v2] vfio: Documentation for the migration region
  2021-12-03 18:06             ` Alex Williamson
  2021-12-06 16:03               ` Cornelia Huck
@ 2021-12-06 19:15               ` Jason Gunthorpe
  2021-12-07 10:50                 ` Cornelia Huck
  1 sibling, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2021-12-06 19:15 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jonathan Corbet, linux-doc, Cornelia Huck, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Fri, Dec 03, 2021 at 11:06:19AM -0700, Alex Williamson wrote:

> > Yes, not sure how you get to a non-issue though? If the IOMMU map is
> > present the huawei device can be attacked by a hostile VM and forced
> > to exit NDMA. All this takes is any two DMA capable devices to be
> > plugged in?
> 
> I'm confused by your idea of a "hostile VM".  Devices within a VM can
> only DMA to memory owned by the VM.  So the only hostile VM that can
> attack this device is the VM that owns the device in the first place.
> Meanwhile the VMM is transitioning all devices to the NDMA state, so
> the opportunity to poke one device from another device within the same
> VM is pretty limited.  And for what gain?  Is this a DoS opportunity
> that the guest might trigger the migration to abort or fail?

I'm quite worried about what could be possible if the VM is able to
impact the hypervisor. Something like migration is a hypervisor
activity and it should be fully isolated.

If we are certain that the only consequence is that the VM corrupts
itself, then I agree that is OK, but I've seen enough crazy security
stuff to be skeptical we are smart enough to know that :(

> I suggest this might be a non-issue for hns because I understand it to
> not support any sort of p2p, therefore this DMA write to the device
> while it's under mediation would never happen in practice. 

It has nothing to do with the huawei device, any DMA capable device
could be instructed to generate a P2P write to the huawei device.

> > This is a good idea, if we want to make huawei support NDMA then
> > flagging it to never be in the iommu map in the first place is a great
> > solution. Then they can use CPU mmio trapping get the rest of the way.
> 
> NB, this flag would only be a hint to the VMM, the vfio IOMMU backend
> would need to be involved if we intended to prevent such mappings, but
> that seems like overkill if we're expecting userspace to follow the
> rules to get working migration and migration drivers have sufficient
> robustness to maintain device isolation.

I'm fine with hint. Secure userspace needs to follow the security
rules, the kernel doesn't need to nanny a VMM, AFAIK?

> > OK, I don't read it like that. It seems OK to have a performance hit
> > in NDMA since it is only a short grace state.
> 
> Is it the length of time we're in the NDMA state or the utility of the
> NDMA state itself?  If we're telling the device "no more outbound DMA,
> no more interrupts", then it doesn't seem super relevant for the
> majority of use cases what happens to the MMIO latency.

Actually, I wrote down that the P2P grace state should have
!VCPU_RUNNING, so we don't even need to worry about anything else
here. Just remove it from the IOMMU map.

PRI is the only wrinkle, but I think we have to insist on NDMA to do
vPRI, and the huawei device doesn't support PRI anyhow.

> > OTOH "supportable" qemu could certainly make the default choice to
> > require devices for simplicity.
> 
> I get a bit lost every time I try to sketch out how QEMU would
> implement it.  Forgive the stream of consciousness and rhetorical
> discussion below...

IMHO the simple solution without ugly edges:

qemu maintains two IOMMU domains on demand:

1) has all the MMIO maps for all PCI devices.
2) has no MMIO maps for any PCI devices.

A device supporting NDMA gets assigned to #1

A device without NDMA gets assigned to #2

Some qemu flag 'enable_p2p=off' causes all devices to use #2 and
optimizes away the 2nd iommu domain.

Thus all combinations of devices can be hotplugged in/out in any order
and nothing breaks.

>  - Does it make sense that a device itself can opt-out of p2p mappings?

Bit strange, but yes it kind of does make sense. It is more of a
'device strictly requires quiet MMIO when !RUNNING' flag. Which means
the user space process must guarentee not to touch the MMIO through
any means, including CPU, VCPU or IOMMU.

Combined with the above, the special flag would allow such devices to
join the #2 IOMMU domain and reduces cases where #1 != #2

>  - Can a user know which devices will require enable_p2p=off in order
>    to set migration=on?  "Read the error log" is a poor user experience
>    and difficult hurdle for libvirt.

I'm sure we can add something in sysfs. There is a patch series adding
a cdev for the vfio_device and that comes along with a full sysfs
directory that can hold per-vfio-device information for userspace to
access.

> > Can we go the other way and move more of the uAPI header text here?
> 
> Currently our Documentation/ file describes an overview of vfio, the
> high level theory of groups, devices, and iommus, provides a detailed
> userspace usage example, an overview of device registration and ops,
> and a bunch of random spapr implementation notes largely around
> userspace considerations on power platforms.

Currently, but it doesn't have to stay that way.
 
> So I would generalize that our Documentation/ is currently focused on
> userspace access and interaction with the uAPI.  The uAPI itself
> therefore generally provides the detail regarding how a given interface
> is intended to work from an implementation perspective.  I'd say the
> proposed document here is mixing both of those in Documentation/, some
> aspects relevant to implementation of the uAPI, others to userspace.
> 
> Currently I would clearly say that the uAPI header is the canonical
> source of truth and I think it should continue to be so for describing
> the implementation of the uAPI.

Well, I don't want to have a > 1000 line comment in the uAPI
header.

> > Unfortunately more than half of that describes how the data window
> > works, and half of the rest is kind of obvious statements.
> 
> How does that rationalize forking implementation details to userspace
> Documentation/ rather than resolving the issues you see with the uAPI
> description?

I object to putting a enormous comment in a header file and using poor
tools to write documentation. If there are small things to update and
correct then please point to them and we can fix it. Otherwise lets
keep them split and try to work to improve the discussion in the rst
in general for all the uapi.

> > In your mind you see generality, in our mind we want to know how to
> > write an inter operable driver and there is no documention saying how
> > to do that.
> 
> I didn't know there was a hive mind over there, maybe that explains
> your quick replies ;)

No :) It is just been a long topic here. You recall the first version
of the series had a very different interpritation of the header file
comment regarding the device_state. It was rather surprising your
remark that the device_state is not a FSM when the header file
document has FSM-like discussions.
 
> This is exactly the sort of "designed for QEMU implementation"
> inter-operability that I want to avoid.  It doesn't take much of a
> crystal ball to guess that gratuitous and redundant device resets
> slow VM instantiation and are a likely target for optimization.

Sorry, but Linus's "don't break userspace" forces us to this world.

It does not matter what is written in text files, only what userspace
actually does and the kernel must accommodate existing userspace going
forward. So once released qemu forms some definitive spec and the
guardrails that limit what we can do going forward.

In part this document is also an effort to describe to kernel devs
what these guardrails are.

> I think though that your reference flow priority depends a lot on your
> implementation that resuming state is stored somewhere and only
> processed on the transition of the RESUMING bit.  You're sharing a
> buffer between SAVING and RESUMING.  That's implementation, not
> specification.  A device may choose to validate and incorporate data
> written to the migration region as it's written.  A device may choose
> to expose actual portions of on device memory through the migration
> region during either SAVING or RESTORING.  Therefore setting or
> clearing of these bits may not have the data dependencies that drive
> the priority scheme of mlx5.

I've said this before, it doesn't matter what other devices might
do. mlx5 needs to work this way, so least-common-denominator.

The only other option is to declare multi-bit changes as undefined
behavior - that seems dangerous in general.

The reason this is a good ordering is because it matches the ordering
userspace would do normally if it does one bit change at a time.

Therefore, *any* device must be able to implement this order if it
simply decomposes multi-bit changes to single actions, it must already
support, using the priority listed here.

Ideally we'd put this logic in the core code and drivers would just
have callbacks, but I'm not quite sure we see this probelm well enough
yet to make the callbacks. Maybe once we have three drivers..

> So why is this the one true implementation?

Because it is what we are going to write down?

> I think implementation and clarification of the actual definition of
> the uAPI should exist in the uAPI header, userspace clarification and
> examples for the consumption of the uAPI should stay here in
> Documentation/, and we should not assume either the QEMU or mlx5
> implementations as the canonical reference for either.

I'm not even sure anymore what you are looking for in order to
advance.

Can you please out line exactly what things you want from us? Like, as
a task list?

> > It is informative for the device driver author to understand what
> > device functionality to map to this.
> 
> Sounds like we're in agreement, so why does this belong in userspace
> Documentation/?

Because this is where documentation should live, not in a header file.

> The uAPI might need some clarification here, but the only viable
> scenario would seem to be that yes, userspace should continue to poll
> the device so long as it remains in SAVING|RUNNING as internal state
> can continue to change asynchronously from userspace polling.  It's
> only when pending_bytes returns zero while !RUNNING that the data
> stream is complete and the device is precluded from deciding any new
> state is available.  Thanks,

I updated around these lines, thanks

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v2] vfio: Documentation for the migration region
  2021-12-06 18:06                   ` Cornelia Huck
@ 2021-12-06 19:19                     ` Jason Gunthorpe
  2021-12-07 11:16                       ` Cornelia Huck
  0 siblings, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2021-12-06 19:19 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alex Williamson, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Mon, Dec 06, 2021 at 07:06:35PM +0100, Cornelia Huck wrote:

> We're discussing a complex topic here, and we really don't want to
> perpetuate an unclear uAPI. This is where my push for more precise
> statements is coming from.

I appreciate that, and I think we've made a big effort toward that
direction.

Can we have some crisp feedback which statements need SHOULD/MUST/MUST
NOT and come to something?

The world needs to move forward, we can't debate this endlessly
forever. It is already another 6 weeks past since the last mlx5 driver
posting.

I don't feel like this is really converging, you two are the
maintainers, can you please drive to something acceptable, whatever it
is?

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v2] vfio: Documentation for the migration region
  2021-12-06 19:15               ` Jason Gunthorpe
@ 2021-12-07 10:50                 ` Cornelia Huck
  2021-12-07 15:37                   ` Jason Gunthorpe
  0 siblings, 1 reply; 29+ messages in thread
From: Cornelia Huck @ 2021-12-07 10:50 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Jonathan Corbet, linux-doc, kvm, Kirti Wankhede, Max Gurtovoy,
	Shameer Kolothum, Yishai Hadas

On Mon, Dec 06 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Fri, Dec 03, 2021 at 11:06:19AM -0700, Alex Williamson wrote:

>> This is exactly the sort of "designed for QEMU implementation"
>> inter-operability that I want to avoid.  It doesn't take much of a
>> crystal ball to guess that gratuitous and redundant device resets
>> slow VM instantiation and are a likely target for optimization.
>
> Sorry, but Linus's "don't break userspace" forces us to this world.
>
> It does not matter what is written in text files, only what userspace
> actually does and the kernel must accommodate existing userspace going
> forward. So once released qemu forms some definitive spec and the
> guardrails that limit what we can do going forward.

But QEMU support is *experimental*, i.e. if it breaks, you get to keep
the pieces, things may change in incompatible ways. And it is
experimental for good reason! I don't want something that we don't
support in QEMU lock us into a bad kernel interface, that's just utterly
broken. It would mean that we must never introduce experimental
interfaces in QEMU that may need some rework of the kernel interface,
but need to keep those out of the tree -- and that can't be in the best
interest of implementing things requiring interaction between the kernel
and QEMU.

If you really think we cannot make changes, I vote for declaring the
current interface legacy and not further developed, and introduce a new,
separate one that doesn't carry all of the baggage that this one
does. We could even end up with a QEMU part that looks better!


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v2] vfio: Documentation for the migration region
  2021-12-06 19:19                     ` Jason Gunthorpe
@ 2021-12-07 11:16                       ` Cornelia Huck
  2021-12-07 15:51                         ` Jason Gunthorpe
  0 siblings, 1 reply; 29+ messages in thread
From: Cornelia Huck @ 2021-12-07 11:16 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Mon, Dec 06 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Dec 06, 2021 at 07:06:35PM +0100, Cornelia Huck wrote:
>
>> We're discussing a complex topic here, and we really don't want to
>> perpetuate an unclear uAPI. This is where my push for more precise
>> statements is coming from.
>
> I appreciate that, and I think we've made a big effort toward that
> direction.
>
> Can we have some crisp feedback which statements need SHOULD/MUST/MUST
> NOT and come to something?

I'm not sure what I should actually comment on, some general remarks:

- If we consider a possible vfio-ccw implementation that will quiesce
  the device and not rely on tracking I/O, we need to make the parts
  that talk about tracking non-mandatory.
- NDMA sounds like something that needs to be non-mandatory as well.
- The discussion regarding bit group changes has me confused. You seem
  to be saying that mlx5 needs that, so it needs to have some mandatory
  component; but are actually all devices able to deal with those bits
  changing as a group?
- In particular, the flow needs definitive markings about what is
  mandatory to implement, what is strongly suggested, and what is
  optional. It is unclear to me what is really expected, and what is
  simply one way to implement it.

>
> The world needs to move forward, we can't debate this endlessly
> forever. It is already another 6 weeks past since the last mlx5 driver
> posting.

6 weeks is already blazingly fast in any vfio migration discussion. /s

Remember that we have other things to do as well, not all of which will
be visible to you.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v2] vfio: Documentation for the migration region
  2021-12-07 10:50                 ` Cornelia Huck
@ 2021-12-07 15:37                   ` Jason Gunthorpe
  2021-12-07 15:56                     ` Cornelia Huck
  2021-12-07 16:22                     ` Alex Williamson
  0 siblings, 2 replies; 29+ messages in thread
From: Jason Gunthorpe @ 2021-12-07 15:37 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alex Williamson, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Tue, Dec 07, 2021 at 11:50:47AM +0100, Cornelia Huck wrote:
> On Mon, Dec 06 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Fri, Dec 03, 2021 at 11:06:19AM -0700, Alex Williamson wrote:
> 
> >> This is exactly the sort of "designed for QEMU implementation"
> >> inter-operability that I want to avoid.  It doesn't take much of a
> >> crystal ball to guess that gratuitous and redundant device resets
> >> slow VM instantiation and are a likely target for optimization.
> >
> > Sorry, but Linus's "don't break userspace" forces us to this world.
> >
> > It does not matter what is written in text files, only what userspace
> > actually does and the kernel must accommodate existing userspace going
> > forward. So once released qemu forms some definitive spec and the
> > guardrails that limit what we can do going forward.
> 
> But QEMU support is *experimental*, i.e. if it breaks, you get to keep
> the pieces, things may change in incompatible ways. And it is
> experimental for good reason!

And we can probably make an breakage exception for this existing
experimental qemu.

My point was going forward, once we userspace starts to become
deployed, it doesn't matter what we write in these text files and
comments. It only matters what deployed userspace actually does.

> It would mean that we must never introduce experimental interfaces
> in QEMU that may need some rework of the kernel interface, but need
> to keep those out of the tree -- and that can't be in the best
> interest of implementing things requiring interaction between the
> kernel and QEMU.

In general we should not be merging uAPI to the kernel that is so
incomplete as to be unusable. 

I'm sorry, this whole thing from the day the migration stuff was first
merged to include/uapi is a textbook example how not to do things in
the kernel community.

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v2] vfio: Documentation for the migration region
  2021-12-07 11:16                       ` Cornelia Huck
@ 2021-12-07 15:51                         ` Jason Gunthorpe
  2021-12-07 16:30                           ` Cornelia Huck
  2021-12-08 16:06                           ` Alex Williamson
  0 siblings, 2 replies; 29+ messages in thread
From: Jason Gunthorpe @ 2021-12-07 15:51 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alex Williamson, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Tue, Dec 07, 2021 at 12:16:32PM +0100, Cornelia Huck wrote:
> On Mon, Dec 06 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Mon, Dec 06, 2021 at 07:06:35PM +0100, Cornelia Huck wrote:
> >
> >> We're discussing a complex topic here, and we really don't want to
> >> perpetuate an unclear uAPI. This is where my push for more precise
> >> statements is coming from.
> >
> > I appreciate that, and I think we've made a big effort toward that
> > direction.
> >
> > Can we have some crisp feedback which statements need SHOULD/MUST/MUST
> > NOT and come to something?
> 
> I'm not sure what I should actually comment on, some general remarks:

You should comment on the paragraphs that prevent you from adding a
reviewed-by.

> - If we consider a possible vfio-ccw implementation that will quiesce
>   the device and not rely on tracking I/O, we need to make the parts
>   that talk about tracking non-mandatory.

I'm not sure what you mean by 'tracking I/O'?

I thought we were good on ccw?

> - NDMA sounds like something that needs to be non-mandatory as well.

I agree, Alex are we agreed now ?

> - The discussion regarding bit group changes has me confused. You seem
>   to be saying that mlx5 needs that, so it needs to have some mandatory
>   component; but are actually all devices able to deal with those bits
>   changing as a group?

Yes, all devices can support this as written.

If you think of the device_state as initiating some action pre bit
group then we have multiple bit group that can change at once and thus
multiple actions that can be triggered.

All devices must support userspace initiating actions one by one in a
manner that supports the reference flow. 

Thus, every driver can decompose a request for multiple actions into
an ordered list of single actions and execute those actions exactly as
if userspace had issued single actions.

The precedence follows the reference flow so that any conflicts
resolve along the path that already has defined behaviors.

I honestly don't know why this is such a discussion point, beyond
being a big oversight of the original design.

> - In particular, the flow needs definitive markings about what is
>   mandatory to implement, what is strongly suggested, and what is
>   optional. It is unclear to me what is really expected, and what is
>   simply one way to implement it.

I'm not sure either, this hasn't been clear at all to me. Alex has
asked for things to be general and left undefined, but we need some
minimum definition to actually implement driver/VMM interoperability
for what we need to do.

Really what qemu does will set the mandatory to implement.

> > The world needs to move forward, we can't debate this endlessly
> > forever. It is already another 6 weeks past since the last mlx5 driver
> > posting.
> 
> 6 weeks is already blazingly fast in any vfio migration discussion. /s

We've invested a lot of engineer months in this project, it is
disrespectful to all of this effort to leave us hanging with no clear
path forward and no actionable review comments after so much
time. This is another kernel cycle lost.

> Remember that we have other things to do as well, not all of which will
> be visible to you.

As do we all, but your name is in the maintainer file, and that comes
with some responsibility.

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v2] vfio: Documentation for the migration region
  2021-12-07 15:37                   ` Jason Gunthorpe
@ 2021-12-07 15:56                     ` Cornelia Huck
  2021-12-07 16:13                       ` Jason Gunthorpe
  2021-12-07 16:22                     ` Alex Williamson
  1 sibling, 1 reply; 29+ messages in thread
From: Cornelia Huck @ 2021-12-07 15:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Tue, Dec 07 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Dec 07, 2021 at 11:50:47AM +0100, Cornelia Huck wrote:
>> On Mon, Dec 06 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:
>> 
>> > On Fri, Dec 03, 2021 at 11:06:19AM -0700, Alex Williamson wrote:
>> 
>> >> This is exactly the sort of "designed for QEMU implementation"
>> >> inter-operability that I want to avoid.  It doesn't take much of a
>> >> crystal ball to guess that gratuitous and redundant device resets
>> >> slow VM instantiation and are a likely target for optimization.
>> >
>> > Sorry, but Linus's "don't break userspace" forces us to this world.
>> >
>> > It does not matter what is written in text files, only what userspace
>> > actually does and the kernel must accommodate existing userspace going
>> > forward. So once released qemu forms some definitive spec and the
>> > guardrails that limit what we can do going forward.
>> 
>> But QEMU support is *experimental*, i.e. if it breaks, you get to keep
>> the pieces, things may change in incompatible ways. And it is
>> experimental for good reason!
>
> And we can probably make an breakage exception for this existing
> experimental qemu.
>
> My point was going forward, once we userspace starts to become
> deployed, it doesn't matter what we write in these text files and
> comments. It only matters what deployed userspace actually does.

Absolutely, as soon as QEMU support is made non-experimental, this needs
to be stable.

>
>> It would mean that we must never introduce experimental interfaces
>> in QEMU that may need some rework of the kernel interface, but need
>> to keep those out of the tree -- and that can't be in the best
>> interest of implementing things requiring interaction between the
>> kernel and QEMU.
>
> In general we should not be merging uAPI to the kernel that is so
> incomplete as to be unusable. 
>
> I'm sorry, this whole thing from the day the migration stuff was first
> merged to include/uapi is a textbook example how not to do things in
> the kernel community.

Honestly, I regret ever acking this and the QEMU part. Maybe the history
of this and the archived discussions are instructive to others so that
they don't repeat this :/


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v2] vfio: Documentation for the migration region
  2021-12-07 15:56                     ` Cornelia Huck
@ 2021-12-07 16:13                       ` Jason Gunthorpe
  0 siblings, 0 replies; 29+ messages in thread
From: Jason Gunthorpe @ 2021-12-07 16:13 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alex Williamson, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Tue, Dec 07, 2021 at 04:56:17PM +0100, Cornelia Huck wrote:

> Honestly, I regret ever acking this and the QEMU part. Maybe the history
> of this and the archived discussions are instructive to others so that
> they don't repeat this :/

The good news is we can fix it :)

Jason
 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v2] vfio: Documentation for the migration region
  2021-12-07 15:37                   ` Jason Gunthorpe
  2021-12-07 15:56                     ` Cornelia Huck
@ 2021-12-07 16:22                     ` Alex Williamson
  1 sibling, 0 replies; 29+ messages in thread
From: Alex Williamson @ 2021-12-07 16:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Cornelia Huck, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Tue, 7 Dec 2021 11:37:43 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Dec 07, 2021 at 11:50:47AM +0100, Cornelia Huck wrote:
> > On Mon, Dec 06 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Fri, Dec 03, 2021 at 11:06:19AM -0700, Alex Williamson wrote:  
> >   
> > >> This is exactly the sort of "designed for QEMU implementation"
> > >> inter-operability that I want to avoid.  It doesn't take much of a
> > >> crystal ball to guess that gratuitous and redundant device resets
> > >> slow VM instantiation and are a likely target for optimization.  
> > >
> > > Sorry, but Linus's "don't break userspace" forces us to this world.
> > >
> > > It does not matter what is written in text files, only what userspace
> > > actually does and the kernel must accommodate existing userspace going
> > > forward. So once released qemu forms some definitive spec and the
> > > guardrails that limit what we can do going forward.  
> > 
> > But QEMU support is *experimental*, i.e. if it breaks, you get to keep
> > the pieces, things may change in incompatible ways. And it is
> > experimental for good reason!  
> 
> And we can probably make an breakage exception for this existing
> experimental qemu.
> 
> My point was going forward, once we userspace starts to become
> deployed, it doesn't matter what we write in these text files and
> comments. It only matters what deployed userspace actually does.

I think we're losing sight of my concern in designing for QEMU.  The
document included a statement that migration driver writers could rely
on userspace performing a device reset prior to entering the RESUMING
device_state because of an unfounded correlation that QEMU resets the
VM on the way to loading device state.  Now, if we say QEMU does this
thing and we need to support that usage model, I'm 100% on board.  If
we turn it around and say QEMU does this thing therefore migration
drivers can expect exactly this usage model, full stop, that's the
wrong direction.  That is what I'm trying to avoid.

The obvious way to remove the any question of breaking userspace is to
simply rev the migration region sub-type.  The kernel stops exposing any
v1 sub-types, we don't break any userspaces, userspaces need to be
updated to v2 in order to continue having any functionality.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v2] vfio: Documentation for the migration region
  2021-12-07 15:51                         ` Jason Gunthorpe
@ 2021-12-07 16:30                           ` Cornelia Huck
  2021-12-07 17:00                             ` Jason Gunthorpe
  2021-12-08 16:06                           ` Alex Williamson
  1 sibling, 1 reply; 29+ messages in thread
From: Cornelia Huck @ 2021-12-07 16:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Tue, Dec 07 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Dec 07, 2021 at 12:16:32PM +0100, Cornelia Huck wrote:
>> On Mon, Dec 06 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:
>> 
>> > On Mon, Dec 06, 2021 at 07:06:35PM +0100, Cornelia Huck wrote:
>> >
>> >> We're discussing a complex topic here, and we really don't want to
>> >> perpetuate an unclear uAPI. This is where my push for more precise
>> >> statements is coming from.
>> >
>> > I appreciate that, and I think we've made a big effort toward that
>> > direction.
>> >
>> > Can we have some crisp feedback which statements need SHOULD/MUST/MUST
>> > NOT and come to something?
>> 
>> I'm not sure what I should actually comment on, some general remarks:
>
> You should comment on the paragraphs that prevent you from adding a
> reviewed-by.

On which copy? There have been updates, and I haven't found a conchise
email to reply to.

>
>> - If we consider a possible vfio-ccw implementation that will quiesce
>>   the device and not rely on tracking I/O, we need to make the parts
>>   that talk about tracking non-mandatory.
>
> I'm not sure what you mean by 'tracking I/O'?

MMIO.

>
> I thought we were good on ccw?

We are, if we don't make things mandatory that are not needed for
non-MMIO.

>
>> - NDMA sounds like something that needs to be non-mandatory as well.
>
> I agree, Alex are we agreed now ?
>
>> - The discussion regarding bit group changes has me confused. You seem
>>   to be saying that mlx5 needs that, so it needs to have some mandatory
>>   component; but are actually all devices able to deal with those bits
>>   changing as a group?
>
> Yes, all devices can support this as written.
>
> If you think of the device_state as initiating some action pre bit
> group then we have multiple bit group that can change at once and thus
> multiple actions that can be triggered.
>
> All devices must support userspace initiating actions one by one in a
> manner that supports the reference flow. 
>
> Thus, every driver can decompose a request for multiple actions into
> an ordered list of single actions and execute those actions exactly as
> if userspace had issued single actions.
>
> The precedence follows the reference flow so that any conflicts
> resolve along the path that already has defined behaviors.

Well, yes. I'm just wondering where bit groups are coming in
then. That's where I'm confused (by the discussion).

>
> I honestly don't know why this is such a discussion point, beyond
> being a big oversight of the original design.
>
>> - In particular, the flow needs definitive markings about what is
>>   mandatory to implement, what is strongly suggested, and what is
>>   optional. It is unclear to me what is really expected, and what is
>>   simply one way to implement it.
>
> I'm not sure either, this hasn't been clear at all to me. Alex has
> asked for things to be general and left undefined, but we need some
> minimum definition to actually implement driver/VMM interoperability
> for what we need to do.
>
> Really what qemu does will set the mandatory to implement.

We really, really need to revisit QEMU before that. I'm staring at the
code and I'm not quite sure if that really is what we want. We might
have been too tired after years of review cycles when merging that.

>
>> > The world needs to move forward, we can't debate this endlessly
>> > forever. It is already another 6 weeks past since the last mlx5 driver
>> > posting.
>> 
>> 6 weeks is already blazingly fast in any vfio migration discussion. /s
>
> We've invested a lot of engineer months in this project, it is
> disrespectful to all of this effort to leave us hanging with no clear
> path forward and no actionable review comments after so much
> time. This is another kernel cycle lost.

Well... it's not only you who are spending time on this. I'm trying to
follow the discussion, which is not easy, and try to come up with
feedback, which is not easy, either. This is using up a huge chunk of my
time. Compared with the long and tedious discussions that led to the
initial code being merged, we're really going very fast. And expecting
people to drop everything and make a definite desicion quickly when
there are still open questions on a complex topic does not strike me as
particularly respectful, either.

>
>> Remember that we have other things to do as well, not all of which will
>> be visible to you.
>
> As do we all, but your name is in the maintainer file, and that comes
> with some responsibility.

It, however, does not mean that someone listed in MAINTAINERS must
immediately deal with anything that is thrown at them to the detriment
of everything else. It *especially* does not mean that someone listed in
MAINTAINERS is neglecting their responsibilies if things are not going
as well as you'd hope them to go.

[There is a reason why I have dropped out of some maintainership entries
recently, the asymmetry of people requiring feedback and merging and
people actually giving feedback and merging seems to have gotten worse
over the last years. I can certainly delist myself as a vfio reviewer as
well, and while that would certainly help my wellbeing, I'm not sure
whether that is what you want.]


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v2] vfio: Documentation for the migration region
  2021-12-07 16:30                           ` Cornelia Huck
@ 2021-12-07 17:00                             ` Jason Gunthorpe
  0 siblings, 0 replies; 29+ messages in thread
From: Jason Gunthorpe @ 2021-12-07 17:00 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alex Williamson, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Tue, Dec 07, 2021 at 05:30:29PM +0100, Cornelia Huck wrote:
> On Tue, Dec 07 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Dec 07, 2021 at 12:16:32PM +0100, Cornelia Huck wrote:
> >> On Mon, Dec 06 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:
> >> 
> >> > On Mon, Dec 06, 2021 at 07:06:35PM +0100, Cornelia Huck wrote:
> >> >
> >> >> We're discussing a complex topic here, and we really don't want to
> >> >> perpetuate an unclear uAPI. This is where my push for more precise
> >> >> statements is coming from.
> >> >
> >> > I appreciate that, and I think we've made a big effort toward that
> >> > direction.
> >> >
> >> > Can we have some crisp feedback which statements need SHOULD/MUST/MUST
> >> > NOT and come to something?
> >> 
> >> I'm not sure what I should actually comment on, some general remarks:
> >
> > You should comment on the paragraphs that prevent you from adding a
> > reviewed-by.
> 
> On which copy? There have been updates, and I haven't found a conchise
> email to reply to.

I'll send a v3

> >> - If we consider a possible vfio-ccw implementation that will quiesce
> >>   the device and not rely on tracking I/O, we need to make the parts
> >>   that talk about tracking non-mandatory.
> >
> > I'm not sure what you mean by 'tracking I/O'?
> 
> MMIO.

If there is no MMIO there is nothing to worry about, right?

S390 should define its own rules around quieting whatever non-MMIO
control path exists between the VM and the mdev.
 
> > The precedence follows the reference flow so that any conflicts
> > resolve along the path that already has defined behaviors.
> 
> Well, yes. I'm just wondering where bit groups are coming in
> then. That's where I'm confused (by the discussion).

Bit groups are things like "SAVING | RUNNING". Bit SAVING does not
have any meaning on its own, it is always qualified by running.

> >> 6 weeks is already blazingly fast in any vfio migration discussion. /s
> >
> > We've invested a lot of engineer months in this project, it is
> > disrespectful to all of this effort to leave us hanging with no clear
> > path forward and no actionable review comments after so much
> > time. This is another kernel cycle lost.
> 
> Well... it's not only you who are spending time on this. I'm trying to
> follow the discussion, which is not easy, and try to come up with
> feedback, which is not easy, either. This is using up a huge chunk of my
> time. Compared with the long and tedious discussions that led to the
> initial code being merged, we're really going very fast. 

Well, yes, we are serious about getting this merged and we are fully
committed to upstream quality and good design. 

In the end we all agree - we are only working through all the edges
and special cases that always existed here but were never fully
discussed.

> And expecting people to drop everything and make a definite desicion
> quickly when there are still open questions on a complex topic does
> not strike me as particularly respectful, either.

I don't even know what is the list open questions anymore! As far as I
know everything in the etherpad is addressed by the documentation I
prepared.

> >> Remember that we have other things to do as well, not all of which will
> >> be visible to you.
> >
> > As do we all, but your name is in the maintainer file, and that comes
> > with some responsibility.
> 
> It, however, does not mean that someone listed in MAINTAINERS must
> immediately deal with anything that is thrown at them to the detriment
> of everything else. 

I don't see this as immediately at all, please don't take it as such a
demand.

We posted v1 in September, it is now Decemeber, we want to know what
we need to do to get this merged. I've asked several times now, and
I'm not seeing a clear answer :(

My suggestion is we agree on the document I wrote as a general
understanding what we are doing (not so much a spec) and we proceed.

We now have a lot of experience with physical device migration here, I
think it is reasonable to extend some trust that we are implementing
something that is reasonable as a uAPI.

> [There is a reason why I have dropped out of some maintainership entries
> recently, the asymmetry of people requiring feedback and merging and
> people actually giving feedback and merging seems to have gotten worse
> over the last years. I can certainly delist myself as a vfio reviewer as
> well, and while that would certainly help my wellbeing, I'm not sure
> whether that is what you want.]

No, I do not want that, and am not suggesting that.

Regards,
Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v2] vfio: Documentation for the migration region
  2021-12-07 15:51                         ` Jason Gunthorpe
  2021-12-07 16:30                           ` Cornelia Huck
@ 2021-12-08 16:06                           ` Alex Williamson
  2021-12-08 20:27                             ` Jason Gunthorpe
  1 sibling, 1 reply; 29+ messages in thread
From: Alex Williamson @ 2021-12-08 16:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Cornelia Huck, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Tue, 7 Dec 2021 11:51:45 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Dec 07, 2021 at 12:16:32PM +0100, Cornelia Huck wrote:
> > On Mon, Dec 06 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Mon, Dec 06, 2021 at 07:06:35PM +0100, Cornelia Huck wrote:
> > >  
> > >> We're discussing a complex topic here, and we really don't want to
> > >> perpetuate an unclear uAPI. This is where my push for more precise
> > >> statements is coming from.  
> > >
> > > I appreciate that, and I think we've made a big effort toward that
> > > direction.
> > >
> > > Can we have some crisp feedback which statements need SHOULD/MUST/MUST
> > > NOT and come to something?  
> > 
> > I'm not sure what I should actually comment on, some general remarks:  
> 
> You should comment on the paragraphs that prevent you from adding a
> reviewed-by.
> 
> > - If we consider a possible vfio-ccw implementation that will quiesce
> >   the device and not rely on tracking I/O, we need to make the parts
> >   that talk about tracking non-mandatory.  
> 
> I'm not sure what you mean by 'tracking I/O'?
> 
> I thought we were good on ccw?
> 
> > - NDMA sounds like something that needs to be non-mandatory as well.  
> 
> I agree, Alex are we agreed now ?

No.  When last we left our thread, you seemed to be suggesting QEMU
maintains two IOMMU domains, ie. containers, the first of which would
include p2p mappings for all PCI devices, the second would include no
p2p mappings.  Device supporting NDMA get attached to the former,
non-NDMA devices the latter.

So some devices can access all devices via p2p DMA, other devices can
access none.  Are there any bare metal systems that expose such
asymmetric p2p constraints?  I'm not inclined to invent new p2p
scenarios that only exist in VMs.

In addition to creating this asymmetric topology, forcing QEMU to
maintain two containers not only increases the overhead, but doubles
the locked memory requirements for QEMU since our current locked memory
accounting is unfortunately per container.  Then we need to also
consider that multi-device groups exist where a group can only be
attached to one container and also vIOMMU cases where presumably we'd
only get these dual-containers when multiple groups are attached to a
container.  Maybe also worth noting that we cannot atomically move a
device between containers, due to both vfio and often IOMMU constraints
afaik.

So it still seems like the only QEMU policy that we could manage to
document and support would require that non-mandatory NDMA support
implies that migration cannot be enabled by default for any vfio device
and that enabling migration sets in motion a binary policy regarding
p2p mappings across the VM.  I'm still not convinced how supportable
that is, but I can at least imagine explicit per device options that
need to align.

I don't know if lack of NDMA on ccw was Connie's reasoning for making
NDMA non-mandatory, but it seems like NDMA is only relevant to buses
that support DMA, so AIUI it would be just as valid for ccw devices to
report NDMA as a no-op.

> > - The discussion regarding bit group changes has me confused. You seem
> >   to be saying that mlx5 needs that, so it needs to have some mandatory
> >   component; but are actually all devices able to deal with those bits
> >   changing as a group?  
> 
> Yes, all devices can support this as written.
> 
> If you think of the device_state as initiating some action pre bit
> group then we have multiple bit group that can change at once and thus
> multiple actions that can be triggered.
> 
> All devices must support userspace initiating actions one by one in a
> manner that supports the reference flow. 
> 
> Thus, every driver can decompose a request for multiple actions into
> an ordered list of single actions and execute those actions exactly as
> if userspace had issued single actions.
> 
> The precedence follows the reference flow so that any conflicts
> resolve along the path that already has defined behaviors.
> 
> I honestly don't know why this is such a discussion point, beyond
> being a big oversight of the original design.

In my case, because it's still not clearly a universal algorithm, yet
it's being proposed as one.  We already discussed that {!}NDMA
placement is fairly arbitrary and looking at v3 I'm wondering how a
RESUMING -> SAVING|!RUNNING transition works.  For an implementation
that shares a buffer between SAVING and RESUMING, the ordering seems to
suggest the SAVING action has precedence over the !RESUMING action,
which is clearly wrong, but for an implementation where migration data
is read or written to the device directly, the ordering is not such a
concern.
 
> > - In particular, the flow needs definitive markings about what is
> >   mandatory to implement, what is strongly suggested, and what is
> >   optional. It is unclear to me what is really expected, and what is
> >   simply one way to implement it.  
> 
> I'm not sure either, this hasn't been clear at all to me. Alex has
> asked for things to be general and left undefined, but we need some
> minimum definition to actually implement driver/VMM interoperability
> for what we need to do.
> 
> Really what qemu does will set the mandatory to implement.

And therefore anything that works with QEMU is correct and how a driver
gets to that correct result can be implementation specific, depending
on factors like whether device data is buffered or the device is
accessed directly.  We can have a valid, interoperable uAPI without
constraining ourselves to a specific implementation.  Largely I think
that trying to impose an implementation as the specification is the
source of our friction.

> > > The world needs to move forward, we can't debate this endlessly
> > > forever. It is already another 6 weeks past since the last mlx5 driver
> > > posting.  
> > 
> > 6 weeks is already blazingly fast in any vfio migration discussion. /s  
> 
> We've invested a lot of engineer months in this project, it is
> disrespectful to all of this effort to leave us hanging with no clear
> path forward and no actionable review comments after so much
> time. This is another kernel cycle lost.
> 
> > Remember that we have other things to do as well, not all of which will
> > be visible to you.  
> 
> As do we all, but your name is in the maintainer file, and that comes
> with some responsibility.

This is a bit infuriating, responding to it at all is probably ill
advised.  We're all investing a lot of time into this.  We're all
disappointed how the open source use case of the previous
implementation fell apart and nobody else stepped up until now.
Rubbing salt in that wound is not helpful or productive.

Regardless, this implementation has highlighted gaps in the initial
design and it's critical that those known gaps are addressed before we
commit to the design with an in-kernel driver.  Referring to the notes
Connie copied from etherpad, those gaps include uAPI clarification
regarding various device states and accesses allowed in those states,
definition of a quiescent (NDMA) device state, discussion of per-device
dirty state, and documentation such as userspace usage and edge cases.
Only the latter items were specifically requested outside of the header
and previously provided comments questioned if we're not actually
creating contradictory documentation to the uAPI and why clarifications
are not applied to the existing uAPI descriptions.

Personally I'm a bit disappointed to see v3 posted where the diffstat
indicates no uAPI updates, so we actually have no formal definition of
this NDMA state, nor does it feel like we've really come to a
conclusion on that discussion and how it affects userspace.  What is
this documenting if NDMA is not formally part of the uAPI?  More so, it
seems we're just trying to push to get a sign-off, demanding specific
actions to get there.  Isn't that how we got into this situation,
approving the uAPI, or in this case documentation, without an in-kernel
implementation and vetted userspace?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC v2] vfio: Documentation for the migration region
  2021-12-08 16:06                           ` Alex Williamson
@ 2021-12-08 20:27                             ` Jason Gunthorpe
  0 siblings, 0 replies; 29+ messages in thread
From: Jason Gunthorpe @ 2021-12-08 20:27 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Wed, Dec 08, 2021 at 09:06:47AM -0700, Alex Williamson wrote:
> On Tue, 7 Dec 2021 11:51:45 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Dec 07, 2021 at 12:16:32PM +0100, Cornelia Huck wrote:
> > > On Mon, Dec 06 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >   
> > > > On Mon, Dec 06, 2021 at 07:06:35PM +0100, Cornelia Huck wrote:
> > > >  
> > > >> We're discussing a complex topic here, and we really don't want to
> > > >> perpetuate an unclear uAPI. This is where my push for more precise
> > > >> statements is coming from.  
> > > >
> > > > I appreciate that, and I think we've made a big effort toward that
> > > > direction.
> > > >
> > > > Can we have some crisp feedback which statements need SHOULD/MUST/MUST
> > > > NOT and come to something?  
> > > 
> > > I'm not sure what I should actually comment on, some general remarks:  
> > 
> > You should comment on the paragraphs that prevent you from adding a
> > reviewed-by.
> > 
> > > - If we consider a possible vfio-ccw implementation that will quiesce
> > >   the device and not rely on tracking I/O, we need to make the parts
> > >   that talk about tracking non-mandatory.  
> > 
> > I'm not sure what you mean by 'tracking I/O'?
> > 
> > I thought we were good on ccw?
> > 
> > > - NDMA sounds like something that needs to be non-mandatory as well.  
> > 
> > I agree, Alex are we agreed now ?
> 
> No.  When last we left our thread, you seemed to be suggesting QEMU
> maintains two IOMMU domains, ie. containers, the first of which would
> include p2p mappings for all PCI devices, the second would include no
> p2p mappings.  Device supporting NDMA get attached to the former,
> non-NDMA devices the latter.

I think it is another valid solution, yes, As are the ideas you
identified

It is also OK what you said earlier, that qemu can just permit a guest
to mangle its migration and not do anything. I don't like this, but as
policy it is something userspace could reasonably choose to do.

The question here, right now, is what should the kernel do?

 - NDMA mandatory/not? Given we know Huawei cannot implement it, I vote
   for not mandatory
 - Should we expose some other flag, eg you 'fragile mmio' to help
   userspace form a policy? Might be useful, but also can be defined
   when someone creates a userspace policy that requires it.

Either way kerenl is just reflecting HW capability to userspace, I'm
not seeing how this can be the wrong thing for the kernel to do??

> So some devices can access all devices via p2p DMA, other devices can
> access none.  Are there any bare metal systems that expose such
> asymmetric p2p constraints?  I'm not inclined to invent new p2p
> scenarios that only exist in VMs.

Migration on baremetal is a bit odd, but all the rules are the same.
The baremetal must reach the P2P grace state to avoid corruption and
if it can rely on internal knowledge of how it operates the devices it
can avoid things like NDMA or multiple domains entirely.

> In addition to creating this asymmetric topology, forcing QEMU to
> maintain two containers not only increases the overhead, 

Yes, this policy choice has downsides. I outlined a few options where
qemu can avoid the overheads.

Other choices have different downsides.

I think part of the reason we keep going around this topic is I'm
offering options that a VMM could do, and I can't much help you decide
what matches qemu's policy framework. Frankly I think it is very
operator specific.

> but doubles the locked memory requirements for QEMU since our
> current locked memory accounting is unfortunately per container.

You asked for that to be fixed in iommufd and I've got that almost
done now. With iommufd we can now have two domains with only the
overhead of the actual io page table entries under the 2nd
iommu_domain.

When VDPA uses iommufd as well it should allow qemu to be single pin
for everything.

> Then we need to also consider that multi-device groups exist where a
> group can only be attached to one container 

It is a problem, yes, but again qemu could adopt a policy of not
support multi-group migration devices, if it wants.

> and also vIOMMU cases where presumably we'd only get these
> dual-containers when multiple groups are attached to a container.

Not sure where this means? vIOMMU should operate both containers in
unison, I'd think?

> Maybe also worth noting that we cannot atomically move a device
> between containers, due to both vfio and often IOMMU constraints
> afaik.

I don't think we need move, it is a permanent setup?

> So it still seems like the only QEMU policy that we could manage to
> document and support would require that non-mandatory NDMA support
> implies that migration cannot be enabled by default for any vfio device
> and that enabling migration sets in motion a binary policy regarding
> p2p mappings across the VM.  I'm still not convinced how supportable
> that is, but I can at least imagine explicit per device options that
> need to align.

If this is what qemu community wants, OK.
 
> I don't know if lack of NDMA on ccw was Connie's reasoning for making
> NDMA non-mandatory, but it seems like NDMA is only relevant to buses
> that support DMA, so AIUI it would be just as valid for ccw devices to
> report NDMA as a no-op.

ccw seems to support DMA, it calls vfio_pin_pages() after all? That is
DMA in this definition.

NDMA here was defined in terms of *any* DMA, not just DMA to certain
addresses, so I expect ccw will probably say it does not support NDMA
for many of the same reasons Huawei can't do it.

So, I think Connie is right to be worried about a direction to require
NDMA as a pre-condition for a migration driver. I suspect something
like CCW is very much migratable with a method that relies on no mmio
touches much like the Huawei driver does. 

In the S390 case I believe "no mmio touches" means 'whatever that
thing inside kvm that causes the device to work' is.

BTW, it is not just VFIO, but VDPA too.

Currently VDPA cannot participate in the P2P mappings so it acts like
the dual domain solution, but with IOMMUFD I expect VDPA to be able to
access the P2P mappings and so it must also support NDMA somehow.

Future things like the "netgpu" patches make it reasonable to combine
VDPA and P2P in a guest.

> > I honestly don't know why this is such a discussion point, beyond
> > being a big oversight of the original design.
> 
> In my case, because it's still not clearly a universal algorithm, yet
> it's being proposed as one.  We already discussed that {!}NDMA
> placement is fairly arbitrary and looking at v3 I'm wondering how a
> RESUMING -> SAVING|!RUNNING transition works.

Ok, I've been struggling since I wrote this to figure out how to
validate the precedence algorithmically, and I think I see it now.

Minimally, the precedence must be ordered so we never transit through
an invalid state.

Ie if userspace asks for

 RESUMING ->  SAVING | !RUNNING

Then the start and end states are valid.

However, we cannot transit through the invalid state 'RESUMING |
SAVING | !RUNNING' on the way to get there. It means the driver cannot
do RESUMING first, and is really the underlying reason why we found we
needed precedence in the first place.

So, my original design idea is just *completely* wrong.

Since we allow 0->ANY the better solution is to go toward 0 first,
then go away from 0:

 - !NDMA
 - !SAVING
 - !RUNNING
 - !RESUMING
 - RESUMING
 - RUNNING
 - SAVING
 - NDMA

and even this I can't tell by inspection if it is OK or not - need a
program to explore all combinations.

Basically, this is too hard. I can possbily figure it out now, but if
someone wants to add another state someday they *will* screw it up.

How do you feel about simply forbidding this corner case entirely?

Require that new_state ^ old_state is one hot.

I have to check, but I think this is incompatible with current
qemu. Maybe we can allow & document the few combinations qemu uses?

A core function can enforce this consistently for drivers. 

??

(frankly, I have decided I absolutely hate this device_state bit field
as a uAPI choice, the fact we have spent so much time on this is just
*grrrrrr*)

> accessed directly.  We can have a valid, interoperable uAPI without
> constraining ourselves to a specific implementation.  Largely I think
> that trying to impose an implementation as the specification is the
> source of our friction.

I agree, but please accept it is not easy to understand what is spec
and what is qemu. It is part of why I want to write this down.
 
> This is a bit infuriating, responding to it at all is probably ill
> advised.  We're all investing a lot of time into this.  We're all
> disappointed how the open source use case of the previous
> implementation fell apart and nobody else stepped up until now.
> Rubbing salt in that wound is not helpful or productive.

Sorry, I'm not trying to rub salt in a wound - but it is maddening we
are so far down this path and we seem to be back at square one and
debating exactly what the existing uAPI even is.

> Regardless, this implementation has highlighted gaps in the initial
> design and it's critical that those known gaps are addressed before we
> commit to the design with an in-kernel driver.  Referring to the notes
> Connie copied from etherpad, those gaps include uAPI clarification
> regarding various device states and accesses allowed in those states,
> definition of a quiescent (NDMA) device state, discussion of per-device
> dirty state, and documentation such as userspace usage and edge
> cases.

Okay, can summarize your view where we are on all these points? I
really don't know anymore. I think the document addresses all of them.

> Only the latter items were specifically requested outside of the header
> and previously provided comments questioned if we're not actually
> creating contradictory documentation to the uAPI and why clarifications
> are not applied to the existing uAPI descriptions.

As I said, it is simply too much text, there is no simple fix to make
the header file documentation completely clear. If you have some idea
how to fix it please share.

I don't think the new docs are contradictory because the header file
doesn't really say that much.. At least I haven't noticed anything.
Yishai's patch to add NDMA should add it to the header comment a
little bit, I'll check on that.

> Personally I'm a bit disappointed to see v3 posted where the diffstat
> indicates no uAPI updates, so we actually have no formal definition of
> this NDMA state, 

I didn't realize this is something you felt was important for this RFC
patch.

The purpose of the RFC patch is to get some overall mutual agreement
on the uAPI protocol as discussed on the RST text. The non-RFC version
of this RST will come with the next mlx5 driver v6 posting and the
implementation of NDMA for mlx5 and all the cap bits Yishai built for
it. This is already coded - we have only refrained from posting a v6
out of respect for everyone's time to re-look on all the mlx5 code
again if we are still debating big topcis in this text.

If you recall, I said we'd like to go ahead with mlx5 as-is and you
were very concerned about the implication of the NDMA problem. Thus
this document was prepared to explain our view of NDMA and more, plus
we went and updated mlx5 to fully implement it so we will not come
back to this topic later in the kernel.

> get there.  Isn't that how we got into this situation, approving the
> uAPI, or in this case documentation, without an in-kernel
> implementation and vetted userspace?  Thanks,

I'm certainly not proposing we do that, please don't mis-understand.

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2021-12-08 20:27 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-29 14:45 [PATCH RFC v2] vfio: Documentation for the migration region Jason Gunthorpe
2021-11-30 17:26 ` Alex Williamson
2021-11-30 18:59   ` Jason Gunthorpe
2021-11-30 22:35     ` Alex Williamson
2021-12-01  3:14       ` Jason Gunthorpe
2021-12-01  9:54         ` Shameerali Kolothum Thodi
2021-12-01 13:49           ` Jason Gunthorpe
2021-12-01 20:03         ` Alex Williamson
2021-12-01 23:25           ` Jason Gunthorpe
2021-12-02 17:05             ` Cornelia Huck
2021-12-02 17:41               ` Jason Gunthorpe
2021-12-02 17:46                 ` Cornelia Huck
2021-12-03 18:06             ` Alex Williamson
2021-12-06 16:03               ` Cornelia Huck
2021-12-06 17:34                 ` Jason Gunthorpe
2021-12-06 18:06                   ` Cornelia Huck
2021-12-06 19:19                     ` Jason Gunthorpe
2021-12-07 11:16                       ` Cornelia Huck
2021-12-07 15:51                         ` Jason Gunthorpe
2021-12-07 16:30                           ` Cornelia Huck
2021-12-07 17:00                             ` Jason Gunthorpe
2021-12-08 16:06                           ` Alex Williamson
2021-12-08 20:27                             ` Jason Gunthorpe
2021-12-06 19:15               ` Jason Gunthorpe
2021-12-07 10:50                 ` Cornelia Huck
2021-12-07 15:37                   ` Jason Gunthorpe
2021-12-07 15:56                     ` Cornelia Huck
2021-12-07 16:13                       ` Jason Gunthorpe
2021-12-07 16:22                     ` Alex Williamson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.