All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] vfio: Documentation for the migration region
@ 2021-11-22 19:53 Jason Gunthorpe
  2021-11-22 20:31 ` Jonathan Corbet
  2021-11-23 14:21 ` Cornelia Huck
  0 siblings, 2 replies; 13+ messages in thread
From: Jason Gunthorpe @ 2021-11-22 19:53 UTC (permalink / raw)
  To: Alex Williamson, Jonathan Corbet, linux-doc
  Cc: Cornelia Huck, kvm, Kirti Wankhede, Max Gurtovoy,
	Shameer Kolothum, Yishai Hadas

Provide some more complete documentation for the migration region's
behavior, specifically focusing on the device_state bits and the whole
system view from a VMM.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 Documentation/driver-api/vfio.rst | 208 +++++++++++++++++++++++++++++-
 1 file changed, 207 insertions(+), 1 deletion(-)

Alex/Cornelia, here is the first draft of the requested documentation I promised

We think it includes all the feedback from hns, Intel and NVIDIA on this mechanism.

Our thinking is that NDMA would be implemented like this:

   +#define VFIO_DEVICE_STATE_NDMA      (1 << 3)

And a .add_capability ops will be used to signal to userspace driver support:

   +#define VFIO_REGION_INFO_CAP_MIGRATION_NDMA    6

I've described DIRTY TRACKING as a seperate concept here. With the current
uAPI this would be controlled by VFIO_IOMMU_DIRTY_PAGES_FLAG_START, with our
change in direction this would be per-tracker control, but no semantic change.

Upon some agreement we'll include this patch in the next iteration of the mlx5 driver
along with the NDMA bits.

Thanks,
Jason

diff --git a/Documentation/driver-api/vfio.rst b/Documentation/driver-api/vfio.rst
index c663b6f978255b..b28c6fb89ee92f 100644
--- a/Documentation/driver-api/vfio.rst
+++ b/Documentation/driver-api/vfio.rst
@@ -242,7 +242,213 @@ group and can access them as follows::
 VFIO User API
 -------------------------------------------------------------------------------
 
-Please see include/linux/vfio.h for complete API documentation.
+Please see include/uapi/linux/vfio.h for complete API documentation.
+
+-------------------------------------------------------------------------------
+
+VFIO migration driver API
+-------------------------------------------------------------------------------
+
+VFIO drivers that support migration implement a migration control register
+called device_state in the struct vfio_device_migration_info which is in its
+VFIO_REGION_TYPE_MIGRATION region.
+
+The device_state triggers device action both when bits are set/cleared and
+continuous behavior for each bit. For VMMs they can also control if the VCPUs in
+a VM are executing (VCPU RUNNING) and if the IOMMU is logging DMAs (DIRTY
+TRACKING). These two controls are not part of the device_state register, KVM
+will be used to control the VCPU and VFIO_IOMMU_DIRTY_PAGES_FLAG_START on the
+container controls dirty tracking.
+
+Along with the device_state the migration driver provides a data window which
+allows streaming migration data into or out of the device.
+
+A lot of flexibility is provided to userspace in how it operates these bits. The
+reference flow for saving device state in a live migration, with all features:
+
+  RUNNING, VCPU_RUNNING
+     Normal operating state
+  RUNNING, DIRTY TRACKING, VCPU RUNNING
+     Log DMAs
+     Stream all memory
+  SAVING | RUNNING, DIRTY TRACKING, VCPU RUNNING
+     Log internal device changes (pre-copy)
+     Stream device state through the migration window
+
+     While in this state repeat as desired:
+	Atomic Read and Clear DMA Dirty log
+	Stream dirty memory
+  SAVING | NDMA | RUNNING, VCPU RUNNING
+     vIOMMU grace state
+     Complete all in progress IO page faults, idle the vIOMMU
+  SAVING | NDMA | RUNNING
+     Peer to Peer DMA grace state
+     Final snapshot of DMA dirty log (atomic not required)
+  SAVING
+     Stream final device state through the migration window
+     Copy final dirty data
+  0
+     Device is halted
+
+and the reference flow for resuming:
+
+  RUNNING
+     Issue VFIO_DEVICE_RESET to clear the internal device state
+  0
+     Device is halted
+  RESUMING
+     Push in migration data. Data captured during pre-copy should be
+     prepended to data captured during SAVING.
+  NDMA | RUNNING
+     Peer to Peer DMA grace state
+  RUNNING, VCPU RUNNING
+     Normal operating state
+
+If the VMM has multiple VFIO devices undergoing migration then the grace states
+act as cross device synchronization points. The VMM must bring all devices to
+the grace state before advancing past it.
+
+To support these operations the migration driver is required to implement
+specific behaviors around the device_state.
+
+Actions on Set/Clear:
+ - SAVING | RUNNING
+   The device clears the data window and begins streaming 'pre copy' migration
+   data through the window. Device that cannot log internal state changes return
+   a 0 length migration stream.
+
+ - SAVING | !RUNNING
+   The device captures its internal state and begins streaming migration data
+   through the migration window
+
+ - RESUMING
+   The data window is opened and can receive the migration data.
+
+ - !RESUMING
+   All the data transferred into the data window is loaded into the device's
+   internal state. The migration driver can rely on userspace issuing a
+   VFIO_DEVICE_RESET prior to starting RESUMING.
+
+ - DIRTY TRACKING
+   On set clear the DMA log and start logging
+
+   On clear freeze the DMA log and allow userspace to read it. Userspace must
+   take care to ensure that DMA is suspended before clearing DIRTY TRACKING, for
+   instance by using NDMA.
+
+   DMA logs should be readable with an "atomic test and clear" to allow
+   continuous non-disruptive sampling of the log.
+
+Continuous Actions:
+  - NDMA
+    The device is not allowed to issue new DMA operations.
+    Before NDMA returns all in progress DMAs must be completed.
+
+  - !RUNNING
+    The device should not change its internal state. Implies NDMA. Any internal
+    state logging can stop.
+
+  - SAVING | !RUNNING
+    RESUMING | !RUNNING
+    The device may assume there are no incoming MMIO operations.
+
+  - RUNNING
+    The device can alter its internal state and must respond to incoming MMIO.
+
+  - SAVING | RUNNING
+    The device is logging changes to the internal state.
+
+  - !VCPU RUNNING
+    The CPU must not generate dirty pages or issue MMIO operations to devices.
+
+  - DIRTY TRACKING
+    DMAs are logged
+
+  - ERROR
+    The behavior of the device is undefined. The device must be recovered by
+    issuing VFIO_DEVICE_RESET.
+
+In general, userspace can issue a VFIO_DEVICE_RESET ioctl and recover the device
+back to device_state RUNNING. When a migration driver executes this ioctl it
+should discard the data window and set migration_state to RUNNING. This must
+happen even if the migration_state has errored. A freshly opened device FD
+should always be in the RUNNING state.
+
+The migration driver has limitations on what device state it can affect. Any
+device state controlled by general kernel subsystems must not be changed during
+RESUME, and SAVING must tolerate mutation of this state. Change to externally
+controlled device state can happen at any time, asynchronously, to the migration
+(ie interrupt rebalancing).
+
+Some examples of externally controlled state:
+ - MSI-X interrupt page
+ - MSI/legacy interrupt configuration
+ - Large parts of the PCI configuration space, ie common control bits
+ - PCI power management
+ - Changes via VFIO_DEVICE_SET_IRQS
+
+During !RUNNING, especially during SAVING and RESUMING, the device may have
+limitations on what it can tolerate. An ideal device will discard/return all
+ones to all incoming MMIO/PIO operations (exclusive of the external state above)
+in !RUNNING. However, devices are free to have undefined behavior if they
+receive MMIOs. This includes corrupting/aborting the migration, dirtying pages,
+and segfaulting userspace.
+
+However, a device may not compromise system integrity if it is subjected to a
+MMIO. It can not trigger an error TLP, it can not trigger a Machine Check, and
+it can not compromise device isolation.
+
+There are several edge cases that userspace should keep in mind when
+implementing migration:
+
+- Device Peer to Peer DMA. In this case devices are able issue DMAs to each
+  other's MMIO regions. The VMM can permit this if it maps the MMIO memory into
+  the IOMMU.
+
+  As Peer to Peer DMA is a MMIO touch like any other, it is important that
+  userspace suspend these accesses before entering any device_state where MMIO
+  is not permitted, such as !RUNNING. This can be accomplished with the NDMA
+  state. Userspace may also choose to remove MMIO mappings from the IOMMU if the
+  device does not support NDMA, and rely on that to guarantee quiet MMIO.
+
+  The P2P Grace States exist so that all devices may reach RUNNING before any
+  device is subjected to a MMIO access.
+
+  Failure to guarentee quiet MMIO may allow a hostile VM to use P2P to violate
+  the no-MMIO restriction during SAVING and corrupt the migration on devices
+  that cannot protect themselves.
+
+- IOMMU Page faults handled in userspace can occur at any time. A migration
+  driver is not required to serialize in-progress page faults. It can assume
+  that all page faults are completed before entering SAVING | !RUNNING. Since
+  the guest VCPU is required to complete page faults the VMM can accomplish this
+  by asserting NDMA | VCPU_RUNNING and clearing all pending page faults before
+  clearing VCPU_RUNNING.
+
+  Device that do not support NDMA cannot be configured to generate page faults
+  that require the VCPU to complete.
+
+- pre-copy allows the device to implement a dirty log for its internal state.
+  During the SAVING | RUNNING state the data window should present the device
+  state being logged and during SAVING | !RUNNING the data window should present
+  the unlogged device state as well as the changes from the internal dirty log.
+
+  On RESUME these two data streams are concatenated together.
+
+  pre-copy is only concerned with internal device state. External DMAs are
+  covered by the DIRTY TRACK function.
+
+- Atomic Read and Clear of the DMA log is a HW feature. If the tracker
+  cannot support this, then NDMA could be used to synthesize it less
+  efficiently.
+
+- NDMA is optional, if the device does not support this then the NDMA States
+  are pushed down to the next step in the sequence and various behaviors that
+  rely on NDMA cannot be used.
+
+TDB - discoverable feature flag for NDMA
+TDB IMS xlation
+TBD PASID xlation
 
 VFIO bus driver API
 -------------------------------------------------------------------------------

base-commit: ae0351a976d1880cf152de2bc680f1dff14d9049
-- 
2.33.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] vfio: Documentation for the migration region
  2021-11-22 19:53 [PATCH RFC] vfio: Documentation for the migration region Jason Gunthorpe
@ 2021-11-22 20:31 ` Jonathan Corbet
  2021-11-23  0:20   ` Jason Gunthorpe
  2021-11-23 14:21 ` Cornelia Huck
  1 sibling, 1 reply; 13+ messages in thread
From: Jonathan Corbet @ 2021-11-22 20:31 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson, linux-doc
  Cc: Cornelia Huck, kvm, Kirti Wankhede, Max Gurtovoy,
	Shameer Kolothum, Yishai Hadas

Jason Gunthorpe <jgg@nvidia.com> writes:

> Provide some more complete documentation for the migration region's
> behavior, specifically focusing on the device_state bits and the whole
> system view from a VMM.
>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  Documentation/driver-api/vfio.rst | 208 +++++++++++++++++++++++++++++-
>  1 file changed, 207 insertions(+), 1 deletion(-)
>
> Alex/Cornelia, here is the first draft of the requested documentation I promised
>
> We think it includes all the feedback from hns, Intel and NVIDIA on this mechanism.
>
> Our thinking is that NDMA would be implemented like this:
>
>    +#define VFIO_DEVICE_STATE_NDMA      (1 << 3)
>
> And a .add_capability ops will be used to signal to userspace driver support:
>
>    +#define VFIO_REGION_INFO_CAP_MIGRATION_NDMA    6
>
> I've described DIRTY TRACKING as a seperate concept here. With the current
> uAPI this would be controlled by VFIO_IOMMU_DIRTY_PAGES_FLAG_START, with our
> change in direction this would be per-tracker control, but no semantic change.
>
> Upon some agreement we'll include this patch in the next iteration of the mlx5 driver
> along with the NDMA bits.
>
> Thanks,
> Jason
>
> diff --git a/Documentation/driver-api/vfio.rst b/Documentation/driver-api/vfio.rst
> index c663b6f978255b..b28c6fb89ee92f 100644
> --- a/Documentation/driver-api/vfio.rst
> +++ b/Documentation/driver-api/vfio.rst
> @@ -242,7 +242,213 @@ group and can access them as follows::
>  VFIO User API
>  -------------------------------------------------------------------------------
>  
> -Please see include/linux/vfio.h for complete API documentation.
> +Please see include/uapi/linux/vfio.h for complete API documentation.
> +
> +-------------------------------------------------------------------------------
> +
> +VFIO migration driver API
> +-------------------------------------------------------------------------------
> +
> +VFIO drivers that support migration implement a migration control register
> +called device_state in the struct vfio_device_migration_info which is in its
> +VFIO_REGION_TYPE_MIGRATION region.
> +
> +The device_state triggers device action both when bits are set/cleared and
> +continuous behavior for each bit. For VMMs they can also control if the VCPUs in
> +a VM are executing (VCPU RUNNING) and if the IOMMU is logging DMAs (DIRTY
> +TRACKING). These two controls are not part of the device_state register, KVM
> +will be used to control the VCPU and VFIO_IOMMU_DIRTY_PAGES_FLAG_START on the
> +container controls dirty tracking.
> +
> +Along with the device_state the migration driver provides a data window which
> +allows streaming migration data into or out of the device.
> +
> +A lot of flexibility is provided to userspace in how it operates these bits. The
> +reference flow for saving device state in a live migration, with all features:
> +
> +  RUNNING, VCPU_RUNNING
> +     Normal operating state
> +  RUNNING, DIRTY TRACKING, VCPU RUNNING
> +     Log DMAs
> +     Stream all memory

So I'd recommend actually building the docs and looking at the result;
this will not render the way you expect it to.  I'd suggest using a
literal block for preformatted sections like this.

Thanks,

jon

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] vfio: Documentation for the migration region
  2021-11-22 20:31 ` Jonathan Corbet
@ 2021-11-23  0:20   ` Jason Gunthorpe
  2021-11-23  7:22     ` Akira Yokosawa
  0 siblings, 1 reply; 13+ messages in thread
From: Jason Gunthorpe @ 2021-11-23  0:20 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Alex Williamson, linux-doc, Cornelia Huck, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Mon, Nov 22, 2021 at 01:31:14PM -0700, Jonathan Corbet wrote:
> Jason Gunthorpe <jgg@nvidia.com> writes:
> 
> > Provide some more complete documentation for the migration region's
> > behavior, specifically focusing on the device_state bits and the whole
> > system view from a VMM.
> >
> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> >  Documentation/driver-api/vfio.rst | 208 +++++++++++++++++++++++++++++-
> >  1 file changed, 207 insertions(+), 1 deletion(-)
> >
> > Alex/Cornelia, here is the first draft of the requested documentation I promised
> >
> > We think it includes all the feedback from hns, Intel and NVIDIA on this mechanism.
> >
> > Our thinking is that NDMA would be implemented like this:
> >
> >    +#define VFIO_DEVICE_STATE_NDMA      (1 << 3)
> >
> > And a .add_capability ops will be used to signal to userspace driver support:
> >
> >    +#define VFIO_REGION_INFO_CAP_MIGRATION_NDMA    6
> >
> > I've described DIRTY TRACKING as a seperate concept here. With the current
> > uAPI this would be controlled by VFIO_IOMMU_DIRTY_PAGES_FLAG_START, with our
> > change in direction this would be per-tracker control, but no semantic change.
> >
> > Upon some agreement we'll include this patch in the next iteration of the mlx5 driver
> > along with the NDMA bits.
> >
> > Thanks,
> > Jason
> >
> > diff --git a/Documentation/driver-api/vfio.rst b/Documentation/driver-api/vfio.rst
> > index c663b6f978255b..b28c6fb89ee92f 100644
> > +++ b/Documentation/driver-api/vfio.rst
> > @@ -242,7 +242,213 @@ group and can access them as follows::
> >  VFIO User API
> >  
> > -Please see include/linux/vfio.h for complete API documentation.
> > +Please see include/uapi/linux/vfio.h for complete API documentation.
> > +
> > +-------------------------------------------------------------------------------
> > +
> > +VFIO migration driver API
> > +-------------------------------------------------------------------------------
> > +
> > +VFIO drivers that support migration implement a migration control register
> > +called device_state in the struct vfio_device_migration_info which is in its
> > +VFIO_REGION_TYPE_MIGRATION region.
> > +
> > +The device_state triggers device action both when bits are set/cleared and
> > +continuous behavior for each bit. For VMMs they can also control if the VCPUs in
> > +a VM are executing (VCPU RUNNING) and if the IOMMU is logging DMAs (DIRTY
> > +TRACKING). These two controls are not part of the device_state register, KVM
> > +will be used to control the VCPU and VFIO_IOMMU_DIRTY_PAGES_FLAG_START on the
> > +container controls dirty tracking.
> > +
> > +Along with the device_state the migration driver provides a data window which
> > +allows streaming migration data into or out of the device.
> > +
> > +A lot of flexibility is provided to userspace in how it operates these bits. The
> > +reference flow for saving device state in a live migration, with all features:
> > +
> > +  RUNNING, VCPU_RUNNING
> > +     Normal operating state
> > +  RUNNING, DIRTY TRACKING, VCPU RUNNING
> > +     Log DMAs
> > +     Stream all memory
> 
> So I'd recommend actually building the docs and looking at the result;
> this will not render the way you expect it to.  

Hum... It is really close to what I'd like, with the state names
bolded and in something like an enumerated list

But on close inspection I see the text fragments have been assembled
together. I'd probably try to make them into sentances than go to a
literal block?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] vfio: Documentation for the migration region
  2021-11-23  0:20   ` Jason Gunthorpe
@ 2021-11-23  7:22     ` Akira Yokosawa
  0 siblings, 0 replies; 13+ messages in thread
From: Akira Yokosawa @ 2021-11-23  7:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Cornelia Huck, Jonathan Corbet, kvm,
	Kirti Wankhede, linux-doc, Max Gurtovoy, Shameer Kolothum,
	Yishai Hadas

On Mon, 22 Nov 2021 20:20:42 -0400, Jason Gunthorpe wrote:
> On Mon, Nov 22, 2021 at 01:31:14PM -0700, Jonathan Corbet wrote:
>> Jason Gunthorpe <jgg@nvidia.com> writes:
>> 
>> > Provide some more complete documentation for the migration region's
>> > behavior, specifically focusing on the device_state bits and the whole
>> > system view from a VMM.
>> >
>> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>> >  Documentation/driver-api/vfio.rst | 208 +++++++++++++++++++++++++++++-
>> >  1 file changed, 207 insertions(+), 1 deletion(-)
>> >
>> > Alex/Cornelia, here is the first draft of the requested documentation I promised
>> >
>> > We think it includes all the feedback from hns, Intel and NVIDIA on this mechanism.
>> >
>> > Our thinking is that NDMA would be implemented like this:
>> >
>> >    +#define VFIO_DEVICE_STATE_NDMA      (1 << 3)
>> >
>> > And a .add_capability ops will be used to signal to userspace driver support:
>> >
>> >    +#define VFIO_REGION_INFO_CAP_MIGRATION_NDMA    6
>> >
>> > I've described DIRTY TRACKING as a seperate concept here. With the current
>> > uAPI this would be controlled by VFIO_IOMMU_DIRTY_PAGES_FLAG_START, with our
>> > change in direction this would be per-tracker control, but no semantic change.
>> >
>> > Upon some agreement we'll include this patch in the next iteration of the mlx5 driver
>> > along with the NDMA bits.
>> >
>> > Thanks,
>> > Jason
>> >
>> > diff --git a/Documentation/driver-api/vfio.rst b/Documentation/driver-api/vfio.rst
>> > index c663b6f978255b..b28c6fb89ee92f 100644
>> > +++ b/Documentation/driver-api/vfio.rst
>> > @@ -242,7 +242,213 @@ group and can access them as follows::
>> >  VFIO User API
>> >  
>> > -Please see include/linux/vfio.h for complete API documentation.
>> > +Please see include/uapi/linux/vfio.h for complete API documentation.
>> > +
>> > +-------------------------------------------------------------------------------
>> > +
>> > +VFIO migration driver API
>> > +-------------------------------------------------------------------------------
>> > +
>> > +VFIO drivers that support migration implement a migration control register
>> > +called device_state in the struct vfio_device_migration_info which is in its
>> > +VFIO_REGION_TYPE_MIGRATION region.
>> > +
>> > +The device_state triggers device action both when bits are set/cleared and
>> > +continuous behavior for each bit. For VMMs they can also control if the VCPUs in
>> > +a VM are executing (VCPU RUNNING) and if the IOMMU is logging DMAs (DIRTY
>> > +TRACKING). These two controls are not part of the device_state register, KVM
>> > +will be used to control the VCPU and VFIO_IOMMU_DIRTY_PAGES_FLAG_START on the
>> > +container controls dirty tracking.
>> > +
>> > +Along with the device_state the migration driver provides a data window which
>> > +allows streaming migration data into or out of the device.
>> > +
>> > +A lot of flexibility is provided to userspace in how it operates these bits. The
>> > +reference flow for saving device state in a live migration, with all features:
>> > +
>> > +  RUNNING, VCPU_RUNNING
>> > +     Normal operating state

Hi.

So if you'd like definition lists of ReST, adding empty lines here

>> > +  RUNNING, DIRTY TRACKING, VCPU RUNNING>> > +     Log DMAs

and here should do the trick.

>> > +     Stream all memory

Ditto for the rest of the lists.

>> 
>> So I'd recommend actually building the docs and looking at the result;
>> this will not render the way you expect it to.  
> 
> Hum... It is really close to what I'd like, with the state names
> bolded and in something like an enumerated list
> 
> But on close inspection I see the text fragments have been assembled
> together. I'd probably try to make them into sentances than go to a
> literal block?

Please see ReST documentation for further info:

    https://docutils.sourceforge.io/docs/ref/rst/restructuredtext.html#definition-lists

        Thanks, Akira
> 
> Thanks,
> Jason

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] vfio: Documentation for the migration region
  2021-11-22 19:53 [PATCH RFC] vfio: Documentation for the migration region Jason Gunthorpe
  2021-11-22 20:31 ` Jonathan Corbet
@ 2021-11-23 14:21 ` Cornelia Huck
  2021-11-23 16:53   ` Jason Gunthorpe
  1 sibling, 1 reply; 13+ messages in thread
From: Cornelia Huck @ 2021-11-23 14:21 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson, Jonathan Corbet, linux-doc
  Cc: kvm, Kirti Wankhede, Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Mon, Nov 22 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:

> Provide some more complete documentation for the migration region's
> behavior, specifically focusing on the device_state bits and the whole
> system view from a VMM.
>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  Documentation/driver-api/vfio.rst | 208 +++++++++++++++++++++++++++++-
>  1 file changed, 207 insertions(+), 1 deletion(-)
>
> Alex/Cornelia, here is the first draft of the requested documentation I promised

Thanks, I'm taking a quick first look.

[As it's Thanksgiving week in the US, I don't think Alex will be reading
this right now.]

>
> We think it includes all the feedback from hns, Intel and NVIDIA on this mechanism.
>
> Our thinking is that NDMA would be implemented like this:
>
>    +#define VFIO_DEVICE_STATE_NDMA      (1 << 3)
>
> And a .add_capability ops will be used to signal to userspace driver support:
>
>    +#define VFIO_REGION_INFO_CAP_MIGRATION_NDMA    6
>
> I've described DIRTY TRACKING as a seperate concept here. With the current
> uAPI this would be controlled by VFIO_IOMMU_DIRTY_PAGES_FLAG_START, with our
> change in direction this would be per-tracker control, but no semantic change.
>
> Upon some agreement we'll include this patch in the next iteration of the mlx5 driver
> along with the NDMA bits.
>
> Thanks,
> Jason
>
> diff --git a/Documentation/driver-api/vfio.rst b/Documentation/driver-api/vfio.rst
> index c663b6f978255b..b28c6fb89ee92f 100644
> --- a/Documentation/driver-api/vfio.rst
> +++ b/Documentation/driver-api/vfio.rst
> @@ -242,7 +242,213 @@ group and can access them as follows::
>  VFIO User API
>  -------------------------------------------------------------------------------
>  
> -Please see include/linux/vfio.h for complete API documentation.
> +Please see include/uapi/linux/vfio.h for complete API documentation.
> +
> +-------------------------------------------------------------------------------
> +
> +VFIO migration driver API
> +-------------------------------------------------------------------------------
> +
> +VFIO drivers that support migration implement a migration control register
> +called device_state in the struct vfio_device_migration_info which is in its
> +VFIO_REGION_TYPE_MIGRATION region.
> +
> +The device_state triggers device action both when bits are set/cleared and
> +continuous behavior for each bit.

I had trouble parsing this sentence, until I read further down... maybe
use something like the following:

The device_state controls both device action and continuous behaviour.
Setting/clearing a bit triggers device action, and each bit controls
continuous behaviour.

> For VMMs they can also control if the VCPUs in
> +a VM are executing (VCPU RUNNING) and if the IOMMU is logging DMAs (DIRTY
> +TRACKING). These two controls are not part of the device_state register, KVM
> +will be used to control the VCPU and VFIO_IOMMU_DIRTY_PAGES_FLAG_START on the
> +container controls dirty tracking.

We usually try to keep kvm out of documentation for the vfio
interfaces; better frame that as an example?

> +
> +Along with the device_state the migration driver provides a data window which
> +allows streaming migration data into or out of the device.
> +
> +A lot of flexibility is provided to userspace in how it operates these bits. The
> +reference flow for saving device state in a live migration, with all features:

It may also vary depending on the device being migrated (a subchannel
passed via vfio-ccw will behave differently than a pci device.)

> +
> +  RUNNING, VCPU_RUNNING

Nit: everywhere else you used "VCPU RUNNING".

Also, can we separate device state bits as defined in vfio.h and VMM
state bits visually a bit :) better?

> +     Normal operating state
> +  RUNNING, DIRTY TRACKING, VCPU RUNNING
> +     Log DMAs
> +     Stream all memory

all memory accessed by the device?

> +  SAVING | RUNNING, DIRTY TRACKING, VCPU RUNNING
> +     Log internal device changes (pre-copy)
> +     Stream device state through the migration window
> +
> +     While in this state repeat as desired:
> +	Atomic Read and Clear DMA Dirty log
> +	Stream dirty memory
> +  SAVING | NDMA | RUNNING, VCPU RUNNING
> +     vIOMMU grace state
> +     Complete all in progress IO page faults, idle the vIOMMU
> +  SAVING | NDMA | RUNNING
> +     Peer to Peer DMA grace state
> +     Final snapshot of DMA dirty log (atomic not required)
> +  SAVING
> +     Stream final device state through the migration window
> +     Copy final dirty data
> +  0
> +     Device is halted
> +
> +and the reference flow for resuming:
> +
> +  RUNNING
> +     Issue VFIO_DEVICE_RESET to clear the internal device state
> +  0
> +     Device is halted
> +  RESUMING
> +     Push in migration data. Data captured during pre-copy should be
> +     prepended to data captured during SAVING.
> +  NDMA | RUNNING
> +     Peer to Peer DMA grace state
> +  RUNNING, VCPU RUNNING
> +     Normal operating state
> +
> +If the VMM has multiple VFIO devices undergoing migration then the grace states
> +act as cross device synchronization points. The VMM must bring all devices to
> +the grace state before advancing past it.
> +
> +To support these operations the migration driver is required to implement
> +specific behaviors around the device_state.
> +
> +Actions on Set/Clear:
> + - SAVING | RUNNING
> +   The device clears the data window and begins streaming 'pre copy' migration
> +   data through the window. Device that cannot log internal state changes return
> +   a 0 length migration stream.

Hm. This and the following are "combination states", i.e. not what I'd
expect if I read about setting/clearing bits. What you describe is what
happens if the device has RUNNING set and additionally SAVING is set,
isn't it? What happens if we set SAVING while !RUNNING? The action below
looks like what is happening when RUNNING is cleared while SAVING is set.

> +
> + - SAVING | !RUNNING
> +   The device captures its internal state and begins streaming migration data
> +   through the migration window
> +
> + - RESUMING
> +   The data window is opened and can receive the migration data.
> +
> + - !RESUMING
> +   All the data transferred into the data window is loaded into the device's
> +   internal state. The migration driver can rely on userspace issuing a
> +   VFIO_DEVICE_RESET prior to starting RESUMING.

Can we also fail migration? I.e. clearing RESUMING without setting RUNNING.

> +
> + - DIRTY TRACKING
> +   On set clear the DMA log and start logging
> +
> +   On clear freeze the DMA log and allow userspace to read it. Userspace must
> +   take care to ensure that DMA is suspended before clearing DIRTY TRACKING, for
> +   instance by using NDMA.
> +
> +   DMA logs should be readable with an "atomic test and clear" to allow
> +   continuous non-disruptive sampling of the log.

I'm not sure whether including DIRTY TRACKING with the bits in
device_state could lead to confusion...

> +
> +Continuous Actions:
> +  - NDMA
> +    The device is not allowed to issue new DMA operations.

Doesn't that make it an action trigger as well? I.e. when NDMA is set, a
blocker for DMA operations is in place?

> +    Before NDMA returns all in progress DMAs must be completed.

What does that mean? That the operation setting NDMA in device_state
returns? This would effectively make setting NDMA a trigger to actually
stop doing any DMA.

> +
> +  - !RUNNING
> +    The device should not change its internal state. Implies NDMA. Any internal
> +    state logging can stop.

So we have:
- !RUNNING -- no DMA, regardless whether NDMA is set
- RUNNING|NDMA -- the device can change its internal state, but not do
  DMA

!RUNNING|!NDMA would basically be a valid state if a device is stopped
before RESUMING, but not for outbound migration?

> +
> +  - SAVING | !RUNNING
> +    RESUMING | !RUNNING
> +    The device may assume there are no incoming MMIO operations.
> +
> +  - RUNNING
> +    The device can alter its internal state and must respond to incoming MMIO.
> +
> +  - SAVING | RUNNING
> +    The device is logging changes to the internal state.
> +
> +  - !VCPU RUNNING
> +    The CPU must not generate dirty pages or issue MMIO operations to devices.
> +
> +  - DIRTY TRACKING
> +    DMAs are logged
> +
> +  - ERROR
> +    The behavior of the device is undefined. The device must be recovered by
> +    issuing VFIO_DEVICE_RESET.
> +

I'm wondering whether it would be better to distinguish between
individual bit meanings vs composite states than set/clear actions vs
continuous actions. This could give us a good overview about what a
device can/should do while in a certain state and what flipping a
certain bit implies.

<I still need to read through the rest of the document>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] vfio: Documentation for the migration region
  2021-11-23 14:21 ` Cornelia Huck
@ 2021-11-23 16:53   ` Jason Gunthorpe
  2021-11-24 16:55     ` Cornelia Huck
  0 siblings, 1 reply; 13+ messages in thread
From: Jason Gunthorpe @ 2021-11-23 16:53 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alex Williamson, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Tue, Nov 23, 2021 at 03:21:06PM +0100, Cornelia Huck wrote:
> On Mon, Nov 22 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > Provide some more complete documentation for the migration region's
> > behavior, specifically focusing on the device_state bits and the whole
> > system view from a VMM.
> >
> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> >  Documentation/driver-api/vfio.rst | 208 +++++++++++++++++++++++++++++-
> >  1 file changed, 207 insertions(+), 1 deletion(-)
> >
> > Alex/Cornelia, here is the first draft of the requested documentation I promised
> 
> Thanks, I'm taking a quick first look.
> 
> [As it's Thanksgiving week in the US, I don't think Alex will be reading
> this right now.]

Sure. I've made a lot of changes already so I'll try to post a v2 next
week

> > +The device_state triggers device action both when bits are set/cleared and
> > +continuous behavior for each bit.
> 
> I had trouble parsing this sentence, until I read further down... maybe
> use something like the following:
> 
> The device_state controls both device action and continuous behaviour.
> Setting/clearing a bit triggers device action, and each bit controls
> continuous behaviour.

OK

> > For VMMs they can also control if the VCPUs in
> > +a VM are executing (VCPU RUNNING) and if the IOMMU is logging DMAs (DIRTY
> > +TRACKING). These two controls are not part of the device_state register, KVM
> > +will be used to control the VCPU and VFIO_IOMMU_DIRTY_PAGES_FLAG_START on the
> > +container controls dirty tracking.
> 
> We usually try to keep kvm out of documentation for the vfio
> interfaces; better frame that as an example?

It is important, we can't clearly explain how things like PRI work or
NDMA without talking about a 'VCPU' concept. I think this is a case
where trying to be general is only going to hurt understandability.

Lets add some more text like 'VCPU RUNNING is used to model the
ability of the VFIO userspace to mutate the device. For VMM cases this
would be mapped to a KVM control on a VCPU, but non VMMs must also
similarly suspend their use of the VFIO device in !VCPU_RUNNING'

> > +Along with the device_state the migration driver provides a data window which
> > +allows streaming migration data into or out of the device.
> > +
> > +A lot of flexibility is provided to userspace in how it operates these bits. The
> > +reference flow for saving device state in a live migration, with all features:
> 
> It may also vary depending on the device being migrated (a subchannel
> passed via vfio-ccw will behave differently than a pci device.)

I don't think I like this statement - why/where would the overall flow
differ?
 
> > +  RUNNING, VCPU_RUNNING
> 
> Nit: everywhere else you used "VCPU RUNNING".

Oh, lets stick with the _ version then
 
> Also, can we separate device state bits as defined in vfio.h and VMM
> state bits visually a bit :) better?

Any idea? I used | for the migration_state and , for the externa ones.

> > +     Normal operating state
> > +  RUNNING, DIRTY TRACKING, VCPU RUNNING
> > +     Log DMAs
> > +     Stream all memory
> 
> all memory accessed by the device?

In this reference flow this is all VCPU memory. Ie you start global
dirty tracking both in VFIO and in the VCPU and start copying all VM
memory.

> > +Actions on Set/Clear:
> > + - SAVING | RUNNING
> > +   The device clears the data window and begins streaming 'pre copy' migration
> > +   data through the window. Device that cannot log internal state changes return
> > +   a 0 length migration stream.
> 
> Hm. This and the following are "combination states", i.e. not what I'd
> expect if I read about setting/clearing bits. 

If you examine the mlx5 driver you'll see this is how the logic is
actually implemented. It is actually very subtly complicated to
implement this properly. I also added this text:

 When multiple bits change in the migration_state the migration driver must
 process them in a priority order:

 - SAVING | RUNNING
 - !RUNNING
 - !NDMA
 - SAVING | !RUNNING
 - RESUMING
 - !RESUMING
 - NDMA
 - RUNNING

The combination states are actually two bit states and entry/exit are
defined in terms of both bits.

> What you describe is what happens if the device has RUNNING set and
> additionally SAVING is set, isn't it? 

Any change in SAVING & RUNNING that results in the new value being
SAVING | RUNNING must follow the above description.

So
  SAVING|0 -> SAVING|RUNNING
  0|RUNNING -> SAVING|RUNNING
  0 -> SAVING|RUNNING

Are all valid ways to reach this action.

This is the substantive difference between the actions and the
continuous, here something specific happes only on entry: 'clears the
data window and begins'

vs something like NDMA which is just continuously preventing DMA.

> > + - SAVING | !RUNNING
> > +   The device captures its internal state and begins streaming migration data
> > +   through the migration window
> > +
> > + - RESUMING
> > +   The data window is opened and can receive the migration data.
> > +
> > + - !RESUMING
> > +   All the data transferred into the data window is loaded into the device's
> > +   internal state. The migration driver can rely on userspace issuing a
> > +   VFIO_DEVICE_RESET prior to starting RESUMING.
> 
> Can we also fail migration? I.e. clearing RESUMING without setting RUNNING.

No, once RESUMING clears migration cannot be forced to fail, to abort
userspace should trigger reset.

This deserves some more language:

If the migration data is invalid the device should go to the ERROR state.

> > + - DIRTY TRACKING
> > +   On set clear the DMA log and start logging
> > +
> > +   On clear freeze the DMA log and allow userspace to read it. Userspace must
> > +   take care to ensure that DMA is suspended before clearing DIRTY TRACKING, for
> > +   instance by using NDMA.
> > +
> > +   DMA logs should be readable with an "atomic test and clear" to allow
> > +   continuous non-disruptive sampling of the log.
> 
> I'm not sure whether including DIRTY TRACKING with the bits in
> device_state could lead to confusion...

It is part of the flow and userspace must sequence it properly, just
like VCPU. We can't properly describe everything without talking about
it.

> > +Continuous Actions:
> > +  - NDMA
> > +    The device is not allowed to issue new DMA operations.
> 
> Doesn't that make it an action trigger as well? I.e. when NDMA is set, a
> blocker for DMA operations is in place?

For clarity I didn't split things like that. All the continuous
behaviors start when the given bits begins and stop when the bits
end.

Most of the actions talk about changes in the data window
 
> > +    Before NDMA returns all in progress DMAs must be completed.
> 
> What does that mean? That the operation setting NDMA in device_state
> returns? 

Yes, it must be a synchronous behavior.

> > +  - !RUNNING
> > +    The device should not change its internal state. Implies NDMA. Any internal
> > +    state logging can stop.
> 
> So we have:
> - !RUNNING -- no DMA, regardless whether NDMA is set
> - RUNNING|NDMA -- the device can change its internal state, but not do
>   DMA
> 
> !RUNNING|!NDMA would basically be a valid state if a device is stopped
> before RESUMING, but not for outbound migration?

The reference flows are just examples we can all think on, it is
always valid to go to any of the legal bit patterns, but may not be
useful.
 
This specifically not a FSM so any before/after migration_state is
technically legal and the device should behave as described here.

> > +  - SAVING | !RUNNING
> > +    RESUMING | !RUNNING
> > +    The device may assume there are no incoming MMIO operations.
> > +
> > +  - RUNNING
> > +    The device can alter its internal state and must respond to incoming MMIO.
> > +
> > +  - SAVING | RUNNING
> > +    The device is logging changes to the internal state.
> > +
> > +  - !VCPU RUNNING
> > +    The CPU must not generate dirty pages or issue MMIO operations to devices.
> > +
> > +  - DIRTY TRACKING
> > +    DMAs are logged
> > +
> > +  - ERROR
> > +    The behavior of the device is undefined. The device must be recovered by
> > +    issuing VFIO_DEVICE_RESET.
> > +
> 
> I'm wondering whether it would be better to distinguish between
> individual bit meanings vs composite states than set/clear actions vs
> continuous actions. This could give us a good overview about what a
> device can/should do while in a certain state and what flipping a
> certain bit implies.

Again, refer to the mlx5 implementation, there are not actually
individual bits here controlling specific things. SAVING for instance
has no device behavior meaning when discussed in isolation.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] vfio: Documentation for the migration region
  2021-11-23 16:53   ` Jason Gunthorpe
@ 2021-11-24 16:55     ` Cornelia Huck
  2021-11-24 18:40       ` Jason Gunthorpe
  0 siblings, 1 reply; 13+ messages in thread
From: Cornelia Huck @ 2021-11-24 16:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Tue, Nov 23 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Nov 23, 2021 at 03:21:06PM +0100, Cornelia Huck wrote:
>> On Mon, Nov 22 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:
>> > For VMMs they can also control if the VCPUs in
>> > +a VM are executing (VCPU RUNNING) and if the IOMMU is logging DMAs (DIRTY
>> > +TRACKING). These two controls are not part of the device_state register, KVM
>> > +will be used to control the VCPU and VFIO_IOMMU_DIRTY_PAGES_FLAG_START on the
>> > +container controls dirty tracking.
>> 
>> We usually try to keep kvm out of documentation for the vfio
>> interfaces; better frame that as an example?
>
> It is important, we can't clearly explain how things like PRI work or
> NDMA without talking about a 'VCPU' concept. I think this is a case
> where trying to be general is only going to hurt understandability.
>
> Lets add some more text like 'VCPU RUNNING is used to model the
> ability of the VFIO userspace to mutate the device. For VMM cases this
> would be mapped to a KVM control on a VCPU, but non VMMs must also
> similarly suspend their use of the VFIO device in !VCPU_RUNNING'

Yes, defining what we mean by "VCPU RUNNING" and "DIRTY TRACKING" first
makes the most sense.

(It also imposes some rules on userspace, doesn't it? Whatever it does,
the interaction with vfio needs to be at least somewhat similar to what
QEMU or another VMM would do. I wonder if we need to be more concrete
here; but let's talk about the basic interface first.)

>
>> > +Along with the device_state the migration driver provides a data window which
>> > +allows streaming migration data into or out of the device.
>> > +
>> > +A lot of flexibility is provided to userspace in how it operates these bits. The
>> > +reference flow for saving device state in a live migration, with all features:
>> 
>> It may also vary depending on the device being migrated (a subchannel
>> passed via vfio-ccw will behave differently than a pci device.)
>
> I don't think I like this statement - why/where would the overall flow
> differ?

What I meant to say: If we give userspace the flexibility to operate
this, we also must give different device types some flexibility. While
subchannels will follow the general flow, they'll probably condense/omit
some steps, as I/O is quite different to PCI there.

>  
>> > +  RUNNING, VCPU_RUNNING
>> 
>> Nit: everywhere else you used "VCPU RUNNING".
>
> Oh, lets stick with the _ version then
>  
>> Also, can we separate device state bits as defined in vfio.h and VMM
>> state bits visually a bit :) better?
>
> Any idea? I used | for the migration_state and , for the externa ones.

Not any good one :( Maybe separate by tabs? We could probably coax the
generated documents into something more distinct, but it's not that easy
to do with normal text.

>
>> > +     Normal operating state
>> > +  RUNNING, DIRTY TRACKING, VCPU RUNNING
>> > +     Log DMAs
>> > +     Stream all memory
>> 
>> all memory accessed by the device?
>
> In this reference flow this is all VCPU memory. Ie you start global
> dirty tracking both in VFIO and in the VCPU and start copying all VM
> memory.

So, general migration, not just the vfio specific parts?

>
>> > +Actions on Set/Clear:
>> > + - SAVING | RUNNING
>> > +   The device clears the data window and begins streaming 'pre copy' migration
>> > +   data through the window. Device that cannot log internal state changes return
>> > +   a 0 length migration stream.
>> 
>> Hm. This and the following are "combination states", i.e. not what I'd
>> expect if I read about setting/clearing bits. 
>
> If you examine the mlx5 driver you'll see this is how the logic is
> actually implemented. It is actually very subtly complicated to
> implement this properly. I also added this text:
>
>  When multiple bits change in the migration_state the migration driver must
>  process them in a priority order:
>
>  - SAVING | RUNNING
>  - !RUNNING
>  - !NDMA
>  - SAVING | !RUNNING
>  - RESUMING
>  - !RESUMING
>  - NDMA
>  - RUNNING
>
> The combination states are actually two bit states and entry/exit are
> defined in terms of both bits.

"subtly complicated" captures this well :(

For example, if I interpret your list correctly, the driver should
prioritize clearing RUNNING over setting SAVING | !RUNNING. What does
that mean? If RUNNING is cleared, first deal with whatever action that
triggers, then later check if it is actually a case of setting SAVING |
!RUNNING, and perform the required actions for that?

Also, does e.g. SAVING | RUNNING mean that both SAVING and RUNNING are
getting set, or only one of them, if the other was already set?

>
>> What you describe is what happens if the device has RUNNING set and
>> additionally SAVING is set, isn't it? 
>
> Any change in SAVING & RUNNING that results in the new value being
> SAVING | RUNNING must follow the above description.
>
> So
>   SAVING|0 -> SAVING|RUNNING
>   0|RUNNING -> SAVING|RUNNING
>   0 -> SAVING|RUNNING
>
> Are all valid ways to reach this action.

Yeah, that's what I meant -- this is not very obvious from the
description.

>
> This is the substantive difference between the actions and the
> continuous, here something specific happes only on entry: 'clears the
> data window and begins'
>
> vs something like NDMA which is just continuously preventing DMA.

Hm. I'm still not quite convinced whether that is a good distinction to
make. I'll read on.

>
>> > + - SAVING | !RUNNING
>> > +   The device captures its internal state and begins streaming migration data
>> > +   through the migration window
>> > +
>> > + - RESUMING
>> > +   The data window is opened and can receive the migration data.
>> > +
>> > + - !RESUMING
>> > +   All the data transferred into the data window is loaded into the device's
>> > +   internal state. The migration driver can rely on userspace issuing a
>> > +   VFIO_DEVICE_RESET prior to starting RESUMING.
>> 
>> Can we also fail migration? I.e. clearing RESUMING without setting RUNNING.
>
> No, once RESUMING clears migration cannot be forced to fail, to abort
> userspace should trigger reset.
>
> This deserves some more language:
>
> If the migration data is invalid the device should go to the ERROR state.

Ok.

>
>> > + - DIRTY TRACKING
>> > +   On set clear the DMA log and start logging
>> > +
>> > +   On clear freeze the DMA log and allow userspace to read it. Userspace must
>> > +   take care to ensure that DMA is suspended before clearing DIRTY TRACKING, for
>> > +   instance by using NDMA.
>> > +
>> > +   DMA logs should be readable with an "atomic test and clear" to allow
>> > +   continuous non-disruptive sampling of the log.
>> 
>> I'm not sure whether including DIRTY TRACKING with the bits in
>> device_state could lead to confusion...
>
> It is part of the flow and userspace must sequence it properly, just
> like VCPU. We can't properly describe everything without talking about
> it.

We probably want to clearly distinguish (visually at least) between the
bits in the device_state and VCPU RUNNING/DIRTY TRACKING. Even if both
are needed to implement vfio migration correctly, one is more strictly
defined as an interface, while for the other we rely more on the
functionality.

>
>> > +Continuous Actions:
>> > +  - NDMA
>> > +    The device is not allowed to issue new DMA operations.
>> 
>> Doesn't that make it an action trigger as well? I.e. when NDMA is set, a
>> blocker for DMA operations is in place?
>
> For clarity I didn't split things like that. All the continuous
> behaviors start when the given bits begins and stop when the bits
> end.
>
> Most of the actions talk about changes in the data window

This might need some better terminology, I did not understand the split
like that...

"action trigger" is basically that the driver sets certain bits and a
certain device action happens. "continuous" means that a certain device
action is done as long as certain bits are set. Sounds a bit like edge
triggered vs level triggered to me. What about:

- event-triggered actions: bits get set/unset, an action needs to be
  done
- condition-triggered actions: as long as bits are set/unset, an action
  needs to be done

>  
>> > +    Before NDMA returns all in progress DMAs must be completed.
>> 
>> What does that mean? That the operation setting NDMA in device_state
>> returns? 
>
> Yes, it must be a synchronous behavior.

To be extra clear: the _setting_ action (e.g. a write), not the
condition (NDMA set)? Sorry if that sounds nitpicky, but I think we
should eliminate possible points of confusion early on.

>
>> > +  - !RUNNING
>> > +    The device should not change its internal state. Implies NDMA. Any internal
>> > +    state logging can stop.
>> 
>> So we have:
>> - !RUNNING -- no DMA, regardless whether NDMA is set
>> - RUNNING|NDMA -- the device can change its internal state, but not do
>>   DMA
>> 
>> !RUNNING|!NDMA would basically be a valid state if a device is stopped
>> before RESUMING, but not for outbound migration?
>
> The reference flows are just examples we can all think on, it is
> always valid to go to any of the legal bit patterns, but may not be
> useful.

Maybe we need to be more clear what is part of the reference flows vs
part of the interface rules, not sure.

>  
> This specifically not a FSM so any before/after migration_state is
> technically legal and the device should behave as described here.
>
>> > +  - SAVING | !RUNNING
>> > +    RESUMING | !RUNNING
>> > +    The device may assume there are no incoming MMIO operations.
>> > +
>> > +  - RUNNING
>> > +    The device can alter its internal state and must respond to incoming MMIO.
>> > +
>> > +  - SAVING | RUNNING
>> > +    The device is logging changes to the internal state.
>> > +
>> > +  - !VCPU RUNNING
>> > +    The CPU must not generate dirty pages or issue MMIO operations to devices.
>> > +
>> > +  - DIRTY TRACKING
>> > +    DMAs are logged
>> > +
>> > +  - ERROR
>> > +    The behavior of the device is undefined. The device must be recovered by
>> > +    issuing VFIO_DEVICE_RESET.
>> > +
>> 
>> I'm wondering whether it would be better to distinguish between
>> individual bit meanings vs composite states than set/clear actions vs
>> continuous actions. This could give us a good overview about what a
>> device can/should do while in a certain state and what flipping a
>> certain bit implies.
>
> Again, refer to the mlx5 implementation, there are not actually
> individual bits here controlling specific things. SAVING for instance
> has no device behavior meaning when discussed in isolation.

See my suggestion about revised naming above, I probably simply did not
understand what you were trying to express here.

I'm trying to understand this document without looking at the mlx5
implementation: Somebody using it as a guide needs to be able to
implement a driver without looking at another driver (unless they prefer
to work with examples.) Using the mlx5 driver as the basis for
_writing_ this document makes sense, but it needs to stand on its own.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] vfio: Documentation for the migration region
  2021-11-24 16:55     ` Cornelia Huck
@ 2021-11-24 18:40       ` Jason Gunthorpe
  2021-11-25 12:27         ` Cornelia Huck
  0 siblings, 1 reply; 13+ messages in thread
From: Jason Gunthorpe @ 2021-11-24 18:40 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alex Williamson, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Wed, Nov 24, 2021 at 05:55:49PM +0100, Cornelia Huck wrote:

> Yes, defining what we mean by "VCPU RUNNING" and "DIRTY TRACKING" first
> makes the most sense.
> 
> (It also imposes some rules on userspace, doesn't it? Whatever it does,
> the interaction with vfio needs to be at least somewhat similar to what
> QEMU or another VMM would do. I wonder if we need to be more concrete
> here; but let's talk about the basic interface first.)

I don't think we need to have excessive precision here. The main
thrust of this as a spec is to define behaviors which starts at the
'Actions on Set/Clear' section.

This part is informative so everyone has the same picture in their
mind about what it is we are trying to accomplish. This can be a bit
imprecise.

> > I don't think I like this statement - why/where would the overall flow
> > differ?
> 
> What I meant to say: If we give userspace the flexibility to operate
> this, we also must give different device types some flexibility. While
> subchannels will follow the general flow, they'll probably condense/omit
> some steps, as I/O is quite different to PCI there.

I would say no - migration is general, no device type should get to
violate this spec.  Did you have something specific in mind? There is
very little PCI specific here already

> >> > +     Normal operating state
> >> > +  RUNNING, DIRTY TRACKING, VCPU RUNNING
> >> > +     Log DMAs
> >> > +     Stream all memory
> >> 
> >> all memory accessed by the device?
> >
> > In this reference flow this is all VCPU memory. Ie you start global
> > dirty tracking both in VFIO and in the VCPU and start copying all VM
> > memory.
> 
> So, general migration, not just the vfio specific parts?

Sure, as above precision isn't important here, the userspace doing
migration should start streaming whatever state it has covered by
dirty logging here.

> "subtly complicated" captures this well :(

Indeed. Frankly, my observation is the team here has invested a lot of
person hours trying to make sense of this and our well-researched take
'this is a FSM' was substantially different from Alex's version 'this
is control bits'. For the 'control bit' model few seem to understand
it at all, and the driver code is short but deceptively complicated.

> For example, if I interpret your list correctly, the driver should
> prioritize clearing RUNNING over setting SAVING | !RUNNING. What does
> that mean? If RUNNING is cleared, first deal with whatever action that
> triggers, then later check if it is actually a case of setting SAVING |
> !RUNNING, and perform the required actions for that?

Yes.

Since this is not a FSM a change from any two valid device_state
values is completely legal. Many of these involve multiple driver
steps. So all drivers must do the actions in the same order to have a
real ABI.

> Also, does e.g. SAVING | RUNNING mean that both SAVING and RUNNING are
> getting set, or only one of them, if the other was already set?

It always refers to the requested migration_state

> >   SAVING|0 -> SAVING|RUNNING
> >   0|RUNNING -> SAVING|RUNNING
> >   0 -> SAVING|RUNNING

Are all described as userspace requesting a migration_state 
of SAVING | RUNNING

> > For clarity I didn't split things like that. All the continuous
> > behaviors start when the given bits begins and stop when the bits
> > end.
> >
> > Most of the actions talk about changes in the data window
> 
> This might need some better terminology, I did not understand the split
> like that...
> 
> "action trigger" is basically that the driver sets certain bits and a
> certain device action happens. "continuous" means that a certain device
> action is done as long as certain bits are set. Sounds a bit like edge
> triggered vs level triggered to me. What about:

Yes

> - event-triggered actions: bits get set/unset, an action needs to be
>   done

"""Event-triggered actions happen when userspace requests a new
migration_state that differs from the current migration_state. Actions
happen on a bit group basis:"""

> - condition-triggered actions: as long as bits are set/unset, an action
>   needs to be done

"""Continuous actions are in effect so long as the below migration_state bit
   group is active:"""
 

> >> What does that mean? That the operation setting NDMA in device_state
> >> returns? 
> >
> > Yes, it must be a synchronous behavior.
> 
> To be extra clear: the _setting_ action (e.g. a write), not the
> condition (NDMA set)? Sorry if that sounds nitpicky, but I think we
> should eliminate possible points of confusion early on.

""Whenever the kernel returns with a migration_state of NDMA there can be no
   in progress DMAs.""
 
> I'm trying to understand this document without looking at the mlx5
> implementation: Somebody using it as a guide needs to be able to
> implement a driver without looking at another driver (unless they prefer
> to work with examples.) Using the mlx5 driver as the basis for
> _writing_ this document makes sense, but it needs to stand on its own.

That may be an ideal that is too hard to reach :(

Thanks,
Jason

Below is where I have things now:

VFIO migration driver API
-------------------------------------------------------------------------------

VFIO drivers that support migration implement a migration control register
called device_state in the struct vfio_device_migration_info which is in its
VFIO_REGION_TYPE_MIGRATION region.

The device_state controls both device action and continuous behaviour.
Setting/clearing bit groups triggers device action, and each bit controls a
continuous device behaviour.

Along with the device_state the migration driver provides a data window which
allows streaming migration data into or out of the device.

A lot of flexibility is provided to userspace in how it operates these
bits. What follows is a reference flow for saving device state in a live
migration, with all features, and an illustration how other external non-VFIO
entities (VCPU_RUNNING and DIRTY_TRACKING) the VMM controls fit in.

  RUNNING, VCPU_RUNNING
     Normal operating state
  RUNNING, DIRTY_TRACKING, VCPU_RUNNING
     Log DMAs

     Stream all memory
  SAVING | RUNNING, DIRTY_TRACKING, VCPU_RUNNING
     Log internal device changes (pre-copy)

     Stream device state through the migration window

     While in this state repeat as desired:

	Atomic Read and Clear DMA Dirty log

	Stream dirty memory
  SAVING | NDMA | RUNNING, VCPU_RUNNING
     vIOMMU grace state

     Complete all in progress IO page faults, idle the vIOMMU
  SAVING | NDMA | RUNNING
     Peer to Peer DMA grace state

     Final snapshot of DMA dirty log (atomic not required)
  SAVING
     Stream final device state through the migration window

     Copy final dirty data
  0
     Device is halted

and the reference flow for resuming:

  RUNNING
     Issue VFIO_DEVICE_RESET to clear the internal device state
  0
     Device is halted
  RESUMING
     Push in migration data. Data captured during pre-copy should be
     prepended to data captured during SAVING.
  NDMA | RUNNING
     Peer to Peer DMA grace state
  RUNNING, VCPU_RUNNING
     Normal operating state

If the VMM has multiple VFIO devices undergoing migration then the grace
states act as cross device synchronization points. The VMM must bring all
devices to the grace state before advancing past it.

The above reference flows are built around specific requirements on the
migration driver for its implementation of the migration_state input.

Event triggered actions happen when userspace requests a new migration_state
that differs from the current migration_state. Actions happen on a bit group
basis:

 - SAVING | RUNNING
   The device clears the data window and begins streaming 'pre copy' migration
   data through the window. Devices that cannot log internal state changes
   return a 0 length migration stream.

 - SAVING | !RUNNING
   The device captures its internal state that is not covered by internal
   logging, as well as any logged changes.

   The device clears the data window and begins streaming the captured
   migration data through the window. Devices that cannot log internal state
   changes stream all of their device state here.

 - RESUMING
   The data window is cleared, opened and can receive the migration data
   stream.

 - !RESUMING
   All the data transferred into the data window is loaded into the device's
   internal state. The migration driver can rely on userspace issuing a
   VFIO_DEVICE_RESET prior to starting RESUMING.

   To abort a RESUMING issue a VFIO_DEVICE_RESET.

   If the migration data is invalid then the ERROR state must be set.

Continuous actions are in effect when migration_state bit groups are active:

 - RUNNING | NDMA
   The device is not allowed to issue new DMA operations.

   Whenever the kernel returns with a migration_state of NDMA there can be no
   in progress DMAs.

 - !RUNNING
   The device should not change its internal state. Further implies the NDMA
   behavior above.

 - SAVING | !RUNNING
   RESUMING | !RUNNING
   The device may assume there are no incoming MMIO operations.

   Internal state logging can stop.

 - RUNNING
   The device can alter its internal state and must respond to incoming MMIO.

 - SAVING | RUNNING
   The device is logging changes to the internal state.

 - ERROR
   The behavior of the device is largely undefined. The device must be
   recovered by issuing VFIO_DEVICE_RESET or closing the device file
   descriptor.

   However, devices supporting NDMA must behave as though NDMA is asserted
   during ERROR to avoid corrupting other devices or a VM during a failed
   migration.

When multiple bits change in the migration_state they may describe multiple
event triggered actions, and multiple changes to continuous actions.  The
migration driver must process them in a priority order:

 - SAVING | RUNNING
 - NDMA
 - !RUNNING
 - SAVING | !RUNNING
 - RESUMING
 - !RESUMING
 - RUNNING
 - !NDMA

In general, userspace can issue a VFIO_DEVICE_RESET ioctl and recover the
device back to device_state RUNNING. When a migration driver executes this
ioctl it should discard the data window and set migration_state to RUNNING as
part of resetting the device to a clean state. This must happen even if the
migration_state has errored. A freshly opened device FD should always be in
the RUNNING state.

The migration driver has limitations on what device state it can affect. Any
device state controlled by general kernel subsystems must not be changed
during RESUME, and SAVING must tolerate mutation of this state. Change to
externally controlled device state can happen at any time, asynchronously, to
the migration (ie interrupt rebalancing).

Some examples of externally controlled state:
 - MSI-X interrupt page
 - MSI/legacy interrupt configuration
 - Large parts of the PCI configuration space, ie common control bits
 - PCI power management
 - Changes via VFIO_DEVICE_SET_IRQS

During !RUNNING, especially during SAVING and RESUMING, the device may have
limitations on what it can tolerate. An ideal device will discard/return all
ones to all incoming MMIO/PIO operations (exclusive of the external state
above) in !RUNNING. However, devices are free to have undefined behavior if
they receive MMIOs. This includes corrupting/aborting the migration, dirtying
pages, and segfaulting userspace.

However, a device may not compromise system integrity if it is subjected to a
MMIO. It can not trigger an error TLP, it can not trigger a Machine Check, and
it can not compromise device isolation.

There are several edge cases that userspace should keep in mind when
implementing migration:

- Device Peer to Peer DMA. In this case devices are able issue DMAs to each
  other's MMIO regions. The VMM can permit this if it maps the MMIO memory into
  the IOMMU.

  As Peer to Peer DMA is a MMIO touch like any other, it is important that
  userspace suspend these accesses before entering any device_state where MMIO
  is not permitted, such as !RUNNING. This can be accomplished with the NDMA
  state. Userspace may also choose to remove MMIO mappings from the IOMMU if the
  device does not support NDMA, and rely on that to guarantee quiet MMIO.

  The Peer to Peer Grace States exist so that all devices may reach RUNNING
  before any device is subjected to a MMIO access.

  Failure to guarentee quiet MMIO may allow a hostile VM to use P2P to violate
  the no-MMIO restriction during SAVING or RESUMING and corrupt the migration on
  devices that cannot protect themselves.

- IOMMU Page faults handled in userspace can occur at any time. A migration
  driver is not required to serialize in-progress page faults. It can assume
  that all page faults are completed before entering SAVING | !RUNNING. Since
  the guest VCPU is required to complete page faults the VMM can accomplish this
  by asserting NDMA | VCPU_RUNNING and clearing all pending page faults before
  clearing VCPU_RUNNING.

  Device that do not support NDMA cannot be configured to generate page faults
  that require the VCPU to complete.

- pre-copy allows the device to implement a dirty log for its internal state.
  During the SAVING | RUNNING state the data window should present the device
  state being logged and during SAVING | !RUNNING the data window should present
  the unlogged device state as well as the changes from the internal dirty log.

  On RESUME these two data streams are concatenated together.

  pre-copy is only concerned with internal device state. External DMAs are
  covered by the seperate DIRTY_TRACKING function.

- Atomic Read and Clear of the DMA log is a HW feature. If the tracker
  cannot support this, then NDMA could be used to synthesize it less
  efficiently.

- NDMA is optional, if the device does not support this then the NDMA States
  are pushed down to the next step in the sequence and various behaviors that
  rely on NDMA cannot be used.

- Migration control registers inside the same iommu_group as the VFIO device.
  This immediately raises a security concern as userspace can use Peer to Peer
  DMA to manipulate these migration control registers concurrently with
  any kernel actions.

  A device driver operating such a device must ensure that kernel integrity
  can not be broken by hostile user space operating the migration MMIO
  registers via peer to peer, at any point in the sequence. Notably the kernel
  cannot use DMA to transfer any migration data.

  However, as discussed above in the "Device Peer to Peer DMA" section, it can
  assume quiet MMIO as a condition to have a successful and uncorrupted
  migration.

To elaborate details on the reference flows, they assume the following details
about the external behaviors:

 - !VCPU_RUNNING
   Userspace must not generate dirty pages or issue MMIO operations to devices.
   For a VMM this would typically be a control toward KVM.

 - DIRTY_TRACKING
   Clear the DMA log and start DMA logging

   DMA logs should be readable with an "atomic test and clear" to allow
   continuous non-disruptive sampling of the log.

   This is controlled by VFIO_IOMMU_DIRTY_PAGES_FLAG_START on the container
   fd.

 - !DIRTY_TRACKING
   Freeze the DMA log, stop tracking and allow userspace to read it.

   If userspace is going to have any use of the dirty log it must ensure ensure
   that all DMA is suspended before clearing DIRTY_TRACKING, for instance by
   using NDMA or !RUNNING on all VFIO devices.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] vfio: Documentation for the migration region
  2021-11-24 18:40       ` Jason Gunthorpe
@ 2021-11-25 12:27         ` Cornelia Huck
  2021-11-25 16:14           ` Jason Gunthorpe
  0 siblings, 1 reply; 13+ messages in thread
From: Cornelia Huck @ 2021-11-25 12:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Wed, Nov 24 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Nov 24, 2021 at 05:55:49PM +0100, Cornelia Huck wrote:

>> What I meant to say: If we give userspace the flexibility to operate
>> this, we also must give different device types some flexibility. While
>> subchannels will follow the general flow, they'll probably condense/omit
>> some steps, as I/O is quite different to PCI there.
>
> I would say no - migration is general, no device type should get to
> violate this spec.  Did you have something specific in mind? There is
> very little PCI specific here already

I'm not really thinking about violating the spec, but more omitting
things that do not really apply to the hardware. For example, it is
really easy to shut up a subchannel, we don't really need to wait until
nothing happens anymore, and it doesn't even have MMIO. Also, I can't
imagine what e.g. ap would do. Granted, those are outliers, most devices
won't be too different from pci.

>> For example, if I interpret your list correctly, the driver should
>> prioritize clearing RUNNING over setting SAVING | !RUNNING. What does
>> that mean? If RUNNING is cleared, first deal with whatever action that
>> triggers, then later check if it is actually a case of setting SAVING |
>> !RUNNING, and perform the required actions for that?
>
> Yes.
>
> Since this is not a FSM a change from any two valid device_state
> values is completely legal. Many of these involve multiple driver
> steps. So all drivers must do the actions in the same order to have a
> real ABI.

Ok, I understand what you mean.

>
>> Also, does e.g. SAVING | RUNNING mean that both SAVING and RUNNING are
>> getting set, or only one of them, if the other was already set?
>
> It always refers to the requested migration_state
>
>> >   SAVING|0 -> SAVING|RUNNING
>> >   0|RUNNING -> SAVING|RUNNING
>> >   0 -> SAVING|RUNNING
>
> Are all described as userspace requesting a migration_state 
> of SAVING | RUNNING

ok

>> - event-triggered actions: bits get set/unset, an action needs to be
>>   done
>
> """Event-triggered actions happen when userspace requests a new
> migration_state that differs from the current migration_state. Actions
> happen on a bit group basis:"""
>
>> - condition-triggered actions: as long as bits are set/unset, an action
>>   needs to be done
>
> """Continuous actions are in effect so long as the below migration_state bit
>    group is active:"""

Ok, that works for me.

>  
>
>> >> What does that mean? That the operation setting NDMA in device_state
>> >> returns? 
>> >
>> > Yes, it must be a synchronous behavior.
>> 
>> To be extra clear: the _setting_ action (e.g. a write), not the
>> condition (NDMA set)? Sorry if that sounds nitpicky, but I think we
>> should eliminate possible points of confusion early on.
>
> ""Whenever the kernel returns with a migration_state of NDMA there can be no
>    in progress DMAs.""

ok

> Below is where I have things now:

(...)

The parts I had already read seem much clearer now, thanks.

> In general, userspace can issue a VFIO_DEVICE_RESET ioctl and recover the
> device back to device_state RUNNING. When a migration driver executes this
> ioctl it should discard the data window and set migration_state to RUNNING as
> part of resetting the device to a clean state. This must happen even if the
> migration_state has errored. A freshly opened device FD should always be in
> the RUNNING state.

Can the state immediately change from RUNNING to ERROR again?

>
> The migration driver has limitations on what device state it can affect. Any
> device state controlled by general kernel subsystems must not be changed
> during RESUME, and SAVING must tolerate mutation of this state. Change to
> externally controlled device state can happen at any time, asynchronously, to
> the migration (ie interrupt rebalancing).
>
> Some examples of externally controlled state:
>  - MSI-X interrupt page
>  - MSI/legacy interrupt configuration
>  - Large parts of the PCI configuration space, ie common control bits
>  - PCI power management
>  - Changes via VFIO_DEVICE_SET_IRQS
>
> During !RUNNING, especially during SAVING and RESUMING, the device may have
> limitations on what it can tolerate. An ideal device will discard/return all
> ones to all incoming MMIO/PIO operations (exclusive of the external state
> above) in !RUNNING. However, devices are free to have undefined behavior if
> they receive MMIOs. This includes corrupting/aborting the migration, dirtying
> pages, and segfaulting userspace.

Make this "MMIO, PIO, or equivalents"? For example, if we talk
subchannels, we might get an unsolicited state from the device (there's
no way to prevent that, until the subchannel has been completely disabled);
"discard it" would be the right answer here as well.

>
> However, a device may not compromise system integrity if it is subjected to a
> MMIO. It can not trigger an error TLP, it can not trigger a Machine Check, and
> it can not compromise device isolation.

"Machine Check" may be confusing to readers coming from s390; there, the
device does not trigger the machine check, but the channel subsystem
does, and we cannot prevent it. Maybe we can word it more as an example,
so readers get an idea what the limits in this state are?

(...)

> To elaborate details on the reference flows, they assume the following details
> about the external behaviors:
>
>  - !VCPU_RUNNING
>    Userspace must not generate dirty pages or issue MMIO operations to devices.

Generally, I/O operations? (Thinking of subchannels here again; if the
vcpu is not running, it cannot send I/O instructions.)

>    For a VMM this would typically be a control toward KVM.

I'm sorry, but I cannot parse this sentence... do you mean "implemented
by KVM"?

>
>  - DIRTY_TRACKING
>    Clear the DMA log and start DMA logging
>
>    DMA logs should be readable with an "atomic test and clear" to allow
>    continuous non-disruptive sampling of the log.
>
>    This is controlled by VFIO_IOMMU_DIRTY_PAGES_FLAG_START on the container
>    fd.
>
>  - !DIRTY_TRACKING
>    Freeze the DMA log, stop tracking and allow userspace to read it.
>
>    If userspace is going to have any use of the dirty log it must ensure ensure
>    that all DMA is suspended before clearing DIRTY_TRACKING, for instance by
>    using NDMA or !RUNNING on all VFIO devices.

Although I would like to see some more feedback from others, I think
this is already a huge step in the right direction.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] vfio: Documentation for the migration region
  2021-11-25 12:27         ` Cornelia Huck
@ 2021-11-25 16:14           ` Jason Gunthorpe
  2021-11-26 12:56             ` Cornelia Huck
  0 siblings, 1 reply; 13+ messages in thread
From: Jason Gunthorpe @ 2021-11-25 16:14 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alex Williamson, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Thu, Nov 25, 2021 at 01:27:12PM +0100, Cornelia Huck wrote:
> On Wed, Nov 24 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Nov 24, 2021 at 05:55:49PM +0100, Cornelia Huck wrote:
> 
> >> What I meant to say: If we give userspace the flexibility to operate
> >> this, we also must give different device types some flexibility. While
> >> subchannels will follow the general flow, they'll probably condense/omit
> >> some steps, as I/O is quite different to PCI there.
> >
> > I would say no - migration is general, no device type should get to
> > violate this spec.  Did you have something specific in mind? There is
> > very little PCI specific here already
> 
> I'm not really thinking about violating the spec, but more omitting
> things that do not really apply to the hardware. For example, it is
> really easy to shut up a subchannel, we don't really need to wait until
> nothing happens anymore, and it doesn't even have MMIO. 

I've never really looked closely at the s390 mdev drivers..

What does something like AP even do anyhow? The ioctl handler doesn't
do anything, there is no mmap hook, how does the VFIO userspace
interact with this thing?

> > In general, userspace can issue a VFIO_DEVICE_RESET ioctl and recover the
> > device back to device_state RUNNING. When a migration driver executes this
> > ioctl it should discard the data window and set migration_state to RUNNING as
> > part of resetting the device to a clean state. This must happen even if the
> > migration_state has errored. A freshly opened device FD should always be in
> > the RUNNING state.
> 
> Can the state immediately change from RUNNING to ERROR again?

Immediately? State change can only happen in response to the ioctl or
the reset.

""The migration_state cannot change asynchronously, upon writing the
migration_state the driver will either keep the current state and return
failure, return failure and go to ERROR, or succeed and go to the new state.""

> > However, a device may not compromise system integrity if it is subjected to a
> > MMIO. It can not trigger an error TLP, it can not trigger a Machine Check, and
> > it can not compromise device isolation.
> 
> "Machine Check" may be confusing to readers coming from s390; there, the
> device does not trigger the machine check, but the channel subsystem
> does, and we cannot prevent it. Maybe we can word it more as an example,
> so readers get an idea what the limits in this state are?

Lets say x86 machine check then which is a kernel-fatal event.

> Although I would like to see some more feedback from others, I think
> this is already a huge step in the right direction.

Thanks, I made all your other changes

Will send a v2 next week

Jason 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] vfio: Documentation for the migration region
  2021-11-25 16:14           ` Jason Gunthorpe
@ 2021-11-26 12:56             ` Cornelia Huck
  2021-11-26 13:06               ` Jason Gunthorpe
  0 siblings, 1 reply; 13+ messages in thread
From: Cornelia Huck @ 2021-11-26 12:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Thu, Nov 25 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Nov 25, 2021 at 01:27:12PM +0100, Cornelia Huck wrote:
>> On Wed, Nov 24 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:
>> 
>> > On Wed, Nov 24, 2021 at 05:55:49PM +0100, Cornelia Huck wrote:
>> 
>> >> What I meant to say: If we give userspace the flexibility to operate
>> >> this, we also must give different device types some flexibility. While
>> >> subchannels will follow the general flow, they'll probably condense/omit
>> >> some steps, as I/O is quite different to PCI there.
>> >
>> > I would say no - migration is general, no device type should get to
>> > violate this spec.  Did you have something specific in mind? There is
>> > very little PCI specific here already
>> 
>> I'm not really thinking about violating the spec, but more omitting
>> things that do not really apply to the hardware. For example, it is
>> really easy to shut up a subchannel, we don't really need to wait until
>> nothing happens anymore, and it doesn't even have MMIO. 
>
> I've never really looked closely at the s390 mdev drivers..
>
> What does something like AP even do anyhow? The ioctl handler doesn't
> do anything, there is no mmap hook, how does the VFIO userspace
> interact with this thing?

For AP, the magic is in the hardware/firmware; the vfio parts are needed
to configure what is exposed to a given guest, not for operation. Once
it is up, the hardware will handle any instructions directly, the
hypervisor will not see them. (Unfortunately, none of the details have
public documentation.) I have no idea how this would play with migration.

>
>> > In general, userspace can issue a VFIO_DEVICE_RESET ioctl and recover the
>> > device back to device_state RUNNING. When a migration driver executes this
>> > ioctl it should discard the data window and set migration_state to RUNNING as
>> > part of resetting the device to a clean state. This must happen even if the
>> > migration_state has errored. A freshly opened device FD should always be in
>> > the RUNNING state.
>> 
>> Can the state immediately change from RUNNING to ERROR again?
>
> Immediately? State change can only happen in response to the ioctl or
> the reset.
>
> ""The migration_state cannot change asynchronously, upon writing the
> migration_state the driver will either keep the current state and return
> failure, return failure and go to ERROR, or succeed and go to the new state.""

ok

>
>> > However, a device may not compromise system integrity if it is subjected to a
>> > MMIO. It can not trigger an error TLP, it can not trigger a Machine Check, and
>> > it can not compromise device isolation.
>> 
>> "Machine Check" may be confusing to readers coming from s390; there, the
>> device does not trigger the machine check, but the channel subsystem
>> does, and we cannot prevent it. Maybe we can word it more as an example,
>> so readers get an idea what the limits in this state are?
>
> Lets say x86 machine check then which is a kernel-fatal event.

ok

>
>> Although I would like to see some more feedback from others, I think
>> this is already a huge step in the right direction.
>
> Thanks, I made all your other changes
>
> Will send a v2 next week

Thanks, sounds good.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] vfio: Documentation for the migration region
  2021-11-26 12:56             ` Cornelia Huck
@ 2021-11-26 13:06               ` Jason Gunthorpe
  2021-11-26 15:01                 ` Cornelia Huck
  0 siblings, 1 reply; 13+ messages in thread
From: Jason Gunthorpe @ 2021-11-26 13:06 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Alex Williamson, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Fri, Nov 26, 2021 at 01:56:26PM +0100, Cornelia Huck wrote:
> On Thu, Nov 25 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Thu, Nov 25, 2021 at 01:27:12PM +0100, Cornelia Huck wrote:
> >> On Wed, Nov 24 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:
> >> 
> >> > On Wed, Nov 24, 2021 at 05:55:49PM +0100, Cornelia Huck wrote:
> >> 
> >> >> What I meant to say: If we give userspace the flexibility to operate
> >> >> this, we also must give different device types some flexibility. While
> >> >> subchannels will follow the general flow, they'll probably condense/omit
> >> >> some steps, as I/O is quite different to PCI there.
> >> >
> >> > I would say no - migration is general, no device type should get to
> >> > violate this spec.  Did you have something specific in mind? There is
> >> > very little PCI specific here already
> >> 
> >> I'm not really thinking about violating the spec, but more omitting
> >> things that do not really apply to the hardware. For example, it is
> >> really easy to shut up a subchannel, we don't really need to wait until
> >> nothing happens anymore, and it doesn't even have MMIO. 
> >
> > I've never really looked closely at the s390 mdev drivers..
> >
> > What does something like AP even do anyhow? The ioctl handler doesn't
> > do anything, there is no mmap hook, how does the VFIO userspace
> > interact with this thing?
> 
> For AP, the magic is in the hardware/firmware; the vfio parts are needed
> to configure what is exposed to a given guest, not for operation. Once
> it is up, the hardware will handle any instructions directly, the
> hypervisor will not see them. (Unfortunately, none of the details have
> public documentation.) I have no idea how this would play with migration.

That is kind of what I thought..

VFIO is all about exposing a device to userspace control, sounds like
the S390 drivers skipped that step.

KVM is all about taking what userspace can already control and giving
it to a guest, in an accelerated way.

Making a bypass where a KVM guest has more capability than the user
process because VFIO and KVM have been directly coupled completely
upends the whole logical model.

As we talked with Intel's wbinvd stuff you should have a mental model
where the VFIO userspace process can do anything the KVM guest can do
via ioctls on the mdev. KVM is just an accelerated way to do that same
stuff. Maybe S390 doesn't implement those ioctls, but they are
logically part of the model.

So, for the migration doc, imagine some non-accelerated KVM that was
intercepting the guest operations and calling the logical ioctls on
the mdev instead. When we talk about MMIO/PIO/etc it also includes
mdev operation ioctls too, and by extension any ioctl accelerated
inside KVM.

Jason

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] vfio: Documentation for the migration region
  2021-11-26 13:06               ` Jason Gunthorpe
@ 2021-11-26 15:01                 ` Cornelia Huck
  0 siblings, 0 replies; 13+ messages in thread
From: Cornelia Huck @ 2021-11-26 15:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Jonathan Corbet, linux-doc, kvm, Kirti Wankhede,
	Max Gurtovoy, Shameer Kolothum, Yishai Hadas

On Fri, Nov 26 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Fri, Nov 26, 2021 at 01:56:26PM +0100, Cornelia Huck wrote:
>> On Thu, Nov 25 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:
>> 
>> > On Thu, Nov 25, 2021 at 01:27:12PM +0100, Cornelia Huck wrote:
>> >> On Wed, Nov 24 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:
>> >> 
>> >> > On Wed, Nov 24, 2021 at 05:55:49PM +0100, Cornelia Huck wrote:
>> >> 
>> >> >> What I meant to say: If we give userspace the flexibility to operate
>> >> >> this, we also must give different device types some flexibility. While
>> >> >> subchannels will follow the general flow, they'll probably condense/omit
>> >> >> some steps, as I/O is quite different to PCI there.
>> >> >
>> >> > I would say no - migration is general, no device type should get to
>> >> > violate this spec.  Did you have something specific in mind? There is
>> >> > very little PCI specific here already
>> >> 
>> >> I'm not really thinking about violating the spec, but more omitting
>> >> things that do not really apply to the hardware. For example, it is
>> >> really easy to shut up a subchannel, we don't really need to wait until
>> >> nothing happens anymore, and it doesn't even have MMIO. 
>> >
>> > I've never really looked closely at the s390 mdev drivers..
>> >
>> > What does something like AP even do anyhow? The ioctl handler doesn't
>> > do anything, there is no mmap hook, how does the VFIO userspace
>> > interact with this thing?
>> 
>> For AP, the magic is in the hardware/firmware; the vfio parts are needed
>> to configure what is exposed to a given guest, not for operation. Once
>> it is up, the hardware will handle any instructions directly, the
>> hypervisor will not see them. (Unfortunately, none of the details have
>> public documentation.) I have no idea how this would play with migration.
>
> That is kind of what I thought..
>
> VFIO is all about exposing a device to userspace control, sounds like
> the S390 drivers skipped that step.

Note that what I wrote above is about AP; CCW does indeed trigger
operations like start subchannel from userspace and relays interrupts
back to userspace. AP is just very dissimilar to basically anything
else.

>
> KVM is all about taking what userspace can already control and giving
> it to a guest, in an accelerated way.
>
> Making a bypass where a KVM guest has more capability than the user
> process because VFIO and KVM have been directly coupled completely
> upends the whole logical model.
>
> As we talked with Intel's wbinvd stuff you should have a mental model
> where the VFIO userspace process can do anything the KVM guest can do
> via ioctls on the mdev. KVM is just an accelerated way to do that same
> stuff. Maybe S390 doesn't implement those ioctls, but they are
> logically part of the model.

FWIW, AP had been a pain to model in a way that we could hand the
devices to the guest; if we are supposed to use vfio for this purpose,
the current design is probably the best we can get, at least nobody has
been able to come up with a better way to interact with the interfaces
that we have.

CCW needs a kernel part for translations, as it doesn't have an iommu,
and the I/O instructions are of course privileged (but so are the
instructions for s390 PCI); I think it is quite close to other devices
in other respects, only that it has a more transaction-based model.

> So, for the migration doc, imagine some non-accelerated KVM that was
> intercepting the guest operations and calling the logical ioctls on
> the mdev instead. When we talk about MMIO/PIO/etc it also includes
> mdev operation ioctls too, and by extension any ioctl accelerated
> inside KVM.

I think only AP is the really odd one out here; CCW will likely differ
in some details... I just wanted to make sure that this will not run
counter to the documentation.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-11-26 15:04 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-22 19:53 [PATCH RFC] vfio: Documentation for the migration region Jason Gunthorpe
2021-11-22 20:31 ` Jonathan Corbet
2021-11-23  0:20   ` Jason Gunthorpe
2021-11-23  7:22     ` Akira Yokosawa
2021-11-23 14:21 ` Cornelia Huck
2021-11-23 16:53   ` Jason Gunthorpe
2021-11-24 16:55     ` Cornelia Huck
2021-11-24 18:40       ` Jason Gunthorpe
2021-11-25 12:27         ` Cornelia Huck
2021-11-25 16:14           ` Jason Gunthorpe
2021-11-26 12:56             ` Cornelia Huck
2021-11-26 13:06               ` Jason Gunthorpe
2021-11-26 15:01                 ` Cornelia Huck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.