All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 1/1] docs/devel: Add VFIO device migration documentation
@ 2021-03-10 19:20 Tarun Gupta
  2021-03-11  9:40 ` Daniel P. Berrangé
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Tarun Gupta @ 2021-03-10 19:20 UTC (permalink / raw)
  To: alex.williamson, qemu-devel, cohuck
  Cc: kevin.tian, berrange, quintela, dgilbert, yan.y.zhao, lushenming,
	Kirti Wankhede, dnigam, cjia, philmd, Tarun Gupta

Document interfaces used for VFIO device migration. Added flow of state changes
during live migration with VFIO device. Tested by building docs with the new
vfio-migration.rst file.

v2:
- Included the new vfio-migration.rst file in index.rst
- Updated dirty page tracking section, also added details about
  'pre-copy-dirty-page-tracking' opt-out option.
- Incorporated comments around wording of doc.

Signed-off-by: Tarun Gupta <targupta@nvidia.com>
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
---
 MAINTAINERS                   |   1 +
 docs/devel/index.rst          |   1 +
 docs/devel/vfio-migration.rst | 135 ++++++++++++++++++++++++++++++++++
 3 files changed, 137 insertions(+)
 create mode 100644 docs/devel/vfio-migration.rst

diff --git a/MAINTAINERS b/MAINTAINERS
index 738786146d..a2a80eee59 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1801,6 +1801,7 @@ M: Alex Williamson <alex.williamson@redhat.com>
 S: Supported
 F: hw/vfio/*
 F: include/hw/vfio/
+F: docs/devel/vfio-migration.rst
 
 vfio-ccw
 M: Cornelia Huck <cohuck@redhat.com>
diff --git a/docs/devel/index.rst b/docs/devel/index.rst
index ae664da00c..5330f1ca1d 100644
--- a/docs/devel/index.rst
+++ b/docs/devel/index.rst
@@ -39,3 +39,4 @@ Contents:
    qom
    block-coroutine-wrapper
    multi-process
+   vfio-migration
diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
new file mode 100644
index 0000000000..6196fb132c
--- /dev/null
+++ b/docs/devel/vfio-migration.rst
@@ -0,0 +1,135 @@
+=====================
+VFIO device Migration
+=====================
+
+VFIO devices use an iterative approach for migration because certain VFIO
+devices (e.g. GPU) have large amount of data to be transfered. The iterative
+pre-copy phase of migration allows for the guest to continue whilst the VFIO
+device state is transferred to the destination, this helps to reduce the total
+downtime of the VM. VFIO devices can choose to skip the pre-copy phase of
+migration by returning pending_bytes as zero during the pre-copy phase.
+
+A detailed description of the UAPI for VFIO device migration can be found in
+the comment for the ``vfio_device_migration_info`` structure in the header
+file linux-headers/linux/vfio.h.
+
+VFIO device hooks for iterative approach:
+
+* A ``save_setup`` function that sets up the migration region, sets _SAVING
+  flag in the VFIO device state and informs the VFIO IOMMU module to start
+  dirty page tracking.
+
+* A ``load_setup`` function that sets up the migration region on the
+  destination and sets _RESUMING flag in the VFIO device state.
+
+* A ``save_live_pending`` function that reads pending_bytes from the vendor
+  driver, which indicates the amount of data that the vendor driver has yet to
+  save for the VFIO device.
+
+* A ``save_live_iterate`` function that reads the VFIO device's data from the
+  vendor driver through the migration region during iterative phase.
+
+* A ``save_live_complete_precopy`` function that resets _RUNNING flag from the
+  VFIO device state, saves the device config space, if any, and iteratively
+  copies the remaining data for the VFIO device untill the vendor driver
+  indicates that no data remains (pending bytes is zero).
+
+* A ``load_state`` function that loads the config section and the data
+  sections that are generated by the save functions above
+
+* ``cleanup`` functions for both save and load that perform any migration
+  related cleanup, including unmapping the migration region
+
+A VM state change handler is registered to change the VFIO device state when
+the VM state changes.
+
+Similarly, a migration state change notifier is registered to get a
+notification on migration state change. These states are translated to VFIO
+device state and conveyed to vendor driver.
+
+System memory dirty pages tracking
+----------------------------------
+
+A ``log_sync`` memory listener callback marks those system memory pages
+as dirty which are used for DMA by the VFIO device. The dirty pages bitmap is
+queried per container. All pages pinned by the vendor driver through
+vfio_pin_pages() external API have to be marked as dirty during migration.
+When there are CPU writes, CPU dirty page tracking can identify dirtied pages,
+but any page pinned by the vendor driver can also be written by device. There
+is currently no device which has hardware support for dirty page tracking. So
+all pages which are pinned by vendor driver are considered as dirty.
+
+By default, dirty pages are tracked when the device is in pre-copy as well as
+stop-and-copy phase. So, a page pinned by the vendor driver using
+vfio_pin_pages() will be copied to destination in both the phases. Copying
+dirty pages in pre-copy phase helps QEMU to predict if it can achieve its
+downtime tolerances.
+
+QEMU also provides a per device opt-out option ``pre-copy-dirty-page-tracking``
+to disable dirty page tracking during pre-copy phase. If it is set to off, all
+pinned pages will be copied to destination in stop-and-copy phase only.
+
+System memory dirty pages tracking when vIOMMU is enabled
+---------------------------------------------------------
+
+With vIOMMU, an IO virtual address range can get unmapped while in pre-copy
+phase of migration. In that case, the unmap ioctl returns any pinned pages in
+that range and QEMU reports corresponding guest physical pages dirty. During
+stop-and-copy phase, an IOMMU notifier is used to get a callback for mapped
+pages and then dirty pages bitmap is fetched from VFIO IOMMU modules for those
+mapped ranges.
+
+Flow of state changes during Live migration
+===========================================
+
+Below is the flow of state change during live migration.
+The values in the brackets represent the VM state, the migration state, and
+the VFIO device state, respectively.
+
+Live migration save path
+------------------------
+
+::
+
+                        QEMU normal running state
+                        (RUNNING, _NONE, _RUNNING)
+                                  |
+                     migrate_init spawns migration_thread
+                Migration thread then calls each device's .save_setup()
+                    (RUNNING, _SETUP, _RUNNING|_SAVING)
+                                  |
+                    (RUNNING, _ACTIVE, _RUNNING|_SAVING)
+             If device is active, get pending_bytes by .save_live_pending()
+          If total pending_bytes >= threshold_size, call .save_live_iterate()
+                  Data of VFIO device for pre-copy phase is copied
+        Iterate till total pending bytes converge and are less than threshold
+                                  |
+  On migration completion, vCPU stops and calls .save_live_complete_precopy for
+   each active device. The VFIO device is then transitioned into _SAVING state
+                   (FINISH_MIGRATE, _DEVICE, _SAVING)
+                                  |
+     For the VFIO device, iterate in .save_live_complete_precopy until
+                         pending data is 0
+                   (FINISH_MIGRATE, _DEVICE, _STOPPED)
+                                  |
+                 (FINISH_MIGRATE, _COMPLETED, _STOPPED)
+             Migraton thread schedules cleanup bottom half and exits
+
+Live migration resume path
+--------------------------
+
+::
+
+              Incoming migration calls .load_setup for each device
+                       (RESTORE_VM, _ACTIVE, _STOPPED)
+                                 |
+       For each device, .load_state is called for that device section data
+                       (RESTORE_VM, _ACTIVE, _RESUMING)
+                                 |
+    At the end, .load_cleanup is called for each device and vCPUs are started
+                       (RUNNING, _NONE, _RUNNING)
+
+Postcopy
+========
+
+Postcopy migration is not supported for VFIO devices.
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/1] docs/devel: Add VFIO device migration documentation
  2021-03-10 19:20 [PATCH v2 1/1] docs/devel: Add VFIO device migration documentation Tarun Gupta
@ 2021-03-11  9:40 ` Daniel P. Berrangé
  2021-03-11 19:41   ` Dr. David Alan Gilbert
  2021-03-12  3:13 ` Tian, Kevin
  2021-03-15 17:22 ` Cornelia Huck
  2 siblings, 1 reply; 12+ messages in thread
From: Daniel P. Berrangé @ 2021-03-11  9:40 UTC (permalink / raw)
  To: Tarun Gupta
  Cc: kevin.tian, cjia, quintela, cohuck, qemu-devel, Kirti Wankhede,
	dgilbert, lushenming, alex.williamson, dnigam, yan.y.zhao,
	philmd

On Thu, Mar 11, 2021 at 12:50:09AM +0530, Tarun Gupta wrote:
> Document interfaces used for VFIO device migration. Added flow of state changes
> during live migration with VFIO device. Tested by building docs with the new
> vfio-migration.rst file.
> 
> v2:
> - Included the new vfio-migration.rst file in index.rst
> - Updated dirty page tracking section, also added details about
>   'pre-copy-dirty-page-tracking' opt-out option.
> - Incorporated comments around wording of doc.
> 
> Signed-off-by: Tarun Gupta <targupta@nvidia.com>
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> ---
>  MAINTAINERS                   |   1 +
>  docs/devel/index.rst          |   1 +
>  docs/devel/vfio-migration.rst | 135 ++++++++++++++++++++++++++++++++++
>  3 files changed, 137 insertions(+)
>  create mode 100644 docs/devel/vfio-migration.rst


> +Postcopy
> +========
> +
> +Postcopy migration is not supported for VFIO devices.

What is the problem here and is there any plan for how to address it ?

Postcopy is essentially the only migration mechanism that can reliably
complete, so it really should be considered the default approach to
migration for all mgmt apps wanting to do migration, except in special
cases.   IOW, if we want VFIO migration to be viable, we need postcopy
support.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/1] docs/devel: Add VFIO device migration documentation
  2021-03-11  9:40 ` Daniel P. Berrangé
@ 2021-03-11 19:41   ` Dr. David Alan Gilbert
  2021-03-12  2:30     ` Tian, Kevin
  0 siblings, 1 reply; 12+ messages in thread
From: Dr. David Alan Gilbert @ 2021-03-11 19:41 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: kevin.tian, cjia, quintela, cohuck, qemu-devel, Kirti Wankhede,
	lushenming, alex.williamson, Tarun Gupta, yan.y.zhao, philmd,
	dnigam

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Thu, Mar 11, 2021 at 12:50:09AM +0530, Tarun Gupta wrote:
> > Document interfaces used for VFIO device migration. Added flow of state changes
> > during live migration with VFIO device. Tested by building docs with the new
> > vfio-migration.rst file.
> > 
> > v2:
> > - Included the new vfio-migration.rst file in index.rst
> > - Updated dirty page tracking section, also added details about
> >   'pre-copy-dirty-page-tracking' opt-out option.
> > - Incorporated comments around wording of doc.
> > 
> > Signed-off-by: Tarun Gupta <targupta@nvidia.com>
> > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > ---
> >  MAINTAINERS                   |   1 +
> >  docs/devel/index.rst          |   1 +
> >  docs/devel/vfio-migration.rst | 135 ++++++++++++++++++++++++++++++++++
> >  3 files changed, 137 insertions(+)
> >  create mode 100644 docs/devel/vfio-migration.rst
> 
> 
> > +Postcopy
> > +========
> > +
> > +Postcopy migration is not supported for VFIO devices.
> 
> What is the problem here and is there any plan for how to address it ?

There's no equivalent to userfaultfd for accesses to RAM made by a
device.
There's some potential for this to be doable with an IOMMU or the like,
but:
  a) IOMMUs and devices aren't currently happy at recovering from
failures
  b) the fragementation you get during a postcopy probably isn't pretty
when you get to build IOMMU tables.

> Postcopy is essentially the only migration mechanism that can reliably
> complete, so it really should be considered the default approach to
> migration for all mgmt apps wanting to do migration, except in special
> cases.   IOW, if we want VFIO migration to be viable, we need postcopy
> support.

There's lots of other things postcopy doesn't work with; so hmm.

Dave

> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH v2 1/1] docs/devel: Add VFIO device migration documentation
  2021-03-11 19:41   ` Dr. David Alan Gilbert
@ 2021-03-12  2:30     ` Tian, Kevin
  2021-03-16 15:46       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 12+ messages in thread
From: Tian, Kevin @ 2021-03-12  2:30 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Daniel P. Berrangé
  Cc: alex.williamson, cjia, quintela, cohuck, qemu-devel, lushenming,
	Kirti Wankhede, Tarun Gupta, Zhao, Yan Y, philmd, dnigam

> From: Qemu-devel <qemu-devel-bounces+kevin.tian=intel.com@nongnu.org>
> On Behalf Of Dr. David Alan Gilbert
> 
> * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > On Thu, Mar 11, 2021 at 12:50:09AM +0530, Tarun Gupta wrote:
> > > Document interfaces used for VFIO device migration. Added flow of state
> changes
> > > during live migration with VFIO device. Tested by building docs with the
> new
> > > vfio-migration.rst file.
> > >
> > > v2:
> > > - Included the new vfio-migration.rst file in index.rst
> > > - Updated dirty page tracking section, also added details about
> > >   'pre-copy-dirty-page-tracking' opt-out option.
> > > - Incorporated comments around wording of doc.
> > >
> > > Signed-off-by: Tarun Gupta <targupta@nvidia.com>
> > > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > ---
> > >  MAINTAINERS                   |   1 +
> > >  docs/devel/index.rst          |   1 +
> > >  docs/devel/vfio-migration.rst | 135
> ++++++++++++++++++++++++++++++++++
> > >  3 files changed, 137 insertions(+)
> > >  create mode 100644 docs/devel/vfio-migration.rst
> >
> >
> > > +Postcopy
> > > +========
> > > +
> > > +Postcopy migration is not supported for VFIO devices.
> >
> > What is the problem here and is there any plan for how to address it ?
> 
> There's no equivalent to userfaultfd for accesses to RAM made by a
> device.
> There's some potential for this to be doable with an IOMMU or the like,
> but:
>   a) IOMMUs and devices aren't currently happy at recovering from
> failures
>   b) the fragementation you get during a postcopy probably isn't pretty
> when you get to build IOMMU tables.

To overcome such limitations one may adopt a prefault-and-pull scheme if 
the vendor driver has the capability to track pending DMA buffers in the
migration process (with additional uAPI changes in VFIO or userfaultfd), 
as discussed here:

https://static.sched.com/hosted_files/kvmforum2019/7a/kvm-forum-postcopy-final.pdf

> 
> > Postcopy is essentially the only migration mechanism that can reliably
> > complete, so it really should be considered the default approach to
> > migration for all mgmt apps wanting to do migration, except in special
> > cases.   IOW, if we want VFIO migration to be viable, we need postcopy
> > support.
> 
> There's lots of other things postcopy doesn't work with; so hmm.
> 

Agree. Also given the amount of work even for pre-copy migration, it makes 
more sense to do things step-by-step.

Thanks
Kevin


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH v2 1/1] docs/devel: Add VFIO device migration documentation
  2021-03-10 19:20 [PATCH v2 1/1] docs/devel: Add VFIO device migration documentation Tarun Gupta
  2021-03-11  9:40 ` Daniel P. Berrangé
@ 2021-03-12  3:13 ` Tian, Kevin
  2021-03-16 13:34   ` Tarun Gupta (SW-GPU)
  2021-03-15 17:22 ` Cornelia Huck
  2 siblings, 1 reply; 12+ messages in thread
From: Tian, Kevin @ 2021-03-12  3:13 UTC (permalink / raw)
  To: Tarun Gupta, alex.williamson, qemu-devel, cohuck
  Cc: berrange, quintela, dgilbert, Zhao, Yan Y, lushenming,
	Kirti Wankhede, dnigam, cjia, philmd

> From: Tarun Gupta <targupta@nvidia.com>
> Sent: Thursday, March 11, 2021 3:20 AM
> 
> Document interfaces used for VFIO device migration. Added flow of state
> changes
> during live migration with VFIO device. Tested by building docs with the new
> vfio-migration.rst file.
> 
> v2:
> - Included the new vfio-migration.rst file in index.rst
> - Updated dirty page tracking section, also added details about
>   'pre-copy-dirty-page-tracking' opt-out option.
> - Incorporated comments around wording of doc.
> 
> Signed-off-by: Tarun Gupta <targupta@nvidia.com>
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> ---
>  MAINTAINERS                   |   1 +
>  docs/devel/index.rst          |   1 +
>  docs/devel/vfio-migration.rst | 135 ++++++++++++++++++++++++++++++++++
>  3 files changed, 137 insertions(+)
>  create mode 100644 docs/devel/vfio-migration.rst
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 738786146d..a2a80eee59 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1801,6 +1801,7 @@ M: Alex Williamson <alex.williamson@redhat.com>
>  S: Supported
>  F: hw/vfio/*
>  F: include/hw/vfio/
> +F: docs/devel/vfio-migration.rst
> 
>  vfio-ccw
>  M: Cornelia Huck <cohuck@redhat.com>
> diff --git a/docs/devel/index.rst b/docs/devel/index.rst
> index ae664da00c..5330f1ca1d 100644
> --- a/docs/devel/index.rst
> +++ b/docs/devel/index.rst
> @@ -39,3 +39,4 @@ Contents:
>     qom
>     block-coroutine-wrapper
>     multi-process
> +   vfio-migration
> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
> new file mode 100644
> index 0000000000..6196fb132c
> --- /dev/null
> +++ b/docs/devel/vfio-migration.rst
> @@ -0,0 +1,135 @@
> +=====================
> +VFIO device Migration
> +=====================
> +
> +VFIO devices use an iterative approach for migration because certain VFIO
> +devices (e.g. GPU) have large amount of data to be transfered. The iterative
> +pre-copy phase of migration allows for the guest to continue whilst the VFIO
> +device state is transferred to the destination, this helps to reduce the total
> +downtime of the VM. VFIO devices can choose to skip the pre-copy phase of
> +migration by returning pending_bytes as zero during the pre-copy phase.
> +
> +A detailed description of the UAPI for VFIO device migration can be found in
> +the comment for the ``vfio_device_migration_info`` structure in the header
> +file linux-headers/linux/vfio.h.
> +
> +VFIO device hooks for iterative approach:
> +
> +* A ``save_setup`` function that sets up the migration region, sets _SAVING
> +  flag in the VFIO device state and informs the VFIO IOMMU module to start
> +  dirty page tracking.
> +
> +* A ``load_setup`` function that sets up the migration region on the
> +  destination and sets _RESUMING flag in the VFIO device state.
> +
> +* A ``save_live_pending`` function that reads pending_bytes from the
> vendor
> +  driver, which indicates the amount of data that the vendor driver has yet
> to
> +  save for the VFIO device.
> +
> +* A ``save_live_iterate`` function that reads the VFIO device's data from the
> +  vendor driver through the migration region during iterative phase.
> +
> +* A ``save_live_complete_precopy`` function that resets _RUNNING flag
> from the
> +  VFIO device state, saves the device config space, if any, and iteratively

and if any, 

> +  copies the remaining data for the VFIO device untill the vendor driver
> +  indicates that no data remains (pending bytes is zero).
> +
> +* A ``load_state`` function that loads the config section and the data
> +  sections that are generated by the save functions above
> +
> +* ``cleanup`` functions for both save and load that perform any migration
> +  related cleanup, including unmapping the migration region
> +
> +A VM state change handler is registered to change the VFIO device state
> when
> +the VM state changes.
> +
> +Similarly, a migration state change notifier is registered to get a
> +notification on migration state change. These states are translated to VFIO
> +device state and conveyed to vendor driver.
> +
> +System memory dirty pages tracking
> +----------------------------------
> +
> +A ``log_sync`` memory listener callback marks those system memory pages
> +as dirty which are used for DMA by the VFIO device. The dirty pages bitmap
> is
> +queried per container. All pages pinned by the vendor driver through
> +vfio_pin_pages() external API have to be marked as dirty during migration.

why mention kernel internal functions in an userspace doc?

> +When there are CPU writes, CPU dirty page tracking can identify dirtied
> pages,
> +but any page pinned by the vendor driver can also be written by device.
> There
> +is currently no device which has hardware support for dirty page tracking.

no device or IOMMU support

> So
> +all pages which are pinned by vendor driver are considered as dirty.

Similarly, why do we care about how the kernel identifies whether a page is
dirty. It could be dirtied due to pinning, or due to IOMMU dirty bit, or due
to IOMMU page fault. Here we'd better just focus on user-tangible effect,
e.g. a large/non-converging dirty map might be returned then how to handle
such situation...

> +
> +By default, dirty pages are tracked when the device is in pre-copy as well as
> +stop-and-copy phase. So, a page pinned by the vendor driver using
> +vfio_pin_pages() will be copied to destination in both the phases. Copying
> +dirty pages in pre-copy phase helps QEMU to predict if it can achieve its
> +downtime tolerances.

worthy of some elaboration on the last sentence.

> +
> +QEMU also provides a per device opt-out option ``pre-copy-dirty-page-
> tracking``
> +to disable dirty page tracking during pre-copy phase. If it is set to off, all

IIUC dirty page tracking is always enabled in vfio_save_setup. What this option
does is to skip sync-ing dirty bitmap in vfio_listerner_log_sync.

> +pinned pages will be copied to destination in stop-and-copy phase only.
> +
> +System memory dirty pages tracking when vIOMMU is enabled
> +---------------------------------------------------------
> +
> +With vIOMMU, an IO virtual address range can get unmapped while in pre-
> copy
> +phase of migration. In that case, the unmap ioctl returns any pinned pages
> in
> +that range and QEMU reports corresponding guest physical pages dirty.

pinned pages -> dirty pages

> During
> +stop-and-copy phase, an IOMMU notifier is used to get a callback for
> mapped
> +pages and then dirty pages bitmap is fetched from VFIO IOMMU modules
> for those
> +mapped ranges.
> +
> +Flow of state changes during Live migration
> +===========================================
> +
> +Below is the flow of state change during live migration.
> +The values in the brackets represent the VM state, the migration state, and
> +the VFIO device state, respectively.
> +
> +Live migration save path
> +------------------------
> +
> +::
> +
> +                        QEMU normal running state
> +                        (RUNNING, _NONE, _RUNNING)
> +                                  |
> +                     migrate_init spawns migration_thread
> +                Migration thread then calls each device's .save_setup()
> +                    (RUNNING, _SETUP, _RUNNING|_SAVING)
> +                                  |
> +                    (RUNNING, _ACTIVE, _RUNNING|_SAVING)
> +             If device is active, get pending_bytes by .save_live_pending()
> +          If total pending_bytes >= threshold_size, call .save_live_iterate()
> +                  Data of VFIO device for pre-copy phase is copied
> +        Iterate till total pending bytes converge and are less than threshold
> +                                  |
> +  On migration completion, vCPU stops and
> calls .save_live_complete_precopy for
> +   each active device. The VFIO device is then transitioned into _SAVING
> state
> +                   (FINISH_MIGRATE, _DEVICE, _SAVING)
> +                                  |
> +     For the VFIO device, iterate in .save_live_complete_precopy until
> +                         pending data is 0
> +                   (FINISH_MIGRATE, _DEVICE, _STOPPED)
> +                                  |
> +                 (FINISH_MIGRATE, _COMPLETED, _STOPPED)
> +             Migraton thread schedules cleanup bottom half and exits
> +
> +Live migration resume path
> +--------------------------
> +
> +::
> +
> +              Incoming migration calls .load_setup for each device
> +                       (RESTORE_VM, _ACTIVE, _STOPPED)
> +                                 |
> +       For each device, .load_state is called for that device section data
> +                       (RESTORE_VM, _ACTIVE, _RESUMING)
> +                                 |
> +    At the end, .load_cleanup is called for each device and vCPUs are started
> +                       (RUNNING, _NONE, _RUNNING)
> +
> +Postcopy
> +========
> +
> +Postcopy migration is not supported for VFIO devices.
> --
> 2.27.0

Thanks
Kevin


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/1] docs/devel: Add VFIO device migration documentation
  2021-03-10 19:20 [PATCH v2 1/1] docs/devel: Add VFIO device migration documentation Tarun Gupta
  2021-03-11  9:40 ` Daniel P. Berrangé
  2021-03-12  3:13 ` Tian, Kevin
@ 2021-03-15 17:22 ` Cornelia Huck
  2021-03-16 16:18   ` Tarun Gupta (SW-GPU)
  2 siblings, 1 reply; 12+ messages in thread
From: Cornelia Huck @ 2021-03-15 17:22 UTC (permalink / raw)
  To: Tarun Gupta
  Cc: kevin.tian, cjia, quintela, qemu-devel, Kirti Wankhede, dgilbert,
	yan.y.zhao, lushenming, alex.williamson, dnigam, berrange,
	philmd

On Thu, 11 Mar 2021 00:50:09 +0530
Tarun Gupta <targupta@nvidia.com> wrote:

> Document interfaces used for VFIO device migration. Added flow of state changes
> during live migration with VFIO device. Tested by building docs with the new
> vfio-migration.rst file.
> 
> v2:
> - Included the new vfio-migration.rst file in index.rst
> - Updated dirty page tracking section, also added details about
>   'pre-copy-dirty-page-tracking' opt-out option.
> - Incorporated comments around wording of doc.
> 
> Signed-off-by: Tarun Gupta <targupta@nvidia.com>
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> ---
>  MAINTAINERS                   |   1 +
>  docs/devel/index.rst          |   1 +
>  docs/devel/vfio-migration.rst | 135 ++++++++++++++++++++++++++++++++++
>  3 files changed, 137 insertions(+)
>  create mode 100644 docs/devel/vfio-migration.rst
> 

(...)

> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
> new file mode 100644
> index 0000000000..6196fb132c
> --- /dev/null
> +++ b/docs/devel/vfio-migration.rst
> @@ -0,0 +1,135 @@
> +=====================
> +VFIO device Migration
> +=====================

Maybe add an introductory sentence or two describing the general
approach? I.e. we have a general framework, and specific support for
devices needs to be hooked up.

> +
> +VFIO devices use an iterative approach for migration because certain VFIO
> +devices (e.g. GPU) have large amount of data to be transfered. The iterative
> +pre-copy phase of migration allows for the guest to continue whilst the VFIO
> +device state is transferred to the destination, this helps to reduce the total
> +downtime of the VM. VFIO devices can choose to skip the pre-copy phase of
> +migration by returning pending_bytes as zero during the pre-copy phase.

What about something like:

"Migration of VFIO devices consists of two phases: the optional
pre-copy phase, and the stop-and-copy phase. The pre-copy phase is
iterative and allows to accommodate VFIO devices that have a large
amount of data that needs to be transferred. The iterative pre-copy
phase..."

> +
> +A detailed description of the UAPI for VFIO device migration can be found in
> +the comment for the ``vfio_device_migration_info`` structure in the header
> +file linux-headers/linux/vfio.h.
> +
> +VFIO device hooks for iterative approach:
> +
> +* A ``save_setup`` function that sets up the migration region, sets _SAVING
> +  flag in the VFIO device state and informs the VFIO IOMMU module to start
> +  dirty page tracking.
> +
> +* A ``load_setup`` function that sets up the migration region on the
> +  destination and sets _RESUMING flag in the VFIO device state.
> +
> +* A ``save_live_pending`` function that reads pending_bytes from the vendor
> +  driver, which indicates the amount of data that the vendor driver has yet to
> +  save for the VFIO device.
> +
> +* A ``save_live_iterate`` function that reads the VFIO device's data from the
> +  vendor driver through the migration region during iterative phase.
> +
> +* A ``save_live_complete_precopy`` function that resets _RUNNING flag from the
> +  VFIO device state, saves the device config space, if any, and iteratively
> +  copies the remaining data for the VFIO device untill the vendor driver

s/untill/until/

> +  indicates that no data remains (pending bytes is zero).
> +
> +* A ``load_state`` function that loads the config section and the data
> +  sections that are generated by the save functions above
> +
> +* ``cleanup`` functions for both save and load that perform any migration
> +  related cleanup, including unmapping the migration region
> +
> +A VM state change handler is registered to change the VFIO device state when
> +the VM state changes.
> +
> +Similarly, a migration state change notifier is registered to get a
> +notification on migration state change. These states are translated to VFIO

s/to/to the corresponding/

> +device state and conveyed to vendor driver.

s/to/to the/

(...)

> +Postcopy
> +========
> +
> +Postcopy migration is not supported for VFIO devices.

s/is not/is currently not/ ?



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/1] docs/devel: Add VFIO device migration documentation
  2021-03-12  3:13 ` Tian, Kevin
@ 2021-03-16 13:34   ` Tarun Gupta (SW-GPU)
  2021-03-17  1:51     ` Tian, Kevin
  0 siblings, 1 reply; 12+ messages in thread
From: Tarun Gupta (SW-GPU) @ 2021-03-16 13:34 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, qemu-devel, cohuck
  Cc: berrange, quintela, dgilbert, Zhao, Yan Y, lushenming,
	Kirti Wankhede, dnigam, cjia, philmd


On 3/12/2021 8:43 AM, Tian, Kevin wrote:
> 
>> From: Tarun Gupta <targupta@nvidia.com>
>> Sent: Thursday, March 11, 2021 3:20 AM
>>
>> Document interfaces used for VFIO device migration. Added flow of state
>> changes
>> during live migration with VFIO device. Tested by building docs with the new
>> vfio-migration.rst file.
>>
>> v2:
>> - Included the new vfio-migration.rst file in index.rst
>> - Updated dirty page tracking section, also added details about
>>    'pre-copy-dirty-page-tracking' opt-out option.
>> - Incorporated comments around wording of doc.
>>
>> Signed-off-by: Tarun Gupta <targupta@nvidia.com>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> ---
>>   MAINTAINERS                   |   1 +
>>   docs/devel/index.rst          |   1 +
>>   docs/devel/vfio-migration.rst | 135 ++++++++++++++++++++++++++++++++++
>>   3 files changed, 137 insertions(+)
>>   create mode 100644 docs/devel/vfio-migration.rst
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 738786146d..a2a80eee59 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -1801,6 +1801,7 @@ M: Alex Williamson <alex.williamson@redhat.com>
>>   S: Supported
>>   F: hw/vfio/*
>>   F: include/hw/vfio/
>> +F: docs/devel/vfio-migration.rst
>>
>>   vfio-ccw
>>   M: Cornelia Huck <cohuck@redhat.com>
>> diff --git a/docs/devel/index.rst b/docs/devel/index.rst
>> index ae664da00c..5330f1ca1d 100644
>> --- a/docs/devel/index.rst
>> +++ b/docs/devel/index.rst
>> @@ -39,3 +39,4 @@ Contents:
>>      qom
>>      block-coroutine-wrapper
>>      multi-process
>> +   vfio-migration
>> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
>> new file mode 100644
>> index 0000000000..6196fb132c
>> --- /dev/null
>> +++ b/docs/devel/vfio-migration.rst
>> @@ -0,0 +1,135 @@
>> +=====================
>> +VFIO device Migration
>> +=====================
>> +
>> +VFIO devices use an iterative approach for migration because certain VFIO
>> +devices (e.g. GPU) have large amount of data to be transfered. The iterative
>> +pre-copy phase of migration allows for the guest to continue whilst the VFIO
>> +device state is transferred to the destination, this helps to reduce the total
>> +downtime of the VM. VFIO devices can choose to skip the pre-copy phase of
>> +migration by returning pending_bytes as zero during the pre-copy phase.
>> +
>> +A detailed description of the UAPI for VFIO device migration can be found in
>> +the comment for the ``vfio_device_migration_info`` structure in the header
>> +file linux-headers/linux/vfio.h.
>> +
>> +VFIO device hooks for iterative approach:
>> +
>> +* A ``save_setup`` function that sets up the migration region, sets _SAVING
>> +  flag in the VFIO device state and informs the VFIO IOMMU module to start
>> +  dirty page tracking.
>> +
>> +* A ``load_setup`` function that sets up the migration region on the
>> +  destination and sets _RESUMING flag in the VFIO device state.
>> +
>> +* A ``save_live_pending`` function that reads pending_bytes from the
>> vendor
>> +  driver, which indicates the amount of data that the vendor driver has yet
>> to
>> +  save for the VFIO device.
>> +
>> +* A ``save_live_iterate`` function that reads the VFIO device's data from the
>> +  vendor driver through the migration region during iterative phase.
>> +
>> +* A ``save_live_complete_precopy`` function that resets _RUNNING flag
>> from the
>> +  VFIO device state, saves the device config space, if any, and iteratively
> 
> and if any,

I didn't get this. I intended to say that it will save the device config 
space only if it is present.
So, used "saves device config space, if any".

> 
>> +  copies the remaining data for the VFIO device untill the vendor driver
>> +  indicates that no data remains (pending bytes is zero).
>> +
>> +* A ``load_state`` function that loads the config section and the data
>> +  sections that are generated by the save functions above
>> +
>> +* ``cleanup`` functions for both save and load that perform any migration
>> +  related cleanup, including unmapping the migration region
>> +
>> +A VM state change handler is registered to change the VFIO device state
>> when
>> +the VM state changes.
>> +
>> +Similarly, a migration state change notifier is registered to get a
>> +notification on migration state change. These states are translated to VFIO
>> +device state and conveyed to vendor driver.
>> +
>> +System memory dirty pages tracking
>> +----------------------------------
>> +
>> +A ``log_sync`` memory listener callback marks those system memory pages
>> +as dirty which are used for DMA by the VFIO device. The dirty pages bitmap
>> is
>> +queried per container. All pages pinned by the vendor driver through
>> +vfio_pin_pages() external API have to be marked as dirty during migration.
> 
> why mention kernel internal functions in an userspace doc?

I'll remove the mention of vfio_pin_pages() and just mention "external API".

> 
>> +When there are CPU writes, CPU dirty page tracking can identify dirtied
>> pages,
>> +but any page pinned by the vendor driver can also be written by device.
>> There
>> +is currently no device which has hardware support for dirty page tracking.
> 
> no device or IOMMU support

Right, will update it.

> 
>> So
>> +all pages which are pinned by vendor driver are considered as dirty.
> 
> Similarly, why do we care about how the kernel identifies whether a page is
> dirty. It could be dirtied due to pinning, or due to IOMMU dirty bit, or due
> to IOMMU page fault. Here we'd better just focus on user-tangible effect,
> e.g. a large/non-converging dirty map might be returned then how to handle
> such situation...

Since VFIO migration feature is not just implemented in userspace but 
also involves implementation in kernel space as well, have documented 
here what is implemented as of now.

> 
>> +
>> +By default, dirty pages are tracked when the device is in pre-copy as well as
>> +stop-and-copy phase. So, a page pinned by the vendor driver using
>> +vfio_pin_pages() will be copied to destination in both the phases. Copying
>> +dirty pages in pre-copy phase helps QEMU to predict if it can achieve its
>> +downtime tolerances.
> 
> worthy of some elaboration on the last sentence.

How about below?
"Copying dirty pages in pre-copy phase helps QEMU to predict if it can 
achieve its downtime tolerances. If QEMU during pre-copy phase keeps 
finding dirty pages continuously,then it understands that even in 
stop-and-copy phase, it is likely to find dirty pages and can predict 
the downtime accordingly."

> 
>> +
>> +QEMU also provides a per device opt-out option ``pre-copy-dirty-page-
>> tracking``
>> +to disable dirty page tracking during pre-copy phase. If it is set to off, all
> 
> IIUC dirty page tracking is always enabled in vfio_save_setup. What this option
> does is to skip sync-ing dirty bitmap in vfio_listerner_log_sync.

I'll update it as below?

"QEMU also provides a per device opt-out option 
``pre-copy-dirty-page-tracking`` which disables querying dirty bitmap 
during pre-copy phase.
If it is set to off, all dirty pages will be copied to destination in 
stop-and-copy phase only."

> 
>> +pinned pages will be copied to destination in stop-and-copy phase only.
>> +
>> +System memory dirty pages tracking when vIOMMU is enabled
>> +---------------------------------------------------------
>> +
>> +With vIOMMU, an IO virtual address range can get unmapped while in pre-
>> copy
>> +phase of migration. In that case, the unmap ioctl returns any pinned pages
>> in
>> +that range and QEMU reports corresponding guest physical pages dirty.
> 
> pinned pages -> dirty pages

Currently, all pinned pages are dirty pages.
But, agreed that dirty pages might be more accurate here, will update it.

Thanks,
Tarun

> 
>> During
>> +stop-and-copy phase, an IOMMU notifier is used to get a callback for
>> mapped
>> +pages and then dirty pages bitmap is fetched from VFIO IOMMU modules
>> for those
>> +mapped ranges.
>> +
>> +Flow of state changes during Live migration
>> +===========================================
>> +
>> +Below is the flow of state change during live migration.
>> +The values in the brackets represent the VM state, the migration state, and
>> +the VFIO device state, respectively.
>> +
>> +Live migration save path
>> +------------------------
>> +
>> +::
>> +
>> +                        QEMU normal running state
>> +                        (RUNNING, _NONE, _RUNNING)
>> +                                  |
>> +                     migrate_init spawns migration_thread
>> +                Migration thread then calls each device's .save_setup()
>> +                    (RUNNING, _SETUP, _RUNNING|_SAVING)
>> +                                  |
>> +                    (RUNNING, _ACTIVE, _RUNNING|_SAVING)
>> +             If device is active, get pending_bytes by .save_live_pending()
>> +          If total pending_bytes >= threshold_size, call .save_live_iterate()
>> +                  Data of VFIO device for pre-copy phase is copied
>> +        Iterate till total pending bytes converge and are less than threshold
>> +                                  |
>> +  On migration completion, vCPU stops and
>> calls .save_live_complete_precopy for
>> +   each active device. The VFIO device is then transitioned into _SAVING
>> state
>> +                   (FINISH_MIGRATE, _DEVICE, _SAVING)
>> +                                  |
>> +     For the VFIO device, iterate in .save_live_complete_precopy until
>> +                         pending data is 0
>> +                   (FINISH_MIGRATE, _DEVICE, _STOPPED)
>> +                                  |
>> +                 (FINISH_MIGRATE, _COMPLETED, _STOPPED)
>> +             Migraton thread schedules cleanup bottom half and exits
>> +
>> +Live migration resume path
>> +--------------------------
>> +
>> +::
>> +
>> +              Incoming migration calls .load_setup for each device
>> +                       (RESTORE_VM, _ACTIVE, _STOPPED)
>> +                                 |
>> +       For each device, .load_state is called for that device section data
>> +                       (RESTORE_VM, _ACTIVE, _RESUMING)
>> +                                 |
>> +    At the end, .load_cleanup is called for each device and vCPUs are started
>> +                       (RUNNING, _NONE, _RUNNING)
>> +
>> +Postcopy
>> +========
>> +
>> +Postcopy migration is not supported for VFIO devices.
>> --
>> 2.27.0
> 
> Thanks
> Kevin
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/1] docs/devel: Add VFIO device migration documentation
  2021-03-12  2:30     ` Tian, Kevin
@ 2021-03-16 15:46       ` Dr. David Alan Gilbert
  2021-03-17  1:44         ` Tian, Kevin
  0 siblings, 1 reply; 12+ messages in thread
From: Dr. David Alan Gilbert @ 2021-03-16 15:46 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: alex.williamson, cjia, quintela, cohuck, qemu-devel, Zhao, Yan Y,
	lushenming, Kirti Wankhede, Tarun Gupta, Daniel P. Berrangé,
	philmd, dnigam

* Tian, Kevin (kevin.tian@intel.com) wrote:
> > From: Qemu-devel <qemu-devel-bounces+kevin.tian=intel.com@nongnu.org>
> > On Behalf Of Dr. David Alan Gilbert
> > 
> > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > On Thu, Mar 11, 2021 at 12:50:09AM +0530, Tarun Gupta wrote:
> > > > Document interfaces used for VFIO device migration. Added flow of state
> > changes
> > > > during live migration with VFIO device. Tested by building docs with the
> > new
> > > > vfio-migration.rst file.
> > > >
> > > > v2:
> > > > - Included the new vfio-migration.rst file in index.rst
> > > > - Updated dirty page tracking section, also added details about
> > > >   'pre-copy-dirty-page-tracking' opt-out option.
> > > > - Incorporated comments around wording of doc.
> > > >
> > > > Signed-off-by: Tarun Gupta <targupta@nvidia.com>
> > > > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > > ---
> > > >  MAINTAINERS                   |   1 +
> > > >  docs/devel/index.rst          |   1 +
> > > >  docs/devel/vfio-migration.rst | 135
> > ++++++++++++++++++++++++++++++++++
> > > >  3 files changed, 137 insertions(+)
> > > >  create mode 100644 docs/devel/vfio-migration.rst
> > >
> > >
> > > > +Postcopy
> > > > +========
> > > > +
> > > > +Postcopy migration is not supported for VFIO devices.
> > >
> > > What is the problem here and is there any plan for how to address it ?
> > 
> > There's no equivalent to userfaultfd for accesses to RAM made by a
> > device.
> > There's some potential for this to be doable with an IOMMU or the like,
> > but:
> >   a) IOMMUs and devices aren't currently happy at recovering from
> > failures
> >   b) the fragementation you get during a postcopy probably isn't pretty
> > when you get to build IOMMU tables.
> 
> To overcome such limitations one may adopt a prefault-and-pull scheme if 
> the vendor driver has the capability to track pending DMA buffers in the
> migration process (with additional uAPI changes in VFIO or userfaultfd), 
> as discussed here:
> 
> https://static.sched.com/hosted_files/kvmforum2019/7a/kvm-forum-postcopy-final.pdf

Did that get any further?

I can imagine that might be tricikier for a GPU than a network card; the
shaders in a GPU are pretty random as to what they go off and access, so
I can't see how you could prefault.

Dave

> > 
> > > Postcopy is essentially the only migration mechanism that can reliably
> > > complete, so it really should be considered the default approach to
> > > migration for all mgmt apps wanting to do migration, except in special
> > > cases.   IOW, if we want VFIO migration to be viable, we need postcopy
> > > support.
> > 
> > There's lots of other things postcopy doesn't work with; so hmm.
> > 
> 
> Agree. Also given the amount of work even for pre-copy migration, it makes 
> more sense to do things step-by-step.
> 
> Thanks
> Kevin
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/1] docs/devel: Add VFIO device migration documentation
  2021-03-15 17:22 ` Cornelia Huck
@ 2021-03-16 16:18   ` Tarun Gupta (SW-GPU)
  2021-03-18 12:28     ` Cornelia Huck
  0 siblings, 1 reply; 12+ messages in thread
From: Tarun Gupta (SW-GPU) @ 2021-03-16 16:18 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: kevin.tian, cjia, quintela, qemu-devel, Kirti Wankhede, dgilbert,
	yan.y.zhao, lushenming, alex.williamson, dnigam, berrange,
	philmd



On 3/15/2021 10:52 PM, Cornelia Huck wrote:
> 
> On Thu, 11 Mar 2021 00:50:09 +0530
> Tarun Gupta <targupta@nvidia.com> wrote:
> 
>> Document interfaces used for VFIO device migration. Added flow of state changes
>> during live migration with VFIO device. Tested by building docs with the new
>> vfio-migration.rst file.
>>
>> v2:
>> - Included the new vfio-migration.rst file in index.rst
>> - Updated dirty page tracking section, also added details about
>>    'pre-copy-dirty-page-tracking' opt-out option.
>> - Incorporated comments around wording of doc.
>>
>> Signed-off-by: Tarun Gupta <targupta@nvidia.com>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> ---
>>   MAINTAINERS                   |   1 +
>>   docs/devel/index.rst          |   1 +
>>   docs/devel/vfio-migration.rst | 135 ++++++++++++++++++++++++++++++++++
>>   3 files changed, 137 insertions(+)
>>   create mode 100644 docs/devel/vfio-migration.rst
>>
> 
> (...)
> 
>> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
>> new file mode 100644
>> index 0000000000..6196fb132c
>> --- /dev/null
>> +++ b/docs/devel/vfio-migration.rst
>> @@ -0,0 +1,135 @@
>> +=====================
>> +VFIO device Migration
>> +=====================
> 
> Maybe add an introductory sentence or two describing the general
> approach? I.e. we have a general framework, and specific support for
> devices needs to be hooked up.

Ummm, the below paragraph does describe the approach we're using for the 
migration framework involving pre-copy and stop-and-copy phase.
Can you help elaborate more on the general approach you'd like to have?

> 
>> +
>> +VFIO devices use an iterative approach for migration because certain VFIO
>> +devices (e.g. GPU) have large amount of data to be transfered. The iterative
>> +pre-copy phase of migration allows for the guest to continue whilst the VFIO
>> +device state is transferred to the destination, this helps to reduce the total
>> +downtime of the VM. VFIO devices can choose to skip the pre-copy phase of
>> +migration by returning pending_bytes as zero during the pre-copy phase.
> 
> What about something like:
> 
> "Migration of VFIO devices consists of two phases: the optional
> pre-copy phase, and the stop-and-copy phase. The pre-copy phase is
> iterative and allows to accommodate VFIO devices that have a large
> amount of data that needs to be transferred. The iterative pre-copy
> phase..."
> 

Thanks, this looks better. I'll update it in next version incorporating 
the other comments too below.

Thanks,
Tarun

>> +
>> +A detailed description of the UAPI for VFIO device migration can be found in
>> +the comment for the ``vfio_device_migration_info`` structure in the header
>> +file linux-headers/linux/vfio.h.
>> +
>> +VFIO device hooks for iterative approach:
>> +
>> +* A ``save_setup`` function that sets up the migration region, sets _SAVING
>> +  flag in the VFIO device state and informs the VFIO IOMMU module to start
>> +  dirty page tracking.
>> +
>> +* A ``load_setup`` function that sets up the migration region on the
>> +  destination and sets _RESUMING flag in the VFIO device state.
>> +
>> +* A ``save_live_pending`` function that reads pending_bytes from the vendor
>> +  driver, which indicates the amount of data that the vendor driver has yet to
>> +  save for the VFIO device.
>> +
>> +* A ``save_live_iterate`` function that reads the VFIO device's data from the
>> +  vendor driver through the migration region during iterative phase.
>> +
>> +* A ``save_live_complete_precopy`` function that resets _RUNNING flag from the
>> +  VFIO device state, saves the device config space, if any, and iteratively
>> +  copies the remaining data for the VFIO device untill the vendor driver
> 
> s/untill/until/
> 
>> +  indicates that no data remains (pending bytes is zero).
>> +
>> +* A ``load_state`` function that loads the config section and the data
>> +  sections that are generated by the save functions above
>> +
>> +* ``cleanup`` functions for both save and load that perform any migration
>> +  related cleanup, including unmapping the migration region
>> +
>> +A VM state change handler is registered to change the VFIO device state when
>> +the VM state changes.
>> +
>> +Similarly, a migration state change notifier is registered to get a
>> +notification on migration state change. These states are translated to VFIO
> 
> s/to/to the corresponding/
> 
>> +device state and conveyed to vendor driver.
> 
> s/to/to the/
> 
> (...)
> 
>> +Postcopy
>> +========
>> +
>> +Postcopy migration is not supported for VFIO devices.
> 
> s/is not/is currently not/ ?
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH v2 1/1] docs/devel: Add VFIO device migration documentation
  2021-03-16 15:46       ` Dr. David Alan Gilbert
@ 2021-03-17  1:44         ` Tian, Kevin
  0 siblings, 0 replies; 12+ messages in thread
From: Tian, Kevin @ 2021-03-17  1:44 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: alex.williamson, cjia, quintela, cohuck, qemu-devel, Zhao, Yan Y,
	lushenming, Kirti Wankhede, Tarun Gupta, Daniel P. Berrangé,
	philmd, dnigam

> From: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Sent: Tuesday, March 16, 2021 11:47 PM
> 
> * Tian, Kevin (kevin.tian@intel.com) wrote:
> > > From: Qemu-devel <qemu-devel-
> bounces+kevin.tian=intel.com@nongnu.org>
> > > On Behalf Of Dr. David Alan Gilbert
> > >
> > > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > > On Thu, Mar 11, 2021 at 12:50:09AM +0530, Tarun Gupta wrote:
> > > > > Document interfaces used for VFIO device migration. Added flow of
> state
> > > changes
> > > > > during live migration with VFIO device. Tested by building docs with
> the
> > > new
> > > > > vfio-migration.rst file.
> > > > >
> > > > > v2:
> > > > > - Included the new vfio-migration.rst file in index.rst
> > > > > - Updated dirty page tracking section, also added details about
> > > > >   'pre-copy-dirty-page-tracking' opt-out option.
> > > > > - Incorporated comments around wording of doc.
> > > > >
> > > > > Signed-off-by: Tarun Gupta <targupta@nvidia.com>
> > > > > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > > > ---
> > > > >  MAINTAINERS                   |   1 +
> > > > >  docs/devel/index.rst          |   1 +
> > > > >  docs/devel/vfio-migration.rst | 135
> > > ++++++++++++++++++++++++++++++++++
> > > > >  3 files changed, 137 insertions(+)
> > > > >  create mode 100644 docs/devel/vfio-migration.rst
> > > >
> > > >
> > > > > +Postcopy
> > > > > +========
> > > > > +
> > > > > +Postcopy migration is not supported for VFIO devices.
> > > >
> > > > What is the problem here and is there any plan for how to address it ?
> > >
> > > There's no equivalent to userfaultfd for accesses to RAM made by a
> > > device.
> > > There's some potential for this to be doable with an IOMMU or the like,
> > > but:
> > >   a) IOMMUs and devices aren't currently happy at recovering from
> > > failures
> > >   b) the fragementation you get during a postcopy probably isn't pretty
> > > when you get to build IOMMU tables.
> >
> > To overcome such limitations one may adopt a prefault-and-pull scheme if
> > the vendor driver has the capability to track pending DMA buffers in the
> > migration process (with additional uAPI changes in VFIO or userfaultfd),
> > as discussed here:
> >
> > https://static.sched.com/hosted_files/kvmforum2019/7a/kvm-forum-
> postcopy-final.pdf
> 
> Did that get any further?

Not yet. As you may see in another thread, even the precopy side still needs
some time to land.

> 
> I can imagine that might be tricikier for a GPU than a network card; the
> shaders in a GPU are pretty random as to what they go off and access, so
> I can't see how you could prefault.

trickier but not impossible. e.g. any page accessed by shaders must be mapped
into GPU page table which could be the coarse-grained interface to track
in-use DMA buffers. However this could be large and prefaulting a large set
of pages weakens the benefit of postcopy. or there may be vendor specific
method to do fine-grained tracking. nevertheless, it does provide an postcopy
option for many devices which don't support I/O page fault but actual feasibility 
is vendor specific.

Thanks
Kevin


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH v2 1/1] docs/devel: Add VFIO device migration documentation
  2021-03-16 13:34   ` Tarun Gupta (SW-GPU)
@ 2021-03-17  1:51     ` Tian, Kevin
  0 siblings, 0 replies; 12+ messages in thread
From: Tian, Kevin @ 2021-03-17  1:51 UTC (permalink / raw)
  To: Tarun Gupta (SW-GPU), alex.williamson, qemu-devel, cohuck
  Cc: berrange, quintela, dgilbert, Zhao, Yan Y, lushenming,
	Kirti Wankhede, dnigam, cjia, philmd

> From: Tarun Gupta (SW-GPU) <targupta@nvidia.com>
> Sent: Tuesday, March 16, 2021 9:35 PM
> 
> 
> >> +
> >> +* A ``save_live_iterate`` function that reads the VFIO device's data from
> the
> >> +  vendor driver through the migration region during iterative phase.
> >> +
> >> +* A ``save_live_complete_precopy`` function that resets _RUNNING flag
> >> from the
> >> +  VFIO device state, saves the device config space, if any, and iteratively
> >
> > and if any,
> 
> I didn't get this. I intended to say that it will save the device config
> space only if it is present.
> So, used "saves device config space, if any".

I misread it, with the impression that 'if any' is for 'iteratively copy'.

> >> +
> >> +System memory dirty pages tracking
> >> +----------------------------------
> >> +
> >> +A ``log_sync`` memory listener callback marks those system memory
> pages
> >> +as dirty which are used for DMA by the VFIO device. The dirty pages
> bitmap
> >> is
> >> +queried per container. All pages pinned by the vendor driver through
> >> +vfio_pin_pages() external API have to be marked as dirty during
> migration.
> >
> > why mention kernel internal functions in an userspace doc?
> 
> I'll remove the mention of vfio_pin_pages() and just mention "external API".
> 
> >
> >> +When there are CPU writes, CPU dirty page tracking can identify dirtied
> >> pages,
> >> +but any page pinned by the vendor driver can also be written by device.
> >> There
> >> +is currently no device which has hardware support for dirty page tracking.
> >
> > no device or IOMMU support
> 
> Right, will update it.
> 
> >
> >> So
> >> +all pages which are pinned by vendor driver are considered as dirty.
> >
> > Similarly, why do we care about how the kernel identifies whether a page is
> > dirty. It could be dirtied due to pinning, or due to IOMMU dirty bit, or due
> > to IOMMU page fault. Here we'd better just focus on user-tangible effect,
> > e.g. a large/non-converging dirty map might be returned then how to
> handle
> > such situation...
> 
> Since VFIO migration feature is not just implemented in userspace but
> also involves implementation in kernel space as well, have documented
> here what is implemented as of now.

but userpace only needs to care about the implications of the uAPI, instead
of the internal detail in the kernel. You might take above pinning detail as
one example to explain that the size of the dirty page set might be big and
static, and then explain how userspace should cope with this situation. but 
don't describe it as the assumed fact.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/1] docs/devel: Add VFIO device migration documentation
  2021-03-16 16:18   ` Tarun Gupta (SW-GPU)
@ 2021-03-18 12:28     ` Cornelia Huck
  0 siblings, 0 replies; 12+ messages in thread
From: Cornelia Huck @ 2021-03-18 12:28 UTC (permalink / raw)
  To: Tarun Gupta (SW-GPU)
  Cc: kevin.tian, cjia, quintela, qemu-devel, Kirti Wankhede, dgilbert,
	yan.y.zhao, lushenming, alex.williamson, dnigam, berrange,
	philmd

On Tue, 16 Mar 2021 21:48:38 +0530
"Tarun Gupta (SW-GPU)" <targupta@nvidia.com> wrote:

> On 3/15/2021 10:52 PM, Cornelia Huck wrote:
> > 
> > On Thu, 11 Mar 2021 00:50:09 +0530
> > Tarun Gupta <targupta@nvidia.com> wrote:
> >   
> >> Document interfaces used for VFIO device migration. Added flow of state changes
> >> during live migration with VFIO device. Tested by building docs with the new
> >> vfio-migration.rst file.
> >>
> >> v2:
> >> - Included the new vfio-migration.rst file in index.rst
> >> - Updated dirty page tracking section, also added details about
> >>    'pre-copy-dirty-page-tracking' opt-out option.
> >> - Incorporated comments around wording of doc.
> >>
> >> Signed-off-by: Tarun Gupta <targupta@nvidia.com>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> ---
> >>   MAINTAINERS                   |   1 +
> >>   docs/devel/index.rst          |   1 +
> >>   docs/devel/vfio-migration.rst | 135 ++++++++++++++++++++++++++++++++++
> >>   3 files changed, 137 insertions(+)
> >>   create mode 100644 docs/devel/vfio-migration.rst
> >>  
> > 
> > (...)
> >   
> >> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
> >> new file mode 100644
> >> index 0000000000..6196fb132c
> >> --- /dev/null
> >> +++ b/docs/devel/vfio-migration.rst
> >> @@ -0,0 +1,135 @@
> >> +=====================
> >> +VFIO device Migration
> >> +=====================  
> > 
> > Maybe add an introductory sentence or two describing the general
> > approach? I.e. we have a general framework, and specific support for
> > devices needs to be hooked up.  
> 
> Ummm, the below paragraph does describe the approach we're using for the 
> migration framework involving pre-copy and stop-and-copy phase.
> Can you help elaborate more on the general approach you'd like to have?

The document dives right in with how vfio devices are using an
iterative approach etc. A quick overview of the general setup (before
you are getting to the different phases) might be helpful, i.e. who
does what. Not sure what we expect a reader of this document to know
already.

> 
> >   
> >> +
> >> +VFIO devices use an iterative approach for migration because certain VFIO
> >> +devices (e.g. GPU) have large amount of data to be transfered. The iterative
> >> +pre-copy phase of migration allows for the guest to continue whilst the VFIO
> >> +device state is transferred to the destination, this helps to reduce the total
> >> +downtime of the VM. VFIO devices can choose to skip the pre-copy phase of
> >> +migration by returning pending_bytes as zero during the pre-copy phase.  
> > 
> > What about something like:
> > 
> > "Migration of VFIO devices consists of two phases: the optional
> > pre-copy phase, and the stop-and-copy phase. The pre-copy phase is
> > iterative and allows to accommodate VFIO devices that have a large
> > amount of data that needs to be transferred. The iterative pre-copy
> > phase..."
> >   
> 
> Thanks, this looks better. I'll update it in next version incorporating 
> the other comments too below.



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-03-18 12:30 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-10 19:20 [PATCH v2 1/1] docs/devel: Add VFIO device migration documentation Tarun Gupta
2021-03-11  9:40 ` Daniel P. Berrangé
2021-03-11 19:41   ` Dr. David Alan Gilbert
2021-03-12  2:30     ` Tian, Kevin
2021-03-16 15:46       ` Dr. David Alan Gilbert
2021-03-17  1:44         ` Tian, Kevin
2021-03-12  3:13 ` Tian, Kevin
2021-03-16 13:34   ` Tarun Gupta (SW-GPU)
2021-03-17  1:51     ` Tian, Kevin
2021-03-15 17:22 ` Cornelia Huck
2021-03-16 16:18   ` Tarun Gupta (SW-GPU)
2021-03-18 12:28     ` Cornelia Huck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.